Pipeline Overview#

The MetaBeeAI pipeline is composed of five core submodules, each responsible for a distinct stage of the literature review and analysis process.

PDFs → Vision AI Processing → LLM Analysis → Human Review → Benchmarking → Analysis

For a detailed overview of the various stages see their respective sections in the submodule documentation.

Stages#

  1. PDF Processing → Structured JSON - Submodule: process_pdfs - Purpose: Convert PDFs into structured JSON text with layout and coordinate data - Input: PDF files - Output: JSON chunks representing extracted text and layout elements - API Reference: Process PDFs

  1. LLM Question Answering → Extracted Information - Submodule: metabeeai_llm - Purpose: Use large language models to extract structured answers and citations from processed text - Input: JSON chunks - Output: Structured question–answer pairs with traceable sources - API Reference: MetaBeeAI LLM

  1. Human Review & Annotation → Validated Answers - Submodule: llm_review_software - Purpose: Provide a graphical interface for human review and validation of LLM answers - Input: LLM-generated answers - Output: Human-verified and annotated answers - API Reference: LLM Review Software

  1. Benchmarking → Performance Metrics - Submodule: llm_benchmarking - Purpose: Evaluate model performance against human-reviewed ground truth - Input: LLM and reviewer answers - Output: Quantitative metrics, comparisons, and performance plots - API Reference: LLM Benchmarking

  1. Data Analysis → Insights - Submodule: query_database - Purpose: Aggregate validated data across studies and perform trend and network analyses - Input: Structured and benchmarked results - Output: Analytical summaries, visualizations, and derived datasets - API Reference: Query Database