PDF Processing Pipeline#
This submodule provides scripts for converting PDF scientific papers into structured JSON format suitable for LLM-based information extraction.
Overview#
The PDF processing pipeline converts raw PDF papers into structured JSON chunks through four main steps:
Split PDFs - Break papers into single-page documents or overlapping 2-page segments
Vision API Processing - Extract text and structure using Landing AI’s Vision Agentic API
Merge JSON Files - Combine individual page JSON files into a single document
Deduplicate Chunks - Remove duplicate text chunks (primarily for overlapping pages)
Installation#
This submodule is part of the metabeeai package. Install it via:
pip install metabeeai
Or if installing from source:
pip install -e /path/to/MetaBeeAI
Required dependencies: PyPDF2, requests, python-dotenv, termcolor, pathlib
Quick Start#
Prerequisites#
Environment Setup:
# Activate your virtual environment
source ../venv/bin/activate # On Mac/Linux
# Or: ..\venv\Scripts\activate # On Windows
API Keys: Configure your API keys in the
.envfile:
# Copy the example environment file
cp ../env.example ../.env
# Edit the .env file and add your API key
# LANDING_AI_API_KEY=your_landing_ai_api_key_here
The .env file is located in the project root directory and is hidden from git for security.
Input Data Format: Your papers must be organized as follows:
papers/
├── 95UKMIEY/
│ └── 95UKMIEY_main.pdf
├── CX9M8HCM/
│ └── CX9M8HCM_main.pdf
├── V7984AAU/
│ └── V7984AAU_main.pdf
...
Each paper should have:
A folder with an alphanumeric name (e.g.,
95UKMIEY,CX9M8HCM,001, etc.)A PDF file named
{folder_name}_main.pdfinside that folderFolders are processed in alphanumeric (lexicographic) order
Basic Usage - Complete Pipeline#
When the metabeeai package is installed, use the CLI command:
Run all steps for all papers in directory:
metabeeai process-pdfs
Run all steps for a range of papers (alphanumeric order):
metabeeai process-pdfs --start 95UKMIEY --end CX9M8HCM
Run with a custom directory:
metabeeai process-pdfs --dir /path/to/papers --start 95UKMIEY --end CX9M8HCM
Merge-only mode (skip expensive PDF splitting and API processing):
# Process all papers - only merge and deduplicate
metabeeai process-pdfs --merge-only
# Process specific papers - only merge and deduplicate
metabeeai process-pdfs --merge-only --start 95UKMIEY --end CX9M8HCM
Note: For Python module syntax alternatives, see the Alternative: Python Module Syntax section below.
Core Files#
1. process_all.py - Main Pipeline Runner#
Purpose: Orchestrates all four steps of the PDF processing pipeline in sequence.
Usage (CLI - Recommended):
When the metabeeai package is installed, use the CLI command:
# Process all papers (all steps)
metabeeai process-pdfs
# Process papers in a specific range (alphanumeric order)
metabeeai process-pdfs --start 95UKMIEY --end CX9M8HCM
# Process papers from a starting folder to the end
metabeeai process-pdfs --start 95UKMIEY
# Merge-only mode (skip expensive PDF splitting and API processing)
metabeeai process-pdfs --merge-only
# Merge-only for specific papers
metabeeai process-pdfs --merge-only --start 95UKMIEY --end CX9M8HCM
# Filter out marginalia chunks during merging
metabeeai process-pdfs --start 95UKMIEY --end CX9M8HCM --filter-chunk-type marginalia
# Split into overlapping 2-page documents
metabeeai process-pdfs --pages 2
# Skip specific steps
metabeeai process-pdfs --skip-split --skip-api
Note: For Python module syntax alternatives, see the Alternative: Python Module Syntax section below.
Command-line options:
--start FOLDER: First folder name to process (optional; defaults to first folder in alphanumeric order)--end FOLDER: Last folder name to process (optional; defaults to last folder in alphanumeric order)--dir PATH: Custom papers directory (default: from config/env)--merge-only: Only run merge and deduplication steps (skip expensive PDF splitting and API processing)--skip-split: Skip PDF splitting step--skip-api: Skip Vision API processing step--skip-merge: Skip JSON merging step--skip-deduplicate: Skip chunk deduplication step--filter-chunk-type TYPE [TYPE ...]: Filter out specific chunk types (e.g., marginalia, figure)
Output: Creates the following files for each paper:
papers/XXX/pages/main_p01-02.pdf,main_p02-03.pdf, etc. (split PDFs)papers/XXX/pages/main_p01-02.pdf.json, etc. (API responses)papers/XXX/pages/merged_v2.json(final merged and deduplicated file)
2. split_pdf.py - PDF Splitter#
Purpose: Splits multi-page PDFs into either single-page documents or overlapping 2-page segments.
Why overlapping pages?: Scientific papers often have content that spans across pages (tables, paragraphs). Overlapping 2-page segments ensure we don’t lose information at page boundaries.
Usage:
Run via the CLI with the main pipeline command (skip later steps):
# Split into single-page documents (default)
metabeeai process-pdfs --skip-api --skip-merge --skip-deduplicate --dir /path/to/papers
# Split into single-page documents (explicit)
metabeeai process-pdfs --pages 1 --skip-api --skip-merge --skip-deduplicate --dir /path/to/papers
# Split into overlapping 2-page documents
metabeeai process-pdfs --pages 2 --skip-api --skip-merge --skip-deduplicate --dir /path/to/papers
Note: For Python module syntax alternatives, see the Alternative: Python Module Syntax section below.
Command-line options:
--pages {1,2}: Number of pages per split (default: 1)1= single-page documents (main_p01.pdf,main_p02.pdf, etc.)2= overlapping 2-page documents (main_p01-02.pdf,main_p02-03.pdf, etc.)
How it works:
Finds all
{folder_name}_main.pdffiles in paper foldersCreates a
pages/subdirectory in each paper folderGenerates split PDFs based on
--pagesoption:Single-page mode (–pages 1):
main_p01.pdf(page 1)main_p02.pdf(page 2)main_p03.pdf(page 3)etc.
Overlapping 2-page mode (–pages 2):
main_p01-02.pdf(pages 1-2)main_p02-03.pdf(pages 2-3)main_p03-04.pdf(pages 3-4)etc.
Example:
Single-page mode: A 10-page paper generates 10 split PDFs
2-page mode: A 10-page paper generates 9 overlapping split PDFs
3. va_process_papers.py - Vision API Processor#
Options#
--pages {1,2}: Number of pages per split (default: 1)1= single-page documents (main_p01.pdf,main_p02.pdf, etc.)2= overlapping 2-page documents (main_p01-02.pdf,main_p02-03.pdf, etc.)
Single-page mode: A 10-page paper generates 10 split PDFs of 1 page each.
2-page mode: A 10-page paper generates 9 overlapping split PDFs of 2 pages each.
split_pdf examples#
# Split into single-page documents (default)
metabeeai process-pdfs --skip-api --skip-merge --skip-deduplicate --dir /path/to/papers
# Split into single-page documents (explicit)
metabeeai process-pdfs --pages 1 --skip-api --skip-merge --skip-deduplicate --dir /path/to/papers
# Split into overlapping 2-page documents
metabeeai process-pdfs --pages 2 --skip-api --skip-merge --skip-deduplicate --dir /path/to/papers
va_process_papers - Vision API Processor#
Purpose: Processes each split PDF through Landing AI’s Vision Agentic Document Analysis API to extract text and structure.
Usage:
Run via the CLI with the main pipeline command (assumes PDFs already split):
# Process all papers
metabeeai process-pdfs --skip-split --skip-merge --skip-deduplicate --dir data/papers
# Start from a specific folder (useful for resuming - alphanumeric order)
metabeeai process-pdfs --skip-split --skip-merge --skip-deduplicate --dir data/papers --start 95UKMIEY
Note: For Python module syntax alternatives, see the Alternative: Python Module Syntax section below.
Command-line options:
--dir PATH: Papers directory (default: data/papers)--start FOLDER: Starting folder name (alphanumeric order, e.g., 95UKMIEY, CX9M8HCM)
How it works:
Processes folders in alphanumeric order
Finds all split PDF files in
papers/{FOLDER}/pages/Sends each PDF to the Vision Agentic API
Saves the JSON response as
{pdf_filename}.jsonSkips files that already have JSON outputs (resume-friendly)
Logs all processing activity with timestamps
Output: Creates JSON files with this structure:
{
"data": {
"chunks": [
{
"chunk_id": "unique_id",
"text": "Extracted text content...",
"chunk_type": "paragraph",
"grounding": [
{
"page": 0,
"bbox": [x1, y1, x2, y2]
}
],
"metadata": {...}
}
]
}
}
Important: This step requires a valid LANDING_AI_API_KEY in your .env file.
4. merger.py - JSON Merger#
Purpose: Combines individual page JSON files into a single merged_v2.json file per paper, handling overlapping pages correctly.
Usage:
Run via the CLI:
# Merge all papers
metabeeai process-pdfs --merge-only
# Filter out marginalia chunks
metabeeai process-pdfs --merge-only --filter-chunk-type marginalia
# Filter multiple chunk types
metabeeai process-pdfs --merge-only --filter-chunk-type marginalia figure
Note: For Python module syntax alternatives, see the Alternative: Python Module Syntax section below.
Command-line options:
--basepath PATH: Base path containing thepapers/folder--filter-chunk-type TYPE [TYPE ...]: Chunk types to exclude from output
How it works:
Finds all
main_*.jsonfiles inpapers/XXX/pages/Adjusts page numbers to account for overlapping pages
Merges all chunks into a single JSON structure
Optionally filters out specified chunk types
Saves as
merged_v2.json
Page number adjustment: Since pages overlap, the merger maps overlapping pages to the same global page number to avoid duplication.
Example:
File 1 (pages 1-2): Global pages 1-2
File 2 (pages 2-3): Page 2 maps to global page 2, page 3 becomes global page 3
File 3 (pages 3-4): Page 3 maps to global page 3, page 4 becomes global page 4
Output format:
{
"data": {
"chunks": [
{
"chunk_id": "unique_id",
"text": "...",
"chunk_type": "paragraph",
"grounding": [{"page": 0, "bbox": [...]}],
"metadata": {...}
}
]
}
}
5. deduplicate_chunks.py - Chunk Deduplication Module#
Purpose: Provides functions to identify and remove duplicate text chunks that result from overlapping pages.
Usage as a module:
from deduplicate_chunks import analyze_chunk_uniqueness, deduplicate_chunks
# Analyze duplication
analysis = analyze_chunk_uniqueness(chunks)
print(f"Found {analysis['duplicate_chunks']} duplicates")
# Deduplicate
deduplicated_chunks = deduplicate_chunks(chunks)
Key functions:
analyze_chunk_uniqueness(chunks)- Returns statistics about duplicatesdeduplicate_chunks(chunks)- Removes duplicates while preserving all chunk IDsget_duplicate_summary(chunks)- Human-readable summary of duplicatesprocess_merged_json_file(path)- Process a single merged JSON file
How deduplication works:
Groups chunks by text content
For duplicate groups, keeps one chunk but preserves all chunk IDs
Adds metadata about the deduplication process
Duplicate handling: When duplicates are found, the deduplicated chunk includes:
chunk_id: Primary chunk ID (first occurrence)chunk_ids: List of all chunk IDs with the same textoriginal_chunk_id: Reference to the original ID
6. batch_deduplicate.py - Batch Deduplication Runner#
Purpose: Processes all merged_v2.json files in a directory to remove duplicates.
Usage:
Run via the CLI as part of merge-only mode (dedup runs after merge):
# Deduplicate all papers (merge-only already runs dedup)
metabeeai process-pdfs --merge-only
Note: For Python module syntax alternatives, see the Alternative: Python Module Syntax section below.
Command-line options:
--base-dir PATH: Base directory containing paper folders--start-paper N: First paper number to process (for numeric folders only)--end-paper N: Last paper number to process (for numeric folders only)--dry-run: Analyze files without making changes--output FILE: Save results summary to file--verbose,-v: Enable verbose logging
Note: When called from process_all.py, the folder list is automatically provided to support alphanumeric folder names.
How it works:
Finds all paper folders with
merged_v2.jsonfilesAnalyzes each file for duplicate chunks
Deduplicates and overwrites the file (unless
--dry-run)Generates a summary report
Output: Creates a summary JSON file with:
{
"status": "completed",
"total_papers": 10,
"processed_papers": 10,
"total_duplicates_removed": 145,
"results": [...]
}
Individual Script Usage#
Running Steps Separately#
If you need to run individual steps (useful for debugging or resuming):
Step 1: Split PDFs#
metabeeai process-pdfs --skip-api --skip-merge --skip-deduplicate --dir /path/to/papers
Step 2: Process with Vision API#
metabeeai process-pdfs --skip-split --skip-merge --skip-deduplicate --dir /path/to/papers
Step 3: Merge JSON files#
metabeeai process-pdfs --merge-only --dir /path/to/data/papers
Step 4: Deduplicate chunks#
metabeeai process-pdfs --merge-only --dir /path/to/data/papers
Note: For Python module syntax alternatives, see the Alternative: Python Module Syntax section below.
Input Data Format#
Required Directory Structure#
papers/
├── 95UKMIEY/
│ └── 95UKMIEY_main.pdf
├── CX9M8HCM/
│ └── CX9M8HCM_main.pdf
├── V7984AAU/
│ └── V7984AAU_main.pdf
...
Requirements:
Each paper must be in a folder with an alphanumeric name (any format:
95UKMIEY,001,ABC123, etc.)PDF file must be named
{folder_name}_main.pdf(matching the folder name)Folders are processed in alphanumeric (lexicographic) order
PDFs should be complete scientific papers (not split or partial)
PDF Requirements#
Format: Valid PDF files
Content: Text-based PDFs work best (scanned PDFs may have lower quality)
Size: No strict limits, but very large files may take longer to process
Pages: Multi-page documents are fully supported
Output Data Format#
Final Output: merged_v2.json#
The pipeline produces a merged_v2.json file for each paper with the following structure:
{
"data": {
"chunks": [
{
"chunk_id": "unique_chunk_identifier",
"text": "The extracted text content from the PDF...",
"chunk_type": "paragraph",
"grounding": [
{
"page": 0,
"bbox": [x1, y1, x2, y2]
}
],
"chunk_ids": ["id1", "id2"],
"metadata": {
"confidence": 0.95,
"font_size": 12,
...
}
}
]
},
"deduplication_info": {
"original_chunks": 500,
"unique_chunks": 450,
"duplicates_removed": 50,
"duplication_rate": 10.0,
"duplicate_groups": 25
}
}
Field Descriptions:#
chunk_id: Unique identifier for this chunk
text: Extracted text content
chunk_type: Type of content (paragraph, heading, table, figure, marginalia, etc.)
grounding: Location information
page: Page number (0-indexed)
bbox: Bounding box coordinates [x1, y1, x2, y2]
chunk_ids: List of all chunk IDs with identical text (after deduplication)
metadata: Additional information from the Vision API
deduplication_info: Statistics about the deduplication process
This format is designed to be consumed by the LLM pipeline in ../metabeeai_llm/.
Understanding the Process Flow#
Complete Pipeline Flow#
Raw PDF → Split PDF → Vision API → Individual JSONs → Merged JSON → Deduplicated JSON
Detailed steps (example with overlapping 2-page mode):
Input:
95UKMIEY_main.pdf(10 pages)After Splitting (with
--pages 2):pages/main_p01-02.pdf pages/main_p02-03.pdf pages/main_p03-04.pdf ... pages/main_p09-10.pdf
Or with single-page mode (
--pages 1, default):pages/main_p01.pdf pages/main_p02.pdf pages/main_p03.pdf ... pages/main_p10.pdf
After API Processing:
pages/main_p01-02.pdf.json (or main_p01.pdf.json in single-page mode) pages/main_p02-03.pdf.json (or main_p02.pdf.json in single-page mode) ... pages/main_p09-10.pdf.json (or main_p10.pdf.json in single-page mode)
After Merging:
pages/merged_v2.json (contains all chunks with adjusted page numbers)
After Deduplication:
pages/merged_v2.json (duplicates removed, chunk IDs preserved)
Troubleshooting#
“LANDING_AI_API_KEY not found”#
Cause: API key not configured in
.envfileFix:
cp ../env.example ../.env # Edit .env and add your LANDING_AI_API_KEY
“PDF file not found”#
Cause: PDF file not named correctly or in wrong location
Fix: Ensure PDFs are named
{folder_number}_main.pdfand in the correct folder
“No merged_v2.json files found”#
Cause: Merger step hasn’t been run yet or failed
Fix: Run
metabeeai process-pdfs --merge-only --dir /path/to/data/papersfirst, or usemetabeeai process-pdfs --skip-split --skip-apito run merge and deduplication steps
API processing is slow#
Cause: Vision API processes each page individually
Solution: This is normal. Processing time depends on:
Number of papers
Pages per paper
API response time
The script will automatically resume if interrupted
Duplicate chunks remain after deduplication#
Cause: Chunks might have slight text differences
Fix: Check the deduplication_info in merged_v2.json for statistics
Note: Only exact text matches are considered duplicates
Out of API quota#
Cause: Too many API calls
Fix:
The script automatically skips already-processed files
Use
--startparameter to resume from a specific paperContact Landing AI to increase your quota
Advanced Usage#
Merge-Only Mode (Cost-Effective)#
If you’ve already run the expensive PDF splitting and Vision API processing steps, you can use --merge-only to only run the merge and deduplication steps:
# Process all papers - merge and deduplicate only
metabeeai process-pdfs --merge-only
# Process specific papers - merge and deduplicate only
metabeeai process-pdfs --merge-only --start 95UKMIEY --end CX9M8HCM
This is useful when:
You’ve already processed PDFs through the Vision API
You want to re-run merging with different filter options
You want to re-deduplicate after manual edits to JSON files
You’re testing the merge/deduplication logic without API costs
Note: Merge-only mode validates that JSON files exist (not PDFs) and automatically skips the split and API steps.
Processing All Papers Automatically#
If you don’t specify --start and --end, the pipeline will automatically detect and process all folders in your papers directory:
# Process all papers found in the directory
metabeeai process-pdfs
# Process all papers with merge-only
metabeeai process-pdfs --merge-only
The script will:
Scan the papers directory for all subfolders
Sort them alphanumerically (lexicographic order:
283C6B42,3ZHNVADM,4KV2ZB36, etc.)Process from the first to the last folder found
Filtering Chunk Types#
You can filter out specific chunk types during merging:
# Remove marginalia (page numbers, headers, footers)
metabeeai process-pdfs --start 95UKMIEY --end CX9M8HCM --filter-chunk-type marginalia
# Remove multiple types
metabeeai process-pdfs --start 95UKMIEY --end CX9M8HCM --filter-chunk-type marginalia figure
# When running merger separately
metabeeai process-pdfs --merge-only --dir /path/to/data/papers --filter-chunk-type marginalia
Common chunk types to filter:
marginalia- Headers, footers, page numbersfigure- Figure captions (if you only want main text)table- Table content (if you only want prose)
Resuming Processing#
If processing is interrupted, the pipeline is resume-friendly:
# API processing automatically skips existing JSON files
metabeeai process-pdfs --skip-split --skip-merge --skip-deduplicate --dir /path/to/papers --start 95UKMIEY
# Process all with resumption from a specific folder
metabeeai process-pdfs --start 95UKMIEY
# Deduplication can be re-run on specific papers (numeric folders)
metabeeai process-pdfs --merge-only --start 50 --end 100
Dry Run Mode#
Test the pipeline without making changes:
# Analyze duplication without modifying files
metabeeai process-pdfs --merge-only
# See what would happen
metabeeai process-pdfs --merge-only
Performance Tips#
Parallel Processing: The Vision API processes one file at a time. For faster processing, consider running multiple instances on different paper ranges:
# Terminal 1 metabeeai process-pdfs --start 283C6B42 --end 76DQP2DC # Terminal 2 metabeeai process-pdfs --start 8BV8BLU8 --end ZTRRIKQ3
Resume from Failures: If processing fails partway through, use
--skip-splitand--startto resume:metabeeai process-pdfs --start 95UKMIEY --skip-split
Monitor Progress: Check log files created in the papers directory:
tail -f papers/processing_log_*.txt
Dependencies#
Core dependencies are included when installing the metabeeai package:
PyPDF2- PDF manipulationrequests- API callspython-dotenv- Environment variable managementtermcolor- Colored console outputpathlib- Path operations (built-in)
If installing from source, dependencies can be installed via:
pip install -r requirements.txt
Next Steps#
After processing your PDFs:
Verify output files:
ls papers/95UKMIEY/pages/merged_v2.jsonCheck deduplication statistics in the JSON file
Proceed to the LLM pipeline:
metabeeai llm --papers 95UKMIEY CX9M8HCM
Alternative: Python Module Syntax#
Instead of using the CLI commands, you can also run the scripts directly as Python modules. This is useful if you need to integrate the functionality into other Python scripts or prefer direct module execution.
Running the Complete Pipeline#
# Process all papers (all steps)
metabeeai process-pdfs
# Process papers in a specific range (alphanumeric order)
metabeeai process-pdfs --start 95UKMIEY --end CX9M8HCM
# Process papers from a starting folder to the end
metabeeai process-pdfs --start 95UKMIEY
# Merge-only mode (skip expensive PDF splitting and API processing)
metabeeai process-pdfs --merge-only
# Filter out marginalia chunks during merging
metabeeai process-pdfs --start 95UKMIEY --end CX9M8HCM --filter-chunk-type marginalia
# Split into overlapping 2-page documents
metabeeai process-pdfs --pages 2
Running Individual Steps#
# Step 1: Split PDFs
metabeeai process-pdfs --pages 2 --skip-api --skip-merge --skip-deduplicate --dir /path/to/papers
# Step 2: Process with Vision API
metabeeai process-pdfs --skip-split --skip-merge --skip-deduplicate --dir /path/to/papers --start 95UKMIEY
# Step 3: Merge JSON files
metabeeai process-pdfs --merge-only --dir /path/to/data/papers --filter-chunk-type marginalia
# Step 4: Deduplicate chunks
metabeeai process-pdfs --merge-only --dir /path/to/papers
Using Functions Programmatically#
from metabeeai.process_pdfs.split_pdf import split_pdfs
from metabeeai.process_pdfs.va_process_papers import process_papers
from metabeeai.process_pdfs.merger import process_all_papers
from metabeeai.process_pdfs.batch_deduplicate import batch_deduplicate
from metabeeai.process_pdfs.deduplicate_chunks import deduplicate_chunks, analyze_chunk_uniqueness
# Use functions programmatically in your own scripts
# ...
All command-line arguments are identical between CLI commands and Python module syntax. The only difference is the invocation method.