Query Database - Data Extraction and Analysis#

This submodule provides tools for querying and analyzing structured data extracted from scientific papers processed through the MetaBeeAI pipeline.

Overview#

The MetaBeeAI pipeline generates structured JSON data from scientific papers, stored in a queriable format. This submodule provides scripts to:

  1. Query the database - Extract structured information from answers.json files

  2. Perform analyses - Generate statistical summaries, trends, and network visualizations

  3. Create visualizations - Produce plots and network diagrams showing relationships

Installation#

This submodule is part of the metabeeai package. Install it via:

pip install metabeeai

Or if installing from source:

pip install -e /path/to/MetaBeeAI

Required dependencies: pandas, numpy, matplotlib, seaborn, networkx, python-dotenv


Understanding the Database Structure#

Data Storage Format#

The MetaBeeAI pipeline stores extracted information from papers in a structured JSON format. Each processed paper contains an answers.json file with question-answer pairs:

Location: {METABEEAI_DATA_DIR}/papers/{paper_id}/answers.json

Structure:

{
  "QUESTIONS": {
    "bee_species": {
      "answer": "Apis mellifera, Bombus terrestris",
      "reason": "Species mentioned in methodology section",
      "chunk_ids": ["chunk_001", "chunk_003"]
    },
    "pesticides": {
      "answer": "Imidacloprid at 10 ppb, oral exposure, 7 days",
      "reason": "Pesticide details found in methods",
      "chunk_ids": ["chunk_005", "chunk_007"]
    },
    "additional_stressors": {
      "answer": "Nosema ceranae infection, 10^5 spores per bee",
      "reason": "Additional stressor mentioned in study design",
      "chunk_ids": ["chunk_012"]
    },
    "significance": {
      "answer": "1. Imidacloprid impaired learning. 2. Combined with Nosema increased mortality.",
      "reason": "Key findings extracted from results section",
      "chunk_ids": ["chunk_020", "chunk_021"]
    }
  }
}

Database Configuration#

The database location is configured via the METABEEAI_DATA_DIR environment variable:

Set in .env file (project root):

METABEEAI_DATA_DIR=/path/to/your/data/directory

Directory structure:

{METABEEAI_DATA_DIR}/
└── papers/
    ├── 001/
    │   ├── answers.json          # LLM-generated answers
    │   ├── answers_extended.json # Reviewer answers (optional)
    │   └── pages/
    │       └── merged_v2.json   # Processed PDF chunks
    ├── 002/
    │   └── answers.json
    ├── 95UKMIEY/
    │   └── answers.json
    └── ...

Querying the Database#

All scripts in this submodule automatically query the database by:

  1. Reading METABEEAI_DATA_DIR from your .env file

  2. Scanning {METABEEAI_DATA_DIR}/papers/ for all paper folders

  3. Loading answers.json files from each paper folder

  4. Extracting structured data based on question types

  5. Saving results to query_database/output/

No manual database queries needed - The scripts handle all data access automatically.


Data Extraction Scripts#

These scripts query the database and extract structured information from specific question types:

1. investigate_bee_species.py - Extract Bee Species Data#

Purpose: Queries the database for bee species information across all papers.

What it queries: Extracts data from the bee_species question in each answers.json file.

Usage:

python -m metabeeai.query_database.investigate_bee_species

Output:

  • output/bee_species_data.json - Structured bee species data

Data structure:

[
  {
    "paper_id": "729",
    "species_name": "Apis mellifera carnica",
    "genus": "Apis",
    "species": "mellifera",
    "subspecies": "carnica",
    "common_name": "Carniolan honey bee"
  }
]

What it does:

  1. Scans all paper directories for answers.json files

  2. Extracts bee species information from the bee_species question

  3. Parses species names into taxonomic components (genus, species, subspecies)

  4. Identifies common names where available

  5. Handles multiple species per paper

  6. Validates and standardizes species names


2. investigate_pesticides.py - Extract Pesticide Data#

Purpose: Queries the database for pesticide exposure information.

What it queries: Extracts data from the pesticides question in each answers.json file.

Usage:

python -m metabeeai.query_database.investigate_pesticides

Output:

  • output/pesticides_data.json - Structured pesticide data

Data structure:

[
  {
    "paper_id": "729",
    "pesticide_name": "imidacloprid",
    "dose": "10 ppb",
    "exposure_method": "oral",
    "duration": "7 days"
  }
]

What it does:

  1. Extracts pesticide information from the pesticides question

  2. Parses chemical names, doses, exposure methods, and durations

  3. Handles multiple pesticides per paper

  4. Standardizes pesticide names (converts to lowercase)

  5. Extracts quantitative dose information when available

  6. Categorizes exposure methods (oral, contact, topical, field, etc.)


3. investigate_additional_stressors.py - Extract Stressor Data#

Purpose: Queries the database for non-pesticide stressors tested in studies.

What it queries: Extracts data from the additional_stressors question in each answers.json file.

Usage:

python -m metabeeai.query_database.investigate_additional_stressors

Output:

  • output/additional_stressors_data.json - Structured stressor data

Data structure:

[
  {
    "paper_id": "729",
    "stressor_type": "Pathogen",
    "stressor_name": "Nosema ceranae",
    "details": "10^5 spores per bee, oral inoculation, 7 days"
  }
]

What it does:

  1. Extracts additional stressor information from the additional_stressors question

  2. Categorizes stressors by type (Temperature, Pathogen, Parasite, Nutritional, Chemical, Environmental)

  3. Extracts stressor names and application details

  4. Handles multiple stressors per paper

  5. Excludes pesticides (captured separately by investigate_pesticides.py)


4. investigate_significance.py - Extract Findings Data#

Purpose: Queries the database for key findings and significance statements.

What it queries: Extracts data from the significance question in each answers.json file.

Usage:

python -m metabeeai.query_database.investigate_significance

Output:

  • output/significance_data.json - Structured findings data

Data structure:

[
  {
    "paper_id": "729",
    "finding": "Imidacloprid at 10 ppb impaired learning and memory",
    "category": "Cognitive Effects"
  }
]

What it does:

  1. Extracts key findings from the significance question

  2. Parses individual findings from multi-point answers

  3. Categorizes findings by impact type

  4. Preserves both original and processed versions of data

  5. Handles quantitative results and effect descriptions


Analysis Scripts#

These scripts analyze the extracted data and generate visualizations:

5. trend_analysis.py - Trend and Co-occurrence Analysis#

Purpose: Analyzes trends and co-occurrence patterns between bee species and pesticides.

What it analyzes: Uses extracted data from bee_species_data.json and pesticides_data.json.

Usage:

python -m metabeeai.query_database.trend_analysis

Input:

  • output/bee_species_data.json - From investigate_bee_species.py

  • output/pesticides_data.json - From investigate_pesticides.py

Output:

  • output/trend_analysis_plots/top_bee_pesticide_combinations.png - Most common bee-pesticide pairs

  • output/trend_analysis_plots/most_studied_bee_species.png - Most frequently studied species

  • output/trend_analysis_plots/most_tested_nicotinic_pesticides.png - Most tested neonicotinoids

  • output/trend_analysis_report.txt - Statistical summary report

What it does:

  1. Filters data:

    • Focuses on nicotinic cholinergic pesticides (neonicotinoids, sulfoximines, butenolides, spinosyns)

    • Standardizes bee species names (uses genus + species when available)

    • Excludes papers without proper species identification

  2. Calculates statistics:

    • Counts studies per bee species

    • Counts studies per pesticide

    • Identifies bee-pesticide co-occurrences

    • Ranks top combinations

  3. Generates visualizations:

    • Bar charts of most studied species and pesticides

    • Co-occurrence frequency plots

    • Top combinations rankings

  4. Creates summary report:

    • Total papers analyzed

    • Unique species and pesticides

    • Study distribution metrics

    • Top 20 bee-pesticide combinations

Nicotinic pesticides included:

  • Neonicotinoids: imidacloprid, thiamethoxam, clothianidin, acetamiprid, thiacloprid, dinotefuran, nitenpyram

  • Sulfoximines: sulfoxaflor

  • Butenolides: flupyradifurone

  • Spinosyns: spinosad, spinetoram


6. network_analysis.py - Network Visualization#

Purpose: Creates network visualizations showing relationships between bees, pesticides, and stressors.

What it analyzes: Uses extracted data from all three extraction scripts.

Usage:

python -m metabeeai.query_database.network_analysis

Input:

  • output/bee_species_data.json - From investigate_bee_species.py

  • output/pesticides_data.json - From investigate_pesticides.py

  • output/additional_stressors_data.json - From investigate_additional_stressors.py

Output:

  • output/network_plots/tripartite_network.png - 3-way network: bees, pesticides, stressors

  • output/network_statistics.txt - Network connectivity statistics

  • output/pesticide_stressor_summary.txt - Co-occurrence analysis

What it does:

  1. Creates tripartite network:

    • Three node types: bee species (green), pesticides (red), stressors (blue)

    • Edge thickness proportional to number of studies

    • Shows complex relationships between all three factors

    • Uses force-directed layout for clarity

  2. Calculates network statistics:

    • Node counts (species, pesticides, stressors)

    • Edge counts (relationships)

    • Degree centrality (most connected nodes)

    • Network density metrics

    • Papers with single vs. multiple factors

  3. Analyzes pesticide-stressor interactions:

    • Papers testing pesticides only

    • Papers testing stressors only

    • Papers testing both (interaction studies)

    • Top pesticide-stressor combinations

    • Identifies nicotinic vs. other pesticides

Network interpretation:

  • Node size: Number of connections

  • Edge thickness: Number of studies with that combination

  • Color coding: Green (bees), Red (pesticides), Blue (stressors)


Complete Workflow#

Step 1: Configure Database Location#

Ensure your .env file contains:

METABEEAI_DATA_DIR=/path/to/your/data/directory

Step 2: Extract Data (Query Database)#

Run all extraction scripts to query the database and extract structured data:

# Extract bee species data
python -m metabeeai.query_database.investigate_bee_species

# Extract pesticide data
python -m metabeeai.query_database.investigate_pesticides

# Extract additional stressors data
python -m metabeeai.query_database.investigate_additional_stressors

# Extract significance/findings data
python -m metabeeai.query_database.investigate_significance

Step 3: Run Analyses#

Analyze the extracted data and create visualizations:

# Analyze trends and co-occurrences
metabeeai trend-analysis

# Create network visualizations
metabeeai network-analysis

Step 4: Review Outputs#

Check the generated files:

ls -lh output/
ls -lh output/trend_analysis_plots/
ls -lh output/network_plots/

Output File Structure#

query_database/output/
├── bee_species_data.json                          # Bee species extraction results
├── pesticides_data.json                           # Pesticide extraction results
├── additional_stressors_data.json                 # Stressor extraction results
├── significance_data.json                         # Findings extraction results
├── trend_analysis_plots/
│   ├── top_bee_pesticide_combinations.png         # Most common pairs
│   ├── most_studied_bee_species.png               # Species frequency
│   └── most_tested_nicotinic_pesticides.png       # Pesticide frequency
├── trend_analysis_report.txt                      # Statistical summary
├── network_plots/
│   └── tripartite_network.png                     # 3-way network visualization
├── network_statistics.txt                         # Network metrics
└── pesticide_stressor_summary.txt                 # Interaction analysis

Understanding the Outputs#

Trend Analysis Report#

The trend_analysis_report.txt provides:

  • Dataset overview: Total papers, unique species, unique pesticides

  • Top bee species: Most frequently studied species (e.g., Apis mellifera)

  • Top pesticides: Most tested nicotinic pesticides (e.g., imidacloprid)

  • Top combinations: Most common bee-pesticide pairs tested together

  • Study distribution: How studies are distributed across species and pesticides

Network Statistics#

The network_statistics.txt provides:

  • Network size: Number of nodes and edges

  • Connectivity: Most connected species, pesticides, stressors

  • Degree centrality: Which factors appear in most studies

  • Network density: How interconnected the research landscape is

Pesticide-Stressor Summary#

The pesticide_stressor_summary.txt provides:

  • Interaction patterns: Papers testing pesticides alone, stressors alone, or both

  • Co-occurrence: Most common pesticide-stressor combinations

  • Pesticide classification: Which pesticides are nicotinic cholinergic

  • Research gaps: Understudied combinations


Prerequisites#

  1. Data Requirements:

    • Must have run the LLM pipeline first (metabeeai llm)

    • Each paper must have an answers.json file

    • Database location configured via METABEEAI_DATA_DIR in .env

  2. Expected data structure:

{METABEEAI_DATA_DIR}/
└── papers/
    ├── 001/
    │   └── answers.json
    ├── 002/
    │   └── answers.json
    ├── 95UKMIEY/
    │   └── answers.json
    └── ...
  1. Environment Configuration:

    • Create .env file in project root (if not exists)

    • Add: METABEEAI_DATA_DIR=/path/to/your/data/directory


Data Quality Notes#

Bee Species Standardization#

The scripts automatically standardize bee species names:

  • Priority 1: Use genus + species when both available

  • Priority 2: Use species_name field if genus/species missing

  • Filtered out: “Species not specified” entries

  • Example: “Apis mellifera carnica” → genus: “Apis”, species: “mellifera”, subspecies: “carnica”

Pesticide Name Cleaning#

Pesticide names are automatically cleaned:

  • Converted to lowercase for consistency

  • Chemical names preferred over trade names

  • Standardized spelling (e.g., “imidacloprid” not “Imidacloprid”)

Stressor Categorization#

Stressors are automatically categorized into types:

  • Pathogen: Bacteria, viruses, fungi

  • Parasite: Varroa, Nosema, tracheal mites

  • Temperature: Heat stress, cold stress

  • Nutritional: Diet restriction, pollen quality

  • Chemical: Non-pesticide chemicals

  • Environmental: Other environmental factors


Advanced Usage#

Custom Database Queries#

You can write custom scripts to query the database:

import json
import os
from pathlib import Path

# Get papers directory from config
from metabeeai.config import get_papers_dir

papers_dir = get_papers_dir()

# Query all answers.json files
for paper_folder in os.listdir(papers_dir):
    answers_path = Path(papers_dir) / paper_folder / "answers.json"

    if answers_path.exists():
        with open(answers_path, 'r') as f:
            answers = json.load(f)

        # Extract specific question
        bee_species = answers.get("QUESTIONS", {}).get("bee_species", {})
        print(f"Paper {paper_folder}: {bee_species.get('answer', 'N/A')}")

Filtering by Date Range#

If your data includes date information, you can filter papers:

# Example: Add to any investigate_*.py script
import datetime

# Filter papers from 2020 onwards
if paper_metadata.get('year') and int(paper_metadata['year']) >= 2020:
    # Process this paper
    pass

Custom Analysis#

You can load the extracted JSON files for custom analyses:

import json
import pandas as pd

# Load extracted data
with open('output/bee_species_data.json', 'r') as f:
    bee_data = json.load(f)

bee_df = pd.DataFrame(bee_data)

# Custom analysis
genus_counts = bee_df['genus'].value_counts()
print(f"Most studied genus: {genus_counts.index[0]}")

Troubleshooting#

“METABEEAI_DATA_DIR not set”#

  • Fix: Add to .env file: METABEEAI_DATA_DIR=/path/to/data

“No answers.json files found”#

  • Cause: LLM pipeline hasn’t been run yet

  • Fix: Run metabeeai llm first to generate answers.json files

Empty output files#

  • Cause: No data found for that question type

  • Check: Review sample answers.json files to ensure questions were answered

Network visualization issues#

  • Cause: Missing data files from extraction phase

  • Fix: Run all investigate_*.py scripts before network_analysis.py

“Species not specified” appears frequently#

  • Cause: LLM couldn’t extract species names from papers

  • Expected: These are automatically filtered out in trend and network analyses


Dependencies#

Core dependencies are included when installing the metabeeai package:

  • pandas - Data manipulation

  • numpy - Numerical operations

  • matplotlib - Plotting

  • seaborn - Statistical visualization

  • networkx - Network analysis

  • python-dotenv - Environment variable management

If installing from source, dependencies can be installed via:

pip install -r requirements.txt

Alternative: Python Module Syntax#

All scripts can be run directly as Python modules. The examples above use this syntax. For programmatic access:

# Import functions for custom scripts
from metabeeai.query_database.investigate_bee_species import get_papers_dir, extract_bee_species
from metabeeai.query_database.investigate_pesticides import extract_pesticides_from_file
from metabeeai.query_database.trend_analysis import analyze_co_occurrence

# Use functions programmatically
# ...