name

sragent

description

Query the Sequence Read Archive (SRA), retrieve scientific publications, and analyze genomics metadata using the SRAgent toolkit. Supports accession conversion (GSE→SRX→SRR), BigQuery metadata queries, manuscript downloads from multiple sources, and scRNA-seq technology identification. Use when working with SRA/GEO datasets, finding publications, or analyzing single-cell sequencing experiments.

SRAgent: Sequence Read Archive Data and Publication Retrieval

Overview

SRAgent is an agentic workflow system for working with the NCBI Sequence Read Archive (SRA) and Gene Expression Omnibus (GEO) databases. It automates literature discovery, metadata extraction, and manuscript retrieval for genomics datasets.

Setup Instructions

1. Install SRAgent

SRAgent requires Python ≥3.11. Check to see if SRAgent is already installed:

which SRAgent

If SRAgent is not installed, follow the instructions below.

Install using uv:

# Clone the repository
git clone https://github.com/ArcInstitute/SRAgent.git
cd SRAgent

# Create and activate virtual environment with uv
uv venv
source .venv/bin/activate

# Install the package
uv pip install .

Verify installation:

SRAgent --help

2. Configure environment variables

The following environment variables are required:

OPENAI_API_KEY=sk-openai-...
- Needed to use OpenAI models
ANTHROPIC_API_KEY=sk-ant-...
- Needed to use Claude models
DYNACONF
- Needed to switch between Claude and OpenAI models
EMAIL=user@example.com
- Needed for using the Entrez API
NCBI_API_KEY=your-ncbi-key
- Optional for increased rate limits when using the Entrez API
CORE_API_KEY=your-core-key
- Optional for paper downloads from the CORE API
GCP_PROJECT_ID=your-project-id
- Needed for using Google BigQuery
GOOGLE_APPLICATION_CREDENTIALS=/path/to/key.json
- Needed for using Google BigQuery

Prompt the user to provide the environment variables if they are not already set as environment variables: export MY_SECRET_VAR=my-secret-value.

3. Configure Settings

SRAgent uses a settings file (settings.yml) to configure models and behavior. The default configuration works for most users, but you can customize it.

Option A: Use Default Settings

No action needed - SRAgent ships with sensible defaults.

Option B: Custom Settings File

See ./references/example-settings.yml for an example settings file that you can modify as needed.

4. Verify Setup

Test your configuration:

# Check which model is being used
python -c "from SRAgent.agents.utils import load_settings; s = load_settings(); print(s['models']['default'])"

# Test basic functionality
SRAgent entrez "Convert GSE121737 to SRX accessions"

Core Capabilities

1. Accession Conversion

Convert between different genomics database accession formats:

GEO Series: GSE* → SRA Study (SRP*)
SRA Study: SRP*/PRJNA* → SRA Experiments (SRX*)
SRA Experiment: SRX*/ERX* → SRA Runs (SRR*/ERR*)

2. Metadata Extraction

Query comprehensive metadata from SRA/GEO:

Sequencing platform (Illumina, PacBio, Oxford Nanopore)
Library preparation technology (10X Genomics, Smart-seq, etc.)
Organism, tissue, cell type
Study design and experimental details
Single-cell vs bulk RNA-seq identification

3. BigQuery Analysis

Leverage NCBI's BigQuery dataset for large-scale queries:

Batch accession conversions
Technology identification across studies
Filtering by platform, assay type, organism
Study/experiment/run relationship mapping

4. Publication Retrieval

Automatically find and download manuscripts:

Link SRA accessions to PubMed publications
Extract DOIs from PubMed records
Download full-text PDFs from multiple sources:
- Preprint servers (arXiv, bioRxiv, medRxiv)
- CORE API
- Europe PMC
- Unpaywall
Batch processing with CSV input

When to Use This Skill

Use SRAgent when the user:

Mentions SRA, GEO, or genomics accessions (GSE, SRP, SRX, SRR)
Needs to convert between accession formats
Wants metadata about sequencing experiments
Needs to find or download papers associated with datasets
References the Sequence Read Archive (SRA), European Nucleotide Archive (ENA), or Gene Expression Omnibus (GEO)

Available Commands

Command 1: `SRAgent entrez`

Purpose: Low-level NCBI Entrez database queries

Best for:

Simple accession conversions
Quick dataset summaries
Cross-database linking
When you know exactly what Entrez tool to use (esearch, efetch, elink)

Examples:

# Convert GEO to SRX
SRAgent --no-progress --no-summaries entrez "Convert GSE121737 to SRX accessions"

# Summarize a dataset
SRAgent --no-progress --no-summaries entrez "Summarize SRX4967527"

# Link to publications
SRAgent --no-progress --no-summaries entrez "Find publications for GSE196830"

Command 2: `SRAgent sragent`

Purpose: Comprehensive metadata extraction with multiple tools

Best for:

Complex metadata queries
Technology identification
When simple Entrez queries aren't enough
Determining if data is single-cell

Tools available:

Entrez agent (all databases)
BigQuery (large-scale queries)
NCBI web scraping
sra-stat (direct sequence file analysis)

Examples:

# Check sequencing technology
SRAgent --no-progress --no-summaries sragent "Which 10X Genomics technology was used for ERX11887200?"

# Comprehensive summary
SRAgent --no-progress --no-summaries sragent "Summarize SRX4967527"

# Verify data type
SRAgent --no-progress --no-summaries sragent "Is SRX4967527 single-cell RNA-seq data?"

# Get organism info
SRAgent --no-progress --no-summaries sragent "What organism was sequenced in study PRJNA498286?"

Command 3: `SRAgent papers`

Purpose: Find and download manuscripts associated with SRA accessions

Best for:

Downloading papers for datasets
Batch retrieval of publications
Enriching CSV files with DOIs and download paths

Input formats:

Single accession: SRX4967527
Study accession: SRP167700 or PRJNA498286
CSV file with accession column

Examples:

# Single experiment
SRAgent --no-progress --no-summaries papers SRX4967527

# Entire study
SRAgent --no-progress --no-summaries papers PRJNA498286

# Batch from CSV
SRAgent --no-progress --no-summaries papers accessions.csv --output-dir papers/

# Custom accession column name
SRAgent --no-progress --no-summaries papers my-data.csv --accession-column "experiment_id"

# Control concurrency
SRAgent --no-progress --no-summaries papers accessions.csv --max-concurrency 3

Output:

PDFs saved to --output-dir/<accession>/
Console summary showing:
- PubMed IDs found
- DOIs extracted
- Download success/failure status
Updated CSV (when input is CSV) with columns:
- pubmed_id
- doi
- download_path

Usage Patterns

Pattern 1: Dataset Investigation Workflow

# Step 1: Convert GEO accession to SRX
SRAgent --no-progress --no-summaries entrez "Convert GSE121737 to SRX accessions"

# Step 2: Get detailed metadata
SRAgent --no-progress --no-summaries sragent "For each SRX from GSE121737, determine: Is it single-cell? What library prep?"

# Step 3: Find associated publications
SRAgent --no-progress --no-summaries papers GSE121737 --output-dir manuscripts/

Pattern 2: Technology Verification

# Check if dataset meets specific criteria
SRAgent --no-progress --no-summaries sragent "Is SRX4967527 Illumina paired-end single-cell RNA-seq data?"

# Get specific technology details
SRAgent --no-progress --no-summaries sragent "Which 10X Genomics chemistry was used: SRX4967527?"

# Verify organism
SRAgent --no-progress --no-summaries sragent "What organism is SRX4967527?"

Pattern 3: Batch Processing

# Create CSV with accessions
cat > accessions.csv << EOF
accession
SRX4967527
SRX4967528
SRX4967529
EOF

# Download all papers
SRAgent --no-progress --no-summaries \
  papers accessions.csv \
    --output-dir papers/ \
    --max-concurrency 5

# Result: CSV enriched with DOIs and download paths

Pattern 4: Study-Level Analysis

# Get all experiments in a study
SRAgent --no-progress --no-summaries entrez "List all SRX accessions for study SRP167700"

# Or use a BioProject accession
SRAgent --no-progress --no-summaries entrez "Convert PRJNA498286 to SRX accessions"

# Then analyze the study
SRAgent --no-progress --no-summaries sragent "Summarize the library prep technologies used in PRJNA498286"

Implementation Guide for Claude

Running SRAgent Commands

When the user needs SRAgent functionality, use the bash tool:

# Example: Convert accessions
result = bash_tool(
    command="SRAgent --no-progress --no-summaries entrez 'Convert GSE121737 to SRX accessions'",
    description="Converting GEO accession to SRX format"
)

# Example: Get metadata
result = bash_tool(
    command="SRAgent --no-progress --no-summaries sragent 'Which 10X technology was used for SRX4967527?'",
    description="Determining library preparation technology"
)

# Example: Download papers
result = bash_tool(
    command="SRAgent --no-progress --no-summaries papers SRX4967527 --output-dir /home/claude/papers",
    description="Downloading manuscripts for dataset"
)

Working with CSV Files

When processing batch data:

import pandas as pd

# User provides accessions - create CSV
accessions = ["SRX4967527", "SRX4967528", "SRX4967529"]
df = pd.DataFrame({"accession": accessions})
df.to_csv("/home/claude/accessions.csv", index=False)

# Run SRAgent papers command
result = bash_tool(
    command="SRAgent --no-progress --no-summaries papers /home/claude/accessions.csv --output-dir /home/claude/papers",
    description="Batch downloading papers for multiple accessions"
)

# Read enriched CSV
enriched_df = pd.read_csv("/home/claude/accessions.csv")
# Now has: accession, pubmed_id, doi, download_path columns

Accession Format Reference

GEO (Gene Expression Omnibus)

Series: GSE + 5-7 digits (e.g., GSE121737)
Sample: GSM + 6-7 digits (e.g., GSM3457845)

SRA (Sequence Read Archive)

Study: SRP + 6 digits (e.g., SRP167700)
- Or BioProject: PRJNA + 6 digits (e.g., PRJNA498286)
Experiment: SRX + 7-8 digits (e.g., SRX4967527)
Run: SRR + 7-8 digits (e.g., SRR8124405)

ENA (European Nucleotide Archive)

Study: ERP + 6 digits or PRJEB + 6 digits
Experiment: ERX + 7-8 digits (e.g., ERX11887200)
Run: ERR + 7-8 digits

Hierarchical Relationships

GEO Series (GSE)
    ↓
SRA Study (SRP) = BioProject (PRJNA)
    ↓
SRA Experiment (SRX) ← Links to → Publications (PubMed ID, DOI)
    ↓
SRA Run (SRR) [actual sequence files]

Common Single-Cell Technologies

SRAgent can identify these scRNA-seq technologies:

10X Genomics

Chromium Single Cell 3' (v1, v2, v3)
Chromium Single Cell 5'
Chromium Single Cell ATAC
Chromium Single Cell Multiome
Visium Spatial

Other Platforms

Smart-seq2 / Smart-seq3
Drop-seq
inDrop
Seq-Well
CEL-Seq2
MARS-seq
Quartz-Seq

Detection Strategy

SRAgent uses multiple signals:

Library prep metadata fields
Study descriptions and titles
PubMed abstracts
Sequence file characteristics (when using sra-stat)

Working Without BigQuery

If you don't have Google Cloud credentials:

# SRAgent gracefully falls back to Entrez-only queries
# BigQuery features will be skipped with a warning

# These still work without BigQuery:
SRAgent --no-progress --no-summaries entrez "Convert GSE121737 to SRX accessions"
SRAgent --no-progress --no-summaries papers SRX4967527

# This will warn but proceed:
SRAgent --no-progress --no-summaries sragent "Which 10X technology for SRX4967527?"
# (Uses Entrez + web scraping instead of BigQuery)

Performance Optimization

# For large batch operations, adjust concurrency
SRAgent --no-progress --no-summaries papers large-dataset.csv \
  --max-concurrency 10 \
  --recursion-limit 150

# For paper downloads specifically
SRAgent --no-progress --no-summaries papers accessions.csv \
  --core-api-key "$CORE_API_KEY" \
  --email "$EMAIL" \
  --max-concurrency 5

Troubleshooting

"ModuleNotFoundError: No module named 'SRAgent'"

# Ensure package is installed
cd SRAgent
uv pip install .

# Verify installation
python -c "import SRAgent; print(SRAgent.__file__)"

"Rate limit exceeded" (NCBI)

# Get NCBI API key: https://www.ncbi.nlm.nih.gov/account/settings/
export NCBI_API_KEY="your-ncbi-api-key"

# Reduces concurrent requests
SRAgent papers accessions.csv --max-concurrency 3

Paper downloads fail

Check: Is DOI found?
- Some datasets may not have linked publications
- Check PubMed link manually first
Check: Multiple sources attempted?
- SRAgent tries: preprints → CORE → Europe PMC → Unpaywall
- Some papers are paywalled (no open access)
Check: Network/authentication
- CORE requires API key: export CORE_API_KEY="..."
- Some sources may be blocked by institution firewall
- Cloudflare may block automated access to some preprint servers

Resources

SRAgent Documentation

./references/metadata-fields.md
- All metadata fields that SRAgent can extract from SRA/GEO databases
./references/quick-reference.md
Quick reference for SRAgent commands
./references/usage-examples.md
Usage examples for SRAgent
./references/example-settings.yml
- Example settings file for SRAgent

External Resources

GitHub: https://github.com/ArcInstitute/SRAgent
Paper: bioRxiv 2025.02.27.640494 (scBaseCount manuscript)
NCBI Entrez: https://www.ncbi.nlm.nih.gov/books/NBK25500/
SRA Database: https://www.ncbi.nlm.nih.gov/sra
GEO Database: https://www.ncbi.nlm.nih.gov/geo/

Install Skill

SKILL.md

SRAgent: Sequence Read Archive Data and Publication Retrieval

Overview

Setup Instructions

1. Install SRAgent

2. Configure environment variables

3. Configure Settings

Option A: Use Default Settings

Option B: Custom Settings File

4. Verify Setup

Core Capabilities

1. Accession Conversion

2. Metadata Extraction

3. BigQuery Analysis

4. Publication Retrieval

When to Use This Skill

Available Commands

Command 1: SRAgent entrez

Command 2: SRAgent sragent

Command 3: SRAgent papers

Usage Patterns

Pattern 1: Dataset Investigation Workflow

Pattern 2: Technology Verification

Pattern 3: Batch Processing

Pattern 4: Study-Level Analysis

Implementation Guide for Claude

Running SRAgent Commands

Working with CSV Files

Accession Format Reference

GEO (Gene Expression Omnibus)

SRA (Sequence Read Archive)

ENA (European Nucleotide Archive)

Hierarchical Relationships

Common Single-Cell Technologies

10X Genomics

Other Platforms

Detection Strategy

Working Without BigQuery

Performance Optimization

Troubleshooting

"ModuleNotFoundError: No module named 'SRAgent'"

"Rate limit exceeded" (NCBI)

Paper downloads fail

Resources

SRAgent Documentation

External Resources

Command 1: `SRAgent entrez`

Command 2: `SRAgent sragent`

Command 3: `SRAgent papers`