| name | curate-bulk-rnaseq |
| description | Process bulk RNA-seq datasets for VEuPathDB resources |
Bulk RNA-seq Dataset Curation
This skill guides processing of bulk RNA-seq datasets for VEuPathDB resources.
Prerequisites Check
This workflow requires the following repositories in veupathdb-repos/:
- ApiCommonPresenters
- EbrcModelCommon
First, run the repository status check to verify repositories are present:
Note: this script is located in the skill directory
bash scripts/check-repos.sh ApiCommonPresenters EbrcModelCommon
If repositories are missing, the script will provide clone instructions.
Branch Confirmation: After verifying repositories exist, check their current branches and status using git -C <path>, then confirm with the user before proceeding.
Example:
git -C veupathdb-repos/ApiCommonPresenters branch --show-current
git -C veupathdb-repos/ApiCommonPresenters status -sb
Working Directory (Curation Workspace Directory)
IMPORTANT: All commands in this workflow must be run from your curation workspace directory (the directory that contains veupathdb-repos/ as a subdirectory).
For Claude Code:
- DO NOT use
cdcommands to change into subdirectories - Use
git -C <path>for git operations in subdirectories - Use absolute paths or relative paths from the curation workspace directory
The workflow creates:
tmp/- Intermediate files (gitignored)delivery/bulk-rnaseq/<BIOPROJECT>/- Pipeline outputs (gitignored)
Required Information
Gather the following before starting:
- VEuPathDB project - Valid projects listed in resources/valid-projects.json
- BioProject accession (e.g.,
PRJNA1018599)
Optional: Journal Article PDF
If a journal article is available for this dataset, providing it enhances the curation workflow:
- Better descriptions: Abstract and introduction provide richer experiment context
- Strandedness detection: Methods section reveals library prep protocol
- Contact identification: Author affiliations clarify roles (experimentalist, analyst, submitter)
- Sample annotation context: Methods help decode unclear sample metadata
To include a PDF:
- Download the article PDF
- Copy it to
tmp/<BIOPROJECT>_article.pdf(e.g.,tmp/PRJNA1018599_article.pdf) - Tell Claude the PDF is available when starting Step 1
The PDF will be processed by a subagent once in Step 1 and extracted data saved to tmp/<BIOPROJECT>_pdf_extracted.json for use throughout the workflow.
Workflow Overview
Step 1: Fetch Metadata (and Extract PDF)
Fetch run-level metadata from ENA and sample attributes from NCBI BioSample. If a journal article PDF is available, extract key information for use in later steps.
Commands:
node scripts/fetch-sra-metadata.js <BIOPROJECT>
Output: tmp/<BIOPROJECT>_sra_metadata.json
Optional - Fetch MINiML for GEO-linked datasets:
node scripts/fetch-miniml.js <BIOPROJECT>
Output: tmp/<GSE>_family.xml (if GEO-linked)
Optional - Extract PDF data:
If tmp/<BIOPROJECT>_article.pdf is present, a subagent will extract it (do not read it yourself).
Output (on success): tmp/<BIOPROJECT>_pdf_extracted.json
Detailed instructions: Step 1 - Fetch Metadata
Step 2: Analyze Samples
Claude analyzes the fetched metadata to:
- Identify experimental factors (attributes that vary between samples)
- Generate sample annotations with meaningful labels
- Group technical replicates
- Determine strand specificity
Output: tmp/<BIOPROJECT>_sample_annotations.json
Detailed instructions: Step 2 - Analyze Samples
Step 3: Curate Contacts
Identify and curate contact entries from GEO contributors or BioProject submitters.
Actions:
- Search existing contacts in
veupathdb-repos/EbrcModelCommon/Model/lib/xml/datasetPresenters/contacts/allContacts.xml - Create new contact entries if needed
- Present choices to curator for review
Detailed instructions: Step 3 - Curate Contacts
Step 4: Generate Presenter XML
Generate the datasetPresenter XML, review/edit it, then insert into the presenter file.
Command:
node scripts/generate-presenter-xml.js <BIOPROJECT> <PROJECT> <PRIMARY_CONTACT_ID> [ADDITIONAL_CONTACT_IDS...]
Output: tmp/<BIOPROJECT>_presenter.xml
Workflow:
- Generate initial XML with script (saves to tmp/)
- Review and edit the temp file to fill in TODOs (shortDisplayName, pubmedIds, etc.)
- Insert finalized XML into presenter file
Target file: veupathdb-repos/ApiCommonPresenters/Model/lib/xml/datasetPresenters/<PROJECT>.xml
Detailed instructions: Step 4 - Generate Presenter
Step 5: Generate Delivery Outputs
Generate pipeline configuration files for the data processing team.
Commands:
bash scripts/check-delivery-dirs.sh bulk-rnaseq <BIOPROJECT>
node scripts/generate-analysis-config.js <BIOPROJECT> [--strand-specific]
node scripts/generate-samplesheet.js <BIOPROJECT> [strandedness]
The strandedness argument accepts: stranded, unstranded, or auto. If omitted, the script checks _pdf_extracted.json and _sample_annotations.json before falling back to auto.
Outputs in delivery/bulk-rnaseq/<BIOPROJECT>/:
analysisConfig.xml- Pipeline configurationsamplesheet.csv- Also for the processing pipeline
Detailed instructions: Step 5 - Generate Outputs
Next Steps
After completing this workflow:
- Review generated XML for TODO fields that require curator input
- Commit changes to dataset branch (curator handles git operations)
- Create pull request for review (curator handles PR creation)
- Deliver output files from
delivery/bulk-rnaseq/<BIOPROJECT>/to data processing team
Resources
- Step 1 - Fetch Metadata
- Step 2 - Analyze Samples
- Step 3 - Curate Contacts
- Step 4 - Generate Presenter
- Step 5 - Generate Outputs
- PDF Extraction
- Editing Large XML Files
- Valid VEuPathDB Projects
Scripts
scripts/fetch-sra-metadata.js- Fetches SRA run metadata from ENA + BioSample attributes from NCBIscripts/fetch-miniml.js- Fetches MINiML XML for GEO-linked datasetsscripts/generate-presenter-xml.js- Generates RNA-seq datasetPresenter XMLscripts/generate-analysis-config.js- Generates analysisConfig.xml for pipelinescripts/generate-samplesheet.js- Generates/delivers samplesheet.csv and sampleAnnotations.jsonscripts/check-repos.sh- Validates veupathdb-repos/ repository setup (synced from shared/)scripts/check-delivery-dirs.sh- Creates delivery directory structure (synced from shared/)