| name | bio-seq |
| description | Read/write FASTA, GenBank, FASTQ files. Sequence manipulation (complement, translate). Indexed random access via faidx. For NGS pipelines (SAM/BAM/VCF), use pysam. For BLAST, use gget or blat-integration. |
| user_invocable | true |
Sequence I/O
Read, write, and manipulate biological sequence files (FASTA, GenBank, FASTQ).
When to Use This Skill
This skill should be used when:
- Reading or writing sequence files (FASTA, GenBank, FASTQ)
- Converting between sequence file formats
- Manipulating sequences (complement, reverse complement, translate)
- Extracting sequences from large indexed FASTA files (faidx)
- Calculating sequence statistics (GC content, molecular weight, Tm)
When NOT to Use This Skill
- NGS alignment files (SAM/BAM/VCF) → Use
pysam - BLAST searches → Use
gget(quick) orblat-integration(large-scale) - Multiple sequence alignment → Use
msa-advanced - Phylogenetic analysis → Use
etetoolkit - NCBI database queries → Use
pubmed-databaseorgene-database
Tool Selection Guide
| Task | Tool | Reference |
|---|---|---|
| Parse FASTA/GenBank/FASTQ | Bio.SeqIO |
biopython_seqio.md |
| Convert file formats | Bio.SeqIO.convert() |
biopython_seqio.md |
| Sequence operations | Bio.Seq |
biopython_seqio.md |
| Large FASTA random access | pysam.FastaFile + faidx |
faidx.md |
| GC%, Tm, molecular weight | Bio.SeqUtils |
utilities.md |
Quick Start
Installation
uv pip install biopython pysam
Read FASTA
from Bio import SeqIO
for record in SeqIO.parse("sequences.fasta", "fasta"):
print(f"{record.id}: {len(record.seq)} bp")
Convert GenBank to FASTA
from Bio import SeqIO
SeqIO.convert("input.gb", "genbank", "output.fasta", "fasta")
Random Access with faidx
import pysam
# Create index (once)
pysam.faidx("reference.fasta")
# Random access
fasta = pysam.FastaFile("reference.fasta")
seq = fasta.fetch("chr1", 1000, 2000) # 0-based coordinates
fasta.close()
Sequence Operations
from Bio.Seq import Seq
seq = Seq("ATGCGATCGATCG")
print(seq.complement())
print(seq.reverse_complement())
print(seq.translate())
Reference Documentation
Consult the appropriate reference file for detailed documentation:
references/biopython_seqio.md
Bio.Seqobject and sequence operationsBio.SeqIOfor file parsing and writingSeqRecordobject and annotations- Supported file formats
- Format conversion patterns
references/faidx.md
- Creating FASTA index with
pysam.faidx() pysam.FastaFilefor random access- Coordinate systems (0-based vs 1-based)
- Performance considerations for large files
- Common patterns (variant context, gene extraction)
references/utilities.md
- GC content calculation (
gc_fraction) - Molecular weight (
molecular_weight) - Melting temperature (
MeltingTemp) - Codon usage analysis
- Restriction enzyme sites
references/formats.md
- FASTA format specification
- GenBank format specification
- FASTQ format and quality scores
- Format detection and validation
Coordinate Systems
Biopython: Uses Python-style 0-based, half-open intervals for slicing.
pysam.FastaFile.fetch():
- Numeric arguments: 0-based (
fetch("chr1", 999, 2000)= positions 999-1999) - Region strings: 1-based (
fetch("chr1:1000-2000")= positions 1000-2000)
Common Pitfalls
- Coordinate confusion: Remember which tool uses 0-based vs 1-based
- Missing faidx index: Random access requires
.faifile - Format mismatch: Verify file format matches the format string in
SeqIO.parse() - Iterator exhaustion:
SeqIO.parse()returns an iterator; convert to list if multiple passes needed - Large files: Use iterators, not
list(), for memory efficiency