BLAST
Use BLAST for finding regions of similarity between biological sequences. This skill covers running BLAST searches and parsing results.
Running BLAST Locally
from Bio.Blast.Applications import NcbiblastpCommandline, NcbiblastnCommandline
# Protein BLAST (blastp)
blastp_cline = NcbiblastpCommandline(
query="query.fasta",
db="nr",
evalue=0.001,
outfmt=5, # XML format
out="blast_results.xml"
)
stdout, stderr = blastp_cline()
# Nucleotide BLAST (blastn)
blastn_cline = NcbiblastnCommandline(
query="query.fasta",
db="nt",
evalue=0.001,
outfmt=5,
out="blastn_results.xml"
)
stdout, stderr = blastn_cline()
NCBI BLAST Web Service
from Bio.Blast import NCBIWWW, NCBIXML
# Run BLAST search online
sequence = "MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQDNLSGAEKAVQVKVKALPDAQFEVVHSLAKWKRQQIAAALEHHHHHH"
result_handle = NCBIWWW.qblast(
"blastp", # Program
"nr", # Database
sequence, # Query sequence
hitlist_size=50, # Max hits
expect=0.001 # E-value threshold
)
# Save results
with open("blast_results.xml", "w") as f:
f.write(result_handle.read())
Parsing BLAST Results
from Bio.Blast import NCBIXML
# Parse XML results
with open("blast_results.xml") as result_handle:
blast_records = NCBIXML.parse(result_handle)
for blast_record in blast_records:
print(f"Query: {blast_record.query}")
print(f"Database: {blast_record.database}")
for alignment in blast_record.alignments[:10]:
print(f"\n Hit: {alignment.hit_def[:60]}")
print(f" Accession: {alignment.accession}")
for hsp in alignment.hsps:
print(f" E-value: {hsp.expect}")
print(f" Score: {hsp.score}")
print(f" Identities: {hsp.identities}/{hsp.align_length}")
print(f" Query: {hsp.query[:50]}...")
print(f" Match: {hsp.match[:50]}...")
print(f" Subject: {hsp.sbjct[:50]}...")
Tabular Output Parsing
import pandas as pd
# Run BLAST with tabular output (format 6)
blastp_cline = NcbiblastpCommandline(
query="query.fasta",
db="nr",
evalue=0.001,
outfmt="6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore",
out="blast_results.tsv"
)
stdout, stderr = blastp_cline()
# Parse tabular results
columns = ['qseqid', 'sseqid', 'pident', 'length', 'mismatch',
'gapopen', 'qstart', 'qend', 'sstart', 'send', 'evalue', 'bitscore']
df = pd.read_csv("blast_results.tsv", sep='\t', names=columns)
# Filter high-quality hits
significant_hits = df[(df['evalue'] < 1e-10) & (df['pident'] > 50)]
print(significant_hits)
Creating Local BLAST Database
# Create protein database
makeblastdb -in proteins.fasta -dbtype prot -out my_protein_db
# Create nucleotide database
makeblastdb -in sequences.fasta -dbtype nucl -out my_nucl_db
from Bio.Blast.Applications import NcbimakeblastdbCommandline
# Create database programmatically
makedb_cline = NcbimakeblastdbCommandline(
input_file="proteins.fasta",
dbtype="prot",
out="my_protein_db"
)
stdout, stderr = makedb_cline()
BLAST Program Selection
| Program |
Query |
Database |
Use Case |
| blastp |
Protein |
Protein |
Protein similarity |
| blastn |
Nucleotide |
Nucleotide |
DNA/RNA similarity |
| blastx |
Nucleotide |
Protein |
Translated query |
| tblastn |
Protein |
Nucleotide |
Search translated DB |
| tblastx |
Nucleotide |
Nucleotide |
Both translated |
Common Parameters
- evalue: Expectation value threshold (default: 10)
- word_size: Word size for initial match
- gapopen: Gap opening penalty
- gapextend: Gap extension penalty
- matrix: Scoring matrix (BLOSUM62, PAM250, etc.)
- num_threads: Number of CPU threads