| name | rrna-prediction-patterns |
| description | Guidelines for rRNA detection and prediction including prokaryotic (16S, 23S, 5S) and eukaryotic (18S, 28S, 5.8S, 5S) ribosomal RNA. Covers sequence patterns, secondary structure, HMM models, blast-based detection, length validation, and quality scoring for rRNA identification. |
rRNA Prediction Patterns
Purpose
Provide comprehensive guidance for implementing rRNA (ribosomal RNA) detection and prediction in both prokaryotic and eukaryotic organisms. This skill helps ensure accurate identification, proper validation, and robust handling of rRNA sequences.
When to Use
This skill activates when:
- Working with rRNA detection or prediction code
- Implementing sequence analysis for 16S, 18S, 23S, 28S, 5S, or 5.8S rRNA
- Building or using HMM models for ribosomal RNA
- Processing genomic sequences for rRNA identification
- Writing code in
backend/rrna/or related directories - Handling BLAST searches for rRNA sequences
- Validating rRNA sequence quality or completeness
rRNA Types Overview
Prokaryotic rRNA
- 16S rRNA: ~1,500 bp, small subunit (SSU), most commonly used for phylogenetics
- 23S rRNA: ~2,900 bp, large subunit (LSU)
- 5S rRNA: ~120 bp, small ribosomal component
Eukaryotic rRNA
- 18S rRNA: ~1,800 bp, small subunit (SSU), equivalent to prokaryotic 16S
- 28S rRNA: ~3,300-5,000 bp, large subunit (LSU)
- 5.8S rRNA: ~160 bp, unique to eukaryotes
- 5S rRNA: ~120 bp, similar to prokaryotic version
Core Detection Methods
1. Hidden Markov Models (HMM)
Best Practice: Use established HMM profiles from databases like Rfam or create custom profiles.
from hmmer import HMMer
def detect_rrna_with_hmm(sequence: str, rrna_type: str) -> dict:
"""
Detect rRNA using HMM profiles.
Args:
sequence: DNA/RNA sequence
rrna_type: One of ['16S', '18S', '23S', '28S', '5S', '5.8S']
Returns:
Detection results with score, e-value, and coordinates
"""
hmm_profile = load_hmm_profile(rrna_type)
results = HMMer.search(
profile=hmm_profile,
sequence=sequence,
e_value_threshold=1e-5
)
return {
'detected': results.has_hits(),
'score': results.best_score,
'e_value': results.best_e_value,
'start': results.best_hit.start if results.has_hits() else None,
'end': results.best_hit.end if results.has_hits() else None,
'confidence': calculate_confidence(results.best_score)
}
Key Considerations:
- Use appropriate e-value thresholds (typically 1e-5 to 1e-10)
- Validate sequence length against expected ranges
- Check for partial vs. complete genes
- Consider domain-specific models (bacteria, archaea, eukarya)
2. BLAST-Based Detection
Best Practice: Use specialized rRNA databases like SILVA or RDP.
from Bio.Blast import NCBIWWW, NCBIXML
def blast_search_rrna(sequence: str, db: str = "silva") -> list:
"""
Search for rRNA using BLAST against specialized databases.
Args:
sequence: Query sequence
db: Database name (silva, rdp, greengenes)
Returns:
List of matches with identity, coverage, and annotation
"""
result_handle = NCBIWWW.qblast(
program="blastn",
database=db,
sequence=sequence,
hitlist_size=10,
expect=1e-5
)
matches = []
for record in NCBIXML.parse(result_handle):
for alignment in record.alignments:
for hsp in alignment.hsps:
if hsp.identities / hsp.align_length > 0.85: # 85% identity
matches.append({
'hit_def': alignment.hit_def,
'identity': hsp.identities / hsp.align_length,
'coverage': hsp.align_length / record.query_length,
'e_value': hsp.expect,
'rrna_type': extract_rrna_type(alignment.hit_def)
})
return matches
Key Considerations:
- Minimum 85% identity for reliable rRNA identification
- Coverage should be >70% for full-length predictions
- Parse hit definitions carefully to extract rRNA type
- Handle ambiguous matches (multiple possible types)
3. Pattern-Based Detection
Best Practice: Use conserved sequence motifs and secondary structure patterns.
import re
from typing import List, Tuple
# Conserved regions in 16S rRNA
CONSERVED_16S_PATTERNS = {
'v1_region': r'GACGA[ACT]G[AG][ACT]TGCGA',
'v3_region': r'GTGGAGGATGAAGG[CT]TTA',
'v4_region': r'TAGC[CT]GGCGGAC',
'shine_dalgarno_region': r'[AG]GGAGG',
}
def detect_conserved_regions(sequence: str, rrna_type: str) -> dict:
"""
Detect conserved sequence patterns specific to rRNA types.
Args:
sequence: DNA sequence
rrna_type: Type of rRNA to search for
Returns:
Dictionary of detected regions with positions
"""
patterns = get_patterns_for_type(rrna_type)
detected_regions = {}
for region_name, pattern in patterns.items():
matches = list(re.finditer(pattern, sequence, re.IGNORECASE))
if matches:
detected_regions[region_name] = [
{'start': m.start(), 'end': m.end(), 'sequence': m.group()}
for m in matches
]
return detected_regions
Conserved Regions to Check:
- 16S: V1-V9 variable regions flanked by conserved regions
- 18S: V2, V4, V7, V9 regions
- 23S/28S: Peptidyl transferase center, GTPase-associated center
- 5S: Highly conserved sequence throughout
4. Length Validation
Critical: Always validate sequence length against expected ranges.
LENGTH_RANGES = {
'16S': {'min': 1200, 'max': 1800, 'typical': 1500},
'18S': {'min': 1600, 'max': 2000, 'typical': 1800},
'23S': {'min': 2500, 'max': 3200, 'typical': 2900},
'28S': {'min': 3000, 'max': 5500, 'typical': 4000},
'5S': {'min': 100, 'max': 140, 'typical': 120},
'5.8S': {'min': 140, 'max': 180, 'typical': 160},
}
def validate_rrna_length(sequence: str, predicted_type: str) -> dict:
"""
Validate if sequence length matches expected rRNA type.
Returns:
Validation result with completeness assessment
"""
length = len(sequence)
ranges = LENGTH_RANGES.get(predicted_type)
if not ranges:
return {'valid': False, 'reason': 'Unknown rRNA type'}
completeness = length / ranges['typical']
return {
'valid': ranges['min'] <= length <= ranges['max'],
'length': length,
'expected_range': (ranges['min'], ranges['max']),
'completeness': min(completeness, 1.0),
'status': get_completeness_status(completeness)
}
def get_completeness_status(completeness: float) -> str:
"""Classify sequence completeness."""
if completeness >= 0.95:
return 'complete'
elif completeness >= 0.70:
return 'nearly_complete'
elif completeness >= 0.40:
return 'partial'
else:
return 'fragment'
Quality Scoring
Best Practice: Implement multi-factor quality scoring for predictions.
from dataclasses import dataclass
from typing import Optional
@dataclass
class RRNAQuality:
"""Quality metrics for rRNA prediction."""
hmm_score: float
blast_identity: Optional[float]
length_validity: bool
completeness: float
conserved_regions_found: int
total_conserved_regions: int
def overall_quality(self) -> str:
"""Calculate overall quality classification."""
score = 0
# HMM score contribution (0-40 points)
if self.hmm_score > 100:
score += 40
elif self.hmm_score > 50:
score += 30
elif self.hmm_score > 20:
score += 20
# BLAST identity contribution (0-30 points)
if self.blast_identity:
if self.blast_identity > 0.95:
score += 30
elif self.blast_identity > 0.85:
score += 20
elif self.blast_identity > 0.75:
score += 10
# Length validity (0-10 points)
if self.length_validity:
score += 10
# Completeness (0-10 points)
score += self.completeness * 10
# Conserved regions (0-10 points)
if self.total_conserved_regions > 0:
ratio = self.conserved_regions_found / self.total_conserved_regions
score += ratio * 10
# Classify based on score
if score >= 80:
return 'high'
elif score >= 60:
return 'medium'
elif score >= 40:
return 'low'
else:
return 'very_low'
def score_rrna_prediction(
sequence: str,
hmm_results: dict,
blast_results: list,
predicted_type: str
) -> RRNAQuality:
"""
Comprehensive quality scoring for rRNA prediction.
Args:
sequence: The predicted rRNA sequence
hmm_results: Results from HMM search
blast_results: Results from BLAST search
predicted_type: Predicted rRNA type
Returns:
RRNAQuality object with detailed metrics
"""
length_validation = validate_rrna_length(sequence, predicted_type)
conserved = detect_conserved_regions(sequence, predicted_type)
best_blast_identity = None
if blast_results:
best_blast_identity = max(r['identity'] for r in blast_results)
return RRNAQuality(
hmm_score=hmm_results.get('score', 0),
blast_identity=best_blast_identity,
length_validity=length_validation['valid'],
completeness=length_validation['completeness'],
conserved_regions_found=len(conserved),
total_conserved_regions=get_expected_conserved_count(predicted_type),
)
Secondary Structure Considerations
Important: rRNA has complex secondary structures that aid in identification.
def predict_secondary_structure(sequence: str) -> dict:
"""
Predict secondary structure features of rRNA.
Uses RNA folding algorithms to identify stems, loops, and hairpins
characteristic of rRNA.
"""
# Use ViennaRNA or similar
from RNA import fold
structure, mfe = fold(sequence)
features = {
'structure': structure,
'mfe': mfe, # Minimum free energy
'stem_loops': count_stem_loops(structure),
'hairpins': count_hairpins(structure),
'internal_loops': count_internal_loops(structure),
}
# Validate against expected secondary structure
return validate_secondary_structure(features, expected_for_rrna_type)
Key Secondary Structure Features:
- 16S/18S: Highly conserved stems (helices 1-50 for 16S)
- 23S/28S: Peptidyl transferase center with specific folding
- 5S: Alpha-helix, loop, beta-helix pattern
- 5.8S: Specific interaction sites with 28S
Error Handling
Critical: Handle ambiguous and failed predictions gracefully.
class RRNADetectionError(Exception):
"""Base exception for rRNA detection errors."""
pass
class AmbiguousRRNATypeError(RRNADetectionError):
"""Raised when multiple rRNA types match equally well."""
def __init__(self, candidates: List[str]):
self.candidates = candidates
super().__init__(f"Ambiguous rRNA type: {', '.join(candidates)}")
class NoRRNADetectedError(RRNADetectionError):
"""Raised when no rRNA is detected in sequence."""
pass
def detect_rrna_with_error_handling(sequence: str) -> dict:
"""
Detect rRNA with comprehensive error handling.
Returns:
Detection results or error information
"""
try:
# Try multiple methods
hmm_results = {}
blast_results = []
for rrna_type in ['16S', '18S', '23S', '28S', '5S', '5.8S']:
try:
hmm_results[rrna_type] = detect_rrna_with_hmm(sequence, rrna_type)
except Exception as e:
# Log but continue
log_detection_error(rrna_type, 'hmm', str(e))
# Determine best match
best_matches = [
(rtype, results['score'])
for rtype, results in hmm_results.items()
if results['detected']
]
if not best_matches:
raise NoRRNADetectedError("No rRNA detected by any method")
# Sort by score
best_matches.sort(key=lambda x: x[1], reverse=True)
# Check for ambiguity
if len(best_matches) > 1:
if best_matches[0][1] - best_matches[1][1] < 10: # Similar scores
raise AmbiguousRRNATypeError([m[0] for m in best_matches[:2]])
predicted_type = best_matches[0][0]
# Additional validation
blast_results = blast_search_rrna(sequence)
quality = score_rrna_prediction(
sequence,
hmm_results[predicted_type],
blast_results,
predicted_type
)
return {
'success': True,
'predicted_type': predicted_type,
'quality': quality.overall_quality(),
'details': {
'hmm': hmm_results[predicted_type],
'blast': blast_results,
'quality_metrics': quality,
}
}
except AmbiguousRRNATypeError as e:
return {
'success': False,
'error_type': 'ambiguous',
'candidates': e.candidates,
'message': str(e),
'suggestion': 'Try additional validation methods or manual inspection'
}
except NoRRNADetectedError as e:
return {
'success': False,
'error_type': 'not_detected',
'message': str(e),
'suggestion': 'Sequence may not contain rRNA or quality may be too low'
}
except Exception as e:
return {
'success': False,
'error_type': 'unknown',
'message': str(e),
'suggestion': 'Check sequence format and input parameters'
}
Best Practices Summary
✅ DO
- Use multiple detection methods (HMM + BLAST + patterns)
- Validate sequence length against expected ranges
- Calculate comprehensive quality scores
- Handle ambiguous predictions gracefully
- Check for conserved sequence regions
- Consider secondary structure when uncertain
- Use domain-specific databases (SILVA, RDP, Rfam)
- Log all detection attempts for debugging
- Provide confidence scores with predictions
- Test against known rRNA sequences
❌ DON'T
- Rely on a single detection method
- Ignore length validation
- Accept predictions without quality assessment
- Assume all detections are accurate
- Skip error handling for edge cases
- Use generic BLAST databases for rRNA
- Forget to handle partial sequences
- Mix up prokaryotic and eukaryotic types
- Ignore ambiguous matches
- Skip validation of conserved regions
Common Pitfalls
Pseudogenes: Some genomic regions resemble rRNA but are non-functional
- Solution: Check for stop codons, frameshifts in surrounding ORFs
Contamination: Sample may contain rRNA from multiple organisms
- Solution: Cluster sequences by similarity before classification
Chimeras: PCR artifacts can create hybrid rRNA sequences
- Solution: Use chimera detection tools (UCHIME, ChimeraSlayer)
Partial Sequences: Fragmented assemblies may contain incomplete rRNA
- Solution: Report completeness percentage, handle fragments explicitly
Testing Strategy
# Example test cases
def test_rrna_detection():
"""Test suite for rRNA detection."""
# Test case 1: Known 16S sequence
ecoli_16s = load_sequence("test_data/ecoli_16s.fasta")
result = detect_rrna_with_error_handling(ecoli_16s)
assert result['success']
assert result['predicted_type'] == '16S'
assert result['quality'] in ['high', 'medium']
# Test case 2: Known 18S sequence
human_18s = load_sequence("test_data/human_18s.fasta")
result = detect_rrna_with_error_handling(human_18s)
assert result['predicted_type'] == '18S'
# Test case 3: Non-rRNA sequence
random_seq = generate_random_dna(1000)
result = detect_rrna_with_error_handling(random_seq)
assert not result['success']
assert result['error_type'] == 'not_detected'
# Test case 4: Partial rRNA
partial_16s = ecoli_16s[:800] # First 800bp only
result = detect_rrna_with_error_handling(partial_16s)
if result['success']:
assert result['details']['quality_metrics'].status == 'partial'
Related Skills: phylogenetic-methods
External Resources:
- Rfam database: https://rfam.org/
- SILVA rRNA database: https://www.arb-silva.de/
- Ribosomal Database Project: https://rdp.cme.msu.edu/
Line Count: < 500 lines ✅