| name | genomics |
| description | Genomics and transcriptomics analysis strategies |
Genomics and Transcriptomics Analysis
When to Use This Skill
- When data contains gene expression measurements (RNA-seq, microarray)
- When analyzing differential gene expression
- When performing pathway or gene set enrichment analysis
- When interpreting genetic variants or mutations
Core Concepts
Gene Expression Data Types
RNA-seq counts:
- Raw read counts per gene
- Requires normalization (TPM, RPKM, DESeq2)
- Suitable for differential expression analysis
Microarray intensities:
- Probe fluorescence intensities
- Log-transformed, background-corrected
- Legacy platform, less common now
Single-cell RNA-seq:
- Expression per cell (not bulk tissue)
- High sparsity (many zeros)
- Specialized analysis methods
Gene Nomenclature
Human genes:
- Official symbols: HUGO Gene Nomenclature Committee (HGNC)
- Example: TP53 (tumor protein p53)
- Italicized in publications
Mouse genes:
- Similar to human but capitalization differs
- Example: Tp53 (first letter capital, rest lowercase)
Protein names:
- Not italicized
- Example: p53 protein
Always verify gene symbols - aliases and outdated names are common.
Differential Expression Analysis
Workflow
import pandas as pd
import numpy as np
from scipy.stats import ttest_ind
from statsmodels.stats.multitest import multipletests
# Load expression data (genes × samples)
# Rows = genes, Columns = samples
expr_data = pd.read_csv("expression_data.csv", index_col=0)
# Define groups
group1_samples = ["Sample1", "Sample2", "Sample3"]
group2_samples = ["Sample4", "Sample5", "Sample6"]
results = []
for gene in expr_data.index:
group1_expr = expr_data.loc[gene, group1_samples]
group2_expr = expr_data.loc[gene, group2_samples]
# T-test
t_stat, p_value = ttest_ind(group1_expr, group2_expr)
# Fold change
mean1 = group1_expr.mean()
mean2 = group2_expr.mean()
log2fc = np.log2(mean1 / mean2) if mean2 > 0 else np.nan
results.append({
"gene": gene,
"log2FC": log2fc,
"p_value": p_value,
"mean_group1": mean1,
"mean_group2": mean2
})
results_df = pd.DataFrame(results)
# Multiple testing correction
results_df["p_adj"] = multipletests(results_df["p_value"], method="fdr_bh")[1]
# Define significant genes
significant = results_df[
(results_df["p_adj"] < 0.05) &
(abs(results_df["log2FC"]) > 1) # 2-fold change
]
print(f"Significant genes: {len(significant)}")
print(f"Upregulated: {sum(significant['log2FC'] > 0)}")
print(f"Downregulated: {sum(significant['log2FC'] < 0)}")
Volcano Plot
Visualize differential expression:
import matplotlib.pyplot as plt
plt.figure(figsize=(8, 6))
plt.scatter(
results_df["log2FC"],
-np.log10(results_df["p_adj"]),
alpha=0.5, s=10, c="gray"
)
# Highlight significant genes
sig_mask = (results_df["p_adj"] < 0.05) & (abs(results_df["log2FC"]) > 1)
plt.scatter(
results_df.loc[sig_mask, "log2FC"],
-np.log10(results_df.loc[sig_mask, "p_adj"]),
alpha=0.7, s=20, c="red", label="Significant"
)
plt.xlabel("log2 Fold Change")
plt.ylabel("-log10(adjusted p-value)")
plt.axhline(-np.log10(0.05), linestyle="--", color="black", linewidth=0.5)
plt.axvline(-1, linestyle="--", color="black", linewidth=0.5)
plt.axvline(1, linestyle="--", color="black", linewidth=0.5)
plt.title("Volcano Plot")
plt.legend()
plt.savefig("volcano_plot.png", dpi=300)
Gene Set Enrichment
Simple Pathway Enrichment
When: You have a list of significant genes and want to know which pathways are affected
# Define gene sets (pathways)
gene_sets = {
"Cell Cycle": ["TP53", "CDK1", "CCNB1", "CDC20", ...],
"Apoptosis": ["TP53", "BAX", "BCL2", "CASP3", ...],
"DNA Repair": ["TP53", "BRCA1", "BRCA2", "ATM", ...],
# ... more pathways
}
# Fisher's exact test for enrichment
from scipy.stats import fisher_exact
all_genes = set(expr_data.index)
sig_genes = set(significant["gene"])
enrichment_results = []
for pathway, pathway_genes in gene_sets.items():
pathway_genes = set(pathway_genes) & all_genes # Only genes in dataset
# 2x2 contingency table
a = len(sig_genes & pathway_genes) # Sig & in pathway
b = len(sig_genes - pathway_genes) # Sig & not in pathway
c = len(pathway_genes - sig_genes) # Not sig & in pathway
d = len(all_genes - sig_genes - pathway_genes) # Not sig & not in pathway
oddsratio, p_value = fisher_exact([[a, b], [c, d]], alternative='greater')
enrichment_results.append({
"pathway": pathway,
"overlap": a,
"pathway_size": len(pathway_genes),
"odds_ratio": oddsratio,
"p_value": p_value
})
enrich_df = pd.DataFrame(enrichment_results)
enrich_df["p_adj"] = multipletests(enrich_df["p_value"], method="fdr_bh")[1]
enrich_df = enrich_df.sort_values("p_adj")
print(enrich_df.head(10))
Gene Ontology (GO) Terms
Common GO categories:
- Biological Process (BP): What the gene does (e.g., "cell cycle", "apoptosis")
- Molecular Function (MF): Biochemical activity (e.g., "kinase activity")
- Cellular Component (CC): Where it acts (e.g., "nucleus", "mitochondrion")
Resources:
- Gene Ontology: http://geneontology.org/
- Enrichr: https://maayanlab.cloud/Enrichr/ (web-based enrichment)
- DAVID: https://david.ncifcrf.gov/
KEGG Pathway Enrichment
KEGG = Kyoto Encyclopedia of Genes and Genomes
Provides curated pathway maps for:
- Metabolic pathways
- Signaling pathways
- Disease pathways
Example pathways:
- hsa04110: Cell cycle
- hsa04210: Apoptosis
- hsa04151: PI3K-Akt signaling
Common Analysis Patterns
Pattern 1: Transcription Factor Activity
Observation: Many genes upregulated
Hypothesis: Shared transcription factor (TF)
Test:
# Check if significant genes share TF binding motifs
tf_targets = {
"TP53": ["BAX", "CDKN1A", "MDM2", "GADD45A", ...],
"MYC": ["CDK4", "CCND1", "E2F1", ...],
# ... more TFs
}
# Test for enrichment (same as pathway enrichment)
Interpretation: Enrichment suggests TF is active/inactive in condition
Pattern 2: Pathway Coordination
Observation: Genes in same pathway all up/down together
Interpretation: Pathway-level regulation (not individual genes)
Example:
All glycolysis genes ↑↑ → Increased glycolysis
All oxidative phosphorylation genes ↓↓ → Metabolic shift
Pattern 3: Compensatory Response
Observation: Opposite regulation of related pathways
Example:
De novo biosynthesis genes ↓
Salvage pathway genes ↑
→ Metabolic switch to energy-efficient salvage
Correlation Analysis
Co-expression Networks
When: Identify genes that change together
from scipy.stats import pearsonr
# Compute pairwise correlations
genes = significant["gene"].tolist()[:50] # Top 50 for tractability
corr_matrix = expr_data.loc[genes].T.corr()
# Filter high correlations
high_corr = []
for i in range(len(genes)):
for j in range(i+1, len(genes)):
if abs(corr_matrix.iloc[i, j]) > 0.8:
high_corr.append({
"gene1": genes[i],
"gene2": genes[j],
"correlation": corr_matrix.iloc[i, j]
})
print(f"High correlations (|r| > 0.8): {len(high_corr)}")
Interpretation:
- Positive correlation → co-regulated (same pathway, shared TF)
- Negative correlation → antagonistic regulation
Network Visualization
import networkx as nx
# Build network
G = nx.Graph()
for item in high_corr:
G.add_edge(item["gene1"], item["gene2"], weight=abs(item["correlation"]))
# Find communities (clusters of co-expressed genes)
from networkx.algorithms import community
communities = community.greedy_modularity_communities(G)
for i, comm in enumerate(communities):
print(f"Community {i}: {list(comm)}")
Literature Search Strategies
Effective Queries
For gene function:
"[GENE] function"
"[GENE] role in [PROCESS]"
"[GENE] knockout phenotype"
For pathway context:
"[GENE] pathway"
"[GENE] interacting proteins"
"[GENE] regulation"
For disease relevance:
"[GENE] [DISEASE]"
"[GENE] mutation [DISEASE]"
Key Databases
- NCBI Gene: Gene summaries and references
- UniProt: Protein function and domains
- STRING: Protein-protein interactions
- GeneCards: Comprehensive gene info
- PubMed: Literature search
Genomics-Specific Hypotheses
Template Hypotheses
H1: Transcriptional Regulation
"Condition X activates transcription factor [TF], upregulating
target genes [G1, G2, G3] in pathway [P]"
H2: Pathway Activation
"Condition X activates [pathway], evidenced by coordinated
upregulation of pathway genes and increased activity signature"
H3: Epigenetic Regulation
"Condition X alters chromatin state at [locus], changing
expression of genes [G1, G2]"
H4: Post-transcriptional Regulation
"MicroRNA [miR] is upregulated, suppressing target genes [G1, G2],
explaining decreased protein levels despite unchanged mRNA"
Quality Control
Before interpreting results:
- Check for batch effects (PCA colored by batch)
- Verify sample labels are correct
- Check for outlier samples (hierarchical clustering)
- Confirm expression distribution (should be roughly normal after log transform)
- Verify normalization (samples should have similar distributions)
Common Pitfalls
❌ Ignoring log transformation
- Expression data should be log-transformed for most analyses
- Fold changes are linear differences in log space
❌ Using nominal p-values for many genes
- Always correct for multiple testing (FDR)
- Use adjusted p-values for significance
❌ Overinterpreting small fold changes
- log2FC < 0.5 (1.4-fold) may not be biologically meaningful
- Use stricter thresholds for noisy data
❌ Confusing gene expression with protein activity
- mRNA ≠ protein levels
- Protein activity may require post-translational modifications
❌ Cherry-picking genes
- Don't select genes to fit a story
- Use unbiased pathway enrichment
Integration with Other Data Types
Transcriptomics + Metabolomics
Strategy:
1. Identify differentially expressed metabolic enzymes
2. Map to KEGG pathways
3. Check if corresponding metabolites are changed
4. Build integrated metabolic model
Example:
Gene: PHGDH (phosphoglycerate dehydrogenase) ↑
Metabolite: Serine ↑
→ Integrated finding: Increased serine biosynthesis
Transcriptomics + Proteomics
Compare mRNA vs protein changes:
- Concordant (both up/down) → transcriptional regulation
- Discordant (mRNA ≠ protein) → post-transcriptional regulation
Key Principle
Gene expression is the messenger, not the message.
mRNA changes indicate potential for protein changes. Always consider:
- Post-transcriptional regulation (miRNA, RNA stability)
- Translational control
- Protein stability and degradation
- Post-translational modifications
Connect expression changes to phenotype through pathways and functional validation.