Claude Code Plugins

Community-maintained marketplace

Feedback

Genomics and transcriptomics analysis strategies

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name genomics
description Genomics and transcriptomics analysis strategies

Genomics and Transcriptomics Analysis

When to Use This Skill

  • When data contains gene expression measurements (RNA-seq, microarray)
  • When analyzing differential gene expression
  • When performing pathway or gene set enrichment analysis
  • When interpreting genetic variants or mutations

Core Concepts

Gene Expression Data Types

RNA-seq counts:

  • Raw read counts per gene
  • Requires normalization (TPM, RPKM, DESeq2)
  • Suitable for differential expression analysis

Microarray intensities:

  • Probe fluorescence intensities
  • Log-transformed, background-corrected
  • Legacy platform, less common now

Single-cell RNA-seq:

  • Expression per cell (not bulk tissue)
  • High sparsity (many zeros)
  • Specialized analysis methods

Gene Nomenclature

Human genes:

  • Official symbols: HUGO Gene Nomenclature Committee (HGNC)
  • Example: TP53 (tumor protein p53)
  • Italicized in publications

Mouse genes:

  • Similar to human but capitalization differs
  • Example: Tp53 (first letter capital, rest lowercase)

Protein names:

  • Not italicized
  • Example: p53 protein

Always verify gene symbols - aliases and outdated names are common.

Differential Expression Analysis

Workflow

import pandas as pd
import numpy as np
from scipy.stats import ttest_ind
from statsmodels.stats.multitest import multipletests

# Load expression data (genes × samples)
# Rows = genes, Columns = samples
expr_data = pd.read_csv("expression_data.csv", index_col=0)

# Define groups
group1_samples = ["Sample1", "Sample2", "Sample3"]
group2_samples = ["Sample4", "Sample5", "Sample6"]

results = []

for gene in expr_data.index:
    group1_expr = expr_data.loc[gene, group1_samples]
    group2_expr = expr_data.loc[gene, group2_samples]

    # T-test
    t_stat, p_value = ttest_ind(group1_expr, group2_expr)

    # Fold change
    mean1 = group1_expr.mean()
    mean2 = group2_expr.mean()
    log2fc = np.log2(mean1 / mean2) if mean2 > 0 else np.nan

    results.append({
        "gene": gene,
        "log2FC": log2fc,
        "p_value": p_value,
        "mean_group1": mean1,
        "mean_group2": mean2
    })

results_df = pd.DataFrame(results)

# Multiple testing correction
results_df["p_adj"] = multipletests(results_df["p_value"], method="fdr_bh")[1]

# Define significant genes
significant = results_df[
    (results_df["p_adj"] < 0.05) &
    (abs(results_df["log2FC"]) > 1)  # 2-fold change
]

print(f"Significant genes: {len(significant)}")
print(f"Upregulated: {sum(significant['log2FC'] > 0)}")
print(f"Downregulated: {sum(significant['log2FC'] < 0)}")

Volcano Plot

Visualize differential expression:

import matplotlib.pyplot as plt

plt.figure(figsize=(8, 6))
plt.scatter(
    results_df["log2FC"],
    -np.log10(results_df["p_adj"]),
    alpha=0.5, s=10, c="gray"
)

# Highlight significant genes
sig_mask = (results_df["p_adj"] < 0.05) & (abs(results_df["log2FC"]) > 1)
plt.scatter(
    results_df.loc[sig_mask, "log2FC"],
    -np.log10(results_df.loc[sig_mask, "p_adj"]),
    alpha=0.7, s=20, c="red", label="Significant"
)

plt.xlabel("log2 Fold Change")
plt.ylabel("-log10(adjusted p-value)")
plt.axhline(-np.log10(0.05), linestyle="--", color="black", linewidth=0.5)
plt.axvline(-1, linestyle="--", color="black", linewidth=0.5)
plt.axvline(1, linestyle="--", color="black", linewidth=0.5)
plt.title("Volcano Plot")
plt.legend()
plt.savefig("volcano_plot.png", dpi=300)

Gene Set Enrichment

Simple Pathway Enrichment

When: You have a list of significant genes and want to know which pathways are affected

# Define gene sets (pathways)
gene_sets = {
    "Cell Cycle": ["TP53", "CDK1", "CCNB1", "CDC20", ...],
    "Apoptosis": ["TP53", "BAX", "BCL2", "CASP3", ...],
    "DNA Repair": ["TP53", "BRCA1", "BRCA2", "ATM", ...],
    # ... more pathways
}

# Fisher's exact test for enrichment
from scipy.stats import fisher_exact

all_genes = set(expr_data.index)
sig_genes = set(significant["gene"])

enrichment_results = []

for pathway, pathway_genes in gene_sets.items():
    pathway_genes = set(pathway_genes) & all_genes  # Only genes in dataset

    # 2x2 contingency table
    a = len(sig_genes & pathway_genes)  # Sig & in pathway
    b = len(sig_genes - pathway_genes)  # Sig & not in pathway
    c = len(pathway_genes - sig_genes)  # Not sig & in pathway
    d = len(all_genes - sig_genes - pathway_genes)  # Not sig & not in pathway

    oddsratio, p_value = fisher_exact([[a, b], [c, d]], alternative='greater')

    enrichment_results.append({
        "pathway": pathway,
        "overlap": a,
        "pathway_size": len(pathway_genes),
        "odds_ratio": oddsratio,
        "p_value": p_value
    })

enrich_df = pd.DataFrame(enrichment_results)
enrich_df["p_adj"] = multipletests(enrich_df["p_value"], method="fdr_bh")[1]
enrich_df = enrich_df.sort_values("p_adj")

print(enrich_df.head(10))

Gene Ontology (GO) Terms

Common GO categories:

  • Biological Process (BP): What the gene does (e.g., "cell cycle", "apoptosis")
  • Molecular Function (MF): Biochemical activity (e.g., "kinase activity")
  • Cellular Component (CC): Where it acts (e.g., "nucleus", "mitochondrion")

Resources:

KEGG Pathway Enrichment

KEGG = Kyoto Encyclopedia of Genes and Genomes

Provides curated pathway maps for:

  • Metabolic pathways
  • Signaling pathways
  • Disease pathways

Example pathways:

  • hsa04110: Cell cycle
  • hsa04210: Apoptosis
  • hsa04151: PI3K-Akt signaling

Common Analysis Patterns

Pattern 1: Transcription Factor Activity

Observation: Many genes upregulated

Hypothesis: Shared transcription factor (TF)

Test:

# Check if significant genes share TF binding motifs
tf_targets = {
    "TP53": ["BAX", "CDKN1A", "MDM2", "GADD45A", ...],
    "MYC": ["CDK4", "CCND1", "E2F1", ...],
    # ... more TFs
}

# Test for enrichment (same as pathway enrichment)

Interpretation: Enrichment suggests TF is active/inactive in condition

Pattern 2: Pathway Coordination

Observation: Genes in same pathway all up/down together

Interpretation: Pathway-level regulation (not individual genes)

Example:

All glycolysis genes ↑↑ → Increased glycolysis
All oxidative phosphorylation genes ↓↓ → Metabolic shift

Pattern 3: Compensatory Response

Observation: Opposite regulation of related pathways

Example:

De novo biosynthesis genes ↓
Salvage pathway genes ↑
→ Metabolic switch to energy-efficient salvage

Correlation Analysis

Co-expression Networks

When: Identify genes that change together

from scipy.stats import pearsonr

# Compute pairwise correlations
genes = significant["gene"].tolist()[:50]  # Top 50 for tractability
corr_matrix = expr_data.loc[genes].T.corr()

# Filter high correlations
high_corr = []
for i in range(len(genes)):
    for j in range(i+1, len(genes)):
        if abs(corr_matrix.iloc[i, j]) > 0.8:
            high_corr.append({
                "gene1": genes[i],
                "gene2": genes[j],
                "correlation": corr_matrix.iloc[i, j]
            })

print(f"High correlations (|r| > 0.8): {len(high_corr)}")

Interpretation:

  • Positive correlation → co-regulated (same pathway, shared TF)
  • Negative correlation → antagonistic regulation

Network Visualization

import networkx as nx

# Build network
G = nx.Graph()
for item in high_corr:
    G.add_edge(item["gene1"], item["gene2"], weight=abs(item["correlation"]))

# Find communities (clusters of co-expressed genes)
from networkx.algorithms import community
communities = community.greedy_modularity_communities(G)

for i, comm in enumerate(communities):
    print(f"Community {i}: {list(comm)}")

Literature Search Strategies

Effective Queries

For gene function:

"[GENE] function"
"[GENE] role in [PROCESS]"
"[GENE] knockout phenotype"

For pathway context:

"[GENE] pathway"
"[GENE] interacting proteins"
"[GENE] regulation"

For disease relevance:

"[GENE] [DISEASE]"
"[GENE] mutation [DISEASE]"

Key Databases

  1. NCBI Gene: Gene summaries and references
  2. UniProt: Protein function and domains
  3. STRING: Protein-protein interactions
  4. GeneCards: Comprehensive gene info
  5. PubMed: Literature search

Genomics-Specific Hypotheses

Template Hypotheses

H1: Transcriptional Regulation

"Condition X activates transcription factor [TF], upregulating
target genes [G1, G2, G3] in pathway [P]"

H2: Pathway Activation

"Condition X activates [pathway], evidenced by coordinated
upregulation of pathway genes and increased activity signature"

H3: Epigenetic Regulation

"Condition X alters chromatin state at [locus], changing
expression of genes [G1, G2]"

H4: Post-transcriptional Regulation

"MicroRNA [miR] is upregulated, suppressing target genes [G1, G2],
explaining decreased protein levels despite unchanged mRNA"

Quality Control

Before interpreting results:

  • Check for batch effects (PCA colored by batch)
  • Verify sample labels are correct
  • Check for outlier samples (hierarchical clustering)
  • Confirm expression distribution (should be roughly normal after log transform)
  • Verify normalization (samples should have similar distributions)

Common Pitfalls

Ignoring log transformation

  • Expression data should be log-transformed for most analyses
  • Fold changes are linear differences in log space

Using nominal p-values for many genes

  • Always correct for multiple testing (FDR)
  • Use adjusted p-values for significance

Overinterpreting small fold changes

  • log2FC < 0.5 (1.4-fold) may not be biologically meaningful
  • Use stricter thresholds for noisy data

Confusing gene expression with protein activity

  • mRNA ≠ protein levels
  • Protein activity may require post-translational modifications

Cherry-picking genes

  • Don't select genes to fit a story
  • Use unbiased pathway enrichment

Integration with Other Data Types

Transcriptomics + Metabolomics

Strategy:

1. Identify differentially expressed metabolic enzymes
2. Map to KEGG pathways
3. Check if corresponding metabolites are changed
4. Build integrated metabolic model

Example:

Gene: PHGDH (phosphoglycerate dehydrogenase) ↑
Metabolite: Serine ↑
→ Integrated finding: Increased serine biosynthesis

Transcriptomics + Proteomics

Compare mRNA vs protein changes:

  • Concordant (both up/down) → transcriptional regulation
  • Discordant (mRNA ≠ protein) → post-transcriptional regulation

Key Principle

Gene expression is the messenger, not the message.

mRNA changes indicate potential for protein changes. Always consider:

  • Post-transcriptional regulation (miRNA, RNA stability)
  • Translational control
  • Protein stability and degradation
  • Post-translational modifications

Connect expression changes to phenotype through pathways and functional validation.