| name | bulk-rna-seq-batch-correction-with-combat |
| title | Bulk RNA-seq batch correction with ComBat |
| description | Use omicverse's pyComBat wrapper to remove batch effects from merged bulk RNA-seq or microarray cohorts, export corrected matrices, and benchmark pre/post correction visualisations. |
Bulk RNA-seq batch correction with ComBat
Overview
Apply this skill when a user has multiple bulk expression matrices measured across different batches and needs to harmonise them
before downstream analysis. It follows t_bulk_combat.ipynb, w
hich demonstrates the pyComBat workflow on ovarian cancer microarray cohorts.
Instructions
- Import core libraries
- Load
omicverse as ov,anndata,pandas as pd, andmatplotlib.pyplot as plt. - Call
ov.ov_plot_set()(aliasedov.plot_set()in some releases) to align figures with omicverse styling.
- Load
- Load each batch separately
- Read the prepared pickled matrices (or user-provided expression tables) with
pd.read_pickle(...)/pd.read_csv(...). - Transpose to gene × sample before wrapping them in
anndata.AnnDataobjects soadata.obsstores sample metadata. - Assign a
batchcolumn for every cohort (adata.obs['batch'] = '1','2', ...). Encourage descriptive labels when availa ble.
- Read the prepared pickled matrices (or user-provided expression tables) with
- Concatenate on shared genes
- Use
anndata.concat([adata1, adata2, adata3], merge='same')to retain the intersection of genes across batches. - Confirm the combined
adatareports balanced sample counts per batch; if not, prompt users to re-check inputs.
- Use
- Run ComBat batch correction
- Execute
ov.bulk.batch_correction(adata, batch_key='batch'). - Explain that corrected values are stored in
adata.layers['batch_correction']while the original counts remain inadata.X.
- Execute
- Export corrected and raw matrices
- Obtain DataFrames via
adata.to_df().T(raw) andadata.to_df(layer='batch_correction').T(corrected). - Encourage saving both tables (
.to_csv(...)) plus the harmonised AnnData (adata.write_h5ad('adata_batch.h5ad', compressio n='gzip')).
- Obtain DataFrames via
- Benchmark the correction
- For per-sample variance checks, draw before/after boxplots and recolour boxes using
ov.utils.red_color,blue_color,gree n_colorpalettes to match batches. - Copy raw counts to a named layer with
adata.layers['raw'] = adata.X.copy()before PCA. - Run
ov.pp.pca(adata, layer='raw', n_pcs=50)andov.pp.pca(adata, layer='batch_correction', n_pcs=50). - Visualise embeddings with
ov.utils.embedding(..., basis='raw|original|X_pca', color='batch', frameon='small')and repeat fo r the corrected layer to verify mixing.
- For per-sample variance checks, draw before/after boxplots and recolour boxes using
- Troubleshooting tips
- Mismatched gene identifiers cause dropped features—remind users to harmonise feature names (e.g., gene symbols) before conca tenation.
- pyComBat expects log-scale intensities or similarly distributed counts; recommend log-transforming strongly skewed matrices.
- If
batch_correctionlayer is missing, ensure thebatch_keymatches the column name inadata.obs.
Examples
- "Combine three GEO ovarian cohorts, run ComBat, and export both the raw and corrected CSV matrices."
- "Plot PCA embeddings before and after batch correction to confirm that batches 1–3 overlap."
- "Save the harmonised AnnData file so I can reload it later for downstream DEG analysis."
References
- Tutorial notebook:
t_bulk_combat.ipynb - Example inputs:
omicverse_guide/docs/Tutorials-bulk/data/combat/ - Quick copy/paste commands:
reference.md