| name | tcga-bulk-data-preprocessing-with-omicverse |
| title | TCGA bulk data preprocessing with omicverse |
| description | Guide Claude through ingesting TCGA sample sheets, expression archives, and clinical carts into omicverse, initialising survival metadata, and exporting annotated AnnData files. |
TCGA bulk data preprocessing with omicverse
Overview
Follow this skill to recreate the preprocessing routine from t_tcga.ipynb. It automates loading TCGA downloads, generating raw/normalised matrices, initialising metadata, and running survival analyses through ov.bulk.pyTCGA.
Instructions
- Gather required downloads
- Confirm the user has:
gdc_sample_sheet.<date>.tsvfrom the TCGA Sample Sheet export.- The decompressed
gdc_download_xxxxxdirectory containing expression archives. - The
clinical.cart.<date>directory with clinical XML/JSON files.
- Mention that sample data are available under
omicverse_guide/docs/Tutorials-bulk/data/TCGA_OV/.
- Confirm the user has:
- Initialise the TCGA helper
- Import
omicverse as ov(andscanpy as scif plotting) then callov.plot_set(). - Instantiate
aml_tcga = ov.bulk.pyTCGA(sample_sheet_path, download_dir, clinical_dir). - Run
aml_tcga.adata_init()to build the AnnData object with raw counts, FPKM, and TPM layers.
- Import
- Persist the dataset
- Encourage saving the initial AnnData:
aml_tcga.adata.write_h5ad('data/TCGA_OV/ov_tcga_raw.h5ad', compression='gzip'). - When reloading, reconstruct the class with the same paths and call
aml_tcga.adata_read(<path>).
- Encourage saving the initial AnnData:
- Initialise metadata and clinical information
- Populate sample metadata using
aml_tcga.adata_meta_init()to convert gene IDs to symbols and attach patient info. - Add survival attributes via
aml_tcga.survial_init()(note the intentional spelling in the API).
- Populate sample metadata using
- Perform survival analyses
- Plot gene-level survival curves with
aml_tcga.survival_analysis('GENE', layer='deseq_normalize', plot=True). - To process all genes, call
aml_tcga.survial_analysis_all(); warn that it may take time.
- Plot gene-level survival curves with
- Export results
- Save enriched metadata to a new AnnData file (
aml_tcga.adata.write_h5ad('.../ov_tcga_survial_all.h5ad', compression='gzip')). - Suggest exporting summary tables (e.g., survival statistics) if users need to share outputs outside Python.
- Save enriched metadata to a new AnnData file (
- Troubleshooting tips
- Ensure TCGA archives are fully extracted; missing XML/TSV files trigger parsing errors.
- The helper expects matching case IDs between the sample sheet and expression files—direct users to re-download if IDs do not align.
- Survival plots require clinical dates; if absent, instruct users to check the
clinical_cartcontents.
Examples
- "Read my TCGA OV download, initialise metadata, and plot MYC survival curves using DESeq-normalised counts."
- "Reload a saved AnnData file, attach survival annotations, and export the updated
.h5ad." - "Run survival analysis for all genes and store the enriched dataset."
References
- Tutorial notebook:
t_tcga.ipynb - Sample dataset:
data/TCGA_OV/ - Quick copy/paste commands:
reference.md