| name | single-cell-preprocessing-with-omicverse |
| title | Single-cell preprocessing with omicverse |
| description | Walk through omicverse's single-cell preprocessing tutorials to QC PBMC3k data, normalise counts, detect HVGs, and run PCA/embedding pipelines on CPU, CPU–GPU mixed, or GPU stacks. |
Single-cell preprocessing with omicverse
Overview
Follow this skill when a user needs to reproduce the preprocessing workflow from the omicverse notebooks `t_preprocess.ipynb`, `t_preprocess_cpu.ipynb`, and `t_preprocess_gpu.ipynb`. The tutorials operate on the 10x PBMC3k dataset and cover QC filtering, normalisation, highly variable gene (HVG) detection, dimensionality reduction, and downstream embeddings.
Instructions
- Set up the environment
- Import
omicverse as ovandscanpy as sc, then callov.plot_set(font_path='Arial')(orov.ov_plot_set()in legacy notebooks) to standardise figure styling. - Encourage
%load_ext autoreloadand%autoreload 2when iterating inside notebooks so code edits propagate without restarting the kernel.
- Import
- Prepare input data
- Download the PBMC3k filtered matrix from 10x Genomics (
pbmc3k_filtered_gene_bc_matrices.tar.gz) and extract it underdata/filtered_gene_bc_matrices/hg19/. - Load the matrix via
sc.read_10x_mtx(..., var_names='gene_symbols', cache=True)and keep a writable folder likewrite/for exports.
- Download the PBMC3k filtered matrix from 10x Genomics (
- Perform quality control (QC)
- Run
ov.pp.qc(adata, tresh={'mito_perc': 0.2, 'nUMIs': 500, 'detected_genes': 250}, doublets_method='scrublet')for the CPU/CPU–GPU pipelines; omitdoublets_methodon pure GPU where Scrublet is not yet supported. - Review the returned AnnData summary to confirm doublet rates and QC thresholds; advise adjusting cut-offs for different species or sequencing depths.
- Run
- Store raw counts before transformations
- Call
ov.utils.store_layers(adata, layers='counts')immediately after QC so the original counts remain accessible for later recovery and comparison.
- Call
- Normalise and select HVGs
- Use
ov.pp.preprocess(adata, mode='shiftlog|pearson', n_HVGs=2000, target_sum=5e5)to apply shift-log normalisation followed by Pearson residual HVG detection (settarget_sum=Noneon GPU, which keeps defaults). - For CPU–GPU mixed runs, demonstrate
ov.pp.recover_counts(...)to invert normalisation and store reconstructed counts inadata.layers['recover_counts'].
- Use
- Manage
.rawand layer recovery- Snapshot normalised data to
.rawwithadata.raw = adata(oradata.raw = adata.copy()), and showov.utils.retrieve_layers(adata_counts, layers='counts')to compare normalised vs. raw intensities.
- Snapshot normalised data to
- Scale, reduce, and embed
- Scale features using
ov.pp.scale(adata)(layers hold scaled matrices) followed byov.pp.pca(adata, layer='scaled', n_pcs=50). - Construct neighbourhood graphs with:
sc.pp.neighbors(adata, n_neighbors=15, n_pcs=50, use_rep='scaled|original|X_pca')for the baseline notebook.ov.pp.neighbors(..., use_rep='scaled|original|X_pca')on CPU–GPU to leverage accelerated routines.ov.pp.neighbors(..., method='cagra')on GPU to call RAPIDS graph primitives.
- Generate embeddings via
ov.utils.mde(...),ov.pp.umap(adata),ov.pp.mde(...),ov.pp.tsne(...), orov.pp.sude(...)depending on the notebook variant.
- Scale features using
- Cluster and annotate
- Run
ov.pp.leiden(adata, resolution=1)orov.single.leiden(adata, resolution=1.0)after neighbour graph construction; CPU–GPU pipelines also showcaseov.pp.score_genes_cell_cyclebefore clustering. - IMPORTANT - Defensive checks: When generating code that plots by clustering results (e.g.,
color='leiden'), always check if the clustering has been performed first:# Check if leiden clustering exists, if not, run it if 'leiden' not in adata.obs: if 'neighbors' not in adata.uns: ov.pp.neighbors(adata, n_neighbors=15, use_rep='X_pca') ov.single.leiden(adata, resolution=1.0) - Plot embeddings with
ov.pl.embedding(...)orov.utils.embedding(...), colouring byleidenclusters and marker genes. Always verify that the column specified incolor=parameter exists inadata.obsbefore plotting.
- Run
- Document outputs
- Encourage saving intermediate AnnData objects (
adata.write('write/pbmc3k_preprocessed.h5ad')) and figure exports using Matplotlib’splt.savefig(...)to preserve QC summaries and embeddings.
- Encourage saving intermediate AnnData objects (
- Notebook-specific notes
- Baseline (
t_preprocess.ipynb): Focuses on CPU execution with Scanpy neighbours; emphasise storing counts before and afterretrieve_layersdemonstrations. - CPU–GPU mixed (
t_preprocess_cpu.ipynb): Highlights Omicverse ≥1.7.0 mixed acceleration. Include timing magics (%%time) to showcase speedups and call outdoublets_method='scrublet'support. - GPU (
t_preprocess_gpu.ipynb): Requires a CUDA-capable GPU, RAPIDS 24.04 stack, andrapids-singlecell. Mention theov.pp.anndata_to_GPU/ov.pp.anndata_to_CPUtransfers andmethod='cagra'neighbours. Note the current warning that pure-GPU pipelines depend on RAPIDS updates.
- Baseline (
- Troubleshooting tips
- If
sc.read_10x_mtxfails, verify the extracted folder structure and ensure gene symbols are available viavar_names='gene_symbols'. - Address GPU import errors by confirming the conda environment matches the RAPIDS version for the installed CUDA driver (
nvidia-smi). - For
ov.pp.preprocessdimension mismatches, ensure QC filtered out empty barcodes so HVG selection does not encounter zero-variance features. - When embeddings lack expected fields (e.g.,
scaled|original|X_pcamissing), re-runov.pp.scaleandov.pp.pcato rebuild the cached layers. - Pipeline dependency errors: When encountering errors like "Could not find 'leiden' in adata.obs or adata.var_names":
- Always check if required preprocessing steps (neighbors, PCA) exist before dependent operations
- Check if clustering results exist in
adata.obsbefore trying to color plots by them - Use defensive checks in generated code to handle incomplete pipelines gracefully
- Code generation best practice: Generate robust code with conditional checks for prerequisites rather than assuming perfect sequential execution. Users may run steps in separate sessions or skip intermediate steps.
- If
Critical API Reference - Batch Column Handling
Batch Column Validation - REQUIRED Before Batch Operations
IMPORTANT: Always validate and prepare the batch column before any batch-aware operations (batch correction, integration, etc.). Missing or NaN values will cause errors.
CORRECT usage:
# Step 1: Check if batch column exists, create default if not
if 'batch' not in adata.obs.columns:
adata.obs['batch'] = 'batch_1' # Default single batch
# Step 2: Handle NaN/missing values - CRITICAL!
adata.obs['batch'] = adata.obs['batch'].fillna('unknown')
# Step 3: Convert to categorical for efficient memory usage
adata.obs['batch'] = adata.obs['batch'].astype('category')
# Now safe to use in batch-aware operations
ov.pp.combat(adata, batch='batch') # or other batch correction methods
WRONG - DO NOT USE:
# WRONG! Using batch column without validation can cause NaN errors
# ov.pp.combat(adata, batch='batch') # May fail if batch has NaN values!
# WRONG! Assuming batch column exists
# adata.obs['batch'].unique() # KeyError if column doesn't exist!
Common Batch-Related Pitfalls
- NaN values in batch column: Always use
fillna()before batch operations - Missing batch column: Always check existence before use
- Non-categorical batch: Convert to category for memory efficiency
- Mixed data types: Ensure consistent string type before categorization
# Complete defensive batch preparation pattern:
def prepare_batch_column(adata, batch_key='batch', default_batch='batch_1'):
"""Prepare batch column for batch-aware operations."""
if batch_key not in adata.obs.columns:
adata.obs[batch_key] = default_batch
adata.obs[batch_key] = adata.obs[batch_key].fillna('unknown')
adata.obs[batch_key] = adata.obs[batch_key].astype(str).astype('category')
return adata
Highly Variable Genes (HVG) - Small Dataset Handling
LOESS Failure with Small Batches
IMPORTANT: The seurat_v3 HVG flavor uses LOESS regression which fails on small datasets or small per-batch subsets (<500 cells per batch). This manifests as:
ValueError: Extrapolation not allowed with blending
CORRECT - Use try/except fallback pattern:
# Robust HVG selection for any dataset size
try:
sc.pp.highly_variable_genes(
adata,
flavor='seurat_v3',
n_top_genes=2000,
batch_key='batch' # if batch correction is needed
)
except ValueError as e:
if 'Extrapolation' in str(e) or 'LOESS' in str(e):
# Fallback to simpler method for small datasets
sc.pp.highly_variable_genes(
adata,
flavor='seurat', # Works with any size
n_top_genes=2000
)
else:
raise
Alternative - Use cell_ranger flavor for batch-aware HVG:
# cell_ranger flavor is more robust for batched data
sc.pp.highly_variable_genes(
adata,
flavor='cell_ranger', # No LOESS, works with batches
n_top_genes=2000,
batch_key='batch'
)
Best Practices for Batch-Aware HVG
- Check batch sizes before HVG: Small batches (<500 cells) will cause LOESS to fail
- Prefer
seuratorcell_rangerwhen batch sizes vary significantly - Use
seurat_v3only when all batches have >500 cells - Always wrap in try/except when dataset size is unknown
# Safe batch-aware HVG pattern
def safe_highly_variable_genes(adata, batch_key='batch', n_top_genes=2000):
"""Select HVGs with automatic fallback for small batches."""
try:
sc.pp.highly_variable_genes(
adata, flavor='seurat_v3', n_top_genes=n_top_genes, batch_key=batch_key
)
except ValueError:
# Fallback for small batches
sc.pp.highly_variable_genes(
adata, flavor='seurat', n_top_genes=n_top_genes
)
Examples
- "Download PBMC3k counts, run QC with Scrublet, normalise with
shiftlog|pearson, and compute MDE + UMAP embeddings on CPU." - "Set up the mixed CPU–GPU workflow in a fresh conda env, recover raw counts after normalisation, and score cell cycle phases before Leiden clustering."
- "Provision a RAPIDS environment, transfer AnnData to GPU, run
method='cagra'neighbours, and return embeddings to CPU for plotting."
References
- Detailed walkthrough notebooks: `t_preprocess.ipynb`, `t_preprocess_cpu.ipynb`, `t_preprocess_gpu.ipynb`
- Quick copy/paste commands: `reference.md`