| name | single-cell-annotation-skills-with-omicverse |
| title | Single-cell annotation skills with omicverse |
| description | Guide Claude through SCSA, MetaTiME, CellVote, CellMatch, GPTAnno, and weighted KNN transfer workflows for annotating single-cell modalities. |
Single-cell annotation skills with omicverse
Overview
Use this skill to reproduce and adapt the single-cell annotation playbook captured in omicverse tutorials: SCSA t_cellanno.ipynb, MetaTiME t_metatime.ipynb, CellVote t_cellvote.md & t_cellvote_pbmc3k.ipynb, CellMatch t_cellmatch.ipynb, GPTAnno t_gptanno.ipynb, and label transfer t_anno_trans.ipynb. Each section below highlights required inputs, training/inference steps, and how to read the outputs.
Instructions
SCSA automated cluster annotation
- Data requirements: PBMC3k raw counts from 10x Genomics (
pbmc3k_filtered_gene_bc_matrices.tar.gz) or the processedsample/rna.h5ad. Download instructions are embedded in the notebook; unpack todata/filtered_gene_bc_matrices/hg19/. Ensure an SCSA SQLite database is available (e.g.pySCSA_2024_v1_plus.dbfrom the Figshare/Drive links listed in the tutorial) and pointmodel_pathto its location. - Preprocessing & model fit: Load with
sc.read_10x_mtx, run QC (ov.pp.qc), normalization and HVG selection (ov.pp.preprocess), scaling (ov.pp.scale), PCA (ov.pp.pca), neighbors, Leiden clustering, and compute rank markers (sc.tl.rank_genes_groups). Instantiatescsa = ov.single.pySCSA(...)choosingtarget='cellmarker'or'panglaodb', tissue scope, and thresholds (foldchange,pvalue). - Inference & interpretation: Call
scsa.cell_anno(clustertype='leiden', result_key='scsa_celltype_cellmarker')orscsa.cell_auto_annoto append predictions toadata.obs. Compare to manual marker-based labels viaov.utils.embeddingorsc.pl.dotplot, inspect marker dictionaries (ov.single.get_celltype_marker), and query supported tissues withscsa.get_model_tissue(). Use the ROI/ROE helpers (ov.utils.roe,ov.utils.plot_cellproportion) to validate abundance trends.
- Data requirements: PBMC3k raw counts from 10x Genomics (
MetaTiME tumour microenvironment states
- Data requirements: Batched TME AnnData with an scVI latent embedding. The tutorial uses
TiME_adata_scvi.h5adfrom Figshare (https://figshare.com/ndownloader/files/41440050). If starting from counts, run scVI (scvi.model.SCVI) first to populateadata.obsm['X_scVI']. - Preprocessing & model fit: Optionally subset to non-malignant cells via
adata.obs['isTME']. Rebuild neighbors on the latent representation (sc.pp.neighbors(adata, use_rep="X_scVI")) and embed with pymde (adata.obsm['X_mde'] = ov.utils.mde(...)). InitialiseTiME_object = ov.single.MetaTiME(adata, mode='table')and, if finer granularity is desired, over-cluster withTiME_object.overcluster(resolution=8, clustercol='overcluster'). - Inference & interpretation: Run
TiME_object.predictTiME(save_obs_name='MetaTiME')to assign minor states andMajor_MetaTiME. Visualise usingTiME_object.plotorsc.pl.embedding. Interpret the outputs by comparing cluster-level distributions and confirming that MetaTiME and Major_MetaTiME columns align with expected niches.
- Data requirements: Batched TME AnnData with an scVI latent embedding. The tutorial uses
CellVote consensus labelling
- Data requirements: A clustered AnnData (e.g. PBMC3k stored as
CELLVOTE_PBMC3Kenv var ordata/pbmc3k.h5ad) plus at least two precomputed annotation columns (simulated in the tutorial asscsa_annotation,gpt_celltype,gbi_celltype). Prepare per-cluster marker genes viasc.tl.rank_genes_groups. - Preprocessing & model fit: After standard preprocessing (normalize, log1p, HVGs, PCA, neighbors, Leiden) build a marker dictionary
marker_dict = top_markers_from_rgg(adata, 'leiden', topn=10)or viaov.single.get_celltype_marker. Instantiatecv = ov.single.CellVote(adata). - Inference & interpretation: Call
cv.vote(clusters_key='leiden', cluster_markers=marker_dict, celltype_keys=[...], species='human', organization='PBMC', provider='openai', model='gpt-4o-mini'). Offline examples monkey-patch arbitration to avoid API calls; online voting requires valid credentials. Final consensus labels live inadata.obs['CellVote_celltype']. Compare each cluster’s majority vote with the input sources (adata.obs[['leiden', 'scsa_annotation', ...]]) to justify decisions.
- Data requirements: A clustered AnnData (e.g. PBMC3k stored as
CellMatch ontology mapping
- Data requirements: Annotated AnnData such as
pertpy.dt.haber_2017_regions()withadata.obs['cell_label']. Download Cell Ontology JSON (cl.json) viaov.single.download_cl(...)or manual links, and optionally Cell Taxonomy resources (Cell_Taxonomy_resource.txt). Ensure access to a SentenceTransformer model (sentence-transformers/all-MiniLM-L6-v2,BAAI/bge-base-en-v1.5, etc.), downloading tolocal_model_dirif offline. - Preprocessing & model fit: Create the mapper with
ov.single.CellOntologyMapper(cl_obo_file='new_ontology/cl.json', model_name='sentence-transformers/all-MiniLM-L6-v2', local_model_dir='./my_models'). Runmapper.map_adata(...)to assign ontology-derived labels/IDs, optionally enabling taxonomy matching (use_taxonomy=Trueafter callingload_cell_taxonomy_resource). - Inference & interpretation: Explore mapping summaries (
mapper.print_mapping_summary_taxonomy) and inspect embeddings coloured bycell_ontology,cell_ontology_cl_id, orenhanced_cell_ontology. Use helper queries such asmapper.find_similar_cells('T helper cell'),mapper.get_cell_info(...), and category browsing to validate ontology coverage.
- Data requirements: Annotated AnnData such as
GPTAnno LLM-powered annotation
- Data requirements: The same PBMC3k dataset (raw matrix or
.h5ad) and cluster assignments. Access to an LLM endpoint—configureAGI_API_KEYfor OpenAI-compatible providers (provider='openai','qwen','kimi', etc.), or supply a local model path forov.single.gptcelltype_local. - Preprocessing & model fit: Follow the QC, normalization, HVG, scaling, PCA, neighbor, Leiden, and marker discovery steps described above (reusing outputs from the SCSA workflow). Build the marker dictionary automatically with
ov.single.get_celltype_marker(adata, clustertype='leiden', rank=True, key='rank_genes_groups', foldchange=2, topgenenumber=5). - Inference & interpretation: Invoke
ov.single.gptcelltype(...)specifying tissue/species context and desired provider/model. Post-process responses to keep clean labels (result[key].split(': ')[-1]...) and write them toadata.obs['gpt_celltype']. Compare embeddings (ov.pl.embedding(..., color=['leiden','gpt_celltype'])) to verify cluster identities. If operating offline, callov.single.gptcelltype_localwith a downloaded instruction-tuned checkpoint.
- Data requirements: The same PBMC3k dataset (raw matrix or
Weighted KNN annotation transfer
- Data requirements: Cross-modal GLUE outputs with aligned embeddings, e.g.
data/analysis_lymph/rna-emb.h5ad(annotated RNA) anddata/analysis_lymph/atac-emb.h5ad(query ATAC) where both containobsm['X_glue']. - Preprocessing & model fit: Load both modalities, optionally concatenate for QC plots, and compute a shared low-dimensional embedding with
ov.utils.mde. Train a neighbour model usingov.utils.weighted_knn_trainer(train_adata=rna, train_adata_emb='X_glue', n_neighbors=15). - Inference & interpretation: Transfer labels via
labels, uncert = ov.utils.weighted_knn_transfer(query_adata=atac, query_adata_emb='X_glue', label_keys='major_celltype', knn_model=knn_transformer, ref_adata_obs=rna.obs). Store predictions inatac.obs['transf_celltype']and uncertainties inatac.obs['transf_celltype_unc']; copy tomajor_celltypeif you want consistent naming. Visualise (ov.utils.embedding) and inspect uncertainty to flag ambiguous cells.
- Data requirements: Cross-modal GLUE outputs with aligned embeddings, e.g.
Examples
- "Run SCSA with both CellMarker and PanglaoDB references on PBMC3k, then benchmark against manual marker assignments before feeding the results into CellVote."
- "Annotate tumour microenvironment states in the MetaTiME Figshare dataset, highlight Major_MetaTiME classes, and export the label distribution per patient."
- "Download Cell Ontology resources, map
haber_2017_regionsclusters to ontology terms, and enrich ambiguous clusters using Cell Taxonomy hints." - "Propagate RNA-derived
major_celltypelabels onto GLUE-integrated ATAC cells and report clusters with high transfer uncertainty."
References
- Tutorials and notebooks:
t_cellanno.ipynb,t_metatime.ipynb,t_cellvote.md,t_cellvote_pbmc3k.ipynb,t_cellmatch.ipynb,t_gptanno.ipynb,t_anno_trans.ipynb. - Sample data & assets: PBMC3k matrix from 10x Genomics, MetaTiME
TiME_adata_scvi.h5ad(Figshare), SCSA database downloads, GLUE embeddings underdata/analysis_lymph/, Cell Ontologycl.json, and Cell Taxonomy resource. - Quick copy commands:
reference.md.