| name | scarches-complete |
| description | scArches 单细胞深度学习参考图谱框架 - 100%覆盖文档(26个HTML文件,包含完整API、教程、模型训练、多模态整合) |
Scarches-Complete Skill
Comprehensive assistance with scArches (single-cell architecture surgery) development, generated from official documentation. scArches enables integration of newly produced single-cell datasets into integrated reference atlases through decentralized training and model surgery.
When to Use This Skill
This skill should be triggered when:
- Building reference atlases using scVI, trVAE, scANVI, totalVI, or expiMap models
- Mapping query datasets to existing reference atlases for cell type annotation
- Performing cell type label transfer from reference to query datasets
- Integrating multi-modal data (CITE-seq, scRNA-seq + ATAC, TCR + transcriptome)
- Analyzing spatial transcriptomics data with SageNet
- Working with gene programs and pathway analysis using expiMap
- Training deep generative models for single-cell data integration
- Debugging scArches models or optimization issues
- Learning best practices for single-cell reference mapping
Quick Reference
Essential Code Patterns
Import and Setup
import warnings
warnings.simplefilter(action='ignore')
import scanpy as sc
import torch
import scarches as sca
import numpy as np
import gdown
Reference Model Training (expiMap)
# Prepare data with gene annotations
sca.utils.add_annotations(adata, 'reactome.gmt', min_genes=12, clean=True)
adata._inplace_subset_var(adata.varm['I'].sum(1)>0)
# Initialize and train model
intr_cvae = sca.models.EXPIMAP(
adata=adata,
condition_key='study',
hidden_layer_sizes=[256, 256, 256],
recon_loss='nb'
)
# Train with early stopping
early_stopping_kwargs = {
"early_stopping_metric": "val_unweighted_loss",
"threshold": 0,
"patience": 50,
"reduce_lr": True,
"lr_patience": 13,
"lr_factor": 0.1,
}
intr_cvae.train(
n_epochs=400,
alpha_epoch_anneal=100,
alpha=0.7,
alpha_kl=0.5,
early_stopping_kwargs=early_stopping_kwargs,
use_early_stopping=True
)
Query Dataset Mapping
# Load pretrained reference model
model = sca.models.SCANVI.load_query_data(adata_query, reference_model)
# Fine-tune on query data
model.train(
n_epochs=100,
train_size=1.0,
lr=1e-4,
use_early_stopping=True
)
# Get latent representation
latent = model.get_latent_representation()
Cell Type Label Transfer
# Train weighted KNN classifier
knn_model = sca.utils.weighted_knn_trainer(
train_adata,
train_adata_emb='X_emb',
n_neighbors=50
)
# Transfer labels to query
sca.utils.weighted_knn_transfer(
query_adata,
ref_adata_obs=train_adata.obs,
label_keys='cell_type',
knn_model=knn_model,
threshold=0.5
)
Multi-modal Integration (mvTCR)
# Initialize mvTCR model for TCR + transcriptome
model = sca.models.mvTCR.models.mixture_modules.moe.MoEModel(
adata_train,
params_architecture,
balanced_sampling='clonotype',
metadata=['clonotype', 'Sample', 'Type'],
conditional='Cohort'
)
# Train model
model.train(n_epochs=200, early_stop=5)
Model Sharing with Zenodo
# Upload trained model to Zenodo
download_link = sca.utils.zenodo.upload_model(
model=trained_model,
deposition_id='your_deposition_id',
access_token='your_token',
model_name='my_scarches_model'
)
# Download model from Zenodo
extract_dir = sca.utils.zenodo.download_model(
link='download_link',
save_path='models/',
extract_dir=True
)
Reference Files
This skill includes comprehensive documentation in references/:
Core Documentation
api_reference.md - Complete API reference for all scArches functions and classes
- Zenodo integration utilities for model sharing
- Utility functions for annotations and KNN classification
- Model training and inference methods
getting_started.md - Installation and introduction guide
- Installation via pip, conda, or from source
- Overview of scArches capabilities and model types
- Quick start examples and basic workflow
training_tips.md - Best practices for model training
- Loss function selection (nb, zinb, mse)
- Hyperparameter recommendations
- Architecture guidance for different data complexities
Advanced Tutorials
tutorials_advanced.md - Specialized model tutorials
- mvTCR: Multi-modal TCR + transcriptome integration
- Human Lung Cell Atlas mapping and classification
- Advanced cell type label transfer techniques
tutorials_surgery_pipeline.md - Complete surgery workflow
- Reference model construction
- Query dataset preparation and mapping
- Joint analysis and visualization
tutorials_treearches.md - Hierarchical cell type analysis
- Tree-based cell type discovery
- Novel cell state identification
- Hierarchical annotation transfer
Working with This Skill
For Beginners
- Start with getting_started.md to understand scArches concepts and installation
- Follow the basic surgery pipeline for your first reference mapping project
- Use the Quick Reference examples as templates for common tasks
- Consult training_tips.md before training your first models
For Intermediate Users
- Explore api_reference.md for detailed function documentation
- Try advanced tutorials for specialized applications (multi-modal, spatial)
- Use Zenodo integration for model sharing and collaboration
- Experiment with different models (scVI, scANVI, expiMap, totalVI) based on your data
For Advanced Users
- Dive into model-specific tutorials for complex use cases
- Optimize hyperparameters using training tips and experimentation
- Implement custom workflows using the comprehensive API
- Contribute models to community atlases using Zenodo sharing
Key Concepts
Model Types
- scVI: Count-based integration using raw counts, assumes NB/ZINB distribution
- trVAE: Supports normalized or count data with MMD loss for better integration
- scANVI: Requires cell type labels for reference, enables classification
- expiMap: Incorporates gene programs for interpretable representation learning
- totalVI: Multi-modal CITE-seq reference construction
- treeArches: Hierarchical cell type discovery and novel state identification
- SageNet: Spatial transcriptomics mapping to coordinate frameworks
- mvTCR: T-cell receptor + transcriptome joint analysis
Core Workflow
- Reference Construction: Train model on integrated reference dataset
- Model Surgery: Adapt pretrained model for query datasets
- Query Mapping: Project query data into reference latent space
- Downstream Analysis: Clustering, classification, trajectory analysis
Data Requirements
- Raw counts preferred for scVI, scANVI, totalVI
- Normalized data acceptable for trVAE (set recon_loss='mse')
- Highly variable genes: Minimum 2000, increase to 5000 for complex datasets
- Cell type labels: Required for scANVI reference, optional for query
Resources
references/
Comprehensive documentation extracted from official sources containing:
- Detailed API documentation with parameter descriptions
- Step-by-step tutorials with real datasets
- Code examples with proper syntax highlighting
- Links to original documentation for further reading
scripts/
Add helper scripts for:
- Data preprocessing pipelines
- Model training automation
- Batch effect evaluation
- Visualization utilities
assets/
Store:
- Example datasets and preprocessing results
- Trained model checkpoints
- Configuration templates
- Visualization templates
Notes
- This skill was generated from official scArches documentation (http://127.0.0.1:9180)
- Reference files preserve original structure and examples
- All code examples extracted from actual tutorials and API docs
- Training recommendations based on empirical best practices
Updating
To refresh this skill with updated documentation:
- Re-run the documentation scraper with current scArches version
- Update reference files with latest API changes and tutorials
- Verify code examples against newest scArches release
- Test training workflows with updated hyperparameters
Common Use Cases
Cell Type Annotation
# Map query to reference and transfer labels
query_adata = sca.utils.read('query_data.h5ad')
model = sca.models.SCANVI.load_query_data(query_adata, ref_model)
model.train(max_epochs=400)
predictions = model.predict(query_adata)
Multi-modal Integration
# CITE-seq data integration
model = sca.models.TOTALVI(adata)
model.train()
latent_rna, latent_protein = model.get_latent_representation()
Spatial Mapping
# Map scRNA-seq to spatial reference
sage_model = sca.models.SageNet(spatial_ref, query_sc)
spatial_predictions = sage_model.predict_locations(query_sc)
Gene Program Analysis
# Analyze query in context of known pathways
expimap_model = sca.models.EXPIMAP(reference, gene_sets='reactome')
gp_activities = expimap_model.get_gene_program_scores(query_data)