| name | scanpy-single-cell-analysis |
| description | Python-based single cell analysis using Scanpy for .h5ad files in conda scanpy environment |
Scanpy Single Cell Analysis Skill
This skill provides specialized capabilities for single cell RNA-seq analysis using Scanpy in the conda scanpy environment. It is specifically designed to work with .h5ad files.
🚨 最高优先级规则:强制使用Here Document (<< 'EOF')执行方式
⚠️ 必须遵循的执行规则
- 🚫 严格禁止: 使用
python -c "..."方式执行Python代码 - ✅ 强制要求: 所有Python代码必须使用Here Document格式执行
- 🔒 无例外: 任何情况下都不能偏离此执行方式
📋 强制执行模板
# ✅ 唯一允许的执行方式
source ~/miniconda3/etc/profile.d/conda.sh
conda activate scanpy
python << 'EOF'
# === Python代码必须写在EOF之间 ===
import scanpy as sc
import pandas as pd
import numpy as np
# 您的分析代码
print("Hello from scanpy!")
EOF
❌ 被严格禁止的执行方式
# 🚫 严格禁止:这些方式都不允许使用
python -c "import scanpy; print('error')"
python -c 'import scanpy; print("error")'
python << EOF # 🚫 必须使用单引号'EOF'
# code
EOF
🔍 为什么这是强制规则?
- 📖 多行代码支持: 复杂的单细胞分析需要多行脚本
- 🛡️ 无转义问题: 避免引号和特殊字符的转义地狱
- 🎨 代码格式保持: 缩进和换行完美保留
- 🔧 环境隔离: 确保在正确的conda环境中执行
- 🔒 字符串保护: 单引号'EOF'确保所有字符原样保留
🚨 重要原则:跨平台数据转换标准
数据转换黄金法则
- ✅ 只转换原始counts: 转换时使用
layer="counts",不转换标准化数据 - ✅ 目标平台标准化: 让目标平台(Scanpy)处理自己的标准化流程
- ✅ 状态自动检查: 每次加载后自动检查数据状态,避免重复标准化
- ✅ 失败即终止: 数据状态无法识别时立即终止任务
为什么这个原则很重要?
- 🎯 保证数据完整性: 原始counts是单细胞分析的生物学基础
- 🔄 确保一致性: 跨平台使用相同的标准化方法
- 🛡️ 防止错误: 自动检测避免重复标准化或数据损坏
- 📊 可重现性: 任何人都可以从counts重新开始分析
When to Use This Skill
This skill should be used when:
- Working with single cell RNA-seq data in .h5ad format
- Analyzing data in a Python environment with Scanpy
- Needing to perform standard single cell analysis workflows including quality control, normalization, clustering, and visualization
- Requiring batch correction and data integration using scVI
- Processing H5AD files converted from Seurat
Prerequisites
This skill requires:
- Conda environment named 'scanpy' with Scanpy and scvi-tools installed
- Single cell data in .h5ad format
- For scVI integration: batch information in the AnnData object
Core Capabilities
1. Data Loading
Load single cell data from .h5ad files using Scanpy.
2. Quality Control
Perform quality control analysis including:
- Cell filtering based on gene counts and mitochondrial percentage
- Visualization of QC metrics
3. Normalization and Feature Selection
- Normalize data using log normalization
- Identify highly variable genes
4. Dimensionality Reduction
- Scale data for PCA
- Perform principal component analysis
- Compute neighborhood graph
5. Clustering
- Cluster cells using the Louvain algorithm
- Visualize clusters with UMAP
6. Marker Identification
- Find marker genes for cell clusters
- Visualize gene expression patterns
7. scVI Integration Analysis
- Perform batch correction and data integration
- Generate latent representations for downstream analysis
- Obtain normalized gene expression values
Usage Instructions
Setting Up the Environment
Before using this skill, activate the scanpy conda environment:
conda activate scanpy
📋 标准执行模板(强制要求)
⚠️ 注意:以下模板是唯一允许的执行方式
# 所有Scanpy分析的标准模板(强制使用)
source ~/miniconda3/etc/profile.d/conda.sh
conda activate scanpy
python << 'EOF'
# === 导入库 ===
import scanpy as sc
import pandas as pd
import numpy as np
# === 数据加载 ===
adata = sc.read_h5ad('input_file.h5ad')
# === 数据检查 ===
print(f'数据维度: {adata.shape}')
if 'celltype' in adata.obs.columns:
print(adata.obs['celltype'].value_counts())
# === 分析流程 ===
# ... 您的分析代码
# === 结果导出 ===
results_df.to_csv('output.csv', index=False)
EOF
🚨 重要提醒:技能中的所有代码示例都必须遵循Here Document执行方式!
Loading Data
⚠️ 重要: 每次加载H5AD后必须检查数据状态(第一步)
任何H5AD文件读取后,必须首先检查X矩阵是否为整数(原始counts):
import scanpy as sc
import numpy as np
def check_h5ad_data_state(adata, verbose=True):
"""
检查H5AD文件的X矩阵是否为整数(原始counts)
Args:
adata: AnnData对象
verbose: 是否显示详细信息
Returns:
str: 'raw_counts', 'log_normalized', 'scaled', or 'unknown'
"""
if verbose:
print(f'🔍 检查H5AD数据状态...')
print(f'数据维度: {adata.shape}')
# 检查X矩阵的基本信息
max_val = adata.X.max()
min_val = adata.X.min()
is_integer = np.all(adata.X == adata.X.astype(int))
data_type = str(adata.X.dtype)
if verbose:
print(f'X矩阵信息:')
print(f' 最大值: {max_val:.3f}')
print(f' 最小值: {min_val:.3f}')
print(f' 是否为整数: {is_integer}')
print(f' 数据类型: {data_type}')
# 检查样本值
if hasattr(adata.X, 'toarray'):
sample_vals = adata.X[:5, :5].toarray()
else:
sample_vals = adata.X[:5, :5]
print(f' 样本值 (前5x5): \n{sample_vals}')
# 检查layers
if hasattr(adata, 'layers') and adata.layers:
if verbose:
print(f'可用layers: {list(adata.layers.keys())}')
# 判断数据状态
if max_val > 20 and is_integer:
if verbose:
print('✅ 检测到原始counts数据')
return "raw_counts"
elif max_val <= 10 and not is_integer:
if verbose:
print('✅ 检测到log标准化数据')
return "log_normalized"
elif max_val > 10 and max_val < 50 and not is_integer:
if verbose:
print('⚠️ 检测到标准化但未log转换的数据')
return "scaled"
else:
if verbose:
print('❌ 数据状态未知')
return "unknown"
def load_and_validate_h5ad(file_path, verbose=True):
"""
加载H5AD文件并验证数据状态(标准流程)
Args:
file_path: H5AD文件路径
verbose: 是否显示详细信息
Returns:
tuple: (adata, needs_normalization)
Raises:
ValueError: 当数据状态无法识别时终止任务
"""
if verbose:
print(f"🔄 加载H5AD文件: {file_path}")
adata = sc.read_h5ad(file_path)
data_state = check_h5ad_data_state(adata, verbose)
if data_state == "raw_counts":
if verbose:
print("✅ 检测到原始counts数据,需要进行标准化")
return adata, True
elif data_state == "log_normalized":
if verbose:
print("✅ 数据已log标准化,可直接使用")
return adata, False
elif data_state == "scaled":
if verbose:
print("⚠️ 数据已标准化但未log转换,建议log转换")
return adata, True
else:
error_msg = f"❌ 数据状态未知: {data_state}。最大值: {adata.X.max():.3f}"
raise ValueError(error_msg)
# 标准使用流程
if __name__ == "__main__":
# 加载并验证数据
adata, needs_norm = load_and_validate_h5ad('your_data.h5ad')
# 根据数据状态决定后续处理
if needs_norm:
print("🔄 进行数据标准化...")
# 标准化流程
else:
print("✅ 跳过标准化,直接分析...")
# 直接分析流程
# 推荐的标准加载方式
adata, needs_norm = load_and_validate_h5ad('path/to/your/file.h5ad')
Quality Control Analysis
Perform quality control with the qc_analysis function:
# Calculate QC metrics
adata.var['mt'] = adata.var_names.str.startswith('MT-')
sc.pp.calculate_qc_metrics(adata, percent_top=None, log1p=False, inplace=True)
# Filter cells based on QC metrics
adata = adata[adata.obs.n_genes_by_counts < 2500, :]
adata = adata[adata.obs.pct_counts_mt < 5, :]
Normalization and Feature Selection
⚠️ 重要: 根据数据状态决定是否标准化
# 推荐的标准化工作流
adata, needs_norm = load_and_validate_h5ad('path/to/your/file.h5ad')
# 条件标准化 - 只对原始counts进行标准化
if needs_norm:
print("🔄 进行数据标准化...")
# Normalize to 10,000 counts per cell
sc.pp.normalize_total(adata, target_sum=1e4)
# Log transform
sc.pp.log1p(adata)
print("✅ 标准化完成")
else:
print("✅ 数据已标准化,跳过标准化步骤")
# Identify highly variable genes
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
adata = adata[:, adata.var.highly_variable]
print(f"📊 识别到 {sum(adata.var.highly_variable)} 个高变基因")
为什么需要条件标准化?
- ✅ 避免在已标准化数据上重复标准化
- ✅ 确保分析的一致性和可重现性
- ✅ 防止数据损坏或信息丢失
- ✅ 自动适应不同来源的H5AD文件
Dimensionality Reduction
Scale data and perform PCA:
# Scale the data
sc.pp.scale(adata, max_value=10)
# Run PCA
sc.tl.pca(adata, svd_solver='arpack')
Clustering Analysis
Compute neighbors and find clusters:
# Compute neighbors
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)
# Find clusters
sc.tl.louvain(adata, resolution=0.5)
# Run UMAP for visualization
sc.tl.umap(adata)
sc.pl.umap(adata, color=['louvain'])
Marker Gene Identification
Find marker genes for clusters:
# Find marker genes (using default parameters)
sc.tl.rank_genes_groups(adata, 'louvain')
# Extract marker genes
markers = sc.get.rank_genes_groups_df(adata, None)
Gene Expression Visualization
Create dotplots to visualize gene expression across cell types:
# 📋 基因分组(字典格式)- 用户指定基因时需要整理成这种格式
gene_groups = {
'Fibro1': ['Gene1', 'Gene2', 'Gene3'],
'Fibro2': ['Gene4', 'Gene5', 'Gene6'],
'Fibro3': ['Gene7', 'Gene8', 'Gene9'],
'Fibro4': ['Gene10', 'Gene11', 'Gene12']
}
# Create dotplot with standardization
sc.pl.dotplot(adata,
var_names=gene_groups,
groupby='celltype.clean',
standard_scale='var') # Standardize each gene across cells
📋 Important Dotplot Parameters:
var_names: Genes to visualize (use dictionary format for grouped display)groupby: Column to group cells by (e.g., 'celltype.clean', 'louvain')standard_scale='var': Standardize each gene across cells (recommended for better visualization)use_raw=False: Use normalized data instead of raw countslog=True: Use log scale for expression valuescolor_map: Colormap for expression (default: 'Reds')
scVI Integration Analysis
Perform batch correction and integration using scvi-tools:
# Import scvi
import scvi
# Set up the AnnData object for scVI
scvi.model.SCVI.setup_anndata(adata, batch_key="batch")
# Initialize and train the scVI model
scvi_model = scvi.model.SCVI(adata)
scvi_model.train()
# Get the latent representation
adata.obsm["X_scVI"] = scvi_model.get_latent_representation()
# Get normalized expression
adata.layers["scvi_normalized"] = scvi_model.get_normalized_expression()
# Perform clustering on the scVI latent representation
sc.pp.neighbors(adata, use_rep="X_scVI")
sc.tl.louvain(adata, key_added="louvain_scvi")
# Visualize with UMAP using scVI latent representation
sc.tl.umap(adata, min_dist=0.3)
sc.pl.umap(adata, color=["louvain_scvi", "batch"])
References
- Scanpy documentation: https://scanpy.readthedocs.io/
- scvi-tools documentation: https://docs.scvi-tools.org/
- Python single cell analysis best practices