Claude Code Plugins

Community-maintained marketplace

Feedback

scanpy-single-cell-analysis

@Ketomihine/my_skills
0
0

Python-based single cell analysis using Scanpy for .h5ad files in conda scanpy environment

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name scanpy-single-cell-analysis
description Python-based single cell analysis using Scanpy for .h5ad files in conda scanpy environment

Scanpy Single Cell Analysis Skill

This skill provides specialized capabilities for single cell RNA-seq analysis using Scanpy in the conda scanpy environment. It is specifically designed to work with .h5ad files.

🚨 最高优先级规则:强制使用Here Document (<< 'EOF')执行方式

⚠️ 必须遵循的执行规则

  • 🚫 严格禁止: 使用 python -c "..." 方式执行Python代码
  • 强制要求: 所有Python代码必须使用Here Document格式执行
  • 🔒 无例外: 任何情况下都不能偏离此执行方式

📋 强制执行模板

# ✅ 唯一允许的执行方式
source ~/miniconda3/etc/profile.d/conda.sh
conda activate scanpy

python << 'EOF'
# === Python代码必须写在EOF之间 ===
import scanpy as sc
import pandas as pd
import numpy as np

# 您的分析代码
print("Hello from scanpy!")
EOF

❌ 被严格禁止的执行方式

# 🚫 严格禁止:这些方式都不允许使用
python -c "import scanpy; print('error')"
python -c 'import scanpy; print("error")'
python << EOF  # 🚫 必须使用单引号'EOF'
# code
EOF

🔍 为什么这是强制规则?

  1. 📖 多行代码支持: 复杂的单细胞分析需要多行脚本
  2. 🛡️ 无转义问题: 避免引号和特殊字符的转义地狱
  3. 🎨 代码格式保持: 缩进和换行完美保留
  4. 🔧 环境隔离: 确保在正确的conda环境中执行
  5. 🔒 字符串保护: 单引号'EOF'确保所有字符原样保留

🚨 重要原则:跨平台数据转换标准

数据转换黄金法则

  • 只转换原始counts: 转换时使用layer="counts",不转换标准化数据
  • 目标平台标准化: 让目标平台(Scanpy)处理自己的标准化流程
  • 状态自动检查: 每次加载后自动检查数据状态,避免重复标准化
  • 失败即终止: 数据状态无法识别时立即终止任务

为什么这个原则很重要?

  • 🎯 保证数据完整性: 原始counts是单细胞分析的生物学基础
  • 🔄 确保一致性: 跨平台使用相同的标准化方法
  • 🛡️ 防止错误: 自动检测避免重复标准化或数据损坏
  • 📊 可重现性: 任何人都可以从counts重新开始分析

When to Use This Skill

This skill should be used when:

  • Working with single cell RNA-seq data in .h5ad format
  • Analyzing data in a Python environment with Scanpy
  • Needing to perform standard single cell analysis workflows including quality control, normalization, clustering, and visualization
  • Requiring batch correction and data integration using scVI
  • Processing H5AD files converted from Seurat

Prerequisites

This skill requires:

  • Conda environment named 'scanpy' with Scanpy and scvi-tools installed
  • Single cell data in .h5ad format
  • For scVI integration: batch information in the AnnData object

Core Capabilities

1. Data Loading

Load single cell data from .h5ad files using Scanpy.

2. Quality Control

Perform quality control analysis including:

  • Cell filtering based on gene counts and mitochondrial percentage
  • Visualization of QC metrics

3. Normalization and Feature Selection

  • Normalize data using log normalization
  • Identify highly variable genes

4. Dimensionality Reduction

  • Scale data for PCA
  • Perform principal component analysis
  • Compute neighborhood graph

5. Clustering

  • Cluster cells using the Louvain algorithm
  • Visualize clusters with UMAP

6. Marker Identification

  • Find marker genes for cell clusters
  • Visualize gene expression patterns

7. scVI Integration Analysis

  • Perform batch correction and data integration
  • Generate latent representations for downstream analysis
  • Obtain normalized gene expression values

Usage Instructions

Setting Up the Environment

Before using this skill, activate the scanpy conda environment:

conda activate scanpy

📋 标准执行模板(强制要求)

⚠️ 注意:以下模板是唯一允许的执行方式

# 所有Scanpy分析的标准模板(强制使用)
source ~/miniconda3/etc/profile.d/conda.sh
conda activate scanpy

python << 'EOF'
# === 导入库 ===
import scanpy as sc
import pandas as pd
import numpy as np

# === 数据加载 ===
adata = sc.read_h5ad('input_file.h5ad')

# === 数据检查 ===
print(f'数据维度: {adata.shape}')
if 'celltype' in adata.obs.columns:
    print(adata.obs['celltype'].value_counts())

# === 分析流程 ===
# ... 您的分析代码

# === 结果导出 ===
results_df.to_csv('output.csv', index=False)
EOF

🚨 重要提醒:技能中的所有代码示例都必须遵循Here Document执行方式!

Loading Data

⚠️ 重要: 每次加载H5AD后必须检查数据状态(第一步)

任何H5AD文件读取后,必须首先检查X矩阵是否为整数(原始counts):

import scanpy as sc
import numpy as np

def check_h5ad_data_state(adata, verbose=True):
    """
    检查H5AD文件的X矩阵是否为整数(原始counts)

    Args:
        adata: AnnData对象
        verbose: 是否显示详细信息

    Returns:
        str: 'raw_counts', 'log_normalized', 'scaled', or 'unknown'
    """

    if verbose:
        print(f'🔍 检查H5AD数据状态...')
        print(f'数据维度: {adata.shape}')

    # 检查X矩阵的基本信息
    max_val = adata.X.max()
    min_val = adata.X.min()
    is_integer = np.all(adata.X == adata.X.astype(int))
    data_type = str(adata.X.dtype)

    if verbose:
        print(f'X矩阵信息:')
        print(f'  最大值: {max_val:.3f}')
        print(f'  最小值: {min_val:.3f}')
        print(f'  是否为整数: {is_integer}')
        print(f'  数据类型: {data_type}')

        # 检查样本值
        if hasattr(adata.X, 'toarray'):
            sample_vals = adata.X[:5, :5].toarray()
        else:
            sample_vals = adata.X[:5, :5]
        print(f'  样本值 (前5x5): \n{sample_vals}')

    # 检查layers
    if hasattr(adata, 'layers') and adata.layers:
        if verbose:
            print(f'可用layers: {list(adata.layers.keys())}')

    # 判断数据状态
    if max_val > 20 and is_integer:
        if verbose:
            print('✅ 检测到原始counts数据')
        return "raw_counts"
    elif max_val <= 10 and not is_integer:
        if verbose:
            print('✅ 检测到log标准化数据')
        return "log_normalized"
    elif max_val > 10 and max_val < 50 and not is_integer:
        if verbose:
            print('⚠️  检测到标准化但未log转换的数据')
        return "scaled"
    else:
        if verbose:
            print('❌ 数据状态未知')
        return "unknown"

def load_and_validate_h5ad(file_path, verbose=True):
    """
    加载H5AD文件并验证数据状态(标准流程)

    Args:
        file_path: H5AD文件路径
        verbose: 是否显示详细信息

    Returns:
        tuple: (adata, needs_normalization)

    Raises:
        ValueError: 当数据状态无法识别时终止任务
    """

    if verbose:
        print(f"🔄 加载H5AD文件: {file_path}")

    adata = sc.read_h5ad(file_path)
    data_state = check_h5ad_data_state(adata, verbose)

    if data_state == "raw_counts":
        if verbose:
            print("✅ 检测到原始counts数据,需要进行标准化")
        return adata, True
    elif data_state == "log_normalized":
        if verbose:
            print("✅ 数据已log标准化,可直接使用")
        return adata, False
    elif data_state == "scaled":
        if verbose:
            print("⚠️  数据已标准化但未log转换,建议log转换")
        return adata, True
    else:
        error_msg = f"❌ 数据状态未知: {data_state}。最大值: {adata.X.max():.3f}"
        raise ValueError(error_msg)

# 标准使用流程
if __name__ == "__main__":
    # 加载并验证数据
    adata, needs_norm = load_and_validate_h5ad('your_data.h5ad')

    # 根据数据状态决定后续处理
    if needs_norm:
        print("🔄 进行数据标准化...")
        # 标准化流程
    else:
        print("✅ 跳过标准化,直接分析...")
        # 直接分析流程

# 推荐的标准加载方式
adata, needs_norm = load_and_validate_h5ad('path/to/your/file.h5ad')

Quality Control Analysis

Perform quality control with the qc_analysis function:

# Calculate QC metrics
adata.var['mt'] = adata.var_names.str.startswith('MT-')
sc.pp.calculate_qc_metrics(adata, percent_top=None, log1p=False, inplace=True)

# Filter cells based on QC metrics
adata = adata[adata.obs.n_genes_by_counts < 2500, :]
adata = adata[adata.obs.pct_counts_mt < 5, :]

Normalization and Feature Selection

⚠️ 重要: 根据数据状态决定是否标准化

# 推荐的标准化工作流
adata, needs_norm = load_and_validate_h5ad('path/to/your/file.h5ad')

# 条件标准化 - 只对原始counts进行标准化
if needs_norm:
    print("🔄 进行数据标准化...")
    # Normalize to 10,000 counts per cell
    sc.pp.normalize_total(adata, target_sum=1e4)
    # Log transform
    sc.pp.log1p(adata)
    print("✅ 标准化完成")
else:
    print("✅ 数据已标准化,跳过标准化步骤")

# Identify highly variable genes
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
adata = adata[:, adata.var.highly_variable]

print(f"📊 识别到 {sum(adata.var.highly_variable)} 个高变基因")

为什么需要条件标准化?

  • ✅ 避免在已标准化数据上重复标准化
  • ✅ 确保分析的一致性和可重现性
  • ✅ 防止数据损坏或信息丢失
  • ✅ 自动适应不同来源的H5AD文件

Dimensionality Reduction

Scale data and perform PCA:

# Scale the data
sc.pp.scale(adata, max_value=10)

# Run PCA
sc.tl.pca(adata, svd_solver='arpack')

Clustering Analysis

Compute neighbors and find clusters:

# Compute neighbors
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)

# Find clusters
sc.tl.louvain(adata, resolution=0.5)

# Run UMAP for visualization
sc.tl.umap(adata)
sc.pl.umap(adata, color=['louvain'])

Marker Gene Identification

Find marker genes for clusters:

# Find marker genes (using default parameters)
sc.tl.rank_genes_groups(adata, 'louvain')

# Extract marker genes
markers = sc.get.rank_genes_groups_df(adata, None)

Gene Expression Visualization

Create dotplots to visualize gene expression across cell types:

# 📋 基因分组(字典格式)- 用户指定基因时需要整理成这种格式
gene_groups = {
    'Fibro1': ['Gene1', 'Gene2', 'Gene3'],
    'Fibro2': ['Gene4', 'Gene5', 'Gene6'],
    'Fibro3': ['Gene7', 'Gene8', 'Gene9'],
    'Fibro4': ['Gene10', 'Gene11', 'Gene12']
}

# Create dotplot with standardization
sc.pl.dotplot(adata,
              var_names=gene_groups,
              groupby='celltype.clean',
              standard_scale='var')  # Standardize each gene across cells

📋 Important Dotplot Parameters:

  • var_names: Genes to visualize (use dictionary format for grouped display)
  • groupby: Column to group cells by (e.g., 'celltype.clean', 'louvain')
  • standard_scale='var': Standardize each gene across cells (recommended for better visualization)
  • use_raw=False: Use normalized data instead of raw counts
  • log=True: Use log scale for expression values
  • color_map: Colormap for expression (default: 'Reds')

scVI Integration Analysis

Perform batch correction and integration using scvi-tools:

# Import scvi
import scvi

# Set up the AnnData object for scVI
scvi.model.SCVI.setup_anndata(adata, batch_key="batch")

# Initialize and train the scVI model
scvi_model = scvi.model.SCVI(adata)
scvi_model.train()

# Get the latent representation
adata.obsm["X_scVI"] = scvi_model.get_latent_representation()

# Get normalized expression
adata.layers["scvi_normalized"] = scvi_model.get_normalized_expression()

# Perform clustering on the scVI latent representation
sc.pp.neighbors(adata, use_rep="X_scVI")
sc.tl.louvain(adata, key_added="louvain_scvi")

# Visualize with UMAP using scVI latent representation
sc.tl.umap(adata, min_dist=0.3)
sc.pl.umap(adata, color=["louvain_scvi", "batch"])

References