name	cellphonedb
description	Comprehensive skill for CellPhoneDB - Database of cell type markers and cell-cell communication analysis for single-cell data. Use for cell type annotation, ligand-receptor analysis, cell-cell interaction inference, and communication network visualization.

Cellphonedb Skill

Comprehensive assistance with CellPhoneDB development, generated from official documentation.

When to Use This Skill

This skill should be triggered when you need to:

Data Preparation & Analysis:

Prepare meta and counts data files for CellPhoneDB analysis
Validate and preprocess single-cell RNA-seq data for interaction analysis
Subsample counts data for computational efficiency
Set up proper cell type annotations and metadata formatting

Cell-Cell Communication Analysis:

Identify significant ligand-receptor interactions between cell types
Perform statistical analysis of cell-type specific communication
Analyze spatial microenvironments and neighborhood interactions
Query and filter interaction results based on expression thresholds

Advanced Applications:

Integrate transcription factor activity with receptor signaling (CellSign module)
Perform differential expression analysis for interaction-specific genes
Visualize communication networks and interaction scores
Analyze complex multi-subunit interactions and heteromeric complexes

Database Management:

Work with CellPhoneDB database files and versions
Extract protein and complex data for web applications
Handle gene synonym mappings and database updates
Manage custom CellPhoneDB database creation

Quick Reference

Data Preparation and Validation

import pandas as pd
import numpy as np
from cellphonedb.src.core.exceptions.ParseCountsException import ParseCountsException

# Validate meta DataFrame - ensure correct columns and indexes
def validate_meta(meta_raw):
    """Re-formats meta_raw if need be to ensure correct columns and indexes are present"""
    meta = meta_raw.copy()
    # Ensure proper indexing and column structure
    return meta

# Validate counts DataFrame - ensure float32 type and cell consistency
def validate_counts(counts, meta):
    """Ensure that counts values are of type float32, and that all cells in meta exist in counts"""
    if not len(counts.columns):
        raise ParseCountsException('Counts values are not decimal values', 'Incorrect file format')

    try:
        if np.any(counts.dtypes.values != np.dtype('float32')):
            counts = counts.astype(np.float32)
    except Exception:
        raise ParseCountsException

    meta.index = meta.index.astype(str)

    if np.any(~meta.index.isin(counts.columns)):
        raise ParseCountsException("Some cells in meta did not exist in counts",
                                   "Maybe incorrect file format")

    if np.any(~counts.columns.isin(meta.index)):
        counts = counts.loc[:, counts.columns.isin(meta.index)]

    return counts

Database Operations and Data Extraction

from typing import Tuple
import pandas as pd
import zipfile
import io

# Extract interaction data from CellPhoneDB database
def get_interactions_genes_complex(cpdb_file_path) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame, pd.DataFrame, dict, dict]:
    """Returns a tuple of four DataFrames containing data from CellPhoneDB database"""

    # Extract csv files from database zip file
    dbTableDFs = extract_dataframes_from_db(cpdb_file_path)

    # Process gene synonym mappings
    gene_synonym2gene_name = {}
    if 'gene_synonym_to_gene_name' in dbTableDFs:
        gs2gn = dbTableDFs['gene_synonym_to_gene_name']
        gene_synonym2gene_name = dict(zip(gs2gn['Gene Synonym'], gs2gn['Gene Name']))

    # Process multidata table and convert boolean columns
    mtTable = dbTableDFs['multidata_table']
    MULTIDATA_TABLE_BOOLEAN_COLS = ['receptor', 'other', 'secreted_highlight',
                                   'transmembrane', 'secreted', 'peripheral', 'integrin', 'is_complex']

    for col in MULTIDATA_TABLE_BOOLEAN_COLS:
        mtTable[col] = mtTable[col].astype(bool)

    # Build genes table by merging gene, protein, and multidata tables
    genes = pd.merge(dbTableDFs['gene_table'], dbTableDFs['protein_table'],
                    left_on='protein_id', right_on='id_protein')
    genes = pd.merge(genes, mtTable, left_on='protein_multidata_id', right_on='id_multidata')

    # Build interactions table with proper suffixes
    multidata_expanded = pd.concat([
        pd.merge(dbTableDFs['protein_table'], mtTable, left_on='protein_multidata_id', right_on='id_multidata'),
        pd.merge(mtTable, dbTableDFs['complex_table'], left_on='id_multidata', right_on='complex_multidata_id')
    ], ignore_index=True, sort=True)

    interactions = pd.merge(dbTableDFs['interaction_table'], multidata_expanded, how='left',
                           left_on=['multidata_1_id'], right_on=['id_multidata'])
    interactions = pd.merge(interactions, multidata_expanded, how='left',
                           left_on=['multidata_2_id'], right_on=['id_multidata'], suffixes=('_1', '_2'))

    # Set indices for final dataframes
    interactions.set_index('id_interaction', drop=True, inplace=True)

    return interactions, genes, complex_composition, complex_expanded, gene_synonym2gene_name, receptor2tfs

Installation and Setup

# Install Python and Jupyter Notebook
# Follow instructions at https://docs.conda.io/en/latest/miniconda.html
conda create -n cpdb python=3.8
conda activate cpdb
pip install notebook

# Clone CellPhoneDB repository
cd <your_working_directory>
git clone git@github.com:ventolab/CellphoneDB.git
cd CellphoneDB/cellphonedb/notebooks

# Start Jupyter notebook
jupyter notebook
# Navigate to http://localhost:8888/notebooks/notebooks/cellphonedb.ipynb

Analysis Methods Selection

# METHOD 1: Simple analysis - interaction means
# Use for quick exploration without statistical testing
cellphonedb method statistical_analysis meta.txt counts.txt --output-path results/

# METHOD 2: Statistical analysis - significance testing
# Use for identifying significant cell-type specific interactions
cellphonedb method statistical_analysis meta.txt counts.txt --output-path results/ --subsampling --threads 4

# METHOD 3: Differential expression analysis
# Use for custom comparisons with provided DEGs file
cellphonedb method degs_analysis meta.txt counts.txt degs.txt --output-path results/

# METHOD 4: Spatial microenvironments analysis
# Add spatial context to interaction analysis
cellphonedb method statistical_analysis meta.txt counts.txt --output-path results/ --microenvironments microenv.txt

Data Format Requirements

# Meta file format (tab-separated):
# cell_name    cell_type
# cell1        T_cell
# cell2        B_cell
# cell3        T_cell

# Counts file format (tab-separated, genes as rows, cells as columns):
# Gene    cell1    cell2    cell3
# EGFR    5.2      0.0      3.1
# CD3D    8.7      1.2      9.4

# DEGs file format for METHOD 3 (tab-separated):
# gene        cluster    pval    avg_log2FC
# IL2RA       T_cell     0.001   2.3
# MS4A1       B_cell     0.0005  3.1

Microenvironments and Spatial Analysis

# Microenvironments file format (tab-separated):
# cell_type    microenvironment
# T_cell       immune_compartment
# B_cell       immune_compartment
# epithelial   tissue_compartment

# Run analysis with spatial constraints
cellphonedb method statistical_analysis meta.txt counts.txt \
    --output-path results/ \
    --microenvironments microenv.txt \
    --threshold 0.1  # Minimum expression fraction

CellSign Module Integration

# Prepare transcription factor activity file
# Format: cell_type    TF1    TF2    TF3
#          T_cell       1.2    0.8    0.5
#          B_cell       0.3    1.1    0.9

# Run analysis with TF activity integration
cellphonedb method statistical_analysis meta.txt counts.txt \
    --output-path results/ \
    --active-tfs tf_activity.txt \
    --threshold 0.1

Database Path Management

import os

def get_db_path(user_dir_root, db_version):
    """Retrieves the path to the local database file corresponding to db_version"""
    return os.path.join(user_dir_root, "releases", db_version)

# Example usage:
user_dir = "/path/to/cellphonedb/data"
db_version = "v5.0"
db_path = get_db_path(user_dir, db_version)
# Returns: "/path/to/cellphonedb/data/releases/v5.0"

Key Concepts

Analysis Methods

METHOD 1 (Simple Analysis): Calculates mean interaction expression without statistical testing. Fast exploration tool.
METHOD 2 (Statistical Analysis): Permutation-based statistical testing for cell-type specific interactions using empirical shuffling.
METHOD 3 (DEGs Analysis): Custom differential expression-based approach using user-provided marker genes or DEGs.

Statistical Testing Framework

Permutation approach: Randomly shuffles cluster labels 1000+ times to create null distribution
P-value calculation: Proportion of permuted means ≥ actual mean
Multiple testing correction: Built-in methods for controlling false discovery rate
Expression thresholds: Default 10% of cells (configurable) must express interacting partners

Database Structure

Multidata table: Central table containing proteins, complexes, and their properties
Interactions table: Curated ligand-receptor pairs with directionality and classification
Complex composition: Multi-subunit protein complexes and their components
Gene synonym mapping: Alternate gene names for comprehensive coverage

CellSign Integration

Receptor-TF relationships: 211 curated high-specificity receptor-transcription factor pairs
Activity status: Uses TF activity as downstream sensor for receptor activation
Enhanced confidence: Adds extra evidence layer for cell-cell interaction predictions

Reference Files

This skill includes comprehensive documentation in references/:

api_reference.md - Technical Implementation

Essential for developers and advanced users:

Data preprocessing functions: Complete implementations for meta and counts validation
Database utilities: Source code for data extraction and processing
Counts preprocessing: Float32 conversion, cell consistency checking, error handling
Protein and complex data extraction: Functions for web application integration

user_guide.md - Complete Analysis Workflow

Comprehensive guide for all analysis methods:

Installation instructions: Python environment setup, Jupyter configuration
Three analysis methods: Detailed explanations, use cases, and interpretation
Statistical framework: Permutation testing, p-value calculation, significance thresholds
Advanced features: Spatial microenvironments, CellSign integration, scoring methodology
Output interpretation: Understanding means, pvalues, significant_means, and deconvoluted files

other.md - Getting Started Resources

Quick start and setup information:

Installation procedures: Conda/miniconda setup, Jupyter notebook configuration
Quick start workflow: From data upload to analysis completion
Example notebooks: Step-by-step guided analysis with sample datasets

Use view to read specific reference files when detailed information is needed.

Working with This Skill

For Beginners

Start with installation: Follow the user_guide.md setup instructions for Python and Jupyter
Prepare your data: Use the interactive notebook format at http://localhost:8888/notebooks/cellphonedb.ipynb
Try METHOD 1 first: Simple analysis without statistical testing to understand data structure
Review output formats: Understand means.csv and deconvoluted.csv structure

For Intermediate Users

Master statistical analysis: Use METHOD 2 for rigorous significance testing of interactions
Optimize thresholds: Adjust expression thresholds based on your dataset characteristics
Implement subsampling: Use geometric sketching for large datasets (>100k cells)
Add spatial context: Incorporate microenvironment information for tissue-specific interactions

For Advanced Users

Custom DEG analysis: Use METHOD 3 for complex experimental designs and hierarchical comparisons
CellSign integration: Incorporate transcription factor activity for enhanced confidence
Database customization: Create custom CellPhoneDB databases with organism-specific interactions
Batch processing: Implement automated pipelines for multiple datasets or conditions

Navigation Tips

Data format first: Always ensure meta.txt and counts.txt follow exact format requirements
Method selection flow: METHOD 1 (exploration) → METHOD 2 (standard analysis) → METHOD 3 (custom comparisons)
Threshold tuning: Adjust expression thresholds (default 0.1) based on sequencing depth and biological context
Result validation: Cross-reference significant interactions with known biology and literature

Resources

references/

Organized documentation extracted from official sources:

Complete API documentation with function implementations and error handling
Step-by-step analysis workflows for all three methods
Statistical framework explanations with permutation testing details
Advanced integration guides for spatial and transcription factor analysis
Real code examples from the official CellPhoneDB codebase

scripts/

Add your automation scripts here:

Data preprocessing pipelines for multiple datasets
Batch analysis workflows for systematic studies
Result visualization and network analysis tools
Custom statistical testing frameworks

assets/

Store templates and reference materials:

Input file templates (meta.txt, counts.txt, DEGs formats)
Output interpretation guides and examples
Network visualization templates and scripts
Analysis workflow checklists

Notes

Data Requirements

Counts data: Raw counts (not normalized) required for statistical methods
Meta information: Cell barcodes and corresponding cell type annotations
Expression threshold: Default 10% of cells must express gene to consider interaction
Cell type consistency: Minimum cell numbers per type recommended for statistical power

Performance Considerations

Large datasets: Use subsampling for datasets >100k cells to improve runtime
Memory usage: Consider sparse matrix representations for large count matrices
Parallel processing: Use --threads parameter for multi-core acceleration
Database caching: Local database storage speeds up repeated analyses

Common Pitfalls

Normalized data: Using normalized counts with statistical methods (requires raw counts)
Format mismatch: Incorrect tab-separated format or header inconsistencies
Low-expressed genes: Setting expression thresholds too low leading to spurious interactions
Cell type naming: Inconsistent cell type labels between meta and analysis files

Updating

To refresh this skill with updated documentation:

Check the official CellPhoneDB documentation at https://cellphonedb.readthedocs.io/en/latest/
Re-run the scraper with updated source URLs if available
The skill will preserve existing structure while incorporating new methods and features
Database updates and new interaction curation will be automatically integrated

For the most current information, always cross-reference with the official CellPhoneDB documentation and GitHub repository.

cellphonedb

Install Skill

SKILL.md