name

excel-analyzer

description

Analyze Excel files and automatically generate custom query skills for conversational data interaction. Also provides standalone conversion of Excel files with multiple sheets to Parquet datasets. Handles files with millions of rows, thousands of columns, formulas, and multiple sheets. Converts data to optimized Parquet format and creates complete skill packages with documentation, query helpers, and formula mappings.

Excel Analyzer Skill

This skill analyzes Excel files and automatically generates custom Claude Agent Skills tailored to specific Excel files, enabling conversational querying of complex Excel data.

What This Skill Does

The Excel Analyzer provides two main capabilities:

Full Analysis & Skill Generation (Workflow 1)

Performs comprehensive analysis of Excel files and generates a complete, ready-to-use query skill that includes:

Schema Extraction: Analyzes all sheets, columns, data types, and relationships
Formula Analysis: Extracts Excel formulas with their locations and dependencies
Data Conversion: Converts Excel to optimized Parquet format for 30x faster queries
Skill Generation: Creates a complete skill with documentation, query helpers, and data dictionaries
Documentation: Generates comprehensive SKILL.md with usage instructions and examples

Standalone Parquet Conversion (Workflow 2)

Quickly converts Excel files with multiple sheets to Parquet datasets:

Multi-Sheet Detection: Automatically detects all sheets in the Excel workbook
Parquet Conversion: Converts each sheet to a separate compressed Parquet file
Organized Storage: Stores files in .claude/skills/{skill-name}/resources/parquet_data/
Ready for Analysis: Output files are immediately usable with Polars, Pandas, or PyArrow

When to Use This Skill

Use this skill when you need to:

Enable conversational AI querying of complex Excel files
Work with large Excel files (100K+ rows, hundreds of columns)
Analyze Excel files containing formulas and computed values
Create reusable query interfaces for Excel datasets
Convert Excel data to more efficient formats (Parquet)
Generate data dictionaries and schema documentation automatically
Build custom skills for specific business datasets

Environment Setup and Dependency Management

CRITICAL: All generated skills MUST include automatic environment detection and dependency management to avoid polluting the global Python environment.

When generating a new skill, the skill generator MUST add the following capabilities to the generated SKILL.md:

1. Operating System Detection

The generated skill must detect the current OS to use correct commands:

import platform
import sys

os_type = platform.system()  # 'Windows', 'Linux', or 'Darwin' (macOS)
python_executable = sys.executable
python_version = sys.version

2. Virtual Environment Detection

The generated skill must detect and use virtual environments in this priority order:

Check if already running in an activated venv (VIRTUAL_ENV environment variable)
Look for .venv/ directory (modern convention)
Look for venv/ directory
Look for env/ directory
Only fall back to system Python if no venv found (with warning)

Detection logic:

import os
from pathlib import Path

def detect_venv():
    """Detect virtual environment in current directory"""
    # Check if already in activated venv
    if os.environ.get('VIRTUAL_ENV'):
        return Path(os.environ['VIRTUAL_ENV'])

    # Check for common venv directory names
    venv_candidates = ['.venv', 'venv', 'env']
    for venv_name in venv_candidates:
        venv_path = Path(venv_name)
        # Verify it's a valid venv by checking for pyvenv.cfg
        if venv_path.exists() and (venv_path / 'pyvenv.cfg').exists():
            return venv_path.absolute()

    return None  # No venv found, will use system Python

3. Dependency Installation

The generated skill must:

Check if required packages are installed in the detected venv
Automatically install missing dependencies from requirements.txt
NEVER install packages globally unless no venv exists

Installation logic:

import subprocess
import importlib.util

def check_package_installed(package_name):
    """Check if a package is installed"""
    return importlib.util.find_spec(package_name) is not None

def install_dependencies(requirements_file, venv_path=None):
    """Install missing dependencies in virtual environment"""
    if not Path(requirements_file).exists():
        return

    # Determine pip executable
    if venv_path:
        if platform.system() == 'Windows':
            pip_executable = venv_path / 'Scripts' / 'pip.exe'
        else:
            pip_executable = venv_path / 'bin' / 'pip'
    else:
        pip_executable = 'pip'

    # Install dependencies
    cmd = [str(pip_executable), 'install', '-r', str(requirements_file)]
    if not venv_path:
        # Add --break-system-packages only if using system Python
        cmd.append('--break-system-packages')

    subprocess.run(cmd, check=True)

4. Complete Setup Function

Every generated skill MUST include this setup function at the beginning of SKILL.md:

def setup_environment():
    """
    Detect environment and ensure all dependencies are installed.
    This function MUST be called before any skill operations.
    """
    import platform
    import sys
    import os
    from pathlib import Path
    import subprocess
    import importlib.util

    # Detect OS
    os_type = platform.system()
    print(f"Detected OS: {os_type}")
    print(f"Python version: {sys.version}")
    print(f"Python executable: {sys.executable}")

    # Detect virtual environment
    venv_path = None
    if os.environ.get('VIRTUAL_ENV'):
        venv_path = Path(os.environ['VIRTUAL_ENV'])
        print(f"Already in activated venv: {venv_path}")
    else:
        # Look for venv in current directory
        for venv_name in ['.venv', 'venv', 'env']:
            candidate = Path(venv_name)
            if candidate.exists() and (candidate / 'pyvenv.cfg').exists():
                venv_path = candidate.absolute()
                print(f"Found virtual environment: {venv_path}")
                break

    if not venv_path:
        print("WARNING: No virtual environment detected. Will use system Python.")
        print("Consider creating a venv: python -m venv .venv")

    # Check and install dependencies
    skill_dir = Path(__file__).parent if '__file__' in globals() else Path.cwd()
    requirements_file = skill_dir / 'requirements.txt'

    if requirements_file.exists():
        print(f"Checking dependencies from: {requirements_file}")

        # Determine pip executable
        if venv_path:
            if os_type == 'Windows':
                pip_exec = venv_path / 'Scripts' / 'pip.exe'
            else:
                pip_exec = venv_path / 'bin' / 'pip'
        else:
            pip_exec = sys.executable.replace('python', 'pip')

        # Install missing dependencies
        try:
            cmd = [str(pip_exec), 'install', '-r', str(requirements_file), '-q']
            if not venv_path:
                cmd.append('--break-system-packages')

            result = subprocess.run(cmd, capture_output=True, text=True)
            if result.returncode == 0:
                print("✓ All dependencies installed successfully")
            else:
                print(f"Warning: Dependency installation had issues: {result.stderr}")
        except Exception as e:
            print(f"Error installing dependencies: {e}")

    return {
        'os': os_type,
        'python_version': sys.version,
        'python_executable': sys.executable,
        'venv_path': str(venv_path) if venv_path else None,
        'venv_active': bool(venv_path),
        'skill_dir': str(skill_dir)
    }

5. Usage in Generated Skills

Every generated skill's SKILL.md must include instructions to call setup_environment() BEFORE any operations:

# At the start of any generated skill usage:
from skill_setup import setup_environment

# MUST run this first
env_info = setup_environment()

# Now safe to proceed with skill operations
from query_helper import QueryHelper
helper = QueryHelper('data_parquet')
# ... rest of the code

The skill generator MUST create a skill_setup.py file in every generated skill directory containing the complete setup_environment() function.

Prerequisites

Required Python packages (will be auto-installed if missing):

pip install openpyxl polars pyarrow  pandas --break-system-packages

Workflow Overview

Step 1: Analyze Excel File
    ↓
    analyzer.py → Extracts schema, formulas, dependencies, samples
    ↓
Step 2: Generate Query Skill
    ↓
    skill_generator.py → Creates custom skill package with environment setup
    ↓
Step 3: Use Generated Skill
    ↓
    Auto-detects venv, installs dependencies, queries data conversationally

Step-by-Step Usage

Workflow Option 1: Full Analysis + Skill Generation (Recommended)

For comprehensive analysis and automatic skill generation with documentation, query helpers, and formula mapping.

Step 1A: Analyze the Excel File

python scripts/analyzer.py \
  --file /path/to/your/data.xlsx \
  --output-dir ./analysis_output

What this does:

Loads the Excel file and analyzes all sheets
Extracts comprehensive schema (columns, types, samples)
Identifies and documents all formulas
Builds formula dependency graphs
Extracts stratified sample data (50 rows per sheet)
Converts all sheets to Parquet format
Saves analysis results as JSON files

Output files created:

analysis_output/
├── analysis_results.json    # Complete analysis
├── data_dictionary.json     # Schema and column metadata
├── formula_map.json         # All formulas with locations
├── dependencies.json        # Formula dependency graph
├── sample_data.json         # Representative data samples
└── resources/parquet_data            # Converted Parquet files
    ├── Sheet1.parquet
    └── Sheet2.parquet

Step 1B: Generate the Query Skill

# Option 1: Auto-detect .claude/skills directory (recommended)
python scripts/skill_generator.py \
  --analysis-dir ./analysis_output

# Option 2: Specify custom output directory
python scripts/skill_generator.py \
  --analysis-dir ./analysis_output \
  --output-dir ./custom_skills_directory

Default Behavior: If --output-dir is not specified, the skill will automatically be created in .claude/skills/ in the current working directory. This ensures all generated skills are organized alongside the excel-analyzer skill.

What this does:

Reads the analysis results
Generates a complete skill package
Creates comprehensive SKILL.md documentation
Generates query_helper.py with utility functions
Creates formula_functions.py for Excel formula equivalents
Creates skill_setup.py with environment detection and dependency management
Copies all data files and Parquet data
Includes requirements.txt for dependencies

Generated skill structure:

.claude/skills/{filename}-query/
├── SKILL.md                    # Complete documentation with setup instructions
├── skill_setup.py              # Environment detection and dependency management
├── data_dictionary.json        # Schema reference
├── formula_map.json           # Formula documentation
├── dependencies.json          # Dependencies
├── sample_data.json           # Sample data
├── query_helper.py            # Query utility class
├── formula_functions.py       # Formula equivalents
├── requirements.txt           # Dependencies
└── resources/parquet_data            # Converted Parquet files
    ├── Sheet1.parquet
    └── Sheet2.parquet

Note: By default, skills are created in .claude/skills/ directory in your current working directory. This keeps all skills organized in one location alongside the excel-analyzer skill. The skill name is automatically derived from the Excel filename with -query suffix (e.g., sales_2024.xlsx becomes sales_2024-query).

Step 1C: Use the Generated Skill

The generated skill can be used immediately with Claude Agent SDK and will automatically handle environment setup:

# Example: Query the data (environment setup happens automatically)
from skill_setup import setup_environment
from query_helper import QueryHelper
import polars as pl

# Setup environment (detects venv, installs dependencies)
env_info = setup_environment()

# Initialize
helper = QueryHelper('data_parquet')

# Load a sheet
df = helper.load_sheet('Sheet1', lazy=True)

# Query with filters
result = helper.query_by_filter(
    'Sheet1',
    [pl.col('age') > 25, pl.col('status') == 'active'],
    columns=['name', 'age', 'revenue'],
    limit=100
)

print(result)

# Aggregate data
aggregated = helper.aggregate_by_group(
    'Sheet1',
    group_by=['category'],
    aggregations={'revenue': 'sum', 'orders': 'count'},
    limit=50
)

print(aggregated)

Workflow Option 2: Standalone Parquet Conversion (Quick & Simple)

For users who just need to convert Excel files with multiple sheets to Parquet datasets without the full analysis and skill generation. This is ideal when you:

Already have your own query tools/scripts
Just need efficient data storage format
Want to quickly convert Excel to Parquet for use with Polars/Pandas
Don't need formula analysis or auto-generated documentation

Step 2A: Convert Excel to Parquet

# Basic usage (auto-detects skill name from filename)
python scripts/excel_to_parquet.py --file /path/to/your/data.xlsx

# Specify custom skill name
python scripts/excel_to_parquet.py --file /path/to/sales_data.xlsx --skill-name sales-2024

What this does:

Detects all sheets in the Excel file
Converts each sheet to a separate Parquet file with zstd compression
Stores files in .claude/skills/{skill-name}/resources/parquet_data/
Sanitizes sheet names for valid filenames
Provides detailed conversion summary with row/column counts and file sizes

Output structure:

.claude/skills/{skill-name}/
└── resources/
    └── parquet_data/
        ├── Sheet1.parquet
        ├── Sheet2.parquet
        ├── Sales_Data.parquet
        └── Monthly_Report.parquet

Example output:

🚀 Starting Excel to Parquet conversion...
📄 File: sales_data.xlsx
📁 Skill: sales-data
💾 Output: .claude/skills/sales-data/resources/parquet_data

✅ Created directory structure: .claude/skills/sales-data/resources/parquet_data

📊 Loading workbook...
✅ Found 3 sheet(s): Summary, Q1_Sales, Q2_Sales

🔄 Converting sheets to Parquet...
   [1/3] Summary... ✅ (150 rows × 8 cols, 0.05 MB)
   [2/3] Q1_Sales... ✅ (5000 rows × 15 cols, 0.45 MB)
   [3/3] Q2_Sales... ✅ (5200 rows × 15 cols, 0.47 MB)

======================================================================
📊 CONVERSION SUMMARY
======================================================================
✅ Successful: 3/3

📦 Generated Parquet files:
   • Summary.parquet (150 rows, 0.05 MB)
   • Q1_Sales.parquet (5000 rows, 0.45 MB)
   • Q2_Sales.parquet (5200 rows, 0.47 MB)

📏 Total size: 0.97 MB

📁 Location: .claude/skills/sales-data/resources/parquet_data
======================================================================

🎉 Conversion complete!

Step 2B: Use the Parquet Files

Once converted, you can use the Parquet files directly with Polars, Pandas, or PyArrow:

import polars as pl

# Load a specific sheet
df = pl.read_parquet('.claude/skills/sales-data/resources/parquet_data/Q1_Sales.parquet')

# Query the data
result = df.filter(pl.col('revenue') > 10000).select(['customer', 'revenue', 'date'])
print(result)

# Aggregate
summary = df.group_by('region').agg([
    pl.col('revenue').sum().alias('total_revenue'),
    pl.col('orders').count().alias('order_count')
])
print(summary)

With Pandas:

import pandas as pd

# Load a specific sheet
df = pd.read_parquet('.claude/skills/sales-data/resources/parquet_data/Q1_Sales.parquet')

# Query the data
result = df[df['revenue'] > 10000][['customer', 'revenue', 'date']]
print(result)

Comparison: Workflow 1 vs Workflow 2

Feature	Workflow 1 (Full Analysis)	Workflow 2 (Parquet Only)
Parquet Conversion	✅ Yes	✅ Yes
Schema Analysis	✅ Yes	❌ No
Formula Extraction	✅ Yes	❌ No
Auto-generated Skill	✅ Yes	❌ No
Query Helper Classes	✅ Yes	❌ No
Documentation	✅ Complete	❌ No
Dependencies Setup	✅ Automatic	⚠️ Manual
Use Case	Full-featured skills	Quick data conversion
Best For	Production, reusable skills	One-off conversions

Supported Features

Excel File Characteristics

Size: Up to 100MB files (tested with 500K rows × 10K columns)
Formats: .xlsx, .xlsm
Sheets: Multiple sheets per file
Formulas: Native Excel formulas (SUM, AVERAGE, IF, VLOOKUP, etc.)
Data Types: Text, numeric, datetime, boolean, mixed
Special Features: Named ranges, formatted tables, computed columns

Analysis Capabilities

Schema Detection: Automatic column type inference
Formula Extraction: Identifies all formula cells and their expressions
Dependency Graphs: Maps formula dependencies across cells and sheets
Data Profiling: Sample values, null counts, unique values
Performance: Handles large files with streaming processing

Generated Query Features

Lazy Loading: Memory-efficient data loading
Streaming Queries: Process large datasets without memory overflow
Filter Operations: Complex multi-condition filtering
Aggregations: Group by, sum, mean, count, min, max
Pagination: Automatic result limiting and pagination
Formula Support: Pre-computed values and formula documentation
Environment Management: Automatic venv detection and dependency installation

Technical Details

Performance Characteristics

File Size	Rows	Columns	Analysis Time	Query Time
Small	<10K	<100	1-2 min	<1 sec
Medium	100K	<500	5-10 min	<10 sec
Large	500K	<2K	20-30 min	1-5 min
Very Large	1M+	5-10K	1-2 hours	5-15 min

Memory Management

Uses Polars streaming mode to handle datasets larger than available RAM
Lazy loading prevents loading entire datasets into memory
Parquet columnar format enables efficient column selection
Read-only mode for initial analysis reduces memory footprint

Formula Handling

Supported Formula Functions:

Arithmetic: SUM, AVERAGE, COUNT, MIN, MAX
Logical: IF, AND, OR, NOT
Lookup: VLOOKUP, HLOOKUP, INDEX, MATCH
Conditional: SUMIF, COUNTIF, AVERAGEIF
Text: CONCATENATE, LEFT, RIGHT, MID, LEN

Formula Conversion: Excel formulas are documented in formula_map.json and can be converted to Polars expressions. Complex formulas may require manual review and adjustment.

Data Type Inference

The analyzer automatically infers data types from samples:

Numeric: Integers and floats (including percentages, currencies)
Text: Strings and categorical data
Datetime: Dates, times, and timestamps
Boolean: True/False values
Mixed: Columns with multiple data types

Examples

Example 1: Sales Data Analysis

# Analyze sales data
python scripts/analyzer.py \
  --file sales_2024.xlsx \
  --output-dir ./sales_analysis

# Generate query skill (auto-saves to .claude/skills/)
python scripts/skill_generator.py \
  --analysis-dir ./sales_analysis

# Now you can query: "Show me total revenue by region for Q1"
# The skill will be available at: .claude/skills/sales_2024-query/
# Environment setup happens automatically!

Example 2: Customer Database

# Analyze customer database
python scripts/analyzer.py \
  --file customer_data.xlsx \
  --output-dir ./customer_analysis

# Generate skill (auto-saves to .claude/skills/)
python scripts/skill_generator.py \
  --analysis-dir ./customer_analysis

# Query: "Find all active customers over age 30 with purchases > $1000"
# The skill will be available at: .claude/skills/customer_data-query/

Example 3: Financial Reports

# Analyze financial report with formulas
python scripts/analyzer.py \
  --file financial_report.xlsx \
  --output-dir ./finance_analysis

# Generate skill (auto-saves to .claude/skills/)
python scripts/skill_generator.py \
  --analysis-dir ./finance_analysis

# Query: "Compare revenue vs expenses across all departments"
# The skill will be available at: .claude/skills/financial_report-query/

Troubleshooting

Common Issues

Q: "File not found" error A: Ensure the Excel file path is absolute or relative to current directory

Q: "Memory error during analysis" A: Excel file may be too large. Try analyzing on a machine with more RAM or reduce file size

Q: "Formula parsing error" A: Some complex or custom formulas may not parse correctly. Check formula_map.json for details

Q: "Parquet conversion failed" A: Install required packages: pip install polars pyarrow --break-system-packages

Q: Generated skill doesn't load A: Ensure all JSON files are valid. Check for Unicode or encoding issues in Excel data

Q: "Virtual environment not detected" A: The skill will fall back to system Python with a warning. Consider creating a venv: python -m venv .venv

Q: "Dependency installation failed" A: Check that pip is available in your venv. Try manually: pip install -r requirements.txt

Performance Optimization

For large files:

Run analysis on a machine with sufficient RAM (8GB+ recommended)
Use SSD storage for faster I/O
Close other applications during analysis
Consider splitting very large files into multiple smaller files

Best Practices

Start Small: Test with a smaller sample of your data first
Clean Data: Remove unnecessary sheets or columns before analysis
Document Formulas: Complex formulas should be documented in Excel comments
Validate Results: Compare generated skill queries against Excel calculations
Version Control: Keep analysis outputs and generated skills in version control
Iterate: Refine generated skills based on usage patterns
Use Virtual Environments: Always work within a venv to avoid dependency conflicts

Limitations

Maximum tested file size: 100MB (500K rows × 10K columns)
Formula conversion may require manual review for complex cases
Custom VBA macros are not analyzed or converted
Pivot tables are not directly supported (analyze source data instead)
External data connections are not followed

Additional Resources

For detailed information on:

Formula conversion patterns: See references/formula_conversion.md
Query patterns and examples: See references/query_patterns.md
Performance tuning: See references/performance_guide.md
Advanced usage: See references/advanced_usage.md

Skill Outputs

After running both scripts, you will have:

Analysis Directory: Contains all analysis results and Parquet data
Generated Skill: A complete, ready-to-use Claude Agent Skill with environment management
Documentation: Comprehensive SKILL.md in the generated skill
Query Tools: Helper classes and functions for data querying
Data Dictionary: Complete schema and metadata documentation
Environment Setup: Automatic venv detection and dependency installation

Integration with Claude Agent SDK

The generated skills work seamlessly with Claude Agent SDK:

from claude_agent_sdk import Agent

# Initialize agent with generated skill
agent = Agent(skills_dir="./claude/skills")

# Query naturally (environment setup happens automatically)
response = agent.query(
    "Show me the top 10 customers by revenue in 2024"
)

print(response)

Install Skill

SKILL.md

Excel Analyzer Skill

What This Skill Does

Full Analysis & Skill Generation (Workflow 1)

Standalone Parquet Conversion (Workflow 2)

When to Use This Skill

Environment Setup and Dependency Management

1. Operating System Detection

2. Virtual Environment Detection

3. Dependency Installation

4. Complete Setup Function

5. Usage in Generated Skills

Prerequisites

Workflow Overview

Step-by-Step Usage

Workflow Option 1: Full Analysis + Skill Generation (Recommended)

Step 1A: Analyze the Excel File

Step 1B: Generate the Query Skill

Step 1C: Use the Generated Skill

Workflow Option 2: Standalone Parquet Conversion (Quick & Simple)

Step 2A: Convert Excel to Parquet

Step 2B: Use the Parquet Files

Supported Features

Excel File Characteristics

Analysis Capabilities

Generated Query Features

Technical Details

Performance Characteristics

Memory Management

Formula Handling

Data Type Inference

Examples

Example 1: Sales Data Analysis

Example 2: Customer Database

Example 3: Financial Reports

Troubleshooting

Common Issues

Performance Optimization

Best Practices

Limitations

Additional Resources

Skill Outputs

Integration with Claude Agent SDK