| name | excel-analyzer |
| description | Analyze Excel files and automatically generate custom query skills for conversational data interaction. Also provides standalone conversion of Excel files with multiple sheets to Parquet datasets. Handles files with millions of rows, thousands of columns, formulas, and multiple sheets. Converts data to optimized Parquet format and creates complete skill packages with documentation, query helpers, and formula mappings. |
Excel Analyzer Skill
This skill analyzes Excel files and automatically generates custom Claude Agent Skills tailored to specific Excel files, enabling conversational querying of complex Excel data.
What This Skill Does
The Excel Analyzer provides two main capabilities:
Full Analysis & Skill Generation (Workflow 1)
Performs comprehensive analysis of Excel files and generates a complete, ready-to-use query skill that includes:
- Schema Extraction: Analyzes all sheets, columns, data types, and relationships
- Formula Analysis: Extracts Excel formulas with their locations and dependencies
- Data Conversion: Converts Excel to optimized Parquet format for 30x faster queries
- Skill Generation: Creates a complete skill with documentation, query helpers, and data dictionaries
- Documentation: Generates comprehensive SKILL.md with usage instructions and examples
Standalone Parquet Conversion (Workflow 2)
Quickly converts Excel files with multiple sheets to Parquet datasets:
- Multi-Sheet Detection: Automatically detects all sheets in the Excel workbook
- Parquet Conversion: Converts each sheet to a separate compressed Parquet file
- Organized Storage: Stores files in
.claude/skills/{skill-name}/resources/parquet_data/ - Ready for Analysis: Output files are immediately usable with Polars, Pandas, or PyArrow
When to Use This Skill
Use this skill when you need to:
- Enable conversational AI querying of complex Excel files
- Work with large Excel files (100K+ rows, hundreds of columns)
- Analyze Excel files containing formulas and computed values
- Create reusable query interfaces for Excel datasets
- Convert Excel data to more efficient formats (Parquet)
- Generate data dictionaries and schema documentation automatically
- Build custom skills for specific business datasets
Environment Setup and Dependency Management
CRITICAL: All generated skills MUST include automatic environment detection and dependency management to avoid polluting the global Python environment.
When generating a new skill, the skill generator MUST add the following capabilities to the generated SKILL.md:
1. Operating System Detection
The generated skill must detect the current OS to use correct commands:
import platform
import sys
os_type = platform.system() # 'Windows', 'Linux', or 'Darwin' (macOS)
python_executable = sys.executable
python_version = sys.version
2. Virtual Environment Detection
The generated skill must detect and use virtual environments in this priority order:
- Check if already running in an activated venv (
VIRTUAL_ENVenvironment variable) - Look for
.venv/directory (modern convention) - Look for
venv/directory - Look for
env/directory - Only fall back to system Python if no venv found (with warning)
Detection logic:
import os
from pathlib import Path
def detect_venv():
"""Detect virtual environment in current directory"""
# Check if already in activated venv
if os.environ.get('VIRTUAL_ENV'):
return Path(os.environ['VIRTUAL_ENV'])
# Check for common venv directory names
venv_candidates = ['.venv', 'venv', 'env']
for venv_name in venv_candidates:
venv_path = Path(venv_name)
# Verify it's a valid venv by checking for pyvenv.cfg
if venv_path.exists() and (venv_path / 'pyvenv.cfg').exists():
return venv_path.absolute()
return None # No venv found, will use system Python
3. Dependency Installation
The generated skill must:
- Check if required packages are installed in the detected venv
- Automatically install missing dependencies from
requirements.txt - NEVER install packages globally unless no venv exists
Installation logic:
import subprocess
import importlib.util
def check_package_installed(package_name):
"""Check if a package is installed"""
return importlib.util.find_spec(package_name) is not None
def install_dependencies(requirements_file, venv_path=None):
"""Install missing dependencies in virtual environment"""
if not Path(requirements_file).exists():
return
# Determine pip executable
if venv_path:
if platform.system() == 'Windows':
pip_executable = venv_path / 'Scripts' / 'pip.exe'
else:
pip_executable = venv_path / 'bin' / 'pip'
else:
pip_executable = 'pip'
# Install dependencies
cmd = [str(pip_executable), 'install', '-r', str(requirements_file)]
if not venv_path:
# Add --break-system-packages only if using system Python
cmd.append('--break-system-packages')
subprocess.run(cmd, check=True)
4. Complete Setup Function
Every generated skill MUST include this setup function at the beginning of SKILL.md:
def setup_environment():
"""
Detect environment and ensure all dependencies are installed.
This function MUST be called before any skill operations.
"""
import platform
import sys
import os
from pathlib import Path
import subprocess
import importlib.util
# Detect OS
os_type = platform.system()
print(f"Detected OS: {os_type}")
print(f"Python version: {sys.version}")
print(f"Python executable: {sys.executable}")
# Detect virtual environment
venv_path = None
if os.environ.get('VIRTUAL_ENV'):
venv_path = Path(os.environ['VIRTUAL_ENV'])
print(f"Already in activated venv: {venv_path}")
else:
# Look for venv in current directory
for venv_name in ['.venv', 'venv', 'env']:
candidate = Path(venv_name)
if candidate.exists() and (candidate / 'pyvenv.cfg').exists():
venv_path = candidate.absolute()
print(f"Found virtual environment: {venv_path}")
break
if not venv_path:
print("WARNING: No virtual environment detected. Will use system Python.")
print("Consider creating a venv: python -m venv .venv")
# Check and install dependencies
skill_dir = Path(__file__).parent if '__file__' in globals() else Path.cwd()
requirements_file = skill_dir / 'requirements.txt'
if requirements_file.exists():
print(f"Checking dependencies from: {requirements_file}")
# Determine pip executable
if venv_path:
if os_type == 'Windows':
pip_exec = venv_path / 'Scripts' / 'pip.exe'
else:
pip_exec = venv_path / 'bin' / 'pip'
else:
pip_exec = sys.executable.replace('python', 'pip')
# Install missing dependencies
try:
cmd = [str(pip_exec), 'install', '-r', str(requirements_file), '-q']
if not venv_path:
cmd.append('--break-system-packages')
result = subprocess.run(cmd, capture_output=True, text=True)
if result.returncode == 0:
print("✓ All dependencies installed successfully")
else:
print(f"Warning: Dependency installation had issues: {result.stderr}")
except Exception as e:
print(f"Error installing dependencies: {e}")
return {
'os': os_type,
'python_version': sys.version,
'python_executable': sys.executable,
'venv_path': str(venv_path) if venv_path else None,
'venv_active': bool(venv_path),
'skill_dir': str(skill_dir)
}
5. Usage in Generated Skills
Every generated skill's SKILL.md must include instructions to call setup_environment() BEFORE any operations:
# At the start of any generated skill usage:
from skill_setup import setup_environment
# MUST run this first
env_info = setup_environment()
# Now safe to proceed with skill operations
from query_helper import QueryHelper
helper = QueryHelper('data_parquet')
# ... rest of the code
The skill generator MUST create a skill_setup.py file in every generated skill directory containing the complete setup_environment() function.
Prerequisites
Required Python packages (will be auto-installed if missing):
pip install openpyxl polars pyarrow pandas --break-system-packages
Workflow Overview
Step 1: Analyze Excel File
↓
analyzer.py → Extracts schema, formulas, dependencies, samples
↓
Step 2: Generate Query Skill
↓
skill_generator.py → Creates custom skill package with environment setup
↓
Step 3: Use Generated Skill
↓
Auto-detects venv, installs dependencies, queries data conversationally
Step-by-Step Usage
Workflow Option 1: Full Analysis + Skill Generation (Recommended)
For comprehensive analysis and automatic skill generation with documentation, query helpers, and formula mapping.
Step 1A: Analyze the Excel File
python scripts/analyzer.py \
--file /path/to/your/data.xlsx \
--output-dir ./analysis_output
What this does:
- Loads the Excel file and analyzes all sheets
- Extracts comprehensive schema (columns, types, samples)
- Identifies and documents all formulas
- Builds formula dependency graphs
- Extracts stratified sample data (50 rows per sheet)
- Converts all sheets to Parquet format
- Saves analysis results as JSON files
Output files created:
analysis_output/
├── analysis_results.json # Complete analysis
├── data_dictionary.json # Schema and column metadata
├── formula_map.json # All formulas with locations
├── dependencies.json # Formula dependency graph
├── sample_data.json # Representative data samples
└── resources/parquet_data # Converted Parquet files
├── Sheet1.parquet
└── Sheet2.parquet
Step 1B: Generate the Query Skill
# Option 1: Auto-detect .claude/skills directory (recommended)
python scripts/skill_generator.py \
--analysis-dir ./analysis_output
# Option 2: Specify custom output directory
python scripts/skill_generator.py \
--analysis-dir ./analysis_output \
--output-dir ./custom_skills_directory
Default Behavior: If --output-dir is not specified, the skill will automatically be created in .claude/skills/ in the current working directory. This ensures all generated skills are organized alongside the excel-analyzer skill.
What this does:
- Reads the analysis results
- Generates a complete skill package
- Creates comprehensive SKILL.md documentation
- Generates query_helper.py with utility functions
- Creates formula_functions.py for Excel formula equivalents
- Creates skill_setup.py with environment detection and dependency management
- Copies all data files and Parquet data
- Includes requirements.txt for dependencies
Generated skill structure:
.claude/skills/{filename}-query/
├── SKILL.md # Complete documentation with setup instructions
├── skill_setup.py # Environment detection and dependency management
├── data_dictionary.json # Schema reference
├── formula_map.json # Formula documentation
├── dependencies.json # Dependencies
├── sample_data.json # Sample data
├── query_helper.py # Query utility class
├── formula_functions.py # Formula equivalents
├── requirements.txt # Dependencies
└── resources/parquet_data # Converted Parquet files
├── Sheet1.parquet
└── Sheet2.parquet
Note: By default, skills are created in .claude/skills/ directory in your current working directory. This keeps all skills organized in one location alongside the excel-analyzer skill. The skill name is automatically derived from the Excel filename with -query suffix (e.g., sales_2024.xlsx becomes sales_2024-query).
Step 1C: Use the Generated Skill
The generated skill can be used immediately with Claude Agent SDK and will automatically handle environment setup:
# Example: Query the data (environment setup happens automatically)
from skill_setup import setup_environment
from query_helper import QueryHelper
import polars as pl
# Setup environment (detects venv, installs dependencies)
env_info = setup_environment()
# Initialize
helper = QueryHelper('data_parquet')
# Load a sheet
df = helper.load_sheet('Sheet1', lazy=True)
# Query with filters
result = helper.query_by_filter(
'Sheet1',
[pl.col('age') > 25, pl.col('status') == 'active'],
columns=['name', 'age', 'revenue'],
limit=100
)
print(result)
# Aggregate data
aggregated = helper.aggregate_by_group(
'Sheet1',
group_by=['category'],
aggregations={'revenue': 'sum', 'orders': 'count'},
limit=50
)
print(aggregated)
Workflow Option 2: Standalone Parquet Conversion (Quick & Simple)
For users who just need to convert Excel files with multiple sheets to Parquet datasets without the full analysis and skill generation. This is ideal when you:
- Already have your own query tools/scripts
- Just need efficient data storage format
- Want to quickly convert Excel to Parquet for use with Polars/Pandas
- Don't need formula analysis or auto-generated documentation
Step 2A: Convert Excel to Parquet
# Basic usage (auto-detects skill name from filename)
python scripts/excel_to_parquet.py --file /path/to/your/data.xlsx
# Specify custom skill name
python scripts/excel_to_parquet.py --file /path/to/sales_data.xlsx --skill-name sales-2024
What this does:
- Detects all sheets in the Excel file
- Converts each sheet to a separate Parquet file with zstd compression
- Stores files in
.claude/skills/{skill-name}/resources/parquet_data/ - Sanitizes sheet names for valid filenames
- Provides detailed conversion summary with row/column counts and file sizes
Output structure:
.claude/skills/{skill-name}/
└── resources/
└── parquet_data/
├── Sheet1.parquet
├── Sheet2.parquet
├── Sales_Data.parquet
└── Monthly_Report.parquet
Example output:
🚀 Starting Excel to Parquet conversion...
📄 File: sales_data.xlsx
📁 Skill: sales-data
💾 Output: .claude/skills/sales-data/resources/parquet_data
✅ Created directory structure: .claude/skills/sales-data/resources/parquet_data
📊 Loading workbook...
✅ Found 3 sheet(s): Summary, Q1_Sales, Q2_Sales
🔄 Converting sheets to Parquet...
[1/3] Summary... ✅ (150 rows × 8 cols, 0.05 MB)
[2/3] Q1_Sales... ✅ (5000 rows × 15 cols, 0.45 MB)
[3/3] Q2_Sales... ✅ (5200 rows × 15 cols, 0.47 MB)
======================================================================
📊 CONVERSION SUMMARY
======================================================================
✅ Successful: 3/3
📦 Generated Parquet files:
• Summary.parquet (150 rows, 0.05 MB)
• Q1_Sales.parquet (5000 rows, 0.45 MB)
• Q2_Sales.parquet (5200 rows, 0.47 MB)
📏 Total size: 0.97 MB
📁 Location: .claude/skills/sales-data/resources/parquet_data
======================================================================
🎉 Conversion complete!
Step 2B: Use the Parquet Files
Once converted, you can use the Parquet files directly with Polars, Pandas, or PyArrow:
import polars as pl
# Load a specific sheet
df = pl.read_parquet('.claude/skills/sales-data/resources/parquet_data/Q1_Sales.parquet')
# Query the data
result = df.filter(pl.col('revenue') > 10000).select(['customer', 'revenue', 'date'])
print(result)
# Aggregate
summary = df.group_by('region').agg([
pl.col('revenue').sum().alias('total_revenue'),
pl.col('orders').count().alias('order_count')
])
print(summary)
With Pandas:
import pandas as pd
# Load a specific sheet
df = pd.read_parquet('.claude/skills/sales-data/resources/parquet_data/Q1_Sales.parquet')
# Query the data
result = df[df['revenue'] > 10000][['customer', 'revenue', 'date']]
print(result)
Comparison: Workflow 1 vs Workflow 2
| Feature | Workflow 1 (Full Analysis) | Workflow 2 (Parquet Only) |
|---|---|---|
| Parquet Conversion | ✅ Yes | ✅ Yes |
| Schema Analysis | ✅ Yes | ❌ No |
| Formula Extraction | ✅ Yes | ❌ No |
| Auto-generated Skill | ✅ Yes | ❌ No |
| Query Helper Classes | ✅ Yes | ❌ No |
| Documentation | ✅ Complete | ❌ No |
| Dependencies Setup | ✅ Automatic | ⚠️ Manual |
| Use Case | Full-featured skills | Quick data conversion |
| Best For | Production, reusable skills | One-off conversions |
Supported Features
Excel File Characteristics
- Size: Up to 100MB files (tested with 500K rows × 10K columns)
- Formats: .xlsx, .xlsm
- Sheets: Multiple sheets per file
- Formulas: Native Excel formulas (SUM, AVERAGE, IF, VLOOKUP, etc.)
- Data Types: Text, numeric, datetime, boolean, mixed
- Special Features: Named ranges, formatted tables, computed columns
Analysis Capabilities
- Schema Detection: Automatic column type inference
- Formula Extraction: Identifies all formula cells and their expressions
- Dependency Graphs: Maps formula dependencies across cells and sheets
- Data Profiling: Sample values, null counts, unique values
- Performance: Handles large files with streaming processing
Generated Query Features
- Lazy Loading: Memory-efficient data loading
- Streaming Queries: Process large datasets without memory overflow
- Filter Operations: Complex multi-condition filtering
- Aggregations: Group by, sum, mean, count, min, max
- Pagination: Automatic result limiting and pagination
- Formula Support: Pre-computed values and formula documentation
- Environment Management: Automatic venv detection and dependency installation
Technical Details
Performance Characteristics
| File Size | Rows | Columns | Analysis Time | Query Time |
|---|---|---|---|---|
| Small | <10K | <100 | 1-2 min | <1 sec |
| Medium | 100K | <500 | 5-10 min | <10 sec |
| Large | 500K | <2K | 20-30 min | 1-5 min |
| Very Large | 1M+ | 5-10K | 1-2 hours | 5-15 min |
Memory Management
- Uses Polars streaming mode to handle datasets larger than available RAM
- Lazy loading prevents loading entire datasets into memory
- Parquet columnar format enables efficient column selection
- Read-only mode for initial analysis reduces memory footprint
Formula Handling
Supported Formula Functions:
- Arithmetic: SUM, AVERAGE, COUNT, MIN, MAX
- Logical: IF, AND, OR, NOT
- Lookup: VLOOKUP, HLOOKUP, INDEX, MATCH
- Conditional: SUMIF, COUNTIF, AVERAGEIF
- Text: CONCATENATE, LEFT, RIGHT, MID, LEN
Formula Conversion:
Excel formulas are documented in formula_map.json and can be converted to Polars expressions. Complex formulas may require manual review and adjustment.
Data Type Inference
The analyzer automatically infers data types from samples:
- Numeric: Integers and floats (including percentages, currencies)
- Text: Strings and categorical data
- Datetime: Dates, times, and timestamps
- Boolean: True/False values
- Mixed: Columns with multiple data types
Examples
Example 1: Sales Data Analysis
# Analyze sales data
python scripts/analyzer.py \
--file sales_2024.xlsx \
--output-dir ./sales_analysis
# Generate query skill (auto-saves to .claude/skills/)
python scripts/skill_generator.py \
--analysis-dir ./sales_analysis
# Now you can query: "Show me total revenue by region for Q1"
# The skill will be available at: .claude/skills/sales_2024-query/
# Environment setup happens automatically!
Example 2: Customer Database
# Analyze customer database
python scripts/analyzer.py \
--file customer_data.xlsx \
--output-dir ./customer_analysis
# Generate skill (auto-saves to .claude/skills/)
python scripts/skill_generator.py \
--analysis-dir ./customer_analysis
# Query: "Find all active customers over age 30 with purchases > $1000"
# The skill will be available at: .claude/skills/customer_data-query/
Example 3: Financial Reports
# Analyze financial report with formulas
python scripts/analyzer.py \
--file financial_report.xlsx \
--output-dir ./finance_analysis
# Generate skill (auto-saves to .claude/skills/)
python scripts/skill_generator.py \
--analysis-dir ./finance_analysis
# Query: "Compare revenue vs expenses across all departments"
# The skill will be available at: .claude/skills/financial_report-query/
Troubleshooting
Common Issues
Q: "File not found" error A: Ensure the Excel file path is absolute or relative to current directory
Q: "Memory error during analysis" A: Excel file may be too large. Try analyzing on a machine with more RAM or reduce file size
Q: "Formula parsing error"
A: Some complex or custom formulas may not parse correctly. Check formula_map.json for details
Q: "Parquet conversion failed"
A: Install required packages: pip install polars pyarrow --break-system-packages
Q: Generated skill doesn't load A: Ensure all JSON files are valid. Check for Unicode or encoding issues in Excel data
Q: "Virtual environment not detected"
A: The skill will fall back to system Python with a warning. Consider creating a venv: python -m venv .venv
Q: "Dependency installation failed"
A: Check that pip is available in your venv. Try manually: pip install -r requirements.txt
Performance Optimization
For large files:
- Run analysis on a machine with sufficient RAM (8GB+ recommended)
- Use SSD storage for faster I/O
- Close other applications during analysis
- Consider splitting very large files into multiple smaller files
Best Practices
- Start Small: Test with a smaller sample of your data first
- Clean Data: Remove unnecessary sheets or columns before analysis
- Document Formulas: Complex formulas should be documented in Excel comments
- Validate Results: Compare generated skill queries against Excel calculations
- Version Control: Keep analysis outputs and generated skills in version control
- Iterate: Refine generated skills based on usage patterns
- Use Virtual Environments: Always work within a venv to avoid dependency conflicts
Limitations
- Maximum tested file size: 100MB (500K rows × 10K columns)
- Formula conversion may require manual review for complex cases
- Custom VBA macros are not analyzed or converted
- Pivot tables are not directly supported (analyze source data instead)
- External data connections are not followed
Additional Resources
For detailed information on:
- Formula conversion patterns: See
references/formula_conversion.md - Query patterns and examples: See
references/query_patterns.md - Performance tuning: See
references/performance_guide.md - Advanced usage: See
references/advanced_usage.md
Skill Outputs
After running both scripts, you will have:
- Analysis Directory: Contains all analysis results and Parquet data
- Generated Skill: A complete, ready-to-use Claude Agent Skill with environment management
- Documentation: Comprehensive SKILL.md in the generated skill
- Query Tools: Helper classes and functions for data querying
- Data Dictionary: Complete schema and metadata documentation
- Environment Setup: Automatic venv detection and dependency installation
Integration with Claude Agent SDK
The generated skills work seamlessly with Claude Agent SDK:
from claude_agent_sdk import Agent
# Initialize agent with generated skill
agent = Agent(skills_dir="./claude/skills")
# Query naturally (environment setup happens automatically)
response = agent.query(
"Show me the top 10 customers by revenue in 2024"
)
print(response)