| name | excel-parser |
| description | Smart Excel/CSV file parsing with intelligent routing based on file complexity analysis. Analyzes file structure (merged cells, row count, table layout) using lightweight metadata scanning, then recommends optimal processing strategy - either high-speed Pandas mode for standard tables or semantic HTML mode for complex reports. Use when processing Excel/CSV files with unknown or varying structure where optimization between speed and accuracy is needed. |
Excel Parser
Table of Contents
- Overview
- Core Philosophy: Scout Pattern
- When to Use This Skill
- Processing Workflow
- Complexity Scoring Rules
- Path A: Pandas Standard Mode
- Path B: HTML Semantic Mode
- Best Practices
- Troubleshooting
- Dependencies
- Resources
Overview
Provide intelligent routing strategies for parsing Excel/CSV files by analyzing complexity and choosing the optimal processing path. The skill implements a "Scout Pattern" that scans file metadata before processing to balance speed (Pandas) with accuracy (semantic extraction).
Core Philosophy: Scout Pattern
Before processing data, deploy a lightweight "scout" to analyze file metadata and make intelligent routing decisions:
- Metadata Scanning - Use
openpyxlto scan file structure without loading data - Complexity Scoring - Calculate score based on merged cells, row count, and layout
- Path Selection - Choose between Pandas (fast) or HTML (accurate) processing
- Optimized Execution - Execute with the most appropriate tool for the file type
Key Principle: "LLM handles metadata decisions, Pandas/HTML processes bulk data"
When to Use This Skill
Use excel-parser when:
- Processing Excel/CSV files with unknown structure or varying complexity
- Handling files ranging from simple data tables to complex financial reports
- Need to optimize between processing speed and extraction accuracy
- Working with files that may contain merged cells, multi-level headers, or irregular layouts
Skip this skill when:
- File structure is already known and documented
- Processing simple, well-structured tables with confirmed format
- Using predefined scripts for specific file formats
Processing Workflow
Step 1: Analyze File Complexity
Use the scripts/complexity_analyzer.py to scan file metadata:
python scripts/complexity_analyzer.py <file_path> [sheet_name]
What it analyzes (without loading data):
- Merged cell distribution (shallow vs deep in the table)
- Row count and data continuity
- Empty row interruptions (indicates multi-table layouts)
Output (JSON format):
{
"is_complex": false,
"recommended_strategy": "pandas",
"reasons": ["No deep merges detected", "Row count exceeds 1000, forcing Pandas mode"],
"stats": {
"total_rows": 5000,
"deep_merges": 0,
"empty_interruptions": 0
}
}
Step 2: Route to Optimal Strategy
Based on complexity analysis:
- is_complex = false → Use Path A (Pandas Standard Mode)
- is_complex = true → Use Path B (HTML Semantic Mode)
Step 3: Execute Processing
Follow the selected path's workflow to extract data.
Complexity Scoring Rules
Rule 1: Deep Merged Cells
- Condition: Merged cells appearing beyond row 5
- Interpretation: Complex table structure (not just header formatting)
- Decision: Mark as complex if >2 deep merges detected
- Example: Financial reports with merged category labels in data region
Rule 2: Empty Row Interruptions
- Condition: Multiple empty rows within the table
- Interpretation: Multiple sub-tables in single sheet
- Decision: Mark as complex if >2 empty row interruptions found
- Example: Summary table + detail table in one sheet
Rule 3: Row Count Override
- Condition: Total rows >1000
- Interpretation: Too large for HTML processing (token explosion)
- Decision: Force Pandas mode regardless of complexity
- Rationale: HTML conversion would exceed token limits
Rule 4: Default (Standard Table)
- Condition: No deep merges, continuous data, moderate size
- Interpretation: Standard data table
- Decision: Use Pandas for optimal speed
Path A: Pandas Standard Mode
When: Simple/large tables (most common case)
Strategy: Agent analyzes ONLY the first 20 rows to determine header position, then use Pandas to read full data at native speed.
Workflow:
Sample First 20 Rows
- Read only the first 20 rows using
pd.read_excel(..., nrows=20) - Convert to CSV format for analysis
- Read only the first 20 rows using
Determine Header Position
- Examine the sampled rows to identify which row contains the actual column headers
- Common patterns: Row 0 (standard), Row 1-2 (if title rows exist), Row with distinct column names
Read Full Data
- Use
pd.read_excel(..., header=<detected_row>)to load complete data - The header parameter ensures proper column naming
- Use
Token Cost: ~500 tokens (only 20 rows analyzed) Processing Speed: Very fast (Pandas native speed)
For implementation details, see
references/smart_excel_router.py
Path B: HTML Semantic Mode
When: Complex/irregular tables (merged cells, multi-level headers)
Strategy: Convert to semantic HTML preserving structure (rowspan/colspan), then extract data understanding the visual layout.
Workflow:
Convert to Semantic HTML
- Load workbook with
openpyxl - Build HTML table preserving merged cell spans
- Use
rowspanandcolspanattributes to maintain structure
- Load workbook with
Extract Structured Data
- Analyze HTML table structure
- Identify hierarchical headers from merged cells
- Extract data preserving semantic relationships
Token Cost: Higher (full HTML structure analyzed) Processing Speed: Slower (semantic extraction) Use Case: Only for small (<1000 rows), complex files where Pandas would fail
For implementation details, see
references/smart_excel_router.py
Best Practices
1. Trust the Scout
Always run complexity analysis before processing. The metadata scan is fast (<1 second) and prevents wasted effort on wrong approach.
2. Respect the Row Count Rule
Never attempt HTML mode on files >1000 rows. Token limits will cause failures.
3. Pandas First for Unknown Files
When in doubt, try Pandas mode first. It fails fast and clearly when structure is incompatible.
4. Cache Analysis Results
If processing multiple sheets from same file, run analysis once and cache results.
5. Preserve Original Files
Never modify the original Excel file during analysis or processing.
Troubleshooting
File Cannot Be Opened
- Symptom:
FileNotFoundErroror permission errors - Causes: Invalid path, file locked by another process, insufficient permissions
- Solutions:
- Verify file path is correct and file exists
- Close the file if open in Excel or another application
- Check read permissions on the file
Corrupted File Errors
- Symptom:
BadZipFileorInvalidFileException - Causes: Incomplete download, file corruption, wrong file extension
- Solutions:
- Re-download or obtain fresh copy of the file
- Verify file is actual Excel format (not CSV with .xlsx extension)
- Try opening in Excel to confirm file integrity
Memory Issues with Large Files
- Symptom:
MemoryErroror system slowdown - Causes: File too large for available RAM
- Solutions:
- Use
read_only=Truemode in openpyxl - Process file in chunks using Pandas
chunksizeparameter - Increase system memory or use machine with more RAM
- Use
Encoding Problems
- Symptom: Garbled text or
UnicodeDecodeError - Causes: Non-UTF8 encoding in source data
- Solutions:
- Specify encoding when reading CSV:
pd.read_csv(..., encoding='gbk') - For Excel, data is usually UTF-8; check source data generation
- Specify encoding when reading CSV:
HTML Mode Token Overflow
- Symptom: Truncated output or API errors
- Causes: Complex file exceeds token limits despite row count check
- Solutions:
- Force Pandas mode even for complex files
- Split sheet into smaller ranges and process separately
- Extract only essential columns before HTML conversion
Incorrect Header Detection
- Symptom: Wrong columns or data shifted
- Causes: Unusual header patterns not caught by sampling
- Solutions:
- Manually specify header row if known
- Increase sample size beyond 20 rows
- Use HTML mode for better structure understanding
Dependencies
Required Python packages:
openpyxl- Metadata scanning and Excel file manipulationpandas- High-speed data reading and manipulation
Resources
This skill includes:
scripts/complexity_analyzer.py- Standalone executable for complexity analysisreferences/smart_excel_router.py- Complete implementation reference with both processing paths