Claude Code Plugins

Community-maintained marketplace

Feedback

excel-parser

@Harryoung/efka
15
0

Smart Excel/CSV file parsing with intelligent routing based on file complexity analysis. Analyzes file structure (merged cells, row count, table layout) using lightweight metadata scanning, then recommends optimal processing strategy - either high-speed Pandas mode for standard tables or semantic HTML mode for complex reports. Use when processing Excel/CSV files with unknown or varying structure where optimization between speed and accuracy is needed.

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name excel-parser
description Smart Excel/CSV file parsing with intelligent routing based on file complexity analysis. Analyzes file structure (merged cells, row count, table layout) using lightweight metadata scanning, then recommends optimal processing strategy - either high-speed Pandas mode for standard tables or semantic HTML mode for complex reports. Use when processing Excel/CSV files with unknown or varying structure where optimization between speed and accuracy is needed.

Excel Parser

Table of Contents

Overview

Provide intelligent routing strategies for parsing Excel/CSV files by analyzing complexity and choosing the optimal processing path. The skill implements a "Scout Pattern" that scans file metadata before processing to balance speed (Pandas) with accuracy (semantic extraction).

Core Philosophy: Scout Pattern

Before processing data, deploy a lightweight "scout" to analyze file metadata and make intelligent routing decisions:

  1. Metadata Scanning - Use openpyxl to scan file structure without loading data
  2. Complexity Scoring - Calculate score based on merged cells, row count, and layout
  3. Path Selection - Choose between Pandas (fast) or HTML (accurate) processing
  4. Optimized Execution - Execute with the most appropriate tool for the file type

Key Principle: "LLM handles metadata decisions, Pandas/HTML processes bulk data"

When to Use This Skill

Use excel-parser when:

  • Processing Excel/CSV files with unknown structure or varying complexity
  • Handling files ranging from simple data tables to complex financial reports
  • Need to optimize between processing speed and extraction accuracy
  • Working with files that may contain merged cells, multi-level headers, or irregular layouts

Skip this skill when:

  • File structure is already known and documented
  • Processing simple, well-structured tables with confirmed format
  • Using predefined scripts for specific file formats

Processing Workflow

Step 1: Analyze File Complexity

Use the scripts/complexity_analyzer.py to scan file metadata:

python scripts/complexity_analyzer.py <file_path> [sheet_name]

What it analyzes (without loading data):

  • Merged cell distribution (shallow vs deep in the table)
  • Row count and data continuity
  • Empty row interruptions (indicates multi-table layouts)

Output (JSON format):

{
  "is_complex": false,
  "recommended_strategy": "pandas",
  "reasons": ["No deep merges detected", "Row count exceeds 1000, forcing Pandas mode"],
  "stats": {
    "total_rows": 5000,
    "deep_merges": 0,
    "empty_interruptions": 0
  }
}

Step 2: Route to Optimal Strategy

Based on complexity analysis:

  • is_complex = false → Use Path A (Pandas Standard Mode)
  • is_complex = true → Use Path B (HTML Semantic Mode)

Step 3: Execute Processing

Follow the selected path's workflow to extract data.

Complexity Scoring Rules

Rule 1: Deep Merged Cells

  • Condition: Merged cells appearing beyond row 5
  • Interpretation: Complex table structure (not just header formatting)
  • Decision: Mark as complex if >2 deep merges detected
  • Example: Financial reports with merged category labels in data region

Rule 2: Empty Row Interruptions

  • Condition: Multiple empty rows within the table
  • Interpretation: Multiple sub-tables in single sheet
  • Decision: Mark as complex if >2 empty row interruptions found
  • Example: Summary table + detail table in one sheet

Rule 3: Row Count Override

  • Condition: Total rows >1000
  • Interpretation: Too large for HTML processing (token explosion)
  • Decision: Force Pandas mode regardless of complexity
  • Rationale: HTML conversion would exceed token limits

Rule 4: Default (Standard Table)

  • Condition: No deep merges, continuous data, moderate size
  • Interpretation: Standard data table
  • Decision: Use Pandas for optimal speed

Path A: Pandas Standard Mode

When: Simple/large tables (most common case)

Strategy: Agent analyzes ONLY the first 20 rows to determine header position, then use Pandas to read full data at native speed.

Workflow:

  1. Sample First 20 Rows

    • Read only the first 20 rows using pd.read_excel(..., nrows=20)
    • Convert to CSV format for analysis
  2. Determine Header Position

    • Examine the sampled rows to identify which row contains the actual column headers
    • Common patterns: Row 0 (standard), Row 1-2 (if title rows exist), Row with distinct column names
  3. Read Full Data

    • Use pd.read_excel(..., header=<detected_row>) to load complete data
    • The header parameter ensures proper column naming

Token Cost: ~500 tokens (only 20 rows analyzed) Processing Speed: Very fast (Pandas native speed)

For implementation details, see references/smart_excel_router.py

Path B: HTML Semantic Mode

When: Complex/irregular tables (merged cells, multi-level headers)

Strategy: Convert to semantic HTML preserving structure (rowspan/colspan), then extract data understanding the visual layout.

Workflow:

  1. Convert to Semantic HTML

    • Load workbook with openpyxl
    • Build HTML table preserving merged cell spans
    • Use rowspan and colspan attributes to maintain structure
  2. Extract Structured Data

    • Analyze HTML table structure
    • Identify hierarchical headers from merged cells
    • Extract data preserving semantic relationships

Token Cost: Higher (full HTML structure analyzed) Processing Speed: Slower (semantic extraction) Use Case: Only for small (<1000 rows), complex files where Pandas would fail

For implementation details, see references/smart_excel_router.py

Best Practices

1. Trust the Scout

Always run complexity analysis before processing. The metadata scan is fast (<1 second) and prevents wasted effort on wrong approach.

2. Respect the Row Count Rule

Never attempt HTML mode on files >1000 rows. Token limits will cause failures.

3. Pandas First for Unknown Files

When in doubt, try Pandas mode first. It fails fast and clearly when structure is incompatible.

4. Cache Analysis Results

If processing multiple sheets from same file, run analysis once and cache results.

5. Preserve Original Files

Never modify the original Excel file during analysis or processing.

Troubleshooting

File Cannot Be Opened

  • Symptom: FileNotFoundError or permission errors
  • Causes: Invalid path, file locked by another process, insufficient permissions
  • Solutions:
    • Verify file path is correct and file exists
    • Close the file if open in Excel or another application
    • Check read permissions on the file

Corrupted File Errors

  • Symptom: BadZipFile or InvalidFileException
  • Causes: Incomplete download, file corruption, wrong file extension
  • Solutions:
    • Re-download or obtain fresh copy of the file
    • Verify file is actual Excel format (not CSV with .xlsx extension)
    • Try opening in Excel to confirm file integrity

Memory Issues with Large Files

  • Symptom: MemoryError or system slowdown
  • Causes: File too large for available RAM
  • Solutions:
    • Use read_only=True mode in openpyxl
    • Process file in chunks using Pandas chunksize parameter
    • Increase system memory or use machine with more RAM

Encoding Problems

  • Symptom: Garbled text or UnicodeDecodeError
  • Causes: Non-UTF8 encoding in source data
  • Solutions:
    • Specify encoding when reading CSV: pd.read_csv(..., encoding='gbk')
    • For Excel, data is usually UTF-8; check source data generation

HTML Mode Token Overflow

  • Symptom: Truncated output or API errors
  • Causes: Complex file exceeds token limits despite row count check
  • Solutions:
    • Force Pandas mode even for complex files
    • Split sheet into smaller ranges and process separately
    • Extract only essential columns before HTML conversion

Incorrect Header Detection

  • Symptom: Wrong columns or data shifted
  • Causes: Unusual header patterns not caught by sampling
  • Solutions:
    • Manually specify header row if known
    • Increase sample size beyond 20 rows
    • Use HTML mode for better structure understanding

Dependencies

Required Python packages:

  • openpyxl - Metadata scanning and Excel file manipulation
  • pandas - High-speed data reading and manipulation

Resources

This skill includes:

  • scripts/complexity_analyzer.py - Standalone executable for complexity analysis
  • references/smart_excel_router.py - Complete implementation reference with both processing paths