name	xlsx
description	Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. Provides intelligent header detection and semantic column mapping for robust data import.
license	Proprietary. LICENSE.txt has complete terms

Intelligence Capabilities (New)

When working with imperfect input files (e.g., user uploads, legacy system exports), apply the following intelligent processing strategies.

1. Intelligent Header Detection

Full Guide: header-detection.md

Never assume the first row is the header. Use the following heuristics:

Semantic Analysis: Look for row content matching known keywords using Levenshtein distance (e.g., "Telefome" ≈ "Telefone").
Data Pattern Analysis: The header is typically the last text-heavy row before data-heavy rows (dates, emails, CPFs).
Context: Headers have high density (few empty cells) and are often below metadata rows (titles, dates).

2. Semantic Column Mapping

Full Guide: column-mapping.md

Map source columns to target fields using specific scoring:

Semantic (40%): Normalization + Synonyms (e.g., "Zap" -> "Phone").
Pattern (30%): Data validation samples (e.g., column has \d{11} -> "CPF").
Context (30%): Position and frequency of occurrence.

3. Real-World Examples

See: examples.md for handling:

Metadata headers (Reports with titles)
Ambiguous columns (Date formats)
Unknown columns

Requirements for Outputs

All Excel files

Zero Formula Errors

Every Excel model MUST be delivered with ZERO formula errors (#REF!, #DIV/0!, #VALUE!, #N/A, #NAME?)

Preserve Existing Templates (when updating templates)

Study and EXACTLY match existing format, style, and conventions when modifying files
Never impose standardized formatting on files with established patterns
Existing template conventions ALWAYS override these guidelines

Financial models

Color Coding Standards

Unless otherwise stated by the user or existing template

Industry-Standard Color Conventions

Blue text (RGB: 0,0,255): Hardcoded inputs, and numbers users will change for scenarios
Black text (RGB: 0,0,0): ALL formulas and calculations
Green text (RGB: 0,128,0): Links pulling from other worksheets within same workbook
Red text (RGB: 255,0,0): External links to other files
Yellow background (RGB: 255,255,0): Key assumptions needing attention or cells that need to be updated

Number Formatting Standards

Required Format Rules

Years: Format as text strings (e.g., "2024" not "2,024")
Currency: Use $#,##0 format; ALWAYS specify units in headers ("Revenue ($mm)")
Zeros: Use number formatting to make all zeros "-", including percentages (e.g., "$#,##0;($#,##0);-")
Percentages: Default to 0.0% format (one decimal)
Multiples: Format as 0.0x for valuation multiples (EV/EBITDA, P/E)
Negative numbers: Use parentheses (123) not minus -123

Formula Construction Rules

Assumptions Placement

Place ALL assumptions (growth rates, margins, multiples, etc.) in separate assumption cells
Use cell references instead of hardcoded values in formulas
Example: Use =B5*(1+$B$6) instead of =B5*1.05

Formula Error Prevention

Verify all cell references are correct
Check for off-by-one errors in ranges
Ensure consistent formulas across all projection periods
Test with edge cases (zero values, negative numbers)
Verify no unintended circular references

Documentation Requirements for Hardcodes

Comment or in cells beside (if end of table). Format: "Source: [System/Document], [Date], [Specific Reference], [URL if applicable]"
Examples:
- "Source: Company 10-K, FY2024, Page 45, Revenue Note, [SEC EDGAR URL]"
- "Source: Company 10-Q, Q2 2025, Exhibit 99.1, [SEC EDGAR URL]"
- "Source: Bloomberg Terminal, 8/15/2025, AAPL US Equity"
- "Source: FactSet, 8/20/2025, Consensus Estimates Screen"

XLSX creation, editing, and analysis

Overview

A user may ask you to create, edit, or analyze the contents of an .xlsx file. You have different tools and workflows available for different tasks.

Important Requirements

LibreOffice Required for Formula Recalculation: You can assume LibreOffice is installed for recalculating formula values using the recalc.py script. The script automatically configures LibreOffice on first run

Reading and analyzing data

Data analysis with pandas

For data analysis, visualization, and basic operations, use pandas which provides powerful data manipulation capabilities:

import pandas as pd

# Read Excel
df = pd.read_excel('file.xlsx')  # Default: first sheet
all_sheets = pd.read_excel('file.xlsx', sheet_name=None)  # All sheets as dict

# Analyze
df.head()      # Preview data
df.info()      # Column info
df.describe()  # Statistics

# Write Excel
df.to_excel('output.xlsx', index=False)

Excel File Workflows

CRITICAL: Use Formulas, Not Hardcoded Values

Always use Excel formulas instead of calculating values in Python and hardcoding them. This ensures the spreadsheet remains dynamic and updateable.

❌ WRONG - Hardcoding Calculated Values

# Bad: Calculating in Python and hardcoding result
total = df['Sales'].sum()
sheet['B10'] = total  # Hardcodes 5000

# Bad: Computing growth rate in Python
growth = (df.iloc[-1]['Revenue'] - df.iloc[0]['Revenue']) / df.iloc[0]['Revenue']
sheet['C5'] = growth  # Hardcodes 0.15

# Bad: Python calculation for average
avg = sum(values) / len(values)
sheet['D20'] = avg  # Hardcodes 42.5

✅ CORRECT - Using Excel Formulas

# Good: Let Excel calculate the sum
sheet['B10'] = '=SUM(B2:B9)'

# Good: Growth rate as Excel formula
sheet['C5'] = '=(C4-C2)/C2'

# Good: Average using Excel function
sheet['D20'] = '=AVERAGE(D2:D19)'

This applies to ALL calculations - totals, percentages, ratios, differences, etc. The spreadsheet should be able to recalculate when source data changes.

Common Workflow

Choose tool: pandas for data, openpyxl for formulas/formatting
Create/Load: Create new workbook or load existing file
Modify: Add/edit data, formulas, and formatting
Save: Write to file
Recalculate formulas (MANDATORY IF USING FORMULAS): Use the recalc.py script
```
python recalc.py output.xlsx
```
Verify and fix any errors:
- The script returns JSON with error details
- If status is errors_found, check error_summary for specific error types and locations
- Fix the identified errors and recalculate again
- Common errors to fix:
  - #REF!: Invalid cell references
  - #DIV/0!: Division by zero
  - #VALUE!: Wrong data type in formula
  - #NAME?: Unrecognized formula name

Creating new Excel files

# Using openpyxl for formulas and formatting
from openpyxl import Workbook
from openpyxl.styles import Font, PatternFill, Alignment

wb = Workbook()
sheet = wb.active

# Add data
sheet['A1'] = 'Hello'
sheet['B1'] = 'World'
sheet.append(['Row', 'of', 'data'])

# Add formula
sheet['B2'] = '=SUM(A1:A10)'

# Formatting
sheet['A1'].font = Font(bold=True, color='FF0000')
sheet['A1'].fill = PatternFill('solid', start_color='FFFF00')
sheet['A1'].alignment = Alignment(horizontal='center')

# Column width
sheet.column_dimensions['A'].width = 20

wb.save('output.xlsx')

Editing existing Excel files

# Using openpyxl to preserve formulas and formatting
from openpyxl import load_workbook

# Load existing file
wb = load_workbook('existing.xlsx')
sheet = wb.active  # or wb['SheetName'] for specific sheet

# Working with multiple sheets
for sheet_name in wb.sheetnames:
    sheet = wb[sheet_name]
    print(f"Sheet: {sheet_name}")

# Modify cells
sheet['A1'] = 'New Value'
sheet.insert_rows(2)  # Insert row at position 2
sheet.delete_cols(3)  # Delete column 3

# Add new sheet
new_sheet = wb.create_sheet('NewSheet')
new_sheet['A1'] = 'Data'

wb.save('modified.xlsx')

Recalculating formulas

Excel files created or modified by openpyxl contain formulas as strings but not calculated values. Use the provided recalc.py script to recalculate formulas:

python recalc.py <excel_file> [timeout_seconds]

Example:

python recalc.py output.xlsx 30

The script:

Automatically sets up LibreOffice macro on first run
Recalculates all formulas in all sheets
Scans ALL cells for Excel errors (#REF!, #DIV/0!, etc.)
Returns JSON with detailed error locations and counts
Works on both Linux and macOS

Formula Verification Checklist

Quick checks to ensure formulas work correctly:

Essential Verification

Test 2-3 sample references: Verify they pull correct values before building full model
Column mapping: Confirm Excel columns match (e.g., column 64 = BL, not BK)
Row offset: Remember Excel rows are 1-indexed (DataFrame row 5 = Excel row 6)

Common Pitfalls

NaN handling: Check for null values with pd.notna()
Far-right columns: FY data often in columns 50+
Multiple matches: Search all occurrences, not just first
Division by zero: Check denominators before using / in formulas (#DIV/0!)
Wrong references: Verify all cell references point to intended cells (#REF!)
Cross-sheet references: Use correct format (Sheet1!A1) for linking sheets

Formula Testing Strategy

Start small: Test formulas on 2-3 cells before applying broadly
Verify dependencies: Check all cells referenced in formulas exist
Test edge cases: Include zero, negative, and very large values

Interpreting recalc.py Output

The script returns JSON with error details:

{
  "status": "success",           // or "errors_found"
  "total_errors": 0,              // Total error count
  "total_formulas": 42,           // Number of formulas in file
  "error_summary": {              // Only present if errors found
    "#REF!": {
      "count": 2,
      "locations": ["Sheet1!B5", "Sheet1!C10"]
    }
  }
}

Best Practices

Library Selection

pandas: Best for data analysis, bulk operations, and simple data export
openpyxl: Best for complex formatting, formulas, and Excel-specific features

Working with openpyxl

Cell indices are 1-based (row=1, column=1 refers to cell A1)
Use data_only=True to read calculated values: load_workbook('file.xlsx', data_only=True)
Warning: If opened with data_only=True and saved, formulas are replaced with values and permanently lost
For large files: Use read_only=True for reading or write_only=True for writing
Formulas are preserved but not evaluated - use recalc.py to update values

Working with pandas

Specify data types to avoid inference issues: pd.read_excel('file.xlsx', dtype={'id': str})
For large files, read specific columns: pd.read_excel('file.xlsx', usecols=['A', 'C', 'E'])
Handle dates properly: pd.read_excel('file.xlsx', parse_dates=['date_column'])

Code Style Guidelines

IMPORTANT: When generating Python code for Excel operations:

Write minimal, concise Python code without unnecessary comments
Avoid verbose variable names and redundant operations
Avoid unnecessary print statements

For Excel files themselves:

Add comments to cells with complex formulas or important assumptions
Document data sources for hardcoded values
Include notes for key calculations and model sections

Intelligent Spreadsheet Import and Analysis

Overview

When importing or analyzing spreadsheets, especially for data import workflows, intelligent detection and mapping capabilities are essential for handling real-world spreadsheets that may have:

Non-standard layouts (metadata rows, empty rows, merged cells)
Varied column naming (synonyms, abbreviations, typos)
Multiple data types requiring pattern recognition
Contextual relationships between columns

This section provides advanced techniques for intelligent header detection, column mapping, and data pattern analysis.

Intelligent Header Detection

Overview

Header detection is the process of identifying which row in a spreadsheet contains the column names. Real-world spreadsheets often have:

Title rows or metadata at the top
Empty separator rows
Multi-line headers (merged cells)
Headers that don't start at row 1

Detection Strategy: Multi-Factor Analysis

Use a combination of semantic analysis, pattern recognition, and contextual clues to identify headers with high confidence.

1. Semantic Analysis

String Similarity Algorithms

Use similarity metrics to compare row cells against known header keywords. Implement Levenshtein distance, Jaro-Winkler similarity, and normalization functions to handle:

Accent removal (á → a)
Case insensitivity
Whitespace normalization
Synonym matching

Synonym Recognition

Maintain a knowledge base of header synonyms and variations in multiple languages (Portuguese, English) and handle:

Regional variations ("celular" vs "móvel")
Common typos ("E-mail" vs "Email")
Abbreviations ("Tel" = "Telefone")
Context-dependent names ("Cliente" = "Nome" in CRM context)

2. Pattern-Based Data Type Detection

Analyze sample data values (first 5-10 rows) to infer column types:

CPF: 11 digits, format XXX.XXX.XXX-XX
Email: Contains @ and valid domain
Phone: 10-11 digits, Brazilian formats
Date: Multiple formats (DD/MM/YYYY, MM/DD/YYYY, Excel serial)
Monetary: R$, points, commas (Brazilian format)
CEP: 8 digits, format XXXXX-XXX

Use pattern matching with regex and validation functions to detect types with confidence scores.

3. Contextual Analysis

Consider multiple contextual factors:

Position: Headers usually in first 5 rows
Density: Headers have many filled cells
Consistency: Headers are text, data rows vary in type
Structure: Headers have similar length/format

Multi-Factor Scoring

Calculate header likelihood score using:

Text ratio (30%): Headers are usually text, not numbers
Keyword matches (40%): Match against known header keywords
Fill ratio (20%): Headers usually have many filled cells
Position bonus (10%): Headers usually in first rows

Thresholds and Confidence Levels

Confidence ≥ 0.8: High confidence - use automatically
Confidence 0.6-0.8: Medium confidence - show to user for confirmation
Confidence < 0.6: Low confidence - require user selection

Handling Edge Cases

Multiple Header Rows: Detect merged cells spanning multiple rows
Empty Rows: Skip empty separator rows before data
Metadata Rows: Identify and skip title/metadata rows at top

Intelligent Column Mapping

Overview

Column mapping matches spreadsheet column names to system fields. Real-world spreadsheets use varied naming that requires intelligent matching.

Multi-Factor Scoring System

Use weighted scoring to determine best mapping:

Semantic Similarity (40%): String similarity and synonym matching
Data Pattern Match (30%): Type inferred from actual values
Position Context (15%): Expected position of field
Frequency/Common Name (15%): Most common names get priority

Expanded Knowledge Base

Maintain comprehensive synonym mappings including:

Portuguese and English variations
Regional differences
Common typos and abbreviations
Context-dependent synonyms

Example Knowledge Base Structure:

# Comprehensive header knowledge base for student/CRM imports
EXPANDED_HEADER_KNOWLEDGE = {
    # Name variations (regional, typos, languages)
    'name': [
        # Portuguese standard
        'nome', 'nome completo', 'aluno', 'cliente', 'pessoa',
        'nome do aluno', 'nome do cliente', 'razao social', 'razão social',
        # English
        'name', 'full name', 'client name', 'student name', 'person', 'fullname',
        # Abbreviations
        'nm', 'nome compl', 'nome compl.',
        # Common typos
        'nme', 'nomee', 'nom', 'nomes',
        # Context variations
        'cadastro', 'registro', 'identificação'
    ],

    # Email variations
    'email': [
        # Portuguese
        'email', 'e-mail', 'correio eletrônico', 'correio eletronico',
        'correio', 'endereço de email', 'endereco de email', 'endereço email',
        # English
        'mail', 'e-mail address', 'email address', 'mail address',
        # Abbreviations
        'e-mail', 'mail addr', 'email addr',
        # Common typos
        'emial', 'emal', 'eail', 'emai',
        # Variations
        'correio eletrônico', 'correio eletr.'
    ],

    # Phone variations
    'phone': [
        # Portuguese
        'telefone', 'celular', 'whatsapp', 'contato', 'telefone celular',
        'telefone fixo', 'fone', 'tel', 'telefone residencial',
        # English
        'phone', 'mobile', 'cell phone', 'telephone', 'contact number',
        'mobile phone', 'cell',
        # Regional
        'móvel', 'movel', 'cel',
        # Abbreviations
        'tel', 'fone', 'cel', 'whats', 'wpp',
        # Common typos
        'telefne', 'telefonee', 'celularr'
    ],

    # CPF variations
    'cpf': [
        'cpf', 'documento', 'documento cpf', 'cpf/cnpj', 'cpf cnpj',
        'identificação', 'identificacao', 'id', 'rg', 'documento de identidade',
        'doc', 'document', 'identifier', 'cpf cnpj', 'documento único',
        'documento unico', 'cpf documento', 'número cpf', 'numero cpf'
    ],

    # Address variations
    'address': [
        'endereço', 'endereco', 'rua', 'logradouro', 'street', 'address',
        'endereço completo', 'endereco completo', 'full address', 'endereço residencial',
        'endereco residencial', 'logradouro completo', 'rua endereço'
    ],

    'addressNumber': [
        'número', 'numero', 'nº', 'n°', 'num', 'número do endereço',
        'numero do endereco', 'address number', 'house number', 'nro'
    ],

    'complement': [
        'complemento', 'complement', 'apto', 'apartamento', 'bloco',
        'bloco apto', 'andar', 'sala', 'complemento endereço'
    ],

    'neighborhood': [
        'bairro', 'neighborhood', 'distrito', 'bairro residencial',
        'bairro endereço', 'zona', 'região', 'regiao'
    ],

    'city': [
        'cidade', 'municipio', 'município', 'city', 'municipality',
        'munic', 'cid', 'município de residência', 'cidade residência'
    ],

    'state': [
        'estado', 'uf', 'state', 'province', 'est', 'uf estado',
        'estado uf', 'unidade federativa', 'estado brasileiro'
    ],

    'zipCode': [
        'cep', 'código postal', 'codigo postal', 'zip', 'postal code',
        'cep código', 'postal', 'cep código postal', 'zip code',
        'código postal cep'
    ],

    'country': [
        'país', 'pais', 'country', 'nacionalidade', 'país de origem',
        'pais de origem', 'country of origin'
    ],

    # Professional fields
    'profession': [
        'profissão', 'profissao', 'graduação', 'graduacao', 'formação',
        'formacao', 'profession', 'graduation', 'formation', 'curso',
        'área de atuação', 'area de atuacao', 'especialidade', 'especialização'
    ],

    'hasClinic': [
        'tem clínica', 'tem clinica', 'possui clínica', 'possui clinica',
        'tem estabelecimento', 'has clinic', 'own clinic', 'clinic owner',
        'clínica própria', 'clinica propria'
    ],

    'clinicName': [
        'nome clínica', 'nome clinica', 'clínica nome', 'clinica nome',
        'nome estabelecimento', 'nome do estabelecimento', 'clinic name',
        'estabelecimento', 'nome da clínica'
    ],

    'clinicCity': [
        'cidade clínica', 'cidade clinica', 'cidade estabelecimento',
        'município clínica', 'municipio clinica', 'clinic city',
        'cidade da clínica'
    ],

    # Status and dates
    'status': [
        'status', 'situação', 'situacao', 'estado', 'condição', 'condicao',
        'status aluno', 'situação aluno', 'status do aluno'
    ],

    'birthDate': [
        'data nasc', 'data de nascimento', 'nascimento', 'data nascimento',
        'data nasc.', 'birth date', 'date of birth', 'dob', 'data nascimento aluno',
        'nasc', 'data nascimento aluno'
    ],

    'saleDate': [
        'data da venda', 'data venda', 'data de venda', 'venda data',
        'data contrato', 'data do contrato', 'sale date', 'contract date',
        'data assinatura', 'data de assinatura'
    ],

    'startDate': [
        'data início', 'data inicio', 'data_inicio', 'start date',
        'inicio', 'data de início', 'data de inicio', 'data início curso',
        'início curso', 'inicio curso'
    ],

    # Sales and marketing
    'salesperson': [
        'vendedor', 'vendedora', 'vendedor responsável', 'vendedor responsavel',
        'salesperson', 'seller', 'vendedor da venda', 'responsável venda',
        'responsavel venda'
    ],

    'leadSource': [
        'origem lead', 'origem', 'fonte', 'lead source', 'source',
        'origem do lead', 'fonte do lead', 'como conheceu', 'origem cliente',
        'canal', 'canal de origem', 'fonte de origem'
    ],

    'cohort': [
        'turma', 'cohort', 'turma curso', 'turma do curso', 'grupo',
        'turma aluno', 'classe', 'batch', 'lote'
    ],

    # Financial fields
    'totalValue': [
        'valor total', 'valortotal', 'valor_total', 'total', 'amount',
        'valor', 'preço', 'preco', 'price', 'total value', 'total amount',
        'valor do curso', 'valor curso', 'valor contrato', 'valor total curso',
        'montante', 'valor pago', 'valor investido'
    ],

    'installments': [
        'parcelas', 'número de parcelas', 'numero de parcelas',
        'qtd parcelas', 'quantidade de parcelas', 'installments',
        'parc', 'qtd parc', 'número parcelas', 'numero parcelas',
        'total parcelas', 'qtd de parcelas', 'quantidade parcelas'
    ],

    'installmentValue': [
        'valor parcela', 'valor_parcela', 'valorparcela', 'installment value',
        'valor da parcela', 'valor de parcela', 'parcela valor',
        'valor mensal', 'mensalidade', 'valor por parcela'
    ],

    'paymentStatus': [
        'status pagamento', 'status_pagamento', 'payment status',
        'pagamento', 'status de pagamento', 'situação pagamento',
        'situacao pagamento', 'payment', 'status pag', 'situação pag',
        'condição pagamento', 'condicao pagamento'
    ],

    'paidInstallments': [
        'parcelas pagas', 'parcelas_pagas', 'paid installments',
        'parcelas quitadas', 'parcelas pagas total', 'qtd parcelas pagas',
        'quantidade parcelas pagas', 'parcelas já pagas', 'parcelas ja pagas'
    ],

    # Professional ID
    'professionalId': [
        'registro', 'registro profissional', 'coren', 'cro', 'crm', 'crf',
        'número registro', 'numero registro', 'registro número',
        'professional id', 'professional registration', 'número conselho',
        'numero conselho', 'conselho profissional', 'registro conselho'
    ],

    # Contract status
    'contractStatus': [
        'status contrato', 'status do contrato', 'situação contrato',
        'situacao contrato', 'contract status', 'status do contrato',
        'condição contrato', 'condicao contrato', 'estado contrato'
    ]
}

def find_best_match(column_header: str, knowledge_base: Dict[str, List[str]]) -> List[Tuple[str, float]]:
    """
    Find best matching schema fields for a column header.
    Returns list of (field_name, confidence_score) tuples sorted by score.
    """
    normalized_header = normalize_header(column_header)
    matches = []

    for field_name, synonyms in knowledge_base.items():
        best_score = 0.0

        for synonym in synonyms:
            normalized_synonym = normalize_header(synonym)

            # Exact match
            if normalized_header == normalized_synonym:
                best_score = 1.0
                break
            # Contains match
            elif normalized_synonym in normalized_header or normalized_header in normalized_synonym:
                best_score = max(best_score, 0.9)
            else:
                # Similarity match
                similarity = jaro_winkler_similarity(normalized_header, normalized_synonym)
                if similarity > 0.7:
                    best_score = max(best_score, similarity * 0.85)

        if best_score > 0.5:  # Only include reasonable matches
            matches.append((field_name, best_score))

    # Sort by score descending
    matches.sort(key=lambda x: x[1], reverse=True)
    return matches

Intelligent Suggestions

When confidence is low (< 70%), provide multiple mapping options:

Show top 3-5 suggestions ordered by score
Include reasoning for each suggestion
Highlight pattern mismatches (e.g., column named "ID" but contains CPF values)

Data Pattern Analysis

Overview

Pattern analysis identifies data types and formats by examining actual values, not just column names. This is crucial for:

Validating mappings
Detecting errors
Normalizing data formats
Providing user feedback

Pattern Recognition

Define comprehensive patterns for common data types:

CPF: Brazilian tax ID with validation algorithm
Email: Standard email format validation
Phone (Brazilian): 10-11 digits with formatting
Date: Multiple format support (DD/MM/YYYY, Excel serial)
Monetary (Brazilian): R$ format with comma decimal separator
CEP: Brazilian postal code (8 digits)
UF: Brazilian state abbreviations (2 letters)

Validation and Normalization

For each pattern type:

Regex matching: Initial pattern detection
Validator function: Domain-specific validation (e.g., CPF check digits)
Normalizer function: Convert to standard format
Formatter function: Display in user-friendly format

Pattern Detection Process

Sample first 10-20 rows of data
Test each value against pattern regexes
Run validator functions for matched patterns
Calculate match ratio and confidence
Return best match with confidence score

Integration with Existing Code

Integration with xlsx-helper.ts

Enhance detectHeaderRow() function in src/lib/xlsx-helper.ts:

Current Implementation:

// Current basic implementation
export function detectHeaderRow(rows: unknown[][], maxRowsToScan = 15): HeaderDetectionResult {
  // Basic keyword matching and scoring
}

Enhanced Implementation:

// Enhanced with multi-factor analysis
export function detectHeaderRowEnhanced(
  rows: unknown[][],
  maxRowsToScan = 15
): HeaderDetectionResult {
  const candidates: Array<{
    rowIndex: number;
    score: number;
    headers: string[];
  }> = [];

  // Factor 1: Text ratio (headers are usually text)
  const calculateTextRatio = (row: unknown[]): number => {
    const textCells = row.filter(
      cell => cell && !String(cell).trim().match(/^\d+([.,]\d+)?$/)
    );
    return textCells.length / Math.max(row.length, 1);
  };

  // Factor 2: Keyword matches
  const countKeywordMatches = (row: unknown[]): number => {
    const normalizedRow = row.map(cell =>
      normalizeHeader(String(cell || ''))
    );
    return normalizedRow.filter(cell =>
      HEADER_KEYWORDS.some(kw => cell.includes(kw) || kw.includes(cell))
    ).length;
  };

  // Factor 3: Fill ratio
  const calculateFillRatio = (row: unknown[]): number => {
    const nonEmpty = row.filter(cell => cell && String(cell).trim());
    return nonEmpty.length / Math.max(row.length, 1);
  };

  // Factor 4: Pattern consistency (compare with next row)
  const checkPatternConsistency = (
    row: unknown[],
    nextRow: unknown[],
    index: number
  ): number => {
    if (index + 1 >= rows.length) return 0;

    const rowTypes = row.map(cell => typeof cell);
    const nextTypes = nextRow.map(cell => typeof cell);

    // Headers should have different types than data
    const typeOverlap = rowTypes.filter((t, i) => t === nextTypes[i]).length;
    return typeOverlap < rowTypes.length / 2 ? 5 : 0;
  };

  // Calculate scores for each candidate row
  for (let i = 0; i < Math.min(rows.length, maxRowsToScan); i++) {
    const row = rows[i];
    if (!row || row.length === 0) continue;

    let score = 0;

    // Factor 1: Text ratio (30 points max)
    score += calculateTextRatio(row) * 30;

    // Factor 2: Keyword matches (40 points max)
    score += Math.min(countKeywordMatches(row) * 10, 40);

    // Factor 3: Fill ratio (20 points max)
    score += calculateFillRatio(row) * 20;

    // Factor 4: Position bonus (10 points)
    if (i < 5) score += 10;

    // Factor 5: Pattern consistency (5 points)
    score += checkPatternConsistency(row, rows[i + 1] || [], i);

    const headers = row
      .map(h => String(h ?? '').trim())
      .filter(Boolean);

    if (headers.length >= 2) {
      candidates.push({
        rowIndex: i,
        score,
        headers,
      });
    }
  }

  if (candidates.length === 0) {
    return {
      headerRowIndex: 0,
      confidence: 0,
      headers: rows[0]?.map(h => String(h ?? '').trim()).filter(Boolean) || [],
      candidates: [],
    };
  }

  candidates.sort((a, b) => b.score - a.score);
  const best = candidates[0];

  // Calculate confidence
  let confidence: number;
  if (candidates.length === 1) {
    confidence = Math.min(best.score / 100, 1);
  } else {
    const scoreDiff = best.score - candidates[1].score;
    const relativeConfidence = scoreDiff / Math.max(best.score, 1);
    const absoluteConfidence = best.score / 100;
    confidence = Math.min((relativeConfidence + absoluteConfidence) / 2, 1);
  }

  return {
    headerRowIndex: best.rowIndex,
    confidence: Math.round(confidence * 100) / 100,
    headers: best.headers,
    candidates: candidates.slice(0, 5),
  };
}

// Helper function for header normalization
function normalizeHeader(header: string): string {
  return header
    .toLowerCase()
    .normalize('NFD')
    .replace(/[\u0300-\u036f]/g, '') // Remove accents
    .trim()
    .replace(/\s+/g, ' ');
}

Integration with csv-validator.ts

Enhance mapCSVHeaders() function:

Current Implementation:

// Current static mapping
export function mapCSVHeaders(headers: string[]): Record<string, string> {
  const mapping: Record<string, string> = {};
  for (const header of headers) {
    const normalized = header.trim().toLowerCase();
    if (HEADER_MAP[normalized]) {
      mapping[header] = HEADER_MAP[normalized];
    }
  }
  return mapping;
}

Enhanced Implementation:

// Enhanced with pattern analysis and fuzzy matching
export interface ColumnMappingResult {
  mapping: Record<string, string>;
  confidence: Record<string, number>;
  suggestions: Record<string, Array<{ field: string; score: number; reason: string }>>;
}

export function mapCSVHeadersIntelligent(
  headers: string[],
  sampleRows: Record<string, unknown>[],
  maxSuggestions = 3
): ColumnMappingResult {
  const mapping: Record<string, string> = {};
  const confidence: Record<string, number> = {};
  const suggestions: Record<string, Array<{ field: string; score: number; reason: string }>> = {};

  for (const header of headers) {
    // Get column data
    const columnData = sampleRows
      .map(row => row[header])
      .filter(v => v !== undefined && v !== null)
      .slice(0, 10); // Sample first 10 rows

    // Detect pattern
    const pattern = detectColumnPattern(columnData);

    // Find best matches
    const matches = findBestSchemaMatches(header, pattern, headers.indexOf(header));

    if (matches.length > 0 && matches[0].score >= 0.5) {
      mapping[header] = matches[0].field;
      confidence[header] = matches[0].score;
    }

    // Store suggestions for low confidence matches
    if (matches.length > 0 && matches[0].score < 0.7) {
      suggestions[header] = matches.slice(0, maxSuggestions).map(m => ({
        field: m.field,
        score: m.score,
        reason: m.reason,
      }));
    }
  }

  return { mapping, confidence, suggestions };
}

// Pattern detection function
function detectColumnPattern(values: unknown[]): {
  type: string;
  confidence: number;
  matchRatio: number;
} {
  if (values.length === 0) {
    return { type: 'empty', confidence: 0, matchRatio: 0 };
  }

  // CPF pattern
  const cpfPattern = /^\d{3}[.\-]?\d{3}[.\-]?\d{3}[.\-]?\d{2}$/;
  const cpfMatches = values.filter(v => cpfPattern.test(String(v).replace(/\s/g, '')));
  if (cpfMatches.length / values.length > 0.7) {
    return {
      type: 'cpf',
      confidence: cpfMatches.length / values.length,
      matchRatio: cpfMatches.length / values.length,
    };
  }

  // Email pattern
  const emailPattern = /^[^\s@]+@[^\s@]+\.[^\s@]+$/;
  const emailMatches = values.filter(v => emailPattern.test(String(v).toLowerCase()));
  if (emailMatches.length / values.length > 0.7) {
    return {
      type: 'email',
      confidence: emailMatches.length / values.length,
      matchRatio: emailMatches.length / values.length,
    };
  }

  // Phone pattern (Brazilian)
  const phonePattern = /^[\d\s\(\)\-\+]{10,15}$/;
  const phoneMatches = values.filter(v => {
    const digits = String(v).replace(/\D/g, '');
    return phonePattern.test(String(v)) && digits.length >= 10 && digits.length <= 11;
  });
  if (phoneMatches.length / values.length > 0.7) {
    return {
      type: 'phone',
      confidence: phoneMatches.length / values.length,
      matchRatio: phoneMatches.length / values.length,
    };
  }

  return { type: 'text', confidence: 0.3, matchRatio: 0 };
}

// Find best schema matches with scoring
function findBestSchemaMatches(
  header: string,
  pattern: { type: string; confidence: number },
  columnIndex: number
): Array<{ field: string; score: number; reason: string }> {
  const normalizedHeader = normalizeHeader(header);
  const matches: Array<{ field: string; score: number; reason: string }> = [];

  // Check against expanded knowledge base
  for (const [field, synonyms] of Object.entries(EXPANDED_HEADER_KNOWLEDGE)) {
    let semanticScore = 0;
    let reason = '';

    // Check synonyms
    for (const synonym of synonyms) {
      const normalizedSynonym = normalizeHeader(synonym);

      if (normalizedHeader === normalizedSynonym) {
        semanticScore = 1.0;
        reason = `Exact match with '${synonym}'`;
        break;
      } else if (normalizedSynonym.includes(normalizedHeader) ||
                 normalizedHeader.includes(normalizedSynonym)) {
        semanticScore = Math.max(semanticScore, 0.9);
        reason = `Contains keyword '${synonym}'`;
      } else {
        // Fuzzy match
        const similarity = jaroWinklerSimilarity(normalizedHeader, normalizedSynonym);
        if (similarity > 0.7) {
          semanticScore = Math.max(semanticScore, similarity * 0.85);
          reason = `Similar to '${synonym}' (${Math.round(similarity * 100)}% similarity)`;
        }
      }
    }

    // Pattern match bonus
    const patternScore = getPatternMatchScore(field, pattern.type);
    const totalScore = semanticScore * 0.4 + patternScore * 0.3 +
                       getPositionScore(field, columnIndex) * 0.15 +
                       getFrequencyScore(field, normalizedHeader) * 0.15;

    if (totalScore >= 0.3) {
      matches.push({
        field,
        score: totalScore,
        reason: reason || `Pattern matches: ${pattern.type}`,
      });
    }
  }

  return matches.sort((a, b) => b.score - a.score);
}

// Jaro-Winkler similarity (simplified)
function jaroWinklerSimilarity(s1: string, s2: string): number {
  // Use a library like 'string-similarity' in production
  // Simplified version here
  if (s1 === s2) return 1.0;
  if (s1.length === 0 || s2.length === 0) return 0.0;

  // Simple character overlap
  const longer = s1.length > s2.length ? s1 : s2;
  const shorter = s1.length > s2.length ? s2 : s1;
  const matches = shorter.split('').filter(c => longer.includes(c)).length;

  return matches / Math.max(s1.length, s2.length);
}

Recommended Enhancements

Add pattern detection: Create detectColumnPattern() function in xlsx-helper.ts
Expand knowledge base: Add all synonyms from EXPANDED_HEADER_KNOWLEDGE to HEADER_MAP in csv-validator.ts
Add validation feedback: Show pattern detection results in UI components
Highlight mismatches: Alert when header name doesn't match data pattern
Add confidence indicators: Display confidence scores in mapping UI
Store user corrections: Learn from user mappings to improve future suggestions

Practical Examples

Example 1: Detecting Header in Spreadsheet with Metadata

Spreadsheet Structure:

Row 0: "Relatório de Alunos - 2024"
Row 1: (empty)
Row 2: "Nome" | "E-mail" | "Telefone" | "CPF"
Row 3: "João Silva" | "joao@email.com" | "11999999999" | "123.456.789-00"
Row 4: "Maria Santos" | "maria@email.com" | "21988888888" | "987.654.321-00"

Detection Process:

rows = [
    ["Relatório de Alunos - 2024"],
    [],
    ["Nome", "E-mail", "Telefone", "CPF"],
    ["João Silva", "joao@email.com", "11999999999", "123.456.789-00"],
    ["Maria Santos", "maria@email.com", "21988888888", "987.654.321-00"]
]

result = detect_header_row(rows, max_rows_to_scan=15)

# Result:
# {
#   'header_row_index': 2,
#   'confidence': 0.85,
#   'headers': ['Nome', 'E-mail', 'Telefone', 'CPF'],
#   'candidates': [
#     {'row_index': 2, 'score': 85.0, 'headers': ['Nome', 'E-mail', 'Telefone', 'CPF']},
#     {'row_index': 0, 'score': 15.0, 'headers': ['Relatório de Alunos - 2024']}
#   ]
# }

Scoring Breakdown for Row 2:

Text ratio: 100% (all cells are text) → 30 points
Keyword matches: 4 matches (nome, email, telefone, cpf) → 40 points
Fill ratio: 100% (all cells filled) → 20 points
Position bonus: Row 2 (< 5) → 10 points
Total: 100 points → Confidence: 0.85

Mappings with Pattern Analysis:

# Analyze each column
mappings = map_columns_intelligently(
    headers=['Nome', 'E-mail', 'Telefone', 'CPF'],
    rows=rows,
    header_row_index=2
)

# Results:
# {
#   'Nome': ('name', 0.95),      # Exact keyword match + text pattern
#   'E-mail': ('email', 0.90),   # Synonym match + email pattern
#   'Telefone': ('phone', 0.92), # Keyword match + phone pattern
#   'CPF': ('cpf', 0.88)         # Keyword match + CPF pattern
# }

Example 2: Ambiguous Column Name

Column: "Contato"

Sample Data:

"Contato"
"11999999999"
"21988888888"
"(11) 3333-4444"

Analysis Process:

column_header = "Contato"
column_data = ["11999999999", "21988888888", "(11) 3333-4444"]
column_index = 2

# Step 1: Semantic analysis
semantic_score = find_synonym_match("Contato", "phone")
# Result: 0.85 (contato ≈ telefone/phone)

# Step 2: Pattern detection
pattern = detect_data_pattern(column_data)
# Result: {'type': 'phone', 'confidence': 0.90, 'format': 'brazilian'}

# Step 3: Position context
position_score = 0.70  # Column 2, expected phone position 2

# Step 4: Calculate total score
total_score = (
    semantic_score * 0.40 +      # 0.85 * 0.40 = 0.34
    pattern['confidence'] * 0.30 + # 0.90 * 0.30 = 0.27
    position_score * 0.15 +       # 0.70 * 0.15 = 0.105
    0.6 * 0.15                    # Frequency score = 0.09
)
# Total: 0.805 → Map to 'phone' with 80.5% confidence

Result: Map to phone (0.82 total score)

Alternative Scenario - If pattern was emails:

column_data = ["joao@email.com", "maria@email.com", "pedro@email.com"]
pattern = detect_data_pattern(column_data)
# Result: {'type': 'email', 'confidence': 0.95}

# Semantic score for email: 0.60 (contato can mean contact/email)
total_score = 0.60 * 0.40 + 0.95 * 0.30 + 0.70 * 0.15 + 0.4 * 0.15
# Total: 0.72 → Map to 'email' with 72% confidence

Example 3: Type Detection by Pattern (Header Name Mismatch)

Column: "ID" Values:

"ID"
"123.456.789-00"
"987.654.321-00"
"111.222.333-44"

Analysis:

column_header = "ID"
column_data = ["123.456.789-00", "987.654.321-00", "111.222.333-44"]

# Step 1: Header name suggests numeric ID
# But pattern analysis reveals CPF format
pattern = detect_data_pattern(column_data)
# Result: {
#   'type': 'cpf',
#   'confidence': 0.95,
#   'match_ratio': 1.0,
#   'format': 'brazilian_tax_id'
# }

# Step 2: Validate CPF format
for value in column_data:
    is_valid = validate_cpf(value)
    # All values pass CPF validation

# Step 3: Mapping decision
# Header name: "ID" → suggests 'id' field (numeric)
# But pattern: CPF → suggests 'cpf' field
# Pattern confidence (0.95) > Semantic confidence (0.30)
# Decision: Map to 'cpf' based on pattern evidence

Result: Map to cpf (0.95 confidence), not numeric id

User Feedback:

⚠️ Column "ID" appears to contain CPF values (confidence: 95%)
   Suggested mapping: CPF
   Reason: Data pattern matches CPF format, not numeric ID

Example 4: Low Confidence - Multiple Suggestions

Column: "Dados"

Sample Data:

"Dados"
"João Silva"
"Maria Santos"
"Pedro Oliveira"

Analysis:

column_header = "Dados"
column_data = ["João Silva", "Maria Santos", "Pedro Oliveira"]
column_index = 0

# Pattern detection
pattern = detect_data_pattern(column_data)
# Result: {'type': 'text', 'confidence': 0.5}

# Find all potential matches
matches = find_best_match("Dados", EXPANDED_HEADER_KNOWLEDGE)
# Results:
# [
#   ('name', 0.45),   # Position suggests name, text pattern
#   ('email', 0.35),  # Could be email if pattern matched
#   ('phone', 0.25),  # Low match
# ]

# Generate suggestions with reasoning
suggestions = [
    {
        'field': 'name',
        'score': 0.45,
        'confidence': 'medium',
        'reasons': [
            'Position suggests name field (first column)',
            'Text pattern matches name format',
            'Values appear to be full names'
        ]
    },
    {
        'field': 'email',
        'score': 0.35,
        'confidence': 'low',
        'reasons': [
            'Could be email if pattern matched',
            'Position sometimes used for email'
        ]
    },
    {
        'field': '_skip',
        'score': 0.30,
        'confidence': 'low',
        'reasons': [
            'Unclear mapping - consider skipping',
            'May be metadata or notes column'
        ]
    }
]

UI Action: Show dropdown with suggestions for user selection

TypeScript Implementation:

// In mapping UI component
const suggestions = getMappingSuggestions("Dados", columnData, 0);

// Display in UI
<Select value={mapping["Dados"]} onValueChange={handleChange}>
  <SelectTrigger>
    <SelectValue placeholder="Select mapping..." />
  </SelectTrigger>
  <SelectContent>
    {suggestions.map(suggestion => (
      <SelectItem key={suggestion.field} value={suggestion.field}>
        {suggestion.field} ({Math.round(suggestion.score * 100)}% confidence)
        <div className="text-xs text-muted-foreground">
          {suggestion.reasons[0]}
        </div>
      </SelectItem>
    ))}
  </SelectContent>
</Select>

Example 5: Regional Variations

Column: "Móvel" (Portuguese for mobile/phone)

Sample Data:

"Móvel"
"11999999999"
"21988888888"
"11987654321"

Analysis:

column_header = "Móvel"
column_data = ["11999999999", "21988888888", "11987654321"]

# Step 1: Semantic analysis with accent handling
normalized_header = normalize_header("Móvel")  # "movel"
synonyms = EXPANDED_HEADER_KNOWLEDGE['phone']
# ['telefone', 'celular', 'móvel', 'movel', 'whatsapp', ...]

# Check synonym chain: móvel → movel → celular → telefone → phone
semantic_score = 0.88  # Strong match through synonym chain

# Step 2: Pattern detection
pattern = detect_data_pattern(column_data)
# Result: {'type': 'phone', 'confidence': 0.92, 'format': 'brazilian'}

# Step 3: Total score
total_score = 0.88 * 0.40 + 0.92 * 0.30 + 0.70 * 0.15 + 0.8 * 0.15
# Total: 0.90 → Map to 'phone' with 90% confidence

Result: Map to phone (0.90 confidence)

Key Learning: Accent normalization is crucial for Portuguese text matching.

Example 6: Handling Multiple Data Blocks

Spreadsheet Structure:

Row 0: "Relatório de Vendas - Q1 2024"
Row 1: (empty)
Row 2: "Nome" | "Email" | "Telefone"
Row 3: "João" | "joao@..." | "11999..."
Row 4: "Maria" | "maria@..." | "21988..."
Row 5: (empty)
Row 6: "Resumo" | "Total" | "Média"
Row 7: "Vendas" | "15000" | "7500"

Detection Strategy:

# First pass: Detect main data header
result1 = detect_header_row(rows[:10])
# Detects row 2 as header (confidence: 0.85)

# Second pass: Detect summary section
result2 = detect_header_row(rows[5:10])
# Detects row 6 as summary header (confidence: 0.75)

# Handle separately
main_data = extract_data_with_headers(rows, result1.header_row_index)
summary_data = extract_data_with_headers(rows[5:], result2.header_row_index)

Example 7: Merged Cells / Multi-line Headers

Spreadsheet Structure:

Row 0: "Dados Pessoais" (merged across columns A-D)
Row 1: "Nome" | "Email" | "Telefone" | "CPF"
Row 2: "João" | "joao@..." | "11999..." | "123..."

Detection:

# Detect multi-line header
def detect_multiline_header(rows, start_row):
    # Check if next row also looks like header
    current_row = rows[start_row]
    next_row = rows[start_row + 1] if start_row + 1 < len(rows) else []

    # If both rows are mostly text and similar structure
    if is_header_like(current_row) and is_header_like(next_row):
        return start_row + 1  # Header spans 2 rows

    return start_row

# Result: Header spans rows 0-1, data starts at row 2
header_row_index = 1  # Use second row as actual column headers

Example 8: Error Detection and Validation

Column: "CPF" Values:

"CPF"
"123.456.789-00"  ✓ Valid
"987.654.321-00"  ✓ Valid
"111.111.111-11"  ✗ Invalid (all same digits)
"123.456.789"     ✗ Invalid (missing check digits)

Validation Process:

column_data = [
    "123.456.789-00",
    "987.654.321-00",
    "111.111.111-11",
    "123.456.789"
]

validation_results = []
for i, value in enumerate(column_data):
    is_valid, normalized, error = validate_and_normalize_value(value, 'cpf')
    validation_results.append({
        'row': i + 2,  # +2 for header row and 0-index
        'value': value,
        'valid': is_valid,
        'normalized': normalized,
        'error': error
    })

# Results:
# [
#   {'row': 2, 'valid': True, 'normalized': '12345678900', 'error': None},
#   {'row': 3, 'valid': True, 'normalized': '98765432100', 'error': None},
#   {'row': 4, 'valid': False, 'normalized': None, 'error': 'CPF with all same digits'},
#   {'row': 5, 'valid': False, 'normalized': None, 'error': 'CPF missing check digits'}
# ]

# User feedback:
# ⚠️ Row 4: Invalid CPF - all same digits
# ⚠️ Row 5: Invalid CPF - missing check digits

Example 9: Learning from User Corrections

Scenario: User corrects mapping "Contato" → email instead of suggested phone

Learning System:

// Store user correction
interface UserCorrection {
  columnHeader: string;
  suggestedField: string;
  userSelectedField: string;
  columnData: unknown[];
  timestamp: number;
}

// Learn from correction
function learnFromCorrection(correction: UserCorrection) {
  // Analyze why user chose different field
  const pattern = detectColumnPattern(correction.columnData);

  // If pattern matches user's choice better, update knowledge base
  if (pattern.type === 'email' && correction.userSelectedField === 'email') {
    // Add "contato" as email synonym in specific contexts
    updateKnowledgeBase('email', {
      context: 'when_pattern_is_email',
      synonyms: [...existingSynonyms, 'contato']
    });
  }

  // Track for future suggestions
  trackCorrection(correction);
}

Future Improvement: Next time "Contato" with email pattern appears, suggest email first.

Best Practices Summary

Always use multi-factor analysis - Don't rely on single method
Sample data for pattern detection - Use first 10-20 rows, not all data
Provide fallbacks - When confidence is low, show suggestions
Validate patterns - Use domain-specific validators (CPF, CEP, etc.)
Normalize early - Clean and normalize data during detection
User feedback - Show confidence scores and reasoning
Learn from corrections - Track user corrections to improve future mappings
Handle edge cases - Empty rows, merged cells, multiple data blocks

Performance Considerations

Limit scanning: Only scan first 15 rows for headers
Sample data: Use first 10-20 rows for pattern detection
Cache results: Store detection results to avoid re-computation
Lazy evaluation: Only analyze columns that need mapping
Parallel processing: Analyze multiple columns simultaneously if possible

Install Skill

SKILL.md

Intelligence Capabilities (New)

1. Intelligent Header Detection

2. Semantic Column Mapping

3. Real-World Examples

Requirements for Outputs

All Excel files

Zero Formula Errors

Preserve Existing Templates (when updating templates)

Financial models

Color Coding Standards

Industry-Standard Color Conventions

Number Formatting Standards

Required Format Rules

Formula Construction Rules

Assumptions Placement

Formula Error Prevention

Documentation Requirements for Hardcodes

XLSX creation, editing, and analysis

Overview

Important Requirements

Reading and analyzing data

Data analysis with pandas

Excel File Workflows

CRITICAL: Use Formulas, Not Hardcoded Values

❌ WRONG - Hardcoding Calculated Values

✅ CORRECT - Using Excel Formulas

Common Workflow

Creating new Excel files

Editing existing Excel files

Recalculating formulas

Formula Verification Checklist

Essential Verification

Common Pitfalls

Formula Testing Strategy

Interpreting recalc.py Output

Best Practices

Library Selection

Working with openpyxl

Working with pandas

Code Style Guidelines

Intelligent Spreadsheet Import and Analysis

Overview

Intelligent Header Detection

Overview

Detection Strategy: Multi-Factor Analysis

1. Semantic Analysis

2. Pattern-Based Data Type Detection

3. Contextual Analysis

Multi-Factor Scoring

Thresholds and Confidence Levels

Handling Edge Cases

Intelligent Column Mapping

Overview

Multi-Factor Scoring System

Expanded Knowledge Base

Intelligent Suggestions

Data Pattern Analysis

Overview

Pattern Recognition

Validation and Normalization

Pattern Detection Process

Integration with Existing Code

Integration with xlsx-helper.ts

Integration with csv-validator.ts

Recommended Enhancements

Practical Examples

Example 1: Detecting Header in Spreadsheet with Metadata

Example 2: Ambiguous Column Name

Example 3: Type Detection by Pattern (Header Name Mismatch)

Example 4: Low Confidence - Multiple Suggestions

Example 5: Regional Variations

Example 6: Handling Multiple Data Blocks

Example 7: Merged Cells / Multi-line Headers

Example 8: Error Detection and Validation

Example 9: Learning from User Corrections

Best Practices Summary

Performance Considerations