| name | xlsx |
| description | Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. Provides intelligent header detection and semantic column mapping for robust data import. |
| license | Proprietary. LICENSE.txt has complete terms |
Intelligence Capabilities (New)
When working with imperfect input files (e.g., user uploads, legacy system exports), apply the following intelligent processing strategies.
1. Intelligent Header Detection
Full Guide: header-detection.md
Never assume the first row is the header. Use the following heuristics:
- Semantic Analysis: Look for row content matching known keywords using Levenshtein distance (e.g., "Telefome" ≈ "Telefone").
- Data Pattern Analysis: The header is typically the last text-heavy row before data-heavy rows (dates, emails, CPFs).
- Context: Headers have high density (few empty cells) and are often below metadata rows (titles, dates).
2. Semantic Column Mapping
Full Guide: column-mapping.md
Map source columns to target fields using specific scoring:
- Semantic (40%): Normalization + Synonyms (e.g., "Zap" -> "Phone").
- Pattern (30%): Data validation samples (e.g., column has
\d{11}-> "CPF"). - Context (30%): Position and frequency of occurrence.
3. Real-World Examples
See: examples.md for handling:
- Metadata headers (Reports with titles)
- Ambiguous columns (Date formats)
- Unknown columns
Requirements for Outputs
All Excel files
Zero Formula Errors
- Every Excel model MUST be delivered with ZERO formula errors (#REF!, #DIV/0!, #VALUE!, #N/A, #NAME?)
Preserve Existing Templates (when updating templates)
- Study and EXACTLY match existing format, style, and conventions when modifying files
- Never impose standardized formatting on files with established patterns
- Existing template conventions ALWAYS override these guidelines
Financial models
Color Coding Standards
Unless otherwise stated by the user or existing template
Industry-Standard Color Conventions
- Blue text (RGB: 0,0,255): Hardcoded inputs, and numbers users will change for scenarios
- Black text (RGB: 0,0,0): ALL formulas and calculations
- Green text (RGB: 0,128,0): Links pulling from other worksheets within same workbook
- Red text (RGB: 255,0,0): External links to other files
- Yellow background (RGB: 255,255,0): Key assumptions needing attention or cells that need to be updated
Number Formatting Standards
Required Format Rules
- Years: Format as text strings (e.g., "2024" not "2,024")
- Currency: Use $#,##0 format; ALWAYS specify units in headers ("Revenue ($mm)")
- Zeros: Use number formatting to make all zeros "-", including percentages (e.g., "$#,##0;($#,##0);-")
- Percentages: Default to 0.0% format (one decimal)
- Multiples: Format as 0.0x for valuation multiples (EV/EBITDA, P/E)
- Negative numbers: Use parentheses (123) not minus -123
Formula Construction Rules
Assumptions Placement
- Place ALL assumptions (growth rates, margins, multiples, etc.) in separate assumption cells
- Use cell references instead of hardcoded values in formulas
- Example: Use =B5*(1+$B$6) instead of =B5*1.05
Formula Error Prevention
- Verify all cell references are correct
- Check for off-by-one errors in ranges
- Ensure consistent formulas across all projection periods
- Test with edge cases (zero values, negative numbers)
- Verify no unintended circular references
Documentation Requirements for Hardcodes
- Comment or in cells beside (if end of table). Format: "Source: [System/Document], [Date], [Specific Reference], [URL if applicable]"
- Examples:
- "Source: Company 10-K, FY2024, Page 45, Revenue Note, [SEC EDGAR URL]"
- "Source: Company 10-Q, Q2 2025, Exhibit 99.1, [SEC EDGAR URL]"
- "Source: Bloomberg Terminal, 8/15/2025, AAPL US Equity"
- "Source: FactSet, 8/20/2025, Consensus Estimates Screen"
XLSX creation, editing, and analysis
Overview
A user may ask you to create, edit, or analyze the contents of an .xlsx file. You have different tools and workflows available for different tasks.
Important Requirements
LibreOffice Required for Formula Recalculation: You can assume LibreOffice is installed for recalculating formula values using the recalc.py script. The script automatically configures LibreOffice on first run
Reading and analyzing data
Data analysis with pandas
For data analysis, visualization, and basic operations, use pandas which provides powerful data manipulation capabilities:
import pandas as pd
# Read Excel
df = pd.read_excel('file.xlsx') # Default: first sheet
all_sheets = pd.read_excel('file.xlsx', sheet_name=None) # All sheets as dict
# Analyze
df.head() # Preview data
df.info() # Column info
df.describe() # Statistics
# Write Excel
df.to_excel('output.xlsx', index=False)
Excel File Workflows
CRITICAL: Use Formulas, Not Hardcoded Values
Always use Excel formulas instead of calculating values in Python and hardcoding them. This ensures the spreadsheet remains dynamic and updateable.
❌ WRONG - Hardcoding Calculated Values
# Bad: Calculating in Python and hardcoding result
total = df['Sales'].sum()
sheet['B10'] = total # Hardcodes 5000
# Bad: Computing growth rate in Python
growth = (df.iloc[-1]['Revenue'] - df.iloc[0]['Revenue']) / df.iloc[0]['Revenue']
sheet['C5'] = growth # Hardcodes 0.15
# Bad: Python calculation for average
avg = sum(values) / len(values)
sheet['D20'] = avg # Hardcodes 42.5
✅ CORRECT - Using Excel Formulas
# Good: Let Excel calculate the sum
sheet['B10'] = '=SUM(B2:B9)'
# Good: Growth rate as Excel formula
sheet['C5'] = '=(C4-C2)/C2'
# Good: Average using Excel function
sheet['D20'] = '=AVERAGE(D2:D19)'
This applies to ALL calculations - totals, percentages, ratios, differences, etc. The spreadsheet should be able to recalculate when source data changes.
Common Workflow
- Choose tool: pandas for data, openpyxl for formulas/formatting
- Create/Load: Create new workbook or load existing file
- Modify: Add/edit data, formulas, and formatting
- Save: Write to file
- Recalculate formulas (MANDATORY IF USING FORMULAS): Use the recalc.py script
python recalc.py output.xlsx - Verify and fix any errors:
- The script returns JSON with error details
- If
statusiserrors_found, checkerror_summaryfor specific error types and locations - Fix the identified errors and recalculate again
- Common errors to fix:
#REF!: Invalid cell references#DIV/0!: Division by zero#VALUE!: Wrong data type in formula#NAME?: Unrecognized formula name
Creating new Excel files
# Using openpyxl for formulas and formatting
from openpyxl import Workbook
from openpyxl.styles import Font, PatternFill, Alignment
wb = Workbook()
sheet = wb.active
# Add data
sheet['A1'] = 'Hello'
sheet['B1'] = 'World'
sheet.append(['Row', 'of', 'data'])
# Add formula
sheet['B2'] = '=SUM(A1:A10)'
# Formatting
sheet['A1'].font = Font(bold=True, color='FF0000')
sheet['A1'].fill = PatternFill('solid', start_color='FFFF00')
sheet['A1'].alignment = Alignment(horizontal='center')
# Column width
sheet.column_dimensions['A'].width = 20
wb.save('output.xlsx')
Editing existing Excel files
# Using openpyxl to preserve formulas and formatting
from openpyxl import load_workbook
# Load existing file
wb = load_workbook('existing.xlsx')
sheet = wb.active # or wb['SheetName'] for specific sheet
# Working with multiple sheets
for sheet_name in wb.sheetnames:
sheet = wb[sheet_name]
print(f"Sheet: {sheet_name}")
# Modify cells
sheet['A1'] = 'New Value'
sheet.insert_rows(2) # Insert row at position 2
sheet.delete_cols(3) # Delete column 3
# Add new sheet
new_sheet = wb.create_sheet('NewSheet')
new_sheet['A1'] = 'Data'
wb.save('modified.xlsx')
Recalculating formulas
Excel files created or modified by openpyxl contain formulas as strings but not calculated values. Use the provided recalc.py script to recalculate formulas:
python recalc.py <excel_file> [timeout_seconds]
Example:
python recalc.py output.xlsx 30
The script:
- Automatically sets up LibreOffice macro on first run
- Recalculates all formulas in all sheets
- Scans ALL cells for Excel errors (#REF!, #DIV/0!, etc.)
- Returns JSON with detailed error locations and counts
- Works on both Linux and macOS
Formula Verification Checklist
Quick checks to ensure formulas work correctly:
Essential Verification
- Test 2-3 sample references: Verify they pull correct values before building full model
- Column mapping: Confirm Excel columns match (e.g., column 64 = BL, not BK)
- Row offset: Remember Excel rows are 1-indexed (DataFrame row 5 = Excel row 6)
Common Pitfalls
- NaN handling: Check for null values with
pd.notna() - Far-right columns: FY data often in columns 50+
- Multiple matches: Search all occurrences, not just first
- Division by zero: Check denominators before using
/in formulas (#DIV/0!) - Wrong references: Verify all cell references point to intended cells (#REF!)
- Cross-sheet references: Use correct format (Sheet1!A1) for linking sheets
Formula Testing Strategy
- Start small: Test formulas on 2-3 cells before applying broadly
- Verify dependencies: Check all cells referenced in formulas exist
- Test edge cases: Include zero, negative, and very large values
Interpreting recalc.py Output
The script returns JSON with error details:
{
"status": "success", // or "errors_found"
"total_errors": 0, // Total error count
"total_formulas": 42, // Number of formulas in file
"error_summary": { // Only present if errors found
"#REF!": {
"count": 2,
"locations": ["Sheet1!B5", "Sheet1!C10"]
}
}
}
Best Practices
Library Selection
- pandas: Best for data analysis, bulk operations, and simple data export
- openpyxl: Best for complex formatting, formulas, and Excel-specific features
Working with openpyxl
- Cell indices are 1-based (row=1, column=1 refers to cell A1)
- Use
data_only=Trueto read calculated values:load_workbook('file.xlsx', data_only=True) - Warning: If opened with
data_only=Trueand saved, formulas are replaced with values and permanently lost - For large files: Use
read_only=Truefor reading orwrite_only=Truefor writing - Formulas are preserved but not evaluated - use recalc.py to update values
Working with pandas
- Specify data types to avoid inference issues:
pd.read_excel('file.xlsx', dtype={'id': str}) - For large files, read specific columns:
pd.read_excel('file.xlsx', usecols=['A', 'C', 'E']) - Handle dates properly:
pd.read_excel('file.xlsx', parse_dates=['date_column'])
Code Style Guidelines
IMPORTANT: When generating Python code for Excel operations:
- Write minimal, concise Python code without unnecessary comments
- Avoid verbose variable names and redundant operations
- Avoid unnecessary print statements
For Excel files themselves:
- Add comments to cells with complex formulas or important assumptions
- Document data sources for hardcoded values
- Include notes for key calculations and model sections
Intelligent Spreadsheet Import and Analysis
Overview
When importing or analyzing spreadsheets, especially for data import workflows, intelligent detection and mapping capabilities are essential for handling real-world spreadsheets that may have:
- Non-standard layouts (metadata rows, empty rows, merged cells)
- Varied column naming (synonyms, abbreviations, typos)
- Multiple data types requiring pattern recognition
- Contextual relationships between columns
This section provides advanced techniques for intelligent header detection, column mapping, and data pattern analysis.
Intelligent Header Detection
Overview
Header detection is the process of identifying which row in a spreadsheet contains the column names. Real-world spreadsheets often have:
- Title rows or metadata at the top
- Empty separator rows
- Multi-line headers (merged cells)
- Headers that don't start at row 1
Detection Strategy: Multi-Factor Analysis
Use a combination of semantic analysis, pattern recognition, and contextual clues to identify headers with high confidence.
1. Semantic Analysis
String Similarity Algorithms
Use similarity metrics to compare row cells against known header keywords. Implement Levenshtein distance, Jaro-Winkler similarity, and normalization functions to handle:
- Accent removal (á → a)
- Case insensitivity
- Whitespace normalization
- Synonym matching
Synonym Recognition
Maintain a knowledge base of header synonyms and variations in multiple languages (Portuguese, English) and handle:
- Regional variations ("celular" vs "móvel")
- Common typos ("E-mail" vs "Email")
- Abbreviations ("Tel" = "Telefone")
- Context-dependent names ("Cliente" = "Nome" in CRM context)
2. Pattern-Based Data Type Detection
Analyze sample data values (first 5-10 rows) to infer column types:
- CPF: 11 digits, format XXX.XXX.XXX-XX
- Email: Contains @ and valid domain
- Phone: 10-11 digits, Brazilian formats
- Date: Multiple formats (DD/MM/YYYY, MM/DD/YYYY, Excel serial)
- Monetary: R$, points, commas (Brazilian format)
- CEP: 8 digits, format XXXXX-XXX
Use pattern matching with regex and validation functions to detect types with confidence scores.
3. Contextual Analysis
Consider multiple contextual factors:
- Position: Headers usually in first 5 rows
- Density: Headers have many filled cells
- Consistency: Headers are text, data rows vary in type
- Structure: Headers have similar length/format
Multi-Factor Scoring
Calculate header likelihood score using:
- Text ratio (30%): Headers are usually text, not numbers
- Keyword matches (40%): Match against known header keywords
- Fill ratio (20%): Headers usually have many filled cells
- Position bonus (10%): Headers usually in first rows
Thresholds and Confidence Levels
- Confidence ≥ 0.8: High confidence - use automatically
- Confidence 0.6-0.8: Medium confidence - show to user for confirmation
- Confidence < 0.6: Low confidence - require user selection
Handling Edge Cases
- Multiple Header Rows: Detect merged cells spanning multiple rows
- Empty Rows: Skip empty separator rows before data
- Metadata Rows: Identify and skip title/metadata rows at top
Intelligent Column Mapping
Overview
Column mapping matches spreadsheet column names to system fields. Real-world spreadsheets use varied naming that requires intelligent matching.
Multi-Factor Scoring System
Use weighted scoring to determine best mapping:
- Semantic Similarity (40%): String similarity and synonym matching
- Data Pattern Match (30%): Type inferred from actual values
- Position Context (15%): Expected position of field
- Frequency/Common Name (15%): Most common names get priority
Expanded Knowledge Base
Maintain comprehensive synonym mappings including:
- Portuguese and English variations
- Regional differences
- Common typos and abbreviations
- Context-dependent synonyms
Example Knowledge Base Structure:
# Comprehensive header knowledge base for student/CRM imports
EXPANDED_HEADER_KNOWLEDGE = {
# Name variations (regional, typos, languages)
'name': [
# Portuguese standard
'nome', 'nome completo', 'aluno', 'cliente', 'pessoa',
'nome do aluno', 'nome do cliente', 'razao social', 'razão social',
# English
'name', 'full name', 'client name', 'student name', 'person', 'fullname',
# Abbreviations
'nm', 'nome compl', 'nome compl.',
# Common typos
'nme', 'nomee', 'nom', 'nomes',
# Context variations
'cadastro', 'registro', 'identificação'
],
# Email variations
'email': [
# Portuguese
'email', 'e-mail', 'correio eletrônico', 'correio eletronico',
'correio', 'endereço de email', 'endereco de email', 'endereço email',
# English
'mail', 'e-mail address', 'email address', 'mail address',
# Abbreviations
'e-mail', 'mail addr', 'email addr',
# Common typos
'emial', 'emal', 'eail', 'emai',
# Variations
'correio eletrônico', 'correio eletr.'
],
# Phone variations
'phone': [
# Portuguese
'telefone', 'celular', 'whatsapp', 'contato', 'telefone celular',
'telefone fixo', 'fone', 'tel', 'telefone residencial',
# English
'phone', 'mobile', 'cell phone', 'telephone', 'contact number',
'mobile phone', 'cell',
# Regional
'móvel', 'movel', 'cel',
# Abbreviations
'tel', 'fone', 'cel', 'whats', 'wpp',
# Common typos
'telefne', 'telefonee', 'celularr'
],
# CPF variations
'cpf': [
'cpf', 'documento', 'documento cpf', 'cpf/cnpj', 'cpf cnpj',
'identificação', 'identificacao', 'id', 'rg', 'documento de identidade',
'doc', 'document', 'identifier', 'cpf cnpj', 'documento único',
'documento unico', 'cpf documento', 'número cpf', 'numero cpf'
],
# Address variations
'address': [
'endereço', 'endereco', 'rua', 'logradouro', 'street', 'address',
'endereço completo', 'endereco completo', 'full address', 'endereço residencial',
'endereco residencial', 'logradouro completo', 'rua endereço'
],
'addressNumber': [
'número', 'numero', 'nº', 'n°', 'num', 'número do endereço',
'numero do endereco', 'address number', 'house number', 'nro'
],
'complement': [
'complemento', 'complement', 'apto', 'apartamento', 'bloco',
'bloco apto', 'andar', 'sala', 'complemento endereço'
],
'neighborhood': [
'bairro', 'neighborhood', 'distrito', 'bairro residencial',
'bairro endereço', 'zona', 'região', 'regiao'
],
'city': [
'cidade', 'municipio', 'município', 'city', 'municipality',
'munic', 'cid', 'município de residência', 'cidade residência'
],
'state': [
'estado', 'uf', 'state', 'province', 'est', 'uf estado',
'estado uf', 'unidade federativa', 'estado brasileiro'
],
'zipCode': [
'cep', 'código postal', 'codigo postal', 'zip', 'postal code',
'cep código', 'postal', 'cep código postal', 'zip code',
'código postal cep'
],
'country': [
'país', 'pais', 'country', 'nacionalidade', 'país de origem',
'pais de origem', 'country of origin'
],
# Professional fields
'profession': [
'profissão', 'profissao', 'graduação', 'graduacao', 'formação',
'formacao', 'profession', 'graduation', 'formation', 'curso',
'área de atuação', 'area de atuacao', 'especialidade', 'especialização'
],
'hasClinic': [
'tem clínica', 'tem clinica', 'possui clínica', 'possui clinica',
'tem estabelecimento', 'has clinic', 'own clinic', 'clinic owner',
'clínica própria', 'clinica propria'
],
'clinicName': [
'nome clínica', 'nome clinica', 'clínica nome', 'clinica nome',
'nome estabelecimento', 'nome do estabelecimento', 'clinic name',
'estabelecimento', 'nome da clínica'
],
'clinicCity': [
'cidade clínica', 'cidade clinica', 'cidade estabelecimento',
'município clínica', 'municipio clinica', 'clinic city',
'cidade da clínica'
],
# Status and dates
'status': [
'status', 'situação', 'situacao', 'estado', 'condição', 'condicao',
'status aluno', 'situação aluno', 'status do aluno'
],
'birthDate': [
'data nasc', 'data de nascimento', 'nascimento', 'data nascimento',
'data nasc.', 'birth date', 'date of birth', 'dob', 'data nascimento aluno',
'nasc', 'data nascimento aluno'
],
'saleDate': [
'data da venda', 'data venda', 'data de venda', 'venda data',
'data contrato', 'data do contrato', 'sale date', 'contract date',
'data assinatura', 'data de assinatura'
],
'startDate': [
'data início', 'data inicio', 'data_inicio', 'start date',
'inicio', 'data de início', 'data de inicio', 'data início curso',
'início curso', 'inicio curso'
],
# Sales and marketing
'salesperson': [
'vendedor', 'vendedora', 'vendedor responsável', 'vendedor responsavel',
'salesperson', 'seller', 'vendedor da venda', 'responsável venda',
'responsavel venda'
],
'leadSource': [
'origem lead', 'origem', 'fonte', 'lead source', 'source',
'origem do lead', 'fonte do lead', 'como conheceu', 'origem cliente',
'canal', 'canal de origem', 'fonte de origem'
],
'cohort': [
'turma', 'cohort', 'turma curso', 'turma do curso', 'grupo',
'turma aluno', 'classe', 'batch', 'lote'
],
# Financial fields
'totalValue': [
'valor total', 'valortotal', 'valor_total', 'total', 'amount',
'valor', 'preço', 'preco', 'price', 'total value', 'total amount',
'valor do curso', 'valor curso', 'valor contrato', 'valor total curso',
'montante', 'valor pago', 'valor investido'
],
'installments': [
'parcelas', 'número de parcelas', 'numero de parcelas',
'qtd parcelas', 'quantidade de parcelas', 'installments',
'parc', 'qtd parc', 'número parcelas', 'numero parcelas',
'total parcelas', 'qtd de parcelas', 'quantidade parcelas'
],
'installmentValue': [
'valor parcela', 'valor_parcela', 'valorparcela', 'installment value',
'valor da parcela', 'valor de parcela', 'parcela valor',
'valor mensal', 'mensalidade', 'valor por parcela'
],
'paymentStatus': [
'status pagamento', 'status_pagamento', 'payment status',
'pagamento', 'status de pagamento', 'situação pagamento',
'situacao pagamento', 'payment', 'status pag', 'situação pag',
'condição pagamento', 'condicao pagamento'
],
'paidInstallments': [
'parcelas pagas', 'parcelas_pagas', 'paid installments',
'parcelas quitadas', 'parcelas pagas total', 'qtd parcelas pagas',
'quantidade parcelas pagas', 'parcelas já pagas', 'parcelas ja pagas'
],
# Professional ID
'professionalId': [
'registro', 'registro profissional', 'coren', 'cro', 'crm', 'crf',
'número registro', 'numero registro', 'registro número',
'professional id', 'professional registration', 'número conselho',
'numero conselho', 'conselho profissional', 'registro conselho'
],
# Contract status
'contractStatus': [
'status contrato', 'status do contrato', 'situação contrato',
'situacao contrato', 'contract status', 'status do contrato',
'condição contrato', 'condicao contrato', 'estado contrato'
]
}
def find_best_match(column_header: str, knowledge_base: Dict[str, List[str]]) -> List[Tuple[str, float]]:
"""
Find best matching schema fields for a column header.
Returns list of (field_name, confidence_score) tuples sorted by score.
"""
normalized_header = normalize_header(column_header)
matches = []
for field_name, synonyms in knowledge_base.items():
best_score = 0.0
for synonym in synonyms:
normalized_synonym = normalize_header(synonym)
# Exact match
if normalized_header == normalized_synonym:
best_score = 1.0
break
# Contains match
elif normalized_synonym in normalized_header or normalized_header in normalized_synonym:
best_score = max(best_score, 0.9)
else:
# Similarity match
similarity = jaro_winkler_similarity(normalized_header, normalized_synonym)
if similarity > 0.7:
best_score = max(best_score, similarity * 0.85)
if best_score > 0.5: # Only include reasonable matches
matches.append((field_name, best_score))
# Sort by score descending
matches.sort(key=lambda x: x[1], reverse=True)
return matches
Intelligent Suggestions
When confidence is low (< 70%), provide multiple mapping options:
- Show top 3-5 suggestions ordered by score
- Include reasoning for each suggestion
- Highlight pattern mismatches (e.g., column named "ID" but contains CPF values)
Data Pattern Analysis
Overview
Pattern analysis identifies data types and formats by examining actual values, not just column names. This is crucial for:
- Validating mappings
- Detecting errors
- Normalizing data formats
- Providing user feedback
Pattern Recognition
Define comprehensive patterns for common data types:
- CPF: Brazilian tax ID with validation algorithm
- Email: Standard email format validation
- Phone (Brazilian): 10-11 digits with formatting
- Date: Multiple format support (DD/MM/YYYY, Excel serial)
- Monetary (Brazilian): R$ format with comma decimal separator
- CEP: Brazilian postal code (8 digits)
- UF: Brazilian state abbreviations (2 letters)
Validation and Normalization
For each pattern type:
- Regex matching: Initial pattern detection
- Validator function: Domain-specific validation (e.g., CPF check digits)
- Normalizer function: Convert to standard format
- Formatter function: Display in user-friendly format
Pattern Detection Process
- Sample first 10-20 rows of data
- Test each value against pattern regexes
- Run validator functions for matched patterns
- Calculate match ratio and confidence
- Return best match with confidence score
Integration with Existing Code
Integration with xlsx-helper.ts
Enhance detectHeaderRow() function in src/lib/xlsx-helper.ts:
Current Implementation:
// Current basic implementation
export function detectHeaderRow(rows: unknown[][], maxRowsToScan = 15): HeaderDetectionResult {
// Basic keyword matching and scoring
}
Enhanced Implementation:
// Enhanced with multi-factor analysis
export function detectHeaderRowEnhanced(
rows: unknown[][],
maxRowsToScan = 15
): HeaderDetectionResult {
const candidates: Array<{
rowIndex: number;
score: number;
headers: string[];
}> = [];
// Factor 1: Text ratio (headers are usually text)
const calculateTextRatio = (row: unknown[]): number => {
const textCells = row.filter(
cell => cell && !String(cell).trim().match(/^\d+([.,]\d+)?$/)
);
return textCells.length / Math.max(row.length, 1);
};
// Factor 2: Keyword matches
const countKeywordMatches = (row: unknown[]): number => {
const normalizedRow = row.map(cell =>
normalizeHeader(String(cell || ''))
);
return normalizedRow.filter(cell =>
HEADER_KEYWORDS.some(kw => cell.includes(kw) || kw.includes(cell))
).length;
};
// Factor 3: Fill ratio
const calculateFillRatio = (row: unknown[]): number => {
const nonEmpty = row.filter(cell => cell && String(cell).trim());
return nonEmpty.length / Math.max(row.length, 1);
};
// Factor 4: Pattern consistency (compare with next row)
const checkPatternConsistency = (
row: unknown[],
nextRow: unknown[],
index: number
): number => {
if (index + 1 >= rows.length) return 0;
const rowTypes = row.map(cell => typeof cell);
const nextTypes = nextRow.map(cell => typeof cell);
// Headers should have different types than data
const typeOverlap = rowTypes.filter((t, i) => t === nextTypes[i]).length;
return typeOverlap < rowTypes.length / 2 ? 5 : 0;
};
// Calculate scores for each candidate row
for (let i = 0; i < Math.min(rows.length, maxRowsToScan); i++) {
const row = rows[i];
if (!row || row.length === 0) continue;
let score = 0;
// Factor 1: Text ratio (30 points max)
score += calculateTextRatio(row) * 30;
// Factor 2: Keyword matches (40 points max)
score += Math.min(countKeywordMatches(row) * 10, 40);
// Factor 3: Fill ratio (20 points max)
score += calculateFillRatio(row) * 20;
// Factor 4: Position bonus (10 points)
if (i < 5) score += 10;
// Factor 5: Pattern consistency (5 points)
score += checkPatternConsistency(row, rows[i + 1] || [], i);
const headers = row
.map(h => String(h ?? '').trim())
.filter(Boolean);
if (headers.length >= 2) {
candidates.push({
rowIndex: i,
score,
headers,
});
}
}
if (candidates.length === 0) {
return {
headerRowIndex: 0,
confidence: 0,
headers: rows[0]?.map(h => String(h ?? '').trim()).filter(Boolean) || [],
candidates: [],
};
}
candidates.sort((a, b) => b.score - a.score);
const best = candidates[0];
// Calculate confidence
let confidence: number;
if (candidates.length === 1) {
confidence = Math.min(best.score / 100, 1);
} else {
const scoreDiff = best.score - candidates[1].score;
const relativeConfidence = scoreDiff / Math.max(best.score, 1);
const absoluteConfidence = best.score / 100;
confidence = Math.min((relativeConfidence + absoluteConfidence) / 2, 1);
}
return {
headerRowIndex: best.rowIndex,
confidence: Math.round(confidence * 100) / 100,
headers: best.headers,
candidates: candidates.slice(0, 5),
};
}
// Helper function for header normalization
function normalizeHeader(header: string): string {
return header
.toLowerCase()
.normalize('NFD')
.replace(/[\u0300-\u036f]/g, '') // Remove accents
.trim()
.replace(/\s+/g, ' ');
}
Integration with csv-validator.ts
Enhance mapCSVHeaders() function:
Current Implementation:
// Current static mapping
export function mapCSVHeaders(headers: string[]): Record<string, string> {
const mapping: Record<string, string> = {};
for (const header of headers) {
const normalized = header.trim().toLowerCase();
if (HEADER_MAP[normalized]) {
mapping[header] = HEADER_MAP[normalized];
}
}
return mapping;
}
Enhanced Implementation:
// Enhanced with pattern analysis and fuzzy matching
export interface ColumnMappingResult {
mapping: Record<string, string>;
confidence: Record<string, number>;
suggestions: Record<string, Array<{ field: string; score: number; reason: string }>>;
}
export function mapCSVHeadersIntelligent(
headers: string[],
sampleRows: Record<string, unknown>[],
maxSuggestions = 3
): ColumnMappingResult {
const mapping: Record<string, string> = {};
const confidence: Record<string, number> = {};
const suggestions: Record<string, Array<{ field: string; score: number; reason: string }>> = {};
for (const header of headers) {
// Get column data
const columnData = sampleRows
.map(row => row[header])
.filter(v => v !== undefined && v !== null)
.slice(0, 10); // Sample first 10 rows
// Detect pattern
const pattern = detectColumnPattern(columnData);
// Find best matches
const matches = findBestSchemaMatches(header, pattern, headers.indexOf(header));
if (matches.length > 0 && matches[0].score >= 0.5) {
mapping[header] = matches[0].field;
confidence[header] = matches[0].score;
}
// Store suggestions for low confidence matches
if (matches.length > 0 && matches[0].score < 0.7) {
suggestions[header] = matches.slice(0, maxSuggestions).map(m => ({
field: m.field,
score: m.score,
reason: m.reason,
}));
}
}
return { mapping, confidence, suggestions };
}
// Pattern detection function
function detectColumnPattern(values: unknown[]): {
type: string;
confidence: number;
matchRatio: number;
} {
if (values.length === 0) {
return { type: 'empty', confidence: 0, matchRatio: 0 };
}
// CPF pattern
const cpfPattern = /^\d{3}[.\-]?\d{3}[.\-]?\d{3}[.\-]?\d{2}$/;
const cpfMatches = values.filter(v => cpfPattern.test(String(v).replace(/\s/g, '')));
if (cpfMatches.length / values.length > 0.7) {
return {
type: 'cpf',
confidence: cpfMatches.length / values.length,
matchRatio: cpfMatches.length / values.length,
};
}
// Email pattern
const emailPattern = /^[^\s@]+@[^\s@]+\.[^\s@]+$/;
const emailMatches = values.filter(v => emailPattern.test(String(v).toLowerCase()));
if (emailMatches.length / values.length > 0.7) {
return {
type: 'email',
confidence: emailMatches.length / values.length,
matchRatio: emailMatches.length / values.length,
};
}
// Phone pattern (Brazilian)
const phonePattern = /^[\d\s\(\)\-\+]{10,15}$/;
const phoneMatches = values.filter(v => {
const digits = String(v).replace(/\D/g, '');
return phonePattern.test(String(v)) && digits.length >= 10 && digits.length <= 11;
});
if (phoneMatches.length / values.length > 0.7) {
return {
type: 'phone',
confidence: phoneMatches.length / values.length,
matchRatio: phoneMatches.length / values.length,
};
}
return { type: 'text', confidence: 0.3, matchRatio: 0 };
}
// Find best schema matches with scoring
function findBestSchemaMatches(
header: string,
pattern: { type: string; confidence: number },
columnIndex: number
): Array<{ field: string; score: number; reason: string }> {
const normalizedHeader = normalizeHeader(header);
const matches: Array<{ field: string; score: number; reason: string }> = [];
// Check against expanded knowledge base
for (const [field, synonyms] of Object.entries(EXPANDED_HEADER_KNOWLEDGE)) {
let semanticScore = 0;
let reason = '';
// Check synonyms
for (const synonym of synonyms) {
const normalizedSynonym = normalizeHeader(synonym);
if (normalizedHeader === normalizedSynonym) {
semanticScore = 1.0;
reason = `Exact match with '${synonym}'`;
break;
} else if (normalizedSynonym.includes(normalizedHeader) ||
normalizedHeader.includes(normalizedSynonym)) {
semanticScore = Math.max(semanticScore, 0.9);
reason = `Contains keyword '${synonym}'`;
} else {
// Fuzzy match
const similarity = jaroWinklerSimilarity(normalizedHeader, normalizedSynonym);
if (similarity > 0.7) {
semanticScore = Math.max(semanticScore, similarity * 0.85);
reason = `Similar to '${synonym}' (${Math.round(similarity * 100)}% similarity)`;
}
}
}
// Pattern match bonus
const patternScore = getPatternMatchScore(field, pattern.type);
const totalScore = semanticScore * 0.4 + patternScore * 0.3 +
getPositionScore(field, columnIndex) * 0.15 +
getFrequencyScore(field, normalizedHeader) * 0.15;
if (totalScore >= 0.3) {
matches.push({
field,
score: totalScore,
reason: reason || `Pattern matches: ${pattern.type}`,
});
}
}
return matches.sort((a, b) => b.score - a.score);
}
// Jaro-Winkler similarity (simplified)
function jaroWinklerSimilarity(s1: string, s2: string): number {
// Use a library like 'string-similarity' in production
// Simplified version here
if (s1 === s2) return 1.0;
if (s1.length === 0 || s2.length === 0) return 0.0;
// Simple character overlap
const longer = s1.length > s2.length ? s1 : s2;
const shorter = s1.length > s2.length ? s2 : s1;
const matches = shorter.split('').filter(c => longer.includes(c)).length;
return matches / Math.max(s1.length, s2.length);
}
Recommended Enhancements
- Add pattern detection: Create
detectColumnPattern()function inxlsx-helper.ts - Expand knowledge base: Add all synonyms from
EXPANDED_HEADER_KNOWLEDGEtoHEADER_MAPincsv-validator.ts - Add validation feedback: Show pattern detection results in UI components
- Highlight mismatches: Alert when header name doesn't match data pattern
- Add confidence indicators: Display confidence scores in mapping UI
- Store user corrections: Learn from user mappings to improve future suggestions
Practical Examples
Example 1: Detecting Header in Spreadsheet with Metadata
Spreadsheet Structure:
Row 0: "Relatório de Alunos - 2024"
Row 1: (empty)
Row 2: "Nome" | "E-mail" | "Telefone" | "CPF"
Row 3: "João Silva" | "joao@email.com" | "11999999999" | "123.456.789-00"
Row 4: "Maria Santos" | "maria@email.com" | "21988888888" | "987.654.321-00"
Detection Process:
rows = [
["Relatório de Alunos - 2024"],
[],
["Nome", "E-mail", "Telefone", "CPF"],
["João Silva", "joao@email.com", "11999999999", "123.456.789-00"],
["Maria Santos", "maria@email.com", "21988888888", "987.654.321-00"]
]
result = detect_header_row(rows, max_rows_to_scan=15)
# Result:
# {
# 'header_row_index': 2,
# 'confidence': 0.85,
# 'headers': ['Nome', 'E-mail', 'Telefone', 'CPF'],
# 'candidates': [
# {'row_index': 2, 'score': 85.0, 'headers': ['Nome', 'E-mail', 'Telefone', 'CPF']},
# {'row_index': 0, 'score': 15.0, 'headers': ['Relatório de Alunos - 2024']}
# ]
# }
Scoring Breakdown for Row 2:
- Text ratio: 100% (all cells are text) → 30 points
- Keyword matches: 4 matches (nome, email, telefone, cpf) → 40 points
- Fill ratio: 100% (all cells filled) → 20 points
- Position bonus: Row 2 (< 5) → 10 points
- Total: 100 points → Confidence: 0.85
Mappings with Pattern Analysis:
# Analyze each column
mappings = map_columns_intelligently(
headers=['Nome', 'E-mail', 'Telefone', 'CPF'],
rows=rows,
header_row_index=2
)
# Results:
# {
# 'Nome': ('name', 0.95), # Exact keyword match + text pattern
# 'E-mail': ('email', 0.90), # Synonym match + email pattern
# 'Telefone': ('phone', 0.92), # Keyword match + phone pattern
# 'CPF': ('cpf', 0.88) # Keyword match + CPF pattern
# }
Example 2: Ambiguous Column Name
Column: "Contato"
Sample Data:
"Contato"
"11999999999"
"21988888888"
"(11) 3333-4444"
Analysis Process:
column_header = "Contato"
column_data = ["11999999999", "21988888888", "(11) 3333-4444"]
column_index = 2
# Step 1: Semantic analysis
semantic_score = find_synonym_match("Contato", "phone")
# Result: 0.85 (contato ≈ telefone/phone)
# Step 2: Pattern detection
pattern = detect_data_pattern(column_data)
# Result: {'type': 'phone', 'confidence': 0.90, 'format': 'brazilian'}
# Step 3: Position context
position_score = 0.70 # Column 2, expected phone position 2
# Step 4: Calculate total score
total_score = (
semantic_score * 0.40 + # 0.85 * 0.40 = 0.34
pattern['confidence'] * 0.30 + # 0.90 * 0.30 = 0.27
position_score * 0.15 + # 0.70 * 0.15 = 0.105
0.6 * 0.15 # Frequency score = 0.09
)
# Total: 0.805 → Map to 'phone' with 80.5% confidence
Result: Map to phone (0.82 total score)
Alternative Scenario - If pattern was emails:
column_data = ["joao@email.com", "maria@email.com", "pedro@email.com"]
pattern = detect_data_pattern(column_data)
# Result: {'type': 'email', 'confidence': 0.95}
# Semantic score for email: 0.60 (contato can mean contact/email)
total_score = 0.60 * 0.40 + 0.95 * 0.30 + 0.70 * 0.15 + 0.4 * 0.15
# Total: 0.72 → Map to 'email' with 72% confidence
Example 3: Type Detection by Pattern (Header Name Mismatch)
Column: "ID" Values:
"ID"
"123.456.789-00"
"987.654.321-00"
"111.222.333-44"
Analysis:
column_header = "ID"
column_data = ["123.456.789-00", "987.654.321-00", "111.222.333-44"]
# Step 1: Header name suggests numeric ID
# But pattern analysis reveals CPF format
pattern = detect_data_pattern(column_data)
# Result: {
# 'type': 'cpf',
# 'confidence': 0.95,
# 'match_ratio': 1.0,
# 'format': 'brazilian_tax_id'
# }
# Step 2: Validate CPF format
for value in column_data:
is_valid = validate_cpf(value)
# All values pass CPF validation
# Step 3: Mapping decision
# Header name: "ID" → suggests 'id' field (numeric)
# But pattern: CPF → suggests 'cpf' field
# Pattern confidence (0.95) > Semantic confidence (0.30)
# Decision: Map to 'cpf' based on pattern evidence
Result: Map to cpf (0.95 confidence), not numeric id
User Feedback:
⚠️ Column "ID" appears to contain CPF values (confidence: 95%)
Suggested mapping: CPF
Reason: Data pattern matches CPF format, not numeric ID
Example 4: Low Confidence - Multiple Suggestions
Column: "Dados"
Sample Data:
"Dados"
"João Silva"
"Maria Santos"
"Pedro Oliveira"
Analysis:
column_header = "Dados"
column_data = ["João Silva", "Maria Santos", "Pedro Oliveira"]
column_index = 0
# Pattern detection
pattern = detect_data_pattern(column_data)
# Result: {'type': 'text', 'confidence': 0.5}
# Find all potential matches
matches = find_best_match("Dados", EXPANDED_HEADER_KNOWLEDGE)
# Results:
# [
# ('name', 0.45), # Position suggests name, text pattern
# ('email', 0.35), # Could be email if pattern matched
# ('phone', 0.25), # Low match
# ]
# Generate suggestions with reasoning
suggestions = [
{
'field': 'name',
'score': 0.45,
'confidence': 'medium',
'reasons': [
'Position suggests name field (first column)',
'Text pattern matches name format',
'Values appear to be full names'
]
},
{
'field': 'email',
'score': 0.35,
'confidence': 'low',
'reasons': [
'Could be email if pattern matched',
'Position sometimes used for email'
]
},
{
'field': '_skip',
'score': 0.30,
'confidence': 'low',
'reasons': [
'Unclear mapping - consider skipping',
'May be metadata or notes column'
]
}
]
UI Action: Show dropdown with suggestions for user selection
TypeScript Implementation:
// In mapping UI component
const suggestions = getMappingSuggestions("Dados", columnData, 0);
// Display in UI
<Select value={mapping["Dados"]} onValueChange={handleChange}>
<SelectTrigger>
<SelectValue placeholder="Select mapping..." />
</SelectTrigger>
<SelectContent>
{suggestions.map(suggestion => (
<SelectItem key={suggestion.field} value={suggestion.field}>
{suggestion.field} ({Math.round(suggestion.score * 100)}% confidence)
<div className="text-xs text-muted-foreground">
{suggestion.reasons[0]}
</div>
</SelectItem>
))}
</SelectContent>
</Select>
Example 5: Regional Variations
Column: "Móvel" (Portuguese for mobile/phone)
Sample Data:
"Móvel"
"11999999999"
"21988888888"
"11987654321"
Analysis:
column_header = "Móvel"
column_data = ["11999999999", "21988888888", "11987654321"]
# Step 1: Semantic analysis with accent handling
normalized_header = normalize_header("Móvel") # "movel"
synonyms = EXPANDED_HEADER_KNOWLEDGE['phone']
# ['telefone', 'celular', 'móvel', 'movel', 'whatsapp', ...]
# Check synonym chain: móvel → movel → celular → telefone → phone
semantic_score = 0.88 # Strong match through synonym chain
# Step 2: Pattern detection
pattern = detect_data_pattern(column_data)
# Result: {'type': 'phone', 'confidence': 0.92, 'format': 'brazilian'}
# Step 3: Total score
total_score = 0.88 * 0.40 + 0.92 * 0.30 + 0.70 * 0.15 + 0.8 * 0.15
# Total: 0.90 → Map to 'phone' with 90% confidence
Result: Map to phone (0.90 confidence)
Key Learning: Accent normalization is crucial for Portuguese text matching.
Example 6: Handling Multiple Data Blocks
Spreadsheet Structure:
Row 0: "Relatório de Vendas - Q1 2024"
Row 1: (empty)
Row 2: "Nome" | "Email" | "Telefone"
Row 3: "João" | "joao@..." | "11999..."
Row 4: "Maria" | "maria@..." | "21988..."
Row 5: (empty)
Row 6: "Resumo" | "Total" | "Média"
Row 7: "Vendas" | "15000" | "7500"
Detection Strategy:
# First pass: Detect main data header
result1 = detect_header_row(rows[:10])
# Detects row 2 as header (confidence: 0.85)
# Second pass: Detect summary section
result2 = detect_header_row(rows[5:10])
# Detects row 6 as summary header (confidence: 0.75)
# Handle separately
main_data = extract_data_with_headers(rows, result1.header_row_index)
summary_data = extract_data_with_headers(rows[5:], result2.header_row_index)
Example 7: Merged Cells / Multi-line Headers
Spreadsheet Structure:
Row 0: "Dados Pessoais" (merged across columns A-D)
Row 1: "Nome" | "Email" | "Telefone" | "CPF"
Row 2: "João" | "joao@..." | "11999..." | "123..."
Detection:
# Detect multi-line header
def detect_multiline_header(rows, start_row):
# Check if next row also looks like header
current_row = rows[start_row]
next_row = rows[start_row + 1] if start_row + 1 < len(rows) else []
# If both rows are mostly text and similar structure
if is_header_like(current_row) and is_header_like(next_row):
return start_row + 1 # Header spans 2 rows
return start_row
# Result: Header spans rows 0-1, data starts at row 2
header_row_index = 1 # Use second row as actual column headers
Example 8: Error Detection and Validation
Column: "CPF" Values:
"CPF"
"123.456.789-00" ✓ Valid
"987.654.321-00" ✓ Valid
"111.111.111-11" ✗ Invalid (all same digits)
"123.456.789" ✗ Invalid (missing check digits)
Validation Process:
column_data = [
"123.456.789-00",
"987.654.321-00",
"111.111.111-11",
"123.456.789"
]
validation_results = []
for i, value in enumerate(column_data):
is_valid, normalized, error = validate_and_normalize_value(value, 'cpf')
validation_results.append({
'row': i + 2, # +2 for header row and 0-index
'value': value,
'valid': is_valid,
'normalized': normalized,
'error': error
})
# Results:
# [
# {'row': 2, 'valid': True, 'normalized': '12345678900', 'error': None},
# {'row': 3, 'valid': True, 'normalized': '98765432100', 'error': None},
# {'row': 4, 'valid': False, 'normalized': None, 'error': 'CPF with all same digits'},
# {'row': 5, 'valid': False, 'normalized': None, 'error': 'CPF missing check digits'}
# ]
# User feedback:
# ⚠️ Row 4: Invalid CPF - all same digits
# ⚠️ Row 5: Invalid CPF - missing check digits
Example 9: Learning from User Corrections
Scenario: User corrects mapping "Contato" → email instead of suggested phone
Learning System:
// Store user correction
interface UserCorrection {
columnHeader: string;
suggestedField: string;
userSelectedField: string;
columnData: unknown[];
timestamp: number;
}
// Learn from correction
function learnFromCorrection(correction: UserCorrection) {
// Analyze why user chose different field
const pattern = detectColumnPattern(correction.columnData);
// If pattern matches user's choice better, update knowledge base
if (pattern.type === 'email' && correction.userSelectedField === 'email') {
// Add "contato" as email synonym in specific contexts
updateKnowledgeBase('email', {
context: 'when_pattern_is_email',
synonyms: [...existingSynonyms, 'contato']
});
}
// Track for future suggestions
trackCorrection(correction);
}
Future Improvement: Next time "Contato" with email pattern appears, suggest email first.
Best Practices Summary
- Always use multi-factor analysis - Don't rely on single method
- Sample data for pattern detection - Use first 10-20 rows, not all data
- Provide fallbacks - When confidence is low, show suggestions
- Validate patterns - Use domain-specific validators (CPF, CEP, etc.)
- Normalize early - Clean and normalize data during detection
- User feedback - Show confidence scores and reasoning
- Learn from corrections - Track user corrections to improve future mappings
- Handle edge cases - Empty rows, merged cells, multiple data blocks
Performance Considerations
- Limit scanning: Only scan first 15 rows for headers
- Sample data: Use first 10-20 rows for pattern detection
- Cache results: Store detection results to avoid re-computation
- Lazy evaluation: Only analyze columns that need mapping
- Parallel processing: Analyze multiple columns simultaneously if possible