name	Construction Document Ingestion
description	Process construction estimate documents (Excel/CSV/PDF) from Supabase and convert them to EstimationElement format. Use when users need to ingest construction estimates, DQE (Descriptive Quantitative Estimate), or devis files.

Construction Document Ingestion

Process construction estimate documents (Excel, CSV, PDF) and extract building elements into EstimationElement format for cost estimation.

When to Use This Skill

Use this skill when the user asks about:

Processing construction estimate documents (DQE, devis, BPU)
Extracting building cost data from Excel/CSV/PDF files
Converting construction estimates to estimation-ready format
Ingesting data from Vectorworks, spreadsheets, or PDF estimates
Downloading and processing files from Supabase storage

Agent Integration (Recommended)

Phase 1 of Estimation Workflow

This skill is Phase 1 in the multi-phase estimation workflow:

Phase 0: file_manager creates workspace
   └─> temp_files/temp_project_{project_id}/

Phase 1: Ingestion processes document  ← THIS SKILL
   Input:  Supabase URL to Excel/CSV/PDF file
   Output: temp_files/temp_project_{project_id}/ingestion_uuid.json

Phase 2: context_extraction analyzes building
   └─> Uses ingestion output

Function Signature

def process_excel_input(supabase_url: str, output_path: str) -> str:
    """
    Agent-compatible entry point for construction document processing

    Phase 1 of Estimation Workflow:
    - Downloads file from Supabase URL
    - Converts to markdown using LlamaParse
    - Extracts structure with LLM (Claude Haiku)
    - Performs code-based data extraction
    - Transforms to EstimationElement format

    Args:
        supabase_url: Supabase public URL to Excel/CSV/PDF file
        output_path: Path to save output JSON (should be in temp directory)
                    Example: "temp_files/temp_project_{project_id}/ingestion_{uuid}.json"

    Returns:
        str: Path to processed output JSON file

    Raises:
        ValueError: If file type cannot be detected from URL
        FileNotFoundError: If download fails
        Exception: For processing errors
    """

Usage in Agent Workflow

# Phase 0: file_manager creates workspace
workspace_dir = file_manager.create_workspace(project_id="123")
# Returns: "temp_files/temp_project_123/"

# Phase 1: Ingestion downloads and processes file
supabase_url = "https://your-supabase-url.com/storage/v1/object/public/bucket/file.xlsx"
output_path = os.path.join(workspace_dir, f"ingestion_{uuid.uuid4()}.json")

result_path = ingestion.process_excel_input(supabase_url, output_path)
# Returns: "temp_files/temp_project_123/ingestion_xyz.json"

# Phase 2: context_extraction uses ingestion output
context = context_extraction.extract(result_path)

CLI Usage (For Local Testing)

Run the Script Directly

# Process file from Supabase URL (output to temp directory)
python process_vectorworks_input.py \
  --supabase-url https://your-supabase.com/storage/v1/object/public/bucket/estimate.xlsx \
  --output temp_files/temp_project_123/ingestion_abc.json

# Process CSV file
python process_vectorworks_input.py \
  --supabase-url https://your-supabase.com/storage/v1/object/public/bucket/devis.csv \
  --output temp_files/output.json

# Process PDF file
python process_vectorworks_input.py \
  --supabase-url https://your-supabase.com/storage/v1/object/public/bucket/dqe.pdf \
  --output temp_files/result.json

Important: Always specify the --output flag with a path in the temp_files directory to ensure proper workspace isolation.

Call as Python Function

For programmatic use in other Python code:

from process_vectorworks_input import process_excel_input

# Download from Supabase and process
supabase_url = "https://your-supabase.com/.../estimate.xlsx"
output_path = "temp_files/temp_project_123/ingestion_xyz.json"

result = process_excel_input(supabase_url, output_path)
# Returns: str (path to output file)

Input Format

Supported File Types

The script automatically detects file type from URL extension:

Excel: .xlsx, .xls
CSV: .csv
PDF: .pdf

Expected Document Structure

Construction estimate documents (DQE, devis, BPU) with:

Hierarchical Sections

I    - Section Title (e.g., "Préparation de chantier")
1.1  - Item description
1.2  - Item description
II   - Another Section
2.1  - Subsection
2.1.1 - Item description

Data Columns

The script looks for these columns (case-insensitive):

U or Unité: Unit of measurement
Q or Quantité: Quantity
PU or Prix Unitaire: Unit price
Total HT or Total Hors Taxe: Total price excluding tax
Limite de prestation: Service limit/assignment

Example Excel/CSV Structure

Item ID | Description              | U  | Q   | PU      | Total HT | Limite de prestation
--------|--------------------------|----|----|---------|----------|---------------------
I       | Préparation de chantier  | U  | Q  | PU      | Total HT | Limite de prestation
1.1     | Installation de chantier | F  | 1  | 24000   | 24000.0  | ?
1.2     | Signalisation            | F  | -  | 0       | 0.0      | Hors lot
II      | Démolitions              | U  | Q  | PU      | Total HT |
2.1.1   | Dépose de bordures       | ml | 50 | 14      | 700.0    | Lot VRD

Processing Pipeline

Download from Supabase - Fetches file and detects type (xlsx, csv, pdf)
Convert to Markdown - Uses LlamaParse to convert document
Read XLSX Directly - For Excel files, reads with pandas for accuracy
Pre-process Markdown - Cleans empty rows and formatting
LLM Structure Extraction - Claude Haiku extracts ToC and schema
Column Filtering - Keeps first 2 columns + columns with valid keywords (U, PU, Q, Total HT, etc.)
Chunk by Sections - Splits document using Table of Contents
Code-based Extraction - Extracts items matching patterns, skips subtotals
Transform to EstimationElement - Converts to final output format

Field Mapping

Project-Level Fields

project.name ← Default: "Construction Project"
project.description ← Default: "Extracted from construction estimate"

Building Element Fields

Identity (Required)

project_id ← Generated UUID
building_id ← Item ID from document (1.1, 2.1.1, etc.)
building_name ← Item description (or fallback to item_id)
building_description ← Item description

Section Context

building_elm_section ← Section ID + Section Title (e.g., "I - Préparation de chantier")

Quantities (Dynamic)

All other fields are prefixed with building_elm_:

building_elm_unit ← U column (Unité)
building_elm_quantity ← Q column (Quantité)
building_elm_unit_price ← PU column (Prix Unitaire)
building_elm_total_ht ← Total HT column
building_elm_total_price ← Alternative total column
building_elm_lot_assignment ← Limite de prestation
... and any other dynamic columns found in the document

Validation

The script validates using Pydantic schemas:

✅ Document structure (TableOfContentsSection, DocumentSchema)
✅ Construction items (ConstructionItem)
✅ Output format (EstimationElement)
✅ Required fields (item_id, section_id, description)
✅ Data types (strings, floats, integers)

Error Handling

The script handles:

❌ Invalid URL format → ValueError with details
❌ Download failure → Logs error, raises exception
❌ Unsupported file type → ValueError
❌ LlamaParse failure → Logs error, raises exception
❌ LLM extraction failure → Logs error, raises exception
❌ Invalid item data → Logs warning, skips item
❌ Column filtering issues → Logs details, continues

Examples

CLI Usage

# Process Excel file from Supabase
python process_vectorworks_input.py \
  --supabase-url https://your-supabase.com/.../estimate.xlsx \
  --output temp_files/temp_project_456/output.json

# Output:
# 2025-11-11 10:30:15 - INFO - Starting ingestion pipeline...
# 2025-11-11 10:30:15 - INFO - Downloading file from: https://...
# 2025-11-11 10:30:15 - INFO - Detected file type: xlsx
# 2025-11-11 10:30:15 - INFO - File downloaded to: /tmp/estimate.xlsx
# 2025-11-11 10:30:16 - INFO - Converting xlsx file to markdown
# 2025-11-11 10:30:16 - INFO - Reading XLSX directly for accurate data extraction...
# 2025-11-11 10:30:16 - INFO - Read 250 rows, 16 columns from XLSX
# 2025-11-11 10:30:16 - INFO - Filtering columns based on valid keywords...
# 2025-11-11 10:30:16 - INFO - Found keyword 'u' in column 3 at row 7
# 2025-11-11 10:30:16 - INFO - Found keyword 'q' in column 4 at row 7
# 2025-11-11 10:30:16 - INFO - Total columns after: 8
# 2025-11-11 10:30:17 - INFO - Extracting document structure with LLM...
# 2025-11-11 10:30:18 - INFO - Extracted 10 sections from ToC
# 2025-11-11 10:30:18 - INFO - Chunking DataFrame by sections...
# 2025-11-11 10:30:18 - INFO - Extracting items using code-based extraction...
# 2025-11-11 10:30:18 - INFO - Section I: extracted 6 items
# 2025-11-11 10:30:18 - INFO - Section II: extracted 15 items
# 2025-11-11 10:30:19 - INFO - Total extracted: 85 items
# 2025-11-11 10:30:19 - INFO - Transformed 85 elements
# 2025-11-11 10:30:19 - INFO - Pipeline complete. Output saved to: temp_files/temp_project_456/output.json
# 2025-11-11 10:30:19 - INFO - Extracted 85 elements

Logging

The script provides detailed logging:

INFO: Processing progress, entity counts, found keywords, column filtering
WARNING: Missing data, fallback strategies used
ERROR: Validation failures, processing errors
DEBUG: Detailed extraction info, item-by-item logging

Log output goes to stderr (doesn't interfere with JSON output).

Platform Compatibility

✅ Windows (UTF-8 encoding handled)
✅ macOS
✅ Linux
✅ Python 3.8+

Environment Variables

# Required
ANTHROPIC_API_KEY=sk-ant-...  # For Claude Haiku LLM
LLAMA_CLOUD_API_KEY=llx-...   # For LlamaParse document conversion

# Optional
LOG_LEVEL=INFO  # Logging level (DEBUG, INFO, WARNING, ERROR)

Notes

Supabase URLs: Must be publicly accessible or properly authenticated
File Type Detection: Automatic from URL extension
Column Filtering: Always keeps first 2 columns, filters rest by keywords
LLM Usage: Claude Haiku for structure extraction (cost-effective)
Excel vs CSV: Excel preferred for numeric precision
PDF Support: Works but may have OCR issues with complex layouts
Output Format: Always pretty-printed JSON with indent=2
Workspace Isolation: Output files saved in temp directories
File Naming: Output files should use pattern ingestion_{uuid}.json
Path Handling: Always use temp_files directory for output
Dynamic Fields: All document-specific fields prefixed with building_elm_

Construction Document Ingestion

Install Skill

SKILL.md

Construction Document Ingestion

When to Use This Skill

Agent Integration (Recommended)

Phase 1 of Estimation Workflow

Function Signature

Usage in Agent Workflow

CLI Usage (For Local Testing)

Run the Script Directly

Call as Python Function

Input Format

Supported File Types

Expected Document Structure

Hierarchical Sections

Data Columns

Example Excel/CSV Structure

Processing Pipeline

Field Mapping

Project-Level Fields

Building Element Fields

Identity (Required)

Section Context

Quantities (Dynamic)

Validation

Error Handling

Examples

CLI Usage

Logging

Platform Compatibility

Environment Variables

Notes