name	table-extractor
description	Extract tables from PDFs and images to CSV or Excel. Support for scanned documents with OCR, multi-page PDFs, and complex table structures.

Table Extractor

Name: table-extractor
Author: dkyazzentwatwa

Extract tables from PDFs and images into structured data formats.

Features

PDF Tables: Extract tables from digital PDFs
Image Tables: OCR-based extraction from images
Multiple Tables: Extract all tables from document
Format Export: CSV, Excel, JSON output
Table Detection: Auto-detect table boundaries
Column Alignment: Smart column detection
Multi-Page: Process entire PDF documents

Quick Start

from table_extractor import TableExtractor

extractor = TableExtractor()

# Extract from PDF
extractor.load_pdf("document.pdf")
tables = extractor.extract_all()

# Save first table to CSV
tables[0].to_csv("table.csv")

# Extract from image
extractor.load_image("scanned_table.png")
table = extractor.extract_table()
print(table)

CLI Usage

# Extract from PDF
python table_extractor.py --input document.pdf --output tables/

# Extract specific pages
python table_extractor.py --input document.pdf --pages 1-3 --output tables/

# Extract from image
python table_extractor.py --input scan.png --output table.csv

# Export to Excel
python table_extractor.py --input document.pdf --format xlsx --output tables.xlsx

# With OCR for scanned PDFs
python table_extractor.py --input scanned.pdf --ocr --output tables/

API Reference

TableExtractor Class

class TableExtractor:
    def __init__(self)

    # Loading
    def load_pdf(self, filepath: str, pages: List[int] = None) -> 'TableExtractor'
    def load_image(self, filepath: str) -> 'TableExtractor'

    # Extraction
    def extract_table(self, page: int = 0) -> pd.DataFrame
    def extract_all(self) -> List[pd.DataFrame]
    def extract_page(self, page: int) -> List[pd.DataFrame]

    # Detection
    def detect_tables(self, page: int = 0) -> List[Dict]
    def get_table_count(self) -> int

    # Configuration
    def set_ocr(self, enabled: bool = True, lang: str = "eng") -> 'TableExtractor'
    def set_column_detection(self, mode: str = "auto") -> 'TableExtractor'

    # Export
    def to_csv(self, tables: List, output_dir: str) -> List[str]
    def to_excel(self, tables: List, output: str) -> str
    def to_json(self, tables: List, output: str) -> str

Supported Formats

Input

PDF documents (text-based and scanned)
Images: PNG, JPEG, TIFF, BMP
Screenshots with tables

Output

CSV (one file per table)
Excel (multiple sheets)
JSON (array of tables)
Pandas DataFrame

Table Detection

# Detect tables without extracting
tables_info = extractor.detect_tables(page=0)
# Returns:
# [
#     {"index": 0, "rows": 10, "cols": 5, "bbox": (x1, y1, x2, y2)},
#     {"index": 1, "rows": 8, "cols": 3, "bbox": (x1, y1, x2, y2)}
# ]

Example Workflows

PDF Report Tables

extractor = TableExtractor()
extractor.load_pdf("quarterly_report.pdf")

# Extract all tables
tables = extractor.extract_all()

# Export each to CSV
for i, table in enumerate(tables):
    table.to_csv(f"table_{i}.csv", index=False)

Scanned Document

extractor = TableExtractor()
extractor.set_ocr(enabled=True, lang="eng")
extractor.load_image("scanned_form.png")

table = extractor.extract_table()
print(table)

Dependencies

pdfplumber>=0.10.0
pillow>=10.0.0
pandas>=2.0.0
pytesseract>=0.3.10 (for OCR)
opencv-python>=4.8.0

table-extractor

Install Skill

SKILL.md

Table Extractor

Features

Quick Start

CLI Usage

API Reference

TableExtractor Class

Supported Formats

Input

Output

Table Detection

Example Workflows

PDF Report Tables

Scanned Document

Dependencies