Claude Code Plugins

Community-maintained marketplace

Feedback

docx-advanced-patterns

@belumume/claude-skills
33
0

Advanced python-docx patterns for nested tables, complex cells, and content extraction beyond .text property. Techniques for forms, checklists, and complex layouts.

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name docx-advanced-patterns
description Advanced python-docx patterns for nested tables, complex cells, and content extraction beyond .text property. Techniques for forms, checklists, and complex layouts.

DOCX Advanced Patterns Skill

Specialized patterns for python-docx that handle complex document structures not covered by basic .text extraction.

When to Use This Skill

Invoke this skill when working with DOCX files that have:

  • Nested tables within table cells
  • Forms with checkbox options
  • Complex multi-row cell layouts
  • Checklists with embedded options
  • Cell content that doesn't appear with .text property

Use alongside the official docx skill for comprehensive document handling.

Core Pattern: Nested Table Extraction

Problem

python-docx's cell.text property only extracts direct paragraph text - it does not traverse nested tables within cells.

Symptom:

cell.text  # Returns: '' or '\n'
# But cell visually contains content!

Detection

Check if a cell contains nested tables:

if cell.tables:
    print(f"Found {len(cell.tables)} nested table(s)")
    # Cell has nested content - need special extraction

Solution (Simple)

def extract_cell_content_with_nested_tables(cell):
    """
    Extract all text from a cell, including text from nested tables.

    Args:
        cell: python-docx _Cell object

    Returns:
        str: Combined text from cell paragraphs and nested tables
    """
    text_parts = []

    # Get direct paragraph text (not inside nested tables)
    for para in cell.paragraphs:
        para_text = para.text.strip()
        if para_text:
            text_parts.append(para_text)

    # Get content from nested tables
    if cell.tables:
        for nested_table in cell.tables:
            for nested_row in nested_table.rows:
                # For checkbox lists: Column 0 = label, Column 1 = checkbox
                # Extract text from first column only
                if nested_row.cells:
                    first_col_text = nested_row.cells[0].text.strip()
                    # Filter out checkbox characters
                    if first_col_text and first_col_text not in ['', '☐', '☑', '☒']:
                        text_parts.append(first_col_text)

    return '\n'.join(text_parts) if text_parts else ''

Solution (Recursive for Deep Nesting)

For documents with multiple levels of table nesting:

def extract_cell_content_recursively(cell):
    """
    Recursively extract text from cell including deeply nested tables.

    Handles arbitrary nesting depth.
    """
    text_parts = []

    def _extract_recursive(cell_obj):
        # Get direct paragraphs
        for para in cell_obj.paragraphs:
            para_text = para.text.strip()
            if para_text and para_text not in ['', '☐', '☑', '☒']:
                text_parts.append(para_text)

        # Recursively get nested tables
        for nested_table in cell_obj.tables:
            for nested_row in nested_table.rows:
                for nested_cell in nested_row.cells:
                    _extract_recursive(nested_cell)

    _extract_recursive(cell)
    return '\n'.join(text_parts) if text_parts else ''

Usage Examples

Example 1: Extracting Form Checkbox Options

Document Structure:

Table Cell contains:
  Nested Table:
    Row 1: "High potential" | ☐
    Row 2: "Moderate potential" | ☐
    Row 3: "Low potential" | ☐

Extraction:

from docx import Document

doc = Document('form.docx')
table = doc.tables[0]
cell = table.rows[1].cells[0]

# Wrong way - returns empty
basic_text = cell.text
print(basic_text)  # Output: '' or '\n'

# Right way - extracts nested content
full_text = extract_cell_content_with_nested_tables(cell)
print(full_text)
# Output:
# High potential
# Moderate potential
# Low potential

Example 2: Processing All Cells in a Table

def process_table_with_nested_content(table):
    """Process all cells, handling nested tables"""
    for row in table.rows:
        for cell in row.cells:
            # Extract with nested table support
            content = extract_cell_content_with_nested_tables(cell)

            if content:
                # Process content (translate, analyze, etc.)
                processed = do_something_with(content)
                print(f"Cell content: {processed}")

Example 3: Detecting Nested Tables

def analyze_document_structure(doc):
    """Find all cells with nested tables"""
    nested_cells = []

    for t_idx, table in enumerate(doc.tables):
        for r_idx, row in enumerate(table.rows):
            for c_idx, cell in enumerate(row.cells):
                if cell.tables:
                    nested_cells.append({
                        'table': t_idx,
                        'row': r_idx,
                        'col': c_idx,
                        'nested_count': len(cell.tables)
                    })

    return nested_cells

# Usage
doc = Document('complex_form.docx')
nested = analyze_document_structure(doc)

for item in nested:
    print(f"Table {item['table']}, Row {item['row']}, Col {item['col']}: "
          f"{item['nested_count']} nested table(s)")

Common Use Cases

1. Government Forms

Forms often use nested tables for checkbox grids:

def extract_form_responses(doc):
    """Extract all form checkbox options"""
    responses = {}

    for table in doc.tables:
        for row in table.rows:
            # First cell = question
            question = row.cells[0].text.strip()

            # Second cell = checkbox options (nested table)
            if row.cells[1].tables:
                options = extract_cell_content_with_nested_tables(row.cells[1])
                responses[question] = options.split('\n')

    return responses

2. Evaluation Forms

Extract rating scales and options:

def extract_evaluation_items(doc):
    """Extract evaluation criteria and options"""
    evaluations = []

    for table in doc.tables:
        for row_idx, row in enumerate(table.rows[1:], 1):
            # Get criterion
            criterion = row.cells[0].text.strip()

            # Get rating options (often nested)
            rating_cell = row.cells[1]
            rating_options = extract_cell_content_with_nested_tables(rating_cell)

            evaluations.append({
                'criterion': criterion,
                'options': rating_options.split('\n')
            })

    return evaluations

3. Complex Data Tables

Extract structured data from cells with nested layouts:

def extract_complex_cell_data(cell):
    """Extract data from cells with complex nested structures"""
    data = {
        'main_content': '',
        'nested_items': []
    }

    # Direct paragraphs
    for para in cell.paragraphs:
        if para.text.strip():
            data['main_content'] = para.text.strip()
            break

    # Nested table data
    if cell.tables:
        for nested_table in cell.tables:
            for nested_row in nested_table.rows:
                row_data = [c.text.strip() for c in nested_row.cells]
                data['nested_items'].append(row_data)

    return data

Integration with Official docx Skill

This skill complements the official docx skill:

Official docx skill provides:

  • Document creation (docx-js)
  • Basic text extraction (pandoc)
  • Tracked changes workflows
  • Comment handling
  • XML access for complex cases

This skill provides:

  • Nested table extraction
  • Complex cell content handling
  • Form and checklist processing
  • Advanced content extraction patterns

Use together:

# For basic operations: use official skill
from docx import Document

# For nested table handling: use this skill
from docx_advanced import extract_cell_content_with_nested_tables

# Combine both
doc = Document('complex_form.docx')  # Official
for table in doc.tables:            # Official
    for row in table.rows:          # Official
        for cell in row.cells:      # Official
            # Advanced extraction:
            content = extract_cell_content_with_nested_tables(cell)

Performance Considerations

For Large Documents:

Cache nested table checks:

def build_nested_table_cache(doc):
    """Pre-compute which cells have nested tables"""
    cache = {}

    for t_idx, table in enumerate(doc.tables):
        for r_idx, row in enumerate(table.rows):
            for c_idx, cell in enumerate(row.cells):
                if cell.tables:
                    cache[(t_idx, r_idx, c_idx)] = len(cell.tables)

    return cache

# Usage
cache = build_nested_table_cache(doc)

for t_idx, table in enumerate(doc.tables):
    for r_idx, row in enumerate(table.rows):
        for c_idx, cell in enumerate(row.cells):
            if (t_idx, r_idx, c_idx) in cache:
                # This cell has nested tables
                content = extract_cell_content_with_nested_tables(cell)
            else:
                # Regular extraction
                content = cell.text

Troubleshooting

Issue: Extraction returns empty despite visible content

Diagnosis:

cell = table.rows[1].cells[0]
print(f"cell.text: '{cell.text}'")
print(f"cell.tables: {len(cell.tables)}")

if not cell.text.strip() and cell.tables:
    print("Content is in nested tables!")

Fix: Use extract_cell_content_with_nested_tables(cell)

Issue: Checkbox characters (, ☐) appear in output

Fix: Filter them out:

text = cell.text.strip()
# Remove checkbox unicode characters
clean_text = text.replace('', '').replace('☐', '').replace('☑', '').replace('☒', '')

Issue: Multi-line content not preserved

Fix: Join with newlines:

'\n'.join(text_parts)  # Preserves line structure

Best Practices

  1. Always check for nested tables first:

    if cell.tables:
        content = extract_cell_content_with_nested_tables(cell)
    else:
        content = cell.text
    
  2. Handle checkbox characters:

    CHECKBOX_CHARS = ['', '☐', '☑', '☒']
    if text not in CHECKBOX_CHARS:
        # Process text
    
  3. Preserve structure:

    # Use newlines to maintain line breaks
    '\n'.join(lines)
    
  4. Test with sample documents:

    def test_extraction():
        doc = Document('sample_form.docx')
        cell = doc.tables[0].rows[1].cells[0]
    
        extracted = extract_cell_content_with_nested_tables(cell)
        assert 'High potential' in extracted
        assert 'Moderate potential' in extracted
    

Reference Implementation

See REFERENCE.md for:

  • Complete working examples
  • Integration patterns
  • Advanced recursive extraction
  • Performance optimization techniques

Contributing to Anthropic Skills

This pattern is not currently in the official docx skill. If you find it useful, consider contributing:

  1. Fork https://github.com/anthropics/skills
  2. Add to document-skills/docx/SKILL.md
  3. Submit pull request with:
    • Pattern description
    • Code examples
    • Use cases

Success Criteria

Pattern is working if:

  • Cells with nested tables return full content
  • Checkbox options are extracted correctly
  • Form fields are readable
  • No content is lost during extraction
  • Structure is preserved (line breaks maintained)