Claude Code Plugins

Community-maintained marketplace

Feedback

docx-advanced-patterns

@belumume/claude-skills
0
0

Advanced python-docx patterns for handling nested tables, complex cell structures, and content extraction beyond basic .text property. Complements the official docx skill with specialized techniques for forms, checklists, and complex layouts.

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name docx-advanced-patterns
description Advanced python-docx patterns for handling nested tables, complex cell structures, and content extraction beyond basic .text property. Complements the official docx skill with specialized techniques for forms, checklists, and complex layouts.
version 1.0.0
dependencies python>=3.8, python-docx>=0.8.11

DOCX Advanced Patterns Skill

Specialized patterns for python-docx that handle complex document structures not covered by basic .text extraction.

When to Use This Skill

Invoke this skill when working with DOCX files that have:

  • Nested tables within table cells
  • Forms with checkbox options
  • Complex multi-row cell layouts
  • Checklists with embedded options
  • Cell content that doesn't appear with .text property

Use alongside the official docx skill for comprehensive document handling.

Core Pattern: Nested Table Extraction

Problem

python-docx's cell.text property only extracts direct paragraph text - it does not traverse nested tables within cells.

Symptom:

cell.text  # Returns: '' or '\n'
# But cell visually contains content!

Detection

Check if a cell contains nested tables:

if cell.tables:
    print(f"Found {len(cell.tables)} nested table(s)")
    # Cell has nested content - need special extraction

Solution (Simple)

def extract_cell_content_with_nested_tables(cell):
    """
    Extract all text from a cell, including text from nested tables.

    Args:
        cell: python-docx _Cell object

    Returns:
        str: Combined text from cell paragraphs and nested tables
    """
    text_parts = []

    # Get direct paragraph text (not inside nested tables)
    for para in cell.paragraphs:
        para_text = para.text.strip()
        if para_text:
            text_parts.append(para_text)

    # Get content from nested tables
    if cell.tables:
        for nested_table in cell.tables:
            for nested_row in nested_table.rows:
                # For checkbox lists: Column 0 = label, Column 1 = checkbox
                # Extract text from first column only
                if nested_row.cells:
                    first_col_text = nested_row.cells[0].text.strip()
                    # Filter out checkbox characters
                    if first_col_text and first_col_text not in ['', '☐', '☑', '☒']:
                        text_parts.append(first_col_text)

    return '\n'.join(text_parts) if text_parts else ''

Solution (Recursive for Deep Nesting)

For documents with multiple levels of table nesting:

def extract_cell_content_recursively(cell):
    """
    Recursively extract text from cell including deeply nested tables.

    Handles arbitrary nesting depth.
    """
    text_parts = []

    def _extract_recursive(cell_obj):
        # Get direct paragraphs
        for para in cell_obj.paragraphs:
            para_text = para.text.strip()
            if para_text and para_text not in ['', '☐', '☑', '☒']:
                text_parts.append(para_text)

        # Recursively get nested tables
        for nested_table in cell_obj.tables:
            for nested_row in nested_table.rows:
                for nested_cell in nested_row.cells:
                    _extract_recursive(nested_cell)

    _extract_recursive(cell)
    return '\n'.join(text_parts) if text_parts else ''

Usage Examples

Example 1: Extracting Form Checkbox Options

Document Structure:

Table Cell contains:
  Nested Table:
    Row 1: "High potential" | ☐
    Row 2: "Moderate potential" | ☐
    Row 3: "Low potential" | ☐

Extraction:

from docx import Document

doc = Document('form.docx')
table = doc.tables[0]
cell = table.rows[1].cells[0]

# Wrong way - returns empty
basic_text = cell.text
print(basic_text)  # Output: '' or '\n'

# Right way - extracts nested content
full_text = extract_cell_content_with_nested_tables(cell)
print(full_text)
# Output:
# High potential
# Moderate potential
# Low potential

Example 2: Processing All Cells in a Table

def process_table_with_nested_content(table):
    """Process all cells, handling nested tables"""
    for row in table.rows:
        for cell in row.cells:
            # Extract with nested table support
            content = extract_cell_content_with_nested_tables(cell)

            if content:
                # Process content (translate, analyze, etc.)
                processed = do_something_with(content)
                print(f"Cell content: {processed}")

Example 3: Detecting Nested Tables

def analyze_document_structure(doc):
    """Find all cells with nested tables"""
    nested_cells = []

    for t_idx, table in enumerate(doc.tables):
        for r_idx, row in enumerate(table.rows):
            for c_idx, cell in enumerate(row.cells):
                if cell.tables:
                    nested_cells.append({
                        'table': t_idx,
                        'row': r_idx,
                        'col': c_idx,
                        'nested_count': len(cell.tables)
                    })

    return nested_cells

# Usage
doc = Document('complex_form.docx')
nested = analyze_document_structure(doc)

for item in nested:
    print(f"Table {item['table']}, Row {item['row']}, Col {item['col']}: "
          f"{item['nested_count']} nested table(s)")

Common Use Cases

1. Government Forms

Forms often use nested tables for checkbox grids:

def extract_form_responses(doc):
    """Extract all form checkbox options"""
    responses = {}

    for table in doc.tables:
        for row in table.rows:
            # First cell = question
            question = row.cells[0].text.strip()

            # Second cell = checkbox options (nested table)
            if row.cells[1].tables:
                options = extract_cell_content_with_nested_tables(row.cells[1])
                responses[question] = options.split('\n')

    return responses

2. Evaluation Forms

Extract rating scales and options:

def extract_evaluation_items(doc):
    """Extract evaluation criteria and options"""
    evaluations = []

    for table in doc.tables:
        for row_idx, row in enumerate(table.rows[1:], 1):
            # Get criterion
            criterion = row.cells[0].text.strip()

            # Get rating options (often nested)
            rating_cell = row.cells[1]
            rating_options = extract_cell_content_with_nested_tables(rating_cell)

            evaluations.append({
                'criterion': criterion,
                'options': rating_options.split('\n')
            })

    return evaluations

3. Complex Data Tables

Extract structured data from cells with nested layouts:

def extract_complex_cell_data(cell):
    """Extract data from cells with complex nested structures"""
    data = {
        'main_content': '',
        'nested_items': []
    }

    # Direct paragraphs
    for para in cell.paragraphs:
        if para.text.strip():
            data['main_content'] = para.text.strip()
            break

    # Nested table data
    if cell.tables:
        for nested_table in cell.tables:
            for nested_row in nested_table.rows:
                row_data = [c.text.strip() for c in nested_row.cells]
                data['nested_items'].append(row_data)

    return data

Integration with Official docx Skill

This skill complements the official docx skill:

Official docx skill provides:

  • Document creation (docx-js)
  • Basic text extraction (pandoc)
  • Tracked changes workflows
  • Comment handling
  • XML access for complex cases

This skill provides:

  • Nested table extraction
  • Complex cell content handling
  • Form and checklist processing
  • Advanced content extraction patterns

Use together:

# For basic operations: use official skill
from docx import Document

# For nested table handling: use this skill
from docx_advanced import extract_cell_content_with_nested_tables

# Combine both
doc = Document('complex_form.docx')  # Official
for table in doc.tables:            # Official
    for row in table.rows:          # Official
        for cell in row.cells:      # Official
            # Advanced extraction:
            content = extract_cell_content_with_nested_tables(cell)

Performance Considerations

For Large Documents:

Cache nested table checks:

def build_nested_table_cache(doc):
    """Pre-compute which cells have nested tables"""
    cache = {}

    for t_idx, table in enumerate(doc.tables):
        for r_idx, row in enumerate(table.rows):
            for c_idx, cell in enumerate(row.cells):
                if cell.tables:
                    cache[(t_idx, r_idx, c_idx)] = len(cell.tables)

    return cache

# Usage
cache = build_nested_table_cache(doc)

for t_idx, table in enumerate(doc.tables):
    for r_idx, row in enumerate(table.rows):
        for c_idx, cell in enumerate(row.cells):
            if (t_idx, r_idx, c_idx) in cache:
                # This cell has nested tables
                content = extract_cell_content_with_nested_tables(cell)
            else:
                # Regular extraction
                content = cell.text

Troubleshooting

Issue: Extraction returns empty despite visible content

Diagnosis:

cell = table.rows[1].cells[0]
print(f"cell.text: '{cell.text}'")
print(f"cell.tables: {len(cell.tables)}")

if not cell.text.strip() and cell.tables:
    print("Content is in nested tables!")

Fix: Use extract_cell_content_with_nested_tables(cell)

Issue: Checkbox characters (, ☐) appear in output

Fix: Filter them out:

text = cell.text.strip()
# Remove checkbox unicode characters
clean_text = text.replace('', '').replace('☐', '').replace('☑', '').replace('☒', '')

Issue: Multi-line content not preserved

Fix: Join with newlines:

'\n'.join(text_parts)  # Preserves line structure

Best Practices

  1. Always check for nested tables first:

    if cell.tables:
        content = extract_cell_content_with_nested_tables(cell)
    else:
        content = cell.text
    
  2. Handle checkbox characters:

    CHECKBOX_CHARS = ['', '☐', '☑', '☒']
    if text not in CHECKBOX_CHARS:
        # Process text
    
  3. Preserve structure:

    # Use newlines to maintain line breaks
    '\n'.join(lines)
    
  4. Test with sample documents:

    def test_extraction():
        doc = Document('sample_form.docx')
        cell = doc.tables[0].rows[1].cells[0]
    
        extracted = extract_cell_content_with_nested_tables(cell)
        assert 'High potential' in extracted
        assert 'Moderate potential' in extracted
    

Reference Implementation

See REFERENCE.md for:

  • Complete working examples
  • Integration patterns
  • Advanced recursive extraction
  • Performance optimization techniques

Contributing to Anthropic Skills

This pattern is not currently in the official docx skill. If you find it useful, consider contributing:

  1. Fork https://github.com/anthropics/skills
  2. Add to document-skills/docx/SKILL.md
  3. Submit pull request with:
    • Pattern description
    • Code examples
    • Use cases

Success Criteria

Pattern is working if:

  • Cells with nested tables return full content
  • Checkbox options are extracted correctly
  • Form fields are readable
  • No content is lost during extraction
  • Structure is preserved (line breaks maintained)