| name | docx-advanced-patterns |
| description | Advanced python-docx patterns for handling nested tables, complex cell structures, and content extraction beyond basic .text property. Complements the official docx skill with specialized techniques for forms, checklists, and complex layouts. |
| version | 1.0.0 |
| dependencies | python>=3.8, python-docx>=0.8.11 |
DOCX Advanced Patterns Skill
Specialized patterns for python-docx that handle complex document structures not covered by basic .text extraction.
When to Use This Skill
Invoke this skill when working with DOCX files that have:
- Nested tables within table cells
- Forms with checkbox options
- Complex multi-row cell layouts
- Checklists with embedded options
- Cell content that doesn't appear with
.textproperty
Use alongside the official docx skill for comprehensive document handling.
Core Pattern: Nested Table Extraction
Problem
python-docx's cell.text property only extracts direct paragraph text - it does not traverse nested tables within cells.
Symptom:
cell.text # Returns: '' or '\n'
# But cell visually contains content!
Detection
Check if a cell contains nested tables:
if cell.tables:
print(f"Found {len(cell.tables)} nested table(s)")
# Cell has nested content - need special extraction
Solution (Simple)
def extract_cell_content_with_nested_tables(cell):
"""
Extract all text from a cell, including text from nested tables.
Args:
cell: python-docx _Cell object
Returns:
str: Combined text from cell paragraphs and nested tables
"""
text_parts = []
# Get direct paragraph text (not inside nested tables)
for para in cell.paragraphs:
para_text = para.text.strip()
if para_text:
text_parts.append(para_text)
# Get content from nested tables
if cell.tables:
for nested_table in cell.tables:
for nested_row in nested_table.rows:
# For checkbox lists: Column 0 = label, Column 1 = checkbox
# Extract text from first column only
if nested_row.cells:
first_col_text = nested_row.cells[0].text.strip()
# Filter out checkbox characters
if first_col_text and first_col_text not in ['', '☐', '☑', '☒']:
text_parts.append(first_col_text)
return '\n'.join(text_parts) if text_parts else ''
Solution (Recursive for Deep Nesting)
For documents with multiple levels of table nesting:
def extract_cell_content_recursively(cell):
"""
Recursively extract text from cell including deeply nested tables.
Handles arbitrary nesting depth.
"""
text_parts = []
def _extract_recursive(cell_obj):
# Get direct paragraphs
for para in cell_obj.paragraphs:
para_text = para.text.strip()
if para_text and para_text not in ['', '☐', '☑', '☒']:
text_parts.append(para_text)
# Recursively get nested tables
for nested_table in cell_obj.tables:
for nested_row in nested_table.rows:
for nested_cell in nested_row.cells:
_extract_recursive(nested_cell)
_extract_recursive(cell)
return '\n'.join(text_parts) if text_parts else ''
Usage Examples
Example 1: Extracting Form Checkbox Options
Document Structure:
Table Cell contains:
Nested Table:
Row 1: "High potential" | ☐
Row 2: "Moderate potential" | ☐
Row 3: "Low potential" | ☐
Extraction:
from docx import Document
doc = Document('form.docx')
table = doc.tables[0]
cell = table.rows[1].cells[0]
# Wrong way - returns empty
basic_text = cell.text
print(basic_text) # Output: '' or '\n'
# Right way - extracts nested content
full_text = extract_cell_content_with_nested_tables(cell)
print(full_text)
# Output:
# High potential
# Moderate potential
# Low potential
Example 2: Processing All Cells in a Table
def process_table_with_nested_content(table):
"""Process all cells, handling nested tables"""
for row in table.rows:
for cell in row.cells:
# Extract with nested table support
content = extract_cell_content_with_nested_tables(cell)
if content:
# Process content (translate, analyze, etc.)
processed = do_something_with(content)
print(f"Cell content: {processed}")
Example 3: Detecting Nested Tables
def analyze_document_structure(doc):
"""Find all cells with nested tables"""
nested_cells = []
for t_idx, table in enumerate(doc.tables):
for r_idx, row in enumerate(table.rows):
for c_idx, cell in enumerate(row.cells):
if cell.tables:
nested_cells.append({
'table': t_idx,
'row': r_idx,
'col': c_idx,
'nested_count': len(cell.tables)
})
return nested_cells
# Usage
doc = Document('complex_form.docx')
nested = analyze_document_structure(doc)
for item in nested:
print(f"Table {item['table']}, Row {item['row']}, Col {item['col']}: "
f"{item['nested_count']} nested table(s)")
Common Use Cases
1. Government Forms
Forms often use nested tables for checkbox grids:
def extract_form_responses(doc):
"""Extract all form checkbox options"""
responses = {}
for table in doc.tables:
for row in table.rows:
# First cell = question
question = row.cells[0].text.strip()
# Second cell = checkbox options (nested table)
if row.cells[1].tables:
options = extract_cell_content_with_nested_tables(row.cells[1])
responses[question] = options.split('\n')
return responses
2. Evaluation Forms
Extract rating scales and options:
def extract_evaluation_items(doc):
"""Extract evaluation criteria and options"""
evaluations = []
for table in doc.tables:
for row_idx, row in enumerate(table.rows[1:], 1):
# Get criterion
criterion = row.cells[0].text.strip()
# Get rating options (often nested)
rating_cell = row.cells[1]
rating_options = extract_cell_content_with_nested_tables(rating_cell)
evaluations.append({
'criterion': criterion,
'options': rating_options.split('\n')
})
return evaluations
3. Complex Data Tables
Extract structured data from cells with nested layouts:
def extract_complex_cell_data(cell):
"""Extract data from cells with complex nested structures"""
data = {
'main_content': '',
'nested_items': []
}
# Direct paragraphs
for para in cell.paragraphs:
if para.text.strip():
data['main_content'] = para.text.strip()
break
# Nested table data
if cell.tables:
for nested_table in cell.tables:
for nested_row in nested_table.rows:
row_data = [c.text.strip() for c in nested_row.cells]
data['nested_items'].append(row_data)
return data
Integration with Official docx Skill
This skill complements the official docx skill:
Official docx skill provides:
- Document creation (docx-js)
- Basic text extraction (pandoc)
- Tracked changes workflows
- Comment handling
- XML access for complex cases
This skill provides:
- Nested table extraction
- Complex cell content handling
- Form and checklist processing
- Advanced content extraction patterns
Use together:
# For basic operations: use official skill
from docx import Document
# For nested table handling: use this skill
from docx_advanced import extract_cell_content_with_nested_tables
# Combine both
doc = Document('complex_form.docx') # Official
for table in doc.tables: # Official
for row in table.rows: # Official
for cell in row.cells: # Official
# Advanced extraction:
content = extract_cell_content_with_nested_tables(cell)
Performance Considerations
For Large Documents:
Cache nested table checks:
def build_nested_table_cache(doc):
"""Pre-compute which cells have nested tables"""
cache = {}
for t_idx, table in enumerate(doc.tables):
for r_idx, row in enumerate(table.rows):
for c_idx, cell in enumerate(row.cells):
if cell.tables:
cache[(t_idx, r_idx, c_idx)] = len(cell.tables)
return cache
# Usage
cache = build_nested_table_cache(doc)
for t_idx, table in enumerate(doc.tables):
for r_idx, row in enumerate(table.rows):
for c_idx, cell in enumerate(row.cells):
if (t_idx, r_idx, c_idx) in cache:
# This cell has nested tables
content = extract_cell_content_with_nested_tables(cell)
else:
# Regular extraction
content = cell.text
Troubleshooting
Issue: Extraction returns empty despite visible content
Diagnosis:
cell = table.rows[1].cells[0]
print(f"cell.text: '{cell.text}'")
print(f"cell.tables: {len(cell.tables)}")
if not cell.text.strip() and cell.tables:
print("Content is in nested tables!")
Fix: Use extract_cell_content_with_nested_tables(cell)
Issue: Checkbox characters (, ☐) appear in output
Fix: Filter them out:
text = cell.text.strip()
# Remove checkbox unicode characters
clean_text = text.replace('', '').replace('☐', '').replace('☑', '').replace('☒', '')
Issue: Multi-line content not preserved
Fix: Join with newlines:
'\n'.join(text_parts) # Preserves line structure
Best Practices
Always check for nested tables first:
if cell.tables: content = extract_cell_content_with_nested_tables(cell) else: content = cell.textHandle checkbox characters:
CHECKBOX_CHARS = ['', '☐', '☑', '☒'] if text not in CHECKBOX_CHARS: # Process textPreserve structure:
# Use newlines to maintain line breaks '\n'.join(lines)Test with sample documents:
def test_extraction(): doc = Document('sample_form.docx') cell = doc.tables[0].rows[1].cells[0] extracted = extract_cell_content_with_nested_tables(cell) assert 'High potential' in extracted assert 'Moderate potential' in extracted
Reference Implementation
See REFERENCE.md for:
- Complete working examples
- Integration patterns
- Advanced recursive extraction
- Performance optimization techniques
Contributing to Anthropic Skills
This pattern is not currently in the official docx skill. If you find it useful, consider contributing:
- Fork https://github.com/anthropics/skills
- Add to
document-skills/docx/SKILL.md - Submit pull request with:
- Pattern description
- Code examples
- Use cases
Success Criteria
Pattern is working if:
- Cells with nested tables return full content
- Checkbox options are extracted correctly
- Form fields are readable
- No content is lost during extraction
- Structure is preserved (line breaks maintained)