name	docx-to-markdown
description	Extract content and images from Microsoft Word (.docx) documents and convert to Markdown format. Use this when migrating Word documents to Markdown-based systems like Obsidian, converting documentation, or extracting structured content from .docx files. Handles text formatting, tables, headings, and image extraction.

Word Document to Markdown Conversion Skill

This skill helps extract and convert Microsoft Word (.docx) documents to Markdown format, including extracting embedded images.

When to Use This Skill

Use this skill when you need to:

Migrate Word documents to Markdown-based note systems (Obsidian, Notion, etc.)
Extract content from .docx files for documentation
Convert structured Word documents to plain text with formatting preserved
Extract images embedded in Word documents
Analyze Word document structure (headings, tables, formatting)

Prerequisites

The skill requires python-docx library. If not already installed, create a virtual environment:

python3 -m venv .venv-docx
source .venv-docx/bin/activate
pip install python-docx

Core Functionality

The skill provides:

Text extraction - Preserves headings, bold, italic, and basic formatting
Table conversion - Converts Word tables to Markdown table format
Image extraction - Extracts all embedded images to a specified directory
Structure preservation - Maintains document hierarchy and organization

Usage Instructions

Step 1: Setup Environment

First, ensure python-docx is available. If using a virtual environment:

source .venv-docx/bin/activate  # or create if needed

Step 2: Create Extraction Script

Create a Python script using the template below, or use the pre-made extract_docx.py script.

Step 3: Run Extraction

python extract_docx.py <input.docx> <image_output_dir> [text_output_file]

Example:

python extract_docx.py "My Document.docx" ./images ./extracted_content.md

Step 4: Review and Refine

After extraction:

Review the extracted markdown file
Check that images were extracted correctly
Update image references if renaming is needed
Migrate content to target system (Obsidian, etc.)

Python Script Template

The extraction script includes these key functions:

def extract_images(doc, output_dir):
    """Extract all images from the document to specified directory"""
    # Extracts images with numbered filenames
    # Returns map of image IDs to filenames

def extract_full_content(doc):
    """Extract all content with Markdown formatting"""
    # Converts headings: # ## ### ####
    # Converts formatting: **bold** *italic*
    # Converts tables to Markdown format
    # Returns full document as string

def extract_table_as_markdown(table):
    """Convert Word table to Markdown table format"""
    # Creates properly formatted Markdown tables

Example Workflow for Obsidian Migration

When migrating a Word document to Obsidian:

Extract content and images:

python extract_docx.py handbook.docx /path/to/obsidian/vault/_assets extracted.md

Review extracted content:

Check heading hierarchy matches expectations
Verify table formatting is correct
Ensure all images were extracted

Rename images (if needed):

Give images meaningful names based on context
Update image references in the markdown file
Use Obsidian's ![[image.png]] syntax

Migrate content to Obsidian:

Copy sections to appropriate notes
Create wiki-links between notes
Verify image references work

Customization

Adjust Heading Levels

Modify the extract_full_content() function to change how headings are detected:

if 'Heading 1' in style_name:
    content.append(f'\n## {text}\n')  # Make H1 → H2

Custom Image Naming

Modify extract_images() to use custom naming:

# Instead of: handbook_image_001.png
# Use context-based names: vm_naming_diagram.png
image_filename = f'custom_name_{image_count}.{ext}'

Filter Content

Add filtering logic to extract only specific sections:

current_section = None
if 'Heading 1' in style_name:
    current_section = text

# Only include content from specific sections
if current_section in ['Resource Naming', 'Security']:
    content.append(text)

Troubleshooting

Issue: ModuleNotFoundError: No module named 'docx'

Solution: Activate virtual environment and install python-docx

Issue: Images not extracting

Solution: Ensure Word doc has embedded (not linked) images
Check that image output directory exists and is writable

Issue: Tables not formatting correctly

Solution: Complex merged cells may need manual adjustment
Review and fix table alignment in extracted markdown

Issue: Formatting lost

Solution: Word uses many complex styles; script preserves basic formatting only
Manually enhance critical formatting after extraction

Best Practices

Always review extracted content - Automated conversion isn't perfect
Backup original Word doc - Keep source material intact
Extract to temporary location first - Review before moving to final destination
Rename images meaningfully - Numbered images are hard to maintain
Test with small documents first - Validate workflow before large migrations

Integration with Other Tools

This skill works well with:

Obsidian - Wiki-style note-taking
GitHub/GitLab wikis - Documentation repositories
Static site generators - Hugo, Jekyll, MkDocs
Markdown editors - Typora, Mark Text, VS Code

Output Format

Text Content

Headings converted to #, ##, ###, ####
Bold text: **text**
Italic text: *text*
Tables: Markdown pipe format
Lists: Preserved as-is

Images

Extracted as separate files (PNG, JPG, etc.)
Numbered sequentially: handbook_image_001.png
Saved to specified output directory
Referenced in text (manual update may be needed)

Limitations

Does not preserve complex Word formatting (colors, fonts, advanced styles)
Merged table cells may not convert perfectly
Comments and tracked changes are not extracted
Embedded objects (Excel sheets, etc.) are not extracted
Page layout information (margins, headers/footers) is lost
Images are extracted but references need manual verification

Future Enhancements

Consider extending this skill to:

Auto-detect image context for better naming
Preserve more complex formatting
Handle embedded Excel tables
Extract comments and track changes
Generate table of contents automatically
Split large documents into multiple files based on headings

docx-to-markdown

Install Skill

SKILL.md