| name | docx-to-markdown |
| description | Extract content and images from Microsoft Word (.docx) documents and convert to Markdown format. Use this when migrating Word documents to Markdown-based systems like Obsidian, converting documentation, or extracting structured content from .docx files. Handles text formatting, tables, headings, and image extraction. |
Word Document to Markdown Conversion Skill
This skill helps extract and convert Microsoft Word (.docx) documents to Markdown format, including extracting embedded images.
When to Use This Skill
Use this skill when you need to:
- Migrate Word documents to Markdown-based note systems (Obsidian, Notion, etc.)
- Extract content from .docx files for documentation
- Convert structured Word documents to plain text with formatting preserved
- Extract images embedded in Word documents
- Analyze Word document structure (headings, tables, formatting)
Prerequisites
The skill requires python-docx library. If not already installed, create a virtual environment:
python3 -m venv .venv-docx
source .venv-docx/bin/activate
pip install python-docx
Core Functionality
The skill provides:
- Text extraction - Preserves headings, bold, italic, and basic formatting
- Table conversion - Converts Word tables to Markdown table format
- Image extraction - Extracts all embedded images to a specified directory
- Structure preservation - Maintains document hierarchy and organization
Usage Instructions
Step 1: Setup Environment
First, ensure python-docx is available. If using a virtual environment:
source .venv-docx/bin/activate # or create if needed
Step 2: Create Extraction Script
Create a Python script using the template below, or use the pre-made extract_docx.py script.
Step 3: Run Extraction
python extract_docx.py <input.docx> <image_output_dir> [text_output_file]
Example:
python extract_docx.py "My Document.docx" ./images ./extracted_content.md
Step 4: Review and Refine
After extraction:
- Review the extracted markdown file
- Check that images were extracted correctly
- Update image references if renaming is needed
- Migrate content to target system (Obsidian, etc.)
Python Script Template
The extraction script includes these key functions:
def extract_images(doc, output_dir):
"""Extract all images from the document to specified directory"""
# Extracts images with numbered filenames
# Returns map of image IDs to filenames
def extract_full_content(doc):
"""Extract all content with Markdown formatting"""
# Converts headings: # ## ### ####
# Converts formatting: **bold** *italic*
# Converts tables to Markdown format
# Returns full document as string
def extract_table_as_markdown(table):
"""Convert Word table to Markdown table format"""
# Creates properly formatted Markdown tables
Example Workflow for Obsidian Migration
When migrating a Word document to Obsidian:
- Extract content and images:
python extract_docx.py handbook.docx /path/to/obsidian/vault/_assets extracted.md
- Review extracted content:
- Check heading hierarchy matches expectations
- Verify table formatting is correct
- Ensure all images were extracted
- Rename images (if needed):
- Give images meaningful names based on context
- Update image references in the markdown file
- Use Obsidian's
![[image.png]]syntax
- Migrate content to Obsidian:
- Copy sections to appropriate notes
- Create wiki-links between notes
- Verify image references work
Customization
Adjust Heading Levels
Modify the extract_full_content() function to change how headings are detected:
if 'Heading 1' in style_name:
content.append(f'\n## {text}\n') # Make H1 → H2
Custom Image Naming
Modify extract_images() to use custom naming:
# Instead of: handbook_image_001.png
# Use context-based names: vm_naming_diagram.png
image_filename = f'custom_name_{image_count}.{ext}'
Filter Content
Add filtering logic to extract only specific sections:
current_section = None
if 'Heading 1' in style_name:
current_section = text
# Only include content from specific sections
if current_section in ['Resource Naming', 'Security']:
content.append(text)
Troubleshooting
Issue: ModuleNotFoundError: No module named 'docx'
- Solution: Activate virtual environment and install python-docx
Issue: Images not extracting
- Solution: Ensure Word doc has embedded (not linked) images
- Check that image output directory exists and is writable
Issue: Tables not formatting correctly
- Solution: Complex merged cells may need manual adjustment
- Review and fix table alignment in extracted markdown
Issue: Formatting lost
- Solution: Word uses many complex styles; script preserves basic formatting only
- Manually enhance critical formatting after extraction
Best Practices
- Always review extracted content - Automated conversion isn't perfect
- Backup original Word doc - Keep source material intact
- Extract to temporary location first - Review before moving to final destination
- Rename images meaningfully - Numbered images are hard to maintain
- Test with small documents first - Validate workflow before large migrations
Integration with Other Tools
This skill works well with:
- Obsidian - Wiki-style note-taking
- GitHub/GitLab wikis - Documentation repositories
- Static site generators - Hugo, Jekyll, MkDocs
- Markdown editors - Typora, Mark Text, VS Code
Output Format
Text Content
- Headings converted to
#,##,###,#### - Bold text:
**text** - Italic text:
*text* - Tables: Markdown pipe format
- Lists: Preserved as-is
Images
- Extracted as separate files (PNG, JPG, etc.)
- Numbered sequentially:
handbook_image_001.png - Saved to specified output directory
- Referenced in text (manual update may be needed)
Limitations
- Does not preserve complex Word formatting (colors, fonts, advanced styles)
- Merged table cells may not convert perfectly
- Comments and tracked changes are not extracted
- Embedded objects (Excel sheets, etc.) are not extracted
- Page layout information (margins, headers/footers) is lost
- Images are extracted but references need manual verification
Future Enhancements
Consider extending this skill to:
- Auto-detect image context for better naming
- Preserve more complex formatting
- Handle embedded Excel tables
- Extract comments and track changes
- Generate table of contents automatically
- Split large documents into multiple files based on headings