Claude Code Plugins

Community-maintained marketplace

Feedback

docx-to-markdown

@dsswift/claude
0
0

Extract content and images from Microsoft Word (.docx) documents and convert to Markdown format. Use this when migrating Word documents to Markdown-based systems like Obsidian, converting documentation, or extracting structured content from .docx files. Handles text formatting, tables, headings, and image extraction.

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name docx-to-markdown
description Extract content and images from Microsoft Word (.docx) documents and convert to Markdown format. Use this when migrating Word documents to Markdown-based systems like Obsidian, converting documentation, or extracting structured content from .docx files. Handles text formatting, tables, headings, and image extraction.

Word Document to Markdown Conversion Skill

This skill helps extract and convert Microsoft Word (.docx) documents to Markdown format, including extracting embedded images.

When to Use This Skill

Use this skill when you need to:

  • Migrate Word documents to Markdown-based note systems (Obsidian, Notion, etc.)
  • Extract content from .docx files for documentation
  • Convert structured Word documents to plain text with formatting preserved
  • Extract images embedded in Word documents
  • Analyze Word document structure (headings, tables, formatting)

Prerequisites

The skill requires python-docx library. If not already installed, create a virtual environment:

python3 -m venv .venv-docx
source .venv-docx/bin/activate
pip install python-docx

Core Functionality

The skill provides:

  1. Text extraction - Preserves headings, bold, italic, and basic formatting
  2. Table conversion - Converts Word tables to Markdown table format
  3. Image extraction - Extracts all embedded images to a specified directory
  4. Structure preservation - Maintains document hierarchy and organization

Usage Instructions

Step 1: Setup Environment

First, ensure python-docx is available. If using a virtual environment:

source .venv-docx/bin/activate  # or create if needed

Step 2: Create Extraction Script

Create a Python script using the template below, or use the pre-made extract_docx.py script.

Step 3: Run Extraction

python extract_docx.py <input.docx> <image_output_dir> [text_output_file]

Example:

python extract_docx.py "My Document.docx" ./images ./extracted_content.md

Step 4: Review and Refine

After extraction:

  1. Review the extracted markdown file
  2. Check that images were extracted correctly
  3. Update image references if renaming is needed
  4. Migrate content to target system (Obsidian, etc.)

Python Script Template

The extraction script includes these key functions:

def extract_images(doc, output_dir):
    """Extract all images from the document to specified directory"""
    # Extracts images with numbered filenames
    # Returns map of image IDs to filenames

def extract_full_content(doc):
    """Extract all content with Markdown formatting"""
    # Converts headings: # ## ### ####
    # Converts formatting: **bold** *italic*
    # Converts tables to Markdown format
    # Returns full document as string

def extract_table_as_markdown(table):
    """Convert Word table to Markdown table format"""
    # Creates properly formatted Markdown tables

Example Workflow for Obsidian Migration

When migrating a Word document to Obsidian:

  1. Extract content and images:
python extract_docx.py handbook.docx /path/to/obsidian/vault/_assets extracted.md
  1. Review extracted content:
  • Check heading hierarchy matches expectations
  • Verify table formatting is correct
  • Ensure all images were extracted
  1. Rename images (if needed):
  • Give images meaningful names based on context
  • Update image references in the markdown file
  • Use Obsidian's ![[image.png]] syntax
  1. Migrate content to Obsidian:
  • Copy sections to appropriate notes
  • Create wiki-links between notes
  • Verify image references work

Customization

Adjust Heading Levels

Modify the extract_full_content() function to change how headings are detected:

if 'Heading 1' in style_name:
    content.append(f'\n## {text}\n')  # Make H1 → H2

Custom Image Naming

Modify extract_images() to use custom naming:

# Instead of: handbook_image_001.png
# Use context-based names: vm_naming_diagram.png
image_filename = f'custom_name_{image_count}.{ext}'

Filter Content

Add filtering logic to extract only specific sections:

current_section = None
if 'Heading 1' in style_name:
    current_section = text

# Only include content from specific sections
if current_section in ['Resource Naming', 'Security']:
    content.append(text)

Troubleshooting

Issue: ModuleNotFoundError: No module named 'docx'

  • Solution: Activate virtual environment and install python-docx

Issue: Images not extracting

  • Solution: Ensure Word doc has embedded (not linked) images
  • Check that image output directory exists and is writable

Issue: Tables not formatting correctly

  • Solution: Complex merged cells may need manual adjustment
  • Review and fix table alignment in extracted markdown

Issue: Formatting lost

  • Solution: Word uses many complex styles; script preserves basic formatting only
  • Manually enhance critical formatting after extraction

Best Practices

  1. Always review extracted content - Automated conversion isn't perfect
  2. Backup original Word doc - Keep source material intact
  3. Extract to temporary location first - Review before moving to final destination
  4. Rename images meaningfully - Numbered images are hard to maintain
  5. Test with small documents first - Validate workflow before large migrations

Integration with Other Tools

This skill works well with:

  • Obsidian - Wiki-style note-taking
  • GitHub/GitLab wikis - Documentation repositories
  • Static site generators - Hugo, Jekyll, MkDocs
  • Markdown editors - Typora, Mark Text, VS Code

Output Format

Text Content

  • Headings converted to #, ##, ###, ####
  • Bold text: **text**
  • Italic text: *text*
  • Tables: Markdown pipe format
  • Lists: Preserved as-is

Images

  • Extracted as separate files (PNG, JPG, etc.)
  • Numbered sequentially: handbook_image_001.png
  • Saved to specified output directory
  • Referenced in text (manual update may be needed)

Limitations

  • Does not preserve complex Word formatting (colors, fonts, advanced styles)
  • Merged table cells may not convert perfectly
  • Comments and tracked changes are not extracted
  • Embedded objects (Excel sheets, etc.) are not extracted
  • Page layout information (margins, headers/footers) is lost
  • Images are extracted but references need manual verification

Future Enhancements

Consider extending this skill to:

  • Auto-detect image context for better naming
  • Preserve more complex formatting
  • Handle embedded Excel tables
  • Extract comments and track changes
  • Generate table of contents automatically
  • Split large documents into multiple files based on headings