| name | pandoc-pdf-generation |
| description | Use when generating PDFs from markdown with Pandoc - covers differences from Python-Markdown, blank line rules, fix scripts for labels/anchors/metadata, and visual testing workflow |
Pandoc PDF Generation Best Practices
Overview
This skill documents lessons learned from generating PDF documents from markdown using Pandoc, drawing from experiences with MkDocs HTML generation and applying systematic validation approaches.
Critical Differences: Pandoc vs Python-Markdown
Supported Features
| Feature | Python-Markdown (MkDocs) | Pandoc (PDF) |
|---|---|---|
Roman numerals (i., ii.) |
❌ Not supported | ✅ Supported |
| Grid tables | ⚠️ Needs extension | ✅ Native support |
LaTeX commands (\pagebreak) |
❌ Renders as text | ✅ Native support |
| Nested list indent | 4 spaces (strict) | More flexible |
| Footnotes continuation | 4-space indent required | More flexible |
Key Insight: Pandoc is MORE capable than Python-Markdown, but this means markdown that works for PDF might break in MkDocs!
Cross-Renderer Compatibility ✅
Good News: Some formatting rules work consistently across both renderers!
Blank Line Rules (Universal):
- ✅ Blank line after bold labels before lists - Works in both MkDocs and Pandoc
- ✅ Blank line after plain text labels before lists - Works in both MkDocs and Pandoc
- ✅ Blank line after HTML anchors before headers - Works in both MkDocs and Pandoc
- ✅ Blank lines between consecutive metadata fields - Works in both MkDocs and Pandoc
Validation Method:
# Generate both outputs
mkdocs build --clean
./scripts/generate-pdf.sh
# Check MkDocs HTML rendering
grep -A 5 "For complete details, see:" site/soc2-type1/index.html
# Should show: <ul><li>...</li></ul>
# Check Pandoc PDF rendering
pdftotext output/Documentation.pdf - | grep -A 5 "For complete details, see:"
# Should show: • Bullet point
Implication: Fix markdown once, works for both HTML and PDF! This makes maintaining shared source files much easier.
Shared Markdown Source Strategy
The Challenge
When using same markdown files for both MkDocs (HTML) and Pandoc (PDF):
Option 1: Optimize for MkDocs (Current Approach)
- ✅ Clean HTML rendering
- ⚠️ PDF might have issues
- 3-space indents, no
\pagebreak, etc.
Option 2: Optimize for Pandoc
- ✅ Perfect PDF output
- ❌ MkDocs rendering breaks
Option 3: Separate Sources (Best for large projects)
- Maintain
docs/for MkDocs - Maintain
pdf-source/for PDF - Use scripts to sync common content
Option 4: Conditional Formatting (Advanced)
- Use Pandoc filters to handle differences
- Use MkDocs plugins for HTML-specific needs
- Keep single source, transform during build
PDF Generation Testing Workflow
Phase 1: Generate PDF (2 minutes)
./scripts/generate-pdf.sh
Check for errors:
- LaTeX errors (process exits non-zero)
- Missing file errors
- Font warnings (informational, not critical)
Phase 2: Visual Inspection (10 minutes)
CRITICAL: Actually open and read the PDF!
open output/Documentation.pdf
Checklist:
- Cover page renders correctly
- TOC is accurate and complete
- All section headers are styled as headers (not plain text or literal
##) - Bullet lists render as bullets (not inline text with dashes)
- Numbered lists render correctly (not inline text)
- Bold labels before lists have proper spacing
- Plain text labels before lists have proper spacing
- Metadata fields appear on separate lines (not run together)
- Tables fit on pages (no overflow)
- Code blocks are formatted correctly
- Page breaks are reasonable (not mid-paragraph)
- Footnotes work (if applicable)
- No missing content
- Font rendering acceptable
- Total page count reasonable
Phase 3: Specific Checks (5 minutes)
Check specific sections user mentioned:
For example, if user says "these should be bullet points":
- Find the section in PDF
- Compare to markdown source
- Verify markdown has proper bullets:
**Access Removal:** - Item one - Item two - Check PDF rendering matches markdown intent
Phase 4: Commit (only if passes)
git add output/Documentation.pdf
git commit -m "docs: regenerate PDF with [specific improvements]"
Common PDF Issues and Solutions
Issue 1: Headers Render as Plain Text
Symptom: Text that should be headers (H2, H3) appears as regular paragraphs in PDF.
Root Cause: Markdown not properly formatted for Pandoc.
Check markdown:
# ✅ CORRECT - Header
## User Identification and Authentication
# ❌ WRONG - Plain text
User Identification and Authentication
Solution: Ensure headers have ## prefix, blank line before and after.
Issue 2: Bullets Render as Plain Text
Symptom: Text shows dashes/bullets as characters, not formatted lists.
Root Cause:
- Missing blank line before list
- Incorrect indentation
- Markdown not recognized as list
Check markdown:
# ✅ CORRECT
**Access Removal:**
- Termination: Immediate revocation
- Role change: Adjusted within 5 days
# ❌ WRONG - No blank line
**Access Removal:**
- Termination: Immediate revocation
Solution:
- Add blank line before list
- Verify proper indentation (0 spaces for root-level)
- Use consistent markers (
-or*)
Issue 3: Font Warnings for Unicode Characters
Symptom:
[WARNING] Missing character: There is no ├ (U+251C) in font [lmmono10-regular]
Root Cause: Default LaTeX font doesn't support all Unicode characters (box-drawing, emojis, etc.)
Solutions:
Option 1: Change Font
# In pandoc command
--pdf-engine=xelatex
--variable mainfont="DejaVu Sans"
Option 2: Remove Special Characters
# Replace tree diagrams with ASCII
sed -i '' 's/├/+/g' file.md
sed -i '' 's/─/-/g' file.md
Option 3: Accept Warnings
- If characters are cosmetic (tree diagrams)
- If they don't affect content comprehension
- Document as "known limitation"
Issue 4: Tables Don't Fit on Page
Symptom: Tables overflow page width, text cut off.
Solutions:
Option 1: Rotate Table (Landscape)
\begin{landscape}
| Col 1 | Col 2 | Col 3 |
|-------|-------|-------|
| Data | Data | Data |
\end{landscape}
Option 2: Smaller Font in Table
\small
| Col 1 | Col 2 | Col 3 |
|-------|-------|-------|
| Data | Data | Data |
\normalsize
Option 3: Redesign Table
- Split into multiple tables
- Use abbreviations
- Rotate headers vertically
Issue 5: Bad Page Breaks
Symptom: Headers at bottom of page, orphaned content.
Solutions:
Option 1: Manual Page Breaks
\pagebreak
## Next Section
Option 2: Pandoc Variables
--variable pagestyle=headings
--variable geometry:margin=1in
Option 3: LaTeX Penalties
\widowpenalty=10000
\clubpenalty=10000
Issue 6: Bold Labels Before Lists Render Inline
Symptom: Bold labels followed by lists render as inline text instead of separate formatted list.
Example in PDF:
Technology Changes: - New system implementations - Software upgrades - Infrastructure modifications
Root Cause: Pandoc requires blank line after bold labels (format: **Label:**) before lists.
Check markdown:
# ❌ WRONG - No blank line
**Technology Changes:**
- New system implementations
- Software upgrades
# ✅ CORRECT - Blank line after label
**Technology Changes:**
- New system implementations
- Software upgrades
Solution: Add blank line between bold label and list.
Automated Detection:
# Find all bold labels immediately followed by lists
grep -n '^\*\*[^*]*:\*\*$' file.md | while read line; do
num=$(echo $line | cut -d: -f1)
next=$((num + 1))
nextline=$(sed -n "${next}p" file.md)
if [[ $nextline =~ ^[-*] ]]; then
echo "Line $num: Missing blank line after bold label"
fi
done
Automated Fix: Use fix_pandoc_lists.py script (see Automation section below).
Issue 7: Headers Show Literal ## Characters
Symptom: Headers render as plain text with literal ## characters visible.
Example in PDF:
## Fraud Risk Assessment
Root Cause: Pandoc requires blank line after HTML anchor tags before markdown headers.
Check markdown:
# ❌ WRONG - No blank line after anchor
<a name="fraud-risk"></a>
## Fraud Risk Assessment
# ✅ CORRECT - Blank line after anchor
<a name="fraud-risk"></a>
## Fraud Risk Assessment
Why This Happens: Pandoc treats HTML and markdown as separate contexts. Without blank line, it doesn't recognize the ## as a markdown header.
Solution: Add blank line between HTML anchor and header.
Automated Detection:
# Find anchors immediately followed by headers
grep -n '^<a name=' file.md | while read line; do
num=$(echo $line | cut -d: -f1)
next=$((num + 1))
nextline=$(sed -n "${next}p" file.md)
if [[ $nextline =~ ^## ]]; then
echo "Line $num: Missing blank line after anchor"
fi
done
Automated Fix: Use fix_pandoc_anchors.py script (see Automation section below).
Issue 8: Metadata Fields Run Together
Symptom: Consecutive metadata fields render on single line instead of separate lines.
Example in PDF:
Title: Report Name Author: Your Name Date: January 2025
Root Cause: Pandoc requires blank lines between consecutive paragraphs. Without them, it merges lines into continuous text flow.
Check markdown:
# ❌ WRONG - No blank lines between
**Organization:** Example Corp
**Audit Type:** SOC 2 Type 1
**Scope:** Security (CC1-CC9)
# ✅ CORRECT - Blank lines between each
**Organization:** Example Corp
**Audit Type:** SOC 2 Type 1
**Scope:** Security (CC1-CC9)
Solution: Add blank lines between consecutive bold label lines.
Automated Detection:
# Find consecutive bold label lines
grep -n '^\*\*[^*]*:\*\* ' file.md | \
awk 'NR > 1 && $1 == prev+1 {print "Lines " prev "-" $1 ": Consecutive bold labels"} {prev=$1}'
Automated Fix: Use fix_pandoc_metadata.py script (see Automation section below).
Issue 9: Plain Text Labels Before Lists Render Inline
Symptom: Plain text (not bold) ending with colon followed by list renders inline.
Example in PDF:
The security program aligns with: - SOC 2 - ISO 27001 - NIST Framework
Root Cause: Same as Issue 6, but for plain text labels instead of bold.
Check markdown:
# ❌ WRONG - No blank line
The security program aligns with:
- SOC 2 Trust Services Criteria
- ISO 27001 control framework
# ✅ CORRECT - Blank line after plain text label
The security program aligns with:
- SOC 2 Trust Services Criteria
- ISO 27001 control framework
Solution: Add blank line after any text ending with : when followed by list.
Automated Fix: Enhanced fix_pandoc_lists.py handles both bold and plain text labels.
Automation: Fix Scripts
Script 1: fix_pandoc_lists.py
Purpose: Fix bold and plain text labels before lists.
Usage:
python3 fix_pandoc_lists.py
What it fixes:
- Bold labels before lists:
**Label:**→ blank line → list - Plain text labels before lists:
Text:→ blank line → list
Example output:
Processing 03-risk-assessment.md...
Line 186: Added blank line after '**Technology Changes:**'
Line 265: Added blank line after 'The security program aligns with:'
✅ Fixed 03-risk-assessment.md
Script location: Project root directory
Script 2: fix_pandoc_anchors.py
Purpose: Fix HTML anchors before headers.
Usage:
python3 fix_pandoc_anchors.py
What it fixes:
<a name="..."></a>→ blank line →## Header
Example output:
Processing 03-risk-assessment.md...
Line 141: Added blank line after '<a name="fraud-risk"></a>'
✅ Fixed 03-risk-assessment.md
Script location: Project root directory
Script 3: fix_pandoc_metadata.py
Purpose: Fix consecutive bold label metadata fields.
Usage:
python3 fix_pandoc_metadata.py
What it fixes:
- Consecutive
**Label:** valuelines → add blank lines between them
Example output:
Processing index.md...
Line 3: Added blank line after '**Organization:** Example Corp'
Line 4: Added blank line after '**Audit Type:** SOC 2 Type 1'
✅ Fixed index.md
Script location: Project root directory
Running All Fix Scripts
Complete fix workflow:
# Fix all Pandoc formatting issues
python3 fix_pandoc_lists.py # Lists after labels
python3 fix_pandoc_anchors.py # Anchors before headers
python3 fix_pandoc_metadata.py # Consecutive metadata
# Regenerate PDF
./scripts/generate-pdf.sh
# Visual verification
open output/Documentation.pdf
When to run:
- After adding new content with lists
- After modifying metadata sections
- After adding HTML anchors
- Before committing PDFs
- When user reports inline rendering issues
Pandoc Command Reference
Basic PDF Generation
pandoc file.md -o output.pdf \
--from markdown \
--to pdf \
--pdf-engine=xelatex
With TOC and Sections
pandoc file.md -o output.pdf \
--from markdown \
--to pdf \
--pdf-engine=xelatex \
--toc \
--toc-depth=3 \
--number-sections
With Metadata
pandoc file.md -o output.pdf \
--from markdown \
--to pdf \
--pdf-engine=xelatex \
--metadata title="Document Title" \
--metadata author="Author Name" \
--metadata date="$(date +%Y-%m-%d)"
With Custom Template
pandoc file.md -o output.pdf \
--from markdown \
--to pdf \
--pdf-engine=xelatex \
--template=custom-template.tex
Testing Checklist Template
Copy this checklist for each PDF generation:
## PDF Generation Test - [DATE]
### Generation Phase
- [ ] Script runs without errors
- [ ] PDF file created
- [ ] File size reasonable (< 10MB for typical docs)
### Visual Inspection Phase
- [ ] Opened PDF and scrolled through ALL pages
- [ ] Cover page correct
- [ ] TOC complete and accurate
- [ ] All headers styled correctly (no literal `##`)
- [ ] All bullets formatted as lists (not inline)
- [ ] All numbered lists formatted correctly (not inline)
- [ ] Bold/plain labels before lists properly spaced
- [ ] Metadata fields on separate lines (not run together)
- [ ] All tables fit on pages
- [ ] No obviously bad page breaks
- [ ] No missing content
- [ ] Font rendering acceptable
### Specific Checks (from user feedback)
- [ ] [Specific section] renders correctly
- [ ] [Specific formatting] matches intent
- [ ] [Specific issue] is fixed
### Final Validation
- [ ] PDF matches markdown source intent
- [ ] All user-reported issues addressed
- [ ] Ready for commit
**Issues Found:** [List any issues]
**Next Steps:** [What needs fixing]
Automation: PDF Testing Script
Create: scripts/test-pdf.sh
#!/bin/bash
# Test PDF generation and basic quality checks
set -e
# Generate PDF
./scripts/generate-pdf.sh
PDF="output/Documentation.pdf"
# Check file exists
if [ ! -f "$PDF" ]; then
echo "❌ PDF not generated"
exit 1
fi
# Check file size (should be between 100KB and 10MB)
SIZE=$(stat -f%z "$PDF" 2>/dev/null || stat -c%s "$PDF")
if [ $SIZE -lt 100000 ]; then
echo "⚠️ WARNING: PDF seems too small ($SIZE bytes)"
elif [ $SIZE -gt 10000000 ]; then
echo "⚠️ WARNING: PDF seems too large ($SIZE bytes)"
else
echo "✅ PDF size OK: $(numfmt --to=iec-i --suffix=B $SIZE)"
fi
# Check page count (using pdfinfo if available)
if command -v pdfinfo &> /dev/null; then
PAGES=$(pdfinfo "$PDF" | grep "Pages:" | awk '{print $2}')
echo "📄 Pages: $PAGES"
if [ $PAGES -lt 50 ]; then
echo "⚠️ WARNING: Expected ~89 pages, got $PAGES"
fi
fi
echo ""
echo "✅ Basic checks passed!"
echo "📋 Next: Open PDF and visually inspect"
echo " open $PDF"
Key Takeaways
- Different renderers = different rules - Pandoc ≠ Python-Markdown
- Visual inspection required - Terminal success ≠ correct PDF
- Blank lines are critical - Pandoc needs blank lines between different markdown elements
- Test locally before committing - Generate, open, review
- Same workflow as MkDocs - Systematic testing, not assumptions
- Font limitations are real - Accept or configure around them
- Markdown intent matters - Source should express desired structure
- Create testing checklists - Catch issues systematically
- Automate fixes - Create scripts for common formatting issues
- HTML and markdown need separation - Always blank line after HTML elements
Common Pandoc Gotchas Summary
The "Blank Line Rule": Pandoc requires blank lines in these situations:
- After bold/plain text labels before lists
- After HTML tags before markdown headers
- Between consecutive paragraph-like elements
- Before and after headers
Quick Check Commands:
# Check for labels before lists (no blank line)
grep -B1 '^[-*] ' file.md | grep ':$' | grep -v '^--$'
# Check for anchors before headers (no blank line)
grep -A1 '^<a name=' file.md | grep '^##'
# Check for consecutive bold labels
grep '^\*\*[^*]*:\*\* ' file.md | uniq -c | grep -v '^ *1 '
When in doubt: Add a blank line. Pandoc almost never complains about too many blank lines.
Real-World Example
Project: Large documentation set Files: 15 markdown files Issues Found: 469 formatting problems across 4 categories
Fixes Applied:
- 376 labels before lists (Issues 6 & 9)
- 45 anchors before headers (Issue 7)
- 62 consecutive metadata fields (Issue 8)
Time Investment:
- Discovery: ~2 hours (user feedback + testing)
- Script development: ~1 hour (3 scripts)
- Execution: ~5 minutes (automated)
- Verification: ~10 minutes (visual PDF review)
ROI: 3 hours invested, automated solution for future. All issues fixed in 5 minutes.
References
- Pandoc User's Guide
- Pandoc PDF Options
- XeLaTeX Documentation
- LaTeX Font Selection
- Pandoc Markdown Spec
Status: Production-ready with automation scripts