| name | markdown-consolidator |
| description | Intelligent consolidation and synthesis of multiple markdown files with overlapping content and different update dates. Use when: (1) Multiple AI-generated markdown files need merging, (2) Knowledge bases have fragmented or duplicate content, (3) Documentation requires recency-aware synthesis, (4) Supporting documents need re-synthesis after AI task completion, (5) Project documentation has semantic overlap across files, (6) Periodic knowledge base maintenance and deduplication is needed. |
Markdown Consolidator
Consolidate and synthesize multiple markdown files with intelligent handling of overlapping content, different update dates, and semantic deduplication.
Core Problem
AI-assisted workflows generate fragmented documentation:
- Each AI session creates task-specific markdown files
- AI references supporting docs but doesn't update them post-task
- Knowledge becomes scattered across files with overlapping content
- Different timestamps make version reconciliation complex
Workflow Overview
1. ANALYZE → Inventory files, extract metadata, identify relationships
2. CLUSTER → Group semantically related files using content analysis
3. PLAN → Create merge strategy based on recency, overlap, authority
4. SYNTHESIZE → Merge content with intelligent conflict resolution
5. VALIDATE → Verify completeness and coherence of output
Analysis Phase
Step 1: File Inventory
Run the inventory script to analyze all markdown files:
python scripts/inventory.py <directory> --output inventory.json
The script extracts:
- File paths and sizes
- Modification timestamps (file system and YAML frontmatter)
- Section headers (H1-H6 structure)
- Word/token counts per section
- Internal links (
[[wikilinks]]and[markdown](links)) - YAML frontmatter metadata
- Content fingerprints for similarity detection
Step 2: Relationship Mapping
python scripts/analyze_relationships.py inventory.json --output relationships.json
Identifies:
- Semantic clusters: Files covering similar topics (via TF-IDF/embedding similarity)
- Temporal chains: Files that evolved from each other (via timestamp + similarity)
- Reference graphs: Which files reference which (via link analysis)
- Conflict zones: Sections with contradictory or overlapping content
Clustering Phase
Clustering Strategies
Choose based on your consolidation goal:
Topic-based clustering (default) Groups files by semantic similarity of content.
python scripts/cluster.py relationships.json --method topic --threshold 0.6
Temporal clustering Groups files by modification date ranges.
python scripts/cluster.py relationships.json --method temporal --window 7d
Hierarchical clustering Groups by directory structure + content similarity.
python scripts/cluster.py relationships.json --method hierarchical
Cluster Output
Creates clusters.json with structure:
{
"clusters": [
{
"id": "cluster_001",
"theme": "API Authentication",
"files": ["auth-design.md", "oauth-notes.md", "token-handling.md"],
"primary_file": "auth-design.md",
"overlap_score": 0.72,
"conflicts": ["token-handling.md:L45 vs oauth-notes.md:L23"]
}
]
}
Planning Phase
Merge Strategy Selection
Authority-based (recommended for documentation)
- Most recent file is authoritative for conflicts
- Older unique content is preserved with attribution
- Use when files represent evolving understanding
Comprehensive (for knowledge bases)
- Union of all unique information
- Conflicts flagged for manual review
- Use when completeness matters more than consistency
Canonical (for specifications)
- Designate one file as canonical
- Others provide supplementary/historical context
- Use when single source of truth is required
Create Merge Plan
python scripts/plan_merge.py clusters.json --strategy authority --output merge_plan.json
Generates actionable merge plan:
{
"cluster_id": "cluster_001",
"output_file": "consolidated/authentication.md",
"sections": [
{
"heading": "## Overview",
"sources": [{"file": "auth-design.md", "lines": "1-25", "action": "primary"}],
"conflicts": []
},
{
"heading": "## Token Handling",
"sources": [
{"file": "token-handling.md", "lines": "10-45", "action": "primary"},
{"file": "oauth-notes.md", "lines": "20-35", "action": "supplement"}
],
"conflicts": [
{
"description": "Token expiry differs: 24h vs 1h",
"resolution": "Use most recent (token-handling.md: 24h)"
}
]
}
]
}
Synthesis Phase
Execute Merge
python scripts/synthesize.py merge_plan.json --output consolidated/
The synthesizer:
- Creates section-by-section merged content
- Preserves original attribution via HTML comments
- Resolves conflicts per strategy
- Maintains internal link consistency
- Updates frontmatter with merge metadata
Synthesis Rules
Content Deduplication
- Exact duplicates: Remove, keep first occurrence
- Near duplicates (>80% similarity): Merge, note sources
- Partial overlap: Keep both with clear section breaks
Conflict Resolution
Authority strategy:
1. Prefer most recently modified source
2. Prefer explicitly dated content over undated
3. Prefer longer/more detailed explanations
4. Flag unresolvable conflicts for review
Comprehensive strategy:
1. Include all non-contradictory content
2. Present conflicts as "Version A / Version B" blocks
3. Add TODO markers for manual resolution
Link Handling
- Internal links updated to point to consolidated files
- Broken links flagged with
<!-- BROKEN: original-target.md --> - External links preserved as-is
Output Format
Consolidated files include:
---
title: Authentication System
consolidated_from:
- file: auth-design.md
modified: 2024-12-01T10:30:00
- file: oauth-notes.md
modified: 2024-11-28T15:45:00
- file: token-handling.md
modified: 2024-12-02T09:00:00
consolidated_at: 2024-12-03T14:00:00
strategy: authority
---
# Authentication System
<!-- SOURCE: auth-design.md:1-25 -->
## Overview
...
<!-- SOURCE: token-handling.md:10-45, SUPPLEMENTED: oauth-notes.md:20-35 -->
## Token Handling
...
<!-- CONFLICT RESOLVED: Used token-handling.md (most recent) -->
Token expiry is set to 24 hours...
Validation Phase
python scripts/validate.py consolidated/ --original <source_dir>
Validates:
- Completeness: All source content represented or explicitly excluded
- Link integrity: All internal links resolve
- Coherence: No contradictions in final output
- Metadata: Proper attribution and timestamps
Generates validation_report.md:
## Consolidation Validation Report
### Coverage
- 47/47 source files processed
- 3 files excluded (empty/invalid)
- 12 clusters created
- 8 consolidated files produced
### Content Coverage
- 98.3% of source content preserved
- 1.7% deduplicated (exact matches)
- 5 conflicts resolved automatically
- 2 conflicts flagged for review
### Issues
- [ ] REVIEW: consolidated/auth.md:L145 - conflicting token formats
- [ ] REVIEW: consolidated/api.md:L67 - unclear which version is correct
Quick Start
For immediate consolidation of a directory:
# Full pipeline
python scripts/consolidate.py <source_dir> <output_dir> --strategy authority
# This runs: inventory → analyze → cluster → plan → synthesize → validate
Advanced: Incremental Updates
For ongoing maintenance:
# Detect changes since last consolidation
python scripts/detect_changes.py <source_dir> --since "2024-12-01"
# Re-consolidate only affected clusters
python scripts/consolidate.py <source_dir> <output_dir> --incremental
Configuration
Create .consolidator.yaml in project root:
# Files/directories to exclude
exclude:
- "**/archive/**"
- "**/.obsidian/**"
- "**/templates/**"
# Similarity threshold for clustering (0-1)
similarity_threshold: 0.6
# Default merge strategy
default_strategy: authority
# Preserve original files
keep_originals: true
archive_path: .consolidated-archive/
# Frontmatter fields to preserve
preserve_frontmatter:
- tags
- aliases
- created
# Output format
output:
add_source_comments: true
add_merge_frontmatter: true
update_internal_links: true
Integration Patterns
With Claude Code Sessions
Add to your CLAUDE.md:
## Post-Task Consolidation
After completing any task that creates or modifies markdown files:
1. Run `/project:consolidate` to update knowledge base
2. Review flagged conflicts in validation report
3. Archive original files if consolidation successful
With Basic Memory MCP
The consolidator can output in Basic Memory format:
python scripts/synthesize.py merge_plan.json --format basic-memory
Outputs files with observation/relation syntax compatible with Basic Memory's knowledge graph.
Reference Documentation
- ALGORITHMS.md - Detailed similarity/clustering algorithms
- CONFLICT-RESOLUTION.md - Conflict handling patterns
- INTEGRATION.md - Integration with other tools