name

markdown-consolidator

description

Intelligent consolidation and synthesis of multiple markdown files with overlapping content and different update dates. Use when: (1) Multiple AI-generated markdown files need merging, (2) Knowledge bases have fragmented or duplicate content, (3) Documentation requires recency-aware synthesis, (4) Supporting documents need re-synthesis after AI task completion, (5) Project documentation has semantic overlap across files, (6) Periodic knowledge base maintenance and deduplication is needed.

Markdown Consolidator

Consolidate and synthesize multiple markdown files with intelligent handling of overlapping content, different update dates, and semantic deduplication.

Core Problem

AI-assisted workflows generate fragmented documentation:

Each AI session creates task-specific markdown files
AI references supporting docs but doesn't update them post-task
Knowledge becomes scattered across files with overlapping content
Different timestamps make version reconciliation complex

Workflow Overview

1. ANALYZE  → Inventory files, extract metadata, identify relationships
2. CLUSTER  → Group semantically related files using content analysis
3. PLAN     → Create merge strategy based on recency, overlap, authority
4. SYNTHESIZE → Merge content with intelligent conflict resolution
5. VALIDATE → Verify completeness and coherence of output

Analysis Phase

Step 1: File Inventory

Run the inventory script to analyze all markdown files:

python scripts/inventory.py <directory> --output inventory.json

The script extracts:

File paths and sizes
Modification timestamps (file system and YAML frontmatter)
Section headers (H1-H6 structure)
Word/token counts per section
Internal links ([[wikilinks]] and [markdown](links))
YAML frontmatter metadata
Content fingerprints for similarity detection

Step 2: Relationship Mapping

python scripts/analyze_relationships.py inventory.json --output relationships.json

Identifies:

Semantic clusters: Files covering similar topics (via TF-IDF/embedding similarity)
Temporal chains: Files that evolved from each other (via timestamp + similarity)
Reference graphs: Which files reference which (via link analysis)
Conflict zones: Sections with contradictory or overlapping content

Clustering Phase

Clustering Strategies

Choose based on your consolidation goal:

Topic-based clustering (default) Groups files by semantic similarity of content.

python scripts/cluster.py relationships.json --method topic --threshold 0.6

Temporal clustering Groups files by modification date ranges.

python scripts/cluster.py relationships.json --method temporal --window 7d

Hierarchical clustering Groups by directory structure + content similarity.

python scripts/cluster.py relationships.json --method hierarchical

Cluster Output

Creates clusters.json with structure:

{
  "clusters": [
    {
      "id": "cluster_001",
      "theme": "API Authentication",
      "files": ["auth-design.md", "oauth-notes.md", "token-handling.md"],
      "primary_file": "auth-design.md",
      "overlap_score": 0.72,
      "conflicts": ["token-handling.md:L45 vs oauth-notes.md:L23"]
    }
  ]
}

Planning Phase

Merge Strategy Selection

Authority-based (recommended for documentation)

Most recent file is authoritative for conflicts
Older unique content is preserved with attribution
Use when files represent evolving understanding

Comprehensive (for knowledge bases)

Union of all unique information
Conflicts flagged for manual review
Use when completeness matters more than consistency

Canonical (for specifications)

Designate one file as canonical
Others provide supplementary/historical context
Use when single source of truth is required

Create Merge Plan

python scripts/plan_merge.py clusters.json --strategy authority --output merge_plan.json

Generates actionable merge plan:

{
  "cluster_id": "cluster_001",
  "output_file": "consolidated/authentication.md",
  "sections": [
    {
      "heading": "## Overview",
      "sources": [{"file": "auth-design.md", "lines": "1-25", "action": "primary"}],
      "conflicts": []
    },
    {
      "heading": "## Token Handling",
      "sources": [
        {"file": "token-handling.md", "lines": "10-45", "action": "primary"},
        {"file": "oauth-notes.md", "lines": "20-35", "action": "supplement"}
      ],
      "conflicts": [
        {
          "description": "Token expiry differs: 24h vs 1h",
          "resolution": "Use most recent (token-handling.md: 24h)"
        }
      ]
    }
  ]
}

Synthesis Phase

Execute Merge

python scripts/synthesize.py merge_plan.json --output consolidated/

The synthesizer:

Creates section-by-section merged content
Preserves original attribution via HTML comments
Resolves conflicts per strategy
Maintains internal link consistency
Updates frontmatter with merge metadata

Synthesis Rules

Content Deduplication

Exact duplicates: Remove, keep first occurrence
Near duplicates (>80% similarity): Merge, note sources
Partial overlap: Keep both with clear section breaks

Conflict Resolution

Authority strategy:
  1. Prefer most recently modified source
  2. Prefer explicitly dated content over undated
  3. Prefer longer/more detailed explanations
  4. Flag unresolvable conflicts for review

Comprehensive strategy:
  1. Include all non-contradictory content
  2. Present conflicts as "Version A / Version B" blocks
  3. Add TODO markers for manual resolution

Link Handling

Internal links updated to point to consolidated files
Broken links flagged with 
External links preserved as-is

Output Format

Consolidated files include:

---
title: Authentication System
consolidated_from:
  - file: auth-design.md
    modified: 2024-12-01T10:30:00
  - file: oauth-notes.md
    modified: 2024-11-28T15:45:00
  - file: token-handling.md
    modified: 2024-12-02T09:00:00
consolidated_at: 2024-12-03T14:00:00
strategy: authority
---

# Authentication System

<!-- SOURCE: auth-design.md:1-25 -->
## Overview
...

<!-- SOURCE: token-handling.md:10-45, SUPPLEMENTED: oauth-notes.md:20-35 -->
## Token Handling
...

<!-- CONFLICT RESOLVED: Used token-handling.md (most recent) -->
Token expiry is set to 24 hours...

Validation Phase

python scripts/validate.py consolidated/ --original <source_dir>

Validates:

Completeness: All source content represented or explicitly excluded
Link integrity: All internal links resolve
Coherence: No contradictions in final output
Metadata: Proper attribution and timestamps

Generates validation_report.md:

## Consolidation Validation Report

### Coverage
- 47/47 source files processed
- 3 files excluded (empty/invalid)
- 12 clusters created
- 8 consolidated files produced

### Content Coverage
- 98.3% of source content preserved
- 1.7% deduplicated (exact matches)
- 5 conflicts resolved automatically
- 2 conflicts flagged for review

### Issues
- [ ] REVIEW: consolidated/auth.md:L145 - conflicting token formats
- [ ] REVIEW: consolidated/api.md:L67 - unclear which version is correct

Quick Start

For immediate consolidation of a directory:

# Full pipeline
python scripts/consolidate.py <source_dir> <output_dir> --strategy authority

# This runs: inventory → analyze → cluster → plan → synthesize → validate

Advanced: Incremental Updates

For ongoing maintenance:

# Detect changes since last consolidation
python scripts/detect_changes.py <source_dir> --since "2024-12-01"

# Re-consolidate only affected clusters
python scripts/consolidate.py <source_dir> <output_dir> --incremental

Configuration

Create .consolidator.yaml in project root:

# Files/directories to exclude
exclude:
  - "**/archive/**"
  - "**/.obsidian/**"
  - "**/templates/**"

# Similarity threshold for clustering (0-1)
similarity_threshold: 0.6

# Default merge strategy
default_strategy: authority

# Preserve original files
keep_originals: true
archive_path: .consolidated-archive/

# Frontmatter fields to preserve
preserve_frontmatter:
  - tags
  - aliases
  - created

# Output format
output:
  add_source_comments: true
  add_merge_frontmatter: true
  update_internal_links: true

Integration Patterns

With Claude Code Sessions

Add to your CLAUDE.md:

## Post-Task Consolidation

After completing any task that creates or modifies markdown files:
1. Run `/project:consolidate` to update knowledge base
2. Review flagged conflicts in validation report
3. Archive original files if consolidation successful

With Basic Memory MCP

The consolidator can output in Basic Memory format:

python scripts/synthesize.py merge_plan.json --format basic-memory

Outputs files with observation/relation syntax compatible with Basic Memory's knowledge graph.

Reference Documentation

ALGORITHMS.md - Detailed similarity/clustering algorithms
CONFLICT-RESOLUTION.md - Conflict handling patterns
INTEGRATION.md - Integration with other tools