name	workspace-cleanup
description	Intelligent workspace cleanup using multi-signal detection (similarity, timestamps, references) to identify and archive clutter with two-stage safety review

Workspace Cleanup

Overview

Automatically clean workspace directories by detecting and archiving clutter using intelligent multi-signal analysis. Reduces AI context pollution from temp files, sync conflicts, and superseded code versions while maintaining safety through archive-based two-stage deletion.

Core principle: Safe, intelligent cleanup that understands file drift from AI code generation and protects important files through multi-signal confidence scoring.

When to Use

Workspace has accumulated clutter (temp files, old versions, sync conflicts)
AI context windows are polluted with noise during file scanning
User mentions cleanup needs ("this is a mess", "clean up experiments", etc.)
Regular maintenance to prevent drift accumulation

Problem This Solves

AI assistants (Claude, Codex, Gemini) often create new files instead of updating existing ones:

auth.ts → auth-new.ts → auth-fixed.ts
Over time: multiple versions, unclear which is current
Clutter from system files, temp files, and abandoned experiments
Context window pollution during code analysis

Detection System

Three Core Signals

1. Similarity Detection

Content hash comparison between files
Filename similarity (Levenshtein distance)
Flags files with >80% content match + similar names

2. Timestamp Analysis

Last modified time (default: 90 days untouched)
Last accessed time
Configurable thresholds

3. Import/Reference Analysis

Grep for imports/requires across codebase
Search for file references in code/docs
Flag files with zero references as "unused"

Tiered Confidence Scoring

Tier 1 (auto-archive): 100% safe to remove

System files: .DS_Store, .sync-conflict-*
Build artifacts: __pycache__/, *.pyc, .pytest_cache/
Empty directories (except with .gitkeep)
Version patterns: -old, -backup, -fixed, -new, -updated, .bak
Status files: *.log, *.tmp, temp-*, tmp-*
Exact duplicates: Files with identical SHA256 (archive all but newest)

Tier 2 (archive): High confidence (2+ signals)

Similar files (80%+ content match) + old timestamp → archive older
Unused + old timestamp
Similarity + unused

Tier 3 (suggest only): Low confidence (1 signal)

Just old, just unused, or just similar
Large files (>100MB) even with multiple signals
Recently modified similar files (<7 days)
Report for manual review, don't auto-archive

Archive Management

Central Archive Structure

/Users/braydon/projects/archive/cleanup/
├── 2025-11-21-143022/
│   ├── metadata.json
│   └── [preserved directory structure]
└── 2025-10-15-091234/
    ├── metadata.json
    └── [files...]

Two-Stage Safety

Stage 1: Archive

Move files to central archive (never immediate deletion)
Preserve original directory structure
Store metadata explaining why each file was archived

Stage 2: Review (30+ days)

Auto-prompt for archives >30 days old
Show summary of archived contents
Options: Keep archive, Delete permanently, Restore files, Skip
Mark reviewed in metadata

Metadata Format

{
  "timestamp": "2025-11-21T14:30:22Z",
  "scope": "/Users/braydon/projects",
  "recursive": true,
  "files": [
    {
      "original_path": "/Users/braydon/projects/foo.txt",
      "tier": 1,
      "signals": ["pattern_match"],
      "score": 100
    },
    {
      "original_path": "/Users/braydon/projects/experiments/old-auth.ts",
      "tier": 2,
      "signals": ["similarity", "unused", "old_timestamp"],
      "score": 85,
      "similar_to": "experiments/auth.ts"
    }
  ],
  "reviewed": false,
  "review_date": null
}

Protected Patterns

Three-Layer Protection

Layer 1: Respect .gitignore

If git ignores it, cleanup should too
Check with: git check-ignore -q "$file"
Prevents cleaning build artifacts, dependencies, etc.

Layer 2: .cleanupignore Optional file for cleanup-specific exclusions:

# .cleanupignore
archive/          # Don't clean the archive itself
important-*.md    # Keep files matching pattern
legacy-project/   # Preserve specific directories

Layer 3: Hard-coded System Patterns Always protected regardless of ignore files:

Directories: .git, .claude, node_modules, .venv, venv, dist, build
Files: package.json, requirements.txt, *.lock, CLAUDE.md, README.md, .env*

Efficient Scanning with Prune

Use find's -prune to skip entire protected directory trees:

find . \( -name node_modules -o -name .git -o -name dist \) -prune \
  -o -type f -name "*.tmp" -print

This never even traverses into protected directories, making scans much faster.

Usage

Context-Aware Invocation

From conversation:

User: "This directory is a mess, let's clean it up"
→ Runs recursive cleanup from CWD

User: "Let's clean up the experiments directory"
→ Runs cleanup scoped to /experiments

Explicit commands:

/cleanup                          # Current directory
/cleanup --recursive              # Current + subdirs
/cleanup /path/to/dir             # Specific directory
/cleanup --review-archives        # Review old archives

Execution Workflow

When invoked, follow these steps:

1. Parse Scope

Determine target directory from user request or CWD
Check if recursive or targeted cleanup
Validate directory exists and is accessible

2. Scan & Analyze

# Build file hash map for duplicate detection
declare -A file_hashes

# Scan with protection (prune protected dirs early)
find . \( -name node_modules -o -name .git -o -name dist -o -name build \) -prune \
  -o -type f -print | while read file; do

  # Layer 1: Check .gitignore
  if git check-ignore -q "$file" 2>/dev/null; then
    continue
  fi

  # Layer 2: Check .cleanupignore (if exists)
  if [[ -f .cleanupignore ]] && grep -q "$(basename "$file")" .cleanupignore; then
    continue
  fi

  # Layer 3: Hard-coded protections
  if [[ "$file" =~ (package\.json|CLAUDE\.md|README\.md|\.env) ]]; then
    continue
  fi

  # === TIER 1 CHECKS (auto-archive) ===

  # Check obvious patterns
  if [[ "$file" =~ (\.DS_Store|\.sync-conflict-|\.tmp$|\.log$) ]]; then
    archive "$file" tier:1 signal:pattern_match
    continue
  fi

  # Check exact duplicates
  hash=$(sha256sum "$file" | cut -d' ' -f1)
  if [[ -n "${file_hashes[$hash]}" ]]; then
    # Found duplicate - archive older file
    original="${file_hashes[$hash]}"
    if [[ "$file" -nt "$original" ]]; then
      archive "$original" tier:1 signal:exact_duplicate duplicate_of:"$file"
      file_hashes[$hash]="$file"
    else
      archive "$file" tier:1 signal:exact_duplicate duplicate_of:"$original"
    fi
    continue
  fi
  file_hashes[$hash]="$file"

  # Check version patterns
  if [[ "$file" =~ (-old|-backup|-fixed|-new|-updated|\.bak)$ ]]; then
    archive "$file" tier:1 signal:version_pattern
    continue
  fi

  # === TIER 2/3 CHECKS (multi-signal) ===
  signals=()

  # Run similarity detection (expensive, do after Tier 1)
  if hasSimilarFile "$file"; then
    signals+=("similarity")
  fi

  # Check timestamps
  if [[ $(find "$file" -mtime +90) ]]; then
    signals+=("old_timestamp")
  fi

  # Check references
  if ! grep -rq "$(basename "$file")" --exclude-dir=node_modules .; then
    signals+=("unused")
  fi

  # Score and tier
  if [[ ${#signals[@]} -ge 2 ]]; then
    archive "$file" tier:2 signals:"${signals[*]}"
  elif [[ ${#signals[@]} -eq 1 ]]; then
    suggest "$file" tier:3 signals:"${signals[*]}"
  fi
done

3. Archive Files

Create timestamped archive directory
Move Tier 1 + Tier 2 files preserving structure
Generate metadata.json with analysis results
Skip Tier 3 (just report)

4. Report Results

🧹 Workspace Cleanup - /Users/braydon/projects
Scope: Recursive | Protected dirs: 8 | Scanning...

📊 Analysis Results:
  • 156 files scanned (78 skipped via protection layers)
  • 23 Tier 1 (auto-archive):
    - System files: .DS_Store (8), sync conflicts (3)
    - Exact duplicates: (6 files, kept newest)
    - Version patterns: -old, -backup files (6)
  • 12 Tier 2 (archive): similar + old or unused
  • 8 Tier 3 (suggestions): review manually

📦 Archiving to: /Users/braydon/projects/archive/cleanup/2025-11-21-143022/
  ✓ Archived 35 files (2.3 MB saved)

💡 Tier 3 Suggestions (not archived):
  • experiments/test-model.py (unused, 45 days old)
  • personal/notes.txt (old, 120 days)
  • work/large-dataset.csv (>100MB, unused - verify before archiving)

⏰ Archives ready for review: 2 archives >30 days old
   Run '/cleanup --review-archives' to review

5. Check Archives

Find archives >30 days old
If found, prompt for review
Show summary and offer actions

Archive Review Workflow

📋 Archive Review - 2 archives ready

Archive: 2025-10-15-091234 (37 days old)
  • Scope: /Users/braydon/projects (recursive)
  • 18 files archived (1.2 MB)
  • Breakdown:
    - Tier 1: .DS_Store (8), sync conflicts (10)
    - Tier 2: unused code (0)

Actions:
  K - Keep archive (don't prompt again for 30 days)
  D - Delete permanently (CANNOT BE UNDONE)
  R - Restore files to original locations
  S - Skip this review

Your choice [K/D/R/S]:

Configuration

Users can override defaults in .claude/workspace-cleanup-config.json:

{
  "timestamp_threshold_days": 90,
  "similarity_threshold": 0.80,
  "archive_review_days": 30,
  "custom_protected_patterns": [
    "important-*.md",
    "do-not-delete/*"
  ],
  "custom_tier1_patterns": [
    "*.tmp",
    "temp-*",
    ".scratch"
  ],
  "excluded_dirs": [
    "special-project"
  ]
}

Implementation Notes

Exact Duplicate Detection

# Build hash map of all files
declare -A file_hashes

while read file; do
  # Generate SHA256 hash
  hash=$(sha256sum "$file" | cut -d' ' -f1)

  # Check if we've seen this hash before
  if [[ -n "${file_hashes[$hash]}" ]]; then
    original="${file_hashes[$hash]}"

    # Archive older file, keep newer
    if [[ "$file" -nt "$original" ]]; then
      echo "Duplicate found: $file is newer than $original"
      archive "$original" tier:1 signal:exact_duplicate
      file_hashes[$hash]="$file"  # Update to keep newer
    else
      echo "Duplicate found: $original is newer than $file"
      archive "$file" tier:1 signal:exact_duplicate
    fi
  else
    # First time seeing this content
    file_hashes[$hash]="$file"
  fi
done

Why SHA256: Strong collision resistance, fast computation, standard tool (sha256sum).

Edge case: If duplicates have same mtime, keep first found, archive rest.

Similarity Detection (for non-exact matches)

// Generate content hash for quick comparison
const hash = crypto.createHash('sha256')
  .update(fs.readFileSync(file))
  .digest('hex');

// Compare filenames (Levenshtein distance)
const nameDistance = levenshtein(file1, file2);
const similarity = 1 - (nameDistance / Math.max(file1.length, file2.length));

// Flag if both content and name are similar (but not identical)
if (contentMatch > 0.80 && contentMatch < 1.0 && similarity > 0.70) {
  // Tier 2: Archive older file if also old/unused
}

Reference Detection

# Use grep to find references
grep -r "import.*${filename}" ${scope}
grep -r "require.*${filename}" ${scope}
grep -r "${filename}" ${scope}

# If no results: unused

Version Pattern Detection

const VERSION_PATTERNS = [
  /-old$/, /-backup$/, /-fixed$/, /-new$/,
  /-updated$/, /-v\d+$/, /-copy$/,
  /^old-/, /^backup-/, /^new-/, /^temp-/,
  /\.bak$/, /\.backup$/
];

// When detected + similarity match:
// Keep file without pattern, archive file with pattern

Common Mistakes

Cleaning without scanning first

❌ Don't skip analysis phase
✅ Always scan → analyze → report → archive

Ignoring Tier 3 suggestions

❌ Tier 3 files might become Tier 2 over time
✅ Review suggestions periodically

Deleting archives too quickly

❌ Don't delete archives <30 days old
✅ Wait for review prompt, verify you don't need files

Not checking protected patterns

❌ Assuming default patterns cover everything
✅ Review protected patterns for your workspace

Running on untracked important work

❌ Don't clean directory with active untracked experiments
✅ Commit or stash important work first

Edge Cases

Similar files, both recent

If both files modified within last 7 days: Tier 3 (suggest only)
Let user decide which to keep

Empty directories with .gitkeep

Don't archive empty dirs containing .gitkeep
These are intentionally empty

Large files (>100MB)

Always Tier 3 (suggest only)
User should explicitly confirm before archiving

Files in git staging area

Skip files with uncommitted changes
Report as "skipped: uncommitted changes"

Benefits

AI Context Reduction - Less noise in context windows
Safety First - Two-stage archive prevents accidental deletion
Intelligent Detection - Finds actual clutter, not just patterns
Context Aware - Adapts to user intent and scope
Low Maintenance - Mostly automated with sensible defaults
Recoverable - Everything archived, nothing immediately deleted

Related Skills

learning-from-outcomes - Learn from cleanup patterns over time
coordinating-sub-agents - Delegate cleanup to specialized agent

Future Enhancements

Machine learning on user archive/restore decisions
Cross-project similarity detection
Automatic .gitignore updates based on archived patterns
Integration with project task management

workspace-cleanup

Install Skill

SKILL.md