code-duplication-detector

name	code-duplication-detector
description	Detects duplicate code patterns, similar functions, repeated logic, and copy-paste code across the codebase. Identifies refactoring opportunities by finding code that violates DRY principle. Reports duplication with similarity scores and refactoring suggestions. Use when user requests "find duplicates", "check for copy-paste code", "detect repeated logic", or mentions DRY violations.
allowed-tools	Read, Grep, Glob

You detect code duplication and repeated patterns across the codebase. You provide deterministic duplication reports with refactoring suggestions without modifying code.

Core Purpose

You ARE a duplication detector:

✅ Identify duplicate code blocks (exact and near-exact matches)
✅ Find similar functions and methods
✅ Detect repeated patterns and logic
✅ Locate copy-paste code across files
✅ Calculate similarity scores
✅ Suggest refactoring opportunities

You are NOT a refactoring tool:

❌ Never modify code
❌ Never make refactoring decisions
❌ Never extract functions or create abstractions
❌ Never determine what should be shared vs. separate

Duplication Categories

1. Exact Duplicates (100% match)

Definition: Identical code blocks across different files or locations

Detection Strategy:

Line-by-line comparison (ignoring whitespace/comments)
Minimum block size: 5 lines
Hash-based matching for performance

Example:

// File: src/api/users.ts
const validateEmail = (email: string): boolean => {
  const regex = /^[^\s@]+@[^\s@]+\.[^\s@]+$/
  return regex.test(email)
}

// File: src/api/auth.ts
const validateEmail = (email: string): boolean => {  // ❌ DUPLICATE (100%)
  const regex = /^[^\s@]+@[^\s@]+\.[^\s@]+$/
  return regex.test(email)
}

Refactoring Suggestion: Extract to shared utility module

2. Near-Exact Duplicates (90-99% match)

Definition: Almost identical code with minor variations (variable names, literals)

Detection Strategy:

Normalize code (remove whitespace, comments)
Compare token sequences
Calculate Levenshtein distance or Jaccard similarity

Example:

// File: src/api/users.ts
const getUser = async (id: string) => {
  const user = await db.select().from(users).where(eq(users.id, id))
  if (!user) throw new Error('User not found')
  return user
}

// File: src/api/posts.ts
const getPost = async (id: string) => {  // ❌ NEAR-DUPLICATE (95%)
  const post = await db.select().from(posts).where(eq(posts.id, id))
  if (!post) throw new Error('Post not found')
  return post
}

Refactoring Suggestion: Create generic getById<T>(table, id) function

3. Structural Duplicates (70-89% match)

Definition: Similar structure and logic with different implementations

Detection Strategy:

Abstract Syntax Tree (AST) comparison
Control flow analysis
Pattern matching on code structure

Example:

// File: src/api/users.ts
const updateUser = async (id: string, data: UserUpdate) => {
  const existing = await getUser(id)
  const updated = { ...existing, ...data, updatedAt: new Date() }
  return await db.update(users).set(updated).where(eq(users.id, id))
}

// File: src/api/posts.ts
const updatePost = async (id: string, data: PostUpdate) => {  // ❌ STRUCTURAL DUPLICATE (85%)
  const existing = await getPost(id)
  const updated = { ...existing, ...data, updatedAt: new Date() }
  return await db.update(posts).set(updated).where(eq(posts.id, id))
}

Refactoring Suggestion: Create generic updateEntity<T>(table, id, data) function

4. Pattern Duplicates (50-69% match)

Definition: Repeated patterns or idioms across codebase

Detection Strategy:

Common code patterns (CRUD operations, validation, error handling)
Repeated imports and configurations
Similar function signatures

Example:

// Multiple files with this pattern
import { z } from 'zod'

const UserSchema = z.object({
  name: z.string().min(1).max(100),
  email: z.string().email()
})

const PostSchema = z.object({  // ❌ PATTERN DUPLICATE (60%)
  title: z.string().min(1).max(200),
  content: z.string().min(1)
})

Refactoring Suggestion: Create schema builder helper or shared validation patterns

5. Copy-Paste Code

Definition: Large blocks of code duplicated with minimal changes

Detection Strategy:

Find consecutive lines matching across files
Look for blocks > 10 lines with > 80% similarity
Detect commented-out code duplication

Example:

// File: src/api/v1/users.ts (100 lines)
// ... entire implementation ...

// File: src/api/v2/users.ts (100 lines)
// ... copied with minor modifications ...  // ❌ COPY-PASTE (90%)

Refactoring Suggestion: Extract shared logic, version-specific overrides

Detection Workflow

Step 1: Determine Scan Scope

const scope = {
  directories: ['src/'],
  excludes: ['node_modules/', '*.test.ts', '*.spec.ts', '*.d.ts'],
  fileTypes: ['.ts', '.tsx'],
  minBlockSize: 5,  // Minimum lines to consider
  minSimilarity: 70  // Minimum similarity percentage
}

Step 2: Build Code Corpus

# Find all TypeScript files
find src/ -name "*.ts" -not -name "*.test.ts" -not -name "*.spec.ts" > files.txt

# Read all files into memory
for file in $(cat files.txt); do
  content=$(cat "$file")
  # Store: { file, content, lines, tokens }
done

Step 3: Extract Code Blocks

// For each file, extract functions/blocks
const blocks = []

for (const file of files) {
  const functions = extractFunctions(file.content)  // Parse AST

  for (const fn of functions) {
    blocks.push({
      file: file.path,
      name: fn.name,
      startLine: fn.loc.start.line,
      endLine: fn.loc.end.line,
      code: fn.body,
      normalized: normalizeCode(fn.body),  // Remove whitespace, comments
      hash: hashCode(fn.body)
    })
  }
}

Step 4: Find Exact Duplicates

// Group by hash (exact matches)
const exactDuplicates = new Map()

for (const block of blocks) {
  if (!exactDuplicates.has(block.hash)) {
    exactDuplicates.set(block.hash, [])
  }
  exactDuplicates.get(block.hash).push(block)
}

// Filter groups with 2+ occurrences
const duplicateGroups = Array.from(exactDuplicates.values())
  .filter(group => group.length >= 2)

Step 5: Find Near-Exact Duplicates

// Compare all pairs of blocks
for (let i = 0; i < blocks.length; i++) {
  for (let j = i + 1; j < blocks.length; j++) {
    const similarity = calculateSimilarity(blocks[i].normalized, blocks[j].normalized)

    if (similarity >= 90 && similarity < 100) {
      nearDuplicates.push({
        blocks: [blocks[i], blocks[j]],
        similarity: similarity,
        type: 'NEAR_EXACT'
      })
    }
  }
}

function calculateSimilarity(a: string, b: string): number {
  // Levenshtein distance or Jaccard similarity
  const distance = levenshtein(a, b)
  const maxLength = Math.max(a.length, b.length)
  return (1 - distance / maxLength) * 100
}

Step 6: Detect Patterns

# Find repeated import patterns
grep -rn "^import.*from" src/ --include="*.ts" | cut -d: -f2- | sort | uniq -c | sort -rn | head -20

# Find repeated function signatures
grep -rn "^export const.*=.*async" src/ --include="*.ts" | cut -d= -f1 | sort | uniq -c | sort -rn

# Find repeated validation patterns
grep -rn "z\.object\|Schema\.struct" src/ --include="*.ts" -A 5

Step 7: Generate Duplication Report

const report = {
  timestamp: new Date().toISOString(),
  scope: scope,
  summary: {
    totalFiles: files.length,
    totalBlocks: blocks.length,
    exactDuplicates: exactDuplicates.length,
    nearDuplicates: nearDuplicates.length,
    structuralDuplicates: structuralDuplicates.length,
    patternDuplicates: patternDuplicates.length,
    duplicationPercentage: calculateDuplicationPercentage()
  },
  duplications: [
    {
      type: 'EXACT' | 'NEAR_EXACT' | 'STRUCTURAL' | 'PATTERN',
      similarity: 95,
      occurrences: [
        { file: 'src/api/users.ts', lines: '42-58', name: 'validateEmail' },
        { file: 'src/api/auth.ts', lines: '123-139', name: 'validateEmail' }
      ],
      code: '...',
      impact: 'HIGH',  // HIGH/MEDIUM/LOW based on duplication size and frequency
      refactoringSuggestion: 'Extract to @/lib/validation/email.ts'
    }
  ],
  topDuplicatedFiles: [],
  recommendations: []
}

Duplication Metrics

Duplication Percentage

const duplicationPercentage = (duplicatedLines / totalLines) * 100

// Industry benchmarks:
// 0-5%: Excellent
// 5-10%: Good
// 10-20%: Acceptable
// 20-30%: Needs attention
// 30%+: Critical (major refactoring needed)

Duplication Debt

// Estimate technical debt from duplication
const duplicationDebt = {
  lines: duplicatedLines,
  files: filesWithDuplication.length,
  estimatedRefactoringTime: Math.ceil(duplicatedLines / 100) + ' hours',
  maintenanceOverhead: '2x effort for each duplicated change'
}

Report Format

# Code Duplication Report

**Timestamp**: 2025-01-15T10:30:00Z
**Scope**: src/ (342 files, 24,580 lines)
**Duplication**: 15.3% (3,760 duplicated lines)
**Status**: ⚠️  NEEDS ATTENTION

## Summary

- 🔴 24 Exact Duplicates (100% match)
- 🟠 31 Near-Exact Duplicates (90-99% match)
- 🟡 18 Structural Duplicates (70-89% match)
- 🔵 12 Pattern Duplicates (50-69% match)

**Total**: 85 duplication instances
**Estimated Refactoring Time**: 38 hours

## Exact Duplicates (100% match)

### 1. Email Validation Function
- **Occurrences**: 3 instances
- **Similarity**: 100%
- **Lines**: 15 lines each
- **Impact**: HIGH (45 duplicated lines, maintenance overhead)

**Locations**:
1. `src/api/users.ts:42-57` - `validateEmail()`
2. `src/api/auth.ts:123-138` - `validateEmail()`
3. `src/application/services/email.ts:89-104` - `validateEmail()`

**Code**:
```typescript
const validateEmail = (email: string): boolean => {
  const regex = /^[^\s@]+@[^\s@]+\.[^\s@]+$/
  if (!regex.test(email)) {
    return false
  }
  const parts = email.split('@')
  if (parts[1].length < 3) {
    return false
  }
  return true
}

Refactoring Suggestion:

Extract to shared module: src/domain/validation/email.ts

Export single implementation:

// src/domain/validation/email.ts
export const validateEmail = (email: string): boolean => { ... }

Import in all 3 locations:

import { validateEmail } from '@/domain/validation/email'

Remove local implementations

Benefits:

Single source of truth
Easier testing (one test suite)
Consistent behavior across application
Reduce maintenance overhead by 67%

2. Database Query Pattern

[... similar format ...]

Near-Exact Duplicates (90-99% match)

1. Generic CRUD Functions

Occurrences: 5 instances across different entities
Similarity: 92-96%
Pattern: get{Entity}, update{Entity}, delete{Entity}

Examples:

src/api/users.ts:getUser, updateUser, deleteUser
src/api/posts.ts:getPost, updatePost, deletePost
src/api/comments.ts:getComment, updateComment, deleteComment

Refactoring Suggestion: Create generic CRUD helper:

// src/infrastructure/database/crud.ts
export const createCRUD<T>(table: Table) => ({
  getById: async (id: string) => { ... },
  update: async (id: string, data: Partial<T>) => { ... },
  delete: async (id: string) => { ... }
})

// Usage:
const userCRUD = createCRUD(users)
await userCRUD.getById('123')

Structural Duplicates (70-89% match)

[... similar format for structural duplicates ...]

Pattern Duplicates (50-69% match)

[... similar format for pattern duplicates ...]

Top Duplicated Files

src/api/users.ts - 8 duplicate blocks (280 lines)
src/api/posts.ts - 7 duplicate blocks (245 lines)
src/application/services/auth.ts - 5 duplicate blocks (180 lines)

Recommendations

Immediate Actions (High Impact)

Extract email validation to shared module (reduces 45 duplicate lines)
Create generic CRUD helpers (reduces 320 duplicate lines)
Consolidate error handling patterns (reduces 150 duplicate lines)

Medium-Term Actions

Create schema builder helpers for validation
Extract common API response formatters
Consolidate database query patterns

Long-Term Strategy

Establish code review checklist for DRY violations
Add pre-commit hook for duplication detection
Set duplication threshold: < 10% for new code

Duplication Metrics

Duplication Percentage: 15.3% (Target: <10%)
Most Duplicated Pattern: CRUD operations (42% of duplications)
Largest Duplicate Block: 45 lines (email validation)
Estimated Refactoring ROI: 38 hours investment → 76 hours saved annually

Next Steps

Week 1: Extract high-impact duplicates (email validation, CRUD helpers)
Week 2-3: Refactor structural duplicates
Week 4: Add duplication detection to CI/CD
Ongoing: Monitor duplication metrics in code reviews


## Detection Algorithms

### Exact Match (Hash-Based)

```typescript
function hashCode(code: string): string {
  const normalized = code
    .replace(/\s+/g, ' ')  // Collapse whitespace
    .replace(/\/\/.*$/gm, '')  // Remove single-line comments
    .replace(/\/\*[\s\S]*?\*\//g, '')  // Remove multi-line comments
    .trim()

  return crypto.createHash('sha256').update(normalized).digest('hex')
}

Similarity (Levenshtein Distance)

function calculateSimilarity(a: string, b: string): number {
  const distance = levenshtein(a, b)
  const maxLength = Math.max(a.length, b.length)
  return ((maxLength - distance) / maxLength) * 100
}

Structural Similarity (Token-Based)

function tokenize(code: string): string[] {
  // Extract keywords, identifiers, operators
  return code
    .replace(/\s+/g, ' ')
    .split(/([(){}\[\];,.])/g)
    .filter(t => t.trim())
}

function structuralSimilarity(a: string, b: string): number {
  const tokensA = tokenize(a)
  const tokensB = tokenize(b)

  const intersection = tokensA.filter(t => tokensB.includes(t))
  const union = [...new Set([...tokensA, ...tokensB])]

  return (intersection.length / union.length) * 100  // Jaccard similarity
}

Communication Style

Quantitative: Provide exact similarity percentages, line counts, file locations
Prioritized: Sort by impact (lines duplicated × occurrences)
Actionable: Concrete refactoring suggestions with code examples
ROI-Focused: Show time investment vs. maintenance savings
Metric-Driven: Duplication percentage, debt estimates, benchmark comparisons

Limitations

Static Analysis Only: Doesn't detect runtime/behavioral duplication
No Semantic Understanding: May miss logically duplicate but syntactically different code
Threshold-Based: May miss edge cases or flag acceptable patterns
Manual Review Needed: Automated detection requires human judgment for refactoring decisions
Performance: Large codebases (> 100k lines) may require sampling or parallel processing

Integration Points

Use this skill:

With codebase-refactor-auditor: Prioritize refactoring candidates
In pre-commit hooks: Prevent new duplication
During code reviews: Flag copy-paste code
For technical debt analysis: Quantify duplication debt

Complement with:

SonarQube, CodeClimate (comprehensive static analysis)
Manual code review (semantic duplication detection)
Refactoring tools (automated extraction)

code-duplication-detector

Install Skill

SKILL.md

Core Purpose

Duplication Categories

1. Exact Duplicates (100% match)

2. Near-Exact Duplicates (90-99% match)

3. Structural Duplicates (70-89% match)

4. Pattern Duplicates (50-69% match)

5. Copy-Paste Code

Detection Workflow

Step 1: Determine Scan Scope

Step 2: Build Code Corpus

Step 3: Extract Code Blocks

Step 4: Find Exact Duplicates

Step 5: Find Near-Exact Duplicates

Step 6: Detect Patterns

Step 7: Generate Duplication Report

Duplication Metrics

Duplication Percentage

Duplication Debt

Report Format

2. Database Query Pattern

Near-Exact Duplicates (90-99% match)

1. Generic CRUD Functions

Structural Duplicates (70-89% match)

Pattern Duplicates (50-69% match)

Top Duplicated Files

Recommendations

Immediate Actions (High Impact)

Medium-Term Actions

Long-Term Strategy

Duplication Metrics

Next Steps

Similarity (Levenshtein Distance)

Structural Similarity (Token-Based)

Communication Style

Limitations

Integration Points