name	Code Archaeology
description	Systematic techniques for reading and understanding unfamiliar legacy code without documentation
when_to_use	when encountering unfamiliar legacy code and need to understand what it does, why it exists, and how it works
version	1.0.0
languages	all

Code Archaeology

Overview

Legacy code is code without context. Code archaeology is the systematic process of excavating that context from the code itself, its history, and its runtime behavior.

Core principle: Read code like a detective, not a compiler. Look for clues about intent, not just mechanics.

When to Use

Use code archaeology when:

Encountering unfamiliar codebase without documentation
Need to modify code you didn't write
Investigating why code was written this way
Before refactoring or adding features
Understanding legacy system architecture

The Archaeology Process

Phase 1: Survey the Landscape

Before diving into details, get the big picture:

Identify Entry Points
- Main functions, API endpoints
- Event handlers, controllers
- What triggers this code to run?
Map the Territory
- Directory structure
- Module organization
- Key abstractions and their relationships
Find the Documentation
- README files (even outdated ones have clues)
- Code comments (especially "why" comments)
- Tests (they show intended usage)
- Commit messages in git history

Quick mapping command:

# Find all entry points
find . -name "main.*" -o -name "*Controller*" -o -name "*Handler*" -o -name "Program.cs" -o -name "Startup.cs"

# See directory structure
tree -L 3 -I 'node_modules|vendor|__pycache__|bin|obj'

# Find tests (they document behavior)
find . -name "*test*" -o -name "*spec*" -o -name "*.Tests" -o -name "*.Test"

Phase 2: Follow the Data

Understanding flow is key to understanding purpose:

Pick a Concrete Example
- Don't start with abstractions
- Choose specific input/output case
- "What happens when user logs in?"

Trace Forward (from input)

Input → Where does it enter?
      → How is it transformed?
      → Where does it go?
      → What's the output?

Trace Backward (from output)

Output → Where is it produced?
       → What data creates it?
       → Where does that data come from?
       → Keep going to the source

Example: Understanding authentication:

Forward: POST /login → router → authController.login() → UserService.authenticate() → Database
Backward: JWT token ← TokenService ← User object ← Database query ← Credentials validation

Phase 3: Identify Core Abstractions

What are the domain concepts?

Look for Nouns
- Classes, types, tables: User, Order, Payment
- These are domain entities
Look for Verbs
- Functions, methods: authenticate(), processPayment()
- These are domain operations
Find the Boundaries
- What talks to what?
- What's isolated from what?
- Where are the layers?

Visualization helps:

Presentation Layer (HTTP, UI)
    ↓
Business Logic Layer (Services, Use Cases)
    ↓
Data Layer (Database, External APIs)

Phase 4: Understand Intent vs Implementation

Code shows HOW. History shows WHY.

Git Archaeology

# When was this added?
git log --follow --diff-filter=A -- path/to/file.py

# Why was it changed?
git log -p -- path/to/file.py

# Who knows about it?
git blame path/to/file.py

# Find related changes
git log --all --grep="auth"

Look for Patterns
- Repeated code → probably important concept
- Complex conditionals → business rules
- Try/catch blocks → known failure modes
- Comments saying "HACK" or "TODO" → technical debt
Run the Code
- Set breakpoints, observe values
- Add debug logging
- See what actually happens vs what code says

Reading Strategies

Top-Down Reading

Start from entry points, drill down:

Good for:

Understanding overall flow
Finding where to start modifying
Grasping system architecture

Process:

Find entry point (main, handler, controller)
Read function signature → what goes in, what comes out?
Scan function body → what are the major steps?
Drill into interesting steps
Recurse

Bottom-Up Reading

Start from utilities, build up:

Good for:

Understanding specific components
Learning domain abstractions
When top-down is overwhelming

Process:

Find leaf functions (no dependencies)
Understand what they do
Find what calls them
Build mental model upward
Eventually reach entry points

Breadth-First Reading

Survey everything shallowly first:

Good for:

Very large codebases
Finding what matters
Avoiding rabbit holes

Process:

List all files
Read first 20 lines of each
Note interesting/critical files
Deep dive only into those

Tools and Techniques

Static Analysis

# Find all uses of a function
grep -r "authenticate" --include="*.py" --include="*.js" --include="*.cs"

# Find all classes/types
grep -r "^class " --include="*.py" --include="*.js" --include="*.ts"
grep -r "^\s*public class " --include="*.cs"

# Find configuration
find . -name "*.config" -o -name ".env*" -o -name "settings*" -o -name "appsettings*.json" -o -name "web.config"

# Count lines by directory (complexity proxy)
find . \( -name "*.py" -o -name "*.js" -o -name "*.cs" \) -exec wc -l {} + | sort -n

Dynamic Analysis

Run code with instrumentation:

# Python: Add logging to understand flow
import logging
logging.basicConfig(level=logging.DEBUG)

def mystery_function(data):
    logging.debug(f"mystery_function called with: {data}")
    result = complex_operation(data)
    logging.debug(f"mystery_function returning: {result}")
    return result

// C#: Add logging to understand flow
using Microsoft.Extensions.Logging;

public class MyService
{
    private readonly ILogger<MyService> _logger;

    public string MysteryFunction(string data)
    {
        _logger.LogDebug("MysteryFunction called with: {Data}", data);
        var result = ComplexOperation(data);
        _logger.LogDebug("MysteryFunction returning: {Result}", result);
        return result;
    }
}

Use debugger:

Set breakpoint at entry
Step through execution
Inspect variables at each step
Note: "Ah, that's where X comes from!"

Pattern Recognition

Common legacy patterns:

Pattern	What it means
Singleton	Global state (often problematic)
Factory	Multiple implementations of same interface
Strategy	Pluggable behavior
Template Method	Framework with customization points
Null checks everywhere	Missing contracts, defensive programming
Try/catch around everything	Unstable dependencies or unknown errors

Checklist

Identified entry points (where does execution start?)
Mapped directory structure and module organization
Found any existing documentation (README, comments, tests)
Traced data flow for concrete example (input → processing → output)
Identified core domain concepts (entities, operations)
Understood system layers and boundaries
Researched git history (when added, why changed, who knows)
Recognized patterns and their purposes
Ran code with debugging/logging to verify understanding

Red Flags - You're Doing It Wrong

"I'll just start changing things" - You don't understand yet. Archaeology first.
"There's too much code to read" - You don't read all of it. Use strategies.
"The code is self-explanatory" - Code shows HOW, not WHY. History matters.
"I'll figure it out as I go" - Plan and map first, modify second.
Diving into details before understanding big picture - Breadth before depth.

Common Mistakes

Mistake	Reality
"Reading code is slow, I'll just fix it"	Understanding saves hours of debugging. Read first.
"I understand what this function does" (after 30 seconds)	Functions exist in context. Understand their role in the system.
"This is bad code, I'll rewrite it"	It's solving a problem you don't understand yet. Archaeology first.
"I'll read it all linearly"	100k LOC linearly = weeks. Use strategies.
"Comments and docs are outdated, I'll ignore them"	Even outdated docs have clues. They show original intent.

Examples

Example 1: Understanding Authentication (Node.js)

Goal: Understand how login works

Archaeology process:

1. Entry point: POST /api/login endpoint
2. Trace: router.js → authController.login() → AuthService.authenticate()
3. Data flow: {email, password} → validate → query DB → generate JWT
4. Git history: Added in commit abc123 "Implement JWT authentication"
5. Why: Comments mention "replaced session-based auth for API scalability"
6. Pattern: Token-based stateless authentication
7. Tests: test/auth.test.js shows expected behavior

Mental model: Stateless auth using JWT, validates credentials, returns token for subsequent requests.

Example 1b: Understanding Authentication (.NET)

Goal: Understand how authentication middleware works

Archaeology process:

1. Entry point: Program.cs or Startup.cs → app.UseAuthentication()
2. Trace: Middleware pipeline → [Authorize] attribute → AuthenticationHandler
3. Data flow: HTTP request → Extract token → Validate → Create ClaimsPrincipal → Controller
4. Git history: Added in commit def456 "Switch to JWT bearer authentication"
5. Why: Comments mention "migrated from cookie auth to support mobile clients"
6. Pattern: ASP.NET Core authentication middleware with JWT bearer
7. Tests: AuthenticationTests.cs shows token validation scenarios

Mental model: Middleware-based authentication, token validation happens early in pipeline, user identity flows through HttpContext.User to controllers.

Example 2: Mystery Business Logic

Code:

def calculate_price(item, user):
    base = item.price
    if user.is_premium:
        base *= 0.9
    if user.referral_count > 5:
        base *= 0.95
    if item.category == "seasonal":
        base *= 1.2
    return round(base, 2)

Archaeology:

Git log:
- Line 1-2: Original implementation
- Line 3-4: "Add premium discount per marketing request"
- Line 5-6: "Incentivize referrals" (commit by PM)
- Line 7-8: "Seasonal markup for inventory management"

Pattern: Accumulation of business rules over time
Why complex: Different stakeholders, different times

Mental model: Pricing is business-driven, not technical. Don't "simplify" without understanding business constraints.

Integration with Other Skills

skills/research/tracing-knowledge-lineages - Why does this code exist?
skills/refactoring/characterization-testing - After understanding, add safety net
skills/understanding/questioning-techniques - Systematic questions to ask
skills/debugging/root-cause-tracing - When code breaks, trace back to source

Remember

Read code like a detective: look for clues about intent
Use multiple strategies: top-down, bottom-up, breadth-first
Git history is treasure: when/why/who
Run code to verify understanding
Pattern recognition speeds comprehension
Understand before changing

Code Archaeology

Install Skill

SKILL.md

Code Archaeology

Overview

When to Use

The Archaeology Process

Phase 1: Survey the Landscape

Phase 2: Follow the Data

Phase 3: Identify Core Abstractions

Phase 4: Understand Intent vs Implementation

Reading Strategies

Top-Down Reading

Bottom-Up Reading

Breadth-First Reading

Tools and Techniques

Static Analysis

Dynamic Analysis

Pattern Recognition

Checklist

Red Flags - You're Doing It Wrong

Common Mistakes

Examples

Example 1: Understanding Authentication (Node.js)

Example 1b: Understanding Authentication (.NET)

Example 2: Mystery Business Logic

Integration with Other Skills

Remember