| name | Code Archaeology |
| description | Systematic techniques for reading and understanding unfamiliar legacy code without documentation |
| when_to_use | when encountering unfamiliar legacy code and need to understand what it does, why it exists, and how it works |
| version | 1.0.0 |
| languages | all |
Code Archaeology
Overview
Legacy code is code without context. Code archaeology is the systematic process of excavating that context from the code itself, its history, and its runtime behavior.
Core principle: Read code like a detective, not a compiler. Look for clues about intent, not just mechanics.
When to Use
Use code archaeology when:
- Encountering unfamiliar codebase without documentation
- Need to modify code you didn't write
- Investigating why code was written this way
- Before refactoring or adding features
- Understanding legacy system architecture
The Archaeology Process
Phase 1: Survey the Landscape
Before diving into details, get the big picture:
Identify Entry Points
- Main functions, API endpoints
- Event handlers, controllers
- What triggers this code to run?
Map the Territory
- Directory structure
- Module organization
- Key abstractions and their relationships
Find the Documentation
- README files (even outdated ones have clues)
- Code comments (especially "why" comments)
- Tests (they show intended usage)
- Commit messages in git history
Quick mapping command:
# Find all entry points
find . -name "main.*" -o -name "*Controller*" -o -name "*Handler*" -o -name "Program.cs" -o -name "Startup.cs"
# See directory structure
tree -L 3 -I 'node_modules|vendor|__pycache__|bin|obj'
# Find tests (they document behavior)
find . -name "*test*" -o -name "*spec*" -o -name "*.Tests" -o -name "*.Test"
Phase 2: Follow the Data
Understanding flow is key to understanding purpose:
Pick a Concrete Example
- Don't start with abstractions
- Choose specific input/output case
- "What happens when user logs in?"
Trace Forward (from input)
Input → Where does it enter? → How is it transformed? → Where does it go? → What's the output?Trace Backward (from output)
Output → Where is it produced? → What data creates it? → Where does that data come from? → Keep going to the source
Example: Understanding authentication:
Forward: POST /login → router → authController.login() → UserService.authenticate() → Database
Backward: JWT token ← TokenService ← User object ← Database query ← Credentials validation
Phase 3: Identify Core Abstractions
What are the domain concepts?
Look for Nouns
- Classes, types, tables: User, Order, Payment
- These are domain entities
Look for Verbs
- Functions, methods: authenticate(), processPayment()
- These are domain operations
Find the Boundaries
- What talks to what?
- What's isolated from what?
- Where are the layers?
Visualization helps:
Presentation Layer (HTTP, UI)
↓
Business Logic Layer (Services, Use Cases)
↓
Data Layer (Database, External APIs)
Phase 4: Understand Intent vs Implementation
Code shows HOW. History shows WHY.
Git Archaeology
# When was this added? git log --follow --diff-filter=A -- path/to/file.py # Why was it changed? git log -p -- path/to/file.py # Who knows about it? git blame path/to/file.py # Find related changes git log --all --grep="auth"Look for Patterns
- Repeated code → probably important concept
- Complex conditionals → business rules
- Try/catch blocks → known failure modes
- Comments saying "HACK" or "TODO" → technical debt
Run the Code
- Set breakpoints, observe values
- Add debug logging
- See what actually happens vs what code says
Reading Strategies
Top-Down Reading
Start from entry points, drill down:
Good for:
- Understanding overall flow
- Finding where to start modifying
- Grasping system architecture
Process:
- Find entry point (main, handler, controller)
- Read function signature → what goes in, what comes out?
- Scan function body → what are the major steps?
- Drill into interesting steps
- Recurse
Bottom-Up Reading
Start from utilities, build up:
Good for:
- Understanding specific components
- Learning domain abstractions
- When top-down is overwhelming
Process:
- Find leaf functions (no dependencies)
- Understand what they do
- Find what calls them
- Build mental model upward
- Eventually reach entry points
Breadth-First Reading
Survey everything shallowly first:
Good for:
- Very large codebases
- Finding what matters
- Avoiding rabbit holes
Process:
- List all files
- Read first 20 lines of each
- Note interesting/critical files
- Deep dive only into those
Tools and Techniques
Static Analysis
# Find all uses of a function
grep -r "authenticate" --include="*.py" --include="*.js" --include="*.cs"
# Find all classes/types
grep -r "^class " --include="*.py" --include="*.js" --include="*.ts"
grep -r "^\s*public class " --include="*.cs"
# Find configuration
find . -name "*.config" -o -name ".env*" -o -name "settings*" -o -name "appsettings*.json" -o -name "web.config"
# Count lines by directory (complexity proxy)
find . \( -name "*.py" -o -name "*.js" -o -name "*.cs" \) -exec wc -l {} + | sort -n
Dynamic Analysis
Run code with instrumentation:
# Python: Add logging to understand flow
import logging
logging.basicConfig(level=logging.DEBUG)
def mystery_function(data):
logging.debug(f"mystery_function called with: {data}")
result = complex_operation(data)
logging.debug(f"mystery_function returning: {result}")
return result
// C#: Add logging to understand flow
using Microsoft.Extensions.Logging;
public class MyService
{
private readonly ILogger<MyService> _logger;
public string MysteryFunction(string data)
{
_logger.LogDebug("MysteryFunction called with: {Data}", data);
var result = ComplexOperation(data);
_logger.LogDebug("MysteryFunction returning: {Result}", result);
return result;
}
}
Use debugger:
- Set breakpoint at entry
- Step through execution
- Inspect variables at each step
- Note: "Ah, that's where X comes from!"
Pattern Recognition
Common legacy patterns:
| Pattern | What it means |
|---|---|
| Singleton | Global state (often problematic) |
| Factory | Multiple implementations of same interface |
| Strategy | Pluggable behavior |
| Template Method | Framework with customization points |
| Null checks everywhere | Missing contracts, defensive programming |
| Try/catch around everything | Unstable dependencies or unknown errors |
Checklist
- Identified entry points (where does execution start?)
- Mapped directory structure and module organization
- Found any existing documentation (README, comments, tests)
- Traced data flow for concrete example (input → processing → output)
- Identified core domain concepts (entities, operations)
- Understood system layers and boundaries
- Researched git history (when added, why changed, who knows)
- Recognized patterns and their purposes
- Ran code with debugging/logging to verify understanding
Red Flags - You're Doing It Wrong
- "I'll just start changing things" - You don't understand yet. Archaeology first.
- "There's too much code to read" - You don't read all of it. Use strategies.
- "The code is self-explanatory" - Code shows HOW, not WHY. History matters.
- "I'll figure it out as I go" - Plan and map first, modify second.
- Diving into details before understanding big picture - Breadth before depth.
Common Mistakes
| Mistake | Reality |
|---|---|
| "Reading code is slow, I'll just fix it" | Understanding saves hours of debugging. Read first. |
| "I understand what this function does" (after 30 seconds) | Functions exist in context. Understand their role in the system. |
| "This is bad code, I'll rewrite it" | It's solving a problem you don't understand yet. Archaeology first. |
| "I'll read it all linearly" | 100k LOC linearly = weeks. Use strategies. |
| "Comments and docs are outdated, I'll ignore them" | Even outdated docs have clues. They show original intent. |
Examples
Example 1: Understanding Authentication (Node.js)
Goal: Understand how login works
Archaeology process:
1. Entry point: POST /api/login endpoint
2. Trace: router.js → authController.login() → AuthService.authenticate()
3. Data flow: {email, password} → validate → query DB → generate JWT
4. Git history: Added in commit abc123 "Implement JWT authentication"
5. Why: Comments mention "replaced session-based auth for API scalability"
6. Pattern: Token-based stateless authentication
7. Tests: test/auth.test.js shows expected behavior
Mental model: Stateless auth using JWT, validates credentials, returns token for subsequent requests.
Example 1b: Understanding Authentication (.NET)
Goal: Understand how authentication middleware works
Archaeology process:
1. Entry point: Program.cs or Startup.cs → app.UseAuthentication()
2. Trace: Middleware pipeline → [Authorize] attribute → AuthenticationHandler
3. Data flow: HTTP request → Extract token → Validate → Create ClaimsPrincipal → Controller
4. Git history: Added in commit def456 "Switch to JWT bearer authentication"
5. Why: Comments mention "migrated from cookie auth to support mobile clients"
6. Pattern: ASP.NET Core authentication middleware with JWT bearer
7. Tests: AuthenticationTests.cs shows token validation scenarios
Mental model: Middleware-based authentication, token validation happens early in pipeline, user identity flows through HttpContext.User to controllers.
Example 2: Mystery Business Logic
Code:
def calculate_price(item, user):
base = item.price
if user.is_premium:
base *= 0.9
if user.referral_count > 5:
base *= 0.95
if item.category == "seasonal":
base *= 1.2
return round(base, 2)
Archaeology:
Git log:
- Line 1-2: Original implementation
- Line 3-4: "Add premium discount per marketing request"
- Line 5-6: "Incentivize referrals" (commit by PM)
- Line 7-8: "Seasonal markup for inventory management"
Pattern: Accumulation of business rules over time
Why complex: Different stakeholders, different times
Mental model: Pricing is business-driven, not technical. Don't "simplify" without understanding business constraints.
Integration with Other Skills
- skills/research/tracing-knowledge-lineages - Why does this code exist?
- skills/refactoring/characterization-testing - After understanding, add safety net
- skills/understanding/questioning-techniques - Systematic questions to ask
- skills/debugging/root-cause-tracing - When code breaks, trace back to source
Remember
- Read code like a detective: look for clues about intent
- Use multiple strategies: top-down, bottom-up, breadth-first
- Git history is treasure: when/why/who
- Run code to verify understanding
- Pattern recognition speeds comprehension
- Understand before changing