| name | debug-phase |
| description | Standard Operating Procedure for /debug phase. Covers error investigation, root cause analysis, systematic debugging, and error documentation. |
| allowed-tools | Read, Write, Edit, Grep, Glob, Bash |
Debug Phase: Standard Operating Procedure
Training Guide: Step-by-step procedures for executing the
/debugcommand with systematic error investigation and documentation.
Supporting references:
- reference.md - Error classification matrix, debugging workflow, logging best practices
- examples.md - Good debugging (systematic, documented) vs bad (trial-and-error)
- scripts/error-classifier.sh - Classifies errors by type and severity
Phase Overview
Purpose: Systematically investigate errors, identify root causes, implement fixes, and document findings to prevent recurrence.
Inputs:
- Error reports (logs, stack traces, user reports)
error-log.md(if exists) - Historical error records
Outputs:
error-log.md- Updated with new error entry (ERR-XXXX)- Fix implementation (code changes)
- Test cases to prevent recurrence
- Updated
workflow-state.yaml
Expected duration: 30 minutes - 4 hours (depends on error complexity)
Prerequisites
Environment checks:
- Error can be reproduced
- Logs available (application logs, error logs, stack traces)
- Development environment set up
Knowledge requirements:
- Debugging techniques (binary search, logging, breakpoints)
- Error classification (syntax, runtime, logic, integration)
- Root cause analysis (5 Whys, fishbone diagram)
Execution Steps
Step 1: Search for Similar Errors
Actions:
Check if
error-log.mdexists and search for similar errors:# Search by error message grep -i "connection timeout" error-log.md # Search by component grep "Component: StudentProgressService" error-log.md # Search by error type grep "Type: Integration" error-log.mdIf similar error found:
- Review previous fix (was it a workaround or root cause fix?)
- Check if error recurred (implement permanent fix this time)
- Note patterns (same component? same conditions?)
Quality check: If recurring error (>2 occurrences), prioritize permanent fix over workaround.
Step 2: Reproduce Error Consistently
Actions:
Gather error context:
- Error message and stack trace
- Timestamp and frequency
- User actions that triggered error
- Environment (dev, staging, production)
- Browser/OS (if frontend error)
Create minimal reproduction:
# Example: API error curl -X POST http://localhost:3000/api/endpoint \ -H "Content-Type: application/json" \ -d '{"field": "value"}' # Example: Frontend error # Open browser, navigate to /page, click button XVerify reproduction:
- Can you trigger error reliably (100% of time)?
- If intermittent: identify conditions (timing, data, state)
Quality check: Error reproduces consistently before proceeding to investigation.
Step 3: Classify Error
Actions: Use error classification matrix (see reference.md):
By type:
- Syntax: Code doesn't compile/parse (typos, missing brackets)
- Runtime: Code runs but crashes (null pointer, type error)
- Logic: Code runs but produces wrong result
- Integration: External dependency fails (API, database, service)
- Performance: Code works but too slow (timeout, memory)
By severity:
- Critical: Data loss, security breach, total system failure
- High: Feature broken, blocks users
- Medium: Feature degraded, workaround exists
- Low: Minor UX issue, no functional impact
Example:
## Error Classification
**Type**: Integration (API call to external service fails)
**Severity**: High (dashboard doesn't load, blocks teachers)
**Component**: StudentProgressService.fetchExternalData()
**Frequency**: 30% of requests (intermittent)
Quality check: Error type and severity correctly identified.
Step 4: Isolate Root Cause
Actions: Use systematic debugging techniques:
1. Binary search (for large codebases):
# Add logging at midpoint
# If error before midpoint: investigate first half
# If error after midpoint: investigate second half
# Repeat until narrowed to specific function
2. Increase logging:
# Add debug logs around suspected area
logger.debug(f"Before API call: student_id={student_id}")
response = api.fetch_data(student_id)
logger.debug(f"After API call: response={response}")
3. Use breakpoints (if applicable):
# Python debugger
python -m pdb script.py
# Node.js debugger
node inspect server.js
# Browser debugger (Frontend)
# Set breakpoints in Chrome DevTools
4. Check assumptions:
- Is variable the value you expect?
- Is function being called when you expect?
- Are external dependencies available?
5. Apply 5 Whys:
1. Why did the dashboard fail to load?
→ API call to external service timed out
2. Why did the API call timeout?
→ Request took >30 seconds (timeout limit)
3. Why did the request take >30 seconds?
→ External service response time was 45 seconds
4. Why was external service so slow?
→ Requesting too much data (1000 records instead of 10)
5. Why requesting 1000 records?
→ Missing pagination parameter in API call
**Root cause**: Missing pagination parameter causes over-fetching
Quality check: Root cause identified (not just symptoms).
Step 5: Implement Fix
Actions:
Write failing test (if not exists):
def test_fetch_external_data_with_pagination(): """Test that API call includes pagination parameter""" service = StudentProgressService() result = service.fetchExternalData(student_id=123) # Verify pagination applied assert len(result) <= 10 # Max 10 records per page assert result.has_next_page # Pagination metadata presentImplement fix:
# Before (broken) def fetchExternalData(self, student_id): return api.get(f"/students/{student_id}/data") # After (fixed) def fetchExternalData(self, student_id, page_size=10): return api.get( f"/students/{student_id}/data", params={"page_size": page_size} )Verify fix:
# Run test pytest tests/test_student_progress_service.py::test_fetch_external_data_with_pagination # Should pass # Manually test reproduction steps # Error should no longer occur
Quality check: Fix addresses root cause (not just symptoms), tests pass.
Step 6: Prevent Recurrence
Actions:
Add tests for error scenario:
- Unit test for the specific bug
- Integration test for the user flow
- Regression test to catch if bug reintroduced
Add defensive code (if applicable):
# Validate inputs if page_size > 100: raise ValueError("page_size must be ≤100") # Handle errors gracefully try: result = api.get(url, params=params, timeout=5) except requests.Timeout: logger.error(f"API timeout: {url}") raise ServiceUnavailable("External service unavailable")Improve logging (if needed):
# Log API calls with parameters logger.info(f"Fetching external data: student_id={student_id}, page_size={page_size}") # Log errors with context logger.error( f"API call failed: {url}", extra={"student_id": student_id, "params": params, "status_code": response.status_code} )
Quality check: Tests added, defensive code in place, logging improved.
Step 7: Document in error-log.md
Actions:
Assign error ID (ERR-XXXX):
# Get next available ID LAST_ID=$(grep -oP "ERR-\K\d+" error-log.md | sort -n | tail -1) NEXT_ID=$((LAST_ID + 1)) # e.g., ERR-0042Add error entry to
error-log.md:## ERR-0042: Dashboard timeout due to missing pagination **Date**: 2025-10-21 **Reporter**: John Doe (teacher) **Component**: StudentProgressService.fetchExternalData() **Type**: Integration **Severity**: High **Error Message**:TimeoutError: Request exceeded 30 second timeout at api.get (/src/services/api.ts:45)
**Steps to Reproduce**: 1. Navigate to /dashboard as teacher 2. Click "View Progress" for student with 1000+ activities 3. Dashboard hangs for 45 seconds, then shows timeout error **Root Cause**: Missing pagination parameter in API call to external service caused over-fetching (1000 records instead of 10), resulting in >30s response time exceeding timeout. **Fix**: Added `page_size=10` parameter to API call in StudentProgressService.fetchExternalData() **Files Changed**: - `api/app/services/student_progress_service.py`: Added pagination parameter - `api/app/tests/services/test_student_progress_service.py`: Added regression test **Prevention**: - Added unit test to verify pagination parameter included - Added input validation (max page_size=100) - Improved API logging to track parameters **Related Errors**: None **Status**: Fixed (2025-10-21)
Quality check: Error documented with all required fields.
Step 8: Commit Fix
Actions:
git add api/app/services/student_progress_service.py \
api/app/tests/services/test_student_progress_service.py \
error-log.md
git commit -m "fix: add pagination to external data fetch (ERR-0042)
Root cause: Missing pagination parameter caused over-fetching
Impact: Dashboard timeout for students with 1000+ activities
Fix: Added page_size=10 parameter to API call
Changes:
- Added pagination parameter to fetchExternalData()
- Added regression test for pagination
- Updated error-log.md with ERR-0042
Prevents recurrence: Input validation + logging + tests
"
Quality check: Fix committed with detailed message referencing error ID.
Common Mistakes to Avoid
🚫 Recurring Errors Not Recognized
Impact: Wasted debugging time, user frustration, technical debt
Scenario:
Same "Connection timeout" error debugged 3 times
Each time: Added 5-second delay as workaround
Never investigated root cause (connection pool exhaustion)
Prevention:
- Search error-log.md BEFORE debugging
- If error seen before: Review previous fix
- If workaround was used: Implement permanent fix this time
- Document patterns for future reference
🚫 Insufficient Error Context
Impact: Hard to reproduce, longer debug time, incomplete fixes
Bad error log:
## Error: Something broke
**Date**: 2025-10-21
**Fix**: Restarted server
Good error log:
## ERR-0042: Dashboard timeout due to missing pagination
**Date**: 2025-10-21 14:30:00 UTC
**Component**: StudentProgressService.fetchExternalData()
**Error Message**: [full stack trace]
**Steps to Reproduce**: [detailed steps]
**Root Cause**: [5 Whys analysis]
**Fix**: [specific code changes]
**Prevention**: [tests + logging added]
Prevention: Use error-log template with required fields
🚫 Treating Symptoms Instead of Root Cause
Impact: Error recurs, user frustration, wasted time
Scenario:
Symptom: API timeout
Quick fix: Increase timeout from 30s to 60s
Result: Still times out, just takes longer
Root cause: Over-fetching 1000 records
Real fix: Add pagination (response <1s)
Prevention: Use 5 Whys to find root cause, don't stop at first symptom
🚫 No Tests for Bug Fix
Impact: Bug can be reintroduced, no regression detection
Prevention:
- Always add regression test for the bug
- Test should fail before fix, pass after fix
- Run test suite to ensure no other breakage
🚫 Trial-and-Error Debugging
Impact: Wasted time, incomplete understanding, fragile fixes
Bad approach:
1. Change timeout value → still broken
2. Add try/catch → hides error
3. Restart service → works temporarily
4. Don't know why it works, ship it
Good approach:
1. Reproduce error consistently
2. Add logging to isolate cause
3. Identify root cause with 5 Whys
4. Implement targeted fix
5. Add tests to prevent recurrence
6. Document in error-log.md
Prevention: Follow systematic debugging workflow
Best Practices
✅ Systematic Debugging Workflow
Follow consistent process:
1. Search error-log.md for similar errors
2. Reproduce error consistently
3. Classify error (type + severity)
4. Isolate root cause (5 Whys, logging, breakpoints)
5. Implement fix + tests
6. Prevent recurrence (defensive code + logging)
7. Document in error-log.md with ERR-XXXX
8. Commit with detailed message
Result: Faster debugging, knowledge accumulation, no recurrence
✅ Error Log Template
Use consistent format:
## ERR-XXXX: [Short description]
**Date**: [ISO 8601 timestamp]
**Reporter**: [Name/email]
**Component**: [File/function name]
**Type**: [Syntax/Runtime/Logic/Integration/Performance]
**Severity**: [Critical/High/Medium/Low]
**Error Message**:
[Full stack trace]
**Steps to Reproduce**:
1. [Step 1]
2. [Step 2]
3. [Expected vs actual]
**Root Cause**:
[5 Whys analysis]
**Fix**:
[Specific code changes]
**Files Changed**:
- [file1]: [what changed]
- [file2]: [what changed]
**Prevention**:
- [Tests added]
- [Logging improved]
- [Defensive code]
**Related Errors**: [ERR-YYYY, ERR-ZZZZ] or None
**Status**: Fixed ([date]) or In Progress
✅ 5 Whys Root Cause Analysis
Ask "why" 5 times:
1. Why did X fail? → Because Y
2. Why did Y happen? → Because Z
3. Why did Z happen? → Because A
4. Why did A happen? → Because B
5. Why did B happen? → Because [root cause]
Stop when: Reached actionable root cause (not symptoms)
Phase Checklist
Pre-phase checks:
- Error can be reproduced
- Logs/stack traces available
- Development environment ready
During phase:
- Searched error-log.md for similar errors
- Error reproduced consistently
- Error classified (type + severity)
- Root cause identified (not symptoms)
- Fix implemented and tested
- Regression tests added
- Error documented in error-log.md
Post-phase validation:
- Fix verified (error no longer occurs)
- Tests pass (including new regression tests)
- error-log.md updated with ERR-XXXX
- Fix committed with detailed message
- workflow-state.yaml updated
Quality Standards
Debugging quality targets:
- Average debug time: <2 hours
- Error recurrence rate: <10%
- Errors documented: 100%
What makes good debugging:
- Systematic approach (consistent workflow)
- Root cause identified (not symptoms)
- Fix tested (regression tests added)
- Well documented (error-log.md entry complete)
- Knowledge shared (prevents recurrence)
What makes bad debugging:
- Trial-and-error (no systematic approach)
- Treats symptoms (root cause unknown)
- No tests (bug can recur)
- Undocumented (knowledge lost)
- Quick workarounds (technical debt)
Completion Criteria
Phase is complete when:
- Error no longer reproduces
- Root cause identified and fixed
- Tests added (regression + prevention)
- error-log.md updated with ERR-XXXX
- Fix committed
- workflow-state.yaml shows
currentPhase: debugandstatus: completed
Ready to proceed:
- All tests pass
- Feature works as expected
- No known errors remaining
Troubleshooting
Issue: Cannot reproduce error Solution: Gather more context (logs, user steps, environment), check for timing/data dependencies
Issue: Error is intermittent Solution: Identify conditions (timing, load, data state), add logging to narrow down triggers
Issue: Root cause unclear Solution: Use 5 Whys, add more logging, use debugger/breakpoints, consult error-log.md patterns
Issue: Fix works but not sure why Solution: Insufficient understanding - continue investigation, don't ship unclear fixes
This SOP guides the debug phase. Refer to reference.md for debugging techniques and examples.md for debugging patterns.