| name | intermittent-issue-debugging |
| description | Debug issues that occur sporadically and are hard to reproduce. Use monitoring and systematic investigation to identify root causes of flaky behavior. |
Intermittent Issue Debugging
Overview
Intermittent issues are the most difficult to debug because they don't occur consistently. Systematic approach and comprehensive monitoring are essential.
When to Use
- Sporadic errors in logs
- Users report occasional issues
- Flaky tests
- Race conditions suspected
- Timing-dependent bugs
- Resource exhaustion issues
Instructions
1. Capturing Intermittent Issues
// Strategy 1: Comprehensive Logging
// Add detailed logging around suspected code
function processPayment(orderId) {
const startTime = Date.now();
console.log(`[${startTime}] Payment start: order=${orderId}`);
try {
const result = chargeCard(orderId);
console.log(`[${Date.now()}] Payment success: ${orderId}`);
return result;
} catch (error) {
const duration = Date.now() - startTime;
console.error(`[${Date.now()}] Payment FAILED:`, {
order: orderId,
error: error.message,
duration_ms: duration,
error_type: error.constructor.name,
stack: error.stack
});
throw error;
}
}
// Strategy 2: Correlation IDs
// Track requests across systems
const correlationId = generateId();
logger.info({
correlationId,
action: 'payment_start',
orderId: 123
});
chargeCard(orderId, {headers: {correlationId}});
logger.info({
correlationId,
action: 'payment_end',
status: 'success'
});
// Later, can grep logs by correlationId to see full trace
// Strategy 3: Error Sampling
// Capture full error context when occurs
window.addEventListener('error', (event) => {
const errorData = {
message: event.message,
url: event.filename,
line: event.lineno,
col: event.colno,
stack: event.error?.stack,
userAgent: navigator.userAgent,
memory: performance.memory?.usedJSHeapSize,
timestamp: new Date().toISOString()
};
sendToMonitoring(errorData); // Send to error tracking
});
2. Common Intermittent Issues
Issue: Race Condition
Symptom: Inconsistent behavior depending on timing
Example:
Thread 1: Read count (5)
Thread 2: Read count (5), increment to 6, write
Thread 1: Increment to 6, write (overrides Thread 2)
Result: Should be 7, but is 6
Debug:
1. Add detailed timestamps
2. Log all operations
3. Look for overlapping operations
4. Check if order matters
Solution:
- Use locks/mutexes
- Use atomic operations
- Use message queues
- Ensure single writer
---
Issue: Timing-Dependent Bug
Symptom: Test passes sometimes, fails others
Example:
test_user_creation:
1. Create user (sometimes slow)
2. Check user exists
3. Fails if create took too long
Debug:
- Add timeout logging
- Increase wait time
- Add explicit waits
- Mock slow operations
Solution:
- Explicit wait for condition
- Remove time-dependent assertions
- Use proper test fixtures
---
Issue: Resource Exhaustion
Symptom: Works fine, but after time fails
Example:
- Memory grows over time
- Connections pool exhausted
- Disk space fills up
- Max open files reached
Debug:
- Monitor resources continuously
- Check for leaks (memory growth)
- Monitor connection count
- Check long-running processes
Solution:
- Fix memory leak
- Increase resource limits
- Implement cleanup
- Add monitoring/alerts
---
Issue: Intermittent Network Failure
Symptom: API calls occasionally fail
Debug:
- Check network logs
- Identify timeout patterns
- Check if time-of-day dependent
- Check if load dependent
Solution:
- Implement exponential backoff retry
- Add circuit breaker
- Increase timeout
- Add redundancy
3. Systematic Investigation Process
Step 1: Understand the Pattern
Questions:
- How often does it occur? (1/100, 1/1000?)
- When does it occur? (time of day, load, specific user?)
- What are the conditions? (network, memory, load?)
- Is it reproducible? (deterministic or random?)
- Any recent changes?
Analysis:
- Review error logs
- Check error rate trends
- Identify patterns
- Correlate with changes
Step 2: Reproduce Reliably
Methods:
- Increase test frequency (run 1000 times)
- Stress test (heavy load)
- Simulate poor conditions (network, memory)
- Run on different machines
- Run in production-like environment
Goal: Make issue consistent to analyze
Step 3: Add Instrumentation
- Add detailed logging
- Add monitoring metrics
- Add trace IDs
- Capture errors fully
- Log system state
Step 4: Capture the Issue
- Recreate scenario
- Capture full context
- Note system state
- Document conditions
- Get reproduction case
Step 5: Analyze Data
- Review logs
- Look for patterns
- Compare normal vs error cases
- Check timing correlations
- Identify root cause
Step 6: Implement Fix
- Based on root cause
- Verify with reproduction case
- Test extensively
- Add regression test
4. Monitoring & Prevention
Monitoring Strategy:
Real User Monitoring (RUM):
- Error rates by feature
- Latency percentiles
- User impact
- Trend analysis
Application Performance Monitoring (APM):
- Request traces
- Database query performance
- External service calls
- Resource usage
Synthetic Monitoring:
- Regular test execution
- Simulate user flows
- Alert on failures
- Trend tracking
---
Alerting:
Setup alerts for:
- Error rate spike
- Response time >threshold
- Memory growth trend
- Failed transactions
---
Prevention Checklist:
[ ] Comprehensive logging in place
[ ] Error tracking configured
[ ] Performance monitoring active
[ ] Resource monitoring enabled
[ ] Correlation IDs used
[ ] Failed requests captured
[ ] Timeout values appropriate
[ ] Retry logic implemented
[ ] Circuit breakers in place
[ ] Load testing performed
[ ] Stress testing performed
[ ] Race conditions reviewed
[ ] Timing dependencies checked
---
Tools:
Monitoring:
- New Relic / DataDog
- Prometheus / Grafana
- Sentry / Rollbar
- Custom logging
Testing:
- Load testing (k6, JMeter)
- Chaos engineering (gremlin)
- Property-based testing (hypothesis)
- Fuzz testing
Debugging:
- Distributed tracing (Jaeger)
- Correlation IDs
- Detailed logging
- Debuggers
Key Points
- Comprehensive logging is essential
- Add correlation IDs for tracing
- Monitor for patterns and trends
- Stress test to reproduce
- Use detailed error context
- Implement exponential backoff for retries
- Monitor resource exhaustion
- Add circuit breakers for external services
- Log system state with errors
- Implement proper monitoring/alerting