| name | log-analysis |
| description | Analyze application and system logs to identify errors, patterns, and root causes. Use log aggregation tools and structured logging for effective debugging. |
Log Analysis
Overview
Logs are critical for debugging and monitoring. Effective log analysis quickly identifies issues and enables root cause analysis.
When to Use
- Troubleshooting errors
- Performance investigation
- Security incident analysis
- Auditing user actions
- Monitoring application health
Instructions
1. Structured Logging
// Good: Structured logs (machine-readable)
logger.info({
level: 'INFO',
timestamp: '2024-01-15T10:30:00Z',
service: 'auth-service',
user_id: '12345',
action: 'user_login',
status: 'success',
duration_ms: 150,
ip_address: '192.168.1.1'
});
// Bad: Unstructured logs (hard to parse)
console.log('User 12345 logged in successfully in 150ms from 192.168.1.1');
// JSON Format (Elasticsearch friendly)
{
"@timestamp": "2024-01-15T10:30:00Z",
"level": "ERROR",
"service": "api-gateway",
"trace_id": "abc123",
"message": "Database connection failed",
"error": {
"type": "ConnectionError",
"code": "ECONNREFUSED"
},
"context": {
"database": "users",
"operation": "SELECT"
}
}
2. Log Levels & Patterns
Log Levels:
DEBUG: Detailed diagnostic info
- Variable values
- Function entry/exit
- Intermediate calculations
- Use: Development only
INFO: General informational messages
- Startup/shutdown
- User actions
- Configuration changes
- Use: Production (normal operations)
WARN: Warning messages (potential issues)
- Deprecated API usage
- Performance degradation
- Resource limits approaching
- Use: Production (investigate soon)
ERROR: Error conditions
- Failed operations
- Exceptions
- Failed requests
- Use: Production (action required)
FATAL/CRITICAL: System unusable
- Critical failures
- Out of memory
- Data corruption
- Use: Production (immediate action)
---
Log Patterns:
Request Logging:
- Request ID (trace_id)
- Method + Path
- Status code
- Duration
- Request size / response size
Error Logging:
- Error type/code
- Error message
- Stack trace
- Context (user_id, session_id)
- Timestamp
Business Events:
- Event type
- User involved
- Impact/importance
- Timestamp
- Relevant context
3. Log Analysis Tools
Log Aggregation:
ELK Stack (Elasticsearch, Logstash, Kibana):
- Logstash: Parse and process logs
- Elasticsearch: Search and analyze
- Kibana: Visualization and dashboards
- Use: Large scale, complex queries
Splunk:
- Comprehensive log management
- Real-time search and analysis
- Dashboards and alerts
- Use: Enterprise (expensive)
CloudWatch (AWS):
- Integrated with AWS services
- Log Insights for querying
- Dashboards
- Use: AWS-based systems
Datadog:
- Application performance monitoring
- Log management
- Real-time alerts
- Use: SaaS monitoring
---
Log Analysis Techniques:
Grep/Awk:
grep "ERROR" app.log
awk '{print $1, $4}' app.log
Filtering:
Filter by timestamp
Filter by service
Filter by error type
Filter by user
Searching:
Search for error patterns
Search for user actions
Search trace IDs
Search IP addresses
Aggregation:
Count occurrences
Group by error type
Calculate duration percentiles
Rate of errors over time
4. Common Log Analysis Queries
Find errors in past hour:
timestamp: last_1h AND level: ERROR
Track user activity:
user_id: 12345 AND action: *
Find slow requests:
duration_ms: >1000 AND level: INFO
Analyze error rate by service:
level: ERROR | stats count by service
Find failed database operations:
error.type: "DatabaseError" | stats count
Trace request flow:
trace_id: "abc123" | sort by timestamp
---
Checklist:
[ ] Structured logging implemented
[ ] All errors logged with context
[ ] Request IDs/trace IDs used
[ ] Sensitive data not logged (passwords, tokens)
[ ] Log levels used appropriately
[ ] Log retention policy set
[ ] Log sampling for high-volume events
[ ] Alerts configured for errors
[ ] Dashboards created
[ ] Regular log review scheduled
[ ] Log analysis tools accessible
[ ] Team trained on querying logs
Key Points
- Use structured JSON logging
- Include trace IDs for request tracking
- Log appropriate levels (DEBUG/INFO/ERROR)
- Never log sensitive data (passwords, tokens)
- Aggregate logs centrally
- Create dashboards for key metrics
- Alert on error rates and critical issues
- Retain logs appropriately
- Search logs by trace ID for troubleshooting
- Review logs regularly for patterns