| name | log-analyzer |
| description | Parse and analyze application logs to identify errors, patterns, and insights. |
Log Analyzer Skill
Parse and analyze application logs to identify errors, patterns, and insights.
Instructions
You are a log analysis expert. When invoked:
Parse Log Files:
- Identify log format (JSON, syslog, Apache, custom)
- Extract structured data from logs
- Handle multi-line stack traces
- Parse timestamps and normalize formats
Analyze Patterns:
- Identify error frequency and trends
- Detect error spikes or anomalies
- Find common error messages
- Track error patterns over time
- Identify correlation between events
Generate Insights:
- Most frequent errors
- Error rate trends
- Performance metrics from logs
- User activity patterns
- System health indicators
Provide Recommendations:
- Root cause analysis
- Suggested fixes for common errors
- Logging improvements
- Monitoring suggestions
Log Format Detection
JSON Logs
{
"timestamp": "2024-01-15T10:30:00.000Z",
"level": "error",
"message": "Database connection failed",
"service": "api",
"userId": "12345",
"error": {
"code": "ECONNREFUSED",
"stack": "Error: connect ECONNREFUSED..."
}
}
Standard Format (Combined)
192.168.1.1 - - [15/Jan/2024:10:30:00 +0000] "GET /api/users HTTP/1.1" 500 1234 "-" "Mozilla/5.0..."
Application Logs
2024-01-15 10:30:00 ERROR [UserService] Failed to fetch user: User not found (ID: 12345)
at UserService.getUser (user-service.js:45:10)
at async API.handler (api.js:23:5)
Analysis Patterns
Error Frequency Analysis
## Top 10 Errors (Last 24h)
1. **Database connection timeout** (1,234 occurrences)
- First seen: 2024-01-15 08:00:00
- Last seen: 2024-01-15 10:30:00
- Peak: 2024-01-15 09:15:00 (234 errors in 1 min)
- Affected services: api, worker
- Impact: High
2. **User not found** (567 occurrences)
- Pattern: Regular distribution
- Likely cause: Normal user behavior
- Impact: Low
3. **Rate limit exceeded** (345 occurrences)
- Source IPs: 192.168.1.100, 10.0.0.50
- Pattern: Burst traffic
- Impact: Medium
Timeline Analysis
## Error Timeline
08:00 - Normal operations (5-10 errors/min)
09:00 - Database connection errors spike (200+ errors/min)
09:15 - Peak error rate (234 errors/min)
09:30 - Database connection restored
10:00 - Return to normal (8-12 errors/min)
## Correlation
- Traffic increased 300% at 09:00
- Database CPU at 95% during incident
- Connection pool exhausted
Performance Metrics
## Response Times (from logs)
**Average**: 234ms
**P50**: 180ms
**P95**: 450ms
**P99**: 890ms
**Slow Requests** (>1s):
- /api/search: 2.3s avg (45 requests)
- /api/reports: 1.8s avg (23 requests)
**Fast Requests** (<100ms):
- /api/health: 5ms avg
- /api/status: 12ms avg
Usage Examples
@log-analyzer
@log-analyzer app.log
@log-analyzer --errors-only
@log-analyzer --time-range "last 24h"
@log-analyzer --pattern "database"
@log-analyzer --format json
Report Format
# Log Analysis Report
**Period**: 2024-01-15 00:00:00 to 2024-01-15 23:59:59
**Log File**: /var/log/app.log
**Total Entries**: 145,678
**Errors**: 2,345 (1.6%)
**Warnings**: 8,901 (6.1%)
---
## Executive Summary
- **Critical Issues**: 3
- **High Priority**: 8
- **Medium Priority**: 15
- **Overall Health**: ⚠️ Degraded (Database issues detected)
### Key Findings
1. Database connection pool exhaustion at 09:00-09:30
2. Rate limiting triggered for 2 IP addresses
3. Slow query performance on search endpoint
4. Memory leak warning in worker service
---
## Critical Issues
### 1. Database Connection Pool Exhaustion
**Severity**: Critical
**Occurrences**: 1,234
**Time Range**: 09:00:00 - 09:30:00
**Impact**: Service degradation, failed requests
**Error Pattern**:
Error: connect ETIMEDOUT Error: Too many connections Error: Connection pool timeout
**Root Cause Analysis**:
- Traffic spike (300% increase)
- Connection pool size: 10 (insufficient)
- Connections not being released properly
- No connection timeout configured
**Recommendations**:
1. Increase connection pool size to 50
2. Implement connection timeout (30s)
3. Review connection release logic
4. Add connection pool monitoring
5. Implement circuit breaker pattern
**Code Fix**:
```javascript
// Increase pool size
const pool = new Pool({
max: 50, // was: 10
min: 5,
acquireTimeoutMillis: 30000,
idleTimeoutMillis: 30000
});
// Ensure connections are released
try {
const client = await pool.connect();
const result = await client.query('SELECT * FROM users');
return result;
} finally {
client.release(); // Always release!
}
2. Memory Leak in Worker Service
Severity: Critical First Detected: 06:00:00 Pattern: Memory usage increasing 50MB/hour
Evidence:
06:00 - Memory: 512MB
09:00 - Memory: 662MB
12:00 - Memory: 812MB
15:00 - Memory: 962MB (WARNING threshold)
Likely Causes:
- Event listeners not cleaned up
- Cached data not being cleared
- Circular references
Recommendations:
- Add heap snapshot analysis
- Review event listener cleanup
- Implement cache eviction policy
- Monitor with heap profiler
High Priority Issues
3. Slow Search Query Performance
Severity: High Endpoint: /api/search Occurrences: 45 requests Average Response: 2.3s (target: <500ms)
Slow Query Examples:
2024-01-15 10:15:23 WARN [SearchService] Query took 2,345ms
SELECT * FROM products WHERE name LIKE '%keyword%'
Rows examined: 1,234,567
Recommendations:
- Add full-text search index
- Implement pagination (limit results)
- Use Elasticsearch for search
- Add query result caching
4. Rate Limit Violations
Severity: High Affected IPs: 2 Requests Blocked: 345
Details:
IP: 192.168.1.100 (245 blocked requests)
- Pattern: Automated scraping
- Recommendation: Consider permanent block
IP: 10.0.0.50 (100 blocked requests)
- Pattern: Burst traffic from legitimate user
- Recommendation: Increase rate limit for authenticated users
Error Distribution
By Severity
- ERROR: 2,345 (1.6%)
- WARN: 8,901 (6.1%)
- INFO: 134,432 (92.3%)
By Service
- api: 1,567 errors
- worker: 456 errors
- scheduler: 234 errors
- auth: 88 errors
By Error Type
- Database errors: 1,234 (52.6%)
- Validation errors: 567 (24.2%)
- Rate limit errors: 345 (14.7%)
- Authentication errors: 199 (8.5%)
Performance Metrics
Response Times
| Endpoint | Avg | P50 | P95 | P99 | Max |
|---|---|---|---|---|---|
| /api/users | 123ms | 95ms | 230ms | 450ms | 890ms |
| /api/search | 2,300ms | 1,800ms | 4,500ms | 6,200ms | 8,900ms |
| /api/posts | 156ms | 120ms | 280ms | 520ms | 780ms |
| /api/health | 5ms | 4ms | 8ms | 12ms | 25ms |
Traffic Patterns
- Peak: 09:15:00 (1,234 req/min)
- Average: 410 req/min
- Quiet Period: 02:00-05:00 (45 req/min)
User Activity
Top Users by Request Count
- User ID 12345: 2,345 requests
- User ID 67890: 1,890 requests
- User ID 11111: 1,456 requests
Failed Authentication Attempts
- Total: 199
- Unique Users: 45
- Suspicious Pattern: User 99999 (23 failed attempts)
Recommendations
Immediate Actions (Today)
- ✓ Increase database connection pool
- ✓ Investigate memory leak in worker
- ✓ Block suspicious IP (192.168.1.100)
- ✓ Add monitoring for connection pool
Short Term (This Week)
- Optimize search queries
- Implement query result caching
- Review event listener cleanup
- Add circuit breaker for database
- Increase rate limits for authenticated users
Long Term (This Month)
- Migrate search to Elasticsearch
- Implement comprehensive APM
- Add automated log analysis
- Set up predictive alerting
- Improve error handling and logging
Logging Improvements
Missing Information
- Request IDs (for tracing)
- User context in some services
- Performance metrics in worker logs
- Structured error codes
Suggested Log Format
{
"timestamp": "2024-01-15T10:30:00.000Z",
"level": "error",
"requestId": "req-abc-123",
"service": "api",
"userId": "12345",
"endpoint": "/api/users",
"method": "GET",
"statusCode": 500,
"duration": 234,
"error": {
"code": "DB_CONNECTION_ERROR",
"message": "Database connection failed",
"stack": "..."
}
}
Monitoring Alerts to Set Up
- Database Connection Errors > 10/min
- Response Time P95 > 500ms
- Error Rate > 2%
- Memory Usage > 80%
- Rate Limit Hits > 100/hour from single IP
## Analysis Techniques
### Regular Expression Patterns
```bash
# Find all errors
grep -E "ERROR|Exception|Failed" app.log
# Extract timestamps and errors
grep "ERROR" app.log | awk '{print $1, $2, $4}'
# Count error types
grep "ERROR" app.log | cut -d':' -f2 | sort | uniq -c | sort -nr
# Find slow requests
awk '$7 > 1000 {print $0}' access.log # Response time > 1s
Time-Based Analysis
# Errors per hour
awk '{print $1" "$2}' app.log | cut -d':' -f1 | uniq -c
# Peak error times
grep "ERROR" app.log | cut -d' ' -f2 | cut -d':' -f1 | sort | uniq -c | sort -nr
Tools Integration
- Elasticsearch + Kibana: Centralized logging and visualization
- Splunk: Enterprise log management
- Datadog: APM and log analysis
- CloudWatch: AWS log aggregation
- Grafana Loki: Open-source log aggregation
- Papertrail: Simple log management
Notes
- Always consider log volume and retention
- Implement log rotation and archiving
- Use structured logging (JSON) for easier parsing
- Include request IDs for distributed tracing
- Set up alerts for critical error patterns
- Regular log analysis prevents incidents
- Correlation with metrics provides better insights