| name | qa-observability |
| description | Production observability and performance engineering with OpenTelemetry, distributed tracing, metrics, logging, SLO/SLI design, capacity planning, performance profiling, APM integration, and observability maturity progression for modern cloud-native systems. |
Observability & Performance Engineering — Production-Ready Systems
This skill provides execution-ready patterns for building observable, performant systems. Claude should apply these patterns when users need observability stack setup, performance optimization, capacity planning, or production monitoring.
Modern Best Practices (2025): OpenTelemetry standard, distributed tracing, unified telemetry (logs+metrics+traces), SLO-driven alerting, eBPF-based observability, AI-assisted anomaly detection, performance budgets, and observability maturity models.
When to Use This Skill
Claude should invoke this skill when a user requests:
- OpenTelemetry instrumentation and setup
- Distributed tracing implementation (Jaeger, Tempo, Zipkin)
- Metrics collection and dashboarding (Prometheus, Grafana)
- Structured logging setup (Pino, Winston, structlog)
- SLO/SLI definition and error budgets
- Performance profiling and optimization
- Capacity planning and resource forecasting
- APM integration (Datadog, New Relic, Dynatrace)
- Observability maturity assessment
- Alert design and on-call runbooks
- Performance budgeting (frontend and backend)
- Cost-performance optimization
- Production performance debugging
Quick Reference
| Task | Tool/Framework | Command/Setup | When to Use |
|---|---|---|---|
| Distributed tracing | OpenTelemetry + Jaeger | Auto-instrumentation, manual spans | Microservices, debugging request flow |
| Metrics collection | Prometheus + Grafana | Expose /metrics endpoint, scrape config | Track latency, error rate, throughput |
| Structured logging | Pino (Node.js), structlog (Python) | JSON logs with trace ID | Production debugging, log aggregation |
| SLO/SLI definition | Prometheus queries, SLO YAML | Availability, latency, error rate SLIs | Reliability targets, error budgets |
| Performance profiling | Clinic.js, Chrome DevTools, cProfile | CPU/memory flamegraphs | Slow application, high resource usage |
| Load testing | k6, Artillery | Ramp-up, spike, soak tests | Capacity planning, performance validation |
| APM integration | Datadog, New Relic, Dynatrace | Agent installation, instrumentation | Full observability stack |
| Web performance | Lighthouse CI, Web Vitals | Performance budgets in CI/CD | Frontend optimization |
Decision Tree: Observability Strategy
User needs: [Observability Task Type]
├─ Starting New Service?
│ ├─ Microservices? → OpenTelemetry auto-instrumentation + Jaeger
│ ├─ Monolith? → Structured logging (Pino/structlog) + Prometheus
│ ├─ Frontend? → Web Vitals + performance budgets
│ └─ All services? → Full stack (logs + metrics + traces)
│
├─ Debugging Issues?
│ ├─ Distributed system? → Distributed tracing (search by trace ID)
│ ├─ Single service? → Structured logs (search by request ID)
│ ├─ Performance problem? → CPU/memory profiling
│ └─ Database slow? → Query profiling (EXPLAIN ANALYZE)
│
├─ Reliability Targets?
│ ├─ Define SLOs? → Availability, latency, error rate SLIs
│ ├─ Error budgets? → Calculate allowed downtime per SLO
│ ├─ Alerting? → Burn rate alerts (fast: 1h, slow: 6h)
│ └─ Dashboard? → Grafana SLO dashboard with error budget
│
├─ Performance Optimization?
│ ├─ Find bottlenecks? → CPU/memory profiling
│ ├─ Database queries? → EXPLAIN ANALYZE, indexing
│ ├─ Frontend slow? → Lighthouse, Web Vitals analysis
│ └─ Load testing? → k6 scenarios (ramp, spike, soak)
│
└─ Capacity Planning?
├─ Baseline metrics? → Collect 30 days of traffic data
├─ Load testing? → Test at 2x expected peak
├─ Forecasting? → Time series analysis (Prophet, ARIMA)
└─ Cost optimization? → Right-size instances, spot instances
Navigation: Core Implementation Patterns
See resources/core-observability-patterns.md for detailed implementation guides:
OpenTelemetry End-to-End Setup - Complete instrumentation with Node.js/Python examples
- Three pillars of observability (logs, metrics, traces)
- Auto-instrumentation and manual spans
- OTLP exporters and collectors
- Production checklist
Distributed Tracing Strategy - Service-to-service trace propagation
- W3C Trace Context standard
- Sampling strategies (always-on, probabilistic, parent-based, adaptive)
- Cross-service correlation
- Trace backend configuration
SLO/SLI Design & Error Budgets - Reliability targets and alerting
- SLI definitions (availability, latency, error rate)
- Prometheus queries for SLIs
- Error budget calculation and policies
- Burn rate alerts (fast: 1h, slow: 6h)
Structured Logging - Production-ready JSON logs
- Log format with trace correlation
- Pino (Node.js) and structlog (Python) setup
- Log levels and what NOT to log
- Centralized aggregation (ELK, Loki, Datadog)
Performance Profiling - CPU, memory, database, frontend optimization
- Node.js profiling (Chrome DevTools, Clinic.js)
- Memory leak detection (heap snapshots)
- Database query profiling (EXPLAIN ANALYZE)
- Web Vitals and performance budgets
Capacity Planning - Scale planning and cost optimization
- Capacity formula and calculations
- Load testing with k6
- Resource forecasting (Prophet, ARIMA)
- Cost per request optimization
Navigation: Observability Maturity
See resources/observability-maturity-model.md for maturity assessment:
Level 1: Reactive (Firefighting) - Manual log grepping, hours to resolve
- Basic logging to files
- No structured logging or distributed tracing
- Progression: Centralize logs, add metrics
Level 2: Proactive (Monitoring) - Centralized logs, application alerts
- Structured JSON logs with ELK/Splunk
- Prometheus metrics and Grafana dashboards
- Progression: Add distributed tracing, define SLOs
Level 3: Predictive (Observability) - Unified telemetry, SLO-driven
- Distributed tracing (Jaeger, Tempo)
- Automatic trace-log-metric correlation
- SLO/SLI-based alerting with error budgets
- Progression: AI anomaly detection, self-healing
Level 4: Autonomous (AIOps) - AI-powered, self-healing systems
- AI anomaly detection and root cause analysis
- Predictive capacity planning
- Auto-remediation and continuous optimization
- Chaos engineering for resilience
Maturity Assessment Tool - Rate your organization (0-5 per category):
- Logging, metrics, tracing, alerting, incident response
- MTTR/MTTD benchmarks by level
- Recommended next steps
Navigation: Anti-Patterns & Best Practices
See resources/anti-patterns-best-practices.md for common mistakes:
Critical Anti-Patterns:
- Logging Everything - Log bloat and high costs
- No Sampling - 100% trace collection is expensive
- Alert Fatigue - Too many noisy alerts
- Ignoring Tail Latency - P99 matters more than average
- No Error Budgets - Teams move too slow or too fast
- Metrics Without Context - Dashboard mysteries
- No Cost Tracking - Observability can cost 20% of infrastructure
- Point-in-Time Profiling - Missing intermittent issues
Best Practices Summary:
- Sample traces intelligently (100% errors, 1-10% success)
- Alert on SLO burn rate, not infrastructure metrics
- Track tail latency (P99, P999), not just averages
- Use error budgets to balance velocity vs reliability
- Add context to dashboards (baselines, annotations, SLOs)
- Track observability costs, optimize aggressively
- Continuous profiling for intermittent issues
Templates
See templates/ for copy-paste ready examples organized by domain and tech stack:
OpenTelemetry Instrumentation:
- Node.js/Express Setup - Auto-instrumentation, manual spans, Docker, K8s
- Python/Flask Setup - Flask instrumentation, SQLAlchemy, deployment
Monitoring & SLO:
- SLO YAML Template - Complete SLO definitions with error budgets
- Prometheus Alert Rules - Burn rate alerts, multi-window monitoring
- SLO Dashboard - SLI tracking, error budget visualization
- Unified Observability Dashboard - Logs, metrics, traces in one view
Load Testing:
- k6 Load Test Template - Ramp-up, spike, soak test scenarios
- Artillery Load Test - YAML configuration, multiple scenarios
Performance Optimization:
- Lighthouse CI Configuration - Performance budgets, CI/CD integration
- Node.js Profiling Config - CPU/memory profiling, leak detection
Resources
See resources/ for deep-dive operational guides:
- Core Observability Patterns - 6 implementation patterns with code examples
- Observability Maturity Model - 4-level maturity framework with assessment
- Anti-Patterns & Best Practices - 8 critical anti-patterns with solutions
- OpenTelemetry Best Practices - Setup, sampling, attributes, context propagation
- Distributed Tracing Patterns - Trace propagation, span design, debugging workflows
- SLO Design Guide - SLI/SLO/SLA, error budgets, burn rate alerts
- Performance Profiling Guide - CPU/memory profiling, database optimization, frontend performance
External Resources
See data/sources.json for curated sources:
- OpenTelemetry documentation and specifications
- APM platforms (Datadog, New Relic, Dynatrace, Honeycomb)
- Observability tools (Prometheus, Grafana, Jaeger, Tempo, Loki)
- SRE books (Google SRE, Site Reliability Workbook)
- Performance tooling (Lighthouse, k6, Clinic.js)
- Web Vitals and Core Web Vitals
Related Skills
DevOps & Infrastructure:
- ../ops-devops-platform/SKILL.md - Kubernetes, Docker, CI/CD pipelines
- ../data-sql-optimization/SKILL.md - Database performance and optimization
Backend Development:
- ../software-backend/SKILL.md - Backend architecture and API design
- ../software-architecture-design/SKILL.md - System design patterns
Quality & Reliability:
- ../qa-resilience/SKILL.md - Circuit breakers, retries, graceful degradation
- ../qa-debugging/SKILL.md - Production debugging techniques
- ../qa-testing-strategy/SKILL.md - Testing strategies and automation
AI/ML Operations:
- ../ai-mlops/SKILL.md - ML model monitoring and deployment
- ../ai-mlops/SKILL.md - ML security and governance
Quick Decision Matrix
| Scenario | Recommendation |
|---|---|
| Starting new service | OpenTelemetry auto-instrumentation + structured logging |
| Debugging microservices | Distributed tracing with Jaeger/Tempo |
| Setting reliability targets | Define SLOs for availability, latency, error rate |
| Application is slow | CPU/memory profiling + database query analysis |
| Planning for scale | Load testing + capacity forecasting |
| High infrastructure costs | Cost per request analysis + right-sizing |
| Improving observability | Assess maturity level + 6-month roadmap |
| Frontend performance issues | Web Vitals monitoring + performance budgets |
Usage Notes
When to Apply Patterns:
- New service setup → Start with OpenTelemetry + structured logging + basic metrics
- Microservices debugging → Use distributed tracing for full request visibility
- Reliability requirements → Define SLOs first, then implement monitoring
- Performance issues → Profile first, optimize second (measure before optimizing)
- Capacity planning → Collect baseline metrics for 30 days before load testing
- Observability maturity → Assess current level, plan 6-month progression
Common Workflows:
- New Service: OpenTelemetry → Prometheus → Grafana → Define SLOs
- Debugging: Search logs by trace ID → Click trace → See full request flow
- Performance: Profile CPU/memory → Identify bottleneck → Optimize → Validate
- Capacity Planning: Baseline metrics → Load test 2x peak → Forecast growth → Scale proactively
Optimization Priorities:
- Correctness (logs, traces, metrics actually help debug)
- Cost (optimize sampling, retention, cardinality)
- Performance (observability overhead <5% latency)
Success Criteria: Systems are fully observable with unified telemetry (logs+metrics+traces), SLOs drive alerting and feature velocity, performance is proactively optimized, and capacity is planned ahead of demand.