Claude Code Plugins

Community-maintained marketplace

Feedback

Production observability and performance engineering with OpenTelemetry, distributed tracing, metrics, logging, SLO/SLI design, capacity planning, performance profiling, APM integration, and observability maturity progression for modern cloud-native systems.

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name qa-observability
description Production observability and performance engineering with OpenTelemetry, distributed tracing, metrics, logging, SLO/SLI design, capacity planning, performance profiling, APM integration, and observability maturity progression for modern cloud-native systems.

Observability & Performance Engineering — Production-Ready Systems

This skill provides execution-ready patterns for building observable, performant systems. Claude should apply these patterns when users need observability stack setup, performance optimization, capacity planning, or production monitoring.

Modern Best Practices (2025): OpenTelemetry standard, distributed tracing, unified telemetry (logs+metrics+traces), SLO-driven alerting, eBPF-based observability, AI-assisted anomaly detection, performance budgets, and observability maturity models.


When to Use This Skill

Claude should invoke this skill when a user requests:

  • OpenTelemetry instrumentation and setup
  • Distributed tracing implementation (Jaeger, Tempo, Zipkin)
  • Metrics collection and dashboarding (Prometheus, Grafana)
  • Structured logging setup (Pino, Winston, structlog)
  • SLO/SLI definition and error budgets
  • Performance profiling and optimization
  • Capacity planning and resource forecasting
  • APM integration (Datadog, New Relic, Dynatrace)
  • Observability maturity assessment
  • Alert design and on-call runbooks
  • Performance budgeting (frontend and backend)
  • Cost-performance optimization
  • Production performance debugging

Quick Reference

Task Tool/Framework Command/Setup When to Use
Distributed tracing OpenTelemetry + Jaeger Auto-instrumentation, manual spans Microservices, debugging request flow
Metrics collection Prometheus + Grafana Expose /metrics endpoint, scrape config Track latency, error rate, throughput
Structured logging Pino (Node.js), structlog (Python) JSON logs with trace ID Production debugging, log aggregation
SLO/SLI definition Prometheus queries, SLO YAML Availability, latency, error rate SLIs Reliability targets, error budgets
Performance profiling Clinic.js, Chrome DevTools, cProfile CPU/memory flamegraphs Slow application, high resource usage
Load testing k6, Artillery Ramp-up, spike, soak tests Capacity planning, performance validation
APM integration Datadog, New Relic, Dynatrace Agent installation, instrumentation Full observability stack
Web performance Lighthouse CI, Web Vitals Performance budgets in CI/CD Frontend optimization

Decision Tree: Observability Strategy

User needs: [Observability Task Type]
    ├─ Starting New Service?
    │   ├─ Microservices? → OpenTelemetry auto-instrumentation + Jaeger
    │   ├─ Monolith? → Structured logging (Pino/structlog) + Prometheus
    │   ├─ Frontend? → Web Vitals + performance budgets
    │   └─ All services? → Full stack (logs + metrics + traces)
    │
    ├─ Debugging Issues?
    │   ├─ Distributed system? → Distributed tracing (search by trace ID)
    │   ├─ Single service? → Structured logs (search by request ID)
    │   ├─ Performance problem? → CPU/memory profiling
    │   └─ Database slow? → Query profiling (EXPLAIN ANALYZE)
    │
    ├─ Reliability Targets?
    │   ├─ Define SLOs? → Availability, latency, error rate SLIs
    │   ├─ Error budgets? → Calculate allowed downtime per SLO
    │   ├─ Alerting? → Burn rate alerts (fast: 1h, slow: 6h)
    │   └─ Dashboard? → Grafana SLO dashboard with error budget
    │
    ├─ Performance Optimization?
    │   ├─ Find bottlenecks? → CPU/memory profiling
    │   ├─ Database queries? → EXPLAIN ANALYZE, indexing
    │   ├─ Frontend slow? → Lighthouse, Web Vitals analysis
    │   └─ Load testing? → k6 scenarios (ramp, spike, soak)
    │
    └─ Capacity Planning?
        ├─ Baseline metrics? → Collect 30 days of traffic data
        ├─ Load testing? → Test at 2x expected peak
        ├─ Forecasting? → Time series analysis (Prophet, ARIMA)
        └─ Cost optimization? → Right-size instances, spot instances

Navigation: Core Implementation Patterns

See resources/core-observability-patterns.md for detailed implementation guides:

  • OpenTelemetry End-to-End Setup - Complete instrumentation with Node.js/Python examples

    • Three pillars of observability (logs, metrics, traces)
    • Auto-instrumentation and manual spans
    • OTLP exporters and collectors
    • Production checklist
  • Distributed Tracing Strategy - Service-to-service trace propagation

    • W3C Trace Context standard
    • Sampling strategies (always-on, probabilistic, parent-based, adaptive)
    • Cross-service correlation
    • Trace backend configuration
  • SLO/SLI Design & Error Budgets - Reliability targets and alerting

    • SLI definitions (availability, latency, error rate)
    • Prometheus queries for SLIs
    • Error budget calculation and policies
    • Burn rate alerts (fast: 1h, slow: 6h)
  • Structured Logging - Production-ready JSON logs

    • Log format with trace correlation
    • Pino (Node.js) and structlog (Python) setup
    • Log levels and what NOT to log
    • Centralized aggregation (ELK, Loki, Datadog)
  • Performance Profiling - CPU, memory, database, frontend optimization

    • Node.js profiling (Chrome DevTools, Clinic.js)
    • Memory leak detection (heap snapshots)
    • Database query profiling (EXPLAIN ANALYZE)
    • Web Vitals and performance budgets
  • Capacity Planning - Scale planning and cost optimization

    • Capacity formula and calculations
    • Load testing with k6
    • Resource forecasting (Prophet, ARIMA)
    • Cost per request optimization

Navigation: Observability Maturity

See resources/observability-maturity-model.md for maturity assessment:

  • Level 1: Reactive (Firefighting) - Manual log grepping, hours to resolve

    • Basic logging to files
    • No structured logging or distributed tracing
    • Progression: Centralize logs, add metrics
  • Level 2: Proactive (Monitoring) - Centralized logs, application alerts

    • Structured JSON logs with ELK/Splunk
    • Prometheus metrics and Grafana dashboards
    • Progression: Add distributed tracing, define SLOs
  • Level 3: Predictive (Observability) - Unified telemetry, SLO-driven

    • Distributed tracing (Jaeger, Tempo)
    • Automatic trace-log-metric correlation
    • SLO/SLI-based alerting with error budgets
    • Progression: AI anomaly detection, self-healing
  • Level 4: Autonomous (AIOps) - AI-powered, self-healing systems

    • AI anomaly detection and root cause analysis
    • Predictive capacity planning
    • Auto-remediation and continuous optimization
    • Chaos engineering for resilience

Maturity Assessment Tool - Rate your organization (0-5 per category):

  • Logging, metrics, tracing, alerting, incident response
  • MTTR/MTTD benchmarks by level
  • Recommended next steps

Navigation: Anti-Patterns & Best Practices

See resources/anti-patterns-best-practices.md for common mistakes:

Critical Anti-Patterns:

  1. Logging Everything - Log bloat and high costs
  2. No Sampling - 100% trace collection is expensive
  3. Alert Fatigue - Too many noisy alerts
  4. Ignoring Tail Latency - P99 matters more than average
  5. No Error Budgets - Teams move too slow or too fast
  6. Metrics Without Context - Dashboard mysteries
  7. No Cost Tracking - Observability can cost 20% of infrastructure
  8. Point-in-Time Profiling - Missing intermittent issues

Best Practices Summary:

  • Sample traces intelligently (100% errors, 1-10% success)
  • Alert on SLO burn rate, not infrastructure metrics
  • Track tail latency (P99, P999), not just averages
  • Use error budgets to balance velocity vs reliability
  • Add context to dashboards (baselines, annotations, SLOs)
  • Track observability costs, optimize aggressively
  • Continuous profiling for intermittent issues

Templates

See templates/ for copy-paste ready examples organized by domain and tech stack:

OpenTelemetry Instrumentation:

Monitoring & SLO:

Load Testing:

Performance Optimization:


Resources

See resources/ for deep-dive operational guides:


External Resources

See data/sources.json for curated sources:

  • OpenTelemetry documentation and specifications
  • APM platforms (Datadog, New Relic, Dynatrace, Honeycomb)
  • Observability tools (Prometheus, Grafana, Jaeger, Tempo, Loki)
  • SRE books (Google SRE, Site Reliability Workbook)
  • Performance tooling (Lighthouse, k6, Clinic.js)
  • Web Vitals and Core Web Vitals

Related Skills

DevOps & Infrastructure:

Backend Development:

Quality & Reliability:

AI/ML Operations:


Quick Decision Matrix

Scenario Recommendation
Starting new service OpenTelemetry auto-instrumentation + structured logging
Debugging microservices Distributed tracing with Jaeger/Tempo
Setting reliability targets Define SLOs for availability, latency, error rate
Application is slow CPU/memory profiling + database query analysis
Planning for scale Load testing + capacity forecasting
High infrastructure costs Cost per request analysis + right-sizing
Improving observability Assess maturity level + 6-month roadmap
Frontend performance issues Web Vitals monitoring + performance budgets

Usage Notes

When to Apply Patterns:

  • New service setup → Start with OpenTelemetry + structured logging + basic metrics
  • Microservices debugging → Use distributed tracing for full request visibility
  • Reliability requirements → Define SLOs first, then implement monitoring
  • Performance issues → Profile first, optimize second (measure before optimizing)
  • Capacity planning → Collect baseline metrics for 30 days before load testing
  • Observability maturity → Assess current level, plan 6-month progression

Common Workflows:

  1. New Service: OpenTelemetry → Prometheus → Grafana → Define SLOs
  2. Debugging: Search logs by trace ID → Click trace → See full request flow
  3. Performance: Profile CPU/memory → Identify bottleneck → Optimize → Validate
  4. Capacity Planning: Baseline metrics → Load test 2x peak → Forecast growth → Scale proactively

Optimization Priorities:

  1. Correctness (logs, traces, metrics actually help debug)
  2. Cost (optimize sampling, retention, cardinality)
  3. Performance (observability overhead <5% latency)

Success Criteria: Systems are fully observable with unified telemetry (logs+metrics+traces), SLOs drive alerting and feature velocity, performance is proactively optimized, and capacity is planned ahead of demand.