Claude Code Plugins

Community-maintained marketplace

Feedback

Observability specialists hub for monitoring, logging, tracing, and alerting. Routes to specialists for metrics collection, log aggregation, distributed tracing, and incident response. Use for system observability, debugging production issues, and performance monitoring.

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name observability
version 2.1.0
description Observability specialists hub for monitoring, logging, tracing, and alerting. Routes to specialists for metrics collection, log aggregation, distributed tracing, and incident response. Use for system observability, debugging production issues, and performance monitoring.

Observability

Central hub for monitoring, logging, tracing, and system observability.

Phase 0: Expertise Loading

expertise_check:
  domain: observability
  file: .claude/expertise/observability.yaml

  if_exists:
    - Load monitoring patterns
    - Load alerting rules
    - Apply SLO definitions

  if_not_exists:
    - Flag discovery mode
    - Document patterns learned

When to Use This Skill

Use observability when:

  • Setting up monitoring infrastructure
  • Implementing logging strategies
  • Configuring distributed tracing
  • Creating dashboards and alerts
  • Debugging production issues

Observability Pillars

Pillar Purpose
Metrics Quantitative measurements
Logs Event records
Traces Request flow tracking
Alerts Incident notification

Tool Ecosystem

Metrics

tools:
  - Prometheus
  - Grafana
  - Datadog
  - CloudWatch
metrics_types:
  - Counters
  - Gauges
  - Histograms
  - Summaries

Logging

tools:
  - ELK Stack (Elasticsearch, Logstash, Kibana)
  - Loki
  - Splunk
  - CloudWatch Logs
patterns:
  - Structured logging (JSON)
  - Log levels
  - Correlation IDs

Tracing

tools:
  - Jaeger
  - Zipkin
  - OpenTelemetry
  - X-Ray
patterns:
  - Span context propagation
  - Baggage items
  - Sampling strategies

SLO/SLI/SLA

definitions:
  SLI: "Service Level Indicator - measurable metric"
  SLO: "Service Level Objective - target value"
  SLA: "Service Level Agreement - contractual commitment"

example:
  SLI: "Request latency p99"
  SLO: "99% of requests < 200ms"
  SLA: "99.9% availability per month"

MCP Requirements

  • claude-flow: For orchestration
  • Bash: For tool CLI commands

Recursive Improvement Integration (v2.1)

Eval Harness Integration

benchmark: observability-benchmark-v1
  tests:
    - obs-001: Monitoring coverage
    - obs-002: Alert quality
  minimum_scores:
    monitoring_coverage: 0.85
    alert_quality: 0.80

Memory Namespace

namespaces:
  - observability/configs/{id}: Monitoring configs
  - observability/dashboards: Dashboard templates
  - improvement/audits/observability: Skill audits

Uncertainty Handling

confidence_check:
  if confidence >= 0.8:
    - Proceed with implementation
  if confidence 0.5-0.8:
    - Confirm tool stack
  if confidence < 0.5:
    - Ask for infrastructure details

Cross-Skill Coordination

Works with: infrastructure, deployment-readiness, performance-analysis


!! SKILL COMPLETION VERIFICATION (MANDATORY) !!

  • Agent Spawning: Spawned agent via Task()
  • Agent Registry Validation: Agent from registry
  • TodoWrite Called: Called with 5+ todos
  • Work Delegation: Delegated to agents

Remember: Skill() -> Task() -> TodoWrite() - ALWAYS

Core Principles

1. Three Pillars Integration

Comprehensive observability requires unified collection and correlation of metrics, logs, and traces - no single pillar provides complete system visibility.

In practice:

  • Implement metrics collection for quantitative measurements (counters, gauges, histograms)
  • Deploy structured logging with correlation IDs for event tracking across services
  • Configure distributed tracing with span context propagation for request flow visualization
  • Correlate all three pillars using common identifiers (trace IDs, request IDs, user IDs)

2. Proactive Alerting with SLO-Based Thresholds

Alerting must be driven by Service Level Objectives that reflect actual user impact, not arbitrary metric thresholds that generate noise.

In practice:

  • Define SLIs (Service Level Indicators) that measure user-facing behavior (p99 latency, error rate)
  • Set SLOs (Service Level Objectives) based on business requirements (99% requests < 200ms)
  • Configure alerts to fire when SLO burn rate exceeds acceptable thresholds
  • Implement multi-window alerting to distinguish temporary spikes from sustained degradation

3. Context-Aware Monitoring with Dynamic Baselines

Effective monitoring adapts to changing system behavior through machine learning baselines, not static thresholds that break during normal traffic variations.

In practice:

  • Use anomaly detection algorithms to learn normal behavior patterns
  • Implement seasonal baselines that adjust for daily/weekly traffic cycles
  • Correlate metrics across services to identify cascading failures
  • Apply intelligent noise reduction to focus on actionable signals

Anti-Patterns

Anti-Pattern Problem Solution
Metric-Only Monitoring Collecting metrics without logs or traces misses critical context for debugging failures Implement all three pillars with correlation, use traces to investigate metric anomalies
Alert Fatigue from Static Thresholds Setting fixed thresholds generates false alarms during traffic variations, causing alert fatigue Use SLO-based alerting with burn rate calculations and dynamic baselines that adapt to traffic patterns
Unstructured Logging Free-form log messages prevent automated analysis and correlation across services Adopt structured logging with JSON format, include correlation IDs, define standard log levels
Missing Sampling Strategies Tracing 100% of requests creates performance overhead and storage costs Implement adaptive sampling: high rates for errors/slow requests, low rates for successful fast requests
Dashboard Proliferation Creating dozens of uncategorized dashboards makes critical information undiscoverable Organize dashboards by audience (SRE, developers, business), implement role-based access, standardize layouts

Conclusion

The Observability skill establishes comprehensive system visibility through unified metrics, logs, and traces coordinated with intelligent alerting and dynamic monitoring. By implementing all three pillars with proper correlation, organizations gain the ability to debug complex distributed systems, proactively detect degradation, and understand user impact. The integration with tools like Prometheus for metrics, ELK/Loki for logs, and Jaeger/OpenTelemetry for traces provides production-grade observability infrastructure.

The SLO-based alerting framework transforms monitoring from reactive firefighting into proactive quality management. By defining Service Level Objectives that reflect actual business requirements and configuring alerts based on SLO burn rates, teams receive actionable notifications about genuine user impact rather than noisy metric threshold violations. The recursive improvement integration through benchmark evaluation ensures observability implementations meet quality standards for monitoring coverage and alert quality.

Organizations implementing this skill benefit from faster incident detection and resolution, reduced mean time to recovery (MTTR), and deeper understanding of system behavior under load. The expertise-aware workflow enables teams to leverage documented monitoring patterns and alerting rules specific to their infrastructure, preventing common pitfalls and accelerating observability maturity. When coordinated with infrastructure, deployment-readiness, and performance-analysis skills, observability creates a complete operational excellence framework.