Claude Code Plugins

Community-maintained marketplace

Feedback
3
0

Plan instrumentation strategy before implementation, covering what to instrument, naming conventions, cardinality management, and instrumentation budget

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name instrumentation-planning
description Plan instrumentation strategy before implementation, covering what to instrument, naming conventions, cardinality management, and instrumentation budget
allowed-tools Read, Glob, Grep

Instrumentation Planning

Strategic planning for application instrumentation before implementation.

When to Use This Skill

  • Planning instrumentation for new services
  • Reviewing instrumentation strategy
  • Establishing naming conventions
  • Managing telemetry cardinality
  • Setting instrumentation budgets

Instrumentation Strategy Framework

What to Instrument

Instrumentation Layers:

┌─────────────────────────────────────────────────────────────────┐
│  Layer 1: Automatic/Library Instrumentation                     │
│  - HTTP clients/servers (auto-captured)                         │
│  - Database clients (auto-captured)                             │
│  - Message queue clients (auto-captured)                        │
│  - Framework-provided metrics                                   │
│  Effort: Low | Coverage: Broad | Customization: Limited        │
├─────────────────────────────────────────────────────────────────┤
│  Layer 2: Business Transaction Instrumentation                  │
│  - Key user journeys                                            │
│  - Business operations (checkout, signup, etc.)                │
│  - Revenue-generating flows                                     │
│  - SLA-bound operations                                         │
│  Effort: Medium | Coverage: Targeted | Value: High             │
├─────────────────────────────────────────────────────────────────┤
│  Layer 3: Debug/Diagnostic Instrumentation                      │
│  - Algorithmic hot paths                                        │
│  - Cache behavior                                               │
│  - Circuit breaker states                                       │
│  - Retry/fallback paths                                         │
│  Effort: Medium | Coverage: Deep | Use: Troubleshooting        │
├─────────────────────────────────────────────────────────────────┤
│  Layer 4: Business Metrics                                      │
│  - Domain-specific counters                                     │
│  - Conversion rates                                             │
│  - Feature usage                                                │
│  - Customer behavior                                            │
│  Effort: High | Coverage: Custom | Value: Business Insights    │
└─────────────────────────────────────────────────────────────────┘

Instrumentation Decision Matrix

instrumentation_decisions:
  always_instrument:
    - "Inbound HTTP/gRPC requests"
    - "Outbound HTTP/gRPC calls"
    - "Database queries"
    - "Message publish/consume"
    - "Authentication/authorization"
    - "External API calls"
    - "Cache operations"

  consider_instrumenting:
    - "Complex business logic"
    - "Feature flags evaluation"
    - "Background jobs"
    - "Scheduled tasks"
    - "File I/O operations"
    - "CPU-intensive operations"

  avoid_instrumenting:
    - "Every method call (too noisy)"
    - "Tight loops (performance impact)"
    - "Data transformation (low value)"
    - "Validation helpers"
    - "Utility functions"

  decision_criteria:
    business_value:
      weight: 0.3
      question: "Does this help understand business outcomes?"

    debugging_value:
      weight: 0.25
      question: "Does this help diagnose production issues?"

    slo_relevance:
      weight: 0.25
      question: "Does this contribute to SLI measurement?"

    cost_impact:
      weight: 0.2
      question: "Is the cardinality/volume acceptable?"

Naming Conventions

Metric Naming

metric_naming:
  format: "[namespace]_[subsystem]_[name]_[unit]"

  rules:
    case: "snake_case"
    unit_suffix: "Always include (_seconds, _bytes, _total)"
    base_units: "Use base units (seconds not milliseconds)"
    counter_suffix: "_total for counters"

  examples:
    good:
      - "http_server_requests_total"
      - "http_server_request_duration_seconds"
      - "http_server_response_size_bytes"
      - "db_connections_current"
      - "order_processing_duration_seconds"
      - "payment_transactions_total"

    bad:
      - "requests (no unit, no namespace)"
      - "HttpRequestDuration (wrong case)"
      - "order_latency_ms (use base units)"
      - "totalOrders (camelCase, no unit)"

  label_naming:
    case: "snake_case"
    avoid:
      - "Embedded values in name (path=/users)"
      - "High cardinality labels"
    good_labels:
      - "method, status_code, path"
      - "service, version, environment"
    bad_labels:
      - "user_id (high cardinality)"
      - "request_id (high cardinality)"
      - "timestamp (not a dimension)"

Span Naming

span_naming:
  format: "[operation] [resource]"

  rules:
    - "Use verb + noun pattern"
    - "Keep names low cardinality"
    - "Include operation type, not specific values"
    - "Be consistent across services"

  examples:
    http:
      pattern: "HTTP {METHOD} {route_template}"
      good: "HTTP GET /users/{id}"
      bad: "HTTP GET /users/12345"

    database:
      pattern: "{operation} {table}"
      good: "SELECT orders"
      bad: "SELECT * FROM orders WHERE id=123"

    messaging:
      pattern: "{operation} {queue/topic}"
      good: "PUBLISH order-events"
      bad: "publish message to order-events queue"

    rpc:
      pattern: "{service}/{method}"
      good: "OrderService/CreateOrder"
      bad: "grpc call to order service"

  attributes:
    required:
      - "service.name"
      - "service.version"
      - "deployment.environment"

    recommended:
      http:
        - "http.method"
        - "http.route"
        - "http.status_code"
        - "http.target"

      database:
        - "db.system"
        - "db.name"
        - "db.operation"
        - "db.statement (sanitized)"

      messaging:
        - "messaging.system"
        - "messaging.destination"
        - "messaging.operation"

Log Field Naming

log_naming:
  format: "snake_case for all fields"

  standard_fields:
    timestamp: "ISO 8601 format"
    level: "INFO, WARN, ERROR, etc."
    message: "Human-readable description"
    service: "Service name"
    trace_id: "Correlation ID"
    span_id: "Current span"

  domain_fields:
    pattern: "{domain}_{field}"
    examples:
      - "order_id"
      - "customer_id"
      - "payment_amount"
      - "product_sku"

  avoid:
    - "Nested objects (flatten for indexing)"
    - "Arrays of unknown length"
    - "Large text blobs"
    - "Sensitive data (PII, secrets)"

Cardinality Management

Understanding Cardinality

Cardinality = Number of unique time series

Example:
http_requests_total{method="GET", path="/api/users", status="200"}

Cardinality = methods × paths × statuses
            = 5 × 100 × 10
            = 5,000 time series

With user_id (1M users):
            = 5 × 100 × 10 × 1,000,000
            = 5,000,000,000 time series ← EXPLOSION!

Cardinality Budget

cardinality_budget:
  planning:
    total_budget: 100000  # Target max time series per service
    allocation:
      automatic_instrumentation: 30%  # 30,000
      business_transactions: 40%      # 40,000
      custom_metrics: 20%             # 20,000
      buffer: 10%                     # 10,000

  per_metric_limits:
    low_cardinality:
      max_series: 100
      example: "status codes, methods"

    medium_cardinality:
      max_series: 1000
      example: "endpoints, operations"

    high_cardinality:
      max_series: 10000
      example: "aggregated by hour"
      requires: "Justification and approval"

  monitoring:
    - "Alert on cardinality growth > 10% per day"
    - "Weekly cardinality reviews"
    - "Automatic label value limiting"

Cardinality Reduction Techniques

cardinality_reduction:
  bucketing:
    before: "path=/users/12345"
    after: "path=/users/{id}"
    technique: "Path template extraction"

  sampling:
    description: "Sample high-volume, low-value traces"
    strategies:
      head_sampling: "Decide at trace start"
      tail_sampling: "Decide after seeing full trace"
      adaptive: "Adjust rate based on volume"

  aggregation:
    description: "Pre-aggregate before export"
    example: "Count by status, not per request"

  value_limiting:
    description: "Cap unique values per label"
    example: "Max 100 unique paths, then 'other'"

  dropping:
    description: "Drop low-value dimensions"
    candidates:
      - "Instance ID (use service name)"
      - "Request ID (not for metrics)"
      - "Full URLs (use route templates)"

Instrumentation Budget

Performance Impact

performance_budget:
  cpu_overhead:
    target: "< 1% CPU increase"
    measurement: "Profile with/without instrumentation"

  memory_overhead:
    target: "< 50MB additional heap"
    components:
      - "Metric registries"
      - "Span buffers"
      - "Log buffers"

  latency_overhead:
    target: "< 1ms per request"
    hot_paths: "< 100μs"

  data_volume:
    metrics:
      target: "< 1GB/day per service"
      calculation: "series × scrape_interval × 8 bytes"

    traces:
      target: "< 10GB/day per service (with sampling)"
      sampling_rate: "1-10% for high-volume services"

    logs:
      target: "< 5GB/day per service"
      strategies: "Sampling, level gating"

Cost Planning

cost_planning:
  estimation_formula:
    metrics:
      monthly_cost: "time_series × $0.003 (typical cloud pricing)"
      example: "10,000 series × $0.003 = $30/month"

    traces:
      monthly_cost: "spans_per_month × $0.000005"
      example: "100M spans × $0.000005 = $500/month"

    logs:
      monthly_cost: "GB_per_month × $0.50"
      example: "500GB × $0.50 = $250/month"

  optimization_strategies:
    - "Increase scrape interval (15s → 60s)"
    - "Reduce trace sampling rate"
    - "Log level gating in production"
    - "Shorter retention for debug data"
    - "Downsample old metrics"

Instrumentation Plan Template

instrumentation_plan:
  service: "{Service Name}"
  version: "1.0"
  date: "{Date}"
  owner: "{Team}"

  objectives:
    - "Track SLIs for order processing"
    - "Enable distributed tracing for debugging"
    - "Monitor payment success rates"

  automatic_instrumentation:
    framework: "OpenTelemetry .NET"
    enabled:
      - "ASP.NET Core (HTTP server)"
      - "HttpClient (HTTP client)"
      - "Entity Framework Core (database)"
      - "Azure.Messaging.ServiceBus"
    configuration:
      sampling_rate: 0.1  # 10% of traces
      batch_export_interval: 5000  # ms

  custom_spans:
    - name: "ProcessOrder"
      purpose: "Track order processing duration"
      attributes:
        - "order.id"
        - "order.item_count"
        - "order.total_amount"
      events:
        - "inventory.reserved"
        - "payment.processed"

    - name: "ValidatePayment"
      purpose: "Track payment validation steps"
      attributes:
        - "payment.method"
        - "payment.provider"
      sensitive: false

  custom_metrics:
    counters:
      - name: "orders_total"
        labels: ["status", "payment_method"]
        purpose: "Count orders by outcome"
        cardinality_estimate: 20

      - name: "payment_failures_total"
        labels: ["reason", "provider"]
        purpose: "Track payment failure reasons"
        cardinality_estimate: 50

    histograms:
      - name: "order_processing_duration_seconds"
        labels: ["order_type"]
        purpose: "Track order processing latency"
        buckets: [0.1, 0.5, 1, 2, 5, 10]
        cardinality_estimate: 10

    gauges:
      - name: "pending_orders_current"
        labels: []
        purpose: "Current pending order count"
        cardinality_estimate: 1

  cardinality_summary:
    estimated_total: 81
    budget: 1000
    status: "Within budget"

  log_strategy:
    production_level: "INFO"
    structured_fields:
      standard:
        - "trace_id"
        - "span_id"
        - "service"
        - "environment"
      domain:
        - "order_id"
        - "customer_id (hashed)"
    sampling:
      debug_logs: "1% in production"

  cost_estimate:
    monthly:
      metrics: "$30"
      traces: "$200"
      logs: "$150"
      total: "$380"

  review_schedule:
    frequency: "Quarterly"
    metrics_to_review:
      - "Cardinality growth"
      - "Data volume"
      - "Cost vs budget"

Related Skills

  • observability-patterns - Three pillars overview
  • distributed-tracing - Trace implementation details
  • slo-sli-error-budget - What to measure for SLOs

Last Updated: 2025-12-26