name	slo-definition
description	Define Service Level Objectives (SLOs), Indicators (SLIs), and error budgets
allowed-tools	Read, Glob, Grep, Write, Edit

SLO Definition Skill

When to Use This Skill

Use this skill when:

Slo Definition tasks - Working on define service level objectives (slos), indicators (slis), and error budgets
Planning or design - Need guidance on Slo Definition approaches
Best practices - Want to follow established patterns and standards

Overview

Define Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets following Google SRE principles.

MANDATORY: Documentation-First Approach

Before defining SLOs:

Invoke docs-management skill for SRE patterns
Verify Google SRE best practices via MCP servers (perplexity for latest)
Base all guidance on Google SRE book and industry standards

SLO/SLI/SLA Hierarchy

Service Level Hierarchy:

┌─────────────────────────────────────────────────────────────────────────────┐
│                     SLA (Service Level Agreement)                            │
│              External contract with customers/users                          │
│              Example: "99.9% availability per month"                         │
├─────────────────────────────────────────────────────────────────────────────┤
│                     SLO (Service Level Objective)                            │
│              Internal target (typically stricter than SLA)                   │
│              Example: "99.95% availability per month"                        │
├─────────────────────────────────────────────────────────────────────────────┤
│                     SLI (Service Level Indicator)                            │
│              Quantitative measure of service behavior                        │
│              Example: "Successful requests / Total requests"                 │
└─────────────────────────────────────────────────────────────────────────────┘

                    SLI measures → SLO targets → SLA contracts

SLI Categories

The Four Golden Signals

Google's Four Golden Signals:

┌─────────────────┬─────────────────┬─────────────────┬─────────────────┐
│    Latency      │    Traffic      │     Errors      │   Saturation    │
├─────────────────┼─────────────────┼─────────────────┼─────────────────┤
│ Time to serve   │ Demand on       │ Rate of failed  │ How "full" the  │
│ requests        │ the system      │ requests        │ service is      │
├─────────────────┼─────────────────┼─────────────────┼─────────────────┤
│ P50, P95, P99   │ Requests/sec    │ 5xx/total       │ CPU, memory,    │
│ response time   │ QPS, bytes/sec  │ Error rate      │ queue depth     │
└─────────────────┴─────────────────┴─────────────────┴─────────────────┘

Common SLI Types

Category	SLI	Measurement
Availability	Success rate	Successful requests / Total requests
Latency	Response time	Requests < threshold / Total requests
Throughput	Processing rate	Operations / Time window
Correctness	Data quality	Correct responses / Total responses
Freshness	Data currency	Data updated within threshold
Coverage	Completeness	Items processed / Items expected

SLO Definition Template

# SLO Definition: [Service Name]

## Service Overview

**Service:** [Service name and brief description]
**Owners:** [Team responsible]
**Stakeholders:** [Dependent teams/users]
**Tier:** [Critical / Standard / Development]

## SLIs

### SLI-001: Availability

**Definition:**
The proportion of successful HTTP requests, as measured from the load balancer.

**Formula:**
```promql
count(http_requests{status!~"5.."}) / count(http_requests) * 100
```

**Good Event:** HTTP response with status code < 500
**Valid Event:** Any HTTP request to the service

### SLI-002: Latency

**Definition:**
The proportion of requests that complete within the latency threshold.

**Formula:**
```promql
count(http_requests{duration<="500ms"}) / count(http_requests) * 100
```

**Good Event:** Request completes in < 500ms
**Valid Event:** Any HTTP request to the service

## SLOs

### SLO-001: Availability

**Target:** 99.9% of requests succeed over a 30-day rolling window
**SLI:** SLI-001 (Availability)
**Window:** 30-day rolling

**Rationale:**
This target provides approximately 43 minutes of downtime budget per month,
balancing reliability with development velocity.

### SLO-002: Latency (P95)

**Target:** 99% of requests complete in < 500ms over a 30-day rolling window
**SLI:** SLI-002 (Latency)
**Window:** 30-day rolling

**Rationale:**
P95 latency target ensures good user experience for majority of users
while allowing headroom for complex operations.

## Error Budget

### Availability Error Budget

**Monthly Budget:** (100% - 99.9%) × 30 days = 43.2 minutes
**Current Burn Rate:** [calculated from monitoring]
**Budget Remaining:** [calculated from monitoring]

### Latency Error Budget

**Monthly Budget:** (100% - 99%) × total_requests = 1% slow requests
**Current Burn Rate:** [calculated from monitoring]
**Budget Remaining:** [calculated from monitoring]

## Alerting Policy

### Error Budget Burn Alerts

| Alert Level | Condition | Action |
|-------------|-----------|--------|
| Warning | >2% budget consumed in 1 hour | Investigate |
| Critical | >5% budget consumed in 1 hour | Page on-call |
| Emergency | >50% budget consumed | Freeze deployments |

## Review Schedule

- **Weekly:** Review burn rate and recent incidents
- **Monthly:** Review SLO achievement, adjust if needed
- **Quarterly:** Review SLO relevance and stakeholder satisfaction

SLO Calculation Patterns

Availability Calculation

public sealed class AvailabilitySloCalculator
{
    private readonly IMetricsClient _metrics;

    public AvailabilitySloCalculator(IMetricsClient metrics)
    {
        _metrics = metrics;
    }

    public async Task<SloStatus> CalculateAsync(
        string serviceName,
        TimeSpan window,
        double targetPercentage,
        CancellationToken ct = default)
    {
        var query = $"""
            sum(rate(http_requests_total{{
                service="{serviceName}",
                status!~"5.."
            }}[{window.TotalMinutes}m]))
            /
            sum(rate(http_requests_total{{
                service="{serviceName}"
            }}[{window.TotalMinutes}m]))
            * 100
            """;

        var currentSli = await _metrics.QueryAsync(query, ct);
        var errorBudget = CalculateErrorBudget(targetPercentage, currentSli, window);

        return new SloStatus
        {
            CurrentSli = currentSli,
            Target = targetPercentage,
            Window = window,
            ErrorBudget = errorBudget,
            IsHealthy = currentSli >= targetPercentage
        };
    }

    private static ErrorBudget CalculateErrorBudget(
        double target,
        double current,
        TimeSpan window)
    {
        var budgetTotal = 100.0 - target; // e.g., 0.1% for 99.9% target
        var budgetConsumed = Math.Max(0, target - current);
        var budgetRemaining = Math.Max(0, budgetTotal - budgetConsumed);

        return new ErrorBudget
        {
            TotalBudget = budgetTotal,
            ConsumedBudget = budgetConsumed,
            RemainingBudget = budgetRemaining,
            RemainingPercentage = (budgetRemaining / budgetTotal) * 100,
            BurnRate = budgetConsumed / window.TotalHours // per hour
        };
    }
}

public sealed record SloStatus
{
    public required double CurrentSli { get; init; }
    public required double Target { get; init; }
    public required TimeSpan Window { get; init; }
    public required ErrorBudget ErrorBudget { get; init; }
    public required bool IsHealthy { get; init; }
}

public sealed record ErrorBudget
{
    public required double TotalBudget { get; init; }
    public required double ConsumedBudget { get; init; }
    public required double RemainingBudget { get; init; }
    public required double RemainingPercentage { get; init; }
    public required double BurnRate { get; init; }  // per hour
}

Multi-Window SLO (Burn Rate)

public sealed class BurnRateCalculator
{
    /// <summary>
    /// Calculate burn rate using multiple windows for better alerting.
    /// Fast burn: 1-hour window with 14.4x burn rate (2% budget in 1 hour)
    /// Slow burn: 6-hour window with 6x burn rate (10% budget in 6 hours)
    /// </summary>
    public BurnRateStatus Calculate(
        double targetSlo,
        double shortWindowSli,    // 1-hour
        double longWindowSli,     // 6-hour
        TimeSpan budgetWindow)    // 30-day
    {
        var budgetTotalPercent = 100.0 - targetSlo;

        // Fast burn: consuming 2% of monthly budget in 1 hour
        // = (1/720) * 14.4 = 2% per hour (720 hours in 30 days)
        var fastBurnRate = (100.0 - shortWindowSli) /
                          (budgetTotalPercent / (budgetWindow.TotalHours / 1));

        // Slow burn: consuming 10% of monthly budget in 6 hours
        var slowBurnRate = (100.0 - longWindowSli) /
                          (budgetTotalPercent / (budgetWindow.TotalHours / 6));

        return new BurnRateStatus
        {
            FastBurnRate = fastBurnRate,
            SlowBurnRate = slowBurnRate,
            IsFastBurnAlert = fastBurnRate >= 14.4 && slowBurnRate >= 14.4,
            IsSlowBurnAlert = fastBurnRate >= 6.0 && slowBurnRate >= 6.0,
            ProjectedBudgetExhaustion = CalculateExhaustion(fastBurnRate, budgetWindow)
        };
    }

    private static TimeSpan? CalculateExhaustion(double burnRate, TimeSpan window)
    {
        if (burnRate <= 0) return null;
        return TimeSpan.FromHours(window.TotalHours / burnRate);
    }
}

public sealed record BurnRateStatus
{
    public required double FastBurnRate { get; init; }
    public required double SlowBurnRate { get; init; }
    public required bool IsFastBurnAlert { get; init; }
    public required bool IsSlowBurnAlert { get; init; }
    public TimeSpan? ProjectedBudgetExhaustion { get; init; }
}

Availability Targets Reference

Target	Annual Downtime	Monthly Downtime	Weekly Downtime
99% (two 9s)	3.65 days	7.31 hours	1.68 hours
99.5%	1.83 days	3.65 hours	50.4 minutes
99.9% (three 9s)	8.77 hours	43.8 minutes	10.1 minutes
99.95%	4.38 hours	21.9 minutes	5.04 minutes
99.99% (four 9s)	52.6 minutes	4.38 minutes	1.01 minutes
99.999% (five 9s)	5.26 minutes	26.3 seconds	6.05 seconds

Error Budget Policies

# Error Budget Policy

## Philosophy

Error budgets represent a balance between reliability and velocity.
Spending error budget on innovation is encouraged; exhausting it
requires course correction.

## Budget States

### Green (>50% remaining)
- Normal development velocity
- New features proceed
- Experiments encouraged

### Yellow (25-50% remaining)
- Increased caution on risky changes
- Enhanced review for deployments
- Focus on reliability improvements

### Red (<25% remaining)
- Feature freeze until budget recovers
- Priority on reliability work only
- Post-mortem for budget consumption

### Exhausted (0% remaining)
- All non-emergency deployments halted
- Emergency reliability sprint
- Stakeholder communication required

## Recovery Actions

When budget is exhausted:
1. Immediate deployment freeze
2. Incident review for contributing factors
3. Prioritize top reliability issues
4. Daily burn rate monitoring
5. Resume deployments when 10% budget recovers

SLO for Different Service Types

API Service SLO

service: order-api
tier: critical

slis:
  availability:
    type: request-based
    good_events: 'status < 500'
    valid_events: 'all requests'

  latency_p95:
    type: threshold
    good_events: 'duration < 200ms'
    valid_events: 'all requests'

  latency_p99:
    type: threshold
    good_events: 'duration < 1s'
    valid_events: 'all requests'

slos:
  - name: availability
    target: 99.9%
    window: 30d

  - name: latency_p95
    target: 99%
    window: 30d

  - name: latency_p99
    target: 95%
    window: 30d

Background Job SLO

service: order-processor
tier: standard

slis:
  success_rate:
    type: request-based
    good_events: 'job.status = completed'
    valid_events: 'job.status in (completed, failed)'

  freshness:
    type: threshold
    good_events: 'queue.age < 5m'
    valid_events: 'queue.depth > 0'

slos:
  - name: success_rate
    target: 99.5%
    window: 7d

  - name: freshness
    target: 95%
    window: 1d

Data Pipeline SLO

service: analytics-pipeline
tier: standard

slis:
  completeness:
    type: coverage
    good_events: 'records_processed'
    valid_events: 'records_expected'

  freshness:
    type: threshold
    good_events: 'last_update < 1h'
    valid_events: 'pipeline.active = true'

  correctness:
    type: quality
    good_events: 'validation.passed = true'
    valid_events: 'validation.executed = true'

slos:
  - name: completeness
    target: 99.9%
    window: 1d

  - name: freshness
    target: 99%
    window: 1d

  - name: correctness
    target: 99.99%
    window: 7d

Workflow

When defining SLOs:

Identify: User journeys and critical paths
Select: Choose appropriate SLI types for each journey
Measure: Implement instrumentation for SLIs
Target: Set realistic SLO targets based on user needs
Budget: Calculate error budgets and policies
Alert: Configure multi-window burn rate alerts
Review: Establish regular SLO review cadence

References

For detailed guidance:

Last Updated: 2025-12-26

slo-definition

Install Skill

SKILL.md