| name | slo-sli-design |
| description | Design Service Level Objectives, Indicators, and error budgets |
| allowed-tools | Read, Glob, Grep, Write, Edit |
SLO/SLI Design Skill
When to Use This Skill
Use this skill when:
- Slo Sli Design tasks - Working on design service level objectives, indicators, and error budgets
- Planning or design - Need guidance on Slo Sli Design approaches
- Best practices - Want to follow established patterns and standards
Overview
Design Service Level Objectives, Indicators, and error budget policies.
MANDATORY: Documentation-First Approach
Before designing SLOs:
- Invoke
docs-managementskill for SLO/SLI patterns - Verify SRE practices via MCP servers (perplexity)
- Base guidance on Google SRE and industry best practices
SLO/SLI/SLA Hierarchy
SLO/SLI/SLA RELATIONSHIP:
┌─────────────────────────────────────────────────────────────────┐
│ │
│ SLA (Service Level Agreement) │
│ ├── External promise to customers │
│ ├── Legal/contractual implications │
│ └── Example: "99.9% monthly uptime" │
│ │
│ ▲ │
│ │ Buffer (SLO should be tighter) │
│ │ │
│ SLO (Service Level Objective) │
│ ├── Internal reliability target │
│ ├── Tighter than SLA (headroom) │
│ └── Example: "99.95% monthly availability" │
│ │
│ ▲ │
│ │ Measured by │
│ │ │
│ SLI (Service Level Indicator) │
│ ├── Actual measurement │
│ ├── Quantitative metric │
│ └── Example: "successful_requests / total_requests" │
│ │
│ ▲ │
│ │ Derived from │
│ │ │
│ Error Budget │
│ ├── Allowable unreliability: 100% - SLO │
│ ├── Example: 0.05% = 21.6 minutes/month │
│ └── Spent on: releases, incidents, maintenance │
│ │
└─────────────────────────────────────────────────────────────────┘
Common SLI Types
SLI CATEGORIES:
AVAILABILITY SLI:
"The proportion of requests that are served successfully"
Formula: successful_requests / total_requests × 100%
Good Events: HTTP 2xx, 3xx, 4xx (client errors)
Bad Events: HTTP 5xx, timeouts, connection failures
Example Prometheus query:
sum(rate(http_requests_total{status!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
───────────────────────────────────────────────────────────────
LATENCY SLI:
"The proportion of requests that are served within threshold"
Formula: requests_below_threshold / total_requests × 100%
Thresholds (example):
- P50: 100ms (median experience)
- P95: 500ms (95th percentile)
- P99: 1000ms (tail latency)
Example Prometheus query:
sum(rate(http_request_duration_bucket{le="0.5"}[5m]))
/
sum(rate(http_request_duration_count[5m]))
───────────────────────────────────────────────────────────────
QUALITY/CORRECTNESS SLI:
"The proportion of requests that return correct results"
Formula: correct_responses / total_responses × 100%
Good Events: Valid data, expected format
Bad Events: Data corruption, stale data, wrong results
───────────────────────────────────────────────────────────────
FRESHNESS SLI:
"The proportion of data that is updated within threshold"
Formula: fresh_records / total_records × 100%
Example: "95% of records updated within 5 minutes"
───────────────────────────────────────────────────────────────
THROUGHPUT SLI:
"The proportion of time system handles expected load"
Formula: time_at_capacity / total_time × 100%
Example: "System handles 1000 req/s 99% of the time"
Error Budget Calculation
ERROR BUDGET MATH:
Monthly Error Budget (30 days):
SLO Target │ Error Budget │ Allowed Downtime
────────────┼──────────────┼──────────────────
99% │ 1% │ 7h 18m
99.5% │ 0.5% │ 3h 39m
99.9% │ 0.1% │ 43m 50s
99.95% │ 0.05% │ 21m 55s
99.99% │ 0.01% │ 4m 23s
99.999% │ 0.001% │ 26s
Error Budget Consumption:
┌─────────────────────────────────────────────────────────────────┐
│ │
│ Monthly Budget: 21m 55s (99.95% SLO) │
│ │
│ ████████████████░░░░░░░░░░░░░░░░ Used: 8m (36%) │
│ │
│ Incidents: │
│ - Jan 5: Database failover - 5m │
│ - Jan 12: Deployment rollback - 3m │
│ │
│ Remaining: 13m 55s (64%) │
│ │
│ Status: ✓ HEALTHY │
│ │
└─────────────────────────────────────────────────────────────────┘
SLO Design Process
SLO DESIGN WORKFLOW:
Step 1: IDENTIFY USER JOURNEYS
┌─────────────────────────────────────────────────────────────────┐
│ What do users care about? │
│ │
│ Critical User Journeys (CUJs): │
│ - Login and authentication │
│ - Search and browse products │
│ - Add to cart and checkout │
│ - View order status │
│ │
│ For each journey: │
│ - What constitutes success? │
│ - What latency is acceptable? │
│ - What's the business impact of failure? │
└─────────────────────────────────────────────────────────────────┘
Step 2: DEFINE SLIs
┌─────────────────────────────────────────────────────────────────┐
│ What can we measure that represents user happiness? │
│ │
│ For "Checkout" journey: │
│ - Availability: checkout completes without error │
│ - Latency: checkout completes within 3 seconds │
│ - Correctness: order total matches cart │
│ │
│ SLI Specification: │
│ - What events are we measuring? │
│ - What's a "good" event vs "bad" event? │
│ - Where do we measure? (server, client, synthetic) │
└─────────────────────────────────────────────────────────────────┘
Step 3: SET SLO TARGETS
┌─────────────────────────────────────────────────────────────────┐
│ What reliability level should we target? │
│ │
│ Consider: │
│ - Current baseline (what are we achieving now?) │
│ - User expectations (what do users tolerate?) │
│ - Business requirements (any SLAs?) │
│ - Cost vs reliability trade-off │
│ │
│ Start achievable, improve iteratively │
│ SLO = Current baseline - small margin │
└─────────────────────────────────────────────────────────────────┘
Step 4: DEFINE ERROR BUDGET POLICY
┌─────────────────────────────────────────────────────────────────┐
│ What happens when budget is exhausted? │
│ │
│ Error Budget Policy: │
│ - Budget > 50%: Normal operations │
│ - Budget 25-50%: Slow down risky changes │
│ - Budget < 25%: Focus on reliability │
│ - Budget = 0%: Feature freeze, reliability only │
│ │
│ Escalation: │
│ - Who gets notified at each threshold? │
│ - What actions are required? │
└─────────────────────────────────────────────────────────────────┘
SLO Document Template
# SLO: {Service Name} - {Journey/Feature}
## Service Overview
| Attribute | Value |
|-----------|-------|
| Service | [Service name] |
| Owner | [Team name] |
| Criticality | [Critical/High/Medium/Low] |
| User Journey | [Journey name] |
## SLI Specification
### Availability SLI
**Definition:** The proportion of [event type] that [success criteria].
**Good Event:** [What counts as success]
**Bad Event:** [What counts as failure]
**Measurement:**
- Source: [Prometheus/Azure Monitor/etc.]
- Query:
```promql
sum(rate(http_requests_total{status!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
Latency SLI
Definition: The proportion of requests served within [threshold].
Thresholds:
| Percentile | Threshold |
|---|---|
| P50 | [X]ms |
| P95 | [X]ms |
| P99 | [X]ms |
Measurement:
histogram_quantile(0.95,
rate(http_request_duration_bucket[5m]))
SLO Targets
| SLI | Target | Window |
|---|---|---|
| Availability | [99.9%] | 30 days rolling |
| Latency (P95) | [99%] below 500ms | 30 days rolling |
Error Budget
| SLO | Error Budget | Allowed Downtime (30d) |
|---|---|---|
| 99.9% availability | 0.1% | 43m 50s |
| 99% latency | 1% | 7h 18m |
Error Budget Policy
Budget Thresholds
| Budget Remaining | Status | Actions |
|---|---|---|
| > 50% | 🟢 Healthy | Normal operations |
| 25-50% | 🟡 Caution | Review recent changes |
| 10-25% | 🟠 Warning | Slow deployments, reliability focus |
| < 10% | 🔴 Critical | Feature freeze |
| Exhausted | ⛔ Frozen | Reliability-only work |
Escalation
| Threshold | Notify | Action Required |
|---|---|---|
| < 50% | Team lead | Awareness |
| < 25% | Engineering manager | Review deployment pace |
| < 10% | Director | Feature freeze decision |
| Exhausted | VP Engineering | Incident response mode |
Alerting
SLO Burn Rate Alerts
| Severity | Burn Rate | Time Window | Example |
|---|---|---|---|
| Critical | 14.4x | 1h | Budget exhausted in ~2 days |
| Warning | 6x | 6h | Budget exhausted in ~5 days |
| Info | 1x | 3d | Budget on track to exhaust |
Alert Configuration
- alert: SLOHighBurnRate
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
) > (14.4 * 0.001) # 14.4x burn rate for 99.9% SLO
for: 2m
labels:
severity: critical
annotations:
summary: "High error budget burn rate"
description: "Error budget burning at 14.4x rate"
Review Schedule
- Weekly: SLO dashboard review
- Monthly: Error budget retrospective
- Quarterly: SLO target review
Appendix: Historical Performance
[Include baseline measurements and trends]
.NET SLO Implementation
// SLO metric implementation in .NET
// Infrastructure/Telemetry/SloMetrics.cs
using System.Diagnostics.Metrics;
public class SloMetrics
{
private readonly Counter<long> _totalRequests;
private readonly Counter<long> _successfulRequests;
private readonly Counter<long> _failedRequests;
private readonly Histogram<double> _requestDuration;
public SloMetrics(IMeterFactory meterFactory)
{
var meter = meterFactory.Create("OrdersApi.SLO");
_totalRequests = meter.CreateCounter<long>(
"slo.requests.total",
"{request}",
"Total requests for SLO calculation");
_successfulRequests = meter.CreateCounter<long>(
"slo.requests.successful",
"{request}",
"Successful requests (good events)");
_failedRequests = meter.CreateCounter<long>(
"slo.requests.failed",
"{request}",
"Failed requests (bad events)");
_requestDuration = meter.CreateHistogram<double>(
"slo.request.duration",
"ms",
"Request duration for latency SLI");
}
public void RecordRequest(
string endpoint,
int statusCode,
double durationMs)
{
var tags = new TagList
{
{ "endpoint", endpoint },
{ "status_code", statusCode.ToString() }
};
_totalRequests.Add(1, tags);
// Availability SLI: 5xx = bad, everything else = good
if (statusCode >= 500)
{
_failedRequests.Add(1, tags);
}
else
{
_successfulRequests.Add(1, tags);
}
// Latency SLI
_requestDuration.Record(durationMs, tags);
}
}
// Middleware to capture SLO metrics
public class SloMetricsMiddleware
{
private readonly RequestDelegate _next;
private readonly SloMetrics _sloMetrics;
public SloMetricsMiddleware(RequestDelegate next, SloMetrics sloMetrics)
{
_next = next;
_sloMetrics = sloMetrics;
}
public async Task InvokeAsync(HttpContext context)
{
var stopwatch = Stopwatch.StartNew();
try
{
await _next(context);
}
finally
{
stopwatch.Stop();
var endpoint = context.GetEndpoint()?.DisplayName ?? "unknown";
var statusCode = context.Response.StatusCode;
var durationMs = stopwatch.Elapsed.TotalMilliseconds;
_sloMetrics.RecordRequest(endpoint, statusCode, durationMs);
}
}
}
Error Budget Dashboard Queries
# Availability SLI (30-day rolling)
1 - (
sum(increase(slo_requests_failed_total[30d]))
/
sum(increase(slo_requests_total[30d]))
)
# Latency SLI (P95 < 500ms, 30-day)
sum(increase(slo_request_duration_bucket{le="500"}[30d]))
/
sum(increase(slo_request_duration_count[30d]))
# Error Budget Remaining (availability)
1 - (
(1 - 0.999) # SLO target (99.9%)
-
(1 - (
sum(increase(slo_requests_failed_total[30d]))
/
sum(increase(slo_requests_total[30d]))
))
) / (1 - 0.999)
# Error Budget Burn Rate (1h)
(
sum(rate(slo_requests_failed_total[1h]))
/
sum(rate(slo_requests_total[1h]))
) / (1 - 0.999) # Divide by error budget (0.1%)
Workflow
When designing SLOs:
- Identify User Journeys: What do users care about?
- Define SLIs: What can we measure?
- Measure Baseline: What are we achieving now?
- Set SLO Targets: Achievable but aspirational
- Define Error Budget Policy: What happens when budget is low?
- Implement Alerting: Multi-window burn rate alerts
- Create Dashboards: Visibility into SLO status
- Review Regularly: Adjust based on learning
References
For detailed guidance:
Last Updated: 2025-12-26