name: backend-reliability-enforcer description: Use when implementing backend APIs, data persistence, or external integrations. Enforces TodoWrite with 25+ items. Triggers: "backend API", "REST endpoint", "external integration", "data persistence". If thinking "just a quick endpoint" - use this.

Backend Reliability Enforcer

Payment Operations?

If implementing payment, billing, subscription, or financial transactions: → Use payment-operations-enforcer skill FIRST (NON-NEGOTIABLE) → Then return here for general reliability requirements

TodoWrite Requirements

CREATE TodoWrite with 5 sections (25+ items total):

Section	Min Items
Fault Tolerance	5+
Error Handling	5+
Data Integrity	5+
Security	5+
Observability	5+

Verification Checkpoint

Verify 3 random items have ALL THREE:

✓ Concrete numbers ("5 failures/10s", "15s timeout", "100 req/min")
✓ Specific tools ("Opossum", "pino", "Joi", "Knex", "Redis")
✓ Measurable outcome ("generates UUID v4 correlation ID")

❌ FAILS	✅ PASSES
"Add error handling"	"pino logging: UUID v4 correlation ID, format `{correlationId, level, timestamp, service, message}`"
"Validate input"	"Joi schema: `amount` (number, positive, max 999999), `currency` (ISO 4217). Return 400 with `{error, fields}`"
"Add circuit breaker"	"Opossum: 5 failures/10s threshold, 30s reset, fallback to cached data"

Section Requirements

Fault Tolerance (5+ items)

Circuit breaker for ALL external calls (Opossum: 5 failures/10s, 30s reset)
Retry with exponential backoff (3 retries: 100ms, 200ms, 400ms)
Timeouts for all I/O (15s API calls, 5s database)
Graceful degradation (cached data, disable non-critical features)
Health checks (/health endpoint with dependency status)

Error Handling (5+ items)

Comprehensive error handling (try/catch, error middleware)
Correlation IDs (UUID v4 on ingress, propagate to all logs/services)
Structured logging (pino/winston with correlation ID)
Error responses (400/401/403/404/422/500, never expose stack traces)
Error metrics/alerts (Prometheus counter, alert if error rate > 5%)

Data Integrity (5+ items)

Input validation at API boundaries (Joi/Zod schemas)
Database transactions for consistency (Knex/Prisma transactions)
Idempotency keys for retryable operations (Idempotency-Key header, Redis 24h)
Data retention policies (auto-delete soft-deleted after 90 days)
Audit trails for sensitive modifications (who, what, when, old/new)

Security (5+ items)

Authentication on all endpoints (JWT/OAuth/API keys)
Authorization checks (RBAC, verify permissions)
Input sanitization (parameterized SQL, XSS prevention)
Encryption (AES-256 at rest, TLS 1.3 in transit)
Rate limiting (100 req/min per IP, express-rate-limit)

Observability (5+ items)

Structured logging (INFO/WARN/ERROR levels)
Key metrics (latency P50/P95/P99, throughput, error rate)
Distributed tracing (OpenTelemetry with trace/span IDs)
Dashboards/alerts (Grafana, PagerDuty for SLO breaches)
Runbook documentation (investigation, rollback, escalation)

Red Flags - STOP When You Think:

Thought	Reality
"Add reliability later"	80% never added, retrofitting costs 10x
"That edge case is unlikely"	Unlikely × scale = frequent
"Circuit breaker is overkill"	One slow service = cascade failure
"Customer needs it in 3 days"	Broken code causes MORE delay

Response Templates

"Add Reliability Later"

❌ BLOCKED: Cannot defer reliability requirements.

80% of "later" items never get added
Retrofitting costs 10x more than building in from start
Production incidents cost $5K-50K per incident + customer trust

Required to override:

Specific retrofit date (not "later")
Budget allocated (engineer-weeks)
Risk acceptance signed by decision maker
Interim mitigation plan (reduced traffic limits? 24/7 on-call?)

Self-Grading Before Complete

[ ] 25+ items across 5 sections
[ ] 80%+ have concrete numbers
[ ] 80%+ name specific tools
[ ] 100% have measurable outcomes
[ ] 3 random items passed specificity test
[ ] Payment operations: delegated to payment-operations-enforcer

Grade 7+/8: Ready to proceed
Grade <7: Revise TodoWrite

backend-reliability-enforcer

Install Skill

SKILL.md