name: backend-reliability-enforcer description: Use when implementing backend APIs, data persistence, or external integrations. Enforces TodoWrite with 25+ items. Triggers: "backend API", "REST endpoint", "external integration", "data persistence". If thinking "just a quick endpoint" - use this.
Backend Reliability Enforcer
Payment Operations?
If implementing payment, billing, subscription, or financial transactions:
→ Use payment-operations-enforcer skill FIRST (NON-NEGOTIABLE)
→ Then return here for general reliability requirements
TodoWrite Requirements
CREATE TodoWrite with 5 sections (25+ items total):
| Section | Min Items |
|---|---|
| Fault Tolerance | 5+ |
| Error Handling | 5+ |
| Data Integrity | 5+ |
| Security | 5+ |
| Observability | 5+ |
Verification Checkpoint
Verify 3 random items have ALL THREE:
- ✓ Concrete numbers ("5 failures/10s", "15s timeout", "100 req/min")
- ✓ Specific tools ("Opossum", "pino", "Joi", "Knex", "Redis")
- ✓ Measurable outcome ("generates UUID v4 correlation ID")
| ❌ FAILS | ✅ PASSES |
|---|---|
| "Add error handling" | "pino logging: UUID v4 correlation ID, format {correlationId, level, timestamp, service, message}" |
| "Validate input" | "Joi schema: amount (number, positive, max 999999), currency (ISO 4217). Return 400 with {error, fields}" |
| "Add circuit breaker" | "Opossum: 5 failures/10s threshold, 30s reset, fallback to cached data" |
Section Requirements
Fault Tolerance (5+ items)
- Circuit breaker for ALL external calls (Opossum: 5 failures/10s, 30s reset)
- Retry with exponential backoff (3 retries: 100ms, 200ms, 400ms)
- Timeouts for all I/O (15s API calls, 5s database)
- Graceful degradation (cached data, disable non-critical features)
- Health checks (
/healthendpoint with dependency status)
Error Handling (5+ items)
- Comprehensive error handling (try/catch, error middleware)
- Correlation IDs (UUID v4 on ingress, propagate to all logs/services)
- Structured logging (pino/winston with correlation ID)
- Error responses (400/401/403/404/422/500, never expose stack traces)
- Error metrics/alerts (Prometheus counter, alert if error rate > 5%)
Data Integrity (5+ items)
- Input validation at API boundaries (Joi/Zod schemas)
- Database transactions for consistency (Knex/Prisma transactions)
- Idempotency keys for retryable operations (
Idempotency-Keyheader, Redis 24h) - Data retention policies (auto-delete soft-deleted after 90 days)
- Audit trails for sensitive modifications (who, what, when, old/new)
Security (5+ items)
- Authentication on all endpoints (JWT/OAuth/API keys)
- Authorization checks (RBAC, verify permissions)
- Input sanitization (parameterized SQL, XSS prevention)
- Encryption (AES-256 at rest, TLS 1.3 in transit)
- Rate limiting (100 req/min per IP, express-rate-limit)
Observability (5+ items)
- Structured logging (INFO/WARN/ERROR levels)
- Key metrics (latency P50/P95/P99, throughput, error rate)
- Distributed tracing (OpenTelemetry with trace/span IDs)
- Dashboards/alerts (Grafana, PagerDuty for SLO breaches)
- Runbook documentation (investigation, rollback, escalation)
Red Flags - STOP When You Think:
| Thought | Reality |
|---|---|
| "Add reliability later" | 80% never added, retrofitting costs 10x |
| "That edge case is unlikely" | Unlikely × scale = frequent |
| "Circuit breaker is overkill" | One slow service = cascade failure |
| "Customer needs it in 3 days" | Broken code causes MORE delay |
Response Templates
"Add Reliability Later"
❌ BLOCKED: Cannot defer reliability requirements.
- 80% of "later" items never get added
- Retrofitting costs 10x more than building in from start
- Production incidents cost $5K-50K per incident + customer trust
Required to override:
- Specific retrofit date (not "later")
- Budget allocated (engineer-weeks)
- Risk acceptance signed by decision maker
- Interim mitigation plan (reduced traffic limits? 24/7 on-call?)
Self-Grading Before Complete
[ ] 25+ items across 5 sections
[ ] 80%+ have concrete numbers
[ ] 80%+ name specific tools
[ ] 100% have measurable outcomes
[ ] 3 random items passed specificity test
[ ] Payment operations: delegated to payment-operations-enforcer
Grade 7+/8: Ready to proceed
Grade <7: Revise TodoWrite