Claude Code Plugins

Community-maintained marketplace

Feedback

backend-reliability-enforcer

@majiayu000/claude-skill-registry
2
0

Use when implementing backend APIs, data persistence, or external integrations. Enforces TodoWrite with 25+ items. Triggers: "backend API", "REST endpoint", "external integration", "data persistence". If thinking "just a quick endpoint" - use this.

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md


name: backend-reliability-enforcer description: Use when implementing backend APIs, data persistence, or external integrations. Enforces TodoWrite with 25+ items. Triggers: "backend API", "REST endpoint", "external integration", "data persistence". If thinking "just a quick endpoint" - use this.

Backend Reliability Enforcer

Payment Operations?

If implementing payment, billing, subscription, or financial transactions: → Use payment-operations-enforcer skill FIRST (NON-NEGOTIABLE) → Then return here for general reliability requirements


TodoWrite Requirements

CREATE TodoWrite with 5 sections (25+ items total):

Section Min Items
Fault Tolerance 5+
Error Handling 5+
Data Integrity 5+
Security 5+
Observability 5+

Verification Checkpoint

Verify 3 random items have ALL THREE:

  • ✓ Concrete numbers ("5 failures/10s", "15s timeout", "100 req/min")
  • ✓ Specific tools ("Opossum", "pino", "Joi", "Knex", "Redis")
  • ✓ Measurable outcome ("generates UUID v4 correlation ID")
❌ FAILS ✅ PASSES
"Add error handling" "pino logging: UUID v4 correlation ID, format {correlationId, level, timestamp, service, message}"
"Validate input" "Joi schema: amount (number, positive, max 999999), currency (ISO 4217). Return 400 with {error, fields}"
"Add circuit breaker" "Opossum: 5 failures/10s threshold, 30s reset, fallback to cached data"

Section Requirements

Fault Tolerance (5+ items)

  • Circuit breaker for ALL external calls (Opossum: 5 failures/10s, 30s reset)
  • Retry with exponential backoff (3 retries: 100ms, 200ms, 400ms)
  • Timeouts for all I/O (15s API calls, 5s database)
  • Graceful degradation (cached data, disable non-critical features)
  • Health checks (/health endpoint with dependency status)

Error Handling (5+ items)

  • Comprehensive error handling (try/catch, error middleware)
  • Correlation IDs (UUID v4 on ingress, propagate to all logs/services)
  • Structured logging (pino/winston with correlation ID)
  • Error responses (400/401/403/404/422/500, never expose stack traces)
  • Error metrics/alerts (Prometheus counter, alert if error rate > 5%)

Data Integrity (5+ items)

  • Input validation at API boundaries (Joi/Zod schemas)
  • Database transactions for consistency (Knex/Prisma transactions)
  • Idempotency keys for retryable operations (Idempotency-Key header, Redis 24h)
  • Data retention policies (auto-delete soft-deleted after 90 days)
  • Audit trails for sensitive modifications (who, what, when, old/new)

Security (5+ items)

  • Authentication on all endpoints (JWT/OAuth/API keys)
  • Authorization checks (RBAC, verify permissions)
  • Input sanitization (parameterized SQL, XSS prevention)
  • Encryption (AES-256 at rest, TLS 1.3 in transit)
  • Rate limiting (100 req/min per IP, express-rate-limit)

Observability (5+ items)

  • Structured logging (INFO/WARN/ERROR levels)
  • Key metrics (latency P50/P95/P99, throughput, error rate)
  • Distributed tracing (OpenTelemetry with trace/span IDs)
  • Dashboards/alerts (Grafana, PagerDuty for SLO breaches)
  • Runbook documentation (investigation, rollback, escalation)

Red Flags - STOP When You Think:

Thought Reality
"Add reliability later" 80% never added, retrofitting costs 10x
"That edge case is unlikely" Unlikely × scale = frequent
"Circuit breaker is overkill" One slow service = cascade failure
"Customer needs it in 3 days" Broken code causes MORE delay

Response Templates

"Add Reliability Later"

BLOCKED: Cannot defer reliability requirements.

  • 80% of "later" items never get added
  • Retrofitting costs 10x more than building in from start
  • Production incidents cost $5K-50K per incident + customer trust

Required to override:

  1. Specific retrofit date (not "later")
  2. Budget allocated (engineer-weeks)
  3. Risk acceptance signed by decision maker
  4. Interim mitigation plan (reduced traffic limits? 24/7 on-call?)

Self-Grading Before Complete

[ ] 25+ items across 5 sections
[ ] 80%+ have concrete numbers
[ ] 80%+ name specific tools
[ ] 100% have measurable outcomes
[ ] 3 random items passed specificity test
[ ] Payment operations: delegated to payment-operations-enforcer

Grade 7+/8: Ready to proceed
Grade <7: Revise TodoWrite