name	postmortem-writer
description	Creates comprehensive post-incident documents with timeline, root cause analysis, contributing factors, action items, and ownership. Follows SRE best practices for blameless postmortems. Use for "postmortem", "incident review", "RCA", or "post-incident".

Postmortem Writer

Name: postmortem-writer
Author: patricio0312rev

Document incidents for learning and improvement.

Postmortem Template

# Postmortem: API Outage - Database Connection Pool Exhausted

**Date:** 2024-01-15
**Authors:** Jane Doe (On-call), John Smith (DBA)
**Status:** Complete
**Severity:** P1 (Critical)

## Summary

On January 15, 2024, our API experienced a complete outage for 25 minutes (14:32 - 14:57 UTC) affecting 100% of users. The root cause was database connection pool exhaustion triggered by a connection leak introduced in deployment v2.3.4.

**Impact:**

- Duration: 25 minutes
- Users affected: ~50,000
- Requests failed: ~125,000
- Revenue impact: ~$15,000

## Timeline (All times UTC)

| Time  | Event                                            |
| ----- | ------------------------------------------------ |
| 14:15 | v2.3.4 deployed to production                    |
| 14:32 | First CloudWatch alarm: HighErrorRate            |
| 14:33 | PagerDuty alert sent to on-call (Jane)           |
| 14:35 | Jane acknowledges, begins investigation          |
| 14:38 | Identified: Database connection pool at 100%     |
| 14:40 | Attempted: Kill long-running queries (no effect) |
| 14:43 | Decision: Rollback to v2.3.3                     |
| 14:45 | Rollback initiated                               |
| 14:47 | Rollback complete, connections dropping          |
| 14:50 | Error rate returning to normal                   |
| 14:57 | All systems recovered, incident closed           |
| 15:30 | Postmortem meeting scheduled                     |

## Root Cause

A code change in v2.3.4 introduced a connection leak in the user profile endpoint. The new caching layer was not properly releasing database connections after queries completed.

**Code diff:**
\`\`\`diff

- await prisma.user.findUnique({ where: { id } });

* const client = await pool.connect();
* const user = await client.query('SELECT \* FROM users WHERE id = $1', [id]);
* // Missing: client.release() ❌
  \`\`\`

## Contributing Factors

1. **Insufficient testing:** Load tests didn't catch the leak

   - Tests only ran for 5 minutes
   - Not enough concurrent connections to exhaust pool

2. **Missing monitoring:** No alerts on connection pool metrics

   - Had alarms for query latency
   - No alarms for active connections count

3. **Inadequate code review:** Reviewer didn't spot missing release()

   - PR approved without running locally
   - No checklist for connection management

4. **Deployment process:** No gradual rollout
   - Deployed to 100% of production immediately
   - No canary deployment

## What Went Well

1. ✅ **Fast detection:** Alert fired within 3 minutes
2. ✅ **Clear runbook:** DBA runbook had exact steps to follow
3. ✅ **Quick decision:** Made rollback decision in 8 minutes
4. ✅ **Communication:** Status page updated every 5 minutes
5. ✅ **Rollback capability:** Automated rollback took <2 minutes

## What Went Wrong

1. ❌ **Code review missed bug:** Connection leak not caught
2. ❌ **Testing gaps:** Load tests insufficient duration
3. ❌ **No canary:** Deployed to all instances at once
4. ❌ **Late detection:** 17 minutes between deploy and alert

## Action Items

| Action                                        | Owner   | Due Date   | Priority | Status         |
| --------------------------------------------- | ------- | ---------- | -------- | -------------- |
| Add connection pool metrics to dashboards     | Jane    | 2024-01-20 | P0       | ✅ Done        |
| Create PR checklist for connection management | John    | 2024-01-22 | P0       | ✅ Done        |
| Extend load tests to 30 minutes minimum       | QA Team | 2024-01-25 | P1       | 🔄 In Progress |
| Implement canary deployment (10% → 100%)      | DevOps  | 2024-02-01 | P1       | 📋 Planned     |
| Add connection leak detection to tests        | Jane    | 2024-01-27 | P1       | 🔄 In Progress |
| Review all DB connection usage patterns       | John    | 2024-02-05 | P2       | 📋 Planned     |
| Improve alert routing (faster escalation)     | DevOps  | 2024-02-10 | P2       | 📋 Planned     |

## Lessons Learned

1. **Code review checklists work:** Need specific items for common issues
2. **Load tests need realistic duration:** 5min insufficient for leaks
3. **Gradual rollouts catch issues:** 10% canary would have limited impact
4. **Monitoring gaps are dangerous:** Add metrics before you need them
5. **Runbooks save time:** Clear procedures enabled fast response

## Related Incidents

- [2023-11-20] Database CPU spike (similar connection pool issue)
- [2023-08-15] Memory leak in cache layer

## Prevention

To prevent similar incidents:

1. ✅ Add connection management to code review checklist
2. ✅ Monitor connection pool utilization
3. ✅ Extend load test duration
4. ✅ Implement canary deployments
5. ✅ Add automated connection leak detection

## Appendix

### Monitoring Graphs

[Insert graphs of connection pool, error rates, latency during incident]

### Communication Log

[Insert status page updates and customer communication]

### Code Fix

PR #1235: Fix connection leak in user profile endpoint
\`\`\`typescript
const client = await pool.connect();
try {
const user = await client.query('SELECT \* FROM users WHERE id = $1', [id]);
return user;
} finally {
client.release(); // ✅ Always release
}
\`\`\`

Postmortem Best Practices

# Blameless Postmortem Guidelines

## Do ✅

- Focus on systems and processes, not people
- Use timeline with exact timestamps
- Identify contributing factors, not just root cause
- Create actionable items with owners and dates
- Document what went well (positive reinforcement)
- Share widely for organizational learning

## Don't ❌

- Blame individuals or teams
- Hide or minimize the incident
- Skip the postmortem (even for small incidents)
- Create action items without owners
- Forget to follow up on action items
- Make it a blame session

## Template Sections

1. **Summary** (2-3 sentences)
2. **Impact** (numbers: users, revenue, duration)
3. **Timeline** (chronological events)
4. **Root Cause** (technical explanation)
5. **Contributing Factors** (broader context)
6. **What Went Well** (positive reinforcement)
7. **What Went Wrong** (improvement areas)
8. **Action Items** (concrete, owned, dated)
9. **Lessons Learned** (key takeaways)

Output Checklist

Timeline created
Root cause identified
Contributing factors documented
Action items with owners
Lessons learned captured
Postmortem meeting held
Document shared widely
Follow-up scheduled ENDFILE

postmortem-writer

Install Skill

SKILL.md

Postmortem Writer

Postmortem Template

Postmortem Best Practices

Output Checklist