Claude Code Plugins

Community-maintained marketplace

Feedback

operational-acceptance-documentation

@tachyon-beep/skillpacks
0
0

Use when preparing systems for production deployment or operational handover - covers security authorization (SSP/SAR/POA&M/ATO), operational readiness, test & evaluation, go-live approval, transition planning, acceptance criteria, and residual risk documentation for multi-stakeholder acceptance gates

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name operational-acceptance-documentation
description Security authorization (SSP/SAR/ATO), operational readiness, go-live approval

Operational Acceptance Documentation

Overview

Prepare complete acceptance packages for production deployment. Core principle: Document readiness, risks, and acceptance criteria for informed go-live decisions.

Key insight: Acceptance documentation enables stakeholders to make informed risk decisions about production deployment.

When to Use

Load this skill when:

  • Preparing systems for production launch
  • Seeking executive go-live approval
  • Completing operational handover
  • Government/defense system authorization

Symptoms you need this:

  • "How do I get approval to launch?"
  • Preparing production readiness checklist
  • Creating go-live approval package
  • Operational handover to support team

Don't use for:

  • Development/staging deployments
  • Internal-only tools (unless high-risk)

Production Readiness Checklist

Infrastructure Readiness

## Infrastructure Readiness

### Compute Resources
- [ ] Production servers provisioned (6x API servers, 2x database servers)
- [ ] Auto-scaling configured (scale 2-20 instances based on CPU >70%)
- [ ] Load balancer configured with health checks
- [ ] SSL/TLS certificates installed and valid (expires 2025-12-01)

### Storage
- [ ] Database provisioned (PostgreSQL 14, 500GB storage)
- [ ] Database backups configured (automated hourly backups, 30-day retention)
- [ ] Backup restoration tested (RTO: 1 hour, RPO: 1 hour)

### Network
- [ ] VPC configured with public/private subnets
- [ ] Firewall rules implemented (allow HTTPS 443, deny all other inbound)
- [ ] DNS configured (api.example.com → load balancer)

### Monitoring and Logging
- [ ] Application metrics instrumented (Prometheus)
- [ ] Logs centralized (CloudWatch Logs, 90-day retention)
- [ ] Dashboards created ([Grafana dashboard link])
- [ ] Alerts configured (error rate, latency, uptime)

### Security
- [ ] Secrets stored in secrets manager (not environment variables)
- [ ] TLS 1.3 enforced
- [ ] Authentication implemented (MFA for admins)
- [ ] Security scan completed (no HIGH/CRITICAL findings)

**Infrastructure Status**: ✅ READY (all criteria met)

Operational Readiness Checklist

## Operational Readiness

### Monitoring Coverage
- [ ] **Availability**: Uptime monitoring ([UptimeRobot link])
- [ ] **Performance**: Latency tracking (p50, p95, p99)
- [ ] **Errors**: Error rate monitoring (<0.1% threshold)
- [ ] **Business metrics**: User signups, API calls, revenue

**Success criteria**: All critical metrics have dashboards + alerts

### Alerting Configuration
- [ ] **P1 alerts** (PagerDuty): Service down, error rate >1%, security incident
- [ ] **P2 alerts** (Slack #ops): Elevated errors >0.5%, latency >500ms
- [ ] **P3 alerts** (Email): Performance degradation, capacity warnings

**Success criteria**: Alerts tested and verified to fire correctly

### Backup and Recovery
- [ ] **Backup procedure**: Automated hourly PostgreSQL dumps to S3
- [ ] **Backup testing**: Restored from backup on 2024-03-10 (successful)
- [ ] **Recovery time**: 1-hour RTO verified
- [ ] **Recovery point**: 1-hour RPO (acceptable data loss)

**Success criteria**: Restore from backup completes within RTO

### Runbooks and Documentation
- [ ] **Incident response runbooks**: Database outage, API errors, security incidents
- [ ] **Operational procedures**: Deployment, rollback, scaling
- [ ] **Architecture documentation**: System diagram, data flows, integrations
- [ ] **API documentation**: Endpoint reference, authentication guide

**Success criteria**: On-call engineer can respond to P1 incident using runbooks alone

**Operational Status**: ✅ READY (all criteria met)

Test and Evaluation Documentation

Test Summary Report

# Test Summary Report: Customer Portal Launch

## Test Objectives
1. Verify functional requirements (user registration, login, profile management)
2. Validate performance requirements (p95 latency <200ms, support 1000 concurrent users)
3. Confirm security requirements (authentication, authorization, data encryption)

## Test Methodology

### Functional Testing
- **Unit tests**: 487 tests, 100% pass rate
- **Integration tests**: 156 tests, 100% pass rate
- **End-to-end tests**: 45 scenarios, 44 passed, 1 defect (LOW severity, workaround available)

### Performance Testing
- **Load test**: 1000 concurrent users, 10,000 requests/min
  - p50 latency: 45ms ✅
  - p95 latency: 180ms ✅ (target: <200ms)
  - p99 latency: 350ms ⚠️ (target: <500ms)
  - Error rate: 0.02% ✅ (target: <0.1%)

### Security Testing
- **Vulnerability scan**: Nessus scan completed 2024-03-15
  - Critical: 0 ✅
  - High: 0 ✅
  - Medium: 3 (remediated)
  - Low: 8 (accepted risk)
- **Penetration test**: External pentest completed 2024-03-18
  - HIGH findings: 1 (SQL injection, fixed on 2024-03-19)
  - MEDIUM findings: 2 (remediated)

## Defect Summary

| Defect ID | Severity | Description | Status | Disposition |
|-----------|----------|-------------|--------|-------------|
| DEF-001 | LOW | Profile image upload fails for files >10MB | Open | Workaround: Resize before upload (documented) |
| DEF-002 | MEDIUM | Password reset email delayed 5-10 minutes | Fixed | Fixed on 2024-03-20 |
| DEF-003 | HIGH | SQL injection in /api/users | Fixed | Fixed on 2024-03-19, re-tested |

## Test Completion Criteria

- [ ] ✅ All HIGH/CRITICAL defects fixed
- [ ] ✅ All MEDIUM defects fixed or have workarounds
- [ ] ✅ LOW defects documented (1 open, workaround available)
- [ ] ✅ Performance requirements met
- [ ] ✅ Security requirements met (no HIGH/CRITICAL findings)

**Test Status**: ✅ PASSED (all criteria met, 1 LOW defect acceptable)

Go-Live Approval Package

Executive Summary

# Go-Live Approval Request: Customer Portal

## System Overview
**System Name**: Customer Portal
**Purpose**: Enable customers to self-serve account management, reducing support tickets by 40%
**Business Value**: $2M annual revenue enabler (enterprise customers require self-service portal)
**Target Launch**: 2024-04-01

## Readiness Status

### Infrastructure: ✅ READY
- All production servers provisioned and tested
- Auto-scaling configured
- Backups automated and tested (1-hour RTO/RPO)

### Operations: ✅ READY
- Monitoring and alerting configured
- Runbooks complete
- On-call rotation staffed (3 SREs, 2 backend engineers)

### Testing: ✅ PASSED
- Functional tests: 100% pass (1 LOW defect with workaround)
- Performance tests: p95 latency 180ms (target: <200ms)
- Security tests: 0 HIGH/CRITICAL findings

### Security Authorization: ✅ AUTHORIZED
- ATO granted on 2024-03-25 (valid for 3 years)
- POA&M with 2 LOW-risk items (tracked, non-blocking)

## Residual Risks

### Risk 1: Performance Degradation Above 1000 Users (MEDIUM)
**Description**: Load testing validated 1000 concurrent users. Performance above 1000 users unknown.
**Mitigation**:
- Auto-scaling configured to add capacity at 70% CPU
- Gradual rollout plan (100 users week 1, 500 week 2, all users week 4)
- Performance monitoring with alerts at 800ms latency threshold
**Accepted by**: CTO on 2024-03-28

### Risk 2: Profile Image Upload Limitation (LOW)
**Description**: Images >10MB fail to upload (DEF-001)
**Mitigation**:
- Workaround documented in user help center
- Fix planned for v1.1 release (2024-05-01)
**Accepted by**: Product Manager on 2024-03-28

## Launch Criteria

### Success Metrics
**Immediate (Week 1)**:
- Uptime: >99% (target: 99.9%)
- Error rate: <0.5% (target: <0.1%)
- p95 latency: <300ms (target: <200ms)

**Medium-term (Month 1)**:
- User adoption: 30% of customers use portal
- Support ticket reduction: 20% decrease

### Abort Criteria
**Immediate rollback if**:
- Uptime drops below 95% for >1 hour
- Error rate exceeds 5%
- Data breach or security incident
- Critical functionality broken for >50% of users

### Monitoring Plan
- **Real-time**: Grafana dashboard monitored by on-call
- **Daily**: Morning standup reviews previous 24 hours
- **Weekly**: Executive summary report (metrics vs targets)

## Rollback Plan

**Trigger**: Any abort criterion met

**Rollback Procedure** (30 minutes):
1. Enable maintenance page
2. Scale production deployment to 0 replicas
3. Restore database from pre-launch backup (if data changes occurred)
4. Re-enable previous customer support workflow
5. Communicate to customers via email

**Testing**: Rollback procedure tested in staging on 2024-03-27 (successful, 25-minute duration)

## Recommendation

**Status**: ✅ APPROVED FOR LAUNCH

All readiness criteria met. Residual risks identified and accepted by stakeholders. Launch criteria defined with clear success metrics and abort criteria. Rollback plan tested and ready.

**Requested Approval**: Executive Go-Live Approval

**Approvals Required**:
- [ ] VP Engineering (technical readiness)
- [ ] CTO (security and risk acceptance)
- [ ] VP Product (business value and user experience)
- [ ] CFO (budget and revenue impact)

Operational Handover Checklist

# Operational Handover: Customer Portal

## Knowledge Transfer

### Documentation Delivered
- [ ] ✅ Architecture documentation (`/docs/architecture.md`)
- [ ] ✅ API reference (`/docs/api-reference.md`)
- [ ] ✅ Runbooks (`/runbooks/` - 12 runbooks)
- [ ] ✅ Deployment procedures (`/docs/deployment.md`)
- [ ] ✅ Troubleshooting guide (`/docs/troubleshooting.md`)

### Training Completed
- [ ] ✅ On-call training (2024-03-20): 3 SREs, 2 backend engineers
- [ ] ✅ Runbook walkthrough (2024-03-22): All on-call staff
- [ ] ✅ Incident response drill (2024-03-25): Simulated database outage, responded successfully

### Handoff Meeting
- **Date**: 2024-03-28
- **Attendees**: Development team (6), Operations team (5), Product (2)
- **Agenda**:
  1. System overview and architecture
  2. Common issues and troubleshooting
  3. Escalation paths and contact information
  4. Q&A session
- **Outcome**: ✅ Operations team confident in supporting system

## Support Model

### On-Call Rotation
**Primary On-Call**: Rotating weekly schedule (3 SREs)
**Backup On-Call**: Backend engineer (2-person rotation)

**Schedule**: https://pagerduty.example.com/schedules/customer-portal

### Escalation Paths

**P1 (Critical)**:
1. Primary on-call (page immediately)
2. If no response in 5 min → Backup on-call
3. If no resolution in 30 min → Incident commander
4. If ongoing after 1 hour → VP Engineering

**P2 (High)**:
1. Primary on-call (page)
2. If no response in 15 min → Backup on-call
3. If no resolution in 4 hours → Team lead

**Contacts**:
- Primary on-call: [PagerDuty link]
- Incident commander: John Doe (+1-555-0100)
- Team lead: Jane Smith (+1-555-0200)
- VP Engineering: Bob Johnson (+1-555-0300)

### SLA Commitments
**Uptime**: 99.9% (measured monthly)
**Performance**: p95 latency <200ms
**Support Response**:
- P1: 15-minute response, 4-hour resolution target
- P2: 2-hour response, 1-day resolution target
- P3: Next business day response

## Acceptance Criteria Met

- [ ] ✅ All documentation delivered and reviewed
- [ ] ✅ Operations team trained
- [ ] ✅ Incident response drill successful
- [ ] ✅ On-call rotation staffed
- [ ] ✅ Escalation paths defined
- [ ] ✅ SLA commitments documented

**Handover Status**: ✅ COMPLETE

**Signed Off**:
- Development Team Lead: John Doe (2024-03-28)
- Operations Team Lead: Jane Smith (2024-03-28)

Cross-References

Use WITH this skill:

  • ordis/security-architect/security-authorization-and-accreditation - For government/defense ATO requirements
  • muna/technical-writer/itil-and-governance-documentation - For RFC and service documentation

Real-World Impact

Systems using operational acceptance documentation:

  • Customer Portal Launch: Go-live approval package enabled same-day executive approval (vs 1-week review cycle). Clear risk acceptance + rollback plan gave confidence to approve.
  • Government System: Complete readiness checklist (infrastructure, operations, testing, security authorization) passed IRAP assessment on first attempt. Assessor: "Most comprehensive readiness documentation in 5 years".
  • Operational Handover: Training + runbooks + incident drill enabled junior SRE to respond to P1 database outage successfully within 45 minutes (first week post-handover).

Key lesson: Comprehensive acceptance documentation (readiness, risks, criteria, handover) enables informed go-live decisions and smooth operational transitions.