| name | cto-risk-resilience |
| description | Expert methodology for identifying, assessing, and mitigating technical and operational risks including security, incidents, compliance, and disaster recovery. |
CTO Risk & Resilience Skill
Purpose
This skill provides a comprehensive framework for managing technical risk and building resilient systems. Use it to conduct risk assessments, plan incident response, achieve compliance certifications, and ensure business continuity.
When to Use
Trigger this skill when you need to:
- Conduct security risk assessments
- Plan for compliance certifications (SOC 2, ISO 27001, GDPR, HIPAA)
- Design incident response processes and runbooks
- Create disaster recovery and business continuity plans
- Assess vendor and third-party risks
- Define SLAs, SLOs, and SLIs
- Prepare for security audits or penetration testing
- Implement chaos engineering and resilience testing
- Respond to security incidents or breaches
Core Methodology
Follow this systematic approach to risk and resilience:
Phase 1: Risk Identification
Categorize Risks
Technical Risks:
- System failures and outages
- Data loss or corruption
- Performance degradation
- Scalability limitations
- Technical debt accumulation
Security Risks:
- Unauthorized access
- Data breaches
- DDoS attacks
- Malware and ransomware
- Supply chain vulnerabilities
Operational Risks:
- Human error
- Process failures
- Knowledge gaps (bus factor)
- Inadequate monitoring
- Poor change management
Compliance Risks:
- Regulatory violations (GDPR, CCPA, HIPAA)
- Certification failures (SOC 2, ISO 27001)
- Contract breaches (SLA violations)
- Audit findings
Business Risks:
- Vendor dependencies
- Technology obsolescence
- Team attrition
- Budget constraints
Conduct Risk Assessment
Use
references/frameworks/risk-assessment-matrix.mdto systematically identify and score risks.For each risk:
- Describe the risk scenario
- Assess probability (High/Medium/Low)
- Assess impact (Critical/High/Medium/Low)
- Calculate risk score
- Identify existing controls
- Define mitigation strategies
Phase 2: Risk Prioritization
Create Risk Matrix
High Impact (Critical) | | [Low Priority] | [Medium Priority] | [HIGH PRIORITY] | Low Prob | Medium Prob | High Prob | High Impact | High Impact | High Impact (5)|___________________|_____________________|____________________ | | | | [Low Priority] | [Medium Priority] | [High Priority] | Low Prob | Medium Prob | High Prob | Medium Impact | Medium Impact | Medium Impact (3)|___________________|_____________________|____________________ | | | | [Very Low] | [Low Priority] | [Medium Priority] | Low Prob | Medium Prob | High Prob | Low Impact | Low Impact | Low Impact Low (1)|___________________|_____________________|____________________ Impact Low (1) Medium (3) High (5) ProbabilityPriority Levels
Critical (Priority 1): Address immediately
- High probability + High impact
- Example: Known security vulnerability in production
High (Priority 2): Address within 30 days
- Medium probability + High impact, OR High probability + Medium impact
- Example: Single point of failure in critical system
Medium (Priority 3): Address within 90 days
- Low probability + High impact, OR Medium probability + Medium impact
- Example: Lack of disaster recovery testing
Low (Priority 4): Monitor and address opportunistically
- Any low impact scenarios
- Example: Minor technical debt in non-critical systems
Use references/templates/risk-register.md to maintain ongoing risk inventory.
Phase 3: Security & Compliance
Security Framework
Implement security controls across multiple layers:
Preventive Controls (Stop threats before they happen)
- Access control (MFA, RBAC, least privilege)
- Network security (firewalls, VPNs, segmentation)
- Encryption (at rest and in transit)
- Secure coding practices
- Security awareness training
Detective Controls (Identify threats quickly)
- Logging and monitoring
- Intrusion detection systems (IDS)
- Security information and event management (SIEM)
- Vulnerability scanning
- Penetration testing
Responsive Controls (React to incidents)
- Incident response plan
- Automated threat response
- Backup and recovery procedures
- Communication protocols
Use references/frameworks/security-controls-framework.md for comprehensive checklist.
Compliance Roadmaps
SOC 2 Type II Certification
Timeline: 12-18 months (9 months preparation + 3-6 months audit period + 3 months report)
Use references/templates/soc2-roadmap.md for detailed plan:
Phases:
- Gap assessment (Month 1-2)
- Control implementation (Month 3-9)
- Evidence collection period (Month 10-15)
- Audit (Month 16-18)
Key Areas:
- Security: Access controls, encryption, monitoring
- Availability: Uptime, incident response, disaster recovery
- Confidentiality: Data protection, privacy controls
- Processing Integrity: Data accuracy, error handling
- Privacy: GDPR/CCPA compliance (if applicable)
ISO 27001 Certification
Timeline: 12-24 months
Use references/templates/iso27001-roadmap.md:
Phases:
- Gap analysis (Month 1-3)
- ISMS implementation (Month 4-15)
- Internal audit (Month 16-18)
- Certification audit (Month 19-24)
Key Requirements:
- 114 controls across 14 domains
- Risk assessment methodology
- Information security policies
- Employee training and awareness
- Continuous improvement process
GDPR / CCPA Compliance
Timeline: 6-12 months
Use references/templates/data-privacy-compliance.md:
Key Areas:
- Data inventory and mapping
- Consent management
- Data subject rights (access, deletion, portability)
- Data processing agreements
- Privacy policy and notices
- Breach notification procedures
Phase 4: Incident Response
Create structured incident response capability:
Incident Response Framework
1. Preparation
- Define incident severity levels
- Create on-call rotation
- Develop runbooks for common scenarios
- Set up communication channels
- Train team on procedures
2. Detection
- Monitoring and alerting
- User reports
- Security scanning
- Automated anomaly detection
3. Triage
- Assess severity (P0/P1/P2/P3)
- Assign incident commander
- Assemble response team
- Begin communication
4. Investigation
- Gather data and logs
- Identify root cause
- Assess blast radius
- Document timeline
5. Containment
- Stop the bleeding
- Isolate affected systems
- Prevent further damage
- Implement temporary fixes
6. Resolution
- Deploy permanent fix
- Verify resolution
- Monitor for recurrence
- Update systems
7. Post-Mortem
- Blameless retrospective
- Document what happened
- Identify improvements
- Create action items
Use references/templates/incident-response-playbook.md for detailed procedures.
Incident Severity Levels
| Level | Definition | Response Time | Escalation |
|---|---|---|---|
| P0 - Critical | Complete service outage, data breach, security incident | Immediate | All-hands, exec team notified |
| P1 - High | Major feature broken, significant degradation | <15 minutes | On-call team, manager notified |
| P2 - Medium | Partial functionality impaired, workaround exists | <2 hours | On-call team |
| P3 - Low | Minor issue, minimal customer impact | Next business day | Normal ticket queue |
On-Call Best Practices
Structure:
- Primary and secondary on-call rotation
- 1-week rotations (avoid burnout)
- Compensate with time off or pay
- Maximum 2-3 incidents per week on average
Tools:
- PagerDuty, Opsgenie, or similar
- Automated escalation
- Mobile app for notifications
- Integration with monitoring systems
Health Metrics:
- On-call incidents per week
- Time spent on-call
- Sleep disruption frequency
- On-call satisfaction score
Use references/frameworks/on-call-framework.md for detailed guidance.
Phase 5: Business Continuity
Ensure critical business functions can continue during disruptions:
Disaster Recovery Planning
Recovery Objectives:
RTO (Recovery Time Objective): How long can we be down?
- Critical systems: 1 hour
- Important systems: 4 hours
- Standard systems: 24 hours
RPO (Recovery Point Objective): How much data can we lose?
- Financial data: 0 (real-time replication)
- Customer data: 15 minutes (frequent backups)
- Analytics data: 24 hours (daily backups)
Disaster Scenarios:
- Single server failure
- Availability zone outage
- Regional outage
- Complete cloud provider outage
- Ransomware attack
- Accidental data deletion
- Key personnel unavailable
For each scenario:
- Detection method
- Recovery procedure
- Responsible team
- Expected RTO/RPO
- Testing frequency
Use references/templates/disaster-recovery-plan.md for comprehensive planning.
Business Continuity Testing
Regular Drills:
- Quarterly: Tabletop exercises (discuss scenarios)
- Bi-annually: Simulated incidents (practice procedures)
- Annually: Full disaster recovery test (actual failover)
Game Days:
- Chaos engineering exercises
- Intentional failure injection
- Test recovery procedures
- Identify weaknesses
Documentation:
- Keep runbooks updated
- Document lessons learned
- Update procedures based on findings
- Share knowledge across team
Phase 6: Resilience Engineering
Build systems that gracefully handle failures:
Resilience Patterns
1. Circuit Breakers
- Detect failing services
- Prevent cascade failures
- Automatic recovery attempts
- Fallback behavior
2. Retry with Exponential Backoff
- Handle transient failures
- Avoid overwhelming systems
- Progressive delay between attempts
- Maximum retry limits
3. Timeout and Bulkheads
- Prevent resource exhaustion
- Isolate failures
- Limit blast radius
- Protect critical paths
4. Graceful Degradation
- Continue with reduced functionality
- Non-critical features can fail
- User experience maintained
- Clear communication to users
5. Rate Limiting and Load Shedding
- Prevent overload
- Protect system stability
- Prioritize critical requests
- Fair resource allocation
Use references/frameworks/resilience-patterns.md for implementation guidance.
Service Level Objectives (SLOs)
Define reliability targets:
SLI (Service Level Indicator): What we measure
- Availability: % of successful requests
- Latency: % of requests under threshold
- Throughput: Requests per second handled
- Error rate: % of failed requests
SLO (Service Level Objective): Our target
- Availability: 99.9% (43 minutes downtime/month allowed)
- Latency: 95% of requests < 200ms
- Error rate: < 0.1%
SLA (Service Level Agreement): Promise to customers
- Usually more lenient than internal SLO
- Has financial consequences if missed
- Example: 99.5% uptime guarantee with credits
Error Budget:
- Amount of allowed unreliability
- 99.9% SLO = 0.1% error budget = 43 minutes/month
- Can "spend" budget on releases, changes, experiments
- When exhausted, focus shifts to reliability
Use references/templates/slo-definition.md for framework.
Key Principles
- Prevention Over Reaction: Build security and resilience from the start
- Defense in Depth: Multiple layers of security controls
- Assume Breach: Plan for when (not if) defenses fail
- Blameless Culture: Learn from incidents without blame
- Continuous Improvement: Regular testing and refinement
- Clear Communication: Transparent about risks and incidents
Bundled Resources
Frameworks (references/frameworks/):
risk-assessment-matrix.md- Systematic risk identification and scoringsecurity-controls-framework.md- Comprehensive security checkliston-call-framework.md- Sustainable on-call practicesresilience-patterns.md- Architecture patterns for resiliencechaos-engineering.md- Controlled failure testing
Templates (references/templates/):
risk-register.md- Ongoing risk trackingincident-response-playbook.md- Step-by-step incident proceduressoc2-roadmap.md- SOC 2 certification planiso27001-roadmap.md- ISO 27001 certification plandata-privacy-compliance.md- GDPR/CCPA compliance guidedisaster-recovery-plan.md- DR procedures and testingslo-definition.md- Service level objective frameworksecurity-audit-checklist.md- Pre-audit preparationpost-mortem-template.md- Incident analysis format
Examples (references/examples/):
- Real post-mortems from major incidents (anonymized)
- Security audit results and remediation
- Compliance certification timelines
- DR testing scenarios and results
Usage Patterns
Example 1: User says "We need to get SOC 2 certified for enterprise sales"
→ Load references/templates/soc2-roadmap.md
→ Conduct gap assessment against SOC 2 requirements
→ Create 12-18 month roadmap with phases
→ Identify control implementations needed
→ Estimate costs (audit fees, tools, consulting)
→ Assign ownership and timeline
→ Provide monthly checklist for evidence collection
Example 2: User says "Create incident response process for my 30-person team"
→ Load references/templates/incident-response-playbook.md
→ Define severity levels (P0-P3) with examples
→ Design on-call rotation structure
→ Create runbooks for common scenarios
→ Set up communication channels (Slack, status page)
→ Define escalation paths
→ Schedule incident response training
Example 3: User says "Conduct security risk assessment for Series B due diligence"
→ Load references/frameworks/risk-assessment-matrix.md
→ Inventory all systems and data
→ Identify risks across security, compliance, operational
→ Score by probability and impact
→ Document existing controls
→ Create risk mitigation roadmap
→ Prepare executive summary for investors
Example 4: User says "We had a major outage, help with post-mortem"
→ Load references/templates/post-mortem-template.md
→ Document incident timeline
→ Identify root cause(s)
→ Analyze what went well and poorly
→ Create blameless narrative
→ Generate action items with owners
→ Share with team and stakeholders
→ Track action item completion
Risk Management by Company Stage
Early Stage Startup (Pre-PMF)
Focus: Security basics, avoid catastrophic risks
Priorities:
- Basic security (encryption, access control)
- Data backup and recovery
- Privacy compliance basics
- Simple incident response
Avoid: Over-investing in compliance certifications too early
Growth Stage (Post-PMF, Scaling)
Focus: Scalability, reliability, security hardening
Priorities:
- SOC 2 preparation (if selling B2B)
- Comprehensive monitoring and alerting
- Incident response process
- On-call rotation structure
- DR planning and testing
Investment: 10-15% of engineering time on resilience
Scale Stage (Enterprise)
Focus: Compliance, resilience, enterprise security
Priorities:
- Multiple compliance certifications (SOC 2, ISO 27001)
- Advanced security (SIEM, threat detection)
- Chaos engineering and resilience testing
- Comprehensive BC/DR
- Security team and CISO
Investment: 20-25% of engineering time on reliability/security
Warning Signs
| Indicator | Risk | Action |
|---|---|---|
| No monitoring on production | High - can't detect issues | Immediate: Implement basic monitoring |
| No backup/DR tested in 6+ months | High - recovery may fail | Test DR procedures this quarter |
| Single person knows critical system | High - bus factor = 1 | Document and cross-train immediately |
| Increasing incident frequency | Medium-High - system degrading | Root cause analysis, resilience improvements |
| Failed security scan findings | High - vulnerable to attack | Remediate critical/high findings in 30 days |
| Compliance deadline <6 months | High - may not certify in time | Accelerate roadmap, consider consultant |
| On-call team burned out | Medium - quality and retention risk | Reduce incident load, improve tooling |
Communication Templates
For Board/Investors
Security & Risk Update
Status: 🟢 Secure and compliant
Key Metrics:
- System uptime: 99.87% (target: 99.9%)
- Security incidents: 0 critical, 2 low (resolved)
- Compliance: SOC 2 on track for Q3 certification
Top Risks & Mitigations:
- Single cloud provider dependency → Implementing multi-region DR
- Growing on-call burden → Hiring SRE, improving automation
- Compliance timeline tight → Weekly checkpoint, external consultant engaged
Investment Request: $150K for penetration testing and SOC 2 audit
For Engineering Team
Incident Review - Service Outage Feb 15
What Happened: Database connection pool exhaustion caused 45-minute outage
Timeline:
- 2:15 PM: Increased load from marketing campaign
- 2:22 PM: First alerts fired
- 2:25 PM: Team paged, investigation started
- 2:40 PM: Root cause identified
- 3:00 PM: Service restored
What Went Well:
- ✅ Alerts fired within 7 minutes
- ✅ Team assembled quickly
- ✅ Clear communication to customers
What We'll Improve:
- ⚠️ Auto-scaling for connection pool
- ⚠️ Load testing before campaigns
- ⚠️ Better runbook documentation
Action Items: [See detailed list]
No blame - systems fail, we learn and improve.
Writing Style
All outputs should be:
- Risk-Aware: Identify and communicate risks clearly
- Action-Oriented: Focus on concrete mitigation steps
- Balanced: Realistic about risk vs. cost/effort
- Empathetic: Blameless culture, learning mindset
- Transparent: Honest about gaps and limitations
Version: 1.0.0 Philosophy: Prevent where possible, detect quickly, respond effectively, learn continuously