name	cto-risk-resilience
description	Expert methodology for identifying, assessing, and mitigating technical and operational risks including security, incidents, compliance, and disaster recovery.

CTO Risk & Resilience Skill

Purpose

This skill provides a comprehensive framework for managing technical risk and building resilient systems. Use it to conduct risk assessments, plan incident response, achieve compliance certifications, and ensure business continuity.

When to Use

Trigger this skill when you need to:

Conduct security risk assessments
Plan for compliance certifications (SOC 2, ISO 27001, GDPR, HIPAA)
Design incident response processes and runbooks
Create disaster recovery and business continuity plans
Assess vendor and third-party risks
Define SLAs, SLOs, and SLIs
Prepare for security audits or penetration testing
Implement chaos engineering and resilience testing
Respond to security incidents or breaches

Core Methodology

Follow this systematic approach to risk and resilience:

Phase 1: Risk Identification

Categorize Risks

Technical Risks:
- System failures and outages
- Data loss or corruption
- Performance degradation
- Scalability limitations
- Technical debt accumulation
Security Risks:
- Unauthorized access
- Data breaches
- DDoS attacks
- Malware and ransomware
- Supply chain vulnerabilities
Operational Risks:
- Human error
- Process failures
- Knowledge gaps (bus factor)
- Inadequate monitoring
- Poor change management
Compliance Risks:
- Regulatory violations (GDPR, CCPA, HIPAA)
- Certification failures (SOC 2, ISO 27001)
- Contract breaches (SLA violations)
- Audit findings
Business Risks:
- Vendor dependencies
- Technology obsolescence
- Team attrition
- Budget constraints
Conduct Risk Assessment

Use references/frameworks/risk-assessment-matrix.md to systematically identify and score risks.

For each risk:
- Describe the risk scenario
- Assess probability (High/Medium/Low)
- Assess impact (Critical/High/Medium/Low)
- Calculate risk score
- Identify existing controls
- Define mitigation strategies

Phase 2: Risk Prioritization

Create Risk Matrix

High Impact (Critical)
     |
     |  [Low Priority]    |  [Medium Priority]  |  [HIGH PRIORITY]
     |  Low Prob          |  Medium Prob        |  High Prob
     |  High Impact       |  High Impact        |  High Impact
  (5)|___________________|_____________________|____________________
     |                    |                     |
     |  [Low Priority]    |  [Medium Priority]  |  [High Priority]
     |  Low Prob          |  Medium Prob        |  High Prob
     |  Medium Impact     |  Medium Impact      |  Medium Impact
  (3)|___________________|_____________________|____________________
     |                    |                     |
     |  [Very Low]        |  [Low Priority]     |  [Medium Priority]
     |  Low Prob          |  Medium Prob        |  High Prob
     |  Low Impact        |  Low Impact         |  Low Impact
Low (1)|___________________|_____________________|____________________
Impact
        Low (1)            Medium (3)            High (5)
                        Probability

Priority Levels

Critical (Priority 1): Address immediately
- High probability + High impact
- Example: Known security vulnerability in production
High (Priority 2): Address within 30 days
- Medium probability + High impact, OR High probability + Medium impact
- Example: Single point of failure in critical system
Medium (Priority 3): Address within 90 days
- Low probability + High impact, OR Medium probability + Medium impact
- Example: Lack of disaster recovery testing
Low (Priority 4): Monitor and address opportunistically
- Any low impact scenarios
- Example: Minor technical debt in non-critical systems

Use references/templates/risk-register.md to maintain ongoing risk inventory.

Phase 3: Security & Compliance

Security Framework

Implement security controls across multiple layers:

Preventive Controls (Stop threats before they happen)
- Access control (MFA, RBAC, least privilege)
- Network security (firewalls, VPNs, segmentation)
- Encryption (at rest and in transit)
- Secure coding practices
- Security awareness training
Detective Controls (Identify threats quickly)
- Logging and monitoring
- Intrusion detection systems (IDS)
- Security information and event management (SIEM)
- Vulnerability scanning
- Penetration testing
Responsive Controls (React to incidents)
- Incident response plan
- Automated threat response
- Backup and recovery procedures
- Communication protocols

Use references/frameworks/security-controls-framework.md for comprehensive checklist.

Compliance Roadmaps

SOC 2 Type II Certification

Timeline: 12-18 months (9 months preparation + 3-6 months audit period + 3 months report)

Use references/templates/soc2-roadmap.md for detailed plan:

Phases:

Gap assessment (Month 1-2)
Control implementation (Month 3-9)
Evidence collection period (Month 10-15)
Audit (Month 16-18)

Key Areas:

Security: Access controls, encryption, monitoring
Availability: Uptime, incident response, disaster recovery
Confidentiality: Data protection, privacy controls
Processing Integrity: Data accuracy, error handling
Privacy: GDPR/CCPA compliance (if applicable)

ISO 27001 Certification

Timeline: 12-24 months

Use references/templates/iso27001-roadmap.md:

Phases:

Gap analysis (Month 1-3)
ISMS implementation (Month 4-15)
Internal audit (Month 16-18)
Certification audit (Month 19-24)

Key Requirements:

114 controls across 14 domains
Risk assessment methodology
Information security policies
Employee training and awareness
Continuous improvement process

GDPR / CCPA Compliance

Timeline: 6-12 months

Use references/templates/data-privacy-compliance.md:

Key Areas:

Data inventory and mapping
Consent management
Data subject rights (access, deletion, portability)
Data processing agreements
Privacy policy and notices
Breach notification procedures

Phase 4: Incident Response

Create structured incident response capability:

Incident Response Framework

1. Preparation

Define incident severity levels
Create on-call rotation
Develop runbooks for common scenarios
Set up communication channels
Train team on procedures

2. Detection

Monitoring and alerting
User reports
Security scanning
Automated anomaly detection

3. Triage

Assess severity (P0/P1/P2/P3)
Assign incident commander
Assemble response team
Begin communication

4. Investigation

Gather data and logs
Identify root cause
Assess blast radius
Document timeline

5. Containment

Stop the bleeding
Isolate affected systems
Prevent further damage
Implement temporary fixes

6. Resolution

Deploy permanent fix
Verify resolution
Monitor for recurrence
Update systems

7. Post-Mortem

Blameless retrospective
Document what happened
Identify improvements
Create action items

Use references/templates/incident-response-playbook.md for detailed procedures.

Incident Severity Levels

Level	Definition	Response Time	Escalation
P0 - Critical	Complete service outage, data breach, security incident	Immediate	All-hands, exec team notified
P1 - High	Major feature broken, significant degradation	<15 minutes	On-call team, manager notified
P2 - Medium	Partial functionality impaired, workaround exists	<2 hours	On-call team
P3 - Low	Minor issue, minimal customer impact	Next business day	Normal ticket queue

On-Call Best Practices

Structure:

Primary and secondary on-call rotation
1-week rotations (avoid burnout)
Compensate with time off or pay
Maximum 2-3 incidents per week on average

Tools:

PagerDuty, Opsgenie, or similar
Automated escalation
Mobile app for notifications
Integration with monitoring systems

Health Metrics:

On-call incidents per week
Time spent on-call
Sleep disruption frequency
On-call satisfaction score

Use references/frameworks/on-call-framework.md for detailed guidance.

Phase 5: Business Continuity

Ensure critical business functions can continue during disruptions:

Disaster Recovery Planning

Recovery Objectives:

RTO (Recovery Time Objective): How long can we be down?

Critical systems: 1 hour
Important systems: 4 hours
Standard systems: 24 hours

RPO (Recovery Point Objective): How much data can we lose?

Financial data: 0 (real-time replication)
Customer data: 15 minutes (frequent backups)
Analytics data: 24 hours (daily backups)

Disaster Scenarios:

Single server failure
Availability zone outage
Regional outage
Complete cloud provider outage
Ransomware attack
Accidental data deletion
Key personnel unavailable

For each scenario:

Detection method
Recovery procedure
Responsible team
Expected RTO/RPO
Testing frequency

Use references/templates/disaster-recovery-plan.md for comprehensive planning.

Business Continuity Testing

Regular Drills:

Quarterly: Tabletop exercises (discuss scenarios)
Bi-annually: Simulated incidents (practice procedures)
Annually: Full disaster recovery test (actual failover)

Game Days:

Chaos engineering exercises
Intentional failure injection
Test recovery procedures
Identify weaknesses

Documentation:

Keep runbooks updated
Document lessons learned
Update procedures based on findings
Share knowledge across team

Phase 6: Resilience Engineering

Build systems that gracefully handle failures:

Resilience Patterns

1. Circuit Breakers

Detect failing services
Prevent cascade failures
Automatic recovery attempts
Fallback behavior

2. Retry with Exponential Backoff

Handle transient failures
Avoid overwhelming systems
Progressive delay between attempts
Maximum retry limits

3. Timeout and Bulkheads

Prevent resource exhaustion
Isolate failures
Limit blast radius
Protect critical paths

4. Graceful Degradation

Continue with reduced functionality
Non-critical features can fail
User experience maintained
Clear communication to users

5. Rate Limiting and Load Shedding

Prevent overload
Protect system stability
Prioritize critical requests
Fair resource allocation

Use references/frameworks/resilience-patterns.md for implementation guidance.

Service Level Objectives (SLOs)

Define reliability targets:

SLI (Service Level Indicator): What we measure

Availability: % of successful requests
Latency: % of requests under threshold
Throughput: Requests per second handled
Error rate: % of failed requests

SLO (Service Level Objective): Our target

Availability: 99.9% (43 minutes downtime/month allowed)
Latency: 95% of requests < 200ms
Error rate: < 0.1%

SLA (Service Level Agreement): Promise to customers

Usually more lenient than internal SLO
Has financial consequences if missed
Example: 99.5% uptime guarantee with credits

Error Budget:

Amount of allowed unreliability
99.9% SLO = 0.1% error budget = 43 minutes/month
Can "spend" budget on releases, changes, experiments
When exhausted, focus shifts to reliability

Use references/templates/slo-definition.md for framework.

Key Principles

Prevention Over Reaction: Build security and resilience from the start
Defense in Depth: Multiple layers of security controls
Assume Breach: Plan for when (not if) defenses fail
Blameless Culture: Learn from incidents without blame
Continuous Improvement: Regular testing and refinement
Clear Communication: Transparent about risks and incidents

Bundled Resources

Frameworks (references/frameworks/):

risk-assessment-matrix.md - Systematic risk identification and scoring
security-controls-framework.md - Comprehensive security checklist
on-call-framework.md - Sustainable on-call practices
resilience-patterns.md - Architecture patterns for resilience
chaos-engineering.md - Controlled failure testing

Templates (references/templates/):

risk-register.md - Ongoing risk tracking
incident-response-playbook.md - Step-by-step incident procedures
soc2-roadmap.md - SOC 2 certification plan
iso27001-roadmap.md - ISO 27001 certification plan
data-privacy-compliance.md - GDPR/CCPA compliance guide
disaster-recovery-plan.md - DR procedures and testing
slo-definition.md - Service level objective framework
security-audit-checklist.md - Pre-audit preparation
post-mortem-template.md - Incident analysis format

Examples (references/examples/):

Real post-mortems from major incidents (anonymized)
Security audit results and remediation
Compliance certification timelines
DR testing scenarios and results

Usage Patterns

Example 1: User says "We need to get SOC 2 certified for enterprise sales"

→ Load references/templates/soc2-roadmap.md → Conduct gap assessment against SOC 2 requirements → Create 12-18 month roadmap with phases → Identify control implementations needed → Estimate costs (audit fees, tools, consulting) → Assign ownership and timeline → Provide monthly checklist for evidence collection

Example 2: User says "Create incident response process for my 30-person team"

→ Load references/templates/incident-response-playbook.md → Define severity levels (P0-P3) with examples → Design on-call rotation structure → Create runbooks for common scenarios → Set up communication channels (Slack, status page) → Define escalation paths → Schedule incident response training

Example 3: User says "Conduct security risk assessment for Series B due diligence"

→ Load references/frameworks/risk-assessment-matrix.md → Inventory all systems and data → Identify risks across security, compliance, operational → Score by probability and impact → Document existing controls → Create risk mitigation roadmap → Prepare executive summary for investors

Example 4: User says "We had a major outage, help with post-mortem"

→ Load references/templates/post-mortem-template.md → Document incident timeline → Identify root cause(s) → Analyze what went well and poorly → Create blameless narrative → Generate action items with owners → Share with team and stakeholders → Track action item completion

Risk Management by Company Stage

Early Stage Startup (Pre-PMF)

Focus: Security basics, avoid catastrophic risks

Priorities:

Basic security (encryption, access control)
Data backup and recovery
Privacy compliance basics
Simple incident response

Avoid: Over-investing in compliance certifications too early

Growth Stage (Post-PMF, Scaling)

Focus: Scalability, reliability, security hardening

Priorities:

SOC 2 preparation (if selling B2B)
Comprehensive monitoring and alerting
Incident response process
On-call rotation structure
DR planning and testing

Investment: 10-15% of engineering time on resilience

Scale Stage (Enterprise)

Focus: Compliance, resilience, enterprise security

Priorities:

Multiple compliance certifications (SOC 2, ISO 27001)
Advanced security (SIEM, threat detection)
Chaos engineering and resilience testing
Comprehensive BC/DR
Security team and CISO

Investment: 20-25% of engineering time on reliability/security

Warning Signs

Indicator	Risk	Action
No monitoring on production	High - can't detect issues	Immediate: Implement basic monitoring
No backup/DR tested in 6+ months	High - recovery may fail	Test DR procedures this quarter
Single person knows critical system	High - bus factor = 1	Document and cross-train immediately
Increasing incident frequency	Medium-High - system degrading	Root cause analysis, resilience improvements
Failed security scan findings	High - vulnerable to attack	Remediate critical/high findings in 30 days
Compliance deadline <6 months	High - may not certify in time	Accelerate roadmap, consider consultant
On-call team burned out	Medium - quality and retention risk	Reduce incident load, improve tooling

Communication Templates

For Board/Investors

Security & Risk Update

Status: 🟢 Secure and compliant

Key Metrics:

System uptime: 99.87% (target: 99.9%)
Security incidents: 0 critical, 2 low (resolved)
Compliance: SOC 2 on track for Q3 certification

Top Risks & Mitigations:

Single cloud provider dependency → Implementing multi-region DR
Growing on-call burden → Hiring SRE, improving automation
Compliance timeline tight → Weekly checkpoint, external consultant engaged

Investment Request: $150K for penetration testing and SOC 2 audit

For Engineering Team

Incident Review - Service Outage Feb 15

What Happened: Database connection pool exhaustion caused 45-minute outage

Timeline:

2:15 PM: Increased load from marketing campaign
2:22 PM: First alerts fired
2:25 PM: Team paged, investigation started
2:40 PM: Root cause identified
3:00 PM: Service restored

What Went Well:

✅ Alerts fired within 7 minutes
✅ Team assembled quickly
✅ Clear communication to customers

What We'll Improve:

⚠️ Auto-scaling for connection pool
⚠️ Load testing before campaigns
⚠️ Better runbook documentation

Action Items: [See detailed list]

No blame - systems fail, we learn and improve.

Writing Style

All outputs should be:

Risk-Aware: Identify and communicate risks clearly
Action-Oriented: Focus on concrete mitigation steps
Balanced: Realistic about risk vs. cost/effort
Empathetic: Blameless culture, learning mindset
Transparent: Honest about gaps and limitations

Version: 1.0.0 Philosophy: Prevent where possible, detect quickly, respond effectively, learn continuously

cto-risk-resilience

Install Skill

SKILL.md

CTO Risk & Resilience Skill

Purpose

When to Use

Core Methodology

Phase 1: Risk Identification

Phase 2: Risk Prioritization

Phase 3: Security & Compliance

Security Framework

Compliance Roadmaps

Phase 4: Incident Response

Incident Response Framework

Incident Severity Levels

On-Call Best Practices

Phase 5: Business Continuity

Disaster Recovery Planning

Business Continuity Testing

Phase 6: Resilience Engineering

Resilience Patterns

Service Level Objectives (SLOs)

Key Principles

Bundled Resources

Usage Patterns

Risk Management by Company Stage

Early Stage Startup (Pre-PMF)

Growth Stage (Post-PMF, Scaling)

Scale Stage (Enterprise)

Warning Signs

Communication Templates

For Board/Investors

For Engineering Team

Writing Style