name	production-incident-responder
description	Crisis response skill for production system failures. Integrates with MCP resilience tools to detect, diagnose, and respond to critical system failures. Use when production system shows signs of failure or during emergency situations.

Production Incident Responder

A crisis response skill that leverages MCP (Model Context Protocol) tools to act on the deployed/working program during critical failures.

When This Skill Activates

Production system health check fails
ACGME compliance violations detected
Utilization exceeds 80% threshold
Coverage gaps identified
Circuit breaker trips
Defense level escalates to ORANGE or higher

MCP Integration

This skill connects to the MCP server which provides real-time access to:

Tier 1: Critical Resilience Tools

MCP Tool	Purpose	Trigger
`check_utilization_threshold_tool`	Monitor 80% queuing theory limit	Utilization > 75%
`get_defense_level_tool`	Nuclear safety graduated response	Any escalation
`run_contingency_analysis_resilience_tool`	N-1/N-2 vulnerability analysis	Faculty absence
`get_static_fallbacks_tool`	Pre-computed backup schedules	Critical failure
`execute_sacrifice_hierarchy_tool`	Triage-based load shedding	RED/BLACK level

Tier 2: Strategic Tools

MCP Tool	Purpose	Trigger
`analyze_homeostasis_tool`	Feedback loop health	Sustained stress
`calculate_blast_radius_tool`	Failure containment	Zone health warning
`analyze_le_chatelier_tool`	Equilibrium shift analysis	Resource strain

Tier 3: Advanced Analytics

MCP Tool	Purpose	Trigger
`analyze_hub_centrality_tool`	Single point of failure ID	Vulnerability scan
`assess_cognitive_load_tool`	Coordinator burnout risk	Decision queue > 7
`check_mtf_compliance_tool`	Military compliance/DRRS	Readiness check

Incident Response Protocol

Level 1: DETECTION (Automated)

System Health Check
├── Check utilization via MCP: check_utilization_threshold_tool
├── Get defense level: get_defense_level_tool
├── Run compliance check: check_mtf_compliance_tool
└── Assess cognitive load: assess_cognitive_load_tool

If any metric is YELLOW or worse → Escalate to Level 2

Level 2: DIAGNOSIS (Automated + Human Review)

Root Cause Analysis
├── Run contingency analysis: run_contingency_analysis_resilience_tool
│   ├── N-1 analysis (single failure resilience)
│   ├── N-2 analysis (dual failure resilience)
│   └── Cascade simulation
├── Analyze hub centrality: analyze_hub_centrality_tool
│   └── Identify critical personnel
├── Check blast radius: calculate_blast_radius_tool
│   └── Identify affected zones
└── Analyze equilibrium: analyze_le_chatelier_tool
    └── Predict sustainability

Output: Incident Report with Recommendations

Level 3: RESPONSE (Human Approval Required)

Response Actions (by severity)

GREEN → No action needed, continue monitoring
YELLOW → Warning: Review recommendations
ORANGE → Critical: Implement mitigations
  ├── Get static fallbacks: get_static_fallbacks_tool
  └── Prepare sacrifice hierarchy (simulate only)
RED → Emergency: Activate crisis protocols
  ├── Execute sacrifice hierarchy: execute_sacrifice_hierarchy_tool
  ├── Activate fallback schedules
  └── Generate SITREP: check_mtf_compliance_tool
BLACK → Catastrophic: Emergency services only
  ├── Execute maximum load shedding
  └── Generate MFR/RFF documentation

Level 4: RECOVERY (Post-Incident)

Recovery Actions
├── Monitor homeostasis: analyze_homeostasis_tool
├── Track allostatic load
├── Verify equilibrium restoration
└── Document lessons learned

MCP Server Connection

Prerequisites

# Start MCP server
cd mcp-server
pip install -e .
python -m scheduler_mcp.server

# Ensure backend is running
cd backend
uvicorn app.main:app --reload

# Start Celery for async operations
./scripts/start-celery.sh both

MCP Configuration

Add to Claude Desktop or IDE MCP config:

{
  "mcpServers": {
    "residency-scheduler": {
      "command": "python",
      "args": ["-m", "scheduler_mcp.server"],
      "cwd": "/path/to/mcp-server"
    }
  }
}

Crisis Response Workflows

Workflow 1: Faculty Absence Emergency

1. DETECT
   - Receive absence notification
   - Run: check_utilization_threshold_tool

2. DIAGNOSE
   - Run: run_contingency_analysis_resilience_tool(scenario="faculty_absence")
   - Check N-1 resilience: Can we survive this absence?
   - Identify coverage gaps

3. RESPOND (based on impact)
   LOW IMPACT:
   - Use swap marketplace for coverage
   - No escalation needed

   MEDIUM IMPACT:
   - Activate backup pool
   - Run: get_static_fallbacks_tool(scenario="single_absence")
   - Implement fallback schedule

   HIGH IMPACT:
   - Escalate defense level
   - Run: execute_sacrifice_hierarchy_tool(target_level="yellow", simulate_only=true)
   - Review load shedding options
   - REQUIRE HUMAN APPROVAL before execution

4. RECOVER
   - Monitor homeostasis post-incident
   - Verify coverage restored

Workflow 2: Mass Casualty / Deployment Event

1. DETECT
   - Multiple absences reported (e.g., military deployment)
   - Run: check_utilization_threshold_tool
   - Expected: ORANGE or RED level

2. DIAGNOSE
   - Run: run_contingency_analysis_resilience_tool(analyze_n1=true, analyze_n2=true)
   - Run: analyze_hub_centrality_tool
   - Identify fatal faculty combinations
   - Calculate cascade risk

3. RESPOND
   - Run: get_static_fallbacks_tool(scenario="deployment")
   - Run: execute_sacrifice_hierarchy_tool(target_level="orange", simulate_only=true)
   - Present options to coordinator:
     a) Implement partial load shedding
     b) Request external locum coverage
     c) Activate cross-training coverage
   - REQUIRE HUMAN APPROVAL

4. COMPLIANCE
   - Run: check_mtf_compliance_tool(generate_sitrep=true)
   - Generate DRRS readiness report
   - Document MFR if circuit breaker trips

5. RECOVER
   - Monitor Le Chatelier equilibrium
   - Track days until exhaustion
   - Plan for resource restoration

Workflow 3: ACGME Compliance Violation

1. DETECT
   - Compliance check fails (80-hour, 1-in-7, supervision)
   - Run: validate_schedule via MCP

2. DIAGNOSE
   - Identify specific violations
   - Check affected residents/faculty
   - Calculate severity

3. RESPOND
   SINGLE VIOLATION:
   - Use conflict auto-resolution
   - Run: detect_conflicts(include_auto_resolution=true)
   - Apply suggested fix

   MULTIPLE VIOLATIONS:
   - Run: run_contingency_analysis_resilience_tool
   - May need schedule regeneration
   - ESCALATE to human

4. DOCUMENT
   - Log compliance event
   - Generate audit trail

Escalation Rules

ALWAYS Escalate to Human When:

Defense level reaches RED or BLACK
Circuit breaker trips
Multiple simultaneous absences (N-2+)
ACGME violation cannot be auto-resolved
Sacrifice hierarchy execution required (not just simulation)
External staffing needed
Regulatory documentation required

Can Handle Automatically:

GREEN/YELLOW level monitoring
Single swap facilitation
Backup pool assignment (if available)
Simulation mode analysis
Report generation
Compliance checking

Response Time Expectations

Severity	Detection	Analysis	Response
GREEN	Continuous	N/A	N/A
YELLOW	< 5 min	< 10 min	< 1 hour
ORANGE	< 1 min	< 5 min	< 30 min
RED	Immediate	< 2 min	< 15 min
BLACK	Immediate	< 1 min	Immediate

Integration with Other Skills

With automated-code-fixer

If crisis response reveals code issues:

Document the issue
Escalate to automated-code-fixer skill
Apply fix through quality gates
Re-run health check

With code-quality-monitor

Post-incident:

Run full quality check
Ensure no degradation from crisis response
Document any technical debt incurred

Reporting Format

Quick Status (for monitoring)

PRODUCTION STATUS: YELLOW

Utilization: 78% (threshold: 80%)
Defense Level: 2 - CONTROL
Coverage: 94%
Pending Decisions: 5
Active Alerts: 2

Next Action: Monitor, no immediate action required

Incident Report (for escalation)

## INCIDENT REPORT

**Severity**: ORANGE
**Time Detected**: 2025-12-20 14:32 UTC
**Status**: ACTIVE - AWAITING HUMAN APPROVAL

### Summary
Two faculty members reported simultaneous absence due to medical emergency.

### Impact Assessment
- Utilization: 85% (above threshold)
- Coverage Gaps: 8 blocks over next 7 days
- ACGME Risk: Supervision ratio violation in 3 blocks
- Cascade Risk: MEDIUM

### MCP Analysis Results
- N-1 Resilience: FAILED
- N-2 Resilience: N/A (already at N-2)
- Hub Centrality: Dr. Smith identified as critical (betweenness: 0.42)

### Recommended Actions
1. Activate static fallback schedule "dual_absence"
2. Request backup pool coverage for PM blocks
3. Consider sacrifice hierarchy level YELLOW (suspend optional education)

### Required Approvals
- [ ] Coordinator approval for fallback activation
- [ ] Medical director review of supervision plan

### Generated Documentation
- SITREP attached
- MFR template prepared (pending circuit breaker status)

References

/mcp-server/RESILIENCE_MCP_INTEGRATION.md - Full MCP resilience integration
/mcp-server/src/scheduler_mcp/resilience_integration.py - Tool implementations
/backend/app/resilience/ - Backend resilience framework
/docs/architecture/resilience-framework.md - Architecture overview

production-incident-responder

Install Skill

SKILL.md