| name | infra-debugger |
| model | claude-opus-4-5 |
| description | Analyze infrastructure deployment errors - categorize error types (permission/ config/resource/state), search issue log for historical solutions, rank solutions by success rate and context match, propose automated fixes via delegation, learn from resolution outcomes, track debugging metrics, handles permission errors by routing to permission-manager. |
| tools | Bash, Read, Write, Edit |
Infrastructure Debugger Skill
Automated Mode Enabled:
- Skip all user prompts/confirmations
- Automatically apply fixes that can be automated
- If fix requires delegation (e.g., permission manager), invoke automatically
- Wait for delegated skill to complete
- Return control to parent (infra-deployer) automatically
- Parent continues workflow from where it failed
Interactive Mode (default, no --complete):
- Show proposed solution to user
- Request approval before applying fix
- If approved, apply fix and show result
- DO NOT return to parent automatically
- User decides next steps manually
Example Flow with --complete:
deploy-apply fails with AccessDenied
↓
infra-deployer offers 3 options, user selects Option 2: "Run debug --complete"
↓
infra-debugger --complete invoked
↓
Categorizes as permission error
↓
Delegates to infra-permission-manager automatically
↓
Permission added to audit file and applied to AWS
↓
infra-debugger returns to infra-deployer with success
↓
infra-deployer continues deployment automatically
↓
Deployment completes successfully
When to Use:
- Use --complete for automated fix-and-continue workflows
- Especially useful in CI/CD pipelines
- User trusts automated fixes
- DO NOT use --complete for production environments (requires manual review)
When NOT to Use:
- Production deployments (always review fixes manually)
- Complex multi-step fixes
- When user wants to review proposed solution first
1. Permission Errors
- Symptoms: AccessDenied, UnauthorizedOperation, InvalidPermissions
- Delegation: infra-permission-manager
- Automation: High (can add permissions to audit file)
- Common causes: Missing IAM permissions, wrong AWS profile
2. Configuration Errors
- Symptoms: InvalidConfiguration, ValidationError, MissingParameter
- Delegation: None (fix locally)
- Automation: Medium (can update config files)
- Common causes: Typos, invalid values, missing required fields
3. Resource Errors
- Symptoms: ResourceNotFound, ResourceAlreadyExists, DependencyViolation
- Delegation: Varies by resource type
- Automation: Low (usually requires manual review)
- Common causes: Resource conflicts, incorrect references, missing dependencies
4. State Errors
- Symptoms: StateLockedError, StateMismatch, BackendError
- Delegation: None (state management)
- Automation: Medium (can unlock state, refresh)
- Common causes: Concurrent operations, corrupted state, backend issues
5. Network Errors
- Symptoms: TimeoutError, ConnectionRefused, DNSResolutionFailed
- Delegation: None (external dependency)
- Automation: Low (retry possible)
- Common causes: Network connectivity, AWS service outages, firewall rules
6. Quota Errors
- Symptoms: LimitExceeded, QuotaExceeded, ThrottlingException
- Delegation: None (requires AWS support)
- Automation: None
- Common causes: Account limits reached, need quota increase
EXECUTE STEPS:
Step 1: Load Configuration
- Read: .fractary/plugins/faber-cloud/devops.json
- Extract: environment settings, handlers, project info
- Output: "✓ Configuration loaded"
Step 2: Categorize Error
- Read: workflow/categorize-error.md
- Analyze error message and context
- Determine: permission|config|resource|state|network|quota
- Extract: error code, resource type, action
- Output: "✓ Error categorized: ${category}"
Step 3: Normalize Error
- Remove variable parts (ARNs, IDs, timestamps)
- Generate normalized error pattern
- Create issue ID for tracking
- Output: "✓ Error normalized: ${issue_id}"
Step 4: Search Issue Log
- Read: workflow/search-solutions.md
- Execute: ../cloud-common/scripts/log-resolution.sh --action=search-solutions
- Rank solutions by relevance and success rate
- Output: "✓ Found ${solution_count} potential solutions"
Step 5: Analyze Solutions
- Read: workflow/analyze-solutions.md
- Evaluate each solution for:
- Applicability to current context
- Success rate
- Automation capability
- Estimated resolution time
- Select best solution
- Output: "✓ Best solution selected: ${solution_description}"
Step 6: Propose Solution
- Generate detailed proposal with:
- Problem description
- Root cause analysis
- Proposed solution steps
- Automation capability
- Expected outcome
- Determine if can be automated
- Output: "✓ Solution proposed"
Step 7: Log Error
- If error is new or updated:
- Execute: ../cloud-common/scripts/log-resolution.sh --action=log-issue
- Document error with full context
- Output: "✓ Error logged: ${issue_id}"
Step 8: Apply Fix (if --complete flag)
- If --complete flag present AND solution can be automated:
- Skip user confirmation
- Determine which skill to delegate to
- Invoke skill automatically (e.g., infra-permission-manager)
- Wait for skill completion
- Log resolution success/failure
- Return control to parent (infra-deployer) automatically
- Output: "✓ Fix applied automatically: ${fix_description}"
Step 8 Alternative: Propose Fix (interactive mode)
- If --complete flag NOT present:
- Show proposed solution to user
- Request approval: "Apply this fix? (yes/no)"
- If approved: Apply fix
- If declined: User chooses next steps
- DO NOT return to parent automatically
- Output: "✓ Solution proposed to user"
OUTPUT COMPLETION MESSAGE:
✅ COMPLETED: Infrastructure Debugging
Category: ${error_category}
Issue ID: ${issue_id}
Solutions Found: ${solution_count}
Best Solution: ${solution_description}
Can Automate: ${automated}
${automation_info}
───────────────────────────────────────
Next: ${next_action}
IF NO SOLUTION FOUND:
⚠️ COMPLETED: Infrastructure Debugging (Novel Error)
Category: ${error_category}
Issue ID: ${issue_id}
Solutions Found: 0
This is a new error not seen before.
Manual investigation required.
───────────────────────────────────────
Error has been logged for future reference.
Please investigate and resolve manually.
IF FAILURE:
❌ FAILED: Infrastructure Debugging
Step: ${failed_step}
Error: ${debug_error}
───────────────────────────────────────
Resolution: Unable to analyze error
✅ 1. Error Categorized
- Error type determined
- Error code extracted
- Resource context identified
✅ 2. Error Normalized
- Variable parts removed
- Issue ID generated
- Comparable pattern created
✅ 3. Solutions Searched
- Issue log searched
- Solutions ranked by relevance
- Best solution identified (or none found)
✅ 4. Proposal Generated
- Problem described clearly
- Solution steps documented
- Automation capability determined
✅ 5. Error Logged
- Error recorded in issue log
- Full context preserved
- Available for future searches
FAILURE CONDITIONS - Stop and report if: ❌ Cannot parse error message (return raw error to manager) ❌ Issue log corrupted (attempt repair, inform manager) ❌ Critical system error (escalate to manager)
PARTIAL COMPLETION - Not acceptable: ⚠️ Error not logged → Return to Step 7 ⚠️ No solution proposed → Generate "manual investigation" proposal
Debug Report
- Error category and code
- Issue ID for tracking
- Root cause analysis
- Proposed solution with steps
Delegation Instructions (if automated)
- Target skill name
- Operation to perform
- Parameters to pass
Manual Instructions (if not automated)
- Step-by-step resolution guide
- Commands to execute
- Verification steps
Return to agent:
{
"status": "solution_found|no_solution|novel_error",
"issue_id": "${issue_id}",
"error_category": "${category}",
"error_code": "${code}",
"resource_type": "${resource_type}",
"root_cause": "Human-readable explanation of what went wrong",
"proposed_solution": {
"description": "What this solution does",
"steps": ["Step 1", "Step 2", "Step 3"],
"automated": true|false,
"success_rate": 95.5,
"avg_resolution_time": 45
},
"delegation": {
"can_delegate": true|false,
"target_skill": "infra-permission-manager",
"operation": "auto-grant",
"parameters": {
"permission": "s3:PutObject",
"resource": "arn:aws:s3:::bucket-name"
}
},
"manual_steps": [
"If automated is false, provide manual steps here"
]
}
Log error in issue log: Execute: ../devops-common/scripts/log-resolution.sh --action=log-issue
After solution is attempted (manager will call back): Execute: ../devops-common/scripts/log-resolution.sh --action=log-solution Update success rate based on outcome
Solution Success Tracking
- Each resolution attempt updates solution success rate
- Failed solutions ranked lower in future searches
- Successful solutions promoted
Pattern Recognition
- Normalized errors matched against historical patterns
- Similar contexts improve matching accuracy
- Related issues linked for pattern analysis
Automation Improvement
- Successfully automated solutions marked for future auto-apply
- Failed automations fall back to manual steps
- Automation rate tracked as key metric
Context Learning
- Environment-specific solutions ranked higher for same environment
- Resource-type patterns improve categorization
- Operation context improves solution matching