| name | autonomous-operations |
| description | Autonomous homelab operations using OODA loop (Observe, Orient, Decide, Act) - use when reviewing system state, planning autonomous actions, or investigating operational issues |
Autonomous Operations Engine
Overview
The autonomous operations engine implements an OODA (Observe, Orient, Decide, Act) loop that orchestrates all homelab components into a self-managing system. It leverages existing infrastructure:
- Context Framework - Historical patterns and preferences
- Auto-Remediation - Playbook-based fixes
- Predictive Analytics - Resource exhaustion forecasting
- Backup Integration - Snapshot-protected operations
- Homelab Intelligence - Health scoring and diagnostics
Philosophy: Proactive over reactive. Predict and prevent rather than detect and fix.
When to Use
Always use for:
- Reviewing current autonomous operations state
- Understanding what actions the system would take
- Investigating why an action was/wasn't taken
- Adjusting autonomy settings
- Querying the decision log
Triggers:
- User asks "what would the system do about..."
- User asks about autonomous actions taken
- User wants to review pending actions
- User asks to adjust autonomy level
Architecture: OODA Loop
┌─────────────────────────────────────────────────────────────────────┐
│ OBSERVE │
│ • Predictive analytics (resource exhaustion forecasts) │
│ • Health scoring (homelab-intel.sh) │
│ • Drift detection (check-drift.sh) │
│ • Backup health verification │
└────────────────────────────┬────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ ORIENT │
│ • Historical patterns (issue-history.json) │
│ • User preferences (preferences.yml) │
│ • Service overrides (traefik: no auto-restart) │
│ • Current state (autonomous-state.json) │
└────────────────────────────┬────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ DECIDE │
│ Confidence = (prediction × 0.30) + (historical × 0.30) │
│ + (impact × 0.20) + (rollback × 0.20) │
│ │
│ ┌────────────────────┬──────────────────────────────────┐ │
│ │ Confidence + Risk │ Action │ │
│ ├────────────────────┼──────────────────────────────────┤ │
│ │ >90% + Low Risk │ AUTO-EXECUTE │ │
│ │ >80% + Med Risk │ NOTIFY + EXECUTE │ │
│ │ >70% + Any Risk │ QUEUE (for approval) │ │
│ │ <70% │ ALERT ONLY │ │
│ └────────────────────┴──────────────────────────────────┘ │
└────────────────────────────┬────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ ACT │
│ 1. Create pre-action snapshot (BTRFS) │
│ 2. Execute via appropriate skill/playbook │
│ 3. Validate outcome (health check) │
│ 4. Auto-rollback if validation fails │
│ 5. Log decision and outcome │
└─────────────────────────────────────────────────────────────────────┘
Core Scripts
autonomous-check.sh
Evaluates current system state and outputs recommended actions:
# Run assessment (outputs JSON)
~/containers/scripts/autonomous-check.sh
# Verbose output
~/containers/scripts/autonomous-check.sh --verbose
# Output to file
~/containers/scripts/autonomous-check.sh --output /tmp/assessment.json
Output structure:
{
"timestamp": "2025-11-29T22:00:00+01:00",
"health_score": 94,
"observations": [...],
"recommended_actions": [
{
"id": "action-001",
"type": "disk-cleanup",
"reason": "Disk predicted 90% full in 8 days",
"confidence": 0.92,
"risk": "low",
"decision": "auto-execute"
}
],
"pending_queue": []
}
autonomous-execute.sh
Executes approved actions from the queue:
# Execute pending low-risk actions
~/containers/scripts/autonomous-execute.sh
# Execute specific action
~/containers/scripts/autonomous-execute.sh --action-id action-001
# Dry run (simulate only)
~/containers/scripts/autonomous-execute.sh --dry-run
Emergency Controls
# Emergency stop all autonomous operations
~/containers/scripts/autonomous-execute.sh --stop
# Pause (stop acting, keep observing)
~/containers/scripts/autonomous-execute.sh --pause
# Resume operations
~/containers/scripts/autonomous-execute.sh --resume
# View current state
~/containers/scripts/autonomous-execute.sh --status
State Files
autonomous-state.json
Location: ~/.claude/context/autonomous-state.json
Tracks operational state:
{
"enabled": true,
"mode": "active",
"paused": false,
"circuit_breaker": {
"triggered": false,
"consecutive_failures": 0,
"last_failure": null
},
"last_check": "2025-11-29T22:00:00+01:00",
"pending_actions": [],
"cooldowns": {
"jellyfin.restart": "2025-11-29T20:00:00+01:00"
},
"statistics": {
"actions_24h": 2,
"actions_7d": 8,
"success_rate": 0.95
}
}
decision-log.json
Location: ~/.claude/context/decision-log.json
Audit trail of all decisions:
{
"decisions": [
{
"id": "decision-001",
"timestamp": "2025-11-29T03:00:00+01:00",
"action_type": "disk-cleanup",
"trigger": "prediction: disk 90% in 9 days",
"confidence": 0.92,
"risk": "low",
"decision": "auto-execute",
"outcome": "success",
"details": "Freed 15 GB, new prediction: 21 days",
"duration_seconds": 45
}
]
}
Integration Points
With Existing Playbooks
The engine routes actions to existing remediation playbooks:
| Action Type | Playbook | Risk |
|---|---|---|
| disk-cleanup | .claude/remediation/playbooks/disk-cleanup.yml |
Low |
| service-restart | .claude/remediation/playbooks/service-restart.yml |
Low |
| drift-reconciliation | .claude/remediation/playbooks/drift-reconciliation.yml |
Medium |
| resource-pressure | .claude/remediation/playbooks/resource-pressure.yml |
Medium |
With Preferences
Respects all settings in ~/.claude/context/preferences.yml:
risk_tolerance- Affects confidence thresholdsservice_overrides- Per-service restrictions (traefik: no auto-restart)safety.max_restarts_per_hour- Rate limitingsafety.restart_cooldown_seconds- Cooldown periods
With Backups
Every action is protected:
- Pre-action snapshot created via
btrfs-snapshot-backup.sh - Snapshot tagged:
autonomous-<action>-<timestamp> - Auto-rollback if validation fails
- Snapshot retained for 7 days minimum
Decision Matrix
Risk Classification
| Action | Data Risk | Availability Risk | Overall |
|---|---|---|---|
| Disk cleanup | None (snapshot) | None | LOW |
| Service restart | None (stateless) | 2-5s downtime | LOW |
| Config reconciliation | None (snapshot) | Potential failure | MEDIUM |
| Multi-service change | Medium | Cascading risk | HIGH |
Confidence Calculation
confidence = (
prediction_confidence × 0.30 + # How sure about the problem?
historical_success × 0.30 + # Has this fix worked before?
impact_certainty × 0.20 + # Are we sure about outcome?
rollback_feasibility × 0.20 # Can we undo if it fails?
)
Example: Disk cleanup
prediction_confidence = 0.89 # Linear trend, high confidence
historical_success = 1.00 # 12/12 successful cleanups
impact_certainty = 0.95 # Known outcome
rollback_feasibility = 1.00 # Instant snapshot restore
confidence = 0.89×0.3 + 1.0×0.3 + 0.95×0.2 + 1.0×0.2
= 0.267 + 0.300 + 0.190 + 0.200
= 0.957 (95.7%)
Decision: AUTO-EXECUTE (>90% + low risk)
Safety Controls
Circuit Breaker
Automatically pauses after 3 consecutive failures:
{
"circuit_breaker": {
"threshold": 3,
"triggered": true,
"consecutive_failures": 3,
"last_failure": "2025-11-29T04:00:00+01:00",
"reason": "service-restart failed 3 times"
}
}
Recovery: Manual reset via --resume or automatically after 24 hours.
Cooldowns
Prevent action storms:
- Service restart: 5 minute cooldown between restarts
- Disk cleanup: 1 hour cooldown
- Drift reconciliation: 15 minute cooldown
Service Overrides
Critical services have special handling:
# From preferences.yml
service_overrides:
traefik:
auto_restart: false # Never auto-restart
requires_confirmation: true
authelia:
auto_restart: false # SSO is critical
requires_confirmation: true
prometheus:
auto_restart: true # Can auto-restart
restart_timeout_seconds: 90
Systemd Integration
Timer: autonomous-check.timer
Runs daily assessment at 06:00 (after backups):
[Timer]
OnCalendar=*-*-* 06:00:00
RandomizedDelaySec=300
Persistent=true
Service: autonomous-execute.service
Executes approved actions after check completes.
Query Interface
Via Claude Code
"What autonomous actions were taken this week?"
"Show me pending actions"
"Why didn't the system restart Jellyfin?"
"What's the current autonomous operations status?"
Via Scripts
# Query decision log
~/containers/.claude/context/scripts/query-decisions.sh --last 7d
# Query pending actions
~/containers/.claude/context/scripts/query-decisions.sh --pending
# Query by action type
~/containers/.claude/context/scripts/query-decisions.sh --type disk-cleanup
Reporting
Weekly Intelligence Report Integration
The weekly report includes autonomous operations section:
## Autonomous Operations (Last 7 Days)
Actions Taken: 8
- disk-cleanup: 2 (100% success)
- service-restart: 5 (100% success)
- drift-reconciliation: 1 (100% success)
Success Rate: 100%
Average Confidence: 91%
Pending Actions: 0
Circuit Breaker: Not triggered
Discord Notifications
- Action execution: Notify on execute (configurable)
- Failures: Always notify
- Weekly summary: Included in intelligence report
Troubleshooting
Check Current State
# View state file
cat ~/.claude/context/autonomous-state.json | jq '.'
# View recent decisions
cat ~/.claude/context/decision-log.json | jq '.decisions[-5:]'
# Check if paused
cat ~/.claude/context/autonomous-state.json | jq '.paused'
Circuit Breaker Triggered
# Check why
cat ~/.claude/context/autonomous-state.json | jq '.circuit_breaker'
# View failing action
cat ~/.claude/context/decision-log.json | jq '.decisions | map(select(.outcome == "failure")) | .[-3:]'
# Reset manually
~/containers/scripts/autonomous-execute.sh --resume
Action Not Executing
Check in order:
- Is autonomous operations enabled? (
enabled: true) - Is it paused? (
paused: false) - Is circuit breaker triggered?
- Is the service in overrides with
auto_restart: false? - Is there a cooldown active?
- Is confidence below threshold?
Success Metrics
- Proactive ratio: 80%+ issues prevented before impact
- Success rate: >95% of autonomous actions succeed
- MTTR reduction: Minutes instead of hours
- Manual interventions: <10% of issues require human action
This skill orchestrates all homelab components into a self-managing system.