| name | supervisor-agent |
| description | LLM-powered intelligent supervisor that orchestrates Artemis pipeline stages with context-aware decision making, failure recovery, and continuous learning. |
Supervisor Agent
Purpose
The Supervisor Agent is the brain of the Artemis pipeline, using LLM intelligence to:
- Orchestrate Workflow - Coordinate multi-stage pipeline execution
- Make Intelligent Decisions - Context-aware stage transitions and parameter tuning
- Recover from Failures - Root cause analysis and retry strategy selection
- Learn from History - Adapt based on past execution patterns
- Monitor Execution - Real-time event tracking via Observer Pattern
When to Use This Skill
The Supervisor Agent is always active as the central coordinator:
- Pipeline Orchestration - Every Artemis pipeline execution
- Stage Transitions - Deciding next stage based on current state
- Failure Recovery - When stages fail and need intelligent retry
- Configuration Tuning - Adjusting parameters based on context
- Anomaly Detection - Identifying unusual execution patterns
Responsibilities
1. Workflow Orchestration
Coordinates the entire Artemis pipeline through multiple stages:
Requirements → Sprint Planning → Development → Code Review →
UI/UX Review → Arbitration → Integration → Retrospective
Capabilities:
- Dynamic Stage Ordering - Adjusts stage sequence based on dependencies
- Parallel Execution - Runs independent stages concurrently
- Stage Skipping - Intelligently bypasses unnecessary stages
- Checkpoint Integration - Resumes from interruptions
- State Machine Coordination - Manages state transitions
Example Decision:
Context: Sprint planning completed, but requirements were unclear
Decision: Skip directly to Requirements stage before Development
Reasoning: Insufficient context for developers to proceed
2. Intelligent Decision Making
Uses LLM to make context-aware decisions:
Decision Types:
- Stage Selection - Which stage to execute next
- Parameter Tuning - Adjusting timeouts, retries, thresholds
- Resource Allocation - Number of agents, LLM model selection
- Quality Gates - Pass/fail decisions based on context
- Escalation - When to involve human intervention
Example LLM Query:
Input: "Code review found 3 high-severity security issues.
Developer A has 85% test coverage.
Developer B has 70% test coverage but cleaner architecture.
Historical data shows Developer A's code had 2 production bugs.
Should we proceed to arbitration or request fixes?"
LLM Response: "Request fixes from both developers. Security issues
are critical and must be addressed before arbitration.
Provide 2-hour deadline for fixes, then re-review."
3. Failure Recovery
Performs root cause analysis and intelligent retry:
Failure Handling:
- Root Cause Analysis - Uses LLM to diagnose failures
- Retry Strategy Selection - Chooses optimal retry approach
- Fallback Orchestration - Alternative paths when primary fails
- Circuit Breaking - Stops retries when futile
- Human Escalation - Knows when to ask for help
Example Recovery:
Failure: LLM API rate limit exceeded during Planning Poker
Root Cause: Too many concurrent API calls
Retry Strategy:
1. Switch to exponential backoff
2. Reduce parallelism from 3 to 1
3. Use cheaper model (gpt-4o-mini) for retries
4. If still failing after 3 attempts, switch to Anthropic
4. Learning and Adaptation
Learns from execution history to improve future decisions:
Learning Mechanisms:
- Success Pattern Recognition - Identifies configurations that work
- Failure Pattern Recognition - Avoids known failure modes
- Performance Optimization - Tunes for faster execution
- Cost Optimization - Balances quality and token usage
- Anomaly Detection - Flags unusual behavior
Example Learning:
Observation: Last 5 sprints, Developer A's code had 0 critical issues
but Developer B's code failed security review 3 times.
Learned Pattern: Prioritize Developer A's implementation in arbitration
when security is critical. Allocate more review time
to Developer B.
Action: Adjust arbitration weights and review SLA accordingly.
5. Observer Pattern Integration
Monitors real-time events from all stages:
Event Types:
- stage.started - Stage execution begins
- stage.completed - Stage finishes successfully
- stage.failed - Stage encounters error
- stage.progress - Incremental progress updates
- metric.recorded - Performance/cost metrics
Example Event Handling:
# Supervisor receives event
Event: stage.failed
Stage: code_review
Error: "Developer A: 5 critical security issues"
Context: {
"issues": ["SQL injection", "XSS", "hardcoded secret", ...],
"developer": "developer-a",
"score": 45
}
# Supervisor decision
Action:
1. Log failure to Knowledge Graph
2. Send detailed feedback to Developer A
3. Request fixes with 1-hour deadline
4. Schedule re-review automatically
5. Update Developer A's quality score
Integration with Pipeline
Placement in Pipeline
The Supervisor Agent oversees all stages:
┌─────────────────────────┐
│ Supervisor Agent │
│ (Orchestrator Brain) │
└──────────┬──────────────┘
│
┌─────────────┼─────────────┐
│ │ │
▼ ▼ ▼
Requirements Sprint Planning Development
│ │ │
└─────────────┼─────────────┘
│
┌─────────────┼─────────────┐
│ │ │
▼ ▼ ▼
Code Review UI/UX Review Arbitration
│ │ │
└─────────────┼─────────────┘
│
▼
Retrospective
Communication
Receives:
- Stage completion events (success/failure)
- Checkpoint state (for resume)
- Configuration overrides
- Human intervention requests
Sends:
- Stage execution commands
- Configuration parameters
- Failure notifications
- Learning insights to Knowledge Graph
Usage Examples
Standalone Usage
# The supervisor is typically invoked by the orchestrator
python3 artemis_orchestrator.py \
--card-id card-12345 \
--full \
--supervisor-enabled
Programmatic Usage
from supervisor_agent import SupervisorAgent
from artemis_state_machine import ArtemisStateMachine
from checkpoint_manager import CheckpointManager
# Initialize supervisor
supervisor = SupervisorAgent(
llm_provider="openai",
llm_model="gpt-4o",
logger=logger
)
# Decision: Should we skip a stage?
decision = supervisor.decide_stage_transition(
current_stage="sprint_planning",
context={
"requirements_clarity": 0.3,
"team_velocity": 15,
"sprint_capacity": 20
}
)
# Decision: {
# "next_stage": "requirements",
# "reason": "Requirements clarity too low (30%). Need clarification before development.",
# "confidence": 0.9
# }
Failure Recovery Example
# Stage failed, ask supervisor for recovery strategy
failure_context = {
"stage": "code_review",
"error": "LLM API timeout",
"attempt": 1,
"max_attempts": 3
}
strategy = supervisor.plan_recovery(failure_context)
# Strategy: {
# "action": "retry",
# "delay": 60, # seconds
# "modifications": {
# "timeout": 300, # increase from 120s
# "model": "gpt-4o-mini" # use faster model
# },
# "fallback": "switch_to_anthropic"
# }
Learning from History
# After successful sprint, update learning
supervisor.record_execution_outcome(
sprint_id="sprint-2025-01",
outcome={
"success": True,
"duration": 3600, # seconds
"cost": 2.50, # USD
"quality_score": 85,
"stages_executed": ["requirements", "sprint_planning", ...],
"developer_a_score": 90,
"developer_b_score": 80
}
)
# Supervisor learns:
# - This configuration works well
# - Developer A consistently scores higher
# - Requirements stage reduces rework
Configuration
Environment Variables
# Enable supervisor (default: true)
ARTEMIS_SUPERVISOR_ENABLED=true
# LLM Provider for supervisor decisions
ARTEMIS_SUPERVISOR_LLM_PROVIDER=openai
ARTEMIS_SUPERVISOR_LLM_MODEL=gpt-4o
# Learning and adaptation
ARTEMIS_SUPERVISOR_LEARNING_ENABLED=true
ARTEMIS_SUPERVISOR_HISTORY_SIZE=100
# Failure recovery
ARTEMIS_SUPERVISOR_MAX_RETRIES=3
ARTEMIS_SUPERVISOR_RETRY_DELAY=60
Hydra Configuration
# conf/supervisor/default.yaml
supervisor:
enabled: true
llm:
provider: openai
model: gpt-4o
learning:
enabled: true
history_size: 100
recovery:
max_retries: 3
retry_delay: 60
exponential_backoff: true
decision_thresholds:
requirements_clarity: 0.5
code_quality_pass: 75
security_score_pass: 80
Decision Framework
Decision Types and Criteria
| Decision | Criteria | Example |
|---|---|---|
| Skip Stage | Requirements clarity < 50%, Stage unnecessary | Skip UI/UX for backend-only changes |
| Request Fixes | Critical issues > 0, Score < 60 | Security vulnerabilities found |
| Proceed | All quality gates pass | Code review passed, tests green |
| Retry | Transient failure, Attempts < max | LLM API timeout |
| Escalate | Repeated failures, Human needed | Ambiguous requirements |
| Switch Provider | Provider failing, Fallback available | OpenAI down, use Anthropic |
LLM Query Templates
# Template 1: Stage Transition Decision
STAGE_TRANSITION_PROMPT = """
You are the Artemis pipeline supervisor. Decide the next stage.
Current State:
- Stage: {current_stage}
- Status: {status}
- Context: {context}
Available Next Stages: {available_stages}
Historical Data:
- Similar sprints: {similar_sprints}
- Average success rate: {success_rate}
Decide:
1. Which stage to execute next
2. Should any stages be skipped? Why?
3. What parameters should be adjusted?
4. Confidence level (0-1)
Respond with JSON: {{"next_stage": "...", "skip": [], "params": {{}}, "reason": "...", "confidence": 0.9}}
"""
# Template 2: Failure Recovery
FAILURE_RECOVERY_PROMPT = """
A stage has failed. Analyze and recommend recovery.
Failure Details:
- Stage: {stage}
- Error: {error}
- Attempt: {attempt}/{max_attempts}
- Duration before failure: {duration}s
Context:
- Previous similar failures: {similar_failures}
- Current system load: {load}
- Budget remaining: ${budget}
Recommend:
1. Should we retry? With what modifications?
2. Should we skip this stage?
3. Should we escalate to human?
4. Alternative approaches?
Respond with JSON: {{"action": "retry|skip|escalate", "modifications": {{}}, "reason": "..."}}
"""
Integration with Other Agents
With Knowledge Graph
# Supervisor stores learning in KG
supervisor.knowledge_graph.add_entity(
entity_type="execution_pattern",
name=f"sprint-{sprint_id}-pattern",
properties={
"success": True,
"configuration": config_hash,
"quality_score": 85
}
)
# Query KG for similar situations
similar_patterns = supervisor.knowledge_graph.query("""
MATCH (p:execution_pattern {success: true})
WHERE p.context_similarity > 0.8
RETURN p.configuration, p.quality_score
ORDER BY p.quality_score DESC
LIMIT 5
""")
With Observer Pattern
from pipeline_observer import PipelineObserver
# Supervisor observes all events
observable = PipelineObserver()
observable.register_observer(supervisor)
# Stage emits event
observable.notify_event(
event_type="stage.failed",
stage="code_review",
data={"error": "timeout", "duration": 120}
)
# Supervisor handles event immediately
With Checkpoint Manager
# Save supervisor state
checkpoint_manager.save_checkpoint({
"stage": "code_review",
"supervisor_state": supervisor.get_state(),
"learning_history": supervisor. get_learning_history()
})
# Resume from checkpoint
supervisor.restore_state(checkpoint["supervisor_state"])
Performance Considerations
LLM Usage Optimization
The supervisor makes strategic decisions to minimize costs:
Query Optimization:
- Caching - Identical decisions cached for 1 hour
- Batching - Multiple decisions in single LLM call
- Model Selection - Use gpt-4o-mini for simple decisions
- Prompt Compression - Remove redundant context
Typical Costs:
- Stage transition decision: ~500 tokens ($0.005-0.01)
- Failure recovery plan: ~800 tokens ($0.008-0.015)
- Learning query: ~1000 tokens ($0.01-0.02)
- Per sprint total: ~5000 tokens ($0.05-0.10)
Decision Latency
| Decision Type | Latency | Can Cache? |
|---|---|---|
| Stage transition | 1-2s | ✅ Yes (5 min) |
| Failure recovery | 2-3s | ❌ No (context-specific) |
| Parameter tuning | 1-2s | ✅ Yes (per config) |
| Skip decision | 0.5-1s | ✅ Yes (per stage type) |
Best Practices
- Trust the Supervisor - It learns from history, respect its decisions
- Monitor Learning - Review learning patterns weekly for quality
- Tune Thresholds - Adjust decision thresholds based on team needs
- Use Knowledge Graph - Store learning for long-term memory
- Enable Observability - Log all decisions for audit trail
- Test Failure Recovery - Simulate failures to validate recovery logic
- Human Override - Always allow manual intervention when needed
Limitations
- LLM Dependent - Quality depends on LLM reasoning capability
- Context Window - Limited by LLM context size (128K tokens)
- Cost - Adds $0.05-0.15 per sprint in LLM costs
- Latency - Adds 2-5s per decision
- Learning Delay - Requires 10+ executions for good patterns
- No Guarantees - LLM decisions are probabilistic, not deterministic
Future Enhancements
- Reinforcement Learning - Learn from explicit feedback (good/bad decisions)
- Multi-Agent Coordination - Coordinate multiple supervisors in parallel pipelines
- Predictive Analytics - Predict failures before they happen
- Auto-Scaling - Dynamically adjust resources based on load
- A/B Testing - Compare different supervision strategies
- Explainability - Better explanations for decisions (SHAP, LIME)
Metrics and KPIs
Track supervisor effectiveness:
| Metric | Target | Measurement |
|---|---|---|
| Decision Accuracy | >85% | % of good decisions (human-validated) |
| Recovery Success Rate | >90% | % of failures successfully recovered |
| Pipeline Efficiency | +20% | Time saved vs. no supervisor |
| Cost Optimization | -15% | Cost reduction through smart routing |
| Quality Improvement | +10% | Higher quality scores with supervision |
References
Version: 1.0.0
Maintained By: Artemis Pipeline Team
Last Updated: October 24, 2025