| name | orchestration-coordination-framework |
| description | Production-scale multi-agent coordination, task orchestration, and workflow automation. Use for distributed system orchestration, agent communication protocols, DAG workflows, state machines, error handling, resource allocation, load balancing, and observability. Covers Apache Airflow, Temporal, Prefect, Celery, Step Functions, and orchestration patterns. |
| allowed-tools | Read, Write, Edit, Bash, Glob, Grep, WebFetch |
Orchestration Coordination Framework
Purpose
Production-scale AI development requires sophisticated orchestration and coordination. This Skill provides comprehensive orchestration capabilities for:
- Multi-Agent Coordination - Coordinate multiple AI agents working on complex tasks
- Task Decomposition - Break down complex objectives into manageable subtasks
- Workflow Orchestration - DAGs, state machines, event-driven patterns
- Communication Protocols - Agent-to-agent messaging, pub/sub, queues
- Error Handling - Retry logic, circuit breakers, fallback strategies
- Resource Management - Load balancing, rate limiting, concurrency control
- Observability - Monitoring, logging, tracing, metrics for distributed systems
- Framework Integration - Apache Airflow, Temporal, Prefect, Celery, Step Functions
When to Use This Skill
Use orchestration coordination for:
- Orchestrating multiple agents or services working together
- Complex multi-step workflows requiring coordination
- Distributed task execution with dependencies
- Event-driven architectures and reactive systems
- Building CI/CD pipelines with complex dependencies
- Microservices coordination and saga patterns
- Data pipeline orchestration (ETL/ELT)
- Long-running workflows with state management
- Fault-tolerant distributed systems
- Resource allocation across multiple workers
- Implementing retries and error recovery strategies
- Monitoring and observability for distributed systems
Quick Start
Basic Multi-Agent Orchestration
Here's a minimal example:
import asyncio
from typing import List, Dict, Any
class SimpleOrchestrator:
def __init__(self):
self.agents = {}
def register_agent(self, agent_id: str, agent_type: str):
self.agents[agent_id] = {"type": agent_type, "busy": False}
async def execute_task(self, agent_id: str, task: Dict[str, Any]) -> Any:
print(f"Agent {agent_id} executing: {task['name']}")
await asyncio.sleep(1)
return {"status": "completed", "result": f"Result of {task['name']}"}
async def orchestrate(self, tasks: List[Dict[str, Any]]) -> List[Any]:
results = await asyncio.gather(*[
self.execute_task(f"agent_{i}", task)
for i, task in enumerate(tasks)
])
return results
# Usage
async def main():
orchestrator = SimpleOrchestrator()
orchestrator.register_agent("agent_0", "analyzer")
orchestrator.register_agent("agent_1", "security")
tasks = [
{"name": "analyze_code", "type": "analysis"},
{"name": "scan_security", "type": "security"}
]
results = await orchestrator.orchestrate(tasks)
print(f"Results: {results}")
asyncio.run(main())
Core Orchestration Concepts
1. DAG (Directed Acyclic Graph) Pattern
Tasks execute based on dependencies, enabling parallel execution where possible.
Start
│
├──► Task A ──┐
│ │
└──► Task B ──┼──► Task D ──► Task F ──► End
│ │
└──► Task C ──► Task E ───┘
Use when: You have tasks with clear dependencies and want parallel execution.
2. State Machine Pattern
Workflows transition through defined states with explicit transitions.
┌─────────┐
│ IDLE │
└────┬────┘
│ start()
┌────▼────┐
│ RUNNING │◄──────┐
└────┬────┘ │
│ │ retry()
┌─────┴─────┐ │
│ │ │
┌───▼───┐ ┌───▼───┐ │
│SUCCESS│ │FAILURE├──┘
└───────┘ └───┬───┘
│ max_retries
┌───▼───┐
│ ERROR │
└───────┘
Use when: You need explicit state tracking and complex transition logic.
3. Event-Driven Pattern
Agents communicate via events and messages (pub/sub, message queues).
┌──────────┐ ┌─────────────┐ ┌──────────┐
│ Agent A │─────►│ Event Bus │─────►│ Agent B │
└──────────┘ └─────────────┘ └──────────┘
│
▼
┌──────────┐
│ Agent C │
└──────────┘
Use when: Agents need loose coupling and async communication.
4. Orchestrator Architecture
┌─────────────────────────────────────────────────────┐
│ Orchestrator │
│ ┌─────────────┐ ┌──────────────┐ ┌───────────┐ │
│ │ Task Queue │ │ State Manager│ │ Scheduler │ │
│ └─────────────┘ └──────────────┘ └───────────┘ │
└──────────────┬──────────────────────────────────────┘
│
┌───────┴───────┐
│ │
┌───▼───┐ ┌───▼───┐ ┌────────┐
│Agent 1│ │Agent 2│ │Agent N │
│ │ │ │ │ │
│Task A │ │Task B │ │Task N │
└───┬───┘ └───┬───┘ └───┬────┘
│ │ │
└───────┬───────┴───────────────┘
│
┌─────▼─────┐
│ Results │
│Aggregator │
└───────────┘
Critical Gotchas
1. Retry Storms
Problem: Failed tasks retry simultaneously, overwhelming systems. Solution: Use exponential backoff with jitter, circuit breakers, and max retry limits. See GOTCHAS.md for details.
2. State Management Complexity
Problem: Losing track of workflow state across restarts. Solution: Use durable execution platforms (Temporal), persist state externally, make tasks idempotent. See GOTCHAS.md for details.
3. Serialization Issues
Problem: Large data passed between tasks causes memory issues. Solution: Pass references (S3 URLs) not data, use streaming, implement chunking. See GOTCHAS.md for details.
For all common gotchas and solutions, see GOTCHAS.md.
Orchestration Patterns
This framework provides 7 production-ready patterns:
- Multi-Agent Task Decomposition - Break complex objectives into subtasks
- Airflow DAG Coordination - Orchestrate agents with Apache Airflow
- Temporal Durable Execution - Long-running workflows with automatic retries
- Event-Driven Coordination - Pub/Sub messaging for loose coupling
- Circuit Breaker Pattern - Prevent cascading failures
- Resource Pool with Load Balancing - Manage agent resources efficiently
- Distributed Tracing - Monitor and debug distributed workflows
See PATTERNS.md for detailed pattern descriptions and when to use each.
Working Examples
All patterns include complete, runnable code examples:
- Multi-Agent Orchestrator - Coordinate specialized agents with dependency resolution
- Airflow DAG - Production DAG for agent coordination with XCom
- Temporal Workflow - Durable workflow with activities and retry policies
- Event-Driven System - Redis-based pub/sub for agent communication
- Circuit Breaker - Fault-tolerant agent calls with state management
- Resource Pool - Load-balanced agent pool with multiple strategies
- Distributed Tracing - OpenTelemetry-style tracing for workflows
See EXAMPLES.md for complete code examples you can copy and adapt.
Framework Selection
Choose the right orchestration framework for your needs:
| Framework | Best For | Key Strength | Learn More |
|---|---|---|---|
| Airflow | Batch ETL, scheduled jobs | Rich UI, Python DAGs | KNOWLEDGE.md |
| Temporal | Long-running workflows | Durable execution | KNOWLEDGE.md |
| Prefect | Data science, dynamic flows | Pythonic API | KNOWLEDGE.md |
| Celery | Distributed task queues | Real-time tasks | KNOWLEDGE.md |
| Step Functions | AWS serverless workflows | No infrastructure | KNOWLEDGE.md |
See KNOWLEDGE.md for detailed framework comparisons, concepts, and learning resources.
Best Practices
DO's
- Design for Idempotency - Tasks should be safely retryable without side effects
- Use Correlation IDs - Track requests across distributed systems
- Implement Timeouts - Every operation should have a timeout
- Monitor Everything - Metrics, logs, traces for all components
- Implement Circuit Breakers - Prevent cascading failures
- Use Exponential Backoff - Space out retries to avoid thundering herd
- Validate DAGs - Check for cycles before execution
- Version Workflows - Track workflow changes over time
DON'Ts
- Don't Pass Large Data - Use references (S3 URLs, database IDs) instead
- Don't Ignore Partial Failures - Handle them explicitly
- Don't Use Distributed Transactions - Use saga pattern instead
- Don't Synchronous Chain - Use async/parallel execution where possible
- Don't Skip Health Checks - Monitor agent health continuously
- Don't Hardcode Timeouts - Make them configurable per task type
- Don't Trust Clocks - Use logical ordering, not wall clock time
- Don't Forget Cleanup - Clean up zombie workflows and stale state
See REFERENCE.md for extended best practices guide.
Production Deployment
Essential Checklist
Before deploying orchestrated workflows to production:
- Define clear task boundaries and responsibilities
- Implement comprehensive error handling and retries
- Set up distributed tracing (OpenTelemetry, Jaeger)
- Configure monitoring and alerting (Prometheus, Grafana)
- Implement circuit breakers for external dependencies
- Use message queues for async communication (RabbitMQ, Kafka)
- Set up health checks for all agents
- Implement graceful shutdown handling
- Configure resource limits and quotas
- Set up log aggregation (ELK, Loki)
See REFERENCE.md for complete deployment guide.
Documentation Structure
This skill is organized for progressive disclosure:
- SKILL.md (this file) - Quick start, core concepts, overview
- KNOWLEDGE.md - Frameworks, theory, distributed systems concepts
- PATTERNS.md - Implementation patterns and when to use them
- EXAMPLES.md - Complete working code examples
- GOTCHAS.md - Common pitfalls and troubleshooting
- REFERENCE.md - API docs, configuration, production deployment
Related Skills
multi-agent-coordination-framework- For advanced agent architecturesmcp-integration-toolkit- For agent communication via MCP protocolgit-mastery-suite- For orchestrating Git-based workflowssecurity-scanning-suite- For security orchestration and SAST/DAST coordination
Learning Path
- Start Here: Read this SKILL.md for quick start and core concepts
- Choose Framework: Review KNOWLEDGE.md to select your orchestration framework
- Pick Pattern: Browse PATTERNS.md to find the pattern matching your use case
- Copy Code: Use EXAMPLES.md to get working code
- Avoid Pitfalls: Check GOTCHAS.md for common mistakes
- Go to Production: Follow REFERENCE.md deployment guide
Quick Reference
Common Operations
# Register agents
orchestrator.register_agent(agent_id, agent_type, capacity)
# Submit task
result = await orchestrator.execute_task(task)
# Execute workflow
results = await orchestrator.execute_workflow(tasks)
# Check circuit breaker status
status = circuit_breaker.get_state()
# Monitor resource pool
status = resource_pool.get_pool_status()
Key Metrics to Monitor
- Task queue depth
- Agent utilization (%)
- Task success rate (%)
- Average task duration
- Circuit breaker state
- Error rate by agent type
See REFERENCE.md for complete monitoring guide.