name	orchestration-coordination-framework
description	Production-scale multi-agent coordination, task orchestration, and workflow automation. Use for distributed system orchestration, agent communication protocols, DAG workflows, state machines, error handling, resource allocation, load balancing, and observability. Covers Apache Airflow, Temporal, Prefect, Celery, Step Functions, and orchestration patterns.
allowed-tools	Read, Write, Edit, Bash, Glob, Grep, WebFetch

Orchestration Coordination Framework

Purpose

Production-scale AI development requires sophisticated orchestration and coordination. This Skill provides comprehensive orchestration capabilities for:

Multi-Agent Coordination - Coordinate multiple AI agents working on complex tasks
Task Decomposition - Break down complex objectives into manageable subtasks
Workflow Orchestration - DAGs, state machines, event-driven patterns
Communication Protocols - Agent-to-agent messaging, pub/sub, queues
Error Handling - Retry logic, circuit breakers, fallback strategies
Resource Management - Load balancing, rate limiting, concurrency control
Observability - Monitoring, logging, tracing, metrics for distributed systems
Framework Integration - Apache Airflow, Temporal, Prefect, Celery, Step Functions

When to Use This Skill

Use orchestration coordination for:

Orchestrating multiple agents or services working together
Complex multi-step workflows requiring coordination
Distributed task execution with dependencies
Event-driven architectures and reactive systems
Building CI/CD pipelines with complex dependencies
Microservices coordination and saga patterns
Data pipeline orchestration (ETL/ELT)
Long-running workflows with state management
Fault-tolerant distributed systems
Resource allocation across multiple workers
Implementing retries and error recovery strategies
Monitoring and observability for distributed systems

Quick Start

Basic Multi-Agent Orchestration

Here's a minimal example:

import asyncio
from typing import List, Dict, Any

class SimpleOrchestrator:
    def __init__(self):
        self.agents = {}

    def register_agent(self, agent_id: str, agent_type: str):
        self.agents[agent_id] = {"type": agent_type, "busy": False}

    async def execute_task(self, agent_id: str, task: Dict[str, Any]) -> Any:
        print(f"Agent {agent_id} executing: {task['name']}")
        await asyncio.sleep(1)
        return {"status": "completed", "result": f"Result of {task['name']}"}

    async def orchestrate(self, tasks: List[Dict[str, Any]]) -> List[Any]:
        results = await asyncio.gather(*[
            self.execute_task(f"agent_{i}", task)
            for i, task in enumerate(tasks)
        ])
        return results

# Usage
async def main():
    orchestrator = SimpleOrchestrator()
    orchestrator.register_agent("agent_0", "analyzer")
    orchestrator.register_agent("agent_1", "security")

    tasks = [
        {"name": "analyze_code", "type": "analysis"},
        {"name": "scan_security", "type": "security"}
    ]

    results = await orchestrator.orchestrate(tasks)
    print(f"Results: {results}")

asyncio.run(main())

Core Orchestration Concepts

1. DAG (Directed Acyclic Graph) Pattern

Tasks execute based on dependencies, enabling parallel execution where possible.

Start
  │
  ├──► Task A ──┐
  │            │
  └──► Task B ──┼──► Task D ──► Task F ──► End
       │        │
       └──► Task C ──► Task E ───┘

Use when: You have tasks with clear dependencies and want parallel execution.

2. State Machine Pattern

Workflows transition through defined states with explicit transitions.

     ┌─────────┐
     │  IDLE   │
     └────┬────┘
          │ start()
     ┌────▼────┐
     │ RUNNING │◄──────┐
     └────┬────┘       │
          │            │ retry()
    ┌─────┴─────┐      │
    │           │      │
┌───▼───┐   ┌───▼───┐  │
│SUCCESS│   │FAILURE├──┘
└───────┘   └───┬───┘
                │ max_retries
            ┌───▼───┐
            │ ERROR │
            └───────┘

Use when: You need explicit state tracking and complex transition logic.

3. Event-Driven Pattern

Agents communicate via events and messages (pub/sub, message queues).

┌──────────┐      ┌─────────────┐      ┌──────────┐
│ Agent A  │─────►│  Event Bus  │─────►│ Agent B  │
└──────────┘      └─────────────┘      └──────────┘
                        │
                        ▼
                  ┌──────────┐
                  │ Agent C  │
                  └──────────┘

Use when: Agents need loose coupling and async communication.

4. Orchestrator Architecture

┌─────────────────────────────────────────────────────┐
│                  Orchestrator                       │
│  ┌─────────────┐  ┌──────────────┐  ┌───────────┐ │
│  │ Task Queue  │  │ State Manager│  │ Scheduler │ │
│  └─────────────┘  └──────────────┘  └───────────┘ │
└──────────────┬──────────────────────────────────────┘
               │
       ┌───────┴───────┐
       │               │
   ┌───▼───┐       ┌───▼───┐       ┌────────┐
   │Agent 1│       │Agent 2│       │Agent N │
   │       │       │       │       │        │
   │Task A │       │Task B │       │Task N  │
   └───┬───┘       └───┬───┘       └───┬────┘
       │               │               │
       └───────┬───────┴───────────────┘
               │
         ┌─────▼─────┐
         │  Results  │
         │Aggregator │
         └───────────┘

Critical Gotchas

1. Retry Storms

Problem: Failed tasks retry simultaneously, overwhelming systems. Solution: Use exponential backoff with jitter, circuit breakers, and max retry limits. See GOTCHAS.md for details.

2. State Management Complexity

Problem: Losing track of workflow state across restarts. Solution: Use durable execution platforms (Temporal), persist state externally, make tasks idempotent. See GOTCHAS.md for details.

3. Serialization Issues

Problem: Large data passed between tasks causes memory issues. Solution: Pass references (S3 URLs) not data, use streaming, implement chunking. See GOTCHAS.md for details.

For all common gotchas and solutions, see GOTCHAS.md.

Orchestration Patterns

This framework provides 7 production-ready patterns:

Multi-Agent Task Decomposition - Break complex objectives into subtasks
Airflow DAG Coordination - Orchestrate agents with Apache Airflow
Temporal Durable Execution - Long-running workflows with automatic retries
Event-Driven Coordination - Pub/Sub messaging for loose coupling
Circuit Breaker Pattern - Prevent cascading failures
Resource Pool with Load Balancing - Manage agent resources efficiently
Distributed Tracing - Monitor and debug distributed workflows

See PATTERNS.md for detailed pattern descriptions and when to use each.

Working Examples

All patterns include complete, runnable code examples:

Multi-Agent Orchestrator - Coordinate specialized agents with dependency resolution
Airflow DAG - Production DAG for agent coordination with XCom
Temporal Workflow - Durable workflow with activities and retry policies
Event-Driven System - Redis-based pub/sub for agent communication
Circuit Breaker - Fault-tolerant agent calls with state management
Resource Pool - Load-balanced agent pool with multiple strategies
Distributed Tracing - OpenTelemetry-style tracing for workflows

See EXAMPLES.md for complete code examples you can copy and adapt.

Framework Selection

Choose the right orchestration framework for your needs:

Framework	Best For	Key Strength	Learn More
Airflow	Batch ETL, scheduled jobs	Rich UI, Python DAGs	KNOWLEDGE.md
Temporal	Long-running workflows	Durable execution	KNOWLEDGE.md
Prefect	Data science, dynamic flows	Pythonic API	KNOWLEDGE.md
Celery	Distributed task queues	Real-time tasks	KNOWLEDGE.md
Step Functions	AWS serverless workflows	No infrastructure	KNOWLEDGE.md

See KNOWLEDGE.md for detailed framework comparisons, concepts, and learning resources.

Best Practices

DO's

Design for Idempotency - Tasks should be safely retryable without side effects
Use Correlation IDs - Track requests across distributed systems
Implement Timeouts - Every operation should have a timeout
Monitor Everything - Metrics, logs, traces for all components
Implement Circuit Breakers - Prevent cascading failures
Use Exponential Backoff - Space out retries to avoid thundering herd
Validate DAGs - Check for cycles before execution
Version Workflows - Track workflow changes over time

DON'Ts

Don't Pass Large Data - Use references (S3 URLs, database IDs) instead
Don't Ignore Partial Failures - Handle them explicitly
Don't Use Distributed Transactions - Use saga pattern instead
Don't Synchronous Chain - Use async/parallel execution where possible
Don't Skip Health Checks - Monitor agent health continuously
Don't Hardcode Timeouts - Make them configurable per task type
Don't Trust Clocks - Use logical ordering, not wall clock time
Don't Forget Cleanup - Clean up zombie workflows and stale state

See REFERENCE.md for extended best practices guide.

Production Deployment

Essential Checklist

Before deploying orchestrated workflows to production:

Define clear task boundaries and responsibilities
Implement comprehensive error handling and retries
Set up distributed tracing (OpenTelemetry, Jaeger)
Configure monitoring and alerting (Prometheus, Grafana)
Implement circuit breakers for external dependencies
Use message queues for async communication (RabbitMQ, Kafka)
Set up health checks for all agents
Implement graceful shutdown handling
Configure resource limits and quotas
Set up log aggregation (ELK, Loki)

See REFERENCE.md for complete deployment guide.

Documentation Structure

This skill is organized for progressive disclosure:

SKILL.md (this file) - Quick start, core concepts, overview
KNOWLEDGE.md - Frameworks, theory, distributed systems concepts
PATTERNS.md - Implementation patterns and when to use them
EXAMPLES.md - Complete working code examples
GOTCHAS.md - Common pitfalls and troubleshooting
REFERENCE.md - API docs, configuration, production deployment

Related Skills

multi-agent-coordination-framework - For advanced agent architectures
mcp-integration-toolkit - For agent communication via MCP protocol
git-mastery-suite - For orchestrating Git-based workflows
security-scanning-suite - For security orchestration and SAST/DAST coordination

Learning Path

Start Here: Read this SKILL.md for quick start and core concepts
Choose Framework: Review KNOWLEDGE.md to select your orchestration framework
Pick Pattern: Browse PATTERNS.md to find the pattern matching your use case
Copy Code: Use EXAMPLES.md to get working code
Avoid Pitfalls: Check GOTCHAS.md for common mistakes
Go to Production: Follow REFERENCE.md deployment guide

Quick Reference

Common Operations

# Register agents
orchestrator.register_agent(agent_id, agent_type, capacity)

# Submit task
result = await orchestrator.execute_task(task)

# Execute workflow
results = await orchestrator.execute_workflow(tasks)

# Check circuit breaker status
status = circuit_breaker.get_state()

# Monitor resource pool
status = resource_pool.get_pool_status()

Key Metrics to Monitor

Task queue depth
Agent utilization (%)
Task success rate (%)
Average task duration
Circuit breaker state
Error rate by agent type

See REFERENCE.md for complete monitoring guide.

Install Skill

SKILL.md