name	performance-monitoring
description	Tracks agent performance metrics, identifies bottlenecks, detects degradation patterns, generates performance reports. Monitors completion rates, response times, success rates, and resource utilization for Journeyman Jobs development optimization.

Performance Monitoring

Purpose

Track, analyze, and report on agent performance metrics to identify bottlenecks, detect degradation, optimize resource allocation, and maintain system health.

When To Use

Tracking agent performance over time
Identifying slow or failing agents
Detecting system bottlenecks
Optimizing resource allocation
Generating performance reports
Troubleshooting coordination issues

Key Performance Indicators

1. Task Completion Rate

Metric: Tasks completed per hour per agent

Completion Rate = Tasks Completed / Time Period

TARGETS:
EXCELLENT: >6 tasks/hour
GOOD: 4-6 tasks/hour
ACCEPTABLE: 2-4 tasks/hour
SLOW: <2 tasks/hour

Factors affecting rate:
- Task complexity
- Agent specialization  
- System resources
- Coordination overhead

Tracking:

frontend-agent:
- Last hour: 5 tasks (GOOD)
- Last day: 42 tasks (GOOD average)
- Trend: Stable

backend-agent:
- Last hour: 1 task (SLOW)
- Last day: 18 tasks (ACCEPTABLE average)
- Trend: Declining ⚠️

2. Average Response Time

Metric: Time from task assignment to completion

Response Time = Completion Time - Assignment Time

TARGETS:
EXCELLENT: <15 minutes
GOOD: 15-30 minutes
ACCEPTABLE: 30-60 minutes
SLOW: >60 minutes

Components:
- Queue wait time
- Execution time
- Coordination delays

Analysis:

Agent Response Time Breakdown:

fast-agent:
- Queue wait: 2 min (5%)
- Execution: 10 min (25%)
- Coordination: 28 min (70%)
- Total: 40 min
→ BOTTLENECK: Coordination overhead

slow-agent:
- Queue wait: 25 min (50%)
- Execution: 20 min (40%)
- Coordination: 5 min (10%)
- Total: 50 min
→ BOTTLENECK: Queue depth

3. Success Rate

Metric: Percentage of tasks completed successfully without errors

Success Rate = (Successful Tasks / Total Tasks) × 100%

TARGETS:
EXCELLENT: >95%
GOOD: 90-95%
ACCEPTABLE: 80-90%
POOR: <80%

Failure categories:
- Syntax errors
- Logic errors
- Integration failures
- Timeout failures

Tracking:

state-agent:
- Success: 47 tasks
- Failed: 3 tasks
- Success rate: 94% (GOOD)
- Common failures: Provider initialization errors

debug-agent:
- Success: 29 tasks
- Failed: 1 task  
- Success rate: 97% (EXCELLENT)
- Failure: Test environment timeout

4. Resource Utilization

Metric: Percentage of time agent is actively working

Utilization = (Active Time / Total Time) × 100%

TARGETS:
EXCELLENT: 60-80% (sustainable)
GOOD: 40-60%
UNDERUTILIZED: <40%
OVERLOADED: >80% (burnout risk)

Idle time reasons:
- No tasks available
- Waiting for dependencies
- Coordination delays
- System maintenance

Dashboard:

═══════════════════════════════════════
       AGENT UTILIZATION REPORT
═══════════════════════════════════════

frontend-orchestrator:  [██████████] 100% ⚠️
state-orchestrator:     [███████░░░] 70% ✓
backend-orchestrator:   [████████░░] 80% ⚠️
debug-orchestrator:     [███░░░░░░░] 30% ⓘ

ALERTS:
⚠️ frontend-orchestrator: OVERLOADED
⚠️ backend-orchestrator: HIGH UTILIZATION
ⓘ debug-orchestrator: UNDERUTILIZED

5. Queue Depth Trends

Metric: Number of queued tasks over time

Queue Trend Analysis:

HEALTHY PATTERN:
├─ Morning: 2-3 tasks (steady)
├─ Midday: 4-6 tasks (peak)
└─ Evening: 1-2 tasks (declining)

UNHEALTHY PATTERN:
├─ Morning: 3 tasks (starting)
├─ Midday: 8 tasks (growing) ⚠️
└─ Evening: 12 tasks (accelerating) 🚨

INTERPRETATION:
Healthy: Work flows in and out smoothly
Unhealthy: Queue growing = bottleneck forming

Performance Degradation Detection

Baseline Establishment

ESTABLISH BASELINE METRICS (first 30 days):

Example: frontend-agent baseline
- Avg completion time: 18 minutes
- Std deviation: ±5 minutes
- Success rate: 96%
- Tasks per hour: 3.2

THRESHOLD ALERTS:
Warning: >2 std deviations from baseline
Critical: >3 std deviations from baseline

Anomaly Detection

DETECT PERFORMANCE ANOMALIES:

Statistical Method:
- Current: 45 min completion time
- Baseline: 18 ± 5 min
- Deviation: (45 - 18) / 5 = 5.4 std deviations
- Status: 🚨 CRITICAL ANOMALY

Possible causes:
1. Task complexity increased
2. System resource contention
3. Agent error/degradation
4. Network issues
5. Coordination delays

Trend Analysis

TREND PATTERNS:

IMPROVING (✓):
Week 1: 25 min avg
Week 2: 22 min avg
Week 3: 20 min avg
Week 4: 18 min avg
→ Learning curve, optimization working

STABLE (✓):
Week 1-4: 18 ± 2 min avg
→ Consistent performance, no issues

DEGRADING (⚠️):
Week 1: 18 min avg
Week 2: 22 min avg
Week 3: 28 min avg
Week 4: 35 min avg
→ Investigation needed urgently

Bottleneck Identification

System Bottlenecks

COMMON BOTTLENECKS:

1. AGENT OVERLOAD
   Symptom: One agent has 10+ tasks queued
   Others: idle or lightly loaded
   Solution: Redistribute work or spawn help

2. DEPENDENCY CHAIN
   Symptom: Multiple agents waiting on one
   Example: Frontend + State waiting on Backend
   Solution: Prioritize blocking work

3. COORDINATION OVERHEAD
   Symptom: High coordination time, low execution time
   Example: 70% coordination, 25% execution
   Solution: Reduce coordination points

4. RESOURCE CONTENTION
   Symptom: All agents slow simultaneously
   Example: All response times 2x baseline
   Solution: Scale infrastructure

5. SPECIALIZATION GAP
   Symptom: Tasks queued, but no specialized agent
   Example: Firebase tasks, but only Flutter agents available
   Solution: Spawn or train specialized agent

Bottleneck Analysis Example

PERFORMANCE REPORT: Backend Domain

Metrics:
- Average completion: 45 min (baseline: 18 min) 🚨
- Queue depth: 12 tasks (baseline: 2 tasks) 🚨
- Success rate: 85% (baseline: 96%) ⚠️

Root Cause Analysis:
1. Check agent logs → Firestore timeout errors
2. Check queue composition → 8 of 12 are complex queries
3. Check dependencies → No blockers from other domains
4. Check resources → Firestore quota near limit

DIAGNOSIS: Firestore performance degradation
BOTTLENECK: Database query optimization needed

SOLUTION:
1. Add composite indexes (immediate)
2. Optimize query patterns (short-term)
3. Implement query caching (long-term)
4. Increase Firestore quota (infrastructure)

Performance Reports

Daily Performance Summary

═══════════════════════════════════════════
     DAILY PERFORMANCE REPORT
     Date: 2025-10-31
═══════════════════════════════════════════

SYSTEM OVERVIEW:
- Total tasks completed: 127
- Average completion time: 22 minutes
- Success rate: 94%
- System utilization: 68%

AGENT PERFORMANCE:

[FRONTEND DOMAIN]
frontend-orchestrator:
  Tasks: 45 (35% of total)
  Avg time: 18 min ✓
  Success: 96% ✓
  Status: HEALTHY

[STATE DOMAIN]
state-orchestrator:
  Tasks: 32 (25% of total)
  Avg time: 20 min ✓
  Success: 94% ✓
  Status: HEALTHY

[BACKEND DOMAIN]
backend-orchestrator:
  Tasks: 38 (30% of total)
  Avg time: 31 min ⚠️
  Success: 89% ⚠️
  Status: DEGRADED

[DEBUG DOMAIN]
debug-orchestrator:
  Tasks: 12 (10% of total)
  Avg time: 25 min ✓
  Success: 100% ✓
  Status: UNDERUTILIZED

BOTTLENECKS:
⚠️ Backend domain showing degradation
   Recommendation: Investigate Firestore performance

OPPORTUNITIES:
ⓘ Debug domain has excess capacity
   Recommendation: Route additional testing work

═══════════════════════════════════════════

Weekly Trend Report

═══════════════════════════════════════════
     WEEKLY TREND REPORT
     Week of: Oct 24-31, 2025
═══════════════════════════════════════════

COMPLETION TREND:
Mon: 118 tasks
Tue: 132 tasks
Wed: 145 tasks (+13% momentum)
Thu: 128 tasks
Fri: 127 tasks
─────────────────
Avg: 130 tasks/day
Trend: Slightly improving ✓

RESPONSE TIME TREND:
Mon: 24 min
Tue: 22 min
Wed: 19 min (improving)
Thu: 23 min  
Fri: 22 min
─────────────────
Avg: 22 min
Trend: Stable with midweek spike ✓

SUCCESS RATE TREND:
Mon: 95%
Tue: 94%
Wed: 92% ⚠️
Thu: 96%
Fri: 94%
─────────────────
Avg: 94%
Trend: Wednesday dip investigated

INSIGHTS:
- Wednesday degradation due to complex feature push
- System recovered well by Thursday
- Overall performance stable and healthy
- No urgent interventions needed

═══════════════════════════════════════════

Journeyman Jobs Specific Monitoring

Mobile Performance Metrics

MOBILE-SPECIFIC KPIS:

1. Battery Impact
   Target: <5% battery drain per hour
   Monitor: Power consumption metrics
   Alert if: >10% drain detected

2. Network Resilience
   Target: 95% offline functionality
   Monitor: Offline operation success rate
   Alert if: <90% offline success

3. Render Performance
   Target: 60 FPS UI rendering
   Monitor: Frame drops, jank events
   Alert if: Consistent drops <45 FPS

4. Memory Usage
   Target: <200MB RAM footprint
   Monitor: Memory allocation trends
   Alert if: >300MB or leak detected

Storm Work Responsiveness

STORM SURGE METRICS:

During storm events (job influx):
- Job scraping latency: <5 min target
- Notification delivery: <2 min target
- UI update propagation: <1 min target

Example Storm Event:
─────────────────────────────────────
Event Start: 14:35
Job Count: 0 → 150 jobs in 20 minutes

Performance:
- Scraping latency: 3.2 min ✓
- Notifications: 1.8 min ✓
- UI updates: 0.9 min ✓

Result: EXCELLENT RESPONSE
System handled surge without degradation
─────────────────────────────────────

IBEW Integration Health

INTEGRATION MONITORING:

1. Local Job Board Scraping
   Success rate: 98% ✓
   Avg latency: 45 seconds
   Failures: 2% (timeout-related)

2. Crew Coordination
   Message delivery: 99.8% ✓
   Avg latency: 1.2 seconds
   Real-time sync: Operational

3. Offline Sync
   Queue processing: 95% ✓
   Conflict rate: 0.3%
   Sync latency: 8 seconds avg

Best Practices

1. Establish Baselines Early

BASELINE COLLECTION:

Week 1-2: Initial baseline
- Collect all metrics
- No optimization yet
- Document normal patterns

Week 3-4: Refine baseline
- Filter anomalies
- Calculate statistics
- Set alert thresholds

Ongoing: Update baseline
- Quarterly reviews
- Adjust for system changes
- Account for growth

2. Alert Fatigue Prevention

ALERT TUNING:

❌ BAD: Alert on every deviation
✓ GOOD: Alert on sustained problems

Example Rules:
- Single spike: Log, don't alert
- 3 consecutive spikes: Warning alert
- 10 consecutive spikes: Critical alert
- Degrading trend over 3 days: Investigation alert

This reduces noise and focuses on real issues.

3. Actionable Metrics

GOOD METRIC: "Backend response time: 45 min (target: 20 min)"
→ Clear problem, clear target, actionable

BAD METRIC: "System health score: 73"
→ Unclear what's wrong, unclear action

Always pair metrics with:
- Baseline/target for comparison
- Trend (improving/stable/degrading)
- Recommended action if threshold exceeded

4. Root Cause Investigation

INVESTIGATION PROCESS:

1. DETECT: Metric exceeds threshold
   Example: Response time 2x baseline

2. CORRELATE: Check related metrics
   - Queue depth (is agent overloaded?)
   - Success rate (are tasks failing?)
   - Other agents (system-wide issue?)

3. ISOLATE: Narrow to specific cause
   - Specific task types affected?
   - Specific time period?
   - Specific agent or domain?

4. VALIDATE: Reproduce if possible
   - Can we trigger the issue?
   - Does it occur predictably?

5. REMEDIATE: Apply targeted fix
   - Address root cause, not symptom
   - Monitor to confirm resolution

5. Continuous Optimization

OPTIMIZATION CYCLE:

1. MEASURE: Collect baseline metrics
2. ANALYZE: Identify slowest components
3. OPTIMIZE: Improve top bottlenecks
4. VALIDATE: Measure impact
5. REPEAT: Move to next bottleneck

Example:
Iteration 1: Reduced queue wait time 20 → 15 min
Iteration 2: Reduced execution time 25 → 20 min
Iteration 3: Reduced coordination 10 → 7 min
Result: Total time 55 → 42 min (24% improvement)

Integration with Resource Allocator

The performance monitoring skill is used by the Resource Allocator agent to:

Track agent performance metrics continuously
Detect degradation patterns and bottlenecks
Analyze root causes of performance issues
Generate performance reports for stakeholders
Recommend optimizations based on metrics
Validate that resource allocation strategies are effective

The Resource Allocator uses this skill to maintain visibility into system health and make data-driven decisions about resource allocation and optimization priorities.

Example: Performance Investigation

ALERT TRIGGERED:
backend-orchestrator response time: 58 minutes
Threshold: 30 minutes (2x exceeded) 🚨

INVESTIGATION:

Step 1: Check Current Metrics
- Queue depth: 9 tasks (baseline: 3) ⚠️
- Success rate: 87% (baseline: 96%) ⚠️
- Utilization: 98% (baseline: 70%) ⚠️

Step 2: Correlate with Other Agents
- frontend-orchestrator: Normal performance ✓
- state-orchestrator: Normal performance ✓
- debug-orchestrator: Normal performance ✓
→ ISOLATED to backend domain

Step 3: Analyze Task Composition
- 6 Firestore complex queries
- 2 Cloud Function deployments
- 1 Auth integration
→ Query-heavy workload

Step 4: Check Logs
- Firestore timeout errors: 8 occurrences
- Error: "Deadline exceeded on read"
→ DATABASE PERFORMANCE ISSUE

ROOT CAUSE: Missing Firestore composite index

REMEDIATION:
1. Add required composite index (immediate)
2. Migrate 3 heavy queries to different agent
3. Monitor for 30 minutes

VALIDATION:
After 30 minutes:
- Response time: 24 minutes ✓
- Success rate: 95% ✓
- Queue depth: 3 tasks ✓
→ ISSUE RESOLVED

performance-monitoring

Install Skill

SKILL.md