Benchmark Suite Skill
Overview
This skill provides comprehensive automated performance testing capabilities including benchmark execution, regression detection, performance validation, and quality assessment for ensuring optimal system performance.
When to Use
- Running performance benchmark suites before deployment
- Detecting performance regressions between versions
- Validating SLA compliance through automated testing
- Load, stress, and endurance testing
- CI/CD pipeline performance quality gates
- Comparing performance across configurations
Quick Start
# Run comprehensive benchmark suite
npx claude-flow benchmark-run --suite comprehensive --duration 300
# Execute specific benchmark
npx claude-flow benchmark-run --suite throughput --iterations 10
# Compare with baseline
npx claude-flow benchmark-compare --current <results> --baseline <baseline>
# Quality assessment
npx claude-flow quality-assess --target swarm-performance --criteria throughput,latency
# Performance validation
npx claude-flow validate-performance --results <file> --criteria <file>
Architecture
+-----------------------------------------------------------+
| Benchmark Suite |
+-----------------------------------------------------------+
| Benchmark Runner | Regression Detector | Validator |
+--------------------+-----------------------+--------------+
| | |
v v v
+------------------+ +-------------------+ +--------------+
| Benchmark Types | | Detection Methods | | Validation |
| - Throughput | | - Statistical | | - SLA |
| - Latency | | - ML-based | | - Regression |
| - Scalability | | - Threshold | | - Scalability|
| - Coordination | | - Trend Analysis | | - Reliability|
+------------------+ +-------------------+ +--------------+
| | |
v v v
+-----------------------------------------------------------+
| Reporter & Comparator |
+-----------------------------------------------------------+
Benchmark Types
Standard Benchmarks
| Benchmark |
Metrics |
Duration |
Targets |
| Throughput |
requests/sec, tasks/sec, messages/sec |
5 min |
min: 1000, optimal: 5000 |
| Latency |
p50, p90, p95, p99, max |
5 min |
p50<100ms, p99<1s |
| Scalability |
linear coefficient, efficiency retention |
variable |
coefficient>0.8 |
| Coordination |
message latency, sync time |
5 min |
<50ms |
| Fault Tolerance |
recovery time, failover success |
10 min |
<30s recovery |
Test Campaign Types
- Load Testing: Gradual ramp-up to sustained load
- Stress Testing: Find breaking points
- Volume Testing: Large data set handling
- Endurance Testing: Long-duration stability
- Spike Testing: Sudden load changes
- Configuration Testing: Different settings comparison
Core Capabilities
1. Comprehensive Benchmarking
// Run benchmark suite
const results = await benchmarkSuite.run({
duration: 300000, // 5 minutes
iterations: 10, // 10 iterations
warmupTime: 30000, // 30 seconds warmup
cooldownTime: 10000, // 10 seconds cooldown
parallel: false, // Sequential execution
baseline: previousRun // Compare with baseline
});
// Results include:
// - summary: Overall scores and status
// - detailed: Per-benchmark results
// - baseline_comparison: Delta from baseline
// - recommendations: Optimization suggestions
2. Regression Detection
Multi-algorithm detection:
| Method |
Description |
Use Case |
| Statistical |
CUSUM change point detection |
Detect gradual degradation |
| Machine Learning |
Anomaly detection models |
Identify unusual patterns |
| Threshold |
Fixed limit comparisons |
Hard performance limits |
| Trend |
Time series regression |
Long-term degradation |
# Detect performance regressions
npx claude-flow detect-regression --current <results> --historical <data>
# Set up automated regression monitoring
npx claude-flow regression-monitor --enable --sensitivity 0.95
3. Automated Performance Testing
// Execute test campaign
const campaign = await tester.runTestCampaign({
tests: [
{ type: 'load', config: loadTestConfig },
{ type: 'stress', config: stressTestConfig },
{ type: 'endurance', config: enduranceConfig }
],
constraints: {
maxDuration: 3600000, // 1 hour max
failFast: true // Stop on first failure
}
});
4. Performance Validation
Validation framework with multi-criteria assessment:
| Validation Type |
Criteria |
| SLA Validation |
Availability, response time, throughput, error rate |
| Regression Validation |
Comparison with historical data |
| Scalability Validation |
Linear scaling, efficiency retention |
| Reliability Validation |
Error handling, recovery, consistency |
MCP Integration
// Comprehensive benchmark integration
const benchmarkIntegration = {
// Execute performance benchmarks
async runBenchmarks(config = {}) {
const [benchmark, metrics, trends, cost] = await Promise.all([
mcp.benchmark_run({ suite: config.suite || 'comprehensive' }),
mcp.metrics_collect({ components: ['system', 'agents', 'coordination'] }),
mcp.trend_analysis({ metric: 'performance', period: '24h' }),
mcp.cost_analysis({ timeframe: '24h' })
]);
return { benchmark, metrics, trends, cost, timestamp: Date.now() };
},
// Quality assessment
async assessQuality(criteria) {
return await mcp.quality_assess({
target: 'swarm-performance',
criteria: criteria || ['throughput', 'latency', 'reliability', 'scalability']
});
}
};
Key Metrics
Benchmark Targets
const benchmarkTargets = {
throughput: {
requests_per_second: { min: 1000, optimal: 5000 },
tasks_per_second: { min: 100, optimal: 500 },
messages_per_second: { min: 10000, optimal: 50000 }
},
latency: {
p50: { max: 100 }, // 100ms
p90: { max: 200 }, // 200ms
p95: { max: 500 }, // 500ms
p99: { max: 1000 }, // 1s
max: { max: 5000 } // 5s
},
scalability: {
linear_coefficient: { min: 0.8 },
efficiency_retention: { min: 0.7 }
}
};
CI/CD Quality Gates
| Gate |
Criteria |
Action on Failure |
| Performance |
< 10% degradation |
Block deployment |
| Latency |
p99 < 1s |
Warning |
| Error Rate |
< 0.5% |
Block deployment |
| Scalability |
> 80% linear |
Warning |
Load Testing Example
// Load test with gradual ramp-up
const loadTest = {
type: 'load',
phases: [
{ phase: 'ramp-up', duration: 60000, startLoad: 10, endLoad: 100 },
{ phase: 'sustained', duration: 300000, load: 100 },
{ phase: 'ramp-down', duration: 30000, startLoad: 100, endLoad: 0 }
],
successCriteria: {
p99_latency: { max: 1000 },
error_rate: { max: 0.01 },
throughput: { min: 80 } // % of expected
}
};
Stress Testing Example
// Stress test to find breaking point
const stressTest = {
type: 'stress',
startLoad: 100,
maxLoad: 10000,
loadIncrement: 100,
duration: 60000, // Per load level
breakingCriteria: {
error_rate: { max: 0.05 }, // 5% errors
latency_p99: { max: 5000 }, // 5s latency
timeout_rate: { max: 0.10 } // 10% timeouts
}
};
Integration Points
| Integration |
Purpose |
| Performance Monitor |
Continuous monitoring data for benchmarking |
| Load Balancer |
Validates load balancing effectiveness |
| Topology Optimizer |
Tests topology configurations |
| CI/CD Pipeline |
Automated quality gates |
Best Practices
- Consistent Environment: Run benchmarks in consistent, isolated environments
- Warmup Period: Always include warmup to eliminate cold-start effects
- Multiple Iterations: Run multiple iterations for statistical significance
- Baseline Maintenance: Keep baseline updated with expected performance
- Historical Tracking: Store all benchmark results for trend analysis
- Realistic Workloads: Use production-like workload patterns
Example: CI/CD Integration
#!/bin/bash
# ci-performance-gate.sh
# Run benchmark suite
RESULTS=$(npx claude-flow benchmark-run --suite quick --output json)
# Compare with baseline
COMPARISON=$(npx claude-flow benchmark-compare \
--current "$RESULTS" \
--baseline ./baseline.json)
# Check for regressions
if echo "$COMPARISON" | jq -e '.regression_detected == true' > /dev/null; then
echo "Performance regression detected!"
echo "$COMPARISON" | jq '.regressions'
exit 1
fi
echo "Performance validation passed"
exit 0
Related Skills
optimization-monitor - Real-time performance monitoring
optimization-analyzer - Bottleneck analysis and reporting
optimization-load-balancer - Load distribution optimization
optimization-topology - Topology performance testing
Version History
- 1.0.0 (2026-01-02): Initial release - converted from benchmark-suite agent with comprehensive benchmarking, regression detection, automated testing, and performance validation