name	benchmark_design
description	Generates controlled benchmark setups with workload patterns, throughput/latency metrics, and reproducible seeds. Use for rigorous performance measurement design.
allowed-tools	Read, Write, Edit, Bash, Grep, Glob

Benchmark Design

Purpose

Design rigorous, reproducible benchmarks that isolate performance characteristics and minimize measurement noise.

CRITICAL: Before designing benchmarks, use the test_data_design skill to create comprehensive test data covering edge cases, outliers, and realistic distributions. Top candidates design data-driven benchmarks that systematically explore the input space.

Benchmark Principles (1999-2002 Era)

Measure what matters: Throughput (ops/sec) and latency (ns/op)
Warm up the JIT: Java needs warmup, C++/Rust don't
Multiple trials: Run 10-100 times, report median + stddev
Fixed seeds: Reproducible random data
Isolate variables: Change one thing at a time

Benchmark Structure

Setup

Generate test data with fixed seed
Warm up (Java: 10K iterations, C++/Rust: optional)
Clear caches between trials (if OS allows)

Measurement

Use high-resolution timer (nanoseconds)
Measure large batch (≥1000 ops) to amortize overhead
Report throughput: ops/sec = batch_size / elapsed_seconds
Report latency: ns/op = elapsed_ns / batch_size

Workloads

Small: k=10 iterators, 1K elements each
Medium: k=100 iterators, 100K elements each
Large: k=1000 iterators, 10M elements total

Framework Usage

Java (JMH - Java Microbenchmark Harness)

@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.SECONDS)
@State(Scope.Thread)
@Warmup(iterations = 5, time = 1)
@Measurement(iterations = 10, time = 1)
@Fork(1)
public class MergeBenchmark {
    @Param({"10", "100", "1000"})
    private int k;

    @Benchmark
    public long testMerge(Blackhole bh) {
        // Benchmark code
    }
}

C++ (Google Benchmark)

static void BM_Merge(benchmark::State& state) {
    int k = state.range(0);
    auto iterators = generate_test_data(k, 10000);

    for (auto _ : state) {
        merge_iterator<int> merger(iterators);
        long count = 0;
        while (merger.has_next()) {
            benchmark::DoNotOptimize(merger.next());
            count++;
        }
        benchmark::DoNotOptimize(count);
    }
}
BENCHMARK(BM_Merge)->Arg(10)->Arg(100)->Arg(1000);

Rust (Criterion)

fn merge_benchmark(c: &mut Criterion) {
    let mut group = c.benchmark_group("merge");

    for k in [10, 100, 1000].iter() {
        group.bench_with_input(
            BenchmarkId::from_parameter(k),
            k,
            |b, &k| {
                let iterators = generate_test_data(k, 10000);
                b.iter(|| {
                    let merger = MergeIterator::new(iterators.clone());
                    merger.count()
                });
            },
        );
    }
}

Workload Patterns

Pattern 1: Uniform Distribution

Each iterator: 10K elements uniformly distributed [0, 100K)
Tests: Typical case, good branch prediction

Pattern 2: Skewed Distribution

Iterator 0: 90% of elements, rest: 10% split
Tests: Heap imbalance, worst-case behavior

Pattern 3: Adversarial

Elements designed to maximize comparisons/cache misses
Tests: Worst-case complexity

Metrics to Collect

Throughput: Elements processed per second
Latency: Nanoseconds per element
Memory: Peak allocation (if measurable)
Cache: Miss rate (if perf counters available)

Reproducibility Checklist

Fixed random seed documented
Compiler flags documented (opt level, flags)
JVM/runtime version documented
CPU model and clock speed documented
OS and kernel version documented
Number of trials ≥ 10
Warmup iterations documented

Cross-Skill Integration

Requires: java_codegen (code to benchmark), test_data_design (comprehensive test inputs) Feeds into: performance_interpretation, reporting_visualization Related: test_data_design (use FIRST to design test data catalog)

benchmark_design

Install Skill

SKILL.md