| name | benchmark-kernel |
| description | Guide for benchmarking FlashInfer kernels with CUPTI timing |
Tutorial: Benchmarking FlashInfer Kernels
This tutorial shows you how to accurately benchmark FlashInfer kernels.
Goal
Measure the performance of FlashInfer kernels:
- Get accurate GPU kernel execution time
- Compare multiple backends (FlashAttention2/3, cuDNN, CUTLASS, TensorRT-LLM)
- Generate reproducible benchmark results
- Save results to CSV for analysis
Timing Methods
FlashInfer supports two timing methods:
CUPTI (Preferred): Hardware-level profiling for most accurate GPU kernel time
- Measures pure GPU compute time without host-device overhead
- Requires
cupti-python >= 13.0.0(CUDA 13+)
CUDA Events (Fallback): Standard CUDA event timing
- Automatically used if CUPTI is not available
- Good accuracy, slight overhead from host synchronization
The framework automatically uses CUPTI if available, otherwise falls back to CUDA events.
Installation
Install CUPTI (Recommended)
For the most accurate benchmarking:
pip install -U cupti-python
Requirements: CUDA 13+ (CUPTI version 13+)
Without CUPTI
If you don't install CUPTI, the framework will:
- Print a warning:
CUPTI is not installed. Falling back to CUDA events. - Automatically use CUDA events for timing
- Still provide good benchmark results
Method 1: Using flashinfer_benchmark.py (Recommended)
Step 1: Choose Your Test Routine
Available routines:
- Attention:
BatchDecodeWithPagedKVCacheWrapper,BatchPrefillWithPagedKVCacheWrapper,BatchPrefillWithRaggedKVCacheWrapper,BatchMLAPagedAttentionWrapper - GEMM:
bmm_fp8,gemm_fp8_nt_groupwise,group_gemm_fp8_nt_groupwise,mm_fp4 - MOE:
trtllm_fp4_block_scale_moe,trtllm_fp8_block_scale_moe,trtllm_fp8_per_tensor_scale_moe,cutlass_fused_moe
Step 2: Run a Single Benchmark
Example - Benchmark decode attention:
# CUPTI will be used automatically if installed
python benchmarks/flashinfer_benchmark.py \
--routine BatchDecodeWithPagedKVCacheWrapper \
--backends fa2 fa2_tc cudnn \
--page_size 16 \
--batch_size 32 \
--s_qo 1 \
--s_kv 2048 \
--num_qo_heads 32 \
--num_kv_heads 8 \
--head_dim_qk 128 \
--head_dim_vo 128 \
--q_dtype bfloat16 \
--kv_dtype bfloat16 \
--num_iters 30 \
--dry_run_iters 5 \
--refcheck \
-vv
Example - Benchmark FP8 GEMM:
python benchmarks/flashinfer_benchmark.py \
--routine bmm_fp8 \
--backends cudnn cublas cutlass \
--batch_size 256 \
--m 1 \
--n 1024 \
--k 7168 \
--input_dtype fp8_e4m3 \
--mat2_dtype fp8_e4m3 \
--out_dtype bfloat16 \
--refcheck \
-vv \
--generate_repro_command
Timing behavior:
- ✅ If CUPTI installed: Uses CUPTI (most accurate)
- ⚠️ If CUPTI not installed: Automatically falls back to CUDA events with warning
- 🔧 To force CUDA events: Add
--use_cuda_eventsflag
Step 3: Understand the Output
[INFO] FlashInfer version: 0.6.0
[VVERBOSE] gpu_name = 'NVIDIA_H100_PCIe'
[PERF] fa2 :: median time 0.145 ms; std 0.002 ms; achieved tflops 125.3 TFLOPs/sec; achieved tb_per_sec 1.87 TB/sec
[PERF] fa2_tc :: median time 0.138 ms; std 0.001 ms; achieved tflops 131.5 TFLOPs/sec; achieved tb_per_sec 1.96 TB/sec
[PERF] cudnn :: median time 0.142 ms; std 0.001 ms; achieved tflops 127.8 TFLOPs/sec; achieved tb_per_sec 1.91 TB/sec
Key metrics:
- median time: Median kernel execution time (lower is better)
- std: Standard deviation (lower means more consistent)
- achieved tflops: Effective TFLOPS throughput
- achieved tb_per_sec: Memory bandwidth utilization
Step 4: Run Batch Benchmarks
Create a test list file my_benchmarks.txt:
--routine BatchDecodeWithPagedKVCacheWrapper --backends fa2 cudnn --page_size 16 --batch_size 32 --s_kv 2048 --num_qo_heads 32 --num_kv_heads 8 --head_dim_qk 128 --head_dim_vo 128
--routine BatchDecodeWithPagedKVCacheWrapper --backends fa2 cudnn --page_size 16 --batch_size 64 --s_kv 4096 --num_qo_heads 32 --num_kv_heads 8 --head_dim_qk 128 --head_dim_vo 128
--routine bmm_fp8 --backends cudnn cutlass --batch_size 256 --m 1 --n 1024 --k 7168 --input_dtype fp8_e4m3 --mat2_dtype fp8_e4m3 --out_dtype bfloat16
Run all tests:
python benchmarks/flashinfer_benchmark.py \
--testlist my_benchmarks.txt \
--output_path results.csv \
--generate_repro_command \
--refcheck
Results are saved to results.csv with all metrics and reproducer commands.
Step 5: Common Flags
| Flag | Description | Default |
|---|---|---|
--num_iters |
Measurement iterations | 30 |
--dry_run_iters |
Warmup iterations | 5 |
--refcheck |
Verify output correctness | False |
--allow_output_mismatch |
Continue on mismatch | False |
--use_cuda_events |
Force CUDA events (skip CUPTI) | False |
--no_cuda_graph |
Disable CUDA graph | False |
-vv |
Very verbose output | - |
--generate_repro_command |
Print reproducer command | False |
--case_tag |
Tag for CSV output | None |
Method 2: Using bench_gpu_time() in Python
For custom benchmarking in your own code:
Step 1: Write Your Benchmark Script
import torch
from flashinfer.testing import bench_gpu_time
# Setup your kernel
def my_kernel_wrapper(q, k, v):
# Your kernel call here
return output
# Create test inputs
device = torch.device("cuda")
q = torch.randn(32, 8, 128, dtype=torch.bfloat16, device=device)
k = torch.randn(2048, 8, 128, dtype=torch.bfloat16, device=device)
v = torch.randn(2048, 8, 128, dtype=torch.bfloat16, device=device)
# Benchmark - CUPTI preferred, CUDA events if CUPTI unavailable
median_time, std_time = bench_gpu_time(
my_kernel_wrapper,
args=(q, k, v),
enable_cupti=True, # Prefer CUPTI, fallback to CUDA events
num_iters=30, # Number of iterations
dry_run_iters=5, # Warmup iterations
)
print(f"Kernel time: {median_time:.3f} ms ± {std_time:.3f} ms")
# Calculate FLOPS if you know the operation count
flops = ... # Your FLOP count
tflops = (flops / 1e12) / (median_time / 1000)
print(f"Achieved: {tflops:.2f} TFLOPS/sec")
Note: If CUPTI is not installed, you'll see a warning and the function will automatically use CUDA events instead.
Step 2: Run Your Benchmark
python my_benchmark.py
Output with CUPTI:
Kernel time: 0.145 ms ± 0.002 ms
Achieved: 125.3 TFLOPS/sec
Output without CUPTI (automatic fallback):
[WARNING] CUPTI is not installed. Try 'pip install -U cupti-python'. Falling back to CUDA events.
Kernel time: 0.147 ms ± 0.003 ms
Achieved: 124.1 TFLOPS/sec
Step 3: Advanced Options
# Cold L2 cache benchmarking (optional)
median_time, std_time = bench_gpu_time(
my_kernel,
args=(x, y),
enable_cupti=True, # Will use CUDA events if CUPTI unavailable
cold_l2_cache=True, # Flush L2 or rotate buffers automatically
num_iters=30
)
# Force CUDA events (skip CUPTI even if installed)
median_time, std_time = bench_gpu_time(
my_kernel,
args=(x, y),
enable_cupti=False, # Explicitly use CUDA events
num_iters=30
)
Troubleshooting
CUPTI Warning Message
Warning: CUPTI is not installed. Falling back to CUDA events.
What it means: CUPTI is not available, using CUDA events instead
Impact: Less accurate for very fast kernels (5-50 us) due to synchronization overhead, but becomes negligible for longer-running kernels
Solution (optional): Install CUPTI for best accuracy:
pip install -U cupti-python
If installation fails, check:
- CUDA version >= 13
- Compatible
cupti-pythonversion
You can still run benchmarks without CUPTI - the framework handles this automatically.
Inconsistent Results
Problem: Large standard deviation or varying results
Solutions:
Increase warmup iterations:
--dry_run_iters 10Increase measurement iterations:
--num_iters 50Use cold L2 cache (in Python):
bench_gpu_time(..., rotate_buffers=True)Disable GPU boost (advanced):
sudo nvidia-smi -lgc <base_clock>
Reference Check Failures
Error: [ERROR] Output mismatch between backends
What it means: Different backends produce different results
Solutions:
Allow mismatch and continue:
--allow_output_mismatchCheck numerical tolerance: Some backends use different precisions (FP32 vs FP16)
Investigate the difference:
-vv # Very verbose mode shows tensor statistics
Backend Not Supported
Error: [WARNING] fa3 for routine ... is not supported on compute capability X.X
Solution: Check the backend support matrix in benchmarks/README.md or remove that backend from --backends list
Best Practices
Install CUPTI for best accuracy (but not required):
pip install -U cupti-pythonUse reference checking to verify correctness:
--refcheckUse verbose mode to see input shapes and dtypes:
-vvGenerate reproducer commands for sharing results:
--generate_repro_commandRun multiple iterations for statistical significance:
--num_iters 30 --dry_run_iters 5Save results to CSV for later analysis:
--output_path results.csvCompare multiple backends to find the best:
--backends fa2 fa3 cudnn cutlass
Quick Examples
Decode Attention (H100)
python benchmarks/flashinfer_benchmark.py \
--routine BatchDecodeWithPagedKVCacheWrapper \
--backends fa2 fa2_tc cudnn trtllm-gen \
--page_size 16 --batch_size 128 --s_kv 8192 \
--num_qo_heads 64 --num_kv_heads 8 \
--head_dim_qk 128 --head_dim_vo 128 \
--refcheck -vv --generate_repro_command
Prefill Attention (Multi-head)
python benchmarks/flashinfer_benchmark.py \
--routine BatchPrefillWithRaggedKVCacheWrapper \
--backends fa2 fa3 cudnn cutlass \
--batch_size 16 --s_qo 1024 --s_kv 1024 \
--num_qo_heads 128 --num_kv_heads 128 \
--head_dim_qk 192 --head_dim_vo 128 \
--causal --random_actual_seq_len \
--q_dtype bfloat16 --kv_dtype bfloat16 \
--refcheck -vv
FP8 GEMM (Batched)
python benchmarks/flashinfer_benchmark.py \
--routine bmm_fp8 \
--backends cudnn cublas cutlass \
--batch_size 256 --m 1 --n 1024 --k 7168 \
--input_dtype fp8_e4m3 --mat2_dtype fp8_e4m3 \
--out_dtype bfloat16 \
--refcheck -vv
MOE (DeepSeek-style routing)
python benchmarks/flashinfer_benchmark.py \
--routine trtllm_fp8_block_scale_moe \
--backends trtllm \
--num_tokens 1024 --hidden_size 5120 \
--intermediate_size 13824 --num_experts 256 \
--top_k 8 --n_group 8 --topk_group 1 \
--routing_method deepseek_v3 \
--routed_scaling_factor 2.5 \
--use_routing_bias \
-vv
Summary: CUPTI vs CUDA Events
| Aspect | CUPTI (Preferred) | CUDA Events (Fallback) |
|---|---|---|
| Accuracy | Highest (hardware-level) | Good (slight overhead) |
| Installation | pip install cupti-python |
Built-in with CUDA |
| Requirements | CUDA 13+ | Any CUDA version |
| Fallback | N/A | Automatic if CUPTI unavailable |
| When to use | Always (if available) | When CUPTI can't be installed |
Recommendation: Install CUPTI for best results, but benchmarks work fine without it.
Next Steps
- Profile kernels with
nsysorncufor detailed analysis - Debug performance issues using
FLASHINFER_LOGLEVEL=3 - Compare with baselines using reference implementations
- Optimize kernels based on profiling results
Related Documentation
- See
benchmarks/README.mdfor full flag documentation - See
benchmarks/samples/sample_testlist.txtfor more examples - See CLAUDE.md "Benchmarking" section for technical details