| name | performance-profiling |
| description | Systematic profiling with torch.profiler, memory analysis, bottleneck identification, GPU timing |
Performance Profiling and Bottleneck Analysis
Overview
Core Principle: Optimization without measurement is guessing. Profile systematically (whole → component → operation) using the right tools to identify actual bottlenecks before attempting fixes. 90% of runtime is usually in 10% of code - find that 10% with profiling, not intuition.
Performance issues stem from: data loading bottlenecks (CPU-bound), inefficient operations (GPU-bound), memory bandwidth limits (memory-bound), or I/O bottlenecks. Profiling reveals which category applies. Guessing leads to optimizing the wrong thing, wasting hours on marginal improvements while real bottleneck remains.
When to Use
Use this skill when:
- Training or inference slower than expected
- Need to identify performance bottleneck in PyTorch code
- High GPU memory usage, need to understand what's using memory
- Evaluating whether optimization actually improved performance
- Debugging low GPU utilization issues
- Comparing performance of different implementations
- Need to profile specific operations or model components
Don't use when:
- Performance is already acceptable (no problem to solve)
- Architecture design questions (use module-design-patterns)
- Debugging correctness issues (use debugging-techniques)
- Memory leaks (use tensor-operations-and-memory)
Symptoms triggering this skill:
- "Training is slower than expected"
- "Low GPU utilization but training still slow"
- "Which part of my model is the bottleneck?"
- "Does this optimization actually help?"
- "Memory usage is high, what's using it?"
- "First iteration much slower than subsequent ones"
Systematic Profiling Methodology
The Four-Phase Framework
Phase 1: Establish Baseline
- Define metric (throughput, latency, memory)
- Measure end-to-end performance
- Set improvement target
- Document measurement conditions
Phase 2: Identify Bottleneck Type
- CPU-bound vs GPU-bound vs I/O-bound vs memory-bound
- Check GPU utilization (nvidia-smi)
- Profile data loading separately from computation
- Determine which component to investigate
Phase 3: Narrow to Component
- Profile at coarse granularity
- Identify which phase is slow (forward/backward/optimizer/data loading)
- Focus profiling on bottleneck component
- Use iterative narrowing
Phase 4: Identify Operation
- Profile bottleneck component in detail
- Examine both table view and trace view
- Find specific operation or pattern
- Measure improvement after fix
Critical Rule: ALWAYS work through phases in order. Don't jump to Phase 4 without Phases 1-3.
Phase 1: Establish Baseline
Step 1: Define Performance Metric
# Choose the right metric for your use case:
# Throughput (samples/second) - for training
def measure_throughput(model, dataloader, num_batches=100):
model.train()
torch.cuda.synchronize()
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
total_samples = 0
start.record()
for i, (data, target) in enumerate(dataloader):
if i >= num_batches:
break
data, target = data.cuda(), target.cuda()
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
optimizer.zero_grad()
total_samples += data.size(0)
end.record()
torch.cuda.synchronize()
elapsed_ms = start.elapsed_time(end)
throughput = total_samples / (elapsed_ms / 1000.0)
print(f"Throughput: {throughput:.2f} samples/sec")
print(f"Time per batch: {elapsed_ms / num_batches:.2f} ms")
return throughput
# Latency (time per sample) - for inference
def measure_latency(model, sample_input, num_iterations=100, warmup=10):
model.eval()
sample_input = sample_input.cuda()
# Warmup (CRITICAL - don't skip!)
with torch.no_grad():
for _ in range(warmup):
_ = model(sample_input)
# Measure
latencies = []
with torch.no_grad():
for _ in range(num_iterations):
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
output = model(sample_input)
end.record()
torch.cuda.synchronize()
latencies.append(start.elapsed_time(end))
# Report statistics (not just average!)
import numpy as np
latencies = np.array(latencies)
print(f"Latency - Mean: {latencies.mean():.2f} ms, "
f"Std: {latencies.std():.2f} ms, "
f"Median: {np.median(latencies):.2f} ms, "
f"P95: {np.percentile(latencies, 95):.2f} ms, "
f"P99: {np.percentile(latencies, 99):.2f} ms")
return latencies
# Memory usage (peak GB)
def measure_memory(model, sample_batch):
torch.cuda.reset_peak_memory_stats()
torch.cuda.synchronize()
# Run one iteration
output = model(sample_batch)
loss = criterion(output, target)
loss.backward()
torch.cuda.synchronize()
peak_memory = torch.cuda.max_memory_allocated() / 1e9
print(f"Peak memory: {peak_memory:.2f} GB")
return peak_memory
Why this matters:
- Without baseline, can't measure improvement
- Need statistics (mean, std, percentiles), not just average
- Must use CUDA Events for GPU timing (not
time.time()) - Warmup critical to exclude JIT compilation overhead
Step 2: Document Measurement Conditions
# Record all relevant configuration
profiling_config = {
'model': model.__class__.__name__,
'batch_size': 32,
'input_shape': (3, 224, 224),
'device': 'cuda:0',
'dtype': 'float16' if using_amp else 'float32',
'mode': 'train' or 'eval',
'num_workers': dataloader.num_workers,
'cudnn_benchmark': torch.backends.cudnn.benchmark,
'gpu': torch.cuda.get_device_name(0),
}
print(json.dumps(profiling_config, indent=2))
Why this matters:
- Performance changes with configuration
- Need to reproduce results
- Comparing different runs requires same conditions
- Document before optimizing, re-measure after
Phase 2: Identify Bottleneck Type
Step 1: Check GPU Utilization
# In terminal, monitor GPU utilization in real-time
nvidia-smi dmon -s u -i 0 -d 1
# Or within Python
import subprocess
result = subprocess.run(['nvidia-smi', '--query-gpu=utilization.gpu,memory.used',
'--format=csv,noheader,nounits'],
capture_output=True, text=True)
gpu_util, mem_used = result.stdout.strip().split(',')
print(f"GPU Utilization: {gpu_util}%, Memory: {mem_used} MB")
Interpretation:
| GPU Utilization | Likely Bottleneck | Next Step |
|---|---|---|
| < 70% | CPU-bound (data loading, preprocessing) | Profile data loading |
| > 90% | GPU-bound (computation) | Profile model operations |
| 50-80% | Mixed or memory-bound | Check memory bandwidth |
Why this matters: GPU utilization tells you WHERE to look. If GPU isn't saturated, optimizing GPU operations won't help.
Step 2: Profile Data Loading vs Computation
import time
def profile_dataloader_vs_model(model, dataloader, num_batches=50):
"""Separate data loading time from model computation time"""
model.train()
data_times = []
compute_times = []
batch_iterator = iter(dataloader)
for i in range(num_batches):
# Time data loading
data_start = time.time()
data, target = next(batch_iterator)
data_end = time.time()
data_times.append(data_end - data_start)
# Time computation
data, target = data.cuda(non_blocking=True), target.cuda(non_blocking=True)
torch.cuda.synchronize()
compute_start = torch.cuda.Event(enable_timing=True)
compute_end = torch.cuda.Event(enable_timing=True)
compute_start.record()
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
optimizer.zero_grad()
compute_end.record()
torch.cuda.synchronize()
compute_times.append(compute_start.elapsed_time(compute_end))
import numpy as np
avg_data_time = np.mean(data_times) * 1000 # to ms
avg_compute_time = np.mean(compute_times)
print(f"Avg data loading time: {avg_data_time:.2f} ms")
print(f"Avg computation time: {avg_compute_time:.2f} ms")
print(f"Data loading is {avg_data_time/avg_compute_time:.1f}x "
f"{'slower' if avg_data_time > avg_compute_time else 'faster'} than compute")
if avg_data_time > avg_compute_time:
print("⚠️ BOTTLENECK: Data loading (CPU-bound)")
print(" Solutions: Increase num_workers, use pin_memory=True, "
"move preprocessing to GPU")
else:
print("✅ Data loading is fast enough. Bottleneck is in model computation.")
return avg_data_time, avg_compute_time
Why this matters:
- If data loading > computation time, GPU is starved (increase workers)
- If computation > data loading, GPU is bottleneck (optimize model)
- Common mistake: Optimizing model when data loading is the bottleneck
Step 3: Determine Bottleneck Category
def diagnose_bottleneck_type(model, dataloader):
"""Systematic bottleneck categorization"""
# 1. Check GPU utilization
print("=== GPU Utilization Check ===")
# Run training for a bit while monitoring GPU
# If GPU util < 70% → CPU-bound
# If GPU util > 90% → GPU-bound
# 2. Check memory bandwidth
print("\n=== Memory Bandwidth Check ===")
from torch.profiler import profile, ProfilerActivity
with profile(activities=[ProfilerActivity.CUDA]) as prof:
for i, (data, target) in enumerate(dataloader):
if i >= 5:
break
data, target = data.cuda(), target.cuda()
output = model(data)
loss = criterion(output, target)
loss.backward()
# Look for high memory-bound ops
events = prof.key_averages()
for evt in events:
if evt.cuda_time_total > 0:
# If many large tensor ops with low FLOPS → memory-bound
pass
# 3. Profile phases
print("\n=== Phase Profiling ===")
times = profile_training_phases(model, next(iter(dataloader)))
# Interpret results
print("\n=== Diagnosis ===")
if times['data_loading'] > times['forward'] + times['backward']:
print("BOTTLENECK: CPU-bound (data loading)")
print("Action: Increase num_workers, enable pin_memory, cache data")
elif times['forward'] > times['backward'] * 2:
print("BOTTLENECK: GPU-bound (forward pass)")
print("Action: Profile forward pass operations")
elif times['backward'] > times['forward'] * 2:
print("BOTTLENECK: GPU-bound (backward pass)")
print("Action: Profile backward pass, check gradient checkpointing")
else:
print("BOTTLENECK: Mixed or memory-bound")
print("Action: Deep profiling needed")
Phase 3: Narrow to Component
Step 1: Coarse-Grained Profiling
from torch.profiler import profile, ProfilerActivity, schedule
def profile_training_step(model, dataloader, num_steps=10):
"""Profile one training step to identify bottleneck phase"""
# Use schedule to reduce profiling overhead
with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
schedule=schedule(wait=1, warmup=2, active=5, repeat=1),
on_trace_ready=torch.profiler.tensorboard_trace_handler('./profiler_logs'),
record_shapes=True,
profile_memory=True,
with_stack=True
) as prof:
for step, (data, target) in enumerate(dataloader):
if step >= num_steps:
break
data, target = data.cuda(), target.cuda()
optimizer.zero_grad()
# Forward
output = model(data)
loss = criterion(output, target)
# Backward
loss.backward()
# Optimizer
optimizer.step()
prof.step() # Notify profiler of step boundary
# Print summary
print(prof.key_averages().table(
sort_by="cuda_time_total",
row_limit=20,
max_src_column_width=80
))
# Export trace for visualization
print("\n✅ Trace exported to ./profiler_logs")
print(" View in Chrome: chrome://tracing (load trace.json)")
print(" Or TensorBoard: tensorboard --logdir=./profiler_logs")
return prof
Understanding the schedule:
wait=1: Skip first iteration (cold start)warmup=2: Next 2 iterations for warmup (no profiling overhead)active=5: Profile these 5 iterationsrepeat=1: Do this cycle once
Why this matters:
- Profiling has overhead - don't profile every iteration
- Schedule controls when profiling is active
- Warmup prevents including JIT compilation in measurements
Step 2: Phase-Level Timing
from torch.profiler import record_function
def profile_training_phases(model, batch, target):
"""Time each phase of training separately"""
data, target = batch.cuda(), target.cuda()
optimizer.zero_grad()
torch.cuda.synchronize()
# Profile each phase
phases = {}
# Forward pass
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
with record_function("forward_pass"):
start.record()
output = model(data)
loss = criterion(output, target)
end.record()
torch.cuda.synchronize()
phases['forward'] = start.elapsed_time(end)
# Backward pass
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
with record_function("backward_pass"):
start.record()
loss.backward()
end.record()
torch.cuda.synchronize()
phases['backward'] = start.elapsed_time(end)
# Optimizer step
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
with record_function("optimizer_step"):
start.record()
optimizer.step()
end.record()
torch.cuda.synchronize()
phases['optimizer'] = start.elapsed_time(end)
# Print breakdown
total = sum(phases.values())
print("Phase Breakdown:")
for phase, time_ms in phases.items():
print(f" {phase:15s}: {time_ms:7.2f} ms ({time_ms/total*100:5.1f}%)")
return phases
Why this matters:
- Identifies which phase is slowest
- Focuses subsequent profiling on bottleneck phase
- Uses
record_functionto add custom markers in trace view
Step 3: Module-Level Profiling
def profile_model_modules(model, sample_input):
"""Profile time spent in each model module"""
model.eval()
sample_input = sample_input.cuda()
# Add hooks to time each module
module_times = {}
def make_hook(name):
def hook(module, input, output):
if name not in module_times:
module_times[name] = []
torch.cuda.synchronize()
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
# Forward already happened, this is for next time
end.record()
torch.cuda.synchronize()
# (Note: This is simplified - real implementation more complex)
return hook
# Register hooks
hooks = []
for name, module in model.named_modules():
if len(list(module.children())) == 0: # Leaf modules only
hook = module.register_forward_hook(make_hook(name))
hooks.append(hook)
# Better approach: Use record_function
class ProfilingModule(torch.nn.Module):
def __init__(self, module, name):
super().__init__()
self.module = module
self.name = name
def forward(self, *args, **kwargs):
with record_function(f"module_{self.name}"):
return self.module(*args, **kwargs)
# Or just use torch.profiler with record_shapes=True
# It will automatically show module breakdown
with profile(activities=[ProfilerActivity.CUDA]) as prof:
with record_function("model_forward"):
output = model(sample_input)
print(prof.key_averages(group_by_input_shape=True).table(
sort_by="cuda_time_total", row_limit=20
))
# Clean up
for hook in hooks:
hook.remove()
Why this matters:
- Identifies which model component is slowest
- Guides optimization efforts to specific layers
- Reveals unexpected bottlenecks (e.g., LayerNorm taking 30% of time)
Phase 4: Identify Operation
Step 1: Detailed Operation Profiling
def profile_operations_detailed(model, sample_input):
"""Get detailed breakdown of all operations"""
model.eval()
sample_input = sample_input.cuda()
with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
record_shapes=True,
with_stack=True,
profile_memory=True
) as prof:
output = model(sample_input)
# Group by operation type
print("\n=== Top Operations by CUDA Time ===")
print(prof.key_averages().table(
sort_by="cuda_time_total",
row_limit=30,
max_src_column_width=100
))
print("\n=== Top Operations by Memory ===")
print(prof.key_averages().table(
sort_by="self_cuda_memory_usage",
row_limit=20,
max_src_column_width=100
))
print("\n=== Grouped by Input Shape ===")
print(prof.key_averages(group_by_input_shape=True).table(
sort_by="cuda_time_total",
row_limit=20
))
# Export for trace view
prof.export_chrome_trace("detailed_trace.json")
print("\n✅ Exported detailed_trace.json - view in chrome://tracing")
return prof
Step 2: Reading Profiler Output
# Example profiler output:
"""
--------------------------------- ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total
--------------------------------- ------------ ------------ ------------ ------------
aten::conv2d 0.5% 124ms 45.2% 11.234s
aten::convolution 1.2% 298ms 44.7% 11.110s
aten::_convolution 2.3% 571ms 43.5% 10.812s
aten::cudnn_convolution 40.1% 9.967s 41.2% 10.241s
aten::batch_norm 0.3% 74ms 25.8% 6.412s
aten::_batch_norm 1.1% 273ms 25.5% 6.338s
aten::cudnn_batch_norm 23.2% 5.765s 24.4% 6.065s
aten::relu 8.2% 2.038s 8.2% 2.038s
"""
How to interpret:
| Column | Meaning | When to Look |
|---|---|---|
| Name | Operation name | Identify what operation |
| Self CPU % | Time in this op only (no children) | Find leaf operations |
| CPU total % | Time in op + children | Find expensive subtrees |
| Self CUDA time | GPU execution time | Main metric for GPU ops |
| Call count | How many times called | High count = optimization target |
Common patterns:
# Pattern 1: High aten::copy_ time (40%+)
# → Device transfer issue (CPU ↔ GPU)
# Action: Check device placement, reduce transfers
# Pattern 2: High cudaLaunchKernel overhead
# → Too many small kernel launches
# Action: Increase batch size, fuse operations
# Pattern 3: High cudnn_convolution time
# → Convolutions are bottleneck (expected for CNNs)
# Action: Check input dimensions for Tensor Core alignment
# Pattern 4: High CPU time, low CUDA time
# → CPU bottleneck (data loading, preprocessing)
# Action: Increase num_workers, move ops to GPU
# Pattern 5: Many small operations
# → Operation fusion opportunity
# Action: Use torch.compile or fuse manually
Step 3: Trace View Analysis
# After exporting trace: prof.export_chrome_trace("trace.json")
# Open in chrome://tracing
"""
Trace view shows:
1. Timeline of GPU kernels
2. CPU → GPU synchronization points
3. Parallel vs sequential execution
4. GPU idle time (gaps between kernels)
What to look for:
- Large gaps between GPU kernels → GPU underutilized
- Many thin bars → Too many small operations
- Thick bars → Few large operations (good for GPU)
- Yellow/red bars → CPU activity (should be minimal during GPU work)
- Overlapping bars → Concurrent execution (good)
"""
Reading trace view:
GPU Stream 0: ████░░░░████░░░░████ ← Gaps = idle GPU (bad)
GPU Stream 0: ███████████████████ ← Continuous = good utilization
CPU: ░░░░████░░░░████░░░░ ← CPU peaks = data loading
Timeline:
[Data Load]──→[GPU Forward]──→[Data Load]──→[GPU Forward]
↑ Gap here = GPU waiting for data
Memory Profiling
Memory Tracking Methodology
Step 1: Basic Memory Tracking
import torch
def track_memory(stage_name):
"""Print current memory usage"""
allocated = torch.cuda.memory_allocated() / 1e9
reserved = torch.cuda.memory_reserved() / 1e9
print(f"{stage_name:30s} - Allocated: {allocated:6.2f} GB, "
f"Reserved: {reserved:6.2f} GB")
# Track at each training phase
track_memory("Start")
data, target = next(iter(dataloader))
data, target = data.cuda(), target.cuda()
track_memory("After data to GPU")
output = model(data)
track_memory("After forward")
loss = criterion(output, target)
track_memory("After loss")
loss.backward()
track_memory("After backward")
optimizer.step()
track_memory("After optimizer step")
optimizer.zero_grad()
track_memory("After zero_grad")
Output interpretation:
Start - Allocated: 2.50 GB, Reserved: 2.75 GB
After data to GPU - Allocated: 2.62 GB, Reserved: 2.75 GB
After forward - Allocated: 4.80 GB, Reserved: 5.00 GB ← Activations
After loss - Allocated: 4.81 GB, Reserved: 5.00 GB
After backward - Allocated: 7.20 GB, Reserved: 7.50 GB ← Gradients
After optimizer step - Allocated: 7.20 GB, Reserved: 7.50 GB
After zero_grad - Allocated: 4.70 GB, Reserved: 7.50 GB ← Gradients freed
Key insights:
- Allocated = actual memory used
- Reserved = memory held by allocator (may be > allocated due to caching)
- Large jump after forward = activations (consider gradient checkpointing)
- Large jump after backward = gradients (same size as parameters)
- Reserved stays high = memory fragmentation or caching
Step 2: Peak Memory Analysis
def analyze_peak_memory(model, batch, target):
"""Find peak memory usage and what causes it"""
torch.cuda.reset_peak_memory_stats()
torch.cuda.synchronize()
# Run one iteration
data, target = batch.cuda(), target.cuda()
output = model(data)
forward_peak = torch.cuda.max_memory_allocated() / 1e9
loss = criterion(output, target)
loss_peak = torch.cuda.max_memory_allocated() / 1e9
loss.backward()
backward_peak = torch.cuda.max_memory_allocated() / 1e9
optimizer.step()
optimizer_peak = torch.cuda.max_memory_allocated() / 1e9
optimizer.zero_grad()
final_peak = torch.cuda.max_memory_allocated() / 1e9
print(f"Peak after forward: {forward_peak:.2f} GB")
print(f"Peak after loss: {loss_peak:.2f} GB")
print(f"Peak after backward: {backward_peak:.2f} GB")
print(f"Peak after optimizer: {optimizer_peak:.2f} GB")
print(f"Overall peak: {final_peak:.2f} GB")
# Identify bottleneck
if forward_peak > backward_peak * 0.8:
print("\n⚠️ Activations dominate memory usage")
print(" Consider: Gradient checkpointing, smaller batch size")
elif backward_peak > forward_peak * 1.5:
print("\n⚠️ Gradients dominate memory usage")
print(" Consider: Gradient accumulation, mixed precision")
else:
print("\n✅ Memory usage balanced across phases")
return {
'forward': forward_peak,
'backward': backward_peak,
'optimizer': optimizer_peak,
'peak': final_peak
}
Step 3: Detailed Memory Summary
def print_memory_summary():
"""Print detailed memory breakdown"""
print(torch.cuda.memory_summary())
"""
Example output:
|===========================================================================|
| PyTorch CUDA memory summary |
|---------------------------------------------------------------------------|
| CUDA OOMs: 0 |
| Metric | Cur Usage | Peak Usage | Alloc Retries | # Allocs |
|----------------|------------|------------|---------------|---------------|
| Allocated | 4.50 GB | 7.20 GB | 0 | 15234 |
| Reserved | 7.50 GB | 7.50 GB | 0 | 1523 |
| Active | 4.50 GB | 7.20 GB | | |
| Inactive | 3.00 GB | 0.30 GB | | |
|===========================================================================|
Allocated memory: 4.50 GB ← Actual tensors
Reserved memory: 7.50 GB ← Memory held by allocator
Active allocations: 4.50 GB ← Currently in use
Inactive allocations: 3.00 GB ← Cached for reuse (fragmentation)
"""
# If Inactive >> 0, memory fragmentation is occurring
# Periodic torch.cuda.empty_cache() may help
Step 4: Memory Snapshot (PyTorch 2.0+)
import pickle
import torch.cuda
def capture_memory_snapshot(filename="memory_snapshot.pickle"):
"""Capture detailed memory snapshot for analysis"""
# Enable memory history tracking
torch.cuda.memory._record_memory_history(max_entries=100000)
try:
# Run your training code here
for i, (data, target) in enumerate(dataloader):
if i >= 5:
break
data, target = data.cuda(), target.cuda()
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
optimizer.zero_grad()
# Capture snapshot
torch.cuda.memory._dump_snapshot(filename)
print(f"✅ Memory snapshot saved to {filename}")
finally:
# Disable tracking
torch.cuda.memory._record_memory_history(enabled=None)
print(f"\nAnalyze with:")
print(f" python -m torch.cuda._memory_viz trace_plot {filename}")
print(f" # Opens interactive visualization in browser")
Memory snapshot visualization shows:
- Allocation timeline
- Stack traces for each allocation
- Memory leaks (allocations never freed)
- Fragmentation patterns
- Peak memory events
GPU Timing Best Practices
CUDA Synchronization and Events
❌ WRONG: Using time.time() for GPU operations
import time
# WRONG - measures CPU time, not GPU time!
start = time.time()
output = model(data) # Kernel launches, returns immediately
end = time.time()
print(f"Time: {end - start:.4f}s") # ❌ This is kernel launch overhead (~microseconds)
# Problem: CUDA operations are asynchronous
# time.time() measures CPU time (when kernel was launched)
# Not GPU time (when kernel actually executed)
Why this is wrong:
- CUDA kernel launches are asynchronous (return immediately to CPU)
time.time()measures CPU wall-clock time- Actual GPU execution happens later, in parallel with CPU
- Measured time is kernel launch overhead (microseconds), not execution time
✅ CORRECT: Using CUDA Events
# CORRECT - measures actual GPU execution time
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
output = model(data)
end_event.record()
# Wait for GPU to finish
torch.cuda.synchronize()
# Get elapsed time in milliseconds
elapsed_time_ms = start_event.elapsed_time(end_event)
print(f"GPU Time: {elapsed_time_ms:.2f} ms")
Why this is correct:
- CUDA Events are GPU-native timing
record()inserts timing markers into GPU streamsynchronize()waits for GPU to completeelapsed_time()returns actual GPU execution time
Alternative: Using torch.profiler
# For comprehensive profiling, use torch.profiler instead of manual timing
from torch.profiler import profile, ProfilerActivity
with profile(activities=[ProfilerActivity.CUDA]) as prof:
output = model(data)
print(prof.key_averages().table(sort_by="cuda_time_total"))
# This automatically handles synchronization and provides detailed breakdown
Warmup Iterations
Why warmup is critical:
# First iteration includes:
# 1. CUDA kernel JIT compilation
# 2. cuDNN algorithm selection (benchmark mode)
# 3. Memory pool allocation
# 4. CPU→GPU transfer of model weights (first time)
# Example timing without warmup:
model.eval()
with torch.no_grad():
for i in range(10):
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
output = model(sample_input)
end.record()
torch.cuda.synchronize()
print(f"Iteration {i}: {start.elapsed_time(end):.2f} ms")
"""
Output:
Iteration 0: 1234.56 ms ← JIT compilation, cuDNN benchmarking
Iteration 1: 987.43 ms ← Still some overhead
Iteration 2: 102.34 ms ← Stabilized
Iteration 3: 101.89 ms ← Stable
Iteration 4: 102.12 ms ← Stable
...
"""
✅ Correct warmup methodology:
def benchmark_with_warmup(model, sample_input, warmup=5, iterations=100):
"""Proper benchmarking with warmup"""
model.eval()
sample_input = sample_input.cuda()
# Warmup iterations (CRITICAL!)
with torch.no_grad():
for _ in range(warmup):
_ = model(sample_input)
# Ensure warmup completed
torch.cuda.synchronize()
# Actual measurement
times = []
with torch.no_grad():
for _ in range(iterations):
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
output = model(sample_input)
end.record()
torch.cuda.synchronize()
times.append(start.elapsed_time(end))
# Report statistics
import numpy as np
times = np.array(times)
print(f"Mean: {times.mean():.2f} ms")
print(f"Std: {times.std():.2f} ms")
print(f"Median: {np.median(times):.2f} ms")
print(f"Min: {times.min():.2f} ms")
print(f"Max: {times.max():.2f} ms")
print(f"P95: {np.percentile(times, 95):.2f} ms")
print(f"P99: {np.percentile(times, 99):.2f} ms")
return times
Warmup rules:
- Minimum 3 iterations, recommend 5-10
- More complex models need more warmup
- Dynamic control flow needs extra warmup
- Report statistics (mean, std, percentiles), not just average
Bottleneck Identification Patterns
CPU-Bound Bottlenecks
Symptoms:
- Low GPU utilization (<70%)
- High CPU usage
- Data loading time > computation time
nvidia-smishows low GPU usage
Diagnostic code:
def diagnose_cpu_bottleneck(model, dataloader):
"""Check if training is CPU-bound"""
# Check GPU utilization
import subprocess
result = subprocess.run(
['nvidia-smi', '--query-gpu=utilization.gpu', '--format=csv,noheader,nounits'],
capture_output=True, text=True
)
gpu_util = int(result.stdout.strip())
print(f"GPU Utilization: {gpu_util}%")
if gpu_util < 70:
print("⚠️ LOW GPU UTILIZATION - likely CPU-bound")
# Profile data loading vs compute
data_time, compute_time = profile_dataloader_vs_model(model, dataloader)
if data_time > compute_time:
print("\n🎯 BOTTLENECK: Data loading")
print("Solutions:")
print(" 1. Increase num_workers in DataLoader")
print(f" Current: {dataloader.num_workers}, try: {os.cpu_count()}")
print(" 2. Enable pin_memory=True")
print(" 3. Move data augmentation to GPU (use kornia)")
print(" 4. Cache preprocessed data if dataset is small")
print(" 5. Use faster storage (SSD instead of HDD)")
else:
print("\n🎯 BOTTLENECK: CPU preprocessing")
print("Solutions:")
print(" 1. Move preprocessing to GPU")
print(" 2. Reduce preprocessing complexity")
print(" 3. Batch preprocessing operations")
else:
print("✅ GPU utilization is healthy")
print(" Bottleneck is likely in GPU computation")
return gpu_util
Common solutions:
# Solution 1: Increase num_workers
dataloader = DataLoader(
dataset,
batch_size=32,
num_workers=8, # Increase from default 0
pin_memory=True, # Enable for faster GPU transfer
persistent_workers=True # Keep workers alive between epochs
)
# Solution 2: Move augmentation to GPU
import kornia
class GPUAugmentation(nn.Module):
def __init__(self):
super().__init__()
self.augment = nn.Sequential(
kornia.augmentation.RandomHorizontalFlip(p=0.5),
kornia.augmentation.ColorJitter(0.2, 0.2, 0.2, 0.1),
kornia.augmentation.RandomResizedCrop((224, 224)),
)
def forward(self, x):
return self.augment(x)
# Apply on GPU
gpu_augment = GPUAugmentation().cuda()
for data, target in dataloader:
data = data.cuda()
data = gpu_augment(data) # Augment on GPU
output = model(data)
GPU-Bound Bottlenecks
Symptoms:
- High GPU utilization (>90%)
- Computation time > data loading time
- High CUDA time in profiler
Diagnostic code:
def diagnose_gpu_bottleneck(model, sample_input):
"""Identify GPU bottleneck operations"""
with profile(activities=[ProfilerActivity.CUDA]) as prof:
output = model(sample_input)
loss = criterion(output, target)
loss.backward()
# Find top GPU operations
events = prof.key_averages()
cuda_events = [(evt.key, evt.cuda_time_total) for evt in events
if evt.cuda_time_total > 0]
cuda_events.sort(key=lambda x: x[1], reverse=True)
print("Top 10 GPU operations:")
total_time = sum(time for _, time in cuda_events)
for i, (name, time) in enumerate(cuda_events[:10], 1):
percentage = (time / total_time) * 100
print(f"{i:2d}. {name:40s} {time/1000:8.2f} ms ({percentage:5.1f}%)")
# Check for optimization opportunities
top_op = cuda_events[0][0]
if 'conv' in top_op.lower():
print("\n🎯 Bottleneck: Convolution operations")
print("Solutions:")
print(" 1. Check input dimensions for Tensor Core alignment (multiples of 8)")
print(" 2. Use mixed precision (torch.cuda.amp)")
print(" 3. Consider depthwise separable convolutions")
print(" 4. Profile with different batch sizes")
elif 'mm' in top_op.lower() or 'matmul' in top_op.lower():
print("\n🎯 Bottleneck: Matrix multiplication")
print("Solutions:")
print(" 1. Ensure dimensions are multiples of 8 (FP16) or 16 (BF16)")
print(" 2. Use mixed precision")
print(" 3. Check for unnecessary transposes")
elif 'copy' in top_op.lower():
print("\n🎯 Bottleneck: Memory copies")
print("Solutions:")
print(" 1. Check device placement (CPU ↔ GPU transfers)")
print(" 2. Ensure tensors are contiguous")
print(" 3. Reduce explicit .cuda() or .cpu() calls")
return cuda_events
Common solutions:
# Solution 1: Mixed precision
from torch.cuda.amp import autocast
with autocast():
output = model(data)
loss = criterion(output, target)
# 2-3x speedup for large models
# Solution 2: Tensor Core alignment
# Ensure dimensions are multiples of 8 (FP16) or 16 (BF16)
# BAD: (batch=31, seq_len=127, hidden=509)
# GOOD: (batch=32, seq_len=128, hidden=512)
# Solution 3: torch.compile (PyTorch 2.0+)
model = torch.compile(model)
# Automatic kernel fusion and optimization
Memory-Bound Bottlenecks
Symptoms:
- Low GPU utilization despite high memory usage
- Large tensor operations dominating time
- Memory bandwidth saturated
Diagnostic code:
def diagnose_memory_bottleneck(model, sample_input):
"""Check if operations are memory-bandwidth limited"""
# Profile memory and compute
with profile(
activities=[ProfilerActivity.CUDA],
profile_memory=True
) as prof:
output = model(sample_input)
# Analyze operations
for evt in prof.key_averages():
if evt.cuda_time_total > 0 and evt.self_cuda_memory_usage > 0:
# Rough FLOP/s estimate
# Memory-bound: low FLOP/s despite high memory usage
# Compute-bound: high FLOP/s
memory_gb = evt.self_cuda_memory_usage / 1e9
time_s = evt.cuda_time_total / 1e6 # µs to s
if memory_gb > 1.0 and time_s > 0.01:
bandwidth = memory_gb / time_s # GB/s
print(f"{evt.key:40s}: {bandwidth:.1f} GB/s")
print("\nIf bandwidth < 500 GB/s, likely memory-bound")
print("Solutions:")
print(" 1. Reduce intermediate tensor sizes")
print(" 2. Use in-place operations where safe")
print(" 3. Tile large operations")
print(" 4. Increase arithmetic intensity (more compute per byte)")
I/O-Bound Bottlenecks
Symptoms:
- Low CPU and GPU utilization
- Long pauses between batches
- Slow disk I/O
Solutions:
# Solution 1: Cache dataset in RAM
class CachedDataset(Dataset):
def __init__(self, dataset):
self.cache = [dataset[i] for i in range(len(dataset))]
def __getitem__(self, idx):
return self.cache[idx]
def __len__(self):
return len(self.cache)
# Solution 2: Use SSD storage or RAM disk
# Solution 3: Prefetch data
dataloader = DataLoader(
dataset,
num_workers=8,
prefetch_factor=4, # Prefetch 4 batches per worker
pin_memory=True
)
Common Profiling Mistakes
Mistake 1: Profiling Too Many Iterations
❌ WRONG:
# Profiling 100 epochs - output is massive, unusable
with profile(activities=[ProfilerActivity.CUDA]) as prof:
for epoch in range(100):
for batch in dataloader:
# ... training ...
pass
✅ CORRECT:
# Profile just a few iterations
with profile(
activities=[ProfilerActivity.CUDA],
schedule=schedule(wait=1, warmup=2, active=3, repeat=1)
) as prof:
for epoch in range(1):
for step, batch in enumerate(dataloader):
if step >= 10:
break
# ... training ...
prof.step()
Mistake 2: No Warmup Before Timing
❌ WRONG:
# Including JIT compilation in timing
start = time.time()
output = model(data) # First call - includes JIT overhead
end = time.time()
✅ CORRECT:
# Warmup first
for _ in range(5):
_ = model(data)
torch.cuda.synchronize()
# Now measure
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
output = model(data)
end.record()
torch.cuda.synchronize()
Mistake 3: Synchronization Overhead
❌ WRONG:
# Synchronizing in training loop - kills performance!
for batch in dataloader:
output = model(batch)
torch.cuda.synchronize() # ❌ Breaks pipelining!
loss = criterion(output, target)
torch.cuda.synchronize() # ❌ Unnecessary!
loss.backward()
✅ CORRECT:
# Only synchronize for timing/profiling, not in production
for batch in dataloader:
output = model(batch)
loss = criterion(output, target)
loss.backward()
# No synchronization - let GPU pipeline work
When to synchronize:
- Profiling/timing measurements
- Before memory measurements
- Debugging CUDA errors
- NEVER in production training loop
Mistake 4: Wrong Profiling Granularity
❌ WRONG:
# Profiling entire model - too coarse, can't identify bottleneck
with profile() as prof:
output = model(data)
# "Model takes 100ms" - not actionable!
✅ CORRECT:
# Iterative narrowing:
# 1. Profile whole step
# 2. Identify slow phase (forward, backward, optimizer)
# 3. Profile that phase in detail
# 4. Identify specific operation
# Phase 1: Coarse
with record_function("forward"):
output = model(data)
with record_function("backward"):
loss.backward()
# Phase 2: Found forward is slow, profile in detail
with profile() as prof:
output = model(data)
# Now see which layer is slow
# Phase 3: Found layer X is slow, profile that layer
with profile() as prof:
output = model.layer_x(data)
# Now see which operation in layer X is slow
Mistake 5: Ignoring Memory While Profiling Compute
❌ WRONG:
# Only looking at time, ignoring memory
with profile(activities=[ProfilerActivity.CUDA]) as prof:
output = model(data)
✅ CORRECT:
# Profile both compute AND memory
with profile(
activities=[ProfilerActivity.CUDA],
profile_memory=True
) as prof:
output = model(data)
# Check both time and memory
print(prof.key_averages().table(sort_by="cuda_time_total"))
print(prof.key_averages().table(sort_by="self_cuda_memory_usage"))
Mistake 6: Profiling in Wrong Mode
❌ WRONG:
# Profiling in eval mode when you care about training speed
model.eval()
with torch.no_grad():
with profile() as prof:
output = model(data)
# ❌ This doesn't include backward pass!
✅ CORRECT:
# Profile in the mode you actually use
model.train()
with profile() as prof:
output = model(data)
loss = criterion(output, target)
loss.backward() # ✅ Include backward if profiling training
Red Flags - Stop and Profile Systematically
If you catch yourself thinking ANY of these, STOP and follow methodology:
| Red Flag Thought | Reality | What to Do Instead |
|---|---|---|
| "I can see the bottleneck" | 90% of the time your guess is wrong | Profile to confirm, don't guess |
| "User says X is slow, so X is the bottleneck" | User might be wrong about the cause | Verify with profiling |
| "This loop looks inefficient" | Intuition about performance often wrong | Measure it, don't assume |
| "Profiling takes too long" | Profiling saves hours of guessing | 10 minutes of profiling > hours of guessing |
| "Let me just try this optimization" | Premature optimization wastes time | Measure first, optimize second |
| "It's obviously a GPU problem" | Could be CPU, data loading, or I/O | Check GPU utilization first |
| "I'll reduce batch size" | Doesn't address root cause | Diagnose memory bottleneck first |
| "Skip warmup, it's just one iteration" | First iterations have 10-100x overhead | Always warmup, no exceptions |
Critical rules:
- NEVER optimize before profiling
- ALWAYS use warmup iterations
- ALWAYS check GPU utilization before assuming GPU bottleneck
- ALWAYS profile data loading separately from computation
- ALWAYS report statistics (mean, std, percentiles), not just average
- ALWAYS use CUDA Events for GPU timing, never
time.time()
Common Rationalizations (Don't Do These)
| Excuse | What Really Happens | Correct Approach |
|---|---|---|
| "User seems rushed, skip profiling" | Guessing wastes MORE time than profiling | 10 min profiling saves hours |
| "I already profiled once" | Might have used wrong tool or granularity | Re-profile with systematic methodology |
| "Profiling overhead will skew results" | Use schedule to minimize overhead | schedule(wait=1, warmup=2, active=3) |
| "This worked on another model" | Different models have different bottlenecks | Profile THIS model, not assumptions |
| "Documentation says X is slow" | Depends on context, hardware, data | Verify with profiling on YOUR setup |
| "Just trust the profiler output" | Must interpret correctly | Understand what metrics mean |
| "The model is the bottleneck" | Often it's data loading | Always check data loading vs compute |
Complete Profiling Example
import torch
import torch.nn as nn
from torch.profiler import profile, ProfilerActivity, schedule
from torch.utils.data import DataLoader
import numpy as np
def comprehensive_profiling(model, dataloader, device='cuda'):
"""
Complete profiling workflow following systematic methodology
"""
print("=" * 80)
print("PHASE 1: ESTABLISH BASELINE")
print("=" * 80)
# Step 1: Measure baseline performance
model = model.to(device)
model.train()
# Warmup
print("\nWarming up (5 iterations)...")
for i, (data, target) in enumerate(dataloader):
if i >= 5:
break
data, target = data.to(device), target.to(device)
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
optimizer.zero_grad()
torch.cuda.synchronize()
# Measure baseline
print("\nMeasuring baseline (10 iterations)...")
times = []
for i, (data, target) in enumerate(dataloader):
if i >= 10:
break
data, target = data.to(device), target.to(device)
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
optimizer.zero_grad()
end.record()
torch.cuda.synchronize()
times.append(start.elapsed_time(end))
times = np.array(times)
print(f"\nBaseline Performance:")
print(f" Mean: {times.mean():.2f} ms/iteration")
print(f" Std: {times.std():.2f} ms")
print(f" Median: {np.median(times):.2f} ms")
print(f" P95: {np.percentile(times, 95):.2f} ms")
# -------------------------------------------------------------------------
print("\n" + "=" * 80)
print("PHASE 2: IDENTIFY BOTTLENECK TYPE")
print("=" * 80)
# Check GPU utilization
import subprocess
result = subprocess.run(
['nvidia-smi', '--query-gpu=utilization.gpu,memory.used',
'--format=csv,noheader,nounits'],
capture_output=True, text=True
)
gpu_util, mem_used = result.stdout.strip().split(',')
print(f"\nGPU Utilization: {gpu_util}%")
print(f"GPU Memory Used: {mem_used} MB")
if int(gpu_util) < 70:
print("⚠️ LOW GPU UTILIZATION - likely CPU-bound")
else:
print("✅ GPU utilization healthy - likely GPU-bound")
# Profile data loading vs computation
print("\nProfiling data loading vs computation...")
data_times = []
compute_times = []
batch_iter = iter(dataloader)
for i in range(20):
import time
# Data loading time
data_start = time.time()
data, target = next(batch_iter)
data_time = time.time() - data_start
data_times.append(data_time * 1000) # to ms
# Computation time
data, target = data.to(device), target.to(device)
torch.cuda.synchronize()
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
optimizer.zero_grad()
end.record()
torch.cuda.synchronize()
compute_times.append(start.elapsed_time(end))
avg_data = np.mean(data_times)
avg_compute = np.mean(compute_times)
print(f"\nData loading: {avg_data:.2f} ms")
print(f"Computation: {avg_compute:.2f} ms")
if avg_data > avg_compute:
print("🎯 BOTTLENECK: Data loading (CPU-bound)")
else:
print("🎯 BOTTLENECK: Model computation (GPU-bound)")
# -------------------------------------------------------------------------
print("\n" + "=" * 80)
print("PHASE 3: NARROW TO COMPONENT")
print("=" * 80)
# Profile training phases
print("\nProfiling training phases...")
data, target = next(iter(dataloader))
data, target = data.to(device), target.to(device)
# Forward
torch.cuda.synchronize()
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
output = model(data)
loss = criterion(output, target)
end.record()
torch.cuda.synchronize()
forward_time = start.elapsed_time(end)
# Backward
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
loss.backward()
end.record()
torch.cuda.synchronize()
backward_time = start.elapsed_time(end)
# Optimizer
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
optimizer.step()
optimizer.zero_grad()
end.record()
torch.cuda.synchronize()
optimizer_time = start.elapsed_time(end)
total = forward_time + backward_time + optimizer_time
print(f"\nPhase breakdown:")
print(f" Forward: {forward_time:7.2f} ms ({forward_time/total*100:5.1f}%)")
print(f" Backward: {backward_time:7.2f} ms ({backward_time/total*100:5.1f}%)")
print(f" Optimizer: {optimizer_time:7.2f} ms ({optimizer_time/total*100:5.1f}%)")
# -------------------------------------------------------------------------
print("\n" + "=" * 80)
print("PHASE 4: IDENTIFY OPERATION")
print("=" * 80)
# Detailed profiling
print("\nRunning detailed profiler...")
with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
schedule=schedule(wait=1, warmup=2, active=3, repeat=1),
on_trace_ready=torch.profiler.tensorboard_trace_handler('./profiler_logs'),
record_shapes=True,
profile_memory=True,
with_stack=True
) as prof:
for step, (data, target) in enumerate(dataloader):
if step >= 10:
break
data, target = data.to(device), target.to(device)
optimizer.zero_grad()
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
prof.step()
# Print summary
print("\nTop operations by CUDA time:")
print(prof.key_averages().table(
sort_by="cuda_time_total",
row_limit=15,
max_src_column_width=60
))
print("\nTop operations by memory:")
print(prof.key_averages().table(
sort_by="self_cuda_memory_usage",
row_limit=10,
max_src_column_width=60
))
prof.export_chrome_trace("detailed_trace.json")
print("\n" + "=" * 80)
print("PROFILING COMPLETE")
print("=" * 80)
print("\nNext steps:")
print(" 1. Review top operations in table above")
print(" 2. Open chrome://tracing and load detailed_trace.json")
print(" 3. Or view in TensorBoard: tensorboard --logdir=./profiler_logs")
print(" 4. Focus optimization on identified bottleneck")
print(" 5. Re-run this profiling after optimization to verify improvement")
return {
'baseline_ms': times.mean(),
'gpu_utilization': int(gpu_util),
'data_loading_ms': avg_data,
'computation_ms': avg_compute,
'forward_ms': forward_time,
'backward_ms': backward_time,
'optimizer_ms': optimizer_time,
}
Memory Profiling Complete Example
def profile_memory_usage(model, sample_batch, sample_target):
"""Comprehensive memory profiling"""
print("=" * 80)
print("MEMORY PROFILING")
print("=" * 80)
device = next(model.parameters()).device
# Reset memory stats
torch.cuda.reset_peak_memory_stats()
torch.cuda.empty_cache()
def print_memory(stage):
allocated = torch.cuda.memory_allocated() / 1e9
reserved = torch.cuda.memory_reserved() / 1e9
peak = torch.cuda.max_memory_allocated() / 1e9
print(f"{stage:30s} | Alloc: {allocated:5.2f} GB | "
f"Reserved: {reserved:5.2f} GB | Peak: {peak:5.2f} GB")
print("\nMemory tracking:")
print("-" * 80)
print_memory("Initial")
# Move data to GPU
data = sample_batch.to(device)
target = sample_target.to(device)
print_memory("After data to GPU")
# Forward pass
output = model(data)
print_memory("After forward")
# Loss
loss = criterion(output, target)
print_memory("After loss")
# Backward
loss.backward()
print_memory("After backward")
# Optimizer
optimizer.step()
print_memory("After optimizer step")
# Zero grad
optimizer.zero_grad()
print_memory("After zero_grad")
# Final peak
peak_memory = torch.cuda.max_memory_allocated() / 1e9
print("-" * 80)
print(f"\nPeak memory usage: {peak_memory:.2f} GB")
# Detailed summary
print("\n" + "=" * 80)
print("DETAILED MEMORY SUMMARY")
print("=" * 80)
print(torch.cuda.memory_summary())
# Memory breakdown
print("\n" + "=" * 80)
print("MEMORY OPTIMIZATION SUGGESTIONS")
print("=" * 80)
allocated = torch.cuda.memory_allocated() / 1e9
reserved = torch.cuda.memory_reserved() / 1e9
if reserved > allocated * 1.5:
print("⚠️ Memory fragmentation detected")
print(f" Reserved: {reserved:.2f} GB, Allocated: {allocated:.2f} GB")
print(" Suggestion: Call torch.cuda.empty_cache() periodically")
# Estimate memory components
param_memory = sum(p.numel() * p.element_size() for p in model.parameters()) / 1e9
print(f"\nModel parameters: {param_memory:.2f} GB")
# Estimate gradients (same size as parameters)
print(f"Gradients (estimate): {param_memory:.2f} GB")
# Optimizer states (Adam: 2x parameters)
if isinstance(optimizer, torch.optim.Adam):
optimizer_memory = param_memory * 2
print(f"Optimizer states (Adam): {optimizer_memory:.2f} GB")
# Activations (peak - parameters - gradients - optimizer)
activation_memory = peak_memory - param_memory - param_memory - optimizer_memory
if activation_memory > 0:
print(f"Activations (estimate): {activation_memory:.2f} GB")
if activation_memory > peak_memory * 0.5:
print("\n🎯 Activations dominate memory usage")
print(" Suggestions:")
print(" 1. Use gradient checkpointing")
print(" 2. Reduce batch size")
print(" 3. Use mixed precision (FP16/BF16)")
return {
'peak_gb': peak_memory,
'parameters_gb': param_memory,
'fragmentation_ratio': reserved / allocated if allocated > 0 else 1.0
}
Profiling Checklist
Before claiming you've profiled the code, verify:
Baseline established
- Defined performance metric (throughput/latency/memory)
- Measured with CUDA Events (not time.time())
- Used 5+ warmup iterations
- Reported statistics (mean, std, percentiles)
- Documented measurement conditions
Bottleneck type identified
- Checked GPU utilization (nvidia-smi)
- Profiled data loading vs computation separately
- Categorized as CPU-bound, GPU-bound, memory-bound, or I/O-bound
- Verified category with profiling data (not guessing)
Component identified
- Profiled training phases (forward/backward/optimizer)
- Identified which phase is slowest
- Used iterative narrowing approach
- Examined both table and trace view
Operation identified
- Profiled bottleneck component in detail
- Found specific operation or pattern
- Understand WHY it's slow (not just WHAT is slow)
- Have actionable optimization target
Verification ready
- Saved baseline measurements
- Know how to re-run profiling after optimization
- Can verify if optimization actually helped
- Have profiling artifacts (traces, summaries)
References
PyTorch Profiling Documentation:
- torch.profiler: https://pytorch.org/docs/stable/profiler.html
- Profiling recipe: https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html
- Performance tuning guide: https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html
Related Skills:
- tensor-operations-and-memory (memory leak debugging, operation optimization)
- mixed-precision-and-optimization (AMP profiling, Tensor Core utilization)
- distributed-training-strategies (multi-GPU profiling)
Tools:
- Chrome tracing: chrome://tracing
- TensorBoard profiler: tensorboard --logdir=
- NVIDIA Nsight Systems: nsys profile python train.py
- PyTorch Memory Visualizer: python -m torch.cuda._memory_viz