| name | loop-vectorizer |
| description | Convert Python loops to vectorized PyTorch tensor operations for performance. This skill should be used when optimizing computational bottlenecks in PyTorch code during Phase 4 performance optimization. |
Loop Vectorizer
Convert inefficient Python loops into fast vectorized PyTorch tensor operations, achieving significant performance improvements.
Purpose
Python loops are slow. PyTorch tensor operations are fast (especially on GPU). This skill systematically identifies loops that can be vectorized and converts them to efficient tensor operations.
When to Use
Use this skill when:
- Profiling reveals loop bottlenecks
- Processing large batches or datasets
- Phase 4 (Performance Optimization) tasks
- Code has nested loops over arrays/tensors
- GPU utilization is low
Performance Impact
Typical speedups:
- Simple loops: 10-100x faster
- Nested loops: 100-1000x faster
- With GPU: 100-10000x faster
Vectorization Patterns
Pattern 1: Element-wise Operations
# Before - Python loop (SLOW)
result = []
for x in data:
result.append(x ** 2 + 3 * x + 1)
result = torch.tensor(result)
# After - Vectorized (FAST)
result = data ** 2 + 3 * data + 1
# Speedup: ~100x
Pattern 2: Batch Processing
# Before - Loop over batch (SLOW)
outputs = []
for i in range(len(images)):
output = model(images[i].unsqueeze(0))
outputs.append(output)
outputs = torch.cat(outputs)
# After - Batch processing (FAST)
outputs = model(images) # Process entire batch at once
# Speedup: ~50-100x (+ GPU optimization)
Pattern 3: Conditional Operations
# Before - Loop with condition (SLOW)
result = []
for x in data:
if x > threshold:
result.append(x * 2)
else:
result.append(x / 2)
result = torch.tensor(result)
# After - Vectorized with where (FAST)
result = torch.where(data > threshold, data * 2, data / 2)
# Speedup: ~50x
Pattern 4: Reductions
# Before - Loop for sum/mean (SLOW)
total = 0
for x in data:
total += x ** 2
mean_square = total / len(data)
# After - Vectorized reduction (FAST)
mean_square = (data ** 2).mean()
# Or for multiple operations:
squared = data ** 2
total = squared.sum()
mean_square = squared.mean()
# Speedup: ~100x
Pattern 5: Matrix Operations
# Before - Nested loops for matrix multiply (VERY SLOW)
result = torch.zeros(m, n)
for i in range(m):
for j in range(n):
for k in range(p):
result[i, j] += A[i, k] * B[k, j]
# After - Matrix multiplication (VERY FAST)
result = A @ B # or torch.matmul(A, B)
# Speedup: ~1000x (10000x on GPU)
Pattern 6: Broadcasting
# Before - Loop to apply operation (SLOW)
result = torch.zeros(B, C, H, W)
for b in range(B):
for c in range(C):
result[b, c] = image[b, c] * mask # mask is [H, W]
# After - Broadcasting (FAST)
result = image * mask # Automatically broadcasts [H,W] to [B,C,H,W]
# Speedup: ~100x
PRISM-Specific Vectorizations
Telescope Sampling Loop
# Before - Loop over measurement positions (SLOW)
measurements = []
for center in centers: # 100 positions
mask = telescope.create_mask(center, radius) # [H, W]
measurement = image * mask # [1, 1, H, W] * [H, W]
measurements.append(measurement)
measurements = torch.stack(measurements) # [100, 1, 1, H, W]
# After - Vectorized mask creation (FAST)
def create_all_masks(centers: list[tuple], radius: float) -> Tensor:
"""Create all masks at once [N, H, W]."""
N = len(centers)
y, x = torch.meshgrid(
torch.arange(H) - H//2,
torch.arange(W) - W//2,
indexing='ij'
) # [H, W]
# Stack center coordinates [N, 2]
centers_t = torch.tensor(centers) # [N, 2]
# Broadcast to compute distances [N, H, W]
dy = y[None, :, :] - centers_t[:, 0, None, None] # [N, H, W]
dx = x[None, :, :] - centers_t[:, 1, None, None] # [N, H, W]
dist = torch.sqrt(dy**2 + dx**2) # [N, H, W]
masks = (dist <= radius).float() # [N, H, W]
return masks
# Use vectorized masks
masks = create_all_masks(centers, radius) # [N, H, W]
measurements = image.unsqueeze(0) * masks.unsqueeze(1) # [N, 1, H, W]
# Speedup: ~100x for 100 positions
FFT on Multiple Images
# Before - Loop over images (SLOW)
freq_images = []
for img in images: # 100 images
freq = torch.fft.fft2(img)
freq_images.append(freq)
freq_images = torch.stack(freq_images)
# After - Batch FFT (FAST)
freq_images = torch.fft.fft2(images) # Process all at once
# Speedup: ~50x (FFT is already optimized, but batch is faster)
Loss Computation Over Measurements
# Before - Loop over measurements (SLOW)
total_loss = 0
for i, (mask, target) in enumerate(zip(masks, targets)):
pred_masked = prediction * mask
loss = criterion(pred_masked, target)
total_loss += loss
# After - Vectorized loss (FAST)
# Broadcast prediction [1, C, H, W] with masks [N, 1, H, W]
pred_masked = prediction.unsqueeze(0) * masks.unsqueeze(1) # [N, C, H, W]
# Compute all losses at once
losses = criterion(pred_masked, targets) # [N] or scalar depending on reduction
# Aggregate
total_loss = losses.sum() # or .mean()
# Speedup: ~50x
Advanced Vectorization
Einstein Summation (einsum)
For complex tensor operations:
# Before - Nested loops (VERY SLOW)
# Compute: output[b, o, h, w] = sum_i input[b, i, h, w] * weight[o, i]
output = torch.zeros(B, O, H, W)
for b in range(B):
for o in range(O):
for i in range(I):
output[b, o] += input[b, i] * weight[o, i]
# After - einsum (VERY FAST)
output = torch.einsum('bihw,oi->bohw', input, weight)
# Speedup: ~1000x
Common einsum patterns:
# Matrix multiply: C = A @ B
C = torch.einsum('ik,kj->ij', A, B)
# Batch matrix multiply
C = torch.einsum('bik,bkj->bij', A, B)
# Trace
trace = torch.einsum('ii->', A)
# Diagonal
diag = torch.einsum('ii->i', A)
# Outer product
outer = torch.einsum('i,j->ij', a, b)
# Attention weights: softmax(Q @ K^T / sqrt(d)) @ V
attn = torch.einsum('bqd,bkd->bqk', Q, K) # Q @ K^T
scores = torch.softmax(attn / math.sqrt(d), dim=-1)
output = torch.einsum('bqk,bkd->bqd', scores, V) # scores @ V
Gather and Scatter Operations
For indexed operations:
# Before - Loop with indexing (SLOW)
output = torch.zeros(N, D)
for i, idx in enumerate(indices):
output[i] = data[idx]
# After - Gather (FAST)
output = data[indices] # Advanced indexing
# Or for more complex cases:
output = torch.gather(data, dim=0, index=indices)
# Speedup: ~50x
Masked Operations
# Before - Loop with condition (SLOW)
valid_data = []
for i, (x, m) in enumerate(zip(data, mask)):
if m:
valid_data.append(x)
valid_data = torch.stack(valid_data)
# After - Boolean indexing (FAST)
valid_data = data[mask]
# Speedup: ~50x
Vectorization Workflow
Step 1: Identify Loop Bottlenecks
Profile to find slow loops:
import time
start = time.time()
for x in data:
result = process(x)
elapsed = time.time() - start
print(f"Loop time: {elapsed:.3f}s")
Step 2: Analyze Loop Dependencies
Ask:
Does iteration
idepend on iterationi-1?- No: Can vectorize! ✅
- Yes: May need different approach
Are operations element-wise or reducible?
- Element-wise: Use broadcasting
- Reduction: Use
.sum(),.mean(), etc.
Step 3: Convert to Tensor Operations
Apply appropriate pattern:
- Element-wise → Broadcasting
- Nested loops → Matrix ops or einsum
- Conditionals →
torch.where()or boolean indexing - Reductions →
.sum(),.mean(),.max(), etc.
Step 4: Benchmark
Compare before and after:
def benchmark(func, *args, n_runs=100):
"""Benchmark function."""
# Warmup
for _ in range(10):
func(*args)
torch.cuda.synchronize()
start = time.time()
for _ in range(n_runs):
func(*args)
torch.cuda.synchronize()
elapsed = time.time() - start
return elapsed / n_runs
old_time = benchmark(loop_version, data)
new_time = benchmark(vectorized_version, data)
print(f"Speedup: {old_time / new_time:.1f}x")
print(f"Old: {old_time*1000:.2f}ms, New: {new_time*1000:.2f}ms")
Common Pitfalls
Pitfall 1: Creating Too Many Intermediate Tensors
# Bad - Creates many intermediate tensors
result = data.clone()
result = result + 1
result = result * 2
result = result / 3
# Good - Fused operations
result = (data + 1) * 2 / 3
Pitfall 2: Unnecessary CPU-GPU Transfers
# Bad - Transfer in loop
for x in data_cpu:
x_gpu = x.cuda()
result = model(x_gpu)
result_cpu = result.cpu()
# Good - Transfer once
data_gpu = data_cpu.cuda()
results_gpu = model(data_gpu)
results_cpu = results_gpu.cpu()
Pitfall 3: Not Using In-Place Operations
# Bad - Creates new tensor
x = x + 1
# Good - In-place (when safe)
x += 1 # or x.add_(1)
Pitfall 4: Ignoring Memory Usage
# Bad - Creates huge tensor [N, M, H, W]
distances = torch.zeros(N, M, H, W)
for i in range(N):
for j in range(M):
distances[i, j] = compute_distance(points[i], points[j])
# Better - Compute in chunks
chunk_size = 100
for i in range(0, N, chunk_size):
chunk = compute_distances(points[i:i+chunk_size], points)
process(chunk)
Special Cases: Non-Vectorizable Loops
Some loops cannot be fully vectorized:
Sequential Dependencies
# Cannot vectorize - depends on previous iteration
for i in range(1, len(data)):
data[i] = data[i] + data[i-1] # Cumulative sum
# Solution: Use torch.cumsum
data = torch.cumsum(data, dim=0)
Dynamic Shapes
# Hard to vectorize - different sizes
results = []
for x in data:
# Each x may produce different size output
result = process_variable_size(x)
results.append(result)
# Solution: Pad to max size or use packed sequences
Complex Control Flow
# Hard to vectorize - complex branching
for x in data:
if condition1(x):
result = path1(x)
elif condition2(x):
result = path2(x)
else:
result = path3(x)
# Partial solution: Vectorize each path separately
mask1 = condition1(data)
mask2 = condition2(data)
mask3 = ~(mask1 | mask2)
result = torch.zeros_like(data)
result[mask1] = path1(data[mask1])
result[mask2] = path2(data[mask2])
result[mask3] = path3(data[mask3])
Validation Checklist
After vectorization:
- Benchmark shows speedup
- Results are identical (or within numerical precision)
- Memory usage is acceptable
- Code is more readable (or at least not worse)
- GPU utilization improved (if using GPU)
Verification
Always verify correctness:
# Verify vectorized version matches loop version
torch.manual_seed(42)
data = torch.randn(100, 256, 256)
result_loop = loop_version(data)
result_vectorized = vectorized_version(data)
assert torch.allclose(result_loop, result_vectorized, rtol=1e-5)
print("✓ Results match!")