name	loop-vectorizer
description	Convert Python loops to vectorized PyTorch tensor operations for performance. This skill should be used when optimizing computational bottlenecks in PyTorch code during Phase 4 performance optimization.

Loop Vectorizer

Convert inefficient Python loops into fast vectorized PyTorch tensor operations, achieving significant performance improvements.

Purpose

Python loops are slow. PyTorch tensor operations are fast (especially on GPU). This skill systematically identifies loops that can be vectorized and converts them to efficient tensor operations.

When to Use

Use this skill when:

Profiling reveals loop bottlenecks
Processing large batches or datasets
Phase 4 (Performance Optimization) tasks
Code has nested loops over arrays/tensors
GPU utilization is low

Performance Impact

Typical speedups:

Simple loops: 10-100x faster
Nested loops: 100-1000x faster
With GPU: 100-10000x faster

Vectorization Patterns

Pattern 1: Element-wise Operations

# Before - Python loop (SLOW)
result = []
for x in data:
    result.append(x ** 2 + 3 * x + 1)
result = torch.tensor(result)

# After - Vectorized (FAST)
result = data ** 2 + 3 * data + 1

# Speedup: ~100x

Pattern 2: Batch Processing

# Before - Loop over batch (SLOW)
outputs = []
for i in range(len(images)):
    output = model(images[i].unsqueeze(0))
    outputs.append(output)
outputs = torch.cat(outputs)

# After - Batch processing (FAST)
outputs = model(images)  # Process entire batch at once

# Speedup: ~50-100x (+ GPU optimization)

Pattern 3: Conditional Operations

# Before - Loop with condition (SLOW)
result = []
for x in data:
    if x > threshold:
        result.append(x * 2)
    else:
        result.append(x / 2)
result = torch.tensor(result)

# After - Vectorized with where (FAST)
result = torch.where(data > threshold, data * 2, data / 2)

# Speedup: ~50x

Pattern 4: Reductions

# Before - Loop for sum/mean (SLOW)
total = 0
for x in data:
    total += x ** 2
mean_square = total / len(data)

# After - Vectorized reduction (FAST)
mean_square = (data ** 2).mean()

# Or for multiple operations:
squared = data ** 2
total = squared.sum()
mean_square = squared.mean()

# Speedup: ~100x

Pattern 5: Matrix Operations

# Before - Nested loops for matrix multiply (VERY SLOW)
result = torch.zeros(m, n)
for i in range(m):
    for j in range(n):
        for k in range(p):
            result[i, j] += A[i, k] * B[k, j]

# After - Matrix multiplication (VERY FAST)
result = A @ B  # or torch.matmul(A, B)

# Speedup: ~1000x (10000x on GPU)

Pattern 6: Broadcasting

# Before - Loop to apply operation (SLOW)
result = torch.zeros(B, C, H, W)
for b in range(B):
    for c in range(C):
        result[b, c] = image[b, c] * mask  # mask is [H, W]

# After - Broadcasting (FAST)
result = image * mask  # Automatically broadcasts [H,W] to [B,C,H,W]

# Speedup: ~100x

PRISM-Specific Vectorizations

Telescope Sampling Loop

# Before - Loop over measurement positions (SLOW)
measurements = []
for center in centers:  # 100 positions
    mask = telescope.create_mask(center, radius)  # [H, W]
    measurement = image * mask  # [1, 1, H, W] * [H, W]
    measurements.append(measurement)
measurements = torch.stack(measurements)  # [100, 1, 1, H, W]

# After - Vectorized mask creation (FAST)
def create_all_masks(centers: list[tuple], radius: float) -> Tensor:
    """Create all masks at once [N, H, W]."""
    N = len(centers)
    y, x = torch.meshgrid(
        torch.arange(H) - H//2,
        torch.arange(W) - W//2,
        indexing='ij'
    )  # [H, W]

    # Stack center coordinates [N, 2]
    centers_t = torch.tensor(centers)  # [N, 2]

    # Broadcast to compute distances [N, H, W]
    dy = y[None, :, :] - centers_t[:, 0, None, None]  # [N, H, W]
    dx = x[None, :, :] - centers_t[:, 1, None, None]  # [N, H, W]
    dist = torch.sqrt(dy**2 + dx**2)  # [N, H, W]

    masks = (dist <= radius).float()  # [N, H, W]
    return masks

# Use vectorized masks
masks = create_all_masks(centers, radius)  # [N, H, W]
measurements = image.unsqueeze(0) * masks.unsqueeze(1)  # [N, 1, H, W]

# Speedup: ~100x for 100 positions

FFT on Multiple Images

# Before - Loop over images (SLOW)
freq_images = []
for img in images:  # 100 images
    freq = torch.fft.fft2(img)
    freq_images.append(freq)
freq_images = torch.stack(freq_images)

# After - Batch FFT (FAST)
freq_images = torch.fft.fft2(images)  # Process all at once

# Speedup: ~50x (FFT is already optimized, but batch is faster)

Loss Computation Over Measurements

# Before - Loop over measurements (SLOW)
total_loss = 0
for i, (mask, target) in enumerate(zip(masks, targets)):
    pred_masked = prediction * mask
    loss = criterion(pred_masked, target)
    total_loss += loss

# After - Vectorized loss (FAST)
# Broadcast prediction [1, C, H, W] with masks [N, 1, H, W]
pred_masked = prediction.unsqueeze(0) * masks.unsqueeze(1)  # [N, C, H, W]

# Compute all losses at once
losses = criterion(pred_masked, targets)  # [N] or scalar depending on reduction

# Aggregate
total_loss = losses.sum()  # or .mean()

# Speedup: ~50x

Advanced Vectorization

Einstein Summation (einsum)

For complex tensor operations:

# Before - Nested loops (VERY SLOW)
# Compute: output[b, o, h, w] = sum_i input[b, i, h, w] * weight[o, i]
output = torch.zeros(B, O, H, W)
for b in range(B):
    for o in range(O):
        for i in range(I):
            output[b, o] += input[b, i] * weight[o, i]

# After - einsum (VERY FAST)
output = torch.einsum('bihw,oi->bohw', input, weight)

# Speedup: ~1000x

Common einsum patterns:

# Matrix multiply: C = A @ B
C = torch.einsum('ik,kj->ij', A, B)

# Batch matrix multiply
C = torch.einsum('bik,bkj->bij', A, B)

# Trace
trace = torch.einsum('ii->', A)

# Diagonal
diag = torch.einsum('ii->i', A)

# Outer product
outer = torch.einsum('i,j->ij', a, b)

# Attention weights: softmax(Q @ K^T / sqrt(d)) @ V
attn = torch.einsum('bqd,bkd->bqk', Q, K)  # Q @ K^T
scores = torch.softmax(attn / math.sqrt(d), dim=-1)
output = torch.einsum('bqk,bkd->bqd', scores, V)  # scores @ V

Gather and Scatter Operations

For indexed operations:

# Before - Loop with indexing (SLOW)
output = torch.zeros(N, D)
for i, idx in enumerate(indices):
    output[i] = data[idx]

# After - Gather (FAST)
output = data[indices]  # Advanced indexing

# Or for more complex cases:
output = torch.gather(data, dim=0, index=indices)

# Speedup: ~50x

Masked Operations

# Before - Loop with condition (SLOW)
valid_data = []
for i, (x, m) in enumerate(zip(data, mask)):
    if m:
        valid_data.append(x)
valid_data = torch.stack(valid_data)

# After - Boolean indexing (FAST)
valid_data = data[mask]

# Speedup: ~50x

Vectorization Workflow

Step 1: Identify Loop Bottlenecks

Profile to find slow loops:

import time

start = time.time()
for x in data:
    result = process(x)
elapsed = time.time() - start
print(f"Loop time: {elapsed:.3f}s")

Step 2: Analyze Loop Dependencies

Ask:

Does iteration i depend on iteration i-1?
- No: Can vectorize! ✅
- Yes: May need different approach
Are operations element-wise or reducible?
- Element-wise: Use broadcasting
- Reduction: Use .sum(), .mean(), etc.

Step 3: Convert to Tensor Operations

Apply appropriate pattern:

Element-wise → Broadcasting
Nested loops → Matrix ops or einsum
Conditionals → torch.where() or boolean indexing
Reductions → .sum(), .mean(), .max(), etc.

Step 4: Benchmark

Compare before and after:

def benchmark(func, *args, n_runs=100):
    """Benchmark function."""
    # Warmup
    for _ in range(10):
        func(*args)

    torch.cuda.synchronize()
    start = time.time()

    for _ in range(n_runs):
        func(*args)

    torch.cuda.synchronize()
    elapsed = time.time() - start

    return elapsed / n_runs

old_time = benchmark(loop_version, data)
new_time = benchmark(vectorized_version, data)

print(f"Speedup: {old_time / new_time:.1f}x")
print(f"Old: {old_time*1000:.2f}ms, New: {new_time*1000:.2f}ms")

Common Pitfalls

Pitfall 1: Creating Too Many Intermediate Tensors

# Bad - Creates many intermediate tensors
result = data.clone()
result = result + 1
result = result * 2
result = result / 3

# Good - Fused operations
result = (data + 1) * 2 / 3

Pitfall 2: Unnecessary CPU-GPU Transfers

# Bad - Transfer in loop
for x in data_cpu:
    x_gpu = x.cuda()
    result = model(x_gpu)
    result_cpu = result.cpu()

# Good - Transfer once
data_gpu = data_cpu.cuda()
results_gpu = model(data_gpu)
results_cpu = results_gpu.cpu()

Pitfall 3: Not Using In-Place Operations

# Bad - Creates new tensor
x = x + 1

# Good - In-place (when safe)
x += 1  # or x.add_(1)

Pitfall 4: Ignoring Memory Usage

# Bad - Creates huge tensor [N, M, H, W]
distances = torch.zeros(N, M, H, W)
for i in range(N):
    for j in range(M):
        distances[i, j] = compute_distance(points[i], points[j])

# Better - Compute in chunks
chunk_size = 100
for i in range(0, N, chunk_size):
    chunk = compute_distances(points[i:i+chunk_size], points)
    process(chunk)

Special Cases: Non-Vectorizable Loops

Some loops cannot be fully vectorized:

Sequential Dependencies

# Cannot vectorize - depends on previous iteration
for i in range(1, len(data)):
    data[i] = data[i] + data[i-1]  # Cumulative sum

# Solution: Use torch.cumsum
data = torch.cumsum(data, dim=0)

Dynamic Shapes

# Hard to vectorize - different sizes
results = []
for x in data:
    # Each x may produce different size output
    result = process_variable_size(x)
    results.append(result)

# Solution: Pad to max size or use packed sequences

Complex Control Flow

# Hard to vectorize - complex branching
for x in data:
    if condition1(x):
        result = path1(x)
    elif condition2(x):
        result = path2(x)
    else:
        result = path3(x)

# Partial solution: Vectorize each path separately
mask1 = condition1(data)
mask2 = condition2(data)
mask3 = ~(mask1 | mask2)

result = torch.zeros_like(data)
result[mask1] = path1(data[mask1])
result[mask2] = path2(data[mask2])
result[mask3] = path3(data[mask3])

Validation Checklist

After vectorization:

Benchmark shows speedup
Results are identical (or within numerical precision)
Memory usage is acceptable
Code is more readable (or at least not worse)
GPU utilization improved (if using GPU)

Verification

Always verify correctness:

# Verify vectorized version matches loop version
torch.manual_seed(42)
data = torch.randn(100, 256, 256)

result_loop = loop_version(data)
result_vectorized = vectorized_version(data)

assert torch.allclose(result_loop, result_vectorized, rtol=1e-5)
print("✓ Results match!")

loop-vectorizer

Install Skill

SKILL.md