name	hardware-optimization-strategies
description	Optimize inference for GPU/CPU/edge via profiling and hardware-aware acceleration.

Hardware Optimization Strategies

Overview

This skill provides systematic methodology for optimizing ML model inference performance on specific hardware platforms. Covers GPU optimization (CUDA, TensorRT), CPU optimization (threading, SIMD), and edge optimization (ARM, quantization), with emphasis on profiling-driven optimization and hardware-appropriate technique selection.

Core Principle: Profile first to identify bottlenecks, then apply hardware-specific optimizations. Different hardware requires different optimization strategies - GPU benefits from batch size and operator fusion, CPU from threading and SIMD, edge devices from quantization and model architecture.

When to Use

Use this skill when:

Model inference performance depends on hardware utilization (not just model architecture)
Need to optimize for specific hardware: NVIDIA GPU, Intel/AMD CPU, ARM edge devices
Model is serving bottleneck after profiling (vs data loading, preprocessing)
Want to maximize throughput or minimize latency on given hardware
Deploying to resource-constrained edge devices
User mentions: "optimize for GPU", "CPU inference slow", "edge deployment", "TensorRT", "ONNX Runtime", "batch size tuning"

Don't use for:

Training optimization → use training-optimization pack
Model architecture selection → use neural-architectures
Model compression (pruning, distillation) → use model-compression-techniques
Quantization specifically → use quantization-for-inference
Serving infrastructure → use model-serving-patterns

Boundary with quantization-for-inference:

This skill covers hardware-aware quantization deployment (INT8 on CPU vs GPU, ARM NEON)
quantization-for-inference covers quantization techniques (PTQ, QAT, calibration)
Use both when quantization is part of hardware optimization strategy

Core Methodology

Step 1: Profile to Identify Bottlenecks

ALWAYS profile before optimizing. Don't guess where time is spent.

PyTorch Profiler (Comprehensive)

import torch
from torch.profiler import profile, ProfilerActivity, record_function

model = load_model().cuda().eval()

# Profile inference
with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    record_shapes=True,
    profile_memory=True,
    with_stack=True
) as prof:
    with record_function("inference"):
        with torch.no_grad():
            output = model(input_tensor.cuda())

# Print results
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=20))
print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=20))

# Export for visualization
prof.export_chrome_trace("trace.json")  # View in chrome://tracing

What to look for:

CPU time high: Data preprocessing, Python overhead, CPU-bound ops
CUDA time high: Model compute is bottleneck, optimize model inference
Memory: Check for out-of-memory issues or unnecessary allocations
Operator breakdown: Which layers/ops are slowest?

NVIDIA Profiling Tools

# Nsight Systems - high-level timeline
nsys profile -o output python inference.py

# Nsight Compute - kernel-level profiling
ncu --set full -o kernel_profile python inference.py

# Simple nvidia-smi monitoring
nvidia-smi dmon -s u -i 0  # Monitor GPU utilization

Intel VTune (CPU profiling)

# Profile CPU bottlenecks
vtune -collect hotspots -r vtune_results -- python inference.py

# Analyze results
vtune-gui vtune_results

Simple Timing

import time
import torch

def profile_pipeline(model, input_data, device='cuda'):
    """Profile each stage of inference pipeline"""

    # Warmup
    for _ in range(10):
        with torch.no_grad():
            _ = model(input_data.to(device))

    if device == 'cuda':
        torch.cuda.synchronize()  # Critical for accurate GPU timing

    # Profile preprocessing
    t0 = time.time()
    preprocessed = preprocess(input_data)
    t1 = time.time()

    # Profile model inference
    preprocessed = preprocessed.to(device)
    if device == 'cuda':
        torch.cuda.synchronize()

    t2 = time.time()
    with torch.no_grad():
        output = model(preprocessed)

    if device == 'cuda':
        torch.cuda.synchronize()
    t3 = time.time()

    # Profile postprocessing
    result = postprocess(output.cpu())
    t4 = time.time()

    print(f"Preprocessing:  {(t1-t0)*1000:.2f}ms")
    print(f"Model Inference: {(t3-t2)*1000:.2f}ms")
    print(f"Postprocessing: {(t4-t3)*1000:.2f}ms")
    print(f"Total:          {(t4-t0)*1000:.2f}ms")

    return {
        'preprocess': (t1-t0)*1000,
        'inference': (t3-t2)*1000,
        'postprocess': (t4-t3)*1000,
    }

Critical: Always use torch.cuda.synchronize() before timing GPU operations, otherwise you measure kernel launch time, not execution time.

Step 2: Select Hardware-Appropriate Optimizations

Based on profiling results and target hardware, select appropriate optimization strategies.

GPU Optimization (NVIDIA CUDA)

Strategy 1: TensorRT (2-5x Speedup for CNNs/Transformers)

When to use:

NVIDIA GPU (T4, V100, A100, RTX series)
Model architecture supported (CNN, Transformer, RNN)
Inference-only workload (not training)
Want automatic optimization (fusion, precision, kernels)

Best for: Production deployment on NVIDIA GPUs, predictable performance gains

import torch
import torch_tensorrt

# Load PyTorch model
model = load_model().eval().cuda()

# Compile to TensorRT
trt_model = torch_tensorrt.compile(
    model,
    inputs=[torch_tensorrt.Input(
        min_shape=[1, 3, 224, 224],    # Minimum batch size
        opt_shape=[8, 3, 224, 224],    # Optimal batch size
        max_shape=[32, 3, 224, 224],   # Maximum batch size
        dtype=torch.float16
    )],
    enabled_precisions={torch.float16},  # Use FP16
    workspace_size=1 << 30,              # 1GB workspace for optimization
    truncate_long_and_double=True
)

# Save compiled model
torch.jit.save(trt_model, "model_trt.ts")

# Inference (same API as PyTorch)
with torch.no_grad():
    output = trt_model(input_tensor.cuda())

What TensorRT does:

Operator fusion: Combines conv + bn + relu into single kernel
Precision calibration: Automatic mixed precision (FP16/INT8)
Kernel auto-tuning: Selects fastest CUDA kernel for each op
Memory optimization: Reduces memory transfers
Graph optimization: Removes unnecessary operations

Limitations:

Only supports NVIDIA GPUs
Some custom ops may not be supported
Compilation time (minutes for large models)
Fixed input shapes (or min/max range)

Troubleshooting:

# If compilation fails, try:
# 1. Enable verbose logging
import logging
logging.getLogger("torch_tensorrt").setLevel(logging.DEBUG)

# 2. Disable unsupported layers (fallback to PyTorch)
trt_model = torch_tensorrt.compile(
    model,
    inputs=[...],
    enabled_precisions={torch.float16},
    torch_fallback=torch_tensorrt.TorchFallback()  # Fallback for unsupported ops
)

# 3. Check for unsupported ops
torch_tensorrt.logging.set_reportable_log_level(torch_tensorrt.logging.Level.Warning)

Strategy 2: torch.compile() (PyTorch 2.0+ - Easy 1.5-2x Speedup)

When to use:

PyTorch 2.0+ available
Want easy optimization without complexity
Model has custom operations (TensorRT may not support)
Rapid prototyping (faster than TensorRT compilation)

Best for: Quick wins, development iteration, custom models

import torch

model = load_model().eval().cuda()

# Compile with default backend (inductor)
compiled_model = torch.compile(model)

# Compile with specific mode
compiled_model = torch.compile(
    model,
    mode="reduce-overhead",  # Options: "default", "reduce-overhead", "max-autotune"
    fullgraph=True,          # Compile entire graph (vs subgraphs)
)

# First run compiles (slow), subsequent runs are fast
with torch.no_grad():
    output = compiled_model(input_tensor.cuda())

Modes:

default: Balanced compilation time and runtime performance
reduce-overhead: Minimize Python overhead (best for small models)
max-autotune: Maximum optimization (long compilation, best runtime)

What torch.compile() does:

Operator fusion: Similar to TensorRT
Python overhead reduction: Removes Python interpreter overhead
Memory optimization: Reduces allocations
CUDA graph generation: For fixed-size models

Advantages over TensorRT:

Easier to use (one line of code)
Supports custom operations
Faster compilation
No fixed input shapes

Disadvantages vs TensorRT:

Smaller speedup (1.5-2x vs 2-5x)
Less mature (newer feature)

Strategy 3: Mixed Precision (FP16 - Easy 2x Speedup)

When to use:

NVIDIA GPU with Tensor Cores (V100, A100, T4, RTX)
Model doesn't require FP32 precision
Want simple optimization (minimal code change)
Memory-bound models (FP16 uses half the memory)

import torch
from torch.cuda.amp import autocast

model = load_model().eval().cuda().half()  # Convert model to FP16

# Inference with autocast
with torch.no_grad():
    with autocast():
        output = model(input_tensor.cuda().half())

Caution: Some models lose accuracy with FP16. Test accuracy before deploying.

# Validate FP16 accuracy
def validate_fp16_accuracy(model, test_loader, tolerance=0.01):
    model_fp32 = model.float()
    model_fp16 = model.half()

    diffs = []
    for inputs, _ in test_loader:
        with torch.no_grad():
            output_fp32 = model_fp32(inputs.cuda().float())
            output_fp16 = model_fp16(inputs.cuda().half())

        diff = (output_fp32 - output_fp16.float()).abs().mean().item()
        diffs.append(diff)

    avg_diff = sum(diffs) / len(diffs)
    print(f"Average FP32-FP16 difference: {avg_diff:.6f}")

    if avg_diff > tolerance:
        print(f"WARNING: FP16 accuracy loss exceeds tolerance ({tolerance})")
        return False
    return True

Strategy 4: Batch Size Tuning

When to use: Always! Batch size is the most important parameter for GPU throughput.

Trade-off:

Larger batch = Higher throughput, higher latency, more memory
Smaller batch = Lower latency, lower throughput, less memory

Find Optimal Batch Size

def find_optimal_batch_size(model, input_shape, device='cuda', max_memory_pct=0.9):
    """Binary search for maximum batch size that fits in memory"""
    model = model.to(device).eval()

    # Start with batch size 1, increase until OOM
    batch_size = 1
    max_batch = 1024  # Upper bound

    while batch_size < max_batch:
        try:
            torch.cuda.empty_cache()
            test_batch = torch.randn(batch_size, *input_shape).to(device)

            with torch.no_grad():
                _ = model(test_batch)

            # Check memory usage
            mem_allocated = torch.cuda.memory_allocated() / torch.cuda.max_memory_allocated()

            if mem_allocated > max_memory_pct:
                print(f"Batch size {batch_size}: {mem_allocated*100:.1f}% memory (near limit)")
                break

            print(f"Batch size {batch_size}: OK ({mem_allocated*100:.1f}% memory)")
            batch_size *= 2

        except RuntimeError as e:
            if "out of memory" in str(e):
                print(f"Batch size {batch_size}: OOM")
                batch_size = batch_size // 2
                break
            else:
                raise e

    print(f"\nOptimal batch size: {batch_size}")
    return batch_size

Measure Latency vs Throughput

def benchmark_batch_sizes(model, input_shape, batch_sizes=[1, 4, 8, 16, 32, 64], device='cuda', num_runs=100):
    """Measure latency and throughput for different batch sizes"""
    model = model.to(device).eval()
    results = []

    for batch_size in batch_sizes:
        try:
            test_batch = torch.randn(batch_size, *input_shape).to(device)

            # Warmup
            for _ in range(10):
                with torch.no_grad():
                    _ = model(test_batch)

            torch.cuda.synchronize()

            # Benchmark
            start = time.time()
            for _ in range(num_runs):
                with torch.no_grad():
                    _ = model(test_batch)
            torch.cuda.synchronize()
            elapsed = time.time() - start

            latency_per_batch = (elapsed / num_runs) * 1000  # ms
            throughput = (batch_size * num_runs) / elapsed  # samples/sec
            latency_per_sample = latency_per_batch / batch_size  # ms/sample

            results.append({
                'batch_size': batch_size,
                'latency_per_batch_ms': latency_per_batch,
                'latency_per_sample_ms': latency_per_sample,
                'throughput_samples_per_sec': throughput,
            })

            print(f"Batch {batch_size:3d}: {latency_per_batch:6.2f}ms/batch, "
                  f"{latency_per_sample:6.2f}ms/sample, {throughput:8.1f} samples/sec")

        except RuntimeError as e:
            if "out of memory" in str(e):
                print(f"Batch {batch_size:3d}: OOM")
                break

    return results

Decision criteria:

Online serving (real-time API): Use small batch (1-8) for low latency
Batch serving: Use large batch (32-128) for high throughput
Dynamic batching: Let serving framework accumulate requests (TorchServe, Triton)

Strategy 5: CUDA Graphs (Fixed-Size Inputs - 20-30% Speedup)

When to use:

Fixed input size (no dynamic shapes)
Small models with many kernel launches
Already optimized but want last 20% speedup

What CUDA graphs do: Record sequence of CUDA operations, replay without CPU overhead

import torch

model = load_model().eval().cuda()

# Static input (fixed size)
static_input = torch.randn(8, 3, 224, 224).cuda()
static_output = torch.randn(8, 1000).cuda()

# Warmup
for _ in range(10):
    with torch.no_grad():
        _ = model(static_input)

# Capture graph
graph = torch.cuda.CUDAGraph()
with torch.cuda.graph(graph):
    with torch.no_grad():
        static_output = model(static_input)

# Replay graph (very fast)
def inference_with_graph(input_tensor):
    # Copy input to static buffer
    static_input.copy_(input_tensor)

    # Replay graph
    graph.replay()

    # Copy output from static buffer
    return static_output.clone()

# Benchmark
input_tensor = torch.randn(8, 3, 224, 224).cuda()
output = inference_with_graph(input_tensor)

Limitations:

Fixed input/output shapes (no dynamic batching)
No control flow (if/else) in model
Adds complexity (buffer management)

CPU Optimization (Intel/AMD)

Strategy 1: Threading Configuration (Critical for Multi-Core)

When to use: Always for CPU inference on multi-core machines

Problem: PyTorch defaults to 4-8 threads, leaving cores idle

import torch

# Check current configuration
print(f"Intra-op threads: {torch.get_num_threads()}")
print(f"Inter-op threads: {torch.get_num_interop_threads()}")

# Set to number of physical cores (not hyperthreads)
import os
num_cores = os.cpu_count() // 2  # Divide by 2 if hyperthreading enabled

torch.set_num_threads(num_cores)  # Intra-op parallelism (within operations)
torch.set_num_interop_threads(1)  # Inter-op parallelism (between operations, disable to avoid oversubscription)

# Verify
print(f"Set intra-op threads: {torch.get_num_threads()}")

Intra-op vs Inter-op:

Intra-op: Parallelizes single operation (e.g., matrix multiply uses 32 cores)
Inter-op: Parallelizes independent operations (e.g., run conv1 and conv2 simultaneously)

Best practice:

Intra-op threads = number of physical cores (enables each op to use all cores)
Inter-op threads = 1 (disable to avoid oversubscription and context switching)

Warning: If using DataLoader with workers, account for those threads:

num_cores = os.cpu_count() // 2
num_dataloader_workers = 4

torch.set_num_threads(num_cores - num_dataloader_workers)  # Leave cores for DataLoader

Strategy 2: MKLDNN/OneDNN Backend (Intel-Optimized Operations)

When to use: Intel CPUs (Xeon, Core i7/i9)

What it does: Uses Intel's optimized math libraries (AVX, AVX-512)

import torch

# Enable MKLDNN
torch.backends.mkldnn.enabled = True

# Check if available
print(f"MKLDNN available: {torch.backends.mkldnn.is_available()}")

# Inference (automatically uses MKLDNN when beneficial)
model = load_model().eval()
with torch.no_grad():
    output = model(input_tensor)

For maximum performance: Use channels-last memory format (better cache locality)

model = model.eval()

# Convert to channels-last format (NHWC instead of NCHW)
model = model.to(memory_format=torch.channels_last)

# Input also channels-last
input_tensor = input_tensor.to(memory_format=torch.channels_last)

with torch.no_grad():
    output = model(input_tensor)

Speedup: 1.5-2x on Intel CPUs with AVX-512

Strategy 3: ONNX Runtime (Best CPU Performance)

When to use:

Dedicated CPU inference deployment
Want best possible CPU performance
Model is fully supported by ONNX

Advantages:

Optimized for CPU (MLAS, DNNL, OpenMP)
Graph optimizations (fusion, constant folding)
Quantization support (INT8)

import torch
import onnx
import onnxruntime as ort

# Export PyTorch model to ONNX
model = load_model().eval()
dummy_input = torch.randn(1, 3, 224, 224)

torch.onnx.export(
    model,
    dummy_input,
    "model.onnx",
    export_params=True,
    opset_version=14,
    input_names=['input'],
    output_names=['output'],
    dynamic_axes={'input': {0: 'batch_size'}, 'output': {0: 'batch_size'}}
)

# Optimize ONNX graph
import onnxruntime.transformers.optimizer as optimizer
optimized_model = optimizer.optimize_model("model.onnx", model_type='bert', num_heads=8, hidden_size=512)
optimized_model.save_model_to_file("model_optimized.onnx")

# Create inference session with optimizations
sess_options = ort.SessionOptions()
sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
sess_options.intra_op_num_threads = os.cpu_count() // 2
sess_options.inter_op_num_threads = 1

session = ort.InferenceSession(
    "model_optimized.onnx",
    sess_options,
    providers=['CPUExecutionProvider']  # Use CPU
)

# Inference
input_data = input_tensor.numpy()
output = session.run(None, {'input': input_data})[0]

Expected speedup: 2-3x over PyTorch CPU inference

Strategy 4: OpenVINO (Intel-Specific - Best Performance)

When to use: Intel CPUs (Xeon, Core), want absolute best CPU performance

Advantages:

Intel-specific optimizations (AVX, AVX-512, VNNI)
Best-in-class CPU inference performance
Integrated optimization tools

# Convert PyTorch to OpenVINO IR
# First: Export to ONNX (as above)
# Then: Use Model Optimizer

# Command-line conversion
!mo --input_model model.onnx --output_dir openvino_model --data_type FP16

# Python API
from openvino.runtime import Core

# Load model
ie = Core()
model = ie.read_model(model="openvino_model/model.xml")
compiled_model = ie.compile_model(model=model, device_name="CPU")

# Inference
input_tensor = np.random.randn(1, 3, 224, 224).astype(np.float32)
output = compiled_model([input_tensor])[0]

Expected speedup: 3-4x over PyTorch CPU inference on Intel CPUs

Strategy 5: Batch Size for CPU

Different from GPU: Smaller batches often better for CPU

Why:

CPU has smaller cache than GPU memory
Large batches may not fit in cache → cache misses → slower
Diminishing returns from batching on CPU

Recommendation:

Start with batch size 1-4
Profile to find optimal
Don't assume large batches are better (unlike GPU)

# CPU batch size tuning
def find_optimal_cpu_batch(model, input_shape, max_batch=32):
    model = model.eval()
    results = []

    for batch_size in [1, 2, 4, 8, 16, 32]:
        if batch_size > max_batch:
            break

        test_input = torch.randn(batch_size, *input_shape)

        # Warmup
        for _ in range(10):
            with torch.no_grad():
                _ = model(test_input)

        # Benchmark
        start = time.time()
        for _ in range(100):
            with torch.no_grad():
                _ = model(test_input)
        elapsed = time.time() - start

        throughput = (batch_size * 100) / elapsed
        latency = (elapsed / 100) * 1000  # ms

        results.append({
            'batch_size': batch_size,
            'throughput': throughput,
            'latency_ms': latency,
        })

        print(f"Batch {batch_size}: {throughput:.1f} samples/sec, {latency:.2f}ms latency")

    return results

Edge/ARM Optimization

Strategy 1: INT8 Quantization (2-4x Speedup on ARM)

When to use: ARM CPU deployment (Raspberry Pi, mobile, edge devices)

Why INT8 on ARM:

ARM NEON instructions accelerate INT8 operations
2-4x faster than FP32 on ARM CPUs
4x smaller model size (critical for edge devices)

import torch
from torch.quantization import quantize_dynamic, quantize_static, get_default_qconfig

# Dynamic quantization (easiest, no calibration)
model = load_model().eval()
quantized_model = quantize_dynamic(
    model,
    {torch.nn.Linear, torch.nn.Conv2d},  # Quantize these layers
    dtype=torch.qint8
)

# Save quantized model
torch.save(quantized_model.state_dict(), 'model_int8.pth')

# Inference (same API)
with torch.no_grad():
    output = quantized_model(input_tensor)

For better accuracy: Use static quantization with calibration (see quantization-for-inference skill)

Strategy 2: TensorFlow Lite (Best for ARM/Mobile)

When to use:

ARM edge devices (Raspberry Pi, Coral, mobile)
Need maximum ARM performance
Can convert model to TensorFlow Lite

Advantages:

XNNPACK backend (ARM NEON optimizations)
Highly optimized for edge devices
Delegate support (GPU, NPU on mobile)

import torch
import tensorflow as tf

# Convert PyTorch to ONNX to TensorFlow to TFLite
# Step 1: PyTorch → ONNX
torch.onnx.export(model, dummy_input, "model.onnx")

# Step 2: ONNX → TensorFlow (use onnx-tf)
from onnx_tf.backend import prepare
import onnx

onnx_model = onnx.load("model.onnx")
tf_rep = prepare(onnx_model)
tf_rep.export_graph("model_tf")

# Step 3: TensorFlow → TFLite with optimizations
converter = tf.lite.TFLiteConverter.from_saved_model("model_tf")
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8

# Provide representative dataset for calibration
def representative_dataset():
    for _ in range(100):
        yield [np.random.randn(1, 3, 224, 224).astype(np.float32)]

converter.representative_dataset = representative_dataset

tflite_model = converter.convert()

with open('model.tflite', 'wb') as f:
    f.write(tflite_model)

Inference with TFLite:

import tensorflow as tf

# Load TFLite model
interpreter = tf.lite.Interpreter(
    model_path="model.tflite",
    num_threads=4  # Use all 4 cores on Raspberry Pi
)
interpreter.allocate_tensors()

# Get input/output details
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()

# Inference
input_data = np.random.randn(1, 3, 224, 224).astype(np.float32)
interpreter.set_tensor(input_details[0]['index'], input_data)
interpreter.invoke()
output_data = interpreter.get_tensor(output_details[0]['index'])

Expected speedup: 3-5x over PyTorch on Raspberry Pi

Strategy 3: ONNX Runtime for ARM

When to use: ARM Linux (Raspberry Pi, Jetson Nano), simpler than TFLite

import onnxruntime as ort

# Export to ONNX (as above)
torch.onnx.export(model, dummy_input, "model.onnx")

# Inference session with ARM optimizations
sess_options = ort.SessionOptions()
sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
sess_options.intra_op_num_threads = 4  # Raspberry Pi has 4 cores
sess_options.inter_op_num_threads = 1

session = ort.InferenceSession(
    "model.onnx",
    sess_options,
    providers=['CPUExecutionProvider']
)

# Inference
output = session.run(None, {'input': input_data.numpy()})[0]

Quantize ONNX for ARM:

from onnxruntime.quantization import quantize_dynamic, QuantType

quantize_dynamic(
    "model.onnx",
    "model_int8.onnx",
    weight_type=QuantType.QInt8
)

Strategy 4: Model Architecture for Edge

When to use: Inference too slow even after quantization

Consider smaller architectures:

MobileNetV3-Small instead of MobileNetV2
EfficientNet-Lite instead of EfficientNet
TinyBERT instead of BERT

Trade-off: Accuracy vs speed. Profile to find acceptable balance.

# Compare architectures on edge device
architectures = [
    ('MobileNetV2', models.mobilenet_v2(pretrained=True)),
    ('MobileNetV3-Small', models.mobilenet_v3_small(pretrained=True)),
    ('EfficientNet-B0', models.efficientnet_b0(pretrained=True)),
]

for name, model in architectures:
    model = model.eval()

    # Quantize
    quantized_model = quantize_dynamic(model, {torch.nn.Linear, torch.nn.Conv2d}, dtype=torch.qint8)

    # Benchmark
    input_tensor = torch.randn(1, 3, 224, 224)
    start = time.time()
    for _ in range(100):
        with torch.no_grad():
            _ = quantized_model(input_tensor)
    elapsed = time.time() - start

    print(f"{name}: {elapsed/100*1000:.2f}ms per inference")

Hardware-Specific Decision Tree

GPU (NVIDIA)

1. Profile with PyTorch profiler or nvidia-smi
   ↓
2. Is GPU utilization low (<50%)?
   YES → Problem:
        - Batch size too small → Increase batch size
        - CPU preprocessing bottleneck → Move preprocessing to GPU or parallelize
        - CPU-GPU transfers → Minimize .cuda()/.cpu() calls
   NO → GPU is bottleneck, optimize model
   ↓
3. Apply optimizations in order:
   a. Increase batch size (measure latency/throughput trade-off)
   b. Mixed precision FP16 (easy 2x speedup if Tensor Cores available)
   c. torch.compile() (easy 1.5-2x speedup, PyTorch 2.0+)
   d. TensorRT (2-5x speedup, more effort)
   e. CUDA graphs (20-30% speedup for small models)
   ↓
4. Measure after each optimization
   ↓
5. If still not meeting requirements:
   - Consider quantization (INT8) → see quantization-for-inference skill
   - Consider model compression → see model-compression-techniques skill
   - Scale horizontally → add more GPU instances

CPU (Intel/AMD)

1. Profile with PyTorch profiler or perf
   ↓
2. Check threading configuration
   - torch.get_num_threads() == num physical cores?
   - If not, set torch.set_num_threads(num_cores)
   ↓
3. Apply optimizations in order:
   a. Set intra-op threads to num physical cores
   b. Enable MKLDNN (Intel CPUs)
   c. Use channels-last memory format
   d. Try ONNX Runtime with graph optimizations
   e. If Intel CPU: Try OpenVINO (best performance)
   ↓
4. Measure batch size trade-off (smaller may be better for CPU)
   ↓
5. If still not meeting requirements:
   - Quantize to INT8 → 2-3x speedup on CPU
   - Consider model compression
   - Scale horizontally

Edge/ARM

1. Profile on target device (Raspberry Pi, etc.)
   ↓
2. Is inference >100ms per sample?
   YES → Model too large for device
        - Try smaller architecture (MobileNetV3-Small, EfficientNet-Lite)
        - If accuracy allows, use smaller model
   NO → Optimize current model
   ↓
3. Apply optimizations in order:
   a. Quantize to INT8 (2-4x speedup on ARM, critical!)
   b. Set num_threads to device's CPU cores
   c. Convert to TensorFlow Lite with XNNPACK (best ARM performance)
      OR use ONNX Runtime with INT8
   ↓
4. Measure memory usage
   - Model fits in RAM?
   - If not, must use smaller model or offload to storage
   ↓
5. If still not meeting requirements:
   - Use smaller model architecture
   - Consider model pruning
   - Hardware accelerator (Coral TPU, Jetson GPU)

Common Patterns

Pattern 1: Latency-Critical Online Serving

Requirements: <50ms latency, moderate throughput (100-500 req/s)

Strategy:

# 1. Small batch size for low latency
batch_size = 1  # or dynamic batching in serving framework

# 2. Use torch.compile() or TensorRT
model = torch.compile(model, mode="reduce-overhead")

# 3. FP16 for speed (if accuracy allows)
model = model.half()

# 4. Profile to ensure <50ms
# 5. If CPU: ensure threading configured correctly

Pattern 2: Throughput-Critical Batch Serving

Requirements: High throughput (>1000 samples/sec), latency flexible (100-500ms OK)

Strategy:

# 1. Large batch size for throughput
batch_size = 64  # or maximum that fits in memory

# 2. Use TensorRT for maximum optimization
trt_model = torch_tensorrt.compile(model, inputs=[...], enabled_precisions={torch.float16})

# 3. FP16 or INT8 for speed
# 4. Profile to maximize throughput
# 5. Consider CUDA graphs for fixed-size batches

Pattern 3: Edge Deployment (Raspberry Pi)

Requirements: <500ms latency, limited memory (1-2GB), ARM CPU

Strategy:

# 1. Quantize to INT8 (critical for ARM)
quantized_model = quantize_dynamic(model, {torch.nn.Linear, torch.nn.Conv2d}, dtype=torch.qint8)

# 2. Convert to TensorFlow Lite with XNNPACK
# (see TFLite section above)

# 3. Set threads to device cores (4 for Raspberry Pi 4)
# 4. Profile on device (not on development machine!)
# 5. If too slow, use smaller architecture (MobileNetV3-Small)

Pattern 4: Multi-GPU Inference

Requirements: Very high throughput, multiple GPUs available

Strategy:

# Option 1: DataParallel (simple, less efficient)
model = torch.nn.DataParallel(model)

# Option 2: Pipeline parallelism (large models)
# Split model across GPUs
model.layer1.to('cuda:0')
model.layer2.to('cuda:1')

# Option 3: Model replication with load balancer (best throughput)
# Run separate inference server per GPU
# Use NGINX or serving framework to distribute requests

Memory vs Compute Trade-offs

Memory-Constrained Scenarios

Symptoms: OOM errors, model barely fits in memory

Optimizations (trade compute for memory):

Reduce precision: FP16 (2x memory reduction) or INT8 (4x reduction)
Reduce batch size: Smaller batches use less memory
Gradient checkpointing: (Training only) Recompute activations during backward
Model pruning: Remove unnecessary parameters
Offload to CPU: Store some layers/activations on CPU, transfer to GPU when needed

# Example: Reduce precision
model = model.half()  # FP32 → FP16 (2x memory reduction)

# Example: Offload to CPU
class OffloadWrapper(nn.Module):
    def __init__(self, model):
        super().__init__()
        self.model = model
        self.model.layer1.to('cuda')
        self.model.layer2.to('cpu')  # Offload to CPU
        self.model.layer3.to('cuda')

    def forward(self, x):
        x = self.model.layer1(x)
        x = self.model.layer2(x.cpu()).cuda()  # Transfer CPU → GPU
        x = self.model.layer3(x)
        return x

Compute-Constrained Scenarios

Symptoms: Low throughput, long latency, GPU/CPU underutilized

Optimizations (trade memory for compute):

Increase batch size: Use available memory for larger batches (higher throughput)
Operator fusion: Combine operations (TensorRT, torch.compile())
Precision increase: If accuracy suffers from FP16/INT8, use FP32 (slower but accurate)
Larger model: If accuracy requirements not met, use larger (slower) model

# Example: Increase batch size
# Find maximum batch size that fits in memory
optimal_batch_size = find_optimal_batch_size(model, input_shape)

# Example: Operator fusion (TensorRT)
trt_model = torch_tensorrt.compile(model, inputs=[...], enabled_precisions={torch.float16})

Profiling Checklist

Before optimizing, profile to answer:

GPU Profiling Questions

What is GPU utilization? (nvidia-smi)
What is memory utilization?
What are the slowest operations? (PyTorch profiler)
Is there CPU-GPU transfer overhead? (.cuda()/.cpu() calls)
Is batch size optimal? (measure latency/throughput)
Are Tensor Cores being used? (FP16/INT8 operations)

CPU Profiling Questions

What is CPU utilization? (all cores used?)
What is threading configuration? (torch.get_num_threads())
What are the slowest operations? (PyTorch profiler)
Is MKLDNN enabled? (Intel CPUs)
Is batch size optimal? (may be smaller for CPU)

Edge Profiling Questions

What is inference latency on target device? (not development machine!)
What is memory usage? (fits in device RAM?)
Is model quantized to INT8? (critical for ARM)
Is threading configured for device cores?
Is model architecture appropriate for device? (too large?)

Common Pitfalls

Pitfall 1: Optimizing Without Profiling

Mistake: Applying optimizations blindly without measuring bottleneck

Example:

# Wrong: Apply TensorRT without profiling
trt_model = torch_tensorrt.compile(model, ...)

# Right: Profile first
with torch.profiler.profile() as prof:
    output = model(input)
print(prof.key_averages().table())
# Then optimize based on findings

Why wrong: May optimize wrong part of pipeline (e.g., model is fast, preprocessing is slow)

Pitfall 2: GPU Optimization for CPU Deployment

Mistake: Using GPU-specific optimizations for CPU deployment

Example:

# Wrong: TensorRT for CPU deployment
trt_model = torch_tensorrt.compile(model, ...)  # TensorRT requires NVIDIA GPU!

# Right: Use CPU-optimized framework
session = ort.InferenceSession("model.onnx", providers=['CPUExecutionProvider'])

Pitfall 3: Ignoring torch.cuda.synchronize() in GPU Timing

Mistake: Measuring GPU time without synchronization (measures kernel launch, not execution)

Example:

# Wrong: Inaccurate timing
start = time.time()
output = model(input.cuda())
elapsed = time.time() - start  # Only measures kernel launch!

# Right: Synchronize before measuring
torch.cuda.synchronize()
start = time.time()
output = model(input.cuda())
torch.cuda.synchronize()  # Wait for GPU to finish
elapsed = time.time() - start  # Accurate GPU execution time

Pitfall 4: Batch Size "Bigger is Better"

Mistake: Using largest possible batch size without considering latency

Example:

# Wrong: Maximum batch size without measuring latency
batch_size = 256  # May violate latency SLA!

# Right: Measure latency vs throughput trade-off
benchmark_batch_sizes(model, input_shape, batch_sizes=[1, 4, 8, 16, 32, 64])
# Select batch size that meets latency requirement

Why wrong: Large batches increase latency (queue time + compute time), may violate SLA

Pitfall 5: Not Validating Accuracy After Optimization

Mistake: Deploying FP16/INT8 model without checking accuracy

Example:

# Wrong: Deploy quantized model without validation
quantized_model = quantize_dynamic(model, ...)
# Deploy immediately

# Right: Validate accuracy first
validate_fp16_accuracy(model, test_loader, tolerance=0.01)
if validation_passes:
    deploy(quantized_model)

Pitfall 6: Over-Optimizing When Requirements Already Met

Mistake: Spending effort optimizing when already meeting requirements

Example:

# Current: 20ms latency, requirement is <50ms
# Wrong: Spend days optimizing to 10ms (unnecessary)

# Right: Check if requirements met
if current_latency < required_latency:
    print("Requirements met, skip optimization")

Pitfall 7: Wrong Threading Configuration (CPU)

Mistake: Not setting intra-op threads, or oversubscribing cores

Example:

# Wrong: Default threading (only uses 4-8 cores on 32-core machine)
# (no torch.set_num_threads() call)

# Wrong: Oversubscription
torch.set_num_threads(32)  # Intra-op threads
torch.set_num_interop_threads(32)  # Inter-op threads (total 64 threads on 32 cores!)

# Right: Set intra-op to num cores, disable inter-op
torch.set_num_threads(32)
torch.set_num_interop_threads(1)

When NOT to Optimize

Skip hardware optimization when:

Requirements already met: Current performance satisfies latency/throughput SLA
Model is not bottleneck: Profiling shows preprocessing or postprocessing is slow
Development phase: Still iterating on model architecture (optimize after finalizing)
Accuracy degradation: Optimization (FP16/INT8) causes unacceptable accuracy loss
Rare inference: Model runs infrequently (e.g., 1x per hour), optimization effort not justified

Red flag: Spending days optimizing when requirements already met or infrastructure scaling is cheaper.

Integration with Other Skills

With quantization-for-inference

This skill: Hardware-aware quantization deployment (INT8 on CPU vs GPU vs ARM)
quantization-for-inference: Quantization techniques (PTQ, QAT, calibration)
Use both: When quantization is part of hardware optimization strategy

With model-compression-techniques

This skill: Hardware optimization (batching, frameworks, profiling)
model-compression-techniques: Model size reduction (pruning, distillation)
Use both: When both hardware optimization and model compression needed

With model-serving-patterns

This skill: Optimize model inference on hardware
model-serving-patterns: Serve optimized model via API/container
Sequential: Optimize model first (this skill), then serve (model-serving-patterns)

With production-monitoring-and-alerting

This skill: Optimize for target latency/throughput
production-monitoring-and-alerting: Monitor actual latency/throughput in production
Feedback loop: Monitor performance, optimize if degraded

Success Criteria

You've succeeded when:

✅ Profiled before optimizing (identified actual bottleneck)
✅ Selected hardware-appropriate optimizations (GPU vs CPU vs edge)
✅ Measured performance before/after each optimization
✅ Met latency/throughput/memory requirements
✅ Validated accuracy after optimization (if using FP16/INT8)
✅ Considered cost vs benefit (optimization effort vs infrastructure scaling)
✅ Documented optimization choices and trade-offs
✅ Avoided premature optimization (requirements already met)

References

Profiling:

PyTorch Profiler: https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html
NVIDIA Nsight Systems: https://developer.nvidia.com/nsight-systems
Intel VTune: https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler.html

GPU Optimization:

TensorRT: https://developer.nvidia.com/tensorrt
torch.compile(): https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html
CUDA Graphs: https://pytorch.org/docs/stable/notes/cuda.html#cuda-graphs

CPU Optimization:

ONNX Runtime: https://onnxruntime.ai/docs/performance/tune-performance.html
OpenVINO: https://docs.openvino.ai/latest/index.html
MKLDNN: https://github.com/oneapi-src/oneDNN

Edge Optimization:

TensorFlow Lite: https://www.tensorflow.org/lite/performance/best_practices
ONNX Runtime Mobile: https://onnxruntime.ai/docs/tutorials/mobile/

Batch Size Tuning:

Dynamic Batching: https://github.com/pytorch/serve/blob/master/docs/batch_inference_with_ts.md
TorchServe Batching: https://pytorch.org/serve/batch_inference.html

Install Skill

SKILL.md

Hardware Optimization Strategies

Overview

When to Use

Core Methodology

Step 1: Profile to Identify Bottlenecks

PyTorch Profiler (Comprehensive)

NVIDIA Profiling Tools

Intel VTune (CPU profiling)

Simple Timing

Step 2: Select Hardware-Appropriate Optimizations

GPU Optimization (NVIDIA CUDA)

Strategy 1: TensorRT (2-5x Speedup for CNNs/Transformers)

Strategy 2: torch.compile() (PyTorch 2.0+ - Easy 1.5-2x Speedup)

Strategy 3: Mixed Precision (FP16 - Easy 2x Speedup)

Strategy 4: Batch Size Tuning

Find Optimal Batch Size

Measure Latency vs Throughput

Strategy 5: CUDA Graphs (Fixed-Size Inputs - 20-30% Speedup)

CPU Optimization (Intel/AMD)

Strategy 1: Threading Configuration (Critical for Multi-Core)

Strategy 2: MKLDNN/OneDNN Backend (Intel-Optimized Operations)

Strategy 3: ONNX Runtime (Best CPU Performance)

Strategy 4: OpenVINO (Intel-Specific - Best Performance)

Strategy 5: Batch Size for CPU

Edge/ARM Optimization

Strategy 1: INT8 Quantization (2-4x Speedup on ARM)

Strategy 2: TensorFlow Lite (Best for ARM/Mobile)

Strategy 3: ONNX Runtime for ARM

Strategy 4: Model Architecture for Edge

Hardware-Specific Decision Tree

GPU (NVIDIA)

CPU (Intel/AMD)

Edge/ARM

Common Patterns

Pattern 1: Latency-Critical Online Serving

Pattern 2: Throughput-Critical Batch Serving

Pattern 3: Edge Deployment (Raspberry Pi)

Pattern 4: Multi-GPU Inference

Memory vs Compute Trade-offs

Memory-Constrained Scenarios

Compute-Constrained Scenarios

Profiling Checklist

GPU Profiling Questions

CPU Profiling Questions

Edge Profiling Questions

Common Pitfalls

Pitfall 1: Optimizing Without Profiling

Pitfall 2: GPU Optimization for CPU Deployment

Pitfall 3: Ignoring torch.cuda.synchronize() in GPU Timing

Pitfall 4: Batch Size "Bigger is Better"

Pitfall 5: Not Validating Accuracy After Optimization

Pitfall 6: Over-Optimizing When Requirements Already Met

Pitfall 7: Wrong Threading Configuration (CPU)

When NOT to Optimize

Integration with Other Skills

With quantization-for-inference

With model-compression-techniques

With model-serving-patterns

With production-monitoring-and-alerting

Success Criteria

References