| name | analyze-simd-usage |
| description | Analyze SIMD usage opportunities in Mojo code. Use to find performance optimization opportunities. |
| category | mojo |
| mcp_fallback | none |
Analyze SIMD Usage Opportunities
Identify where SIMD (Single Instruction Multiple Data) can improve performance.
When to Use
- Performance-critical tensor operations
- Element-wise operations on large arrays
- Vectorization of loops processing multiple elements
- Optimizing matrix/vector operations
- Finding performance bottlenecks in ML code
Quick Reference
# Find loops processing arrays/tensors
grep -n "for.*in.*range\|@unroll\|@vectorize" *.mojo
# Find element-wise operations
grep -n "\.load\|\.store\|\.broadcast" *.mojo
# Check for SIMD parameters
grep -n "simd_width\|nelems\|\[.*:\]" *.mojo
# Identify candidates
grep -n "for i in range.*:" -A 10 *.mojo | grep -E "array\[i\]|tensor\[i\]"
SIMD Optimization Opportunities
Vectorizable Patterns:
- ✅ Element-wise addition:
a[i] + b[i]for all i - ✅ Scalar multiplication:
a[i] * scalarfor all i - ✅ Unary operations:
sin(a[i]),exp(a[i])for all i - ✅ Reduction operations: sum, max, min over array
- ❌ Dependent iterations:
a[i] = a[i-1] + value(sequential) - ❌ Conditional branches:
if a[i] > threshold:(hard to vectorize) - ❌ Function calls: unpredictable latency (avoid in tight loops)
SIMD Width Selection:
@parameter fn[simd_width: Int]- Generic SIMD widthsimd_width=4- Typically good for float32simd_width=8- Optimal for many operationssimd_width=16+- For int32 or specialized ops- Match hardware capabilities (AVX2=4-8, AVX512=8-16)
Vectorization Patterns:
- ✅
@vectorizedecorator for simple loops - ✅
@unrollfor small loops (2-4 iterations) - ✅ Manual SIMD with
.load[]and.store[] - ✅ Tensor operations with SIMD dimensions
Analysis Workflow
- Profile code: Identify bottlenecks using time/memory metrics
- Find loops: Locate loops processing large amounts of data
- Check vectorizability: Verify no loop-carried dependencies
- Estimate speedup: SIMD could provide 4-16x improvement
- Implement SIMD: Use @vectorize, @unroll, or manual SIMD
- Measure performance: Verify improvement with benchmarks
- Document changes: Note what was optimized and why
Output Format
Report SIMD analysis with:
- Hotspots - Functions/loops using most CPU time
- Vectorization Potential - Operations that could use SIMD
- Estimated Speedup - Expected performance improvement
- Implementation Priority - High/medium/low impact
- Technical Approach - How to implement SIMD
- Risks - Potential issues with vectorization
- Recommendations - Which optimizations to pursue first
Optimization Examples
Example 1: Element-wise Addition
# Before: scalar loop
fn add_scalar(a: Tensor, b: Tensor) -> Tensor:
var result = Tensor(a.shape)
for i in range(a.num_elements()):
result._data[i] = a._data[i] + b._data[i]
return result
# After: vectorized
@vectorize
fn add_simd[simd_width: Int](i: Int):
result._data.store[simd_width](i,
a._data.load[simd_width](i) + b._data.load[simd_width](i))
def add_vectorized(a: Tensor, b: Tensor) -> Tensor:
var result = Tensor(a.shape)
# 4x-8x speedup typical
return result
Example 2: Reduction (Sum)
# Before: scalar loop
fn sum_scalar(tensor: Tensor) -> Float32:
var total: Float32 = 0
for i in range(tensor.num_elements()):
total += tensor._data[i]
return total
# After: SIMD reduction
fn sum_simd[simd_width: Int](tensor: Tensor) -> Float32:
# Process simd_width elements at a time
# Then reduce results - can be much faster
return total
Error Handling
| Problem | Solution |
|---|---|
| Vectorization causes wrong results | Check for loop-carried dependencies |
| Segment fault with SIMD | Verify alignment and bounds |
| Minimal speedup | May not be vectorizable, profile to confirm |
| Complex logic | Break into simpler vectorizable operations |
| Type mismatches | Ensure SIMD width compatible with element type |
SIMD Decision Tree
- Does loop process large arrays? → YES → Check vectorizability
- Loop-carried dependencies? → YES → Can't vectorize, optimize differently
- Simple operations on many elements? → YES → Use @vectorize or @unroll
- Critical path (hot loop)? → YES → Worth optimizing
- Implement → Measure → Iterate
References
- See mojo-simd-optimize for implementation guidance
- See CLAUDE.md for SIMD code patterns
- See performance section in module documentation