Claude Code Plugins

Community-maintained marketplace

Feedback

analyze-simd-usage

@mvillmow/ProjectOdyssey
8
0

Analyze SIMD usage opportunities in Mojo code. Use to find performance optimization opportunities.

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name analyze-simd-usage
description Analyze SIMD usage opportunities in Mojo code. Use to find performance optimization opportunities.
category mojo
mcp_fallback none

Analyze SIMD Usage Opportunities

Identify where SIMD (Single Instruction Multiple Data) can improve performance.

When to Use

  • Performance-critical tensor operations
  • Element-wise operations on large arrays
  • Vectorization of loops processing multiple elements
  • Optimizing matrix/vector operations
  • Finding performance bottlenecks in ML code

Quick Reference

# Find loops processing arrays/tensors
grep -n "for.*in.*range\|@unroll\|@vectorize" *.mojo

# Find element-wise operations
grep -n "\.load\|\.store\|\.broadcast" *.mojo

# Check for SIMD parameters
grep -n "simd_width\|nelems\|\[.*:\]" *.mojo

# Identify candidates
grep -n "for i in range.*:" -A 10 *.mojo | grep -E "array\[i\]|tensor\[i\]"

SIMD Optimization Opportunities

Vectorizable Patterns:

  • ✅ Element-wise addition: a[i] + b[i] for all i
  • ✅ Scalar multiplication: a[i] * scalar for all i
  • ✅ Unary operations: sin(a[i]), exp(a[i]) for all i
  • ✅ Reduction operations: sum, max, min over array
  • ❌ Dependent iterations: a[i] = a[i-1] + value (sequential)
  • ❌ Conditional branches: if a[i] > threshold: (hard to vectorize)
  • ❌ Function calls: unpredictable latency (avoid in tight loops)

SIMD Width Selection:

  • @parameter fn[simd_width: Int] - Generic SIMD width
  • simd_width=4 - Typically good for float32
  • simd_width=8 - Optimal for many operations
  • simd_width=16+ - For int32 or specialized ops
  • Match hardware capabilities (AVX2=4-8, AVX512=8-16)

Vectorization Patterns:

  • @vectorize decorator for simple loops
  • @unroll for small loops (2-4 iterations)
  • ✅ Manual SIMD with .load[] and .store[]
  • ✅ Tensor operations with SIMD dimensions

Analysis Workflow

  1. Profile code: Identify bottlenecks using time/memory metrics
  2. Find loops: Locate loops processing large amounts of data
  3. Check vectorizability: Verify no loop-carried dependencies
  4. Estimate speedup: SIMD could provide 4-16x improvement
  5. Implement SIMD: Use @vectorize, @unroll, or manual SIMD
  6. Measure performance: Verify improvement with benchmarks
  7. Document changes: Note what was optimized and why

Output Format

Report SIMD analysis with:

  1. Hotspots - Functions/loops using most CPU time
  2. Vectorization Potential - Operations that could use SIMD
  3. Estimated Speedup - Expected performance improvement
  4. Implementation Priority - High/medium/low impact
  5. Technical Approach - How to implement SIMD
  6. Risks - Potential issues with vectorization
  7. Recommendations - Which optimizations to pursue first

Optimization Examples

Example 1: Element-wise Addition

# Before: scalar loop
fn add_scalar(a: Tensor, b: Tensor) -> Tensor:
    var result = Tensor(a.shape)
    for i in range(a.num_elements()):
        result._data[i] = a._data[i] + b._data[i]
    return result

# After: vectorized
@vectorize
fn add_simd[simd_width: Int](i: Int):
    result._data.store[simd_width](i,
        a._data.load[simd_width](i) + b._data.load[simd_width](i))

def add_vectorized(a: Tensor, b: Tensor) -> Tensor:
    var result = Tensor(a.shape)
    # 4x-8x speedup typical
    return result

Example 2: Reduction (Sum)

# Before: scalar loop
fn sum_scalar(tensor: Tensor) -> Float32:
    var total: Float32 = 0
    for i in range(tensor.num_elements()):
        total += tensor._data[i]
    return total

# After: SIMD reduction
fn sum_simd[simd_width: Int](tensor: Tensor) -> Float32:
    # Process simd_width elements at a time
    # Then reduce results - can be much faster
    return total

Error Handling

Problem Solution
Vectorization causes wrong results Check for loop-carried dependencies
Segment fault with SIMD Verify alignment and bounds
Minimal speedup May not be vectorizable, profile to confirm
Complex logic Break into simpler vectorizable operations
Type mismatches Ensure SIMD width compatible with element type

SIMD Decision Tree

  • Does loop process large arrays? → YES → Check vectorizability
  • Loop-carried dependencies? → YES → Can't vectorize, optimize differently
  • Simple operations on many elements? → YES → Use @vectorize or @unroll
  • Critical path (hot loop)? → YES → Worth optimizing
  • Implement → Measure → Iterate

References

  • See mojo-simd-optimize for implementation guidance
  • See CLAUDE.md for SIMD code patterns
  • See performance section in module documentation