Claude Code Plugins

Community-maintained marketplace

Feedback

mojo-simd-optimize

@mvillmow/ml-odyssey
4
0

Apply SIMD (Single Instruction Multiple Data) optimizations to Mojo code for parallel computation performance. Use when optimizing performance-critical tensor and array operations.

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name mojo-simd-optimize
description Apply SIMD (Single Instruction Multiple Data) optimizations to Mojo code for parallel computation performance. Use when optimizing performance-critical tensor and array operations.

SIMD Optimization Skill

Apply SIMD optimizations to Mojo code for parallel performance.

When to Use

  • Optimizing tensor operations
  • Array/vector computations
  • Performance-critical loops
  • Benchmark results show optimization potential

SIMD Basics

SIMD processes multiple data elements in parallel:

from sys.info import simdwidthof

# Get optimal SIMD width for dtype
alias simd_width = simdwidthof[DType.float32]()  # Usually 8 or 16

# SIMD vector add
fn simd_add(a: Tensor[DType.float32], b: Tensor[DType.float32]):
    for i in range(0, a.size, simd_width):
        let vec_a = a.load[simd_width](i)
        let vec_b = b.load[simd_width](i)
        let result = vec_a + vec_b
        a.store(i, result)

Optimization Patterns

1. Vectorize Loops

Before:

fn add_scalar(a: Tensor, b: Tensor):
    for i in range(a.size):
        a[i] = a[i] + b[i]

After:

fn add_simd(a: Tensor, b: Tensor):
    alias width = simdwidthof[DType.float32]()
    for i in range(0, a.size, width):
        a.store(i, a.load[width](i) + b.load[width](i))

2. Handle Remainder

fn process_with_remainder(data: Tensor):
    alias width = simdwidthof[dtype]()
    let vector_end = (data.size // width) * width

    # SIMD loop
    for i in range(0, vector_end, width):
        # Vectorized processing
        pass

    # Handle remainder
    for i in range(vector_end, data.size):
        # Scalar processing
        pass

3. Alignment

# Ensure data is properly aligned for SIMD
@parameter
fn aligned_load[width: Int](ptr: DTypePointer, offset: Int):
    # Use aligned load for better performance
    pass

Performance Guidelines

  • Use SIMD for operations on > 1000 elements
  • Typical speedup: 4x-8x for float32
  • Test performance: Always benchmark
  • Handle remainder: Process leftover elements

Examples

Vector addition:

fn add[dtype: DType](a: Tensor[dtype], b: Tensor[dtype]) -> Tensor[dtype]:
    alias width = simdwidthof[dtype]()
    for i in range(0, a.size, width):
        a.store(i, a.load[width](i) + b.load[width](i))
    return a

Matrix multiplication (tiled):

fn matmul_simd[dtype: DType](A: Matrix[dtype], B: Matrix[dtype]):
    # Tile for cache + SIMD
    alias tile = 32
    alias width = simdwidthof[dtype]()
    # Implementation with tiling and SIMD

See Mojo documentation for complete SIMD API.