name	microarchitectural_modeling
description	Models cache lines, branch prediction, allocation patterns, and lock contention. Converts algorithmic reasoning into CPU-era reality spanning P4 to Xeon to M4. Use for performance modeling and bottleneck prediction.
allowed-tools	Read, Grep, Glob, WebFetch, WebSearch

Microarchitectural Modeling

Purpose

Translate algorithmic designs into concrete microarchitectural behavior, predicting performance on real hardware from P4 era through modern architectures.

Instructions

When activated:

Identify Target Architecture
- Era: P4 Northwood (2001-2003), Early Xeon, Athlon XP
- Modern: M1/M2/M3/M4, Zen 3/4, Intel Alder Lake
- Specify cache hierarchy, pipeline depth, execution units
Model Memory Hierarchy
- Cache line size (typically 64 bytes)
- Cache capacity (L1/L2/L3)
- Miss latency at each level
- Prefetch behavior (sequential, stride)
Model Execution Pipeline
- Branch misprediction penalty
- Instruction-level parallelism (ILP)
- Out-of-order window size
- Execution port contention
Model Memory Operations
- Allocation cost (malloc/new)
- GC overhead (if applicable)
- False sharing penalties
- Memory bandwidth limits
Predict Bottlenecks
- Memory-bound vs CPU-bound
- Cache miss rates
- Branch misprediction rates
- TLB thrashing (large datasets)

Architecture Profiles

P4 Northwood (2001-2003)

L1: 8 KB / 2 cycles, L2: 512 KB / 7 cycles
Branch mispredict: 20 cycles (deep pipeline)
Weak prefetcher, ~2 GB/s bandwidth

Apple M1/M2/M3/M4 (2020-2024)

L1: 128-192 KB / 3 cycles (huge)
L2: 12-16 MB / 15 cycles
Branch mispredict: ~18 cycles
100-400 GB/s unified memory bandwidth

Cross-Skill Integration

Requires: algorithmic_analysis, systems_design_patterns Feeds into: performance_interpretation, technical_exposition

microarchitectural_modeling

Install Skill

SKILL.md