name	performance-optimization-for-sims
description	Profiling, LOD systems, time-slicing, spatial partitioning for sim scaling

Performance Optimization for Simulations

When to use this skill: When simulations run below target frame rate (typically 60 FPS for PC, 30 FPS for mobile), especially with large agent counts (100+ units), complex AI, physics calculations, or proximity queries. Critical for RTS games, crowd simulations, ecosystem models, traffic systems, and any scenario requiring 1000+ active entities.

What this skill provides: Systematic methodology for performance optimization using profiling-driven decisions, spatial partitioning patterns, level-of-detail (LOD) systems, time-slicing, caching strategies, data-oriented design, and selective multithreading. Focuses on achieving 60 FPS at scale while maintaining gameplay quality.

Core Concepts

The Optimization Hierarchy (Critical Order)

ALWAYS optimize in this order - each level provides 10-100× improvement:

PROFILE FIRST (0.5-1 hour investment)
- Identify actual bottleneck with profiler
- Measure baseline performance
- Set target frame time budgets
- Never guess - 80% of time is usually in 20% of code
Algorithmic Optimizations (10-100× improvement)
- Fix O(n²) → O(n) or O(n log n)
- Spatial partitioning for proximity queries
- Replace brute-force with smart algorithms
- Biggest wins, do these FIRST
Level of Detail (LOD) (2-10× improvement)
- Reduce computation for distant/unimportant entities
- Smooth transitions (no popping)
- Priority-based update frequencies
- Behavior LOD + visual LOD
Time-Slicing (2-5× improvement)
- Spread work across multiple frames
- Frame time budgets per system
- Priority queues for important work
- Amortized expensive operations
Caching (2-10× improvement)
- Avoid redundant calculations
- LRU eviction + TTL
- Proper invalidation
- Bounded memory usage
Data-Oriented Design (1.5-3× improvement)
- Cache-friendly memory layouts
- Struct of Arrays (SoA) vs Array of Structs (AoS)
- Minimize pointer chasing
- Batch operations on contiguous data
Multithreading (1.5-4× improvement)
- ONLY if still needed after above
- Job systems for data parallelism
- Avoid locks and race conditions
- Complexity cost is high

Example: RTS with 1000 units at 10 FPS → 60 FPS

Profile: Vision checks are 80% of frame time
Spatial partitioning: O(n²) → O(n) = 50× faster → 40 FPS
LOD: Distant units update less = 1.5× faster → 60 FPS
Done in 30 minutes vs 2 hours of trial-and-error

Profiling Methodology

Three-step profiling process:

Capture Baseline (before optimization)
- Total frame time
- Time per major system (AI, physics, rendering, pathfinding)
- CPU vs GPU bound
- Memory allocations per frame
- Cache misses (if profiler supports)
Identify Bottleneck (80/20 rule)
- Sort functions by time spent
- Focus on top 3-5 functions (usually 80% of time)
- Understand WHY they're slow (algorithm, data layout, cache misses)
Validate Improvement (after each optimization)
- Measure same metrics
- Calculate speedup ratio
- Check for regressions (new bottlenecks)
- Iterate until target met

Profiling Tools:

Python: cProfile, line_profiler, memory_profiler, py-spy
C++: VTune, perf, Instruments (Mac), Very Sleepy
Unity: Unity Profiler, Deep Profile mode
Unreal: Unreal Insights, stat commands
Browser: Chrome DevTools Performance tab

Example Profiling Output:

Total frame time: 100ms (10 FPS)

Function                    Time    % of Frame
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
update_vision_checks()      80ms    80%      ← BOTTLENECK
update_ai()                 10ms    10%
update_pathfinding()         5ms     5%
update_physics()             3ms     3%
render()                     2ms     2%

Diagnosis: O(n²) vision checks (1000 units × 1000 = 1M checks/frame)
Solution: Spatial partitioning → O(n) checks

Spatial Partitioning

Problem: Proximity queries are O(n²) when checking every entity against every other

100 entities = 10,000 checks
1,000 entities = 1,000,000 checks (death)
10,000 entities = 100,000,000 checks (impossible)

Solution: Divide space into regions, only check entities in nearby regions

Spatial Hash Grid (simplest, fastest for uniform distribution)

Divide world into fixed-size cells (e.g., 50×50 units)
Hash entity position to cell(s)
Query: Check only entities in neighboring cells
Complexity: O(n) to build, O(1) average query
Best for: Mostly uniform entity distribution

Quadtree (adaptive, good for clustered entities)

Recursively subdivide space into 4 quadrants
Split when cell exceeds threshold (e.g., 10 entities)
Query: Descend tree, check overlapping nodes
Complexity: O(n log n) to build, O(log n) average query
Best for: Entities clustered in areas

Octree (3D version of quadtree)

Recursively subdivide 3D space into 8 octants
Same benefits as quadtree for 3D worlds
Best for: 3D flight sims, space games, underwater

Decision Framework:

Spatial Partitioning Choice:

├─ 2D WORLD with UNIFORM DISTRIBUTION?
│  └─ Use Spatial Hash Grid (simplest, fastest)
│
├─ 2D WORLD with CLUSTERED ENTITIES?
│  └─ Use Quadtree (adapts to density)
│
├─ 3D WORLD?
│  └─ Use Octree (3D quadtree)
│
└─ VERY LARGE WORLD (multiple km²)?
   └─ Use Hierarchical Grid (multiple grids at different scales)

Performance Impact:

1000 units: O(n²) = 1,000,000 checks → O(n) = 1,000 checks = 1000× faster
Typical speedup: 50-100× in practice (accounting for grid overhead)

Level of Detail (LOD)

Concept: Reduce computation for entities that don't need full precision

Distance-Based LOD Levels:

LOD 0 (0-50 units from camera): Full detail
- Full AI decision-making (10 Hz)
- Precise pathfinding
- Detailed animations
- All visual effects
LOD 1 (50-100 units): Reduced detail
- Simplified AI (5 Hz)
- Coarse pathfinding (waypoints only)
- Simplified animations
- Reduced effects
LOD 2 (100-200 units): Minimal detail
- Basic AI (1 Hz)
- Straight-line movement
- Static pose or simple animation
- No effects
LOD 3 (200+ units): Culled or dormant
- State update only (0.2 Hz)
- No pathfinding
- Billboards or invisible
- No physics

Importance-Based LOD (better than distance alone):

def calculate_lod_level(entity, camera, player):
    # Multiple factors determine importance
    distance = entity.distance_to(camera)
    is_player_unit = entity.team == player.team
    is_in_combat = entity.in_combat
    is_selected = entity in player.selection

    # Important entities always get high LOD
    if is_selected:
        return 0  # Always full detail
    if is_player_unit and is_in_combat:
        return 0  # Player's units in combat = critical

    # Distance-based for others
    if distance < 50:
        return 0
    elif distance < 100:
        return 1
    elif distance < 200:
        return 2
    else:
        return 3

Smooth LOD Transitions (avoid popping):

Hysteresis: Different thresholds for upgrading vs downgrading
- Upgrade LOD at 90 units
- Downgrade LOD at 110 units
- 20-unit buffer prevents thrashing
Time delay: Wait N seconds before downgrading LOD
- Prevents rapid flicker at boundary
Blend animations: Cross-fade between LOD levels
- 0.5-1 second blend

Behavior LOD Examples:

System	LOD 0 (Full)	LOD 1 (Reduced)	LOD 2 (Minimal)	LOD 3 (Dormant)
AI	Behavior tree 10 Hz	Simple FSM 5 Hz	Follow path 1 Hz	State only 0.2 Hz
Pathfinding	Full A*	Hierarchical	Straight line	None
Vision	360° scan 10 Hz	Forward cone 5 Hz	None	None
Physics	Full collision	Bounding box	None	None
Animation	Full skeleton	5 bones	Static pose	None
Audio	3D positioned	2D ambient	None	None

Performance Impact:

1000 units: 100% at LOD 0 vs 20% at LOD 0 + 80% at LOD 1-3 = 3-5× faster

Time-Slicing

Concept: Spread expensive operations across multiple frames to stay within frame budget

Frame Time Budget (60 FPS = 16.67ms per frame):

Frame Budget (16.67ms total):
├─ Rendering: 6ms (40%)
├─ AI: 4ms (24%)
├─ Physics: 3ms (18%)
├─ Pathfinding: 2ms (12%)
└─ Other: 1.67ms (10%)

Time-Slicing Pattern 1: Fixed Budget Per Frame

class TimeSlicedSystem:
    def __init__(self, budget_ms=2.0):
        self.budget = budget_ms
        self.pending_work = []

    def add_work(self, work_item, priority=0):
        # Priority queue: higher priority = processed first
        heapq.heappush(self.pending_work, (-priority, work_item))

    def update(self, dt):
        start_time = time.time()
        processed = 0

        while self.pending_work and (time.time() - start_time) < self.budget:
            priority, work_item = heapq.heappop(self.pending_work)
            work_item.execute()
            processed += 1

        return processed

# Usage: Pathfinding
pathfinding_system = TimeSlicedSystem(budget_ms=2.0)

for unit in units_needing_paths:
    priority = calculate_priority(unit)  # Player units = high priority
    pathfinding_system.add_work(PathfindRequest(unit), priority)

# Each frame: process as many as fit in 2ms budget
paths_found = pathfinding_system.update(dt)

Time-Slicing Pattern 2: Amortized Updates

class AmortizedUpdateManager:
    def __init__(self, entities, updates_per_frame=200):
        self.entities = entities
        self.updates_per_frame = updates_per_frame
        self.current_index = 0

    def update(self, dt):
        # Update N entities per frame
        for i in range(self.updates_per_frame):
            entity = self.entities[self.current_index]
            entity.expensive_update(dt)

            self.current_index = (self.current_index + 1) % len(self.entities)

        # All entities updated every N frames
        # 1000 entities / 200 per frame = every 5 frames = 12 Hz at 60 FPS

# Priority-based amortization
def update_with_priority(entities, frame_count):
    for i, entity in enumerate(entities):
        # Distance-based update frequency
        distance = entity.distance_to_camera()

        if distance < 50:
            entity.update()  # Every frame (60 Hz)
        elif distance < 100 and frame_count % 2 == 0:
            entity.update()  # Every 2 frames (30 Hz)
        elif distance < 200 and frame_count % 5 == 0:
            entity.update()  # Every 5 frames (12 Hz)
        elif frame_count % 30 == 0:
            entity.update()  # Every 30 frames (2 Hz)

Time-Slicing Pattern 3: Incremental Processing

class IncrementalPathfinder:
    """Find path over multiple frames instead of blocking"""

    def __init__(self, max_nodes_per_frame=100):
        self.max_nodes = max_nodes_per_frame
        self.open_set = []
        self.closed_set = set()
        self.current_request = None

    def start_pathfind(self, start, goal):
        self.current_request = PathRequest(start, goal)
        heapq.heappush(self.open_set, (0, start))
        return self.current_request

    def step(self):
        """Process up to max_nodes this frame, return True if done"""
        if not self.current_request:
            return True

        nodes_processed = 0

        while self.open_set and nodes_processed < self.max_nodes:
            current = heapq.heappop(self.open_set)

            if current == self.current_request.goal:
                self.current_request.path = reconstruct_path(current)
                self.current_request.complete = True
                return True

            # Expand neighbors...
            nodes_processed += 1

        return False  # Not done yet, continue next frame

# Usage
pathfinder = IncrementalPathfinder(max_nodes_per_frame=100)
request = pathfinder.start_pathfind(unit.pos, target.pos)

# Each frame
while not request.complete:
    pathfinder.step()  # Process 100 nodes, spread over multiple frames

Performance Impact:

1000 expensive updates: 1000/frame → 200/frame = 5× faster
Pathfinding: Blocking 50ms → 2ms budget = stays at 60 FPS

Caching Strategies

When to Cache:

Expensive calculations used repeatedly (pathfinding, line-of-sight)
Results that change infrequently (static paths, terrain visibility)
Deterministic results (same input = same output)

Cache Design Pattern:

class PerformanceCache:
    def __init__(self, max_size=10000, ttl_seconds=60.0):
        self.cache = {}  # key -> CacheEntry
        self.max_size = max_size
        self.ttl = ttl_seconds
        self.access_times = {}  # LRU tracking
        self.insert_times = {}  # TTL tracking

    def get(self, key):
        current_time = time.time()

        if key not in self.cache:
            return None

        # Check TTL (time-to-live)
        if current_time - self.insert_times[key] > self.ttl:
            del self.cache[key]
            del self.access_times[key]
            del self.insert_times[key]
            return None

        # Update LRU
        self.access_times[key] = current_time
        return self.cache[key]

    def put(self, key, value):
        current_time = time.time()

        # Evict if full (LRU eviction)
        if len(self.cache) >= self.max_size:
            # Find least recently used
            lru_key = min(self.access_times, key=self.access_times.get)
            del self.cache[lru_key]
            del self.access_times[lru_key]
            del self.insert_times[lru_key]

        self.cache[key] = value
        self.access_times[key] = current_time
        self.insert_times[key] = current_time

    def invalidate(self, key):
        """Explicit invalidation when data changes"""
        if key in self.cache:
            del self.cache[key]
            del self.access_times[key]
            del self.insert_times[key]

    def invalidate_region(self, x, y, radius):
        """Invalidate all cache entries in region (e.g., terrain changed)"""
        keys_to_remove = []
        for key in self.cache:
            if self._key_in_region(key, x, y, radius):
                keys_to_remove.append(key)

        for key in keys_to_remove:
            self.invalidate(key)

# Usage: Path caching
path_cache = PerformanceCache(max_size=5000, ttl_seconds=30.0)

def get_or_calculate_path(start, goal):
    # Quantize to grid for cache key (allow slight position variance)
    key = (round(start.x), round(start.y), round(goal.x), round(goal.y))

    cached = path_cache.get(key)
    if cached:
        return cached  # Cache hit!

    # Cache miss - calculate
    path = expensive_pathfinding(start, goal)
    path_cache.put(key, path)
    return path

# Invalidate when terrain changes
def on_building_placed(x, y):
    path_cache.invalidate_region(x, y, radius=100)

Cache Invalidation Strategies:

Time-To-Live (TTL): Expire after N seconds
- Good for: Dynamic environments (traffic, weather)
- Example: Path cache with 30 second TTL
Event-Based: Invalidate on specific events
- Good for: Known change triggers (building placed, obstacle moved)
- Example: Invalidate paths when wall built
Hybrid: TTL + event-based
- Good for: Most scenarios
- Example: 60 second TTL OR invalidate on terrain change

Performance Impact:

Pathfinding with 60% cache hit rate: 40% of requests calculate = 2.5× faster
Line-of-sight with 80% cache hit rate: 20% of requests calculate = 5× faster

Data-Oriented Design (DOD)

Concept: Organize data for cache-friendly access patterns

Array of Structs (AoS) - Traditional OOP approach:

class Unit:
    def __init__(self):
        self.x = 0.0
        self.y = 0.0
        self.health = 100
        self.damage = 10
        # ... 20 more fields ...

units = [Unit() for _ in range(1000)]

# Update positions (cache-unfriendly)
for unit in units:
    unit.x += unit.velocity_x * dt  # Load entire Unit struct for each unit
    unit.y += unit.velocity_y * dt  # Only using 2 fields, wasting cache

Struct of Arrays (SoA) - DOD approach:

class UnitSystem:
    def __init__(self, count):
        # Separate arrays for each component
        self.positions_x = [0.0] * count
        self.positions_y = [0.0] * count
        self.velocities_x = [0.0] * count
        self.velocities_y = [0.0] * count
        self.health = [100] * count
        self.damage = [10] * count
        # ... more arrays ...

units = UnitSystem(1000)

# Update positions (cache-friendly)
for i in range(len(units.positions_x)):
    units.positions_x[i] += units.velocities_x[i] * dt  # Sequential memory access
    units.positions_y[i] += units.velocities_y[i] * dt  # Perfect for CPU cache

Why SoA is Faster:

CPU cache lines are 64 bytes
AoS: Load 1-2 units per cache line (if Unit is 32-64 bytes)
SoA: Load 8-16 floats per cache line (4 bytes each)
4-8× better cache utilization = 1.5-3× faster in practice

When to Use SoA:

Batch operations on many entities (position updates, damage calculations)
Systems that only need 1-2 fields from entity
Performance-critical inner loops

When AoS is Okay:

Small entity counts (< 100)
Operations needing many fields
Prototyping (DOD is optimization, not default)

ECS Architecture (combines SoA + component composition):

# Components (pure data)
class Position:
    x: float
    y: float

class Velocity:
    x: float
    y: float

class Health:
    current: int
    max: int

# Systems (pure logic)
class MovementSystem:
    def update(self, positions, velocities, dt):
        # Batch process all entities with Position + Velocity
        for i in range(len(positions)):
            positions[i].x += velocities[i].x * dt
            positions[i].y += velocities[i].y * dt

class CombatSystem:
    def update(self, positions, health, attacks):
        # Only process entities with Position + Health + Attack
        # ...

# Entity is just an ID
entities = [Entity(id=i) for i in range(1000)]

Performance Impact:

Cache-friendly data layout: 1.5-3× faster for batch operations
ECS architecture: Enables efficient multithreading (no shared mutable state)

Multithreading (Use Sparingly)

When to Multithread:

✅ After all other optimizations (if still needed)
✅ Embarrassingly parallel work (no dependencies)
✅ Long-running tasks (benefit outweighs overhead)
✅ Native code (C++, Rust) - avoids GIL

When NOT to Multithread:

❌ Python CPU-bound code (GIL limits to 1 core)
❌ Before trying simpler optimizations
❌ Lots of shared mutable state (locking overhead)
❌ Small tasks (thread overhead > savings)

Job System Pattern (best practice):

from concurrent.futures import ThreadPoolExecutor
import threading

class JobSystem:
    def __init__(self, num_workers=4):
        self.executor = ThreadPoolExecutor(max_workers=num_workers)

    def submit_batch(self, jobs):
        """Submit list of independent jobs, return futures"""
        futures = [self.executor.submit(job.execute) for job in jobs]
        return futures

    def wait_all(self, futures):
        """Wait for all jobs to complete"""
        results = [future.result() for future in futures]
        return results

# Good: Parallel pathfinding (independent tasks)
job_system = JobSystem(num_workers=4)

path_jobs = [PathfindJob(unit.pos, unit.target) for unit in units_needing_paths]
futures = job_system.submit_batch(path_jobs)

# Do other work while pathfinding runs...

# Collect results
paths = job_system.wait_all(futures)

Data Parallelism Pattern (no shared mutable state):

def update_positions_parallel(positions, velocities, dt, num_workers=4):
    """Update positions in parallel batches"""

    def update_batch(start_idx, end_idx):
        # Each worker gets exclusive slice (no locks needed)
        for i in range(start_idx, end_idx):
            positions[i].x += velocities[i].x * dt
            positions[i].y += velocities[i].y * dt

    # Split work into batches
    batch_size = len(positions) // num_workers
    futures = []

    for worker_id in range(num_workers):
        start = worker_id * batch_size
        end = start + batch_size if worker_id < num_workers - 1 else len(positions)
        future = executor.submit(update_batch, start, end)
        futures.append(future)

    # Wait for all workers
    for future in futures:
        future.result()

Common Multithreading Pitfalls:

Race Conditions (shared mutable state)

# BAD: Multiple threads modifying same list
for unit in units:
    threading.Thread(target=unit.update, args=(all_units,)).start()
    # Each thread reads/writes all_units = data race!

# GOOD: Read-only shared data
for unit in units:
    # units is read-only for all threads
    # Each unit only modifies itself (exclusive ownership)
    threading.Thread(target=unit.update, args=(units,)).start()

False Sharing (cache line contention)

# BAD: Adjacent array elements on same cache line
shared_counters = [0] * 8  # 8 threads updating 8 counters
# Thread 0 updates counter[0], Thread 1 updates counter[1]
# Both on same 64-byte cache line = cache thrashing!

# GOOD: Pad to separate cache lines
class PaddedCounter:
    value: int
    padding: [int] * 15  # Force to own cache line

shared_counters = [PaddedCounter() for _ in range(8)]

Excessive Locking (defeats parallelism)

# BAD: Single lock for everything
lock = threading.Lock()

def update_unit(unit):
    with lock:  # Only 1 thread can work at a time!
        unit.update()

# GOOD: Lock-free or fine-grained locking
def update_unit(unit):
    unit.update()  # Each unit independent, no lock needed

Performance Impact:

4 cores: Ideal speedup = 4×, realistic = 2-3× (overhead, Amdahl's law)
Python: Minimal (GIL), use multiprocessing or native extensions
C++/Rust: Good (2-3× on 4 cores for parallelizable work)

Decision Frameworks

Framework 1: Systematic Optimization Process

Use this process EVERY time performance is inadequate:

Step 1: PROFILE (mandatory, do first)
├─ Capture baseline metrics
├─ Identify top 3-5 bottlenecks (80% of time)
└─ Understand WHY slow (algorithm, data, cache)

Step 2: ALGORITHMIC (10-100× gains)
├─ Is bottleneck O(n²) or worse?
│  ├─ Proximity queries? → Spatial partitioning
│  ├─ Pathfinding? → Hierarchical, flow fields, or caching
│  └─ Sorting? → Better algorithm or less frequent
├─ Is bottleneck doing redundant work?
│  └─ Add caching with LRU + TTL
└─ Measure improvement, re-profile

Step 3: LOD (2-10× gains)
├─ Can distant entities use less detail?
│  ├─ Distance-based LOD levels (4 levels)
│  ├─ Importance weighting (player units > NPC)
│  └─ Smooth transitions (hysteresis, blending)
└─ Measure improvement, re-profile

Step 4: TIME-SLICING (2-5× gains)
├─ Can work spread across multiple frames?
│  ├─ Set frame budget per system (2-4ms typical)
│  ├─ Priority queue (important work first)
│  └─ Amortized updates (N entities per frame)
└─ Measure improvement, re-profile

Step 5: DATA-ORIENTED DESIGN (1.5-3× gains)
├─ Is bottleneck cache-unfriendly?
│  ├─ Convert AoS → SoA for batch operations
│  ├─ Group hot data together
│  └─ Minimize pointer chasing
└─ Measure improvement, re-profile

Step 6: MULTITHREADING (1.5-4× gains, high complexity)
├─ Still below target after above?
│  ├─ Identify embarrassingly parallel work
│  ├─ Job system for independent tasks
│  ├─ Data parallelism (no shared mutable state)
│  └─ Avoid locks (lock-free or per-entity ownership)
└─ Measure improvement, re-profile

Step 7: VALIDATE
├─ Met target frame rate? → Done!
├─ Still slow? → Return to Step 1, find new bottleneck
└─ Regression? → Revert and try different approach

Example Application (1000-unit RTS at 10 FPS):

Profile: Vision checks are 80% (80ms/100ms frame)
Algorithmic: Add spatial hash grid → 40 FPS (15ms vision checks)
LOD: Distant units update at 5 Hz → 55 FPS (11ms vision)
Time-slicing: 2ms pathfinding budget → 60 FPS ✅ Done
(Skip DOD and multithreading - already at target)

Framework 2: Choosing Spatial Partitioning

START: What's my proximity query scenario?

├─ 2D WORLD with UNIFORM ENTITY DISTRIBUTION?
│  └─ Use SPATIAL HASH GRID
│     - Cell size = 2× query radius (e.g., vision range 50 → cells 100×100)
│     - O(n) build, O(1) query
│     - Simplest to implement
│     - Example: RTS units on open battlefield
│
├─ 2D WORLD with CLUSTERED ENTITIES?
│  └─ Use QUADTREE
│     - Split threshold = 10-20 entities per node
│     - Max depth = 8-10 levels
│     - O(n log n) build, O(log n) query
│     - Example: City simulation (dense downtown, sparse suburbs)
│
├─ 3D WORLD?
│  └─ Use OCTREE
│     - Same as quadtree, but 8 children per node
│     - Example: Space game, underwater sim
│
├─ VERY LARGE WORLD (> 10 km²)?
│  └─ Use HIERARCHICAL GRID
│     - Coarse grid (1km cells) + fine grid (50m cells) per coarse cell
│     - Example: MMO world, open-world game
│
└─ ENTITIES MOSTLY STATIONARY?
   └─ Use STATIC QUADTREE/OCTREE
      - Build once, query many times
      - Example: Building placement, static obstacles

Implementation Complexity:

Spatial Hash Grid: 1-2 hours (simple)
Quadtree: 3-5 hours (moderate)
Octree: 4-6 hours (moderate)
Hierarchical Grid: 6-10 hours (complex)

Performance Characteristics:

Method	Build Time	Query Time	Memory	Best For
Hash Grid	O(n)	O(1) avg	Low	Uniform distribution
Quadtree	O(n log n)	O(log n) avg	Medium	Clustered entities
Octree	O(n log n)	O(log n) avg	Medium	3D worlds
Hierarchical	O(n)	O(1) avg	Higher	Massive worlds

Framework 3: LOD Level Assignment

For each entity, assign LOD level based on:

├─ IMPORTANCE (highest priority)
│  ├─ Player-controlled? → LOD 0 (always full detail)
│  ├─ Player's team AND in combat? → LOD 0
│  ├─ Selected units? → LOD 0
│  ├─ Quest-critical NPCs? → LOD 0
│  └─ Otherwise, use distance-based...
│
├─ DISTANCE FROM CAMERA (secondary)
│  ├─ 0-50 units → LOD 0 (full detail)
│  │  - Update: 60 Hz (every frame)
│  │  - AI: Full behavior tree
│  │  - Pathfinding: Precise A*
│  │  - Animation: Full skeleton
│  │
│  ├─ 50-100 units → LOD 1 (reduced)
│  │  - Update: 30 Hz (every 2 frames)
│  │  - AI: Simplified FSM
│  │  - Pathfinding: Hierarchical
│  │  - Animation: 10 bones
│  │
│  ├─ 100-200 units → LOD 2 (minimal)
│  │  - Update: 12 Hz (every 5 frames)
│  │  - AI: Basic scripted
│  │  - Pathfinding: Waypoints
│  │  - Animation: Static pose
│  │
│  └─ 200+ units → LOD 3 (culled)
│     - Update: 2 Hz (every 30 frames)
│     - AI: State only (no decisions)
│     - Pathfinding: None
│     - Animation: None (invisible or billboard)
│
└─ SCREEN SIZE (tertiary)
   ├─ Occluded or < 5 pixels? → LOD 3 (culled)
   └─ Small on screen? → Bump LOD down 1 level

Hysteresis to Prevent LOD Thrashing:

# Without hysteresis (bad - flickers)
lod = 0 if distance < 100 else 1
# Entity at 99-101 units: LOD flip-flops every frame!

# With hysteresis (good - stable)
if distance < 90:
    lod = 0  # Upgrade at 90
elif distance > 110:
    lod = 1  # Downgrade at 110
# else: keep current LOD
# 20-unit buffer prevents thrashing

Framework 4: When to Use Multithreading

Should I multithread this system?

├─ ALREADY optimized algorithmic/LOD/caching?
│  └─ NO → Do those FIRST (10-100× gains vs 2-4× for threading)
│
├─ WORK IS EMBARRASSINGLY PARALLEL?
│  ├─ Independent tasks (pathfinding requests)? → YES, good candidate
│  ├─ Lots of shared mutable state? → NO, locking kills performance
│  └─ Need results immediately? → NO, adds latency
│
├─ TASK DURATION > 1ms?
│  ├─ YES → Threading overhead is small % of work
│  └─ NO → Overhead dominates, not worth it
│
├─ PYTHON or NATIVE CODE?
│  ├─ Python → Use multiprocessing (avoid GIL) or native extensions
│  └─ C++/Rust → ThreadPool or job system works well
│
├─ COMPLEXITY COST JUSTIFIED?
│  ├─ Can maintain code with debugging difficulty? → Consider it
│  └─ Team inexperienced with threading? → Avoid (bugs are costly)
│
└─ EXPECTED SPEEDUP > 1.5×?
   ├─ 4 cores: Realistic = 2-3× (not 4× due to overhead)
   ├─ Worth complexity? → Your call
   └─ Not worth it? → Try other optimizations first

Threading Decision Tree Example:

Scenario: Pathfinding for 100 units

├─ Already using caching? YES (60% hit rate)
├─ Work is parallel? YES (each path independent)
├─ Task duration? 5ms per path (good for threading)
├─ Language? Python (GIL problem)
│  └─ Solution: Use multiprocessing or native pathfinding library
├─ Complexity justified? 100 paths × 5ms = 500ms → 60ms with 8 workers
│  └─ YES, worth it (8× speedup)
│
Decision: Use multiprocessing.Pool with 8 workers

Framework 5: Frame Time Budget Allocation

60 FPS = 16.67ms per frame, 30 FPS = 33.33ms per frame

Budget Template (adjust based on game type):

60 FPS Frame Budget (16.67ms total):

├─ Rendering: 6.0ms (40%)
│  ├─ Culling: 1.0ms
│  ├─ Draw calls: 4.0ms
│  └─ Post-processing: 1.0ms
│
├─ AI: 3.5ms (24%)
│  ├─ Behavior trees: 2.0ms
│  ├─ Sensors/perception: 1.0ms
│  └─ Decision-making: 0.5ms
│
├─ Physics: 3.0ms (18%)
│  ├─ Broad-phase: 0.5ms
│  ├─ Narrow-phase: 1.5ms
│  └─ Constraint solving: 1.0ms
│
├─ Pathfinding: 2.0ms (12%)
│  ├─ New paths: 1.5ms
│  └─ Path following: 0.5ms
│
├─ Gameplay: 1.0ms (6%)
│  ├─ Economy updates: 0.3ms
│  ├─ Event processing: 0.4ms
│  └─ UI updates: 0.3ms
│
└─ Buffer: 1.17ms (7%)
   └─ Unexpected spikes, GC, etc.

Budget by Game Type:

Game Type	Rendering	AI	Physics	Pathfinding	Gameplay
RTS	30%	30%	10%	20%	10%
FPS	50%	15%	20%	5%	10%
City Builder	35%	20%	5%	15%	25%
Physics Sim	30%	5%	50%	5%	10%
Turn-Based	60%	15%	5%	10%	10%

Enforcement Pattern:

class FrameBudgetMonitor:
    def __init__(self):
        self.budgets = {
            'rendering': 6.0,
            'ai': 3.5,
            'physics': 3.0,
            'pathfinding': 2.0,
            'gameplay': 1.0
        }
        self.measurements = {key: [] for key in self.budgets}

    def measure(self, system_name, func):
        start = time.perf_counter()
        result = func()
        elapsed_ms = (time.perf_counter() - start) * 1000

        self.measurements[system_name].append(elapsed_ms)

        # Alert if over budget
        if elapsed_ms > self.budgets[system_name]:
            print(f"⚠️  {system_name} over budget: {elapsed_ms:.2f}ms / {self.budgets[system_name]:.2f}ms")

        return result

    def report(self):
        print("Frame Time Budget Report:")
        for system, budget in self.budgets.items():
            avg = sum(self.measurements[system]) / len(self.measurements[system])
            pct = (avg / budget) * 100
            print(f"  {system}: {avg:.2f}ms / {budget:.2f}ms ({pct:.0f}%)")

# Usage
monitor = FrameBudgetMonitor()

def game_loop():
    monitor.measure('ai', lambda: update_ai(units))
    monitor.measure('physics', lambda: update_physics(world))
    monitor.measure('pathfinding', lambda: update_pathfinding(units))
    monitor.measure('rendering', lambda: render_scene(camera))

    if frame_count % 300 == 0:  # Every 5 seconds
        monitor.report()

Implementation Patterns

Pattern 1: Spatial Hash Grid for Proximity Queries

Problem: Checking every unit against every other unit for vision/attack is O(n²)

1000 units = 1,000,000 checks per frame = death

Solution: Spatial hash grid divides world into cells, only check nearby cells

import math
from collections import defaultdict

class SpatialHashGrid:
    """
    Spatial partitioning using hash grid for O(1) average query time.

    Best for: Uniform entity distribution, 2D worlds
    Cell size rule: 2× maximum query radius
    """

    def __init__(self, cell_size=100):
        self.cell_size = cell_size
        self.grid = defaultdict(list)  # (cell_x, cell_y) -> [entities]

    def _hash(self, x, y):
        """Convert world position to cell coordinates"""
        cell_x = int(math.floor(x / self.cell_size))
        cell_y = int(math.floor(y / self.cell_size))
        return (cell_x, cell_y)

    def clear(self):
        """Clear all entities (call at start of frame)"""
        self.grid.clear()

    def insert(self, entity):
        """Insert entity into grid"""
        cell = self._hash(entity.x, entity.y)
        self.grid[cell].append(entity)

    def query_radius(self, x, y, radius):
        """
        Find all entities within radius of (x, y).

        Returns: List of entities in range
        Complexity: O(k) where k = entities in nearby cells (typically 10-50)
        """
        # Calculate which cells to check
        min_cell_x = int(math.floor((x - radius) / self.cell_size))
        max_cell_x = int(math.floor((x + radius) / self.cell_size))
        min_cell_y = int(math.floor((y - radius) / self.cell_size))
        max_cell_y = int(math.floor((y + radius) / self.cell_size))

        candidates = []

        # Check all cells in range
        for cell_x in range(min_cell_x, max_cell_x + 1):
            for cell_y in range(min_cell_y, max_cell_y + 1):
                cell = (cell_x, cell_y)
                candidates.extend(self.grid.get(cell, []))

        # Filter by exact distance (candidates may be outside radius)
        results = []
        radius_sq = radius * radius

        for entity in candidates:
            dx = entity.x - x
            dy = entity.y - y
            dist_sq = dx * dx + dy * dy

            if dist_sq <= radius_sq:
                results.append(entity)

        return results

    def query_rect(self, min_x, min_y, max_x, max_y):
        """Find all entities in rectangular region"""
        min_cell_x = int(math.floor(min_x / self.cell_size))
        max_cell_x = int(math.floor(max_x / self.cell_size))
        min_cell_y = int(math.floor(min_y / self.cell_size))
        max_cell_y = int(math.floor(max_y / self.cell_size))

        results = []

        for cell_x in range(min_cell_x, max_cell_x + 1):
            for cell_y in range(min_cell_y, max_cell_y + 1):
                cell = (cell_x, cell_y)
                results.extend(self.grid.get(cell, []))

        return results

# Usage Example
class Unit:
    def __init__(self, x, y, team):
        self.x = x
        self.y = y
        self.team = team
        self.vision_range = 50
        self.attack_range = 20

def game_loop():
    units = [Unit(random() * 1000, random() * 1000, random_team())
             for _ in range(1000)]

    # Cell size = 2× max query radius (vision range)
    spatial_grid = SpatialHashGrid(cell_size=100)

    while running:
        # Rebuild grid each frame (units move)
        spatial_grid.clear()
        for unit in units:
            spatial_grid.insert(unit)

        # Update units
        for unit in units:
            # OLD (O(n²)): Check all 1000 units = 1,000,000 checks
            # enemies = [u for u in units if u.team != unit.team and distance(u, unit) < vision_range]

            # NEW (O(k)): Check ~10-50 units in nearby cells
            nearby = spatial_grid.query_radius(unit.x, unit.y, unit.vision_range)
            enemies = [u for u in nearby if u.team != unit.team]

            # Attack enemies in range
            for enemy in enemies:
                dist_sq = (unit.x - enemy.x)**2 + (unit.y - enemy.y)**2
                if dist_sq <= unit.attack_range**2:
                    enemy.health -= unit.damage

# Performance: O(n²) → O(n)
# 1000 units: 1,000,000 checks → ~30,000 checks (nearby cells only)
# Speedup: ~30-50× for vision/attack queries

Pattern 2: Quadtree for Clustered Entities

When to use: Entities cluster in specific areas (cities, battlefields) with sparse regions

class Quadtree:
    """
    Adaptive spatial partitioning for clustered entity distributions.

    Best for: Non-uniform distribution, entities cluster in areas
    Automatically subdivides dense regions
    """

    class Node:
        def __init__(self, x, y, width, height, max_entities=10, max_depth=8):
            self.x = x
            self.y = y
            self.width = width
            self.height = height
            self.max_entities = max_entities
            self.max_depth = max_depth
            self.entities = []
            self.children = None  # [NW, NE, SW, SE] when subdivided

        def is_leaf(self):
            return self.children is None

        def contains(self, entity):
            """Check if entity is within this node's bounds"""
            return (self.x <= entity.x < self.x + self.width and
                    self.y <= entity.y < self.y + self.height)

        def subdivide(self):
            """Split into 4 quadrants"""
            hw = self.width / 2  # half width
            hh = self.height / 2  # half height

            # Create 4 children: NW, NE, SW, SE
            self.children = [
                Quadtree.Node(self.x, self.y, hw, hh,
                             self.max_entities, self.max_depth - 1),  # NW
                Quadtree.Node(self.x + hw, self.y, hw, hh,
                             self.max_entities, self.max_depth - 1),  # NE
                Quadtree.Node(self.x, self.y + hh, hw, hh,
                             self.max_entities, self.max_depth - 1),  # SW
                Quadtree.Node(self.x + hw, self.y + hh, hw, hh,
                             self.max_entities, self.max_depth - 1),  # SE
            ]

            # Move entities to children
            for entity in self.entities:
                for child in self.children:
                    if child.contains(entity):
                        child.insert(entity)
                        break

            self.entities.clear()

        def insert(self, entity):
            """Insert entity into quadtree"""
            if not self.contains(entity):
                return False

            if self.is_leaf():
                self.entities.append(entity)

                # Subdivide if over capacity and can go deeper
                if len(self.entities) > self.max_entities and self.max_depth > 0:
                    self.subdivide()
            else:
                # Insert into appropriate child
                for child in self.children:
                    if child.insert(entity):
                        break

            return True

        def query_radius(self, x, y, radius, results):
            """Find entities within radius of (x, y)"""
            # Check if search circle intersects this node
            closest_x = max(self.x, min(x, self.x + self.width))
            closest_y = max(self.y, min(y, self.y + self.height))

            dx = x - closest_x
            dy = y - closest_y
            dist_sq = dx * dx + dy * dy

            if dist_sq > radius * radius:
                return  # No intersection

            if self.is_leaf():
                # Check entities in this leaf
                radius_sq = radius * radius
                for entity in self.entities:
                    dx = entity.x - x
                    dy = entity.y - y
                    if dx * dx + dy * dy <= radius_sq:
                        results.append(entity)
            else:
                # Recurse into children
                for child in self.children:
                    child.query_radius(x, y, radius, results)

    def __init__(self, world_width, world_height, max_entities=10, max_depth=8):
        self.root = Quadtree.Node(0, 0, world_width, world_height,
                                   max_entities, max_depth)

    def insert(self, entity):
        self.root.insert(entity)

    def query_radius(self, x, y, radius):
        results = []
        self.root.query_radius(x, y, radius, results)
        return results

# Usage
quadtree = Quadtree(world_width=1000, world_height=1000,
                    max_entities=10, max_depth=8)

# Insert entities
for unit in units:
    quadtree.insert(unit)

# Query
enemies_nearby = quadtree.query_radius(player.x, player.y, vision_range=50)

# Performance: O(log n) average query
# Adapts to entity distribution automatically

Pattern 3: Distance-Based LOD System

Problem: All entities update at full frequency, wasting CPU on distant entities

Solution: Update frequency based on distance from camera/player

class LODSystem:
    """
    Level-of-detail system with smooth transitions and importance weighting.

    LOD 0: Full detail (near camera, important entities)
    LOD 1: Reduced detail (medium distance)
    LOD 2: Minimal detail (far distance)
    LOD 3: Dormant (very far, culled)
    """

    # LOD configuration
    LOD_LEVELS = [
        {
            'name': 'LOD_0_FULL',
            'distance_min': 0,
            'distance_max': 50,
            'update_hz': 60,        # Every frame
            'ai_enabled': True,
            'pathfinding': 'full',  # Precise A*
            'animation': 'full',    # Full skeleton
            'physics': 'full'       # Full collision
        },
        {
            'name': 'LOD_1_REDUCED',
            'distance_min': 50,
            'distance_max': 100,
            'update_hz': 30,        # Every 2 frames
            'ai_enabled': True,
            'pathfinding': 'hierarchical',
            'animation': 'reduced',  # 10 bones
            'physics': 'bbox'        # Bounding box only
        },
        {
            'name': 'LOD_2_MINIMAL',
            'distance_min': 100,
            'distance_max': 200,
            'update_hz': 12,        # Every 5 frames
            'ai_enabled': False,    # Scripted only
            'pathfinding': 'waypoints',
            'animation': 'static',   # Static pose
            'physics': 'none'
        },
        {
            'name': 'LOD_3_CULLED',
            'distance_min': 200,
            'distance_max': float('inf'),
            'update_hz': 2,         # Every 30 frames
            'ai_enabled': False,
            'pathfinding': 'none',
            'animation': 'none',
            'physics': 'none'
        }
    ]

    def __init__(self, camera, player):
        self.camera = camera
        self.player = player
        self.frame_count = 0

        # Hysteresis to prevent LOD thrashing
        self.hysteresis = 20  # Units of distance buffer

    def calculate_lod(self, entity):
        """
        Calculate LOD level for entity based on importance and distance.

        Priority:
        1. Importance (player-controlled, in combat, selected)
        2. Distance from camera
        3. Screen size
        """
        # Important entities always get highest LOD
        if self._is_important(entity):
            return 0

        # Distance-based LOD
        distance = self._distance_to_camera(entity)

        # Current LOD (for hysteresis)
        current_lod = getattr(entity, 'lod_level', 0)

        # Determine LOD level with hysteresis
        for i, lod in enumerate(self.LOD_LEVELS):
            if i < current_lod:
                # Upgrading (closer): Use min distance
                if distance <= lod['distance_max'] - self.hysteresis:
                    return i
            else:
                # Downgrading (farther): Use max distance
                if distance <= lod['distance_max'] + self.hysteresis:
                    return i

        return len(self.LOD_LEVELS) - 1

    def _is_important(self, entity):
        """Check if entity is important (always highest LOD)"""
        return (entity.player_controlled or
                entity.selected or
                (entity.team == self.player.team and entity.in_combat))

    def _distance_to_camera(self, entity):
        dx = entity.x - self.camera.x
        dy = entity.y - self.camera.y
        return math.sqrt(dx * dx + dy * dy)

    def should_update(self, entity):
        """Check if entity should update this frame"""
        lod_level = entity.lod_level
        lod_config = self.LOD_LEVELS[lod_level]
        update_hz = lod_config['update_hz']

        if update_hz >= 60:
            return True  # Every frame

        # Calculate frame interval
        frame_interval = 60 // update_hz  # 60 FPS baseline

        # Offset by entity ID to spread updates across frames
        return (self.frame_count + entity.id) % frame_interval == 0

    def update(self, entities):
        """Update LOD levels and entities"""
        self.frame_count += 1

        # Update LOD levels (cheap, do every frame)
        for entity in entities:
            entity.lod_level = self.calculate_lod(entity)

        # Update entities based on LOD (expensive, time-sliced)
        for entity in entities:
            if self.should_update(entity):
                lod_config = self.LOD_LEVELS[entity.lod_level]
                self._update_entity(entity, lod_config)

    def _update_entity(self, entity, lod_config):
        """Update entity according to LOD configuration"""
        if lod_config['ai_enabled']:
            entity.update_ai()

        if lod_config['pathfinding'] == 'full':
            entity.update_pathfinding_full()
        elif lod_config['pathfinding'] == 'hierarchical':
            entity.update_pathfinding_hierarchical()
        elif lod_config['pathfinding'] == 'waypoints':
            entity.update_pathfinding_waypoints()

        if lod_config['animation'] != 'none':
            entity.update_animation(lod_config['animation'])

        if lod_config['physics'] == 'full':
            entity.update_physics_full()
        elif lod_config['physics'] == 'bbox':
            entity.update_physics_bbox()

# Usage
lod_system = LODSystem(camera, player)

def game_loop():
    lod_system.update(units)
    # Only entities that should_update() this frame were updated

# Performance: 1000 units all at LOD 0 → mixed LOD levels
# Typical distribution: 100 LOD0 + 300 LOD1 + 400 LOD2 + 200 LOD3
# Effective updates: 100 + 150 + 80 + 7 = 337 updates/frame
# Speedup: 1000 → 337 = 3× faster

Pattern 4: Time-Sliced Pathfinding with Priority Queue

Problem: 100 path requests × 5ms each = 500ms frame time (2 FPS)

Solution: Process paths over multiple frames with priority (player units first)

import heapq
import time
from enum import Enum

class PathPriority(Enum):
    """Priority levels for pathfinding requests"""
    CRITICAL = 0    # Player-controlled, combat
    HIGH = 1        # Player's units
    NORMAL = 2      # Visible units
    LOW = 3         # Off-screen units

class PathRequest:
    def __init__(self, entity, start, goal, priority):
        self.entity = entity
        self.start = start
        self.goal = goal
        self.priority = priority
        self.path = None
        self.complete = False
        self.timestamp = time.time()

class TimeSlicedPathfinder:
    """
    Pathfinding system with frame time budget and priority queue.

    Features:
    - 2ms frame budget (stays at 60 FPS)
    - Priority queue (important requests first)
    - Incremental pathfinding (spread work over frames)
    - Request timeout (abandon old requests)
    """

    def __init__(self, budget_ms=2.0, timeout_seconds=5.0):
        self.budget = budget_ms / 1000.0  # Convert to seconds
        self.timeout = timeout_seconds
        self.pending = []  # Priority queue: (priority, request)
        self.active_request = None
        self.pathfinder = AStarPathfinder()  # Your pathfinding implementation

        # Statistics
        self.stats = {
            'requests_submitted': 0,
            'requests_completed': 0,
            'requests_timeout': 0,
            'avg_time_to_completion': 0
        }

    def submit_request(self, entity, start, goal, priority=PathPriority.NORMAL):
        """Submit pathfinding request with priority"""
        request = PathRequest(entity, start, goal, priority)
        heapq.heappush(self.pending, (priority.value, request))
        self.stats['requests_submitted'] += 1
        return request

    def update(self, dt):
        """
        Process pathfinding requests within frame budget.

        Returns: Number of paths completed this frame
        """
        start_time = time.perf_counter()
        completed = 0

        while time.perf_counter() - start_time < self.budget:
            # Get next request
            if not self.active_request:
                if not self.pending:
                    break  # No more work

                priority, request = heapq.heappop(self.pending)

                # Check timeout
                if time.time() - request.timestamp > self.timeout:
                    self.stats['requests_timeout'] += 1
                    continue

                self.active_request = request
                self.pathfinder.start(request.start, request.goal)

            # Process active request incrementally
            # (process up to 100 nodes this frame)
            done = self.pathfinder.step(max_nodes=100)

            if done:
                # Request complete
                self.active_request.path = self.pathfinder.get_path()
                self.active_request.complete = True
                self.active_request.entity.path = self.active_request.path

                time_to_complete = time.time() - self.active_request.timestamp
                self._update_avg_time(time_to_complete)

                self.stats['requests_completed'] += 1
                self.active_request = None
                completed += 1

        return completed

    def _update_avg_time(self, time_to_complete):
        """Update moving average of completion time"""
        alpha = 0.1  # Smoothing factor
        current_avg = self.stats['avg_time_to_completion']
        self.stats['avg_time_to_completion'] = (
            alpha * time_to_complete + (1 - alpha) * current_avg
        )

    def get_stats(self):
        """Get performance statistics"""
        pending_count = len(self.pending) + (1 if self.active_request else 0)
        return {
            **self.stats,
            'pending_requests': pending_count,
            'completion_rate': (
                self.stats['requests_completed'] / max(1, self.stats['requests_submitted'])
            )
        }

# Usage
pathfinder = TimeSlicedPathfinder(budget_ms=2.0)

def game_loop():
    # Submit pathfinding requests
    for unit in units_needing_paths:
        # Determine priority
        if unit.player_controlled:
            priority = PathPriority.CRITICAL
        elif unit.team == player.team:
            priority = PathPriority.HIGH
        elif unit.visible:
            priority = PathPriority.NORMAL
        else:
            priority = PathPriority.LOW

        pathfinder.submit_request(unit, unit.pos, unit.target, priority)

    # Process paths (stays within 2ms budget)
    paths_completed = pathfinder.update(dt)

    # Every 5 seconds, print stats
    if frame_count % 300 == 0:
        stats = pathfinder.get_stats()
        print(f"Pathfinding: {stats['requests_completed']} complete, "
              f"{stats['pending_requests']} pending, "
              f"avg time: {stats['avg_time_to_completion']:.2f}s")

# Performance:
# Without time-slicing: 100 paths × 5ms = 500ms frame (2 FPS)
# With time-slicing: 2ms budget per frame = 60 FPS maintained
# Paths complete over multiple frames, but high-priority paths finish first

Pattern 5: LRU Cache with TTL for Pathfinding

Problem: Recalculating same paths repeatedly wastes CPU

Solution: Cache paths with LRU eviction and time-to-live

import time
from collections import OrderedDict

class PathCache:
    """
    LRU cache with TTL for pathfinding results.

    Features:
    - LRU eviction (least recently used)
    - TTL expiration (paths become stale)
    - Region invalidation (terrain changes)
    - Bounded memory (max size)
    """

    def __init__(self, max_size=5000, ttl_seconds=30.0):
        self.cache = OrderedDict()  # Maintains insertion order for LRU
        self.max_size = max_size
        self.ttl = ttl_seconds
        self.insert_times = {}

        # Statistics
        self.stats = {
            'hits': 0,
            'misses': 0,
            'evictions': 0,
            'expirations': 0,
            'invalidations': 0
        }

    def _make_key(self, start, goal):
        """Create cache key from start/goal positions"""
        # Quantize to grid (allows position variance within cell)
        # Cell size = 5 units (units within 5 units share same path)
        return (
            round(start[0] / 5) * 5,
            round(start[1] / 5) * 5,
            round(goal[0] / 5) * 5,
            round(goal[1] / 5) * 5
        )

    def get(self, start, goal):
        """
        Get cached path if available and not expired.

        Returns: Path if cached and valid, None otherwise
        """
        key = self._make_key(start, goal)
        current_time = time.time()

        if key not in self.cache:
            self.stats['misses'] += 1
            return None

        # Check TTL
        if current_time - self.insert_times[key] > self.ttl:
            # Expired
            del self.cache[key]
            del self.insert_times[key]
            self.stats['expirations'] += 1
            self.stats['misses'] += 1
            return None

        # Cache hit - move to end (most recently used)
        self.cache.move_to_end(key)
        self.stats['hits'] += 1
        return self.cache[key]

    def put(self, start, goal, path):
        """Store path in cache"""
        key = self._make_key(start, goal)
        current_time = time.time()

        # Evict if at capacity (LRU)
        if len(self.cache) >= self.max_size and key not in self.cache:
            # Remove oldest (first item in OrderedDict)
            oldest_key = next(iter(self.cache))
            del self.cache[oldest_key]
            del self.insert_times[oldest_key]
            self.stats['evictions'] += 1

        # Store path
        self.cache[key] = path
        self.insert_times[key] = current_time

        # Move to end (most recently used)
        self.cache.move_to_end(key)

    def invalidate_region(self, x, y, radius):
        """
        Invalidate all cached paths in region.

        Call when terrain changes (building placed, wall destroyed, etc.)
        """
        radius_sq = radius * radius
        keys_to_remove = []

        for key in self.cache:
            start_x, start_y, goal_x, goal_y = key

            # Check if start or goal in affected region
            dx_start = start_x - x
            dy_start = start_y - y
            dx_goal = goal_x - x
            dy_goal = goal_y - y

            if (dx_start * dx_start + dy_start * dy_start <= radius_sq or
                dx_goal * dx_goal + dy_goal * dy_goal <= radius_sq):
                keys_to_remove.append(key)

        for key in keys_to_remove:
            del self.cache[key]
            del self.insert_times[key]
            self.stats['invalidations'] += 1

    def get_hit_rate(self):
        """Calculate cache hit rate"""
        total = self.stats['hits'] + self.stats['misses']
        if total == 0:
            return 0.0
        return self.stats['hits'] / total

    def get_stats(self):
        """Get cache statistics"""
        return {
            **self.stats,
            'size': len(self.cache),
            'hit_rate': self.get_hit_rate()
        }

# Usage
path_cache = PathCache(max_size=5000, ttl_seconds=30.0)

def find_path(start, goal):
    # Try cache first
    cached_path = path_cache.get(start, goal)
    if cached_path:
        return cached_path  # Cache hit!

    # Cache miss - calculate path
    path = expensive_pathfinding(start, goal)
    path_cache.put(start, goal, path)
    return path

# Invalidate when terrain changes
def on_building_placed(building):
    # Invalidate paths near building
    path_cache.invalidate_region(building.x, building.y, radius=100)

# Print stats periodically
def print_cache_stats():
    stats = path_cache.get_stats()
    print(f"Path Cache: {stats['size']}/{path_cache.max_size} entries, "
          f"hit rate: {stats['hit_rate']:.1%}, "
          f"{stats['hits']} hits, {stats['misses']} misses")

# Performance:
# 60% hit rate: Only 40% of requests calculate = 2.5× faster
# 80% hit rate: Only 20% of requests calculate = 5× faster

Pattern 6: Job System for Parallel Work

When to use: Native code (C++/Rust) with embarrassingly parallel work

#include <vector>
#include <thread>
#include <queue>
#include <mutex>
#include <condition_variable>
#include <functional>

/**
 * Job system for data-parallel work.
 *
 * Features:
 * - Worker thread pool
 * - Lock-free job submission (mostly)
 * - Wait-for-completion
 * - No shared mutable state (data parallelism)
 */
class JobSystem {
public:
    using Job = std::function<void()>;

    JobSystem(int num_workers = std::thread::hardware_concurrency()) {
        workers.reserve(num_workers);

        for (int i = 0; i < num_workers; ++i) {
            workers.emplace_back([this]() { this->worker_loop(); });
        }
    }

    ~JobSystem() {
        {
            std::unique_lock<std::mutex> lock(queue_mutex);
            shutdown = true;
        }
        queue_cv.notify_all();

        for (auto& worker : workers) {
            worker.join();
        }
    }

    // Submit single job
    void submit(Job job) {
        {
            std::unique_lock<std::mutex> lock(queue_mutex);
            job_queue.push(std::move(job));
        }
        queue_cv.notify_one();
    }

    // Submit batch of jobs and wait for all to complete
    void submit_batch_and_wait(const std::vector<Job>& jobs) {
        std::atomic<int> remaining{static_cast<int>(jobs.size())};
        std::mutex wait_mutex;
        std::condition_variable wait_cv;

        for (const auto& job : jobs) {
            submit([&, job]() {
                job();

                if (--remaining == 0) {
                    wait_cv.notify_one();
                }
            });
        }

        // Wait for all jobs to complete
        std::unique_lock<std::mutex> lock(wait_mutex);
        wait_cv.wait(lock, [&]() { return remaining == 0; });
    }

private:
    void worker_loop() {
        while (true) {
            Job job;

            {
                std::unique_lock<std::mutex> lock(queue_mutex);
                queue_cv.wait(lock, [this]() {
                    return !job_queue.empty() || shutdown;
                });

                if (shutdown && job_queue.empty()) {
                    return;
                }

                job = std::move(job_queue.front());
                job_queue.pop();
            }

            job();
        }
    }

    std::vector<std::thread> workers;
    std::queue<Job> job_queue;
    std::mutex queue_mutex;
    std::condition_variable queue_cv;
    bool shutdown = false;
};

// Usage Example: Parallel position updates
struct Unit {
    float x, y;
    float vx, vy;

    void update(float dt) {
        x += vx * dt;
        y += vy * dt;
    }
};

void update_units_parallel(std::vector<Unit>& units, float dt, JobSystem& job_system) {
    const int num_workers = 8;
    const int batch_size = units.size() / num_workers;

    std::vector<JobSystem::Job> jobs;

    for (int worker_id = 0; worker_id < num_workers; ++worker_id) {
        int start = worker_id * batch_size;
        int end = (worker_id == num_workers - 1) ? units.size() : start + batch_size;

        jobs.push_back([&units, dt, start, end]() {
            // Each worker updates exclusive slice (no locks needed)
            for (int i = start; i < end; ++i) {
                units[i].update(dt);
            }
        });
    }

    job_system.submit_batch_and_wait(jobs);
}

// Performance: 4 cores = 2-3× speedup (accounting for overhead)

Common Pitfalls

Pitfall 1: Premature Optimization (Most Common!)

Symptoms:

Jumping to complex solutions (multithreading) before measuring bottleneck
Micro-optimizing (sqrt → squared distance) without profiling
Optimizing code that's 1% of frame time

Why it fails:

You optimize the wrong thing (80% of time elsewhere)
Complex solutions add bugs without benefit
Time wasted that could go to real bottleneck

Example:

# BAD: Premature micro-optimization
# Replaced sqrt with squared distance (saves 0.1ms)
# But vision checks are only 1% of frame time!
dist_sq = dx*dx + dy*dy
if dist_sq < range_sq:  # Micro-optimization
    # ...

# GOOD: Profile first, found pathfinding is 80% of frame time
# Added path caching (saves 40ms!)
cached_path = path_cache.get(start, goal)
if cached_path:
    return cached_path

Solution:

✅ Profile FIRST - measure where time is actually spent
✅ Focus on top bottleneck (80/20 rule)
✅ Measure improvement - validate optimization helped
✅ Repeat - find next bottleneck

Quote: "Premature optimization is the root of all evil" - Donald Knuth

Pitfall 2: LOD Popping (Visual Artifacts)

Symptoms:

Units suddenly appear/disappear at LOD boundaries
Animation quality jumps (smooth → jerky)
Players notice "fake" LOD transitions

Why it fails:

No hysteresis: Entity at 99-101 units flip-flops between LOD 0/1 every frame
Instant transitions: LOD 0 → LOD 3 in one frame (jarring)
Distance-only: Ignores importance (player's units should always be high detail)

Example:

# BAD: No hysteresis (causes popping)
if distance < 100:
    lod = 0
else:
    lod = 1
# Entity at 99.5 units: LOD 0
# Entity moves to 100.5 units: LOD 1
# Entity moves to 99.5 units: LOD 0 (flicker!)

# GOOD: Hysteresis + importance + blend
if is_important(entity):
    lod = 0  # Always full detail for player units
elif distance < 90:
    lod = 0  # Upgrade at 90
elif distance > 110:
    lod = 1  # Downgrade at 110
# else: keep current LOD
# 20-unit buffer prevents thrashing

# Blend between LOD levels over 0.5 seconds
blend_factor = (time.time() - lod_transition_start) / 0.5

Solution:

✅ Hysteresis - different thresholds for upgrade (90) vs downgrade (110)
✅ Importance weighting - player units, selected units always high LOD
✅ Blend transitions - cross-fade over 0.5-1 second
✅ Time delay - wait N seconds before downgrading LOD

Pitfall 3: Thread Contention and Race Conditions

Symptoms:

Crashes with "list modified during iteration"
Nondeterministic behavior (works sometimes)
Slower with multithreading than without (due to locking)

Why it fails:

Multiple threads read/write shared mutable state (data race)
Excessive locking serializes code (defeats parallelism)
False sharing - adjacent data on same cache line thrashes

Example:

# BAD: Race condition (shared mutable list)
def update_unit_threaded(unit, all_units):
    # Thread 1 reads all_units
    # Thread 2 modifies all_units (adds/removes unit)
    # Thread 1 crashes: "list changed during iteration"
    for other in all_units:
        if collides(unit, other):
            all_units.remove(other)  # RACE!

# BAD: Excessive locking (serialized)
lock = threading.Lock()

def update_unit(unit):
    with lock:  # Only 1 thread works at a time!
        unit.update()

# GOOD: Data parallelism (no shared mutable state)
def update_units_parallel(units, num_workers=4):
    batch_size = len(units) // num_workers

    def update_batch(start, end):
        # Exclusive ownership - no locks needed
        for i in range(start, end):
            units[i].update()  # Only modifies units[i]

    with ThreadPoolExecutor(max_workers=num_workers) as executor:
        futures = []
        for worker_id in range(num_workers):
            start = worker_id * batch_size
            end = start + batch_size if worker_id < num_workers - 1 else len(units)
            futures.append(executor.submit(update_batch, start, end))

        # Wait for all
        for future in futures:
            future.result()

Solution:

✅ Avoid shared mutable state - each thread owns exclusive data
✅ Read-only sharing - threads can read shared data if no writes
✅ Message passing - communicate via queues instead of shared memory
✅ Lock-free algorithms - atomic operations, compare-and-swap
✅ Test with thread sanitizer - detects data races

Pitfall 4: Cache Invalidation Bugs

Symptoms:

Units walk through walls (stale paths cached)
Memory leak (cache grows unbounded)
Crashes after long play sessions (out of memory)

Why it fails:

No invalidation: Cache never updates when terrain changes
No TTL: Old paths stay forever, become invalid
No eviction: Cache grows until memory exhausted

Example:

# BAD: No invalidation, no TTL, unbounded growth
cache = {}

def get_path(start, goal):
    key = (start, goal)
    if key in cache:
        return cache[key]  # May be stale!

    path = pathfind(start, goal)
    cache[key] = path  # Cache grows forever!
    return path

# Building placed, but cached paths not invalidated
def place_building(x, y):
    buildings.append(Building(x, y))
    # BUG: Paths through this area still cached!

# GOOD: LRU + TTL + invalidation
cache = PathCache(max_size=5000, ttl_seconds=30.0)

def get_path(start, goal):
    cached = cache.get(start, goal)
    if cached:
        return cached

    path = pathfind(start, goal)
    cache.put(start, goal, path)
    return path

def place_building(x, y):
    buildings.append(Building(x, y))
    cache.invalidate_region(x, y, radius=100)  # Clear affected paths

Solution:

✅ TTL (time-to-live) - expire entries after N seconds
✅ Event-based invalidation - clear cache when terrain changes
✅ LRU eviction - remove least recently used when full
✅ Bounded size - set max_size to prevent unbounded growth

Pitfall 5: Forgetting to Rebuild Spatial Grid

Symptoms:

Units see enemies that are no longer there
Collision detection misses fast-moving objects
Query results are stale (from previous frame)

Why it fails:

Entities move every frame, but grid not rebuilt
Grid contains stale positions

Example:

# BAD: Grid built once, never updated
spatial_grid = SpatialHashGrid(cell_size=100)
for unit in units:
    spatial_grid.insert(unit)

def game_loop():
    # Units move
    for unit in units:
        unit.x += unit.vx * dt
        unit.y += unit.vy * dt

    # Query stale grid (positions from frame 0!)
    enemies = spatial_grid.query_radius(player.x, player.y, 50)

# GOOD: Rebuild grid every frame
def game_loop():
    # Move units
    for unit in units:
        unit.x += unit.vx * dt
        unit.y += unit.vy * dt

    # Rebuild spatial grid (fast: O(n))
    spatial_grid.clear()
    for unit in units:
        spatial_grid.insert(unit)

    # Query with current positions
    enemies = spatial_grid.query_radius(player.x, player.y, 50)

Solution:

✅ Rebuild every frame - spatial_grid.clear() + insert all entities
✅ Or use dynamic structure - quadtree with update() method
✅ Profile rebuild cost - should be < 1ms for 1000 entities

Pitfall 6: Optimization Without Validation

Symptoms:

"Optimized" code runs slower
New bottleneck created elsewhere
Unsure if optimization helped

Why it fails:

No before/after measurements
Optimization moved bottleneck to different system
Assumptions about cost were wrong

Example:

# BAD: No measurement
def optimize_pathfinding():
    # Made some changes...
    # Hope it's faster?
    pass

# GOOD: Measure before and after
def optimize_pathfinding():
    # Measure baseline
    start = time.perf_counter()
    for i in range(100):
        path = pathfind(start, goal)
    baseline_ms = (time.perf_counter() - start) * 1000
    print(f"Baseline: {baseline_ms:.2f}ms for 100 paths")

    # Apply optimization...
    add_path_caching()

    # Measure improvement
    start = time.perf_counter()
    for i in range(100):
        path = pathfind(start, goal)
    optimized_ms = (time.perf_counter() - start) * 1000
    print(f"Optimized: {optimized_ms:.2f}ms for 100 paths")

    speedup = baseline_ms / optimized_ms
    print(f"Speedup: {speedup:.1f}×")

    # Baseline: 500ms for 100 paths
    # Optimized: 200ms for 100 paths
    # Speedup: 2.5×

Solution:

✅ Measure baseline before optimization
✅ Measure improvement after optimization
✅ Calculate speedup - validate it helped
✅ Re-profile - check for new bottlenecks
✅ Regression test - ensure gameplay still works

Pitfall 7: Ignoring Amdahl's Law (Diminishing Returns)

Concept: Speedup limited by serial portion of code

Amdahl's Law: Speedup = 1 / ((1 - P) + P/N)

P = portion that can be parallelized (e.g., 0.75 = 75%)
N = number of cores (e.g., 4)

Example:

75% of code parallelizable, 4 cores
Speedup = 1 / ((1 - 0.75) + 0.75/4) = 1 / (0.25 + 0.1875) = 2.29×
Not 4×! Serial portion limits speedup

Why it matters:

Multithreading has diminishing returns
Focus on parallelizing largest portions first
Some tasks can't be parallelized (Amdahl's law ceiling)

Solution:

✅ Parallelize largest bottleneck first (maximize P)
✅ Set realistic expectations (2-3× on 4 cores, not 4×)
✅ Measure actual speedup - compare to theoretical maximum

Pitfall 8: Sorting Every Frame (Expensive!)

Symptoms:

3-5ms spent sorting units by distance
Sorting is top function in profiler

Why it fails:

O(n log n) sort is expensive for large N
Entity distances change slowly (don't need exact sort every frame)

Example:

# BAD: Full sort every frame
def update():
    # O(n log n) = 1000 × log(1000) ≈ 10,000 operations
    units_sorted = sorted(units, key=lambda u: distance_to_camera(u))

    # Update closest units
    for unit in units_sorted[:100]:
        unit.update()

# GOOD: Sort every N frames, or use approximate sort
def update():
    # Re-sort every 10 frames only
    if frame_count % 10 == 0:
        global units_sorted
        units_sorted = sorted(units, key=lambda u: distance_to_camera(u))

    # Use slightly stale sort (good enough!)
    for unit in units_sorted[:100]:
        unit.update()

# BETTER: Use spatial partitioning (no sorting needed!)
def update():
    # Query entities near camera (already sorted by distance)
    nearby_units = spatial_grid.query_radius(camera.x, camera.y, radius=200)

    # Update nearby units
    for unit in nearby_units:
        unit.update()

Solution:

✅ Sort less frequently - every 5-10 frames is fine
✅ Approximate sort - bucketing instead of exact sort
✅ Spatial queries - avoid sorting entirely (use grid/quadtree)

Real-World Examples

Example 1: Unity DOTS (Data-Oriented Technology Stack)

What it is: Unity's high-performance ECS (Entity Component System) architecture

Key optimizations:

Struct of Arrays (SoA) - Components stored in contiguous arrays
- Traditional: List<GameObject> with components scattered in memory
- DOTS: NativeArray<Position>, NativeArray<Velocity> - cache-friendly
- Result: 1.5-3× faster for batch operations
Job System - Data parallelism across CPU cores
- Each job processes exclusive slice of entities
- No locks (data ownership model)
- Result: 2-4× speedup on 4-8 core CPUs
Burst Compiler - LLVM-based code generation
- Generates SIMD instructions (AVX2, SSE)
- Removes bounds checks, optimizes math
- Result: 2-10× faster than standard C#

Performance: 10,000 entities at 60 FPS (vs 1,000 in traditional Unity)

When to use:

✅ 1000+ entities needing updates
✅ Batch operations (position updates, physics, AI)
✅ Performance-critical simulations

When NOT to use:

❌ Small entity counts (< 100)
❌ Gameplay prototyping (ECS is complex)
❌ Unique entities with lots of one-off logic

Example 2: Supreme Commander (RTS with 1000+ Units)

Challenge: Support 1000+ units in RTS battles at 30-60 FPS

Optimizations:

Flow Fields for Pathfinding
- Pre-compute direction field from goal
- Each unit follows field (O(1) per unit)
- Alternative to A* per unit (O(n log n) each)
- Result: 100× faster pathfinding for groups
LOD for Unit AI
- LOD 0 (< 50 units from camera): Full behavior tree
- LOD 1 (50-100 units): Simplified FSM
- LOD 2 (100+ units): Scripted behavior
- Result: 3-5× fewer AI updates per frame
Spatial Partitioning for Weapons
- Grid-based broad-phase for weapon targeting
- Only check units in weapon range cells
- Result: O(n²) → O(n) for combat calculations
Time-Sliced Sim
- Economy updates: Every 10 frames
- Unit production: Every 5 frames
- Visual effects: Based on distance LOD
- Result: Consistent frame rate under load

Performance: 1000 units at 30 FPS, 500 units at 60 FPS

Lessons:

Flow fields > A* for large unit groups
LOD critical for maintaining frame rate at scale
Spatial partitioning is non-negotiable for 1000+ units

Example 3: Total War (20,000+ Soldiers in Battles)

Challenge: Render and simulate 20,000 individual soldiers at 30-60 FPS

Optimizations:

Hierarchical LOD
- LOD 0 (< 20m): Full skeleton, detailed model
- LOD 1 (20-50m): Reduced skeleton, simpler model
- LOD 2 (50-100m): Impostor (textured quad)
- LOD 3 (100m+): Single pixel or culled
- Result: 10× fewer vertices rendered
Formation-Based AI
- Units in formation share single pathfinding result
- Individual units offset from formation center
- Result: 100× fewer pathfinding calculations
Batched Rendering
- Instanced rendering for identical soldiers
- 1 draw call for 100 soldiers (vs 100 draw calls)
- Result: 10× fewer draw calls
Simplified Physics
- Full physics for nearby units (< 20m)
- Ragdolls for deaths near camera
- Simplified collision for distant units
- Result: 5× fewer physics calculations

Performance: 20,000 units at 30-60 FPS (depending on settings)

Lessons:

Visual LOD as important as simulation LOD
Formation-based AI avoids redundant pathfinding
Instanced rendering critical for large unit counts

Example 4: Cities Skylines (Traffic Simulation)

Challenge: Simulate 10,000+ vehicles with realistic traffic at 30 FPS

Optimizations:

Hierarchical Pathfinding
- Highway network → arterial roads → local streets
- Pre-compute high-level paths, refine locally
- Result: 20× faster pathfinding for long routes
Path Caching
- Common routes cached (home → work, work → home)
- 60-80% cache hit rate
- Result: 2.5-5× fewer pathfinding calculations
Dynamic Cost Adjustment
- Road segments track vehicle density
- Congested roads have higher pathfinding cost
- Vehicles reroute around congestion
- Result: Emergent traffic patterns
Despawn Distant Vehicles
- Vehicles > 500m from camera despawned
- Statistics tracked, respawn when relevant
- Result: Effective vehicle count reduced 50%

Performance: 10,000 active vehicles at 30 FPS

Lessons:

Hierarchical pathfinding essential for city-scale maps
Path caching provides huge wins (60%+ hit rate common)
Despawning off-screen entities maintains performance

Example 5: Factorio (Mega-Factory Optimization)

Challenge: Simulate 100,000+ entities (belts, inserters, assemblers) at 60 FPS

Optimizations:

Update Skipping
- Idle machines don't update (no input/output)
- Active set typically 10-20% of total entities
- Result: 5-10× fewer updates per tick
Chunk-Based Simulation
- World divided into 32×32 tile chunks
- Inactive chunks (no player nearby) update less often
- Result: Effective world size reduced 80%
Belt Optimization
- Items on belts compressed into contiguous arrays
- Lane-based updates (not per-item)
- Result: 10× faster belt simulation
Electrical Network Caching
- Power grid solved once, cached until topology changes
- Only recalculate when grid modified
- Result: 100× fewer electrical calculations

Performance: 60 FPS with 100,000+ entities (in optimized factories)

Lessons:

Update skipping (sleeping entities) provides huge wins
Chunk-based simulation scales to massive worlds
Cache static calculations (power grid, fluid networks)

Cross-References

Within Bravos/Simulation-Tactics

This skill applies to ALL other simulation skills:

traffic-and-pathfinding ← Optimize pathfinding with caching, time-slicing
ai-and-agent-simulation ← LOD for AI, time-sliced behavior trees
physics-simulation-patterns ← Spatial partitioning for collision, broad-phase
ecosystem-simulation ← LOD for distant populations, time-sliced updates
weather-and-time ← Particle budgets, LOD for effects
economic-simulation-patterns ← Time-slicing for economy updates

Related skills in this skillpack:

spatial-partitioning (planned) - Deep dive into quadtrees, octrees, grids
ecs-architecture (planned) - Data-oriented design, component systems

External Skillpacks

Yzmir/Performance-Optimization (if exists):

Profiling tools and methodology
Memory optimization (pooling, allocators)
Cache optimization (data layouts)

Yzmir/Algorithms-and-Data-Structures (if exists):

Spatial data structures (quadtree, k-d tree, BVH)
Priority queues (for time-slicing)
LRU cache implementation

Axiom/Game-Engine-Patterns (if exists):

Update loop patterns
Frame time management
Object pooling

Testing Checklist

Use this checklist to verify optimization is complete and correct:

1. Profiling

Captured baseline performance (frame time, FPS)
Identified top 3-5 bottlenecks (80% of time)
Understood WHY each bottleneck is slow (algorithm, data, cache)
Documented baseline metrics for comparison

2. Algorithmic Optimization

Checked for O(n²) algorithms (proximity queries, collisions)
Applied spatial partitioning where appropriate (grid, quadtree)
Validated spatial queries return correct results
Measured improvement (should be 10-100×)

3. Level of Detail (LOD)

Defined LOD levels (typically 4: full, reduced, minimal, culled)
Implemented distance-based LOD assignment
Added importance weighting (player units, selected units)
Implemented hysteresis to prevent LOD thrashing
Verified no visual popping artifacts
Measured improvement (should be 2-10×)

4. Time-Slicing

Set frame time budget per system (e.g., 2ms for pathfinding)
Implemented priority queue (important work first)
Verified budget is respected (doesn't exceed limit)
Checked that high-priority work completes quickly
Measured improvement (should be 2-5×)

5. Caching

Identified redundant calculations to cache
Implemented cache with LRU eviction
Added TTL (time-to-live) expiration
Implemented invalidation triggers (terrain changes, etc.)
Verified cache hit rate (aim for 60-80%)
Checked no stale data bugs (units walking through walls)
Measured improvement (should be 2-10×)

6. Data-Oriented Design (if applicable)

Identified batch operations on many entities
Converted AoS → SoA for hot data
Verified memory layout is cache-friendly
Measured improvement (should be 1.5-3×)

7. Multithreading (if needed)

Verified all simpler optimizations done first
Identified embarrassingly parallel work
Implemented job system or data parallelism
Verified no race conditions (test with thread sanitizer)
Checked performance gain justifies complexity
Measured improvement (should be 1.5-4×)

8. Validation

Met target frame rate (60 FPS or 30 FPS)
Verified no gameplay regressions (units behave correctly)
Checked no visual artifacts (LOD popping, etc.)
Tested at target entity count (e.g., 1000 units)
Tested edge cases (10,000 units, worst-case scenarios)
Documented final performance metrics
Calculated total speedup (baseline → optimized)

9. Before/After Comparison

Metric	Before	After	Improvement
Frame Time	___ms	___ms	___× faster
FPS	___	___	___
Bottleneck System Time	___ms	___ms	___× faster
Entity Count (target FPS)	___	___	___× more
Memory Usage	___MB	___MB	___

10. Regression Tests

Units still path correctly (no walking through walls)
AI behavior unchanged (same decisions)
Combat calculations correct (same damage)
No crashes or exceptions
No memory leaks (long play session test)
Deterministic results (same input → same output)

Remember:

Profile FIRST - measure before guessing
Algorithmic optimization provides biggest wins (10-100×)
LOD and time-slicing are essential for 1000+ entities
Multithreading is LAST resort - complexity cost is high
Validate improvement - measure before/after, check for regressions

Success criteria: Target frame rate achieved (60 FPS) with desired entity count (1000+) and no gameplay compromises.

Install Skill

SKILL.md

Performance Optimization for Simulations

Core Concepts

The Optimization Hierarchy (Critical Order)

Profiling Methodology

Spatial Partitioning

Level of Detail (LOD)

Time-Slicing

Caching Strategies

Data-Oriented Design (DOD)

Multithreading (Use Sparingly)

Decision Frameworks

Framework 1: Systematic Optimization Process

Framework 2: Choosing Spatial Partitioning

Framework 3: LOD Level Assignment

Framework 4: When to Use Multithreading

Framework 5: Frame Time Budget Allocation

Implementation Patterns

Pattern 1: Spatial Hash Grid for Proximity Queries

Pattern 2: Quadtree for Clustered Entities

Pattern 3: Distance-Based LOD System

Pattern 4: Time-Sliced Pathfinding with Priority Queue

Pattern 5: LRU Cache with TTL for Pathfinding

Pattern 6: Job System for Parallel Work

Common Pitfalls

Pitfall 1: Premature Optimization (Most Common!)

Pitfall 2: LOD Popping (Visual Artifacts)

Pitfall 3: Thread Contention and Race Conditions

Pitfall 4: Cache Invalidation Bugs

Pitfall 5: Forgetting to Rebuild Spatial Grid

Pitfall 6: Optimization Without Validation

Pitfall 7: Ignoring Amdahl's Law (Diminishing Returns)

Pitfall 8: Sorting Every Frame (Expensive!)

Real-World Examples

Example 1: Unity DOTS (Data-Oriented Technology Stack)

Example 2: Supreme Commander (RTS with 1000+ Units)

Example 3: Total War (20,000+ Soldiers in Battles)

Example 4: Cities Skylines (Traffic Simulation)

Example 5: Factorio (Mega-Factory Optimization)

Cross-References

Within Bravos/Simulation-Tactics

External Skillpacks

Testing Checklist

1. Profiling

2. Algorithmic Optimization

3. Level of Detail (LOD)

4. Time-Slicing

5. Caching

6. Data-Oriented Design (if applicable)

7. Multithreading (if needed)

8. Validation

9. Before/After Comparison

10. Regression Tests