| name | profiling-practices |
| description | Performance profiling best practices using py-spy and other Python profiling tools. Activated when profiling code, analyzing bottlenecks, or optimizing performance. |
Profiling practices
Purpose
Guide for performance profiling using py-spy, cProfile, and other Python profiling tools. Covers CPU profiling, memory profiling, and flame graph analysis.
When to use
This skill activates when:
- Identifying performance bottlenecks
- Profiling CPU usage
- Analyzing memory consumption
- Creating flame graphs
- Optimizing hot paths
Core principles
Profile before optimizing
- NEVER guess where bottlenecks are
- Always measure before and after changes
- Focus optimization on actual hot paths
Use the right tool
- py-spy for sampling-based profiling
- cProfile for deterministic profiling
- tracemalloc for memory profiling
CPU profiling with py-spy
Record profile to file
# Create flame graph SVG
uv run py-spy record -o profile.svg -- python script.py
# Create speedscope JSON
uv run py-spy record -o profile.json --format speedscope -- python script.py
Live process profiling
# Top-like view
uv run py-spy top -- python script.py
# Attach to running process
uv run py-spy top --pid 12345
Record options
# Increase sampling rate
uv run py-spy record --rate 250 -o profile.svg -- python script.py
# Include native frames
uv run py-spy record --native -o profile.svg -- python script.py
# Subprocesses too
uv run py-spy record --subprocesses -o profile.svg -- python script.py
CPU profiling with cProfile
Basic profiling
# Run with profiler
uv run python -m cProfile -o profile.prof script.py
# Sort by cumulative time
uv run python -m cProfile -s cumulative script.py
Analyze results
import pstats
# Load and analyze
p = pstats.Stats('profile.prof')
p.sort_stats('cumulative')
p.print_stats(20) # Top 20 functions
# Filter by function name
p.print_stats('process')
Profile specific code
import cProfile
import pstats
def profile_function(func):
"""Decorator to profile a function."""
def wrapper(*args, **kwargs):
profiler = cProfile.Profile()
result = profiler.runcall(func, *args, **kwargs)
stats = pstats.Stats(profiler)
stats.sort_stats('cumulative')
stats.print_stats(10)
return result
return wrapper
Memory profiling
With tracemalloc
import tracemalloc
# Start tracing
tracemalloc.start()
# Run code to profile
result = process_large_data()
# Get snapshot
snapshot = tracemalloc.take_snapshot()
# Print top memory consumers
top_stats = snapshot.statistics('lineno')
for stat in top_stats[:10]:
print(stat)
Comparing snapshots
import tracemalloc
tracemalloc.start()
# First snapshot
process_step1()
snapshot1 = tracemalloc.take_snapshot()
# Second snapshot
process_step2()
snapshot2 = tracemalloc.take_snapshot()
# Compare
top_stats = snapshot2.compare_to(snapshot1, 'lineno')
for stat in top_stats[:10]:
print(stat)
Flame graph analysis
Reading flame graphs
- Width: Time spent in function (wider = more time)
- Height: Call stack depth (taller = deeper calls)
- Colors: Usually arbitrary, can indicate different categories
What to look for
- Wide bars at top: Direct time consumers
- Wide bars lower: Functions called frequently
- Many thin bars: Possibly inefficient iteration
- Deep stacks: Potential for stack optimization
Optimization workflow
- Establish baseline: Profile current state
- Identify hot path: Find actual bottleneck
- Hypothesize: Theory for improvement
- Implement: Make targeted change
- Verify: Profile again to confirm improvement
- Repeat: If needed, go back to step 2
Common optimizations
Algorithm improvements
# O(n^2) - linear search in loop
for item in items:
if item in other_items: # O(n) lookup each time
...
# O(n) - use set for O(1) lookup
other_set = set(other_items)
for item in items:
if item in other_set: # O(1) lookup
...
Caching
from functools import lru_cache
@lru_cache(maxsize=128)
def expensive_computation(key: str) -> Result:
"""Cache expensive results."""
return compute(key)
Generator expressions
# Memory-heavy: creates full list
data = [transform(x) for x in large_input]
result = sum(data)
# Memory-efficient: processes one at a time
data = (transform(x) for x in large_input)
result = sum(data)
Checklist
- Baseline profile established
- Hot paths identified with data
- Changes targeted at actual bottlenecks
- Improvements verified with profiling
- No functionality broken
Additional resources: