| name | debugging-and-profiling |
| description | Debugging fundamentals, debugpy/VS Code, pdb, CPU profiling, memory profiling, profiling async code, performance optimization, systematic diagnosis, common bottlenecks |
Debugging and Profiling
Overview
Core Principle: Profile before optimizing. Humans are terrible at guessing where code is slow. Always measure before making changes.
Python debugging and profiling enables systematic problem diagnosis and performance optimization. Use debugpy/pdb for step-through debugging, cProfile for CPU profiling, memory_profiler for memory analysis. The biggest mistake: optimizing code without profiling first—you'll likely optimize the wrong thing.
When to Use
Use this skill when:
- "Code is slow"
- "How to profile Python?"
- "Memory leak"
- "Debugging not working"
- "Find bottleneck"
- "Optimize performance"
- "Step through code"
- "Where is my code spending time?"
Don't use when:
- Setting up project (use project-structure-and-tooling)
- Already know what to optimize (but still profile to verify!)
- Algorithm selection (different skill domain)
Symptoms triggering this skill:
- Code runs slower than expected
- Memory usage growing over time
- Need to understand execution flow
- Performance degraded after changes
Debugging Fundamentals
Using debugpy with VS Code
# ✅ CORRECT: debugpy for remote debugging
import debugpy
# Allow VS Code to attach
debugpy.listen(5678)
print("Waiting for debugger to attach...")
debugpy.wait_for_client()
# Your code here
def process_data(data):
result = []
for item in data:
# Set breakpoint in VS Code on this line
transformed = transform(item)
result.append(transformed)
return result
# VS Code launch.json configuration:
"""
{
"version": "0.2.0",
"configurations": [
{
"name": "Python: Attach",
"type": "python",
"request": "attach",
"connect": {
"host": "localhost",
"port": 5678
}
}
]
}
"""
Using pdb (Python Debugger)
# ✅ CORRECT: pdb for interactive debugging
import pdb
def buggy_function(data):
result = []
for i, item in enumerate(data):
# Drop into debugger
pdb.set_trace() # Or: breakpoint() in Python 3.7+
processed = item * 2
result.append(processed)
return result
# pdb commands:
# n (next): Execute next line
# s (step): Step into function
# c (continue): Continue execution
# p variable: Print variable
# pp variable: Pretty print variable
# l (list): Show current location in code
# w (where): Show stack trace
# q (quit): Quit debugger
Conditional Breakpoints
# ❌ WRONG: Breaking on every iteration
def process_items(items):
for item in items:
pdb.set_trace() # Breaks 10000 times!
process(item)
# ✅ CORRECT: Conditional breakpoint
def process_items(items):
for i, item in enumerate(items):
if i == 5000: # Only break on specific iteration
breakpoint()
process(item)
# ✅ BETTER: Use pdb.set_trace with condition
def process_items(items):
for item in items:
if item.value < 0: # Break only when problematic
breakpoint()
process(item)
Post-Mortem Debugging
# ✅ CORRECT: Debug after exception
import pdb
def main():
try:
# Code that might raise exception
result = risky_operation()
except Exception:
# Drop into debugger at exception point
pdb.post_mortem()
# ✅ CORRECT: Auto post-mortem for unhandled exceptions
import sys
def custom_excepthook(type, value, traceback):
pdb.post_mortem(traceback)
sys.excepthook = custom_excepthook
# Now unhandled exceptions drop into pdb automatically
Why this matters: Breakpoints let you inspect state at exact point of failure. Conditional breakpoints avoid noise. Post-mortem debugging examines crashes.
CPU Profiling
cProfile for Function-Level Profiling
import cProfile
import pstats
# ❌ WRONG: Guessing which function is slow
def slow_program():
# "I think this loop is the problem..."
for i in range(1000):
process_data(i)
# ✅ CORRECT: Profile to find actual bottleneck
def slow_program():
for i in range(1000):
process_data(i)
# Profile the function
cProfile.run('slow_program()', 'profile_stats')
# Analyze results
stats = pstats.Stats('profile_stats')
stats.strip_dirs()
stats.sort_stats('cumulative')
stats.print_stats(20) # Top 20 functions by cumulative time
# ✅ CORRECT: Profile with context manager
from contextlib import contextmanager
import cProfile
@contextmanager
def profiled():
pr = cProfile.Profile()
pr.enable()
yield
pr.disable()
stats = pstats.Stats(pr)
stats.strip_dirs()
stats.sort_stats('cumulative')
stats.print_stats(20)
# Usage
with profiled():
slow_program()
Profiling Specific Code Blocks
# ✅ CORRECT: Profile specific section
import cProfile
pr = cProfile.Profile()
# Normal code
setup_data()
# Profile this section
pr.enable()
expensive_operation()
pr.disable()
# More normal code
cleanup()
# View results
pr.print_stats(sort='cumulative')
Line-Level Profiling with line_profiler
# Install: pip install line_profiler
# ✅ CORRECT: Line-by-line profiling
from line_profiler import LineProfiler
@profile # Use @profile decorator
def slow_function():
total = 0
for i in range(10000):
total += i ** 2
return total
# Run with kernprof:
# kernprof -l -v script.py
# Or programmatically:
lp = LineProfiler()
lp.add_function(slow_function)
lp.enable()
slow_function()
lp.disable()
lp.print_stats()
# Output shows time spent per line:
# Line # Hits Time Per Hit % Time Line Contents
# ==============================================================
# 1 def slow_function():
# 2 1 2.0 2.0 0.0 total = 0
# 3 10001 15234.0 1.5 20.0 for i in range(10000):
# 4 10000 60123.0 6.0 80.0 total += i ** 2
# 5 1 1.0 1.0 0.0 return total
Why this matters: cProfile shows which functions are slow. line_profiler shows which lines within functions. Both essential for optimization.
Visualizing Profiles with SnakeViz
# Install: pip install snakeviz
# Profile code
python -m cProfile -o program.prof script.py
# Visualize
snakeviz program.prof
# Opens browser with interactive visualization:
# - Sunburst chart showing call hierarchy
# - Icicle chart showing time distribution
# - Click functions to zoom in
Memory Profiling
Memory Usage with memory_profiler
# Install: pip install memory_profiler
from memory_profiler import profile
# ✅ CORRECT: Track memory usage per line
@profile
def memory_hungry_function():
# Line-by-line memory usage shown
big_list = [i for i in range(1000000)] # Allocates ~40MB
big_dict = {i: i**2 for i in range(1000000)} # Another ~40MB
return len(big_list), len(big_dict)
# Run with:
# python -m memory_profiler script.py
# Output:
# Line # Mem usage Increment Line Contents
# ================================================
# 3 38.3 MiB 38.3 MiB @profile
# 4 def memory_hungry_function():
# 5 45.2 MiB 6.9 MiB big_list = [i for i in range(1000000)]
# 6 83.1 MiB 37.9 MiB big_dict = {i: i**2 for i in range(1000000)}
# 7 83.1 MiB 0.0 MiB return len(big_list), len(big_dict)
Finding Memory Leaks
# ✅ CORRECT: Detect memory leaks with tracemalloc
import tracemalloc
# Start tracing
tracemalloc.start()
# Take snapshot before
snapshot1 = tracemalloc.take_snapshot()
# Run code that might leak
problematic_function()
# Take snapshot after
snapshot2 = tracemalloc.take_snapshot()
# Compare snapshots
top_stats = snapshot2.compare_to(snapshot1, 'lineno')
print("Top 10 memory increases:")
for stat in top_stats[:10]:
print(stat)
tracemalloc.stop()
# ✅ CORRECT: Track specific objects
import gc
import sys
def find_memory_leak():
# Force garbage collection
gc.collect()
# Track objects before
before = len(gc.get_objects())
# Run potentially leaky code
for _ in range(100):
leaky_operation()
# Force GC again
gc.collect()
# Track objects after
after = len(gc.get_objects())
if after > before:
print(f"Potential leak: {after - before} objects not collected")
# Find what's keeping objects alive
for obj in gc.get_objects():
if isinstance(obj, MyClass): # Suspect class
print(f"Found {type(obj)}: {sys.getrefcount(obj)} references")
print(gc.get_referrers(obj))
Profiling Memory with objgraph
# Install: pip install objgraph
import objgraph
# ✅ CORRECT: Find most common objects
def analyze_memory():
objgraph.show_most_common_types()
# Output:
# dict 12453
# function 8234
# list 6789
# ...
# ✅ CORRECT: Track object growth
objgraph.show_growth()
potentially_leaky_function()
objgraph.show_growth() # Shows objects that increased
# ✅ CORRECT: Visualize object references
import objgraph
objgraph.show_refs([my_object], filename='refs.png')
# Creates graph showing what references my_object
Why this matters: Memory leaks cause gradual performance degradation. tracemalloc and memory_profiler help find exactly where memory is allocated.
Profiling Async Code
Profiling Async Functions
import asyncio
import cProfile
import pstats
# ❌ WRONG: cProfile doesn't work well with async
async def slow_async():
await asyncio.sleep(1)
await process_data()
cProfile.run('asyncio.run(slow_async())') # Misleading results
# ✅ CORRECT: Use yappi for async profiling
# Install: pip install yappi
import yappi
async def slow_async():
await asyncio.sleep(1)
await process_data()
yappi.set_clock_type("wall") # Use wall time, not CPU time
yappi.start()
asyncio.run(slow_async())
yappi.stop()
# Print stats
stats = yappi.get_func_stats()
stats.sort("totaltime", "desc")
stats.print_all()
# ✅ CORRECT: Profile coroutines specifically
stats = yappi.get_func_stats(filter_callback=lambda x: 'coroutine' in x.name)
stats.print_all()
Detecting Blocking Code in Async
# ✅ CORRECT: Detect event loop blocking
import asyncio
import time
class LoopMonitor:
def __init__(self, threshold: float = 0.1):
self.threshold = threshold
async def monitor(self):
while True:
start = time.monotonic()
await asyncio.sleep(0.01) # Very short sleep
elapsed = time.monotonic() - start
if elapsed > self.threshold:
print(f"WARNING: Event loop blocked for {elapsed:.3f}s")
async def main():
# Start monitor
monitor = LoopMonitor(threshold=0.1)
monitor_task = asyncio.create_task(monitor.monitor())
# Run your async code
await your_async_function()
monitor_task.cancel()
# ✅ CORRECT: Use asyncio debug mode
asyncio.run(main(), debug=True)
# Warns about slow callbacks (>100ms)
Performance Optimization Strategies
Optimization Workflow
# ✅ CORRECT: Systematic optimization approach
# 1. Profile to find bottleneck
import cProfile
cProfile.run('main()', 'profile_stats')
# 2. Analyze results
stats = pstats.Stats('profile_stats')
stats.sort_stats('cumulative')
stats.print_stats(10) # Focus on top 10
# 3. Identify specific slow function
def slow_function(data):
# Original implementation
result = []
for item in data:
if is_valid(item):
result.append(transform(item))
return result
# 4. Create benchmark
import timeit
def benchmark():
data = create_test_data(10000)
time_taken = timeit.timeit(
lambda: slow_function(data),
number=100
)
print(f"Average time: {time_taken / 100:.4f}s")
benchmark() # Baseline: 0.1234s
# 5. Optimize
def optimized_function(data):
# Use list comprehension (faster)
return [transform(item) for item in data if is_valid(item)]
# 6. Benchmark again
time_taken = timeit.timeit(
lambda: optimized_function(data),
number=100
)
print(f"Average time: {time_taken / 100:.4f}s") # 0.0789s - 36% faster!
# 7. Verify correctness
assert slow_function(data) == optimized_function(data)
# 8. Re-profile entire program to verify improvement
cProfile.run('main()', 'profile_stats_optimized')
Why this matters: Without profiling, you might optimize code that takes 1% of runtime, ignoring the 90% bottleneck. Always measure.
Common Optimizations
# ❌ WRONG: Repeated expensive operations
def process_items(items):
for item in items:
# Regex compiled every iteration!
pattern = re.compile(r'\d+')
match = pattern.search(item)
# ✅ CORRECT: Move expensive operations outside loop
def process_items(items):
pattern = re.compile(r'\d+') # Compile once
for item in items:
match = pattern.search(item)
# ❌ WRONG: Growing list with repeated concatenation
def build_large_list():
result = []
for i in range(100000):
result = result + [i] # Creates new list each time! O(n²)
# ✅ CORRECT: Use append
def build_large_list():
result = []
for i in range(100000):
result.append(i) # O(n)
# ❌ WRONG: Checking membership in list
def filter_items(items, blacklist):
return [item for item in items if item not in blacklist]
# O(n * m) if blacklist is list
# ✅ CORRECT: Use set for membership checks
def filter_items(items, blacklist):
blacklist_set = set(blacklist) # O(m)
return [item for item in items if item not in blacklist_set]
# O(n) for iteration + O(1) per lookup = O(n)
Caching Results
from functools import lru_cache
# ❌ WRONG: Recomputing expensive results
def fibonacci(n):
if n < 2:
return n
return fibonacci(n-1) + fibonacci(n-2)
# O(2^n) - recalculates same values repeatedly
# ✅ CORRECT: Cache results
@lru_cache(maxsize=None)
def fibonacci(n):
if n < 2:
return n
return fibonacci(n-1) + fibonacci(n-2)
# O(n) - each value computed once
# ✅ CORRECT: Custom caching for unhashable arguments
from functools import wraps
def cache_dataframe_results(func):
cache = {}
@wraps(func)
def wrapper(df):
# Use hash of dataframe content as key
key = hashlib.md5(df.to_csv(index=False).encode()).hexdigest()
if key not in cache:
cache[key] = func(df)
return cache[key]
return wrapper
@cache_dataframe_results
def expensive_dataframe_operation(df):
# Complex computation
return df.groupby('category').agg({'value': 'sum'})
Systematic Diagnosis
Performance Degradation Diagnosis
# ✅ CORRECT: Diagnose performance regression
import cProfile
import pstats
def diagnose_slowdown():
"""Compare current vs baseline performance."""
# Profile current code
cProfile.run('main()', 'current_profile.prof')
# Load baseline profile (from git history or previous run)
# git show main:profile.prof > baseline_profile.prof
current = pstats.Stats('current_profile.prof')
baseline = pstats.Stats('baseline_profile.prof')
print("=== CURRENT ===")
current.sort_stats('cumulative')
current.print_stats(10)
print("\n=== BASELINE ===")
baseline.sort_stats('cumulative')
baseline.print_stats(10)
# Look for functions that got slower
# Compare cumulative times
Memory Leak Diagnosis
# ✅ CORRECT: Systematic memory leak detection
import tracemalloc
import gc
def diagnose_memory_leak():
"""Run function multiple times and check memory growth."""
gc.collect()
tracemalloc.start()
# Baseline
snapshot1 = tracemalloc.take_snapshot()
# Run 100 times
for _ in range(100):
potentially_leaky_function()
gc.collect()
# Check memory
snapshot2 = tracemalloc.take_snapshot()
top_stats = snapshot2.compare_to(snapshot1, 'lineno')
print("Top 10 memory allocations:")
for stat in top_stats[:10]:
print(f"{stat.traceback}: +{stat.size_diff / 1024:.1f} KB")
tracemalloc.stop()
I/O vs CPU Bound Diagnosis
# ✅ CORRECT: Determine if I/O or CPU bound
import time
import cProfile
def diagnose_bottleneck():
"""Determine if program is I/O or CPU bound."""
# Time wall clock
start_wall = time.time()
main()
wall_time = time.time() - start_wall
# Profile CPU time
pr = cProfile.Profile()
pr.enable()
start_cpu = time.process_time()
main()
cpu_time = time.process_time() - start_cpu
pr.disable()
print(f"Wall time: {wall_time:.2f}s")
print(f"CPU time: {cpu_time:.2f}s")
if cpu_time / wall_time > 0.9:
print("CPU bound - optimize computation")
# Consider: Cython, NumPy, multiprocessing
else:
print("I/O bound - optimize I/O")
# Consider: async/await, caching, batching
Common Bottlenecks and Solutions
String Concatenation
# ❌ WRONG: String concatenation in loop
def build_string(items):
result = ""
for item in items:
result += str(item) + "\n" # Creates new string each time
return result
# O(n²) time complexity
# ✅ CORRECT: Use join
def build_string(items):
return "\n".join(str(item) for item in items)
# O(n) time complexity
# Benchmark:
# 1000 items: 0.0015s (join) vs 0.0234s (concatenation) - 15x faster
# 10000 items: 0.015s (join) vs 2.341s (concatenation) - 156x faster
List Comprehension vs Map/Filter
import timeit
# ✅ CORRECT: List comprehension (usually fastest)
def with_list_comp(data):
return [x * 2 for x in data if x > 0]
# ✅ CORRECT: Generator (memory efficient for large data)
def with_generator(data):
return (x * 2 for x in data if x > 0)
# Map/filter (sometimes faster for simple operations)
def with_map_filter(data):
return map(lambda x: x * 2, filter(lambda x: x > 0, data))
# Benchmark
data = list(range(1000000))
print(timeit.timeit(lambda: list(with_list_comp(data)), number=10))
print(timeit.timeit(lambda: list(with_generator(data)), number=10))
print(timeit.timeit(lambda: list(with_map_filter(data)), number=10))
# Results: List comprehension usually fastest for complex logic
# Generator best when you don't need all results at once
Dictionary Lookups vs List Searches
# ❌ WRONG: Searching in list
def find_users_list(user_ids, all_users_list):
results = []
for user_id in user_ids:
for user in all_users_list: # O(n) per lookup
if user['id'] == user_id:
results.append(user)
break
return results
# O(n * m) time complexity
# ✅ CORRECT: Use dictionary
def find_users_dict(user_ids, all_users_dict):
return [all_users_dict[uid] for uid in user_ids if uid in all_users_dict]
# O(n) time complexity
# Benchmark:
# 1000 lookups in 10000 items:
# List: 1.234s
# Dict: 0.001s - 1234x faster!
DataFrame Iteration Anti-Pattern
import pandas as pd
import numpy as np
# ❌ WRONG: Iterating over DataFrame rows
def process_rows_iterrows(df):
results = []
for idx, row in df.iterrows(): # VERY SLOW
if row['value'] > 0:
results.append(row['value'] * 2)
return results
# ✅ CORRECT: Vectorized operations
def process_rows_vectorized(df):
mask = df['value'] > 0
return (df.loc[mask, 'value'] * 2).tolist()
# Benchmark with 100,000 rows:
# iterrows: 15.234s
# vectorized: 0.015s - 1000x faster!
Profiling Tools Comparison
When to Use Which Tool
| Tool | Use Case | Output |
|---|---|---|
| cProfile | Function-level CPU profiling | Which functions take most time |
| line_profiler | Line-level CPU profiling | Which lines within function slow |
| memory_profiler | Line-level memory profiling | Memory usage per line |
| tracemalloc | Memory allocation tracking | Where memory allocated |
| yappi | Async/multithreaded profiling | Profile concurrent code |
| py-spy | Sampling profiler (no code changes) | Profile running processes |
| scalene | CPU+GPU+memory profiling | Comprehensive profiling |
py-spy for Production Profiling
# Install: pip install py-spy
# Profile running process (no code changes needed!)
py-spy record -o profile.svg --pid 12345
# Profile for 60 seconds
py-spy record -o profile.svg --duration 60 -- python script.py
# Top-like view of running process
py-spy top --pid 12345
# Why use py-spy:
# - No code changes needed
# - Minimal overhead
# - Can attach to running process
# - Great for production debugging
Anti-Patterns
Premature Optimization
# ❌ WRONG: Optimizing before measuring
def process_data(data):
# "Let me make this fast with complex caching..."
# Spend hours optimizing function that takes 0.1% of runtime
# ✅ CORRECT: Profile first
cProfile.run('main()', 'profile.prof')
# Oh, process_data only takes 0.1% of time
# The real bottleneck is database queries (90% of time)
# Optimize database queries instead!
Micro-Optimizations
# ❌ WRONG: Micro-optimizing at expense of readability
def calculate(x, y):
# "Using bit shift instead of multiply by 2 for speed!"
return (x << 1) + (y << 1)
# Saved: ~0.0000001 seconds per call
# Cost: Unreadable code
# ✅ CORRECT: Clear code first
def calculate(x, y):
return 2 * x + 2 * y
# Modern Python JIT optimizes this anyway
# Only optimize if profiler shows this is bottleneck
Not Benchmarking Changes
# ❌ WRONG: Assuming optimization worked
def slow_function():
# Original code
pass
def optimized_function():
# "Optimized" code
pass
# Assume optimized_function is faster without measuring
# ✅ CORRECT: Benchmark before and after
import timeit
before = timeit.timeit(slow_function, number=1000)
after = timeit.timeit(optimized_function, number=1000)
print(f"Before: {before:.4f}s")
print(f"After: {after:.4f}s")
print(f"Speedup: {before/after:.2f}x")
# Verify correctness
assert slow_function() == optimized_function()
Decision Trees
What Tool to Use for Profiling?
What do I need to profile?
├─ CPU time
│ ├─ Function-level → cProfile
│ ├─ Line-level → line_profiler
│ └─ Async code → yappi
├─ Memory usage
│ ├─ Line-level → memory_profiler
│ ├─ Allocation tracking → tracemalloc
│ └─ Object types → objgraph
└─ Running process (no code changes) → py-spy
Optimization Strategy
Is code slow?
├─ Yes → Profile to find bottleneck
│ ├─ CPU bound → Profile with cProfile
│ │ └─ Optimize hot functions (vectorize, cache, algorithms)
│ └─ I/O bound → Profile with timing
│ └─ Use async/await, caching, batching
└─ No → Don't optimize (focus on features/correctness)
Memory Issue Diagnosis
Is memory usage high?
├─ Yes → Profile with memory_profiler
│ ├─ Growing over time → Memory leak
│ │ └─ Use tracemalloc to find leak
│ └─ High but stable → Large data structures
│ └─ Optimize data structures (generators, efficient types)
└─ No → Monitor but don't optimize yet
Integration with Other Skills
After using this skill:
- If I/O bound → See @async-patterns-and-concurrency for async optimization
- If data processing slow → See @scientific-computing-foundations for vectorization
- If need to track improvements → See @ml-engineering-workflows for metrics
Before using this skill:
- If unsure code is slow → Use this skill to profile and confirm!
- If setting up profiling → See @project-structure-and-tooling for dependencies
Quick Reference
Essential Profiling Commands
# CPU profiling
import cProfile
cProfile.run('main()', 'profile.prof')
# View results
import pstats
stats = pstats.Stats('profile.prof')
stats.sort_stats('cumulative')
stats.print_stats(20)
# Memory profiling
import tracemalloc
tracemalloc.start()
# ... code ...
snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics('lineno')
for stat in top_stats[:10]:
print(stat)
Debugging Commands
# Set breakpoint
breakpoint() # Python 3.7+
# or
import pdb; pdb.set_trace()
# pdb commands:
# n - next line
# s - step into
# c - continue
# p var - print variable
# l - list code
# w - where am I
# q - quit
Optimization Checklist
- Profile before optimizing (use cProfile)
- Identify bottleneck (top 20% of time)
- Create benchmark for bottleneck
- Optimize bottleneck
- Benchmark again to verify improvement
- Re-profile entire program
- Verify correctness (tests still pass)
Common Optimizations
| Problem | Solution | Speedup |
|---|---|---|
| String concatenation in loop | Use str.join() | 10-100x |
| List membership checks | Use set | 100-1000x |
| DataFrame iteration | Vectorize with NumPy/pandas | 100-1000x |
| Repeated expensive computation | Cache with @lru_cache | ∞ (depends on cache hits) |
| I/O bound | Use async/await | 10-100x |
| CPU bound with parallelizable work | Use multiprocessing | ~number of cores |
Red Flags
If you find yourself:
- Optimizing before profiling → STOP, profile first
- Spending hours on micro-optimizations → Check if it's bottleneck
- Making code unreadable for speed → Benchmark the benefit
- Assuming what's slow → Profile to verify
Always measure. Never assume.