| name | gpu-parallel-scheduling |
| description | GPU-safe parallel processing patterns for KINTSUGI to prevent OOM crashes and ensure Jupyter-compatible progress output |
| author | KINTSUGI Team |
| date | Mon Dec 15 2025 00:00:00 GMT+0000 (Coordinated Universal Time) |
GPU Parallel Scheduling - Research Notes
Experiment Overview
| Item | Details |
|---|---|
| Date | 2025-12-15 |
| Goal | Fix kernel crashes during parallel GPU processing in Notebook 2 |
| Environment | KINTSUGI pipeline, CuPy, multi-GPU HPC, Jupyter notebooks |
| Status | Success |
Context
Notebook 2 (Cycle Processing) was crashing after Channel 1 completed when attempting to process remaining channels in parallel. The crash occurred immediately after the message "[PARALLEL] Processing channels [2, 3, 4] across GPUs..." with no further output.
Root Cause Analysis
Problem 1: Nested ThreadPoolExecutor Causing GPU Thread Explosion
The code structure was:
# OUTER: Parallel channel processing
with ThreadPoolExecutor(max_workers=len(GPU_DEVICE_IDS)) as executor:
for ch in channels:
# INNER: Each channel spawns parallel z-plane workers
process_channel_zplanes_parallel(ch, ...)
# Inside this function:
with ThreadPoolExecutor(max_workers=ZPLANES_PER_GPU) as inner_executor:
# Process z-planes in parallel
With 2 GPUs, 3 channels, and ZPLANES_PER_GPU=4:
- Channel assignment: CH2→GPU0, CH3→GPU1, CH4→GPU0 (round-robin)
- GPU0 gets CH2 + CH4 simultaneously
- Each channel spawns 4 z-plane workers
- GPU0 runs 2 × 4 = 8 concurrent BaSiC corrections → OOM CRASH
Problem 2: Jupyter Thread Output Suppression
Print statements from worker threads don't appear in Jupyter notebook output until the cell completes (or never). This made debugging difficult as processing appeared to stall.
Problem 3: Missing GPU Memory Cleanup Between Channels
GPU memory from Channel 1 wasn't being freed before Channel 2 started, causing cumulative memory pressure.
Verified Workflow
Solution: Queue-Based GPU Allocation
Use a GPU queue to ensure exactly 1 channel per GPU at any time:
import queue
from concurrent.futures import ThreadPoolExecutor, as_completed
# Create pool of available GPUs
gpu_queue = queue.Queue()
for dev_id in GPU_DEVICE_IDS:
gpu_queue.put(dev_id)
def process_channel_with_gpu(ch):
"""Process channel, acquiring and releasing GPU from queue."""
# ACQUIRE: Block until a GPU is available
dev_id = gpu_queue.get()
ch_start = time.time()
try:
process_channel_zplanes_parallel(
...,
device_id=dev_id,
zplanes_per_gpu=ZPLANES_PER_GPU,
...
)
elapsed = time.time() - ch_start
# GPU cleanup before releasing
try:
import cupy as cp
with cp.cuda.Device(dev_id):
cp.get_default_memory_pool().free_all_blocks()
cp.get_default_pinned_memory_pool().free_all_blocks()
except Exception:
pass
gc.collect()
return (ch, dev_id, elapsed, None)
except Exception as e:
return (ch, dev_id, time.time() - ch_start, str(e))
finally:
# RELEASE: Return GPU to pool for next channel
gpu_queue.put(dev_id)
# Process with max_workers = number of GPUs
with ThreadPoolExecutor(max_workers=len(GPU_DEVICE_IDS)) as executor:
futures = {executor.submit(process_channel_with_gpu, ch): ch
for ch in remaining_channels}
# Main thread prints progress (Jupyter-compatible)
for future in as_completed(futures):
ch, dev_id, elapsed, error = future.result()
if error:
log(f" [GPU{dev_id}] Channel {ch} ERROR: {error}")
else:
log(f" [GPU{dev_id}] Channel {ch} COMPLETE ({elapsed:.1f}s)")
Key Principles
- max_workers = n_gpus: Never more concurrent channels than GPUs
- Queue acquisition: Each channel blocks until it gets a GPU
- Queue release in finally: GPU always returns to pool, even on error
- GPU cleanup before release: Free memory before another channel uses the GPU
- Main-thread progress: Use
as_completed()to print from main thread
Failed Attempts (Critical)
| Attempt | Why it Failed | Lesson Learned |
|---|---|---|
Parallel channels with iter_cycle(GPU_DEVICE_IDS) |
Pre-assigned GPUs didn't prevent multiple channels on same GPU | Need dynamic GPU allocation, not static assignment |
max_workers=len(GPU_DEVICE_IDS) without queue |
Channels could start on any GPU regardless of assignment | Queue ensures 1:1 GPU:channel mapping |
| Printing progress from worker threads | Output lost in Jupyter (thread stdout not captured) | Always print from main thread using as_completed() |
| GPU cleanup only at end of cell | Memory exhausted before cell completes | Cleanup after EACH channel, before releasing GPU |
| Nested ThreadPoolExecutor | Thread explosion: outer × inner workers | Inner parallelism is OK; outer must be limited to n_gpus |
| Adding status monitoring thread | Monitoring thread output also lost in Jupyter | Only main thread output is reliable in notebooks |
Final Parameters
GPU Queue Pattern
# Initialize
gpu_queue = queue.Queue()
for dev_id in GPU_DEVICE_IDS:
gpu_queue.put(dev_id)
# In worker
dev_id = gpu_queue.get() # Blocks until available
try:
# ... GPU work ...
finally:
gpu_queue.put(dev_id) # Always return
Progress Output Pattern (Jupyter-safe)
with ThreadPoolExecutor(max_workers=n_gpus) as executor:
futures = {executor.submit(work_fn, item): item for item in items}
for future in as_completed(futures):
result = future.result()
print(f"Completed: {result}") # Main thread - visible in Jupyter
GPU Cleanup Pattern
try:
import cupy as cp
with cp.cuda.Device(device_id):
cp.get_default_memory_pool().free_all_blocks()
cp.get_default_pinned_memory_pool().free_all_blocks()
except Exception:
pass
gc.collect()
Key Insights
- CuPy is not thread-safe for concurrent operations on the same GPU
- Jupyter suppresses thread output - always use main thread for progress
- Queue > static assignment for dynamic GPU allocation
- Cleanup before release prevents memory accumulation
- Inner parallelism is safe (z-planes on single GPU) when outer is controlled
- 1 channel per GPU rule prevents all OOM issues from parallel processing
When to Apply This Pattern
- Multi-GPU parallel processing in Jupyter notebooks
- Any CuPy-based batch processing with parallelism
- When kernel crashes after first item completes in parallel loop
- When progress output disappears in parallel processing
- When OOM occurs despite having enough total GPU memory
References
- CuPy Memory Management: https://docs.cupy.dev/en/stable/user_guide/memory.html
- Python concurrent.futures: https://docs.python.org/3/library/concurrent.futures.html
- KINTSUGI Notebook 2: Cycle Processing
- Related skill:
gpu-memory-cleanup(per-channel cleanup pattern)