Claude Code Plugins

Community-maintained marketplace

Feedback

gpu-parallel-scheduling

@smith6jt-cop/Skills_Registry
0
0

GPU-safe parallel processing patterns for KINTSUGI to prevent OOM crashes and ensure Jupyter-compatible progress output

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name gpu-parallel-scheduling
description GPU-safe parallel processing patterns for KINTSUGI to prevent OOM crashes and ensure Jupyter-compatible progress output
author KINTSUGI Team
date Mon Dec 15 2025 00:00:00 GMT+0000 (Coordinated Universal Time)

GPU Parallel Scheduling - Research Notes

Experiment Overview

Item Details
Date 2025-12-15
Goal Fix kernel crashes during parallel GPU processing in Notebook 2
Environment KINTSUGI pipeline, CuPy, multi-GPU HPC, Jupyter notebooks
Status Success

Context

Notebook 2 (Cycle Processing) was crashing after Channel 1 completed when attempting to process remaining channels in parallel. The crash occurred immediately after the message "[PARALLEL] Processing channels [2, 3, 4] across GPUs..." with no further output.

Root Cause Analysis

Problem 1: Nested ThreadPoolExecutor Causing GPU Thread Explosion

The code structure was:

# OUTER: Parallel channel processing
with ThreadPoolExecutor(max_workers=len(GPU_DEVICE_IDS)) as executor:
    for ch in channels:
        # INNER: Each channel spawns parallel z-plane workers
        process_channel_zplanes_parallel(ch, ...)
            # Inside this function:
            with ThreadPoolExecutor(max_workers=ZPLANES_PER_GPU) as inner_executor:
                # Process z-planes in parallel

With 2 GPUs, 3 channels, and ZPLANES_PER_GPU=4:

  • Channel assignment: CH2→GPU0, CH3→GPU1, CH4→GPU0 (round-robin)
  • GPU0 gets CH2 + CH4 simultaneously
  • Each channel spawns 4 z-plane workers
  • GPU0 runs 2 × 4 = 8 concurrent BaSiC corrections → OOM CRASH

Problem 2: Jupyter Thread Output Suppression

Print statements from worker threads don't appear in Jupyter notebook output until the cell completes (or never). This made debugging difficult as processing appeared to stall.

Problem 3: Missing GPU Memory Cleanup Between Channels

GPU memory from Channel 1 wasn't being freed before Channel 2 started, causing cumulative memory pressure.

Verified Workflow

Solution: Queue-Based GPU Allocation

Use a GPU queue to ensure exactly 1 channel per GPU at any time:

import queue
from concurrent.futures import ThreadPoolExecutor, as_completed

# Create pool of available GPUs
gpu_queue = queue.Queue()
for dev_id in GPU_DEVICE_IDS:
    gpu_queue.put(dev_id)

def process_channel_with_gpu(ch):
    """Process channel, acquiring and releasing GPU from queue."""
    # ACQUIRE: Block until a GPU is available
    dev_id = gpu_queue.get()
    ch_start = time.time()

    try:
        process_channel_zplanes_parallel(
            ...,
            device_id=dev_id,
            zplanes_per_gpu=ZPLANES_PER_GPU,
            ...
        )
        elapsed = time.time() - ch_start

        # GPU cleanup before releasing
        try:
            import cupy as cp
            with cp.cuda.Device(dev_id):
                cp.get_default_memory_pool().free_all_blocks()
                cp.get_default_pinned_memory_pool().free_all_blocks()
        except Exception:
            pass
        gc.collect()

        return (ch, dev_id, elapsed, None)
    except Exception as e:
        return (ch, dev_id, time.time() - ch_start, str(e))
    finally:
        # RELEASE: Return GPU to pool for next channel
        gpu_queue.put(dev_id)

# Process with max_workers = number of GPUs
with ThreadPoolExecutor(max_workers=len(GPU_DEVICE_IDS)) as executor:
    futures = {executor.submit(process_channel_with_gpu, ch): ch
               for ch in remaining_channels}

    # Main thread prints progress (Jupyter-compatible)
    for future in as_completed(futures):
        ch, dev_id, elapsed, error = future.result()
        if error:
            log(f"  [GPU{dev_id}] Channel {ch} ERROR: {error}")
        else:
            log(f"  [GPU{dev_id}] Channel {ch} COMPLETE ({elapsed:.1f}s)")

Key Principles

  1. max_workers = n_gpus: Never more concurrent channels than GPUs
  2. Queue acquisition: Each channel blocks until it gets a GPU
  3. Queue release in finally: GPU always returns to pool, even on error
  4. GPU cleanup before release: Free memory before another channel uses the GPU
  5. Main-thread progress: Use as_completed() to print from main thread

Failed Attempts (Critical)

Attempt Why it Failed Lesson Learned
Parallel channels with iter_cycle(GPU_DEVICE_IDS) Pre-assigned GPUs didn't prevent multiple channels on same GPU Need dynamic GPU allocation, not static assignment
max_workers=len(GPU_DEVICE_IDS) without queue Channels could start on any GPU regardless of assignment Queue ensures 1:1 GPU:channel mapping
Printing progress from worker threads Output lost in Jupyter (thread stdout not captured) Always print from main thread using as_completed()
GPU cleanup only at end of cell Memory exhausted before cell completes Cleanup after EACH channel, before releasing GPU
Nested ThreadPoolExecutor Thread explosion: outer × inner workers Inner parallelism is OK; outer must be limited to n_gpus
Adding status monitoring thread Monitoring thread output also lost in Jupyter Only main thread output is reliable in notebooks

Final Parameters

GPU Queue Pattern

# Initialize
gpu_queue = queue.Queue()
for dev_id in GPU_DEVICE_IDS:
    gpu_queue.put(dev_id)

# In worker
dev_id = gpu_queue.get()  # Blocks until available
try:
    # ... GPU work ...
finally:
    gpu_queue.put(dev_id)  # Always return

Progress Output Pattern (Jupyter-safe)

with ThreadPoolExecutor(max_workers=n_gpus) as executor:
    futures = {executor.submit(work_fn, item): item for item in items}

    for future in as_completed(futures):
        result = future.result()
        print(f"Completed: {result}")  # Main thread - visible in Jupyter

GPU Cleanup Pattern

try:
    import cupy as cp
    with cp.cuda.Device(device_id):
        cp.get_default_memory_pool().free_all_blocks()
        cp.get_default_pinned_memory_pool().free_all_blocks()
except Exception:
    pass
gc.collect()

Key Insights

  • CuPy is not thread-safe for concurrent operations on the same GPU
  • Jupyter suppresses thread output - always use main thread for progress
  • Queue > static assignment for dynamic GPU allocation
  • Cleanup before release prevents memory accumulation
  • Inner parallelism is safe (z-planes on single GPU) when outer is controlled
  • 1 channel per GPU rule prevents all OOM issues from parallel processing

When to Apply This Pattern

  • Multi-GPU parallel processing in Jupyter notebooks
  • Any CuPy-based batch processing with parallelism
  • When kernel crashes after first item completes in parallel loop
  • When progress output disappears in parallel processing
  • When OOM occurs despite having enough total GPU memory

References