Claude Code Plugins

Community-maintained marketplace

Feedback

cpu-gpu-performance

@athola/claude-night-market
8
0

|

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name cpu-gpu-performance
description Monitor and optimize CPU/GPU usage with load measurement and cost-effective validation strategies.
location plugin
token_budget 400
progressive_loading true
dependencies [object Object]

CPU/GPU Performance Discipline

When to Use

  • At the beginning of every session (auto-load alongside token-conservation).
  • Whenever you plan to build, train, or test anything that could pin CPU cores or GPUs for more than a minute.
  • Before retrying a failing command that previously consumed significant resources.

Required TodoWrite Items

  1. cpu-gpu-performance:baseline
  2. cpu-gpu-performance:scope
  3. cpu-gpu-performance:instrument
  4. cpu-gpu-performance:throttle
  5. cpu-gpu-performance:log

Step 1 – Establish Current Baseline (baseline)

  • Capture current utilization:

    • uptime
    • ps -eo pcpu,cmd | head
    • nvidia-smi --query-gpu=utilization.gpu,memory.used --format=csv

    Note which hosts/GPUs are already busy.

  • Record any CI/cluster budgets (time quotas, GPU hours) before launching work.

  • Set a per-task CPU minute / GPU minute budget that respects those limits.

Step 2 – Narrow the Scope (scope)

  • Avoid running "whole world" jobs after a small fix. Prefer diff-based or tag-based selective testing:
    • pytest -k
    • Bazel target patterns
    • cargo test <module>
  • Batch low-level fixes so you can validate multiple changes with a single targeted command.
  • For GPU jobs, favor unit-scale smoke inputs or lower epoch counts before scheduling the full training/eval sweep.

Step 3 – Instrument Before You Optimize (instrument)

  • Pick the right profiler/monitor:
    • CPU work:
      • perf
      • intel vtune
      • cargo flamegraph
      • language-specific profilers
    • GPU work:
      • nvidia-smi dmon
      • nsys
      • nvprof
      • DLProf
      • framework timeline tracers
  • Capture kernel/ops timelines, memory footprints, and data pipeline latency so you have evidence when throttling or parallelizing.
  • Record hot paths + I/O bottlenecks in notes so future reruns can jump straight to the culprit.

Step 4 – Throttle and Sequence Work (throttle)

  • Use nice, ionice, or Kubernetes/Slurm quotas to prevent starvation of shared nodes.
  • Chain heavy tasks with guardrails:
    • Rerun only the failed test/module
    • Then (optionally) escalate to the next-wider shard
    • Reserve the full suite for the final gate
  • Stagger GPU kernels (smaller batch sizes or gradient accumulation) when memory pressure risks eviction; prefer checkpoint/restore over restarts.

Step 5 – Log Decisions + Next Steps (log)

Conclude by documenting the commands that were run and their resource cost (duration, CPU%, GPU%), confirming whether they remained within the per-task budget. If a full suite or long training run was necessary, justify why selective or staged approaches were not feasible. Capture any follow-up tasks, such as adding a new test marker or profiling documentation, to streamline future sessions.

Output Expectations

  • Brief summary covering:
    • baseline metrics
    • scope chosen
    • instrumentation captured
    • throttling tactics
    • follow-up items
  • Concrete example(s) of what ran (e.g.):
    • "reran pytest tests/test_orders.py -k test_refund instead of pytest -m slow"
    • "profiled nvidia-smi dmon output to prove GPU idle time before scaling"