name	cpu-gpu-performance
description	Monitor and optimize CPU/GPU usage with load measurement and cost-effective validation strategies.
location	plugin
token_budget	400
progressive_loading	true
dependencies	[object Object]

CPU/GPU Performance Discipline

When to Use

At the beginning of every session (auto-load alongside token-conservation).
Whenever you plan to build, train, or test anything that could pin CPU cores or GPUs for more than a minute.
Before retrying a failing command that previously consumed significant resources.

Required TodoWrite Items

cpu-gpu-performance:baseline
cpu-gpu-performance:scope
cpu-gpu-performance:instrument
cpu-gpu-performance:throttle
cpu-gpu-performance:log

Step 1 – Establish Current Baseline (`baseline`)

Capture current utilization:
- uptime
- ps -eo pcpu,cmd | head
- nvidia-smi --query-gpu=utilization.gpu,memory.used --format=csv
Note which hosts/GPUs are already busy.
Record any CI/cluster budgets (time quotas, GPU hours) before launching work.
Set a per-task CPU minute / GPU minute budget that respects those limits.

Step 2 – Narrow the Scope (`scope`)

Avoid running "whole world" jobs after a small fix. Prefer diff-based or tag-based selective testing:
- pytest -k
- Bazel target patterns
- cargo test <module>
Batch low-level fixes so you can validate multiple changes with a single targeted command.
For GPU jobs, favor unit-scale smoke inputs or lower epoch counts before scheduling the full training/eval sweep.

Step 3 – Instrument Before You Optimize (`instrument`)

Pick the right profiler/monitor:
- CPU work:
  - perf
  - intel vtune
  - cargo flamegraph
  - language-specific profilers
- GPU work:
  - nvidia-smi dmon
  - nsys
  - nvprof
  - DLProf
  - framework timeline tracers
Capture kernel/ops timelines, memory footprints, and data pipeline latency so you have evidence when throttling or parallelizing.
Record hot paths + I/O bottlenecks in notes so future reruns can jump straight to the culprit.

Step 4 – Throttle and Sequence Work (`throttle`)

Use nice, ionice, or Kubernetes/Slurm quotas to prevent starvation of shared nodes.
Chain heavy tasks with guardrails:
- Rerun only the failed test/module
- Then (optionally) escalate to the next-wider shard
- Reserve the full suite for the final gate
Stagger GPU kernels (smaller batch sizes or gradient accumulation) when memory pressure risks eviction; prefer checkpoint/restore over restarts.

Step 5 – Log Decisions + Next Steps (`log`)

Conclude by documenting the commands that were run and their resource cost (duration, CPU%, GPU%), confirming whether they remained within the per-task budget. If a full suite or long training run was necessary, justify why selective or staged approaches were not feasible. Capture any follow-up tasks, such as adding a new test marker or profiling documentation, to streamline future sessions.

Output Expectations

Brief summary covering:
- baseline metrics
- scope chosen
- instrumentation captured
- throttling tactics
- follow-up items
Concrete example(s) of what ran (e.g.):
- "reran pytest tests/test_orders.py -k test_refund instead of pytest -m slow"
- "profiled nvidia-smi dmon output to prove GPU idle time before scaling"

cpu-gpu-performance

Install Skill

SKILL.md

CPU/GPU Performance Discipline

When to Use

Required TodoWrite Items

Step 1 – Establish Current Baseline (`baseline`)

Step 2 – Narrow the Scope (`scope`)

Step 3 – Instrument Before You Optimize (`instrument`)

Step 4 – Throttle and Sequence Work (`throttle`)

Step 5 – Log Decisions + Next Steps (`log`)

Output Expectations

Install Skill

SKILL.md

CPU/GPU Performance Discipline

When to Use

Required TodoWrite Items

Step 1 – Establish Current Baseline (baseline)

Step 2 – Narrow the Scope (scope)

Step 3 – Instrument Before You Optimize (instrument)

Step 4 – Throttle and Sequence Work (throttle)

Step 5 – Log Decisions + Next Steps (log)

Output Expectations

Step 1 – Establish Current Baseline (`baseline`)

Step 2 – Narrow the Scope (`scope`)

Step 3 – Instrument Before You Optimize (`instrument`)

Step 4 – Throttle and Sequence Work (`throttle`)

Step 5 – Log Decisions + Next Steps (`log`)