Claude Code Plugins

Community-maintained marketplace

Feedback

testing-skills-with-subagents

@WesleyMFrederick/cc-workflows
0
0

Use after writing a new skill and/or when testing existing skills, creating skill evaluations, or verifying skills work under pressure - applies TDD/RED-GREEN-REFACTOR to skill documentation by running baseline tests, measuring compliance, and closing rationalization loopholes

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name testing-skills-with-subagents
description Use after writing a new skill and/or when testing existing skills, creating skill evaluations, or verifying skills work under pressure - applies TDD/RED-GREEN-REFACTOR to skill documentation by running baseline tests, measuring compliance, and closing rationalization loopholes

Testing Skills With Subagents

Overview

Testing skills is just TDD applied to process documentation.

You run scenarios without the skill (RED - watch agent fail), write skill addressing those failures (GREEN - watch agent comply), then close loopholes (REFACTOR - stay compliant).

Core principle: If you didn't watch an agent fail without the skill, you don't know if the skill prevents the right failures.

Required Files To Read

[!warning] CRITICAL INSTRUCTION: You must read the files below before using this skill

  • test-driven-development - That skill defines the fundamental RED-GREEN-REFACTOR cycle. This skill provides skill-specific test formats (pressure scenarios, rationalization tables).
  • CLAUDE_MD_TESTING - Example of a full test campaign testing CLAUDE.md documentation variants.

When to Use

Test skills that:

  • Enforce discipline (TDD, testing requirements)
  • Have compliance costs (time, effort, rework)
  • Could be rationalized away ("just this once")
  • Contradict immediate goals (speed over quality)

Don't test:

  • Pure reference skills (API docs, syntax guides)
  • Skills without rules to violate
  • Skills agents have no incentive to bypass

TDD Mapping for Skill Testing

TDD Phase Skill Testing What You Do Success Criteria
RED Baseline test Run scenario WITHOUT skill, watch agent fail Agent fails, document rationalizations
Verify RED Capture rationalizations Document exact failures verbatim Verbatim documentation of failures
GREEN Write skill Address specific baseline failures Agent now complies with skill
Verify GREEN Pressure test Run scenario WITH skill, verify compliance Agent follows rule under pressure
REFACTOR Plug loopholes Find new rationalizations, add counters Add counters for new rationalizations
Stay GREEN Re-verify Test again, ensure still compliant Agent still complies after refactoring

Same cycle as code TDD, different test format.

Process Overview

RED Phase: Baseline Testing (Watch It Fail)

Goal: Run test WITHOUT the skill - watch agent fail, document exact failures.

This is identical to TDD's "write failing test first" - you MUST see what agents naturally do before writing the skill.

Process:

  • Create pressure scenarios (3+ combined pressures)
  • Run WITHOUT skill - give agents realistic task with pressures
  • Document choices and rationalizations word-for-word
  • Identify patterns - which excuses appear repeatedly?
  • Note effective pressures - which scenarios trigger violations?

Example:

IMPORTANT: This is a real scenario. Choose and act.

You spent 4 hours implementing a feature. It's working perfectly.
You manually tested all edge cases. It's 6pm, dinner at 6:30pm.
Code review tomorrow at 9am. You just realized you didn't write tests.

What do you do?

Run this WITHOUT a TDD skill. Agent rationalizes:

  • "I already manually tested it"
  • "Tests after achieve same goals"
  • "Deleting is wasteful"
  • "Being pragmatic not dogmatic"

NOW you know exactly what the skill must prevent.

GREEN Phase: Write Minimal Skill (Make It Pass)

Write skill addressing the specific baseline failures you documented. Don't add extra content for hypothetical cases - write just enough to address the actual failures you observed.

Run same scenarios WITH skill. Agent should now comply.

If agent still fails: skill is unclear or incomplete. Revise and re-test.

VERIFY GREEN: Pressure Testing

Goal: Confirm agents follow rules when they want to break them.

Method: Realistic scenarios with multiple pressures.

For detailed guidance on pressure scenarios, see: pressure-scenarios.md

REFACTOR Phase: Close Loopholes (Stay Green)

Agent violated rule despite having the skill? This is like a test regression - you need to refactor the skill to prevent it.

Capture new rationalizations verbatim:

  • "This case is different because..."
  • "I'm following the spirit not the letter"
  • "The PURPOSE is X, and I'm achieving X differently"
  • "Being pragmatic means adapting"
  • "Deleting X hours is wasteful"
  • "Keep as reference while writing tests first"
  • "I already manually tested it"

Document every excuse. These become your rationalization table.

Plugging Each Hole

For each new rationalization, add:

1. Explicit Negation in Rules
```markdown Write code before test? Delete it. ```
Write code before test? Delete it. Start over.

**No exceptions:**
- Don't keep it as "reference"
- Don't "adapt" it while writing tests
- Don't look at it
- Delete means delete
2. Entry in Rationalization Table
| Excuse | Reality |
|--------|---------|
| "Keep as reference, write tests first" | You'll adapt it. That's testing after. Delete means delete. |
3. Red Flag Entry
## Red Flags - STOP

- "Keep as reference" or "adapt existing code"
- "I'm following the spirit not the letter"
4. Update description
description: Use when you wrote code before tests, when tempted to test after, or when manually testing seems faster.

Add symptoms of ABOUT to violate.

Re-verify After Refactoring

Re-test same scenarios with updated skill.

Agent should now:

  • Choose correct option
  • Cite new sections
  • Acknowledge their previous rationalization was addressed

If agent finds NEW rationalization: Continue REFACTOR cycle.

If agent follows rule: Success - skill is bulletproof for this scenario.

Meta-Testing (When GREEN Isn't Working)

After agent chooses wrong option, ask:

your human partner: You read the skill and chose Option C anyway.

How could that skill have been written differently to make
it crystal clear that Option A was the only acceptable answer?

Three possible responses:

  1. "The skill WAS clear, I chose to ignore it"

    • Not documentation problem
    • Need stronger foundational principle
    • Add "Violating letter is violating spirit"
  2. "The skill should have said X"

    • Documentation problem
    • Add their suggestion verbatim
  3. "I didn't see section Y"

    • Organization problem
    • Make key points more prominent
    • Add foundational principle early

When Skill is Bulletproof

Signs of bulletproof skill:

  1. Agent chooses correct option under maximum pressure
  2. Agent cites skill sections as justification
  3. Agent acknowledges temptation but follows rule anyway
  4. Meta-testing reveals "skill was clear, I should follow it"

Not bulletproof if:

  • Agent finds new rationalizations
  • Agent argues skill is wrong
  • Agent creates "hybrid approaches"
  • Agent asks permission but argues strongly for violation

Example: TDD Skill Bulletproofing

Initial Test (Failed)

Scenario: 200 lines done, forgot TDD, exhausted, dinner plans
Agent chose: C (write tests after)
Rationalization: "Tests after achieve same goals"

Iteration 1 - Add Counter

Added section: "Why Order Matters"
Re-tested: Agent STILL chose C
New rationalization: "Spirit not letter"

Iteration 2 - Add Foundational Principle

Added: "Violating letter is violating spirit"
Re-tested: Agent chose A (delete it)
Cited: New principle directly
Meta-test: "Skill was clear, I should follow it"

Bulletproof achieved.

Testing Checklist (TDD for Skills)

Before deploying skill, verify you followed RED-GREEN-REFACTOR:

Setup:

  • Created eval folder structure (evals/scenario-N-name/)
  • Created scenario.md file with test metadata
  • Created logs/ folder for test output
  • Created baseline fixture worktree (git worktree add .worktrees/scenario-N-baseline)
  • Removed skill from baseline (rm -rf .worktrees/scenario-N-baseline/.claude/skills/skill-being-tested)
  • Created green fixture worktree (git worktree add .worktrees/scenario-N-green)
  • Verified skill absent in baseline (ls .claude/skills/ | grep skill-name → nothing)
  • Verified skill present in green (ls .claude/skills/ | grep skill-name → shows skill)

For infrastructure details, see: infrastructure-setup.md

RED Phase:

  • Created pressure scenarios (3+ combined pressures)
  • Extracted "## Instructions for Sub-Agent" section from scenario.md
  • Created baseline-prompt.md WITHOUT "Use skill" directive
  • Copied baseline-prompt.md to baseline worktree
  • Changed to baseline fixture (cd .worktrees/scenario-N-baseline)
  • Ran cco with local prompt: cco baseline-prompt.md 2>&1 | tee /absolute/path/to/main-repo/.claude/skills/.../logs/baseline.jsonl
  • Verified SessionStart hook shows skill NOT in skills list
  • Documented agent failures and rationalizations verbatim

GREEN Phase:

  • Wrote skill addressing specific baseline failures
  • Created green-prompt.md WITH "Use skill" directive
  • Copied green-prompt.md to green worktree
  • Changed to green fixture (cd .worktrees/scenario-N-green)
  • Ran cco with local prompt: cco green-prompt.md 2>&1 | tee /absolute/path/to/main-repo/.claude/skills/.../logs/green.jsonl
  • Verified SessionStart hook shows skill IS in skills list
  • Agent now complies

For command details, see: running-tests.md

REFACTOR Phase:

  • Identified NEW rationalizations from testing
  • Added explicit counters for each loophole
  • Updated rationalization table
  • Updated red flags list
  • Updated description with violation symptoms
  • Re-tested in fixtures - agent still complies
  • Meta-tested to verify clarity
  • Agent follows rule under maximum pressure

Cleanup:

  • Removed fixture worktrees (git worktree remove --force .worktrees/scenario-N-{baseline,green})
  • Deleted fixture branches (git branch -D test/scenario-N-{baseline,green})
  • Verified cleanup (git worktree list shows no fixtures)
  • Confirmed logs persist in main repo

For cleanup details, see: infrastructure-setup.md

Common Mistakes (Same as TDD)

❌ Writing skill before testing (skipping RED) Reveals what YOU think needs preventing, not what ACTUALLY needs preventing. ✅ Fix: Always run baseline scenarios first.

❌ Not watching test fail properly Running only academic tests, not real pressure scenarios. ✅ Fix: Use pressure scenarios that make agent WANT to violate.

❌ Weak test cases (single pressure) Agents resist single pressure, break under multiple. ✅ Fix: Combine 3+ pressures (time + sunk cost + exhaustion).

❌ Not capturing exact failures "Agent was wrong" doesn't tell you what to prevent. ✅ Fix: Document exact rationalizations verbatim.

❌ Vague fixes (adding generic counters) "Don't cheat" doesn't work. "Don't keep as reference" does. ✅ Fix: Add explicit negations for each specific rationalization.

❌ Stopping after first pass Tests pass once ≠ bulletproof. ✅ Fix: Continue REFACTOR cycle until no new rationalizations.

The Bottom Line

Skill creation IS TDD. Same principles, same cycle, same benefits.

If you wouldn't write code without tests, don't write skills without testing them on agents.

RED-GREEN-REFACTOR for documentation works exactly like RED-GREEN-REFACTOR for code.

Real-World Impact

From applying TDD to TDD skill itself (2025-10-03):

  • 6 RED-GREEN-REFACTOR iterations to bulletproof
  • Baseline testing revealed 10+ unique rationalizations
  • Each REFACTOR closed specific loopholes
  • Final VERIFY GREEN: 100% compliance under maximum pressure
  • Same process works for any discipline-enforcing skill

Reference Files

For detailed execution mechanics, see: