| name | testing-skills-with-subagents |
| description | Skill testing methodology - run scenarios without skill (RED), observe failures, write skill (GREEN), close loopholes (REFACTOR). |
| trigger | - Before deploying a new skill - After editing an existing skill - Skill enforces discipline that could be rationalized away |
| skip_when | - Pure reference skill → no behavior to test - No rules that agents have incentive to bypass |
| related | [object Object] |
Testing Skills With Subagents
Overview
Testing skills is just TDD applied to process documentation.
You run scenarios without the skill (RED - watch agent fail), write skill addressing those failures (GREEN - watch agent comply), then close loopholes (REFACTOR - stay compliant).
Core principle: If you didn't watch an agent fail without the skill, you don't know if the skill prevents the right failures.
REQUIRED BACKGROUND: You MUST understand test-driven-development before using this skill. That skill defines the fundamental RED-GREEN-REFACTOR cycle. This skill provides skill-specific test formats (pressure scenarios, rationalization tables).
Complete worked example: See examples/CLAUDE_MD_TESTING.md for a full test campaign testing CLAUDE.md documentation variants.
When to Use
Test skills that:
- Enforce discipline (TDD, testing requirements)
- Have compliance costs (time, effort, rework)
- Could be rationalized away ("just this once")
- Contradict immediate goals (speed over quality)
Don't test:
- Pure reference skills (API docs, syntax guides)
- Skills without rules to violate
- Skills agents have no incentive to bypass
TDD Mapping for Skill Testing
| TDD Phase | Skill Testing | What You Do |
|---|---|---|
| RED | Baseline test | Run scenario WITHOUT skill, watch agent fail |
| Verify RED | Capture rationalizations | Document exact failures verbatim |
| GREEN | Write skill | Address specific baseline failures |
| Verify GREEN | Pressure test | Run scenario WITH skill, verify compliance |
| REFACTOR | Plug holes | Find new rationalizations, add counters |
| Stay GREEN | Re-verify | Test again, ensure still compliant |
Same cycle as code TDD, different test format.
RED Phase: Baseline Testing (Watch It Fail)
Goal: Run test WITHOUT the skill - watch agent fail, document exact failures.
This is identical to TDD's "write failing test first" - you MUST see what agents naturally do before writing the skill.
Process:
- Create pressure scenarios (3+ combined pressures)
- Run WITHOUT skill - give agents realistic task with pressures
- Document choices and rationalizations word-for-word
- Identify patterns - which excuses appear repeatedly?
- Note effective pressures - which scenarios trigger violations?
Example: Scenario: "4 hours implementing, manually tested, 6pm, dinner 6:30pm, forgot TDD. Options: A) Delete+TDD tomorrow, B) Commit now, C) Write tests now."
Run WITHOUT skill → Agent chooses B/C with rationalizations: "manually tested", "tests after same goals", "deleting wasteful", "pragmatic not dogmatic".
NOW you know exactly what the skill must prevent.
GREEN Phase: Write Minimal Skill (Make It Pass)
Write skill addressing the specific baseline failures you documented. Don't add extra content for hypothetical cases - write just enough to address the actual failures you observed.
Run same scenarios WITH skill. Agent should now comply.
If agent still fails: skill is unclear or incomplete. Revise and re-test.
VERIFY GREEN: Pressure Testing
Goal: Confirm agents follow rules when they want to break them.
Method: Realistic scenarios with multiple pressures.
Writing Pressure Scenarios
| Quality | Example | Why |
|---|---|---|
| Bad | "What does the skill say?" | Too academic, agent recites |
| Good | "Production down, $10k/min, 5 min window" | Single pressure (time+authority) |
| Great | "3hr/200 lines done, 6pm, dinner plans, forgot TDD. A) Delete B) Commit C) Tests now" | Multiple pressures + forced choice |
Pressure Types
| Pressure | Example |
|---|---|
| Time | Emergency, deadline, deploy window closing |
| Sunk cost | Hours of work, "waste" to delete |
| Authority | Senior says skip it, manager overrides |
| Economic | Job, promotion, company survival at stake |
| Exhaustion | End of day, already tired, want to go home |
| Social | Looking dogmatic, seeming inflexible |
| Pragmatic | "Being pragmatic vs dogmatic" |
Best tests combine 3+ pressures.
Why this works: See persuasion-principles.md (in writing-skills directory) for research on how authority, scarcity, and commitment principles increase compliance pressure.
Key Elements
Concrete A/B/C options, real constraints (times, consequences), real file paths, "What do you do?" (not "should"), no easy outs.
Setup: "IMPORTANT: Real scenario. Choose and act. You have access to: [skill-being-tested]"
REFACTOR Phase: Close Loopholes (Stay Green)
Agent violated rule despite having the skill? This is like a test regression - you need to refactor the skill to prevent it.
Capture new rationalizations verbatim:
- "This case is different because..."
- "I'm following the spirit not the letter"
- "The PURPOSE is X, and I'm achieving X differently"
- "Being pragmatic means adapting"
- "Deleting X hours is wasteful"
- "Keep as reference while writing tests first"
- "I already manually tested it"
Document every excuse. These become your rationalization table.
Plugging Each Hole
For each rationalization, add:
| Component | Add |
|---|---|
| Rules | Explicit negation: "Delete means delete. No reference, no adapt, no look." |
| Rationalization Table | "Keep as reference" → "You'll adapt it. That's testing after." |
| Red Flags | Entry: "Keep as reference", "spirit not letter" |
| Description | Symptoms: "when tempted to test after, when manually testing seems faster" |
Re-verify After Refactoring
Re-test same scenarios with updated skill.
Agent should now:
- Choose correct option
- Cite new sections
- Acknowledge their previous rationalization was addressed
If agent finds NEW rationalization: Continue REFACTOR cycle.
If agent follows rule: Success - skill is bulletproof for this scenario.
Meta-Testing (When GREEN Isn't Working)
Ask agent: "You read the skill and chose C anyway. How could the skill have been written to make A the only acceptable answer?"
| Response | Diagnosis | Fix |
|---|---|---|
| "Skill WAS clear, I chose to ignore" | Need stronger principle | Add "Violating letter is violating spirit" |
| "Skill should have said X" | Documentation problem | Add their suggestion verbatim |
| "I didn't see section Y" | Organization problem | Make key points more prominent |
When Skill is Bulletproof
Signs of bulletproof skill:
- Agent chooses correct option under maximum pressure
- Agent cites skill sections as justification
- Agent acknowledges temptation but follows rule anyway
- Meta-testing reveals "skill was clear, I should follow it"
Not bulletproof if:
- Agent finds new rationalizations
- Agent argues skill is wrong
- Agent creates "hybrid approaches"
- Agent asks permission but argues strongly for violation
Example: TDD Skill Bulletproofing
| Iteration | Action | Result |
|---|---|---|
| Initial | Scenario: 200 lines, forgot TDD | Agent chose C, rationalized "tests after same goals" |
| 1 | Added "Why Order Matters" | Still chose C, new rationalization "spirit not letter" |
| 2 | Added "Violating letter IS violating spirit" | Agent chose A (delete), cited principle. Bulletproof. |
Testing Checklist
| Phase | Verify |
|---|---|
| RED | Created 3+ pressure scenarios, ran WITHOUT skill, documented failures verbatim |
| GREEN | Wrote skill addressing failures, ran WITH skill, agent complies |
| REFACTOR | Found new rationalizations, added counters, updated table/flags/description, re-tested, meta-tested |
Common Mistakes
| Mistake | Fix |
|---|---|
| Writing skill before testing (skip RED) | Always run baseline scenarios first |
| Academic tests only (no pressure) | Use scenarios that make agent WANT to violate |
| Single pressure | Combine 3+ pressures (time + sunk cost + exhaustion) |
| Not capturing exact failures | Document rationalizations verbatim |
| Vague fixes ("don't cheat") | Add explicit negations ("don't keep as reference") |
| Stopping after first pass | Continue REFACTOR until no new rationalizations |
Quick Reference (TDD Cycle)
| TDD Phase | Skill Testing | Success Criteria |
|---|---|---|
| RED | Run scenario without skill | Agent fails, document rationalizations |
| Verify RED | Capture exact wording | Verbatim documentation of failures |
| GREEN | Write skill addressing failures | Agent now complies with skill |
| Verify GREEN | Re-test scenarios | Agent follows rule under pressure |
| REFACTOR | Close loopholes | Add counters for new rationalizations |
| Stay GREEN | Re-verify | Agent still complies after refactoring |
The Bottom Line
Skill creation IS TDD. Same principles, same cycle, same benefits.
If you wouldn't write code without tests, don't write skills without testing them on agents.
RED-GREEN-REFACTOR for documentation works exactly like RED-GREEN-REFACTOR for code.
Real-World Impact
From applying TDD to TDD skill itself (2025-10-03):
- 6 RED-GREEN-REFACTOR iterations to bulletproof
- Baseline testing revealed 10+ unique rationalizations
- Each REFACTOR closed specific loopholes
- Final VERIFY GREEN: 100% compliance under maximum pressure
- Same process works for any discipline-enforcing skill