| name | testing-skills-with-subagents |
| description | Use to validate process documentation. Apply TDD to skill writing: RED (run without skill, document failures) → GREEN (write skill) → REFACTOR (close loopholes). Test under pressure: time constraints, sunk cost, exhaustion, authority. |
Testing Skills with Subagents
Core Principle
Skills are process documentation. Like code, they need tests. Use TDD to validate skills actually work under pressure.
When to Use This Skill
- Writing new skills
- Validating existing skills
- Skill seems incomplete
- Want to ensure skill works
- Before sharing skills upstream
- After receiving feedback on skills
- Refining skill effectiveness
The Iron Law
NO SKILL WITHOUT A FAILING TEST FIRST.
If you can't demonstrate the skill solves a problem, you don't need the skill.
Why Test Skills?
Benefits: ✅ Proves skill actually helps ✅ Finds gaps and loopholes ✅ Validates under pressure ✅ Creates realistic examples ✅ Builds confidence in skill
Without testing: ❌ Untested assumptions ❌ Skill might not work ❌ Loopholes undiscovered ❌ No proof of value ❌ False confidence
TDD for Skills
RED: Run Scenarios WITHOUT Skill
🔴 RED Phase: Establish baseline
Scenario: Debug intermittent test failure without root-cause-tracing skill
Setup:
- Fresh subagent
- Give debugging task
- Do NOT provide root-cause-tracing skill
- Observe behavior
Task given to subagent:
---
You are debugging a test that fails intermittently (1 in 20 runs).
Test:
```php
public function test_order_total()
{
$order = Order::factory()->create();
$order->addItem(['price' => 10, 'qty' => 2]);
$this->assertEquals(20, $order->total());
// Sometimes fails: Expected 20, got 0
}
Debug this issue and fix it.
Observed behavior (WITHOUT skill):
- Subagent adds logging
- Runs test multiple times
- Finds timing issue
- Fixes symptom (adds sleep)
- ❌ Does NOT trace to root cause
- ❌ Does NOT find async job issue
- ❌ Quick fix instead of proper fix
FAILURES DOCUMENTED:
- Stopped at symptom, not root cause
- Added sleep() instead of fixing architecture
- Didn't trace backward through call chain
- Missed the async job that caused race condition
✅ Baseline failures documented Ready for GREEN phase
### GREEN: Write Skill to Address Failures
🟢 GREEN Phase: Create skill
Based on RED phase failures, write skill that addresses:
- Stopping at symptoms ← Need backward tracing process
- Quick fixes ← Need emphasis on finding root cause
- Missing call chain analysis ← Need tracing technique
- Not finding async issues ← Need timing-related patterns
Write root-cause-tracing skill:
Root Cause Tracing
The Iron Law
NEVER STOP AT THE SYMPTOM. Trace backward until you find the ORIGINAL TRIGGER.
Process
- Observe symptom
- Find immediate cause
- Trace backward through call chain
- Keep asking "Why?"
- Find original trigger
...detailed process...
Skill written ✅ Ready to test if it works
### REFACTOR: Test with Skill, Close Loopholes
🔵 REFACTOR Phase: Test skill and refine
Test 1: Same scenario WITH skill
Setup:
- Fresh subagent
- Same debugging task
- Include root-cause-tracing skill
Observed behavior (WITH skill):
- Subagent follows tracing process
- Observes symptom (0 value)
- Finds immediate cause (timing)
- Traces backward (async job)
- Finds root cause (race condition)
- ✅ Fixes architecture, not symptom
SUCCESS! Skill prevented failures observed in RED phase.
Test 2: Pressure scenario - Time constraint
Setup:
- Fresh subagent
- Same task + "You have 5 minutes"
- Include root-cause-tracing skill
Observed behavior:
- Subagent starts tracing process
- Time pressure mentioned
- ❌ Skips tracing, adds quick fix
- Rationalization: "No time for full trace"
FAILURE! Skill failed under time pressure.
LOOPHOLE FOUND: Skill doesn't address time pressure.
Fix: Add to skill: "Time pressure is when you MOST need root cause tracing. Quick fixes under pressure create technical debt that takes 10x longer to fix later."
Skill updated ✅
Test 3: Pressure scenario - Sunk cost
Setup:
- Fresh subagent
- Task: "You've spent 2 hours debugging, just make it work"
- Include updated root-cause-tracing skill
Observed behavior:
- Subagent mentions sunk cost
- ⚠️ Considers quick fix
- ✅ Skill reminds: trace to root cause
- ✅ Completes proper tracing
- ✅ Finds and fixes root cause
SUCCESS! Updated skill handles sunk cost pressure.
Test 4: Pressure scenario - Exhaustion
Setup:
- Fresh subagent
- Task: "You've been debugging for 6 hours, tired"
- Include updated skill
Observed behavior:
- Subagent acknowledges exhaustion
- ❌ Suggests taking shortcut
- Rationalization: "I'm too tired to trace properly"
FAILURE! Skill failed under exhaustion.
LOOPHOLE FOUND: Skill doesn't address exhaustion.
Fix: Add to skill: "When exhausted, your judgment is impaired. This is when you MOST need to follow the process systematically. The process protects you when judgment fails."
Skill updated ✅
Test 5: Pressure scenario - Authority
Setup:
- Fresh subagent
- Task: "Manager says just fix it fast"
- Include updated skill
Observed behavior:
- Authority pressure mentioned
- ⚠️ Considers compliance
- ✅ Skill provides response template
- ✅ Explains why proper fix is faster
- ✅ Proceeds with tracing
SUCCESS! Skill handles authority pressure.
All pressure scenarios tested ✅ Loopholes found and fixed ✅ Skill ready for use ✅
## The Four Pressure Scenarios
### Pressure 1: Time Constraints
Scenario: "We need this fixed in 15 minutes"
Without skill:
- Quick fixes
- Symptom treatment
- Technical debt
Test approach:
- Give subagent task + time limit
- Observe if skill followed
- Look for shortcuts
- Check if skill addresses time pressure
Skill must include: "Time pressure is when you MOST need systematic approach. Quick fixes take 10x longer to fix later."
### Pressure 2: Sunk Cost
Scenario: "You've already spent 3 hours on this"
Without skill:
- Desperation fixes
- "Make it work" mentality
- Abandoning proper process
Test approach:
- Give subagent task + sunk cost context
- Observe if skill followed
- Look for "just make it work"
- Check if skill addresses sunk cost
Skill must include: "Sunk cost is irrelevant. What matters: doing it right vs. doing it twice. Follow the process."
### Pressure 3: Exhaustion
Scenario: "You've been working for 8 hours straight"
Without skill:
- Impaired judgment
- Taking shortcuts
- Missing obvious things
Test approach:
- Give subagent task + exhaustion context
- Observe if skill followed
- Look for "too tired to do it right"
- Check if skill addresses exhaustion
Skill must include: "Exhaustion impairs judgment. Process protects you when judgment fails. Follow it systematically."
### Pressure 4: Authority
Scenario: "Boss/client demands quick fix"
Without skill:
- Compliance over quality
- Shortcuts to please
- Technical debt
Test approach:
- Give subagent task + authority pressure
- Observe if skill followed
- Look for inappropriate compliance
- Check if skill provides response template
Skill must include: "Authority pressure needs thoughtful response: 'Quick fix now = 10x work later. Let me do this right, it'll take [time] and prevent future issues.'"
## Complete Testing Process
### Step 1: Identify Problem
Problem observed: Subagents fixing symptoms instead of root causes
Evidence:
- Added sleep() for race conditions
- Try/catch to hide errors
- Quick workarounds instead of proper fixes
Need: Skill for root cause tracing
### Step 2: RED - Establish Baseline
Create 3-5 scenarios:
- Intermittent test failure
- Performance issue
- Data corruption
- Production bug
- "Works on my machine"
For each scenario:
- Fresh subagent
- No skill provided
- Observe behavior
- Document failures
Common failures found:
- Stops at symptoms ✅
- Doesn't trace backward ✅
- Accepts first explanation ✅
- Skips verification ✅
- Makes quick fixes ✅
Baseline documented ✅
### Step 3: GREEN - Write Skill
Based on failures, write skill:
Must address:
- ✅ Stopping at symptoms → Process for tracing backward
- ✅ Not tracing back → Call chain analysis technique
- ✅ First explanation → "Keep asking why"
- ✅ Skipping verification → Verification step required
- ✅ Quick fixes → Emphasis on root cause
Skill structure:
- Core Principle
- When to Use
- The Iron Law
- Step-by-step process
- Examples
- Common mistakes
- Authority
- Commitment
Skill written ✅
### Step 4: REFACTOR - Test and Refine
Test with skill:
- Same scenarios from RED phase
- Fresh subagent each time
- Include skill
- Observe improvement
Expected improvement:
- ✅ Traces to root cause (not just symptoms)
- ✅ Follows backward tracing process
- ✅ Asks "Why?" multiple times
- ✅ Verifies root cause hypothesis
- ✅ Makes proper fix
Improvement verified ✅
Test pressure scenarios:
- Time constraint → ❌ Failed
- Sunk cost → ✅ Passed
- Exhaustion → ❌ Failed
- Authority → ⚠️ Partial
Loopholes found:
- Time pressure needs addressing
- Exhaustion needs addressing
- Authority needs response template
Update skill to close loopholes ✅
Re-test pressure scenarios:
- Time constraint → ✅ Now passes
- Sunk cost → ✅ Still passes
- Exhaustion → ✅ Now passes
- Authority → ✅ Now passes
All scenarios passing ✅ Skill validated ✅
### Step 5: Document Test Results
Test Report: root-cause-tracing skill
Scenarios tested: 9 total
- 5 baseline scenarios
- 4 pressure scenarios
RED phase results:
- Intermittent failure: Fixed symptom (sleep), not root cause
- Performance issue: Added cache, didn't find missing index
- Data corruption: Added validation, didn't find race condition
- Production bug: Rolled back, didn't identify cause
- Works locally: Changed config, didn't compare environments
GREEN phase results:
- All 5 scenarios: Root cause found and fixed properly
REFACTOR phase results: Initial test: 4/4 passed Pressure test (v1): 2/4 passed Pressure test (v2): 4/4 passed
Loopholes found and fixed: 2
- Time pressure rationalization
- Exhaustion rationalization
Skill validated ✅ Ready for use ✅
## Real-World Example: Testing TDD Skill
🔴 RED Phase
Scenario: Add new feature without TDD skill
Task to subagent: "Add user profile update endpoint"
Observed behavior:
- ❌ Writes controller code first
- ❌ Adds tests after code
- ❌ Tests pass immediately (no RED phase)
- ❌ Tests check implementation, not behavior
Failures documented:
- Test-after, not test-first
- No RED/GREEN/REFACTOR cycle
- Tests verify implementation
🟢 GREEN Phase
Write TDD skill addressing failures:
- Iron Law: TEST FIRST, CODE SECOND
- Process: RED → GREEN → REFACTOR
- Examples of proper TDD cycle
- Anti-patterns to avoid
Skill written ✅
🔵 REFACTOR Phase
Test with skill:
- Same task, include TDD skill
- Observe: ✅ Writes test first
- Observe: ✅ Test fails (RED)
- Observe: ✅ Minimal code to pass
- Observe: ✅ Refactors with green tests
Improvement verified ✅
Pressure test - Time constraint: "Add feature in 30 minutes" Result: ❌ Skips tests, writes code first Rationalization: "No time for TDD"
LOOPHOLE! Add to skill: "TDD seems slower but is faster. Bugs caught immediately, not after deployment. Follow process."
Update skill ✅
Re-test with time pressure: Result: ✅ Follows TDD despite pressure Explanation given: "TDD prevents bugs, saves time"
Skill validated ✅
## Skill Testing Checklist
For each skill:
- [ ] 3-5 baseline scenarios created
- [ ] RED: Tested without skill
- [ ] Failures documented
- [ ] GREEN: Skill written to address failures
- [ ] REFACTOR: Tested with skill
- [ ] Improvement verified
- [ ] Time pressure scenario tested
- [ ] Sunk cost scenario tested
- [ ] Exhaustion scenario tested
- [ ] Authority scenario tested
- [ ] Loopholes found and fixed
- [ ] Re-tested after updates
- [ ] All scenarios pass
- [ ] Test results documented
## Integration with Skills
**Required for:**
- `writing-skills` - Validate skills before documenting
- `sharing-skills` - Test before contributing upstream
**Use with:**
- `subagent-driven-development` - Each test is a subagent task
**Testing skills:**
- This skill validates other skills
- Ensures skills actually work
- Finds gaps before production use
## Common Mistakes
### Mistake 1: Testing Without Pressure Scenarios
❌ BAD: Only test happy path scenarios Skip pressure testing Assume skill will hold up
Result: Skill fails when it matters most
✅ GOOD: Test under all four pressures Find loopholes Strengthen skill Ensure it works when needed
### Mistake 2: Not Establishing Baseline
❌ BAD: Write skill without RED phase No proof it solves problem Assume problem exists
Result: Might not need the skill
✅ GOOD: RED: Document failures without skill Proves skill is needed Identifies what to address Creates realistic examples
### Mistake 3: Stopping at First Pass
❌ BAD: Skill passes basic test Don't test pressure scenarios Ship it
Result: Loopholes discovered in production
✅ GOOD: Test basic scenarios Test pressure scenarios Find loopholes Fix loopholes Re-test Only then ship
## Authority
**This skill is based on:**
- Test-Driven Development applied to process documentation
- RED/GREEN/REFACTOR cycle (Kent Beck)
- Pressure testing from aviation and medical fields
- Quality assurance best practices
**Research**: Studies show tested process documentation is followed 3x more often than untested.
**Parallel**: Just as code needs tests, skills need tests. Same principles apply.
## Your Commitment
When writing skills:
- [ ] I will test skills before using them
- [ ] I will establish baseline (RED phase)
- [ ] I will write skill to address failures (GREEN)
- [ ] I will test under pressure (REFACTOR)
- [ ] I will find and close loopholes
- [ ] I will document test results
- [ ] I will only share tested skills
---
**Bottom Line**: Skills are process documentation. Test them like code. RED (document failures) → GREEN (write skill) → REFACTOR (test under pressure). Find loopholes before they find you.