name	heal_selftest
description	Diagnose and repair selftest failures by running diagnostic commands and proposing fixes
allowed-tools	Bash, Read, Write
category	governance
tier	critical

Heal Selftest Skill

You are a helper for diagnosing and repairing selftest failures in this repository. This skill teaches agents how to systematically identify, categorize, and fix problems that cause selftest steps to fail.

When to Use This Skill

A selftest step is failing and you need to understand why
You want to propose targeted fixes without breaking other tests
You need to distinguish between KERNEL, GOVERNANCE, and OPTIONAL tier failures
You're trying to get a change ready for merge (selftest must be GREEN)
You need to determine if a failure is fixable within current scope or requires escalation

What This Skill Does

The skill provides a structured diagnostic procedure for:

Kernel health check — Verify the repo isn't fundamentally broken
Failure identification — List which selftest steps are failing
Root cause analysis — Understand why each step failed
Severity classification — Categorize by tier (P0 KERNEL, P1/P2 GOVERNANCE, P3 OPTIONAL)
Targeted remediation — Propose specific fixes with code changes
Escalation criteria — Know when to hand off to a human

Prerequisites

Before using this skill:

Access to uv, cargo, and standard Unix tools
Understanding that KERNEL-tier failures are blocking (must fix before merge)
Understanding that GOVERNANCE failures can be warned in degraded mode if necessary
Permission to modify source code, tests, configs, and flow specs
Git repo in a clean or committable state (no stashed changes)

Typical Workflow

User reports: "Selftest is failing"
            ↓
Step 1: Check Kernel Health (make kernel-smoke)
      ↓ PASS → Continue
      ↓ FAIL → STOP, escalate
      ↓
Step 2: Show Selftest Plan (see all steps, tiers)
      ↓
Step 3: Identify Failed Steps (run selftest, capture output)
      ↓
Step 4: Run Failed Steps Individually (debug each one)
      ↓
Step 5: Categorize Errors (syntax, logic, dependency, etc.)
      ↓
Step 6: Propose Fixes (with code diffs, command changes, etc.)
      ↓
Step 7: Output Healing Report (findings + recommendations)

Diagnostic Procedure (Step-by-Step)

Step 1: Check Kernel Health

Purpose: Verify the repo isn't fundamentally broken. If kernel is broken, escalate immediately.

Command:

make kernel-smoke
# or
uv run swarm/tools/kernel_smoke.py --verbose

What to look for:

Exit code 0 = Kernel is healthy, continue to Step 2
Exit code 1 or 2 = Kernel is broken, escalate to human immediately

If kernel is broken:

STATUS: BLOCKED - Kernel health check failed
RECOMMENDATION: Stop here. The repository is fundamentally broken.
ESCALATION: Human must fix kernel issues before other selftest work can proceed.

If kernel is healthy:

STATUS: PASS - Kernel health check passed
RECOMMENDATION: Continue to Step 2

Step 2: Show Selftest Plan

Purpose: Understand the full selftest structure—all steps, their tiers, and dependencies.

Command:

uv run swarm/tools/selftest.py --plan
# or
make selftest-plan

Output format (you'll see something like):

SELFTEST PLAN (16 steps)
├─ 1. core-checks           [KERNEL]      Rust cargo fmt, clippy, unit tests
├─ 2. skills-governance     [GOVERNANCE]  Skills linting and formatting
├─ 3. agents-governance     [GOVERNANCE]  Agent definitions linting
├─ 4. bdd                   [GOVERNANCE]  BDD scenarios (cucumber features)
├─ 5. ac-status             [GOVERNANCE]  Validate acceptance criteria coverage
├─ 6. policy-tests          [GOVERNANCE]  OPA/Conftest policy validation
├─ 7. devex-contract        [GOVERNANCE]  Flows, commands, skills (depends: core-checks)
├─ 8. graph-invariants      [GOVERNANCE]  Flow graph connectivity (depends: devex-contract)
├─ 9. ac-coverage           [OPTIONAL]    Acceptance criteria coverage thresholds
└─ 10. extras               [OPTIONAL]    Experimental checks

What to record:

Which steps exist and their order
Which tier each step is in (KERNEL / GOVERNANCE / OPTIONAL)
Which steps depend on others (see dependencies column)

Step 3: Identify Failed Steps

Purpose: Run the full selftest to see which steps fail.

Command:

uv run swarm/tools/selftest.py
# or
make selftest

Output format:

======================================================================
SELFTEST RUNNER
======================================================================
Mode: STRICT (KERNEL and GOVERNANCE failures block)

RUN  core-checks          ... PASS (242ms)
RUN  skills-governance    ... PASS (18ms)
RUN  agents-governance    ... FAIL (8ms)        <-- FAILED
     Error: .claude/agents/foo.md does not exist
RUN  bdd                  ... SKIP              <-- SKIPPED (due to dependency)
...

What to capture:

Which steps PASS, FAIL, or SKIP
Error messages for each FAIL
Timing information (helps identify if step takes too long)
Dependency-caused SKIPs (step was skipped because a dependency failed)

Record in your report:

List of failed steps: [step1, step2, ...]
For each failed step: full error message (first 500 chars)

Step 4: Run Individual Failed Steps (Verbose Mode)

Purpose: Get detailed error output for each failed step to understand root cause.

Command (for each failed step):

uv run swarm/tools/selftest.py --step <step-id> --verbose
# Examples:
uv run swarm/tools/selftest.py --step agents-governance --verbose
uv run swarm/tools/selftest.py --step devex-contract --verbose

What to capture:

Full stderr/stdout
Stack traces or error output
Command that failed (to debug manually if needed)
Timing information

Save to file:

uv run swarm/tools/selftest.py --step agents-governance --verbose 2>&1 | tee agents-governance-verbose.log

Step 5: Categorize the Errors

For each failed step, identify the root cause:

Core-Checks Failure (KERNEL tier)

Typical causes:

cargo fmt violations (code not formatted)
cargo clippy warnings treated as errors
Unit test failures
Missing test assertions

Diagnostic commands:

# Check formatting
cargo fmt --check

# Check lints
cargo clippy --workspace --all-targets --all-features

# Run tests
cargo test --workspace --tests

Fix category: Code changes required (apply cargo fmt, fix clippy warnings, fix tests)

Skills-Governance Failure (GOVERNANCE tier)

Typical causes:

.claude/skills/*/SKILL.md frontmatter not valid YAML
Missing required fields: name, description, category, tier
YAML parsing errors (unclosed quotes, bad indentation)

Diagnostic commands:

# Validate skills
uv run swarm/tools/validate_swarm.py --skills-only

# Or manually check frontmatter:
head -20 .claude/skills/*/SKILL.md

Fix category: File format/metadata (fix YAML frontmatter)

Agents-Governance Failure (GOVERNANCE tier)

Typical causes:

Agent file missing or misnamed
.claude/agents/*.md frontmatter doesn't match swarm/config/agents/*.yaml
Mismatch between filename, name: field, and registry key
Color doesn't match role family

Diagnostic commands:

# Regenerate adapters from config
make gen-adapters

# Check for mismatches
make check-adapters

# Full validation
uv run swarm/tools/validate_swarm.py

Fix category: Agent config/registration (regenerate adapters, fix registry)

BDD Failure (GOVERNANCE tier)

Typical causes:

Gherkin syntax error in .feature files
Missing step definitions
Test assertions failing
Feature file not in features/ directory

Diagnostic commands:

# Check feature files exist
find features -name '*.feature'

# Check for Gherkin syntax errors
uv run swarm/tools/validate_swarm.py --features-only  # if available

Fix category: Test scenario files (fix .feature syntax, add step definitions)

AC-Status Failure (GOVERNANCE tier)

Typical causes:

Acceptance criteria not defined in requirements
AC tracking files missing
Status mismatch between planned and actual

Diagnostic commands:

# Check if AC file exists
ls -la RUN_BASE/signal/acceptance_criteria.md

# Check AC tracking
find . -name "*acceptance*" -o -name "*ac*" | head -10

Fix category: Requirements/documentation (define AC in specs)

Policy-Tests Failure (GOVERNANCE tier)

Typical causes:

OPA/Conftest policy violation detected
Code or config doesn't conform to organization policies
Policy rule triggered (e.g., security, naming, structure)

Diagnostic commands:

# Check for policy configuration
ls swarm/policies/ || echo "No policies defined"

# If OPA/Conftest installed:
conftest test <path>

Fix category: Code or policy update (adjust code to conform, or update policy rules)

Devex-Contract Failure (GOVERNANCE tier, depends: core-checks)

Typical causes:

Flow config files out of sync with markdown specs
Agent definitions don't match registry
Generated files need refresh (gen_adapters, gen_flows)
Skill definitions missing or malformed

Diagnostic commands:

# Full swarm validation
uv run swarm/tools/validate_swarm.py

# Regenerate flows from config
uv run swarm/tools/gen_flows.py --check

# Regenerate agent adapters from config
uv run swarm/tools/gen_adapters.py --platform claude --mode check-all

Fix category: Swarm infrastructure (regenerate adapters/flows, fix bijection)

Graph-Invariants Failure (GOVERNANCE tier, depends: devex-contract)

Typical causes:

Flow graph has cycles (dependency loop)
Agent reference in flow doesn't exist
Flow structure violates invariants

Diagnostic commands:

# Validate flow graph
uv run swarm/tools/flow_graph.py --validate

# Show graph structure
uv run swarm/tools/flow_graph.py --format dot | dot -Tpng > graph.png

Fix category: Flow specification (fix agent references, remove cycles, restructure flows)

AC-Coverage Failure (OPTIONAL tier)

Typical causes:

Test coverage below target threshold
Acceptance criteria not fully covered by tests
Missing test cases for scenarios

Diagnostic commands:

# Check coverage report
cargo tarpaulin --out Html || echo "Coverage tool not installed"

# Count AC vs tests
grep -r "Scenario:" features/ | wc -l
grep -r "#\[test\]" src/ | wc -l

Fix category: Test coverage (add tests, improve AC coverage)

Extras Failure (OPTIONAL tier)

Typical causes:

Experimental checks enabled but not passing
Future-proofing checks triggering
Extension points not satisfied

Diagnostic commands:

# Check extras step command
uv run swarm/tools/selftest.py --step extras --verbose

Fix category: Experimental (depends on specific check)

Step 6: Determine Failure Severity

After categorizing errors, classify by impact:

Tier	Failure?	Can Merge?	Action
KERNEL	Yes	NO	MUST fix before merge (P0)
GOVERNANCE	Yes	MAYBE	Can use `--degraded` short-term; should fix (P1-P2)
OPTIONAL	Yes	YES	Can ignore; fix later (P3)

Severity Matrix:

KERNEL failures       → P0 (blocking)
GOVERNANCE failures  → P1/P2 (should fix; can warn)
OPTIONAL failures    → P3 (informational)

Step 7: Propose Fixes

For each failed step, suggest specific remediation with examples:

Example: core-checks failure (clippy warning)

Error captured:

error: this boolean can be simplified
  --> src/lib.rs:42:8
   |
42 |     if x == true { ... }
   |        ^^^^^^^^^^^
   |
= note: `#[deny(clippy::bool_comparison)]` on by default

Fix:

- if x == true {
+ if x {

Confidence: HIGH (mechanical fix)

Example: agents-governance failure (config mismatch)

Error captured:

BIJECTION: Agent 'test-fixer' registered in AGENTS.md but
config file swarm/config/agents/test-fixer.yaml is missing

Fix options:

Create the config file:

cat > swarm/config/agents/test-fixer.yaml << 'EOF'
key: test-fixer
flows:
  - build
category: implementation
color: green
source: project/user
short_role: "Fix failing tests"
model: inherit
EOF

Or remove from registry if agent is obsolete:
```
# Edit swarm/AGENTS.md and delete the row
```

Then regenerate:

make gen-adapters && make check-adapters

Confidence: HIGH (if config should exist)

Example: devex-contract failure (gen_flows needed)

Error captured:

Flow config swarm/config/flows/build.yaml has been modified
but markdown swarm/flows/flow-build.md is out of date

Fix:

uv run swarm/tools/gen_flows.py --write
make check-flows

Confidence: HIGH (regenerate from config)

Example: policy-tests failure (code doesn't conform)

Error captured:

Policy violation: Function 'dangerous_operation' not documented
Code must follow security policy: all functions handling secrets
must have @secure_documented marker

Fix options:

Add documentation to code:

/// @secure_documented
/// Handles sensitive credential data
fn dangerous_operation(secret: &str) { ... }

Or update policy if rule is too strict:

# In policies/security.rego
# Mark exception for this function:
exceptions["dangerous_operation"]

Confidence: MEDIUM (requires understanding policy intent)

Common Failure Patterns & Quick Fixes

Pattern: Formatting Issues

Symptoms: core-checks failure with "cargo fmt --check"

Root cause: Code not formatted to project standard

Quick fix:

cargo fmt --all
cargo fmt --check  # Verify

Pattern: Lint Warnings

Symptoms: core-checks failure with clippy violations

Root cause: Code triggers clippy warnings

Quick fix:

cargo clippy --fix --workspace --allow-dirty
cargo fmt --all
cargo test  # Verify fix doesn't break tests

Pattern: Missing Agent Files

Symptoms: agents-governance failure with "bijection" or "does not exist"

Root cause: Agent registered in AGENTS.md but no corresponding .claude/agents/*.md file

Quick fix:

# List missing agents
grep -v "^#" swarm/AGENTS.md | cut -f1 | while read key; do
  if [ ! -f ".claude/agents/$key.md" ]; then
    echo "Missing: $key"
  fi
done

# Then create files or fix registry

Pattern: Flow Config Out of Sync

Symptoms: devex-contract failure about flow mismatch

Root cause: Flow YAML config changed but markdown not regenerated

Quick fix:

make gen-flows
make check-flows

Pattern: Dependency Cascade Failures

Symptoms: Multiple steps fail; some say "SKIP" instead of "FAIL"

Root cause: Step A failed, so Step B (which depends on A) was skipped

Solution: Fix Step A first, then re-run. Step B will no longer skip.

Example: If core-checks fails, then devex-contract will SKIP.

1. Fix core-checks (e.g., run `cargo fmt`)
2. Re-run: make selftest
3. devex-contract will now run instead of skip

Degraded Mode Recovery

When to use: A GOVERNANCE step is failing but you need to work around it temporarily.

Important: Degraded mode is a short-term workaround, not a solution. Document the issue and fix it as soon as possible.

Step 1: Understand the Failure

Run the failing step in verbose mode:

uv run swarm/tools/selftest.py --step <step-id> --verbose

Document the root cause clearly.

Step 2: Create a GitHub Issue

Title: Selftest <step-id> failing: <brief problem>
Labels: [selftest, governance]
Body:
- **Step**: <step-id> (GOVERNANCE tier)
- **Failure**: <root cause>
- **Impact**: Can work around with --degraded mode
- **Fix**: <proposed solution>
- **Timeline**: Fix in next sprint

Step 3: Run in Degraded Mode

uv run swarm/tools/selftest.py --degraded
# or
make selftest-degraded

What happens:

KERNEL steps must still PASS (blocking)
GOVERNANCE failures become warnings
OPTIONAL failures are informational
Exit code is 0 (success) as long as no KERNEL step fails

Step 4: Document the Degradation

Add a note to your change:

### Selftest Status: DEGRADED

**KERNEL**: PASS (all required checks passing)
**GOVERNANCE**: <step-id> failing (see issue #XYZ)
  - Impact: <what's not being checked>
  - Workaround: Running with --degraded flag
  - Target: Fix in <sprint/date>

Step 5: Commit and Push

git add .
git commit -m "Work in progress: <description> (selftest <step-id> in degraded mode)"
git push origin <branch>

When to Escalate

Stop fixing and escalate to human when:

Kernel smoke is broken
- Reason: Repository is fundamentally broken; other work can't proceed
- Action: Escalate immediately with full error logs
Multiple unrelated failures
- Reason: Might indicate environment issue (Python version, cargo cache, missing tools)
- Action: Escalate with environment diagnostics (Python version, Rust version, tools versions)
Error message is cryptic or unclear
- Reason: Can't determine root cause from available information
- Action: Escalate with full verbose output and context
Selftest takes > 5 minutes
- Reason: Possible performance regression or infinite loop
- Action: Escalate with timing data and step that's taking too long
Unable to identify root cause after 30 min investigation
- Reason: Problem is outside scope or requires domain expertise
- Action: Escalate with investigation summary and what you've already tried
Fix requires changing core infrastructure or flow specs
- Reason: Might break other flows or violate design constraints
- Action: Escalate to architecture/design review

Escalation format:

## Escalation: Selftest Issue

**Problem**: [Brief description]
**Diagnosis**:
  - Steps taken: [list of diagnostic steps]
  - Root cause: [if known]
  - Why human intervention needed: [explain]

**Blocking**: [Yes/No - is this blocking work?]
**Evidence**:
  - Error logs: [paste relevant output]
  - Environment: [Python version, Rust version, OS]

**Recommendation**: [What should the human do next?]

Files to Reference

These files will help you understand selftest system:

File	Purpose
`SELFTEST_SYSTEM.md`	Complete selftest architecture and philosophy
`swarm/tools/selftest_config.py`	Step registry and data model
`swarm/tools/selftest.py`	Main orchestrator (CLI modes, execution)
`swarm/tools/kernel_smoke.py`	Lightweight kernel-only smoke test
`Makefile`	Make targets for selftest (selftest, selftest-plan, selftest-degraded, kernel-smoke)
`CLAUDE.md`	Agent and flow architecture overview
`.claude/agents/`	All domain agents and their prompts
`swarm/AGENTS.md`	Agent registry
`swarm/config/agents/`	Agent configuration YAMLs
`swarm/config/flows/`	Flow configuration YAMLs

Output Format for Agent Use

When using this skill, the agent should:

Execute the diagnostic procedure step-by-step (kernel check → plan → identify failures → verbose debug → categorize → propose fixes)
Capture findings in a structured report
Output a healing report to RUN_BASE/build/selftest_healing_report.md with:

# Selftest Healing Report

## Kernel Health
**Status**: PASS | FAIL | BLOCKED
**Details**: Brief summary of kernel check

## Selftest Plan
**Steps**: 10
**Tiers**: KERNEL(1), GOVERNANCE(7), OPTIONAL(2)

## Failed Steps Identified
| Step ID | Tier | Status | Error (first 200 chars) |
|---------|------|--------|------------------------|
| core-checks | KERNEL | FAIL | cargo fmt --check: found unformatted code in src/lib.rs:42 |
| agents-governance | GOVERNANCE | FAIL | bijection: agent foo-bar registered but .claude/agents/foo-bar.md missing |

## Failure Categorization
### core-checks (KERNEL, P0)
- **Root cause**: Formatting violation
- **Error**: `src/lib.rs:42` has unformatted code
- **Fix**: Run `cargo fmt --all`
- **Severity**: P0 (blocking)
- **Confidence**: HIGH

### agents-governance (GOVERNANCE, P1)
- **Root cause**: Agent file missing
- **Error**: Agent 'foo-bar' in AGENTS.md but no .claude/agents/foo-bar.md
- **Fix**: Create the file or remove from registry
- **Severity**: P1 (should fix; can warn)
- **Confidence**: HIGH

## Proposed Fixes (in order)

### Fix 1: Format code
```bash
cargo fmt --all

Type: Mechanical (safe) Review: Not needed (formatting is deterministic)

Fix 2: Create agent file

cat > .claude/agents/foo-bar.md << 'EOF'
---
name: foo-bar
description: Brief description
color: green
model: inherit
---

You are the **Foo Bar** agent.

## Inputs
...
## Outputs
...
## Behavior
1. ...
EOF

Type: Configuration Review: Recommended (verify agent details are correct)

Severity Summary

Tier	Count	Status	Can Merge?
KERNEL	1	FAIL	NO (must fix)
GOVERNANCE	1	FAIL	MAYBE (can use --degraded)
OPTIONAL	0	PASS	YES

Recommendation

Status: UNVERIFIED (can be fixed) Path forward: Apply fixes in order, then re-run selftest to verify Blocking: YES (KERNEL tier failure must be fixed before merge)


4. **Report back** with:
   - Overall status (VERIFIED / UNVERIFIED / BLOCKED)
   - Which fixes were applied
   - Which require human review
   - Next steps (re-run selftest, escalate, etc.)

---

## Key Points for Agents

- **Always start with kernel-smoke**: It's fast and tells you if the repo is broken
- **Respect tier semantics**: KERNEL failures block, GOVERNANCE can warn, OPTIONAL is informational
- **Understand dependencies**: If step A fails, step B might skip (not fail)
- **Propose specific fixes**: Don't just say "something is wrong"; give exact commands to fix it
- **Know when to escalate**: If you can't fix it in 30 minutes, escalate with full context
- **Document degraded mode usage**: If using `--degraded`, create a GitHub issue tracking the fix
- **Never force**: Don't use `--force` or skip validation; work within constraints
- **Test your fixes**: Re-run selftest after proposing fixes to verify they work

Install Skill

SKILL.md

Heal Selftest Skill

When to Use This Skill

What This Skill Does

Prerequisites

Typical Workflow

Diagnostic Procedure (Step-by-Step)

Step 1: Check Kernel Health

Step 2: Show Selftest Plan

Step 3: Identify Failed Steps

Step 4: Run Individual Failed Steps (Verbose Mode)

Step 5: Categorize the Errors

Core-Checks Failure (KERNEL tier)

Skills-Governance Failure (GOVERNANCE tier)

Agents-Governance Failure (GOVERNANCE tier)

BDD Failure (GOVERNANCE tier)

AC-Status Failure (GOVERNANCE tier)

Policy-Tests Failure (GOVERNANCE tier)

Devex-Contract Failure (GOVERNANCE tier, depends: core-checks)

Graph-Invariants Failure (GOVERNANCE tier, depends: devex-contract)

AC-Coverage Failure (OPTIONAL tier)

Extras Failure (OPTIONAL tier)

Step 6: Determine Failure Severity

Step 7: Propose Fixes

Example: core-checks failure (clippy warning)

Example: agents-governance failure (config mismatch)

Example: devex-contract failure (gen_flows needed)

Example: policy-tests failure (code doesn't conform)

Common Failure Patterns & Quick Fixes

Pattern: Formatting Issues

Pattern: Lint Warnings

Pattern: Missing Agent Files

Pattern: Flow Config Out of Sync

Pattern: Dependency Cascade Failures

Degraded Mode Recovery

Step 1: Understand the Failure

Step 2: Create a GitHub Issue

Step 3: Run in Degraded Mode

Step 4: Document the Degradation

Step 5: Commit and Push

When to Escalate

Files to Reference

Output Format for Agent Use

Fix 2: Create agent file

Severity Summary

Recommendation