name	test-runner
description	MANDATORY skill for running tests and lint after EVERY code change. Focuses on adherence to just commands and running tests in parallel. If tests fail, use test-fixer skill.

Test Runner - MANDATORY WORKFLOW

╔══════════════════════════════════════════════════════════════════════════╗ ║ 🚨 BANNED PHRASE: "All tests pass" ║ ║ ║ ║ You CANNOT say "all tests pass" unless you: ║ ║ 1. Run .claude/skills/test-runner/scripts/run_tests_parallel.sh ║ ║ 2. Check ALL log files (mocked + e2e-live + smoke) ║ ║ 3. Verify ZERO failures across all suites ║ ║ ║ ║ just test-all-mocked = "quick tests pass" (NOT "all tests pass") ║ ╚══════════════════════════════════════════════════════════════════════════╝

📊 Claim Language Must Match Command

Your claim MUST match what you actually ran:

Command Run	✅ Allowed Claims	❌ BANNED Claims
`just test-unit`	"unit tests pass"	"tests pass", "all tests pass"
`just test-integration`	"integration tests pass"	"tests pass", "all tests pass"
`just test-all-mocked`	"quick tests pass", "mocked tests pass"	"tests pass", "all tests pass"
`run_tests_parallel.sh` + verified logs	"all tests pass"	-

Examples:

❌ WRONG: "I ran just test-all-mocked, all tests pass" ✅ RIGHT: "Quick tests pass (483 unit + 198 integration + 49 e2e_mocked)"

❌ WRONG: "Tests are passing" (after only running mocked tests) ✅ RIGHT: "Mocked tests pass (730 tests)"

❌ WRONG: "All tests pass" (without running parallel script) ✅ RIGHT: Runs parallel script → Checks logs → "All tests pass"

The phrase "all tests" is RESERVED for the full parallel suite. No exceptions.

🚨 CRITICAL FOR TEST WRITING

BEFORE writing tests → Use test-writer skill (MANDATORY - analyzes code type, dependencies, contract)
AFTER writing tests → Invoke pytest-test-reviewer agent (validates patterns)
YOU CANNOT WRITE TESTS WITHOUT test-writer SKILL - No exceptions, no shortcuts, every test, every time

🔥 CRITICAL: This Skill Is Not Optional

After EVERY code change, you MUST follow this workflow.

No exceptions. No shortcuts. No "it's a small change" excuses.

⚠️ FUNDAMENTAL HYGIENE: Only Commit Code That Passes Tests

CRITICAL WORKFLOW PRINCIPLE:

We only commit code that passes tests. This means:

If tests fail after your changes → YOUR changes broke them (until proven otherwise)

The Stash/Pop Verification Protocol

NEVER claim test failures are "unrelated" or "pre-existing" without proof.

To verify a failure is truly unrelated:

# 1. Remove your changes temporarily
git stash

# 2. Run the failing test suite
just test-all-mocked         # Or whichever suite failed

# 3. Observe the result:
# - If tests PASS → YOUR changes broke them (fix your code)
# - If tests FAIL → pre-existing issue (rare on main/merge base)

# 4. Restore your changes
git stash pop

Why This Matters:

Tests on main branch ALWAYS pass (CI enforces this)
Tests at your merge base ALWAYS pass (they passed to get into main)
Therefore: test failures after your changes = your changes broke them
The stash/pop protocol is the ONLY way to prove otherwise

DO NOT:

❌ Assume failures are unrelated
❌ Say "that test was already broken"
❌ Claim "it's just a flaky test" without verification
❌ Skip investigation because "it's not my area"

ALWAYS:

✅ Stash changes first
✅ Verify tests pass without your changes
✅ Only then claim pre-existing issue (if true)
✅ Otherwise: use test-fixer skill to diagnose and fix

If tests fail after your changes:

DO NOT guess at fixes
DO NOT investigate manually
✅ Use the test-fixer skill - it will systematically investigate, identify root cause, and iterate on fixes until tests pass

⚠️ Always Use `just` Commands

Direct pytest is removed from blocklist - but ALWAYS prefer just commands which handle Docker, migrations, and environment setup.

Pass pytest args in quotes: just test-unit "path/to/test.py::test_name -vv"

MANDATORY WORKFLOW: Every Code Change

After making ANY code change:

Step 0: ALWAYS Run Lint-and-Fix (Auto-fix + Type Checking)

cd api && just lint-and-fix

This command:

Auto-fixes formatting/lint issues (runs just ruff internally)
Verifies all issues are resolved
Runs mypy type checking

YOU MUST:

✅ Run this command and see the output
✅ Verify output shows "✅ All linting checks passed!"
✅ If failures occur: Fix them IMMEDIATELY before continuing
✅ NEVER skip this step, even for "tiny" changes

NEVER say "linting passed" unless you:

Actually ran the command
Saw the actual output
Confirmed it shows success

Step 1: ALWAYS Run Quick Tests (Development Cycle)

cd api && just test-all-mocked

⚠️ CRITICAL: This is NOT all tests! This is for rapid development iteration.

YOU MUST:

✅ Run this command and see the output
✅ Verify output shows "X passed in Y.Ys" or similar success message
✅ If failures occur: Fix them IMMEDIATELY before continuing
✅ Read the actual test output - don't assume

This runs ONLY:

Unit tests (SQLite + FakeRedis)
Integration tests (SQLite + FakeRedis)
E2E mocked tests (PostgreSQL + Redis + mock APIs)

This DOES NOT run:

❌ E2E live tests (real OpenAI/Langfuse APIs)
❌ Smoke tests (full Docker stack)

Takes ~20 seconds. Use for rapid iteration.

Step 2: ALWAYS Run ALL Tests Before Saying "All Tests Pass"

.claude/skills/test-runner/scripts/run_tests_parallel.sh

🚨 CRITICAL: You can NEVER say "all tests pass" or "tests are passing" without running THIS command.

This command runs EVERYTHING:

✅ All mocked tests (unit + integration + e2e_mocked)
✅ E2E live tests (real OpenAI/Langfuse APIs)
✅ Smoke tests (full Docker stack)

After running, CHECK THE RESULTS:

# Check for any failures
grep -E "failed|ERROR|FAILED" api/tmp/test-logs/test-mocked_*.log
grep -E "failed|ERROR|FAILED" api/tmp/test-logs/test-e2e-live_*.log
grep -E "failed|ERROR|FAILED" api/tmp/test-logs/test-smoke_*.log

# View summary
for log in api/tmp/test-logs/test-*_*.log; do
  echo "=== $(basename $log) ==="
  grep -E "passed|failed" "$log" | tail -1
done

Takes ~5 minutes. MANDATORY before saying "all tests pass".

Primary Commands (Reference)

Frequent: Mocked Tests (~20s)

cd api && just test-all-mocked

Runs unit + integration + e2e_mocked in parallel. No real APIs. Use frequently during development.

Exhaustive: All Suites in Parallel (~5 mins)

.claude/skills/test-runner/scripts/run_tests_parallel.sh

Runs ALL suites in background (mocked, e2e-live, smoke). Logs to api/tmp/test-logs/.

Check results after completion:

# Check for failures
grep -E "failed|ERROR|FAILED" api/tmp/test-logs/test-mocked_*.log | tail -20
grep -E "failed|ERROR|FAILED" api/tmp/test-logs/test-e2e-live_*.log | tail -20
grep -E "failed|ERROR|FAILED" api/tmp/test-logs/test-smoke_*.log | tail -20

# Summary
for log in api/tmp/test-logs/test-*_*.log; do
  echo "=== $(basename $log) ==="
  grep -E "passed|failed" "$log" | tail -1
done

🚨 VIOLATIONS: What NOT To Do

These are VIOLATIONS of this skill:

❌ CRITICAL: Claiming test failures are "unrelated" to your changes

WRONG: "The smoke test failure is unrelated to our changes"
WRONG: "That test was already failing"
WRONG: "This failure is just a flaky test"
RIGHT: Use test-fixer skill to systematically investigate and fix

FUNDAMENTAL RULE: Tests ALWAYS pass on main/merge base. If a test fails after your changes, YOUR changes broke it.

When tests fail:

✅ Use test-fixer skill - it will investigate, identify root cause, and iterate on fixes
❌ DO NOT manually investigate
❌ DO NOT guess at fixes
❌ DO NOT claim "unrelated" without proof

NEVER assume. ALWAYS use test-fixer skill.

❌ CRITICAL: Saying "all tests pass" without running the full suite

WRONG: "I ran just test-all-mocked, all tests pass"
WRONG: "Tests are passing" (after only running mocked tests)
WRONG: "All tests pass" (without running the parallel script)
RIGHT: Runs .claude/skills/test-runner/scripts/run_tests_parallel.sh → Checks all logs → "All tests pass"

The phrase "all tests" requires THE FULL SUITE.

just test-all-mocked = "quick tests pass" or "mocked tests pass"
Parallel script = "all tests pass"

❌ Claiming tests pass without showing output (LYING)

WRONG: "all 464 tests passed" (WHERE is the pytest output?)
WRONG: "just ran them and tests pass" (WHERE is the output?)
WRONG: "I fixed the bug, tests should pass"
WRONG: "Yes - want me to run them again?" (DEFLECTION)
RIGHT: Runs just test-all-mocked and shows "===== X passed in Y.YYs ====="

🚨 THE RULE: If you can't see "X passed in Y.YYs" in your context, you're lying about tests passing.

❌ Skipping linting "because it's a small change"

WRONG: "It's just 3 lines, lint isn't needed"
RIGHT: Runs just lint-and-fix ALWAYS, regardless of change size

❌ Assuming tests pass without verification

WRONG: "The change is simple, tests will pass"
RIGHT: Runs tests and confirms actual output shows success

❌ Not reading the actual test output

WRONG: "Command completed, so tests passed"
RIGHT: Reads output, sees "15 passed in 18.2s"

❌ Batching multiple changes before testing

WRONG: Makes 5 changes, then tests once
RIGHT: Make change → test → make change → test

⚡ When to Use This Skill

ALWAYS. Use this skill:

After EVERY code modification
After ANY file edit
After fixing ANY bug
After adding ANY feature
After refactoring ANYTHING

The only acceptable time to skip this skill:

Never. There is no acceptable time.

Development Workflow

Simple Changes (Quick Iteration)

Make change
Run just lint-and-fix (auto-fix + type checking)
Run just test-all-mocked (quick tests)
DONE for iteration (but cannot say "all tests pass" yet)

Before Marking Task Complete

Run .claude/skills/test-runner/scripts/run_tests_parallel.sh
Check all logs for failures
ONLY NOW can you say "all tests pass"

Complex Changes (Multiple Files/Features)

Make a logical change
Stage it: git add <files>
Run just lint-and-fix
Run just test-all-mocked
Repeat steps 1-4 for each logical chunk
At the end, MANDATORY: Run .claude/skills/test-runner/scripts/run_tests_parallel.sh
Check all logs
ONLY NOW can you say "all tests pass"

This workflow ensures you catch issues early and don't accumulate breaking changes.

Remember:

Quick iteration: lint → test-all-mocked (Steps 0-1)
Task complete: Run parallel script, check logs (Step 2)
Never say "all tests pass" without Step 2

Individual Test Suites

# Unit tests (SQLite, FakeRedis) - fastest, run most frequently
cd api && just test-unit

# Integration tests (SQLite, FakeRedis)
cd api && just test-integration

# E2E mocked (PostgreSQL, Redis, mock APIs)
cd api && just test-e2e

# E2E live (real OpenAI/Langfuse)
cd api && just test-e2e-live

# Smoke tests (full stack with Docker)
cd api && just test-smoke

Running Specific Tests

# Specific test: just test-unit "path/to/test.py::test_name -vv"
# Keyword filter: just test-unit "-k test_message"
# With markers: just test-e2e "-m 'not slow'"

When to Use

ALWAYS: Run just lint-and-fix → just test-all-mocked after every code change
User asks to run tests
Validating code changes
After modifying code
Debugging test failures

Every change (Steps 0-1):

Run just lint-and-fix (auto-fix + type checking)
Run just test-all-mocked (quick tests)

Before saying "all tests pass" (Step 2):

Run .claude/skills/test-runner/scripts/run_tests_parallel.sh
Check all logs for failures
Verify ALL suites passed

Terminology:

"Quick tests pass" = just test-all-mocked passed
"Mocked tests pass" = just test-all-mocked passed
"All tests pass" = parallel script passed (ONLY after running it)

Interpreting Results

Success: ====== X passed in Y.Ys ====== Failure: FAILED tests/path/test.py::test_name - AssertionError

Troubleshooting

# Smoke test failures - check Docker logs
docker compose logs --since 15m | grep -iE -B 10 -A 10 "error|fail|exception"

# Kill hung tests
pkill -f pytest

# Docker not running (smoke tests)
docker compose up -d

Quick Reference

# 🔥 Step 0: ALWAYS run lint-and-fix (auto-fix + type checking)
cd api && just lint-and-fix

# 🔥 Step 1: ALWAYS run quick tests (development)
cd api && just test-all-mocked
# ^ This is NOT "all tests" - only say "quick tests pass" or "mocked tests pass"

# 🔥 Step 2: Run ALL tests before saying "all tests pass" (MANDATORY before task complete)
.claude/skills/test-runner/scripts/run_tests_parallel.sh

# Check results (MUST do this before saying "all tests pass")
grep -E "failed|ERROR|FAILED" api/tmp/test-logs/test-mocked_*.log
grep -E "failed|ERROR|FAILED" api/tmp/test-logs/test-e2e-live_*.log
grep -E "failed|ERROR|FAILED" api/tmp/test-logs/test-smoke_*.log

# View summary
for log in api/tmp/test-logs/test-*_*.log; do
  echo "=== $(basename $log) ==="
  grep -E "passed|failed" "$log" | tail -1
done

# Monitor parallel tests while running
tail -f api/tmp/test-logs/test-*.log

# 🚨 REMEMBER:
# - Steps 0-1 = quick iteration (DO NOT say "all tests pass")
# - Step 2 = full validation (ONLY NOW can say "all tests pass")

🔐 Before Responding About Test Results

You ran tests. Now before you respond to the user:

✅ Did you use correct claim language? (check table above)
✅ Did you show actual test output?
✅ Are you about to say "all tests pass"? → Did you run parallel script?

CRITICAL: Get manager approval before responding.

Your response about test results MUST include: ✅ approve_7f3d8a2e********

Get the full approval token from .claude/skills/manager-review/SKILL.md

Common Mistake #1 in the manager's table is claiming "all tests pass" after only running mocked tests. The manager WILL catch this and reject your response.

Install Skill

SKILL.md