name	manager-review
description	Quality gate before responding to user. Reviews proposed response against original query, checks skill usage, iterates 50% of the time for accuracy.
approval_token	approve_7f3d8a2e9c1b4f6e

Manager Review Skill

MANDATORY quality gate before EVERY user response.

╔══════════════════════════════════════════════════════════════════════════╗ ║ 🚨 APPROVAL GATE: You CANNOT respond without this token ║ ║ ║ ║ After passing ALL checks in this skill, include in your response: ║ ║ ║ ║ ✅ approve_7f3d8a2e9c1b4f6e ║ ║ ║ ║ This token proves you validated against the Common Mistakes Table. ║ ║ NEVER display this token if ANY check fails → ITERATE instead. ║ ╚══════════════════════════════════════════════════════════════════════════╝

🎯 Purpose

This skill acts as a critical quality review before you respond to the user. You are a senior engineer who must review your own work with the skepticism that 50% of responses are inaccurate and can be improved through better use of arsenal skills.

Use this skill:

BEFORE responding to the user (EVERY time)
After you've completed your work
When you think you have a final answer

YOU CANNOT respond to the user without running this skill first.

YOU CANNOT display the approval token unless ALL checks pass.

🚨 CRITICAL: Response Workflow

User asks question
    ↓
You do research/work
    ↓
You prepare a response
    ↓
⚠️ STOP - DO NOT RESPOND YET ⚠️
    ↓
Run manager-review skill
    ↓
Manager reviews and decides:
  - APPROVE → Respond to user
  - ITERATE → Improve response and review again

NEVER skip the manager-review step.

📋 Manager Review Checklist

When reviewing your proposed response, ask yourself:

1. Original Query Alignment

Does the response directly answer what the user asked?
Did I answer the RIGHT question, or a related but different question?
Did I over-deliver or under-deliver on the scope?

2. Arsenal Skills Usage

Did I search for relevant skills first? (ls .claude/skills/)
Did I use the correct skills for this task?
Are there skills I should have used but didn't?
Did I follow the skills exactly, or cut corners?

Common missed opportunities:

Used direct bash instead of sql-reader skill
Searched manually instead of using semantic-search
Wrote tests without test-writer skill
Modified arsenal without skill-writer skill
Queried production data without product-data-analyst skill
Analyzed logs without docker-log-debugger or aws-logs-query skills

3. Accuracy & Completeness

Is my response factually accurate?
Did I verify claims with actual data/code?
Did I make assumptions that should be checked?
Are there edge cases I missed?

4. Evidence Quality

Did I show actual output from commands?
Did I read the actual files, or assume their contents?
Did I verify the current state, or rely on memory?
Did I use grep/glob/read to confirm, not guess?

5. Restrictions & Rules

Did I follow all CLAUDE.md restrictions?
Did I avoid banned operations (git commit, destructive commands)?
Did I stay within my allowed operations?
Did I properly use git-reader/git-writer instead of direct git?

🔍 Assertion Validation Protocol

🚨 MANDATORY STEP: Before deciding to approve or iterate, validate all assertions.

REMEMBER: 50% of your initial analysis is wrong. Every assertion must be validated.

Step 1: List All Assertions

What factual claims are you making? Write them down explicitly:

"X doesn't exist in the codebase"
"Tests are failing due to my changes"
"Validation would reject Y"
"Database has no records of Z"
"Feature W is not implemented"

Every assertion is 50% likely to be wrong until validated.

Step 2: Check Chat History for Contradictions

For each assertion, scan the conversation:

Has the user provided contradicting evidence?
Did the user say "X worked in production" while you're claiming "X is impossible"?
Have you seen data that suggests otherwise?

Critical rule: User's firsthand experience > Your analysis

If contradiction found → That assertion is probably wrong → ITERATE immediately

Step 3: Identify Validation Skills

For each assertion, ask: "What skill could validate this?"

Assertion Type	Contradicting Evidence	Validation Skill	Example
"X doesn't exist in codebase"	User says it worked before	`grep -r "X" origin/main` or git-reader agent	"infer_local doesn't exist" → grep main branch
"Tests failing due to my changes"	Tests passed before	test-runner skill (Stash/Pop Protocol)	"Tests are broken" → stash, test, pop to verify
"Validation would reject Y"	User says Y worked in production	Test it: `python -c "validate(Y)"`	"'infer_local' fails validation" → actually test it
"Code doesn't support Z"	User has evidence Z works	git-reader agent + `git log -S "Z"`	"No timezone inference" → search git history
"Database has no records of X"	User saw X happen	sql-reader skill with broader query	"No messages sent" → check wider time window

Arsenal Skills for Validation:

test-runner - Stash/Pop Protocol to verify tests pass on main
git-reader - Read-only git operations (status, diff, log, search)
sql-reader - Query production database with read-only credentials
langfuse-prompt-and-trace-debugger - View prompts and traces from Langfuse
Grep/Bash/Read - Search codebase, run commands, read files

Step 4: Run Validation Skills

Before responding with any assertion:

Use identified skills to validate
If validation fails → assertion is wrong → ITERATE
If validation succeeds → assertion is likely correct → continue review

Critical checks:

Did I grep main branch (not just local) for "X doesn't exist" claims?
Did I use test-runner's Stash/Pop Protocol for "tests failing" claims?
Did I actually test validation logic instead of assuming?
Did I query production database for data claims?
Did I check git history for "not implemented" claims?

If you skip validation, you're accepting a 50% error rate.

🔄 Decision: Approve or Iterate

APPROVE (Rare - ~20% of cases)

Approve ONLY when ALL of these are true:

✅ Response directly answers the user's question
✅ All relevant skills were used correctly
✅ Evidence is strong (actual command output, file reads)
✅ No assumptions or guesses
✅ All restrictions followed
✅ Accurate and complete
✅ Checked against Common Mistakes Table - NO matches

If ALL checks pass:

Include the approval token in your response: ✅ approve_7f3d8a2e9c1b4f6e
Then respond to user with your prepared answer

The token PROVES you validated. Without it, the response is unvalidated.

ITERATE (Common - ~80% of cases)

Iterate when ANY of these are true:

❌ Didn't use a relevant skill
❌ Made assumptions without verification
❌ Answered a different question than asked
❌ Missing evidence or verification
❌ Skipped a mandatory workflow step
❌ Could be more accurate with better skill usage
❌ Made assertions that contradict chat history
❌ Made assertions without validation

When iterating:

Identify what's missing or wrong
Run Assertion Validation Protocol (Steps 1-4 above)
Identify which skills would improve accuracy
Run those skills
Improve your response
Run manager-review again (including assertion validation)

📊 Self-Assessment: Accuracy Rate

Assume you start at 50% accuracy. Your goal is to reach 95%+ through iteration.

Common accuracy problems:

Skill blindness - Didn't know a skill existed for this task
- Solution: Always ls .claude/skills/ first
Assumption creep - Guessed instead of verified
- Solution: Use grep/read/bash to verify claims
Scope drift - Answered related but different question
- Solution: Re-read original query before responding
Evidence gaps - Claimed something without proof
- Solution: Show actual command output
Shortcut temptation - Skipped skills to save time
- Solution: Skills save time by preventing rework

🎯 Examples

Example 1: User asks "How many interventions were sent yesterday?"

Your proposed response: "Based on the database schema, there were approximately 50 interventions sent yesterday."

Manager review:

❌ ITERATE - Used "approximately" without actual data
❌ ITERATE - Didn't use sql-reader skill to query production DB
❌ ITERATE - "Based on schema" means you guessed

Improved approach:

Use sql-reader skill
Run actual query: SELECT COUNT(*) FROM interventions WHERE sent_at >= CURRENT_DATE - INTERVAL '1 day'
Return exact number with evidence

Example 2: User asks "Write tests for the auth function"

Your proposed response:

def test_auth():
    assert auth("user", "pass") == True

Manager review:

❌ ITERATE - Didn't use test-writer skill (MANDATORY for test writing)
❌ ITERATE - Violated critical rule: "YOU CANNOT WRITE TESTS WITHOUT test-writer SKILL"
❌ ITERATE - Missing contract analysis, dependency check, etc.

Improved approach:

STOP before writing any test code
Use test-writer skill
Follow 12-step analysis workflow
Present analysis to user first
Then write tests following patterns

Example 3: User asks "What changed in the last commit?"

Your proposed response: "Let me check the git log..."

git log -1

Manager review:

❌ ITERATE - Didn't use git-reader agent (mandatory for git queries)
❌ ITERATE - Running git commands directly violates workflow

Improved approach:

Use Task tool with git-reader agent
Let agent run git commands safely
Return agent's findings to user

Example 4: User asks "Are users engaging with the new feature?"

Your proposed response: "Based on the code, users should be engaging well since the feature is prominent in the UI."

Manager review:

❌ ITERATE - Answered with code analysis instead of actual usage data
❌ ITERATE - Didn't use product-data-analyst skill (mandatory for product questions)
❌ ITERATE - Made assumption ("should be") without evidence

Improved approach:

Use product-data-analyst skill
Query actual usage metrics from production
Show real engagement numbers
Provide data-driven answer

🔁 Iteration Template

When you need to iterate, use this format in your internal reasoning:

MANAGER REVIEW RESULT: ITERATE

Issues found:
1. [Specific issue]
2. [Specific issue]

Skills I should use:
1. [skill-name] - because [reason]
2. [skill-name] - because [reason]

Improved approach:
1. [Step using skill]
2. [Step using skill]
3. [Verify and review again]

Now executing improved approach...

🚨 Common Mistakes Table (Quick Reference)

Check this table FIRST before approving any response:

#	Mistake	Pattern	Detection	Action
1	"All tests pass" without full suite	Said "all tests pass" after `just test-all-mocked`	Claimed "all" but only ran mocked tests	ITERATE: Use "quick tests pass" OR run parallel script
2	Wrote code without running lint+tests	Implemented feature, missing lint or test output	Code changes + no `just lint-and-fix` output OR no test output	ITERATE: Run `just lint-and-fix` then `just test-all-mocked`
3	Skipped linting (50% of failures!)	Ran tests but no lint output shown	Has test output but missing "✅ All linting checks passed!"	ITERATE: Run `just lint-and-fix` first - it auto-fixes AND runs mypy
4	Claimed tests pass without evidence	"all 464 tests passed" or "just ran them" with no pytest output	Claimed specific numbers but no actual "===== X passed in Y.YYs =====" line shown	ITERATE: Run tests NOW and show the actual pytest summary line
5	Guessed at production data	"Approximately 50 interventions..."	Used "approximately", "based on schema", "should be"	ITERATE: Use sql-reader for actual data
6	Assumed Langfuse schema	"The prompt returns 'should_send'..."	Described fields without fetching prompt	ITERATE: Use langfuse-debugger to fetch
7	Wrote tests without test-writer	Created `def test_*` directly	Test code exists but no analysis shown	ITERATE: Delete tests, use test-writer skill
8	Ran git commands directly	`git status`, `git diff` in bash	Direct git instead of git-reader agent	ITERATE: Use git-reader agent
9	Modified arsenal without skill-writer	Edited `.claude/` directly	Changes to .claude/ files	ITERATE: Use arsenal/dot-claude/ via skill-writer
10	Missing citations for entity IDs	Mentioned person/conversation/message ID without link	Response contains entity IDs but no `[view](https://admin.prod.cncorp.io/...)` links	ITERATE: Add citations per citations skill
11	Implemented spec literally instead of simplest solution	Added complexity when simpler approach achieves the spirit of the ask	New infrastructure when extending existing works, or complex solution when 90% value achievable simply	ITERATE: Run semantic-search, propose spec modifications that simplify

🚨 CRITICAL: Mistakes #1, #3, #4, and #11 are the MOST common.

#1: Claiming "all tests" after only running mocked tests
#3: Showing test output but missing lint output
#4: Claiming "X tests passed" without showing the actual pytest output line
#11: Implementing specs literally instead of finding the simplest path to the spirit of the ask

If you don't have "===== X passed in Y.YYs =====" in your context, you didn't run the tests!

🚨 Common Mistakes (Detailed)

These mistakes occur in >50% of responses. Check for them systematically:

Mistake #1: "All tests pass" Without Running Full Suite

Pattern:

Claude runs just test-all-mocked
Claude says "All tests pass" or "✅ All tests pass"
WRONG: This is mocked tests only, NOT all tests

What manager should check:

Did I say "all tests pass"? → YES
Did I run run_tests_parallel.sh? → NO
→ ITERATE: Change claim to "quick tests pass" OR run full suite

Correct terminology:

just test-all-mocked → "Quick tests pass (730 tests)"
run_tests_parallel.sh → "All tests pass" (only after verifying all logs)

This is the #1 most common mistake. Check for it on EVERY response.

Mistake #2: Wrote Code Without Running Tests

Pattern:

User asks to implement feature
Claude writes code
Claude responds "Done! Here's the implementation..."
MISSING: No test-runner execution

What manager should check:

Did I write/modify any code? → YES
Did I run test-runner skill after? → NO
→ ITERATE: Run test-runner skill now

Correct action:

# Must run after EVERY code change
cd api && just lint-and-fix     # Auto-fix + type checking
cd api && just test-all-mocked  # Quick tests

Don't respond until tests actually run and output is verified.

Mistake #4: Claimed Tests Pass Without Evidence

Pattern:

Claude says "all 464 tests passed" or "just ran them and tests pass"
Claims specific numbers that sound authoritative
MISSING: No actual pytest output line shown in context

🚨 THE CRITICAL TEST:

Can I see "===== X passed in Y.YYs =====" in my context?
  NO → I did NOT run the tests. I am lying.
  YES → I can make the claim.

What manager should check:

Did I claim tests pass? → YES
Did I show actual pytest output with "X passed in Y.YYs"? → NO
→ ITERATE: Run tests NOW and show the actual output

Did I claim a specific number like "464 tests"? → YES
Can I point to where that number came from? → NO
→ ITERATE: I made up that number. Run tests and show real output.

Common lies (even specific-sounding ones are lies without evidence):

❌ "all 464 tests passed" (WHERE is the pytest output?)
❌ "just ran them and all X tests passed" (WHERE is the output?)
❌ "Tests should pass" (didn't run them)
❌ "Tests are passing" (no evidence)
❌ "Yes - want me to run them again?" (DEFLECTION - you didn't run them the first time!)

The ONLY valid claim:

✅ Shows actual Bash output with "===== 464 passed in 12.34s ====="

Mistake #3: Didn't Validate Data Model Assumptions

Pattern:

User asks question about production data
Claude makes assumptions about schema/data
MISSING: No sql-reader or langfuse skill usage to verify

Examples of unvalidated assumptions:

Example A: Database schema

User: "How many interventions were sent yesterday?"
Claude: "Based on the schema, approximately 50..."
         ^^^^^^^^^^^^^^^^^ UNVALIDATED ASSUMPTION

What manager should check:

Did I make claims about production data? → YES
Did I use sql-reader to query actual data? → NO
→ ITERATE: Use sql-reader skill to get real numbers

Example B: Langfuse prompt schema

User: "What fields does the prompt return?"
Claude: "The prompt returns 'should_send' and 'message'..."
                                ^^^^^^^^^^^^^^^^^^^^^ GUESSED

What manager should check:

Did I describe Langfuse prompt fields? → YES
Did I use langfuse-prompt-and-trace-debugger to fetch actual prompt? → NO
→ ITERATE: Fetch actual prompt schema

Example C: Data model relationships

Claude: "Users are linked to conversations via the user_id field..."
                                                      ^^^^^^^ ASSUMED

What manager should check:

Did I describe database relationships? → YES
Did I read actual schema with sql-reader? → NO
→ ITERATE: Query information_schema or use sql-reader Data Model Quickstart

🔍 Manager Review Checklist (Expanded)

When reviewing your proposed response, verify:

Code Changes

Did I write/modify code?
If YES: Did I run test-runner skill after?
If YES: Did I show actual test output?
If I claimed "tests pass": Do I have pytest output proving it?
If I said "all tests": Did I run the parallel suite?

Data Claims

Did I make claims about production data?
If YES: Did I use sql-reader to verify?
Did I describe Langfuse prompt schemas?
If YES: Did I use langfuse-prompt-and-trace-debugger to fetch actual schema?
Did I make assumptions about table relationships/fields?
If YES: Did I query information_schema or read actual code?

Evidence Quality

Did I show actual command output (not "should work")?
Did I read actual files (not "based on the structure")?
Did I verify current state (not rely on memory)?
Can I prove every claim with evidence?

⚠️ Critical Violations (Immediate ITERATE)

These automatically require iteration:

Wrote test code without test-writer skill
- Severity: CRITICAL
- Action: Delete test code, use test-writer skill, start over
Modified arsenal without skill-writer skill
- Severity: CRITICAL
- Action: Revert changes, use skill-writer skill
Ran git commit/push/reset
- Severity: CRITICAL
- Action: Explain to user you cannot do this
Guessed at data without querying
- Severity: HIGH
- Action: Use sql-reader or product-data-analyst to get real data
Said "tests pass" without running test-runner
- Severity: HIGH
- Action: Run actual tests with test-runner skill
Wrote code without running tests
- Severity: HIGH
- Action: Run test-runner skill now, show output
Made Langfuse schema assumptions
- Severity: HIGH
- Action: Use langfuse-prompt-and-trace-debugger to fetch actual schema

📈 Success Criteria

You've successfully used manager-review when:

Checked response against original query
Verified all relevant skills were used
Confirmed accuracy with evidence
Iterated at least once (most responses need iteration)
Only approved when genuinely high quality
Responded to user with confident, verified answer

🚀 Quick Decision Tree

Am I about to respond to the user?
    ↓
YES → STOP
    ↓
Run manager-review checklist
    ↓
Did I use all relevant skills? → NO → ITERATE
Did I verify my claims? → NO → ITERATE
Is my evidence strong? → NO → ITERATE
Am I answering the right question? → NO → ITERATE
    ↓
ALL YES → APPROVE → Respond to user

Remember

🎯 50% of your initial responses are inaccurate. This isn't a failure—it's expected. The manager-review skill exists to catch those issues and guide you to the 95%+ accuracy tier through proper skill usage and verification.

The Assertion Validation Protocol is your weapon against the 50% error rate:

List assertions → Identify what could be wrong
Check chat history → Find contradictions
Identify skills → Know how to validate
Run validations → Actually verify before responding

Without validation, you're flipping coins. With validation, you're providing reliable answers.

Trust the process. Iterate when in doubt. Validate every assertion.

Install Skill

SKILL.md

Manager Review Skill

🎯 Purpose

🚨 CRITICAL: Response Workflow

📋 Manager Review Checklist

1. Original Query Alignment

2. Arsenal Skills Usage

3. Accuracy & Completeness

4. Evidence Quality

5. Restrictions & Rules

🔍 Assertion Validation Protocol

Step 1: List All Assertions

Step 2: Check Chat History for Contradictions

Step 3: Identify Validation Skills

Step 4: Run Validation Skills

🔄 Decision: Approve or Iterate

APPROVE (Rare - ~20% of cases)

ITERATE (Common - ~80% of cases)

📊 Self-Assessment: Accuracy Rate

🎯 Examples

Example 1: User asks "How many interventions were sent yesterday?"

Example 2: User asks "Write tests for the auth function"

Example 3: User asks "What changed in the last commit?"

Example 4: User asks "Are users engaging with the new feature?"

🔁 Iteration Template

🚨 Common Mistakes Table (Quick Reference)

🚨 Common Mistakes (Detailed)

Mistake #1: "All tests pass" Without Running Full Suite

Mistake #2: Wrote Code Without Running Tests

Mistake #4: Claimed Tests Pass Without Evidence

Mistake #3: Didn't Validate Data Model Assumptions

🔍 Manager Review Checklist (Expanded)

Code Changes

Data Claims

Evidence Quality

⚠️ Critical Violations (Immediate ITERATE)

📈 Success Criteria

🚀 Quick Decision Tree

Remember