| name | eval-harness |
| description | Frozen evaluation harness that gates self-improvement with benchmarks, regressions, and human approval loops. |
| allowed-tools | Read, Write, Edit, Bash, Glob, Grep, Task, TodoWrite |
| model | claude-3-5-sonnet |
| x-version | 3.2.0 |
| x-category | tooling |
| x-vcl-compliance | v3.1.1 |
| x-cognitive-frames | HON, MOR, COM, CLS, EVD, ASP, SPC |
L1 Improvement
- Rewrote the harness SOP in English-first Prompt Architect style with explicit confidence ceilings.
- Clarified freeze rules, routing, and validation gates to align with Skill Forge guardrails.
- Added memory tagging and baseline hashing to stop silent drift.
STANDARD OPERATING PROCEDURE
Purpose
Act as the immutable anchor for recursive improvement cycles. The harness does not self-improve; it gates changes via frozen benchmarks, regression suites, and documented decisions.
Trigger Conditions
- Positive: request to gate changes for Prompt Forge, Skill Forge, or other skills; regression checks; baseline comparisons.
- Negative: exploratory evaluation design (route to test-design skills) or ad-hoc scoring without baselines.
Guardrails
- Freeze benchmark definitions and scoring; record hash before execution.
- Structure-first docs maintained; changes to harness require manual review and explicit versioning.
- Confidence ceilings required for verdicts; cite evidence and observed metrics.
- Memory tagging for every run to preserve comparables.
Execution Phases
- Scope & Baseline – Identify target skill/build, select benchmark suites, and capture hashes of frozen assets.
- Data Load – Pull prior runs for comparables; confirm hardware/latency constraints if applicable.
- Run Benchmarks – Execute frozen suites; collect metrics, logs, and failures without mutation.
- Regression Analysis – Compare against baselines; flag degradations and confidence impacts.
- Decision – Recommend ACCEPT/REJECT/ROLLBACK with rationale, risks, and ceilings.
- Archive – Store reports, hashes, and decisions in memory; surface follow-ups.
Output Format
- Target, suites executed, baselines/hashes, and environment notes.
- Metrics and regressions with evidence.
- Decision (ACCEPT/REJECT/ROLLBACK) + risks and follow-ups.
- Confidence: X.XX (ceiling: TYPE Y.YY) and memory namespace used.
Validation Checklist
- Benchmarks frozen; hashes recorded and unchanged.
- Comparables loaded; regressions checked.
- Decision cites evidence and confidence ceiling.
- Memory tagged and artifacts archived.
Integration
- Memory MCP:
skills/tooling/eval-harness/{project}/{timestamp}for reports and hashes. - Hooks: follow Skill Forge latency bounds; refuse execution if harness assets are modified.
Confidence: 0.70 (ceiling: inference 0.70) – Harness SOP aligned to Prompt Architect clarity and Skill Forge safeguards.