Name: eval-harness
Author: majiayu000

name	eval-harness
description	Frozen evaluation harness that gates self-improvement with benchmarks, regressions, and human approval loops.
allowed-tools	Read, Write, Edit, Bash, Glob, Grep, Task, TodoWrite
model	claude-3-5-sonnet
x-version	3.2.0
x-category	tooling
x-vcl-compliance	v3.1.1
x-cognitive-frames	HON, MOR, COM, CLS, EVD, ASP, SPC

L1 Improvement

Rewrote the harness SOP in English-first Prompt Architect style with explicit confidence ceilings.
Clarified freeze rules, routing, and validation gates to align with Skill Forge guardrails.
Added memory tagging and baseline hashing to stop silent drift.

STANDARD OPERATING PROCEDURE

Purpose

Act as the immutable anchor for recursive improvement cycles. The harness does not self-improve; it gates changes via frozen benchmarks, regression suites, and documented decisions.

Trigger Conditions

Positive: request to gate changes for Prompt Forge, Skill Forge, or other skills; regression checks; baseline comparisons.
Negative: exploratory evaluation design (route to test-design skills) or ad-hoc scoring without baselines.

Guardrails

Freeze benchmark definitions and scoring; record hash before execution.
Structure-first docs maintained; changes to harness require manual review and explicit versioning.
Confidence ceilings required for verdicts; cite evidence and observed metrics.
Memory tagging for every run to preserve comparables.

Execution Phases

Scope & Baseline – Identify target skill/build, select benchmark suites, and capture hashes of frozen assets.
Data Load – Pull prior runs for comparables; confirm hardware/latency constraints if applicable.
Run Benchmarks – Execute frozen suites; collect metrics, logs, and failures without mutation.
Regression Analysis – Compare against baselines; flag degradations and confidence impacts.
Decision – Recommend ACCEPT/REJECT/ROLLBACK with rationale, risks, and ceilings.
Archive – Store reports, hashes, and decisions in memory; surface follow-ups.

Output Format

Target, suites executed, baselines/hashes, and environment notes.
Metrics and regressions with evidence.
Decision (ACCEPT/REJECT/ROLLBACK) + risks and follow-ups.
Confidence: X.XX (ceiling: TYPE Y.YY) and memory namespace used.

Validation Checklist

Benchmarks frozen; hashes recorded and unchanged.
Comparables loaded; regressions checked.
Decision cites evidence and confidence ceiling.
Memory tagged and artifacts archived.

Integration

Memory MCP: skills/tooling/eval-harness/{project}/{timestamp} for reports and hashes.
Hooks: follow Skill Forge latency bounds; refuse execution if harness assets are modified.

Confidence: 0.70 (ceiling: inference 0.70) – Harness SOP aligned to Prompt Architect clarity and Skill Forge safeguards.

eval-harness

Install Skill

SKILL.md