Claude Code Plugins

Community-maintained marketplace

Feedback

Frozen evaluation harness that gates self-improvement with benchmarks, regressions, and human approval loops.

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name eval-harness
description Frozen evaluation harness that gates self-improvement with benchmarks, regressions, and human approval loops.
allowed-tools Read, Write, Edit, Bash, Glob, Grep, Task, TodoWrite
model claude-3-5-sonnet
x-version 3.2.0
x-category tooling
x-vcl-compliance v3.1.1
x-cognitive-frames HON, MOR, COM, CLS, EVD, ASP, SPC

L1 Improvement

  • Rewrote the harness SOP in English-first Prompt Architect style with explicit confidence ceilings.
  • Clarified freeze rules, routing, and validation gates to align with Skill Forge guardrails.
  • Added memory tagging and baseline hashing to stop silent drift.

STANDARD OPERATING PROCEDURE

Purpose

Act as the immutable anchor for recursive improvement cycles. The harness does not self-improve; it gates changes via frozen benchmarks, regression suites, and documented decisions.

Trigger Conditions

  • Positive: request to gate changes for Prompt Forge, Skill Forge, or other skills; regression checks; baseline comparisons.
  • Negative: exploratory evaluation design (route to test-design skills) or ad-hoc scoring without baselines.

Guardrails

  • Freeze benchmark definitions and scoring; record hash before execution.
  • Structure-first docs maintained; changes to harness require manual review and explicit versioning.
  • Confidence ceilings required for verdicts; cite evidence and observed metrics.
  • Memory tagging for every run to preserve comparables.

Execution Phases

  1. Scope & Baseline – Identify target skill/build, select benchmark suites, and capture hashes of frozen assets.
  2. Data Load – Pull prior runs for comparables; confirm hardware/latency constraints if applicable.
  3. Run Benchmarks – Execute frozen suites; collect metrics, logs, and failures without mutation.
  4. Regression Analysis – Compare against baselines; flag degradations and confidence impacts.
  5. Decision – Recommend ACCEPT/REJECT/ROLLBACK with rationale, risks, and ceilings.
  6. Archive – Store reports, hashes, and decisions in memory; surface follow-ups.

Output Format

  • Target, suites executed, baselines/hashes, and environment notes.
  • Metrics and regressions with evidence.
  • Decision (ACCEPT/REJECT/ROLLBACK) + risks and follow-ups.
  • Confidence: X.XX (ceiling: TYPE Y.YY) and memory namespace used.

Validation Checklist

  • Benchmarks frozen; hashes recorded and unchanged.
  • Comparables loaded; regressions checked.
  • Decision cites evidence and confidence ceiling.
  • Memory tagged and artifacts archived.

Integration

  • Memory MCP: skills/tooling/eval-harness/{project}/{timestamp} for reports and hashes.
  • Hooks: follow Skill Forge latency bounds; refuse execution if harness assets are modified.

Confidence: 0.70 (ceiling: inference 0.70) – Harness SOP aligned to Prompt Architect clarity and Skill Forge safeguards.