name	skill-evaluator
description	Evaluate agent skill quality using rubric-based assessment. Use when assessing skill effectiveness, identifying improvement opportunities, or ensuring quality standards.
license	MIT
allowed-tools	Read, Bash

Skill Evaluator

Assess agent skill quality through comprehensive rubric-based evaluation.

When to Use This Skill

Use skill-evaluator when you need to:

Evaluate a new skill's quality before deployment
Assess an existing skill's effectiveness
Identify specific improvement opportunities
Measure skill quality objectively
Compare skills against quality standards
Generate improvement recommendations

Evaluation Process

Phase 1: Preparation

Gather Context

Identify the skill to evaluate
Read the complete SKILL.md
Review any bundled resources (scripts, templates)
Understand the skill's intended purpose

Load Evaluation Rubric

cat templates/evaluation-rubric.md

Phase 2: Rubric-Based Evaluation

Evaluate the skill across four dimensions:

1. Clarity (1-5)

What to Assess:

Are instructions imperative and actionable?
Is the description clear about WHAT the skill does?
Is the description clear about WHEN to use it?
Are technical terms explained?
Is the language precise and unambiguous?

Scoring Guide:

5: Exceptionally clear, every instruction actionable, perfect balance of detail
4: Very clear, minor improvements possible
3: Generally clear, some vague sections
2: Unclear in multiple places, hard to follow
1: Confusing, missing critical information

Questions to Ask:

Can someone unfamiliar with the domain understand this?
Are all steps clearly defined?
Is there any ambiguous language?

2. Completeness (1-5)

What to Assess:

Are all necessary steps documented?
Are dependencies clearly stated?
Are edge cases addressed?
Is error handling covered?
Are prerequisites listed?

Scoring Guide:

5: Comprehensive, covers all scenarios, no gaps
4: Very complete, minor edge cases missing
3: Generally complete, some steps assumed
2: Significant gaps, missing key information
1: Incomplete, critical steps missing

Questions to Ask:

Could someone execute this skill without asking questions?
Are dependencies explicit?
What happens when things go wrong?

3. Examples (1-5)

What to Assess:

Are concrete examples provided?
Do examples show expected outcomes?
Are multiple scenarios covered?
Do examples help clarify instructions?
Are examples realistic and practical?

Scoring Guide:

5: Excellent examples, multiple scenarios, clear outcomes
4: Good examples, covers main use cases
3: Basic examples present, could be more detailed
2: Minimal examples, not very helpful
1: No examples or examples are confusing

Questions to Ask:

Do examples demonstrate the skill in action?
Are edge cases illustrated?
Can users adapt examples to their needs?

4. Focus (1-5)

What to Assess:

Does the skill address a specific, well-defined task?
Is it too broad or monolithic?
Does it try to do too much?
Is the scope appropriate?
Are responsibilities clear?

Scoring Guide:

5: Perfectly focused, single clear purpose
4: Well-focused, minor scope creep
3: Generally focused, some unnecessary breadth
2: Too broad, trying to do too much
1: Unfocused, unclear purpose

Questions to Ask:

Can you describe the skill's purpose in one sentence?
Should this be split into multiple skills?
Is anything out of scope included?

Phase 3: Scoring

Calculate Scores

For each dimension:

Review the assessment criteria
Assign a score (1-5)
Document specific observations
Note evidence supporting the score

Total Score

Sum all dimensions (max: 20)

Quality Thresholds

18-20: Excellent - Production ready
15-17: Strong - Minor improvements recommended
12-14: Adequate - Needs refinement
9-11: Weak - Significant improvements required
1-8: Poor - Major rework needed

Phase 4: Recommendations

Generate Actionable Feedback

For each dimension with score <5:

Identify specific weaknesses
Provide concrete improvement suggestions
Prioritize recommendations (high/medium/low)

Example Recommendations:

Clarity (3/5):

High: Rewrite Phase 2 instructions in imperative form
Medium: Define technical term "API endpoint" in context
Low: Add section headings for better scanning

Completeness (4/5):

Medium: Add error handling section
Low: List Python version requirement

Examples (3/5):

High: Add 2 more concrete examples showing different scenarios
Medium: Include expected output for Example 1
Medium: Add edge case example (empty input)

Focus (5/5):

No recommendations - well-focused

Phase 5: Generate Report

Create Evaluation Report

Use the rubric template:

cat templates/evaluation-rubric.md

Fill in:

Skill name and date
Scores for each dimension
Total score and quality level
Detailed observations
Prioritized recommendations
Overall assessment

Report Structure

# Skill Evaluation Report

**Skill**: skill-name
**Date**: YYYY-MM-DD
**Evaluator**: [Agent/Human name]

## Scores

| Dimension    | Score     | Notes                          |
| ------------ | --------- | ------------------------------ |
| Clarity      | 4/5       | Very clear, minor improvements |
| Completeness | 5/5       | Comprehensive coverage         |
| Examples     | 3/5       | Needs more concrete examples   |
| Focus        | 5/5       | Well-scoped and focused        |
| **Total**    | **17/20** | **Strong**                     |

## Detailed Assessment

[Per-dimension analysis]

## Recommendations

[Prioritized improvement suggestions]

## Overall Assessment

[Summary and final thoughts]

Phase 6: Iterate

Support Improvement Cycle

Share evaluation report with skill author
Implement high-priority recommendations
Re-evaluate after changes
Compare scores to measure improvement
Iterate until quality threshold met

Evaluation Rubric

Use the bundled rubric template:

cat templates/evaluation-rubric.md

The template provides:

Detailed scoring criteria for each dimension
Example assessments
Report structure
Question prompts

Scripts

run_evaluation.py

Purpose: Semi-automated evaluation assistance

Usage:

python scripts/run_evaluation.py /path/to/skill-directory

Features:

Extracts skill metadata
Calculates word counts
Identifies missing sections
Generates report template
Assists with objective metrics

Note: Final scoring requires human/agent judgment

Examples

Example 1: High-Quality Skill Evaluation

Skill: brand-guidelines

Assessment:

Clarity: 5/5

Description perfectly explains what and when
Instructions are imperative and actionable
Technical terms clear from context
Language is precise

Completeness: 5/5

All necessary information present
Color codes, typography, spacing all specified
No external dependencies needed
Self-contained and complete

Examples: 4/5

Color examples provided with hex codes
Typography examples with sizes and weights
Could benefit from visual mockup examples

Focus: 5/5

Single clear purpose: brand guidelines
Not trying to do design work, just provide standards
Perfect scope

Total: 19/20 (Excellent)

Recommendations:

Low: Add 1-2 visual mockup examples showing guidelines applied

Example 2: Skill Needing Improvement

Skill: code-helper

Assessment:

Clarity: 2/5

Description vague: "Helps with code"
Instructions use passive voice
Many undefined terms
Ambiguous steps

Completeness: 3/5

Missing dependency information
No error handling guidance
Prerequisites not stated
Some steps documented

Examples: 2/5

Only one example provided
Example doesn't show expected output
No edge cases illustrated

Focus: 2/5

Tries to do too much: formatting, analysis, documentation
Should be 3 separate skills
Unclear primary purpose

Total: 9/20 (Weak)

Recommendations:

High Priority:

Split into 3 focused skills: code-formatter, code-analyzer, code-documenter
Rewrite description to explain specific purpose and when to use
Rewrite all instructions in imperative form

Medium Priority:

Add 3-4 concrete examples with inputs and outputs
Document all dependencies (Python version, packages)
Add error handling section

Low Priority:

Define all technical terms in context
Add edge case examples

Example 3: Re-Evaluation After Improvements

Initial Evaluation: 12/20 (Adequate) After Improvements: 17/20 (Strong)

Changes Made:

Added 3 concrete examples (+2 points)
Clarified description to explain when to use (+1 point)
Rewrote vague instructions in imperative form (+2 points)

Impact: Moved from "Needs refinement" to "Minor improvements recommended"

Best Practices

Be Objective

Use the rubric consistently
Base scores on evidence, not feelings
Compare against criteria, not other skills

Be Specific

Document exact issues
Provide concrete examples of problems
Suggest specific improvements

Be Constructive

Focus on helping improve the skill
Acknowledge what works well
Prioritize feedback (don't overwhelm)

Be Thorough

Read the entire skill before scoring
Check all bundled resources
Consider real-world usage

Common Evaluation Pitfalls

Leniency Bias

❌ Scoring too high because skill "seems okay" ✓ Compare against explicit rubric criteria

Recency Effect

❌ Letting recent sections heavily influence score ✓ Evaluate each dimension independently

Halo Effect

❌ One strong area influences scoring of others ✓ Score each dimension separately

Comparison Bias

❌ Scoring relative to other skills ✓ Score against absolute rubric criteria

Quality Gates

Minimum Thresholds:

For deployment: ≥15/20 total, no dimension <3
For production: ≥18/20 total, all dimensions ≥4
Best practice: All dimensions ≥4

Red Flags (immediate revision needed):

Any dimension scored 1
Total score <12
Missing examples entirely
Vague description

Evaluation Checklist

Before finalizing evaluation:

Read complete SKILL.md
Reviewed all bundled resources
Scored all four dimensions
Calculated total score
Documented specific observations
Generated prioritized recommendations
Identified quality threshold met
Created evaluation report

Templates

Evaluation Rubric Template

cat templates/evaluation-rubric.md

Provides:

Scoring criteria details
Report structure
Example assessments

Resources

Evaluation script: scripts/run_evaluation.py
Rubric template: templates/evaluation-rubric.md
Agent Skills Specification: https://github.com/anthropics/skills/blob/main/agent_skills_spec.md

Install Skill

SKILL.md

Skill Evaluator

When to Use This Skill

Evaluation Process

Phase 1: Preparation

Gather Context

Load Evaluation Rubric

Phase 2: Rubric-Based Evaluation

1. Clarity (1-5)

2. Completeness (1-5)

3. Examples (1-5)

4. Focus (1-5)

Phase 3: Scoring

Calculate Scores

Total Score

Quality Thresholds

Phase 4: Recommendations

Generate Actionable Feedback

Phase 5: Generate Report

Create Evaluation Report

Report Structure

Phase 6: Iterate

Support Improvement Cycle

Evaluation Rubric

Scripts

run_evaluation.py

Examples

Example 1: High-Quality Skill Evaluation

Clarity: 5/5

Completeness: 5/5

Examples: 4/5

Focus: 5/5

Total: 19/20 (Excellent)

Example 2: Skill Needing Improvement

Clarity: 2/5

Completeness: 3/5

Examples: 2/5

Focus: 2/5

Total: 9/20 (Weak)

Example 3: Re-Evaluation After Improvements

Best Practices

Be Objective

Be Specific

Be Constructive

Be Thorough

Common Evaluation Pitfalls

Leniency Bias

Recency Effect

Halo Effect

Comparison Bias

Quality Gates

Evaluation Checklist

Templates

Evaluation Rubric Template

Resources