| name | skill-evaluator |
| description | Evaluate agent skill quality using rubric-based assessment. Use when assessing skill effectiveness, identifying improvement opportunities, or ensuring quality standards. |
| license | MIT |
| allowed-tools | Read, Bash |
Skill Evaluator
Assess agent skill quality through comprehensive rubric-based evaluation.
When to Use This Skill
Use skill-evaluator when you need to:
- Evaluate a new skill's quality before deployment
- Assess an existing skill's effectiveness
- Identify specific improvement opportunities
- Measure skill quality objectively
- Compare skills against quality standards
- Generate improvement recommendations
Evaluation Process
Phase 1: Preparation
Gather Context
- Identify the skill to evaluate
- Read the complete SKILL.md
- Review any bundled resources (scripts, templates)
- Understand the skill's intended purpose
Load Evaluation Rubric
cat templates/evaluation-rubric.md
Phase 2: Rubric-Based Evaluation
Evaluate the skill across four dimensions:
1. Clarity (1-5)
What to Assess:
- Are instructions imperative and actionable?
- Is the description clear about WHAT the skill does?
- Is the description clear about WHEN to use it?
- Are technical terms explained?
- Is the language precise and unambiguous?
Scoring Guide:
- 5: Exceptionally clear, every instruction actionable, perfect balance of detail
- 4: Very clear, minor improvements possible
- 3: Generally clear, some vague sections
- 2: Unclear in multiple places, hard to follow
- 1: Confusing, missing critical information
Questions to Ask:
- Can someone unfamiliar with the domain understand this?
- Are all steps clearly defined?
- Is there any ambiguous language?
2. Completeness (1-5)
What to Assess:
- Are all necessary steps documented?
- Are dependencies clearly stated?
- Are edge cases addressed?
- Is error handling covered?
- Are prerequisites listed?
Scoring Guide:
- 5: Comprehensive, covers all scenarios, no gaps
- 4: Very complete, minor edge cases missing
- 3: Generally complete, some steps assumed
- 2: Significant gaps, missing key information
- 1: Incomplete, critical steps missing
Questions to Ask:
- Could someone execute this skill without asking questions?
- Are dependencies explicit?
- What happens when things go wrong?
3. Examples (1-5)
What to Assess:
- Are concrete examples provided?
- Do examples show expected outcomes?
- Are multiple scenarios covered?
- Do examples help clarify instructions?
- Are examples realistic and practical?
Scoring Guide:
- 5: Excellent examples, multiple scenarios, clear outcomes
- 4: Good examples, covers main use cases
- 3: Basic examples present, could be more detailed
- 2: Minimal examples, not very helpful
- 1: No examples or examples are confusing
Questions to Ask:
- Do examples demonstrate the skill in action?
- Are edge cases illustrated?
- Can users adapt examples to their needs?
4. Focus (1-5)
What to Assess:
- Does the skill address a specific, well-defined task?
- Is it too broad or monolithic?
- Does it try to do too much?
- Is the scope appropriate?
- Are responsibilities clear?
Scoring Guide:
- 5: Perfectly focused, single clear purpose
- 4: Well-focused, minor scope creep
- 3: Generally focused, some unnecessary breadth
- 2: Too broad, trying to do too much
- 1: Unfocused, unclear purpose
Questions to Ask:
- Can you describe the skill's purpose in one sentence?
- Should this be split into multiple skills?
- Is anything out of scope included?
Phase 3: Scoring
Calculate Scores
For each dimension:
- Review the assessment criteria
- Assign a score (1-5)
- Document specific observations
- Note evidence supporting the score
Total Score
Sum all dimensions (max: 20)
Quality Thresholds
- 18-20: Excellent - Production ready
- 15-17: Strong - Minor improvements recommended
- 12-14: Adequate - Needs refinement
- 9-11: Weak - Significant improvements required
- 1-8: Poor - Major rework needed
Phase 4: Recommendations
Generate Actionable Feedback
For each dimension with score <5:
- Identify specific weaknesses
- Provide concrete improvement suggestions
- Prioritize recommendations (high/medium/low)
Example Recommendations:
Clarity (3/5):
- High: Rewrite Phase 2 instructions in imperative form
- Medium: Define technical term "API endpoint" in context
- Low: Add section headings for better scanning
Completeness (4/5):
- Medium: Add error handling section
- Low: List Python version requirement
Examples (3/5):
- High: Add 2 more concrete examples showing different scenarios
- Medium: Include expected output for Example 1
- Medium: Add edge case example (empty input)
Focus (5/5):
- No recommendations - well-focused
Phase 5: Generate Report
Create Evaluation Report
Use the rubric template:
cat templates/evaluation-rubric.md
Fill in:
- Skill name and date
- Scores for each dimension
- Total score and quality level
- Detailed observations
- Prioritized recommendations
- Overall assessment
Report Structure
# Skill Evaluation Report
**Skill**: skill-name
**Date**: YYYY-MM-DD
**Evaluator**: [Agent/Human name]
## Scores
| Dimension | Score | Notes |
| ------------ | --------- | ------------------------------ |
| Clarity | 4/5 | Very clear, minor improvements |
| Completeness | 5/5 | Comprehensive coverage |
| Examples | 3/5 | Needs more concrete examples |
| Focus | 5/5 | Well-scoped and focused |
| **Total** | **17/20** | **Strong** |
## Detailed Assessment
[Per-dimension analysis]
## Recommendations
[Prioritized improvement suggestions]
## Overall Assessment
[Summary and final thoughts]
Phase 6: Iterate
Support Improvement Cycle
- Share evaluation report with skill author
- Implement high-priority recommendations
- Re-evaluate after changes
- Compare scores to measure improvement
- Iterate until quality threshold met
Evaluation Rubric
Use the bundled rubric template:
cat templates/evaluation-rubric.md
The template provides:
- Detailed scoring criteria for each dimension
- Example assessments
- Report structure
- Question prompts
Scripts
run_evaluation.py
Purpose: Semi-automated evaluation assistance
Usage:
python scripts/run_evaluation.py /path/to/skill-directory
Features:
- Extracts skill metadata
- Calculates word counts
- Identifies missing sections
- Generates report template
- Assists with objective metrics
Note: Final scoring requires human/agent judgment
Examples
Example 1: High-Quality Skill Evaluation
Skill: brand-guidelines
Assessment:
Clarity: 5/5
- Description perfectly explains what and when
- Instructions are imperative and actionable
- Technical terms clear from context
- Language is precise
Completeness: 5/5
- All necessary information present
- Color codes, typography, spacing all specified
- No external dependencies needed
- Self-contained and complete
Examples: 4/5
- Color examples provided with hex codes
- Typography examples with sizes and weights
- Could benefit from visual mockup examples
Focus: 5/5
- Single clear purpose: brand guidelines
- Not trying to do design work, just provide standards
- Perfect scope
Total: 19/20 (Excellent)
Recommendations:
- Low: Add 1-2 visual mockup examples showing guidelines applied
Example 2: Skill Needing Improvement
Skill: code-helper
Assessment:
Clarity: 2/5
- Description vague: "Helps with code"
- Instructions use passive voice
- Many undefined terms
- Ambiguous steps
Completeness: 3/5
- Missing dependency information
- No error handling guidance
- Prerequisites not stated
- Some steps documented
Examples: 2/5
- Only one example provided
- Example doesn't show expected output
- No edge cases illustrated
Focus: 2/5
- Tries to do too much: formatting, analysis, documentation
- Should be 3 separate skills
- Unclear primary purpose
Total: 9/20 (Weak)
Recommendations:
High Priority:
- Split into 3 focused skills: code-formatter, code-analyzer, code-documenter
- Rewrite description to explain specific purpose and when to use
- Rewrite all instructions in imperative form
Medium Priority:
- Add 3-4 concrete examples with inputs and outputs
- Document all dependencies (Python version, packages)
- Add error handling section
Low Priority:
- Define all technical terms in context
- Add edge case examples
Example 3: Re-Evaluation After Improvements
Initial Evaluation: 12/20 (Adequate) After Improvements: 17/20 (Strong)
Changes Made:
- Added 3 concrete examples (+2 points)
- Clarified description to explain when to use (+1 point)
- Rewrote vague instructions in imperative form (+2 points)
Impact: Moved from "Needs refinement" to "Minor improvements recommended"
Best Practices
Be Objective
- Use the rubric consistently
- Base scores on evidence, not feelings
- Compare against criteria, not other skills
Be Specific
- Document exact issues
- Provide concrete examples of problems
- Suggest specific improvements
Be Constructive
- Focus on helping improve the skill
- Acknowledge what works well
- Prioritize feedback (don't overwhelm)
Be Thorough
- Read the entire skill before scoring
- Check all bundled resources
- Consider real-world usage
Common Evaluation Pitfalls
Leniency Bias
❌ Scoring too high because skill "seems okay" ✓ Compare against explicit rubric criteria
Recency Effect
❌ Letting recent sections heavily influence score ✓ Evaluate each dimension independently
Halo Effect
❌ One strong area influences scoring of others ✓ Score each dimension separately
Comparison Bias
❌ Scoring relative to other skills ✓ Score against absolute rubric criteria
Quality Gates
Minimum Thresholds:
- For deployment: ≥15/20 total, no dimension <3
- For production: ≥18/20 total, all dimensions ≥4
- Best practice: All dimensions ≥4
Red Flags (immediate revision needed):
- Any dimension scored 1
- Total score <12
- Missing examples entirely
- Vague description
Evaluation Checklist
Before finalizing evaluation:
- Read complete SKILL.md
- Reviewed all bundled resources
- Scored all four dimensions
- Calculated total score
- Documented specific observations
- Generated prioritized recommendations
- Identified quality threshold met
- Created evaluation report
Templates
Evaluation Rubric Template
cat templates/evaluation-rubric.md
Provides:
- Scoring criteria details
- Report structure
- Example assessments
Resources
- Evaluation script:
scripts/run_evaluation.py - Rubric template:
templates/evaluation-rubric.md - Agent Skills Specification: https://github.com/anthropics/skills/blob/main/agent_skills_spec.md