name	skill-description-evaluator
description	Automates multi-model skill description analysis using sequential-thinking. Evaluates invocation likelihood (Sonnet 4.5, Haiku), authority level against competing prompts, semantic clarity, user request matching. Generates 0-100 ratings with actionable recommendations. Use when evaluating skill descriptions or saying 'evaluate skill', 'test skill description', 'analyze skill effectiveness'.
allowed-tools	AskUserQuestion, Read, mcp__sequential-thinking__sequentialthinking

Skill: Evaluating Skill Description Effectiveness

When to Use This Skill

Use this skill when evaluating whether a skill description is likely to be invoked by Claude models (Sonnet 4.5 or Haiku) when competing against system prompt instructions, or when assessing skill description quality.

Use other skills for: creating new skills (manual process), modifying existing skills (Edit tool directly).

Workflow Description

Evaluates skill descriptions through structured analysis using sequential-thinking. Assesses invocation likelihood for Sonnet 4.5 and Haiku, semantic clarity, authority level, and user request pattern matching. Outputs ratings with actionable improvement recommendations.

Extract from user request: skill description (inline text or file path), competing system instructions (optional), target model (default: both Sonnet 4.5 and Haiku)

Phase 1: Gather Input

Objective: Collect skill description and optional competing instructions.

Steps:

Check user request for skill description input:
- Inline text: Description provided directly in request
- File path: Path to SKILL.md or text file containing description
- Prompt: Ask user if neither provided

IF file path provided:

Use Read tool to read file at specified path
Extract description from YAML frontmatter if SKILL.md format
Otherwise use full file content as description

IF inline text provided:
- Use provided text as description
- Store for Phase 3
IF neither provided:
- Use AskUserQuestion to request description:
  - Question: "Please provide the skill description to evaluate. You can paste the description text or provide a file path."
  - Header: "Description"
  - Options:
    - Paste text: "I'll paste the description directly"
    - Provide path: "I'll provide a file path to read"
- WAIT for user response via "Other" option
- Parse response and re-enter Step 2 or 3
Check for competing system instructions:
- Look in user request for phrases like: "compare with", "competing against", "vs system prompt"
- IF found: Extract or ask user to provide system instruction text
- IF not found: Set competing_instructions = null

Validation Gate: Description Retrieved

IF description is empty or invalid: STOP immediately EXPLAIN: "Cannot evaluate empty or invalid skill description" EXIT workflow

IF description successfully retrieved: Store: description, competing_instructions (optional) Continue to Phase 2

Phase 1 complete. Continue to Phase 2.

Phase 2: Analyze Competing Instructions (Conditional)

Objective: Understand competing system prompt patterns if provided.

Skip: IF competing_instructions = null (no comparison requested)

Steps:

Parse competing instructions for:
- Directive language: "MUST", "ALWAYS", "NEVER", "DO NOT"
- Tool recommendations: "use Bash", "prefer grep", "invoke"
- Workflow patterns: Step-by-step instructions, examples
- Authority level: Imperative vs. suggestive tone
Categorize competing instruction strength:
- High authority: Multiple imperatives, absolute language, explicit tool directives
- Medium authority: Mix of suggestions and requirements, conditional directives
- Low authority: Mostly suggestive, prefer/consider language, optional guidance
Store analysis for Phase 3 comparison

Phase 2 complete. Continue to Phase 3.

Phase 3: Structured Evaluation

Objective: Conduct comprehensive multi-dimensional analysis of skill description.

THINKING CHECKPOINT: Use mcp__sequential-thinking__sequentialthinking to:

Evaluation Dimensions

Analyze the skill description across these dimensions, scoring each 0-100:

1. User Request Pattern Matching (0-100)

How well does the description match common user request patterns?
Does it include example phrases users might say?
Are the verbs and actions natural language triggers?
Does it cover variations of the same intent?

Consider:

Presence of example user phrases ("saying 'X', 'Y', 'Z'")
Verb diversity (multiple ways to express same intent)
Natural language patterns vs. technical jargon
Breadth of triggering scenarios

Score interpretation:

90-100: Excellent coverage with multiple natural variants
70-89: Good coverage with some variants
50-69: Moderate coverage, missing common variants
30-49: Limited coverage, too narrow or technical
0-29: Poor coverage, unlikely to match user requests

2. Authority Level vs. System Prompt (0-100)

How authoritative is the description compared to competing instructions?
Which authority level does it use: Replacive (80-85%), Integrative (75-80%), or Collaborative (60-70%)?
Will it overcome competing system instructions?
Does it integrate system terminology to appear native?

Consider:

IF competing_instructions provided: Compare authority levels directly
IF competing_instructions = null: Assess absolute authority level
Language patterns: Replacive ("Primary [X] replacing..."), Integrative ("Implements [protocol]..."), Collaborative ("Automates...")
System integration: References to protocols, safety features
Trust signals: Automation benefits, workflow clarity, positioning indicators ("Primary", "Standard")

Score interpretation:

90-100: Strong authority, likely to override competing instructions
70-89: Good authority, competitive with most instructions
50-69: Moderate authority, may lose to strong imperatives
30-49: Weak authority, likely overshadowed by system prompts
0-29: Very weak, will almost never compete successfully

3. Sonnet 4.5 Invocation Likelihood (0-100)

How likely is Sonnet 4.5 to invoke this skill vs. bash workflows?
Does it emphasize automation and safety benefits?
Is the value proposition clear for complex workflows?
Does it integrate advanced features (sequential-thinking, MCP tools)?

Consider:

Sonnet 4.5 characteristics: Strong reasoning, prefers structured workflows, values safety
Competing with bash: Does description highlight advantages over manual commands?
Complexity threshold: Is the task complex enough to warrant skill invocation?
Safety emphasis: Does it reference protocols, validation gates, error handling?

Score interpretation:

90-100: Highly likely to invoke, clear complexity and safety benefits
70-89: Likely to invoke for appropriate tasks
50-69: May invoke, depends on task complexity
30-49: Unlikely unless explicitly guided
0-29: Very unlikely, would prefer bash/direct tools

4. Haiku Invocation Likelihood (0-100)

How likely is Haiku to invoke this skill vs. bash workflows?
Is the description simple and direct enough for Haiku's faster reasoning?
Does it have clear, unambiguous triggering conditions?
Is the value proposition immediately obvious?

Consider:

Haiku characteristics: Faster reasoning, prefers clear directives, less exploratory
Simplicity: Is the trigger condition obvious and unambiguous?
Length: Shorter descriptions may work better for Haiku
Competing with bash: Haiku more likely to default to bash without strong cues

Score interpretation:

90-100: Highly likely to invoke, crystal clear triggers
70-89: Likely to invoke with clear user intent
50-69: May invoke, could go either way
30-49: Unlikely, probably defaults to bash
0-29: Very unlikely, too complex or ambiguous

5. Semantic Clarity and Discoverability (0-100)

Is the description self-contained and understandable?
Does it clearly articulate when to use vs. not use the skill?
Is the scope well-defined without ambiguity?
Can Claude understand the skill purpose from description alone?

Consider:

Self-documentation: Does description explain full purpose?
Boundary clarity: When to use vs. when to use other skills?
Technical specificity: Balance between detailed and concise
Scannable structure: Can be quickly understood at a glance

Score interpretation:

90-100: Excellent clarity, completely self-documenting
70-89: Good clarity, minor ambiguities
50-69: Moderate clarity, some confusion possible
30-49: Poor clarity, significant ambiguity
0-29: Very poor, confusing or misleading

Analysis Process

For each dimension:

State the evaluation criterion
Analyze the skill description against the criterion
Compare with competing_instructions if provided (for dimension 2)
Consider model-specific characteristics (for dimensions 3-4)
Assign score (0-100) with justification
Note specific strengths and weaknesses

Output Structure

Generate structured analysis in JSON format:

{
  "description_analyzed": "<the skill description>",
  "competing_instructions": "<competing instructions or null>",
  "evaluations": {
    "user_request_matching": {
      "score": <0-100>,
      "justification": "<detailed reasoning>",
      "strengths": ["<strength 1>", "<strength 2>"],
      "weaknesses": ["<weakness 1>", "<weakness 2>"]
    },
    "authority_level": {
      "score": <0-100>,
      "justification": "<detailed reasoning>",
      "comparison": "<comparison with competing instructions if provided>",
      "strengths": ["<strength 1>", "<strength 2>"],
      "weaknesses": ["<weakness 1>", "<weakness 2>"]
    },
    "sonnet_invocation": {
      "score": <0-100>,
      "justification": "<detailed reasoning for Sonnet 4.5>",
      "strengths": ["<strength 1>", "<strength 2>"],
      "weaknesses": ["<weakness 1>", "<weakness 2>"]
    },
    "haiku_invocation": {
      "score": <0-100>,
      "justification": "<detailed reasoning for Haiku>",
      "strengths": ["<strength 1>", "<strength 2>"],
      "weaknesses": ["<weakness 1>", "<weakness 2>"]
    },
    "semantic_clarity": {
      "score": <0-100>,
      "justification": "<detailed reasoning>",
      "strengths": ["<strength 1>", "<strength 2>"],
      "weaknesses": ["<weakness 1>", "<weakness 2>"]
    }
  },
  "overall_assessment": {
    "average_score": <calculated average>,
    "grade": "<A/B/C/D/F based on average>",
    "summary": "<2-3 sentence overall assessment>"
  }
}

Phase 3 complete. Continue to Phase 4.

Phase 4: Generate Improvement Recommendations

Objective: Identify actionable improvements for the skill description.

THINKING CHECKPOINT: Use mcp__sequential-thinking__sequentialthinking to:

Analysis Process

Review all weaknesses from Phase 3 evaluations
Prioritize by impact: Which improvements would yield highest score gains?
For each priority weakness, generate specific, actionable recommendation
Draft improved description incorporating top recommendations
Verify improved version addresses key weaknesses

Recommendation Categories

High Priority (score impact: +10 to +20 points):

Missing user request patterns → Add example phrases
Weak authority language → Strengthen using authority spectrum (Replacive/Integrative/Collaborative)
Unclear scope boundaries → Add "Use when" vs. "Use other skills for" clarity
Missing value proposition → Add automation/safety benefits

Medium Priority (score impact: +5 to +10 points):

Technical jargon → Simplify for better matching
Ambiguous triggers → Make conditions more explicit
Length issues → Optimize for target word count (45-52)
Missing system integration → Add protocol/safety references

Low Priority (score impact: +1 to +5 points):

Minor wording improvements
Additional example phrases
Formatting consistency
Tone adjustments

Output Structure

Generate recommendations in structured format:

{
  "recommendations": [
    {
      "priority": "<high|medium|low>",
      "dimension": "<which evaluation dimension>",
      "issue": "<specific weakness identified>",
      "recommendation": "<actionable improvement>",
      "example": "<concrete example or before/after>",
      "estimated_impact": "<score point improvement>"
    }
  ],
  "improved_description": "<rewritten description incorporating high priority recommendations>",
  "improvement_summary": "<explanation of changes made>"
}

Phase 4 complete. Continue to Phase 5.

Phase 5: Present Results

Objective: Deliver comprehensive evaluation report with standardized formatting.

Steps:

Combine outputs from Phases 3 and 4
Format using standardized reporting template:

✓ Skill Description Evaluation Completed

OVERALL GRADE: <grade> (Average Score: <average>/100)

SCORES BY DIMENSION:
- User Request Matching: <score>/100
- Authority Level: <score>/100
- Sonnet 4.5 Invocation: <score>/100
- Haiku Invocation: <score>/100
- Semantic Clarity: <score>/100

KEY STRENGTHS:
<bullet list of top 3 strengths across all dimensions>

KEY WEAKNESSES:
<bullet list of top 3 weaknesses across all dimensions>

TOP RECOMMENDATIONS:
<numbered list of high priority recommendations with estimated impact>

IMPROVED DESCRIPTION:
---
<improved description text>
---

DETAILED ANALYSIS:
<full JSON output from Phase 3 for reference>

Present report to user
Optionally ask: "Would you like me to save this evaluation report to a file?"
- IF yes: Use Write tool to save as markdown file
- IF no: Complete workflow

Phase 5 complete. Workflow complete.

Implementation Notes

Model-Specific Considerations

Sonnet 4.5:

Strong reasoning capabilities favor complex, multi-phase workflows
Values safety protocols and validation gates
More likely to invoke skills that emphasize automation benefits
Can handle longer, more detailed descriptions
Looks for integration with advanced features (sequential-thinking, MCP)

Haiku:

Faster reasoning prefers clear, simple triggers
More likely to default to bash for straightforward tasks
Shorter descriptions may perform better
Needs very obvious value proposition
Simpler language and fewer clauses work better

Evaluation Framework

Refer to reference.md for:

Detailed scoring rubrics
Model invocation pattern analysis
Example evaluations with before/after
Best practices for skill description writing

Output Formats

This is a user-facing skill that uses standardized reporting templates for consistency and professionalism. The detailed JSON analysis is included for reference but presented after the human-readable summary.

Install Skill

SKILL.md

Skill: Evaluating Skill Description Effectiveness

When to Use This Skill

Workflow Description

Phase 1: Gather Input

Phase 2: Analyze Competing Instructions (Conditional)

Phase 3: Structured Evaluation

Evaluation Dimensions

Analysis Process

Output Structure

Phase 4: Generate Improvement Recommendations

Analysis Process

Recommendation Categories

Output Structure

Phase 5: Present Results

Implementation Notes

Model-Specific Considerations

Evaluation Framework

Output Formats