| name | skill-logger |
| description | Logs and scores skill usage quality, tracking output effectiveness, user satisfaction signals, and improvement opportunities. Expert in skill analytics, quality metrics, feedback loops, and continuous improvement. Activate on "skill logging", "skill quality", "skill analytics", "skill scoring", "skill performance", "skill metrics", "track skill usage", "skill improvement". NOT for creating skills (use agent-creator), skill documentation (use skill-coach), or runtime debugging (use debugger skills). |
| allowed-tools | Read,Write,Edit,Bash,Grep,Glob |
| category | Productivity & Meta |
| tags | logging, analytics, metrics, quality, improvement |
| pairs-with | [object Object], [object Object] |
Skill Logger
Track, measure, and improve skill quality through systematic logging and scoring.
When to Use This Skill
Use for:
- Setting up skill usage logging
- Defining quality metrics for skill outputs
- Analyzing skill performance over time
- Identifying skills that need improvement
- Building feedback loops for skill enhancement
- A/B testing skill variations
NOT for:
- Creating new skills → use agent-creator
- Skill documentation → use skill-coach
- Runtime debugging → use appropriate debugger skills
- General logging/monitoring → use devops-automator
Core Logging Architecture
┌────────────────────────────────────────────────────────────────┐
│ SKILL LOGGING PIPELINE │
├────────────────────────────────────────────────────────────────┤
│ │
│ 1. CAPTURE 2. ANALYZE 3. SCORE │
│ ├─ Invocation ├─ Output parse ├─ Quality metrics │
│ ├─ Input context ├─ Token usage ├─ User satisfaction │
│ ├─ Output ├─ Tool calls ├─ Goal completion │
│ └─ Timing └─ Error patterns └─ Efficiency │
│ │
│ 4. AGGREGATE 5. ALERT 6. IMPROVE │
│ ├─ Per-skill stats ├─ Quality drops ├─ Identify patterns │
│ ├─ Trend analysis ├─ Error spikes ├─ Suggest changes │
│ └─ Comparisons └─ Underuse └─ Track experiments │
│ │
└────────────────────────────────────────────────────────────────┘
What to Log
Invocation Data
{
"invocation_id": "uuid",
"timestamp": "ISO8601",
"skill_name": "wedding-immortalist",
"skill_version": "1.2.0",
"input": {
"user_query": "Create a 3D model from my wedding photos",
"context_tokens": 1500,
"files_referenced": ["photos/", "config.json"]
},
"execution": {
"duration_ms": 45000,
"tool_calls": [
{"tool": "Bash", "count": 5},
{"tool": "Write", "count": 3}
],
"tokens_used": {
"input": 8500,
"output": 3200
},
"errors": []
},
"output": {
"type": "code_generation",
"artifacts_created": ["pipeline.py", "config.yaml"],
"response_length": 3200
}
}
Quality Signals
QUALITY_SIGNALS = {
# Implicit signals (automated)
'completion': 'Did the skill complete without errors?',
'token_efficiency': 'Output quality per token used',
'tool_success_rate': 'Tool calls that succeeded',
'retry_count': 'How many retries needed?',
# Explicit signals (user feedback)
'user_edit_ratio': 'How much did user modify output?',
'user_accepted': 'Did user accept/use the output?',
'follow_up_needed': 'Did user need to ask for fixes?',
'explicit_rating': 'Thumbs up/down if available',
# Outcome signals (delayed)
'code_ran_successfully': 'Did generated code work?',
'tests_passed': 'Did it pass tests?',
'reverted': 'Was the output later reverted?',
}
Scoring Framework
Multi-Dimensional Quality Score
def calculate_skill_score(invocation_log):
"""Score a skill invocation 0-100."""
scores = {
# Completion (25%)
'completion': (
25 if invocation_log['errors'] == [] else
15 if invocation_log['recovered'] else
0
),
# Efficiency (20%)
'efficiency': min(20, 20 * (
BASELINE_TOKENS / invocation_log['tokens_used']
)),
# Output Quality (30%)
'quality': (
30 if invocation_log['user_accepted'] else
20 if invocation_log['user_edit_ratio'] < 0.2 else
10 if invocation_log['user_edit_ratio'] < 0.5 else
0
),
# User Satisfaction (25%)
'satisfaction': (
25 if invocation_log['explicit_rating'] == 'positive' else
15 if invocation_log['no_follow_up'] else
5 if invocation_log['follow_up_resolved'] else
0
),
}
return sum(scores.values())
Score Interpretation
| Score Range | Quality Level | Action |
|---|---|---|
| 90-100 | Excellent | Document as exemplar |
| 75-89 | Good | Monitor for consistency |
| 50-74 | Acceptable | Review for improvements |
| 25-49 | Poor | Prioritize fixes |
| 0-24 | Failing | Immediate intervention |
Log Storage Schema
SQLite Schema (Local)
CREATE TABLE skill_invocations (
id TEXT PRIMARY KEY,
skill_name TEXT NOT NULL,
skill_version TEXT,
timestamp DATETIME DEFAULT CURRENT_TIMESTAMP,
-- Input
user_query TEXT,
context_tokens INTEGER,
-- Execution
duration_ms INTEGER,
tokens_input INTEGER,
tokens_output INTEGER,
tool_calls_json TEXT,
errors_json TEXT,
-- Output
output_type TEXT,
artifacts_json TEXT,
response_length INTEGER,
-- Quality signals
user_accepted BOOLEAN,
user_edit_ratio REAL,
follow_up_needed BOOLEAN,
explicit_rating TEXT,
-- Computed
quality_score REAL,
INDEX idx_skill_name (skill_name),
INDEX idx_timestamp (timestamp),
INDEX idx_quality (quality_score)
);
CREATE TABLE skill_aggregates (
skill_name TEXT,
period TEXT, -- 'daily', 'weekly', 'monthly'
period_start DATE,
invocation_count INTEGER,
avg_quality_score REAL,
error_rate REAL,
avg_tokens_used INTEGER,
avg_duration_ms INTEGER,
PRIMARY KEY (skill_name, period, period_start)
);
JSON Log Format (Portable)
{
"logs_version": "1.0",
"skill_name": "wedding-immortalist",
"entries": [
{
"id": "uuid",
"timestamp": "2025-01-15T14:30:00Z",
"input": {...},
"execution": {...},
"output": {...},
"quality": {
"signals": {...},
"score": 85,
"computed_at": "2025-01-15T14:35:00Z"
}
}
]
}
Analytics Queries
Skill Performance Dashboard
-- Overall skill rankings
SELECT
skill_name,
COUNT(*) as uses,
AVG(quality_score) as avg_quality,
AVG(tokens_output) as avg_tokens,
SUM(CASE WHEN errors_json != '[]' THEN 1 ELSE 0 END) * 100.0 / COUNT(*) as error_rate
FROM skill_invocations
WHERE timestamp > datetime('now', '-30 days')
GROUP BY skill_name
ORDER BY avg_quality DESC;
-- Quality trend (weekly)
SELECT
skill_name,
strftime('%Y-%W', timestamp) as week,
AVG(quality_score) as avg_quality,
COUNT(*) as uses
FROM skill_invocations
GROUP BY skill_name, week
ORDER BY skill_name, week;
-- Problem detection
SELECT skill_name, COUNT(*) as failures
FROM skill_invocations
WHERE quality_score < 50
AND timestamp > datetime('now', '-7 days')
GROUP BY skill_name
HAVING failures >= 3
ORDER BY failures DESC;
Improvement Opportunities
def identify_improvement_opportunities(skill_name, logs):
"""Analyze logs to suggest skill improvements."""
opportunities = []
# Pattern 1: Common follow-up questions
follow_ups = extract_follow_up_patterns(logs)
if follow_ups:
opportunities.append({
'type': 'missing_capability',
'description': f'Users frequently ask: {follow_ups[0]}',
'suggestion': 'Add guidance for this common need'
})
# Pattern 2: High edit ratio in specific output types
edit_patterns = analyze_edit_patterns(logs)
if edit_patterns['code'] > 0.4:
opportunities.append({
'type': 'code_quality',
'description': 'Users frequently edit generated code',
'suggestion': 'Review code examples and templates'
})
# Pattern 3: Repeated errors
error_patterns = cluster_errors(logs)
for error_type, count in error_patterns:
if count >= 3:
opportunities.append({
'type': 'recurring_error',
'description': f'{error_type} occurred {count} times',
'suggestion': 'Add error handling or documentation'
})
return opportunities
Implementation Guide
Basic Logger Hook
# hooks/skill_logger.py
import json
import sqlite3
from datetime import datetime
from pathlib import Path
LOG_DB = Path.home() / '.claude' / 'skill_logs.db'
def log_skill_invocation(
skill_name: str,
user_query: str,
output: str,
tool_calls: list,
duration_ms: int,
tokens: dict,
errors: list = None
):
"""Log a skill invocation to the database."""
conn = sqlite3.connect(LOG_DB)
cursor = conn.cursor()
cursor.execute('''
INSERT INTO skill_invocations
(id, skill_name, timestamp, user_query, duration_ms,
tokens_input, tokens_output, tool_calls_json, errors_json,
response_length)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
''', (
str(uuid.uuid4()),
skill_name,
datetime.utcnow().isoformat(),
user_query,
duration_ms,
tokens.get('input', 0),
tokens.get('output', 0),
json.dumps(tool_calls),
json.dumps(errors or []),
len(output)
))
conn.commit()
conn.close()
Quality Signal Collection
def collect_quality_signals(invocation_id: str, signals: dict):
"""Update an invocation with quality signals."""
conn = sqlite3.connect(LOG_DB)
cursor = conn.cursor()
# Update with user feedback
cursor.execute('''
UPDATE skill_invocations
SET user_accepted = ?,
user_edit_ratio = ?,
follow_up_needed = ?,
explicit_rating = ?,
quality_score = ?
WHERE id = ?
''', (
signals.get('accepted'),
signals.get('edit_ratio'),
signals.get('follow_up'),
signals.get('rating'),
calculate_score(signals),
invocation_id
))
conn.commit()
conn.close()
Alerting & Notifications
Alert Conditions
ALERT_CONDITIONS = {
'quality_drop': {
'condition': 'avg_quality_7d < avg_quality_30d * 0.8',
'message': 'Skill {skill} quality dropped 20%+ in past week',
'severity': 'warning'
},
'error_spike': {
'condition': 'error_rate_24h > error_rate_7d * 2',
'message': 'Skill {skill} error rate doubled in past 24h',
'severity': 'critical'
},
'underused': {
'condition': 'uses_7d < uses_30d_avg * 0.5',
'message': 'Skill {skill} usage down 50%+ this week',
'severity': 'info'
},
'high_performer': {
'condition': 'avg_quality_7d > 90 AND uses_7d > 10',
'message': 'Skill {skill} performing excellently',
'severity': 'positive'
}
}
Anti-Patterns
"Log Everything"
Wrong: Logging complete input/output for every invocation. Why: Privacy concerns, storage explosion, noise. Right: Log metadata, summaries, and opt-in detailed logging.
"Score Once, Forget"
Wrong: Calculating quality score immediately after completion. Why: Misses delayed signals (did code work? was it reverted?). Right: Collect signals over time, recalculate periodically.
"Averages Only"
Wrong: Only tracking average quality scores. Why: Hides distribution, misses failure modes. Right: Track percentiles, failure rates, and patterns.
"No Baseline"
Wrong: Measuring quality without establishing baselines. Why: Can't detect improvement or regression. Right: Establish baselines per skill, compare trends.
Output Reports
Weekly Skill Health Report
# Skill Health Report - Week of 2025-01-13
## Overview
- Total invocations: 247
- Average quality: 78.3 (up 2.1 from last week)
- Error rate: 4.2% (down 1.8%)
## Top Performers
1. **wedding-immortalist** - 92.1 avg quality, 18 uses
2. **skill-coach** - 89.4 avg quality, 34 uses
3. **api-architect** - 87.2 avg quality, 22 uses
## Needs Attention
1. **legacy-code-converter** - 52.3 avg quality (down 15%)
- Common issue: Missing dependency detection
- Suggested fix: Add dependency scanning step
## Improvement Opportunities
- `partner-text-coach`: Users frequently ask for tone adjustment
- `yard-landscaper`: High edit ratio on plant recommendations
Integration Points
- skill-coach: Feed quality data for skill improvements
- agent-creator: Use metrics when designing new skills
- automatic-stateful-prompt-improver: Quality signals for prompt optimization
Core Philosophy: What gets measured gets improved. Skill logging transforms intuition about skill quality into actionable data, enabling continuous improvement of the entire skill ecosystem.