| name | generalization-evaluator |
| description | Cross-domain evaluation to estimate generality and detect blind spots. Use when asked to assess broad capability, compare models across domains, or identify missing skills. |
Generalization Evaluator
Use this skill to measure generality across domains and identify weak coverage.
Workflow
- Load a task set (use references/task_set.example.json).
- Run the task set with a consistent runner.
- Score pass/fail per task and summarize by domain.
- Rank gaps by impact.
Scripts
- Run: python scripts/run_eval.py --tasks references/task_set.example.json --runner ollama --model qwen3:latest
Output Expectations
- Provide a domain score table and a short summary of weaknesses.
- List the top 3 skill gaps with suggested skill actions.