| name | base-model-selector |
| description | Use when starting a fine-tuning project to determine if fine-tuning is needed, or when evaluating whether a base model meets quality thresholds for a specific domain task |
Base Model Selector + Baseline Evaluation
Overview
Evaluate base models against your domain rubric BEFORE committing to fine-tuning. Most projects skip this and waste resources fine-tuning models that were already good enough—or fine-tuning models that need fundamental architectural changes instead.
Core principle: Never fine-tune without baseline data. The baseline tells you if fine-tuning will help and by how much.
When to Use
digraph when_to_use {
"Starting fine-tuning project?" [shape=diamond];
"Have baseline metrics?" [shape=diamond];
"Use this skill" [shape=box, style=filled, fillcolor=lightgreen];
"Skip - you have data" [shape=box];
"Proceed to data generation" [shape=box];
"Starting fine-tuning project?" -> "Have baseline metrics?" [label="yes"];
"Starting fine-tuning project?" -> "Skip - you have data" [label="no"];
"Have baseline metrics?" -> "Proceed to data generation" [label="yes"];
"Have baseline metrics?" -> "Use this skill" [label="no"];
}
Use when:
- Starting any fine-tuning project
- Switching base models mid-project
- Evaluating if existing model needs fine-tuning for new domain
Skip when:
- You already have baseline evaluation data
- Pure prompting project (no fine-tuning planned)
The Process
Step 1: Deep Research (EXHAUSTIVE)
This is NOT quick research. You must build a comprehensive comparison table covering ALL viable candidates for your target size and use case.
1a. Identify ALL Candidates
Search for models in your target parameter range. For 7-10B, check:
| Family | Models to Evaluate |
|---|---|
| Qwen | Qwen 3 8B, Qwen 2.5 7B, Qwen 2.5 14B |
| Llama | Llama 3.1 8B, Llama 3.2 3B/8B |
| Mistral | Mistral 7B v0.3, Zephyr 7B, OpenChat 3.5 |
| Gemma 2 9B, Gemma 3 4B | |
| Microsoft | Phi-4 14B, Phi-4-mini 3.8B |
| DeepSeek | DeepSeek-R1-Distill-Qwen-7B, DeepSeek-R1-Distill-Llama-8B |
| Specialized | Domain-specific fine-tunes on HuggingFace |
Search queries to run:
"best [size] parameter LLM 2025 [your domain]""[model family] vs [model family] comparison 2025""open source LLM [your domain] fine-tuned 2025""best LLM for [specific capability] 2025"(e.g., empathy, coding, reasoning)
1b. Evaluate Each Factor
| Factor | Questions to Answer |
|---|---|
| License | Apache 2.0 / MIT / Llama? Commercial use allowed? |
| Context length | Fits your longest input? (calculate: turns × 2 × ~200 tokens) |
| Architecture focus | Reasoning-focused or conversation-focused? |
| Domain fit | Research shows this model type works for your domain? |
| Recency | Latest generation? Newer models often outperform. |
| Quantization | GGUF available? Quality at Q4_K_M? |
| Community | Active development? Known issues? |
1c. Domain-Specific Research
CRITICAL: Different domains need different model characteristics.
| Domain | Prefer | Avoid |
|---|---|---|
| Therapeutic/Empathy | Chat-focused, dialogue-optimized | Reasoning-focused (R1, o1, Phi-4) |
| Coding | Code-trained, reasoning-capable | Pure chat models |
| Math/Logic | Reasoning models, thinking modes | Pure instruction-following |
| Creative Writing | High temperature tolerance, natural flow | Overly formal models |
| Factual Q&A | Knowledge-dense, grounded | Creative/hallucination-prone |
Example research for therapeutic coaching:
"Models such as DeepSeek R1, OpenAI o3, and o1 excel in logical reasoning but fall short in conversational fluency and emotional responsiveness."
This means: avoid DeepSeek R1 distills for therapy, even though they're excellent models.
1d. Build Comparison Table
Create a comprehensive table with ALL candidates:
| Model | Params | License | Context | Release | Domain Fit | Notes |
|-------|--------|---------|---------|---------|------------|-------|
| Qwen 3 8B | 8.2B | Apache 2.0 | 128K | Apr 2025 | ⭐⭐⭐⭐ | Thinking/non-thinking modes |
| ... | ... | ... | ... | ... | ... | ... |
Rate domain fit on a scale (⭐ to ⭐⭐⭐⭐⭐) based on research findings.
Step 2: Select Primary Candidate
Based on research, select ONE candidate. Document:
- Why this model over alternatives
- Expected strengths for your domain
- Potential concerns to watch for
Also identify a backup candidate in case primary underperforms.
Step 3: Pull Model
# Ollama
ollama pull qwen3:8b
# Or llama.cpp with GGUF
huggingface-cli download Qwen/Qwen3-8B-Instruct-GGUF qwen3-8b-instruct-q4_k_m.gguf
# Verify it runs
llama-cli -m model.gguf -p "Hello, testing."
Step 4: Generate Evaluation Scenarios
Create ~50 diverse scenarios from your input taxonomy. These are INPUTS only (user messages), not full conversations.
# Generate opening messages covering your taxonomy
scenarios = []
for topic in taxonomy["topics"]:
for subtopic in topic["subtopics"]:
for style in ["terse", "conversational", "detailed"]:
scenarios.append(generate_opening(topic, subtopic, style))
Critical: Scenarios must cover your full input distribution, including edge cases.
Step 5: Run Base Model
Run the base model on each scenario. Collect single-turn responses.
responses = []
for scenario in scenarios:
response = generate(
model=model_path,
prompt=scenario,
system=your_system_prompt,
)
responses.append({
"scenario": scenario,
"response": response,
})
Step 6: Assess with Rubric
Run your domain rubric on each response. Calculate pass rate.
results = []
for item in responses:
assessment = assess_single_turn(item["scenario"], item["response"])
results.append(assessment)
pass_rate = sum(1 for r in results if r.passed) / len(results)
Step 7: Make Decision
digraph decision {
"pass_rate" [shape=diamond, label="Pass Rate?"];
"qualitative" [shape=box, label="≥70%: Qualitative review\nMay not need fine-tuning"];
"proceed" [shape=box, label="50-70%: Proceed with\nfine-tuning (moderate gains)"];
"full" [shape=box, label="<50%: Full pipeline\n(significant gains possible)"];
"pass_rate" -> "qualitative" [label="≥70%"];
"pass_rate" -> "proceed" [label="50-70%"];
"pass_rate" -> "full" [label="<50%"];
}
| Pass Rate | Decision | Next Step |
|---|---|---|
| ≥ 70% | Likely sufficient | Do qualitative review. If specific failure modes exist, consider targeted fine-tuning. Otherwise, deploy base model. |
| 50-70% | Moderate improvement possible | Proceed with fine-tuning. Document failure modes to guide data generation. |
| < 50% | Significant improvement needed | Full pipeline. Analyze failures—are they fixable with data, or architectural? |
Step 8: Document Results
Create docs/base-model-evaluation.md:
# Base Model Evaluation
**Date:** YYYY-MM-DD
## Model Selection
### Research Summary
[Summary of models considered and why primary was chosen]
| Model | Domain Fit | Why Considered | Why Selected/Rejected |
|-------|------------|----------------|----------------------|
| Qwen 3 8B | ⭐⭐⭐⭐ | Latest gen, non-thinking mode | **SELECTED** - best chat quality |
| DeepSeek-R1-7B | ⭐⭐ | Strong reasoning | Rejected - poor empathy per research |
| ... | ... | ... | ... |
### Selected Model
- **Model:** qwen3:8b-instruct-q4_k_m
- **Parameters:** 8.2B
- **License:** Apache 2.0
- **Context:** 128K tokens
## Evaluation Results
- **Scenarios:** 50
- **Pass rate:** XX%
- **Decision:** [DEPLOY / FINE-TUNE / FULL PIPELINE]
## Failure Analysis
| Criterion | Failure Count | Pattern |
|-----------|---------------|---------|
| CQ3 | 12 | Jumps to advice without validation |
| CQ8 | 3 | Missed crisis signals |
## Qualitative Notes
[Specific observations about response quality]
## Next Steps
[What happens next based on decision]
Outputs
- Research table — All candidates with domain fit ratings
- Selected model — Name, version, and rationale
- Baseline pass rate — Percentage on rubric
- Failure analysis — Which criteria fail most
- Decision document —
docs/base-model-evaluation.md
Common Mistakes
| Mistake | Why It's Wrong |
|---|---|
| Shallow research | Missing better candidates; picking based on familiarity |
| Ignoring domain fit | Reasoning models fail at empathy tasks; chat models fail at logic |
| Using old model lists | Models release monthly; 6-month-old recommendations are stale |
| Skipping baseline | You won't know if fine-tuning helped |
| Testing multiple models in parallel | Wastes time—pick one, evaluate, iterate if needed |
| Using full conversations for baseline | Single-turn is enough to assess base capability |
| Ignoring qualitative review at high pass rates | Numbers hide specific failure modes worth fixing |
| Proceeding without documenting failures | Failure patterns should guide data generation |
Quick Reference
# 1. Research (use web search extensively)
# Build comparison table with 8-12 candidates
# 2. Pull model
ollama pull qwen3:8b
# or
huggingface-cli download Qwen/Qwen3-8B-Instruct-GGUF
# 3. Generate scenarios
uv run python generator.py --scenarios-only 50
# 4. Run evaluation
uv run python evaluate_base_model.py --model qwen3:8b
# 5. Check results
cat docs/base-model-evaluation.md