name	model-evaluation-framework
description	Model evaluation metrics, testing protocols, and performance assessment for Somali dialect classification. Covers accuracy, F1-score, confusion matrix analysis, per-dialect performance, and evaluation best practices for multi-class classification tasks.
allowed-tools	Read, Grep

Model Evaluation Framework

Evaluation Metrics

Primary Metrics

Accuracy:

Overall correctness across all dialects
Target: >85% for production
Formula: (Correct Predictions) / (Total Predictions)

Macro F1-Score:

Average F1 across all dialect classes
Treats all dialects equally (good for imbalanced data)
Target: >0.82

Weighted F1-Score:

F1 weighted by class support
Accounts for class imbalance
Target: >0.85

Per-Class Metrics

Precision (per dialect):

Of predicted Northern, how many are actually Northern?
Formula: True Positives / (True Positives + False Positives)

Recall (per dialect):

Of actual Northern texts, how many did we find?
Formula: True Positives / (True Positives + False Negatives)

F1-Score (per dialect):

Harmonic mean of precision and recall
Formula: 2 × (Precision × Recall) / (Precision + Recall)

Evaluation Protocol

Standard Evaluation

from sklearn.metrics import (
    accuracy_score,
    precision_recall_fscore_support,
    classification_report,
    confusion_matrix
)

def evaluate_model(y_true, y_pred, dialect_names):
    """Comprehensive model evaluation"""

    # Overall metrics
    accuracy = accuracy_score(y_true, y_pred)

    # Per-class metrics
    precision, recall, f1, support = precision_recall_fscore_support(
        y_true, y_pred, average=None, labels=range(len(dialect_names))
    )

    # Macro averages
    macro_f1 = f1.mean()

    # Detailed report
    report = classification_report(
        y_true, y_pred,
        target_names=dialect_names,
        digits=4
    )

    # Confusion matrix
    cm = confusion_matrix(y_true, y_pred)

    return {
        'accuracy': accuracy,
        'macro_f1': macro_f1,
        'per_class': {
            dialect_names[i]: {
                'precision': precision[i],
                'recall': recall[i],
                'f1': f1[i],
                'support': support[i]
            }
            for i in range(len(dialect_names))
        },
        'report': report,
        'confusion_matrix': cm
    }

Confusion Matrix Analysis

Interpreting Results

Example Confusion Matrix:

               Predicted
           N    S    C
Actual N [[450  20   10]
       S [ 30 380   15]
       C [ 15  25  255]]

Analysis:

Northern most accurately classified (93.8%)
Southern confused with Northern in 7% of cases
Central most challenging (86.4% accuracy)
Northern rarely confused with Central (2%)

Visualization

import matplotlib.pyplot as plt
import seaborn as sns

def plot_confusion_matrix(cm, dialect_names):
    """Visualize confusion matrix"""
    plt.figure(figsize=(8, 6))
    sns.heatmap(
        cm,
        annot=True,
        fmt='d',
        cmap='Blues',
        xticklabels=dialect_names,
        yticklabels=dialect_names
    )
    plt.ylabel('Actual')
    plt.xlabel('Predicted')
    plt.title('Dialect Classification Confusion Matrix')
    plt.tight_layout()
    plt.savefig('confusion_matrix.png', dpi=300)

Cross-Validation

K-Fold Strategy

from sklearn.model_selection import StratifiedKFold
import numpy as np

def cross_validate_model(model, X, y, n_folds=5):
    """Stratified k-fold cross-validation"""
    skf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=42)

    fold_scores = []

    for fold, (train_idx, val_idx) in enumerate(skf.split(X, y)):
        X_train, X_val = X[train_idx], X[val_idx]
        y_train, y_val = y[train_idx], y[val_idx]

        # Train model
        model.fit(X_train, y_train)

        # Evaluate
        y_pred = model.predict(X_val)
        accuracy = accuracy_score(y_val, y_pred)
        f1 = f1_score(y_val, y_pred, average='macro')

        fold_scores.append({'accuracy': accuracy, 'f1': f1})
        print(f"Fold {fold + 1}: Accuracy={accuracy:.4f}, F1={f1:.4f}")

    # Report mean and std
    acc_mean = np.mean([s['accuracy'] for s in fold_scores])
    acc_std = np.std([s['accuracy'] for s in fold_scores])
    f1_mean = np.mean([s['f1'] for s in fold_scores])
    f1_std = np.std([s['f1'] for s in fold_scores])

    print(f"\nCross-Validation Results:")
    print(f"Accuracy: {acc_mean:.4f} ± {acc_std:.4f}")
    print(f"Macro F1: {f1_mean:.4f} ± {f1_std:.4f}")

    return fold_scores

Performance Thresholds

Minimum Acceptable Performance

For Production Deployment:

Overall Accuracy: ≥85%
Macro F1-Score: ≥0.82
Per-Class F1: ≥0.75 for all dialects
No class with recall <70%

For Research/Experimental:

Overall Accuracy: ≥75%
Macro F1-Score: ≥0.70
Document performance gaps

Error Analysis

Qualitative Review

def analyze_errors(X, y_true, y_pred, dialect_names, n_examples=10):
    """Identify and display misclassified examples"""
    errors = []

    for i in range(len(y_true)):
        if y_true[i] != y_pred[i]:
            errors.append({
                'text': X[i],
                'true_label': dialect_names[y_true[i]],
                'pred_label': dialect_names[y_pred[i]]
            })

    # Sample random errors for review
    import random
    sample_errors = random.sample(errors, min(n_examples, len(errors)))

    print(f"\nError Analysis ({len(errors)} total errors):\n")
    for idx, error in enumerate(sample_errors):
        print(f"Example {idx + 1}:")
        print(f"  Text: {error['text'][:100]}...")
        print(f"  True: {error['true_label']}, Predicted: {error['pred_label']}\n")

    return errors

Baseline Comparison

Establish Baselines

# Majority class baseline
def majority_baseline(y_train, y_test):
    """Predict most frequent class"""
    from collections import Counter
    majority_class = Counter(y_train).most_common(1)[0][0]
    y_pred = [majority_class] * len(y_test)
    return accuracy_score(y_test, y_pred)

# Random baseline
def random_baseline(y_train, y_test):
    """Random predictions based on train distribution"""
    from collections import Counter
    class_dist = Counter(y_train)
    classes, counts = zip(*class_dist.items())
    probs = [c / sum(counts) for c in counts]

    y_pred = np.random.choice(classes, size=len(y_test), p=probs)
    return accuracy_score(y_test, y_pred)

Report Format:

Baseline Results:
- Random: 33.5% accuracy
- Majority: 58.2% accuracy
- Our Model: 87.3% accuracy ✓ (significantly better)

Evaluation Report Template

# Model Evaluation Report

**Model:** XLM-R Fine-Tuned for Somali Dialect Classification
**Date:** 2025-11-06
**Test Set Size:** 1,500 examples

## Overall Performance

| Metric | Value |
|--------|-------|
| Accuracy | 87.3% |
| Macro F1 | 0.852 |
| Weighted F1 | 0.871 |

## Per-Dialect Performance

| Dialect | Precision | Recall | F1-Score | Support |
|---------|-----------|--------|----------|---------|
| Northern | 0.91 | 0.94 | 0.92 | 750 |
| Southern | 0.85 | 0.82 | 0.83 | 450 |
| Central | 0.81 | 0.79 | 0.80 | 300 |

## Confusion Matrix

[Insert visualization]

## Error Analysis

- Northern-Southern confusion: 3.2% of cases
- Most errors occur with short texts (<50 words)
- Informal/social media text more challenging

## Comparison to Baseline

- Majority class baseline: 50.0%
- Our model: 87.3%
- **Improvement: +37.3 percentage points**

## Recommendations

- Collect more Southern and Central dialect training data
- Investigate short text performance
- Consider ensemble approaches for difficult cases

When This Skill Activates

This skill auto-invokes when you mention:

Model evaluation, model assessment, performance
Accuracy, precision, recall, F1-score
Confusion matrix, error analysis
Cross-validation, k-fold
Metrics, evaluation metrics
Test set, validation set
Per-class performance, per-dialect
Baseline comparison

Version: 1.0.0 Last Updated: 2025-11-06 Project: Somali Dialect Classifier

model-evaluation-framework

Install Skill

SKILL.md