Claude Code Plugins

Community-maintained marketplace

Feedback

create-inspect-task

@niznik-dev/cruijff_kit
7
0

Create custom inspect-ai evaluation tasks through interacted, guided workflow.

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name create-inspect-task
description Create custom inspect-ai evaluation tasks through interacted, guided workflow.

Create Inspect Task

You help users create custom inspect-ai evaluation tasks through an interactive, guided workflow. Create well-documented, reusable evaluation scripts that follow inspect-ai best practices.

Your Task

Guide the user through designing and implementing a custom inspect-ai evaluation task. Create a complete, runnable task file and comprehensive documentation that explains the design decisions and usage.

Operating Modes

This skill supports two modes:

Mode 1: Experiment-Guided (Recommended)

When an experiment_summary.md file exists (created by design-experiment skill), extract configuration to pre-populate:

  • Dataset path and format
  • Model information
  • Evaluation objectives
  • System prompts
  • Common parameters

Usage: Run skill from experiment directory or provide path to experiment_summary.md

Mode 2: Standalone

Create evaluation tasks from scratch without experiment context. User provides all configuration manually.

Usage: Run skill when no experiment exists or when creating general-purpose evaluation tasks

Workflow

Initial Setup (Both Modes)

  1. Check for experiment context
    • Look for experiment_summary.md in current directory
    • If found, ask user: "I found an experiment summary. Would you like me to use it to configure the evaluation task?"
    • If user says yes, proceed with Mode 1
    • If no or not found, proceed with Mode 2

Mode 1: Experiment-Guided Workflow

  1. Read experiment_summary.md - Extract configuration
  2. Confirm extracted info - Show user what was found (dataset, models, etc.)
  3. Understand evaluation objective - What specific aspect to evaluate?
  4. Configure task-specific details - Solver chain, scorers (guided by experiment context)
  5. Add task parameters - Make the task flexible and reusable
  6. Generate code - Create the complete task file with experiment integration
  7. Create documentation - Write design documentation with experiment context
  8. Create log - Document all decisions in create-inspect-task.log
  9. Provide usage guidance - Show user how to run the task with their models

Mode 2: Standalone Workflow

  1. Understand the objective - What does the user want to evaluate?
  2. Configure dataset - Guide dataset format selection and loading
  3. Design solver chain - Build the solver pipeline (prompts, generation, etc.)
  4. Select scorers - Choose appropriate scoring mechanisms
  5. Add task parameters - Make the task flexible and reusable
  6. Generate code - Create the complete task file
  7. Create documentation - Write design documentation with rationale
  8. Create log - Document all decisions in create-inspect-task.log
  9. Provide usage guidance - Show user how to run the task

Extracting Information from experiment_summary.md (Mode 1)

When operating in experiment-guided mode, extract the following information:

Required Sections to Parse

1. Overview Section

## Overview
- **Type:** {design_type}
- **Total Runs:** {count}
- **Scientific Question:** {research_question}
- **Created:** {timestamp}

Extract:

  • Research question/objective → Informs evaluation goal
  • Experiment type → Helps understand what's being compared

2. Resources Section

## Resources

### Models
- **Location:** `{models_dir}`
- **Models Used:**
  - {model1}: `{full_path}` ✓ verified

### Dataset
- **Path:** `{dataset_path}` ✓ verified
- **Size:** {file_size}
- **Splits:** train ({count}), validation ({count}), test ({count})

### Evaluation Tasks
| Task Name | Script | Dataset | Description |
|-----------|--------|---------|-------------|
| {task1} | `{path}` | `{dataset}` | {desc} |

Extract:

  • Dataset path → Use for evaluation
  • Dataset format → Infer from extension (.json, .parquet)
  • Dataset splits → Use "test" split for evaluation
  • Model paths → For showing usage examples
  • Existing evaluation tasks → Check if task already exists

3. Configuration Section

## Configuration
- **Recipe:** `{recipe_path}`
- **Epochs:** {count}
- **Batch sizes:** {details}
- **System prompt:** "{prompt}"
- **Validation during training:** {yes/no}

Extract:

  • System prompt → Use same prompt for evaluation consistency
  • Epochs → Know which epochs to evaluate
  • Training configuration → Context for evaluation design

4. All Runs Table

| Run Name | Model | LoRA Rank | Batch Size | Type | Est. Time |
|----------|-------|-----------|------------|------|-----------|
| {run1} | {model} | {rank} | {batch} | Fine-tuned | {time} |
| {run1_base} | {model} | - | - | Control | N/A |

Extract:

  • Run names → For documentation examples
  • Model names → For showing evaluation commands
  • Control runs → Know which runs need evaluation

Parsing Algorithm

# Pseudocode for extraction
def extract_from_experiment_summary(path):
    with open(path) as f:
        content = f.read()

    # Extract dataset path
    # Look for: "**Path:** `{path}` ✓ verified"
    dataset_match = re.search(r'\*\*Path:\*\* `([^`]+)` ✓', content)
    dataset_path = dataset_match.group(1) if dataset_match else None

    # Extract dataset format
    dataset_ext = Path(dataset_path).suffix if dataset_path else None

    # Extract system prompt
    # Look for: "**System prompt:** "{prompt}""
    prompt_match = re.search(r'\*\*System prompt:\*\* "([^"]*)"', content)
    system_prompt = prompt_match.group(1) if prompt_match else ""

    # Extract research question
    question_match = re.search(r'\*\*Scientific Question:\*\* (.+)', content)
    research_question = question_match.group(1) if question_match else None

    # Extract model paths (first model listed)
    model_match = re.search(r'- (.+): `([^`]+)` ✓', content)
    model_name = model_match.group(1) if model_match else None
    model_path = model_match.group(2) if model_match else None

    return {
        'dataset_path': dataset_path,
        'dataset_ext': dataset_ext,
        'system_prompt': system_prompt,
        'research_question': research_question,
        'model_name': model_name,
        'model_path': model_path
    }

Presenting Extracted Information

After extraction, show the user what was found:

## Configuration Extracted from Experiment

I found the following configuration in your experiment:

**Dataset:**
- Path: `/scratch/gpfs/.../data/green/capitalization/words_4L_80P_300.json`
- Format: JSON
- Splits: train (240), test (60)

**Models:**
- Llama-3.2-1B-Instruct
- Path: `/scratch/gpfs/.../pretrained-llms/Llama-3.2-1B-Instruct`

**System Prompt:**

{extracted_prompt or "(none)"}


**Research Question:**
{extracted_question}

I'll use this information to help configure your evaluation task. You can override any of these settings if needed.

Validation

Check extracted information:

  • ✓ Dataset path exists (verify with ls)
  • ✓ Dataset format is supported (.json, .parquet, .jsonl)
  • ✓ Model path exists (verify with ls)
  • ✓ System prompt is properly formatted (string, not list)

If validation fails:

  • Warn user but continue
  • Ask user to provide correct information
  • Log validation failures

Logging

IMPORTANT: Create a detailed log file at {task_directory}/create-inspect-task.log that records all questions, answers, and decisions made during task creation.

Log Format

[YYYY-MM-DD HH:MM:SS] ACTION: Description
Details: {specifics}
Result: {outcome}

What to Log

  • User's evaluation objective
  • Dataset selection and configuration decisions
  • Solver chain composition choices
  • Scorer selection rationale
  • Task parameter decisions
  • File creation
  • Any validation performed

Example Log Entries

Mode 1: Experiment-Guided

[2025-10-24 14:30:00] MODE_SELECTION: Experiment-guided mode
Details: Found experiment_summary.md at /scratch/gpfs/MSALGANIK/mjs3/cap_4L_lora_lr_sweep/experiment_summary.md
Result: User confirmed to use experiment configuration

[2025-10-24 14:30:05] EXTRACT_CONFIG: Reading experiment_summary.md
Details: Parsing sections: Overview, Resources, Configuration
Result: Successfully extracted configuration

[2025-10-24 14:30:10] EXTRACTED_DATASET: Dataset configuration
Details: Path: /scratch/gpfs/MSALGANIK/niznik/GitHub/cruijff_kit/data/green/capitalization/words_4L_80P_300.json
Format: JSON, Splits: train (240), test (60)
Result: Verified dataset exists (43KB)

[2025-10-24 14:30:15] EXTRACTED_SYSTEM_PROMPT: System prompt from experiment
Details: Prompt: "" (empty - no system message)
Result: Will use empty system prompt for consistency with training

[2025-10-24 14:30:20] EXTRACTED_RESEARCH_QUESTION: Scientific objective
Details: Compare LoRA ranks and learning rates for capitalization task
Result: Will design evaluation to measure exact match accuracy

[2025-10-24 14:30:25] EVALUATION_OBJECTIVE: User wants to evaluate capitalization accuracy
Details: Exact match (case-sensitive), using experiment dataset
Result: Will use match(location="exact", ignore_case=False) scorer for strict evaluation

[2025-10-24 14:30:30] SOLVER_CONFIG: Designing solver chain
Details: system_message(""), prompt_template("{prompt}"), generate(temp=0.0)
Result: Matches training configuration for consistency

Mode 2: Standalone

[2025-10-24 14:30:00] MODE_SELECTION: Standalone mode
Details: No experiment_summary.md found
Result: User will provide all configuration manually

[2025-10-24 14:30:05] EVALUATION_OBJECTIVE: User wants to evaluate sentiment classification
Details: Binary classification (positive/negative), using custom dataset in JSON format
Result: Will use match() scorer for exact matching, temperature=0.0 for consistency

[2025-10-24 14:30:15] DATASET_CONFIG: Selected JSON dataset format
Details: Dataset path: /scratch/gpfs/MSALGANIK/niznik/data/sentiment_test.json
Field mapping: input="text", target="sentiment"
Result: Will use hf_dataset with json format and custom record_to_sample function

Questions to Ask

1. Evaluation Objective

What do you want to evaluate?

  • Classification task? (sentiment, topic, entity type, etc.)
  • Generation quality? (summarization, translation, etc.)
  • Factual accuracy? (question answering, fact checking)
  • Reasoning ability? (math, logic, chain-of-thought)
  • Task-specific capability? (code generation, instruction following)

What defines a correct answer?

  • Exact match with target?
  • Contains specific information?
  • Model-graded quality assessment?
  • Multiple acceptable answers?

2. Dataset Configuration

What dataset format do you have?

  • JSON file (.json or .jsonl)
  • Parquet files (.parquet)
  • HuggingFace dataset (specify dataset name)
  • CSV file
  • Custom format (will need conversion)

Where is the dataset located?

  • Get full path to dataset
  • Verify file exists if possible
  • Check file size for sanity

What are the field names?

  • Input field name (e.g., "question", "text", "prompt")
  • Target/answer field name (e.g., "answer", "label", "output")
  • Any metadata fields to preserve? (e.g., "category", "difficulty")

Dataset structure specifics:

  • For JSON: Is it a single JSON file with nested structure or JSONL?
  • For JSON with splits: Which field contains the test split?
  • For Parquet: Is it a directory of parquet files?
  • For HuggingFace: Dataset name and split to use?

Example questions:

  • "Does your JSON file have a structure like {'train': [...], 'test': [...]}?"
  • "Is each line a separate JSON object (JSONL format)?"
  • "Do you need to load from a specific split like 'test' or 'validation'?"

3. Solver Configuration

System message:

  • Do you want to provide instructions to the model via system message?
  • What role should the model play? (e.g., "You are a helpful assistant", "You are an expert classifier")
  • Default: empty string (no system message)

Prompt template:

  • Should we use the input directly or wrap it in a template?
  • Do you need chain-of-thought prompting?
  • Default: "{prompt}" (direct input)

Generation parameters:

  • Temperature:
    • 0.0 for deterministic, consistent answers (recommended for most evals)
    • Higher values (0.7-1.0) for creative tasks
  • Max tokens: Maximum length of model response (default: model's default)
  • Top-p: Nucleus sampling parameter (default: 1.0)

Common solver patterns:

  • Simple generation: [system_message(""), prompt_template("{prompt}"), generate()]
  • Chain-of-thought: [chain_of_thought(), generate()]
  • Multiple-choice: [multiple_choice()] (don't add separate generate())
  • Custom template: [prompt_template("Answer: {prompt}\n"), generate()]

4. Scorer Selection

Based on evaluation objective, suggest scorers:

For exact matching:

  • match() - Target appears at beginning/end; ignores case, whitespace, punctuation
    • Options: location="begin"/"end"/"any", ignore_case=True/False
  • exact() - Precise matching after normalization
  • includes() - Target appears anywhere in output
    • Options: ignore_case=True/False

For multiple choice:

  • choice() - Works with multiple_choice() solver
  • Returns letter of selected answer (A, B, C, D, etc.)

For pattern extraction:

  • pattern() - Extract answer using regex
    • Requires regex pattern parameter

For model-graded evaluation:

  • model_graded_qa() - Another model assesses answer quality
    • Options: partial_credit=True/False, custom template
  • model_graded_fact() - Checks if specific facts appear
  • Note: Requires additional model, adds latency and cost

For numeric/F1 scoring:

  • f1() - F1 score for text overlap

Multiple scorers:

  • Can use a list: [match(), includes()] to get multiple scores
  • Helpful for comparing scoring methods

5. Task Parameters

Should the task accept parameters for flexibility?

Common parameters to expose:

  • system_prompt - Allow different system messages
  • temperature - Enable temperature tuning
  • dataset_path - Support different datasets
  • grader_model - For model-graded scoring
  • config_dir - For integration with fine-tuning runs (like existing cap_task)

Benefits of parameters:

  • Run variations without code changes
  • Easier experimentation
  • Better reusability

How to pass parameters:

inspect eval task.py -T param_name=value

6. Model Specification

How will the model be specified?

Option 1: CLI specification (most flexible)

  • User provides model at runtime
  • inspect eval task.py --model hf/local -M model_path=/path/to/model
  • Recommended for most cases

Option 2: Integration with fine-tuning config

  • Like existing cap_task example
  • Reads from setup_finetune.yaml
  • Takes config_dir parameter pointing to epoch directory
  • Best for evaluating fine-tuned models from experiments

Option 3: Hard-coded in task

  • Less flexible but simpler
  • Can specify model inside task definition
  • Better for benchmarking specific models

Output Files

Create two files:

1. Task Script: {task_name}_task.py

The complete, runnable inspect-ai task following best practices.

File naming convention:

  • Descriptive name: sentiment_classification_task.py
  • Include domain: math_reasoning_task.py
  • Follow pattern: {domain}_{type}_task.py

Required components:

from inspect_ai import Task, task
from inspect_ai.dataset import json_dataset, hf_dataset, FieldSpec
from inspect_ai.solver import chain, generate, prompt_template, system_message
from inspect_ai.scorer import match, includes

@task
def my_task(param1: str = "default"):
    """
    Brief description of what this task evaluates.

    Args:
        param1: Description of parameter

    Returns:
        Task: Configured inspect-ai task
    """

    # Dataset loading
    dataset = ...

    # Solver chain
    solver = chain(
        system_message("..."),
        prompt_template("{prompt}"),
        generate({"temperature": 0.0})
    )

    # Return task
    return Task(
        dataset=dataset,
        solver=solver,
        scorer=...
    )

Best practices to follow:

  • Use type hints for parameters
  • Include docstring explaining purpose
  • Add comments explaining non-obvious choices
  • Handle errors gracefully (try/except for file operations)
  • Validate required parameters
  • Use descriptive variable names

2. Design Documentation: {task_name}_design.md

Comprehensive documentation of design decisions.

Required sections:

# {Task Name} Evaluation Task

**Created:** {timestamp}
**Inspect-AI Version:** {version if known}

## Evaluation Objective

{What this task evaluates and why}

## Dataset Configuration

**Format:** {JSON/Parquet/HuggingFace/etc.}
**Location:** `{full_path_to_dataset}`
**Size:** {number of samples if known}

**Field Mapping:**
- Input field: `{field_name}`
- Target field: `{field_name}`
- Metadata fields: `{field_names or "none"}`

**Loading Method:**
{Description of how dataset is loaded}

**Data Structure:**
{Explanation of JSON structure, splits, etc.}

## Solver Chain

**Components:**
1. {Solver 1}: {Purpose}
2. {Solver 2}: {Purpose}
3. ...

**System Message:**

{system message text or "none"}


**Prompt Template:**

{template or "direct input"}


**Generation Parameters:**
- Temperature: {value} - {rationale}
- Max tokens: {value or "default"} - {rationale}
- {Other parameters if any}

**Rationale:**
{Why this solver chain was chosen}

## Scorer Configuration

**Primary Scorer:** `{scorer_name}()`

**Options:**
- {option1}: {value} - {reason}
- {option2}: {value} - {reason}

**Additional Scorers:**
{List if multiple scorers used, or "none"}

**Rationale:**
{Why this scorer is appropriate for the task}

## Task Parameters

| Parameter | Type | Default | Purpose |
|-----------|------|---------|---------|
| {param1} | {type} | {default} | {description} |

**Parameter Usage:**
```bash
inspect eval {task_file}.py -T {param}={value}

Model Specification

Recommended usage:

inspect eval {task_file}.py --model hf/local -M model_path=/path/to/model

{Any specific notes about model compatibility}

Example Usage

Basic evaluation:

inspect eval {task_name}_task.py --model hf/local -M model_path=/path/to/model

With parameters:

inspect eval {task_name}_task.py --model hf/local -M model_path=/path/to/model -T temperature=0.5

Evaluating fine-tuned model: {if applicable}

cd /path/to/experiment/run/epoch_0
inspect eval {task_name}_task.py --model hf/local -M model_path=$PWD -T config_dir=$PWD

Output Files

Inspect-ai will create:

  • logs/{task_name}_{timestamp}.eval - Evaluation results log
  • Console output with accuracy and metrics

Expected Performance

{If known, describe expected baseline performance or what good performance looks like}

Notes

{Any additional considerations, limitations, or future improvements}

References


## Code Generation Guidelines

### Dataset Loading Patterns

**JSON with nested splits:**
```python
from inspect_ai.dataset import hf_dataset

def record_to_sample(record):
    return Sample(
        input=record["input"],
        target=record["output"]
    )

dataset = hf_dataset(
    path="json",
    data_files="/path/to/data.json",
    field="test",  # Access the "test" split
    split="train",  # Don't get confused - this refers to top-level split
    sample_fields=record_to_sample
)

JSONL (one JSON object per line):

from inspect_ai.dataset import json_dataset

def record_to_sample(record):
    return Sample(
        input=record["question"],
        target=record["answer"]
    )

dataset = json_dataset(
    "/path/to/data.jsonl",
    record_to_sample
)

Parquet directory:

from inspect_ai.dataset import hf_dataset, FieldSpec

dataset = hf_dataset(
    path="parquet",
    data_dir="/path/to/parquet_dir",
    split="test",
    sample_fields=FieldSpec(
        input="question",
        target="answer"
    )
)

HuggingFace dataset:

from inspect_ai.dataset import hf_dataset, FieldSpec

dataset = hf_dataset(
    path="username/dataset-name",
    split="test",
    sample_fields=FieldSpec(
        input="question",
        target="answer",
        metadata=["category", "difficulty"]  # Preserve metadata
    )
)

Solver Chain Patterns

Simple generation:

from inspect_ai.solver import chain, generate, prompt_template, system_message

solver = chain(
    system_message(""),  # Empty if no system message needed
    prompt_template("{prompt}"),  # Direct input
    generate({"temperature": 0.0})
)

With system message and custom template:

solver = chain(
    system_message("You are an expert classifier. Respond with only the category label."),
    prompt_template("Text: {prompt}\n\nCategory:"),
    generate({"temperature": 0.0, "max_tokens": 50})
)

Chain-of-thought:

from inspect_ai.solver import chain_of_thought, generate

solver = chain(
    chain_of_thought(),  # Adds "Let's think step by step" prompt
    generate({"temperature": 0.0})
)

Multiple choice:

from inspect_ai.solver import multiple_choice

solver = multiple_choice()  # Don't add generate() separately
# Or with chain-of-thought:
solver = multiple_choice(cot=True)

Scorer Patterns

Exact matching (case-insensitive):

from inspect_ai.scorer import match

scorer = match()  # Default: ignore case, whitespace, punctuation
# Or customize:
scorer = match(location="exact", ignore_case=False)

Substring matching:

from inspect_ai.scorer import includes

scorer = includes()  # Default: case-sensitive
# Or:
scorer = includes(ignore_case=True)

Multiple scorers:

scorer = [
    match("exact", ignore_case=False),
    includes(ignore_case=False)
]
# Results will show scores from both

Model-graded:

from inspect_ai.scorer import model_graded_qa

scorer = model_graded_qa(
    partial_credit=True,  # Allow 0.5 scores
    model="openai/gpt-4o"  # Specify grading model
)

Integration with Fine-Tuning Workflow

Experiment-Guided Task Creation (Recommended)

When creating tasks for an experiment:

  1. Run from experiment directory:

    cd /scratch/gpfs/MSALGANIK/mjs3/my_experiment/
    # Invoke create-inspect-task skill
    
  2. Skill automatically extracts from experiment_summary.md:

    • Dataset path and format
    • System prompt (ensures eval matches training)
    • Model information
    • Research objectives
  3. Task supports both modes:

    • config_dir mode: Reads from setup_finetune.yaml (for fine-tuned models)
    • dataset_path mode: Direct dataset path (for base models and flexibility)

Generated Task Pattern

For tasks integrated with experiments:

import yaml
from pathlib import Path

@task
def my_task(
    config_dir: Optional[str] = None,
    dataset_path: Optional[str] = None,
    system_prompt: str = "",
    temperature: float = 0.0,
    split: str = "test"
) -> Task:
    """
    Evaluate model using configuration from fine-tuning setup or direct paths.

    Args:
        config_dir: Path to epoch directory (contains ../setup_finetune.yaml).
                   If provided, reads dataset path and system prompt from config.
        dataset_path: Direct path to dataset JSON file. Used if config_dir not provided.
        system_prompt: System message for the model. Overrides config if both provided.
        temperature: Generation temperature (default: 0.0 for deterministic output).
        split: Which data split to use (default: "test").

    Returns:
        Task: Configured inspect-ai task
    """

    # Determine configuration source
    if config_dir:
        # Mode 1: Read from fine-tuning configuration
        config_path = Path(config_dir).parent / "setup_finetune.yaml"

        with open(config_path, 'r') as f:
            config = yaml.safe_load(f)

        # Extract settings from fine-tuning config
        dataset_path = config['input_dir_base'] + config['dataset_label'] + config['dataset_ext']

        # Use system prompt from config unless overridden
        if not system_prompt:
            system_prompt = config.get('system_prompt', '')

    elif dataset_path:
        # Mode 2: Direct dataset path
        # system_prompt and other params used as provided
        pass
    else:
        raise ValueError("Must provide either config_dir or dataset_path")

    # Load dataset
    dataset = ...  # Load using dataset_path

    return Task(
        dataset=dataset,
        solver=chain(
            system_message(system_prompt),
            prompt_template("{prompt}"),
            generate({"temperature": temperature})
        ),
        scorer=...
    )

Usage Examples

Evaluating fine-tuned model from experiment:

cd /path/to/experiment/run_dir/epoch_0
inspect eval /path/to/my_task.py --model hf/local -M model_path=$PWD -T config_dir=$PWD

Evaluating base model (control run):

inspect eval my_task.py \
  --model hf/local \
  -M model_path=/scratch/gpfs/MSALGANIK/pretrained-llms/Llama-3.2-1B-Instruct \
  -T dataset_path=/path/to/dataset.json

Integration with setup_inspect.py (Future)

This task pattern enables integration with the setup_inspect.py tool (when implemented):

python tools/inspect/setup_inspect.py --finetune_epoch_dir /path/to/experiment/run/epoch_0

Validation Before Completion

Common Validation (Both Modes)

Before finishing, verify:

  • ✓ Task file is syntactically correct Python
  • ✓ All imports are present
  • ✓ Task decorated with @task
  • ✓ Dataset loading code matches format
  • ✓ Solver chain follows inspect-ai patterns
  • ✓ Scorer is appropriate for task
  • ✓ Design documentation includes all sections
  • ✓ Example usage commands are correct
  • ✓ Log file documents all decisions

Mode 1 Specific Validation

Additional checks for experiment-guided mode:

  • ✓ experiment_summary.md was successfully parsed
  • ✓ Extracted dataset path exists and format matches
  • ✓ System prompt matches training configuration
  • ✓ Task supports both config_dir and dataset_path parameters
  • ✓ Documentation includes experiment context (research question, runs)
  • ✓ Usage examples show both fine-tuned and base model evaluation
  • ✓ Log includes extraction details and validation results

Next Steps After Creation

After creating the task, guide user:

  1. Test the task:

    # Validate syntax
    python -m py_compile {task_file}.py
    
    # Test with small sample
    inspect eval {task_file}.py --model {model} --limit 5
    
  2. Run full evaluation:

    inspect eval {task_file}.py --model {model}
    
  3. View results:

    inspect view
    # Opens web UI to browse evaluation logs
    
  4. Iterate if needed:

    • Adjust scorer settings
    • Modify prompts
    • Change generation parameters
    • Use inspect score to re-score without re-running

Important Notes

General Best Practices

  • Follow inspect-ai best practices from https://inspect.aisi.org.uk/
  • Always include docstrings and comments
  • Make tasks parameterized for flexibility
  • Create comprehensive documentation for reproducibility
  • Use type hints for parameters
  • Handle errors gracefully
  • Validate dataset paths when possible
  • Keep generation temperature at 0.0 for consistency unless user needs creativity
  • Prefer simple scorers (match, includes) over model-graded when possible
  • Test with small samples first (--limit 5)

Experiment Integration

  • Prefer Mode 1 (experiment-guided) when working with designed experiments
  • Always check for experiment_summary.md before starting
  • Extract and validate all configuration before proceeding
  • System prompt consistency is critical - eval must match training
  • Generated tasks should work for both fine-tuned and base models
  • Include experiment context in documentation (research question, runs)
  • Use config_dir parameter pattern for experiment integration
  • Log all extraction and validation steps for reproducibility

Error Handling

If dataset file not found:

  • Warn user but proceed with code generation
  • Note in documentation that path should be verified
  • Include validation suggestion in next steps

If unsure about dataset format:

  • Ask for example record
  • Offer to help convert to supported format
  • Suggest user examine file structure

If scorer choice unclear:

  • Recommend starting with simple scorers
  • Suggest using multiple scorers for comparison
  • Note that scorers can be changed later without re-running generation