Claude Code Plugins

Community-maintained marketplace

Feedback

mechinterp-next-step-planner

@cesaregarza/SplatNLP
0
0

Propose next mechanistic interpretability experiments based on research state, hypotheses, and existing evidence

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name mechinterp-next-step-planner
description Propose next mechanistic interpretability experiments based on research state, hypotheses, and existing evidence

MechInterp Next-Step Planner

Analyze current research state and propose the next best experiments to run. Writes ExperimentSpec JSON files directly to the specs directory.

Purpose

The planner skill:

  • Analyzes current hypotheses and their evidence status
  • Identifies gaps in understanding
  • Proposes 2-3 experiments that would discriminate between hypotheses
  • Writes ExperimentSpec JSON files ready for the runner

When to Use

Use this skill when you need to:

  • Decide what experiment to run next
  • Find experiments that test uncertain hypotheses
  • Generate specs after updating research state
  • Plan a sequence of experiments

Decision Heuristics

The planner considers:

  1. Hypothesis uncertainty: Prioritize experiments for hypotheses with confidence near 0.5
  2. Evidence gaps: Look for hypotheses with few evidence items
  3. Validation needs: Suggest validation for high-confidence hypotheses (>0.7)
  4. ReLU floor avoidance: Skip weapon experiments if activation is low
  5. Constraint compliance: All specs enforce one-rung-per-family
  6. Suppressor detection: Suggest token_influence_sweep when single-family features may have hidden suppressors

Planning Workflow

1. Load Current State

from splatnlp.mechinterp.state import ResearchStateManager

manager = ResearchStateManager(feature_id=18712, model_type="ultra")
print(manager.get_summary())

2. Analyze Evidence Gaps

Check which hypotheses need more testing:

for h in manager.state.hypotheses:
    n_support = len(h.supporting_evidence)
    n_refute = len(h.refuting_evidence)
    print(f"{h.id}: conf={h.confidence:.0%}, support={n_support}, refute={n_refute}")

    if h.status == "proposed" and n_support == 0:
        print(f"  -> NEEDS TESTING")
    elif h.confidence > 0.7 and h.status == "supported":
        print(f"  -> NEEDS VALIDATION")

3. Generate Experiment Specs

Based on the analysis, create appropriate specs:

from splatnlp.mechinterp.schemas import ExperimentSpec, ExperimentType
from splatnlp.mechinterp.state.io import get_spec_path, SPECS_DIR
import json
from datetime import datetime

# Example: Family sweep for untested hypothesis about SCU
spec = ExperimentSpec(
    id=datetime.now().strftime("%Y%m%d_%H%M%S"),
    type=ExperimentType.FAMILY_1D_SWEEP,
    feature_id=18712,
    model_type="ultra",
    variables={
        "family": "special_charge_up",
        "rungs": [3, 12, 29, 41, 57],
        "include_absent": True
    },
    dataset_slice={
        "percentile_min": 10.0,
        "percentile_max": 90.0,
        "sample_size": 500
    },
    description="Test SCU response to identify threshold rung",
    parent_hypothesis="h001",
    rationale="Hypothesis h001 claims SCU >= 41 AP triggers feature. "
              "This sweep will identify the exact threshold."
)

# Write spec to file
spec_path = SPECS_DIR / spec.to_filename()
spec_path.parent.mkdir(parents=True, exist_ok=True)
spec_path.write_text(spec.model_dump_json(indent=2))
print(f"Spec written to: {spec_path}")

4. Common Experiment Templates

For Family-Specific Hypotheses

# "Feature responds to family X at high AP"
ExperimentSpec(
    type=ExperimentType.FAMILY_1D_SWEEP,
    feature_id=feature_id,
    model_type="ultra",
    variables={"family": "swim_speed_up"},
)

For Multi-Family Interactions

# "SCU and ISS have synergistic effect"
ExperimentSpec(
    type=ExperimentType.FAMILY_2D_HEATMAP,
    feature_id=feature_id,
    model_type="ultra",
    variables={
        "family_x": "special_charge_up",
        "family_y": "ink_saver_sub"
    },
)

For Pattern Discovery

# "Feature detects specific ability combinations"
ExperimentSpec(
    type=ExperimentType.FREQUENT_ITEMSETS,
    feature_id=feature_id,
    model_type="ultra",
    variables={
        "min_support": 0.05,
        "max_size": 4,
        "high_activation_pct": 10.0
    },
)

For Validation

# Validate stable high-confidence hypothesis
ExperimentSpec(
    type=ExperimentType.SPLIT_HALF,
    feature_id=feature_id,
    model_type="ultra",
    variables={"n_splits": 10},
    parent_hypothesis="h001",
)

For Token Influence Analysis (Enhancers + Suppressors)

# Identify both enhancing and suppressing tokens
# CRITICAL: Use this when you need to understand what a feature AVOIDS
ExperimentSpec(
    type=ExperimentType.TOKEN_INFLUENCE_SWEEP,
    feature_id=feature_id,
    model_type="ultra",
    variables={
        "min_samples": 50,
        "high_percentile": 0.995,  # Top 0.5%
        "collapse_families": True,
        "suppressor_threshold": 0.8,  # < 0.8 = suppressed
        "enhancer_threshold": 1.2,    # > 1.2 = enhanced
    },
    description="Identify enhancers and suppressors for complete interpretation",
)

Planner Decision Tree

┌─────────────────────────────────────────┐
│     Load Research State                 │
└────────────────┬────────────────────────┘
                 │
    ┌────────────▼────────────┐
    │  Any hypotheses?         │
    └────────────┬────────────┘
                 │
         No ◄────┴────► Yes
         │               │
         ▼               ▼
┌─────────────────┐ ┌─────────────────────────┐
│ Exploration     │ │ Any untested?           │
│ - PageRank      │ └────────────┬────────────┘
│ - Itemsets      │              │
│ - Family sweeps │      Yes ◄───┴───► No
└─────────────────┘       │              │
                          ▼              ▼
                 ┌─────────────┐ ┌─────────────────┐
                 │ Test most   │ │ Any uncertain?  │
                 │ uncertain   │ │ (conf ~0.5)     │
                 └─────────────┘ └────────┬────────┘
                                          │
                                  Yes ◄───┴───► No
                                   │             │
                                   ▼             ▼
                          ┌─────────────┐ ┌─────────────────┐
                          │ Discriminate│ │ Any high-conf   │
                          │ experiments │ │ need validation?│
                          └─────────────┘ └────────┬────────┘
                                                   │
                                           Yes ◄───┴───► No
                                            │             │
                                            ▼             ▼
                                   ┌─────────────┐ ┌───────────┐
                                   │ Validation  │ │ Research  │
                                   │ experiments │ │ complete! │
                                   └─────────────┘ └───────────┘

Guardrails

The planner enforces these rules:

  1. One-rung-per-family: All specs include this constraint
  2. ReLU floor check: If state has pitfall "relu_floor_at_low_activation", avoid weapon sweeps
  3. Evidence diversification: Prefer different experiment types than recently run
  4. Hypothesis coverage: Ensure all active hypotheses have at least one pending experiment
  5. Suppressor check: For single-family features, suggest token_influence_sweep to detect what the feature AVOIDS (critical for complete interpretation)
  6. Conditional sweep requirement: After confirming primary driver with 1D sweep, ALWAYS suggest 2D heatmaps for secondary abilities (see below)

⚠️ CRITICAL: Conditional 2D Sweep Logic

After a 1D sweep confirms a primary driver, 1D sweeps for other abilities are MISLEADING.

Why 1D Sweeps Fail for Secondary Abilities

When feature has strong primary driver (e.g., SCU):

  • Most contexts have LOW primary → activation ≈ 0
  • Secondary ability can't suppress zero
  • 1D sweep averages across ALL contexts → delta ≈ 0
  • False conclusion: "secondary has no effect"

Planner Logic

def suggest_next_experiments(state):
    # Check if primary driver is confirmed
    primary_confirmed = any(
        h.status == "supported" and h.confidence > 0.7
        for h in state.hypotheses
        if "primary" in h.tags or "monotonic" in h.tags
    )

    if primary_confirmed:
        primary_family = get_confirmed_primary(state)
        correlated_abilities = get_overview_top_tokens(state, top_k=10)

        # For EACH correlated ability, suggest 2D heatmap
        for ability in correlated_abilities:
            if ability != primary_family:
                suggest_2d_heatmap(
                    family_x=primary_family,
                    family_y=ability,
                    rationale=f"Test conditional effect of {ability} at each {primary_family} level"
                )

        # ALWAYS suggest death-mitigation battery
        for death_ability in ["quick_respawn", "special_saver", "comeback"]:
            if not already_tested(state, primary_family, death_ability):
                suggest_2d_heatmap(
                    family_x=primary_family,
                    family_y=death_ability,
                    rationale="Death-aversion test"
                )

Template: Post-Primary Conditional Sweeps

After confirming primary driver {PRIMARY}, automatically suggest:

// Batch 1: Death-mitigation (REQUIRED)
[
  {"type": "family_2d_heatmap", "variables": {"family_x": "{PRIMARY}", "family_y": "quick_respawn"}},
  {"type": "family_2d_heatmap", "variables": {"family_x": "{PRIMARY}", "family_y": "special_saver"}},
  {"type": "family_2d_heatmap", "variables": {"family_x": "{PRIMARY}", "family_y": "comeback"}}
]

// Batch 2: Top correlated from overview (if not already primary)
[
  {"type": "family_2d_heatmap", "variables": {"family_x": "{PRIMARY}", "family_y": "{SECONDARY_1}"}},
  {"type": "family_2d_heatmap", "variables": {"family_x": "{PRIMARY}", "family_y": "{SECONDARY_2}"}}
]

// Batch 3: Weapon interaction (optional)
[
  {"type": "weapon_sweep", "variables": {"condition_family": "{PRIMARY}", "condition_rung": 29}}
]

Interpreting 2D Results

2D Result Meaning Label Update
Y suppresses at high X Anti-synergy Add "{Y}-averse"
Y enhances at high X Synergy Add "{Y}-synergistic"
Y flat across X levels Spurious correlation Remove from label
No data at high X + high Y Mutual exclusion Note in label

Output Location

Specs are written to:

/mnt/e/mechinterp_runs/specs/{timestamp}__f{feature_id}__{type}.json

Example filenames:

  • 20250607_142531__f18712__family-1d-sweep.json
  • 20250607_143022__f18712__frequent-itemsets.json
  • 20250607_143500__f18712__split-half.json

See Also

  • mechinterp-state: Manage research state
  • mechinterp-runner: Execute generated specs
  • mechinterp-glossary-and-constraints: Domain rules
  • mechinterp-summarizer: Process results
  • mechinterp-ability-semantics: Ability semantic groupings for interpretation
  • mechinterp-investigator: Full investigation workflow