| name | mechinterp-next-step-planner |
| description | Propose next mechanistic interpretability experiments based on research state, hypotheses, and existing evidence |
MechInterp Next-Step Planner
Analyze current research state and propose the next best experiments to run. Writes ExperimentSpec JSON files directly to the specs directory.
Purpose
The planner skill:
- Analyzes current hypotheses and their evidence status
- Identifies gaps in understanding
- Proposes 2-3 experiments that would discriminate between hypotheses
- Writes ExperimentSpec JSON files ready for the runner
When to Use
Use this skill when you need to:
- Decide what experiment to run next
- Find experiments that test uncertain hypotheses
- Generate specs after updating research state
- Plan a sequence of experiments
Decision Heuristics
The planner considers:
- Hypothesis uncertainty: Prioritize experiments for hypotheses with confidence near 0.5
- Evidence gaps: Look for hypotheses with few evidence items
- Validation needs: Suggest validation for high-confidence hypotheses (>0.7)
- ReLU floor avoidance: Skip weapon experiments if activation is low
- Constraint compliance: All specs enforce one-rung-per-family
- Suppressor detection: Suggest token_influence_sweep when single-family features may have hidden suppressors
Planning Workflow
1. Load Current State
from splatnlp.mechinterp.state import ResearchStateManager
manager = ResearchStateManager(feature_id=18712, model_type="ultra")
print(manager.get_summary())
2. Analyze Evidence Gaps
Check which hypotheses need more testing:
for h in manager.state.hypotheses:
n_support = len(h.supporting_evidence)
n_refute = len(h.refuting_evidence)
print(f"{h.id}: conf={h.confidence:.0%}, support={n_support}, refute={n_refute}")
if h.status == "proposed" and n_support == 0:
print(f" -> NEEDS TESTING")
elif h.confidence > 0.7 and h.status == "supported":
print(f" -> NEEDS VALIDATION")
3. Generate Experiment Specs
Based on the analysis, create appropriate specs:
from splatnlp.mechinterp.schemas import ExperimentSpec, ExperimentType
from splatnlp.mechinterp.state.io import get_spec_path, SPECS_DIR
import json
from datetime import datetime
# Example: Family sweep for untested hypothesis about SCU
spec = ExperimentSpec(
id=datetime.now().strftime("%Y%m%d_%H%M%S"),
type=ExperimentType.FAMILY_1D_SWEEP,
feature_id=18712,
model_type="ultra",
variables={
"family": "special_charge_up",
"rungs": [3, 12, 29, 41, 57],
"include_absent": True
},
dataset_slice={
"percentile_min": 10.0,
"percentile_max": 90.0,
"sample_size": 500
},
description="Test SCU response to identify threshold rung",
parent_hypothesis="h001",
rationale="Hypothesis h001 claims SCU >= 41 AP triggers feature. "
"This sweep will identify the exact threshold."
)
# Write spec to file
spec_path = SPECS_DIR / spec.to_filename()
spec_path.parent.mkdir(parents=True, exist_ok=True)
spec_path.write_text(spec.model_dump_json(indent=2))
print(f"Spec written to: {spec_path}")
4. Common Experiment Templates
For Family-Specific Hypotheses
# "Feature responds to family X at high AP"
ExperimentSpec(
type=ExperimentType.FAMILY_1D_SWEEP,
feature_id=feature_id,
model_type="ultra",
variables={"family": "swim_speed_up"},
)
For Multi-Family Interactions
# "SCU and ISS have synergistic effect"
ExperimentSpec(
type=ExperimentType.FAMILY_2D_HEATMAP,
feature_id=feature_id,
model_type="ultra",
variables={
"family_x": "special_charge_up",
"family_y": "ink_saver_sub"
},
)
For Pattern Discovery
# "Feature detects specific ability combinations"
ExperimentSpec(
type=ExperimentType.FREQUENT_ITEMSETS,
feature_id=feature_id,
model_type="ultra",
variables={
"min_support": 0.05,
"max_size": 4,
"high_activation_pct": 10.0
},
)
For Validation
# Validate stable high-confidence hypothesis
ExperimentSpec(
type=ExperimentType.SPLIT_HALF,
feature_id=feature_id,
model_type="ultra",
variables={"n_splits": 10},
parent_hypothesis="h001",
)
For Token Influence Analysis (Enhancers + Suppressors)
# Identify both enhancing and suppressing tokens
# CRITICAL: Use this when you need to understand what a feature AVOIDS
ExperimentSpec(
type=ExperimentType.TOKEN_INFLUENCE_SWEEP,
feature_id=feature_id,
model_type="ultra",
variables={
"min_samples": 50,
"high_percentile": 0.995, # Top 0.5%
"collapse_families": True,
"suppressor_threshold": 0.8, # < 0.8 = suppressed
"enhancer_threshold": 1.2, # > 1.2 = enhanced
},
description="Identify enhancers and suppressors for complete interpretation",
)
Planner Decision Tree
┌─────────────────────────────────────────┐
│ Load Research State │
└────────────────┬────────────────────────┘
│
┌────────────▼────────────┐
│ Any hypotheses? │
└────────────┬────────────┘
│
No ◄────┴────► Yes
│ │
▼ ▼
┌─────────────────┐ ┌─────────────────────────┐
│ Exploration │ │ Any untested? │
│ - PageRank │ └────────────┬────────────┘
│ - Itemsets │ │
│ - Family sweeps │ Yes ◄───┴───► No
└─────────────────┘ │ │
▼ ▼
┌─────────────┐ ┌─────────────────┐
│ Test most │ │ Any uncertain? │
│ uncertain │ │ (conf ~0.5) │
└─────────────┘ └────────┬────────┘
│
Yes ◄───┴───► No
│ │
▼ ▼
┌─────────────┐ ┌─────────────────┐
│ Discriminate│ │ Any high-conf │
│ experiments │ │ need validation?│
└─────────────┘ └────────┬────────┘
│
Yes ◄───┴───► No
│ │
▼ ▼
┌─────────────┐ ┌───────────┐
│ Validation │ │ Research │
│ experiments │ │ complete! │
└─────────────┘ └───────────┘
Guardrails
The planner enforces these rules:
- One-rung-per-family: All specs include this constraint
- ReLU floor check: If state has pitfall "relu_floor_at_low_activation", avoid weapon sweeps
- Evidence diversification: Prefer different experiment types than recently run
- Hypothesis coverage: Ensure all active hypotheses have at least one pending experiment
- Suppressor check: For single-family features, suggest token_influence_sweep to detect what the feature AVOIDS (critical for complete interpretation)
- Conditional sweep requirement: After confirming primary driver with 1D sweep, ALWAYS suggest 2D heatmaps for secondary abilities (see below)
⚠️ CRITICAL: Conditional 2D Sweep Logic
After a 1D sweep confirms a primary driver, 1D sweeps for other abilities are MISLEADING.
Why 1D Sweeps Fail for Secondary Abilities
When feature has strong primary driver (e.g., SCU):
- Most contexts have LOW primary → activation ≈ 0
- Secondary ability can't suppress zero
- 1D sweep averages across ALL contexts → delta ≈ 0
- False conclusion: "secondary has no effect"
Planner Logic
def suggest_next_experiments(state):
# Check if primary driver is confirmed
primary_confirmed = any(
h.status == "supported" and h.confidence > 0.7
for h in state.hypotheses
if "primary" in h.tags or "monotonic" in h.tags
)
if primary_confirmed:
primary_family = get_confirmed_primary(state)
correlated_abilities = get_overview_top_tokens(state, top_k=10)
# For EACH correlated ability, suggest 2D heatmap
for ability in correlated_abilities:
if ability != primary_family:
suggest_2d_heatmap(
family_x=primary_family,
family_y=ability,
rationale=f"Test conditional effect of {ability} at each {primary_family} level"
)
# ALWAYS suggest death-mitigation battery
for death_ability in ["quick_respawn", "special_saver", "comeback"]:
if not already_tested(state, primary_family, death_ability):
suggest_2d_heatmap(
family_x=primary_family,
family_y=death_ability,
rationale="Death-aversion test"
)
Template: Post-Primary Conditional Sweeps
After confirming primary driver {PRIMARY}, automatically suggest:
// Batch 1: Death-mitigation (REQUIRED)
[
{"type": "family_2d_heatmap", "variables": {"family_x": "{PRIMARY}", "family_y": "quick_respawn"}},
{"type": "family_2d_heatmap", "variables": {"family_x": "{PRIMARY}", "family_y": "special_saver"}},
{"type": "family_2d_heatmap", "variables": {"family_x": "{PRIMARY}", "family_y": "comeback"}}
]
// Batch 2: Top correlated from overview (if not already primary)
[
{"type": "family_2d_heatmap", "variables": {"family_x": "{PRIMARY}", "family_y": "{SECONDARY_1}"}},
{"type": "family_2d_heatmap", "variables": {"family_x": "{PRIMARY}", "family_y": "{SECONDARY_2}"}}
]
// Batch 3: Weapon interaction (optional)
[
{"type": "weapon_sweep", "variables": {"condition_family": "{PRIMARY}", "condition_rung": 29}}
]
Interpreting 2D Results
| 2D Result | Meaning | Label Update |
|---|---|---|
| Y suppresses at high X | Anti-synergy | Add "{Y}-averse" |
| Y enhances at high X | Synergy | Add "{Y}-synergistic" |
| Y flat across X levels | Spurious correlation | Remove from label |
| No data at high X + high Y | Mutual exclusion | Note in label |
Output Location
Specs are written to:
/mnt/e/mechinterp_runs/specs/{timestamp}__f{feature_id}__{type}.json
Example filenames:
20250607_142531__f18712__family-1d-sweep.json20250607_143022__f18712__frequent-itemsets.json20250607_143500__f18712__split-half.json
See Also
- mechinterp-state: Manage research state
- mechinterp-runner: Execute generated specs
- mechinterp-glossary-and-constraints: Domain rules
- mechinterp-summarizer: Process results
- mechinterp-ability-semantics: Ability semantic groupings for interpretation
- mechinterp-investigator: Full investigation workflow