Claude Code Plugins

Community-maintained marketplace

Feedback

mechinterp-investigator

@cesaregarza/SplatNLP
0
0

Orchestrate a systematic research program to investigate and meaningfully label SAE features

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name mechinterp-investigator
description Orchestrate a systematic research program to investigate and meaningfully label SAE features

MechInterp Investigator

This skill guides a systematic investigation of SAE features to arrive at meaningful, non-trivial labels. It orchestrates the other mechinterp skills into a coherent research workflow.

Phase 0: Triage (ALWAYS START HERE)

Goal: Quickly filter out weak/auxiliary features that don't warrant deep investigation.

Time: 1-2 minutes

Many SAE features have minimal influence on model outputs. Triage identifies these early so you can skip expensive analysis.

Step 0.1: Check Decoder Weight Percentile

import torch

sae_path = '/mnt/e/dev_spillover/SplatNLP/sae_runs/run_20250704_191557/sae_model_final.pth'
sae_checkpoint = torch.load(sae_path, map_location='cpu', weights_only=True)
decoder_weight = sae_checkpoint['decoder.weight']  # [512, 24576]

# Get this feature's max absolute decoder weight
feature_decoder = decoder_weight[:, FEATURE_ID]
max_abs = torch.abs(feature_decoder).max().item()

# Compare to all features
all_max_abs = torch.abs(decoder_weight).max(dim=0).values
percentile = (all_max_abs < max_abs).float().mean() * 100

print(f"Feature {FEATURE_ID} decoder weight percentile: {percentile:.1f}%")
Percentile Action
< 10% Likely weak - check overview structure
10-25% Borderline - overview decides
> 25% Proceed to Phase 1 (Overview)

Step 0.2: Quick Overview Check (if <10%)

If decoder percentile < 10%, run a quick overview:

poetry run python -m splatnlp.mechinterp.cli.overview_cli \
    --feature-id {FEATURE_ID} --model ultra --top-k 10

Signs of clear structure (proceed to Phase 1):

  • One family dominates (>40% of breakdown)
  • Strong weapon concentration (>50% one weapon)
  • Clear binary ability pattern
  • Top PageRank token has score > 0.20

Signs of no structure (label as weak):

  • Family breakdown is flat (all <15%)
  • Weapons are diverse
  • Top PageRank score < 0.10
  • High sparsity (>99%) with no clear pattern

Triage Decision

Decoder percentile < 10% AND no clear structure in overview?
  │
  Yes → Label as "Weak/Aux Feature {ID}" and STOP
  │
  No → Proceed to Phase 1 (Overview)

Weak Feature Label Format

{
  "dashboard_name": "Weak/Aux Feature {ID}",
  "dashboard_category": "auxiliary",
  "dashboard_notes": "TRIAGE: Decoder weight {X}th percentile, no clear structure in overview. Skipped deep dive.",
  "hypothesis_confidence": 0.0,
  "source": "claude code (triage)"
}

When to Override Triage

Even with low decoder weights, proceed if:

  • The feature is part of a cluster you're investigating
  • You have external reason to believe it's important
  • You're doing exhaustive analysis of a subset

⚠️ Deep Dive Basics

A proper deep dive requires experiments, not just reading overview data. The overview shows correlations; experiments reveal causation.

Minimum Requirements for a Deep Dive

Step What to Do Why
1. Overview Run overview to see correlations Generate hypotheses
2. 1D Sweeps Test top 3-5 families with 1D sweeps Find causal drivers (scaling abilities)
3. Binary Check For binary abilities (Comeback, Stealth Jump, LDE, Haunt, etc.), check presence rate Binary abilities show delta=0 in sweeps but may still be characteristic
4. Bottom Tokens Check suppressors from overview What the feature AVOIDS is often more informative
5. 2D Heatmaps Test interactions between primary driver and correlated tokens Verify if correlations are causal or spurious
6. Kit Analysis Check if core weapons share sub/special/class pattern Can explain "why" behind build philosophy - determine if causal or spurious

Binary Abilities Need Special Handling

Binary abilities (you have them or you don't) show delta=0 in 1D sweeps because there's no scaling. This does NOT mean they're unimportant.

Binary Abilities
Comeback, Stealth Jump, Last-Ditch Effort, Haunt, Ninja Squid, Respawn Punisher, Object Shredder, Drop Roller, Opening Gambit, Tenacity

To evaluate binary abilities:

  1. Check PageRank score (correlation strength)
  2. Check presence rate: What % of high-activation examples contain it?
  3. Compare mean activation WITH vs WITHOUT the binary token
  4. Run 2D heatmap: scaling_ability × binary_ability to see conditional effect

Binary Ability Analysis Protocol (CRITICAL)

Binary abilities can have strong conditional effects that ONLY show up in 2D analysis. Here's the exact methodology:

Step 1: Check presence rate enrichment

from splatnlp.mechinterp.skill_helpers import load_context
import polars as pl

ctx = load_context('ultra')
df = ctx.db.get_all_feature_activations_for_pagerank(FEATURE_ID)

# Find binary token ID
binary_id = None
for tok_id, tok_name in ctx.inv_vocab.items():
    if tok_name == 'comeback':  # or stealth_jump, etc.
        binary_id = tok_id
        break

# Calculate enrichment
threshold = df['activation'].quantile(0.90)  # Top 10%
high_df = df.filter(pl.col('activation') >= threshold)

with_binary_all = df.filter(pl.col('ability_input_tokens').list.contains(binary_id))
with_binary_high = high_df.filter(pl.col('ability_input_tokens').list.contains(binary_id))

baseline_rate = len(with_binary_all) / len(df)
high_rate = len(with_binary_high) / len(high_df)
enrichment = high_rate / baseline_rate

print(f"Baseline presence: {baseline_rate:.1%}")
print(f"High-activation presence: {high_rate:.1%}")
print(f"Enrichment ratio: {enrichment:.2f}x")
# Enrichment > 1.5x suggests binary ability is characteristic

Step 2: Check mean activation WITH vs WITHOUT

with_binary = df.filter(pl.col('ability_input_tokens').list.contains(binary_id))
without_binary = df.filter(~pl.col('ability_input_tokens').list.contains(binary_id))

mean_with = with_binary['activation'].mean()
mean_without = without_binary['activation'].mean()
delta = mean_with - mean_without

print(f"Mean WITH: {mean_with:.4f}")
print(f"Mean WITHOUT: {mean_without:.4f}")
print(f"Delta: {delta:+.4f}")
# Delta > 0.03 suggests meaningful effect

Step 3: Run 2D heatmap (MOST IMPORTANT)

Binary abilities can have conditional effects that vary by the scaling ability level:

# Manual 2D analysis for binary abilities
# (The built-in 2D heatmap may not handle binary tokens correctly)

scaling_ids = {3: 48, 6: 49, 12: 50, 21: 53, 29: 80}  # ISM example
binary_id = 27  # Comeback

print("Scaling | No Binary | With Binary | Delta")
print("-" * 50)

for level, tok_id in scaling_ids.items():
    level_df = df.filter(pl.col('ability_input_tokens').list.contains(tok_id))

    with_binary = level_df.filter(pl.col('ability_input_tokens').list.contains(binary_id))
    without_binary = level_df.filter(~pl.col('ability_input_tokens').list.contains(binary_id))

    mean_with = with_binary['activation'].mean() if len(with_binary) > 0 else 0
    mean_without = without_binary['activation'].mean() if len(without_binary) > 0 else 0
    delta = mean_with - mean_without

    print(f"{level:>7} | {mean_without:>9.4f} | {mean_with:>11.4f} | {delta:>+.4f}")

Example (Feature 13352):

ISM × Comeback 2D Analysis:
ISM | No CB  | With CB | Delta
  0 | 0.066  | 0.117   | +0.051
  3 | 0.122  | 0.261   | +0.139
  6 | 0.147  | 0.352   | +0.205  ← PEAK INTERACTION
 12 | 0.094  | 0.163   | +0.069
 21 | 0.094  | 0.129   | +0.035

Interpretation: Comeback has STRONG conditional effect at ISM 3-6.
The +0.205 delta at ISM_6 means Comeback DOUBLES the activation!
1D sweep showed delta=0 because most examples have ISM=0 (low baseline).

Step 4: Test combinations of binary abilities together

# Test multiple binary abilities together
binary_id_1 = 27  # e.g., comeback
binary_id_2 = 1   # e.g., stealth_jump

both = df.filter(
    pl.col('ability_input_tokens').list.contains(binary_id_1) &
    pl.col('ability_input_tokens').list.contains(binary_id_2)
)
neither = df.filter(
    ~pl.col('ability_input_tokens').list.contains(binary_id_1) &
    ~pl.col('ability_input_tokens').list.contains(binary_id_2)
)

# Then do 2D analysis at each scaling level
# Combinations can have stronger effects than individual abilities!

Key Insight: Binary abilities may have stronger effects when combined. Always test combinations, not just individual tokens.

Additional Learnings

  1. Conditional effects can be much stronger than marginal effects: A feature might show ISM with only 0.069 max_delta in 1D sweeps, but a binary ability combination at moderate ISM could produce +0.335 delta - the interaction effect can be 5x stronger than the marginal effect. 1D sweeps can dramatically underestimate a feature's true behavior.

  2. Depletion is informative: If a binary ability shows enrichment < 1.0 (e.g., 0.72x), the feature actively avoids that ability. This is meaningful for interpretation - it tells you what the feature excludes, not just what it includes.

  3. Manual 2D analysis required for binary tokens: The Family2DHeatmapRunner uses parse_token() which expects family_name_AP format, but binary abilities appear as just the token name (e.g., comeback not comeback_10). Use manual 2D analysis code for binary abilities (see protocol above).

  4. "Weak feature" needs decoder weight check: A feature with weak activation effects (max_delta < 0.03) might still have high influence on outputs. Remember: net influence = activation strength × decoder weight. Before labeling as "weak", check the feature's decoder weights to the output tokens it contributes to. A "weak activation" feature with high decoder weights may actually be important.

  5. Watch for error-correction features: If 1D sweeps show small deltas or effects only in unusual rung combinations, the feature may fire when prerequisites are MISSING (OOD detection). Test "explains-away" behavior by comparing activation when low-level evidence is present vs missing. Example: Does feature fire MORE when SCU_3 is absent from a high-SCU build?

  6. Beware of flanderization in top activations: The top 100 activations over-emphasize extreme cases. The TRUE concept often lives in the mid-activation range (25-75th percentile). Always compare mid vs top activation regions - if they show different weapon/ability patterns, label the mid-range concept and note the extremes as "super-stimuli".

What Counts as Evidence

Evidence Type Strength Example
1D sweep max_delta > 0.05 Strong causal "ISM drives this feature"
1D sweep max_delta 0.02-0.05 Weak causal "ISM has minor effect"
1D sweep max_delta < 0.02 Negligible "ISM doesn't drive this"
Binary delta = 0 Inconclusive Need presence rate check
High PageRank + low delta Spurious correlation Token co-occurs but doesn't cause
2D heatmap shows conditional effect Interaction confirmed "X matters only when Y is high"
Bottom tokens (suppressors) Avoidance pattern "Feature avoids death-perks"
Higher activation when prerequisite MISSING Error-correction "Fires on OOD rung combos"
Mid-range (25-75%) differs from top Flanderization "Top is super-stimuli; label mid-range"

Common Mistakes to Avoid

  1. Presenting overview as findings - Overview is hypotheses, not conclusions
  2. Ignoring binary abilities - Delta=0 doesn't mean unimportant
  3. Skipping bottom tokens - Suppressors reveal what feature avoids
  4. Only running 1D sweeps - 2D heatmaps needed for interaction effects
  5. Not checking weapon patterns - Feature may be weapon-specific, not ability-specific
  6. Using only top activations - Top activations (90%+ of max) may be "flanderized" extremes; check core region (25-75% of max)
  7. Missing error-correction features - Small deltas in weird rung combos may indicate OOD detection
  8. Confusing data sparsity with suppression - Zero examples at a condition ≠ "suppression to 0" (see below)
  9. Shallow validation - Just checking if numbers "look right" without running enrichment analysis
  10. Semantic contradictions in labels - e.g., "Zombie" (embraces death) + "high SSU" (avoids death) is contradictory
  11. Reporting weapon percentages from top-100 - Use top 20-30% instead; top-100 can be 5-10x off (e.g., 78% vs 10%)
  12. Not checking meta archetypes - Weapons may cluster by playstyle, not kit; use splatoon3-meta skill
  13. Assuming kit-based patterns - Check if weapons share sub/special BEFORE assuming it's kit-related
  14. Ignoring flanderization crossover - Note where a "super-stimulus" weapon overtakes the general pattern (usually 90%+ of max activation)

⚠️ CRITICAL: Data Sparsity vs Suppression

This is a common and dangerous mistake. When you see "activation = 0" or "no effect" at some condition, ask: Is this suppression or data sparsity?

Example of the mistake (Feature 1819):

Original claim: "QR is HARD SUPPRESSOR - SSU_57+QR_any=0.000"
Reality: There were ZERO examples with SSU_57 + any QR in the dataset!
         The "0.000" was missing data, not suppression.

How to detect data sparsity:

# ALWAYS check sample sizes when claiming suppression!
at_high_ssu = df.filter(pl.col('ability_input_tokens').list.contains(ssu_57_id))
with_qr = at_high_ssu.filter(pl.col('ability_input_tokens').list.set_intersection(qr_ids).list.len() > 0)

print(f"Examples at SSU_57 with QR: {len(with_qr)}")  # If 0, this is SPARSITY not suppression!

Rule: Never claim "suppression" unless you have ≥20 examples in the suppressed condition. Report sample sizes with all claims.

Philosophy

A meaningful label should capture:

  • What concept the feature encodes (not just "detects token X")
  • Why the model might have learned this representation
  • How it relates to strategic/tactical gameplay

Avoid trivial labels like:

  • "SCU Detector" (just describes token presence)
  • "High activation feature" (describes statistics, not meaning)

Aim for interpretable labels like:

  • "Aggressive Slayer Build" (strategic concept)
  • "Special Spam Enabler" (functional role)
  • "Backline Support Kit" (playstyle archetype)

Investigation Workflow

Phase 0: Triage

See Phase 0: Triage above. Always start here.

If feature passes triage (decoder weight ≥10% OR has clear structure), proceed to Phase 1.

Phase 1: Initial Assessment

Run the overview and classify the feature type:

poetry run python -m splatnlp.mechinterp.cli.overview_cli \
    --feature-id {FEATURE_ID} --model {MODEL} --top-k 20

Classify based on family breakdown:

Pattern Type Next Steps
One family >40% Single-family Check for interference, weapon specificity
Top 2-3 families ~20% each Multi-family Check synergy/redundancy, build archetype
Many families <15% each Distributed Look for meta-pattern, weapon class
Weapons concentrated Weapon-specific Weapon sweep, class analysis

CRITICAL: Always check for non-monotonic effects! Higher AP doesn't always mean higher activation.

Phase 1.5: Activation Region Analysis (CRITICAL - Anti-Flanderization)

Don't only examine extreme activations! High activations may be "flanderized" - exaggerated, extreme versions of the true concept that over-emphasize niche cases.

Key insight: The TRUE concept often lives in the core region (25-75% of effective max), not the top examples. Top activations (90%+ of effective max) can mislead you into labeling a niche pattern instead of the general concept.

Why "effective max"? Activation distributions are heavy-tailed. Using effective_max = 99.5th percentile of nonzero activations prevents single outliers from making the core region nearly empty.

Run activation region analysis:

from splatnlp.mechinterp.skill_helpers import load_context
import numpy as np
from collections import Counter

ctx = load_context("{MODEL}")
df = ctx.db.get_all_feature_activations_for_pagerank({FEATURE_ID})

acts = df['activation'].to_numpy()
weapons = df['weapon_id'].to_list()

# Use EFFECTIVE MAX (99.5th percentile) to handle heavy-tailed distributions
# This prevents single outliers from making the core region nearly empty
nonzero_acts = acts[acts > 0]
effective_max = np.percentile(nonzero_acts, 99.5)
true_max = acts.max()
print(f"True max: {true_max:.4f}, Effective max (99.5%ile): {effective_max:.4f}")

# Define activation regions as % of EFFECTIVE max
regions = [
    ('Floor (≤1%)', lambda a: a <= 0.01 * effective_max),
    ('Low (1-10%)', lambda a: 0.01 * effective_max < a <= 0.10 * effective_max),
    ('Below Core (10-25%)', lambda a: 0.10 * effective_max < a <= 0.25 * effective_max),
    ('Core (25-75%) - TRUE CONCEPT', lambda a: 0.25 * effective_max < a <= 0.75 * effective_max),
    ('High (75-90%)', lambda a: 0.75 * effective_max < a <= 0.90 * effective_max),
    ('Flanderization Zone (90%+)', lambda a: a > 0.90 * effective_max),
]

for region_name, filter_fn in regions:
    indices = [i for i, a in enumerate(acts) if filter_fn(a)]
    weps = [weapons[i] for i in indices]
    print(f"\n{region_name} (n={len(indices)}):")
    for wep, count in Counter(weps).most_common(5):
        name = ctx.id_to_weapon_display_name(wep)
        print(f"  {name}: {count}")

Key signals to look for:

Pattern Interpretation
Same weapons in ALL regions General concept (continuous feature)
Different weapons in core vs 90%+ Super-stimuli detected
Diverse weapons in core, concentrated in 90%+ True concept is in core region
Niche weapons only in 90%+ High activations are "flanderized" extremes

Example (Feature 9971):

Core (25-75%): Splattershot (115), Wellstring (65), Sploosh (57)...
Flanderization (90%+): Bloblobber (44), Glooga Deco (39), Range Blaster (28)

Interpretation: Core region shows GENERAL offensive investment.
Flanderization zone shows EXTREME SCU on special-dependent weapons (super-stimuli).
Label the general concept, note the super-stimuli pattern.

CRITICAL: Always check the Bottom Tokens (Suppressors) section! Tokens that rarely appear in high-activation examples can reveal what the feature avoids:

Suppressor Pattern Interpretation
Death-mitigation (QR, SS, CB) suppressed Feature avoids "death-accepting" builds
Defensive (IR, SR) suppressed Feature prefers aggressive/ranged builds
Mobility suppressed Feature prefers stationary/positional play
Special abilities suppressed Feature encodes non-special playstyle

Example: If SCU is enhanced but quick_respawn, special_saver, and comeback are ALL suppressed, the feature doesn't just detect "SCU" - it detects "death-averse SCU builds" (players who stack SCU but don't plan to die).

Phase 1.6: Weapon Distribution Analysis (CRITICAL - Anti-Flanderization)

NEVER report weapon percentages from top-100 samples. Top-100 is severely flanderized and can give wildly misleading weapon distributions.

Example (Feature 14096 - Real Case):

Top 100:     Dark Tetra 78%, Stamper 20%  ← WRONG, flanderized
Top 10%:     Stamper 35%, Dark Tetra 21%  ← Better but still skewed
Top 30%:     Stamper 23%, Dark Tetra 10%  ← TRUE CONCEPT
Full dataset: Stamper 9%, Dark Tetra 3.5% ← Includes noise/floor

Use top 20-30% for weapon characterization:

import polars as pl
import numpy as np
from collections import Counter
from splatnlp.mechinterp.skill_helpers import load_context

ctx = load_context('ultra')
df = ctx.db.get_all_feature_activations_for_pagerank(FEATURE_ID)

# Get percentile thresholds
acts = df['activation'].to_numpy()
thresholds = {p: np.percentile(acts, p) for p in [0, 50, 70, 80, 90, 95, 99]}

# Analyze by region
regions = [
    ("Bottom 50% (noise)", 0, 50),
    ("50-70% (weak)", 50, 70),
    ("Top 30% (TRUE CONCEPT)", 70, 100),
    ("Top 10%", 90, 100),
    ("Top 1% (flanderized)", 99, 100),
]

print("Region | Top Weapons")
print("-" * 60)

for name, p_low, p_high in regions:
    t_low, t_high = thresholds[p_low], thresholds.get(p_high, float('inf'))
    if p_high == 100:
        region_df = df.filter(pl.col('activation') >= t_low)
    else:
        region_df = df.filter((pl.col('activation') >= t_low) & (pl.col('activation') < t_high))

    if len(region_df) == 0:
        continue

    weapon_counts = region_df.group_by('weapon_id').agg(
        pl.col('activation').count().alias('n')
    ).sort('n', descending=True)

    top3 = []
    for row in weapon_counts.head(3).iter_rows(named=True):
        wname = ctx.id_to_weapon_display_name(row['weapon_id'])
        pct = row['n'] / len(region_df) * 100
        top3.append(f"{wname[:12]}({pct:.0f}%)")

    print(f"{name:<25} | {', '.join(top3)}")

Interpretation Guide:

Pattern Meaning
Same weapons in top-30% and top-1% Continuous feature, no flanderization
Different weapons in top-30% vs top-1% Flanderization detected - label top-30% concept
One weapon jumps from 10% to 70%+ That weapon is "super-stimulus" for the feature
Weapons consistent 50%→30%→10%→1% Stable feature, safe to use any region

Rule: Report weapon percentages from top 20-30%, note if top-1% differs significantly.

Phase 1.6.5: Ability Flanderization Check (CRITICAL)

The same flanderization that applies to weapons applies to abilities. A binary ability with high tail enrichment but low core coverage is a super-stimulus, not the core concept.

The Rule: If a "dominant" driver has <30% core coverage, it's a tail marker, not the headline concept.

Use the core coverage experiment:

cd /root/dev/SplatNLP

# Direct subcommand (recommended)
poetry run python -m splatnlp.mechinterp.cli.runner_cli coverage \
    --feature-id {FEATURE_ID} --model ultra \
    --tokens respawn_punisher,comeback,stealth_jump \
    --threshold 0.30

Output tables:

  • token_coverage: Shows core_coverage_pct, tail_enrichment, is_tail_marker for each token
  • weapon_coverage: Shows core vs tail weapon distributions (catches weapon flanderization)

Coverage Interpretation:

Core Coverage Interpretation Label Implication
>50% Primary driver Safe to headline
30-50% Significant but not universal Mention in notes, not headline
<30% Tail marker / super-stimulus NOT the headline concept

Example (Feature 13934):

respawn_punisher: 8.57x tail enrichment, BUT only 12% core coverage
→ RP is a super-stimulus, NOT the core concept
→ Wrong label: "RP Backline Anchor"
→ Right approach: Split core by RP presence to reveal hidden modes

When you find a super-stimulus (<30% coverage):

  1. Split the core by presence/absence of the super-stimulus
  2. Analyze both modes separately
  3. Look for what they have in COMMON (the true concept)
  4. Label the commonality, note the super-stimulus as a tail marker

Phase 1.7: Meta-Informed Weapon Analysis (USE AFTER WEAPON SWEEP)

After identifying top weapons, always check if they match a known meta archetype using the splatoon3-meta skill.

Step 1: Look up weapon kits

Check references/weapons.md for each top weapon's sub and special:

# Top weapons from Feature 14096 (top 30%):
kits = {
    "Splatana Stamper": ("Burst Bomb", "Zipcaster"),
    "Dark Tetra Dualies": ("Autobomb", "Reefslider"),
    "Glooga Dualies": ("Splash Wall", "Booyah Bomb"),
    "Dapple Dualies Nouveau": ("Torpedo", "Reefslider"),
    "Splatana Wiper": ("Torpedo", "Ultra Stamp"),
}

# Check for shared subs/specials
from collections import Counter
subs = Counter(k[0] for k in kits.values())
specials = Counter(k[1] for k in kits.values())

# If one sub/special dominates → kit-based feature
# If diverse → playstyle-based feature

Step 2: Check archetype reference

Read references/archetypes.md to see if weapons match a known archetype:

Archetype Key Weapons Signature Abilities
Zombie Slayer Tetra Dualies, Splatana Wiper QR + Comeback + Stealth Jump
Stealth Slayer Carbon Roller, Inkbrush Ninja Squid + SSU + Stealth Jump
Anchor/Backline E-liter, Hydra Splatling Respawn Punisher + Object Shredder
Support/Beacon Squid Beakon weapons Sub Power Up + ISS + Comeback

Step 3: Classification decision

Kit Analysis Result:
├─ Shared sub weapon? → Feature may encode SUB PLAYSTYLE
├─ Shared special? → Feature may encode SPECIAL FARMING
├─ No kit pattern + archetype match? → PLAYSTYLE FEATURE (label as archetype)
└─ No kit pattern + no archetype? → WEAPON CLASS feature (check if all dualies, all shooters, etc.)

Example (Feature 14096):

Top 30% weapons: Stamper, Dark Tetra, Glooga, Dapple, Wiper
Kit analysis: Diverse subs (Burst, Auto, Splash Wall, Torpedo), diverse specials
Archetype check: Dark Tetra + Splatana Wiper = "Zombie Slayer" archetype!
Conclusion: PLAYSTYLE feature encoding Zombie Slayer (death-accepting aggressive)
Label: "Zombie Slayer QR (Splatana/Dualies)" - tactical category

When to invoke splatoon3-meta skill:

  • After weapon_sweep shows concentrated weapon pattern
  • When top weapons seem unrelated by kit but share a playstyle
  • To validate that ability patterns match expected meta builds
  • To identify if weapons share archetype despite different kits

Phase 1.7.5: Kit Component Analysis (OPTIONAL but Recommended)

When to use: After weapon sweep, check if the core weapons share patterns in ANY kit component: sub weapon, special weapon, or main weapon class. This can reveal WHY certain build philosophies emerge.

Key insight: Weapons may cluster by:

  • Sub weapon (Burst Bomb users, Beakon users → explains SPU/ISS builds)
  • Special weapon (Aggressive push specials → explains survival builds)
  • Main weapon class (All dualies, all chargers → explains mobility/positioning builds)

The feature may be driven by ONE of these - identify which, then determine if it's causal or spurious.


Component 1: Sub Weapon Pattern Analysis

When relevant: If kit_sweep (Phase 1.7/3d) shows sub concentration, investigate further.

from collections import Counter

# Map top weapons to their subs (from weapons.md)
weapon_subs = {
    "Splattershot Jr.": "Splat Bomb",
    "Neo Splash-o-matic": "Suction Bomb",
    "Sploosh-o-matic 7": "Splat Bomb",
    # ... add more as needed
}

# Categorize subs
sub_categories = {
    # Lethal bombs
    "Splat Bomb": "lethal", "Suction Bomb": "lethal", "Burst Bomb": "lethal",
    "Curling Bomb": "lethal", "Autobomb": "lethal", "Torpedo": "lethal",
    "Fizzy Bomb": "lethal", "Ink Mine": "lethal", "Toxic Mist": "lethal",
    # Utility/Support
    "Squid Beakon": "utility", "Splash Wall": "utility", "Sprinkler": "utility",
    "Point Sensor": "utility", "Angle Shooter": "utility",
}

# Count categories
sub_counts = Counter()
for weapon in top_weapons:
    sub = weapon_subs.get(weapon)
    if sub:
        category = sub_categories.get(sub, "other")
        sub_counts[category] += 1

print("Sub Weapon Breakdown:")
for sub, count in Counter(weapon_subs.get(w) for w in top_weapons if weapon_subs.get(w)).most_common():
    print(f"  {sub}: {count}")

Sub pattern implications:

Sub Pattern Build Implication Example
Shared Beakons SPU/ISS focus for sub spam Beacon Support builds
Shared Burst Bomb Mobility + burst damage Aggressive flanker builds
Shared Splash Wall Positional/defensive play Lane control builds
Diverse subs Sub is NOT the clustering factor Check special or main class

Component 2: Special Weapon Pattern Analysis

When relevant: After weapon sweep, check if core weapons share a special weapon pattern.

from collections import Counter

# Map top weapons to their specials (from weapons.md)
weapon_specials = {
    "Splatana Stamper": "Zipcaster",
    "Sloshing Machine": "Booyah Bomb",
    "Squeezer": "Trizooka",
    # ... add more as needed
}

# Categorize specials
special_categories = {
    # Zoning/Area Denial
    "Ink Storm": "zoning", "Wave Breaker": "zoning", "Tenta Missiles": "zoning",
    "Killer Wail 5.1": "zoning", "Triple Inkstrike": "zoning",
    # Team Support
    "Tacticooler": "team_support", "Big Bubbler": "team_support",
    "Splattercolor Screen": "team_support",
    # Aggression/Push
    "Trizooka": "aggression", "Crab Tank": "aggression", "Ink Jet": "aggression",
    "Ultra Stamp": "aggression", "Booyah Bomb": "aggression", "Reefslider": "aggression",
    "Kraken Royale": "aggression", "Zipcaster": "aggression",
    # Utility/Defense
    "Ink Vac": "utility", "Super Chump": "utility", "Triple Splashdown": "utility",
}

# Count categories
category_counts = Counter()
for weapon in top_weapons:
    special = weapon_specials.get(weapon)
    if special:
        category = special_categories.get(special, "other")
        category_counts[category] += 1

print("Special Category Breakdown:")
for cat, count in category_counts.most_common():
    print(f"  {cat}: {count/sum(category_counts.values())*100:.0f}%")

Special pattern implications:

Special Pattern Build Implication Example
>60% aggression Players build for survival to deploy push specials Feature 14964
>60% zoning Players may invest in SCU/SPU for area denial uptime Ink Storm spam
>50% team_support Team-oriented builds, may see Tenacity/CB Support kit
Diverse specials Special is NOT the clustering factor Check sub or main class

Component 3: Main Weapon Class Pattern Analysis

When relevant: If weapons seem diverse but may share a class (all shooters, all dualies, all chargers).

# Weapon class mapping (from weapon-vibes.md)
weapon_classes = {
    "Splattershot": "shooter", "Splattershot Jr.": "shooter", "Splattershot Pro": "shooter",
    "Dark Tetra Dualies": "dualie", "Dapple Dualies": "dualie", "Splat Dualies": "dualie",
    "E-liter 4K": "charger", "Splat Charger": "charger", "Goo Tuber": "charger",
    "Luna Blaster": "blaster", "Range Blaster": "blaster", "Rapid Blaster": "blaster",
    "Hydra Splatling": "splatling", "Mini Splatling": "splatling",
    "Splatana Stamper": "splatana", "Splatana Wiper": "splatana",
    # ... add more as needed
}

# Count classes
class_counts = Counter(weapon_classes.get(w, "other") for w in top_weapons)

print("Weapon Class Breakdown:")
for cls, count in class_counts.most_common():
    pct = count / len(top_weapons) * 100
    print(f"  {cls}: {pct:.0f}%")

Class pattern implications:

Class Pattern Build Implication Example
>60% dualies Mobility-focused, dodge-roll builds SSU + QSJ synergy
>60% chargers Positioning, low death tolerance Anchor builds
>60% blasters Burst damage, trade-happy QR + Comeback synergy
>60% splatlings Charge management, lane holding ISM + positioning
Diverse classes Class is NOT the clustering factor Check sub or special

Step 4: Determine if Pattern is CAUSAL or SPURIOUS

This is the critical step. A strong pattern in ANY component could be causal or spurious.

Pattern Type Evidence Implication
CAUSAL Kit component explains build philosophy Include in label rationale
SPURIOUS Weapons share other traits that better explain clustering Don't emphasize that component

Questions to determine causality:

  1. Does the kit component align with decoder output?

    • Decoder promotes SCU/SS/SPU + aggressive specials → Special farming is likely causal
    • Decoder promotes ISS/SPU + shared sub weapon → Sub spam is likely causal
    • Decoder promotes SSU/QSJ + all dualies → Weapon class mobility is likely causal
  2. Do weapons share OTHER traits that better explain the clustering?

    • All dualies with aggressive specials → Is it the CLASS or the SPECIAL?
    • Test: Do other dualies (without aggressive specials) also cluster here?
  3. Does the build philosophy make sense for this kit component?

    • Survival builds + aggressive specials → "Stay alive to use push special" (causal)
    • Mobility builds + all dualies → "Dualies need SSU for dodge-roll play" (causal)
    • Survival builds + diverse subs/specials + all chargers → "Chargers can't trade" (class is causal)

Example Analysis (Special-driven):

Feature 14964 special breakdown: 77% aggression (Zipcaster, Booyah Bomb, Trizooka)
Build philosophy: "Balanced utility spread for survival"

Analysis:
- Decoder suppresses death-trading (Comeback, RP) ✓
- Decoder promotes survival abilities (SS, ISM) ✓
- Weapons have LOW-MED death tolerance ✓
- Weapons have aggressive push specials ✓
- Sub weapons are DIVERSE (no pattern)
- Weapon classes are DIVERSE (shooters, slosher, splatana)

Conclusion: CAUSAL - Players build for survival BECAUSE they have aggressive specials
           that require staying alive to deploy effectively.

Note: "Core weapons have aggressive push specials (77%) requiring survival to deploy"

Example Analysis (Class-driven):

Feature shows: 80% dualies (Dark Tetra, Dapple, Dualie Squelchers)
Decoder promotes: SSU, QSJ, RSU (mobility family)

Analysis:
- Specials are DIVERSE (not the driver)
- Subs are DIVERSE (not the driver)
- All weapons are DUALIES with dodge-roll mechanics ✓
- Dualies benefit uniquely from SSU for roll distance/recovery

Conclusion: CAUSAL - Dualies cluster because dodge-roll playstyle needs mobility
           The feature encodes "dualie mobility optimization"

Counter-example (Spurious):

Feature has 70% aggression specials
But: All weapons are CLOSE-range SLAYER with HIGH death tolerance
And: Decoder promotes QR, Comeback (death-trading)

Conclusion: SPURIOUS - Weapons are aggressive slayers who happen to have aggressive specials
           The special type is incidental to the slayer playstyle.
           Primary driver is ROLE (slayer), not KIT.

Step 5: Record findings in notes

If pattern is CAUSAL, add to dashboard_notes:

KIT PATTERN: {component} - {X}% {category/type} ({list top examples}).
INTERPRETATION: [Why this explains the build philosophy]

If pattern is SPURIOUS, note briefly:

KIT PATTERN: Diverse/incidental. Weapons cluster by [range/role/playstyle], not kit.

When to skip this phase:

  • Feature is clearly mechanical (single ability stacker like "SCU_57 threshold")
  • Weapons are highly diverse with no concentration in any component
  • Earlier analysis already identified clear driver (e.g., single weapon dominance)

Phase 1.8: Weapon Range/Role Classification (REQUIRED for Labels)

Before proposing any label, you MUST classify the feature's weapons by range and role. This prevents incorrect role assumptions (e.g., calling Jr./Rapid Blasters "anchors" when they're midrange).

Step 1: Extract properties for top 5-10 core weapons from weapon-vibes.md

Property Values Label Implication
RANGE CLOSE, MID, LONG, SNIPER Determines qualifier
LANE FRONT, MID, BACK, FLEX Confirms positioning
JOB SLAYER, SUPPORT, ANCHOR, SKIRMISH, ASSASSIN Determines role word
NS_FIT CORE, GOOD, MEH, BAD, NO Stealth vs visible
DEATH_TOL HIGH, MED, LOW Trading vs survival

Step 2: Find the common pattern

If most weapons share:

  • LONG/SNIPER + BACK + ANCHOR → use "Anchor" or "Backline" qualifier
  • MID/LONG + MID + SKIRMISH/SUPPORT → use "Midrange" qualifier
  • CLOSE/MID + FRONT + SLAYER → use "Slayer" or "Frontline" qualifier
  • NO/BAD NS_FIT + LOW DEATH_TOL → "Visible" or "Positional" concept (not stealth, not trading)

Step 3: Record in notes

Always include weapon classification in dashboard_notes:

WEAPON ROLE: Midrange (MID-LONG range, SKIRMISH/SUPPORT jobs, NO/BAD NS fit, LOW death tolerance)

Phase 2: Hypothesis Generation

Based on Phase 1, generate hypotheses about what the feature might encode:

For single-family dominated features:

  • H1: Pure token detector (trivial - try to disprove)
  • H2: Threshold detector (activates only at high AP)
  • H3: Interaction detector (family + something else)
  • H4: Weapon-conditional (family matters only for certain weapons)

For multi-family features:

  • H1: Synergy detector (families work together)
  • H2: Build archetype (strategic loadout pattern)
  • H3: Playstyle indicator (aggressive, defensive, support)
  • H4: Shared NEED (different builds solving the same tactical problem)

Build NEED Framework (For Multi-Modal/Diffuse Features)

When a feature activates on seemingly different build types, ask: "What NEED do these builds share?"

Features can encode solutions to problems, not just correlations. Different builds may trigger the same feature because they're different answers to the same question.

Step 1: Identify the tactical constraint these builds solve

Question Example
What gameplay problem do these builds address? "How to handle death for low-death-tolerance weapons"
What enemy behavior are they countering? "Dealing with aggressive flankers"
What win condition are they enabling? "Special pressure" or "Map control"

Step 2: Check weapon properties (use splatoon3-meta)

Compare enriched weapons on these axes from weapon-vibes.md:

  • Ink feel: STARVING / HUNGRY / AVERAGE / EFFICIENT / PAINTER
  • Range: MELEE / CLOSE / MID / LONG / SNIPER
  • Ninja Squid affinity: CORE / GOOD / MEH / BAD / NO
  • Death tolerance: HIGH / MED / LOW
  • Role: SLAYER / SUPPORT / ANCHOR / SKIRMISH / ASSASSIN

If all enriched weapons share properties (e.g., all HUNGRY ink + NO ninja squid + LOW death tolerance), the feature may encode a need specific to that weapon class.

Step 3: Reframe the modes as "answers to the same question"

Example (Feature 13934):

Mode A (12%): RP anchor builds (E-liter) - "I won't die, make their deaths hurt"
Mode B (88%): Zombie utility builds (DS) - "I will die sometimes, optimize respawns"

Shared NEED: "Death management for non-stealth, low-death-tolerance, midrange+ weapons"
Both modes are VALID ANSWERS to the same tactical question.

Step 4: Label the NEED, not the modes

Instead of: "Mixed: Zombie + RP Anchor" (describes the modes) Label as: "Balanced Utility Axis (Non-Stealth Midline+)" (describes the need)

Key Insight: The model learned that these seemingly different builds share a common requirement. The feature encodes that requirement, and the modes are just different implementations.

For weapon-specific features:

  • H1: Weapon class pattern (all shooters, all chargers, etc.)
  • H2: Meta build (optimal loadout for that weapon)
  • H3: Weapon-ability interaction

Phase 3: Targeted Experiments

Run experiments to test hypotheses. Available experiment types:

Type Purpose
family_1d_sweep Activation across AP rungs for one family
family_2d_heatmap Interaction between two families
within_family_interference Detect error correction within a family
weapon_sweep Activation by weapon (optionally conditioned on family)
weapon_group_analysis Compare high vs low activation by weapon
pairwise_interactions Synergy/redundancy between tokens
token_influence_sweep Identify enhancers and suppressors across all tokens

⚠️ CRITICAL: Iterative Conditional Testing Protocol

1D sweeps can be MISLEADING for secondary abilities. When a feature has a strong primary driver:

The Problem

1D sweep for secondary ability (e.g., QR) across ALL contexts might show delta ≈ 0

Why this happens:

  • Most contexts have LOW primary driver (e.g., low SCU) → activation already near zero
  • Secondary ability can't suppress what's already zero
  • The few high-primary contexts get drowned out in the average

Example (Feature 18712):

QR 1D sweep (all contexts): mean_delta = -0.0006 → "QR has no effect" ❌ WRONG!
SCU × QR 2D heatmap:
  - At SCU_15: QR_0=0.13, QR_12=0.04 → QR suppresses 70%! ✅
  - At SCU_29: QR_0=0.15, QR_12=0.04 → QR suppresses 74%! ✅

The Solution: Iterative 2D Testing

Protocol for features with a strong primary driver:

1. Confirm primary driver with 1D sweep
   └─ If monotonic response confirmed → proceed to step 2

2. For EACH correlated ability in overview (top 5-10):
   └─ Run 2D heatmap: PRIMARY × SECONDARY
   └─ Check activation at EACH primary level
   └─ Look for:
      - Suppression: secondary reduces activation at high primary
      - Synergy: secondary boosts activation at high primary
      - Spurious: no conditional effect (correlation was coincidence)

3. Group findings by semantic category:
   └─ Death-mitigation (QR, SS, CB): all suppress? → "death-averse"
   └─ Mobility (SSU, RSU): all enhance? → "mobility-synergistic"
   └─ Efficiency (ISM, ISS): mixed? → test individually

2D Heatmap Interpretation Guide

Pattern Interpretation
Peak at (high_X, 0_Y) Y is a suppressor
Peak at (high_X, high_Y) Y is a synergy
Flat across Y at each X Y has no conditional effect (spurious)
Non-monotonic in X at some Y Interference pattern

Heatmap Cell Validity Check

Before drawing conclusions from heatmap cells, check the cell metadata:

Each cell in heatmap output includes:

  • n: Number of valid samples in this cell
  • std: Standard deviation of activations
  • stderr: Standard error (std / sqrt(n)) - new field
n (samples) Interpretation
null/0 Impossible combination (constraint violation) - don't interpret
1-4 Very weak evidence - note uncertainty in conclusions
5-20 Moderate evidence - interpret with caution
20+ Strong evidence - interpret confidently

High stderr (>0.1) indicates high variance - the mean may not be reliable.

Anti-patterns to avoid:

  • Drawing conclusions from cells with n < 5
  • Claiming "peak at X=57, Y=29" when that cell has n=2
  • Ignoring null cells (they represent impossible ability combinations)

Example interpretation:

Cell (ISM=51, IRU=29): mean=0.35, n=3, stderr=0.08
→ "ISM=51 with IRU=29 shows high activation, but n=3 means this could be noise"

Cell (ISM=51, IRU=0): mean=0.35, n=45, stderr=0.02
→ "ISM=51 without IRU shows reliable high activation (n=45)"

When to Use 2D vs 1D

Scenario Use 1D Use 2D
Testing primary driver -
Testing secondary abilities ❌ MISLEADING ✅ REQUIRED
Looking for interactions -
Confirming suppressor hypothesis -
Quick initial scan ✅ (with caution) -

Template: Death-Aversion Test Battery

For single-family dominated features, always test death-mitigation:

# Test 1: Primary × Quick Respawn
poetry run python -m splatnlp.mechinterp.cli.runner_cli heatmap \
    --feature-id {ID} --family-x {PRIMARY} --family-y quick_respawn \
    --rungs-x 0,6,15,29,41,57 --rungs-y 0,6,12,21,29

# Test 2: Primary × Special Saver
poetry run python -m splatnlp.mechinterp.cli.runner_cli heatmap \
    --feature-id {ID} --family-x {PRIMARY} --family-y special_saver \
    --rungs-x 0,6,15,29,41,57 --rungs-y 0,3,6,12,21

# Test 3: Primary × Comeback (binary ability - use binary subcommand for this)
poetry run python -m splatnlp.mechinterp.cli.runner_cli binary \
    --feature-id {ID} --model ultra

If ALL three show suppression at Y>0, label includes "death-averse"

Template: Error-Correction Detection

If 1D sweeps show small deltas or effects only in unusual rung combinations, test for error-correction behavior:

import polars as pl
from splatnlp.mechinterp.skill_helpers import load_context

ctx = load_context('ultra')
df = ctx.db.get_all_feature_activations_for_pagerank(FEATURE_ID)

# Get token IDs for high and low rungs
# Example: SCU_57 (high) and SCU_3 (low)
high_rung_id = ctx.vocab['special_charge_up_57']
low_rung_id = ctx.vocab['special_charge_up_3']

# Compare activation when low rung is present vs missing (among high-rung builds)
high_with_low = df.filter(
    pl.col('ability_input_tokens').list.contains(high_rung_id) &
    pl.col('ability_input_tokens').list.contains(low_rung_id)
)
high_without_low = df.filter(
    pl.col('ability_input_tokens').list.contains(high_rung_id) &
    ~pl.col('ability_input_tokens').list.contains(low_rung_id)
)

mean_with = high_with_low['activation'].mean()
mean_without = high_without_low['activation'].mean()

print(f"High rung WITH low rung present: {mean_with:.4f} (n={len(high_with_low)})")
print(f"High rung WITHOUT low rung: {mean_without:.4f} (n={len(high_without_low)})")
print(f"Delta: {mean_without - mean_with:+.4f}")

# If WITHOUT > WITH, feature fires when prerequisite is MISSING = error correction!

Signs of error-correction:

Pattern Interpretation Label Style
Higher activation when low rung MISSING "Explains away" missing evidence "Error-Correction: {FAMILY}"
Only fires on weird rung combos OOD detector "OOD Detector: {PATTERN}"
Negative interactions in 2D heatmaps Within-family interference "Interference Feature: {FAMILY}"

Test for within-family interference (CRITICAL for single-family):

poetry run python -m splatnlp.mechinterp.cli.runner_cli family-sweep \
    --feature-id {FEATURE_ID} --family {FAMILY} --model {MODEL}
# Check for non-monotonic response patterns in the output

Test for interactions (2D heatmap):

poetry run python -m splatnlp.mechinterp.cli.runner_cli heatmap \
    --feature-id {FEATURE_ID} --family-x {FAMILY_A} --family-y {FAMILY_B} --model {MODEL}

Test for weapon specificity:

poetry run python -m splatnlp.mechinterp.cli.runner_cli weapon-sweep \
    --feature-id {FEATURE_ID} --model {MODEL} --top-k 20 --min-examples 10

CHECKPOINT: After weapon_sweep, check for dominant weapon pattern:

If weapon_sweep diagnostics show "DOMINANT WEAPON" warning (one weapon has >2x delta of second):

  1. Run kit_sweep to analyze by sub weapon and special weapon:
poetry run python -m splatnlp.mechinterp.cli.runner_cli kit-sweep \
    --feature-id {FEATURE_ID} --model {MODEL} --top-k 10 --analyze-combinations
  1. Use splatoon3-meta skill to look up the dominant weapon's kit:

    • Read .claude/skills/splatoon3-meta/references/weapons.md
    • Find the weapon's sub weapon and special weapon
  2. Cross-reference other high-activation weapons:

    • Do they share the same sub weapon?
    • Do they share the same special weapon?
    • If yes, the feature may encode kit behavior not weapon behavior
  3. Update hypothesis based on findings:

    • If shared sub: Feature may encode sub weapon playstyle
    • If shared special: Feature may encode special spam/farming
    • If no kit pattern: Feature is truly weapon-specific

Example: Feature 18712 shows Octobrush Nouveau dominant. Kit lookup reveals Squid Beakon + Ink Storm. Other high weapons (Rapid Blaster, Range Blaster) also have "special-dependent" characteristics per meta → Feature encodes "SCU for Ink Storm spam" not just "Octobrush".

Test for threshold effects:

  • Compare low-rung vs high-rung responses
  • Look for non-linear jumps in activation
  • Check if certain rungs REDUCE activation (interference)

Phase 4: Synthesis

Combine findings into a coherent interpretation:

  1. What triggers activation? (tokens, combinations, weapons)
  2. Is there structure beyond simple detection? (interactions, thresholds)
  3. What gameplay concept does this represent?
  4. Why would the model learn this? (predictive value for recommendations)

Phase 5: Label Proposal

Propose a label at the appropriate level:

Complexity Label Type Example
Trivial Token detector "SCU Presence" (avoid if possible)
Simple Threshold detector "High SCU Investment (29+ AP)"
Moderate Interaction "SCU + Mobility Combo"
Strategic Build archetype "Special Spam Slayer Kit"
Tactical Playstyle "Aggressive Frontline Build"

Label Specificity by Category

The label's specificity should match its concept level:

Category Specificity Style Examples
mechanical Terse Token-focused, technical "SCU Threshold 29+", "ISM Stacker"
tactical Mid-level Ability combos, weapon synergies "Zombie Slayer Dualies", "Beacon Support Kit"
strategic High-concept Playstyle, gameplay philosophy "Positional Survival - Midrange", "Aggressive Reentry"

Why this matters:

  • Mechanical features encode low-level patterns → label should be precise and technical
  • Tactical features encode build strategies → label should name the strategy
  • Strategic features encode gameplay philosophies → label should capture the "why"

Examples by level:

Feature encodes "SCU above 29 AP threshold"
→ Category: mechanical
→ Label: "SCU Threshold 29+" (terse, specific)

Feature encodes "QR + Comeback + Stealth Jump on dualies"
→ Category: tactical
→ Label: "Zombie Slayer Dualies" (names the combo + weapon)

Feature encodes "survive through positioning, not stealth or trading"
→ Category: strategic
→ Label: "Positional Survival - Midrange" (high-concept + role)

Strategic Label Quality Checklist

Before finalizing a label, verify:

  1. Concept over tokens: Does the label describe a GAMEPLAY CONCEPT, not just list abilities?

    • BAD: "SSU + ISM + SRU Kit", "Swim Efficiency Kit"
    • GOOD: "Positional Survival", "Aggressive Reentry"
  2. Positive framing: Does the label describe what the feature IS, not just what it avoids?

    • BAD: "Death-Averse Efficiency", "Anti-Stealth Build"
    • GOOD: "Positional Survival", "Visible Zone Control"
  3. The "why" test: Can you answer "why would a player build this?"

    • If answer is "to have SSU and ISM" → label is too mechanical
    • If answer is "to survive through positioning at midrange" → label captures concept
  4. Range/role qualifier: Have you verified weapon range (Phase 1.8) and added appropriate qualifier?

    • Backline (SNIPER/LONG + ANCHOR) → "- Anchor" or "- Backline"
    • Midrange (MID/LONG + SUPPORT/SKIRMISH) → "- Midrange"
    • Frontline (CLOSE/MID + SLAYER) → "- Slayer" or "- Frontline"

Strategic Label Format

Prefer: "[Concept] - [Qualifier]"

Concept Examples What it captures
Positional Survival Stay alive through positioning, not stealth/trading
Aggressive Reentry Pressure through fast respawn (zombie)
Stealth Approach Win through concealment (NS builds)
Special Pressure Win through special uptime
Lane Persistence Hold lanes through sustain
Qualifier Examples When to use
Midrange MID-range weapons, SKIRMISH/SUPPORT jobs
Anchor LONG/SNIPER range, ANCHOR job, chargers/splatlings
Slayer CLOSE/MID range, SLAYER job, aggressive weapons
Support SUPPORT job, team utility focus
(Weapon Class) When specific to dualies, blasters, etc.

Label Anti-Patterns to Avoid

Anti-Pattern Example Why It's Bad Better Label
Token listing "SSU + ISM Kit" Describes tokens, not purpose "Positional Survival"
Negation-only "Death-Averse" Describes avoidance, not identity "Positional Survival"
Wrong role "Anchor" for Jr./Rapid Anchor implies backline chargers "- Midrange"
Too generic "Utility Build" Could mean anything "Positional Survival - Midrange"
Flanderized Based on top 100 only Captures tail, not core concept Check core region first

Phase 6: Deeper Dive (For Thorny Features)

When to use: If the standard deep dive (Phases 1-5) didn't produce a clear interpretation:

  • All scaling effects weak (max_delta < 0.03)
  • No clear primary driver
  • Conflicting signals from different experiments
  • Feature seems important (high contribution to outputs) but unclear why

The Deeper Dive uses the hypothesis/state management system for systematic exploration:

Step 1: Initialize Research State

from splatnlp.mechinterp.state import ResearchState, Hypothesis

state = ResearchState(feature_id=FEATURE_ID, model_type="ultra")

# Add competing hypotheses based on what you've observed
state.add_hypothesis(Hypothesis(
    id="h1",
    description="Feature encodes weapon-specific pattern for Dapple Nouveau",
    status="pending"
))
state.add_hypothesis(Hypothesis(
    id="h2",
    description="Feature encodes binary ability package (Stealth + Comeback)",
    status="pending"
))
state.add_hypothesis(Hypothesis(
    id="h3",
    description="Feature has high decoder weights despite weak activation effects",
    status="pending"
))

Step 2: Check Decoder Weights

For "weak activation" features, check if they have high influence via decoder weights:

# Load SAE decoder weights
import torch
sae_path = '/mnt/e/dev_spillover/SplatNLP/sae_runs/run_20250704_191557/sae_model_final.pth'
sae_checkpoint = torch.load(sae_path, map_location='cpu', weights_only=True)
decoder_weight = sae_checkpoint['decoder.weight']  # [512, 24576]

# Get this feature's decoder weights to output space
feature_decoder = decoder_weight[:, FEATURE_ID]  # [512]

# Check magnitude
print(f"Decoder weight L2 norm: {torch.norm(feature_decoder):.4f}")
print(f"Max absolute weight: {torch.abs(feature_decoder).max():.4f}")

# Compare to other features
all_norms = torch.norm(decoder_weight, dim=0)
percentile = (all_norms < torch.norm(feature_decoder)).float().mean() * 100
print(f"Percentile among all features: {percentile:.1f}%")

If decoder weights are high (>75th percentile), the feature may be important despite weak activation effects.

Step 3: Decoder Output Analysis (CRITICAL for Diffuse Features)

When activation analysis doesn't yield a clean interpretation, analyze what the feature RECOMMENDS.

This technique asks: "What does this feature push the model to predict?" rather than "What activates this feature?"

Use the decoder CLI:

cd /root/dev/SplatNLP

# Quick output influence check
poetry run python -m splatnlp.mechinterp.cli.decoder_cli output-influence \
    --feature-id {FEATURE_ID} \
    --model ultra \
    --top-k 15

# Check decoder weight importance
poetry run python -m splatnlp.mechinterp.cli.decoder_cli weight-percentile \
    --feature-id {FEATURE_ID} \
    --model ultra

See mechinterp-decoder skill for full documentation.

Interpretation Guide:

Output Pattern Interpretation
Promotes low-AP tokens (_3, _6) "Recommend light investment"
Promotes high-AP tokens (_51, _57) "Recommend heavy stacking"
Suppresses high-AP tokens "Anti-stacking / balanced build"
Promotes death-mitigation (QR, CB, SS) "Recommend zombie/respawn optimization"
Suppresses death-mitigation "Death-averse / stay alive"

Example (Feature 13934):

PROMOTES: respawn_punisher (+0.23), comeback (+0.16), QSJ_6 (+0.15), IA_3 (+0.14), ISM_6 (+0.13)
SUPPRESSES: RSU_57 (-0.30), QR_57 (-0.25), RSU_51 (-0.24)

Interpretation: Feature recommends "balanced utility spread with low-AP investments"
               and DISCOURAGES heavy stacking of any single ability.

When to use decoder output analysis:

  • Activation analysis shows multi-modal or diffuse patterns
  • No single signature covers >50% of core
  • Feature seems "confused" between different build types
  • You want to understand the feature's PURPOSE, not just what triggers it

Key Insight: A feature can activate on seemingly different builds because they share the same NEED. The output analysis reveals what the feature is recommending, which may unify apparently contradictory activation patterns.

Decoder Output Semantic Grouping (CRITICAL for Labels)

After running decoder output analysis, group promoted/suppressed tokens by MEANING, not just family:

Semantic Group Token Families Gameplay Meaning
Mobility SSU, RSU How you reposition
Survival BRU, IRU, RES, QR, SS, RP How you stay alive
Efficiency ISM, ISS, IRU How you sustain pressure
Lethality IA, MPU, BPU (bomb damage) How you get kills
Special-Focus SCU, SS, SPU, Tenacity How you use specials
Stealth NS, (high SSU) How you approach unseen
Death-Trading QR, CB, SJ, SS How you weaponize respawn

Abbreviation Key:

  • SSU = Swim Speed Up, RSU = Run Speed Up
  • BRU = Bomb (Sub) Resistance Up, RES = Ink Resistance Up
  • IRU = Ink Recovery Up, ISM = Ink Saver Main, ISS = Ink Saver Sub
  • BPU = Bomb (Sub) Power Up, SPU = Special Power Up
  • SCU = Special Charge Up, SS = Special Saver
  • QR = Quick Respawn, CB = Comeback, SJ = Stealth Jump
  • IA = Intensify Action, MPU = Main Power Up, NS = Ninja Squid, RP = Respawn Punisher

Then ask: "What COMBINATION of groups defines this feature?"

Promoted Groups Suppressed Groups Strategic Concept
Mobility + Survival + Efficiency Death-Trading, Stealth Positional Survival
Death-Trading + Mobility Survival Zombie/Aggressive Reentry
Stealth + Mobility - Stealth Approach
Special-Focus + Efficiency Mobility Special Farming
Lethality + Mobility Efficiency Aggressive Slayer

This semantic grouping directly informs the strategic label.

Post-Decoder Sweep Rule

After decoder output analysis, verify the top promoted/suppressed families with causal 1D sweeps.

The decoder tells you what the feature RECOMMENDS, but not whether it's causally driven by those tokens. To validate:

  1. Identify top 2 promoted families from decoder output (highest positive contributions)
  2. Identify top 2 suppressed families from decoder output (most negative contributions)
  3. Run 1D sweeps for any not yet tested in Phase 2
Decoder Shows Test With Expected If Valid
BRU highly promoted family_1d_sweep BRU Positive delta with BRU levels
RSU suppressed family_1d_sweep RSU Negative delta or flat

Example: Feature 10938 decoder showed BRU heavily promoted (+0.126, +0.120, +0.108 for different rungs), but initial sweeps only tested SSU/ISM. Should have run:

# Missing sweep that would validate decoder findings
poetry run python -m splatnlp.mechinterp.cli.runner_cli run-spec \
    --spec '{"type": "family_1d_sweep", "variables": {"family": "bomb_resistance_up"}}' \
    --feature-id 10938 --model ultra

Anti-pattern: Trusting decoder output without causal validation. Decoder weights show correlation to output tokens, not causal effect of input tokens.

Step 4: Run Targeted Experiments

Based on hypotheses, run specific tests:

# Log experiments and findings to state
state.add_evidence(
    hypothesis_id="h1",
    experiment_type="weapon_sweep",
    finding="37% Dapple Nouveau, but also 10% .96 Gal Deco - not single-weapon",
    supports=False
)

state.add_evidence(
    hypothesis_id="h3",
    experiment_type="decoder_weight_check",
    finding="Decoder L2 norm: 0.89 (92nd percentile) - HIGH despite weak activation",
    supports=True
)

Step 5: Synthesize

# Review all evidence
state.summarize()

# Update hypothesis statuses
state.update_hypothesis("h1", status="rejected")
state.update_hypothesis("h3", status="supported")

# Propose final interpretation
state.set_conclusion(
    "Feature has weak activation effects but high decoder weights. "
    "It acts as a 'fine-tuning' feature that makes small but important "
    "adjustments to output probabilities."
)

When Deeper Dive is Complete

The state object provides an audit trail of:

  • What hypotheses were considered
  • What experiments were run
  • What evidence was found
  • Why the final interpretation was chosen

This is useful for:

  • Revisiting the feature later
  • Explaining the interpretation to others
  • Identifying if new evidence should change the interpretation

Decision Trees

Single-Family Dominated Feature

1. Run within_family_interference to check for error correction
   └─ If interference found → "Error-Correcting {FAMILY} Detector"
   └─ If enhancement patterns → "{FAMILY} Stacker (synergistic)"
   └─ If neutral → continue

2. Check for non-monotonic 1D response
   └─ If drops at certain rungs → investigate interference
   └─ If monotonic with threshold → "High {FAMILY} Investment"
   └─ If monotonic with no threshold → probably trivial

3. Run weapon_sweep to check weapon specificity
   └─ If weapon-concentrated → run weapon_group_analysis
   └─ If weapon-specific patterns → "{WEAPON_CLASS} + {FAMILY}"

4. Run 2D sweep with second-ranked family
   └─ If interaction effect → "{FAMILY_A} + {FAMILY_B} Combo"
   └─ If no interaction → try third family

5. If all trivial → label as "{FAMILY} Stacker" with note "simple detector"

Multi-Family Feature

1. Check if families are related
   └─ All mobility (SSU, RSU, QSJ) → "Mobility Kit"
   └─ All ink efficiency (ISM, ISS, IRU) → "Efficiency Kit"
   └─ Mixed → continue

2. Run pairwise interaction analysis
   └─ Positive synergy → "Synergistic Build"
   └─ Redundancy → "Alternative Paths"

3. Check weapon breakdown
   └─ Weapon class pattern → "{CLASS} Optimal Build"

4. Consider strategic meaning
   └─ What playstyle does this combination enable?

Example Investigation

Feature 18712 (Deep Analysis):

  1. Overview: SCU 31%, SSU 11%, ISS 10% → Single-family dominated
  2. Hypothesis: Could be SCU + something, or just trivial SCU detector
  3. 2D Heatmap (SCU × SSU): Peak at SCU=57, SSU=0. Non-monotonic drops visible!
    • SCU 6→12: DROP of 0.02 (unexpected)
    • SCU 15→21: DROP of 0.01
  4. Interference Analysis:
    • SCU_12 REDUCES SCU_51 signal by 0.10 (interference!)
    • SCU_15 ENHANCES SCU_51 signal by 0.12 (synergy!)
  5. Weapon Analysis: Effect varies by weapon
    • weapon_id_50: SCU_3 reduces SCU_15 (-0.08)
    • weapon_id_7020: SCU_3 enhances SCU_15 (+0.03)
  6. Interpretation: Feature detects "clean" high-SCU builds.
    • Low rungs (SCU_3, SCU_12) can contaminate the signal
    • Effect is weapon-dependent
  7. Label: "SCU Purity Detector (weapon-conditional)" - NOT trivial!

Key Insight: What looked like a simple "SCU detector" actually encodes complex error-correction behavior. Always check for interference!

Commands Summary

# Phase 1: Overview (with extended analyses)
poetry run python -m splatnlp.mechinterp.cli.overview_cli \
    --feature-id {ID} --model ultra --top-k 20

# Phase 1 with extended analyses (enrichment, regions, binary, kit)
poetry run python -m splatnlp.mechinterp.cli.overview_cli \
    --feature-id {ID} --model ultra --all

# Phase 3a: 1D sweep for dominant family (direct subcommand)
poetry run python -m splatnlp.mechinterp.cli.runner_cli family-sweep \
    --feature-id {ID} --family {FAMILY} --model ultra

# Phase 3b: 2D heatmap for interactions (direct subcommand)
poetry run python -m splatnlp.mechinterp.cli.runner_cli heatmap \
    --feature-id {ID} --family-x {FAMILY_A} --family-y {FAMILY_B} --model ultra

# Phase 3c: Weapon sweep (direct subcommand)
poetry run python -m splatnlp.mechinterp.cli.runner_cli weapon-sweep \
    --feature-id {ID} --model ultra --top-k 20

# Phase 3d: Kit sweep (if dominant weapon detected)
poetry run python -m splatnlp.mechinterp.cli.runner_cli kit-sweep \
    --feature-id {ID} --model ultra --analyze-combinations

# Phase 3e: Binary ability analysis
poetry run python -m splatnlp.mechinterp.cli.runner_cli binary \
    --feature-id {ID} --model ultra

# Phase 3f: Core coverage analysis
poetry run python -m splatnlp.mechinterp.cli.runner_cli coverage \
    --feature-id {ID} --tokens {TOKEN1},{TOKEN2}

# Phase 1.7.5: Kit Component Analysis (see skill for full code)
# After weapon sweep, check for patterns in: sub weapons, specials, or weapon class
# For any concentrated pattern, determine if CAUSAL (explains build) or SPURIOUS (incidental)

# Phase 5: Set label
poetry run python -m splatnlp.mechinterp.cli.labeler_cli label \
    --feature-id {ID} --name "{LABEL}" --category {tactical|strategic|mechanical}

Labeling Categories

  • mechanical: Low-level patterns (token presence, simple combinations)
  • tactical: Mid-level patterns (build synergies, weapon kits)
  • strategic: High-level patterns (playstyles, meta concepts)

See Also

  • mechinterp-overview: Initial feature assessment (now includes bottom tokens)
  • mechinterp-runner: Execute experiments (includes core_coverage_analysis and decoder_output_analysis)
  • mechinterp-decoder: Decoder weight analysis - what features recommend (USE for diffuse/heterogeneous features)
  • mechinterp-next-step-planner: Generate experiment specs
  • mechinterp-labeler: Save labels
  • mechinterp-glossary-and-constraints: Domain reference
  • mechinterp-ability-semantics: Ability semantic groupings (check AFTER hypotheses)
  • splatoon3-meta: Weapon archetypes, kit lookups, meta knowledge (USE for weapon pattern interpretation)