name	mechinterp-investigator
description	Orchestrate a systematic research program to investigate and meaningfully label SAE features

MechInterp Investigator

This skill guides a systematic investigation of SAE features to arrive at meaningful, non-trivial labels. It orchestrates the other mechinterp skills into a coherent research workflow.

Phase 0: Triage (ALWAYS START HERE)

Goal: Quickly filter out weak/auxiliary features that don't warrant deep investigation.

Time: 1-2 minutes

Many SAE features have minimal influence on model outputs. Triage identifies these early so you can skip expensive analysis.

Step 0.1: Check Decoder Weight Percentile

import torch

sae_path = '/mnt/e/dev_spillover/SplatNLP/sae_runs/run_20250704_191557/sae_model_final.pth'
sae_checkpoint = torch.load(sae_path, map_location='cpu', weights_only=True)
decoder_weight = sae_checkpoint['decoder.weight']  # [512, 24576]

# Get this feature's max absolute decoder weight
feature_decoder = decoder_weight[:, FEATURE_ID]
max_abs = torch.abs(feature_decoder).max().item()

# Compare to all features
all_max_abs = torch.abs(decoder_weight).max(dim=0).values
percentile = (all_max_abs < max_abs).float().mean() * 100

print(f"Feature {FEATURE_ID} decoder weight percentile: {percentile:.1f}%")

Percentile	Action
< 10%	Likely weak - check overview structure
10-25%	Borderline - overview decides
> 25%	Proceed to Phase 1 (Overview)

Step 0.2: Quick Overview Check (if <10%)

If decoder percentile < 10%, run a quick overview:

poetry run python -m splatnlp.mechinterp.cli.overview_cli \
    --feature-id {FEATURE_ID} --model ultra --top-k 10

Signs of clear structure (proceed to Phase 1):

One family dominates (>40% of breakdown)
Strong weapon concentration (>50% one weapon)
Clear binary ability pattern
Top PageRank token has score > 0.20

Signs of no structure (label as weak):

Family breakdown is flat (all <15%)
Weapons are diverse
Top PageRank score < 0.10
High sparsity (>99%) with no clear pattern

Triage Decision

Decoder percentile < 10% AND no clear structure in overview?
  │
  Yes → Label as "Weak/Aux Feature {ID}" and STOP
  │
  No → Proceed to Phase 1 (Overview)

Weak Feature Label Format

{
  "dashboard_name": "Weak/Aux Feature {ID}",
  "dashboard_category": "auxiliary",
  "dashboard_notes": "TRIAGE: Decoder weight {X}th percentile, no clear structure in overview. Skipped deep dive.",
  "hypothesis_confidence": 0.0,
  "source": "claude code (triage)"
}

When to Override Triage

Even with low decoder weights, proceed if:

The feature is part of a cluster you're investigating
You have external reason to believe it's important
You're doing exhaustive analysis of a subset

⚠️ Deep Dive Basics

A proper deep dive requires experiments, not just reading overview data. The overview shows correlations; experiments reveal causation.

Minimum Requirements for a Deep Dive

Step	What to Do	Why
1. Overview	Run overview to see correlations	Generate hypotheses
2. 1D Sweeps	Test top 3-5 families with 1D sweeps	Find causal drivers (scaling abilities)
3. Binary Check	For binary abilities (Comeback, Stealth Jump, LDE, Haunt, etc.), check presence rate	Binary abilities show delta=0 in sweeps but may still be characteristic
4. Bottom Tokens	Check suppressors from overview	What the feature AVOIDS is often more informative
5. 2D Heatmaps	Test interactions between primary driver and correlated tokens	Verify if correlations are causal or spurious
6. Kit Analysis	Check if core weapons share sub/special/class pattern	Can explain "why" behind build philosophy - determine if causal or spurious

Binary Abilities Need Special Handling

Binary abilities (you have them or you don't) show delta=0 in 1D sweeps because there's no scaling. This does NOT mean they're unimportant.

Binary Abilities
Comeback, Stealth Jump, Last-Ditch Effort, Haunt, Ninja Squid, Respawn Punisher, Object Shredder, Drop Roller, Opening Gambit, Tenacity

To evaluate binary abilities:

Check PageRank score (correlation strength)
Check presence rate: What % of high-activation examples contain it?
Compare mean activation WITH vs WITHOUT the binary token
Run 2D heatmap: scaling_ability × binary_ability to see conditional effect

Binary Ability Analysis Protocol (CRITICAL)

Binary abilities can have strong conditional effects that ONLY show up in 2D analysis. Here's the exact methodology:

Step 1: Check presence rate enrichment

from splatnlp.mechinterp.skill_helpers import load_context
import polars as pl

ctx = load_context('ultra')
df = ctx.db.get_all_feature_activations_for_pagerank(FEATURE_ID)

# Find binary token ID
binary_id = None
for tok_id, tok_name in ctx.inv_vocab.items():
    if tok_name == 'comeback':  # or stealth_jump, etc.
        binary_id = tok_id
        break

# Calculate enrichment
threshold = df['activation'].quantile(0.90)  # Top 10%
high_df = df.filter(pl.col('activation') >= threshold)

with_binary_all = df.filter(pl.col('ability_input_tokens').list.contains(binary_id))
with_binary_high = high_df.filter(pl.col('ability_input_tokens').list.contains(binary_id))

baseline_rate = len(with_binary_all) / len(df)
high_rate = len(with_binary_high) / len(high_df)
enrichment = high_rate / baseline_rate

print(f"Baseline presence: {baseline_rate:.1%}")
print(f"High-activation presence: {high_rate:.1%}")
print(f"Enrichment ratio: {enrichment:.2f}x")
# Enrichment > 1.5x suggests binary ability is characteristic

Step 2: Check mean activation WITH vs WITHOUT

with_binary = df.filter(pl.col('ability_input_tokens').list.contains(binary_id))
without_binary = df.filter(~pl.col('ability_input_tokens').list.contains(binary_id))

mean_with = with_binary['activation'].mean()
mean_without = without_binary['activation'].mean()
delta = mean_with - mean_without

print(f"Mean WITH: {mean_with:.4f}")
print(f"Mean WITHOUT: {mean_without:.4f}")
print(f"Delta: {delta:+.4f}")
# Delta > 0.03 suggests meaningful effect

Step 3: Run 2D heatmap (MOST IMPORTANT)

Binary abilities can have conditional effects that vary by the scaling ability level:

# Manual 2D analysis for binary abilities
# (The built-in 2D heatmap may not handle binary tokens correctly)

scaling_ids = {3: 48, 6: 49, 12: 50, 21: 53, 29: 80}  # ISM example
binary_id = 27  # Comeback

print("Scaling | No Binary | With Binary | Delta")
print("-" * 50)

for level, tok_id in scaling_ids.items():
    level_df = df.filter(pl.col('ability_input_tokens').list.contains(tok_id))

    with_binary = level_df.filter(pl.col('ability_input_tokens').list.contains(binary_id))
    without_binary = level_df.filter(~pl.col('ability_input_tokens').list.contains(binary_id))

    mean_with = with_binary['activation'].mean() if len(with_binary) > 0 else 0
    mean_without = without_binary['activation'].mean() if len(without_binary) > 0 else 0
    delta = mean_with - mean_without

    print(f"{level:>7} | {mean_without:>9.4f} | {mean_with:>11.4f} | {delta:>+.4f}")

Example (Feature 13352):

ISM × Comeback 2D Analysis:
ISM | No CB  | With CB | Delta
  0 | 0.066  | 0.117   | +0.051
  3 | 0.122  | 0.261   | +0.139
  6 | 0.147  | 0.352   | +0.205  ← PEAK INTERACTION
 12 | 0.094  | 0.163   | +0.069
 21 | 0.094  | 0.129   | +0.035

Interpretation: Comeback has STRONG conditional effect at ISM 3-6.
The +0.205 delta at ISM_6 means Comeback DOUBLES the activation!
1D sweep showed delta=0 because most examples have ISM=0 (low baseline).

Step 4: Test combinations of binary abilities together

# Test multiple binary abilities together
binary_id_1 = 27  # e.g., comeback
binary_id_2 = 1   # e.g., stealth_jump

both = df.filter(
    pl.col('ability_input_tokens').list.contains(binary_id_1) &
    pl.col('ability_input_tokens').list.contains(binary_id_2)
)
neither = df.filter(
    ~pl.col('ability_input_tokens').list.contains(binary_id_1) &
    ~pl.col('ability_input_tokens').list.contains(binary_id_2)
)

# Then do 2D analysis at each scaling level
# Combinations can have stronger effects than individual abilities!

Key Insight: Binary abilities may have stronger effects when combined. Always test combinations, not just individual tokens.

Additional Learnings

Conditional effects can be much stronger than marginal effects: A feature might show ISM with only 0.069 max_delta in 1D sweeps, but a binary ability combination at moderate ISM could produce +0.335 delta - the interaction effect can be 5x stronger than the marginal effect. 1D sweeps can dramatically underestimate a feature's true behavior.
Depletion is informative: If a binary ability shows enrichment < 1.0 (e.g., 0.72x), the feature actively avoids that ability. This is meaningful for interpretation - it tells you what the feature excludes, not just what it includes.
Manual 2D analysis required for binary tokens: The Family2DHeatmapRunner uses parse_token() which expects family_name_AP format, but binary abilities appear as just the token name (e.g., comeback not comeback_10). Use manual 2D analysis code for binary abilities (see protocol above).
"Weak feature" needs decoder weight check: A feature with weak activation effects (max_delta < 0.03) might still have high influence on outputs. Remember: net influence = activation strength × decoder weight. Before labeling as "weak", check the feature's decoder weights to the output tokens it contributes to. A "weak activation" feature with high decoder weights may actually be important.
Watch for error-correction features: If 1D sweeps show small deltas or effects only in unusual rung combinations, the feature may fire when prerequisites are MISSING (OOD detection). Test "explains-away" behavior by comparing activation when low-level evidence is present vs missing. Example: Does feature fire MORE when SCU_3 is absent from a high-SCU build?
Beware of flanderization in top activations: The top 100 activations over-emphasize extreme cases. The TRUE concept often lives in the mid-activation range (25-75th percentile). Always compare mid vs top activation regions - if they show different weapon/ability patterns, label the mid-range concept and note the extremes as "super-stimuli".

What Counts as Evidence

Evidence Type	Strength	Example
1D sweep max_delta > 0.05	Strong causal	"ISM drives this feature"
1D sweep max_delta 0.02-0.05	Weak causal	"ISM has minor effect"
1D sweep max_delta < 0.02	Negligible	"ISM doesn't drive this"
Binary delta = 0	Inconclusive	Need presence rate check
High PageRank + low delta	Spurious correlation	Token co-occurs but doesn't cause
2D heatmap shows conditional effect	Interaction confirmed	"X matters only when Y is high"
Bottom tokens (suppressors)	Avoidance pattern	"Feature avoids death-perks"
Higher activation when prerequisite MISSING	Error-correction	"Fires on OOD rung combos"
Mid-range (25-75%) differs from top	Flanderization	"Top is super-stimuli; label mid-range"

Common Mistakes to Avoid

Presenting overview as findings - Overview is hypotheses, not conclusions
Ignoring binary abilities - Delta=0 doesn't mean unimportant
Skipping bottom tokens - Suppressors reveal what feature avoids
Only running 1D sweeps - 2D heatmaps needed for interaction effects
Not checking weapon patterns - Feature may be weapon-specific, not ability-specific
Using only top activations - Top activations (90%+ of max) may be "flanderized" extremes; check core region (25-75% of max)
Missing error-correction features - Small deltas in weird rung combos may indicate OOD detection
Confusing data sparsity with suppression - Zero examples at a condition ≠ "suppression to 0" (see below)
Shallow validation - Just checking if numbers "look right" without running enrichment analysis
Semantic contradictions in labels - e.g., "Zombie" (embraces death) + "high SSU" (avoids death) is contradictory
Reporting weapon percentages from top-100 - Use top 20-30% instead; top-100 can be 5-10x off (e.g., 78% vs 10%)
Not checking meta archetypes - Weapons may cluster by playstyle, not kit; use splatoon3-meta skill
Assuming kit-based patterns - Check if weapons share sub/special BEFORE assuming it's kit-related
Ignoring flanderization crossover - Note where a "super-stimulus" weapon overtakes the general pattern (usually 90%+ of max activation)

⚠️ CRITICAL: Data Sparsity vs Suppression

This is a common and dangerous mistake. When you see "activation = 0" or "no effect" at some condition, ask: Is this suppression or data sparsity?

Example of the mistake (Feature 1819):

Original claim: "QR is HARD SUPPRESSOR - SSU_57+QR_any=0.000"
Reality: There were ZERO examples with SSU_57 + any QR in the dataset!
         The "0.000" was missing data, not suppression.

How to detect data sparsity:

# ALWAYS check sample sizes when claiming suppression!
at_high_ssu = df.filter(pl.col('ability_input_tokens').list.contains(ssu_57_id))
with_qr = at_high_ssu.filter(pl.col('ability_input_tokens').list.set_intersection(qr_ids).list.len() > 0)

print(f"Examples at SSU_57 with QR: {len(with_qr)}")  # If 0, this is SPARSITY not suppression!

Rule: Never claim "suppression" unless you have ≥20 examples in the suppressed condition. Report sample sizes with all claims.

Philosophy

A meaningful label should capture:

What concept the feature encodes (not just "detects token X")
Why the model might have learned this representation
How it relates to strategic/tactical gameplay

Avoid trivial labels like:

"SCU Detector" (just describes token presence)
"High activation feature" (describes statistics, not meaning)

Aim for interpretable labels like:

"Aggressive Slayer Build" (strategic concept)
"Special Spam Enabler" (functional role)
"Backline Support Kit" (playstyle archetype)

Investigation Workflow

Phase 0: Triage

See Phase 0: Triage above. Always start here.

If feature passes triage (decoder weight ≥10% OR has clear structure), proceed to Phase 1.

Phase 1: Initial Assessment

Run the overview and classify the feature type:

poetry run python -m splatnlp.mechinterp.cli.overview_cli \
    --feature-id {FEATURE_ID} --model {MODEL} --top-k 20

Classify based on family breakdown:

Pattern	Type	Next Steps
One family >40%	Single-family	Check for interference, weapon specificity
Top 2-3 families ~20% each	Multi-family	Check synergy/redundancy, build archetype
Many families <15% each	Distributed	Look for meta-pattern, weapon class
Weapons concentrated	Weapon-specific	Weapon sweep, class analysis

CRITICAL: Always check for non-monotonic effects! Higher AP doesn't always mean higher activation.

Phase 1.5: Activation Region Analysis (CRITICAL - Anti-Flanderization)

Don't only examine extreme activations! High activations may be "flanderized" - exaggerated, extreme versions of the true concept that over-emphasize niche cases.

Key insight: The TRUE concept often lives in the core region (25-75% of effective max), not the top examples. Top activations (90%+ of effective max) can mislead you into labeling a niche pattern instead of the general concept.

Why "effective max"? Activation distributions are heavy-tailed. Using effective_max = 99.5th percentile of nonzero activations prevents single outliers from making the core region nearly empty.

Run activation region analysis:

from splatnlp.mechinterp.skill_helpers import load_context
import numpy as np
from collections import Counter

ctx = load_context("{MODEL}")
df = ctx.db.get_all_feature_activations_for_pagerank({FEATURE_ID})

acts = df['activation'].to_numpy()
weapons = df['weapon_id'].to_list()

# Use EFFECTIVE MAX (99.5th percentile) to handle heavy-tailed distributions
# This prevents single outliers from making the core region nearly empty
nonzero_acts = acts[acts > 0]
effective_max = np.percentile(nonzero_acts, 99.5)
true_max = acts.max()
print(f"True max: {true_max:.4f}, Effective max (99.5%ile): {effective_max:.4f}")

# Define activation regions as % of EFFECTIVE max
regions = [
    ('Floor (≤1%)', lambda a: a <= 0.01 * effective_max),
    ('Low (1-10%)', lambda a: 0.01 * effective_max < a <= 0.10 * effective_max),
    ('Below Core (10-25%)', lambda a: 0.10 * effective_max < a <= 0.25 * effective_max),
    ('Core (25-75%) - TRUE CONCEPT', lambda a: 0.25 * effective_max < a <= 0.75 * effective_max),
    ('High (75-90%)', lambda a: 0.75 * effective_max < a <= 0.90 * effective_max),
    ('Flanderization Zone (90%+)', lambda a: a > 0.90 * effective_max),
]

for region_name, filter_fn in regions:
    indices = [i for i, a in enumerate(acts) if filter_fn(a)]
    weps = [weapons[i] for i in indices]
    print(f"\n{region_name} (n={len(indices)}):")
    for wep, count in Counter(weps).most_common(5):
        name = ctx.id_to_weapon_display_name(wep)
        print(f"  {name}: {count}")

Key signals to look for:

Pattern	Interpretation
Same weapons in ALL regions	General concept (continuous feature)
Different weapons in core vs 90%+	Super-stimuli detected
Diverse weapons in core, concentrated in 90%+	True concept is in core region
Niche weapons only in 90%+	High activations are "flanderized" extremes

Example (Feature 9971):

Core (25-75%): Splattershot (115), Wellstring (65), Sploosh (57)...
Flanderization (90%+): Bloblobber (44), Glooga Deco (39), Range Blaster (28)

Interpretation: Core region shows GENERAL offensive investment.
Flanderization zone shows EXTREME SCU on special-dependent weapons (super-stimuli).
Label the general concept, note the super-stimuli pattern.

CRITICAL: Always check the Bottom Tokens (Suppressors) section! Tokens that rarely appear in high-activation examples can reveal what the feature avoids:

Suppressor Pattern	Interpretation
Death-mitigation (QR, SS, CB) suppressed	Feature avoids "death-accepting" builds
Defensive (IR, SR) suppressed	Feature prefers aggressive/ranged builds
Mobility suppressed	Feature prefers stationary/positional play
Special abilities suppressed	Feature encodes non-special playstyle

Example: If SCU is enhanced but quick_respawn, special_saver, and comeback are ALL suppressed, the feature doesn't just detect "SCU" - it detects "death-averse SCU builds" (players who stack SCU but don't plan to die).

Phase 1.6: Weapon Distribution Analysis (CRITICAL - Anti-Flanderization)

NEVER report weapon percentages from top-100 samples. Top-100 is severely flanderized and can give wildly misleading weapon distributions.

Example (Feature 14096 - Real Case):

Top 100:     Dark Tetra 78%, Stamper 20%  ← WRONG, flanderized
Top 10%:     Stamper 35%, Dark Tetra 21%  ← Better but still skewed
Top 30%:     Stamper 23%, Dark Tetra 10%  ← TRUE CONCEPT
Full dataset: Stamper 9%, Dark Tetra 3.5% ← Includes noise/floor

Use top 20-30% for weapon characterization:

import polars as pl
import numpy as np
from collections import Counter
from splatnlp.mechinterp.skill_helpers import load_context

ctx = load_context('ultra')
df = ctx.db.get_all_feature_activations_for_pagerank(FEATURE_ID)

# Get percentile thresholds
acts = df['activation'].to_numpy()
thresholds = {p: np.percentile(acts, p) for p in [0, 50, 70, 80, 90, 95, 99]}

# Analyze by region
regions = [
    ("Bottom 50% (noise)", 0, 50),
    ("50-70% (weak)", 50, 70),
    ("Top 30% (TRUE CONCEPT)", 70, 100),
    ("Top 10%", 90, 100),
    ("Top 1% (flanderized)", 99, 100),
]

print("Region | Top Weapons")
print("-" * 60)

for name, p_low, p_high in regions:
    t_low, t_high = thresholds[p_low], thresholds.get(p_high, float('inf'))
    if p_high == 100:
        region_df = df.filter(pl.col('activation') >= t_low)
    else:
        region_df = df.filter((pl.col('activation') >= t_low) & (pl.col('activation') < t_high))

    if len(region_df) == 0:
        continue

    weapon_counts = region_df.group_by('weapon_id').agg(
        pl.col('activation').count().alias('n')
    ).sort('n', descending=True)

    top3 = []
    for row in weapon_counts.head(3).iter_rows(named=True):
        wname = ctx.id_to_weapon_display_name(row['weapon_id'])
        pct = row['n'] / len(region_df) * 100
        top3.append(f"{wname[:12]}({pct:.0f}%)")

    print(f"{name:<25} | {', '.join(top3)}")

Interpretation Guide:

Pattern	Meaning
Same weapons in top-30% and top-1%	Continuous feature, no flanderization
Different weapons in top-30% vs top-1%	Flanderization detected - label top-30% concept
One weapon jumps from 10% to 70%+	That weapon is "super-stimulus" for the feature
Weapons consistent 50%→30%→10%→1%	Stable feature, safe to use any region

Rule: Report weapon percentages from top 20-30%, note if top-1% differs significantly.

Phase 1.6.5: Ability Flanderization Check (CRITICAL)

The same flanderization that applies to weapons applies to abilities. A binary ability with high tail enrichment but low core coverage is a super-stimulus, not the core concept.

The Rule: If a "dominant" driver has <30% core coverage, it's a tail marker, not the headline concept.

Use the core coverage experiment:

cd /root/dev/SplatNLP

# Direct subcommand (recommended)
poetry run python -m splatnlp.mechinterp.cli.runner_cli coverage \
    --feature-id {FEATURE_ID} --model ultra \
    --tokens respawn_punisher,comeback,stealth_jump \
    --threshold 0.30

Output tables:

token_coverage: Shows core_coverage_pct, tail_enrichment, is_tail_marker for each token
weapon_coverage: Shows core vs tail weapon distributions (catches weapon flanderization)

Coverage Interpretation:

Core Coverage	Interpretation	Label Implication
>50%	Primary driver	Safe to headline
30-50%	Significant but not universal	Mention in notes, not headline
<30%	Tail marker / super-stimulus	NOT the headline concept

Example (Feature 13934):

respawn_punisher: 8.57x tail enrichment, BUT only 12% core coverage
→ RP is a super-stimulus, NOT the core concept
→ Wrong label: "RP Backline Anchor"
→ Right approach: Split core by RP presence to reveal hidden modes

When you find a super-stimulus (<30% coverage):

Split the core by presence/absence of the super-stimulus
Analyze both modes separately
Look for what they have in COMMON (the true concept)
Label the commonality, note the super-stimulus as a tail marker

Phase 1.7: Meta-Informed Weapon Analysis (USE AFTER WEAPON SWEEP)

After identifying top weapons, always check if they match a known meta archetype using the splatoon3-meta skill.

Step 1: Look up weapon kits

Check references/weapons.md for each top weapon's sub and special:

# Top weapons from Feature 14096 (top 30%):
kits = {
    "Splatana Stamper": ("Burst Bomb", "Zipcaster"),
    "Dark Tetra Dualies": ("Autobomb", "Reefslider"),
    "Glooga Dualies": ("Splash Wall", "Booyah Bomb"),
    "Dapple Dualies Nouveau": ("Torpedo", "Reefslider"),
    "Splatana Wiper": ("Torpedo", "Ultra Stamp"),
}

# Check for shared subs/specials
from collections import Counter
subs = Counter(k[0] for k in kits.values())
specials = Counter(k[1] for k in kits.values())

# If one sub/special dominates → kit-based feature
# If diverse → playstyle-based feature

Step 2: Check archetype reference

Read references/archetypes.md to see if weapons match a known archetype:

Archetype	Key Weapons	Signature Abilities
Zombie Slayer	Tetra Dualies, Splatana Wiper	QR + Comeback + Stealth Jump
Stealth Slayer	Carbon Roller, Inkbrush	Ninja Squid + SSU + Stealth Jump
Anchor/Backline	E-liter, Hydra Splatling	Respawn Punisher + Object Shredder
Support/Beacon	Squid Beakon weapons	Sub Power Up + ISS + Comeback

Step 3: Classification decision

Kit Analysis Result:
├─ Shared sub weapon? → Feature may encode SUB PLAYSTYLE
├─ Shared special? → Feature may encode SPECIAL FARMING
├─ No kit pattern + archetype match? → PLAYSTYLE FEATURE (label as archetype)
└─ No kit pattern + no archetype? → WEAPON CLASS feature (check if all dualies, all shooters, etc.)

Example (Feature 14096):

Top 30% weapons: Stamper, Dark Tetra, Glooga, Dapple, Wiper
Kit analysis: Diverse subs (Burst, Auto, Splash Wall, Torpedo), diverse specials
Archetype check: Dark Tetra + Splatana Wiper = "Zombie Slayer" archetype!
Conclusion: PLAYSTYLE feature encoding Zombie Slayer (death-accepting aggressive)
Label: "Zombie Slayer QR (Splatana/Dualies)" - tactical category

When to invoke splatoon3-meta skill:

After weapon_sweep shows concentrated weapon pattern
When top weapons seem unrelated by kit but share a playstyle
To validate that ability patterns match expected meta builds
To identify if weapons share archetype despite different kits

Phase 1.7.5: Kit Component Analysis (OPTIONAL but Recommended)

When to use: After weapon sweep, check if the core weapons share patterns in ANY kit component: sub weapon, special weapon, or main weapon class. This can reveal WHY certain build philosophies emerge.

Key insight: Weapons may cluster by:

Sub weapon (Burst Bomb users, Beakon users → explains SPU/ISS builds)
Special weapon (Aggressive push specials → explains survival builds)
Main weapon class (All dualies, all chargers → explains mobility/positioning builds)

The feature may be driven by ONE of these - identify which, then determine if it's causal or spurious.

Component 1: Sub Weapon Pattern Analysis

When relevant: If kit_sweep (Phase 1.7/3d) shows sub concentration, investigate further.

from collections import Counter

# Map top weapons to their subs (from weapons.md)
weapon_subs = {
    "Splattershot Jr.": "Splat Bomb",
    "Neo Splash-o-matic": "Suction Bomb",
    "Sploosh-o-matic 7": "Splat Bomb",
    # ... add more as needed
}

# Categorize subs
sub_categories = {
    # Lethal bombs
    "Splat Bomb": "lethal", "Suction Bomb": "lethal", "Burst Bomb": "lethal",
    "Curling Bomb": "lethal", "Autobomb": "lethal", "Torpedo": "lethal",
    "Fizzy Bomb": "lethal", "Ink Mine": "lethal", "Toxic Mist": "lethal",
    # Utility/Support
    "Squid Beakon": "utility", "Splash Wall": "utility", "Sprinkler": "utility",
    "Point Sensor": "utility", "Angle Shooter": "utility",
}

# Count categories
sub_counts = Counter()
for weapon in top_weapons:
    sub = weapon_subs.get(weapon)
    if sub:
        category = sub_categories.get(sub, "other")
        sub_counts[category] += 1

print("Sub Weapon Breakdown:")
for sub, count in Counter(weapon_subs.get(w) for w in top_weapons if weapon_subs.get(w)).most_common():
    print(f"  {sub}: {count}")

Sub pattern implications:

Sub Pattern	Build Implication	Example
Shared Beakons	SPU/ISS focus for sub spam	Beacon Support builds
Shared Burst Bomb	Mobility + burst damage	Aggressive flanker builds
Shared Splash Wall	Positional/defensive play	Lane control builds
Diverse subs	Sub is NOT the clustering factor	Check special or main class

Component 2: Special Weapon Pattern Analysis

When relevant: After weapon sweep, check if core weapons share a special weapon pattern.

from collections import Counter

# Map top weapons to their specials (from weapons.md)
weapon_specials = {
    "Splatana Stamper": "Zipcaster",
    "Sloshing Machine": "Booyah Bomb",
    "Squeezer": "Trizooka",
    # ... add more as needed
}

# Categorize specials
special_categories = {
    # Zoning/Area Denial
    "Ink Storm": "zoning", "Wave Breaker": "zoning", "Tenta Missiles": "zoning",
    "Killer Wail 5.1": "zoning", "Triple Inkstrike": "zoning",
    # Team Support
    "Tacticooler": "team_support", "Big Bubbler": "team_support",
    "Splattercolor Screen": "team_support",
    # Aggression/Push
    "Trizooka": "aggression", "Crab Tank": "aggression", "Ink Jet": "aggression",
    "Ultra Stamp": "aggression", "Booyah Bomb": "aggression", "Reefslider": "aggression",
    "Kraken Royale": "aggression", "Zipcaster": "aggression",
    # Utility/Defense
    "Ink Vac": "utility", "Super Chump": "utility", "Triple Splashdown": "utility",
}

# Count categories
category_counts = Counter()
for weapon in top_weapons:
    special = weapon_specials.get(weapon)
    if special:
        category = special_categories.get(special, "other")
        category_counts[category] += 1

print("Special Category Breakdown:")
for cat, count in category_counts.most_common():
    print(f"  {cat}: {count/sum(category_counts.values())*100:.0f}%")

Special pattern implications:

Special Pattern	Build Implication	Example
>60% aggression	Players build for survival to deploy push specials	Feature 14964
>60% zoning	Players may invest in SCU/SPU for area denial uptime	Ink Storm spam
>50% team_support	Team-oriented builds, may see Tenacity/CB	Support kit
Diverse specials	Special is NOT the clustering factor	Check sub or main class

Component 3: Main Weapon Class Pattern Analysis

When relevant: If weapons seem diverse but may share a class (all shooters, all dualies, all chargers).

# Weapon class mapping (from weapon-vibes.md)
weapon_classes = {
    "Splattershot": "shooter", "Splattershot Jr.": "shooter", "Splattershot Pro": "shooter",
    "Dark Tetra Dualies": "dualie", "Dapple Dualies": "dualie", "Splat Dualies": "dualie",
    "E-liter 4K": "charger", "Splat Charger": "charger", "Goo Tuber": "charger",
    "Luna Blaster": "blaster", "Range Blaster": "blaster", "Rapid Blaster": "blaster",
    "Hydra Splatling": "splatling", "Mini Splatling": "splatling",
    "Splatana Stamper": "splatana", "Splatana Wiper": "splatana",
    # ... add more as needed
}

# Count classes
class_counts = Counter(weapon_classes.get(w, "other") for w in top_weapons)

print("Weapon Class Breakdown:")
for cls, count in class_counts.most_common():
    pct = count / len(top_weapons) * 100
    print(f"  {cls}: {pct:.0f}%")

Class pattern implications:

Class Pattern	Build Implication	Example
>60% dualies	Mobility-focused, dodge-roll builds	SSU + QSJ synergy
>60% chargers	Positioning, low death tolerance	Anchor builds
>60% blasters	Burst damage, trade-happy	QR + Comeback synergy
>60% splatlings	Charge management, lane holding	ISM + positioning
Diverse classes	Class is NOT the clustering factor	Check sub or special

Step 4: Determine if Pattern is CAUSAL or SPURIOUS

This is the critical step. A strong pattern in ANY component could be causal or spurious.

Pattern Type	Evidence	Implication
CAUSAL	Kit component explains build philosophy	Include in label rationale
SPURIOUS	Weapons share other traits that better explain clustering	Don't emphasize that component

Questions to determine causality:

Does the kit component align with decoder output?
- Decoder promotes SCU/SS/SPU + aggressive specials → Special farming is likely causal
- Decoder promotes ISS/SPU + shared sub weapon → Sub spam is likely causal
- Decoder promotes SSU/QSJ + all dualies → Weapon class mobility is likely causal
Do weapons share OTHER traits that better explain the clustering?
- All dualies with aggressive specials → Is it the CLASS or the SPECIAL?
- Test: Do other dualies (without aggressive specials) also cluster here?
Does the build philosophy make sense for this kit component?
- Survival builds + aggressive specials → "Stay alive to use push special" (causal)
- Mobility builds + all dualies → "Dualies need SSU for dodge-roll play" (causal)
- Survival builds + diverse subs/specials + all chargers → "Chargers can't trade" (class is causal)

Example Analysis (Special-driven):

Feature 14964 special breakdown: 77% aggression (Zipcaster, Booyah Bomb, Trizooka)
Build philosophy: "Balanced utility spread for survival"

Analysis:
- Decoder suppresses death-trading (Comeback, RP) ✓
- Decoder promotes survival abilities (SS, ISM) ✓
- Weapons have LOW-MED death tolerance ✓
- Weapons have aggressive push specials ✓
- Sub weapons are DIVERSE (no pattern)
- Weapon classes are DIVERSE (shooters, slosher, splatana)

Conclusion: CAUSAL - Players build for survival BECAUSE they have aggressive specials
           that require staying alive to deploy effectively.

Note: "Core weapons have aggressive push specials (77%) requiring survival to deploy"

Example Analysis (Class-driven):

Feature shows: 80% dualies (Dark Tetra, Dapple, Dualie Squelchers)
Decoder promotes: SSU, QSJ, RSU (mobility family)

Analysis:
- Specials are DIVERSE (not the driver)
- Subs are DIVERSE (not the driver)
- All weapons are DUALIES with dodge-roll mechanics ✓
- Dualies benefit uniquely from SSU for roll distance/recovery

Conclusion: CAUSAL - Dualies cluster because dodge-roll playstyle needs mobility
           The feature encodes "dualie mobility optimization"

Counter-example (Spurious):

Feature has 70% aggression specials
But: All weapons are CLOSE-range SLAYER with HIGH death tolerance
And: Decoder promotes QR, Comeback (death-trading)

Conclusion: SPURIOUS - Weapons are aggressive slayers who happen to have aggressive specials
           The special type is incidental to the slayer playstyle.
           Primary driver is ROLE (slayer), not KIT.

Step 5: Record findings in notes

If pattern is CAUSAL, add to dashboard_notes:

KIT PATTERN: {component} - {X}% {category/type} ({list top examples}).
INTERPRETATION: [Why this explains the build philosophy]

If pattern is SPURIOUS, note briefly:

KIT PATTERN: Diverse/incidental. Weapons cluster by [range/role/playstyle], not kit.

When to skip this phase:

Feature is clearly mechanical (single ability stacker like "SCU_57 threshold")
Weapons are highly diverse with no concentration in any component
Earlier analysis already identified clear driver (e.g., single weapon dominance)

Phase 1.8: Weapon Range/Role Classification (REQUIRED for Labels)

Before proposing any label, you MUST classify the feature's weapons by range and role. This prevents incorrect role assumptions (e.g., calling Jr./Rapid Blasters "anchors" when they're midrange).

Step 1: Extract properties for top 5-10 core weapons from weapon-vibes.md

Property	Values	Label Implication
RANGE	CLOSE, MID, LONG, SNIPER	Determines qualifier
LANE	FRONT, MID, BACK, FLEX	Confirms positioning
JOB	SLAYER, SUPPORT, ANCHOR, SKIRMISH, ASSASSIN	Determines role word
NS_FIT	CORE, GOOD, MEH, BAD, NO	Stealth vs visible
DEATH_TOL	HIGH, MED, LOW	Trading vs survival

Step 2: Find the common pattern

If most weapons share:

LONG/SNIPER + BACK + ANCHOR → use "Anchor" or "Backline" qualifier
MID/LONG + MID + SKIRMISH/SUPPORT → use "Midrange" qualifier
CLOSE/MID + FRONT + SLAYER → use "Slayer" or "Frontline" qualifier
NO/BAD NS_FIT + LOW DEATH_TOL → "Visible" or "Positional" concept (not stealth, not trading)

Step 3: Record in notes

Always include weapon classification in dashboard_notes:

WEAPON ROLE: Midrange (MID-LONG range, SKIRMISH/SUPPORT jobs, NO/BAD NS fit, LOW death tolerance)

Phase 2: Hypothesis Generation

Based on Phase 1, generate hypotheses about what the feature might encode:

For single-family dominated features:

H1: Pure token detector (trivial - try to disprove)
H2: Threshold detector (activates only at high AP)
H3: Interaction detector (family + something else)
H4: Weapon-conditional (family matters only for certain weapons)

For multi-family features:

H1: Synergy detector (families work together)
H2: Build archetype (strategic loadout pattern)
H3: Playstyle indicator (aggressive, defensive, support)
H4: Shared NEED (different builds solving the same tactical problem)

Build NEED Framework (For Multi-Modal/Diffuse Features)

When a feature activates on seemingly different build types, ask: "What NEED do these builds share?"

Features can encode solutions to problems, not just correlations. Different builds may trigger the same feature because they're different answers to the same question.

Step 1: Identify the tactical constraint these builds solve

Question	Example
What gameplay problem do these builds address?	"How to handle death for low-death-tolerance weapons"
What enemy behavior are they countering?	"Dealing with aggressive flankers"
What win condition are they enabling?	"Special pressure" or "Map control"

Step 2: Check weapon properties (use splatoon3-meta)

Compare enriched weapons on these axes from weapon-vibes.md:

Ink feel: STARVING / HUNGRY / AVERAGE / EFFICIENT / PAINTER
Range: MELEE / CLOSE / MID / LONG / SNIPER
Ninja Squid affinity: CORE / GOOD / MEH / BAD / NO
Death tolerance: HIGH / MED / LOW
Role: SLAYER / SUPPORT / ANCHOR / SKIRMISH / ASSASSIN

If all enriched weapons share properties (e.g., all HUNGRY ink + NO ninja squid + LOW death tolerance), the feature may encode a need specific to that weapon class.

Step 3: Reframe the modes as "answers to the same question"

Example (Feature 13934):

Mode A (12%): RP anchor builds (E-liter) - "I won't die, make their deaths hurt"
Mode B (88%): Zombie utility builds (DS) - "I will die sometimes, optimize respawns"

Shared NEED: "Death management for non-stealth, low-death-tolerance, midrange+ weapons"
Both modes are VALID ANSWERS to the same tactical question.

Step 4: Label the NEED, not the modes

Instead of: "Mixed: Zombie + RP Anchor" (describes the modes) Label as: "Balanced Utility Axis (Non-Stealth Midline+)" (describes the need)

Key Insight: The model learned that these seemingly different builds share a common requirement. The feature encodes that requirement, and the modes are just different implementations.

For weapon-specific features:

H1: Weapon class pattern (all shooters, all chargers, etc.)
H2: Meta build (optimal loadout for that weapon)
H3: Weapon-ability interaction

Phase 3: Targeted Experiments

Run experiments to test hypotheses. Available experiment types:

Type	Purpose
`family_1d_sweep`	Activation across AP rungs for one family
`family_2d_heatmap`	Interaction between two families
`within_family_interference`	Detect error correction within a family
`weapon_sweep`	Activation by weapon (optionally conditioned on family)
`weapon_group_analysis`	Compare high vs low activation by weapon
`pairwise_interactions`	Synergy/redundancy between tokens
`token_influence_sweep`	Identify enhancers and suppressors across all tokens

⚠️ CRITICAL: Iterative Conditional Testing Protocol

1D sweeps can be MISLEADING for secondary abilities. When a feature has a strong primary driver:

The Problem

1D sweep for secondary ability (e.g., QR) across ALL contexts might show delta ≈ 0

Why this happens:

Most contexts have LOW primary driver (e.g., low SCU) → activation already near zero
Secondary ability can't suppress what's already zero
The few high-primary contexts get drowned out in the average

Example (Feature 18712):

QR 1D sweep (all contexts): mean_delta = -0.0006 → "QR has no effect" ❌ WRONG!
SCU × QR 2D heatmap:
  - At SCU_15: QR_0=0.13, QR_12=0.04 → QR suppresses 70%! ✅
  - At SCU_29: QR_0=0.15, QR_12=0.04 → QR suppresses 74%! ✅

The Solution: Iterative 2D Testing

Protocol for features with a strong primary driver:

1. Confirm primary driver with 1D sweep
   └─ If monotonic response confirmed → proceed to step 2

2. For EACH correlated ability in overview (top 5-10):
   └─ Run 2D heatmap: PRIMARY × SECONDARY
   └─ Check activation at EACH primary level
   └─ Look for:
      - Suppression: secondary reduces activation at high primary
      - Synergy: secondary boosts activation at high primary
      - Spurious: no conditional effect (correlation was coincidence)

3. Group findings by semantic category:
   └─ Death-mitigation (QR, SS, CB): all suppress? → "death-averse"
   └─ Mobility (SSU, RSU): all enhance? → "mobility-synergistic"
   └─ Efficiency (ISM, ISS): mixed? → test individually

2D Heatmap Interpretation Guide

Pattern	Interpretation
Peak at (high_X, 0_Y)	Y is a suppressor
Peak at (high_X, high_Y)	Y is a synergy
Flat across Y at each X	Y has no conditional effect (spurious)
Non-monotonic in X at some Y	Interference pattern

Heatmap Cell Validity Check

Before drawing conclusions from heatmap cells, check the cell metadata:

Each cell in heatmap output includes:

n: Number of valid samples in this cell
std: Standard deviation of activations
stderr: Standard error (std / sqrt(n)) - new field

n (samples)	Interpretation
null/0	Impossible combination (constraint violation) - don't interpret
1-4	Very weak evidence - note uncertainty in conclusions
5-20	Moderate evidence - interpret with caution
20+	Strong evidence - interpret confidently

High stderr (>0.1) indicates high variance - the mean may not be reliable.

Anti-patterns to avoid:

Drawing conclusions from cells with n < 5
Claiming "peak at X=57, Y=29" when that cell has n=2
Ignoring null cells (they represent impossible ability combinations)

Example interpretation:

Cell (ISM=51, IRU=29): mean=0.35, n=3, stderr=0.08
→ "ISM=51 with IRU=29 shows high activation, but n=3 means this could be noise"

Cell (ISM=51, IRU=0): mean=0.35, n=45, stderr=0.02
→ "ISM=51 without IRU shows reliable high activation (n=45)"

When to Use 2D vs 1D

Scenario	Use 1D	Use 2D
Testing primary driver	✅	-
Testing secondary abilities	❌ MISLEADING	✅ REQUIRED
Looking for interactions	-	✅
Confirming suppressor hypothesis	-	✅
Quick initial scan	✅ (with caution)	-

Template: Death-Aversion Test Battery

For single-family dominated features, always test death-mitigation:

# Test 1: Primary × Quick Respawn
poetry run python -m splatnlp.mechinterp.cli.runner_cli heatmap \
    --feature-id {ID} --family-x {PRIMARY} --family-y quick_respawn \
    --rungs-x 0,6,15,29,41,57 --rungs-y 0,6,12,21,29

# Test 2: Primary × Special Saver
poetry run python -m splatnlp.mechinterp.cli.runner_cli heatmap \
    --feature-id {ID} --family-x {PRIMARY} --family-y special_saver \
    --rungs-x 0,6,15,29,41,57 --rungs-y 0,3,6,12,21

# Test 3: Primary × Comeback (binary ability - use binary subcommand for this)
poetry run python -m splatnlp.mechinterp.cli.runner_cli binary \
    --feature-id {ID} --model ultra

If ALL three show suppression at Y>0, label includes "death-averse"

Template: Error-Correction Detection

If 1D sweeps show small deltas or effects only in unusual rung combinations, test for error-correction behavior:

import polars as pl
from splatnlp.mechinterp.skill_helpers import load_context

ctx = load_context('ultra')
df = ctx.db.get_all_feature_activations_for_pagerank(FEATURE_ID)

# Get token IDs for high and low rungs
# Example: SCU_57 (high) and SCU_3 (low)
high_rung_id = ctx.vocab['special_charge_up_57']
low_rung_id = ctx.vocab['special_charge_up_3']

# Compare activation when low rung is present vs missing (among high-rung builds)
high_with_low = df.filter(
    pl.col('ability_input_tokens').list.contains(high_rung_id) &
    pl.col('ability_input_tokens').list.contains(low_rung_id)
)
high_without_low = df.filter(
    pl.col('ability_input_tokens').list.contains(high_rung_id) &
    ~pl.col('ability_input_tokens').list.contains(low_rung_id)
)

mean_with = high_with_low['activation'].mean()
mean_without = high_without_low['activation'].mean()

print(f"High rung WITH low rung present: {mean_with:.4f} (n={len(high_with_low)})")
print(f"High rung WITHOUT low rung: {mean_without:.4f} (n={len(high_without_low)})")
print(f"Delta: {mean_without - mean_with:+.4f}")

# If WITHOUT > WITH, feature fires when prerequisite is MISSING = error correction!

Signs of error-correction:

Pattern	Interpretation	Label Style
Higher activation when low rung MISSING	"Explains away" missing evidence	"Error-Correction: {FAMILY}"
Only fires on weird rung combos	OOD detector	"OOD Detector: {PATTERN}"
Negative interactions in 2D heatmaps	Within-family interference	"Interference Feature: {FAMILY}"

Test for within-family interference (CRITICAL for single-family):

poetry run python -m splatnlp.mechinterp.cli.runner_cli family-sweep \
    --feature-id {FEATURE_ID} --family {FAMILY} --model {MODEL}
# Check for non-monotonic response patterns in the output

Test for interactions (2D heatmap):

poetry run python -m splatnlp.mechinterp.cli.runner_cli heatmap \
    --feature-id {FEATURE_ID} --family-x {FAMILY_A} --family-y {FAMILY_B} --model {MODEL}

Test for weapon specificity:

poetry run python -m splatnlp.mechinterp.cli.runner_cli weapon-sweep \
    --feature-id {FEATURE_ID} --model {MODEL} --top-k 20 --min-examples 10

CHECKPOINT: After weapon_sweep, check for dominant weapon pattern:

If weapon_sweep diagnostics show "DOMINANT WEAPON" warning (one weapon has >2x delta of second):

Run kit_sweep to analyze by sub weapon and special weapon:

poetry run python -m splatnlp.mechinterp.cli.runner_cli kit-sweep \
    --feature-id {FEATURE_ID} --model {MODEL} --top-k 10 --analyze-combinations

Use splatoon3-meta skill to look up the dominant weapon's kit:
- Read .claude/skills/splatoon3-meta/references/weapons.md
- Find the weapon's sub weapon and special weapon
Cross-reference other high-activation weapons:
- Do they share the same sub weapon?
- Do they share the same special weapon?
- If yes, the feature may encode kit behavior not weapon behavior
Update hypothesis based on findings:
- If shared sub: Feature may encode sub weapon playstyle
- If shared special: Feature may encode special spam/farming
- If no kit pattern: Feature is truly weapon-specific

Example: Feature 18712 shows Octobrush Nouveau dominant. Kit lookup reveals Squid Beakon + Ink Storm. Other high weapons (Rapid Blaster, Range Blaster) also have "special-dependent" characteristics per meta → Feature encodes "SCU for Ink Storm spam" not just "Octobrush".

Test for threshold effects:

Compare low-rung vs high-rung responses
Look for non-linear jumps in activation
Check if certain rungs REDUCE activation (interference)

Phase 4: Synthesis

Combine findings into a coherent interpretation:

What triggers activation? (tokens, combinations, weapons)
Is there structure beyond simple detection? (interactions, thresholds)
What gameplay concept does this represent?
Why would the model learn this? (predictive value for recommendations)

Phase 5: Label Proposal

Propose a label at the appropriate level:

Complexity	Label Type	Example
Trivial	Token detector	"SCU Presence" (avoid if possible)
Simple	Threshold detector	"High SCU Investment (29+ AP)"
Moderate	Interaction	"SCU + Mobility Combo"
Strategic	Build archetype	"Special Spam Slayer Kit"
Tactical	Playstyle	"Aggressive Frontline Build"

Label Specificity by Category

The label's specificity should match its concept level:

Category	Specificity	Style	Examples
mechanical	Terse	Token-focused, technical	"SCU Threshold 29+", "ISM Stacker"
tactical	Mid-level	Ability combos, weapon synergies	"Zombie Slayer Dualies", "Beacon Support Kit"
strategic	High-concept	Playstyle, gameplay philosophy	"Positional Survival - Midrange", "Aggressive Reentry"

Why this matters:

Mechanical features encode low-level patterns → label should be precise and technical
Tactical features encode build strategies → label should name the strategy
Strategic features encode gameplay philosophies → label should capture the "why"

Examples by level:

Feature encodes "SCU above 29 AP threshold"
→ Category: mechanical
→ Label: "SCU Threshold 29+" (terse, specific)

Feature encodes "QR + Comeback + Stealth Jump on dualies"
→ Category: tactical
→ Label: "Zombie Slayer Dualies" (names the combo + weapon)

Feature encodes "survive through positioning, not stealth or trading"
→ Category: strategic
→ Label: "Positional Survival - Midrange" (high-concept + role)

Strategic Label Quality Checklist

Before finalizing a label, verify:

Concept over tokens: Does the label describe a GAMEPLAY CONCEPT, not just list abilities?
- BAD: "SSU + ISM + SRU Kit", "Swim Efficiency Kit"
- GOOD: "Positional Survival", "Aggressive Reentry"
Positive framing: Does the label describe what the feature IS, not just what it avoids?
- BAD: "Death-Averse Efficiency", "Anti-Stealth Build"
- GOOD: "Positional Survival", "Visible Zone Control"
The "why" test: Can you answer "why would a player build this?"
- If answer is "to have SSU and ISM" → label is too mechanical
- If answer is "to survive through positioning at midrange" → label captures concept
Range/role qualifier: Have you verified weapon range (Phase 1.8) and added appropriate qualifier?
- Backline (SNIPER/LONG + ANCHOR) → "- Anchor" or "- Backline"
- Midrange (MID/LONG + SUPPORT/SKIRMISH) → "- Midrange"
- Frontline (CLOSE/MID + SLAYER) → "- Slayer" or "- Frontline"

Strategic Label Format

Prefer: "[Concept] - [Qualifier]"

Concept Examples	What it captures
Positional Survival	Stay alive through positioning, not stealth/trading
Aggressive Reentry	Pressure through fast respawn (zombie)
Stealth Approach	Win through concealment (NS builds)
Special Pressure	Win through special uptime
Lane Persistence	Hold lanes through sustain

Qualifier Examples	When to use
Midrange	MID-range weapons, SKIRMISH/SUPPORT jobs
Anchor	LONG/SNIPER range, ANCHOR job, chargers/splatlings
Slayer	CLOSE/MID range, SLAYER job, aggressive weapons
Support	SUPPORT job, team utility focus
(Weapon Class)	When specific to dualies, blasters, etc.

Label Anti-Patterns to Avoid

Anti-Pattern	Example	Why It's Bad	Better Label
Token listing	"SSU + ISM Kit"	Describes tokens, not purpose	"Positional Survival"
Negation-only	"Death-Averse"	Describes avoidance, not identity	"Positional Survival"
Wrong role	"Anchor" for Jr./Rapid	Anchor implies backline chargers	"- Midrange"
Too generic	"Utility Build"	Could mean anything	"Positional Survival - Midrange"
Flanderized	Based on top 100 only	Captures tail, not core concept	Check core region first

Phase 6: Deeper Dive (For Thorny Features)

When to use: If the standard deep dive (Phases 1-5) didn't produce a clear interpretation:

All scaling effects weak (max_delta < 0.03)
No clear primary driver
Conflicting signals from different experiments
Feature seems important (high contribution to outputs) but unclear why

The Deeper Dive uses the hypothesis/state management system for systematic exploration:

Step 1: Initialize Research State

from splatnlp.mechinterp.state import ResearchState, Hypothesis

state = ResearchState(feature_id=FEATURE_ID, model_type="ultra")

# Add competing hypotheses based on what you've observed
state.add_hypothesis(Hypothesis(
    id="h1",
    description="Feature encodes weapon-specific pattern for Dapple Nouveau",
    status="pending"
))
state.add_hypothesis(Hypothesis(
    id="h2",
    description="Feature encodes binary ability package (Stealth + Comeback)",
    status="pending"
))
state.add_hypothesis(Hypothesis(
    id="h3",
    description="Feature has high decoder weights despite weak activation effects",
    status="pending"
))

Step 2: Check Decoder Weights

For "weak activation" features, check if they have high influence via decoder weights:

# Load SAE decoder weights
import torch
sae_path = '/mnt/e/dev_spillover/SplatNLP/sae_runs/run_20250704_191557/sae_model_final.pth'
sae_checkpoint = torch.load(sae_path, map_location='cpu', weights_only=True)
decoder_weight = sae_checkpoint['decoder.weight']  # [512, 24576]

# Get this feature's decoder weights to output space
feature_decoder = decoder_weight[:, FEATURE_ID]  # [512]

# Check magnitude
print(f"Decoder weight L2 norm: {torch.norm(feature_decoder):.4f}")
print(f"Max absolute weight: {torch.abs(feature_decoder).max():.4f}")

# Compare to other features
all_norms = torch.norm(decoder_weight, dim=0)
percentile = (all_norms < torch.norm(feature_decoder)).float().mean() * 100
print(f"Percentile among all features: {percentile:.1f}%")

If decoder weights are high (>75th percentile), the feature may be important despite weak activation effects.

Step 3: Decoder Output Analysis (CRITICAL for Diffuse Features)

When activation analysis doesn't yield a clean interpretation, analyze what the feature RECOMMENDS.

This technique asks: "What does this feature push the model to predict?" rather than "What activates this feature?"

Use the decoder CLI:

cd /root/dev/SplatNLP

# Quick output influence check
poetry run python -m splatnlp.mechinterp.cli.decoder_cli output-influence \
    --feature-id {FEATURE_ID} \
    --model ultra \
    --top-k 15

# Check decoder weight importance
poetry run python -m splatnlp.mechinterp.cli.decoder_cli weight-percentile \
    --feature-id {FEATURE_ID} \
    --model ultra

See mechinterp-decoder skill for full documentation.

Interpretation Guide:

Output Pattern	Interpretation
Promotes low-AP tokens (_3, _6)	"Recommend light investment"
Promotes high-AP tokens (_51, _57)	"Recommend heavy stacking"
Suppresses high-AP tokens	"Anti-stacking / balanced build"
Promotes death-mitigation (QR, CB, SS)	"Recommend zombie/respawn optimization"
Suppresses death-mitigation	"Death-averse / stay alive"

Example (Feature 13934):

PROMOTES: respawn_punisher (+0.23), comeback (+0.16), QSJ_6 (+0.15), IA_3 (+0.14), ISM_6 (+0.13)
SUPPRESSES: RSU_57 (-0.30), QR_57 (-0.25), RSU_51 (-0.24)

Interpretation: Feature recommends "balanced utility spread with low-AP investments"
               and DISCOURAGES heavy stacking of any single ability.

When to use decoder output analysis:

Activation analysis shows multi-modal or diffuse patterns
No single signature covers >50% of core
Feature seems "confused" between different build types
You want to understand the feature's PURPOSE, not just what triggers it

Key Insight: A feature can activate on seemingly different builds because they share the same NEED. The output analysis reveals what the feature is recommending, which may unify apparently contradictory activation patterns.

Decoder Output Semantic Grouping (CRITICAL for Labels)

After running decoder output analysis, group promoted/suppressed tokens by MEANING, not just family:

Semantic Group	Token Families	Gameplay Meaning
Mobility	SSU, RSU	How you reposition
Survival	BRU, IRU, RES, QR, SS, RP	How you stay alive
Efficiency	ISM, ISS, IRU	How you sustain pressure
Lethality	IA, MPU, BPU (bomb damage)	How you get kills
Special-Focus	SCU, SS, SPU, Tenacity	How you use specials
Stealth	NS, (high SSU)	How you approach unseen
Death-Trading	QR, CB, SJ, SS	How you weaponize respawn

Abbreviation Key:

SSU = Swim Speed Up, RSU = Run Speed Up
BRU = Bomb (Sub) Resistance Up, RES = Ink Resistance Up
IRU = Ink Recovery Up, ISM = Ink Saver Main, ISS = Ink Saver Sub
BPU = Bomb (Sub) Power Up, SPU = Special Power Up
SCU = Special Charge Up, SS = Special Saver
QR = Quick Respawn, CB = Comeback, SJ = Stealth Jump
IA = Intensify Action, MPU = Main Power Up, NS = Ninja Squid, RP = Respawn Punisher

Then ask: "What COMBINATION of groups defines this feature?"

Promoted Groups	Suppressed Groups	Strategic Concept
Mobility + Survival + Efficiency	Death-Trading, Stealth	Positional Survival
Death-Trading + Mobility	Survival	Zombie/Aggressive Reentry
Stealth + Mobility	-	Stealth Approach
Special-Focus + Efficiency	Mobility	Special Farming
Lethality + Mobility	Efficiency	Aggressive Slayer

This semantic grouping directly informs the strategic label.

Post-Decoder Sweep Rule

After decoder output analysis, verify the top promoted/suppressed families with causal 1D sweeps.

The decoder tells you what the feature RECOMMENDS, but not whether it's causally driven by those tokens. To validate:

Identify top 2 promoted families from decoder output (highest positive contributions)
Identify top 2 suppressed families from decoder output (most negative contributions)
Run 1D sweeps for any not yet tested in Phase 2

Decoder Shows	Test With	Expected If Valid
BRU highly promoted	`family_1d_sweep` BRU	Positive delta with BRU levels
RSU suppressed	`family_1d_sweep` RSU	Negative delta or flat

Example: Feature 10938 decoder showed BRU heavily promoted (+0.126, +0.120, +0.108 for different rungs), but initial sweeps only tested SSU/ISM. Should have run:

# Missing sweep that would validate decoder findings
poetry run python -m splatnlp.mechinterp.cli.runner_cli run-spec \
    --spec '{"type": "family_1d_sweep", "variables": {"family": "bomb_resistance_up"}}' \
    --feature-id 10938 --model ultra

Anti-pattern: Trusting decoder output without causal validation. Decoder weights show correlation to output tokens, not causal effect of input tokens.

Step 4: Run Targeted Experiments

Based on hypotheses, run specific tests:

# Log experiments and findings to state
state.add_evidence(
    hypothesis_id="h1",
    experiment_type="weapon_sweep",
    finding="37% Dapple Nouveau, but also 10% .96 Gal Deco - not single-weapon",
    supports=False
)

state.add_evidence(
    hypothesis_id="h3",
    experiment_type="decoder_weight_check",
    finding="Decoder L2 norm: 0.89 (92nd percentile) - HIGH despite weak activation",
    supports=True
)

Step 5: Synthesize

# Review all evidence
state.summarize()

# Update hypothesis statuses
state.update_hypothesis("h1", status="rejected")
state.update_hypothesis("h3", status="supported")

# Propose final interpretation
state.set_conclusion(
    "Feature has weak activation effects but high decoder weights. "
    "It acts as a 'fine-tuning' feature that makes small but important "
    "adjustments to output probabilities."
)

When Deeper Dive is Complete

The state object provides an audit trail of:

What hypotheses were considered
What experiments were run
What evidence was found
Why the final interpretation was chosen

This is useful for:

Revisiting the feature later
Explaining the interpretation to others
Identifying if new evidence should change the interpretation

Decision Trees

Single-Family Dominated Feature

1. Run within_family_interference to check for error correction
   └─ If interference found → "Error-Correcting {FAMILY} Detector"
   └─ If enhancement patterns → "{FAMILY} Stacker (synergistic)"
   └─ If neutral → continue

2. Check for non-monotonic 1D response
   └─ If drops at certain rungs → investigate interference
   └─ If monotonic with threshold → "High {FAMILY} Investment"
   └─ If monotonic with no threshold → probably trivial

3. Run weapon_sweep to check weapon specificity
   └─ If weapon-concentrated → run weapon_group_analysis
   └─ If weapon-specific patterns → "{WEAPON_CLASS} + {FAMILY}"

4. Run 2D sweep with second-ranked family
   └─ If interaction effect → "{FAMILY_A} + {FAMILY_B} Combo"
   └─ If no interaction → try third family

5. If all trivial → label as "{FAMILY} Stacker" with note "simple detector"

Multi-Family Feature

1. Check if families are related
   └─ All mobility (SSU, RSU, QSJ) → "Mobility Kit"
   └─ All ink efficiency (ISM, ISS, IRU) → "Efficiency Kit"
   └─ Mixed → continue

2. Run pairwise interaction analysis
   └─ Positive synergy → "Synergistic Build"
   └─ Redundancy → "Alternative Paths"

3. Check weapon breakdown
   └─ Weapon class pattern → "{CLASS} Optimal Build"

4. Consider strategic meaning
   └─ What playstyle does this combination enable?

Example Investigation

Feature 18712 (Deep Analysis):

Overview: SCU 31%, SSU 11%, ISS 10% → Single-family dominated
Hypothesis: Could be SCU + something, or just trivial SCU detector
2D Heatmap (SCU × SSU): Peak at SCU=57, SSU=0. Non-monotonic drops visible!
- SCU 6→12: DROP of 0.02 (unexpected)
- SCU 15→21: DROP of 0.01
Interference Analysis:
- SCU_12 REDUCES SCU_51 signal by 0.10 (interference!)
- SCU_15 ENHANCES SCU_51 signal by 0.12 (synergy!)
Weapon Analysis: Effect varies by weapon
- weapon_id_50: SCU_3 reduces SCU_15 (-0.08)
- weapon_id_7020: SCU_3 enhances SCU_15 (+0.03)
Interpretation: Feature detects "clean" high-SCU builds.
- Low rungs (SCU_3, SCU_12) can contaminate the signal
- Effect is weapon-dependent
Label: "SCU Purity Detector (weapon-conditional)" - NOT trivial!

Key Insight: What looked like a simple "SCU detector" actually encodes complex error-correction behavior. Always check for interference!

Commands Summary

# Phase 1: Overview (with extended analyses)
poetry run python -m splatnlp.mechinterp.cli.overview_cli \
    --feature-id {ID} --model ultra --top-k 20

# Phase 1 with extended analyses (enrichment, regions, binary, kit)
poetry run python -m splatnlp.mechinterp.cli.overview_cli \
    --feature-id {ID} --model ultra --all

# Phase 3a: 1D sweep for dominant family (direct subcommand)
poetry run python -m splatnlp.mechinterp.cli.runner_cli family-sweep \
    --feature-id {ID} --family {FAMILY} --model ultra

# Phase 3b: 2D heatmap for interactions (direct subcommand)
poetry run python -m splatnlp.mechinterp.cli.runner_cli heatmap \
    --feature-id {ID} --family-x {FAMILY_A} --family-y {FAMILY_B} --model ultra

# Phase 3c: Weapon sweep (direct subcommand)
poetry run python -m splatnlp.mechinterp.cli.runner_cli weapon-sweep \
    --feature-id {ID} --model ultra --top-k 20

# Phase 3d: Kit sweep (if dominant weapon detected)
poetry run python -m splatnlp.mechinterp.cli.runner_cli kit-sweep \
    --feature-id {ID} --model ultra --analyze-combinations

# Phase 3e: Binary ability analysis
poetry run python -m splatnlp.mechinterp.cli.runner_cli binary \
    --feature-id {ID} --model ultra

# Phase 3f: Core coverage analysis
poetry run python -m splatnlp.mechinterp.cli.runner_cli coverage \
    --feature-id {ID} --tokens {TOKEN1},{TOKEN2}

# Phase 1.7.5: Kit Component Analysis (see skill for full code)
# After weapon sweep, check for patterns in: sub weapons, specials, or weapon class
# For any concentrated pattern, determine if CAUSAL (explains build) or SPURIOUS (incidental)

# Phase 5: Set label
poetry run python -m splatnlp.mechinterp.cli.labeler_cli label \
    --feature-id {ID} --name "{LABEL}" --category {tactical|strategic|mechanical}

Labeling Categories

mechanical: Low-level patterns (token presence, simple combinations)
tactical: Mid-level patterns (build synergies, weapon kits)
strategic: High-level patterns (playstyles, meta concepts)

Install Skill

SKILL.md

MechInterp Investigator

Phase 0: Triage (ALWAYS START HERE)

Step 0.1: Check Decoder Weight Percentile

Step 0.2: Quick Overview Check (if <10%)

Triage Decision

Weak Feature Label Format

When to Override Triage

⚠️ Deep Dive Basics

Minimum Requirements for a Deep Dive

Binary Abilities Need Special Handling

Binary Ability Analysis Protocol (CRITICAL)

Additional Learnings

What Counts as Evidence

Common Mistakes to Avoid

⚠️ CRITICAL: Data Sparsity vs Suppression

Philosophy

Investigation Workflow

Phase 0: Triage

Phase 1: Initial Assessment

Phase 1.5: Activation Region Analysis (CRITICAL - Anti-Flanderization)

Phase 1.6: Weapon Distribution Analysis (CRITICAL - Anti-Flanderization)

Phase 1.6.5: Ability Flanderization Check (CRITICAL)

Phase 1.7: Meta-Informed Weapon Analysis (USE AFTER WEAPON SWEEP)

Phase 1.7.5: Kit Component Analysis (OPTIONAL but Recommended)

Component 1: Sub Weapon Pattern Analysis

Component 2: Special Weapon Pattern Analysis

Component 3: Main Weapon Class Pattern Analysis

Step 4: Determine if Pattern is CAUSAL or SPURIOUS

Step 5: Record findings in notes

When to skip this phase:

Phase 1.8: Weapon Range/Role Classification (REQUIRED for Labels)

Phase 2: Hypothesis Generation

Build NEED Framework (For Multi-Modal/Diffuse Features)

Phase 3: Targeted Experiments

⚠️ CRITICAL: Iterative Conditional Testing Protocol

The Problem

The Solution: Iterative 2D Testing

2D Heatmap Interpretation Guide

Heatmap Cell Validity Check

When to Use 2D vs 1D

Template: Death-Aversion Test Battery

Template: Error-Correction Detection

Phase 4: Synthesis

Phase 5: Label Proposal

Label Specificity by Category

Strategic Label Quality Checklist

Strategic Label Format

Label Anti-Patterns to Avoid

Phase 6: Deeper Dive (For Thorny Features)

Step 1: Initialize Research State

Step 2: Check Decoder Weights

Step 3: Decoder Output Analysis (CRITICAL for Diffuse Features)

Decoder Output Semantic Grouping (CRITICAL for Labels)

Post-Decoder Sweep Rule

Step 4: Run Targeted Experiments

Step 5: Synthesize

When Deeper Dive is Complete

Decision Trees

Single-Family Dominated Feature

Multi-Family Feature

Example Investigation

Commands Summary

Labeling Categories

See Also