Claude Code Plugins

Community-maintained marketplace

Feedback

mechinterp-overview

@cesaregarza/SplatNLP
0
0

Quick "first look" overview of SAE features - top tokens, activation stats, weapons, families, sample contexts

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name mechinterp-overview
description Quick "first look" overview of SAE features - top tokens, activation stats, weapons, families, sample contexts

MechInterp Overview

Get a comprehensive first-look overview of an SAE feature before deep investigation. This skill provides a fast summary of key characteristics to help you decide what hypotheses to test.

⚠️ CRITICAL: Overview is NOT Findings

The overview shows CORRELATIONS, not CAUSATION. It is a starting point for generating hypotheses, NOT a source of conclusions.

Overview Shows What It Actually Means
Top tokens (PageRank) Tokens that CO-OCCUR with high activation (correlation)
Family breakdown Which ability families appear in high-activation examples
Top weapons Weapons present in high-activation examples

You CANNOT conclude from overview alone:

  • That a token "drives" or "causes" activation
  • That the feature "detects" a specific ability
  • That correlations are meaningful vs spurious

To make conclusions, you MUST run experiments (see mechinterp-investigator for deep dive basics).

Purpose

The overview skill:

  • Computes PageRank-weighted top tokens for a feature
  • Shows activation statistics (mean, std, median, sparsity)
  • Aggregates tokens by ability family
  • Lists top weapons associated with the feature
  • Provides sample high-activation contexts
  • Checks for existing labels and ReLU floor issues

When to Use

Use this skill when:

  1. Starting to investigate a new feature
  2. You want a quick summary before running experiments
  3. Deciding which feature to label next
  4. Checking if a feature has already been labeled

DO NOT use overview results as final findings. Always follow up with experiments.

Output Information

Section Description
Activation Stats Mean, std, median, sparsity percentage, example count
Top Tokens PageRank-weighted most important tokens (enhancers)
Bottom Tokens Tokens suppressed in high-activation examples
Family Breakdown Aggregated scores by ability family (SCU, SSU, etc.)
Top Weapons Weapons with most examples for this feature
Sample Contexts 3-5 high-activation example builds
Existing Label Current label if one exists
ReLU Floor Warning if feature is mostly zeros (>50%)

Sparsity Definition

Sparsity = % of examples where feature activation is ZERO

A high sparsity percentage means the feature fires RARELY (is selective):

Sparsity Meaning Interpretation
95%+ Very sparse Fires on only 5% of examples - very specific pattern
80-95% Moderately sparse Good discriminative feature (fires on 5-20% of examples)
50-80% Dense Fires often (20-50% of examples) - broad pattern
<50% Very dense Fires on majority of examples - may be baseline feature

Common confusion: "89% sparsity" means "fires on 11% of examples" NOT "fires often."

Think of it as: Sparsity = how empty/silent the feature usually is.

CRITICAL: Always check the Bottom Tokens section! Tokens that rarely appear in high-activation examples reveal what the feature avoids, which is often more informative than what it detects.

Usage

Command Line

cd /root/dev/SplatNLP

# Basic overview (markdown output)
poetry run python -m splatnlp.mechinterp.cli.overview_cli \
    --feature-id 18712 \
    --model ultra

# JSON output for programmatic use
poetry run python -m splatnlp.mechinterp.cli.overview_cli \
    --feature-id 18712 \
    --model ultra \
    --format json

# More top tokens
poetry run python -m splatnlp.mechinterp.cli.overview_cli \
    --feature-id 18712 \
    --model ultra \
    --top-k 25

# Full model
poetry run python -m splatnlp.mechinterp.cli.overview_cli \
    --feature-id 5432 \
    --model full

# Verbose logging
poetry run python -m splatnlp.mechinterp.cli.overview_cli \
    --feature-id 18712 \
    --model ultra \
    --verbose

Extended Analyses

Additional analysis flags provide deeper insights:

# Token enrichment (enhancers/suppressors)
poetry run python -m splatnlp.mechinterp.cli.overview_cli \
    --feature-id 6235 --model ultra --enrichment

# Activation region breakdown (anti-flanderization)
poetry run python -m splatnlp.mechinterp.cli.overview_cli \
    --feature-id 6235 --model ultra --regions

# Binary ability enrichment (main-only abilities)
poetry run python -m splatnlp.mechinterp.cli.overview_cli \
    --feature-id 6235 --model ultra --binary

# Sub/special weapon breakdown (kit analysis)
poetry run python -m splatnlp.mechinterp.cli.overview_cli \
    --feature-id 6235 --model ultra --kit

# All extended analyses at once
poetry run python -m splatnlp.mechinterp.cli.overview_cli \
    --feature-id 6235 --model ultra --all

# Customize high-activation threshold (default: 0.90 = top 10%)
poetry run python -m splatnlp.mechinterp.cli.overview_cli \
    --feature-id 6235 --model ultra --enrichment --high-percentile 0.95

Extended Analysis Reference

Flag Purpose Output
--enrichment Token enrichment ratios Suppressors (<0.8x) and enhancers (>1.2x)
--regions Activation regions Floor/Low/Core/High/Flanderization breakdown
--binary Binary ability presence Enrichment for main-only abilities (Comeback, Stealth Jump, etc.)
--kit Sub/special breakdown Which subs/specials appear in core region
--all Enable all above Combined output
--kit-region Region for kit analysis core (default), high, or all
--high-percentile Threshold for "high" Default: 0.90 (top 10%)

Programmatic

from splatnlp.mechinterp.labeling import FeatureOverview, compute_overview
from splatnlp.mechinterp.skill_helpers import load_context

# Load context
ctx = load_context("ultra")

# Compute overview
overview = compute_overview(
    feature_id=18712,
    ctx=ctx,
    top_k_tokens=15,
    n_sample_contexts=5,
)

# Display markdown
print(overview.to_markdown())

# Access fields directly
print(f"Mean: {overview.activation_mean}")
print(f"Top token: {overview.top_tokens[0]}")
print(f"Main family: {max(overview.family_breakdown.items(), key=lambda x: x[1])}")

Sample Output

## Feature 18712 Overview (ultra)

### Activation Stats
- Mean: 0.5056
- Std: 0.5163
- Median: 0.3835
- Sparsity: 97.1%
- Examples: 108,163

### Top Tokens (PageRank)
1. `special_charge_up` (0.274)
2. `swim_speed_up` (0.099)
3. `ink_saver_sub` (0.084)
4. `stealth_jump` (0.049)
5. `run_speed_up` (0.048)

### Family Breakdown
- special_charge_up: 31.2%
- swim_speed_up: 11.2%
- ink_saver_sub: 9.6%

### Top Weapons
- weapon_id_5021: 28
- weapon_id_220: 28

### Bottom Tokens (Suppressors)
Tokens rarely present in high-activation examples:
1. `respawn_punisher` (high_rate_ratio=0.00) - Never in high activation
2. `special_saver` (high_rate_ratio=0.16) - 6x less common than baseline
3. `quick_respawn` (high_rate_ratio=0.47) - 2x less common than baseline

### Sample Contexts (High Activation)
1. [weapon_id_1111] special_charge_up_6, special_charge_up_57 (act=0.731)
2. [weapon_id_1111] special_charge_up_6, special_charge_up_51... (act=0.724)

FeatureOverview Dataclass

@dataclass
class FeatureOverview:
    feature_id: int
    model_type: str

    # Activation statistics
    activation_mean: float
    activation_std: float
    activation_median: float
    sparsity: float  # Percentage (0-100)
    n_examples: int

    # PageRank-weighted top tokens
    top_tokens: list[tuple[str, float]]

    # Bottom tokens (suppressors) - tokens excluded from high activation
    bottom_tokens: list[tuple[str, float]]  # (token, high_rate_ratio)

    # Detailed token influence statistics
    token_influences: list[TokenInfluence]

    # Aggregated by family
    family_breakdown: dict[str, float]

    # Weapon breakdown
    top_weapons: list[tuple[str, int]]

    # Sample high-activation contexts
    sample_contexts: list[SampleContext]

    # Diagnostic flags
    relu_floor_rate: float
    existing_label: str | None

Performance

  • Typical runtime: 30-60 seconds (dominated by PageRank computation)
  • Loads activation data lazily from efficient database
  • Caches context between calls in the same session

Interpretation Tips

  1. High sparsity (>90%): Most inputs don't activate this feature. Look at what's special about the ones that do.

  2. ReLU floor warning: If >50% of examples hit the ReLU floor, the feature may be hard to interpret or require special handling.

  3. Single dominant family: If one family has >50% of the breakdown, the feature likely responds to that ability family.

  4. Multiple families: If breakdown is spread across families, look for interactions or common contexts.

  5. Weapon concentration: If a few weapons dominate, the feature may be weapon-specific rather than ability-specific.

⚠️ CRITICAL: Super-Stimuli Detection

Don't only examine high activations - they may be "super-stimuli"!

High activation examples can be exaggerated, "flanderized" versions of the true concept. The core region (25-75% of effective max) often reveals the actual feature meaning better than the flanderization zone (90%+ of effective max).

Why "effective max"? Activation distributions are heavy-tailed. Use effective_max = 99.5th percentile of nonzero activations to prevent single outliers from making your core region nearly empty.

Warning Signs of Super-Stimuli

Pattern What It Means
90%+ activations only on 3-4 niche weapons Flanderization zone = super-stimuli
Core region (25-75%) has diverse mainstream weapons TRUE concept is in core region
One weapon spans ALL activation levels continuously Feature is general, not weapon-specific

Activation Region Bins

Use these standard bins (as % of effective max = 99.5th percentile) to analyze feature behavior:

Region Range (% of effective max) Typical Interpretation
Floor ≤1% Feature not activated
Low 1-10% Weak signal, early detection
Below Core 10-25% Emerging pattern
Core 25-75% TRUE CONCEPT (examine carefully!)
High 75-90% Strong expression
Flanderization Zone 90%+ Potential super-stimuli

Example: Feature 9971

Initial analysis (looking only at 90%+ activations):

  • Top weapons: Bloblobber, Glooga Deco, Range Blaster, Octobrush
  • Conclusion: "SCU stacker on special-dependent weapons"

After region analysis (examining core 25-75%):

  • Core region: Splattershot (115), Wellstring (65), Sploosh (57)
  • Splattershot appears in EVERY region (29→125→83→115→61→19)
  • True concept: "General offensive investment (death-averse)"
  • Flanderization zone (90%+): "Super-stimuli" version on niche special-dependent weapons

Key insight: Label the core-region concept, not the flanderized extreme!

Coverage Threshold Rule

When overview shows a dominant token or weapon, CHECK CORE-REGION COVERAGE before treating it as the concept.

A token can have high enrichment in the tail but be a tail marker, not the true concept.

Metric Interpretation
>50% core coverage Primary concept - safe to use in label
30-50% core coverage Significant but not universal - note in label, don't headline
<30% core coverage Tail marker / super-stimulus - NOT the concept

Example (Feature 13934):

Overview showed: respawn_punisher with 8.57x tail enrichment
BUT: RP only present in 12% of core-region examples

⚠️ Flag in overview: "respawn_punisher: high enrichment (8.57x) but <30% core coverage - may be tail marker, not core concept"

When to flag: If any token in top-10 has enrichment >3x but core coverage <30%, add a warning note.

  1. Weapon Outlier Detection: If a single weapon has >2x the examples of the second weapon, this is a weapon-dominated feature:

    • Use splatoon3-meta skill to look up the weapon's kit (sub + special)
    • Check if other high-activation weapons share the same sub OR special
    • If they share kit components, the feature may encode kit behavior, not weapon behavior
    • Run kit_sweep experiment to analyze activation by sub/special
  2. Check suppressors: Always examine bottom tokens! If death-mitigation abilities (QR, SS, CB) are suppressed, the feature encodes "death-averse" builds. See mechinterp-ability-semantics for semantic groupings.

  3. Enhancers + Suppressors together: The combination tells the full story. A feature with SCU enhanced AND death-perks suppressed isn't just "SCU detector" - it's "death-averse special builds".

  4. "Weak activation" ≠ "unimportant feature": If all scaling effects are weak (max_delta < 0.03), don't immediately label as "weak feature". Check the feature's decoder weights to output tokens. Net influence = activation × decoder weight. A feature with low activation effects but high decoder weights may still strongly influence predictions.

⚠️ WARNING: Correlation ≠ Causation

PageRank scores show correlation, NOT causation. Tokens appearing in the overview may be:

  • True drivers: Actually cause activation changes
  • Spurious correlations: Just happen to co-occur with the true driver

How to Distinguish

  1. Run 1D sweep for top token (likely primary driver)
  2. If confirmed, run 2D heatmaps for other tokens:
    • PRIMARY × SECONDARY reveals if secondary has conditional effect
    • If secondary shows effect only at high primary → true interaction
    • If secondary shows NO effect at any primary level → spurious

Example: Feature 18712

Overview showed: SCU (24%), Opening Gambit (17%), SSU (12%)

1D sweeps:
- SCU: strong effect (0.03→0.58) ✅ PRIMARY
- OG: delta ≈ 0 → appears to have no effect
- SSU: delta ≈ 0 → appears to have no effect

BUT WAIT! 1D sweeps for secondary abilities are MISLEADING.

2D heatmaps (SCU × OG, SCU × SSU):
- Both show NO conditional effect at any SCU level
- Conclusion: OG and SSU were SPURIOUS correlations

2D heatmaps (SCU × QR, SCU × SS):
- QR_12+ SUPPRESSES activation by 70-99% at high SCU!
- SS_12+ SUPPRESSES activation by 40-60%!
- Conclusion: Feature is DEATH-AVERSE (not visible in 1D)

Always verify top overview tokens with conditional 2D testing!

See mechinterp-investigator for the full Iterative Conditional Testing Protocol.

See Also

  • mechinterp-labeler: Manage labeling workflow and save labels
  • mechinterp-runner: Run experiments to test hypotheses
  • mechinterp-next-step-planner: Generate experiment specs based on overview
  • mechinterp-glossary-and-constraints: Reference for token families and AP rungs
  • mechinterp-ability-semantics: Ability semantic groupings (check AFTER forming hypotheses)
  • mechinterp-investigator: Full investigation workflow