Claude Code Plugins

Community-maintained marketplace

Feedback

mechinterp-decoder

@cesaregarza/SplatNLP
0
0

Analyze SAE decoder weights - output influence, feature importance, and decoder similarity

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name mechinterp-decoder
description Analyze SAE decoder weights - output influence, feature importance, and decoder similarity

MechInterp Decoder

Analyze SAE features through their decoder weights. This skill answers: "What does this feature RECOMMEND?" rather than "What activates this feature?"

Purpose

Decoder analysis provides a complementary perspective to activation analysis:

Analysis Type Question Answered
Activation (overview, sweeps) "What inputs activate this feature?"
Decoder (this skill) "What outputs does this feature promote?"

For diffuse or heterogeneous features where activation analysis shows multiple modes, decoder analysis often reveals the unifying concept.

When to Use

Use this skill when:

  1. Activation analysis is inconclusive - Multiple modes or no clear pattern
  2. Feature appears heterogeneous - Different builds activate it for different reasons
  3. Looking for "what does it recommend" - Shift from inputs to outputs
  4. Checking AP level preferences - Does feature prefer low-AP (_3, _6) vs high-AP (_57)?
  5. Finding similar features - Cluster features by decoder similarity

Commands

Output Influence

Show what tokens a feature promotes (positive contribution) or suppresses (negative contribution):

cd /root/dev/SplatNLP

# Basic output influence
poetry run python -m splatnlp.mechinterp.cli.decoder_cli output-influence \
    --feature-id 13934 \
    --model ultra

# JSON output
poetry run python -m splatnlp.mechinterp.cli.decoder_cli output-influence \
    --feature-id 13934 \
    --model ultra \
    --format json

# More tokens
poetry run python -m splatnlp.mechinterp.cli.decoder_cli output-influence \
    --feature-id 13934 \
    --model ultra \
    --top-k 25

Sample Output:

## Feature 13934 Output Influence (ultra)

### Tokens This Feature PROMOTES

| Token | Contribution | Family | AP Level |
|-------|--------------|--------|----------|
| respawn_punisher | +0.232 | respawn_punisher | binary |
| comeback | +0.159 | comeback | binary |
| quick_super_jump_6 | +0.155 | quick_super_jump | 6 |
| intensify_action_3 | +0.140 | intensify_action | 3 |
| ink_saver_main_6 | +0.128 | ink_saver_main | 6 |

### Tokens This Feature SUPPRESSES

| Token | Contribution | Family | AP Level |
|-------|--------------|--------|----------|
| run_speed_up_57 | -0.301 | run_speed_up | 57 |
| quick_respawn_57 | -0.247 | quick_respawn | 57 |
| swim_speed_up_57 | -0.209 | swim_speed_up | 57 |

### Interpretation
- **Top promoted**: respawn_punisher (+0.232)
- **Top suppressed**: run_speed_up_57 (-0.301)
- **Pattern**: Promotes low-AP tokens, suppresses high-AP stacking

Weight Percentile

Check how important a feature is by its decoder weight magnitude:

poetry run python -m splatnlp.mechinterp.cli.decoder_cli weight-percentile \
    --feature-id 13934 \
    --model ultra

Sample Output:

## Feature 13934 Decoder Weight (ultra)

- **Magnitude**: 2.3456
- **Percentile**: 78.5%
- **Total features**: 24576

Interpretation:

  • High percentile (>90%): Feature has strong output influence
  • Low percentile (<10%): Feature has weak output influence
  • Note: Low-magnitude features may still be important for specific tokens

Similar Features (by Decoder)

Find features with similar decoder patterns (what they recommend):

poetry run python -m splatnlp.mechinterp.cli.decoder_cli similar \
    --feature-id 13934 \
    --model ultra \
    --top-k 10

Sample Output:

## Features Similar to 13934 (ultra)

| Feature ID | Cosine Similarity |
|------------|-------------------|
| 13892 | 0.9234 |
| 14501 | 0.8876 |
| 12044 | 0.8521 |

Experiment Runner

For programmatic use or integration with runner_cli:

# Create spec file
cat > decoder_spec.json << 'EOF'
{
  "type": "decoder_output_analysis",
  "feature_id": 13934,
  "model_type": "ultra",
  "variables": {
    "top_k_promoted": 15,
    "top_k_suppressed": 15,
    "group_by_family": true,
    "include_ap_level": true
  }
}
EOF

# Run via runner CLI
poetry run python -m splatnlp.mechinterp.cli.runner_cli \
    --spec-path decoder_spec.json

Interpretation Guide

AP Level Patterns

Pattern Meaning
Promotes _3, _6; Suppresses _51, _57 "Use balanced spread, not stacking"
Promotes _57; Suppresses low AP "Heavy stacking is the goal"
Promotes binary (RP, CB, OG) "These specific abilities are key"
Mixed AP levels promoted "Ability presence matters, not amount"

Common Feature Types

Output Pattern Feature Type
Single family promoted Family detector (e.g., SCU detector)
Low-AP promoted, high-AP suppressed "Balanced utility recommendation"
Binary abilities promoted "Build style marker" (aggressive, defensive)
Death perks promoted (QR, SS, CB) "Death-tolerant" archetype
Death perks suppressed "Death-averse" archetype

Integration with Investigation Workflow

Decoder analysis fits into the investigation workflow as follows:

1. Overview (mechinterp-overview)
   ↓
2. Hypothesis formation
   ↓
3. 1D Sweeps (mechinterp-runner)
   ↓
4. Core Coverage Check ← NEW: Catch tail markers
   ↓
5. If diffuse/heterogeneous:
   → Decoder Output Analysis ← THIS SKILL
   ↓
6. Label formulation

Example: Feature 13934 (from investigation log)

Problem: Activation analysis showed two opposite modes (RP anchor vs Zombie builds).

Solution: Decoder analysis revealed unifying pattern:

PROMOTES: low-AP utility (_3, _6 tokens)
SUPPRESSES: heavy stacking (_51, _57 tokens)

→ Feature recommends "balanced utility spread" regardless of death strategy

Key Insight: Different builds (RP vs Zombie) activate the feature because they share a NEED (balanced utility), not a BUILD pattern.

See Also

  • mechinterp-overview: Initial feature assessment
  • mechinterp-runner: Run experiments (including core_coverage_analysis, decoder_output_analysis)
  • mechinterp-investigator: Full investigation workflow
  • mechinterp-labeler: Save labels after investigation