| name | mechinterp-decoder |
| description | Analyze SAE decoder weights - output influence, feature importance, and decoder similarity |
MechInterp Decoder
Analyze SAE features through their decoder weights. This skill answers: "What does this feature RECOMMEND?" rather than "What activates this feature?"
Purpose
Decoder analysis provides a complementary perspective to activation analysis:
| Analysis Type | Question Answered |
|---|---|
| Activation (overview, sweeps) | "What inputs activate this feature?" |
| Decoder (this skill) | "What outputs does this feature promote?" |
For diffuse or heterogeneous features where activation analysis shows multiple modes, decoder analysis often reveals the unifying concept.
When to Use
Use this skill when:
- Activation analysis is inconclusive - Multiple modes or no clear pattern
- Feature appears heterogeneous - Different builds activate it for different reasons
- Looking for "what does it recommend" - Shift from inputs to outputs
- Checking AP level preferences - Does feature prefer low-AP (_3, _6) vs high-AP (_57)?
- Finding similar features - Cluster features by decoder similarity
Commands
Output Influence
Show what tokens a feature promotes (positive contribution) or suppresses (negative contribution):
cd /root/dev/SplatNLP
# Basic output influence
poetry run python -m splatnlp.mechinterp.cli.decoder_cli output-influence \
--feature-id 13934 \
--model ultra
# JSON output
poetry run python -m splatnlp.mechinterp.cli.decoder_cli output-influence \
--feature-id 13934 \
--model ultra \
--format json
# More tokens
poetry run python -m splatnlp.mechinterp.cli.decoder_cli output-influence \
--feature-id 13934 \
--model ultra \
--top-k 25
Sample Output:
## Feature 13934 Output Influence (ultra)
### Tokens This Feature PROMOTES
| Token | Contribution | Family | AP Level |
|-------|--------------|--------|----------|
| respawn_punisher | +0.232 | respawn_punisher | binary |
| comeback | +0.159 | comeback | binary |
| quick_super_jump_6 | +0.155 | quick_super_jump | 6 |
| intensify_action_3 | +0.140 | intensify_action | 3 |
| ink_saver_main_6 | +0.128 | ink_saver_main | 6 |
### Tokens This Feature SUPPRESSES
| Token | Contribution | Family | AP Level |
|-------|--------------|--------|----------|
| run_speed_up_57 | -0.301 | run_speed_up | 57 |
| quick_respawn_57 | -0.247 | quick_respawn | 57 |
| swim_speed_up_57 | -0.209 | swim_speed_up | 57 |
### Interpretation
- **Top promoted**: respawn_punisher (+0.232)
- **Top suppressed**: run_speed_up_57 (-0.301)
- **Pattern**: Promotes low-AP tokens, suppresses high-AP stacking
Weight Percentile
Check how important a feature is by its decoder weight magnitude:
poetry run python -m splatnlp.mechinterp.cli.decoder_cli weight-percentile \
--feature-id 13934 \
--model ultra
Sample Output:
## Feature 13934 Decoder Weight (ultra)
- **Magnitude**: 2.3456
- **Percentile**: 78.5%
- **Total features**: 24576
Interpretation:
- High percentile (>90%): Feature has strong output influence
- Low percentile (<10%): Feature has weak output influence
- Note: Low-magnitude features may still be important for specific tokens
Similar Features (by Decoder)
Find features with similar decoder patterns (what they recommend):
poetry run python -m splatnlp.mechinterp.cli.decoder_cli similar \
--feature-id 13934 \
--model ultra \
--top-k 10
Sample Output:
## Features Similar to 13934 (ultra)
| Feature ID | Cosine Similarity |
|------------|-------------------|
| 13892 | 0.9234 |
| 14501 | 0.8876 |
| 12044 | 0.8521 |
Experiment Runner
For programmatic use or integration with runner_cli:
# Create spec file
cat > decoder_spec.json << 'EOF'
{
"type": "decoder_output_analysis",
"feature_id": 13934,
"model_type": "ultra",
"variables": {
"top_k_promoted": 15,
"top_k_suppressed": 15,
"group_by_family": true,
"include_ap_level": true
}
}
EOF
# Run via runner CLI
poetry run python -m splatnlp.mechinterp.cli.runner_cli \
--spec-path decoder_spec.json
Interpretation Guide
AP Level Patterns
| Pattern | Meaning |
|---|---|
| Promotes _3, _6; Suppresses _51, _57 | "Use balanced spread, not stacking" |
| Promotes _57; Suppresses low AP | "Heavy stacking is the goal" |
| Promotes binary (RP, CB, OG) | "These specific abilities are key" |
| Mixed AP levels promoted | "Ability presence matters, not amount" |
Common Feature Types
| Output Pattern | Feature Type |
|---|---|
| Single family promoted | Family detector (e.g., SCU detector) |
| Low-AP promoted, high-AP suppressed | "Balanced utility recommendation" |
| Binary abilities promoted | "Build style marker" (aggressive, defensive) |
| Death perks promoted (QR, SS, CB) | "Death-tolerant" archetype |
| Death perks suppressed | "Death-averse" archetype |
Integration with Investigation Workflow
Decoder analysis fits into the investigation workflow as follows:
1. Overview (mechinterp-overview)
↓
2. Hypothesis formation
↓
3. 1D Sweeps (mechinterp-runner)
↓
4. Core Coverage Check ← NEW: Catch tail markers
↓
5. If diffuse/heterogeneous:
→ Decoder Output Analysis ← THIS SKILL
↓
6. Label formulation
Example: Feature 13934 (from investigation log)
Problem: Activation analysis showed two opposite modes (RP anchor vs Zombie builds).
Solution: Decoder analysis revealed unifying pattern:
PROMOTES: low-AP utility (_3, _6 tokens)
SUPPRESSES: heavy stacking (_51, _57 tokens)
→ Feature recommends "balanced utility spread" regardless of death strategy
Key Insight: Different builds (RP vs Zombie) activate the feature because they share a NEED (balanced utility), not a BUILD pattern.
See Also
- mechinterp-overview: Initial feature assessment
- mechinterp-runner: Run experiments (including core_coverage_analysis, decoder_output_analysis)
- mechinterp-investigator: Full investigation workflow
- mechinterp-labeler: Save labels after investigation