Claude Code Plugins

Community-maintained marketplace

Feedback

mechinterp-crossmodel-matcher

@cesaregarza/SplatNLP
0
0

Match SAE features between Ultra (24K) and Full (2K) models based on activation patterns and token overlap

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name mechinterp-crossmodel-matcher
description Match SAE features between Ultra (24K) and Full (2K) models based on activation patterns and token overlap

MechInterp Cross-Model Matcher

Match features between the Ultra (24K features) and Full (2K features) SAE models to understand feature correspondence and discover monosemantic representations.

Purpose

The cross-model matcher skill:

  • Finds corresponding features across models
  • Computes similarity based on top token overlap
  • Identifies features unique to each model
  • Helps validate interpretations across model scales

When to Use

Use this skill when you:

  • Have interpreted a feature in one model and want to find its counterpart
  • Want to validate that a pattern exists across model scales
  • Need to understand what the Ultra model decomposes that Full doesn't

Usage

Programmatic

from splatnlp.mechinterp.analysis import FeatureMatcher
from splatnlp.mechinterp.skill_helpers import load_context

# Load source context (the model with your known feature)
source_ctx = load_context("ultra")

# Initialize matcher (automatically loads target model)
matcher = FeatureMatcher(source_ctx)

# Find matches for an Ultra feature in the Full model
report = matcher.find_matches(
    source_feature=18712,
    n_candidates=500,  # How many Full features to check
    n_top_matches=10   # How many matches to return
)

# View results
print(f"Searched {report.n_candidates_tested} candidates")
print(f"Best correlation: {report.best_correlation:.3f}")

for match in report.matches:
    print(f"\nFull feature {match.target_feature}:")
    print(f"  Token overlap: {match.top_token_overlap:.3f}")
    print(f"  Shared tokens: {match.shared_top_tokens[:5]}")
    print(f"  Notes: {match.notes}")

Detailed Comparison

# Compare two specific features in detail
comparison = matcher.compare_features(
    source_fid=18712,  # Ultra feature
    target_fid=1024,   # Full feature
)

print(f"Jaccard similarity: {comparison['jaccard_similarity']:.3f}")
print(f"Shared tokens: {comparison['shared_tokens'][:10]}")
print(f"Ultra-only tokens: {comparison['source_only_tokens'][:10]}")
print(f"Full-only tokens: {comparison['target_only_tokens'][:10]}")

Matching Metrics

Token Overlap (Jaccard Similarity)

Compares top tokens between features:

overlap = |source_top ∩ target_top| / |source_top ∪ target_top|
  • > 0.3: Strong match - likely same underlying concept
  • 0.1 - 0.3: Moderate match - related but not identical
  • < 0.1: Weak match - probably different concepts

Interpretation

High overlap suggests:

  • Features detect the same pattern
  • Ultra feature may be a "refinement" of Full feature
  • Good candidate for cross-model validation

Low overlap with similar activation patterns suggests:

  • Ultra model has decomposed the Full feature
  • Multiple Ultra features may combine to match one Full feature

Example: Finding Ultra Decomposition

# Example: A Full model feature that might be polysemantic
full_ctx = load_context("full")
matcher = FeatureMatcher(full_ctx)  # Source = Full

# Find what Ultra features correspond to Full feature 512
report = matcher.find_matches(source_feature=512)

# If multiple Ultra features match, the Full feature may be polysemantic
if len([m for m in report.matches if m.combined_score > 0.1]) > 3:
    print("Full feature 512 appears to be polysemantic")
    print("Ultra decomposition:")
    for m in report.matches[:5]:
        print(f"  Ultra {m.target_feature}: {m.shared_top_tokens[:3]}")

Workflow Integration

  1. Start with interpreted feature: Begin with a feature you understand
  2. Find matches: Use this skill to find counterparts
  3. Validate interpretation: Check if matches have similar behavior
  4. Document correspondence: Update research state with cross-model links
  5. Investigate decomposition: If Ultra splits a Full feature, analyze each part

Limitations

  • Token overlap is a proxy; true matching would require shared activation data
  • Different expansion factors mean different granularity
  • Some features may not have clear counterparts

See Also

  • mechinterp-cluster-mapper: Analyze groups of related features
  • mechinterp-state: Track cross-model research
  • mechinterp-runner: Validate matches with experiments