Claude Code Plugins

Community-maintained marketplace

Feedback

Continually test and repair solutions to benchmark problems until they pass. Use when a solution has failing tests. Invoke with /fix-solution <snapshot_path> <problem_name> <checkpoint_index>.

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name fix-solution
description Continually test and repair solutions to benchmark problems until they pass. Use when a solution has failing tests. Invoke with /fix-solution <snapshot_path> <problem_name> <checkpoint_index>.

Fix Solution

Iteratively test and repair a solution until all tests pass.

Usage: /fix-solution /path/to/snapshot problem_name checkpoint_N

Example: /fix-solution outputs/run_001/submissions/file_backup/checkpoint_2/snapshot file_backup checkpoint_2


Core Principle

Solutions are likely wrong - focus on fixing the solution code, not the tests.

Tests should generally be trusted. However, the MOMENT there is any doubt whether the tests themselves are correct (not the solution), immediately bring this up to the user for clarification. Do not silently assume tests are wrong.


Step 1: Run Initial Evaluation

Run the eval-snapshot command to see current test status:

slop-code --quiet eval-snapshot {snapshot_path} \
  -p {problem_name} \
  -c {checkpoint_index} \
  -e configs/environments/docker-python3.12-uv.yaml \
  -o /tmp/eval-output \
  --json

Parse the JSON output to identify:

  • Total tests passed/failed
  • Which specific tests are failing
  • Error messages and stack traces

Step 2: Analyze Failures

For each failing test:

  1. Read the test code - Understand what behavior is expected
  2. Read the spec - Verify the test aligns with requirements (problems/{problem}/checkpoint_{N}.md)
  3. Read the solution code - Find the bug causing the failure
  4. Identify the fix - Determine what change will make the test pass

If a test seems incorrect:

  • Stop and ask the user before proceeding
  • Explain why you think the test might be wrong
  • Wait for user guidance before continuing

Step 3: Fix the Solution

Edit the solution files in the snapshot directory to fix the failing tests.

Keep fixes minimal:

  • Only change what's necessary to pass the test
  • Don't refactor or "improve" unrelated code
  • Preserve existing working functionality

Step 4: Re-run Evaluation

After making fixes, run the evaluation again:

slop-code --quiet eval-snapshot {snapshot_path} \
  -p {problem_name} \
  -c {checkpoint_index} \
  -e configs/environments/docker-python3.12-uv.yaml \
  -o /tmp/eval-output \
  --json

Loop until all tests pass or you determine a test is genuinely incorrect.


Step 5: Propagate to Future Checkpoints

If the fix needs to be applied to later checkpoints (checkpoint_3, checkpoint_4, etc.):

  1. Identify which files were changed in the current checkpoint
  2. Check if those files exist in future checkpoint snapshots
  3. Apply the same fix to each future checkpoint
  4. Re-run evaluation on each updated checkpoint to verify
# Example: Fix was in main.py, propagate to checkpoint_3
# First, check if checkpoint_3 has the same issue
slop-code --quiet eval-snapshot {base_path}/checkpoint_3/snapshot \
  -p {problem_name} \
  -c checkpoint_3 \
  -e configs/environments/docker-python3.12-uv.yaml \
  -o /tmp/eval-output \
  --json

# If same test fails, apply the fix there too

Only propagate fixes that are relevant - later checkpoints may have different implementations.


Iteration Flow

┌─────────────────────────┐
│   Run Evaluation        │
└───────────┬─────────────┘
            │
            ▼
┌─────────────────────────┐
│   All Tests Pass?       │──── Yes ───► Done!
└───────────┬─────────────┘
            │ No
            ▼
┌─────────────────────────┐
│   Analyze Failures      │
└───────────┬─────────────┘
            │
            ▼
┌─────────────────────────┐
│   Test Seems Wrong?     │──── Yes ───► Ask User
└───────────┬─────────────┘
            │ No
            ▼
┌─────────────────────────┐
│   Fix Solution Code     │
└───────────┬─────────────┘
            │
            ▼
┌─────────────────────────┐
│   Propagate to Future   │
│   Checkpoints           │
└───────────┬─────────────┘
            │
            └──────────────────► (loop back to Run Evaluation)

Common Fix Patterns

Missing Error Handling

# Solution missing try/except that spec requires
try:
    result = process(data)
except ValueError as e:
    return {"error": str(e)}

Wrong Return Format

# Test expects list, solution returns dict
# Fix: Return the correct format
return [item for item in results]  # not {"items": results}

Off-by-One Errors

# Test expects 0-indexed, solution uses 1-indexed
return index  # not index + 1

Missing Edge Case Handling

# Test passes empty input, solution crashes
if not data:
    return []

Important Notes

  • Trust the tests first - Assume solution bugs until proven otherwise
  • Ask about suspicious tests - Don't silently decide tests are wrong
  • Keep fixes minimal - Change only what's needed to pass
  • Always re-verify - Run evaluation after every fix
  • Propagate consistently - Apply fixes to all affected checkpoints
  • Track your changes - Use git diff or similar to review what changed