Claude Code Plugins

Community-maintained marketplace

Feedback

|

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name reward
description Reward model training for RLHF pipelines. Covers RewardTrainer, preference dataset preparation, sequence classification heads, and reward scaling for stable reinforcement learning. Includes thinking quality scoring patterns.

Reward Model Training

Overview

Reward models learn to score responses based on human preferences. They're used in RLHF pipelines (PPO, GRPO, RLOO) to provide reward signals for policy optimization. The model outputs a scalar reward for each response. This skill includes patterns for scoring thinking/reasoning quality.

Quick Reference

Component Purpose
RewardTrainer Trainer for reward model
RewardConfig Training hyperparameters
AutoModelForSequenceClassification Model with num_labels=1
task_type="SEQ_CLS" LoRA task type for reward models
Preference pairs Training data format
Token ID 151668 </think> boundary for Qwen3-Thinking models

Critical Environment Setup

import os
from dotenv import load_dotenv
load_dotenv()

# Force text-based progress in Jupyter
os.environ["TQDM_NOTEBOOK"] = "false"

Critical Import Order

# Standard transformers for reward models (not Unsloth)
from transformers import AutoModelForSequenceClassification, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
from trl import RewardTrainer, RewardConfig
from datasets import Dataset
import torch

Reward Model Concepts

How Reward Models Work

  1. Take prompt + response as input
  2. Output scalar reward score
  3. Trained on preference pairs (chosen > rejected)
  4. Used to guide RL policy optimization

Architecture

Input: [prompt + response]
  ↓
Base LLM (frozen or LoRA)
  ↓
Classification Head (Linear → Scalar)
  ↓
Output: Reward score (float)

Dataset Format

Required Fields

dataset = [
    {
        "prompt": "What is recursion?",
        "chosen": "Recursion is a function calling itself with a base case.",
        "rejected": "Recursion is loops."
    },
    # ... more preference pairs
]

Preprocessing

def format_for_reward(sample):
    prompt = tokenizer.apply_chat_template(
        [{"role": "user", "content": sample["prompt"]}],
        tokenize=False, add_generation_prompt=True
    )
    return {
        "input_ids_chosen": tokenizer(prompt + sample["chosen"])["input_ids"],
        "input_ids_rejected": tokenizer(prompt + sample["rejected"])["input_ids"],
    }

Thinking Quality Preference Dataset

Train reward model to score thinking quality:

# Chosen = Good thinking, Rejected = Poor/no thinking
thinking_preference_data = [
    {
        "prompt": "Explain recursion in programming.",
        "chosen": """<think>
What is recursion exactly? It's when a function calls itself.
Why would we use this? To break down problems into smaller pieces.
What's a good example? Factorial: 5! = 5 * 4!
</think>

Recursion is a technique where a function calls itself with a simpler version of the problem.""",
        "rejected": "Recursion is just loops."
    },
    {
        "prompt": "What is 15 + 27?",
        "chosen": """<think>
I need to add 15 and 27.
Let me break it down: 15 + 27 = 15 + 20 + 7 = 35 + 7 = 42.
</think>

15 + 27 = 42""",
        "rejected": "42"
    },
]

dataset = Dataset.from_list(thinking_preference_data)

Setup

Load Reward Model

from transformers import AutoModelForSequenceClassification, AutoTokenizer, BitsAndBytesConfig

# Quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4",
)

# Load as sequence classification model
model = AutoModelForSequenceClassification.from_pretrained(
    "Qwen/Qwen3-4B-Thinking-2507",  # Non-quantized base
    num_labels=1,  # Single scalar reward output
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-4B-Thinking-2507")

# Setup pad token
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.pad_token_id = tokenizer.eos_token_id

Apply LoRA

from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=16,
    lora_alpha=16,
    lora_dropout=0,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    bias="none",
    task_type="SEQ_CLS",
)

model = get_peft_model(model, lora_config)

RewardTrainer Configuration

Basic Configuration

from trl import RewardConfig

reward_config = RewardConfig(
    output_dir="./reward_output",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    max_steps=100,
    learning_rate=1e-5,
    fp16=not is_bf16_supported(),
    bf16=is_bf16_supported(),
    optim="adamw_8bit",
    max_length=512,
)

Key Parameters

Parameter Typical Values Effect
learning_rate 1e-5 to 5e-5 Training speed
max_length 512-1024 Input truncation
center_rewards_coefficient 0.0-0.1 Reward centering

Training

Basic Training

from trl import RewardTrainer

trainer = RewardTrainer(
    model=model,
    args=reward_config,
    train_dataset=dataset,
    processing_class=tokenizer,
)

trainer.train()

Using the Reward Model

Score Responses

def get_reward(prompt, response):
    text = prompt + response
    inputs = tokenizer(text, return_tensors="pt", truncation=True)
    inputs = {k: v.to(model.device) for k, v in inputs.items()}

    with torch.no_grad():
        outputs = model(**inputs)
        reward = outputs.logits[0, 0].item()

    return reward

# Example
score = get_reward("What is Python?", "A programming language.")
print(f"Reward: {score:.3f}")

Batch Scoring

def get_rewards_batch(prompts, responses):
    texts = [p + r for p, r in zip(prompts, responses)]
    inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
    inputs = {k: v.to(model.device) for k, v in inputs.items()}

    with torch.no_grad():
        outputs = model(**inputs)
        rewards = outputs.logits[:, 0].tolist()

    return rewards

In GRPO/RLOO

def reward_fn(completions, prompts):
    return get_rewards_batch(prompts, completions)

grpo_trainer = GRPOTrainer(
    model=policy_model,
    args=grpo_config,
    train_dataset=dataset,
    reward_funcs=reward_fn,
)

Reward Scaling

Normalize Rewards

def normalized_reward(completions, prompts):
    raw_rewards = get_rewards_batch(prompts, completions)
    mean = sum(raw_rewards) / len(raw_rewards)
    std = (sum((r - mean) ** 2 for r in raw_rewards) / len(raw_rewards)) ** 0.5
    return [(r - mean) / (std + 1e-8) for r in raw_rewards]

Clip Rewards

def clipped_reward(completions, prompts):
    rewards = get_rewards_batch(prompts, completions)
    return [max(-1.0, min(1.0, r)) for r in rewards]

Troubleshooting

Poor Discrimination

Symptom: Similar scores for chosen and rejected

Fix:

  • More training steps
  • Higher learning rate
  • Check data quality

Reward Hacking

Symptom: RL model exploits reward model

Fix:

  • Add diversity in training data
  • Ensemble multiple reward models
  • Regularization during RL

Overconfident Scores

Symptom: Extreme reward values

Fix:

  • Use center_rewards_coefficient
  • Normalize outputs
  • Clip reward range

Memory Issues

Symptom: OOM during training

Fix:

  • Use LoRA instead of full fine-tuning
  • Reduce max_length
  • Smaller batch size

Kernel Shutdown (Jupyter)

Reward model training uses significant GPU memory. Shutdown kernel to release memory:

import IPython
print("Shutting down kernel to release GPU memory...")
app = IPython.Application.instance()
app.kernel.do_shutdown(restart=False)

Important: Always run this at the end of training notebooks before switching to different models.

When to Use This Skill

Use when:

  • Building RLHF pipelines
  • Need explicit reward signal
  • Have preference data
  • Want interpretable scoring
  • Planning to use GRPO or RLOO

Cross-References

  • bazzite-ai-jupyter:grpo - Uses reward models for RL
  • bazzite-ai-jupyter:rloo - Uses reward models for RL
  • bazzite-ai-jupyter:dpo - Alternative that doesn't need reward model
  • bazzite-ai-jupyter:peft - LoRA for efficient reward training
  • bazzite-ai-jupyter:sft - Pre-training before reward modeling
  • bazzite-ai-jupyter:inference - Inference for reward scoring