name	reward
description	Reward model training for RLHF pipelines. Covers RewardTrainer, preference dataset preparation, sequence classification heads, and reward scaling for stable reinforcement learning. Includes thinking quality scoring patterns.

Reward Model Training

Overview

Reward models learn to score responses based on human preferences. They're used in RLHF pipelines (PPO, GRPO, RLOO) to provide reward signals for policy optimization. The model outputs a scalar reward for each response. This skill includes patterns for scoring thinking/reasoning quality.

Quick Reference

Component	Purpose
`RewardTrainer`	Trainer for reward model
`RewardConfig`	Training hyperparameters
`AutoModelForSequenceClassification`	Model with `num_labels=1`
`task_type="SEQ_CLS"`	LoRA task type for reward models
Preference pairs	Training data format
Token ID 151668	`</think>` boundary for Qwen3-Thinking models

Critical Environment Setup

import os
from dotenv import load_dotenv
load_dotenv()

# Force text-based progress in Jupyter
os.environ["TQDM_NOTEBOOK"] = "false"

Critical Import Order

# Standard transformers for reward models (not Unsloth)
from transformers import AutoModelForSequenceClassification, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
from trl import RewardTrainer, RewardConfig
from datasets import Dataset
import torch

Reward Model Concepts

How Reward Models Work

Take prompt + response as input
Output scalar reward score
Trained on preference pairs (chosen > rejected)
Used to guide RL policy optimization

Architecture

Input: [prompt + response]
  ↓
Base LLM (frozen or LoRA)
  ↓
Classification Head (Linear → Scalar)
  ↓
Output: Reward score (float)

Dataset Format

Required Fields

dataset = [
    {
        "prompt": "What is recursion?",
        "chosen": "Recursion is a function calling itself with a base case.",
        "rejected": "Recursion is loops."
    },
    # ... more preference pairs
]

Preprocessing

def format_for_reward(sample):
    prompt = tokenizer.apply_chat_template(
        [{"role": "user", "content": sample["prompt"]}],
        tokenize=False, add_generation_prompt=True
    )
    return {
        "input_ids_chosen": tokenizer(prompt + sample["chosen"])["input_ids"],
        "input_ids_rejected": tokenizer(prompt + sample["rejected"])["input_ids"],
    }

Thinking Quality Preference Dataset

Train reward model to score thinking quality:

# Chosen = Good thinking, Rejected = Poor/no thinking
thinking_preference_data = [
    {
        "prompt": "Explain recursion in programming.",
        "chosen": """<think>
What is recursion exactly? It's when a function calls itself.
Why would we use this? To break down problems into smaller pieces.
What's a good example? Factorial: 5! = 5 * 4!
</think>

Recursion is a technique where a function calls itself with a simpler version of the problem.""",
        "rejected": "Recursion is just loops."
    },
    {
        "prompt": "What is 15 + 27?",
        "chosen": """<think>
I need to add 15 and 27.
Let me break it down: 15 + 27 = 15 + 20 + 7 = 35 + 7 = 42.
</think>

15 + 27 = 42""",
        "rejected": "42"
    },
]

dataset = Dataset.from_list(thinking_preference_data)

Setup

Load Reward Model

from transformers import AutoModelForSequenceClassification, AutoTokenizer, BitsAndBytesConfig

# Quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4",
)

# Load as sequence classification model
model = AutoModelForSequenceClassification.from_pretrained(
    "Qwen/Qwen3-4B-Thinking-2507",  # Non-quantized base
    num_labels=1,  # Single scalar reward output
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-4B-Thinking-2507")

# Setup pad token
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.pad_token_id = tokenizer.eos_token_id

Apply LoRA

from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=16,
    lora_alpha=16,
    lora_dropout=0,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    bias="none",
    task_type="SEQ_CLS",
)

model = get_peft_model(model, lora_config)

RewardTrainer Configuration

Basic Configuration

from trl import RewardConfig

reward_config = RewardConfig(
    output_dir="./reward_output",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    max_steps=100,
    learning_rate=1e-5,
    fp16=not is_bf16_supported(),
    bf16=is_bf16_supported(),
    optim="adamw_8bit",
    max_length=512,
)

Key Parameters

Parameter	Typical Values	Effect
`learning_rate`	1e-5 to 5e-5	Training speed
`max_length`	512-1024	Input truncation
`center_rewards_coefficient`	0.0-0.1	Reward centering

Training

Basic Training

from trl import RewardTrainer

trainer = RewardTrainer(
    model=model,
    args=reward_config,
    train_dataset=dataset,
    processing_class=tokenizer,
)

trainer.train()

Using the Reward Model

Score Responses

def get_reward(prompt, response):
    text = prompt + response
    inputs = tokenizer(text, return_tensors="pt", truncation=True)
    inputs = {k: v.to(model.device) for k, v in inputs.items()}

    with torch.no_grad():
        outputs = model(**inputs)
        reward = outputs.logits[0, 0].item()

    return reward

# Example
score = get_reward("What is Python?", "A programming language.")
print(f"Reward: {score:.3f}")

Batch Scoring

def get_rewards_batch(prompts, responses):
    texts = [p + r for p, r in zip(prompts, responses)]
    inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
    inputs = {k: v.to(model.device) for k, v in inputs.items()}

    with torch.no_grad():
        outputs = model(**inputs)
        rewards = outputs.logits[:, 0].tolist()

    return rewards

In GRPO/RLOO

def reward_fn(completions, prompts):
    return get_rewards_batch(prompts, completions)

grpo_trainer = GRPOTrainer(
    model=policy_model,
    args=grpo_config,
    train_dataset=dataset,
    reward_funcs=reward_fn,
)

Reward Scaling

Normalize Rewards

def normalized_reward(completions, prompts):
    raw_rewards = get_rewards_batch(prompts, completions)
    mean = sum(raw_rewards) / len(raw_rewards)
    std = (sum((r - mean) ** 2 for r in raw_rewards) / len(raw_rewards)) ** 0.5
    return [(r - mean) / (std + 1e-8) for r in raw_rewards]

Clip Rewards

def clipped_reward(completions, prompts):
    rewards = get_rewards_batch(prompts, completions)
    return [max(-1.0, min(1.0, r)) for r in rewards]

Troubleshooting

Poor Discrimination

Symptom: Similar scores for chosen and rejected

Fix:

More training steps
Higher learning rate
Check data quality

Reward Hacking

Symptom: RL model exploits reward model

Fix:

Add diversity in training data
Ensemble multiple reward models
Regularization during RL

Overconfident Scores

Symptom: Extreme reward values

Fix:

Use center_rewards_coefficient
Normalize outputs
Clip reward range

Memory Issues

Symptom: OOM during training

Fix:

Use LoRA instead of full fine-tuning
Reduce max_length
Smaller batch size

Kernel Shutdown (Jupyter)

Reward model training uses significant GPU memory. Shutdown kernel to release memory:

import IPython
print("Shutting down kernel to release GPU memory...")
app = IPython.Application.instance()
app.kernel.do_shutdown(restart=False)

Important: Always run this at the end of training notebooks before switching to different models.

When to Use This Skill

Use when:

Building RLHF pipelines
Need explicit reward signal
Have preference data
Want interpretable scoring
Planning to use GRPO or RLOO

Cross-References

bazzite-ai-jupyter:grpo - Uses reward models for RL
bazzite-ai-jupyter:rloo - Uses reward models for RL
bazzite-ai-jupyter:dpo - Alternative that doesn't need reward model
bazzite-ai-jupyter:peft - LoRA for efficient reward training
bazzite-ai-jupyter:sft - Pre-training before reward modeling
bazzite-ai-jupyter:inference - Inference for reward scoring

reward

Install Skill

SKILL.md

Reward Model Training

Overview

Quick Reference

Critical Environment Setup

Critical Import Order

Reward Model Concepts

How Reward Models Work

Architecture

Dataset Format

Required Fields

Preprocessing

Thinking Quality Preference Dataset

Setup

Load Reward Model

Apply LoRA

RewardTrainer Configuration

Basic Configuration

Key Parameters

Training

Basic Training

Using the Reward Model

Score Responses

Batch Scoring

In GRPO/RLOO

Reward Scaling

Normalize Rewards

Clip Rewards

Troubleshooting

Poor Discrimination

Reward Hacking

Overconfident Scores

Memory Issues

Kernel Shutdown (Jupyter)

When to Use This Skill

Cross-References