Fine Tuning With Trl — TRL: SFT, DPO, PPO, GRPO, reward modeling for LLM RLHF
Fine Tuning With Trl
Section titled “Fine Tuning With Trl”TRL: SFT, DPO, PPO, GRPO, reward modeling for LLM RLHF.
Skill metadata
Section titled “Skill metadata”| Source | Bundled (installed by default) |
| Path | skills/mlops/training/trl-fine-tuning |
| Version | 1.0.0 |
| Author | Orchestra Research |
| License | MIT |
| Dependencies | trl, transformers, datasets, peft, accelerate, torch |
| Tags | Post-Training, TRL, Reinforcement Learning, Fine-Tuning, SFT, DPO, PPO, GRPO, RLHF, Preference Alignment, HuggingFace |
Reference: full SKILL.md
Section titled “Reference: full SKILL.md”The following is the complete skill definition that Hermes loads when this skill is triggered. This is what the agent sees as instructions when the skill is active.
TRL - Transformer Reinforcement Learning
Section titled “TRL - Transformer Reinforcement Learning”Quick start
Section titled “Quick start”TRL provides post-training methods for aligning language models with human preferences.
Installation:
pip install trl transformers datasets peft accelerateSupervised Fine-Tuning (instruction tuning):
from trl import SFTTrainer
trainer = SFTTrainer( model="Qwen/Qwen2.5-0.5B", train_dataset=dataset, # Prompt-completion pairs)trainer.train()DPO (align with preferences):
from trl import DPOTrainer, DPOConfig
config = DPOConfig(output_dir="model-dpo", beta=0.1)trainer = DPOTrainer( model=model, args=config, train_dataset=preference_dataset, # chosen/rejected pairs processing_class=tokenizer)trainer.train()Common workflows
Section titled “Common workflows”Workflow 1: Full RLHF pipeline (SFT → Reward Model → PPO)
Section titled “Workflow 1: Full RLHF pipeline (SFT → Reward Model → PPO)”Complete pipeline from base model to human-aligned model.
Copy this checklist:
RLHF Training:- [ ] Step 1: Supervised fine-tuning (SFT)- [ ] Step 2: Train reward model- [ ] Step 3: PPO reinforcement learning- [ ] Step 4: Evaluate aligned modelStep 1: Supervised fine-tuning
Train base model on instruction-following data:
from transformers import AutoModelForCausalLM, AutoTokenizerfrom trl import SFTTrainer, SFTConfigfrom datasets import load_dataset
# Load modelmodel = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B")tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B")
# Load instruction datasetdataset = load_dataset("trl-lib/Capybara", split="train")
# Configure trainingtraining_args = SFTConfig( output_dir="Qwen2.5-0.5B-SFT", per_device_train_batch_size=4, num_train_epochs=1, learning_rate=2e-5, logging_steps=10, save_strategy="epoch")
# Traintrainer = SFTTrainer( model=model, args=training_args, train_dataset=dataset, tokenizer=tokenizer)trainer.train()trainer.save_model()Step 2: Train reward model
Train model to predict human preferences:
from transformers import AutoModelForSequenceClassificationfrom trl import RewardTrainer, RewardConfig
# Load SFT model as basemodel = AutoModelForSequenceClassification.from_pretrained( "Qwen2.5-0.5B-SFT", num_labels=1 # Single reward score)tokenizer = AutoTokenizer.from_pretrained("Qwen2.5-0.5B-SFT")
# Load preference data (chosen/rejected pairs)dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
# Configure trainingtraining_args = RewardConfig( output_dir="Qwen2.5-0.5B-Reward", per_device_train_batch_size=2, num_train_epochs=1, learning_rate=1e-5)
# Train reward modeltrainer = RewardTrainer( model=model, args=training_args, processing_class=tokenizer, train_dataset=dataset)trainer.train()trainer.save_model()Step 3: PPO reinforcement learning
Optimize policy using reward model:
python -m trl.scripts.ppo \ --model_name_or_path Qwen2.5-0.5B-SFT \ --reward_model_path Qwen2.5-0.5B-Reward \ --dataset_name trl-internal-testing/descriptiveness-sentiment-trl-style \ --output_dir Qwen2.5-0.5B-PPO \ --learning_rate 3e-6 \ --per_device_train_batch_size 64 \ --total_episodes 10000Step 4: Evaluate
from transformers import pipeline
# Load aligned modelgenerator = pipeline("text-generation", model="Qwen2.5-0.5B-PPO")
# Testprompt = "Explain quantum computing to a 10-year-old"output = generator(prompt, max_length=200)[0]["generated_text"]print(output)Workflow 2: Simple preference alignment with DPO
Section titled “Workflow 2: Simple preference alignment with DPO”Align model with preferences without reward model.
Copy this checklist:
DPO Training:- [ ] Step 1: Prepare preference dataset- [ ] Step 2: Configure DPO- [ ] Step 3: Train with DPOTrainer- [ ] Step 4: Evaluate alignmentStep 1: Prepare preference dataset
Dataset format:
{ "prompt": "What is the capital of France?", "chosen": "The capital of France is Paris.", "rejected": "I don't know."}Load dataset:
from datasets import load_dataset
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")# Or load your own# dataset = load_dataset("json", data_files="preferences.json")Step 2: Configure DPO
from trl import DPOConfig
config = DPOConfig( output_dir="Qwen2.5-0.5B-DPO", per_device_train_batch_size=4, num_train_epochs=1, learning_rate=5e-7, beta=0.1, # KL penalty strength max_prompt_length=512, max_length=1024, logging_steps=10)Step 3: Train with DPOTrainer
from transformers import AutoModelForCausalLM, AutoTokenizerfrom trl import DPOTrainer
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
trainer = DPOTrainer( model=model, args=config, train_dataset=dataset, processing_class=tokenizer)
trainer.train()trainer.save_model()CLI alternative:
trl dpo \ --model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \ --dataset_name argilla/Capybara-Preferences \ --output_dir Qwen2.5-0.5B-DPO \ --per_device_train_batch_size 4 \ --learning_rate 5e-7 \ --beta 0.1Workflow 3: Memory-efficient online RL with GRPO
Section titled “Workflow 3: Memory-efficient online RL with GRPO”Train with reinforcement learning using minimal memory.
For in-depth GRPO guidance — reward function design, critical training insights (loss behavior, mode collapse, tuning), and advanced multi-stage patterns — see references/grpo-training.md. A production-ready training script is in templates/basic_grpo_training.py.
Copy this checklist:
GRPO Training:- [ ] Step 1: Define reward function- [ ] Step 2: Configure GRPO- [ ] Step 3: Train with GRPOTrainerStep 1: Define reward function
def reward_function(completions, **kwargs): """ Compute rewards for completions.
Args: completions: List of generated texts
Returns: List of reward scores (floats) """ rewards = [] for completion in completions: # Example: reward based on length and unique words score = len(completion.split()) # Favor longer responses score += len(set(completion.lower().split())) # Reward unique words rewards.append(score) return rewardsOr use a reward model:
from transformers import pipeline
reward_model = pipeline("text-classification", model="reward-model-path")
def reward_from_model(completions, prompts, **kwargs): # Combine prompt + completion full_texts = [p + c for p, c in zip(prompts, completions)] # Get reward scores results = reward_model(full_texts) return [r["score"] for r in results]Step 2: Configure GRPO
from trl import GRPOConfig
config = GRPOConfig( output_dir="Qwen2-GRPO", per_device_train_batch_size=4, num_train_epochs=1, learning_rate=1e-5, num_generations=4, # Generate 4 completions per prompt max_new_tokens=128)Step 3: Train with GRPOTrainer
from datasets import load_datasetfrom trl import GRPOTrainer
# Load prompt-only datasetdataset = load_dataset("trl-lib/tldr", split="train")
trainer = GRPOTrainer( model="Qwen/Qwen2-0.5B-Instruct", reward_funcs=reward_function, # Your reward function args=config, train_dataset=dataset)
trainer.train()CLI:
trl grpo \ --model_name_or_path Qwen/Qwen2-0.5B-Instruct \ --dataset_name trl-lib/tldr \ --output_dir Qwen2-GRPO \ --num_generations 4When to use vs alternatives
Section titled “When to use vs alternatives”Use TRL when:
- Need to align model with human preferences
- Have preference data (chosen/rejected pairs)
- Want to use reinforcement learning (PPO, GRPO)
- Need reward model training
- Doing RLHF (full pipeline)
Method selection:
- SFT: Have prompt-completion pairs, want basic instruction following
- DPO: Have preferences, want simple alignment (no reward model needed)
- PPO: Have reward model, need maximum control over RL
- GRPO: Memory-constrained, want online RL
- Reward Model: Building RLHF pipeline, need to score generations
Use alternatives instead:
- HuggingFace Trainer: Basic fine-tuning without RL
- Axolotl: YAML-based training configuration
- LitGPT: Educational, minimal fine-tuning
- Unsloth: Fast LoRA training
Common issues
Section titled “Common issues”Issue: OOM during DPO training
Reduce batch size and sequence length:
config = DPOConfig( per_device_train_batch_size=1, # Reduce from 4 max_length=512, # Reduce from 1024 gradient_accumulation_steps=8 # Maintain effective batch)Or use gradient checkpointing:
model.gradient_checkpointing_enable()Issue: Poor alignment quality
Tune beta parameter:
# Higher beta = more conservative (stays closer to reference)config = DPOConfig(beta=0.5) # Default 0.1
# Lower beta = more aggressive alignmentconfig = DPOConfig(beta=0.01)Issue: Reward model not learning
Check loss type and learning rate:
config = RewardConfig( learning_rate=1e-5, # Try different LR num_train_epochs=3 # Train longer)Ensure preference dataset has clear winners:
# Verify datasetprint(dataset[0])# Should have clear chosen > rejectedIssue: PPO training unstable
Adjust KL coefficient:
config = PPOConfig( kl_coef=0.1, # Increase from 0.05 cliprange=0.1 # Reduce from 0.2)Advanced topics
Section titled “Advanced topics”SFT training guide: See references/sft-training.md for dataset formats, chat templates, packing strategies, and multi-GPU training.
DPO variants: See references/dpo-variants.md for IPO, cDPO, RPO, and other DPO loss functions with recommended hyperparameters.
Reward modeling: See references/reward-modeling.md for outcome vs process rewards, Bradley-Terry loss, and reward model evaluation.
Online RL methods: See references/online-rl.md for PPO, GRPO, RLOO, and OnlineDPO with detailed configurations.
GRPO deep dive: See references/grpo-training.md for expert-level GRPO patterns — reward function design philosophy, training insights (why loss increases, mode collapse detection), hyperparameter tuning, multi-stage training, and troubleshooting. Production-ready template in templates/basic_grpo_training.py.
Hardware requirements
Section titled “Hardware requirements”- GPU: NVIDIA (CUDA required)
- VRAM: Depends on model and method
- SFT 7B: 16GB (with LoRA)
- DPO 7B: 24GB (stores reference model)
- PPO 7B: 40GB (policy + reward model)
- GRPO 7B: 24GB (more memory efficient)
- Multi-GPU: Supported via
accelerate - Mixed precision: BF16 recommended (A100/H100)
Memory optimization:
- Use LoRA/QLoRA for all methods
- Enable gradient checkpointing
- Use smaller batch sizes with gradient accumulation
Resources
Section titled “Resources”- Docs: https://huggingface.co/docs/trl/
- GitHub: https://github.com/huggingface/trl
- Papers:
- “Training language models to follow instructions with human feedback” (InstructGPT, 2022)
- “Direct Preference Optimization: Your Language Model is Secretly a Reward Model” (DPO, 2023)
- “Group Relative Policy Optimization” (GRPO, 2024)
- Examples: https://github.com/huggingface/trl/tree/main/examples/scripts