Skip to content

Slime Rl Training — Provides guidance for LLM post-training with RL using slime, a Megatron+SGLang framework

Provides guidance for LLM post-training with RL using slime, a Megatron+SGLang framework. Use when training GLM models, implementing custom data generation workflows, or needing tight Megatron-LM integration for RL scaling.

SourceOptional — install with hermes skills install official/mlops/slime
Pathoptional-skills/mlops/slime
Version1.0.0
AuthorOrchestra Research
LicenseMIT
Dependenciessglang-router>=0.2.3, ray, torch>=2.0.0, transformers>=4.40.0
TagsReinforcement Learning, Megatron-LM, SGLang, GRPO, Post-Training, GLM

The following is the complete skill definition that Hermes loads when this skill is triggered. This is what the agent sees as instructions when the skill is active.

slime: LLM Post-Training Framework for RL Scaling

Section titled “slime: LLM Post-Training Framework for RL Scaling”

slime is an LLM post-training framework from Tsinghua’s THUDM team, powering GLM-4.5, GLM-4.6, and GLM-4.7. It connects Megatron-LM for training with SGLang for high-throughput rollout generation.

Choose slime when you need:

  • Megatron-LM native training with SGLang inference
  • Custom data generation workflows with flexible data buffers
  • Training GLM, Qwen3, DeepSeek V3, or Llama 3 models
  • Research-grade framework with production backing (Z.ai)

Consider alternatives when:

  • You need enterprise-grade stability features → use miles
  • You want flexible backend swapping → use verl
  • You need PyTorch-native abstractions → use torchforge
  • Training: Megatron-LM with full parallelism support (TP, PP, DP, SP)
  • Rollout: SGLang-based high-throughput generation with router
  • Data Buffer: Flexible prompt management and sample storage
  • Models: GLM-4.x, Qwen3, DeepSeek V3/R1, Llama 3
┌─────────────────────────────────────────────────────────┐
│ Data Buffer │
│ - Prompt initialization and management │
│ - Custom data generation and filtering │
│ - Rollout sample storage │
└─────────────┬───────────────────────────┬───────────────┘
│ │
┌─────────────▼───────────┐ ┌─────────────▼───────────────┐
│ Training (Megatron-LM) │ │ Rollout (SGLang + Router) │
│ - Actor model training │ │ - Response generation │
│ - Critic (optional) │ │ - Reward/verifier output │
│ - Weight sync to rollout│ │ - Multi-turn support │
└─────────────────────────┘ └─────────────────────────────┘
Окно терминала
# Recommended: Docker
docker pull slimerl/slime:latest
docker run --rm --gpus all --ipc=host --shm-size=16g \
-it slimerl/slime:latest /bin/bash
# Inside container
cd /root/slime && pip install -e . --no-deps
Окно терминала
git clone https://github.com/THUDM/slime.git
cd slime
pip install -r requirements.txt
pip install -e .
Окно терминала
# Source model configuration
source scripts/models/qwen3-4B.sh
# Launch training
python train.py \
--actor-num-nodes 1 \
--actor-num-gpus-per-node 4 \
--rollout-num-gpus 4 \
--advantage-estimator grpo \
--use-kl-loss --kl-loss-coef 0.001 \
--rollout-batch-size 32 \
--n-samples-per-prompt 8 \
--global-batch-size 256 \
--num-rollout 3000 \
--prompt-data /path/to/data.jsonl \
${MODEL_ARGS[@]} ${CKPT_ARGS[@]}

Use this workflow for training reasoning models with group-relative advantages.

  • Docker environment or Megatron-LM + SGLang installed
  • Model checkpoint (HuggingFace or Megatron format)
  • Training data in JSONL format
# data.jsonl format
{"prompt": "What is 2 + 2?", "label": "4"}
{"prompt": "Solve: 3x = 12", "label": "x = 4"}

Or with chat format:

{
"prompt": [
{"role": "system", "content": "You are a math tutor."},
{"role": "user", "content": "What is 15 + 27?"}
],
"label": "42"
}

Choose a pre-configured model script:

Окно терминала
# List available models
ls scripts/models/
# glm4-9B.sh, qwen3-4B.sh, qwen3-30B-A3B.sh, deepseek-v3.sh, llama3-8B.sh, ...
# Source your model
source scripts/models/qwen3-4B.sh
Окно терминала
python train.py \
--actor-num-nodes 1 \
--actor-num-gpus-per-node 8 \
--rollout-num-gpus 8 \
--advantage-estimator grpo \
--use-kl-loss \
--kl-loss-coef 0.001 \
--prompt-data /path/to/train.jsonl \
--input-key prompt \
--label-key label \
--apply-chat-template \
--rollout-batch-size 32 \
--n-samples-per-prompt 8 \
--global-batch-size 256 \
--num-rollout 3000 \
--save-interval 100 \
--eval-interval 50 \
${MODEL_ARGS[@]}
  • Check TensorBoard: tensorboard --logdir outputs/
  • Verify reward curves are increasing
  • Monitor GPU utilization across nodes

Use async mode for higher throughput by overlapping rollout and training.

  • Large models with long generation times
  • High GPU idle time in synchronous mode
  • Sufficient memory for buffering
Окно терминала
python train_async.py \
--actor-num-nodes 1 \
--actor-num-gpus-per-node 8 \
--rollout-num-gpus 8 \
--advantage-estimator grpo \
--async-buffer-size 4 \
--prompt-data /path/to/train.jsonl \
${MODEL_ARGS[@]}
Окно терминала
--async-buffer-size 4 # Number of rollouts to buffer
--update-weights-interval 2 # Sync weights every N rollouts

Use this workflow for training agents with tool use or multi-step reasoning.

  • Custom generate function for multi-turn logic
  • Tool/environment interface
custom_generate.py
async def custom_generate(args, samples, evaluation=False):
"""Multi-turn generation with tool calling."""
for sample in samples:
conversation = sample.prompt
for turn in range(args.max_turns):
# Generate response
response = await generate_single(conversation)
# Check for tool call
tool_call = extract_tool_call(response)
if tool_call:
tool_result = execute_tool(tool_call)
conversation.append({"role": "assistant", "content": response})
conversation.append({"role": "tool", "content": tool_result})
else:
break
sample.response = response
sample.reward = compute_reward(sample)
return samples
Окно терминала
python train.py \
--custom-generate-function-path custom_generate.py \
--max-turns 5 \
--prompt-data /path/to/agent_data.jsonl \
${MODEL_ARGS[@]}

See examples/search-r1/ for a complete multi-turn search example.


slime uses three types of arguments:

1. Megatron Arguments (passed directly):

Окно терминала
--tensor-model-parallel-size 2
--pipeline-model-parallel-size 1
--num-layers 32
--hidden-size 4096

2. SGLang Arguments (prefixed with --sglang-):

Окно терминала
--sglang-mem-fraction-static 0.8
--sglang-context-length 8192
--sglang-log-level INFO

3. slime Arguments:

Окно терминала
# Resource allocation
--actor-num-nodes 1
--actor-num-gpus-per-node 8
--rollout-num-gpus 8
--colocate # Share GPUs between training/inference
# Data
--prompt-data /path/to/data.jsonl
--input-key prompt
--label-key label
# Training loop
--num-rollout 3000
--rollout-batch-size 32
--n-samples-per-prompt 8
--global-batch-size 256
# Algorithm
--advantage-estimator grpo # or: gspo, ppo, reinforce_plus_plus
--use-kl-loss
--kl-loss-coef 0.001
rollout_batch_size × n_samples_per_prompt = global_batch_size × num_steps_per_rollout

Example: 32 × 8 = 256 × 1


slime’s data buffer enables flexible data management:

class RolloutDataSource:
def get_samples(self, num_samples):
"""Fetch prompts from dataset."""
return self.dataset.sample(num_samples)
def add_samples(self, samples):
"""Called after generation (no-op by default)."""
pass
class RolloutDataSourceWithBuffer(RolloutDataSource):
def __init__(self):
self.buffer = []
def add_samples(self, samples):
"""Store generated samples for reuse."""
self.buffer.extend(samples)
def buffer_filter(self, args, buffer, num_samples):
"""Custom selection logic (prioritized, stratified, etc.)."""
return select_best(buffer, num_samples)

Symptoms: Inference engine dies mid-training

Solutions:

Окно терминала
# Enable fault tolerance
--use-fault-tolerance
# Increase memory allocation
--sglang-mem-fraction-static 0.85
# Reduce batch size
--rollout-batch-size 16

Symptoms: Training hangs after rollout

Solutions:

Окно терминала
# Increase sync interval
--update-weights-interval 5
# Use colocated mode (no network transfer)
--colocate

Symptoms: CUDA OOM in backward pass

Solutions:

Окно терминала
# Enable gradient checkpointing
--recompute-activations
# Reduce micro-batch size
--micro-batch-size 1
# Enable sequence parallelism
--sequence-parallel

Symptoms: GPU idle during data fetch

Solutions:

Окно терминала
# Increase data workers
--num-data-workers 4
# Use streaming dataset
--streaming-data

Model FamilyConfigurations
GLMGLM-4.5, GLM-4.6, GLM-4.7, GLM-Z1-9B
QwenQwen3 (4B, 8B, 30B-A3B), Qwen3-MoE, Qwen2.5
DeepSeekV3, V3.1, R1
LlamaLlama 3 (8B, 70B)
OthersKimi K2, Moonlight-16B

Each model has pre-configured scripts in scripts/models/.


Share GPUs between training and inference to reduce memory:

Окно терминала
python train.py \
--colocate \
--actor-num-gpus-per-node 8 \
--sglang-mem-fraction-static 0.4 \
${MODEL_ARGS[@]}
custom_rm.py
class CustomRewardModel:
def __init__(self, model_path):
self.model = load_model(model_path)
def compute_reward(self, prompts, responses):
inputs = self.tokenize(prompts, responses)
scores = self.model(inputs)
return scores.tolist()
Окно терминала
--custom-rm-path custom_rm.py
Окно терминала
--eval-prompt-data aime /path/to/aime.jsonl \
--eval-prompt-data gsm8k /path/to/gsm8k.jsonl \
--n-samples-per-eval-prompt 16