Huggingface Accelerate — Simplest distributed training API
Huggingface Accelerate
Section titled “Huggingface Accelerate”Simplest distributed training API. 4 lines to add distributed support to any PyTorch script. Unified API for DeepSpeed/FSDP/Megatron/DDP. Automatic device placement, mixed precision (FP16/BF16/FP8). Interactive config, single launch command. HuggingFace ecosystem standard.
Skill metadata
Section titled “Skill metadata”| Source | Optional — install with hermes skills install official/mlops/accelerate |
| Path | optional-skills/mlops/accelerate |
| Version | 1.0.0 |
| Author | Orchestra Research |
| License | MIT |
| Dependencies | accelerate, torch, transformers |
| Tags | Distributed Training, HuggingFace, Accelerate, DeepSpeed, FSDP, Mixed Precision, PyTorch, DDP, Unified API, Simple |
Reference: full SKILL.md
Section titled “Reference: full SKILL.md”The following is the complete skill definition that Hermes loads when this skill is triggered. This is what the agent sees as instructions when the skill is active.
HuggingFace Accelerate - Unified Distributed Training
Section titled “HuggingFace Accelerate - Unified Distributed Training”Quick start
Section titled “Quick start”Accelerate simplifies distributed training to 4 lines of code.
Installation:
pip install accelerateConvert PyTorch script (4 lines):
import torch+ from accelerate import Accelerator
+ accelerator = Accelerator()
model = torch.nn.Transformer() optimizer = torch.optim.Adam(model.parameters()) dataloader = torch.utils.data.DataLoader(dataset)
+ model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)
for batch in dataloader: optimizer.zero_grad() loss = model(batch)- loss.backward()+ accelerator.backward(loss) optimizer.step()Run (single command):
accelerate launch train.pyCommon workflows
Section titled “Common workflows”Workflow 1: From single GPU to multi-GPU
Section titled “Workflow 1: From single GPU to multi-GPU”Original script:
import torch
model = torch.nn.Linear(10, 2).to('cuda')optimizer = torch.optim.Adam(model.parameters())dataloader = torch.utils.data.DataLoader(dataset, batch_size=32)
for epoch in range(10): for batch in dataloader: batch = batch.to('cuda') optimizer.zero_grad() loss = model(batch).mean() loss.backward() optimizer.step()With Accelerate (4 lines added):
import torchfrom accelerate import Accelerator # +1
accelerator = Accelerator() # +2
model = torch.nn.Linear(10, 2)optimizer = torch.optim.Adam(model.parameters())dataloader = torch.utils.data.DataLoader(dataset, batch_size=32)
model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader) # +3
for epoch in range(10): for batch in dataloader: # No .to('cuda') needed - automatic! optimizer.zero_grad() loss = model(batch).mean() accelerator.backward(loss) # +4 optimizer.step()Configure (interactive):
accelerate configQuestions:
- Which machine? (single/multi GPU/TPU/CPU)
- How many machines? (1)
- Mixed precision? (no/fp16/bf16/fp8)
- DeepSpeed? (no/yes)
Launch (works on any setup):
# Single GPUaccelerate launch train.py
# Multi-GPU (8 GPUs)accelerate launch --multi_gpu --num_processes 8 train.py
# Multi-nodeaccelerate launch --multi_gpu --num_processes 16 \ --num_machines 2 --machine_rank 0 \ --main_process_ip $MASTER_ADDR \ train.pyWorkflow 2: Mixed precision training
Section titled “Workflow 2: Mixed precision training”Enable FP16/BF16:
from accelerate import Accelerator
# FP16 (with gradient scaling)accelerator = Accelerator(mixed_precision='fp16')
# BF16 (no scaling, more stable)accelerator = Accelerator(mixed_precision='bf16')
# FP8 (H100+)accelerator = Accelerator(mixed_precision='fp8')
model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)
# Everything else is automatic!for batch in dataloader: with accelerator.autocast(): # Optional, done automatically loss = model(batch) accelerator.backward(loss)Workflow 3: DeepSpeed ZeRO integration
Section titled “Workflow 3: DeepSpeed ZeRO integration”Enable DeepSpeed ZeRO-2:
from accelerate import Accelerator
accelerator = Accelerator( mixed_precision='bf16', deepspeed_plugin={ "zero_stage": 2, # ZeRO-2 "offload_optimizer": False, "gradient_accumulation_steps": 4 })
# Same code as before!model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)Or via config:
accelerate config# Select: DeepSpeed → ZeRO-2deepspeed_config.json:
{ "fp16": {"enabled": false}, "bf16": {"enabled": true}, "zero_optimization": { "stage": 2, "offload_optimizer": {"device": "cpu"}, "allgather_bucket_size": 5e8, "reduce_bucket_size": 5e8 }}Launch:
accelerate launch --config_file deepspeed_config.json train.pyWorkflow 4: FSDP (Fully Sharded Data Parallel)
Section titled “Workflow 4: FSDP (Fully Sharded Data Parallel)”Enable FSDP:
from accelerate import Accelerator, FullyShardedDataParallelPlugin
fsdp_plugin = FullyShardedDataParallelPlugin( sharding_strategy="FULL_SHARD", # ZeRO-3 equivalent auto_wrap_policy="TRANSFORMER_AUTO_WRAP", cpu_offload=False)
accelerator = Accelerator( mixed_precision='bf16', fsdp_plugin=fsdp_plugin)
model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)Or via config:
accelerate config# Select: FSDP → Full Shard → No CPU OffloadWorkflow 5: Gradient accumulation
Section titled “Workflow 5: Gradient accumulation”Accumulate gradients:
from accelerate import Accelerator
accelerator = Accelerator(gradient_accumulation_steps=4)
model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)
for batch in dataloader: with accelerator.accumulate(model): # Handles accumulation optimizer.zero_grad() loss = model(batch) accelerator.backward(loss) optimizer.step()Effective batch size: batch_size * num_gpus * gradient_accumulation_steps
When to use vs alternatives
Section titled “When to use vs alternatives”Use Accelerate when:
- Want simplest distributed training
- Need single script for any hardware
- Use HuggingFace ecosystem
- Want flexibility (DDP/DeepSpeed/FSDP/Megatron)
- Need quick prototyping
Key advantages:
- 4 lines: Minimal code changes
- Unified API: Same code for DDP, DeepSpeed, FSDP, Megatron
- Automatic: Device placement, mixed precision, sharding
- Interactive config: No manual launcher setup
- Single launch: Works everywhere
Use alternatives instead:
- PyTorch Lightning: Need callbacks, high-level abstractions
- Ray Train: Multi-node orchestration, hyperparameter tuning
- DeepSpeed: Direct API control, advanced features
- Raw DDP: Maximum control, minimal abstraction
Common issues
Section titled “Common issues”Issue: Wrong device placement
Don’t manually move to device:
# WRONGbatch = batch.to('cuda')
# CORRECT# Accelerate handles it automatically after prepare()Issue: Gradient accumulation not working
Use context manager:
# CORRECTwith accelerator.accumulate(model): optimizer.zero_grad() accelerator.backward(loss) optimizer.step()Issue: Checkpointing in distributed
Use accelerator methods:
# Save only on main processif accelerator.is_main_process: accelerator.save_state('checkpoint/')
# Load on all processesaccelerator.load_state('checkpoint/')Issue: Different results with FSDP
Ensure same random seed:
from accelerate.utils import set_seedset_seed(42)Advanced topics
Section titled “Advanced topics”Megatron integration: See references/megatron-integration.md for tensor parallelism, pipeline parallelism, and sequence parallelism setup.
Custom plugins: See references/custom-plugins.md for creating custom distributed plugins and advanced configuration.
Performance tuning: See references/performance.md for profiling, memory optimization, and best practices.
Hardware requirements
Section titled “Hardware requirements”- CPU: Works (slow)
- Single GPU: Works
- Multi-GPU: DDP (default), DeepSpeed, or FSDP
- Multi-node: DDP, DeepSpeed, FSDP, Megatron
- TPU: Supported
- Apple MPS: Supported
Launcher requirements:
- DDP:
torch.distributed.run(built-in) - DeepSpeed:
deepspeed(pip install deepspeed) - FSDP: PyTorch 1.12+ (built-in)
- Megatron: Custom setup
Resources
Section titled “Resources”- Docs: https://huggingface.co/docs/accelerate
- GitHub: https://github.com/huggingface/accelerate
- Version: 1.11.0+
- Tutorial: “Accelerate your scripts”
- Examples: https://github.com/huggingface/accelerate/tree/main/examples
- Used by: HuggingFace Transformers, TRL, PEFT, all HF libraries