Transformers documentation

DDP

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v5.12.0).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

DDP

DistributedDataParallel (DDP) maintains a full copy of a model on each GPU. Each GPU processes a non-overlapping shard of data with a forward and backward pass. Before the optimizer step, an all-reduce averages gradients across all GPUs so every model copy stays identical. Use DDP when your model fits on a single GPU.

                         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                         β”‚  training data  β”‚
                         β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
               β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
               β”‚ shard 0          β”‚ shard 1          β”‚ shard 2
               β–Ό                  β–Ό                  β–Ό
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚   model     β”‚    β”‚   model     β”‚    β”‚   model     β”‚
        β”‚  (copy 0)   β”‚    β”‚  (copy 1)   β”‚    β”‚  (copy 2)   β”‚
        β”‚   GPU 0     β”‚    β”‚   GPU 1     β”‚    β”‚   GPU 2     β”‚
        β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
               β”‚ grads            β”‚ grads            β”‚ grads
               β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                               all-reduce
                          (average gradients)
               β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
               β–Ό                  β–Ό                  β–Ό
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚  optimizer  β”‚    β”‚  optimizer  β”‚    β”‚  optimizer  β”‚
        β”‚    step     β”‚    β”‚    step     β”‚    β”‚    step     β”‚
        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          (identical)        (identical)        (identical)

DDP activates automatically when you launch with a multi-process launcher like Accelerate.

# 4 GPUs on one machine
accelerate launch --num_processes 4 train.py

Configure DDP

Pass these TrainingArguments to control DDP behavior.

  • gradient_accumulation_steps() determines when to perform the all-reduce. Trainer skips the all-reduce on intermediate accumulation steps and runs it only on the final micro-batch. For example, with gradient_accumulation_steps=4, the all-reduce runs every 4 backward passes.
  • ~TrainingArguments.ddp_find_unused_parameters traverses the autograd graph at the end of the forward pass for parameters that won’t receive a gradient and marks them as ready so they don’t block the all-reduce. Don’t use with gradient_checkpointing() because gradient checkpointing discards intermediate activations and recomputes them on the fly.
  • ~TrainingArguments.ddp_bucket_cap_mb is the bucket size for batching gradients into a single all-reduce during the backward pass. A larger bucket means fewer all-reduce calls and less launch overhead.
  • ~TrainingArguments.ddp_broadcast_buffers synchronizes model buffers (such as BatchNorm running statistics) from rank 0 to all other ranks at the start of every forward pass. Disable if your model only uses LayerNorm. Don’t use with gradient_checkpointing().
  • ~TrainingArguments.ddp_backend sets the communication backend. Use "nccl" for NVIDIA GPUs (default and fastest), "gloo" for CPU training or debugging, and "xccl", "hccl", or "cncl" for other hardware.
  • ddp_timeout() sets the time limit for all processes and operations (all-reduce, broadcast) to complete. If a process hangs, like when loading a large model slowly, the timeout raises an error instead of blocking indefinitely.
from transformers import TrainingArguments

args = TrainingArguments(
    ...,
    gradient_accumulation_steps=4,
    ddp_backend="nccl",
    ddp_find_unused_parameters=False,
    ddp_bucket_cap_mb=25,
    ddp_broadcast_buffers=True,
    ddp_timeout=1800,
)

Next steps

  • See FSDP for training models too large to fit on a single GPU.
  • See DeepSpeed for ZeRO optimization and offloading.
  • Read the Data Parallelism chapter from The Ultra-Scale Playbook for more information about how DDP works.
Update on GitHub