Accelerate

Accelerate provides a unified interface for distributed training backends like FSDP or DeepSpeed. It detects your environment (number of GPUs, distributed backend, mixed precision, etc.) and automatically configures training, whether you’re on 1 GPU with DDP or 8 GPUs with FSDP.

Accelerate wraps the model in the appropriate distributed wrapper, moves it to the correct device, and creates a compatible optimizer. During training, Accelerate uses its own backward method to handle gradient scaling for mixed precision. Trainer calls the appropriate Accelerate APIs and delegates all distributed mechanics to Accelerate.

Configure Accelerate for Trainer with either an Accelerate config file or TrainingArguments.

Accelerate config file

Run the accelerate config command and answer questions about your hardware and training setup. This creates a default_config.yaml file in your cache. The example below is for FSDP.

compute_environment: LOCAL_MACHINE
distributed_type: FSDP
fsdp_config:
  fsdp_version: 2
  fsdp_reshard_after_forward: true
  fsdp_cpu_offload: false
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_cpu_ram_efficient_loading: true
  fsdp_activation_checkpointing: false
  fsdp_state_dict_type: SHARDED_STATE_DICT
  fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
mixed_precision: bf16
num_machines: 1
num_processes: 4

Run accelerate launch with a Trainer-based script, and Accelerate reads the config file to set up training. The fsdp_config and deepspeed args are unnecessary because the Accelerate config file covers the same settings.

accelerate launch train.py

The accelerator_config accepts settings that don’t have dedicated top-level arguments. For example, set non_blocking=True together with dataloader_pin_memory() to overlap data transfer with compute for higher GPU throughput.

from transformers import TrainingArguments

TrainingArguments(
    ...,
    dataloader_pin_memory=True,
    accelerator_config={
        "non_blocking": True,
    },
)

TrainingArguments

Pass a backend-specific config to TrainingArguments. The create_accelerator_and_postprocess() method reads the settings and configures training.

FSDP

DeepSpeed

DDP

Next steps

See DDP for data-parallel training when your model fits on one GPU.
See FSDP for sharding parameters, gradients, and optimizer states across GPUs.
See DeepSpeed for ZeRO optimization and offloading.

Update on GitHub

Transformers

Accelerate

Accelerate config file

TrainingArguments

Next steps