Single GPU Optimization

Resource: Hugging Face Doc

Method/toolImproves training speedOptimizes memory utilization
Batch size choiceYesYes
Gradient accumulationNoYes
Gradient checkpointingNoYes
Mixed precision trainingYes(No)
Optimizer choiceYesYes
Data preloadingYesNo
DeepSpeed ZeroNoYes
Parameter-Efficient Fine Tuning (PEFT)NoYes


If your model doesn’t work well with mixed precision, for example if it wasn’t pretrained in mixed precision, you may encounter overflow or underflow issues which can cause NaN loss. For these cases, you should use full fp32 precision by explicitly disabling the default fp16 mode.


stage 1 Optimizer state

stage 2 Optimizer state + gradient

stage 3 optimizer state + gradient + model parameters (weights)

The inference stage zero-1 zero-2 does nothing, only zero-3 can be set (no optimizer is needed for the inference stage and no gradient is generated). Also if Transformers<4.28, you need to set synced_gpus=True when generating.

Using multiple GPUs with ZeRO-3 for generation requires synchronizing the GPUs by setting synced_gpus=True in the generate() method. Otherwise, if one GPU is finished generating before another one, the whole system hangs because the remaining GPUs haven’t received the weight shard from the GPU that finished first.

For Transformers>=4.28, if synced_gpus is automatically set to True if multiple GPUs are detected during generation.

If offload is configured, you need to choose an optimizer that is adapted to both CPU and GPU, and the commonly used Adam cannot be used.

DeepSpeed and Transformers optimizer and scheduler can be mixed and matched as long as you don’t enable offload_optimizer. When offload_optimizer is enabled, you could use a non-DeepSpeed optimizer (except for LAMB) as long as it has both a CPU and GPU implementation.

Multi-GPU Optimization

Resource: Hugging Face Doc

It’s mostly just a few latitudes of parallelism: Data + Pipeline + Tensor.



Only speeds up the training process, not necessarily reduces the memory footprint (maybe even by a factor of 1.5).

While mixed precision training results in faster computations, it can also lead to more GPU memory being utilized, especially for small batch sizes. This is because the model is now present on the GPU in both 16-bit and 32-bit precision (1.5x the original model on the GPU). From: Hugging Face Doc

Note: when using mixed precision with a small model and a large batch size, there will be some memory savings but with a large model and a small batch size, the memory use will be larger.

Small models, large batch sizes

  • MEMORY SAVINGS: For small models, the model parameters and weights take up relatively little memory. When training with large batch sizes, the memory occupied by activations and gradients becomes the main memory usage part. In this case, converting activations and gradients to FP16 can significantly reduce this portion of memory usage, as FP16 occupies half the memory of FP32. Even when considering the retention of FP32-accurate copies of the model weights, the overall memory usage is still reduced because the memory savings of the activations and gradients outweigh the memory overhead of the additional copies of the weights.
  • EFFICIENCY AND SAVINGS: Large batch sizes mean that more data is processed per iteration, which increases the amount of memory required to compute activations and gradients. Since this part now uses FP16, the overall memory requirement is reduced compared to using only FP32.

Large model, small batch size

  • Increased memory usage: for large models, the model parameters and weights themselves take up a lot of memory. In mixed-precision training, even if the activations and gradients are stored as FP16, a copy of the FP32-precision weights still needs to be retained to ensure the accuracy of the updates. This means that the additional memory overhead (due to the FP32 copy of the weights) is a relatively large part of the total memory footprint relative to the size of the model itself. When the batch size is small, the activations and gradients take up relatively little memory, so the memory saving effect from FP16 is not enough to offset the additional memory usage added by keeping FP32 weight copies.
  • Impact of weighted copies: in this case, the additional memory used by FP32 weighted copies becomes a significant factor because the model itself is large. Even if FP16 is used for activation and gradient, the total memory usage may still increase due to FP32 weighted copies.

Gradient accumulation

Effective batch = gradient accumulation * actual batch.

training_args = TrainingArguments(per_device_train_batch_size=1, gradient_accumulation_steps=4, **default_args)

In the above example, your effective batch size becomes 4.

for epoch in range(2): # Assume 2 epochs of training
    for i, (inputs, labels) in enumerate(dataloader):: inputs = torch.tensor(inputs, dtype=torch.float32): # Suppose train 2 epochs.
        inputs = torch.tensor(inputs, dtype=torch.float32)
        labels = torch.tensor(labels, dtype=torch.long)
        # Forward propagation
        outputs = model(inputs)
        loss = criterion(outputs, labels) / accumulation_steps # Note that this is divided by the number of accumulated steps
        # Backward propagation
        # Update model parameters every accumulation_steps
        if (i + 1) % accumulation_steps == 0 or (i + 1) == len(dataloader).
            optimizer.step() # update parameters
            optimizer.zero_grad() # clear the gradient

BP is required each time, but the optimizer updates the parameters only when the cumulative number of gradient steps is reached.

optimizer.step() vs optimizer.zero_grad()

Normal training flow: zero_grad then step Avoid the gradient computed by previous batch to affect the current optimizer state.

Gradient accumulation process: step then zero_grad As the name suggests, we want to let the gradient accumulation take effect and then update the parameters at the end.

Prioritize the per_device_train_batch_size setting that guarantees the GPU limit is reached before considering accumulation.

While it is advised to max out GPU usage as much as possible, a high number of gradient accumulation steps can result in a more pronounced training slowdown. Consider the following example. Let’s say, the per_device_train_batch_size=4 without gradient accumulation hits the GPU’s limit. If you would like to train with batches of size 64, do not set the per_device_train_batch_size to 1 and gradient_accumulation_steps to 64. Instead, keep per_device_train_batch_size=4 and set gradient_accumulation_steps=16. This results in the same effective batch size while making better use of the available GPU resources.

It is not known whether gradient accumulation causes excess memory usage. Or rather I don’t think it does, I think the gradient just changes from one number to another number resulting from multiple accumulations, the data format doesn’t change and the number of bytes taken up doesn’t change. In practice it seems to have an effect, but I can’t find a reason to rationalize it.


Hugging Face Doc