Single GPU Optimization
Resource: Hugging Face Doc
Method/tool | Improves training speed | Optimizes memory utilization |
---|---|---|
Batch size choice | Yes | Yes |
Gradient accumulation | No | Yes |
Gradient checkpointing | No | Yes |
Mixed precision training | Yes | (No) |
Optimizer choice | Yes | Yes |
Data preloading | Yes | No |
DeepSpeed Zero | No | Yes |
torch.compile | Yes | No |
Parameter-Efficient Fine Tuning (PEFT) | No | Yes |
FP16
If your model doesn’t work well with mixed precision, for example if it wasn’t pretrained in mixed precision, you may encounter overflow or underflow issues which can cause NaN loss. For these cases, you should use full fp32 precision by explicitly disabling the default fp16 mode.
ZERO
stage 1 Optimizer state
stage 2 Optimizer state + gradient
stage 3 optimizer state + gradient + model parameters (weights)
The inference stage zero-1 zero-2 does nothing, only zero-3 can be set (no optimizer is needed for the inference stage and no gradient is generated). Also if Transformers<4.28, you need to set synced_gpus=True
when generating.
Using multiple GPUs with ZeRO-3 for generation requires synchronizing the GPUs by setting
synced_gpus=True
in the generate() method. Otherwise, if one GPU is finished generating before another one, the whole system hangs because the remaining GPUs haven’t received the weight shard from the GPU that finished first.For Transformers>=4.28, if
synced_gpus
is automatically set toTrue
if multiple GPUs are detected during generation.
If offload is configured, you need to choose an optimizer that is adapted to both CPU and GPU, and the commonly used Adam cannot be used.
DeepSpeed and Transformers optimizer and scheduler can be mixed and matched as long as you don’t enable
offload_optimizer
. Whenoffload_optimizer
is enabled, you could use a non-DeepSpeed optimizer (except for LAMB) as long as it has both a CPU and GPU implementation.
Multi-GPU Optimization
Resource: Hugging Face Doc
It’s mostly just a few latitudes of parallelism: Data + Pipeline + Tensor.
Confusion
FP16
Only speeds up the training process, not necessarily reduces the memory footprint (maybe even by a factor of 1.5).
While mixed precision training results in faster computations, it can also lead to more GPU memory being utilized, especially for small batch sizes. This is because the model is now present on the GPU in both 16-bit and 32-bit precision (1.5x the original model on the GPU). From: Hugging Face Doc
Note: when using mixed precision with a small model and a large batch size, there will be some memory savings but with a large model and a small batch size, the memory use will be larger.
Small models, large batch sizes
- MEMORY SAVINGS: For small models, the model parameters and weights take up relatively little memory. When training with large batch sizes, the memory occupied by activations and gradients becomes the main memory usage part. In this case, converting activations and gradients to FP16 can significantly reduce this portion of memory usage, as FP16 occupies half the memory of FP32. Even when considering the retention of FP32-accurate copies of the model weights, the overall memory usage is still reduced because the memory savings of the activations and gradients outweigh the memory overhead of the additional copies of the weights.
- EFFICIENCY AND SAVINGS: Large batch sizes mean that more data is processed per iteration, which increases the amount of memory required to compute activations and gradients. Since this part now uses FP16, the overall memory requirement is reduced compared to using only FP32.
Large model, small batch size
- Increased memory usage: for large models, the model parameters and weights themselves take up a lot of memory. In mixed-precision training, even if the activations and gradients are stored as FP16, a copy of the FP32-precision weights still needs to be retained to ensure the accuracy of the updates. This means that the additional memory overhead (due to the FP32 copy of the weights) is a relatively large part of the total memory footprint relative to the size of the model itself. When the batch size is small, the activations and gradients take up relatively little memory, so the memory saving effect from FP16 is not enough to offset the additional memory usage added by keeping FP32 weight copies.
- Impact of weighted copies: in this case, the additional memory used by FP32 weighted copies becomes a significant factor because the model itself is large. Even if FP16 is used for activation and gradient, the total memory usage may still increase due to FP32 weighted copies.
Gradient accumulation
Effective batch = gradient accumulation * actual batch.
|
|
In the above example, your effective batch size becomes 4.
|
|
BP is required each time, but the optimizer updates the parameters only when the cumulative number of gradient steps is reached.
optimizer.step() vs optimizer.zero_grad()
Normal training flow: zero_grad then step Avoid the gradient computed by previous batch to affect the current optimizer state.
Gradient accumulation process: step then zero_grad As the name suggests, we want to let the gradient accumulation take effect and then update the parameters at the end.
Prioritize the per_device_train_batch_size
setting that guarantees the GPU limit is reached before considering accumulation.
While it is advised to max out GPU usage as much as possible, a high number of gradient accumulation steps can result in a more pronounced training slowdown. Consider the following example. Let’s say, the
per_device_train_batch_size=4
without gradient accumulation hits the GPU’s limit. If you would like to train with batches of size 64, do not set theper_device_train_batch_size
to 1 andgradient_accumulation_steps
to 64. Instead, keepper_device_train_batch_size=4
and setgradient_accumulation_steps=16
. This results in the same effective batch size while making better use of the available GPU resources.
It is not known whether gradient accumulation causes excess memory usage. Or rather I don’t think it does, I think the gradient just changes from one number to another number resulting from multiple accumulations, the data format doesn’t change and the number of bytes taken up doesn’t change. In practice it seems to have an effect, but I can’t find a reason to rationalize it.