读Hugging Face文档有感

单GPU优化

Method/tool	Improves training speed	Optimizes memory utilization
Batch size choice	Yes	Yes
Gradient accumulation	No	Yes
Gradient checkpointing	No	Yes
Mixed precision training	Yes	(No)
Optimizer choice	Yes	Yes
Data preloading	Yes	No
DeepSpeed Zero	No	Yes
torch.compile	Yes	No
Parameter-Efficient Fine Tuning (PEFT)	No	Yes

FP16

If your model doesn’t work well with mixed precision, for example if it wasn’t pretrained in mixed precision, you may encounter overflow or underflow issues which can cause NaN loss. For these cases, you should use full fp32 precision by explicitly disabling the default fp16 mode.

ZERO

stage 1 优化器状态

stage 2 优化器状态 + 梯度

stage 3 优化器状态 + 梯度 + 模型参数（权重）

推理阶段 zero-1 zero-2 没什么作用，只能设置zero-3（推理阶段不需要优化器，也不会产生梯度）。另外如果 Transformers<4.28 ，生成时需要设置synced_gpus=True。

Using multiple GPUs with ZeRO-3 for generation requires synchronizing the GPUs by setting synced_gpus=True in the generate() method. Otherwise, if one GPU is finished generating before another one, the whole system hangs because the remaining GPUs haven’t received the weight shard from the GPU that finished first.
For Transformers>=4.28, if synced_gpus is automatically set to True if multiple GPUs are detected during generation.

如果配置了offload 则需要选择对CPU和GPU同时适配的优化器，常用的Adam不能使用。

DeepSpeed and Transformers optimizer and scheduler can be mixed and matched as long as you don’t enable offload_optimizer. When offload_optimizer is enabled, you could use a non-DeepSpeed optimizer (except for LAMB) as long as it has both a CPU and GPU implementation.

多GPU优化

Resource: Hugging Face Doc

主要就是几个纬度的并行：Data + Pipeline + Tensor。

混淆点

FP16

只能加速训练过程，不一定能减少显存占用（甚至可能会占用1.5倍）。

While mixed precision training results in faster computations, it can also lead to more GPU memory being utilized, especially for small batch sizes. This is because the model is now present on the GPU in both 16-bit and 32-bit precision (1.5x the original model on the GPU). From: Hugging Face Doc
Note: when using mixed precision with a small model and a large batch size, there will be some memory savings but with a large model and a small batch size, the memory use will be larger.

小模型，大批量大小

内存节省：对于小型模型，模型参数和权重占用的内存相对较少。当采用大批量大小进行训练时，激活和梯度占用的内存成为主要的内存使用部分。在这种情况下，将激活和梯度转换为FP16可以显著减少这部分的内存占用，因为FP16占用的内存是FP32的一半。即使考虑到保留FP32精度的模型权重副本，整体内存使用量仍然会减少，因为激活和梯度的内存节省超过了额外权重副本的内存开销。
效率和节省：大批量大小意味着每次迭代处理更多的数据，这增加了计算激活和梯度所需的内存量。由于这部分现在使用FP16，所以相比仅使用FP32，总体内存需求降低。

大模型，小批量大小

内存使用增加：对于大型模型，模型参数和权重本身就占用大量内存。在混合精度训练中，即使将激活和梯度存储为FP16，仍需保留一份FP32精度的权重副本以保证更新的准确性。这意味着相对于模型本身大小，额外的内存开销（由于FP32的权重副本）在总内存占用中占比较大。当批量大小较小时，激活和梯度占用的内存相对较少，因此FP16带来的内存节省效果不足以抵消因保持FP32权重副本而增加的额外内存使用。
权重副本的影响：在这种情况下，由于模型本身大，所以FP32权重副本占用的额外内存成为了一个重要因素。即使激活和梯度使用了FP16，总内存使用量仍可能因为FP32的权重副本而增加。

梯度累计

有效批次 = 梯度累计数 * 实际batch。

1
training_args = TrainingArguments(per_device_train_batch_size=1, gradient_accumulation_steps=4, **default_args)

In the above example, your effective batch size becomes 4.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
for epoch in range(2):  # 假设训练2个epoch
    for i, (inputs, labels) in enumerate(dataloader):
        inputs = torch.tensor(inputs, dtype=torch.float32)
        labels = torch.tensor(labels, dtype=torch.long)
        # 前向传播
        outputs = model(inputs)
        loss = criterion(outputs, labels) / accumulation_steps  # 注意这里除以了累计步骤数
        # 反向传播
        loss.backward()
        # 每accumulation_steps次更新一次模型参数
        if (i + 1) % accumulation_steps == 0 or (i + 1) == len(dataloader):
            optimizer.step()  # 更新参数
            optimizer.zero_grad()  # 清除梯度

每次都需要BP，但是只有达到累计梯度步数，优化器才更新参数。

optimizer.step() vs optimizer.zero_grad()
正常训练流程：先zero_grad再step 避免以前的batch计算出的梯度对本次优化器状态造成影响。
梯度累计流程：先step再zero_grad 顾名思义就是要让梯度累计生效最后再更新参数。

优先保证达到GPU极限的per_device_train_batch_size设置时再考虑累计。

While it is advised to max out GPU usage as much as possible, a high number of gradient accumulation steps can result in a more pronounced training slowdown. Consider the following example. Let’s say, the per_device_train_batch_size=4 without gradient accumulation hits the GPU’s limit. If you would like to train with batches of size 64, do not set the per_device_train_batch_size to 1 and gradient_accumulation_steps to 64. Instead, keep per_device_train_batch_size=4 and set gradient_accumulation_steps=16. This results in the same effective batch size while making better use of the available GPU resources.

梯度累计是否会造成多余的显存占用，尚不得而知。或者说我认为不会，我认为梯度只是由一个数字变成了多次累计后得出的另外一个数字，数据格式没有改变，占用的字节数也不会改变。实际操作过程中似乎有影响，但我无法找到原因来对其进行合理解释。

模型输入

对数据进行处理，huggingface提供的api主要是各种DataCollator。这其实就是一个数据预处理函数，后面也可以自己编写类似的函数。

token分类

包括常见的NER，input_ids是tokenize后的tokens id，labels是等长的input_ids，但是需要将特殊token标记为-100（交叉熵忽略的标签值）。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
def align_labels_with_tokens(labels, word_ids):
    new_labels = []
    current_word = None
    for word_id in word_ids:
        if word_id != current_word:
            # Start of a new word!
            current_word = word_id
            label = -100 if word_id is None else labels[word_id]
            new_labels.append(label)
        elif word_id is None:
            # Special token
            new_labels.append(-100)
        else:
            # Same word as previous token
            label = labels[word_id]
            # If the label is B-XXX we change it to I-XXX
            if label % 2 == 1:
                label += 1
            new_labels.append(label)

    return new_labels

###### OUTPUT ######
[3, 0, 7, 0, 0, 0, 7, 0, 0]
[-100, 3, 0, 7, 0, 0, 0, 7, 0, 0, 0, -100]

为了使input_ids与labels长度一致，可以手动处理，也可以使用DataCollatorForTokenClassification：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)
batch = data_collator([tokenized_datasets["train"][i] for i in range(2)])
batch["labels"]

###### OUTPUT ######
[-100, 3, 0, 7, 0, 0, 0, 7, 0, 0, 0, -100]
[-100, 1, 2, -100]
###### VS ######
tensor([[-100,    3,    0,    7,    0,    0,    0,    7,    0,    0,    0, -100],
        [-100,    1,    2, -100, -100, -100, -100, -100, -100, -100, -100, -100]])

mask

继续预训练mask模型，input_ids需要将原文本进行随机mask然后tokenize（这一步可以借助DataCollatorForLanguageModeling），然后labels是除了[MASK]代表的原文本id之外全是-100。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# result["labels"] = result["input_ids"].copy()
input_ids = feature["input_ids"]
labels = feature["labels"]
new_labels = [-100] * len(labels)
for word_id in np.where(mask)[0]:
    word_id = word_id.item()
    for idx in mapping[word_id]:
        new_labels[idx] = labels[idx]
        input_ids[idx] = tokenizer.mask_token_id
feature["labels"] = new_labels

casual
Seq2Seq类型任务，包括Chat、Translate等。使用DataCollatorForSeq2Seq。
from scratch
从头预训练或者继续预训练，输入也用作标签（只是移动了一个元素），并且这个数据应该在训练期间实时生成，所以我们不需要复制 input_ids，而是使用DataCollatorForLanguageModeling。
1 2 3
from transformers import DataCollatorForLanguageModeling # 默认是mlm data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)
other

参考

Hugging Face Doc

Hugging Face NLP Course

单GPU优化#

FP16#

ZERO#

多GPU优化#

混淆点#

FP16#

小模型，大批量大小#

大模型，小批量大小#

梯度累计#

模型输入#

参考#

单GPU优化

FP16

ZERO

多GPU优化

混淆点

FP16

小模型，大批量大小

大模型，小批量大小

梯度累计

模型输入

参考