深入了解 Adam 优化器对显存的需求：以 LLaMA-2 7B 模型为例 (中英双语)

中文版

深入了解 Adam 优化器对显存的额外需求：模型参数与优化器状态的显存开销分析

在深度学习模型的训练过程中，显存是一个关键的资源，尤其在处理大型语言模型或深度神经网络时。训练时的显存需求不仅包括模型参数本身，还涉及梯度的存储以及优化器状态的额外开销。对于使用 Adam 或 AdamW 优化器的训练过程，优化器的状态会增加额外的显存开销，这部分显存消耗需要单独分析。

本篇博客将重点讨论模型参数和优化器的状态所需的额外显存开销，并探讨如何分开计算这些部分，帮助你更好地理解训练显存的实际需求。

1. 模型训练与推理的显存需求对比

1.1 推理时的显存需求

在推理过程中，模型的显存需求较为简单，主要包括：

模型参数：存储网络中每个参数的数值（例如权重和偏置）。
激活值：在前向传播过程中，计算每层的输出需要存储中间结果（即激活值）。

对于大多数模型来说，推理时的显存需求主要集中在这两部分，它通常是模型参数所占显存的基础。具体来说，推理显存的计算大致为：
$\text{推理显存} = \text{模型参数显存} + \text{激活值显存}（后者一般不考虑）$
推理时如果每次只处理一句话（例如一个短输入或一个 token），激活值显存需求会很小。

1.2 训练时的显存需求

与推理相比，训练时显存需求更为复杂，除了模型参数和激活值外，还要考虑：

梯度存储：每次反向传播时，网络会计算并存储每个参数的梯度。这些梯度是训练中更新模型权重的关键，它们的存储和计算同样需要显存。
优化器状态：如 Adam 等优化器需要存储每个参数的动量（一阶矩）和梯度平方的滑动平均（二阶矩），这部分开销会显著增加显存需求。

在计算训练时的显存需求时，通常可以按以下公式来估算：
$\text{训练显存} = \text{模型参数显存} + \text{梯度显存} + \text{优化器状态显存}$

2. 训练时的额外显存开销

训练时的显存需求大于推理时，主要来自于梯度存储和优化器状态的存储。具体来说：

梯度存储：梯度的存储需要额外的显存。每个参数都会有一个对应的梯度，梯度和参数的大小相同，因此梯度所需的显存与模型参数显存相当。
优化器状态存储：Adam 优化器需要为每个参数存储一阶矩和二阶矩（动量和梯度平方的滑动平均）。因此，优化器的显存需求通常是 模型参数显存的 2倍（即每个参数存储三项：参数本身、一阶矩、二阶矩）。

2.1 如何分开计算

在估算训练时显存需求时，可以将模型参数、梯度和优化器状态的显存开销分开计算。具体来说，训练时显存需求大致为：
$\text{训练显存} = 1x \, (\text{模型参数显存}) + 1x \, (\text{梯度显存}) + 1-2x \, (\text{优化器状态显存})$
这意味着，训练所需的显存通常是推理显存的 4倍，其中：

1x 为模型本身所占的显存。
1x 为梯度的显存（和模型参数的显存相同）。
1-2x 为优化器状态的显存。对于 Adam 优化器，这部分额外的显存开销通常是 模型显存的 2倍，因为 Adam 需要存储一阶和二阶矩（每个参数两项状态）。

2.2 Adam 优化器的显存消耗

Adam 优化器相比于传统的优化器，如 SGD，额外的显存消耗主要来自于两个部分：

一阶矩（( $m$ )）：动量项，存储每个参数的梯度的指数加权平均。
二阶矩（( $v$ )）：梯度平方的指数加权平均，类似于 RMSProp。

每个参数需要存储 三项数据：本身的值、一阶矩 ( $m$ )、二阶矩 ( $v$ )，因此显存需求为：
$\text{优化器状态显存} = 2 \times \text{模型参数显存}$
对于 AdamW，由于它与 Adam 在参数更新策略上有所不同，但在显存需求上几乎没有差异，因此我们可以认为它的额外显存消耗也是 2倍。

2.3 不同精度下的显存需求

训练时的显存需求也受到数据类型精度的影响。常见的精度有 float32 和 bfloat16，它们对显存的需求有所不同：

float32（32位浮动点数）：每个参数、梯度和优化器状态都占 4 字节，因此显存需求相对较高。
bfloat16（16位浮动点数）：每个参数、梯度和优化器状态仅占 2 字节，因此显存需求较低。

3. 实际计算：以 LLaMA-2 7B 模型为例

假设 LLaMA-2 7B 模型包含 70 亿个参数。我们以 float32 为例来计算训练时显存需求。

3.1 模型参数显存

每个参数占用 4 字节，总参数数量为 70 亿，则模型参数显存为：
$\text{模型参数显存} = 7 \times 10^9 \times 4 \text{字节} = 28 \, \text{GB} \, (\text{以1000为进位})$

3.2 梯度显存

梯度显存与模型参数显存相同，因为每个参数都有一个对应的梯度：
$\text{梯度显存} = 28 \, \text{GB}$

3.3 优化器状态显存

Adam 优化器需要存储一阶矩和二阶矩，因此优化器状态的显存需求是模型参数显存的 2倍：
$\text{优化器状态显存} = 2 \times 28 \, \text{GB} = 56 \, \text{GB}$

3.4 训练显存需求

将模型参数显存、梯度显存和优化器状态显存加总，训练时的显存需求为：
$\text{训练显存} = 28 \, \text{GB} + 28 \, \text{GB} + 56 \, \text{GB} = 112 \, \text{GB}$

具体计算过程如下：

下面的python代码测试7B大模型本身的参数量：以float32计算。进位采用1024，计算得出：7B大模型的参数量为26.08 GB；当进位采用1000时，计算得出28.00 GB。为什么尝试1000，是因为在其他博文中看到28GB这个数字，自己测试一下，发现他们是在以1000为进位的时候测试得出的。参考文章：https://cuiyuhao.com/posts/c87c0f5d/

# 定义参数
num_parameters = 7 * 10**9  # 70 亿个参数
bytes_per_param = 4  # 每个参数占用 4 字节（32 位浮动数）# 计算显存需求（单位：字节）
memory_in_bytes = num_parameters * bytes_per_param# 将字节转换为 GB
memory_in_GB = memory_in_bytes / (1024 ** 3)  # 转换为 GB， 可调为1000print(f"模型需要的显存为: {memory_in_GB:.2f} GB")

以bf16为例，由于它是float32的一半，所以它的参数量为 26.08GB / 2 = 13.04GB (以1024为进位)，当以1000进位的时候，28GB / 2 = 14GB

4. 总结

在深度学习模型训练中，除了模型参数显存外，梯度存储和优化器状态（如 Adam 优化器）会显著增加显存需求。通常情况下，训练显存需求是推理显存的 4倍，其中：

1倍用于存储模型参数。
1倍用于存储梯度。
1-2倍用于存储优化器状态。

了解这些显存需求对于选择合适的硬件和优化训练过程至关重要。如果显存有限，可以通过使用混合精度训练（如 bfloat16）来降低显存消耗，或者利用梯度累积等技术来优化显存的使用。

5. 使用 bfloat16 计算显存消耗

与 float32 精度相比，使用 bfloat16 （16 位浮动数）来训练大模型可以显著降低显存消耗。bfloat16 精度的主要优点是相对于 float32，减少了内存的占用量，但保持了足够的数值表示精度，特别适用于训练深度神经网络。

5.1 bfloat16 对模型参数的影响

在使用 bfloat16 时，每个参数占用的内存从 float32 的 4 字节降到 2 字节。以 LLaMA-2 7B 模型为例，假设模型有 7 亿个参数：

模型参数内存（使用 bfloat16）：
$\text{模型参数内存} = 7 \times 10^9 \times 2 \, \text{字节} = 14 \, \text{GB}$
因此，模型参数的内存需求降至 14 GB。

5.2 bfloat16 对梯度的影响

梯度是通过反向传播计算得到的，用于更新模型参数。梯度的内存占用与模型参数相同，因此使用 bfloat16 计算梯度时，内存占用也将减半：

梯度内存（使用 bfloat16）：
$\text{梯度内存} = 14 \, \text{GB}$

5.3 bfloat16 对优化器状态的影响

Adam 优化器的状态包括两个部分：一阶矩（ $m$ ） 和 二阶矩（ $v$ ）。每个参数都需要存储这两项数据，导致优化器状态的内存占用是模型参数内存的两倍。

在 bfloat16 精度下，优化器状态的内存占用也将减少到 float32 的一半。因此，优化器状态的内存为：

优化器状态内存（使用 bfloat16）：
$\text{优化器状态内存} = 2 \times 14 \, \text{GB} = 28 \, \text{GB}$

5.4 总训练显存需求

通过将 bfloat16 应用到模型参数、梯度和优化器状态的计算中，总的显存消耗将大幅下降。对于 LLaMA-2 7B 模型，使用 bfloat16 时的训练显存需求为：

$\text{训练显存} = 14 \, \text{GB} (\text{模型参数}) + 14 \, \text{GB} (\text{梯度}) + 28 \, \text{GB} (\text{优化器状态})$
因此，使用 bfloat16 精度时，训练显存需求为 56 GB。

5.5 总结

float32 精度：LLaMA-2 7B 模型训练需要约 112 GB 显存。
bfloat16 精度：LLaMA-2 7B 模型训练需要约 56 GB 显存。

使用 bfloat16 精度能有效减少训练时的显存需求，尤其是对于大模型来说，可以显著降低硬件成本，甚至使得原本无法在某些显卡上训练的大模型得以运行。因此，在显存资源有限的情况下，使用混合精度训练（例如采用 bfloat16）是一种提高训练效率的有效方法。

总之，选择合适的数值精度不仅能降低显存消耗，还能加速训练过程，特别是在面对大规模深度学习模型时，精度选择对训练的性能和可行性有着直接影响。

英文版

Understanding the Additional Memory Requirements of Adam Optimizer: Memory Consumption Breakdown for Model Parameters and Optimizer States

In deep learning model training, memory (or VRAM) is a critical resource, especially when dealing with large models like large language models or deep neural networks. The memory required for training is not only for storing the model parameters but also for the gradients and the additional memory consumed by the optimizer state. For optimizers like Adam or AdamW, the optimizer state introduces significant additional memory consumption, which is often overlooked when estimating memory requirements.

This blog post will focus on the memory overhead introduced by the optimizer and how to break down the memory requirements separately for model parameters and optimizer states. Understanding this is important for efficiently managing memory during training, especially for beginners who may not fully realize the impact of optimizers like Adam on memory consumption.

1. Memory Requirements for Inference vs. Training

1.1 Inference Memory Requirements

During inference (i.e., when you are just running the model and making predictions), the memory requirements are relatively straightforward and mainly include:

Model Parameters: Storing the model’s weights and biases.
Activations: The intermediate outputs produced during the forward pass that are required for computing the final result.

In most cases, the memory for inference is primarily used by these two components, and the memory formula can be simplified as:
$\text{Inference Memory} = \text{Model Parameters Memory} + \text{Activations Memory}$

1.2 Training Memory Requirements

In contrast to inference, training requires more memory due to the following additional components:

Gradients: During backpropagation, the gradients for each parameter need to be computed and stored. The gradients are necessary for updating the model parameters, and they take up memory equivalent to the size of the model parameters.
Optimizer State: Optimizers like Adam require additional memory to store per-parameter information such as the first moment (momentum) and the second moment (squared gradients). This memory overhead can be substantial, especially for large models.

Thus, the memory formula for training is:
$\text{Training Memory} = \text{Model Parameters Memory} + \text{Gradients Memory} + \text{Optimizer States Memory}$

2. The Additional Memory Overhead for Optimizer States

Training memory usage is higher than inference memory due to the extra storage required for gradients and optimizer states. Specifically:

Gradients: The memory required for storing gradients is the same as the memory for storing model parameters, because every parameter has a corresponding gradient.
Optimizer States: For Adam optimizer, each parameter requires storage for both the first and second moment (the running average of gradients and squared gradients), which results in the need for 2 times the memory of the model parameters.

2.1 Breaking Down the Memory Calculation

When estimating the total memory required for training, it is useful to break down the contributions from the model parameters, gradients, and optimizer state. The total memory for training is approximately:
$\text{Training Memory} = 1x \, (\text{Model Parameters Memory}) + 1x \, (\text{Gradients Memory}) + 1-2x \, (\text{Optimizer States Memory})$
This means the memory for training is typically 4 times the memory required for inference:

1x for the model parameters.
1x for the gradients (same memory as model parameters).
1-2x for the optimizer states. For Adam optimizer, the overhead is typically 2 times the model memory because it stores both the first and second moments.

2.2 Memory Consumption of Adam Optimizer

The Adam optimizer consumes extra memory because it needs to store both the first moment (( $m$ )) and the second moment (( $v$ )) for each parameter. Therefore, the memory consumed by the optimizer is:
$\text{Optimizer States Memory} = 2 \times \text{Model Parameters Memory}$
Similarly, AdamW optimizer, which differs from Adam only in the weight decay application, has almost the same memory consumption as Adam, so its extra memory overhead is also 2 times the model parameters memory.

2.3 Impact of Precision on Memory Requirements

The memory usage for training is also affected by the data type (precision) used. Commonly, float32 and bfloat16 are used for training, and the precision affects memory consumption as follows:

float32 (32-bit floating point): Each parameter, gradient, and optimizer state consumes 4 bytes, which results in higher memory usage.
bfloat16 (16-bit floating point): Each parameter, gradient, and optimizer state consumes 2 bytes, leading to reduced memory consumption.

Thus, using bfloat16 instead of float32 can significantly reduce the overall memory requirements for both the model parameters and the optimizer states.

3. Practical Example: LLaMA-2 7B Model

Let’s walk through an example of estimating the memory for training a LLaMA-2 7B model, which contains 7 billion parameters. We will first calculate the memory requirements assuming float32 precision.

3.1 Model Parameters Memory

Each parameter in the model requires 4 bytes (for float32), and with 7 billion parameters, the model parameters memory is:
$\text{Model Parameters Memory} = 7 \times 10^9 \times 4 \, \text{bytes} = 28 \, \text{GB} \, (\text{using 1000-based units})$

3.2 Gradients Memory

The gradients memory is the same as the model parameters memory because we need to store one gradient per parameter:
$\text{Gradients Memory} = 28 \, \text{GB}$

3.3 Optimizer States Memory

Since the Adam optimizer needs to store both the first and second moments, the optimizer states memory will be 2 times the model parameters memory:
$\text{Optimizer States Memory} = 2 \times 28 \, \text{GB} = 56 \, \text{GB}$

3.4 Total Training Memory

The total memory required for training is the sum of model parameters, gradients, and optimizer states:
$\text{Training Memory} = 28 \, \text{GB} + 28 \, \text{GB} + 56 \, \text{GB} = 112 \, \text{GB}$

4. Summary

In summary, when estimating memory for training a deep learning model, especially using optimizers like Adam, you need to consider not only the model parameters but also the gradients and the additional memory consumed by the optimizer state. Typically, the memory required for training is about 4 times the memory required for inference, where:

1x is for the model parameters.
1x is for the gradients.
1-2x is for the optimizer states.

Optimizers like Adam consume significant additional memory because they store both first and second moment estimates for each parameter. For large models, this memory overhead can be substantial, so it’s important to account for it when selecting hardware and configuring training settings. If memory is limited, techniques like mixed-precision training (e.g., using bfloat16) can help reduce memory usage.

5. Memory Consumption with bfloat16 Precision

Using bfloat16 (16-bit floating point) instead of float32 (32-bit floating point) during training can significantly reduce memory consumption. The primary advantage of bfloat16 is that it reduces memory usage while maintaining sufficient numerical precision, making it particularly well-suited for training deep neural networks.

5.1 Impact of bfloat16 on Model Parameters

With bfloat16 precision, each parameter consumes 2 bytes, as opposed to 4 bytes with float32. For example, for the LLaMA-2 7B model, which has 7 billion parameters:

Model Parameters Memory (using bfloat16):
$\text{Model Parameters Memory} = 7 \times 10^9 \times 2 \, \text{bytes} = 14 \, \text{GB}$
Therefore, the memory required for the model parameters is reduced to 14 GB when using bfloat16.

5.2 Impact of bfloat16 on Gradients

Gradients are computed during backpropagation and are used to update the model parameters. The memory required to store gradients is the same as for the model parameters. Thus, when using bfloat16 for gradients, memory usage is also halved:

Gradients Memory (using bfloat16):
$\text{Gradients Memory} = 14 \, \text{GB}$

5.3 Impact of bfloat16 on Optimizer States

The Adam optimizer stores two components for each parameter: the first moment ( $m$ ) and the second moment ( $v$ ). This results in the optimizer state requiring 2 times the memory of the model parameters.

Since bfloat16 uses 2 bytes per value, the memory overhead for the optimizer state is also halved compared to float32. Thus, the optimizer states’ memory in bfloat16 will be:

Optimizer States Memory (using bfloat16):
$\text{Optimizer States Memory} = 2 \times 14 \, \text{GB} = 28 \, \text{GB}$

5.4 Total Memory Requirement for Training

When applying bfloat16 to model parameters, gradients, and optimizer states, the overall memory consumption for training is significantly reduced. For the LLaMA-2 7B model, the total memory required for training with bfloat16 would be:

$\text{Training Memory} = 14 \, \text{GB} (\text{Model Parameters}) + 14 \, \text{GB} (\text{Gradients}) + 28 \, \text{GB} (\text{Optimizer States})$
Thus, the total memory requirement for training the LLaMA-2 7B model using bfloat16 precision would be 56 GB.

5.5 Summary

float32 precision: The training memory requirement for LLaMA-2 7B is approximately 112 GB.
bfloat16 precision: The training memory requirement for LLaMA-2 7B is approximately 56 GB.

By using bfloat16 precision, the memory consumption for training is reduced by half compared to float32, which can significantly lower hardware costs and make it feasible to train large models on GPUs with limited memory. Therefore, when memory is a constraint, using mixed-precision training (e.g., with bfloat16) is an effective strategy to improve training efficiency.

In summary, choosing the right precision for your training process not only reduces memory usage but also accelerates training. This decision has a direct impact on the performance and feasibility of training large-scale deep learning models, making precision selection critical, especially when working with very large models.