大模型计算量纲

1. 模型参数量(llama 13B为例)

{"architectures": ["LLaMAForCausalLM"],"bos_token_id": 0,"eos_token_id": 1,"hidden_act": "silu","hidden_size": 5120,"intermediate_size": 13824,"initializer_range": 0.02,"max_sequence_length": 2048,"model_type": "llama","num_attention_heads": 40,"num_hidden_layers": 40,"pad_token_id": -1,"rms_norm_eps": 1e-06,"torch_dtype": "float16","transformers_version": "4.27.0.dev0","use_cache": true,"vocab_size": 32000
}

Embedding

$vocab\_size * h = 32000 h$

TransformerBlock

Self-Attention
- Q, K, V, O
- 参数量(无bias)： $4 * h^2$

	self.q_proj = nn.Linear(hidden_size, num_heads * self.head_dim, bias=False)self.k_proj = nn.Linear(hidden_size, num_heads * self.head_dim, bias=False)self.v_proj = nn.Linear(hidden_size, num_heads * self.head_dim, bias=False)self.o_proj = nn.Linear(num_he

ads * self.head_dim, hidden_size, bias=False)

MLP
- 3层Dense
- 参数量(无bias) $3 * h * in t er m e d ia t e$

		self.gate_proj = nn.Linear(hidden_size, intermediate_size, bias=False)self.down_proj = nn.Linear(intermediate_size, hidden_size, bias=False)self.up_proj = nn.Linear(hidden_size, intermediate_size, bias=False)

LayNorm
- input_layernorm, post_attention_layernorm

$2 * h, 各一个参数$

	variance =hidden_states.to(torch.float32).pow(2).mean(-1, keepdim=True)hidden_states = self.weight * hidden_states * torch.rsqrt(variance + self.variance_epsilon)

$4*h^2 + 3*h*intermediate + 2*h\\=4*h^2 + 3*h*2.7h + 2*h\\=12h^2 + 2h$

Transformer Layer

$TransformerBlock * layer \\= (12h^2+2h)*layer$

最后norm

norm: $h$

整体

$整体参数\\=embedding + transformer Layer + norm\\ = 6.2h^2 + 480*h^2 + 80*h + h \\= 486.2*h^2 + 81*h \\=12.7B$

2. 显存占用

Zero论文 https://arxiv.org/pdf/2104.07857.pdf https://arxiv.org/pdf/1910.02054.pdf

2.1 训练阶段(混合精度)

Model States = optimizer status、gradients、parameters

$B=batch\\_size, N = head, S = sequence\\_length, D = dim,h=hidden\\_dim$

$C$ 两个 activation checkpoints至今的transformer block量

对于一个参数 $\theta$ ，后向梯度 $\nabla f(\theta)$ ；adamW 里面有两个参数 $m, v$

$m, v$ 是 float32，4个字节

$\theta$ $\nabla f(\theta)$ 在做前后项计算时，使用float16，2个字节

更新参数： $\theta$ $\nabla f(\theta)$ 使用 float32的copy

整体：一个参数贡献 4字节

Residual States

$hidden\\_status * \frac{layer}{C}* 2 \\= B * S * h * \frac{layer}{C} * 2$

对于深的网络，block之间需要部分的residual传递，C表示多少个Block存储1个residual，每个模型不一样

Model State Working Memory

Model State都offload到CPU后，在前向计算、后向更新梯度(各2个字节)时，需要的最少计算的临时内存占用

最大需要开辟的是MLP里面的线性层：

$h * in t er m e d i e * (2 + 2)$

Activation Working Memory

区分于params，是计算后的中间结果，不包含模型参数和优化器状态，但包含了dropout操作需要用到的mask矩阵

TransformerBlock整体计算：

hidden_states Residual

$2 * 2 * [B, N, S, D] = 4 * BNS D$

Q、K、V、O 计算后：

[B, N, S, D] * 4 $[B, N, S, D] * 4 * 2 = 8 * BNS D$

$Softmax(\frac{QK}{\sqrt{D}}) = [B, N, S, S]=2 * BNSS$
MLP直接计算结果：

$[B, N, S, D] = 2 * BNS D$

Dropout Mask (Attention_Drop + Residual_Drop):

$[B, N, S, D] = 2 * 1 * BNS D$

$16 * BNS D + 2 * BNSS$

整体

$(16 * BNS D + 2 * BNSS) * C = BNS (16 D + 2 S) * C$

对比

column 5: Model States = (0.1 T * 20 = 1.82 TB)

column 6: full set of activations = 中间过程有引用的矩阵都需要存储: $BNS (34 D + 5 S) * l a yer$

column 7: Memory for Residual States = $\frac{layer}{C} * 2=32 * 1024*10000*80*2=0.05TB$

column 8: Model State Working Memory = $h * in t er m e d i e * (2 + 2) = 10000 * 4 * 1000 * 4 = 1.6 GB$

column 9: Activations Working Memory: 8卡机型，32/8 = 4 $= BNS (16 D + 2 S) * C = 4 * 40 * 1024 * (16 * 10000/128 + 2 * 1024) = 0.62 GB$

整体显存计算

$\frac{layer}{C} * 2 \\+ BNS(S+2D)*Layer * 2+h* intermedie * (2+2)$

KV Cache

需要存储历史K, V结果，不能释放都需要存储

$(BS h + BS h) * l a yer * 2$

推理阶段

推理阶段，没有梯度，优化器，只有Fp16的weight，和中间不能释放变量：

$\frac{layer}{C} * 2 \\+ BNS(S+2D)*Layer * 2+h* intermedie * (2+2)$

示例

ZeRO-Offload partitions the data such that the fp16 parameters are stored in GPU while the fp16 gradients, and all the optimizer states such as fp32 momentum, variance and parameters are stored in CPU

ZeRO-1, ZeRO-2 and ZeRO-3 corresponding to the partitioning of the three different model states, optimizer states, gradients and parameters, respectively.

ZeRO-1 partitions the optimizer states only：4字节
ZeRO-2 partitions gradients in addition to optimizer states： 2字节
ZeRO-3 partitions all model states

fp32参数，梯度的更新，都在cpu中

模型	llama 7B(GB)	llama 13B(GB)
Model States(Train)	140	260
Model States(zero-2)	14 + 必要的梯度更新	26 + 必要的梯度更新
Model States(Inference)	14	26
Memory for Residual States	0	0
Model State Working Memory	0.25	0.39
Activations Working Memory(B=4,S=1024）	0.5	0.62
Activations Working Memory(B=1,S=1024）	0.125	0.155
Activations Working Memory(B=1,S=2048）	0.5	0.62
Activations Working Memory(B=4,S=10240）	27.5	34
KV Cache(B=1,S=2048）	1G	1.56

3. 计算Flops（Floating point operations)

矩阵运算

$A\in R^{m,n}$ , $B\in R^{n,p}$

单个元素的计算： $2 * n$ (乘 + 加)，相乘后的矩阵元素： $R^{m,p}$

整体计算量： $2 * n * m * p = 2 mn p$

Embedding

$B,S] * [V,h]^T -> [B,S,h]$ , Lookup 无

TransformerBlock

Self-Attention

$\in R^{B,S,h}$ , $W\_Q,W\_K,W\_V \in R^{h,h}$ ，计算后： $3 * 2 * BS hh = 6 BS hh$
$\frac{QK^T}{\sqrt{N}}$ , $B,N,S, D] * [B,N,S,D]^T -> [B,N,S,S]$ ，计算后：

$(2 + 1) BNS D S = 3 BSS h$

$\text{Softmax}(x\_{i}) = \frac{\exp(x\_i)}{\sum\_j \exp(x\_j)}$ ，对矩阵进行，乘、加、除，计算后：

$3 * [B, N, S, S] = 3 BNSS$

$\text{Softmax}(W) .V.W\_O$ ,计算后：

$[B, N, S, S] * [B, N, S, D] * [h, h] - > 2 BNSS D + 2 BS hh = 2 BSS h + 2 BS hh$

整体： $\frac{S}{h} + 3\frac{S}{Dh} + 2 \frac{S}{h} + 2 )$ ，Softmax可以忽略
- $8BSh^2 +5BS^2h$
- S=2048, h=4096, D=32 , $\frac{3}{2} + \frac{3}{64} + 1 + 2 )$
- 当长度是h的两倍时，QKV，QK, SVW, 三个计算量级一致
- 不管升级S，还是h，都是平方次提升
MLP
- L1 gate $[B, S, h] * [h, in t er] - > [B, S, in t er]$ ，计算量： $2 * BS h * in t er$
- L2 up $[B, S, h] * [h, in t er] - > [B, S, in t er]$ ，计算量： $2 * BS h * in t er$
- L3 down $[B, S, in t er] * [in t er, h] - > [B, S, h]$ ，计算量： $2 * BS * in t er * h$
- 整体： $6 * BS h * in t er$
Logits
- $[B, S, h] * [h, V] - > [B, S, V]$ ，计算量： $2 BS hV$
整体：
- $前向 = layer * (8BSh^2 +5BS^2h + 6BSh*inter) + 2BShV$
- $后向 = 2 * 前向$
- 整体 = 3 * step * 前向