大模型调研之 OPT-175B是如何炼成的（过程，细节，参考链接等）

杂谈

现在是2023年年中，WAIC世界人工智能大会刚结束不久。
要说WAIC，有种“盛名之下其实难副”之感。诺大个会场，八成都在大谈特谈大模型与gpt，不过部署一个据说要大洋上亿。尽管貌似中国还没有相关货真价实的学术成果，但泡沫已初见。
跟一位技术公司经理提起此事，亦有同感。2019年，彼时WAIC观展还需要3000块，如今观展免费，然而此时此刻，一如彼时彼刻。
在这里插入图片描述然而老板自有老板的思量。
上亿的大模型，也许过半的开销是要回馈给甲方或有关部门的。
大环境如此。
既然老板有安排，闲话少说，且把大模型训练的资料整理一下
后续更新。。

前言

为给大家提供相关决策与讨论依据，我整理了这篇文档供参考。
现有公开的论文中，没有涉及到chatgpt3.5/chatgpt4具体训练的方法，chatgpt4的所谓论文则更像是一个产品广告，无怪乎有不少人称OpenAI公司为closeAI。近期更有诸如chatgpt依靠MOE（混合专家系统）的传言，更使chatgpt的训练方法显得扑朔迷离。

这里对大模型训练方法的探讨，是基于斯坦福的一个大模型(OPT-175)的训练讨论会。由于openAI不公开chatgpt技术细节，这个研讨会的相关技术细节、实现过程、投入资源、遇到的问题等等信息就显得非常具有参考价值。
另外，在2023/07/20，Meta发布Llama 2，开源免费研究/商用。有关内容另外再做分析。

训练过程

Setup(team of 5 engineers, set up in about September 2021)

Train a 175B LLM(Dense, autoregressive, decoder-only Transformer) in about 3 months using 1024 80G A100 GPUs（是的，1024块A100卡）

With stanford team’s resource/efficiency, needed about 33 days of continuous training(assuming 0 failures/restarts)in order to go through 300B tokens
除了云服务器客服，无专门设备运维团队
使用其实验室当时能找到的所有数据，很明显很多时候模型是undertraining的（数据不太够，训练不足）
由于超参数与业内其他几家FAIR NLP groups(Microsoft/NVIDIA/OpenAI)所公开的超参数有所不同，所以不太清楚如何设置是最好的

October 2021

Run 1: 使用FAIR NLP groups（公开的资料/论文）所用的参数设置（毕竟也没有别的可参考）。结果，效果不太理想
Run 2: 将weight decay从0.01增至0.1，训练loss进了平台期，效果不理想
Run 3: 开始修改更多GPT-3/Megatron-LM的参数
- 设置global grad norm clipping至1.0
- Adam优化器beta2从0.98改至0.95
- Adam优化器epsilon从1e-8改至1e-6
- 并行执行时发现代码中有bug。教训：要先在小规模数据上试跑，以便提早发现代码bug

Runs 4-10: 各种超参数调整

Run 4: 回滚张量并行计算代码，回滚weight decay参数至0.01
Run 5: 开始clip gradients, 设置weight decay至0.1, 增加warmup的步数（steps）
Run 6: 终于修复了张量并行计算的bug，按Run 4设置重跑，但clipping保持1.0
Run 7: 表现仍不太满意，更多参数回滚
- weight decay改回0.1，更多的warm up（步数），跳过最后（部分）batch
Run 8,9为提高训练稳定性，更多参数调整
Run 10, batch size加倍（2M->4M）
- 没啥不同，保持2M（这样可以让优化步数多一点，优化过程更温和（团队人员自己都有点存疑的观点）？）

November 2021

Run 11: Lets go, 经过之前的折腾，大家终于选定了一组认为可能还不错的参数设置
- 2M batch size
- FP32 Adam
- Tensor parallel(8x MP)
- 新data，来自实验29（之前的数据集中存在问题，曾额外添加了转义字符，于是模型训练时通过找转义字符而降低了loss，而非真正学到东西）
- 训练中学习positional embedding。因为不太有信心学到positional embedding，所以通过正弦init(sinusoidal init)学习positional embedding，使之与原transformer论文相符合。
- weight decay, 0.05
- LR of 3+4, end LR of le-5。学习率开始时高，后期变低以避免训练变得不稳定
- No dropout on embeddings
- Normformer (impact on grad norm is making earlier layers be more similar with later layers)
- Gradient pre-divide factor: 32(Naman has been running with this)，目的是实现局部梯度积累(local gradient accumulation)
- Clip (12 norm): 2.5(后被证实是重要的，当时并不清楚这样的重要性)
Run 11.（1-5）: Off to a rocky start…
- Saw instabilities immediately within a few hundred steps(初始学习率设置偏高导致训练不稳定)
  - GPT-3 trained with 6e-5 LR
- Run 11.1: Lowered LR from 3e-4–>7.5e-5
- Run 11.2: Lowered gradient clipping threshold from 2.5–>1.5
- Run 11.3: Relaunched after uncorrectable ECC error
- Run 11.4: Trying to speed things up by running our validation loop less frequently
- Run 11.5: Grad norm still spiking, lowered clip from 1.5–>1.0
Run 11.(6-10)：事态急转直下(going downhill)
- 一旦出现训练不稳定问题，就回滚，调参，再继续训
- Run11.6, skip batches when grad norm>1.0（绕过不稳定数据）, 效果不佳
- Run 11.7, New data shared, increase weight decay to 0.1(from 0.05), beta2从0.98调至0.95
- Run 11.8, keep beta2 0.95, revert weight decay back to 0.05, no clipping
- Run 11.9, turn clipping back on, LR降至6e-5, 其他与11.7一致
- Run 11.10,
  - 借鉴了这个等式，转换了一下，提高稳定性
```
n * (A dot B) === (sqrt(n) * A) dot (sqrt(n) * B)
```
  - 将Gelu替换为Relu，因为Gelu的公式里有个x^3，可能造成不稳定
  - 此时训练继续下去了，但仍不清楚模型是否学到了东西，这是因为“Training with FP16(not mixed-precision/no copy of weights in FP32)；Use loss-scaling to try and preserve small gradient values；Scale up loss when we haven’t overflowed in a while (no ‘inf’ grads)；Scale down loss when we start overflowing”

Mid Nov; Run 12.00:Beginning of the “final” run

To recap: need at least 33 days to train a 175B model over 300B tokens using 992 80GB AIOOs
- Also need time/compute resources to benchmark the model before EOY
No strong evidence that settings from the Run ll.xx lineage would work
Match GPT-3/Megatron codebase as closely as possible
- Settings in both places seem consistent with one another
- Some evidence that these settings can be used on larger models “successfully”
Overall weight initialization updated:
- Removed extra layer norms from Normformer setup
- Removed embedding scaling
- Gaussian init for learned positional embeddings (instead of sinusoidal init)，此处观点是，也许较小的标准差有助于提高稳定性
- Weight decay = 0.1
- Clipping = 1.0
- Adam beta2 = 0.95
- Max LR of 1.2e-4
Run 12.(01-15): Mainly “systems” issues
- Lost GPU(12.01, 12.10)
- CUDA errors (12.02, 12.03, 12.04, 12.09, 12.15, 12.17)
- Job hanging (12.05, 12.06)
- NCCL error (12.08)
- Job slowdown (12.11)
Run 12.（17-34）:All the instabilities…
For Run 12.16:
- Reduce clipping to 0.3
- Backup plan:reset Adam state and do fresh warmup
Runs 12.(17-34):
- Hardware issues (ECC errors, lost GPU, high#DRAM correctables, etc.)
- Mysterious job hanging issues
- Blob storage issues
- Gradient overflow issues
  - Used loss scaling state “reset” with restarts to try and circumvent the same fate
Run 12.(35-40): “Fake SGDW” side-quest
- Given training instabilities, we got a bad idea to hot-swap in SGDW(without momentum)and get rid of Adam entirely (Runs 12.(35-40))
  - Tried to be clever by “approximating SGD” via setting betal=O, epsilon to 100(and increase LR by the same 1OOx factor)
  - Realized betal was not actually set to O in these runs, results from these runs were completely void
- Real SGD attempt with Run 12.41, but did not seem to improve anything
- But we actually wanted to do SGDW but had a weight decay bug, so we only had SGD
- LR was likely also too low
Run 12.42: Back to AdamW with lower LR
- Beginning of the “restart and Iower LR repeatedly” cycle
- Run 12.42: Lowered (max) LR from 9.0e-5 to 6.0e-5 to match GPT-3
- Run 12.44: Lowered LR to 5.4e-5
- Run 12.46: Lowered LR to 4.0e-5
- Run 12.51: Increased LR to 6.0e-5
- Run 12.51: Lowered LR to 4.5e-5
- Run 12.53: Lowered LR to 3.Oe-5

OPT-175B survived 143k steps

问答环节

Q：如果你现在重新回到训练开始之前，你再希望改善什么。

A：Much Much more data。超参什么的都是浮云，关键还是训练数据。另外，发现bfloat16是最适合的格式也是里程碑。

Q：为什么这个项目团队只有5个人？是因为老板认为5个人就足够，还是因为传说中的世界上只有约200人能够训练这些大型模型，而你们是其中之一。

A：途中有很多其他人员协助，但是核心团队确实只有5人。因为我们跑一段训练就要几个星期，而每次只跑一段训练，太多人就浪费了。

Q：一天中你有多少时间是焦虑地盯着Loss Curve？

A：必须成功的压力很大，确实很多时间都盯着最新的Loss（我猜测比盯着K线图还紧张），试图在失败之前就介入，免得浪费更多时间，还有很多时间在解决硬件问题和debug代码。

Q：我个人训练过2.7B的模型，这个规模的模型很容易就搞定，参数数量到达哪个临界点之后，模型会变得非常不稳定？

A：不确定，根据我浅薄的经验，以及Google等的论文，可能临界点是70B。

Q：怎么解决硬件问题？

A：和云工程师一起解决。老黄的A100相比V100来说太不稳定了，希望未来能解决。

提问者补充：传闻最新的H100更多问题，祝你好运。

Q：这个项目怎么定义成功？

A：当时没什么成功的概念，只是单纯地希望Loss低到一定程度，以及效果比benchmark好。在做项目的时候还没精力和当时的竞争模型做比较，只专注于loss loss loss。现在回过头来才有空跟竞品比较。

Q：你刚才强调数据质量，并说如果当初能有更好的数据，模型训练的效果也会更好。既然数据质量这么重要，如何定义数据的好与坏？

A：书、论文和代码直觉上最好质量的数据。对于特定领域的模型，比如编程模型，一看就知道哪些是好的代码；但对于这些大型通用模型来说，我们不知道怎么定义数据的好坏，要等最终模型训练出来之后才知道喂的数据好不好。

Q：出现错误的时候，你怎么知道是什么原因导致的？有可能是bug，有可能是参数问题，有可能是硬件故障。

A：没有好的方法，只能不停地试错。我们还曾经遇到晶体管的问题，完全没有办法复现问题。完全没有通用的方法，只能通过失败积累经验，但到下一代机子H100时，积累的这些经验又没用了，要从头开始试错。

Q：为什么越新型号的机子越不稳定？

A：你要问老黄。如果老黄的驱动是开源的，我们还能帮他debug，很可惜不是，所以爱莫能助。

Q：现在有很多大鳄在训练这些大模型，你觉得你们的项目的特别之处在哪里？

A：个人而言，最重要的是让训练大模型的经验教训流传于世，让世人少走弯路。（真是女菩萨） AI 会改变世界，但我不希望改变世界的力量只掌握在极少数人手上，大家一起携手发展才是硬道理。希望金主Facebook还能继续支持类似的这种开源项目。

Ref

代码仓（码云）, https://gitee.com/mirrors/OPT-175B/tree/main
论文， https://arxiv.org/pdf/2205.01068.pdf
训练笔记与记录， https://github.com/facebookresearch/metaseq/tree/main/projects/OPT/chronicles
Logbook, https://github.com/facebookresearch/metaseq/blob/main/projects/OPT/chronicles/OPT175B_Logbook.pdf
ChatGPT怎么建立私有知识库，https://www.zhihu.com/question/596838257