【LLM-RL】DeepSeekMath强化对齐之GRPO算法

note

文章目录

note
一、GRPO
Reference

一、GRPO

论文：DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models （https://arxiv.org/pdf/2402.03300）GRPO 在 DeepSeek V2 中采用了，GRPO 在训练过程中，不需要 Value Model，因此也能够减少 RL 训练过程中的资源消耗。
在这里插入图片描述
GRPO 的目标函数为：
$\begin{aligned} & \mathcal{J}_{G R P O}(\theta)=\mathbb{E}\left[q \sim P(Q),\left\{o_i\right\}_{i=1}^G \sim \pi_{\theta_{o l d}}(O \mid q)\right] \\ & \frac{1}{G} \sum_{i=1}^G \frac{1}{\left|o_i\right|} \sum_{t=1}^{\left|o_i\right|}\left\{\min \left[\frac{\pi_\theta\left(o_{i, t} \mid q, o_{i,<t}\right)}{\pi_{\theta_{o l d}\left(o_{i, t} \mid q, o_{i,<t}\right)}} \hat{A}_{i, t}, \operatorname{clip}\left(\frac{\pi_\theta\left(o_{i, t} \mid q, o_{i,<t}\right)}{\left.\left.\left.\pi_{\theta_{o l d}\left(o_{i, t} \mid q, o_{i,<t}\right)}, 1-\varepsilon, 1+\varepsilon\right) \hat{A}_{i, t}\right]-\beta \mathbb{D}_{K L}\left[\pi_\theta| | \pi_{r e f}\right]\right\}}\right.\right.\right.\end{aligned}$

步骤流程：
在这里插入图片描述

Reference

https://github.com/huggingface/alignment-handbook
https://github.com/OpenRLHF/OpenRLHF
https://github.com/hiyouga/LLaMA-Factory
AI 大模型Paper Reading: DeepSeek Math 阅读笔记之GRPO算法
速读deepseek v2 （三）- 理解GRPO（deepseekmath 与 deepseek coder）
解读DeepSeekMath中的RL策略！GRPO：改进PPO增强推理能力

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：http://www.rhkb.cn/news/4064.html

如若内容造成侵权/违法违规/事实不符，请联系长河编程网进行投诉反馈email:809451989@qq.com，一经查实，立即删除！