深入理解DPO（Direct Preference Optimization）算法

1. 什么是DPO？

直接偏好优化（Direct Preference Optimization, DPO）是一种不需要强化学习的对齐算法。由于去除了复杂的强化学习算法，DPO 可以通过与有监督微调（SFT）相似的复杂度实现模型对齐，不再需要在训练过程中针对大语言模型进行采样，同时超参数的选择更加容易。

2. Bradley-Terry模型

Bradley-Terry模型对比较关系进行建模，设 $A$ 的实力为 $\lambda_1$ ， $B$ 的实力为 $\lambda_2$ ，那么 $A$ 和 $B$ 对战， $A$ 战胜 $B$ 的概率为：

$P(A>B)=\frac{e^{\lambda_1}}{e^{\lambda_1}+e^{\lambda_2}}=\frac{\alpha_1}{\alpha_1+\alpha_2},\quad \alpha_1\triangleq e^{\lambda_1},\quad \alpha_2\triangleq e^{\lambda_2}$

因为无法保证 $\lambda_1,\lambda_2$ 一定大于0，所以需要用softmax函数处理一下。

举一个例子，假设有如下的胜负表：

对战	胜	负
A vs B	8	4
A vs C	3	5

若要求 $B$ 战胜 $C$ 的概率，我们需要知道 $\alpha_2,\alpha_3$ 的值。首先可以得到似然函数：

$L=\left(\frac{\alpha_1}{\alpha_1+\alpha_2}\right)^8 \left(\frac{\alpha_2}{\alpha_1+\alpha_2}\right)^4 \left(\frac{\alpha_1}{\alpha_1+\alpha_3}\right)^3 \left(\frac{\alpha_3}{\alpha_1+\alpha_3}\right)^5$

对对数似然函数求偏导可以得到 $\alpha_2=\frac12\alpha_1,\,\alpha_3=\frac53\alpha_1$ 。于是

$P(B>C)=\frac{\alpha_2}{\alpha_2+\alpha_3}=\frac{\frac12}{\frac12+\frac53}=\frac{3}{13}$

2.1 奖励模型的训练

奖励模型的训练涉及到正例 $x,y^+)$ 和负例 $x,y^-)$ ，其中 $x$ 是prompt， $y$ 是response。由于 $r (x, y)$ 可能是负数，因此在使用Bradley-Terry建模时，需要预先过一下softmax：

$\begin{aligned} P(y^+>y^-|x)&=\frac{\exp (r(x,y^+))}{\exp (r(x,y^+))+\exp (r(x,y^-))}=\frac{1}{1+\exp(r(x,y^-)- r(x,y^+))} \\ &=\sigma (r(x,y^+)-r(x,y^-)) \end{aligned}$

其中 $\sigma(x)=\frac{1}{1+e^{-x}}$ 是Sigmoid函数。训练奖励模型实际上就是最大化 $P(y^+>y^-|x)$ 的过程，这等价于最小化 $log P(y^+>y^-|x)$ ，因此可以得到奖励模型训练的损失函数：

$\mathcal{L}_{\text{RM}} =-\mathbb{E}_{(x,y^+,y^-)\sim \mathcal{D}} [\,\log\sigma(r(x,y^+)-r(x,y^-))]$

这一过程实际上是对比学习，奖励模型需要学习在提升正例分数的同时，进一步降低负例的分数，以最大化正例和负例之间的分数差异。

3. 从PPO到DPO

传统的RLHF算法需要先在人类偏好数据上训练一个奖励模型，然后再使用这个奖励模型和相关的强化学习算法（如PPO）去指导LLM进一步学习，但这种做法有如下弊端：

奖励建模的过程较为复杂，需要额外的计算开销。
强化学习流程复杂，过程不稳定，且对超参数敏感。

DPO可以直接让策略模型在人类偏好数据上学习，省去了构建奖励模型和进行强化学习的步骤，故得名直接偏好优化（Direct Preference Optimization）。

我们先来看使用KL散度作为正则项的PPO算法，为了推导更为简便，我们可以将优化目标重写为下式：

$\max_{\pi_{\theta}} \mathbb{E}_{x\sim D,y\sim \pi_{\theta}} [r(x,y)]-\beta \text{KL} [\pi_{\theta}(y|x) \,\|\, \pi_{\text{ref}}(y|x)]$

其中 $r (x, y)$ 是奖励函数， $\pi_{\theta}$ 是策略模型（待训练的模型）， $\pi_{\text{ref}}$ 是参考模型（冻结），两者均从SFT模型初始化得来。在RLHF阶段，我们一方面需要最大化奖励，一方面又不能让策略模型偏离参考模型太远。

注意到 $P(y^+>y^-|x)$ 仅跟 $r (x, y)$ 有关，如果我们能够找到 $\pi_{\theta}$ 和 $r (x, y)$ 之间的关系，我们就能用 $\pi_{\theta}$ 去表示 $P(y^+>y^-|x)$ ，进而就能规避奖励建模的过程。这样一来，LLM就能够通过与强化学习等价的形式学习到人类的价值观和偏好。

考虑对PPO的优化目标进行变换：

$\begin{aligned} &\max_{\pi_{\theta}} \mathbb{E}_{x\sim D,y\sim \pi_{\theta}} [r(x,y)]-\beta \text{KL} [\pi_{\theta}(y|x) \,\|\, \pi_{\text{ref}}(y|x)] \\ =&\max_{\pi_{\theta}} \mathbb{E}_{x\sim D} \mathbb{E}_{y\sim \pi_{\theta}(y|x)}\left[ r(x,y)-\beta\log\frac{\pi_{\theta}(y|x)}{\pi_{\text{ref}}(y|x)}\right] \\ =&\min_{\pi_{\theta}} \mathbb{E}_{x\sim D} \mathbb{E}_{y\sim \pi_{\theta}(y|x)}\left[ \log\frac{\pi_{\theta}(y|x)}{\pi_{\text{ref}}(y|x)}-\frac{1}{\beta}r(x,y)\right] \\ =&\min_{\pi_{\theta}} \mathbb{E}_{x\sim D} \mathbb{E}_{y\sim \pi_{\theta}(y|x)}\left[ \log\frac{\pi_{\theta}(y|x)}{\pi_{\text{ref}}(y|x)}+\log\frac{1}{\exp(\frac{1}{\beta}r(x,y))}+\log\frac{1}{\frac{1}{Z(x)}}-\log Z(x)\right] \\ =&\min_{\pi_{\theta}} \mathbb{E}_{x\sim D} \mathbb{E}_{y\sim \pi_{\theta}(y|x)}\left[ \log\frac{\pi_{\theta}(y|x)}{\frac{1}{Z(x)}\pi_{\text{ref}}(y|x)\exp(\frac{1}{\beta}r(x,y))}-\log Z(x)\right] \\ \end{aligned}$

其中 $Z (x)$ 是我们额外引入的配分函数，定义为

$Z(x)=\sum_y \pi_{\text{ref}}(y|x)\exp\left(\frac{1}{\beta}r(x,y)\right)$

现定义

$\pi^*(y|x)=\frac{1}{Z(x)}\pi_{\text{ref}}(y|x)\exp\left(\frac{1}{\beta}r(x,y)\right)$

容易发现 $\pi^*$ 满足以下两个性质：

$\pi^*(y|x)\geq 0$ 。
$\sum_y \pi^*(y|x)=1$ 。

这说明 $\pi^*$ 是一个概率分布，我们将它代回原式并继续推导：

$\begin{aligned} &\min_{\pi_{\theta}} \mathbb{E}_{x\sim D} \mathbb{E}_{y\sim \pi_{\theta}(y|x)}\left[ \log\frac{\pi_{\theta}(y|x)}{\pi^*(y|x)}-\log Z(x)\right] \\ =&\min_{\pi_{\theta}} \mathbb{E}_{x\sim D} \left[ \mathbb{E}_{y\sim \pi_{\theta}(y|x)}\left[ \log\frac{\pi_{\theta}(y|x)}{\pi^*(y|x)} \right]-\log Z(x) \right] \\ =&\min_{\pi_{\theta}} \mathbb{E}_{x\sim D} \left[ \text{KL}[\pi_{\theta}(y|x) \,\|\, \pi^*(y|x)]-\log Z(x) \right] \\ \end{aligned}$

注意到配分函数 $Z (x)$ 与 $\pi_{\theta}$ 无关，因此可以视为常数，所以只需要最小化KL散度这一项。根据Gibbs不等式，我们可以直接得到最优解：

$\pi_{\theta}(y|x)=\pi^*(y|x)=\frac{1}{Z(x)}\pi_{\text{ref}}(y|x)\exp\left(\frac{1}{\beta}r(x,y)\right)$

接下来推导 $r (x, y)$ 和 $\pi_{\theta}$ 之间的关系。对上式移项可得：

$\begin{aligned} \exp\left(\frac{1}{\beta}r(x,y)\right)&=\frac{\pi_{\theta}(y|x)}{\pi_{\text{ref}}(y|x)}\cdot Z(x)\\ r(x,y)&=\beta\log\frac{\pi_{\theta}(y|x)}{\pi_{\text{ref}}(y|x)}+\beta \log Z(x) \end{aligned}$

我们将这个表达式代入到之前的 $P(y^+>y^-|x)$ 中可得：

$\begin{aligned} P(y^+>y^-|x)&=\sigma (r(x,y^+)-r(x,y^-)) \\ &=\sigma\left(\beta\log\frac{\pi_{\theta}(y^+|x)}{\pi_{\text{ref}}(y^+|x)}+\beta \log Z(x)-\beta\log\frac{\pi_{\theta}(y^-|x)}{\pi_{\text{ref}}(y^-|x)}-\beta \log Z(x) \right) \\ &=\sigma\left(\beta\log\frac{\pi_{\theta}(y^+|x)}{\pi_{\text{ref}}(y^+|x)}-\beta\log\frac{\pi_{\theta}(y^-|x)}{\pi_{\text{ref}}(y^-|x)} \right) \\ \end{aligned}$

最终得到DPO的目标函数：

$\mathcal{L}_{\text{DPO}}=-\mathbb{E}_{(x,y^+,y^-)\sim \mathcal{D}} \left[ \log\sigma\left(\beta\log\frac{\pi_{\theta}(y^+|x)}{\pi_{\text{ref}}(y^+|x)}-\beta\log\frac{\pi_{\theta}(y^-|x)}{\pi_{\text{ref}}(y^-|x)} \right) \right]$

可以发现 $\mathcal{L}_{\text{DPO}}$ 与 $\mathcal{L}_{\text{RM}}$ 的形式十分接近，即DPO具有以下形式的隐式奖励函数：

$r_{\theta}(x,y)=\beta\log\frac{\pi_{\theta}(y|x)}{\pi_{\text{ref}}(y|x)}$

这也回应了DPO论文标题中的「Your Language Model is Secretly a Reward Model」。

接下来可以总结一下DPO的流程了：

从 $\pi^{\text{SFT}}$ 初始化 $\pi_{\theta},\,\pi_{\text{ref}}$ 。
对于每个 $x$ ，用 $\pi_{\text{ref}}$ 采样一对答案 $y_1,y_2)$ ，再让人工标注者去标注，以离线的方式构建人类偏好数据集 $\mathcal{D}=\{x_i,y_i^+,y_i^-\}_{i=1}^N$ 。
通过最小化 $\mathcal{L}_{\text{DPO}}$ 来不断优化 $\pi_{\theta}$ 。

4. DPO的简单实现

为方便计算，我们对 $\mathcal{L}_{\text{DPO}}$ 做个简单的变形：

$\mathcal{L}_{\text{DPO}}=-\mathbb{E}_{(x,y^+,y^-)\sim \mathcal{D}} \left[ \log\sigma\left(\beta\log\frac{\pi_{\theta}(y^+|x)}{\pi_{\theta}(y^-|x)}-\beta\log\frac{\pi_{\text{ref}}(y^+|x)}{\pi_{\text{ref}}(y^-|x)} \right) \right]$

一种简单的实现：

def dpo_loss(policy_chosen_logps, policy_rejected_logps, ref_chosen_logps, ref_rejected_logps, beta):"""Compute the simplified DPO loss with sigmoid loss type.Args:policy_chosen_logps: Log probabilities of the policy model for the chosen responses. Shape: (batch_size,)policy_rejected_logps: Log probabilities of the policy model for the rejected responses. Shape: (batch_size,)ref_chosen_logps: Log probabilities of the reference model for the chosen responses. Shape: (batch_size,)ref_rejected_logps: Log probabilities of the reference model for the rejected responses. Shape: (batch_size,)beta: Temperature controlling strength of KL penaltyReturns:losses: The DPO loss for each example in the batch.chosen_rewards: Rewards for the chosen responses.rejected_rewards: Rewards for the rejected responses."""# Calculate log-ratiospolicy_logratios = policy_chosen_logps - policy_rejected_logpsref_logratios = ref_chosen_logps - ref_rejected_logps# Compute logits for sigmoid losslogits = policy_logratios - ref_logratios# Sigmoid loss typelosses = -F.logsigmoid(beta * logits)# Compute rewardschosen_rewards = beta * (policy_chosen_logps - ref_chosen_logps).detach()rejected_rewards = beta * (policy_rejected_logps - ref_rejected_logps).detach()return losses, chosen_rewards, rejected_rewards

5. 梯度分析

通过对DPO的目标函数求导，我们可以深入理解DPO算法如何针对LLM的参数进行优化。

令 $u=\beta\log\frac{\pi_{\theta}(y^+|x)}{\pi_{\text{ref}}(y^+|x)}-\beta\log\frac{\pi_{\theta}(y^-|x)}{\pi_{\text{ref}}(y^-|x)}$ ，利用Sigmoid函数的性质，我们有：

$\begin{aligned} \nabla \mathcal{L}_{\text{DPO}}&=-\mathbb{E}_{(x,y^+,y^-)\sim \mathcal{D}}[\nabla\log\sigma(u)]= -\mathbb{E}_{(x,y^+,y^-)\sim \mathcal{D}}\left[\frac{\nabla \sigma(u)}{\sigma(u)}\nabla u\right] \\ &=-\mathbb{E}_{(x,y^+,y^-)\sim \mathcal{D}}\left[ \sigma(-u)\nabla u \right] \\ &=-\mathbb{E}_{(x,y^+,y^-)\sim \mathcal{D}}\left[ \sigma(r_{\theta}(x,y^-)-r_{\theta}(x,y^+)) \cdot (\nabla \log \pi_{\theta}(y^+|x) - \nabla \log \pi_{\theta}(y^-|x)) \right] \end{aligned}$

其中 $r_{\theta}$ 是上文提到的隐式奖励函数。

通过对上述目标函数的导数进行分析，可以发现优化过程中会增大 $\log \pi_\theta(y^+|x)$ 与 $\log \pi_\theta(y^-|x)$ 之间的差异。这表明优化过程中训练模型向符合人类偏好的内容靠近 $y^+)$ ，同时尽量避免生成不符合人类偏好的内容 $y^-)$ 。

此外，公式的前半部分 $\sigma(r_\theta(x,y^-) - r_\theta(x,y^+))$ 可以看作是梯度的系数，动态地控制梯度下降的步长。可以发现，当策略模型更倾向于生成不符合人类偏好的内容 $y^-$ 时， $r_\theta(x,y^-)$ 和 $r_\theta(x,y^+)$ 之间的差值变大，导致梯度下降的步长变大，从而进行更为激进的参数更新，以避免生成 $y^-$ 。反之，当策略模型倾向于生成符合人类偏好的内容 $y^+$ 时，说明策略模型当前具备较好的参数。此时梯度的系数变小，这会使得策略模型的参数的更新幅度降低，防止更新步长过大使得策略模型的性能出现震荡，增加训练的稳定性。

Ref

[1] https://www.bilibili.com/video/BV1GF4m1L7Nt/?spm_id_from=333.337.search-card.all.click
[2] 《大模型综述》
[3] https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model