RLHF包含了两个至关重要的步骤: 训练Reward Model用Reward Model和SFT Model构造Reward Function,基于PPO算法来训练LLM frozen RMfrozen SFT ModelActor π Φ R L \pi_{\Phi}^{R L} πΦRL initialized from SFT ModelCritic V η V_\eta Vη initialized from RM 参考 RLHF理论篇