deepseek R1基本原理解读与系列论文简介

文章目录

前言
一、deepseek R1发展史
二、deepseek R1简介
- 1、R1简介
- 2、R1成功秘诀
- 3、R1推理模型概念
- 4、R1自我进化与顿悟时刻特点
- 5、不同处理方法比较
- 6、训练流程
- 7、训练阶段
- 8、R1的MLA结构
- 9、R1的MOE结构
- 10、R1的MTP结构
- 11、R1的GRPO结构
三、DeepSeek LLM Scaling Open-Source Language Models with Longtermism(2024.1)
- 1、摘要
- 2、引言
四、DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence(2024.6)
- 1、摘要
- 2、引言
五、DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model(2024.6)
- 1、摘要
- 2、引言
六、DeepSeek-V3 Technical Report(2024.12)
- 1、摘要
- 2、引言
七、DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning（2025.1）
- 1、摘要
- 2、引言

前言

DeepSeek是一个专注于利用深度学习技术解决复杂问题的平台，旨在通过先进的算法和模型帮助研究人员和开发者探索数据深处的模式。特别值得一提的是DeepSeek R1，这是其推出的首个标志性模型或产品，凭借其卓越的性能和创新性在发布后迅速“出圈”，吸引了大量关注。DeepSeek R1不仅展示了在多个领域的强大应用潜力，如图像识别、自然语言处理等，还因其突破性的进展而获得了广泛的认可，成为人工智能领域的一个重要里程碑。它让用户能够更高效地应用深度学习技术实现创新和突破，进一步推动了该领域的发展。本篇文章简要解读deepseek系列文章。

一、deepseek R1发展史

我整理了一个deepseek的发展历史，请查看如下：

在这里插入图片描述

可以参考更多其它信息链接：https://www.huxiu.com/article/4009260.html

二、deepseek R1简介

1、R1简介

简介如下图所示：
在这里插入图片描述

2、R1成功秘诀

继续说明前人不成功内容，而R1成功，介绍如下：

在这里插入图片描述

3、R1推理模型概念

在这里插入图片描述

4、R1自我进化与顿悟时刻特点

在这里插入图片描述

5、不同处理方法比较

在这里插入图片描述

6、训练流程

在这里插入图片描述

7、训练阶段

在这里插入图片描述

8、R1的MLA结构

在这里插入图片描述
训练：

推理：

在这里插入图片描述

9、R1的MOE结构

在这里插入图片描述

10、R1的MTP结构

在这里插入图片描述

11、R1的GRPO结构

在这里插入图片描述

三、DeepSeek LLM Scaling Open-Source Language Models with Longtermism(2024.1)

在这里插入图片描述

1、摘要

The rapid development of open-source large language models (LLMs) has been truly remarkable. However, the scaling laws described in previous literature presents varying conclusions, which casts a dark cloud over scaling LLMs. We delve into the study of scaling laws and present our distinctive findings that facilitate the scaling of large scale models in two prevalent used opensource configurations, 7B and 67B. Guided by the scaling laws, we introduce DeepSeek LLM, a project dedicated to advancing open-source language models with a long-term perspective. To support the pre-training phase, we have developed a dataset that currently consists of 2 trillion tokens and is continuously expanding. We further conduct supervised fine-tuning (SFT) and direct preference optimization (DPO) on DeepSeek LLM Base models, resulting in the creation of DeepSeek Chat models. Our evaluation results demonstrate that DeepSeek LLM 67B surpasses LLaMA-2 70B across a range of benchmarks, especially in the domains of code, mathematics, and reasoning. Furthermore, open-ended evaluations reveal that our DeepSeek LLM 67B Chat exhibits superior performance compared to GPT-3.5.

开源大型语言模型（LLMs）的快速发展确实令人瞩目。然而，先前文献中描述的缩放定律提出了不同的结论，这给LLM的扩展蒙上了一层阴影。我们深入研究了这些缩放定律，并提出了独特的发现，以促进两种广泛使用的开源配置——7B和67B——的大规模模型的扩展。在缩放定律的指导下，我们推出了DeepSeek LLM项目，致力于从长远角度推进开源语言模型的发展。为了支持预训练阶段，我们构建了一个目前包含2万亿个令牌的数据集，并且该数据集正在持续扩展。我们进一步对DeepSeek LLM基础模型进行了监督微调（SFT）和直接偏好优化（DPO），从而开发出了DeepSeek Chat模型。我们的评估结果显示，DeepSeek LLM 67B在一系列基准测试中超越了LLaMA-2 70B，特别是在代码、数学和推理领域。此外，开放式评估表明，我们的DeepSeek LLM 67B Chat表现出比GPT-3.5更优的性能。

2、引言

Over the past few years, Large Language Models (LLMs) based on decoder-only Transformers (Vaswani et al., 2017) have increasingly become the cornerstone and pathway to achieving Artificial General Intelligence (AGI). By predicting the next word in continuous text, LLMs undergo self-supervised pre-training on massive datasets, enabling them to achieve various purposes and possess many abilities, such as novel creation, text summarization, code completion, and more. Subsequent developments like supervised fine-tuning and reward modeling have enabled Large Language Models (LLMs) to better follow user intentions and instructions. This has endowed
them with more versatile conversational capabilities and rapidly expanded their influence.
近年来，基于仅解码器Transformer（Vaswani等人，2017）的大型语言模型（LLMs）逐渐成为实现通用人工智能（AGI）的基石和途径。通过预测连续文本中的下一个词，LLMs在大规模数据集上进行自我监督预训练，使它们能够达成各种目的并拥有许多能力，如创作小说、文本摘要、代码补全等。后续的发展，例如监督微调和奖励模型，让LLMs更好地遵循用户意图和指令，赋予了它们更加多样化的对话能力，并迅速扩展了它们的影响范围。

This wave is sparked with closed products, such as ChatGPT (OpenAI, 2022), Claude (Anthropic, 2023), and Bard (Google, 2023), which are developed with extensive computational resources and substantial annotation costs. These products have significantly raised the community’s expectations for the capabilities of open-source LLMs, consequently inspiring a series of work (Bai et al., 2023; Du et al., 2022; Jiang et al., 2023; Touvron et al., 2023a,b; Yang et al., 2023). Among these, the LLaMA series models (Touvron et al., 2023a,b) stand out. It consolidates a range of works to create an efficient and stable architecture, building well-performing models ranging from 7B to 70B parameters. Consequently, the LLaMA series has become the de facto benchmark for architecture and performance among open-source models.
这一浪潮由诸如ChatGPT（OpenAI, 2022）、Claude（Anthropic, 2023）和Bard（Google, 2023）这样的闭源产品所引发，这些产品利用了大量的计算资源和显著的标注成本开发而成。这些产品大幅提升了社区对开源LLM能力的期望，从而激发了一系列工作（Bai等人，2023；Du等人，2022；Jiang等人，2023；Touvron等人，2023a,b；Yang等人，2023）。其中，LLaMA系列模型（Touvron等人，2023a,b）脱颖而出，它整合了一系列工作创建了一个高效稳定的架构，构建了从7B到70B参数不等的高性能模型。因此，LLaMA系列成为了开源模型中实际上的架构和性能基准。

Following LLaMA, the open-source community has primarily focused on training fixed-size (7B, 13B, 34B, and 70B), high-quality models, often neglecting research exploration into LLM scaling laws (Hoffmann et al., 2022; Kaplan et al., 2020). Nonetheless, research on scaling laws is of utmost importance, considering that the current open-source models are merely at the initial stage of Artificial General Intelligence (AGI) development. In addition, early works (Hoffmann et al., 2022; Kaplan et al., 2020) reached varying conclusions on the scaling of model and data with increased compute budgets and inadequately addressed hyperparameter discussions. In this paper, we extensively investigate the scaling behavior of language models and apply our findings in two widely used large-scale model configurations, namely 7B and 67B. Our study aims to lay the groundwork for future scaling of open-source LLMs, paving the way for further advancements in this domain. Specifically, we first examined the scaling laws of batch size and learning rate, and found their trends with model size. Building on this, we conducted a comprehensive study of the scaling laws of the data and model scale, successfully revealing the optimal model/data scaling-up allocation strategy and predicting the expected performance of our large-scale models. Additionally, during development, we discovered that the scaling laws derived from different datasets show significant differences. This suggests that choice of dataset remarkably affects the scaling behavior, indicating that caution should be exercised
when generalizing scaling laws across datasets.
在LLaMA之后，开源社区主要集中在训练固定规模（7B, 13B, 34B和70B）、高质量的模型上，往往忽视了对LLM缩放定律的研究探索（Hoffmann等人，2022；Kaplan等人，2020）。然而，考虑到当前开源模型仍处于AGI发展的初期阶段，缩放定律的研究至关重要。此外，早期的工作（Hoffmann等人，2022；Kaplan等人，2020）对于随着计算预算增加的模型和数据缩放得出了不同的结论，并且对超参数讨论不足。本文深入探讨了语言模型的缩放行为，并将我们的发现应用于两种广泛使用的大型模型配置，即7B和67B。我们的研究旨在为未来开源LLM的扩展奠定基础，进一步推动该领域的发展。具体而言，我们首先检查了批量大小和学习率的缩放规律，并发现了它们随模型大小变化的趋势。在此基础上，我们对数据和模型规模的缩放规律进行了全面研究，成功揭示了最优的模型/数据扩展分配策略，并预测了我们大规模模型的预期表现。此外，在开发过程中，我们发现不同数据集衍生的缩放定律显示出显著差异，这表明数据集的选择极大地影响了缩放行为，提示在跨数据集泛化缩放定律时应谨慎行事。

Under the guidance of our scaling laws, we build from scratch open-source large language models, and release as much information as possible for community reference. We collect 2 trillion tokens for pre-training, primarily in Chinese and English. At the model level, we
generally followed the architecture of LLaMA, but replaced the cosine learning rate scheduler with a multi-step learning rate scheduler, maintaining performance while facilitating continual training. We collected over 1 million instances for supervised fine-tuning (SFT) (Ouyang et al., 2022) from diverse sources. This paper shares our experiences with different SFT strategies and findings in data ablation techniques. Additionally, we have utilized direct preference optimization (DPO) (Rafailov et al., 2023) to improve the conversational performance of the model.
根据我们的缩放定律指导，我们从头开始构建开源大型语言模型，并尽可能多地发布信息供社区参考。我们收集了2万亿个令牌用于预训练，主要涵盖中文和英文。在模型层面，我们总体上遵循了LLaMA的架构，但用多步学习率调度器替换了余弦学习率调度器，在保持性能的同时便于持续训练。我们从多个来源收集了超过100万条实例用于监督微调（SFT）（Ouyang等人，2022）。本文分享了我们在不同SFT策略上的经验以及在数据消融技术方面的发现。此外，我们还利用直接偏好优化（DPO）（Rafailov等人，2023）来提升模型的对话表现。

We conduct extensive evaluations using our base and chat models. The evaluation results demonstrate that DeepSeek LLM surpasses LLaMA-2 70B across various benchmarks, particularly in the fields of code, mathematics, and reasoning. Following SFT and DPO, the DeepSeek 67B chat model outperforms GPT-3.5 in both Chinese and English open-ended evaluations. This highlights the superior performance of DeepSeek 67B in generating high-quality responses and engaging in meaningful conversations in both languages. Furthermore, the safety evaluation indicates that DeepSeek 67B Chat can provide harmless responses in practice.
我们使用基础模型和聊天模型进行了广泛的评估。评估结果显示DeepSeek LLM在多个基准测试中超越了LLaMA-2 70B，特别是在代码、数学和推理领域。经过SFT和DPO后，DeepSeek 67B聊天模型在中文和英文开放评估中均超过了GPT-3.5，这突显了DeepSeek 67B在生成高质量响应和进行有意义对话方面的能力。此外，安全性评估显示DeepSeek 67B Chat在实践中能够提供无害的回应。

In the rest of this paper, we first introduce our pre-training basic concepts of DeepSeek LLM in Section 2, including the composition of data, model architecture, infrastructure, and hyperparameters. In Section 3, we provide a detailed explanation of the scaling laws we have discovered and its implications. Additionally, we discuss the rationale behind our selection of pre-training hyperparameters, taking into account the insights gained from the scaling laws analysis. In Section 4, we discuss our fine-tuning methodology, encompassing the composition of fine-tuning data and specific methods during the SFT and DPO stages. We then present the detailed evaluation results of DeepSeek LLM in Section 5, covering both the base and chat models, as well as their performance in open-ended evaluations and safety evaluations. Finally, we discuss the current limitations and future directions of DeepSeek LLM in Section 6.
本文其余部分首先在第2节介绍DeepSeek LLM的预训练基本概念，包括数据组成、模型架构、基础设施和超参数。在第3节中，我们详细解释了我们发现的缩放定律及其含义，并讨论了基于缩放定律分析所得见解选择预训练超参数的依据。第4节讨论了我们的微调方法论，涵盖了微调数据的组成及SFT和DPO阶段的具体方法。然后，在第5节中我们展示了DeepSeek LLM详细的评估结果，包括基础模型和聊天模型的表现，以及在开放评估和安全评估中的表现。最后，在第6节中，我们讨论了DeepSeek LLM目前的局限性和未来方向。

四、DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence(2024.6)

在这里插入图片描述

1、摘要

We present DeepSeek-Coder-V2, an open-source Mixture-of-Experts (MoE) code language model that achieves performance comparable to GPT4-Turbo in code-specific tasks. Specifically, DeepSeek-Coder-V2 is further pre-trained from an intermediate checkpoint of DeepSeek-V2 with additional 6 trillion tokens. Through this continued pre-training, DeepSeek-Coder-V2 substantially enhances the coding and mathematical reasoning capabilities of DeepSeek-V2, while maintaining comparable performance in general langua