【跟我学YOLO】（1）YOLO12：以注意力为中心的物体检测

欢迎关注『跟我学 YOLO』系列

【跟我学YOLO】（1）YOLO12：以注意力为中心的物体检测]

- 0. YOLOv12 简介
- - 0.1 YOLO12 论文下载
  - 0.2 YOLO12 的主要改进
  - 0.3 YOLO12 支持的任务和性能
  - 0.4 论文摘要
- 1. 背景介绍
- 2. 相关的工作
- 3. 方法
- - 3.1 效率分析
  - 3.2 区域注意力
  - 3.3 残差高效层聚合网络
  - 3.4 网络架构改进
- 4. 实验
- - 4.1 实验设置
  - 4.2 与 SOTA 方法的对比
  - 4.3 消融实验
  - 4.4 速度对比
  - 4.5 诊断与可视化
- 5. 结论
- 6. 其它
- - 6.1 限制
  - 6.2 更多细节
  - 6.3 参考文献
- 7. 程序实现
- - 7.1 区域注意力的实现
  - 7.2 残差高效层聚合网络的实现
  - 7.3 YOLO12 网络架构

0. YOLOv12 简介

在这里插入图片描述

0.1 YOLO12 论文下载

2025年2月19日，美国 Univ. Buffalo 和中国中科院大学的 Tian Yunjie 等在 arxiv 发表论文 “YOLOv12：以注意力为中心的物体检测（YOLOv12: Attention-Centric Real-Time Object Detectors）”。

YOLO12 引入了一种以注意力为中心的架构，它不同于以往YOLO 模型中使用的基于 CNN 的传统方法，但仍保持了许多应用所必需的实时推理速度。该模型通过对注意力机制和整体网络架构进行新颖的方法创新，实现了最先进的物体检测精度，同时保持了实时性能。

YOLO12 支持一系列核心计算机视觉任务：物体检测、实例分割、图像分类、姿态估计和定向物体检测 (OBB)。与 YOLO10、YOLO11 相比，YOLO12 效率更高，部署灵活。

论文下载：tian2025yolov12
官方文档：ultralytics-YOLOv12
Github下载：Github-yolov12
引用格式：

Tian YJ, Ye QX, and Doermann David. arXiv preprint arXiv:2502.12524. YOLOv12: Attention-Centric Real-Time Object Detectors

在这里插入图片描述

0.2 YOLO12 的主要改进

增强型特征提取：
- 区域关注：有效处理大型感受野，降低计算成本。
- 优化平衡：改进注意力与前馈网络计算之间的平衡。
- R-ELAN：利用R-ELAN 架构加强特征聚合。
优化创新：
- 残差连接：通过缩放引入残差连接，以稳定训练，尤其是在较大的模型中。
- 改进的特征整合：在 R-ELAN 中执行改进的特征整合方法。
- FlashAttention：采用 FlashAttention，减少内存访问开销。
架构效率：
- 减少参数：与之前的许多型号相比，在保持或提高精度的同时，减少了参数数量。
- 简化注意力：使用简化的注意力实现方式，避免位置编码。
- 优化 MLP 比率：调整 MLP 比例，以便更有效地分配计算资源。

0.3 YOLO12 支持的任务和性能

支持的任务和模式
YOLO12 支持多种计算机视觉任务。下表列出了任务支持和每种任务启用的运行模式（推理、验证、训练和输出）：

在这里插入图片描述

支持的任务和模式

与之前最快的YOLO 模型相比，YOLO 在速度方面有所折衷，但在所有模型尺度上的精度都有显著提高。以下是 COCO 验证数据集（COCO Val2017）上物体检测的定量结果：

在这里插入图片描述
说明：
在NVIDIA T4GPU 上使用TensorRT FP16精度测量的推理速度。
比较显示了 mAP 的相对改进和速度变化的百分比（正值表示速度更快；负值表示速度更慢）。与已公布的 YOLOv10、YOLO11 和RT-DETR 结果（如有）进行了比较。

0.4 论文摘要

长期以来，YOLO框架的网络架构增强一直至关重要。但是，尽管注意力机制在建模能力方面具有优势，但之前的工作仍专注于基于CNN的改进。这是因为基于注意力的模型无法与基于CNN的模型在速度上匹敌。本文提出了一种以注意力为中心的 YOLO 框架，即 YOLOv12，它在保持与之前的基于CNN模型相当速度的同时，充分利用了注意力机制的性能优势。

YOLOv12在准确性上超越了所有流行的实时目标检测器，同时保持了有竞争力的速度。例如，YOLOv12-N 在T4 GPU上的推理延迟为1.64毫秒，达到了40.6%的mAP，比先进的 YOLOv10-N / YOLOv11-N 分别高出2.1%/1.2%的mAP，且速度相当。这一优势也体现在其他规模的模型上。YOLOv12 还超越了改进 DETR 的端到端实时检测器，如 RT-DETR / RTDETRv2：YOLOv12-S 仅使用了36%的计算量和45%的参数，但运行速度比 RT-DETR-R18 / RT-DETRv2-R18 快 42%。
更多比较如图1所示。

1. 背景介绍

实时目标检测因其低延迟特性始终备受关注，这种特性赋予了其重要的实用价值[4, 17, 24, 28]。其中，YOLO系列[3, 24, 28, 29, 32, 45–47, 53, 57, 58]在延迟与准确率之间建立了最佳平衡，主导了该领域。尽管YOLO的改进主要集中在损失函数[8, 35, 43, 44, 48, 67, 68]、标签分配[22, 23, 34, 59, 69]等方面，网络架构设计仍是关键研究方向[24, 28, 32, 57, 58]。

尽管以注意力为核心的视觉Transformer（ViT）架构已被证明具有更强的建模能力（即使在小模型中[20, 21, 25, 50]），大多数架构设计仍主要聚焦于CNN。这一现状的根源在于注意力机制的效率问题，主要体现在二次计算复杂度和注意力机制的低效内存访问操作（后者是FlashAttention[13, 14]解决的主要问题）。因此，在同等计算资源下，基于CNN的架构性能比基于注意力的架构高约3倍[38]，这极大限制了注意力机制在需要高推理速度的YOLO系统中的采用。

本文旨在解决这些挑战，构建以注意力为核心的YOLO框架——YOLOv12。
我们提出了三项关键改进：

简单高效的区域注意力模块（A2）：通过简化方式降低注意力计算复杂度，同时保持大感受野以提升速度。
残差高效层聚合网络（R-ELAN）：针对注意力引入的优化难题（主要在大规模模型中），在原始ELAN[57]基础上改进：
（1）采用缩放技术的块级残差设计；
（2）重新设计的特征聚合方法。
超越传统注意力的架构适配：包括引入FlashAttention解决内存访问问题、移除位置编码以简化模型、将MLP比例从 4 调整至 1.2 以平衡注意力与前馈网络的计算、减少堆叠块深度以优化训练、充分利用卷积算子提升计算效率。

基于上述设计，我们开发了包含 5种规模的实时检测器家族：YOLOv12-N/S/M/L/X。

在YOLOv11[28] 的实验设置下（未使用额外技巧），标准目标检测基准测试表明，YOLOv12在所有规模下均实现了延迟-准确率与FLOPs-准确率权衡的显著提升（见图1）。

例如：
（1）YOLOv12-N以40.6% mAP超越YOLOv10-N[53]（+2.1% mAP）且推理更快，相较YOLOv11-N[28]提升1.2% mAP（速度相当）；
（2）YOLOv12-S对比RT-DETR-R18[66]/RT-DETRv2-R18[40]，mAP提高1.5%/0.1%，推理速度加快42%/42%，计算量仅需36%/36%，参数量减少至45%/45%。

本文的贡献：

构建了以注意力为核心、简洁高效的YOLO框架，通过方法创新与架构改进打破了CNN在YOLO系列中的主导地位；
在不依赖预训练等额外技术的情况下，YOLOv12以快速推理与更高检测精度实现了SOTA，展现了其潜力。

2. 相关的工作

实时目标检测器

实时目标检测器因其重要的实用价值一直备受学术界关注。YOLO系列[3, 9, 24, 28, 29, 32, 45–47, 53, 54, 57, 58]已成为实时目标检测领域的领先框架。早期的YOLO系统[45–47]从模型设计的角度奠定了YOLO系列的基础。YOLOv4[3]和YOLOv5[29]在框架中引入了CSPNet[55]、数据增强和多尺度特征。YOLOv6[32]进一步通过BiC和SimCSPSPPF模块改进了主干网络和颈部结构，并采用锚点辅助训练。YOLOv7[57]引入了E-ELAN[56]（高效层聚合网络）以改善梯度流，并采用了多种“免费技巧”（bag-of-freebies）。YOLOv8[24]则集成了高效的C2f模块以增强特征提取能力。在最近的版本中，YOLOv9[58]引入了GELAN用于架构优化和PGI用于训练改进，而YOLOv10[53]则通过无NMS训练和双重分配机制提升了效率。YOLOv11[28]进一步通过采用C3K2模块（GELAN[58]的一种变体）和检测头中的轻量级深度可分离卷积降低了延迟并提高了准确率。

最近，一种端到端目标检测方法——RT-DETR[66]通过设计高效编码器和不确定性最小化查询选择机制，改进了传统端到端检测器[7, 33, 37, 42, 71]，以满足实时需求。RT-DETRv2[40]进一步通过“免费技巧”增强了其性能。与以往的YOLO系列不同，本研究旨在构建一个以注意力为核心的YOLO框架，以充分利用注意力机制的优越性。

高效视觉Transformer
降低全局自注意力机制的计算成本对于在下游任务中有效应用视觉Transformer至关重要。PVT[61]通过多分辨率阶段和下采样特征解决了这一问题。Swin Transformer[39]将自注意力限制在局部窗口内，并通过调整窗口划分方式连接非重叠窗口，从而在通信需求与内存和计算需求之间取得平衡。其他方法，如轴向自注意力[26]和十字交叉注意力[27]，在水平和垂直窗口内计算注意力。CSWin Transformer[16]在此基础上引入了十字形窗口自注意力，沿水平和垂直条纹并行计算注意力。此外，一些工作[12, 64]通过建立局部-全局关系，减少对全局自注意力的依赖，从而提高了效率。Fast-iTPN[50]通过令牌迁移和令牌聚集机制提高了下游任务的推理速度。还有一些方法[31, 49, 60, 62]使用线性注意力来降低注意力的复杂度。尽管基于Mamba的视觉模型[38, 70]旨在实现线性复杂度，但它们仍未能达到实时速度[38]。

FlashAttention[13, 14]发现了导致注意力计算低效的高带宽内存瓶颈，并通过I/O优化减少了内存访问，从而提高了计算效率。在本研究中，我们摒弃了复杂的设计，提出了一种简单的区域注意力机制以降低注意力的复杂度。此外，我们采用FlashAttention来解决注意力机制固有的内存访问问题[13, 14]。

3. 方法

本节从网络架构与注意力机制的角度介绍 YOLOv12 框架的创新。

3.1 效率分析

注意力机制虽然在捕捉全局依赖关系和促进自然语言处理[5, 15]与计算机视觉[19, 39]等任务方面非常有效，但其本质上比卷积神经网络（CNN）慢。导致这种速度差异的主要因素有两个：

复杂度
首先，自注意力操作的计算复杂度随输入序列长度 $L$ 呈二次方增长。具体来说，对于长度为 $L$ 、特征维度为 $d$ 的输入序列，计算注意力矩阵需要 $O(L^2 d)$ 次操作，因为每个token都需要与其他所有 token 进行交互。相比之下，CNN 中卷积操作的复杂度在空间或时间维度上是线性的，即 $O (k L d)$ ，其中 $k$ 是卷积核大小，通常远小于 $L$ 。因此，自注意力在计算上变得非常昂贵，尤其是对于高分辨率图像或长序列等大规模输入。
此外，另一个重要因素是，大多数基于注意力的视觉 Transformer 由于其复杂设计（如Swin Transformer[39] 中的窗口划分/反转）和额外模块的引入（如位置编码），逐渐累积了速度开销，导致整体速度比CNN 架构慢[38]。
本文中的设计模块采用简单且干净的操作来实现注意力，最大限度地确保效率。
计算
其次，在注意力计算过程中，内存访问模式相比 CNN 效率较低[13, 14]。具体来说，在自注意力计算期间，中间结果如注意力矩阵 $QK^T$ 和 softmax 矩阵 $L \times L$ 需要从高速GPU SRAM（实际计算位置）存储到高带宽GPU内存（HBM），并在后续计算中重新读取。前者的读写速度是后者的 10倍以上，因此导致显著的内存访问开销和实际运行时间增加——FlashAttention 解决了这个问题[13,14]，将在模型设计中直接采用。此外，注意力中不规则的内存访问模式进一步引入了延迟，而CNN则利用结构化和局部化的内存访问模式。CNN得益于空间受限的卷积核，能够实现高效的内存缓存，并因其固定的感受野和滑动窗口操作而减少延迟。

这两个因素——二次计算复杂度和低效的内存访问——共同导致注意力机制比 CNN更慢，尤其是在实时或资源受限的场景中。解决这些限制已成为一个关键研究领域，稀疏注意力机制和内存高效近似方法（如Linformer[60]或Performer[11]）旨在缓解二次方增长问题。

3.2 区域注意力

在这里插入图片描述

降低传统注意力计算成本的一种简单方法是使用线性注意力机制[49, 60]，它将传统注意力的复杂度从二次方降低到线性。对于一个维度为 $(n, h, d)$ 的视觉特征 $f$ （其中 $n$ 是token数量， $h$ 是注意力头数量， $d$ 是每个头的大小），线性注意力将复杂度从 $2n^2hd$ 降低到 $2nhd^2$ ，从而减少计算成本（因为 $n > d$ ）。然而，线性注意力存在全局依赖退化[30]、不稳定性[11]和分布敏感性[63]等问题。此外，由于低秩瓶颈[2, 10]，当应用于输入分辨率为 640×640 的YOLO时，其速度优势有限。

另一种有效降低复杂度的方法是局部注意力机制（如Shift window[39]、十字交叉注意力[27]和轴向注意力[16]），如图2 所示，它将全局注意力转化为局部注意力，从而降低计算成本。然而，将特征图划分为窗口可能会引入额外开销或缩小感受野，影响速度和准确性。

在本研究中，我们提出了一种简单而高效的区域注意力模块。如图2所示，分辨率为 $(H, W)$ 的特征图被划分为 $l$ 个大小为 $(H / l, W)$ 或 $(H, W / l)$ 的片段。这种方法消除了显式的窗口划分，仅需简单的 reshape 操作即可实现，从而获得更快的速度。我们通过实验将l的默认值设为 4，将感受野缩小到原来的 1/4，但仍保持了较大的感受野。

通过这种方法，注意力机制的计算成本从 $2n^2hd$ 降低到 $½n^2²hd$ 。我们证明，尽管复杂度仍为 %n^2$，但当 $n$ 固定为 640 时（如果输入分辨率增加， $n$ 也会增加），这种方法仍然足够高效，能够满足 YOLO 系统的实时需求。有趣的是，我们发现这种修改对性能的影响很小，但显著提高了速度。

3.3 残差高效层聚合网络

高效层聚合网络（ELAN）[57]旨在改进特征聚合能力。如图3(b)所示，ELAN将过渡层（1×1卷积）的输出分割为两部分，其中一部分通过多个模块处理，然后将所有输出拼接并通过另一个过渡层（1×1卷积）对齐维度。然而，正如[57]所分析的，这种架构可能引入不稳定性。我们认为，这种设计会导致梯度阻塞，并且缺乏从输入到输出的残差连接。此外，由于我们围绕注意力机制构建网络，这进一步增加了优化挑战。实验表明，在使用Adam或AdamW优化器的情况下，L和X规模的模型要么无法收敛，要么仍然不稳定。

为了解决这一问题，我们提出了残差高效层聚合网络（R-ELAN），如图3(d)所示。我们在整个模块中引入了从输入到输出的残差捷径，并添加了一个缩放因子（默认值为0.01）。这种设计与层缩放[52]类似，后者用于构建深层视觉Transformer。然而，对每个区域注意力应用层缩放并不能解决优化挑战，反而会引入延迟。这表明，注意力机制的引入并不是导致收敛问题的唯一原因，ELAN架构本身也存在问题，这验证了我们R-ELAN设计的合理性。

我们还设计了一种新的聚合方法，如图3(d)所示。原始ELAN层通过过渡层处理模块的输入，并将其分割为两部分。其中一部分通过后续模块进一步处理，最后将两部分拼接生成输出。相比之下，我们的设计通过过渡层调整通道维度并生成单一特征图，然后通过后续模块处理并进行拼接，形成瓶颈结构。这种方法不仅保留了原始的特征整合能力，还减少了计算成本和参数/内存使用量。

在这里插入图片描述

3.4 网络架构改进

本节将介绍整体架构以及对传统注意力机制的一些改进。其中部分改进并非我们首次提出。

许多以注意力为核心的视觉Transformer采用平面式架构[1, 18, 19, 21, 25, 51]，而我们保留了先前YOLO系统[3, 24, 28, 29, 32, 45–47, 53, 57, 58]的分层设计，并将证明这种设计的必要性。我们移除了主干网络最后阶段堆叠三个块的设计（该设计出现在最近的版本[24, 28, 53, 58]中），仅保留一个R-ELAN块，从而减少总块数并有助于优化。我们从YOLOv11[28]继承了主干网络的前两个阶段，并未使用R-ELAN。

此外，我们对传统注意力机制中的一些默认配置进行了修改，以更好地适配YOLO系统。这些修改包括：

将MLP比例从4调整为1.2（对于N/S/M规模的模型调整为2），以更好地分配计算资源，提升性能；
采用nn.Conv2d+BN替代nn.Linear+LN，以充分利用卷积算子的效率；
移除位置编码，并引入大尺寸可分离卷积（7×7）（称为位置感知器），以帮助区域注意力感知位置信息。

这些改进的有效性将在第4.5节中进行验证。

4. 实验

本节分为四个部分：实验设置、与流行方法的系统对比、验证我们方法的消融实验，以及通过可视化分析进一步探索YOLOv12。

4.1 实验设置

我们在 MSCOCO 2017 数据集[36]上验证了所提出的方法。YOLOv12系列包括5个不同规模的版本：YOLOv12-N、YOLOv12-S、YOLOv12-M、YOLOv12-L和YOLOv12-X。所有模型均使用SGD优化器训练600个epoch，初始学习率为0.01，与YOLOv11[28]保持一致。我们采用线性学习率衰减策略，并在前3个epoch进行线性热身。根据[53, 66]的方法，所有模型的延迟均在T4 GPU上使用TensorRT FP16进行测试。

基线

我们选择YOLOv11[28]作为基线，模型缩放策略也与其一致。我们使用了其提出的C3K2块（即GELAN[58]的一种特例），并未使用YOLOv11[28]之外的任何技巧。

4.2 与 SOTA 方法的对比

我们在表1 中展示了YOLOv12与其他流行实时检测器的性能对比。

YOLOv12-N：
YOLOv12-N在mAP上分别比YOLOv6-3.0-N[32]、YOLOv8-N[58]、YOLOv10-N[53]和YOLOv11[28]高出3.6%、3.3%、2.1%和1.2%，同时保持相似甚至更少的计算量和参数，并实现了1.64 ms/图像的快速推理速度。
YOLOv12-S：
YOLOv12-S以21.4G FLOPs和9.3M参数实现了48.0 mAP，延迟为2.61 ms/图像。它在mAP上分别比YOLOv8-S[24]、YOLOv9-S[58]、YOLOv10-S[53]和YOLOv11-S[28]高出3.0%、1.2%、1.7%和1.1%，同时保持相似或更少的计算量。与端到端检测器RT-DETR-R18[66]/RT-DETRv2-R18[41]相比，YOLOv12-S在性能上具有竞争力，同时推理速度更快、计算成本和参数更少。
YOLOv12-M：
YOLOv12-M以67.5G FLOPs和20.2M参数实现了52.5 mAP，推理速度为4.86 ms/图像。相比Gold-YOLO-M[54]、YOLOv8-M[24]、YOLOv9-M[58]、YOLOv10[53]、YOLOv11[28]和RT-DETR-R34[66]/RT-DETRv2-R34[40]，YOLOv12-S具有显著优势。
YOLOv12-L：
YOLOv12-L甚至以31.4G更少的FLOPs超越了YOLOv10-L[53]。YOLOv12-L在mAP上比YOLOv11[28]高出0.4%，同时FLOPs和参数相当。YOLOv12-L还以更快的速度、更少的FLOPs（34.6%）和更少的参数（37.1%）超越了RT-DERT-R50[66]/RT-DERTv2-R50[41]。
YOLOv12-X：
YOLOv12-X在mAP上分别比YOLOv10-X[53]和YOLOv11-X[28]高出0.8%和0.6%，同时速度、FLOPs和参数相当。YOLOv12-X再次以更快的速度、更少的FLOPs（23.4%）和更少的参数（22.2%）超越了RT-DETR-R101[66]/RT-DETRv2-R101[40]。

特别地，如果使用FP32精度评估L/X规模模型（需要将模型单独保存为FP32格式），YOLOv12将实现约0.2%的mAP提升。这意味着 YOLOv12-L/X 将分别达到 33.9%/55.4% 的 mAP。

在这里插入图片描述

4.3 消融实验

R-ELAN

表2 评估了所提出的残差高效层网络（R-ELAN）在YOLOv12-N/L/X模型上的有效性。实验结果揭示了两个关键发现：

（1）对于小型模型（如YOLOv12-N），残差连接不会影响收敛，但会降低性能。相反，对于大型模型（如YOLOv12-L/X），残差连接对稳定训练至关重要。特别是，YOLOv12-X需要一个极小的缩放因子（0.01）来确保收敛。
（2）所提出的特征整合方法有效降低了模型的FLOPs和参数复杂度，同时仅带来微小的性能下降，保持了可比性能。

在这里插入图片描述

区域注意力

我们进行了消融实验以验证区域注意力的有效性，结果如表3 所示。实验在YOLOv12-N/S/X模型上进行，分别在GPU（CUDA）和CPU上测量推理速度。CUDA结果在RTX 3080和A5000上获得，CPU性能在Intel Core i7-10700K@3.80GHz上测量。实验结果表明，区域注意力显著提升了速度。例如，在RTX 3080上使用FP32时，YOLOv12-N的推理时间减少了0.7毫秒。这种性能提升在不同模型和硬件配置中均保持一致。本实验未使用FlashAttention[13, 14]，因为其会显著缩小速度差异。

在这里插入图片描述

4.4 速度对比

表4 展示了在不同GPU上对YOLOv9[58]、YOLOv10[53]、YOLOv11[28]和我们的YOLOv12在RTX 3080、RTX A5000和RTX A6000上使用FP32和FP16精度的推理速度对比分析。为确保一致性，所有结果均在相同硬件上获得，YOLOv9[58]和YOLOv10[53]使用ultralytics[28]的集成代码库进行评估。结果表明，YOLOv12的推理速度显著高于YOLOv9[58]，同时与YOLOv10[53]和YOLOv11[28]持平。例如，在RTX 3080上，YOLOv9的推理时间为2.4毫秒（FP32）和1.5毫秒（FP16），而YOLOv12-N的推理时间为1.7毫秒（FP32）和1.1毫秒（FP16）。类似趋势在其他配置中也保持一致。

图4 展示了更多对比结果。左子图展示了YOLOv12与流行方法的准确率-参数权衡对比，YOLOv12在边界上显著优于其他方法，甚至超越了以显著更少参数为特点的YOLOv10，展示了YOLOv12的高效性。右子图展示了YOLOv12与之前YOLO版本在CPU上的推理延迟对比（所有结果均在Intel Core i7-10700K@3.80GHz上测量）。如图所示，YOLOv12以更优的边界超越了其他竞争对手，凸显了其在多种硬件平台上的高效性。

在这里插入图片描述

4.5 诊断与可视化

我们在表5a至5h中对YOLOv12的设计进行了诊断分析。除非另有说明，这些诊断均在YOLOv12-N上进行，默认训练设置为600个epoch从头训练。

在这里插入图片描述

注意力实现方式：表5a。
我们研究了两种实现注意力的方法。基于卷积的方法比基于线性变换的方法更快，这得益于卷积的计算效率。此外，我们探索了两种归一化方法（层归一化（LN）和批归一化（BN）），发现尽管层归一化在注意力机制中常用，但与卷积结合使用时，其性能不如批归一化。值得注意的是，这一发现与PSA模块[53]的设计一致。
分层设计：表5b。
与其他检测系统（如Mask R-CNN[1, 25]）不同，平面视觉Transformer架构在这些系统中可以取得强劲性能，而YOLOv12表现出不同的行为。当使用平面视觉Transformer（N/A）时，检测器性能显著下降，仅达到38.3% mAP。更温和的调整（如省略第一阶段（S1）或第四阶段（S4））在保持相似FLOPs的情况下，分别导致0.5%和0.8%的mAP下降。与之前的YOLO模型一致，分层设计仍然是最有效的，在YOLOv12中实现了最佳性能。
训练epoch数：表5c。
我们研究了不同训练epoch数对性能的影响（从头训练）。尽管一些现有的YOLO检测器在大约500个训练epoch后即可达到最佳结果[24, 53, 58]，但YOLOv12需要更长的训练周期（约600个epoch）才能达到峰值性能，同时保持与YOLOv11[28]相同的配置。
位置感知器：表5d。
在注意力机制中，我们对注意力值v应用了大卷积核的可分离卷积，并将其输出与v@attn相加。我们将这一组件称为位置感知器，因为卷积的平滑效果保留了图像像素的原始位置，从而帮助注意力机制感知位置信息（这一方法已在PSA模块[53]中使用，但我们扩大了卷积核，在不影响速度的情况下实现了性能提升）。如表所示，增加卷积核大小会提升性能，但会逐渐降低速度。当卷积核大小达到9×9时，速度下降显著。因此，我们将7×7设为默认卷积核大小。
位置编码：表5e。
我们研究了大多数基于注意力的模型中常用的位置编码（RPE：相对位置编码；APE：绝对位置编码）对性能的影响。有趣的是，不使用任何位置编码的配置表现最佳，这带来了更简洁的架构和更快的推理延迟。
区域注意力：表5f。
在本表中，我们默认使用FlashAttention技术。这使得尽管区域注意力机制增加了计算复杂度（带来性能提升），但速度下降仍然很小。关于区域注意力有效性的进一步验证，请参见表3。
MLP比例：表5g。
在传统视觉Transformer中，注意力模块内的MLP比例通常设置为4.0。然而，我们在YOLOv12中观察到了不同的行为。表中显示，调整MLP比例会影响模型大小，因此我们调整特征维度以保持整体模型一致性。特别是，YOLOv12在MLP比例为1.2时表现更好，这与传统做法不同。这一调整将更多计算负载转移到注意力机制上，凸显了区域注意力的重要性。
FlashAttention：表5h。
本表验证了FlashAttention在YOLOv12中的作用。结果表明，FlashAttention在不增加其他成本的情况下，将YOLOv12-N加速约0.3毫秒，将YOLOv12-S加速约0.4毫秒。

可视化：热图对比。图5比较了YOLOv12与最先进的YOLOv10[53]和YOLOv11[28]的热图。这些热图从X尺度模型骨干网络的第三阶段提取，突出了模型激活的区域，反映了其目标感知能力。如图所示，与YOLOv10和YOLOv11相比，YOLOv12产生了更清晰的目标轮廓和更精确的前景激活，表明感知能力有所提升。我们的解释是，这种改进来自于区域注意力机制，其感受野比卷积网络更大，因此被认为更擅长捕捉整体上下文，从而实现更精确的前景激活。我们相信，这一特性为YOLOv12带来了性能优势。

图5 比较了YOLOv12与当前最先进的YOLOv10[53]和YOLOv11[28]的热图。这些热图从X规模模型的主干网络第三阶段提取，展示了模型激活的区域，反映了其目标感知能力。如图所示，与YOLOv10和YOLOv11相比，YOLOv12生成了更清晰的目标轮廓和更精确的前景激活，表明其感知能力有所提升。我们的解释是，这种改进源于区域注意力机制，其感受野比卷积网络更大，因此被认为更擅长捕捉整体上下文，从而实现了更精确的前景激活。我们相信，这一特性为YOLOv12带来了性能优势。

在这里插入图片描述

5. 结论

本研究提出了YOLOv12，成功地将传统上被认为不适合实时需求的以注意力为核心的设计引入YOLO框架，实现了SOTA的延迟-准确率权衡。为实现高效推理，我们提出了一种新颖的网络架构，利用区域注意力降低计算复杂度，并通过**残差高效层聚合网络（R-ELAN）**增强特征聚合能力。此外，我们优化了传统注意力机制的关键组件，以更好地适配YOLO的实时性需求，同时保持高速性能。

通过有效结合区域注意力、R-ELAN和架构优化，YOLOv12在准确率和效率上均实现了显著提升，达到了SOTA性能。全面的消融实验进一步验证了这些创新的有效性。本研究挑战了基于CNN的设计在YOLO系统中的主导地位，推动了注意力机制在实时目标检测中的集成，为构建更高效、更强大的YOLO系统铺平了道路。

在这里插入图片描述

6. 其它

6.1 限制

YOLOv12 需要依赖FlashAttention[13, 14]，目前其支持的GPU包括Turing、Ampere、Ada Lovelace或Hopper架构（例如T4、Quadro RTX系列、RTX20系列、RTX30系列、RTX40系列、RTX A5000/6000、A30/40、A100、H100等）。

6.2 更多细节

微调细节
默认情况下，所有YOLOv12模型均使用SGD优化器训练600个epoch。遵循先前工作[24, 53, 57, 58]，SGD动量和权重衰减分别设置为0.937和5×10⁻⁴。初始学习率设置为1×10⁻²，并在训练过程中线性衰减至1×10⁻⁴。数据增强方法包括Mosaic[3, 57]、Mixup[71]和复制粘贴增强[65]，以提升训练效果。遵循YOLOv11[28]，我们采用Albumentations库[6]。详细的超参数如表7所示。所有模型均在8×NVIDIA A6000 GPU上训练。遵循惯例[24, 28, 53, 58]，我们报告了不同目标尺度和IoU阈值下的标准平均精度（mAP）。此外，我们还报告了所有图像的平均延迟。

更多细节请参考官方代码库：https://github.com/sunsmarterjie/yolov12。

结果细节
我们在表6中提供了更多结果的详细信息，包括 $AP^{val}_{50:95}$ 、 $AP^{val}_{50}$ 、 $AP^{val}_{75}$ 、 $AP^{val}_{small}$ 、 $AP^{val}_{medium}$ 、 $AP^{val}_{large}$ 。

在这里插入图片描述

6.3 参考文献

[1] Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. Beit: Bert pre-training of image transformers. arXiv preprint
arXiv:2106.08254, 2021. 6, 9
[2] Srinadh Bhojanapalli, Chulhee Yun, Ankit Singh Rawat, Sashank Reddi, and Sanjiv Kumar. Low-rank bottleneck in
multi-head attention models. In International conference on machine learning, pages 864–873. PMLR, 2020. 4
[3] Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao. Yolov4: Optimal speed and accuracy of
object detection. arXiv preprint arXiv:2004.10934, 2020. 1, 2, 6, 11
[4] Daniel Bogdoll, Maximilian Nitsche, and J Marius Z¨ollner. Anomaly detection in autonomous driving: A survey. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4488–4499, 2022. 1
[5] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan,
Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information
processing systems, 33:1877–1901, 2020. 3
[6] Alexander Buslaev, Vladimir I Iglovikov, Eugene Khvedchenya, Alex Parinov, Mikhail Druzhinin, and Alexandr A
Kalinin. Albumentations: fast and flexible image augmentations. Information, 11(2):125, 2020. 11
[7] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-toend
object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020. 2
[8] Kean Chen, Weiyao Lin, Jianguo Li, John See, Ji Wang, and Junni Zou. Ap-loss for accurate one-stage object detection.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(11):3782–3798, 2020. 1
[9] Yuming Chen, Xinbin Yuan, Ruiqi Wu, Jiabao Wang, Qibin Hou, and Ming-Ming Cheng. Yolo-ms: rethinking multiscale
representation learning for real-time object detection. arXiv preprint arXiv:2308.05480, 2023. 2
[10] Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter
Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention with performers. arXiv preprint
arXiv:2009.14794, 2020. 4
[11] Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter
Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention with performers. arXiv preprint
arXiv:2009.14794, 2020. 3, 4
[12] Xiangxiang Chu, Zhi Tian, Yuqing Wang, Bo Zhang, Haibing Ren, Xiaolin Wei, Huaxia Xia, and Chunhua Shen.
Twins: Revisiting the design of spatial attention in vision transformers. Advances in Neural Information Processing
Systems, 34:9355–9366, 2021. 3
[13] Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint
arXiv:2307.08691, 2023. 2, 3, 7, 11
[14] Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R´e. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022. 2, 3, 7, 11
[15] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers
for language understanding. In North American Chapter of the Association for Computational Linguistics:
Human Language Technologies, pages 4171–4186, 2019. 3
[16] Xiaoyi Dong, Jianmin Bao, Dongdong Chen, Weiming Zhang, Nenghai Yu, Lu Yuan, Dong Chen, and Baining
Guo. Cswin transformer: A general vision transformer backbone with cross-shaped windows. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12124–12134, 2022. 2, 4
[17] Douglas Henke Dos Reis, Daniel Welfer, Marco Antonio De Souza Leite Cuadros, and Daniel Fernando Tello
Gamarra. Mobile robot navigation using an object recognition software with rgbd images and the yolo algorithm. Applied
Artificial Intelligence, 33(14):1290–1305, 2019.1
[18] Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint
arXiv:2010.11929, 2020. 6
[19] Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue
Cao. Eva: Exploring the limits of masked visual representation learning at scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19358–19369, 2023. 3, 6
[20] Yuxin Fang, Quan Sun, XinggangWang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva-02: A visual representation
for neon genesis. Image and Vision Computing, 149:105171, 2024. 1
[21] Yuxin Fang, Quan Sun, XinggangWang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva-02: A visual representation
for neon genesis. Image and Vision Computing, 149:105171, 2024. 1, 6
[22] Chengjian Feng, Yujie Zhong, Yu Gao, Matthew R Scott, and Weilin Huang. Tood: Task-aligned one-stage object detection.
In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 3490–3499. IEEE Computer Society,
2021. 1
[23] Zheng Ge, Songtao Liu, Zeming Li, Osamu Yoshie, and Jian Sun. Ota: Optimal transport assignment for object detection.
In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 303–312, 2021. 1
[24] Jocher Glenn. Yolov8.
https://github.com/ultralytics/ultralytics/tree/main, 2023. 1, 2, 5, 6, 9, 11
[25] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable
vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–
16009, 2022. 1, 6, 9
[26] Jonathan Ho, Nal Kalchbrenner, Dirk Weissenborn, and Tim Salimans. Axial attention in multidimensional transformers.
arXiv preprint arXiv:1912.12180, 2019. 2
[27] Zilong Huang, Xinggang Wang, Lichao Huang, Chang Huang, Yunchao Wei, and Wenyu Liu. Ccnet: Criss-cross
attention for semantic segmentation. In Proceedings of the IEEE/CVF international conference on computer vision,
pages 603–612, 2019. 2, 4
[28] Glenn Jocher. yolov11. https://github.com/ultralytics, 2024. 1, 2, 4, 5, 6, 7, 8, 9, 10, 11
[29] Glenn Jocher, K Nishimura, T Mineeva, and RJAM Vilari˜no. yolov5. https://github.com/ultralytics/yolov5/tree, 2, 2020. 1, 2, 6
[30] Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and Franc¸ois Fleuret. Transformers are rnns: Fast autoregressive
transformers with linear attention. In International conference on machine learning, pages 5156–5165. PMLR, 2020.4
[31] Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and Franc¸ois Fleuret. Transformers are rnns: Fast autoregressive
transformers with linear attention. In International conference on machine learning, pages 5156–5165. PMLR, 2020.3
[32] Chuyi Li, Lulu Li, Yifei Geng, Hongliang Jiang, Meng Cheng, Bo Zhang, Zaidan Ke, Xiaoming Xu, and Xiangxiang
Chu. Yolov6 v3. 0: A full-scale reloading. arXiv preprint arXiv:2301.05586, 2023. 1, 2, 5, 6
[33] Feng Li, Hao Zhang, Shilong Liu, Jian Guo, Lionel M Ni, and Lei Zhang. Dn-detr: Accelerate detr training by introducing
query denoising. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13619–13627, 2022. 2
[34] Shuai Li, Chenhang He, Ruihuang Li, and Lei Zhang. A dual weighting label assignment scheme for object detection. In
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9387–9396, 2022. 1
[35] Xiang Li, Wenhai Wang, Lijun Wu, Shuo Chen, Xiaolin Hu, Jun Li, Jinhui Tang, and Jian Yang. Generalized focal loss:
Learning qualified and distributed bounding boxes for dense object detection. Advances in Neural Information Processing
Systems, 33:21002–21012, 2020. 1
[36] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence
Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference,
Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014. 6, 10
[37] Shilong Liu, Feng Li, Hao Zhang, Xiao Yang, Xianbiao Qi, Hang Su, Jun Zhu, and Lei Zhang. Dab-detr: Dynamic
anchor boxes are better queries for detr. arXiv preprint arXiv:2201.12329, 2022. 2
[38] Yue Liu, Yunjie Tian, Yuzhong Zhao, Hongtian Yu, Lingxi Xie, Yaowei Wang, Qixiang Ye, Jianbin Jiao, and Yunfan
Liu. Vmamba: Visual state space model. In The Thirtyeighth Annual Conference on Neural Information Processing
Systems, 2024. 2, 3
[39] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer:
Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on
computer vision, pages 10012–10022, 2021. 2, 3, 4
[40] Wenyu Lv, Yian Zhao, Qinyao Chang, Kui Huang, Guanzhong Wang, and Yi Liu. Rt-detrv2: Improved baseline
with bag-of-freebies for real-time detection transformer. arXiv preprint arXiv:2407.17140, 2024. 2, 6
[41] Wenyu Lv, Yian Zhao, Qinyao Chang, Kui Huang, Guanzhong Wang, and Yi Liu. Rt-detrv2: Improved baseline
with bag-of-freebies for real-time detection transformer. arXiv preprint arXiv:2407.17140, 2024. 5, 6
[42] Depu Meng, Xiaokang Chen, Zejia Fan, Gang Zeng, Houqiang Li, Yuhui Yuan, Lei Sun, and Jingdong Wang.
Conditional detr for fast training convergence. In Proceedings of the IEEE/CVF international conference on computer
vision, pages 3651–3660, 2021. 2
[43] Kemal Oksuz, Baris Can Cam, Emre Akbas, and Sinan Kalkan. A ranking-based, balanced loss function unifying
classification and localisation in object detection. Advances in Neural Information Processing Systems, 33:15534–15545,
2020. 1
[44] Kemal Oksuz, Baris Can Cam, Emre Akbas, and Sinan Kalkan. Rank & sort loss for object detection and instance
segmentation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3009–3018, 2021. 1
[45] J Redmon. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer
vision and pattern recognition, 2016. 1, 2, 6
[46] Joseph Redmon. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018.
[47] Joseph Redmon and Ali Farhadi. Yolo9000: better, faster, stronger. In Proceedings of the IEEE conference on computer
vision and pattern recognition, pages 7263–7271, 2017. 1, 2, 6
[48] Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. Generalized intersection
over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 658–666, 2019. 1
[49] Zhuoran Shen, Mingyuan Zhang, Haiyu Zhao, Shuai Yi, and Hongsheng Li. Efficient attention: Attention with linear
complexities. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 3531–3539, 2021. 3, 4
[50] Yunjie Tian, Lingxi Xie, Jihao Qiu, Jianbin Jiao, Yaowei Wang, Qi Tian, and Qixiang Ye. Fast-itpn: Integrally pretrained
transformer pyramid network with token migration. IEEE Transactions on Pattern Analysis and Machine Intelligence,
2024. 1, 3
[51] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herv´e J´egou. Training
data-efficient image transformers & distillation through attention. In International conference on machine learning, pages 10347–10357. PMLR, 2021. 6
[52] Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, Gabriel Synnaeve, and Herv´e J´egou. Going deeper with image
transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 32–42, 2021. 4
[53] Ao Wang, Hui Chen, Lihao Liu, Kai Chen, Zijia Lin, Jungong Han, and Guiguang Ding. Yolov10: Real-time endto-
end object detection. arXiv preprint arXiv:2405.14458, 2024. 1, 2, 5, 6, 7, 8, 9, 10, 11
[54] Chengcheng Wang, Wei He, Ying Nie, Jianyuan Guo, Chuanjian Liu, Yunhe Wang, and Kai Han. Gold-yolo: Efficient
object detector via gather-and-distribute mechanism. Advances in Neural Information Processing Systems, 36, 2024. 2, 5, 6
[55] Chien-Yao Wang, Hong-Yuan Mark Liao, Yueh-Hua Wu, Ping-Yang Chen, Jun-Wei Hsieh, and I-Hau Yeh. Cspnet: A
new backbone that can enhance learning capability of cnn. In Proceedings of the IEEE/CVF conference on computer
vision and pattern recognition workshops, pages 390–391, 2020. 2, 4
[56] Chien-Yao Wang, Hong-Yuan Mark Liao, and I-Hau Yeh. Designing network design strategies through gradient path
analysis. arXiv preprint arXiv:2211.04800, 2022. 2, 4
[57] Chien-Yao Wang, Alexey Bochkovskiy, and Hong-Yuan Mark Liao. Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7464–7475, 2023. 1, 2, 4, 6,11
[58] Chien-Yao Wang, I-Hau Yeh, and Hong-Yuan Mark Liao. Yolov9: Learning what you want to learn using programmable gradient information. arXiv preprint arXiv:2402.13616, 2024. 1, 2, 4, 5, 6, 7, 8, 9, 11
[59] Jianfeng Wang, Lin Song, Zeming Li, Hongbin Sun, Jian Sun, and Nanning Zheng. End-to-end object detection
with fully convolutional network. In Proceedings of the IEEE/CVF conference on computer vision and pattern
recognition, pages 15849–15858, 2021. 1
[60] Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity.
arXiv preprint arXiv:2006.04768, 2020. 3, 4
[61] Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao.
Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the
IEEE/CVF international conference on computer vision, pages 568–578, 2021. 2
[62] Enze Xie, Junsong Chen, Yuyang Zhao, Jincheng Yu, Ligeng Zhu, Yujun Lin, Zhekai Zhang, Muyang Li, Junyu Chen,
Han Cai, et al. Sana 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer.
arXiv preprint arXiv:2501.18427, 2025. 3
[63] Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, and Vikas Singh.
Nystr¨omformer: A nystr¨om-based algorithm for approximating self-attention. In Proceedings of the AAAI Conference on
Artificial Intelligence, pages 14138–14148, 2021. 4
[64] Qihang Yu, Yingda Xia, Yutong Bai, Yongyi Lu, Alan L Yuille, and Wei Shen. Glance-and-gaze vision transformer.
Advances in Neural Information Processing Systems, 34: 12992–13003, 2021. 3
[65] Hongyi Zhang. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017. 11
[66] Yian Zhao, Wenyu Lv, Shangliang Xu, Jinman Wei, Guanzhong Wang, Qingqing Dang, Yi Liu, and Jie Chen.
Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pages 16965–16974, 2024. 2, 5, 6
[67] Zhaohui Zheng, Ping Wang, Wei Liu, Jinze Li, Rongguang Ye, and Dongwei Ren. Distance-iou loss: Faster and better
learning for bounding box regression. In Proceedings of the AAAI conference on artificial intelligence, pages 12993–
13000, 2020. 1
[68] Dingfu Zhou, Jin Fang, Xibin Song, Chenye Guan, Junbo Yin, Yuchao Dai, and Ruigang Yang. Iou loss for 2d/3d object
detection. In 2019 international conference on 3D vision (3DV), pages 85–94. IEEE, 2019. 1
[69] Benjin Zhu, Jianfeng Wang, Zhengkai Jiang, Fuhang Zong, Songtao Liu, Zeming Li, and Jian Sun. Autoassign: Differentiable label assignment for dense object detection. arXiv preprint arXiv:2007.03496, 2020. 1
[70] Lianghui Zhu, Bencheng Liao, Qian Zhang, Xinlong Wang, Wenyu Liu, and Xinggang Wang. Vision mamba: Efficient
visual representation learning with bidirectional state space model. arXiv preprint arXiv:2401.09417, 2024. 3
[71] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable detr: Deformable transformers
for end-to-end object detection. arXiv preprint arXiv:2010.04159, 2020. 2, 11

7. 程序实现

在这里插入图片描述

7.1 区域注意力的实现

class AAttn(nn.Module):"""Area-attention module with the requirement of flash attention."""def __init__(self, dim, num_heads, area=1):"""Initializes the area-attention module, a simple yet efficient attention module for YOLO."""super().__init__()self.area = areaself.num_heads = num_headsself.head_dim = head_dim = dim // num_headsall_head_dim = head_dim * self.num_headsself.qkv = Conv(dim, all_head_dim * 3, 1, act=False)self.proj = Conv(all_head_dim, dim, 1, act=False)self.pe = Conv(all_head_dim, dim, 7, 1, 3, g=dim, act=False)def forward(self, x):"""Processes the input tensor 'x' through the area-attention"""B, C, H, W = x.shapeN = H * Wqkv = self.qkv(x).flatten(2).transpose(1, 2)if self.area > 1:qkv = qkv.reshape(B * self.area, N // self.area, C * 3)B, N, _ = qkv.shapeq, k, v = qkv.view(B, N, self.num_heads, self.head_dim * 3).split([self.head_dim, self.head_dim, self.head_dim], dim=3)if x.is_cuda and USE_FLASH_ATTN:x = flash_attn_func(q.contiguous().half(),k.contiguous().half(),v.contiguous().half()).to(q.dtype)elif x.is_cuda andnot USE_FLASH_ATTN:x = sdpa(q.permute(0, 2, 1, 3), k.permute(0, 2, 1, 3), v.permute(0, 2, 1, 3), attn_mask=None, dropout_p=0.0, is_causal=False)x = x.permute(0, 2, 1, 3)else:q = q.permute(0, 2, 3, 1)k = k.permute(0, 2, 3, 1)v = v.permute(0, 2, 3, 1)attn = (q.transpose(-2, -1) @ k) * (self.head_dim ** -0.5)max_attn = attn.max(dim=-1, keepdim=True).values exp_attn = torch.exp(attn - max_attn)attn = exp_attn / exp_attn.sum(dim=-1, keepdim=True)x = (v @ attn.transpose(-2, -1))x = x.permute(0, 3, 1, 2)v = v.permute(0, 3, 1, 2)if self.area > 1:x = x.reshape(B // self.area, N * self.area, C)v = v.reshape(B // self.area, N * self.area, C)B, N, _ = x.shapex = x.reshape(B, H, W, C).permute(0, 3, 1, 2)v = v.reshape(B, H, W, C).permute(0, 3, 1, 2)x = x + self.pe(v)x = self.proj(x)return x

7.2 残差高效层聚合网络的实现

class A2C2f(nn.Module):  """A2C2f module with residual enhanced feature extraction using ABlock blocks with area-attention. Also known as R-ELAN"""def __init__(self, c1, c2, n=1, a2=True, area=1, residual=False, mlp_ratio=2.0, e=0.5, g=1, shortcut=True):super().__init__()c_ = int(c2 * e)  # hidden channelsassert c_ % 32 == 0, "Dimension of ABlock be a multiple of 32."# num_heads = c_ // 64 if c_ // 64 >= 2 else c_ // 32num_heads = c_ // 32self.cv1 = Conv(c1, c_, 1, 1)self.cv2 = Conv((1 + n) * c_, c2, 1)  # optional act=FReLU(c2)init_values = 0.01# or smallerself.gamma = nn.Parameter(init_values * torch.ones((c2)), requires_grad=True) if a2 and residual elseNoneself.m = nn.ModuleList(nn.Sequential(*(ABlock(c_, num_heads, mlp_ratio, area) for _ in range(2))) if a2 else C3k(c_, c_, 2, shortcut, g) for _ in range(n))def forward(self, x):"""Forward pass through R-ELAN layer."""y = [self.cv1(x)]y.extend(m(y[-1]) for m in self.m)if self.gamma isnotNone:return x + self.gamma.view(1, -1, 1, 1) * self.cv2(torch.cat(y, 1))return self.cv2(torch.cat(y, 1))

7.3 YOLO12 网络架构

# YOLO12n backbone
backbone:
# [from, repeats, module, args]
-[-1,1,Conv,[64,3,2]]# 0-P1/2
-[-1,1,Conv,[128,3,2]]# 1-P2/4
-[-1,2,C3k2,[256,False,0.25]]
-[-1,1,Conv,[256,3,2]]# 3-P3/8
-[-1,2,C3k2,[512,False,0.25]]
-[-1,1,Conv,[512,3,2]]# 5-P4/16
-[-1,4,A2C2f,[512,True,4]]
-[-1,1,Conv,[1024,3,2]]# 7-P5/32
-[-1,4,A2C2f,[1024,True,1]]# 8# YOLO12n head
head:
-[-1,1,nn.Upsample,[None,2,"nearest"]]
-[[-1,6],1,Concat,[1]]# cat backbone P4
-[-1,2,A2C2f,[512,False,-1]]# 11-[-1,1,nn.Upsample,[None,2,"nearest"]]
-[[-1,4],1,Concat,[1]]# cat backbone P3
-[-1,2,A2C2f,[256,False,-1]]# 14-[-1,1,Conv,[256,3,2]]
-[[-1,11],1,Concat,[1]]# cat head P4
-[-1,2,A2C2f,[512,False,-1]]# 17-[-1,1,Conv,[512,3,2]]
-[[-1,8],1,Concat,[1]]# cat head P5
-[-1,2,C3k2,[1024,True]]# 20 (P5/32-large)-[[14,17,20],1,Detect,[nc]]# Detect(P3, P4, P5)