【ICCV21】Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

文章目录

  • 0. Abstract
  • 1. Introduction
  • 2. Related Work
  • 3. Method
    • 3.1 Overall Architecture
    • 3.2 Shifted Window based Self-Attention
    • 3.3 Architecture Variants
  • 4. Experiments
    • 4.1 Image Classification on ImageNet-1K
    • 4.2 Object Detection on COCO
    • 4.3 Semantic Segmentation on ADE20K
    • 4.4 Ablation Study
  • 5. Conclusion
  • 6. Acknowledgement
  • References
  • My thought
  • 彩蛋

在这里插入图片描述

论文链接: https://openaccess.thecvf.com/content/ICCV2021/papers/Liu_Swin_Transformer_Hierarchical_Vision_Transformer_Using_Shifted_Windows_ICCV_2021_paper.pdf

Article Reading Sharing

0. Abstract

This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision.
本文提出了一种新的视觉变压器,称为Swin变压器,它可以作为计算机视觉的通用骨干

Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text.
将Transformer从语言应用到视觉的挑战来自于这两个领域之间的差异,例如视觉实体规模的巨大差异以及与文本中的单词相比,图像中像素的高分辨率。

To address these differences, we propose a hierarchical Transformer whose representation is computed with Shifted windows.
为了解决这些差异,我们提出了一个分层的Transformer,它的表示是用移位窗口计算的。

The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection.
移位窗口方案将自关注计算限制在不重叠的局部窗口,同时允许跨窗口连接,从而提高了效率。

This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size.
这种层次结构具有在各种尺度上建模的灵活性,并且相对于图像大小具有线性计算复杂度。

These qualities of Swin Transformer make it compatible with a broad range of vision tasks, including image classification (87.3 top-1 accuracy on ImageNet-1K) and dense prediction tasks such as object detection (58.7 box AP and 51.1 mask AP on COCO testdev) and semantic segmentation (53.5 mloU on ADE20K val).
Swin Transformer的这些特性使其与广泛的视觉任务兼容,包括图像分类(ImageNet-1K上的87.3 top-1精度)和密集预测任务,如对象检测(COCO testdev上的58.7 box AP和51.1 mask AP)和语义分割(ADE20K val上的53.5 mIoU)。

Its performance surpasses the previous state-of-theart by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mloU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones.
其性能在COCO上大幅超过了+2.7 box AP和+2.6 mask AP,在ADE20K上超过了+ 3.2 mloU,显示了基于transformer的模型作为视觉骨干的潜力。

The hierarchical design and the shifted window approach also prove beneficial for all-MLP architectures.
分层设计和移位窗口方法也被证明对所有MLP体系结构都是有益的。


capably serves as a general-purpose(通用) backbone for computer vision.

such as large variations (变化) in the scale of visual entities (实体) and the high resolution of pixels
in images compared to words in text

a hierarchical (分层) Transformer whose representation is computed with Shifted windows.

scheme(计划)brings greater efficiency by limiting self-attention computation to non-overlapping local
windows while also allowing for cross-window connection.

These qualities of Swin Transformer make it compatible with a broad range of vision tasks

dense (密集) prediction tasks such as object detection (58.7 box AP and 51.1 mask AP on COCO test-dev) and semantic segmentation (53.5 mIoU on ADE20K val)

surpasses (超过) the previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask AP on
COCO, and +3.2 mIoU on ADE20K, demonstrating (展示) the potential of Transformer-based models as vision backbones.


代码仓库:https://github.com/microsoft/Swin-Transformer

1. Introduction

本文旨在扩展transformer为计算机视觉的通用backbone,与CNN形成竞争,以提高其在图像分类和视觉语言模型任务上的表现。

Swin Transformer适合作为各种视觉任务的通用主干,与以前基于Transformer的架构形成鲜明对比。


has long been dominated(主导)by convolutional neural networks (CNNs).

在这里插入图片描述
Figure 1. (a) The proposed Swin Transformer builds hierarchical feature maps by merging (合并) image patches (shown in gray) in deeper layers and has linear computation complexity to input image size due to computation of self-attention only within each local window (shown in red). It can thus (因此) serve as a general-purpose backbone for both image classification and dense recognition tasks. (b) In contrast, previous vision Transformers [19] produce feature maps of a single low resolution and have quadratic (二次) computation complexity to input image size due to computation of self-attention globally.

the prevalent (普遍的)architecture today is instead the Transformer

Designed for sequence modeling and transduction tasks, the Transformer is notable (值得注意)for its use
of attention to model long-range dependencies(依赖) in the data.

demonstrated promising (有前途) results on certain tasks

between the two modalities(模式)

can vary substantially(大幅) in scale

this would be intractable(难以对付) for Transformer on high-resolution images

would be intractable(棘手的) for Transformer on high-resolution images, as the computational complexity of its self-attention is quadratic(二次)

conveniently leverage (利用)advanced techniques for dense prediction such as feature pyramid networks (FPN) [38] or U-Net [47]. The linear computational complexity is achieved by computing self-attention locally within non-overlapping(重叠) windows that partition an
image (outlined in red).

between consecutive (连续) self-attention layers, as illustrated in Figure 2.

strategy is also efficient in regards to real-world latency(延迟): all query patches within a window share the same key set, which facilitates(促进) memory access in hardware.

在这里插入图片描述
a unified(统一) architecture across computer vision and natural language processing could benefit both fields, since it would facilitate (促进) joint modeling of visual and textual signals and the modeling knowledge from
both domains can be more deeply shared.

2. Related Work

  • CNN and variants

  • Self-attention based backbone architectures

  • Self-attention/Transformers to complement(补充)CNNs

  • Transformer based vision backbones

本篇工作与Vision Transformer(ViT)非常相关

Our approach is both efficient and effective, achieving state-of-the-art accuracy on both COCO object detection and ADE20K semantic segmentation.

3. Method

3.1 Overall Architecture

It first splits an input RGB image into non-overlapping(非重叠) patches by a patch splitting(分裂) module, like ViT

Each patch is treated as a “token” and its feature is set as a concatenation(连接)of the raw pixel RGB values

project it to an arbitrary(任意) dimension

Several Transformer blocks with modified(修改)self-attention computation (Swin Transformer blocks) are applied on these patch tokens.

is reduced by patch merging(合并) layers as the network
gets deeper

The first patch merging layer concatenates(连接) the
features of

Swin Transformer blocks are applied afterwards(后来) for feature transformation

在这里插入图片描述
two successive(连续) Swin Transformer Blocks

with regular and shifted windowing configurations(配置), respectively.

  • Swin Transformer block

Swin Transformer is built by replacing the standard multi-head self attention (MSA) module in a Transformer block by a module based on
shifted windows

3.2 Shifted Window based Self-Attention

  • Self-attention in non-overlapped windows

  • Shifted window partitioning in successive blocks

  • Efficient batch computation for shifted configuration

  • Relative position bias

3.3 Architecture Variants

除了构建的基础模型swin-B之外,还有swin-T、swin-S、和swin-L

在这里插入图片描述

4. Experiments

We conduct experiments on ImageNet-1K image classification [19], COCO object detection [43], and ADE20K semantic segmentation [83].
我们对ImageNet-1K图像分类[19]、COCO目标检测[43]、ADE20K 语义分割 [83]进行了实验。

In the following, we first compare the proposed Swin Transformer architecture with the previous state-of-the-arts on the three tasks.
在下文中,我们将首先比较所建议的Swin Transformer体系结构与之前关于这三个任务的最新技术

Then, we ablate the important design elements of Swin Transformer.
然后,对Swin变压器的重要设计要素进行了分析。

4.1 Image Classification on ImageNet-1K

在这里插入图片描述

4.2 Object Detection on COCO

在这里插入图片描述

4.3 Semantic Segmentation on ADE20K

在这里插入图片描述

FLOPS 注意全部大写 是floating point of per second的缩写,意指每秒浮点运算次数。用来衡量硬件的性能。
FLOPs 是floating point of operations的缩写,是浮点运算次数,可以用来衡量算法/模型复杂度。

4.4 Ablation Study

在这里插入图片描述
在这里插入图片描述

5. Conclusion

swin transformer 可以产生 层次特征表示 和 相对于输入图像的大小 具有线性计算复杂度,在COCO和ADE20K方面实现了SOTA。

本文提出的基于位移窗口的自注意力在视觉问题上是有效和高效的。

6. Acknowledgement

We thank many colleagues at Microsoft for their help, in particular, Li Dong and Furu Wei for useful discussions; Bin Xiao, Lu Yuan and Lei Zhang for help on datasets.

此部分不包含已有作者。

References

https://github.com/microsoft/Swin-Transformer

https://gitcode.com/microsoft/Swin-Transformer/overview?utm_source=csdn_github_accelerator&isLogin=1

https://openaccess.thecvf.com/content/ICCV2021/papers/Liu_Swin_Transformer_Hierarchical_Vision_Transformer_Using_Shifted_Windows_ICCV_2021_paper.pdf

My thought

swin transformer 更强调在视觉任务语言任务上的通用性,本文更强调其在不同视觉任务上的backbone能力。

彩蛋

轻松一刻
在这里插入图片描述

欢迎在评论区讨论本文

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.rhkb.cn/news/274091.html

如若内容造成侵权/违法违规/事实不符,请联系长河编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

文本向量评测MTEB和C-MTEB

文章目录 简介MTEBC-MTEB参考资料 简介 MTEB(Massive Text Embedding Benchmark)是目前评测文本向量很重要的一个参考,其榜单也是各大文本向量模型用来展示与其他向量模型强弱的一个竞技台。 C-MTEB则是专门针对中文文本向量的评测基准。 MTEB MTEB的目的是为了…

OKLink2月安全月报| 2起典型漏洞攻击案例分析

在本月初我们发布的2024年2月安全月报中提到,2月全网累计造成损失约1.03亿美元。其中钓鱼诈骗事件损失占比11.76%。 OKLink提醒大家,在参与Web3项目时,应当仔细调研项目的真实性、可靠性,提升对钓鱼网站和风险项目的甄别能力&…

C语言从入门到熟悉------第二阶段

printf的用法 printf的格式有四种: (1)printf("字符串\n"); 其中\n表示换行的意思。其中n是“new line”的缩写,即“新的一行”。此外需要注意的是,printf中的双引号和后面的分号必须是在英文输入法下。双引…

portainer管理远程docker和docker-swarm集群

使用前请先安装docker和docker-compose,同时完成docker-swarm集群初始化 一、portainer-ce部署 部署portainer-ce实时管理本机docker,使用docker-compose一键拉起 docker-compose.yml version: 3 services:portainer:container_name: portainer#imag…

[机器视觉]halcon应用实例 找圆

[机器视觉]halcon应用实例 找圆 代码 *清空屏幕,显示控制图像 dev_close_window () dev_update_off () read_image (Image, 形状模板图) dev_open_window_fit_image (Image, 0, 0, -1, -1, WindowHandle) dev_display (Image) *创建测量模型 create_metrology_mod…

AD20新建工程步骤

1 新建工程 2 创建 3 新建原理图 4 新建PCB图 5 对原理图贺PCB都进行保存 6 新建原理图库贺PCB库,以及保存 最后在保存位置上都可以看到 打开的时候直接打开工程,它自己就会把这些链接在一起

UVa11595 Crossing Streets EXTREME

题目链接 UVa11595 - Crossing Streets EXTREME 题意 平面上有 n(n≤35)条直线,各代表一条街道。街道相互交叉,形成一些路段(对应于几何上的线段)。你的任务是设计一条从A到B的路线,使得穿过路…

c++: 引用能否替代指针? 详解引用与指针的区别.

文章目录 前言1. 引用和指针的最大区别:引用不能改变指向2. 引用和指针在底层上面是一样的3. 引用和指针在sizeof面前大小不同4. 有多级指针,没有多级引用5.引用是引用的实体,指针会向后偏移同一个类型的大小 总结 前言 新来的小伙伴如果不知道引用是什么?可以看我的上一篇文…

AI新工具(20240312) Midjourney官方发布角色一致性功能;免费且开源的简历制作工具;精确克隆语调、控制声音风格

1: Midjourney角色一致性功能 使人物画像在多方面高度一致成为可能。 Midjourney的角色一致性功能的使用方法如下: ⭐在你的输入指令后面加上 --cref URL,其中URL是你选择的角色图像的链接。 ⭐你可以通过 --cw 参数来调整参照的强度,范围…

spring boot 使用 webservice

spring boot 使用 webservice 使用 java 自带的 jax-ws 依赖 如果是jdk1.8,不需要引入任何依赖&#xff0c;如果大于1.8 <dependency><groupId>javax.jws</groupId><artifactId>javax.jws-api</artifactId><version>1.1</version&g…

【Linux】文件缓冲区|理解文件系统

目录 预备知识 观察现象 第一&#xff1a;携带\n&#xff0c;不使用fork()&#xff0c;打印到显示器 第二&#xff1a;携带\n&#xff0c;使用fork()&#xff0c;打印到显示器 第三&#xff1a;携带\n&#xff0c;使用fork()&#xff0c;打印到文件里 第四&#xff1a;不携…

案例分析篇03:一篇文章搞定软考设计模式考点(2024年软考高级系统架构设计师冲刺知识点总结系列文章)

专栏系列文章推荐: 2024高级系统架构设计师备考资料(高频考点&真题&经验)https://blog.csdn.net/seeker1994/category_12593400.html 【历年案例分析真题考点汇总】与【专栏文章案例分析高频考点目录】(2024年软考高级系统架构设计师冲刺知识点总结-案例分析篇-…

IPSec NAT穿越原理

一、IPSec VPN在NAT场景中存在的问题 当某些组网中&#xff0c;有的分支连动态的公网IP地址也没有&#xff0c;只能由网络中的NAT设备进行地址转换&#xff0c;才能访问互联网&#xff0c;然而IPsec是用来保护报文不被修改的&#xff0c;而NAT需要修改报文的IP地址&#xff0c…

十大排序算法(冒泡排序、插入排序、选择排序、希尔排序、堆排序、快排、归并排序、桶排序、计数排序、基数排序)

目录 一、冒泡排序&#xff1a; 二、插入排序&#xff1a; 三、选择排序&#xff1a; 四、希尔排序&#xff1a; 五、堆排序&#xff1a; 六、快速排序&#xff1a; 6.1挖坑法&#xff1a; 6.2左右指针法 6.3前后指针法&#xff1a; 七、归并排序&#xff1a; 八、桶…

力扣hot100:240.搜索二维矩阵II(脑子)

吉大21级算法分析与设计的一道大题&#xff0c;由于每一行都是排好序的直接逐行二分 可以达到&#xff1a;O(mlogn)。但是这里追求更广的思路可以使用其他方法。 矩阵四分&#xff1a; 在矩阵中用中心点比较&#xff0c;如果target大于中心点的值&#xff0c;则由于升序排列&am…

vue学习笔记23-组件事件⭐

组件事件 在组件的模板表达式中&#xff0c;可以直接使用$emit方法触发自定义事件&#xff1b;触发自定义事件的目的是组件之间传递数据 好好好今天又碰到问题了&#xff0c;来吧来吧 测试发现其他项目都可以 正常的run ,就它不行 搜索发现新建项目并进入以后&#xff0c;用指…

C语言--- 指针运算笔试题详解

目录 题目1&#xff1a; 题目2&#xff1a; 题目3&#xff1a; 题目4&#xff1a; 题目5&#xff1a; 题目6&#xff1a; 题目7&#xff1a; 题目1&#xff1a; #include <stdio.h> int main() {int a[5] { 1, 2, 3, 4, 5 };int *ptr (int *)(&a 1);print…

新书推荐|职业教育赛教一体化课程改革系列教材之spark大数据分析

由武汉唯众智创科技有限公司统一规划并参与编写的“职业教育赛教一体化课程改革系列教材”-《spark大数据分析》正式出版上线&#xff0c;(其它八本为《云计算技术与应用》《大数据技术与应用Ⅰ》《网络综合布线》《物联网.NET开发》《物联网嵌入式开发》《物联网移动应用开发》…

Cloud-Eureka服务治理-Ribbon负载均衡

构建Cloud父工程 父工程只做依赖版本管理 不引入依赖 pom.xml <packaging>pom</packaging><parent><groupId>org.springframework.boot</groupId><artifactId>spring-boot-starter-parent</artifactId><version>2.3.9.RELEA…

Request和Response对象

Request和Response都是Servlet的service方法的参数&#xff0c;Request负责获取请求数据&#xff0c;而Response负责设置相应数据~ 一.Request 1.继承体系 Tomcat负责解析数据&#xff0c;因此由Tomcat来提供实现类~ 2.获取请求数据 请求行 请求头 请求体 需要注意的是只有…