【考古篇】Attension is all you need

Transformer

文章目录

  • Transformer
  • 1. What
  • 2. Why
  • 3. How
    • 3.1 Encoder
    • 3.2 Decoder
    • 3.3 Attention
    • 3.4 Application
    • 3.5 Position-wise Feed-Forward Networks(The second sublayer)
    • 3.6 Embeddings and Softmax
    • 3.7 Positional Encoding
    • 3.8 Why Self-Attention

1. What

A new simple network architecture called Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.

2. Why

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder.

Recurrent neural networks have been sota in most sequence modeling tasks, but memory constraints in RNN limit batching across examples.

The goal of reducing sequential computation also forms the use of convolutional neural networks. But it’s hard for them to handle the dependencies between distant positions because convolution can only see a small window in a whole image. If we want to use convolution to see the far-away parts, it needs several convolutions. In the transformer, it reduces to a constant number of operations because it can overview the whole image.

Meanwhile, similar to the idea of using many convolution kernels in RNN, we introduce the Multi-Head Attention to make up for this feature.

3. How

请添加图片描述

3.1 Encoder

The encoder is the block on the left with 6 identical layers. Each layer has two sublayers. Combined with the residual connection, it can be represented as:

LayerNorm(x+Sublayer(x)) \text{LayerNorm(x+Sublayer(x))} LayerNorm(x+Sublayer(x))

Each sub-layer is followed by layer normalization. We will introduce it in detail.

Firstly, we will introduce batch normalization and layer normalization, which are shown below as the blue and yellow squares.

请添加图片描述

In the 2D dimension, the data can be represented as feature × \times × batch. And batch normalization is to normalize one feature in different batches. The layer normalization is equivalent to the transposition of batch normalization, which can be seen as the normalization of one batch with different features.

In the 3D dimension, every sentence is a sequence and each word is a vector. So we can visualize it as below:

请添加图片描述

The blue and yellow squares represent batch normalization and layer normalization in 3D data. Consider if the sequence length is different among sentences, the normalization will be different:

请添加图片描述

The batch normalization will consider all of the data, so if new data has an extreme length, the predicted normalization will be inaccurate. So on the contrary, we will use layer normalization in transform, which only makes sense in its own sequence and will not be affected by global data.

3.2 Decoder

The decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack. It was added a mask to prevent positions from attending to subsequent positions.

3.3 Attention

An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors.

The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key. That is:
请添加图片描述

The key and value are paired. The weight for each value depends on the compatibility between the query and key.

Mathematically,

Attention ( Q , K , V ) = softmax ( Q K T d k ) V . \text{Attention}(Q,K,V)=\text{softmax}(\frac{QK^T}{\sqrt {d_k}})V. Attention(Q,K,V)=softmax(dk QKT)V.

where the query is a matrix and we use softmax \text{softmax} softmax to gain the relative weights. The scaling factor d k \sqrt {d_k} dk is used to avoid the extreme length.

The matrix multiplication can be represented as:

请添加图片描述

We will also use masks in this block, it will set the value after v t v_t vt to a big negative number. So it will be small after softmax.

请添加图片描述

As for the multi-head attention, some different, learned linear projections are used for the Q , K , V Q,K,V Q,K,V to compress dimension from d m o d e l d_{model} dmodel to d k , d k , d v d_k,d_k,d_v dk,dk,dv. It is shown below:

请添加图片描述

And mathematically,

M u l t i H e a d ( Q , K , V ) = C o n c a t ( h e a d 1 , . . . , h e a d h ) W O w h e r e h e a d i = A t t e n t i o n ( Q W i Q , K W i K , V W i V ) \begin{aligned}\mathrm{MultiHead}(Q,K,V)&=\mathrm{Concat}(\mathrm{head}_{1},...,\mathrm{head}_{\mathrm{h}})W^{O}\\\mathrm{where~head_{i}}&=\mathrm{Attention}(QW_{i}^{Q},KW_{i}^{K},VW_{i}^{V})\end{aligned} MultiHead(Q,K,V)where headi=Concat(head1,...,headh)WO=Attention(QWiQ,KWiK,VWiV)

Where the projections are parameter matrices W i Q ∈ R d m o d e l × d k , W i K ∈ R d m o d e l × d k , W i V ∈ R d m o d e l × d v W_{i}^{Q}\in\mathbb{R}^{d_{\mathrm{model}}\times d_{k}},W_{i}^{K}\in\mathbb{R}^{d_{\mathrm{model}}\times d_{k}},W_{i}^{V}\in\mathbb{R}^{d_{\mathrm{model}}\times d_v} WiQRdmodel×dk,WiKRdmodel×dk,WiVRdmodel×dv and W O ∈ R h d v × d m o d e l W^{O}\in\mathbb{R}^{hd_{v}\times d_{\mathrm{model}}} WORhdv×dmodel.

Practically, d k = d v = d m o d e l / h = 64 d_{k}=d_{v}=d_{\mathrm{model}}/h=64 dk=dv=dmodel/h=64 and h = 8 h=8 h=8.

In this way, we also have more parameters in liner layers to learn compared with single attention.

3.4 Application

There are three types of multi-head attention in the model. For the first two, as shown below:

请添加图片描述

All of the keys, values, and queries come from the same place and have the same size. The output size is n × d n \times d n×d.

As for the third one, the queries come from the previous decoder layer,
and the memory keys and values come from the output of the encoder.

请添加图片描述

The K K K and V V V’s sizes are n × d n \times d n×d and the Q Q Q’s size is m × d m \times d m×d. So the final output’s size is m × d m \times d m×d. From a semantic point of view, it means to put forward the words in the output sequence that have similar semantics to the input word sequence.

请添加图片描述

3.5 Position-wise Feed-Forward Networks(The second sublayer)

Actually, it is a MLP:

F F N ( x ) = max ⁡ ( 0 , x W 1 + b 1 ) W 2 + b 2 . \mathrm{FFN}(x)=\max(0,xW_1+b_1)W_2+b_2. FFN(x)=max(0,xW1+b1)W2+b2.

The input x x x is d m o d e l d_{model} dmodel(512), W 1 W_1 W1 is R 512 × 2048 \mathbb{R}^{512\times2048} R512×2048 and W 2 W_2 W2 is R 2048 × 512 \mathbb{R}^{2048\times512} R2048×512.

Position-wise means it is a reflection of every word in the sequence and all of them use the same MLP.

请添加图片描述

This is also the difference between him and RNN. The latter needs the output of the last MLP to be the input.

3.6 Embeddings and Softmax

Embeddings are the map from word tokens to vectors of dimension d m o d e l d_{model} dmodel. The linear transformation and softmax function will convert the decoder output to predicted next-token probabilities. All of them use the same weights.

3.7 Positional Encoding

In order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the
tokens in the sequence.

Every PC will be d m o d e l d_{model} dmodel and be added to the input embedding. The formula is:

P E ( p o s , 2 i ) = s i n ( p o s / 1000 0 2 i / d m o d e l ) P E ( p o s , 2 i + 1 ) = c o s ( p o s / 1000 0 2 i / d m o d e l ) , PE_{(pos,2i)}=sin(pos/10000^{2i/d_{\mathrm{model}}})\\PE_{(pos,2i+1)}=cos(pos/10000^{2i/d_{\mathrm{model}}}), PE(pos,2i)=sin(pos/100002i/dmodel)PE(pos,2i+1)=cos(pos/100002i/dmodel),

where p o s pos pos is the position and i i i is the dimension.

3.8 Why Self-Attention

请添加图片描述

Use this table to compare different models. Three metrics were used.

As for the Complexity per Layer of Self-Attention, O ( n 2 d ˙ ) O(n^2 \dot d) O(n2d˙) is the multiplication of matrix Q Q Q and K K K. The Self-Attention (restricted) means only use some near Q Q Q as quary.

Ref:

Transformer论文逐段精读【论文精读】_哔哩哔哩_bilibili

Transformer常见问题与回答总结

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.rhkb.cn/news/327560.html

如若内容造成侵权/违法违规/事实不符,请联系长河编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

最新极空间部署iCloudpd教程,实现自动同步iCloud照片到NAS硬盘

【iPhone福利】最新极空间部署iCloudpd教程,实现自动同步iCloud照片到NAS硬盘 哈喽小伙伴们好,我是Stark-C~ 我记得我前年的时候发过一篇群晖使用Docker部署iCloudpd容器来实现自动同步iCloud照片的教程,当时热度还很高,可见大家…

CSS表格特殊样式

列组样式 使用colgroup与col标签配合可以定义列祖样式&#xff1a;例 <!DOCTYPE html> <html><head><meta charset"utf-8"><title></title><style>table,tr,th,td{border: 1px solid #000;}table{border-collapse: coll…

【数据结构】顺序表与链表的差异

顺序表和链表都是线性表&#xff0c;它们有着相似的部分&#xff0c;但是同时也有着很大的差异。 存储空间上的差异&#xff1a; 对于插入上的不同点&#xff0c;顺序表在空间不够时需要扩容&#xff0c;而如果在使用realloc函数去扩容&#xff0c;会有原地扩容和异地扩容两种情…

Python图形复刻——绘制母亲节花束

各位小伙伴&#xff0c;好久不见&#xff0c;今天学习用Python绘制花束。 有一种爱&#xff0c;不求回报&#xff0c;有一种情&#xff0c;无私奉献&#xff0c;这就是母爱。祝天下妈妈节日快乐&#xff0c;幸福永远&#xff01; 图形展示&#xff1a; 代码展示&#xff1a; …

GCP谷歌云有什么数据库类型,该怎么选择

GCP谷歌云提供的数据库类型主要包括&#xff1a; 关系型数据库&#xff1a;这类数据库适用于结构化数据&#xff0c;通常用于数据结构不经常发生变化的场合。在GCP中&#xff0c;关系型数据库选项包括Cloud SQL和Cloud Spanner。Cloud SQL提供托管的MySQL、PostgreSQL和SQL Se…

GPT-4o正式发布;零一万物发布千亿参数模型;英国推出AI评估平台

OpenAI 正式发布 GPT-4o 今天凌晨&#xff0c;OpenAI 正式发布 GPT-4o&#xff0c;其中的「o」代表「omni」&#xff08;即全面、全能的意思&#xff09;&#xff0c;这个模型同时具备文本、图片、视频和语音方面的能力&#xff0c;甚至就是 GPT-5 的一个未完成版。 并且&…

RK3566(泰山派):3.1寸屏幕D310T9362V1SPEC触摸驱动(竖屏)

RK3566&#xff08;泰山派&#xff09;&#xff1a;3.1寸屏幕D310T9362V1SPEC触摸驱动&#xff08;竖屏&#xff09; 文章目录 RK3566&#xff08;泰山派&#xff09;&#xff1a;3.1寸屏幕D310T9362V1SPEC触摸驱动&#xff08;竖屏&#xff09;电路配置i2c1设备树创建驱动编写…

【Qt】常用控件(一)

文章目录 一、核心属性1、enabled代码示例: 通过按钮2 切换按钮1 的禁用状态 2、geometry代码示例: 控制按钮的位置代码示例&#xff1a;window frame 的影响代码示例: 感受 geometry 和 frameGeometry 的区别 3、windowTitle4、windowIcon代码示例: 通过 qrc 管理图片作为图标…

【ARM Cortex-M 系列 2.3 -- Cortex-M7 Debug event 详细介绍】

请阅读【嵌入式开发学习必备专栏】 文章目录 Cortex-M7 Debug eventDebug events Cortex-M7 Debug event 在ARM Cortex-M7架构中&#xff0c;调试事件&#xff08;Debug Event&#xff09;是由于调试原因而触发的事件。一个调试事件会导致以下几种情况之一发生&#xff1a; 进…

76岁林子祥升级做爷爷,亲自为孙女取名

林子祥与前妻吴正元的儿子&#xff0c;现年39岁的林德信入行以来绯闻不少&#xff0c;自与圈外女友Candace拍拖后便修心养性&#xff0c;去年他已经低调与拍拖5年多Candace完婚&#xff0c;正式步入人生另一阶段。 昨日&#xff08;5月12日&#xff09;林德信借母亲节这个温馨日…

【线性系统理论】笔记一

一&#xff1a;状态空间表达式 电路系统状态空间描述列写 1&#xff1a;选取状态变量 状态变量定义&#xff1a;线性无关极大组属性。 2&#xff1a;列出电路原始回路方程 ps&#xff1a;状态变量有两个&#xff0c;理论上需要列写2个方程 3&#xff1a;规范形势 4&#xf…

两小时看完花书(深度学习入门篇)

1.深度学习花书前言 机器学习早期的时候十分依赖于已有的知识库和人为的逻辑规则&#xff0c;需要人们花大量的时间去制定合理的逻辑判定&#xff0c;可以说是有多少人工&#xff0c;就有多少智能。后来逐渐发展出一些简单的机器学习方法例如logistic regression、naive bayes等…

智慧文旅赋能旅游服务升级:以科技创新驱动行业变革,打造智慧化、个性化、高效化的旅游新体验,满足游客日益增长的多元化需求

目录 一、引言 二、智慧文旅的概念与内涵 三、智慧文旅在旅游服务升级中的应用 1、智慧旅游服务平台建设 2、智慧景区管理 3、智慧旅游营销 四、智慧文旅推动旅游行业变革的案例分析 案例一&#xff1a;某智慧旅游城市建设项目 案例二&#xff1a;某景区智慧化改造项目…

47. UE5 RPG 实现角色死亡效果

在上一篇文章中&#xff0c;我们实现了敌人受到攻击后会播放受击动画&#xff0c;并且还给角色设置了受击标签。并在角色受击时&#xff0c;在角色身上挂上受击标签&#xff0c;在c里&#xff0c;如果挂载了此标签&#xff0c;速度将降为0 。 受击有了&#xff0c;接下来我们将…

22. 括号生成

1.题目 22. 括号生成 - 力扣&#xff08;LeetCode&#xff09; 2.思路 3.代码 class Solution { public:int left,right;string path;vector<string> ret;vector<string> generateParenthesis(int n) {dfs(n);return ret;}void dfs(int n){if(rightn){ret.push_…

前端使用Compressor.js实现图片压缩上传

前端使用Compressor.js实现图片压缩上传 Compressor.js官方文档 安装 npm install compressorjs使用 在使用ElementUI或者其他UI框架的上传组件时&#xff0c;都会有上传之前的钩子函数&#xff0c;在这个函数中可以拿到原始file&#xff0c;这里我用VantUI的上传做演示 a…

在js中table表格中进行渲染轮播图

效果图&#xff1a;示例&#xff1a; <!DOCTYPE html> <html> <head><meta charset"utf-8"><title></title><script src"js/jquery-3.6.3.js"></script><style>/* 轮播图 */.basko {width: 100%;h…

【C++】string|迭代器iterator|getline|find

目录 ​编辑 string 1.string与char* 的区别 2.string的使用 字符串遍历 利用迭代器遍历 范围for遍历 反向迭代器 字符串capacity 字符串插入操作 push_back函数 append函数 运算符 ​编辑 insert函数 substr函数 字符串查找函数 find函数 rfind函数 …

【C++】priority_queues(优先级队列)和反向迭代器适配器的实现

目录 一、 priority_queue1.priority_queue的介绍2.priority_queue的使用2.1、接口使用说明2.2、优先级队列的使用样例 3.priority_queue的底层实现3.1、库里面关于priority_queue的定义3.2、仿函数1.什么是仿函数&#xff1f;2.仿函数样例 3.3、实现优先级队列1. 1.0版本的实现…

memset函数

让我们先看两个代码 memset(dp, 0x3f, sizeof(dp)); for (int i 0; i < 5; i)cout << dp[i] << " "; memset(dp, 127, sizeof(dp)); for (int i 0; i < 5; i)cout << dp[i] << " "; 代码结果如下&#xff1a; 现在我们来分…