【考古篇】Attension is all you need

Transformer

文章目录

Transformer
1. What
2. Why
3. How
- 3.1 Encoder
- 3.2 Decoder
- 3.3 Attention
- 3.4 Application
- 3.5 Position-wise Feed-Forward Networks(The second sublayer)
- 3.6 Embeddings and Softmax
- 3.7 Positional Encoding
- 3.8 Why Self-Attention

1. What

A new simple network architecture called Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.

2. Why

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder.

Recurrent neural networks have been sota in most sequence modeling tasks, but memory constraints in RNN limit batching across examples.

The goal of reducing sequential computation also forms the use of convolutional neural networks. But it’s hard for them to handle the dependencies between distant positions because convolution can only see a small window in a whole image. If we want to use convolution to see the far-away parts, it needs several convolutions. In the transformer, it reduces to a constant number of operations because it can overview the whole image.

Meanwhile, similar to the idea of using many convolution kernels in RNN, we introduce the Multi-Head Attention to make up for this feature.

3. How

请添加图片描述

3.1 Encoder

The encoder is the block on the left with 6 identical layers. Each layer has two sublayers. Combined with the residual connection, it can be represented as:

$\text{LayerNorm(x+Sublayer(x))}$

Each sub-layer is followed by layer normalization. We will introduce it in detail.

Firstly, we will introduce batch normalization and layer normalization, which are shown below as the blue and yellow squares.

请添加图片描述

In the 2D dimension, the data can be represented as feature $\times$ batch. And batch normalization is to normalize one feature in different batches. The layer normalization is equivalent to the transposition of batch normalization, which can be seen as the normalization of one batch with different features.

In the 3D dimension, every sentence is a sequence and each word is a vector. So we can visualize it as below:

请添加图片描述

The blue and yellow squares represent batch normalization and layer normalization in 3D data. Consider if the sequence length is different among sentences, the normalization will be different:

请添加图片描述

The batch normalization will consider all of the data, so if new data has an extreme length, the predicted normalization will be inaccurate. So on the contrary, we will use layer normalization in transform, which only makes sense in its own sequence and will not be affected by global data.

3.2 Decoder

The decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack. It was added a mask to prevent positions from attending to subsequent positions.

3.3 Attention

An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors.

The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key. That is:
请添加图片描述

The key and value are paired. The weight for each value depends on the compatibility between the query and key.

Mathematically,

$\text{Attention}(Q,K,V)=\text{softmax}(\frac{QK^T}{\sqrt {d_k}})V.$

where the query is a matrix and we use $\text{softmax}$ to gain the relative weights. The scaling factor $\sqrt {d_k}$ is used to avoid the extreme length.

The matrix multiplication can be represented as:

请添加图片描述

We will also use masks in this block, it will set the value after $v_t$ to a big negative number. So it will be small after softmax.

请添加图片描述

As for the multi-head attention, some different, learned linear projections are used for the $Q, K, V$ to compress dimension from $d_{model}$ to $d_k,d_k,d_v$ . It is shown below:

请添加图片描述

And mathematically,

$\begin{aligned}\mathrm{MultiHead}(Q,K,V)&=\mathrm{Concat}(\mathrm{head}_{1},...,\mathrm{head}_{\mathrm{h}})W^{O}\\\mathrm{where~head_{i}}&=\mathrm{Attention}(QW_{i}^{Q},KW_{i}^{K},VW_{i}^{V})\end{aligned}$

Where the projections are parameter matrices $W_{i}^{Q}\in\mathbb{R}^{d_{\mathrm{model}}\times d_{k}},W_{i}^{K}\in\mathbb{R}^{d_{\mathrm{model}}\times d_{k}},W_{i}^{V}\in\mathbb{R}^{d_{\mathrm{model}}\times d_v}$ and $W^{O}\in\mathbb{R}^{hd_{v}\times d_{\mathrm{model}}}$ .

Practically, $d_{k}=d_{v}=d_{\mathrm{model}}/h=64$ and $h = 8$ .

In this way, we also have more parameters in liner layers to learn compared with single attention.

3.4 Application

There are three types of multi-head attention in the model. For the first two, as shown below:

请添加图片描述

All of the keys, values, and queries come from the same place and have the same size. The output size is $\times d$ .

As for the third one, the queries come from the previous decoder layer,
and the memory keys and values come from the output of the encoder.

请添加图片描述

The $K$ and $V$ ’s sizes are $\times d$ and the $Q$ ’s size is $\times d$ . So the final output’s size is $\times d$ . From a semantic point of view, it means to put forward the words in the output sequence that have similar semantics to the input word sequence.

请添加图片描述

3.5 Position-wise Feed-Forward Networks(The second sublayer)

Actually, it is a MLP:

$\mathrm{FFN}(x)=\max(0,xW_1+b_1)W_2+b_2.$

The input $x$ is $d_{model}$ (512), $W_1$ is $\mathbb{R}^{512\times2048}$ and $W_2$ is $\mathbb{R}^{2048\times512}$ .

Position-wise means it is a reflection of every word in the sequence and all of them use the same MLP.

请添加图片描述

This is also the difference between him and RNN. The latter needs the output of the last MLP to be the input.

3.6 Embeddings and Softmax

Embeddings are the map from word tokens to vectors of dimension $d_{model}$ . The linear transformation and softmax function will convert the decoder output to predicted next-token probabilities. All of them use the same weights.

3.7 Positional Encoding

In order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the
tokens in the sequence.

Every PC will be $d_{model}$ and be added to the input embedding. The formula is:

$PE_{(pos,2i)}=sin(pos/10000^{2i/d_{\mathrm{model}}})\\PE_{(pos,2i+1)}=cos(pos/10000^{2i/d_{\mathrm{model}}}),$

where $p os$ is the position and $i$ is the dimension.

3.8 Why Self-Attention

请添加图片描述

Use this table to compare different models. Three metrics were used.

As for the Complexity per Layer of Self-Attention, $O(n^2 \dot d)$ is the multiplication of matrix $Q$ and $K$ . The Self-Attention (restricted) means only use some near $Q$ as quary.