09_从经典论文入手Seq2Seq架构

Sequence to Sequence 架构

Paper链接

Sequence to Sequence Learning with Neural Networks

B站课程@ShusenWang

核心思想

关键的改进点

In this paper, we show that a straightforward application of the Long Short-Term Memory (LSTM) architecture [16] can solve general sequence to sequence problems. The idea is to use one LSTM to read the input sequence, one timestep at a time, to obtain large fixeddimensional vector representation, and then to use another LSTM to extract the output sequence from that vector (fig. 1). The second LSTM is essentially a recurrent neural network language model [28, 23, 30] except that it is conditioned on the input sequence. The LSTM’s ability to successfully learn on data with long range temporal dependencies makes it a natural choice for this application due to the considerable time lag between the inputs and their corresponding outputs .

在这里插入图片描述

[^ 注]: $s_1,s_2,..s_t$ : Decoder_RNN每一个时间步的输出, $P_1,P_2,..P_t$ :Decoder全连接层的输出

seq2seq模型架构包括三部分，分别是encoder(编码器)、decoder(解码器)、固定长度的上下文向量(Context Vector)。

编码器:将输入序列编码为一个固定长度的上下文向量(Context Vector),通常为LSTM或GRU。
**解码器:**基于该上下文向量生成输出序列。在每个时间步，解码器根据前一步的输出和当前隐藏状态生成下一个词，直到生成终止符（如<EOS>）。

网络的工作流程[以机器翻译任务为例]:

假设有一对翻译样本 [“欢迎来北京” , “welcome to BeiJing”]

生成原文词表与目标词表 (e.g. 生成 Token to ids词汇索引表 {‘欢迎’:0,‘来’:1,‘北京’:2,…},{‘to’:0,…})
使用编码器对输入序列 [0,1,2] 进行编码,生成Context Vector(表示原文句子 ‘欢迎来北京’)。
将Context Vector作为解码器的第一个时间步的隐藏状态,同时输入特殊的Token[Start]作为第一个时间步的输入
将每一个时间步Decoder的输出作为下一个时间步的输入
重复直到Decoder生成

Token的作用

标记句子结束位置
防止解码器陷入无限循环
让模型学习到句子结束的概率分布,使其能够处理变长序列

Seq2Seq的核心任务:计算目标序列在给定输入序列下的条件概率

The goal of the LSTM is to estimate the conditional probability $ p(y_1, . . . , y_{T^′} |x_1, . . . , x_T )$ where $(x_1, . . . , x_T ) $is an input sequence and $y_1, . . . , y_{T^′}$ is its corresponding output sequence whose length T ′ may differ from T . The LSTM computes this conditional probability by first obtaining the fixeddimensional representation v of the input sequence$ (x_1, . . . , x_T )$ given by the last hidden state of the LSTM, and then computing the probability of$ y_1, . . . , y_{T^′}$ with a standard LSTM-LM formulation whose initial hidden state is set to the representation v of $x_1, . . . , x_T )$