Sequence to Sequence 架构
Paper链接
- Sequence to Sequence Learning with Neural Networks
B站课程@ShusenWang
核心思想
关键的改进点
In this paper, we show that a straightforward application of the Long Short-Term Memory (LSTM) architecture [16] can solve general sequence to sequence problems. The idea is to use one LSTM to read the input sequence, one timestep at a time, to obtain large fixeddimensional vector representation, and then to use another LSTM to extract the output sequence from that vector (fig. 1). The second LSTM is essentially a recurrent neural network language model [28, 23, 30] except that it is conditioned on the input sequence. The LSTM’s ability to successfully learn on data with long range temporal dependencies makes it a natural choice for this application due to the considerable time lag between the inputs and their corresponding outputs .
[^ 注]: s 1 , s 2 , . . s t s_1,s_2,..s_t s1,s2,..st: Decoder_RNN每一个时间步的输出, P 1 , P 2 , . . P t P_1,P_2,..P_t P1,P2,..Pt:Decoder全连接层的输出
seq2seq模型架构包括三部分,分别是encoder(编码器)、decoder(解码器)、固定长度的上下文向量(Context Vector)。
- 编码器:将输入序列编码为一个固定长度的上下文向量(Context Vector),通常为LSTM或GRU。
- **解码器:**基于该上下文向量生成输出序列。在每个时间步,解码器根据前一步的输出和当前隐藏状态生成下一个词,直到生成终止符(如
<EOS>
)。
网络的工作流程[以机器翻译任务为例]:
假设有一对翻译样本 [“欢迎 来 北京” , “welcome to BeiJing”]
- 生成原文词表 与 目标词表 (e.g. 生成 Token to ids词汇索引表 {‘欢迎’:0,‘来’:1,‘北京’:2,…},{‘to’:0,…})
- 使用编码器对输入序列 [0,1,2] 进行编码,生成Context Vector(表示原文句子 ‘欢迎来北京’)。
- 将Context Vector作为解码器的第一个时间步的隐藏状态,同时输入特殊的Token[Start]作为第一个时间步的输入
- 将每一个时间步Decoder的输出作为下一个时间步的输入
- 重复直到Decoder生成
Token的作用
- 标记句子结束位置
- 防止解码器陷入无限循环
- 让模型学习到句子结束的概率分布,使其能够处理变长序列
Seq2Seq的核心任务:计算目标序列在给定输入序列下的条件概率
The goal of the LSTM is to estimate the conditional probability $ p(y_1, . . . , y_{T^′} |x_1, . . . , x_T )$ where $(x_1, . . . , x_T ) $is an input sequence and y 1 , . . . , y T ′ y_1, . . . , y_{T^′} y1,...,yT′ is its corresponding output sequence whose length T ′ may differ from T . The LSTM computes this conditional probability by first obtaining the fixeddimensional representation v of the input sequence$ (x_1, . . . , x_T )$ given by the last hidden state of the LSTM, and then computing the probability of$ y_1, . . . , y_{T^′}$ with a standard LSTM-LM formulation whose initial hidden state is set to the representation v of ( x 1 , . . . , x T ) (x_1, . . . , x_T ) (x1,...,xT)
有待改进的地方
- 采用双向LSTM网络
- 多任务学习 [e.g. 英-中 + 英- 法 + 英-德]
- 增加注意力机制
学习案例 英译中案例
Github 链接 ;;;;;;