深度学习----------------------注意力机制

心理学
- 不随意线索
- 随意线索
注意力机制
非参注意力池化层
Nadaraya-Watson核回归
- 参数化的注意力机制
总结
注意力汇聚：Nadaraya-Watson核回归代码
- 生成数据集
- 核回归
- 非参数注意力汇聚
- 注意力权重
- - 该部分总代码
- 带参数的注意力汇聚
- - 将训练数据集转换为键和值
  - 训练带参数的注意力汇聚模型
  - 预测结果绘制
  - - 该部分总代码

心理学

动物需要在复杂环境下有效关注值得注意的点。
心理学框架：人类根据随意线索和不随意线索选择注意点

不随意线索

在这里插入图片描述

随意线索

在这里插入图片描述

注意力机制

卷积、全连接、池化层都只考虑不随意线索
注意力机制则显示的考虑随意线索
    随意线索被称为查询。
    每个输入是一个值和不随意线索的对
    通过注意力池化层来有偏向性的选择某些输入。

非参注意力池化层

给定数据( $x_i$ , $y_i$ )，i=1,…,n。其中x就是key、y就是value。

最简单的池化是平均池化。
这里的x是query
在这里插入图片描述

更好的方案是Nadaraya-Watson核回归。

在这里插入图片描述

Nadaraya-Watson核回归

在这里插入图片描述

参数化的注意力机制

在之前基础上引入可以学习的w

在这里插入图片描述

总结

心理学认为人通过随意线索和不随意线索选择注意点。

注意力机制中，通过query(随意线索)和key(不随意线索)来有偏向性的选择输入

在这里插入图片描述

注意力汇聚：Nadaraya-Watson核回归代码

import torch
from torch import nn
from d2l import torch as d2l

生成数据集

import torch# 训练集样本数量
n_train = 50
# 生成训练集特征x_train，范围为[0, 5)，并进行排序
x_train, _ = torch.sort(torch.rand(n_train) * 5)# 定义函数f，用于生成标签y(真实的函数)
def f(x):return 2 * torch.sin(x) + x ** 0.8# 生成训练集标签y_train，并加上服从正态分布的噪声
y_train = f(x_train) + torch.normal(0.0, 0.5, (n_train,))
# 生成测试集特征x_test，范围为[0, 5)，步长为0.1
x_test = torch.arange(0, 5, 0.1)
# 生成测试集的真实标签y_truth
y_truth = f(x_test)
# 计算测试集样本数量
n_test = len(x_test)
print(n_test)

在这里插入图片描述

核回归

在这里插入图片描述

import torch
from torch import nn
from d2l import torch as d2l# 定义函数f，用于生成标签y(真实的函数)
def f(x):return 2 * torch.sin(x) + x ** 0.8# 绘制核回归结果的图像
def plot_kernel_reg(y_hat):# 绘制x_test和对应的真实标签y_truth以及预测标签y_hat的图像d2l.plot(x_test, [y_truth, y_hat], 'x', 'y', legend=['Truth', 'Pred'],xlim=[0, 5], ylim=[-1, 5])# 绘制训练集的散点图，用圆圈表示d2l.plt.plot(x_train, y_train, 'o', alpha=0.5)# 训练集样本数量
n_train = 50
# 生成训练集特征x_train，范围为[0, 5)，并进行排序
x_train, _ = torch.sort(torch.rand(n_train) * 5)
# 生成训练集标签y_train，并加上服从正态分布的噪声
y_train = f(x_train) + torch.normal(0.0, 0.5, (n_train,))
# 生成测试集特征x_test，范围为[0, 5)，步长为0.1
x_test = torch.arange(0, 5, 0.1)
# 生成测试集的真实标签y_truth
y_truth = f(x_test)
# 计算测试集样本数量
n_test = len(x_test)
# 将y_train的均值重复n_test次作为预测标签y_hat
# 最简单的池化----平均池化，公式:f(x)=1/n * sum(y_i)
y_hat = torch.repeat_interleave(y_train.mean(), n_test)
# 调用plot_kernel_reg函数，绘制核回归结果的图像
plot_kernel_reg(y_hat)
d2l.plt.show()

在这里插入图片描述

非参数注意力汇聚

在这里插入图片描述

import torch
from torch import nn
from d2l import torch as d2l# 定义函数f，用于生成标签y(真实的函数)
def f(x):return 2 * torch.sin(x) + x ** 0.8# 绘制核回归结果的图像
def plot_kernel_reg(y_hat):# 绘制x_test和对应的真实标签y_truth以及预测标签y_hat的图像d2l.plot(x_test, [y_truth, y_hat], 'x', 'y', legend=['Truth', 'Pred'],xlim=[0, 5], ylim=[-1, 5])# 绘制训练集的散点图，用圆圈表示d2l.plt.plot(x_train, y_train, 'o', alpha=0.5)# 训练集样本数量
n_train = 50
# 生成训练集特征x_train，范围为[0, 5)，并进行排序
x_train, _ = torch.sort(torch.rand(n_train) * 5)
# 生成训练集标签y_train，并加上服从正态分布的噪声
y_train = f(x_train) + torch.normal(0.0, 0.5, (n_train,))
# 生成测试集特征x_test，范围为[0, 5)，步长为0.1
x_test = torch.arange(0, 5, 0.1)
# 生成测试集的真实标签y_truth
y_truth = f(x_test)# 将测试集特征x_test重复n_train次并重新reshape为二维矩阵
# X_repeat的形状为torch.Size([50, 50])，
# x_test为
# tensor([0.0000, 0.1000, 0.2000, 0.3000, 0.4000, 0.5000, 0.6000, 0.7000, 0.8000,
#         0.9000, 1.0000, 1.1000, 1.2000, 1.3000, 1.4000, 1.5000, 1.6000, 1.7000,
#         1.8000, 1.9000, 2.0000, 2.1000, 2.2000, 2.3000, 2.4000, 2.5000, 2.6000,
#         2.7000, 2.8000, 2.9000, 3.0000, 3.1000, 3.2000, 3.3000, 3.4000, 3.5000,
#         3.6000, 3.7000, 3.8000, 3.9000, 4.0000, 4.1000, 4.2000, 4.3000, 4.4000,
#         4.5000, 4.6000, 4.7000, 4.8000, 4.9000])
# X_repeat为在列上重复五十次
# tensor([[0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
#         [0.1000, 0.1000, 0.1000,  ..., 0.1000, 0.1000, 0.1000],
#         [0.2000, 0.2000, 0.2000,  ..., 0.2000, 0.2000, 0.2000],
#         ...,
#         [4.7000, 4.7000, 4.7000,  ..., 4.7000, 4.7000, 4.7000],
#         [4.8000, 4.8000, 4.8000,  ..., 4.8000, 4.8000, 4.8000],
#         [4.9000, 4.9000, 4.9000,  ..., 4.9000, 4.9000, 4.9000]])
X_repeat = x_test.repeat_interleave(n_train).reshape((-1, n_train))
# 计算注意力权重，通过对特征差值的平方取负并除以2，再进行softmax归一化
# dim=0是按行求和(实际上是按照行方向对每一列求和)
# dim=1按列求和（实际上是按照列的方向对每一行求和）每一行的和为1
attention_weights = nn.functional.softmax(-(X_repeat - x_train)**2 / 2, dim=1)
# 注意力权重与训练集标签y_train进行矩阵乘法得到预测标签y_hat
y_hat = torch.matmul(attention_weights, y_train)
# 调用plot_kernel_reg函数，绘制非参数注意力汇聚的核回归结果图像
plot_kernel_reg(y_hat)
d2l.plt.show()

在这里插入图片描述

非参的好处是不需要学习参数，然后有理论证明只要给你足够多的数据是能够把原始的模型弄出来的。

注意力权重

# 可视化注意力权重
d2l.show_heatmaps(attention_weights.unsqueeze(0).unsqueeze(0),xlabel='Sorted training inputs', ylabel='Sorted test inputs')

该部分总代码

import torch
from torch import nn
from d2l import torch as d2l# 定义函数f，用于生成标签y(真实的函数)
def f(x):return 2 * torch.sin(x) + x ** 0.8# 训练集样本数量
n_train = 50
# 生成训练集特征x_train，范围为[0, 5)，并进行排序
x_train, _ = torch.sort(torch.rand(n_train) * 5)
# 生成训练集标签y_train，并加上服从正态分布的噪声
y_train = f(x_train) + torch.normal(0.0, 0.5, (n_train,))
# 生成测试集特征x_test，范围为[0, 5)，步长为0.1
x_test = torch.arange(0, 5, 0.1)
# 生成测试集的真实标签y_truth
y_truth = f(x_test)# 将测试集特征x_test重复n_train次并重新reshape为二维矩阵
# X_repeat的形状为torch.Size([50, 50])，
# x_test为
# tensor([0.0000, 0.1000, 0.2000, 0.3000, 0.4000, 0.5000, 0.6000, 0.7000, 0.8000,
#         0.9000, 1.0000, 1.1000, 1.2000, 1.3000, 1.4000, 1.5000, 1.6000, 1.7000,
#         1.8000, 1.9000, 2.0000, 2.1000, 2.2000, 2.3000, 2.4000, 2.5000, 2.6000,
#         2.7000, 2.8000, 2.9000, 3.0000, 3.1000, 3.2000, 3.3000, 3.4000, 3.5000,
#         3.6000, 3.7000, 3.8000, 3.9000, 4.0000, 4.1000, 4.2000, 4.3000, 4.4000,
#         4.5000, 4.6000, 4.7000, 4.8000, 4.9000])
# X_repeat为在列上重复五十次
# tensor([[0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
#         [0.1000, 0.1000, 0.1000,  ..., 0.1000, 0.1000, 0.1000],
#         [0.2000, 0.2000, 0.2000,  ..., 0.2000, 0.2000, 0.2000],
#         ...,
#         [4.7000, 4.7000, 4.7000,  ..., 4.7000, 4.7000, 4.7000],
#         [4.8000, 4.8000, 4.8000,  ..., 4.8000, 4.8000, 4.8000],
#         [4.9000, 4.9000, 4.9000,  ..., 4.9000, 4.9000, 4.9000]])
X_repeat = x_test.repeat_interleave(n_train).reshape((-1, n_train))
# 计算注意力权重，通过对特征差值的平方取负并除以2，再进行softmax归一化
# dim=0是按行求和(实际上是按照行方向对每一列求和)
# dim=1按列求和（实际上是按照列的方向对每一行求和）每一行的和为1
attention_weights = nn.functional.softmax(-(X_repeat - x_train)**2 / 2, dim=1)
# 可视化注意力权重
d2l.show_heatmaps(attention_weights.unsqueeze(0).unsqueeze(0),xlabel='Sorted training inputs', ylabel='Sorted test inputs')
d2l.plt.show()

在这里插入图片描述

带参数注意力汇聚假定两个张量的形状分别是(n,a,b)和(n,b,c)，他们的批量矩阵乘法输出的形状为(n,a,c)

import torchX = torch.ones((2, 1, 4))
Y = torch.ones((2, 4, 6))
print(torch.bmm(X, Y).shape)

在这里插入图片描述

使用小批量矩阵乘法来计算小批量数据中的加权平均值。

import torch# weights 的形状为[2,10]
weights = torch.ones((2, 10)) * 0.1
values = torch.arange(20.0).reshape((2, 10))
# 执行小批量矩阵乘法，计算加权平均值
# weights.unsqueeze(1)的形状为[2,1,10]，在第一维度上加了一个维度
# values.unsqueeze(-1)的形状为[2,10,1]，在最后一维度上加了一个维度
torch.bmm(weights.unsqueeze(1), values.unsqueeze(-1))

带参数的注意力汇聚

在这里插入图片描述

# 带参数的注意力汇聚
class NWKernelRegression(nn.Module):def __init__(self, **kwargs):super().__init__(**kwargs)# 创建形状为(1,)的参数张量w，用于调整注意力权重self.w = nn.Parameter(torch.rand((1,), requires_grad=True))def forward(self, queries, keys, values):# 重复queries并调整形状，使其与keys具有相同的列数queries = queries.repeat_interleave(keys.shape[1]).reshape(-1, keys.shape[1])# 计算注意力权重，通过调整参数w对注意力进行调节self.attention_weights = nn.functional.softmax(-((queries - keys) * self.w) ** 2 / 2, dim=1)# 执行带参数的注意力汇聚，并返回最终结果的形状调整return torch.bmm(self.attention_weights.unsqueeze(1), values.unsqueeze(-1)).reshape(-1)

将训练数据集转换为键和值

import torch
from torch import nn# 带参数的注意力汇聚
class NWKernelRegression(nn.Module):def __init__(self, **kwargs):super().__init__(**kwargs)# 创建形状为(1,)的参数张量w，用于调整注意力权重self.w = nn.Parameter(torch.rand((1,), requires_grad=True))def forward(self, queries, keys, values):# 重复queries并调整形状，使其与keys具有相同的列数queries = queries.repeat_interleave(keys.shape[1]).reshape(-1, keys.shape[1])# 计算注意力权重，通过调整参数w对注意力进行调节self.attention_weights = nn.functional.softmax(-((queries - keys) * self.w) ** 2 / 2, dim=1)# 执行带参数的注意力汇聚，并返回最终结果的形状调整return torch.bmm(self.attention_weights.unsqueeze(1), values.unsqueeze(-1)).reshape(-1)def f(x):return 2 * torch.sin(x) + x ** 0.8n_train = 50
x_train, _ = torch.sort(torch.rand(n_train) * 5)
y_train = f(x_train) + torch.normal(0.0, 0.5, (n_train,))
# X_tile的形状:(50, 50)
X_tile = x_train.repeat((n_train, 1))
# 将y_train在行维度上重复n_train次，形成一个矩阵Y_tile,形状为(n_train, n_train)
Y_tile = y_train.repeat((n_train, 1))
# 通过掩码操作，从X_tile中排除对角线元素，得到键矩阵keys
# torch.eye(n_train)使对角线全为1其余为0
# tensor([[1., 0., 0.,  ..., 0., 0., 0.],
#         [0., 1., 0.,  ..., 0., 0., 0.],
#         [0., 0., 1.,  ..., 0., 0., 0.],
#         ...,
#         [0., 0., 0.,  ..., 1., 0., 0.],
#         [0., 0., 0.,  ..., 0., 1., 0.],
#         [0., 0., 0.,  ..., 0., 0., 1.]])
# 然后使1 - torch.eye(n_train)将对角线的元素变为0其余为1,然后将结果转为布尔型
# keys的形状为[50, 49]
keys = X_tile[(1 - torch.eye(n_train)).type(torch.bool)].reshape((n_train, -1))
# 通过掩码操作，从Y_tile中排除对角线元素，得到值矩阵values
values = Y_tile[(1 - torch.eye(n_train)).type(torch.bool)].reshape(n_train, -1)

往前缩进一列，所以变成了49列
在这里插入图片描述

训练带参数的注意力汇聚模型

import torch
from torch import nn
from d2l import torch as d2l# 带参数的注意力汇聚
class NWKernelRegression(nn.Module):def __init__(self, **kwargs):super().__init__(**kwargs)self.w = nn.Parameter(torch.rand((1,), requires_grad=True))def forward(self, queries, keys, values):queries = queries.repeat_interleave(keys.shape[1]).reshape(-1, keys.shape[1])self.attention_weights = nn.functional.softmax(-((queries - keys) * self.w) ** 2 / 2, dim=1)return torch.bmm(self.attention_weights.unsqueeze(1), values.unsqueeze(-1)).reshape(-1)def f(x):return 2 * torch.sin(x) + x ** 0.8n_train = 50
x_train, _ = torch.sort(torch.rand(n_train) * 5)
y_train = f(x_train) + torch.normal(0.0, 0.5, (n_train,))
X_tile = x_train.repeat((n_train, 1))
Y_tile = y_train.repeat((n_train, 1))
keys = X_tile[(1 - torch.eye(n_train)).type(torch.bool)].reshape((n_train, -1))
values = Y_tile[(1 - torch.eye(n_train)).type(torch.bool)].reshape(n_train, -1)
# 创建带参数的注意力汇聚模型
net = NWKernelRegression()
loss = nn.MSELoss(reduction='none')
# 创建随机梯度下降优化器，用于参数更新
trainer = torch.optim.SGD(net.parameters(), lr=0.5)
# 创建动画绘制器，用于绘制损失曲线
animator = d2l.Animator(xlabel='epoch', ylabel='loss', xlim=[1, 5])# 遍历5次
for epoch in range(5):trainer.zero_grad()l = loss(net(x_train, keys, values), y_train) / 2# 反向传播，计算梯度l.sum().backward()# 更新参数trainer.step()# 打印当前的损失print(f'epoch {epoch + 1}, loss {float(l.sum()):.6f}')# 绘制损失曲线animator.add(epoch + 1, float(l.sum()))
d2l.plt.show()

在这里插入图片描述

预测结果绘制

import torch
from torch import nn
from d2l import torch as d2l# 带参数的注意力汇聚
class NWKernelRegression(nn.Module):def __init__(self, **kwargs):super().__init__(**kwargs)self.w = nn.Parameter(torch.rand((1,), requires_grad=True))def forward(self, queries, keys, values):queries = queries.repeat_interleave(keys.shape[1]).reshape(-1, keys.shape[1])self.attention_weights = nn.functional.softmax(-((queries - keys) * self.w) ** 2 / 2, dim=1)return torch.bmm(self.attention_weights.unsqueeze(1), values.unsqueeze(-1)).reshape(-1)# 绘制核回归结果的图像
def plot_kernel_reg(y_hat):# 绘制x_test和对应的真实标签y_truth以及预测标签y_hat的图像d2l.plot(x_test, [y_truth, y_hat], 'x', 'y', legend=['Truth', 'Pred'],xlim=[0, 5], ylim=[-1, 5])# 绘制训练集的散点图，用圆圈表示d2l.plt.plot(x_train, y_train, 'o', alpha=0.5)def f(x):return 2 * torch.sin(x) + x ** 0.8n_train = 50
x_train, _ = torch.sort(torch.rand(n_train) * 5)
y_train = f(x_train) + torch.normal(0.0, 0.5, (n_train,))
x_test = torch.arange(0, 5, 0.1)
y_truth = f(x_test)
n_test = len(x_test)X_tile = x_train.repeat((n_train, 1))
Y_tile = y_train.repeat((n_train, 1))
keys = X_tile[(1 - torch.eye(n_train)).type(torch.bool)].reshape((n_train, -1))
values = Y_tile[(1 - torch.eye(n_train)).type(torch.bool)].reshape(n_train, -1)# 创建带参数的注意力汇聚模型
net = NWKernelRegression()
loss = nn.MSELoss(reduction='none')
# 创建随机梯度下降优化器，用于参数更新
trainer = torch.optim.SGD(net.parameters(), lr=0.5)# 遍历5次
for epoch in range(5):trainer.zero_grad()l = loss(net(x_train, keys, values), y_train) / 2# 反向传播，计算梯度l.sum().backward()# 更新参数trainer.step()# 打印当前的损失print(f'epoch {epoch + 1}, loss {float(l.sum()):.6f}')keys = x_train.repeat((n_test, 1))
values = y_train.repeat((n_test, 1))
# 使用训练好的模型进行预测，得到预测结果y_hat
y_hat = net(x_test, keys, values).unsqueeze(1).detach()
# 绘制预测结果
plot_kernel_reg(y_hat)
d2l.plt.show()

在这里插入图片描述

# 曲线在注意力权重较大的区域变得更不平滑
d2l.show_heatmaps(net.attention_weights.unsqueeze(0).unsqueeze(0),xlabel='Sorted training inputs', ylabel='Sorted testing inputs')

在这里插入图片描述

该部分总代码

import torch
from torch import nn
from d2l import torch as d2l# 带参数的注意力汇聚
class NWKernelRegression(nn.Module):def __init__(self, **kwargs):super().__init__(**kwargs)self.w = nn.Parameter(torch.rand((1,), requires_grad=True))def forward(self, queries, keys, values):queries = queries.repeat_interleave(keys.shape[1]).reshape(-1, keys.shape[1])self.attention_weights = nn.functional.softmax(-((queries - keys) * self.w) ** 2 / 2, dim=1)return torch.bmm(self.attention_weights.unsqueeze(1), values.unsqueeze(-1)).reshape(-1)# 绘制核回归结果的图像
def plot_kernel_reg(y_hat):# 绘制x_test和对应的真实标签y_truth以及预测标签y_hat的图像d2l.plot(x_test, [y_truth, y_hat], 'x', 'y', legend=['Truth', 'Pred'],xlim=[0, 5], ylim=[-1, 5])# 绘制训练集的散点图，用圆圈表示d2l.plt.plot(x_train, y_train, 'o', alpha=0.5)def f(x):return 2 * torch.sin(x) + x ** 0.8n_train = 50
x_train, _ = torch.sort(torch.rand(n_train) * 5)
y_train = f(x_train) + torch.normal(0.0, 0.5, (n_train,))
x_test = torch.arange(0, 5, 0.1)
y_truth = f(x_test)
n_test = len(x_test)X_tile = x_train.repeat((n_train, 1))
Y_tile = y_train.repeat((n_train, 1))
keys = X_tile[(1 - torch.eye(n_train)).type(torch.bool)].reshape((n_train, -1))
values = Y_tile[(1 - torch.eye(n_train)).type(torch.bool)].reshape(n_train, -1)# 创建带参数的注意力汇聚模型
net = NWKernelRegression()
loss = nn.MSELoss(reduction='none')
# 创建随机梯度下降优化器，用于参数更新
trainer = torch.optim.SGD(net.parameters(), lr=0.5)# 遍历5次
for epoch in range(5):trainer.zero_grad()l = loss(net(x_train, keys, values), y_train) / 2# 反向传播，计算梯度l.sum().backward()# 更新参数trainer.step()# 打印当前的损失print(f'epoch {epoch + 1}, loss {float(l.sum()):.6f}')keys = x_train.repeat((n_test, 1))
values = y_train.repeat((n_test, 1))
# 使用训练好的模型进行预测，得到预测结果y_hat
y_hat = net(x_test, keys, values).unsqueeze(1).detach()
# 绘制预测结果
plot_kernel_reg(y_hat)
# 曲线在注意力权重较大的区域变得更不平滑
d2l.show_heatmaps(net.attention_weights.unsqueeze(0).unsqueeze(0),xlabel='Sorted training inputs', ylabel='Sorted testing inputs')
d2l.plt.show()