ResNet论文精读，代码实现与拓展知识

文章目录

ResNet 论文精读
代码实现
- 网络可视化代码
拓展知识 ResNets
- 残差的调参
- 残差链接的渊源
- 残差链接有效性的解释
- ResNet 深度
- ResNeXt
- - 代码实现
- 能够提点的技巧
- - 「Warmup」
  - 「Label-smoothing」
  - 「Random image cropping and patching」
  - 「Knowledge Distiallation」
  - 「Cutout」
  - 「Random erasing」
  - 「Cosine learning rate decay」
  - 「Mixup training」
  - 「AutoAugment」
  - 其他经典的Tricks

ResNet 论文精读

Deep Residual Learning for Image Recognition
深度学习引用量最多的论文
写作：问题一摆，简单分析一下，然后举一些文献，再说自己是怎么解决的
相关工作：列出来不会影响你的贡献，要说出你的工作和之前工作的区别。
这篇论文重点在于残差结构。有助于解决梯度弥散。这里的加号是相加操作不是拼接操作。
4 个Stage，每个 Stage 之内没有 feature map 变化，卷积核通道数量也不变，之外卷积核通道数量加倍。把空间维度上的信息转移到通道维度上。下采样利用步长为 2 的卷积核。Stage 间利用了 1 * 1 卷积调节通道数。

ResNet

k=3,s=1,p=1 输入输出特征大小一样
k=3,s=2,p=1 2 倍下采样
k=3,s=4,p=1 4 倍下采样

代码实现

看懂不是目标，自己手敲后才有感觉
模型实现很巧妙，GoogLeNet 后开始利用 Stage 的思想，能够容易的对模型进行放缩。

import torch.nn as nn
import torchclass BasicBlock(nn.Module):expansion = 1def __init__(self, in_channel, out_channel, stride=1, downsample=None, **kwargs):super(BasicBlock, self).__init__()self.conv1 = nn.Conv2d(in_channels=in_channel, out_channels=out_channel, kernel_size=3, stride=stride, padding=1, bias=False)self.bn1 = nn.BatchNorm2d(out_channel)self.relu = nn.ReLU()self.conv2 = nn.Conv2d(in_channels=out_channel, out_channels=out_channel, kernel_size=3, stride=1, padding=1, bias=False)self.bn2 = nn.BatchNorm2d(out_channel)self.downsample = downsampledef forward(self, x):identity = xif self.downsample is not None:identity = self.downsample(x)out = self.conv1(x)out = self.bn1(out)out = self.relu(out)out = self.conv2(out)out = self.bn2(out)out += identityout = self.relu(out)return outclass Bottleneck(nn.Module):"""注意:原论文中,在虚线残差结构的主分支上,第一个1x1卷积层的步距是2,第二个3x3卷积层步距是1。但在pytorch官方实现过程中是第一个1x1卷积层的步距是1,第二个3x3卷积层步距是2,这么做的好处是能够在top1上提升大概0.5%的准确率。可参考Resnet v1.5 https://ngc.nvidia.com/catalog/model-scripts/nvidia:resnet_50_v1_5_for_pytorch"""expansion = 4def __init__(self, in_channel, out_channel, stride=1, downsample=None,groups=1, width_per_group=64):super(Bottleneck, self).__init__()width = int(out_channel * (width_per_group / 64.)) * groupsself.conv1 = nn.Conv2d(in_channels=in_channel, out_channels=width,  kernel_size=1, stride=1, bias=False)  # squeeze channelsself.bn1 = nn.BatchNorm2d(width)# -----------------------------------------self.conv2 = nn.Conv2d(in_channels=width, out_channels=width, groups=groups, kernel_size=3, stride=stride, bias=False, padding=1)self.bn2 = nn.BatchNorm2d(width)# -----------------------------------------self.conv3 = nn.Conv2d(in_channels=width, out_channels=out_channel*self.expansion, kernel_size=1, stride=1, bias=False)  # unsqueeze channelsself.bn3 = nn.BatchNorm2d(out_channel*self.expansion)self.relu = nn.ReLU(inplace=True)self.downsample = downsampledef forward(self, x):identity = xif self.downsample is not None:identity = self.downsample(x)out = self.conv1(x)out = self.bn1(out)out = self.relu(out)out = self.conv2(out)out = self.bn2(out)out = self.relu(out)out = self.conv3(out)out = self.bn3(out)out += identityout = self.relu(out)return outclass ResNet(nn.Module):def __init__(self,block,blocks_num,num_classes=1000,include_top=True,groups=1,width_per_group=64):super(ResNet, self).__init__()self.include_top = include_topself.in_channel = 64self.groups = groupsself.width_per_group = width_per_groupself.conv1 = nn.Conv2d(3, self.in_channel, kernel_size=7, stride=2,  padding=3, bias=False)self.bn1 = nn.BatchNorm2d(self.in_channel)self.relu = nn.ReLU(inplace=True)self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)self.layer1 = self._make_layer(block, 64, blocks_num[0])self.layer2 = self._make_layer(block, 128, blocks_num[1], stride=2)self.layer3 = self._make_layer(block, 256, blocks_num[2], stride=2)self.layer4 = self._make_layer(block, 512, blocks_num[3], stride=2)if self.include_top:self.avgpool = nn.AdaptiveAvgPool2d((1, 1))  # output size = (1, 1)self.fc = nn.Linear(512 * block.expansion, num_classes)for m in self.modules():if isinstance(m, nn.Conv2d):nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')def _make_layer(self, block, channel, block_num, stride=1):downsample = Noneif stride != 1 or self.in_channel != channel * block.expansion:downsample = nn.Sequential(nn.Conv2d(self.in_channel, channel * block.expansion, kernel_size=1, stride=stride, bias=False),nn.BatchNorm2d(channel * block.expansion))layers = []layers.append(block(self.in_channel,channel,downsample=downsample,stride=stride,groups=self.groups,width_per_group=self.width_per_group))self.in_channel = channel * block.expansionfor _ in range(1, block_num):layers.append(block(self.in_channel,channel,groups=self.groups,width_per_group=self.width_per_group))return nn.Sequential(*layers)def forward(self, x):x = self.conv1(x)x = self.bn1(x)x = self.relu(x)x = self.maxpool(x)x = self.layer1(x)x = self.layer2(x)x = self.layer3(x)x = self.layer4(x)if self.include_top:x = self.avgpool(x)x = torch.flatten(x, 1)x = self.fc(x)return x# # resnet34  pre-train parameters https://download.pytorch.org/models/resnet34-333f7ec4.pth
# def resnet_samll(num_classes=1000, include_top=True):#     return ResNet(BasicBlock, [3, 4, 6, 3], num_classes=num_classes, include_top=include_top)# # resnet50  pre-train parameters https://download.pytorch.org/models/resnet50-19c8e357.pth
# def resnet(num_classes=1000, include_top=True): 
#     return ResNet(Bottleneck, [3, 4, 6, 3], num_classes=num_classes, include_top=include_top)# # resnet101 pre-train parameters https://download.pytorch.org/models/resnet101-5d3b4d8f.pth
# def resnet_big(num_classes=1000, include_top=True):
#     return ResNet(Bottleneck, [3, 4, 23, 3], num_classes=num_classes, include_top=include_top)# # resneXt pre-train parameters https://download.pytorch.org/models/resnext50_32x4d-7cdf4587.pth
# def resnext(num_classes=1000, include_top=True): 
#     groups = 32
#     width_per_group = 4
#     return ResNet(Bottleneck, [3, 4, 6, 3],
#                   num_classes=num_classes,
#                   include_top=include_top,
#                   groups=groups,
#                   width_per_group=width_per_group)# # resneXt_big pre-train parameters https://download.pytorch.org/models/resnext101_32x8d-8ba56ff5.pth
# def resnext_big(num_classes=1000, include_top=True): 
#     groups = 32
#     width_per_group = 8
#     return ResNet(Bottleneck, [3, 4, 23, 3],
#                   num_classes=num_classes,
#                   include_top=include_top,
#                   groups=groups,
#                   width_per_group=width_per_group)def resnet34(num_classes=1000, include_top=True):# https://download.pytorch.org/models/resnet34-333f7ec4.pthreturn ResNet(BasicBlock, [3, 4, 6, 3], num_classes=num_classes, include_top=include_top)def resnet50(num_classes=1000, include_top=True):# https://download.pytorch.org/models/resnet50-19c8e357.pthreturn ResNet(Bottleneck, [3, 4, 6, 3], num_classes=num_classes, include_top=include_top)def resnet101(num_classes=1000, include_top=True):# https://download.pytorch.org/models/resnet101-5d3b4d8f.pthreturn ResNet(Bottleneck, [3, 4, 23, 3], num_classes=num_classes, include_top=include_top)def resnext50_32x4d(num_classes=1000, include_top=True):# https://download.pytorch.org/models/resnext50_32x4d-7cdf4587.pthgroups = 32width_per_group = 4return ResNet(Bottleneck, [3, 4, 6, 3],num_classes=num_classes,include_top=include_top,groups=groups,width_per_group=width_per_group)def resnext101_32x8d(num_classes=1000, include_top=True):# https://download.pytorch.org/models/resnext101_32x8d-8ba56ff5.pthgroups = 32width_per_group = 8return ResNet(Bottleneck, [3, 4, 23, 3],num_classes=num_classes,include_top=include_top,groups=groups,width_per_group=width_per_group)

网络可视化代码

import torch.nn as nn
import torch 
import torch 
import matplotlib.pyplot as plt
import numpy as np
from PIL import Image
from torchvision import transformsclass AlexNet(nn.Module):def __init__(self, num_classes=5, init_weights=False):super().__init__()self.features = nn.Sequential(nn.Conv2d(3, 96, kernel_size=11, stride=4, padding=2),  # input[3, 224, 224]  output[96, 55, 55]nn.ReLU(inplace=True),nn.MaxPool2d(kernel_size=3, stride=2),                  # output[96, 27, 27]nn.Conv2d(96, 256, kernel_size=5, padding=2),           # output[256, 27, 27]nn.ReLU(inplace=True),nn.MaxPool2d(kernel_size=3, stride=2),                  # output[256, 13, 13]nn.Conv2d(256, 384, kernel_size=3, padding=1),          # output[384, 13, 13]nn.ReLU(inplace=True),nn.Conv2d(384, 384, kernel_size=3, padding=1),          # output[384, 13, 13]nn.ReLU(inplace=True),nn.Conv2d(384, 256, kernel_size=3, padding=1),          # output[256, 13, 13]nn.ReLU(inplace=True),nn.MaxPool2d(kernel_size=3, stride=2),                  # output[256, 6, 6])self.classifier = nn.Sequential(nn.Dropout(p=0.5),nn.Linear(256 * 6 * 6, 4096),nn.ReLU(inplace=True),nn.Dropout(p=0.5),nn.Linear(4096, 4096),nn.ReLU(inplace=True),nn.Linear(4096, num_classes),)if init_weights:self._initialize_weights()def forward(self, x):outputs = []for name, module in self.features.named_children():x = module(x)if name in ["0"]:outputs.append(x)return outputsdef _initialize_weights(self):for m in self.modules():if isinstance(m, nn.Conv2d):nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')if m.bias is not None:nn.init.constant_(m.bias, 0)elif isinstance(m, nn.Linear):nn.init.normal_(m.weight, 0, 0.01)nn.init.constant_(m.bias, 0)data_transform = transforms.Compose([transforms.Resize((224, 224)),transforms.ToTensor(),transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])# create model
model = AlexNet(num_classes=5) 
# load model weights
model_weight_path = "/Users/jiangxiyu/根目录/深度学习/Deep-Learning-Image-Classification-Models-Based-CNN-or-Attention/results/weights/alexnet/AlexNet.pth"   
model.load_state_dict(torch.load(model_weight_path))
print(model)# load image
img = Image.open("/Users/jiangxiyu/根目录/深度学习/flower/train/daisy/5547758_eea9edfd54_n.jpg")
# [N, C, H, W]
img = data_transform(img)
# expand batch dimension
img = torch.unsqueeze(img, dim=0)# forward
out_put = model(img)
for feature_map in out_put:# [N, C, H, W] -> [C, H, W]im = np.squeeze(feature_map.detach().numpy())# [C, H, W] -> [H, W, C]im = np.transpose(im, [1, 2, 0])# show top 12 feature mapsplt.figure()for i in range(12):ax = plt.subplot(3, 4, i+1)# [H, W, C]plt.imshow(im[:, :, i], cmap='gray')plt.show()plt.savefig("./AelxNet_vis.jpg")

拓展知识 ResNets

残差的调参

把 BN 和激活函数调前精度更高

残差链接的渊源

其实残差连接可以看成一种特殊的跳跃连接，早在15年的时候，还有一篇Highway Networks使用了类似的结构，在旁路中设置了可训练参数。

残差链接有效性的解释

把残差网络展开，相当与集成（ensemble）了多个神经网络，这是一种典型的集成学习的思想（ensemble learning），如下图示：

ResNet 深度

GoogLeNet V3 准则：要尽可能在网络的浅层提取更多的信息
$\cdot H(x) + x$

ResNeXt

基本不改变模型复杂度或降低复杂度，同时减少了超参数，计算量减半，使用分组卷积。
g==c：xception 深度可分离卷积
g==1：普通卷积
论文中 cardinality 就是 group
三种分组卷积效果差不多，选择最简单的第三种
好处：即有残差结构（便于训练），又对特征层进行了 concat（对特征多角度理解）

代码实现

起码要敲 7，8 遍，不要眼高手低
同上文的 ResNet 代码，实现在里面

能够提点的技巧

trick（技巧）：以结果为导向

「Warmup」

* 在刚开始训练的时候先使用一个较小的学习率，训练一些epoches或iterations，等模型稳定时再修改为预先设置的学习率进行训练。 提 1 个点

「Linear scaling learning rate」
- 在论文中针对比较大的batch size而提出的一种方法。使用相同的epoch时，大batch size训练的模型与小batch size训练的模型相比，验证准确率会减小。大batch size训练的模型可以增大学习率来加快收敛。具体做法很简单，比如ResNet原论文［1］中，batch size为256时选择的学习率是0.1，当我们把batchsize变为一个较大的数b时，学习率应该变为0.1 × b/256。

「Label-smoothing」

*  在分类问题中，我们的最后一层一般是全连接层，然后对应标签的one-hot编码，即把对应类别的值编码为1，其他为0。这种编码方式和通过降低交叉熵损失来调整参数的方式结合起来，会有一些问题。这种方式会鼓励模型对不同类别的输出分数差异非常大，或者说，模型过分相信它的判断。但是，对于一个由多人标注的数据集，不同人标注的准则可能不同，每个人的标注也可能会有一些错误。模型对标签的过分相信会导致过拟合。标签平滑（Label-smoothing regularization，LSR）是应对该问题的有效方法之一，它的具体思想是降低我们对于标签的信任，例如我们可以将损失的目标值从1稍微降到0.9，或者将从0稍微升到0.1。标签平滑最早在inception-v2［4］中被提出。

「Random image cropping and patching」

* （RICAP）方法随剪四个图片的中部分，然后把它们拼接为一个图片，同时混合这四个

「Knowledge Distiallation」

* 知识蒸馏（Knowledge Distilling）是模型压缩的一种方法，是指利用已经训练完成的一个较复杂的Teacher模型，指导一个较轻量的Student模型训练，从而在减小模型大小和参数量的同时，尽量保持Teacher模型的准确率。模型复杂度如下图所示，很明显左边的老师模型的参数量远大于右边的学生模型，那么在部署上线时，只需要部署学生模型则可，这样就需要更少的显存，同时计算更快，一举两得。
* 给定输入，假定 p 是真正的概率分布，z 和 r 分别是学生模型和教师模型最后一个全连接层的输出。之前我们会用交叉熵损失l(p.softmax(z))来度量p和z之间的差异，这里的蒸馏损失同样用交叉熵。所以，使用知识蒸馏方法总的损失函数是
*  T越高，softmax的output probability distribution越趋于平滑，其分布的熵越大，负标签携带的信息会被相对地放大，模型训练将更加关注负标签。

$\text{softmax}(z)) + T^2 \cdot l(\text{softmax}(r/T),\text{softmax}(z/T))$
在这里插入图片描述

「Cutout」

* Cutout是一种新的正则化方法。原理是在训练时随机把图片的一部分减掉，这样能提高模型的鲁棒性。它的来源是计算机视觉任务中经常遇到的物体遮挡问题。通过cutout生成一些类似被遮挡的物体，不仅可以让模型在遇到遮挡问题时表现更好，还能让模型在做决定时更多地考虑环境（context）。

「Random erasing」

* Random erasing其实和cutout非常类似，也是一种模拟物体遮挡情况的数据增强方法。区别在于，cutout是把图片中随机抽中的矩形区域的像素值置为0，相当于裁剪掉，random erasing是用随机数或者数据集中像素的平均值替换原来的像素值。而且，cutout每次裁剪掉的区域大小是固定的，Random erasing替换掉的区域大小是随机的。

「Cosine learning rate decay」

* 在warmup之后的训练过程中，学习率不断衰减是一个提高精度的好方法。其中有step decay和cosine decay等，前者是随着epoch增大学习率不断减去一个小的数，后者是让学习率随着训练过程曲线下降。

在这里插入图片描述

「Mixup training」

* Mixup是一种新的数据增强的方法。Mixup training，就是每次取出2张图片，然后将它们线性组合，得到新的图片，以此来作为新的训练样本，进行网络的训练，如下公式，其中x代表图像数据，y代表标签，则得到的新的xhat，yhat。

$\lambda \cdot x_{\text{i}} + (1 - \lambda) \cdot x_j$ $\lambda \cdot y_{\text{i}} + (1 - \lambda) \cdot y_j$
* Mixup方法主要增强了训练样本之间的线性表达，增强网络的泛化能力，不过mixup方法需要较长的时间才能收敛得比较好。

「AutoAugment」

* 数据增强在图像分类问题上有很重要的作用，但是增强的方法有很多，并非一股脑地用上所有的方法就是最好的。那么，如何选择最佳的数据增强方法呢？AutoAugment就是一种搜索适合当前问题的数据增强方法的方法。该方法创建一个数据增强策略的搜索空间，利用搜索算法选取适合特定数据集的数据增强策略。此外，从一个数据集中学到的策略能够很好地迁移到其它相似的数据集上。

其他经典的Tricks

* Dropout
* L1/L2正则
* Batch Normalization
* Early stopping
* Random cropping
* Mirroring
* Rotation
* Color shifting
* Xavier init

acc = 1 data + 3 model + 2 training(超参数）

ResNet论文精读，代码实现与拓展知识

文章目录

ResNet 论文精读

代码实现

网络可视化代码

拓展知识 ResNets

残差的调参

残差链接的渊源

残差链接有效性的解释

ResNet 深度

ResNeXt

代码实现

能够提点的技巧

「Warmup」

「Label-smoothing」

「Random image cropping and patching」

「Knowledge Distiallation」

「Cutout」

「Random erasing」

「Cosine learning rate decay」

「Mixup training」

「AutoAugment」

其他经典的Tricks

相关文章