0. 简介
对于深度学习而言,通过模型加速来嵌入进C++是非常有意义的,因为本身训练出来的pt
文件其实效率比较低下,在讲完BEVDET后,这里我们将以CUDA-FastBEV作为例子,来向读者展示如何去跑CUDA版本的Fast-BEV,因为原项目问题比较多,所以作者适配了一个版本。这里最近受到优刻得的使用邀请,正好解决了我在大模型和自动驾驶行业对GPU的使用需求。UCloud云计算旗下的Compshare的GPU算力云平台。他们提供高性价比的4090 GPU,按时收费每卡2.08元,月卡只需要1.36元每小时,并附带200G的免费磁盘空间。暂时已经满足我的使用需求了,同时支持访问加速,独立IP等功能,能够更快的完成项目搭建。
对应的环境搭建已经在《如何使用共享GPU平台搭建LLAMA3环境(LLaMA-Factory)》、从BEVDET来学习如何生成trt以及如何去写这些C++内容介绍过了。对于自定义的无论是LibTorch还是CUDA这些都在《Ubuntu20.04安装LibTorch并完成高斯溅射环境搭建》这篇文章提到过了。这一章节我们来看一下怎么在平台上运行基于TensorRT的CUDA-FastBEV项目的。
1. 模型推理
1.1 下载模型和数据到CUDA-FastBEV目录
- 下载model.zip
- 下载nuScenes-example-data.zip
# 下载模型和数据到CUDA-FastBEV
cd CUDA-FastBEV# 解压模型和数据
unzip model.zip
unzip nuScenes-example-data.zip# 解压后目录结构如下
CUDA-FastBEV
|-- example-data|-- 0-FRONT.jpg|-- 1-FRONT_RIGHT.jpg|-- ...|-- example-data.pth|-- x.tensor|-- y.tensor`-- valid_c_idx.tensor
|-- src
|-- ptq
|-- model|-- resnet18int8| |-- fastbev_pre_trt.onnx| |-- fastbev_post_trt_decode.onnx| |-- fastbev_ptq.pth|-- resnet18`-- resnet18int8head
`-- tool
1.2 配置environment.sh
- 安装Python依赖库
sudo apt install libprotobuf-dev
pip install onnx
- 修改tool/environment.sh文件中的TensorRT/CUDA/CUDNN/fastbev变量值。这里我们可以参考从BEVDET来学习如何生成trt以及如何去写这些C++内容中的Tensorrt步骤安装对应内容。这里我们安装的是cudnn 8.0+的版本,并通过
locate
找到对应安装的位置。如果出现cudnn版本和TensorRT版本不一致问题,则需要参考Ubuntu 卸载-安装cudnn这一篇文章完成cudnn的更新-----因为TensorRT转换需要
sudo dpkg -r cudnn9-cuda-12
sudo apt-get remove --purge libcudnn9-cuda-12
sudo apt-get purge cudnn-local-repo-ubuntu2004-8.9.5.29
# 更改为您当前使用的目录路径
export TensorRT_Lib=/home/ubuntu/TensorRT-8.6.1.6/lib
export TensorRT_Inc=/home/ubuntu/TensorRT-8.6.1.6/include
export TensorRT_Bin=/home/ubuntu/TensorRT-8.6.1.6/binexport CUDA_Lib=/usr/local/cuda-12.1/targets/x86_64-linux/lib
export CUDA_Inc=/usr/local/cuda-12.1/targets/x86_64-linux/include
export CUDA_Bin=/usr/local/cuda-12.1/bin
export CUDA_HOME=/usr/local/cuda-12.1#export CUDNN_Lib=/path/to/cudnn/lib# resnet18/resnet18int8/resnet18int8head
export DEBUG_MODEL=resnet18int8# fp16/int8
export DEBUG_PRECISION=int8
export DEBUG_DATA=example-data
export USE_Python=OFF
- 将环境应用于当前终端。
. tool/environment.sh
1.3 编译并运行
- 为TensorRT构建模型
bash tool/build_trt_engine.sh
- 编译并运行程序
cd ~
git clone -b v1.2.1 https://github.com/traveller59/spconv.git --recurse-submodules
cd spconv
打开 setup.py 脚本,找到 build_extension 函数,找到 cuda_flags 这个参数,在下方添加如下代码
cuda_flags += ["-gencode", "arch=compute_52,code=sm_52","-gencode", "arch=compute_60,code=sm_60","-gencode", "arch=compute_61,code=sm_61","-gencode", "arch=compute_70,code=sm_70","-gencode", "arch=compute_75,code=sm_75","-gencode", "arch=compute_80,code=sm_80","-gencode", "arch=compute_86,code=sm_86","-gencode", "arch=compute_86,code=compute_86"]
打开 CMakeLists.txt ,在其内添加如下代码:
set(CMAKE_VERBOSE_MAKEFILE ON)
执行编译命令:
python setup.py bdist_wheel
# THC/THCNumerics.cuh: No such file or directory: https://blog.csdn.net/zjsdbkb88/article/details/136454689
编译完成后生成的动态库文件在 build 文件夹下以 lib 开头的文件夹内,开发时主要使用的是 libcuhash.so
及 libspconv.so
这两个文件,可以将这两个文件拷贝到 /usr/local/lib 目录下。头文件在项目根目录的 include 文件夹下,可以将其内的所有文件放到 /usr/local/include
目录。
然后我们回到CUDA-FastBEV中来再次编译
cd CUDA-FastBEV
bash tool/run.sh
我们发现还是有问题
这里我们需要将对应的内容从Lidar_AI_Solution下载下来并放入项目中
最后发现缺少stb_image库,由于这是head only的,所以我们直接改动源码即可
最后输出结果:
2. PTQ和导出ONNX代码学习
2.1 PTQ
import argparse
import os
import randomimport numpy as np
import torch
import torch.nn as nn
from copy import deepcopyimport lean.quantize as quantize
import lean.funcs as funcsimport mmcv
from mmcv import Config, DictAction
from mmcv.runner import get_dist_info, load_checkpoint
from mmdet.datasets import replace_ImageToTensorfrom mmdet3d.datasets import build_dataset, build_dataloader
from mmdet3d.models import build_model
from mmdet3d.apis import single_gpu_test# Additions
from mmcv.runner import load_checkpoint
from mmcv.parallel import MMDataParallel
from mmcv.cnn.utils.fuse_conv_bn import _fuse_conv_bn
from pytorch_quantization.nn.modules.quant_conv import QuantConv2d'''
融合卷积层与其后紧跟的批归一化层,减少计算量并加速推理过程
'''
def fuse_conv_bn(module):last_conv = None#存储最近遇到的卷积层last_conv_name = None#存储最近卷积层的名称for name, child in module.named_children():#遍历当前模块的所有子模块,并获取每个子模块的名称和实例if isinstance(child, (nn.modules.batchnorm._BatchNorm, nn.SyncBatchNorm)):#判断当前子模块是否为批归一化层if last_conv is None: # only fuse BN that is after Convcontinuefused_conv = _fuse_conv_bn(last_conv, child)#原来的卷积层替换为融合后的卷积层,同时将 BatchNorm 层替换为身份映射(nn.Identity()),以避免删除操作带来的潜在问题module._modules[last_conv_name] = fused_conv# To reduce changes, set BN as Identity instead of deleting it.module._modules[name] = nn.Identity()last_conv = None# 如果当前子模块是卷积层(包括定点量化卷积层 QuantConv2d 和标准卷积层 nn.Conv2d),则更新 last_conv 和 last_conv_name,以便在后续找到对应的 BatchNorm 层进行融合elif isinstance(child, QuantConv2d) or isinstance(child, nn.Conv2d): # or isinstance(child, QuantConvTranspose2d):last_conv = childlast_conv_name = nameelse:fuse_conv_bn(child)return moduledef load_model(cfg, checkpoint_path = None):model = build_model(cfg.model, test_cfg=cfg.get('test_cfg'))if checkpoint_path != None:checkpoint = load_checkpoint(model, checkpoint_path, map_location="cpu")return model, checkpoint'''
对模型进行量化
model: 代表了模型
'''
def quantize_net(model): quantize.quantize_backbone(model.backbone)quantize.quantize_neck(model.neck)quantize.quantize_neck_fuse(model.neck_fuse_0)quantize.quantize_neck_3d(model.neck_3d)quantize.quantize_head(model.bbox_head)# print(model)return modeldef test_model(cfg, args, model, checkpoint, data_loader, dataset):samples_per_gpu = 1if isinstance(cfg.data.test, dict):cfg.data.test.test_mode = Truesamples_per_gpu = cfg.data.test.pop('samples_per_gpu', 1)if samples_per_gpu > 1:# Replace 'ImageToTensor' to 'DefaultFormatBundle'cfg.data.test.pipeline = replace_ImageToTensor(cfg.data.test.pipeline)elif isinstance(cfg.data.test, list):for ds_cfg in cfg.data.test:ds_cfg.test_mode = Truesamples_per_gpu = max([ds_cfg.pop('samples_per_gpu', 1) for ds_cfg in cfg.data.test])if samples_per_gpu > 1:for ds_cfg in cfg.data.test:ds_cfg.pipeline = replace_ImageToTensor(ds_cfg.pipeline)if 'CLASSES' in checkpoint.get('meta', {}):model.CLASSES = checkpoint['meta']['CLASSES']else:model.CLASSES = dataset.CLASSES# palette for visualization in segmentation tasksif 'PALETTE' in checkpoint.get('meta', {}):model.PALETTE = checkpoint['meta']['PALETTE']elif hasattr(dataset, 'PALETTE'):# segmentation dataset has `PALETTE` attributemodel.PALETTE = dataset.PALETTEmodel = MMDataParallel(model, device_ids=[0])outputs = single_gpu_test(model, data_loader, args.show, args.show_dir)rank, _ = get_dist_info()if rank == 0:if args.out:print(f'\nwriting results to {args.out}')mmcv.dump(outputs, args.out)kwargs = {} if args.eval_options is None else args.eval_optionsif args.format_only:dataset.format_results(outputs, **kwargs)if args.eval:eval_kwargs = cfg.get('evaluation', {}).copy()# hard-code way to remove EvalHook argsfor key in ['interval', 'tmpdir', 'start', 'gpu_collect', 'save_best','rule']:eval_kwargs.pop(key, None)eval_kwargs.update(dict(metric=args.eval, **kwargs))print(dataset.evaluate(outputs, **eval_kwargs))def main():quantize.initialize() parser = argparse.ArgumentParser()parser.add_argument("--config", metavar="FILE", default="configs/fastbev/exp/paper/fastbev_m0_r18_s256x704_v200x200x4_c192_d2_f1.py", help="config file")parser.add_argument("--ckpt", default='tools/ptq/pth/epoch_20.pth',help="the checkpoint file to resume from")parser.add_argument("--calibrate_batch", type=int, default=200, help="calibrate batch")parser.add_argument("--seed", type=int, default=666, help="seed")parser.add_argument("--deterministic", type=bool, default=True, help="deterministic")parser.add_argument('--show', action='store_true', help='show results')parser.add_argument('--show-dir', help='directory where results will be saved')parser.add_argument('--test_int8_and_fp32', default=True, help='test int8 and fp32 or not')parser.add_argument('--out', help='output result file in pickle format')parser.add_argument('--format-only',action='store_true',help='Format the output results without perform evaluation. It is''useful when you want to format the result to a specific format and ''submit it to the test server')parser.add_argument('--eval',type=str,default='bbox',help='evaluation metrics, which depends on the dataset, e.g., "mAP",'' "segm", "proposal" for COCO, and "mAP", "recall" for PASCAL VOC')parser.add_argument('--eval-options',nargs='+',action=DictAction,help='custom options for evaluation, the key-value pair in xxx=yyy ''format will be kwargs for dataset.evaluate() function')args = parser.parse_args()args.ptq_only = Truecfg = Config.fromfile(args.config)cfg.seed = args.seedcfg.deterministic = args.deterministiccfg.test_int8_and_fp32 = args.test_int8_and_fp32save_path = 'tools/ptq/pth/bev_ptq_head.pth'os.makedirs(os.path.dirname(save_path), exist_ok=True)# set random seedsif cfg.seed is not None:print(f"Set random seed to {cfg.seed}, "f"deterministic mode: {cfg.deterministic}")random.seed(cfg.seed)np.random.seed(cfg.seed)torch.manual_seed(cfg.seed)if cfg.deterministic:torch.backends.cudnn.deterministic = Truetorch.backends.cudnn.benchmark = Falsedataset_train = build_dataset(cfg.data.train)dataset_test = build_dataset(cfg.data.test)print('train nums:{} val nums:{}'.format(len(dataset_train), len(dataset_test))) distributed =Falsedata_loader_test = build_dataloader(dataset_test,samples_per_gpu=1, workers_per_gpu=1, dist=distributed,shuffle=False,)print('Test DataLoader Info:', data_loader_test.batch_size, data_loader_test.num_workers)#Create Modelmodel_fp32, checkpoint = load_model(cfg, checkpoint_path = args.ckpt)model_int8 = deepcopy(model_fp32)if cfg.test_int8_and_fp32:model_fp32 = fuse_conv_bn(model_fp32)model_fp32 = MMDataParallel(model_fp32, device_ids=[0])model_fp32.eval()print('############################## fp32 ##############################')test_model(cfg, args, model_fp32, checkpoint, data_loader_test, dataset_test)model_int8 = quantize_net(model_int8)model_int8 = fuse_conv_bn(model_int8)model_int8 = MMDataParallel(model_int8, device_ids=[0])model_int8.eval()##Calibrateprint("Start calibrate 🌹🌹🌹🌹🌹🌹 ")quantize.set_quantizer_fast(model_int8)quantize.calibrate_model(model_int8, data_loader_test, 0, None, args.calibrate_batch)torch.save(model_int8, save_path)if cfg.test_int8_and_fp32:print('############################## int8 ##############################')test_model(cfg, args, model_int8, checkpoint, data_loader_test, dataset_test)returnif __name__ == "__main__":main()
2.2 导出ONNX
import argparse
from argparse import ArgumentParser
import math
import copy
import torch
import torch.nn as nn
import onnx
import onnxsim
from onnxsim import simplify
from mmseg.ops import resize
import mmcv
from mmcv import Config, DictAction
from mmcv.runner import load_checkpoint
# import warnings
from mmdet3d.datasets import build_dataloader, build_dataset
from mmdet3d.apis import init_model
import lean.quantize as quantize
import os
import numpy as np
from ptq_bev import quantize_net, fuse_conv_bn
from lean import tensorbox_code_size = 9
cfg_n_voxels=[[200, 200, 4]]
cfg_voxel_size=[[0.5, 0.5, 1.5]]
nv = 6'''
简化ONNX模型onnx_path: ONNX模型文件的路径
'''
def simplify_onnx(onnx_path):onnx_model = onnx.load(onnx_path)#加载ONNX模型model_simp, check = simplify(onnx_model)#使用simplify函数进行简化assert check, "simplify onnx model fail!" # 使用断言检查onnx模型效果onnx.save(model_simp, onnx_path)print("finish simplify onnx!")'''
生成点云坐标
n_voxels: 每个维度上的体素数量。
voxel_size: 每个体素的大小。
origin: 原点位置。
'''
@torch.no_grad()
def get_points(n_voxels, voxel_size, origin):points = torch.stack(torch.meshgrid([torch.arange(n_voxels[0]),torch.arange(n_voxels[1]),torch.arange(n_voxels[2]),]))#使用torch.meshgrid生成网格点new_origin = origin - n_voxels / 2.0 * voxel_sizepoints = points * voxel_size.view(3, 1, 1, 1) + new_origin.view(3, 1, 1, 1)#据体素大小和原点计算实际的三维坐标return points'''
计算从激光雷达到图像平面的投影矩阵
img_meta: 图像元数据,包括内外参。
stride: 下采样步幅。
noise: 噪声值(可选)
'''
def compute_projection(img_meta, stride, noise=0):projection = []intrinsic = torch.tensor(img_meta["lidar2img"]["intrinsic"][:3, :3])# 读取的内参intrinsic[:2] /= strideextrinsics = map(torch.tensor, img_meta["lidar2img"]["extrinsic"])# 外参信息for extrinsic in extrinsics:if noise > 0:projection.append(intrinsic @ extrinsic[:3] + noise)#根据相机的内外参数计算投影矩阵else:projection.append(intrinsic @ extrinsic[:3])return torch.stack(projection)#使用torch.stack将投影矩阵堆叠起来'''
从多层特征中获取投影输出。对应fastbev.py中extract_feat函数部分,对应178-228行
img: 输入图像。
img_metas: 图像元数据列表。
mlvl_feats: 多层特征。
'''
def get_project_output(img, img_metas, mlvl_feats):stride_i = math.ceil(img.shape[-1] / mlvl_feats.shape[-1])#计算下采样步幅,拿到的是图像的深度以及多层特征的深度mlvl_feat_split = torch.split(mlvl_feats, nv, dim=1)#将多层特征按照nv进行切分volume_list = []for seq_id in range(len(mlvl_feat_split)):#遍历多层特征volumes = []for batch_id, seq_img_meta in enumerate(img_metas):#遍历每个batch以及每个序列feat_i = mlvl_feat_split[seq_id][batch_id] #拿到每个序列的特征img_meta = copy.deepcopy(seq_img_meta)#这个是图像元数据,里面包含了内外参信息img_meta["lidar2img"]["extrinsic"] = img_meta["lidar2img"]["extrinsic"][seq_id*6:(seq_id+1)*6]if isinstance(img_meta["img_shape"], list):img_meta["img_shape"] = img_meta["img_shape"][seq_id*6:(seq_id+1)*6]img_meta["img_shape"] = img_meta["img_shape"][0]height = math.ceil(img_meta["img_shape"][0] / stride_i)#对图像的高度和宽度进行下采样width = math.ceil(img_meta["img_shape"][1] / stride_i)projection = compute_projection(img_meta, stride_i, noise=0).to(feat_i.device)#计算投影矩阵n_voxels, voxel_size = cfg_n_voxels[0], cfg_voxel_size[0]points = get_points(n_voxels=torch.tensor(n_voxels),voxel_size=torch.tensor(voxel_size),origin=torch.tensor(img_meta["lidar2img"]["origin"]),).to(feat_i.device)#获取点云坐标,里面包含了体素的大小和原点位置volume = backproject_inplace(feat_i[:, :, :height, :width], points, projection) #将2d特征和点云投影到3d体素中volumes.append(volume)#将每个序列的体素添加到volumes中volume_list.append(torch.stack(volumes))mlvl_volumes = torch.cat(volume_list, dim=1)return mlvl_volumes'''将2D特征反投影到3D体积features: 2D特征图
points: 点云坐标
projection: 投影矩阵
'''
def backproject_inplace(features, points, projection):'''function: 2d feature + predefined point cloud -> 3d volumeinput:features: [6, 64, 225, 400]points: [3, 200, 200, 12]projection: [6, 3, 4]output:volume: [64, 200, 200, 12]'''n_images, n_channels, height, width = features.shape#对应特征的维度n_x_voxels, n_y_voxels, n_z_voxels = points.shape[-3:]#对应点云的维度# [3, 200, 200, 12] -> [1, 3, 480000] -> [6, 3, 480000]points = points.view(1, 3, -1).expand(n_images, 3, -1)#将点云坐标转换为[6, 3, 480000]# [6, 3, 480000] -> [6, 4, 480000]points = torch.cat((points, torch.ones_like(points[:, :1])), dim=1)#将点云坐标转换为齐次坐标,通过cat函数将1拼接到points的第一维度# ego_to_cam# [6, 3, 4] * [6, 4, 480000] -> [6, 3, 480000]points_2d_3 = torch.bmm(projection, points) # lidar2img,torch.bmm是矩阵批量相乘x = (points_2d_3[:, 0] / points_2d_3[:, 2]).round().long() # [6, 480000],这样就得到了投影到图像平面上的坐标y = (points_2d_3[:, 1] / points_2d_3[:, 2]).round().long() # [6, 480000]z = points_2d_3[:, 2] # [6, 480000]valid = (x >= 0) & (y >= 0) & (x < width) & (y < height) & (z > 0) # [6, 480000]# method2:特征填充,只填充有效特征,重复特征直接覆盖volume = torch.zeros((n_channels, points.shape[-1]), device=features.device).type_as(features)#用channel和点云的数量初始化体素for i in range(n_images):volume[:, valid[i]] = features[i, :, y[i, valid[i]], x[i, valid[i]]]#将特征填充到体素中,对应了图像,channel,点云的坐标volume = volume.view(n_channels, n_x_voxels, n_y_voxels, n_z_voxels)return volume'''
返回解码后的边界框
anchors: 锚框参数。
deltas: 编码的边界框变化。
'''
def decode(anchors, deltas):"""Apply transformation `deltas` (dx, dy, dz, dw, dh, dl, dr, dv*) to`boxes`.Args:anchors: 这是一个形状为 (N, 7) 的张量,其中 N 是锚框的数量。每个锚框由 7 个参数组成:[x, y, z, w, l, h, r],分别表示边界框的中心坐标 (x, y, z),宽度 w,长度 l,高度 h 和旋转角度 r。deltas: 这是一个形状为 (N, 7+n) 的张量,其中 n 是额外的速度参数(如果存在)。它包含了对每个锚框的编码变化,格式为 [dx, dy, dz, dw, dh, dl, dr, velo*]。Returns:torch.Tensor: Decoded boxes."""cas, cts = [], []xa, ya, za, wa, la, ha, ra, *cas = torch.split(anchors, 1, dim=-1)#将anchors按照最后一维度进行切分,对应于锚框的中心坐标和尺寸参数。xt, yt, zt, wt, lt, ht, rt, *cts = torch.split(deltas, 1, dim=-1)za = za + ha / 2#计算锚框的底面高度 diagonal = torch.sqrt(la**2 + wa**2)#计算锚框的对角线长度xg = xt * diagonal + xa#计算锚框的中心坐标yg = yt * diagonal + yazg = zt * ha + zalg = torch.exp(lt) * la#计算解码后的边界框的宽度wg = torch.exp(wt) * wa#计算解码后的边界框的长度hg = torch.exp(ht) * ha#计算解码后的边界框的高度rg = rt + ra#计算解码后的边界框的旋转角度zg = zg - hg / 2#计算解码后的边界框的底面高度cgs = [t + a for t, a in zip(cts, cas)]#计算解码后的边界框的速度参数return torch.cat([xg, yg, zg, wg, lg, hg, rg, *cgs], dim=-1)'''
前向传播模型类,用于处理输入图像并提取特征
forward(img): 接收图像输入,经过骨干网络和颈部网络提取特征。
'''
class TRTModel_pre(nn.Module):def __init__(self, model):super().__init__()self.model = modelself.seq = 1#序列数self.nv = 6#每个序列的特征数self.batch_size = 1#batch大小def forward(self, img):#对应fastbev.py中extract_feat函数部分,对应119-176行#假设 img.shape 为 (10, 256, 256, 3),则 list(img.shape)[2:] 会返回 [256, 256, 3],则最后就会变成 (n, 256, 256, 3),其中 n 是根据原始数组的总元素数量计算出的值。img = img.reshape([-1] + list(img.shape)[2:])#将输入图像的维度进行重排,将最后一个维度和前面第三个维度开始到最后的所有维度(即 height, width, channels),并将其转换为列表x = self.model.backbone(img)#将输入图像传入骨干网络,提取特征mlvl_feats = self.model.neck(x)#将提取的特征传入颈部网络,提取多层特征mlvl_feats = list(mlvl_feats)#将多层特征转换为列表if self.model.multi_scale_id is not None:mlvl_feats_ = []for msid in self.model.multi_scale_id:#遍历多尺度特征# fpn output fusionif getattr(self.model, f'neck_fuse_{msid}', None) is not None:#获取model当中的neck_fuse_x的属性fuse_feats = [mlvl_feats[msid]]for i in range(msid + 1, len(mlvl_feats)):#遍历多尺度特征resized_feat = resize(mlvl_feats[i], size=mlvl_feats[msid].size()[2:], mode="bilinear", align_corners=False)#对多尺度特征进行双线性插值fuse_feats.append(resized_feat)#将插值后的特征添加到fuse_feats中if len(fuse_feats) > 1:#对多层特征进行拼接fuse_feats = torch.cat(fuse_feats, dim=1)else:fuse_feats = fuse_feats[0]fuse_feats = getattr(self.model, f'neck_fuse_{msid}')(fuse_feats)mlvl_feats_.append(fuse_feats)else:mlvl_feats_.append(mlvl_feats[msid])mlvl_feats = mlvl_feats_# v3 bev ms# 检查 self.model.n_voxels 是否是一个列表,并且当前的多层特征列表 mlvl_feats 的长度是否小于 self.model.n_voxels 的长度if isinstance(self.model.n_voxels, list) and len(mlvl_feats) < len(self.model.n_voxels):pad_feats = len(self.model.n_voxels) - len(mlvl_feats)for _ in range(pad_feats):mlvl_feats.append(mlvl_feats[0])#填充特征# only support one layer featureassert len(mlvl_feats) == 1, "only support one layer feature !"mlvl_feat = mlvl_feats[0]return mlvl_feat'''
后向传播模型类,用于处理3D体积特征并进行分类和边界框预测。
forward(mlvl_volumes): 接收多层体积特征,经过3D颈部网络和边界框头进行预测
'''
class TRTModel_post(nn.Module):def __init__(self, model, device):super().__init__()self.model = model#存储传入的模型self.device = device#存储设备self.num_levels = 1self.anchors = tensor.load("example-data/anchors.tensor", return_torch=True)#加载预定义的锚框(anchors),用于后续的边界框预测self.anchors = self.anchors.to(device)self.nms_pre = 1000def forward(self, mlvl_volumes):neck_3d_feature = self.model.neck_3d.forward(mlvl_volumes.to(self.device))#输入数据,可能是多层体积数据(multi-level volumes)cls_scores, bbox_preds, dir_cls_preds = self.model.bbox_head(neck_3d_feature)#将3D体积特征传入3D颈部网络和边界框头,进行分类和边界框预测#通过 self.model.bbox_head 获取分类分数cls_score = cls_scores[0][0]#从每个预测中选择第一层的输出bbox_pred = bbox_preds[0][0]dir_cls_pred = dir_cls_preds[0][0]dir_cls_pred = dir_cls_pred.permute(1, 2, 0).reshape(-1, 2)dir_cls_scores = torch.max(dir_cls_pred, dim=-1)[1]cls_score = cls_score.permute(1, 2, 0).reshape(-1, self.model.bbox_head.num_classes)#使用 permute 和 reshape 将方向分类预测的形状调整为适合后续处理的格式cls_score = cls_score.sigmoid()bbox_pred = bbox_pred.permute(1, 2, 0).reshape(-1, self.model.bbox_head.box_code_size)#NMS 处理max_scores, _ = cls_score.max(dim=1)#找出每个预测的最大分数_, topk_inds = max_scores.topk(self.nms_pre)#使用 topk 方法选择得分最高的 nms_pre 个预测anchors = self.anchors[topk_inds, :]#根据选择的索引提取对应的锚框、边界框预测、分类分数和方向分类分数bbox_pred_ = bbox_pred[topk_inds, :]scores = cls_score[topk_inds, :]dir_cls_score = dir_cls_scores[topk_inds]bboxes = decode(anchors, bbox_pred_)#使用 decode 函数将锚框和边界框预测解码为实际的边界框坐标return scores, bboxes, dir_cls_score'''
主程序入口,负责解析命令行参数、加载模型、构建数据集和数据加载器,以及导出ONNX模型
'''
def main():parser = ArgumentParser()parser.add_argument('--config', default="configs/fastbev_m0_r18_s256x704_v200x200x4_c192_d2_f1.py", help='Config file')parser.add_argument('--checkpoint', default="ptq/pth/bev_ptq_head.pth", help='Checkpoint file')#加载的模型文件parser.add_argument('--device', default='cuda:0', help='Device used for inference')parser.add_argument('--outfile', type=str, default='model/resnet18int8head/', help='dir to save results')#导出的ONNX模型文件路径parser.add_argument('--ptq', default=True, help='ptq or qat')args = parser.parse_args()cfg = Config.fromfile(args.config)# build the model from a config file and a checkpoint filemodel = init_model(args.config, device=args.device)#对应了load_model。因为这里model_int8 = quantize_net(model)#对模型进行量化model_int8 = fuse_conv_bn(model_int8)#对模型进行融合# 下面其实对应的就是single_gpu_test的推导,实际构建模型是build_model函数构建的,通过DETECTORS.build来获取fastbev配置,调用FastBEV类if args.ptq:ckpt = torch.load(args.checkpoint, map_location=args.device)model_int8.load_state_dict(ckpt.module.state_dict(), strict =True)else:from mmcv.runner import load_checkpointload_checkpoint(model_int8, args.checkpoint, map_location=args.device)dataset = build_dataset(cfg.data.test)data_loader = build_dataloader(dataset,samples_per_gpu=1,workers_per_gpu=0,dist=False,shuffle=False)#将数据集构建为数据加载器def get_input_meta(data_loader):data=Nonefor i , data in enumerate(data_loader):if i >= 1:breakdata = dataimage = data['img'].data[0]img_metas = data["img_metas"].data[0]return image, img_metasdevice = next(model_int8.parameters()).deviceimage, img_metas = get_input_meta(data_loader)#获取输入图像和图像元数据image_input = torch.tensor(image).to(device)#将输入图像转换为张量,并将其移动到指定的设备上trtModel_pre = TRTModel_pre(model_int8)#构建前向传播模型trtModel_pre.eval()output_names_pre = ['mlvl_feat']pre_onnx_path = os.path.join(args.outfile, 'fastbev_pre_trt_ptq.onnx')quantize.quant_nn.TensorQuantizer.use_fb_fake_quant = Truetorch.onnx.export(trtModel_pre,(image_input,),pre_onnx_path,input_names=['image'],output_names=output_names_pre,opset_version=13,enable_onnx_checker=False,training= torch.onnx.TrainingMode.EVAL,do_constant_folding=True,)#将前向传播模型导出为ONNX模型mlvl_feat = trtModel_pre.forward(image_input)#将输入图像传入前向传播模型,提取特征_, c_, h_, w_ = mlvl_feat.shapemlvl_feat = mlvl_feat.reshape(trtModel_pre.batch_size, -1, c_, h_, w_ )#将提取的特征重排为适合后续处理的格式,对应fastbev.py中extract_feat函数部分,对应182行mlvl_volumes = get_project_output(image_input, img_metas, mlvl_feat)mlvl_volume = mlvl_volumes.to(device)trtModel_post = TRTModel_post(model_int8, device)#构建后向传播模型output_names_post = ["cls_score", "bbox_pred", "dir_cls_preds"]post_onnx_path = os.path.join(args.outfile, 'fastbev_post_trt_ptq.onnx')torch.onnx.export(trtModel_post,(mlvl_volume,),post_onnx_path,input_names=['mlvl_volume'],output_names=output_names_post,opset_version=13,enable_onnx_checker=False,)#将后向传播模型导出为ONNX模型simplify_onnx(pre_onnx_path)simplify_onnx(post_onnx_path)if __name__ == '__main__':main()
2.3 对应TensorRT实现
std::vector<post::transbbox::BoundingBox> forward_only(const void* camera_images, void* stream, bool do_normalization) {nvtype::half* normed_images = (nvtype::half*)camera_images;if (do_normalization) {normed_images = (nvtype::half*)this->normalizer_->forward((const unsigned char**)(camera_images), stream);}this->camera_backbone_->forward(normed_images, stream); //前处理nvtype::half* camera_bev = this->vtransform_->forward(this->camera_backbone_->feature(), stream);// 将2D特征投影到3D体素中auto fusion_feature = this->fuse_head_->forward(camera_bev, stream);// 后处理return this->transbbox_->forward(fusion_feature, stream, param_.transbbox.sorted_bboxes);//输出结果}
3. 演示展示
4. 参考文献
https://blog.csdn.net/weixin_42108183/article/details/129190315