LLM推理框架Triton Inference Server学习笔记(二): Triton模型部署流程(stey by stey)

官方文档查阅: TritonInferenceServer文档

1. 写在前面

上一篇文章对triton inference server进行了一个整体的介绍，解答了三个经典问题what, why, how。这篇文章就开始转入实践，从实践的角度整理Triton模型部署的全流程，如果我有一个训练好的模型了，究竟如何部署到triton server，并提供服务给到客户端，客户端发请求之后，怎么把数据推理得到结果等。这篇文章，会对这些问题做出解答。

大纲如下：

Triton模型部署概览
模型仓库准备
模型配置文件编写
Triton Server启动
TritonServer客户端访问

OK， let’s go!

2. Triton模型部署概览

部署triton模型，整个流程大概是3步：

准备model repository，包含需要served的所有模型， Triton 会loader这些模型，根据模型提前在配置里指定好的backend运行到server端具体的cpu or GPU
客户端发送推理请求
Triton调度器把请求调度到相应的instancer执行推理结果返回给client

在这里插入图片描述

整理流程看完，我们下面会进行实际操作，实际操作可以手动编译triton，也可以使用NGC提供的镜像，里面把triton编译好了，可以基于官方的镜像安装包打成自己的镜像。

在下面实际操作之前，需要先准备环境，打出一个镜像来，可以运行后面的triton，我这里选择了官方的编译镜像，在这个基础上安装了自己的一些包，比如pytorch, torchvision等。

这里使用的官方镜像的这个版本 nvcr.io/nvidia/tritonserver:24.01-py3，版本这里要注意的问题，就是需要和conda， tensor rt等都匹配，如果不兼容，后面的模型可能无法导入。

Dockerfile文件如下：

FROM nvcr.io/nvidia/tritonserver:24.01-py3
MAINTAINER wuzhongqiang <wuzhongqiang@163.com>COPY bigmodellearning /home/work/bigmodellearning
RUN cd /home/work/bigmodellearning
RUN pip install huggingface-hub -i 可指定源
RUN pip install pandas -i 
RUN pip install torch -i 
RUN pip install torchvision -i 
RUN pip install transformers -i 
RUN pip install tritonclient[all] -i     # 得安装all的，否则不能用grpc client推理
RUN pip install pillow -i 
RUN pip install onnx -i 
RUN pip install onnxruntime-gpu -i

这样执行命令：

docker build -t wuzhongqiang/triton_img:v1 -f /home/wuzhongqiang/PycharmHome/big_model/bigmodellearning/Dockerfile .
docker push wuzhongqiang/trition_img:v0

镜像准备好之后，就可以进行下面的实践工作。

3. 准备模型仓库

首先，需要准备模型仓库，即把所有训练好的模型按照triton规定的格式放到一个统一的目录里面，启动triton server的时候，告诉模型这个目录， triton就会去这个目录下面去加载模型。

模型目录的格式如下：
在这里插入图片描述
模型库目录

模型名字
- config.pbtxt: 包含模型配置参数，决定served时候的具体行为
- output-labels-file(densenet_labels.txt): 分类模型的辅助功能专属，把分类模型输出的概率转成分类标签
- version: 版本号
  - model-definition-file: 具体模型文件，不同格式的模型文件会有不同：
    - TensorRT: model.plan
    - Onnx: model.onnx
    - TorchScripts: model.pt
    - TensorFlow: model.graphdef or model.savedmodel
    - Python: model.py
    - DALI: model.dali
    - OpenVINO: model.xml and model.bin
    - Custom: model.so
  - 目录的名字是版本号，用于版本控制

基于上面的知识，我们准备模型库。这里我先准备3种格式的模型resnet50_torch, resnet50_trt, resnet50_onnx。

import os
import torch
from torchvision import modelsclass PrepareModel(object):@staticmethoddef torch_model(save_dir, model):model.eval().cuda()input_ = torch.randn(1, 3, 224, 224).cuda()resnet50_traced = torch.jit.trace(model, input_)        # or resnet50_traced = torch.jit.script(model)print(model(input_).shape)      # [1, 1000]  resnet50做的1000分类resnet50_traced.save(f'{save_dir}/model_repo/resnet50_torch/1/model.pt')@staticmethoddef onnx_model(save_dir, model):model.eval().cuda()input_ = torch.randn(1, 3, 224, 224).cuda()input_names = ["actual_input_1"]   # 这个名字不要变， 官方的输入名字应该是定死了， 变了之后，后面模型加载的时候会失output_names = ["output_1"]# 需要安装onnx# pip install onnx  -i https://pkgs.d.xiaomi.net/artifactory/api/pypi/pypi-virtual/simple(换别的源)# netro.app工具可以在线把模型图可视化出来 https://netron.app/# opset_version这里指定15, 默认会是17,triton server加载的时候会报版本过高错误torch.onnx.export(model, input_, f'{save_dir}/model_repo/resnet50_onnx/1/model.onnx',input_names=input_names, output_names=output_names,dynamic_axes={'actual_input_1': {0: 'batch_size'}, 'output_1': {0: 'batch_size'}},opset_version=15)if __name__ == "__main__":parent_dir = os.getcwd()model = models.resnet50(pretrained=True)# 生成torch modelprint("torch model save...")PrepareModel.torch_model(parent_dir, model)# 生成onnx modelprint("onnx model save...")PrepareModel.onnx_model(parent_dir, model)# 生成tensor rt model# 命令行去搞

TensorRT 为inference（推理）为生，是NVIDIA研发的一款针对深度学习模型在GPU上的计算，显著提高GPU上的模型推理性能，优势：

Reduced Precision：将模型量化成INT8或者FP16的数据类型（在保证精度不变或略微降低的前提下），以提升模型的推理速度。
Layer and Tensor Fusion：通过将多个层结构进行融合（包括横向和纵向）来优化GPU的显存以及带宽。
Kernel Auto-Tuning：根据当前使用的GPU平台选择最佳的数据层和算法。
Dynamic Tensor Memory：最小化内存占用并高效地重用张量的内存。
Multi-Stream Execution：使用可扩展设计并行处理多个输入流。
Time Fusion：使用动态生成的核去优化随时间步长变化的RNN网络。

制作tensorrt版本的模型有些麻烦，步骤如下：

# 验证是否装了cuda
sudo apt install nvidia-cuda-toolkit
nvcc -V# 构建tensor rt模型 # 先安装tensor rt
# 下载地址： https://developer.nvidia.com/nvidia-tensorrt-8x-download  下载符合系统和cuda的版本，我这里下载的deb
# sudo dpkg -i nv-tensorrt-local-repo-ubuntu2004-8.6.1-cuda-12.0_1.0-1_amd64.deb
# sudo apt-get install -y aptitude# 这个会报错 原因是很多依赖没有装，所以tensorrt用下面的方式安装
# 安装cuda toolkit包 https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=20.04&target_type=deb_local
# 安装cudnn
sudo apt-get -y install cudnn9-cuda-12
# 安装tensor rt   tensort只支持GPU
# 这里注意安装的版本要与triton server的版本匹配，否则后面triton server无法加载
# 各个模块的兼容矩阵 https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html
sudo apt-get install tensorrt
# 加入到环境变量
vim ~/.bashrc
export PATH=$PATH:/usr/src/tensorrt/bin# 上面准备工作完成， 把上面的onnx模型转成tensorrt版 一行命令：
trtexec --onnx=/home/zhongqiang/bigmodellearning/triton_learning/model_repo/resnet50_onnx/1/model.onnx --explicitBatch --optShapes=actual_input_1:16x3x224x224 --maxShapes=actual_input_1:32x3x224x224 --minShapes=actual_input_1:1x3x224x224 --best --saveEngine=/home/zhongqiang/bigmodellearning/triton_learning/model_repo/resnet50_trt/1/model.plan

最后的结果：

在这里插入图片描述
三个模型都已经创建完毕，接下来就是给每个模型编写配置文件。

这里每个模型先写最简单的配置文件，具体说明放到下一个章节，这里先看看能不能都正常加载。

cd resent50_onnx
vim config.pbtxt
platform: "onnxruntime_onnx"
max_batch_size: 32
input [{name: "actual_input_1"data_type: TYPE_FP32dims: [ 3, 224, 224 ]}
]
output [{name: "output_1"data_type: TYPE_FP32dims: [ 1000 ]}
]cd resnet50_torch
vim config.pbtxt
platform: "pytorch_libtorch"
max_batch_size: 32
input [{name: "input__0"data_type: TYPE_FP32format: FORMAT_NCHWdims: [ 3, 224, 224 ]}
]
output {name: "output__0"data_type: TYPE_FP32dims: [ 1000 ]
}cd resnet50_trt
vim config.pb.txt
name: "resnet50_trt"
platform: "tensorrt_plan"
max_batch_size: 8
input [{name: "actual_input_1"data_type: TYPE_FP32format: FORMAT_NCHWdims: [ 3, 224, 224 ]}
]
output {name: "output_1"data_type: TYPE_FP32dims: [ 1000 ]label_filename: "imagenet-simple-labels.json"
}dynamic_batching{preferred_batch_size: [ 2, 4 ]
}

下面我们启动加载一下：

# 启动镜像  这里要指定gpus参数， 使得容器能用宿主主机的GPU， 否则tensor rt模型不能使用
sudo docker run -ti --rm --network=host --gpus all -v ~/bigmodellearning:/mnt/bigmodellearning --name triton-server nvcr.io/nvidia/tritonserver:24.01-py3
sudo docker run -ti --rm --network=host --gpus all -v ~/bigmodellearning:/mnt/bigmodellearning --name triton-server-v1 micr.cloud.mioffice.cn/wuzhongqiang/triton_img:v1# 启动triton server
tritonserver --model-repository=./model_repo/

最后可以发现，三个模型都已经加载成功。

在这里插入图片描述

4. 模型配置文件

这个就是每个模型下面的config.pbtxt文件，上面我只是列了最简单的参数，这里详细看看这个文件的其他参数。

注：如果在开启triton服务的时候指定strict-model-config=false， TensorRT, TensorFlow saved-model和Onnx model可以不写config.pbtxt，因为trion server可以从这三种模型的模型文件里面直接读取所需要的batch size, input和output信息。后面我们具体看下。

4.1 必备的参数

这块是config.pbtxt文件里面必须要具备的参数：

backend或者platform，模型使用的是什么backend，有的模型二者选其一指定就可以，有的必须指定某一种

绿色区域是2者选其1，红色是必选，黄色是可选
max_batch_size: 定义了模型最大推理的batch是多少，通常用在限制模型推理过程中不会超过GPU的显存(可以事先通过测试性能确定下来)

下面通过例子，看下batch_size的功效，代码在下面的client部分会给出。

事先在test_data目录下面截取了10张图片，通过指定不同的batch size，可以看到不同的推理效果：

batch_size=8

batch_size=1

tensor rt模型的config.pbtxt里面max_batch_size最大设置的参数是8,如果我们超过了这个数，就会报错：
input和output: 输入tensor和输出tensor的名字，决定数据要从哪里喂进去，得出来的推理结果从哪里获取

下面是max_batch_size, input和ouput怎么设定的例子，这三个有关联：
- 左一（最常见）： max_batch_size大于0
  - 模型是tensorrt模型
  - 最大batch_size是8
  - 两个输入input0和input1, 大小是[3, 224, 224]，注意这里没有batch_size的大小，只是图片[通道数, 长，宽]
  - batch_size大于0的情况下，输入是不需要指定batch size那一维的，这种情况， triton默认batch size那一维可变
  - 输出outpu0是16维的向量，如果模型文件要求输出必须是个4维的张量，那么这里output参数设置里面就可以通过reshape，变成模型要求的大小
  输入和输出的这个名字，在client调用推理服务的时候，会指定出来。名字必须对应上才可以。
- 左二： max_batch_size等于0
  - 模型是tensorrt模型
  - 最大batch_size是0，指定成0, 意味这个模型输入和输出是不包含batch_size那一维的，这时候下面的input和output的所有维度都得写出来
  - 模型两个输入input0和input1, 大小是[3, 224, 224]，注意这里的模型输入就是[3,224,224]，也就是只能输入一张图片，不能输入一个batch的图片
  - batch_size为0,如果想支持batch的推理，batch那一维必须增加input和output的batch维度, 并且这个维度就不支持可变长度了
  - 输出是16维的向量，就是一个向量，不支持batch
- 左三： pytorch模型的特殊点
  - 模型是pytorch模型
  - 最大batch_size是8
  - 模型两个输入INPUT__0和INPUT__1, 大小是[3, -1, -1]， pytorch的torch script模型，模型文件里面是不包含input和output丰富信息的，所以torch script模型的输入输出名称，结构需要固定(字符串__num)，如果模型支持可变维度的话，可以把可变的维度设置成-1, 这里的h，w设置-1, 就支持任意尺寸图片的推理。
  - 输出是16维的向量
- 左四： reshape的作用
  - 模型是tensorrt模型
  - 最大batch_size是0
  - 模型两个输入input0和input1, 大小是[3, 224, 224]，注意这里没有batch_size的大小，只是图片[通道数, 长，宽]
  - 可以加Reshape操作，把input tensor reshape成想要的维度，batch_size为0, 必须按照模型input指定的维度去输入，但假设模型文件的输入，是一个四维的输入，那这时候，就需要reshape在input里面加上batch那一维度。
  - 输出是16维的向量

下面做一个实验，测试batch size=0，首先我们修改resnet50_onnx模型的config.pbtxt文件，把max_batch_size参数设置为0, 此时，重启tritonserver的时候报错：
在这里插入图片描述
这里我指定reshape改成[1, 3, 224, 224]，发现也不好使，原因是我发现我torch保存onnx模型的时候，第一维默认是动态的batch维度，是可以改动的。

在这里插入图片描述
所以，如果把onnx模型的max_batch_size设置为0, 不管input和output的维度怎么设置，都不是动态改变维度了，和模型设定有冲突，所以总是报错，就到这吧，为0这种情况比较极端，一般不用。

4.2 重要参数

重要参数这里介绍几个，合理利用可以提升推理效率。

重要的参数instance_group

对应triton重要的feature: 并行的模型实例，对同一个模型可以开启多个excution instance，可以在GPU上并行执行多个实例。
可以提高GPU利用率，增加模型吞吐。
一个GPU或CPU上可以开启多个instance, 类似于多线程的个数
如果不指定是哪块GPU或者CPU，默认是会在每个GPU上开count数量的instance

这里我们也做一个实验，不过我只有1块GPU，我没法上面这样指定了，就一个GPU上开多个实例，跑一个性能测试看看。处理过程：

首先， 3个模型的config.pbtxt文件里面加上参数：

instance_group [{count: 1kind: KIND_GPU}
]

然后，就是在triton server启动的时候，加上一个参数 tritonserver --model-repository=./model_repo/ --model-control-mode explicit 这个意思是服务端启动的时候，先不加载任何模型，模型由客户端去指定加载，此时启动之后：

在这里插入图片描述
来到客户端，用下面命令加载resnet50_torch:

 curl -X POST http://localhost:8000/v2/repository/models/resnet50_torch/load

看服务器端的日志，会发现resnet50_torch模型被加载

在这里插入图片描述
加载成功之后，用perf_analyzer工具，对模型resnet50_torch进行性能分析，去测试serve的吞吐，延时等

perf_analyzer -m resnet50_torch -b 1 --concurrency-range 64 --max-threads 32 -u localhost:8001 -i gRPC# -m resnet50_torch: 指定模型名称为resnet50_torch
# -b 1: 指定批处理大小为1
# -concurrency-range 64: 设置并发客户端请求的数量为64
# --max-threads 32: 设置客户端使用的最大线程数为32
# -u localhost:8001: 指定服务的URL为本地主机的8001端口
# -i gRPC: 指定使用gRPC接口进行测试# 结果
*** Measurement Settings ***Batch size: 1   # 批处理大小为1Service Kind: Triton   # 使用Triton服务Using "time_windows" mode for stabilization  # 使用稳定模式的"time_windows"Measurement window: 5000 msec     # 测量窗口为5000毫秒Using synchronous calls for inference   # 使用同步调用进行推理Stabilizing using average latency   # 使用平均延迟进行稳定# 在请求的并发为64时，客户端和服务器的性能表现如下
Request concurrency: 64Client: Request count: 11210   # 客户端发送了11210个请求，平均每秒能进行622.538次推理 (infer/sec)Throughput: 622.538 infer/secAvg latency: 102503 usec (standard deviation 14891 usec)  # 客户端的平均延迟约为102503微秒 (usec)  标准偏差为14891微秒，表现出延迟的波动p50 latency: 101190 usecp90 latency: 102145 usecp95 latency: 102523 usecp99 latency: 111301 usecAvg gRPC time: 102489 usec ((un)marshal request/response 65 usec + response wait 102424 usec)Server: Inference count: 11210Execution count: 11210Successful request count: 11210Avg request latency: 101966 usec (overhead 19 usec + queue 100365 usec + compute input 91 usec + compute infer 1477 usec + compute output 12 usec)# GPU上承载的服务实例，在64个并发请求时，能达到每秒622次推理的吞吐量，而平均延迟接近1秒。这为评估在特定载荷水平下模型的性能提供了数值化的信息
Inferences/Second vs. Client Average Batch Latency
Concurrency: 64, throughput: 622.538 infer/sec, latency 102503 usec   # GPU上开一个实例， 达到的吞吐是1s 622次推理， 延时是1s 左右# 卸载模型
curl -X POST http://localhost:8000/v2/repository/models/resnet50_torch/unload# 调整instance的个数为2, 重新serve和重新加载，再进行压测
*** Measurement Settings ***Batch size: 1Service Kind: TritonUsing "time_windows" mode for stabilizationMeasurement window: 5000 msecUsing synchronous calls for inferenceStabilizing using average latencyRequest concurrency: 64Client: Request count: 16344Throughput: 907.455 infer/secAvg latency: 70382 usec (standard deviation 13000 usec)p50 latency: 69536 usecp90 latency: 70124 usecp95 latency: 70329 usecp99 latency: 70893 usecAvg gRPC time: 70368 usec ((un)marshal request/response 61 usec + response wait 70307 usec)Server: Inference count: 16344Execution count: 16344Successful request count: 16344Avg request latency: 69894 usec (overhead 22 usec + queue 67697 usec + compute input 89 usec + compute infer 2072 usec + compute output 13 usec)Inferences/Second vs. Client Average Batch Latency
Concurrency: 64, throughput: 907.455 infer/sec, latency 70382 usec    # GPU上开2个实例， 达到的吞吐是1s 907次推理， 延时是700 ms 左右 ， 性能提升了，但不是成比例提升的

这里普及一个点：

p50、p90这样的指标通常用来表示延迟的百分位数。具体地：

p50表示50th百分位数，也就是中位数（median）。意味着有50%的请求延迟是低于或等于这个值的。p90表示90th百分位数。意味着有90%的请求延迟是低于或等于这个值的。

举个例子，如果p90 latency是102145微秒，这表示在所有测量的请求中，有90%的请求其延迟是低于或等于102145微秒的。

这些百分位数是评估服务质量的重要指标，它们能提供比平均值更全面的延迟分布情况。在考虑用户体验时非常关键，特别是p95和p99延迟，因为它们表示极端情况下的延迟，即最糟糕的用户体验。

一个GPU上开1个instance， GPU利用率：
在这里插入图片描述
一个GPU上开2个instance, GPU利用率：

我这里达到了100%，确实一个GPU上开多个instance，能够提升GPU的利用率。

调度策略：下面的一个重要参数Scheduler和Batching

指明了triton应该使用哪种调度策略去调度送进来的推理请求。不同的策略也可以提升GPU的提升性能。
1. Default scheduler: 不做batch，推理请求送进来多少，给模型就推理多少，如果不指定，默认就是它
2. dynamic Batcher: 对于一个请求,先不进行推理,等个几毫秒，把这几毫秒的所有请求拼接成一个batch进行推理，这样可以充分利用硬件，提升并行能力，当然缺点就是个别用户等待时间变长，不适合低频次请求的场景。常用的两个参数期望server端打的batch是多少(preferred_batch_size)以及打batch的时间限制是100微妙(max_queue_delay_microseconds)，可以理解成打batch的时间窗口，在这个间隔内的请求才会打成一个batch。这个值越大，说明愿意等待多个请求打成一个batch一块推理，吞吐会更大，但相对应的延迟可能会变长。
  
  这里做一个实验，再resnet50_torch的配置文件里面再加一行：
```
dynamic_batching{preferred_batch_size: [ 2, 4, 8, 16]
}# 重新开启triton server， 并手动加载resnet50_torch模型
tritonserver --model-repository=./model_repo/ --model-control-mode explicit# 客户端
curl -X POST http://localhost:8000/v2/repository/models/resnet50_torch/load
perf_analyzer -m resnet50_torch -b 1 --concurrency-range 64 --max-threads 32 -u localhost:8001 -i gRPC# 压测结果
*** Measurement Settings ***Batch size: 1Service Kind: TritonUsing "time_windows" mode for stabilizationMeasurement window: 5000 msecUsing synchronous calls for inferenceStabilizing using average latencyRequest concurrency: 64Client: Request count: 20563Throughput: 1141.31 infer/secAvg latency: 55969 usec (standard deviation 16177 usec)p50 latency: 55030 usecp90 latency: 55425 usecp95 latency: 55568 usecp99 latency: 57847 usecAvg gRPC time: 55947 usec ((un)marshal request/response 144 usec + response wait 55803 usec)Server: Inference count: 20563Execution count: 1287Successful request count: 20563Avg request latency: 54301 usec (overhead 167 usec + queue 40529 usec + compute input 2294 usec + compute infer 11277 usec + compute output 33 usec)Inferences/Second vs. Client Average Batch Latency
Concurrency: 64, throughput: 1141.31 infer/sec, latency 55969 usec   
# 加上dynamic_batching参数， 达到的吞吐是1s 1141次推理， 延时是500 ms 左右 ， 性能比在GPU上开2个实例都好
```
  还有3个高级的选项：
  - perserve_ordering: 指定之后，可以保证使用dynamic batching之后，推理结果返回的顺序和推理请求送进来的顺序完全保持一致
  - priority_levels: 定义不同优先级的请求处理的顺序，可以选择优先级高的请求，打成batch，送进backend进行推理
  - queue policy: 可以设置请求等待队列的一些行为，比如行为队列设置成多长，比如可以设置一个计时器，当时间一过，请求可以直接推掉，不推理等
3. sequence batcher: 专门用于statefule model的调度器，可以保证同一个streaming的序列，推理的时候，所有的请求能够发送到同一个instance上推理，从而保证model instance的状态
4. ensemble scheduer 这个调度器可以组合不同的模块，形成一个pipeline。后面的聚合模型部分会整理到。
上面的调度器和调度策略和服务器的性能是直接相关的，需要重点关注。

优化策略optimization，目前支持两类模型： tensor-rt加速器加速
在这里插入图片描述
这里我做一个实验，就是在resnet50的onnx模型的配置文件里面加上加速器参数：我这里测试的差不多，并没有多大提升。

optimization {execution_accelerators {gpu_execution_accelerator: [{name: "tensorrt"parameters: { key: "precision_mode" value: "FP16"}parameters: { key: "max_workspace_size_bytes" value: "1073741824"}}]}
}# 服务端
tritonserver --model-repository=./model_repo/ --model-control-mode explicit
# 客户端
curl -X POST http://localhost:8000/v2/repository/models/resnet50_onnx/load
perf_analyzer -m resnet50_onnx -b 1 --concurrency-range 64 --max-threads 32 -u localhost:8001 -i gRPC# 这里测试resnet50_onnx模型，不加optimization参数压测结果：
*** Measurement Settings ***Batch size: 1Service Kind: TritonUsing "time_windows" mode for stabilizationMeasurement window: 5000 msecUsing synchronous calls for inferenceStabilizing using average latencyRequest concurrency: 64Client: Request count: 26016Throughput: 1443.47 infer/secAvg latency: 44340 usec (standard deviation 278 usec)p50 latency: 44334 usecp90 latency: 44680 usecp95 latency: 44794 usecp99 latency: 45074 usecAvg gRPC time: 44319 usec ((un)marshal request/response 143 usec + response wait 44176 usec)Server: Inference count: 26016Execution count: 1626Successful request count: 26016Avg request latency: 42865 usec (overhead 170 usec + queue 31815 usec + compute input 2283 usec + compute infer 8560 usec + compute output 35 usec)Inferences/Second vs. Client Average Batch Latency
Concurrency: 64, throughput: 1443.47 infer/sec, latency 44340 usec# 加上otpimization的参数压测结果*** Measurement Settings ***Batch size: 1Service Kind: TritonUsing "time_windows" mode for stabilizationMeasurement window: 5000 msecUsing synchronous calls for inferenceStabilizing using average latencyRequest concurrency: 64Client: Request count: 25888Throughput: 1436.23 infer/secAvg latency: 44554 usec (standard deviation 286 usec)p50 latency: 44543 usecp90 latency: 44911 usecp95 latency: 45043 usecp99 latency: 45319 usecAvg gRPC time: 44532 usec ((un)marshal request/response 139 usec + response wait 44393 usec)Server: Inference count: 25888Execution count: 1618Successful request count: 25888Avg request latency: 43080 usec (overhead 167 usec + queue 31976 usec + compute input 2302 usec + compute infer 8601 usec + compute output 34 usec)Inferences/Second vs. Client Average Batch Latency
Concurrency: 64, throughput: 1436.23 infer/sec, latency 44554 usec
# onnx模型吞吐是1s 1436次推理， 延时是445 ms 左右， 果然onnx模型要快# 这里顺便测试了下tensor rt模型的推理性能， 这个会远超pytorch 和onnx
*** Measurement Settings ***Batch size: 1Service Kind: TritonUsing "time_windows" mode for stabilizationMeasurement window: 5000 msecUsing synchronous calls for inferenceStabilizing using average latencyRequest concurrency: 64Client: Request count: 87816Throughput: 4839.96 infer/secAvg latency: 13213 usec (standard deviation 1759 usec)p50 latency: 13161 usecp90 latency: 15368 usecp95 latency: 16039 usecp99 latency: 17489 usecAvg gRPC time: 13183 usec ((un)marshal request/response 176 usec + response wait 13007 usec)Server: Inference count: 87823Execution count: 6824Successful request count: 87823Avg request latency: 11408 usec (overhead 474 usec + queue 7328 usec + compute input 1844 usec + compute infer 1066 usec + compute output 694 usec)Inferences/Second vs. Client Average Batch Latency
Concurrency: 64, throughput: 4839.96 infer/sec, latency 13213 usec  # 1s 4839次推理， 延时是132 ms 左右

这里就把上面的几组实验结果放到表格对比，看看参数的有效性。

在这里插入图片描述

4.3 其他参数

这里再介绍两个参数，不是很重要，但有时候会用到。

version_policy: 模型版本serve
一个模型里面可以包含很多个版本目录，当一个模型里面包含多个版本的时候，我们究竟要serve哪一个或者哪几个呢？可以通过version_policy指定。

all策略：所有的模型版本都serve上去
latest: 把最新的几个版本serve上去，这里的1是serve1个最新的
specffic: 就是指定具体的模型版本了

下面，我在resnet50_onnx模型下面再新建一个2版本，然后在config.pbtxt文件里面试下这个参数：

# config.pbtxt
version_policy: { all { } }   # 所有版本都serve
version_policy: { latest { num_versions: 1 } }   # 最后的1个版本serve
version_policy: { specific { versions: 1 } }   # 特定的版本  这里是1版本

重启下：

在这里插入图片描述
这里就能成功指定两个版本了，推理的时候，可以指定特定的版本推理了。

root@zouyilin:/mnt/bigmodellearning/triton_learning/triton_cli# python3 img_cli.py --model_name resnet50_onnx --model_version 2 --img_dir /mnt/bigmodellearning/triton_learning/test_data/pic --batch_size 5 --cli_type http
pics nums: 10
model_name: resnet50_onnx, model_version: 2, cli_type: http, cur_batch: 0_5, batch size: 5elephant.png: 12.28624439239502 (101) = tuskerhorse.png: 15.936531066894531 (339) = common sorrelkoala.png: 17.28411102294922 (105) = koalatree.png: 12.409147262573242 (975) = lakeshoreapple.png: 6.7075605392456055 (923) = plate
model_name: resnet50_onnx, model_version: 2, cli_type: http, cur_batch: 5_10, batch size: 5cat.png: 13.10097885131836 (282) = tiger catairliner.png: 17.840190887451172 (404) = airlinerbanana.png: 15.781821250915527 (954) = bananadog.png: 15.80997371673584 (263) = Pembroke Welsh Corgipandas.png: 13.727736473083496 (388) = giant panda

model_warmup
- 有些模型在参数初始化的时候，执行推理请求去推理的性能是不太稳定的，可能比较慢，所以需要一个热身的过程，使得模型的推理过程趋于一个稳定
- 通过指定model warmup字段，来定义一个热身的过程
- 指定完了这样的参数之后， triton在加载某个模型的时候，就会先用热身请求给模型热身，达到一个模型热身的效果。
- warm up的过程中， triton是没办法往外提供服务的，所以这个参数会增加模型加载时间，响应变长
下面我们也通过一个实验看下这个过程：
```
# 在resnet50_onnx模型中加入model_warmup的参数
model_warmup [{batch_size: 32name: "warmup_requests"inputs {key: "actual_input_1"value: {random_data: truedims: [3, 224, 224]data_type: TYPE_FP32}}
}]# 启动triton server 
tritonserver --model-repository=./model_repo --log-verbose 1
```
这里通过日志会看到， tritonserver启动之后，会对onnx模型进行一个热身操作，完成之后，模型才会被加载

5. 启动triton server

启动之前需要先编译triton 或者使用NGC上的docker image在container执行命令。

在这里插入图片描述
启动容器：

—gpus: 看到所有的gpu，可以通过—device限制使用的gpu
-it: 可以和container进行交互
—rm: container任务执行完了之后，自动关掉
—shm-size: 指定container可以访问的共享内存的打小，这个比较有用，比如backend之间的交互，example model的各个模块之间的数据传输，这个大小可能会影响服务在运行中的一个情况
-p: 指定需要监听的端口，host_端口:映射到container的端口。
- 8000：用于http请求的访问
- 8001: 用于grpc请求的访问
- 8002： metics的访问(健康性检查)
-v: 目录的映射，一般会把主机上的模型仓库映射到容器里面
最后是ngc 容器镜像的名称

启动起来，会展示可加载的模型，一些参数的设置，以及不同的网络协议监听的端口是多少。

triton server启动了之后，发送下面命令可以检查server的健康状态，这个就不用写一个client去检查了，如果这个reday，就说明这个server能用了

curl -v localhost:8000/v2/health/readyroot@zouyilin:/mnt/bigmodellearning/triton_learning# curl -v localhost:8000/v2/health/ready
*   Trying 127.0.0.1:8000...
* Connected to localhost (127.0.0.1) port 8000 (#0)
> GET /v2/health/ready HTTP/1.1
> Host: localhost:8000
> User-Agent: curl/7.81.0
> Accept: */*
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK   # 已经ready了
< Content-Length: 0
< Content-Type: text/plain
< 
* Connection #0 to host localhost left intact

常用的一些options:

log-verbose: 日志展示层级， warm up的时候用到过，可以看到啥时候做了warm up的信息等，模型执行到哪一步，调用了什么函数等信息，可以通过把这个层级设置成大于1的值来展示
strict-model-config: 如果是之前提到TensorRT, TensorFlow saved-model和Onnx model模型，不需要给模型提供config.pbtxt配置文件，即没有这个文件也能reday。这个参数默认是false
这里我们测试下，假设我把resnet50_onnx和resnet50_trt目录下面的config.pbtxt文件删除掉，如果把这个参数设置为true，此时reset50_onnx和resnet50_trt就无法启动了：

tritonserver --model-repository=./model_repo/ --strict-model-config=true

加上就可以正常启动。另外torch scripts是必须有config.pbtxt，如果去掉，即使指定这个参数为False也是会报错：
strict-readiness: 上面检查server状态的信息，这里是定义什么情况下检查状态会reday，如果是true，就是当模型库里面所有的模型reday，才会返回true，而false的话，是有模型reday了，就返回true
exit-on-error: 设置成false，如果模型仓库里面的某些模型启动挂了，这个server也是能启动起来的，如果是true，必须是所有模型都reday，这个server才能启动起来（这个比较有用）

http-port, grpc-port, metrics-port：指定服务启动时监听的端口号，默认是8000, 8001, 8002，如果不是这3个，需要自己去指定

# 如果启动发现8000, 8001, 8002端口已经被其他triton server容器占用了，这个参数就会起到作用# 启动容器的时候
sudo docker run --gpus all -it --rm --net host --shm-size 1g -p8003:8003 -p8004:8004 -p8005:8005 -v ~/bigmodellearning:/mnt/bigmodellearning --name triton-server-v1 micr.cloud.mioffice.cn/wuzhongqiang/triton_img:v1# 启动triton server
tritonserver --model-repo model_repo/ --http-port 8003 --grpc-port 8004 --metrics-port 8005

model-control-model: 以一种什么样的方式管理模型库，也是比较常用的
1. none是server开启的时候，会把所有的模型都load进来，模型一旦开启服务的话，没法动态的卸载或更新
2. explicit是server启动的时候，不加载任何模型，然后用model control api在客户端，动态的加载或者卸载模型
3. poll：动态的更新served的模型，比如server启动好了，把模型加载进来了，此时如果在model仓库中再增加新模型，这里会自动再把新模型也load进来，如果改模型的config，也会动态的把config改变，这个就类似于uvicorn 启动后端app的时候加上—reload的功效，如果后端代码有改动，就会重新加载新代码。注意poll模式下，就不能通过control API去加载或者卸载模型了。
reposity-poll-secs: 自动检查模型是否有新更新的时间间隔，模型库控制方式为poll的时候才有效

load-model：在server启动的时候，可以指定特定的模型加载，在模型库控制方式为explicit的时候有效

# 服务器端也可以通过这个参数指定的特定模型加载
tritonserver --model-repository=./model_repo --strict-model-config=false --model-control-mode explicit --load-model resnet50_onnx# 客户端
curl -X POST http://localhost:8000/v2/repository/models/resnet50_onnx/load
curl -X POST http://localhost:8000/v2/repository/models/resnet50_onnx/upload

pinned-memory-pool-byte-size: triton server能够分配的所有pinned的cpu内存大小，这个pinned内存在模型推理时可以有效提高cpu, gpu的数据传输效率
cuda-memory-pool-byte-size: 可以分配的最大的cuda memory的大小，默认时64M
backend-directory: 自己指定backends的存放位置，后面如果实现自己的一些backends，需要告诉triton server，去哪里找自己的backend，需要设置这个
repoagent-directory: 用来预处理模型库的程序，比如模型库load进去的时候做一个加密操作，就可以把加密的程序做一个repoagent放到这个目录下面，然后指定这个参数

6. 发请求到Triton server

怎么写client，去发送请求进行推理？主要有3种： http请求， grpc请求或者直接调接口。

我这里主要是实现了http和grpc的两种，这里直接放代码就可以，上面实验里面也是用的下面的img_cli.py文件。

import argparse
import json
import os
import queue
import time
from functools import partialimport numpy as np
import tritonclient.grpc as tritongrpcclient
import tritonclient.http as tritonhttpclient
from PIL import Image
from torchvision import transformsdef completion_callback(user_data, result, error):user_data._completed_requests.put((result, error))class UserData(object):def __init__(self):self._completed_requests = queue.Queue()class ImageClient(object):VERBOSE = FalseINPUT_DTYPE = 'FP32'HTTP_URL = 'localhost:8000'GRPC_URL = 'localhost:8001'def __init__(self,model_name: str = 'resnet50_trt',model_version: str = '1',batch_size: int = 1,cli_type: str = 'http'):assert model_name in ['resnet50_torch', 'resnet50_onnx', 'resnet50_trt'], "model name is invalid!"self._model_name = model_nameself._model_version = model_versionself._http_client = tritonhttpclient.InferenceServerClient(url=ImageClient.HTTP_URL,verbose=ImageClient.VERBOSE)self._grpc_client = tritongrpcclient.InferenceServerClient(url=ImageClient.GRPC_URL,verbose=ImageClient.VERBOSE)self._cli_type = cli_typeself._batch_size = batch_sizeif model_name == 'resnet50_torch':self._input_name, self._output_name = 'input__0', 'output__0'elif model_name == 'resnet50_onnx':self._input_name, self._output_name = 'actual_input_1', 'output_1'elif model_name == 'resnet50_trt':self._input_name, self._output_name = 'actual_input_1', 'output_1'self._user_data = UserData()def _preprocess(self, img):imagenet_mean = [0.485, 0.456, 0.406]imagenet_std = [0.485, 0.456, 0.406]resize = transforms.Resize((256, 256))center_crop = transforms.CenterCrop(224)to_tensor = transforms.ToTensor()normalize = transforms.Normalize(mean=imagenet_mean, std=imagenet_std)transform = transforms.Compose([resize, center_crop, to_tensor, normalize])image_tensor = transform(img).unsqueeze(0).cuda()return image_tensordef _img_process(self, img):_, file_extension = os.path.splitext(img)if file_extension[1:] == 'png':# png图片格式有4个通道(RGBA, A是透明度), 后面transform处理，期望是3个颜色通道的图像# 所以需要转换一层image = Image.open(img).convert('RGB')else:image = Image.open(img)image_tensor = self._preprocess(image)image_numpy = image_tensor.cpu().numpy()return image_numpydef _check_model(self, triton_client):model_metadata = triton_client.get_model_metadata(model_name=self._model_name,model_version=self._model_version)model_config = triton_client.get_model_config(model_name=self._model_name, model_version=self._model_version)return model_metadata, model_configdef _gen_triton_input_output(self, image_numpy, input_shape):if self._cli_type == 'http':input_0 = tritonhttpclient.InferInput(self._input_name, input_shape, ImageClient.INPUT_DTYPE)input_0.set_data_from_numpy(image_numpy)output = tritonhttpclient.InferRequestedOutput(self._output_name)# output里面还可以指定class_count参数， 告诉triton执行推理的模型是分类模型， 最后返回结果里面把概率值直接转成分类的标签# 模型的配置文件必须提供label的那个文件才可以指定这个参数，如果不指定这个参数， 就只返回模型的推理结果， 根据推理结果客户端也可以输出类别# output = tritonhttpclient.InferRequestedOutput(self._output_name, class_count=1)else:input_0 = tritongrpcclient.InferInput(self._input_name, input_shape, ImageClient.INPUT_DTYPE)input_0.set_data_from_numpy(image_numpy)output = tritongrpcclient.InferRequestedOutput(self._output_name)# output里面还可以指定class_count参数， 告诉triton执行推理的模型是分类模型， 最后返回结果里面把概率值直接转成分类的标签# 模型的配置文件必须提供label的那个文件才可以指定这个参数，如果不指定这个参数， 就只返回模型的推理结果， 根据推理结果客户端也可以输出类别# output = tritongrpcclient.InferRequestedOutput(self._output_name, class_count=1)return input_0, outputdef _httpcli_infer(self, input_0, output):# model_metadata, model_config = self._check_model(self._http_client)# print(f"model_metadata: {model_metadata}")# print(f"model_config: {model_config}")response = self._http_client.infer(self._model_name,model_version=self._model_version,inputs=[input_0],outputs=[output])return responsedef _grpccli_infer(self, input_0, output):# model_metadata, model_config = self._check_model(self._grpc_client)# print(f"model_metadata: {model_metadata}")# print(f"model_config: {model_config}")# https://docs.nvidia.com/deeplearning/triton-inference-server/archives/triton_inference_server_220/user-guide/docs/python_api.html#tritongrpcclient.InferenceServerClient.inferresponse = self._grpc_client.infer(self._model_name,model_version=self._model_version,inputs=[input_0],outputs=[output])return responsedef _grpccli_infer_async(self, input_0, output):# https://docs.nvidia.com/deeplearning/triton-inference-server/archives/triton_inference_server_220/user-guide/docs/python_api.html#tritongrpcclient.InferenceServerClient.async_infer# 异步方式需要提供一个回调函数， 当triton server推理完毕之后，就把推理的结果通过回调函数，放到user_data里面去# 这样后面就可以在user_data里面去拿结果self._grpc_client.async_infer(self._model_name,callback=partial(completion_callback, self._user_data),model_version=self._model_version,inputs=[input_0],outputs=[output])print("异步推理可以处理别的事情， 过一段时间之后来拿结果")time.sleep(5)(results, error) = self._user_data._completed_requests.get()return resultsdef _infer_process(self, images_list):images_numpy = np.concatenate(images_list)input_0, output = self._gen_triton_input_output(images_numpy, input_shape=images_numpy.shape)if self._cli_type == 'http':response = self._httpcli_infer(input_0, output)else:response = self._grpccli_infer(input_0, output)# grpc的async方式# response = self._grpccli_infer_async(input_0, output)return response.as_numpy(self._output_name)def _postprocess(self, logits):with open('/mnt/bigmodellearning/triton_learning/model_repo/resnet50_torch/imagenet-simple-labels.json') as file:labels = json.load(file)result = []for logit in logits:logit = np.asarray(logit, dtype=np.float32)class_name = labels[np.argmax(logit)]score = np.max(logit)loc = np.argmax(logit)result.append([class_name, score, loc])return resultdef run_sync(self, img_dir: str):pics = [os.path.join(img_dir, pic) for pic in os.listdir(img_dir)]print(f"pics nums: {len(pics)}")for i in range(0, len(pics), self._batch_size):cur_batch_imgs = pics[i:i + self._batch_size]cur_batch_img_list = []for cur_batch_img in cur_batch_imgs:cur_img_numpy = self._img_process(cur_batch_img)cur_batch_img_list.append(cur_img_numpy)# infer batchif not cur_batch_img_list:breakbatch_response = self._infer_process(cur_batch_img_list)batch_results = self._postprocess(batch_response)print(f"model_name: {self._model_name}, "f"model_version: {self._model_version}, "f"cli_type: {self._cli_type}, "f"cur_batch: {i}_{i+self._batch_size}, "f"batch size: {self._batch_size}")for j, res in enumerate(batch_results):file_name = cur_batch_imgs[j].split('/')[-1]print(f"    {file_name}: {res[1]} ({res[2]}) = {res[0]}")if __name__ == '__main__':parser = argparse.ArgumentParser()parser.add_argument('--model_name', dest='model_name', type=str, help='model name')parser.add_argument('--model_version', dest='model_version', type=str, help='model version')parser.add_argument('--img_dir', dest='img_dir', type=str, help='img dir')parser.add_argument('--batch_size', dest='batch_size', type=str, help='batch size')parser.add_argument('--cli_type', dest='cli_type', type=str, default='http', help='cli type')args = parser.parse_args()img_client = ImageClient(model_name=args.model_name,model_version=args.model_version,batch_size=int(args.batch_size),cli_type=args.cli_type)img_client.run_sync(args.img_dir)# python3 img_cli.py
# --model_name resnet50_torch or resnet50_onnx or resnet50_trt
# --model_version 1
# --img_dir /mnt/bigmodellearning/triton_learning/test_data/pic
# --batch_size 2
# --cli_type http or grpc

7. 小总

Triton server学习系列的第二篇文章到这里就结束啦，这里简单小结一下，总体上是从实践的角度去介绍了triton server如何去部署模型进行serve的，从创建模型仓库，编写配置文件，启动服务，客户端发送服务四个步骤详细介绍了全流程，通过这篇文章，就可以把一个训练好的模型通过triton去进行部署了。

后面的一篇文章，在这个基础上扩展两个内容，第一个是python backend的模型如何部署，这个类似于实现了一个自定义的模型，第二个就是如何搭建一个ensemble的模型，依然是通过例子去完成练习实践。

有了这些基础，再搭建复杂的模型就容易了，因为原理本质上都是相通的，流程也是通用的，模型都已经封装好了，无非就是根据triton要求的形式调整配置文件等，后面还打算写一篇文章，是关于backend的，也就是简单看看整个过程背后的一个所以然，通过triton服务的关键源码，看看给了一个配置文件之后，它背后是如何进行工作的，这对我们了解整个triton的设计应该有很大的帮助。

所以，冲吧哈哈 😉

参考：

B站上的这个课程
https://github.com/leejinho610/TRT_Triton_HandsOn/blob/main/1. Model preperation (starting Point).ipynb
Triton Inference Server介绍
Triton inference server tutorials
Nvidia Triton使用教程：从青铜到王者