TensorRT 是 NVIDIA 自家的高效能推理庫,其 Getting Started 列出了各資料入口,如下:
本文基於當前的 TensorRT 8.2 版本,將一步步介紹從安裝,直到加速推理自己的 ONNX 模型。
安裝
進 TensorRT 下載頁 選擇版本下載,需註冊登入。
本文選擇了 TensorRT-8.2.2.1.Linux.x86_64-gnu.cuda-11.4.cudnn8.2.tar.gz
,可以注意到與 CUDA cuDNN 要匹配好版本。也可以準備 NVIDIA Docker 拉取對應版本的 nvidia/cuda 映象,再 ADD
TensorRT
即可。
# 解壓進 $HOME (以免 sudo 編譯樣例,為當前使用者)
tar -xzvf TensorRT-*.tar.gz -C $HOME/
# 軟鏈到 /usr/local/TensorRT (以固定一個路徑)
sudo ln -s $HOME/TensorRT-8.2.2.1 /usr/local/TensorRT
之後,編譯執行樣例,保證 TensorRT 安裝正確。
編譯樣例
樣例在 TensorRT/samples
,說明見 Sample Support Guide 或各樣例目錄裡的 README.md
。
cd /usr/local/TensorRT/samples/
# 設定環境變數,可見 Makefile.config
export CUDA_INSTALL_DIR=/usr/local/cuda
export CUDNN_INSTALL_DIR=/usr/local/cuda
export ENABLE_DLA=
export TRT_LIB_DIR=../lib
export PROTOBUF_INSTALL_DIR=
# 編譯
make -j`nproc`
# 執行
export LD_LIBRARY_PATH=/usr/local/TensorRT/lib:$LD_LIBRARY_PATH
cd /usr/local/TensorRT/
./bin/trtexec -h
./bin/sample_mnist -d data/mnist/ --fp16
執行結果參考:
$ ./bin/sample_mnist -d data/mnist/ --fp16
&&&& RUNNING TensorRT.sample_mnist [TensorRT v8202] # ./bin/sample_mnist -d data/mnist/ --fp16
[12/23/2021-20:20:16] [I] Building and running a GPU inference engine for MNIST
[12/23/2021-20:20:16] [I] [TRT] [MemUsageChange] Init CUDA: CPU +322, GPU +0, now: CPU 333, GPU 600 (MiB)
[12/23/2021-20:20:16] [I] [TRT] [MemUsageSnapshot] Begin constructing builder kernel library: CPU 333 MiB, GPU 600 MiB
[12/23/2021-20:20:16] [I] [TRT] [MemUsageSnapshot] End constructing builder kernel library: CPU 468 MiB, GPU 634 MiB
[12/23/2021-20:20:17] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +518, GPU +224, now: CPU 988, GPU 858 (MiB)
[12/23/2021-20:20:17] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +114, GPU +52, now: CPU 1102, GPU 910 (MiB)
[12/23/2021-20:20:17] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
[12/23/2021-20:20:33] [I] [TRT] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
[12/23/2021-20:20:34] [I] [TRT] Detected 1 inputs and 1 output network tensors.
[12/23/2021-20:20:34] [I] [TRT] Total Host Persistent Memory: 8448
[12/23/2021-20:20:34] [I] [TRT] Total Device Persistent Memory: 1626624
[12/23/2021-20:20:34] [I] [TRT] Total Scratch Memory: 0
[12/23/2021-20:20:34] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 2 MiB, GPU 13 MiB
[12/23/2021-20:20:34] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 0.01595ms to assign 3 blocks to 8 nodes requiring 57857 bytes.
[12/23/2021-20:20:34] [I] [TRT] Total Activation Memory: 57857
[12/23/2021-20:20:34] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 1621, GPU 1116 (MiB)
[12/23/2021-20:20:34] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 1621, GPU 1124 (MiB)
[12/23/2021-20:20:34] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +0, GPU +4, now: CPU 0, GPU 4 (MiB)
[12/23/2021-20:20:34] [I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 1622, GPU 1086 (MiB)
[12/23/2021-20:20:34] [I] [TRT] Loaded engine size: 1 MiB
[12/23/2021-20:20:34] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 1622, GPU 1096 (MiB)
[12/23/2021-20:20:34] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +1, GPU +8, now: CPU 1623, GPU 1104 (MiB)
[12/23/2021-20:20:34] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +1, now: CPU 0, GPU 1 (MiB)
[12/23/2021-20:20:34] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 1485, GPU 1080 (MiB)
[12/23/2021-20:20:34] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 1485, GPU 1088 (MiB)
[12/23/2021-20:20:34] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +2, now: CPU 0, GPU 3 (MiB)
[12/23/2021-20:20:34] [I] Input:
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@%+-: =@@@@@@@@@@@@
@@@@@@@%= -@@@**@@@@@@@
@@@@@@@ :%#@-#@@@. #@@@@@@
@@@@@@* +@@@@:*@@@ *@@@@@@
@@@@@@# +@@@@ @@@% @@@@@@@
@@@@@@@. :%@@.@@@. *@@@@@@@
@@@@@@@@- =@@@@. -@@@@@@@@
@@@@@@@@@%: +@- :@@@@@@@@@
@@@@@@@@@@@%. : -@@@@@@@@@@
@@@@@@@@@@@@@+ #@@@@@@@@@@
@@@@@@@@@@@@@@+ :@@@@@@@@@@
@@@@@@@@@@@@@@+ *@@@@@@@@@
@@@@@@@@@@@@@@: = @@@@@@@@@
@@@@@@@@@@@@@@ :@ @@@@@@@@@
@@@@@@@@@@@@@@ -@ @@@@@@@@@
@@@@@@@@@@@@@# +@ @@@@@@@@@
@@@@@@@@@@@@@* ++ @@@@@@@@@
@@@@@@@@@@@@@* *@@@@@@@@@
@@@@@@@@@@@@@# =@@@@@@@@@@
@@@@@@@@@@@@@@. +@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[12/23/2021-20:20:34] [I] Output:
0:
1:
2:
3:
4:
5:
6:
7:
8: **********
9:
&&&& PASSED TensorRT.sample_mnist [TensorRT v8202] # ./bin/sample_mnist -d data/mnist/ --fp16
快速開始
Quick Start Guide / Using The TensorRT Runtime API
準備教程程式碼,編譯:
git clone --depth 1 https://github.com/NVIDIA/TensorRT.git
export CUDA_INSTALL_DIR=/usr/local/cuda
export CUDNN_INSTALL_DIR=/usr/local/cuda
export TRT_LIB_DIR=/usr/local/TensorRT/lib
# 編譯 quickstart
cd TensorRT/quickstart
# Makefile.config
# INCPATHS += -I"/usr/local/TensorRT/include"
# common/logging.h
# void log(Severity severity, const char* msg) noexcept override
make
# 執行環境
export PATH=/usr/local/TensorRT/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/TensorRT/lib:$LD_LIBRARY_PATH
cd SemanticSegmentation
獲取預訓練 FCN-ResNet-101 模型,轉成 ONNX:
# 建立本地環境
# conda create -n torch python=3.9 -y
# conda activate torch
# conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch -y
# 不然,容器環境
# docker run --rm -it --gpus all -p 8888:8888 -v `pwd`:/workspace/SemanticSegmentation -w /workspace nvcr.io/nvidia/pytorch:20.12-py3 bash
$ python export.py
Exporting ppm image input.ppm
Downloading: "https://github.com/pytorch/vision/archive/v0.6.0.zip" to /home/john/.cache/torch/hub/v0.6.0.zip
Downloading: "https://download.pytorch.org/models/resnet101-5d3b4d8f.pth" to /home/john/.cache/torch/hub/checkpoints/resnet101-5d3b4d8f.pth
100%|████████████████████████████████████████| 170M/170M [00:27<00:00, 6.57MB/s]
Downloading: "https://download.pytorch.org/models/fcn_resnet101_coco-7ecb50ca.pth" to /home/john/.cache/torch/hub/checkpoints/fcn_resnet101_coco-7ecb50ca.pth
100%|████████████████████████████████████████| 208M/208M [02:26<00:00, 1.49MB/s]
Exporting ONNX model fcn-resnet101.onnx
再用 trtexec
將 ONNX 轉成 TensorRT engine:
$ trtexec --onnx=fcn-resnet101.onnx --fp16 --workspace=64 --minShapes=input:1x3x256x256 --optShapes=input:1x3x1026x1282 --maxShapes=input:1x3x1440x2560 --buildOnly --saveEngine=fcn-resnet101.engine
...
[01/07/2022-20:20:00] [I] Engine built in 406.011 sec.
&&&& PASSED TensorRT.trtexec [TensorRT v8202] ...
隨機輸入,測試 engine:
$ trtexec --shapes=input:1x3x1026x1282 --loadEngine=fcn-resnet101.engine
...
[01/07/2022-20:20:00] [I] === Performance summary ===
[01/07/2022-20:20:00] [I] Throughput: 12.4749 qps
[01/07/2022-20:20:00] [I] Latency: min = 76.9746 ms, max = 98.8354 ms, mean = 79.5844 ms, median = 78.0542 ms, percentile(99%) = 98.8354 ms
[01/07/2022-20:20:00] [I] End-to-End Host Latency: min = 150.942 ms, max = 188.431 ms, mean = 155.834 ms, median = 152.444 ms, percentile(99%) = 188.431 ms
[01/07/2022-20:20:00] [I] Enqueue Time: min = 0.390625 ms, max = 1.61279 ms, mean = 1.41182 ms, median = 1.46136 ms, percentile(99%) = 1.61279 ms
[01/07/2022-20:20:00] [I] H2D Latency: min = 1.25977 ms, max = 1.53467 ms, mean = 1.27415 ms, median = 1.26514 ms, percentile(99%) = 1.53467 ms
[01/07/2022-20:20:00] [I] GPU Compute Time: min = 75.2869 ms, max = 97.1318 ms, mean = 77.8847 ms, median = 76.3599 ms, percentile(99%) = 97.1318 ms
[01/07/2022-20:20:00] [I] D2H Latency: min = 0.408447 ms, max = 0.454346 ms, mean = 0.425577 ms, median = 0.423004 ms, percentile(99%) = 0.454346 ms
[01/07/2022-20:20:00] [I] Total Host Walltime: 3.2866 s
[01/07/2022-20:20:00] [I] Total GPU Compute Time: 3.19327 s
[01/07/2022-20:20:00] [I] Explanations of the performance metrics are printed in the verbose logs.
[01/07/2022-20:20:00] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8202] ...
執行教程,使用 engine:
$ ./bin/segmentation_tutorial
[01/07/2022-20:20:34] [I] [TRT] [MemUsageChange] Init CUDA: CPU +322, GPU +0, now: CPU 463, GPU 707 (MiB)
[01/07/2022-20:20:34] [I] [TRT] Loaded engine size: 132 MiB
[01/07/2022-20:20:35] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +520, GPU +224, now: CPU 984, GPU 1065 (MiB)
[01/07/2022-20:20:35] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +115, GPU +52, now: CPU 1099, GPU 1117 (MiB)
[01/07/2022-20:20:35] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +131, now: CPU 0, GPU 131 (MiB)
[01/07/2022-20:20:35] [I] Running TensorRT inference for FCN-ResNet101
[01/07/2022-20:20:35] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +10, now: CPU 966, GPU 1109 (MiB)
[01/07/2022-20:20:35] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 966, GPU 1117 (MiB)
[01/07/2022-20:20:35] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +722, now: CPU 0, GPU 853 (MiB)
實踐
以上給到了官方樣例與教程的編譯使用。這裡,另外找了個 RVM 的模型,從頭開始試一試。
準備模型
Robust Video Matting (RVM) 穩定視訊摳像,可在任意視訊上做實時高清摳像。有 Webcam Demo 可以網頁上體驗。
準備 ONNX 模型 rvm_mobilenetv3_fp32.onnx,其 推斷文件 給出了模型輸入輸出:
輸入: [
src
,r1i
,r2i
,r3i
,r4i
,downsample_ratio
]src
:輸入幀,RGB 通道,形狀為[B, C, H, W]
,範圍為0~1
rXi
:記憶輸入,初始值是是形狀為[1, 1, 1, 1]
的零張量downsample_ratio
下采樣比,張量形狀為[1]
- 只有
downsample_ratio
必須是FP32
,其他輸入必須和載入的模型使用一樣的dtype
輸出: [
fgr
,pha
,r1o
,r2o
,r3o
,r4o
]fgr, pha
:前景和透明度通道輸出,範圍為0~1
rXo
:記憶輸出
準備輸入影像 input.jpg 。不用視訊,保持程式碼簡單些。
準備環境
conda create -n torch python=3.9 -y
conda activate torch
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch -y
# Requirements
# https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html#requirements
pip install onnx onnxruntime-gpu==1.10
執行 ONNX 模型
import onnxruntime as ort
import numpy as np
from PIL import Image
# 讀取影像
with Image.open('input.jpg') as img:
img.load()
# HWC [0,255] > BCHW [0,1]
src = np.array(img)
src = np.moveaxis(src, -1, 0) .astype(np.float32)
src = src[np.newaxis, :] / 255.
# 載入模型
sess = ort.InferenceSession('rvm_mobilenetv3_fp32.onnx', providers=['CUDAExecutionProvider'])
# 建立 io binding
io = sess.io_binding()
# 在 CUDA 上建立張量
rec = [ ort.OrtValue.ortvalue_from_numpy(np.zeros([1, 1, 1, 1], dtype=np.float32), 'cuda') ] * 4
downsample_ratio = ort.OrtValue.ortvalue_from_numpy(np.asarray([0.25], dtype=np.float32), 'cuda')
# 設定輸出項
for name in ['fgr', 'pha', 'r1o', 'r2o', 'r3o', 'r4o']:
io.bind_output(name, 'cuda')
# 推斷
io.bind_cpu_input('src', src)
io.bind_ortvalue_input('r1i', rec[0])
io.bind_ortvalue_input('r2i', rec[1])
io.bind_ortvalue_input('r3i', rec[2])
io.bind_ortvalue_input('r4i', rec[3])
io.bind_ortvalue_input('downsample_ratio', downsample_ratio)
sess.run_with_iobinding(io)
fgr, pha, *rec = io.get_outputs()
# 只將 `fgr` 和 `pha` 回傳到 CPU
fgr = fgr.numpy()
pha = pha.numpy()
# 合成 RGBA
com = np.where(pha > 0, fgr, pha)
com = np.concatenate([com, pha], axis=1) # + alpha
# BCHW [0,1] > HWC [0,255]
com = np.squeeze(com, axis=0)
com = np.moveaxis(com, 0, -1) * 255
img = Image.fromarray(com.astype(np.uint8))
img.show()
執行:
python rvm_onnx_infer.py --model "rvm_mobilenetv3_fp32.onnx" --input-image "input.jpg" --precision float32 --show
結果(背景透明):
ONNX 轉成 TRT 模型
trtexec
將 ONNX 轉成 TensorRT engine:
export PATH=/usr/local/TensorRT/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/TensorRT/lib:$LD_LIBRARY_PATH
trtexec --onnx=rvm_mobilenetv3_fp32.onnx --workspace=64 --saveEngine=rvm_mobilenetv3_fp32.engine --verbose
發生問題:
[01/08/2022-20:20:36] [E] [TRT] ModelImporter.cpp:773: While parsing node number 3 [Resize -> "389"]:
[01/08/2022-20:20:36] [E] [TRT] ModelImporter.cpp:774: --- Begin node ---
[01/08/2022-20:20:36] [E] [TRT] ModelImporter.cpp:775: input: "src"
input: "386"
input: "388"
output: "389"
name: "Resize_3"
op_type: "Resize"
attribute {
name: "coordinate_transformation_mode"
s: "pytorch_half_pixel"
type: STRING
}
attribute {
name: "cubic_coeff_a"
f: -0.75
type: FLOAT
}
attribute {
name: "mode"
s: "linear"
type: STRING
}
attribute {
name: "nearest_mode"
s: "floor"
type: STRING
}
[01/08/2022-20:20:36] [E] [TRT] ModelImporter.cpp:776: --- End node ---
[01/08/2022-20:20:36] [E] [TRT] ModelImporter.cpp:779: ERROR: builtin_op_importers.cpp:3608 In function importResize:
[8] Assertion failed: scales.is_weights() && "Resize scales must be an initializer!"
這時,需要動手改動模型了。
首先,安裝必要工具:
snap install netron
pip install onnx-simplifier
pip install onnx_graphsurgeon --index-url https://pypi.ngc.nvidia.com
之後,Netron 檢視模型 Resize_3
節點:
發現其 scales
輸入是依據 downsample_ratio
得到的,即 [1,1,downsample_ratio,downsample_ratio]
,可用 ONNX GraphSurgeon 修改成常量。
最後,模型改動步驟如下:
# ONNX 模型簡化,並改為靜態輸入尺寸
python -m onnxsim rvm_mobilenetv3_fp32.onnx rvm_mobilenetv3_fp32_sim.onnx \
--input-shape src:1,3,1080,1920 r1i:1,1,1,1 r2i:1,1,1,1 r3i:1,1,1,1 r4i:1,1,1,1
# ONNX GraphSurgeon 修改模型
python rvm_onnx_modify.py -i rvm_mobilenetv3_fp32_sim.onnx --input-size 1920 1280
# trtexec 將 ONNX 轉成 TensorRT engine
trtexec --onnx=rvm_mobilenetv3_fp32_sim_modified.onnx --workspace=64 --saveEngine=rvm_mobilenetv3_fp32_sim_modified.engine
def modify(input: str, output: str, downsample_ratio: float = 0.25) -> None:
print(f'\nonnx load: {input}')
graph = gs.import_onnx(onnx.load(input))
_print_graph(graph)
# update node Resize_3: scales
resize_3 = [n for n in graph.nodes if n.name == 'Resize_3'][0]
print()
print(resize_3)
scales = gs.Constant('388',
np.asarray([1, 1, downsample_ratio, downsample_ratio], dtype=np.float32))
resize_3.inputs = [i if i.name != '388' else scales for i in resize_3.inputs]
print()
print(resize_3)
# remove input downsample_ratio
graph.inputs = [i for i in graph.inputs if i.name != 'downsample_ratio']
# remove node Concat_2
concat_2 = [n for n in graph.nodes if n.name == 'Concat_2'][0]
concat_2.outputs.clear()
# remove unused nodes/tensors
graph.cleanup()
onnx.save(gs.export_onnx(graph), output)
ONNX 與 TRT 模型輸出差異
可用 Polygraphy 檢視 ONNX 與 TRT 模型的輸出差異。
首先,安裝
# 安裝 TensorRT Python API
cd /usr/local/TensorRT/python/
pip install tensorrt-8.2.2.1-cp39-none-linux_x86_64.whl
export LD_LIBRARY_PATH=/usr/local/TensorRT/lib:$LD_LIBRARY_PATH
python -c "import tensorrt; print(tensorrt.__version__)"
# 安裝 Polygraphy,或者通過 TensorRT/tools/Polygraphy 原始碼安裝
python -m pip install colored polygraphy --extra-index-url https://pypi.ngc.nvidia.com
執行 ONNX 與 TRT 模型,對比輸出誤差:
# 執行 ONNX 模型,儲存輸入輸出
polygraphy run rvm_mobilenetv3_fp32_sim_modified.onnx --onnxrt --val-range [0,1] --save-inputs onnx_inputs.json --save-outputs onnx_outputs.json
# 執行 TRT 模型,載入 ONNX 輸入輸出,對比輸出的相對誤差與絕對誤差
polygraphy run rvm_mobilenetv3_fp32_sim_modified.engine --model-type engine --trt --load-inputs onnx_inputs.json --load-outputs onnx_outputs.json --rtol 1e-3 --atol 1e-3
可見 fp32
精度誤差在 1e-3
以內,PASSED
:
[I] PASSED | All outputs matched | Outputs: ['r4o', 'r3o', 'r2o', 'r1o', 'fgr', 'pha']
[I] PASSED | Command: /home/john/anaconda3/envs/torch/bin/polygraphy run rvm_mobilenetv3_fp32_sim_modified.engine --model-type engine --trt --load-inputs onnx_inputs.json --load-outputs onnx_outputs.json --rtol 1e-3 --atol 1e-3
也試了 fp16
,其精度損失就比較大,FAILED
:
[E] FAILED | Mismatched outputs: ['r4o', 'r3o', 'r2o', 'r1o', 'fgr', 'pha']
[!] FAILED | Command: /home/john/anaconda3/envs/torch/bin/polygraphy run rvm_mobilenetv3_fp16_sim_modified.engine --model-type engine --trt --load-inputs onnx_inputs.json --load-outputs onnx_outputs.json --rtol 1e-3 --atol 1e-3
執行 TRT 模型
這裡以 TensorRT C++ runtime APIs 為例,將轉出的 RVM TRT 模型執行起來。完整程式碼見 rvm_infer.cc。
1. 載入模型:建立 runtime
,反序列化 TRT 模型檔案的資料
static Logger logger{Logger::Severity::kINFO};
auto runtime = std::unique_ptr<nvinfer1::IRuntime>(nvinfer1::createInferRuntime(logger));
auto engine = runtime->deserializeCudaEngine(engine_data.data(), fsize, nullptr);
遍歷全部輸入輸出 bindings
:
auto nb = engine->getNbBindings();
for (int32_t i = 0; i < nb; i++) {
auto is_input = engine->bindingIsInput(i);
auto name = engine->getBindingName(i);
auto dims = engine->getBindingDimensions(i);
auto datatype = engine->getBindingDataType(i);
// ...
}
Engine
Name=Unnamed Network 0
DeviceMemorySize=148 MiB
MaxBatchSize=1
Bindings
Input[0] name=src dims=[1,3,1080,1920] datatype=FLOAT
Input[1] name=r1i dims=[1,1,1,1] datatype=FLOAT
Input[2] name=r2i dims=[1,1,1,1] datatype=FLOAT
Input[3] name=r3i dims=[1,1,1,1] datatype=FLOAT
Input[4] name=r4i dims=[1,1,1,1] datatype=FLOAT
Output[5] name=r4o dims=[1,64,18,32] datatype=FLOAT
Output[6] name=r3o dims=[1,40,36,64] datatype=FLOAT
Output[7] name=r2o dims=[1,20,72,128] datatype=FLOAT
Output[8] name=r1o dims=[1,16,144,256] datatype=FLOAT
Output[9] name=fgr dims=[1,3,1080,1920] datatype=FLOAT
Output[10] name=pha dims=[1,1,1080,1920] datatype=FLOAT
之後,分配好所有 bindings
的 device
記憶體:
auto nb = engine->getNbBindings();
std::vector<void *> bindings(nb, nullptr);
std::vector<int32_t> bindings_size(nb, 0);
for (int32_t i = 0; i < nb; i++) {
auto dims = engine->getBindingDimensions(i);
auto size = GetMemorySize(dims, sizeof(float));
if (cudaMalloc(&bindings[i], size) != cudaSuccess) {
std::cerr << "ERROR: cuda memory allocation failed, size = " << size
<< " bytes" << std::endl;
return false;
}
bindings_size[i] = size;
}
到此,準備工作就好了。
2. 前處理:輸入資料處理成輸入格式,存進輸入 bindings
用 OpenCV 讀取影像,縮放成 src
的輸入尺寸。再把資料從 BGR [0,255]
處理成 RGB [0,1]
。因 batch=1
,所以處理時可忽略。
// img: HWC BGR [0,255] u8
auto img = cv::imread(input_filename, cv::IMREAD_COLOR);
if (src_h != img.rows || src_w != img.cols) {
cv::resize(img, img, cv::Size(src_w, src_h));
}
// src: BCHW RGB [0,1] fp32
auto src = cv::Mat(img.rows, img.cols, CV_32FC3);
{
auto src_data = (float*)(src.data);
for (int y = 0; y < src_h; ++y) {
for (int x = 0; x < src_w; ++x) {
auto &&bgr = img.at<cv::Vec3b>(y, x);
/*r*/ *(src_data + y*src_w + x) = bgr[2] / 255.;
/*g*/ *(src_data + src_n + y*src_w + x) = bgr[1] / 255.;
/*b*/ *(src_data + src_n*2 + y*src_w + x) = bgr[0] / 255.;
}
}
}
if (cudaMemcpyAsync(bindings[0], src.data, bindings_size[0],
cudaMemcpyHostToDevice, stream) != cudaSuccess) {
std::cerr << "ERROR: CUDA memory copy of src failed, size = "
<< bindings_size[0] << " bytes" << std::endl;
return false;
}
3. 推理:將 bindings
給到 engine
執行上下文進行推理
auto context = std::unique_ptr<nvinfer1::IExecutionContext>(
engine->createExecutionContext());
if (!context) {
return false;
}
bool status = context->enqueueV2(bindings.data(), stream, nullptr);
if (!status) {
std::cout << "ERROR: TensorRT inference failed" << std::endl;
return false;
}
4. 後處理:從輸出 bindings
取出資料,根據輸出格式處理資料
用 cv::Mat
接收輸出的前景 fgr
和透明通道 pha
:
auto fgr = cv::Mat(src_h, src_w, CV_32FC3); // BCHW RGB [0,1] fp32
if (cudaMemcpyAsync(fgr.data, bindings[9], bindings_size[9],
cudaMemcpyDeviceToHost, stream) != cudaSuccess) {
std::cerr << "ERROR: CUDA memory copy of output failed, size = "
<< bindings_size[9] << " bytes" << std::endl;
return false;
}
auto pha = cv::Mat(src_h, src_w, CV_32FC1); // BCHW A [0,1] fp32
if (cudaMemcpyAsync(pha.data, bindings[10], bindings_size[10],
cudaMemcpyDeviceToHost, stream) != cudaSuccess) {
std::cerr << "ERROR: CUDA memory copy of output failed, size = "
<< bindings_size[10] << " bytes" << std::endl;
return false;
}
cudaStreamSynchronize(stream);
再將 fgr
pha
合成 RGBA
資料,並復原成原尺寸:
// Compose `fgr` and `pha`
auto com = cv::Mat(src_h, src_w, CV_8UC4); // HWC BGRA [0,255] u8
{
auto fgr_data = (float*)(fgr.data);
auto pha_data = (float*)(pha.data);
for (int y = 0; y < com.rows; ++y) {
for (int x = 0; x < com.cols; ++x) {
auto &&elem = com.at<cv::Vec4b>(y, x);
auto alpha = *(pha_data + y*src_w + x);
if (alpha > 0) {
/*r*/ elem[2] = *(fgr_data + y*src_w + x) * 255;
/*g*/ elem[1] = *(fgr_data + src_n + y*src_w + x) * 255;
/*b*/ elem[0] = *(fgr_data + src_n*2 + y*src_w + x) * 255;
} else {
/*r*/ elem[2] = 0;
/*g*/ elem[1] = 0;
/*b*/ elem[0] = 0;
}
/*a*/ elem[3] = alpha * 255;
}
}
}
if (dst_h != com.rows || dst_w != com.cols) {
cv::resize(com, com, cv::Size(dst_w, dst_h));
}
5. 執行得到的摳像結果(背景透明):
最後
想入門 TensorRT 的,動手實踐一下吧!
GoCoding 個人實踐的經驗分享,可關注公眾號!