使用TensorRT部署你的神經網路（1）

樹莓派派酒發表於2020-12-29

原文網址 : https://blog.csdn.net/qq_35054151/article/details/111878679

作者：阿鬆
連結：https://zhuanlan.zhihu.com/p/259539097
來源：知乎
著作權歸作者所有。商業轉載請聯絡作者獲得授權，非商業轉載請註明出處。

眾所周知，TensorRT是一個非常不錯的神經網路部署工具，NVIDIA裝置首選，TensorRT原生是支援主流訓練框架的模型匯出的，如UFFParser，CaffePaser以及ONNXParser，但是用過的人都知道，這些parser在進行模型轉換的過程中，總會遇到各種問題，例如不支援的網路層以及一些並不友好的報錯提示等等。那麼這個庫就嘗試使用TensorRT的各種API直接進行網路的構建，簡單粗暴，還很有效。

wang-xinyu/tensorrtxgithub.com

我們就來簡單分析一下這個非常不錯的程式碼庫吧。

在tensorrtx/tutorials中提供了幾個教程檔案，其中這個getting_start.md以lenet為例子講解了tensorrtx的使用方法，let's try it~

網路構建與權重匯出

首先是得到網路和權重引數，使用torch構建網路，並匯出一個pth模型，然後利用inference.py來生成一個wts檔案，這個wts檔案就是我們下一步的主角啦。lenet的網路結構如圖，關於網路視覺化，可以參考這個文章。

阿鬆：Pytorch網路視覺化zhuanlan.zhihu.com

lenet5結構

2. TensorRT Engine的構建與推理

使用TensorRT進行推理時，需要將網路轉換成TensorRT的Engine格式。首先編譯程式碼，進入tensorrtx/lenet路徑下，

cd tensorrtx/lenet
cp path_to_lene5.wts .
mkdir build & cd build
cmake ..
make

如果沒問題呢，就得到可執行檔案啦，然後就生成trt的engine吧。

./lenet -s

上述命令會將生成的engine檔案序列化儲存起來，因為TensorRT構建Engine的過程中通常會比較耗時，尤其是在嵌入式上，那麼序列化的模型可以在下次執行的時候直接載入，大大縮短程式初始化的時間。然後使用-d選項將模型反序列化並用於推理吧。

./lenet -d

可以看到，我們使用tensorrt執行的結果為

Output:

0.0949623, 0.0998472, 0.110072, 0.0975036, 0.0965564, 0.109736, 0.0947979, 0.105618, 0.099228, 0.0916792,

而我們在之前的pytorch執行時輸出結果為

lenet out: tensor([[0.0950, 0.0998, 0.1101, 0.0975, 0.0966, 0.1097, 0.0948, 0.1056, 0.0992,
         0.0917]], device='cuda:0', grad_fn=<SoftmaxBackward>)

可以看到，使用tensorrt推理結果和使用pytorch推理結果非常接近，nice！

3. 簡單的程式碼分析

首先是pytorch中網路構建與模型匯出部分的程式碼，直接貼過來

import torch
from torch import nn
from torch.nn import functional as F

class Lenet5(nn.Module):
    """
    for cifar10 dataset.
    """
    def __init__(self):
        super(Lenet5, self).__init__()

        self.conv1 = nn.Conv2d(1, 6, kernel_size=5, stride=1, padding=0)
        self.pool1 = nn.AvgPool2d(kernel_size=2, stride=2, padding=0)
        self.conv2 = nn.Conv2d(6, 16, kernel_size=5, stride=1, padding=0)
        self.fc1 = nn.Linear(16*5*5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        print('input: ', x.shape)
        x = F.relu(self.conv1(x))
        print('conv1',x.shape)
        x = self.pool1(x)
        print('pool1: ', x.shape)
        x = F.relu(self.conv2(x))
        print('conv2',x.shape)
        x = self.pool1(x)
        print('pool2',x.shape)
        x = x.view(x.size(0), -1)
        print('view: ', x.shape)
        x = F.relu(self.fc1(x))
        print('fc1: ', x.shape)
        x = F.relu(self.fc2(x))
        x = F.softmax(self.fc3(x), dim=1)
        return x

def main():
    print('cuda device count: ', torch.cuda.device_count())
    torch.manual_seed(1234)
    net = Lenet5()
    net = net.to('cuda:0')
    net.eval()
    tmp = torch.ones(1, 1, 32, 32).to('cuda:0')
    out = net(tmp)
    print('lenet out shape:', out.shape)
    print('lenet out:', out)  # 列印出網路結果
    torch.save(net, "lenet5.pth")  #將模型儲存為pth格式，可用netron視覺化

if __name__ == '__main__':
    main()

由於網路結構非常簡單，這裡就不贅述啦，下面是inference.py程式碼

import torch
from torch import nn
from lenet5 import Lenet5
import os
import struct

def main():
    print('cuda device count: ', torch.cuda.device_count())
    net = torch.load('lenet5.pth')
    net = net.to('cuda:0')
    net.eval()
    #print('model: ', net)
    #print('state dict: ', net.state_dict()['conv1.weight'])
    tmp = torch.ones(1, 1, 32, 32).to('cuda:0')
    #print('input: ', tmp)
    out = net(tmp)
    print('lenet out:', out)  #照例列印出來網路推理結果用來進行測試對比

    f = open("lenet5.wts", 'w')
    f.write("{}\n".format(len(net.state_dict().keys())))  #儲存所有keys的數量
    for k,v in net.state_dict().items():
        #print('key: ', k)
        #print('value: ', v.shape)
        vr = v.reshape(-1).cpu().numpy()
        f.write("{} {}".format(k, len(vr)))  #儲存每一層名稱和引數長度
        for vv in vr:
            f.write(" ")
            f.write(struct.pack(">f", float(vv)).hex())  #使用struct把權重封裝成字串
        f.write("\n")

if __name__ == '__main__':
    main()

可以看到，在inference.py中將pth中的總層數，每一層的層名、引數長度以及所有的權重進行了儲存。結果如下

wts檔案內容

然後看看c++程式碼中如何進行權重的載入和轉換的。

int main(int argc, char** argv)
{
    if (argc != 2) {
        std::cerr << "arguments not right!" << std::endl;
        std::cerr << "./lenet -s   // serialize model to plan file" << std::endl;
        std::cerr << "./lenet -d   // deserialize plan file and run inference" << std::endl;
        return -1;
    }

    // create a model using the API directly and serialize it to a stream
    char *trtModelStream{nullptr};
    size_t size{0};

    if (std::string(argv[1]) == "-s") {  //進行模型的序列化
        IHostMemory* modelStream{nullptr};
        APIToModel(1, &modelStream);  //主角在這裡
        assert(modelStream != nullptr);

        std::ofstream p("lenet5.engine");
        if (!p)
        {
            std::cerr << "could not open plan output file" << std::endl;
            return -1;
        }
        p.write(reinterpret_cast<const char*>(modelStream->data()), modelStream->size());
        modelStream->destroy();
        return 1;
    } else if (std::string(argv[1]) == "-d") {  //進行模型的反序列化
        std::ifstream file("lenet5.engine", std::ios::binary);
        if (file.good()) {
            file.seekg(0, file.end);
            size = file.tellg();
            file.seekg(0, file.beg);
            trtModelStream = new char[size];
            assert(trtModelStream);
            file.read(trtModelStream, size);
            file.close();
        }
    } else {
        return -1;
    }

可以看到，核心是呼叫了APIToModel()函式，而其中通過設定TensorRT構建engine所需的builder以外，就是呼叫了createLenetEngine()函式來實現wts檔案到engine檔案的華麗變身。

// Creat the engine using only the API and not any parser.
ICudaEngine* createLenetEngine(unsigned int maxBatchSize, IBuilder* builder, DataType dt)
{
    INetworkDefinition* network = builder->createNetwork();

    // Create input tensor of shape { 1, 1, 32, 32 } with name INPUT_BLOB_NAME
    ITensor* data = network->addInput(INPUT_BLOB_NAME, dt, Dims3{1, INPUT_H, INPUT_W});
    assert(data);

    // Add convolution layer with 6 outputs and a 5x5 filter.
    std::map<std::string, Weights> weightMap = loadWeights("../lenet5.wts");
    IConvolutionLayer* conv1 = network->addConvolution(*data, 6, DimsHW{5, 5}, weightMap["conv1.weight"], weightMap["conv1.bias"]);
    assert(conv1);
    conv1->setStride(DimsHW{1, 1});

    // Add activation layer using the ReLU algorithm.
    IActivationLayer* relu1 = network->addActivation(*conv1->getOutput(0), ActivationType::kRELU);
    assert(relu1);

    // Add max pooling layer with stride of 2x2 and kernel size of 2x2.
    IPoolingLayer* pool1 = network->addPooling(*relu1->getOutput(0), PoolingType::kAVERAGE, DimsHW{2, 2});
    assert(pool1);
    pool1->setStride(DimsHW{2, 2});

    // Add second convolution layer with 16 outputs and a 5x5 filter.
    IConvolutionLayer* conv2 = network->addConvolution(*pool1->getOutput(0), 16, DimsHW{5, 5}, weightMap["conv2.weight"], weightMap["conv2.bias"]);
    assert(conv2);
    conv2->setStride(DimsHW{1, 1});

    // Add activation layer using the ReLU algorithm.
    IActivationLayer* relu2 = network->addActivation(*conv2->getOutput(0), ActivationType::kRELU);
    assert(relu2);

    // Add second max pooling layer with stride of 2x2 and kernel size of 2x2>
    IPoolingLayer* pool2 = network->addPooling(*relu2->getOutput(0), PoolingType::kMAX, DimsHW{2, 2});
    assert(pool2);
    pool2->setStride(DimsHW{2, 2});

    // Add fully connected layer with 500 outputs.
    IFullyConnectedLayer* fc1 = network->addFullyConnected(*pool2->getOutput(0), 120, weightMap["fc1.weight"], weightMap["fc1.bias"]);
    assert(fc1);

    // Add activation layer using the ReLU algorithm.
    IActivationLayer* relu3 = network->addActivation(*fc1->getOutput(0), ActivationType::kRELU);
    assert(relu3);

    // Add second fully connected layer with 20 outputs.
    IFullyConnectedLayer* fc2 = network->addFullyConnected(*relu3->getOutput(0), 84, weightMap["fc2.weight"], weightMap["fc2.bias"]);
    assert(fc2);

    // Add activation layer using the ReLU algorithm.
    IActivationLayer* relu4 = network->addActivation(*fc2->getOutput(0), ActivationType::kRELU);
    assert(relu4);

    // Add second fully connected layer with 20 outputs.
    IFullyConnectedLayer* fc3 = network->addFullyConnected(*relu4->getOutput(0), OUTPUT_SIZE, weightMap["fc3.weight"], weightMap["fc3.bias"]);
    assert(fc3);

    // Add softmax layer to determine the probability.
    ISoftMaxLayer* prob = network->addSoftMax(*fc3->getOutput(0));
    assert(prob);
    prob->getOutput(0)->setName(OUTPUT_BLOB_NAME);
    network->markOutput(*prob->getOutput(0));

    // Build engine
    builder->setMaxBatchSize(maxBatchSize);
    builder->setMaxWorkspaceSize(1 << 20);
    ICudaEngine* engine = builder->buildCudaEngine(*network);

    // Don't need the network any more
    network->destroy();

    // Release host memory
    for (auto& mem : weightMap)
    {
        free((void*) (mem.second.values));
    }

    return engine;
}

這個過程主要包括：構建INetworkDefinition，用於構建trt的網路表示；使用trt的API逐層往INetworkDefinition中新增ILayer;（對特殊的網路層通過ITensor->setName()方法設定名稱，方便後面的操作）；指定網路的output節點，tensorrt必須指定輸出節點，否則有可能會在優化過程中將該節點優化掉；設定各種builder引數，包括maxbatchsize以及maxworkspacesize等；使用build->buildCudaEngine(*network)構建出ICudaEngine，就得到我們要的trt的engine啦。

此外，作者自己實現了wts權重的方法，方便後面帶權重的層在網路INetworkDefinition中新增時候的權重設定。另外關於序列化儲存和反序列化的地方就不贅述啦。

可以看到，tensorrtx倉庫中已經支援非常多的網路了。

作者還做了很多速度測試。

4. 進階版網路實現

lenet還是太簡單了，我們再來看看yolo網路的實現，我們都知道yolo網路最後的yolo層TensorRT肯定是不支援的，此外yolov4中還加入了mish啟用函式，那這裡怎麼處理的呢？

作者分別實現了mish以及yolo layer的cuda實現，後面會使用plugin機制將該層插入到網路中。開啟yolov4.cpp，發現基本結構和lenet.cpp類似，重要的是實現了關於檢測網路的前後處理的部分，分別在preProcess()以及iou(),cmp()和nms()等函式中，這裡就不贅述啦。另外作者實現了addBatchNorm2d()，convBnMish()以及convBnLeaky()等方法，方便將CBR或者CBM結構整體進行轉換，簡化程式碼。重點來啦

// yolov3.cpp line 485
    auto creator = getPluginRegistry()->getPluginCreator("YoloLayer_TRT", "1");  //獲取到TensorRT Plugin Registry
    const PluginFieldCollection* pluginData = creator->getFieldNames();
    IPluginV2 *pluginObj = creator->createPlugin("yololayer", pluginData);  //建立pluginV2物件
    ITensor* inputTensors_yolo[] = {conv138->getOutput(0), conv149->getOutput(0), conv160->getOutput(0)};
    auto yolo = network->addPluginV2(inputTensors_yolo, 3, *pluginObj);  // 將pulgin插入到網路中

    yolo->getOutput(0)->setName(OUTPUT_BLOB_NAME);  //設定yolo層名稱
    std::cout << "set name out" << std::endl;
    network->markOutput(*yolo->getOutput(0));  //把yolo層輸出設為網路的輸出，防止被優化掉

可以從yolov4結構中看出，139,150,161層為yolo層，這裡使用addPluginV2新增plugin層，可以參考這個連結對其進行理解。Mish層的新增同理。

[TensorRT] How to write code to using PluginV2www.codenong.com

此外，yolo中的upsample作者使用分組反摺積，並通過network->addDeconvolutionNd()的方法新增到網路中，解決了upsample不支援的問題。

5. 小結

總的來說，TensorRTx的程式碼庫非常簡潔，使用起來非常方便，跳過了使用onnx這個坑，使得網路轉換的可操作性更強，目前作者還在非常積極地進行維護，github已經900+star了，非常推薦一試。

不過由於所有的網路都是單獨重新使用tensorrt的api進行構建，總體過程還是相對比較麻煩，如果使用一些剪枝演算法對原有網路結構進行了修改，那基本還要重新搭建一遍網路，工作量還不小。此外，一些網路的backbone等通用的結構應該可以單獨抽象出來進行實現，從而有利於網路的拼接，大大簡化程式碼除錯工作量，這個後期都是可以優化的地方。

原帖：https://zhuanlan.zhihu.com/p/259539097

卷積神經網路-1
2018-04-19
卷積神經網路
神經網路：numpy實現神經網路框架
2021-08-19
神經網路框架
神經網路
2018-11-30
神經網路
【深度學習篇】--神經網路中的卷積神經網路
2018-03-28
深度學習神經網路卷積
umich cv-5-1 神經網路訓練1
2023-10-28
神經網路
神經網路中使用的啟用函式
2020-12-01
神經網路函式
神經網路已經猜到了你要繪製的內容
2018-08-26
神經網路
LSTM神經網路
2019-03-02
神經網路
8、神經網路
2020-12-18
神經網路
聊聊從腦神經到神經網路
2018-11-29
神經網路
圖神經網路GNN 庫，液體神經網路LNN/LFM
2024-10-07
神經網路GNN
用神經網路模型給你的照片打分（Part I）
2019-03-02
神經網路模型
卷積神經網路—基礎知識（1）
2018-06-08
卷積神經網路
卷積神經網路CNN-學習1
2020-12-28
卷積神經網路CNN
【神經網路篇】--RNN遞迴神經網路初始與詳解
2018-05-13
神經網路RNN遞迴
神經網路篇——從程式碼出發理解BP神經網路
2021-01-01
神經網路
神經網路的發展史
2019-04-06
神經網路
神經網路是如何工作的？
2023-12-12
神經網路
3.2 神經網路的通俗理解
2019-12-31
神經網路
3.3 神經網路的訓練
2019-12-31
神經網路
人工神經網路（ANN）
2023-10-07
神經網路
卷積神經網路
2020-03-10
卷積神經網路
迴圈神經網路
2020-03-14
神經網路
神經網路（neural networks）
2024-07-27
神經網路
生成型神經網路
2022-03-25
神經網路
吳恩達《卷積神經網路》課程筆記（1）– 卷積神經網路基礎
2018-08-02
吳恩達卷積神經網路筆記
關於神經網路：你需要知道這些
2018-05-15
神經網路
卷積神經網路1-邊緣檢測
2020-10-17
卷積神經網路
使用tf.estimator.Estimator訓練神經網路
2018-09-16
神經網路
使用NumPy演示實現神經網路過程
2024-03-08
神經網路
使用PyTorch演示實現神經網路過程
2024-03-09
PyTorch神經網路
使用神經網路生成抽象隨機藝術
2019-02-15
神經網路抽象隨機
MATLAB神經網路工具箱使用介紹
2024-06-20
Matlab神經網路
（四）卷積神經網路 -- 8 網路中的網路（NiN）
2020-10-18
卷積神經網路
Tensorflow系列專題（四）：神經網路篇之前饋神經網路綜述
2018-11-20
神經網路
卷積神經網路學習筆記——Siamese networks（孿生神經網路）
2021-01-14
卷積神經網路筆記
與神經網路相比，你對P圖一無所知
2018-04-18
神經網路
帶你認識9種常用卷積神經網路
2021-06-28
卷積神經網路

使用TensorRT部署你的神經網路（1）

相關文章