TVM 加速模型，優化推斷

GoCodingInMyWay發表於2022-05-22

原文網址 : https://www.cnblogs.com/gocodinginmyway/p/16297204.html

模型優化

TVM 是一個開源深度學習編譯器，可適用於各類 CPUs, GPUs 及其他專用加速器。它的目標是使得我們能夠在任何硬體上優化和執行自己的模型。不同於深度學習框架關注模型生產力，TVM 更關注模型在硬體上的效能和效率。

本文只簡單介紹 TVM 的編譯流程，及如何自動調優自己的模型。更深入瞭解，可見 TVM 官方內容：

文件: https://tvm.apache.org/docs/
原始碼: https://github.com/apache/tvm

編譯流程

TVM 文件 Design and Architecture 講述了例項編譯流程、邏輯結構元件、裝置目標實現等。其中流程見下圖：

從高層次上看，包含了如下步驟：

匯入（Import）：前端元件將模型提取進 IRModule，其是模型內部表示（IR）的函式集合。
轉換（Transformation）：編譯器將 IRModule 轉換為另一個功能等效或近似等效（如量化情況下）的 IRModule。大多轉換都是獨立於目標（後端）的。TVM 也允許目標影響轉換通道的配置。
目標翻譯（Target Translation）：編譯器翻譯（程式碼生成） IRModule 到目標上的可執行格式。目標翻譯結果被封裝為 runtime.Module，可以在目標執行時環境中匯出、載入和執行。
執行時執行（Runtime Execution）：使用者載入一個 runtime.Module 並在支援的執行時環境中執行編譯好的函式。

調優模型

TVM 文件 User Tutorial 從怎麼編譯優化模型開始，逐步深入到 TE, TensorIR, Relay 等更底層的邏輯結構元件。

這裡只講下如何用 AutoTVM 自動調優模型，實際瞭解 TVM 編譯、調優、執行模型的過程。原文見 Compiling and Optimizing a Model with the Python Interface (AutoTVM)。

準備 TVM

首先，安裝 TVM。可見文件 Installing TVM，或筆記「TVM 安裝」。

之後，即可通過 TVM Python API 來調優模型。我們先匯入如下依賴：

import onnx
from tvm.contrib.download import download_testdata
from PIL import Image
import numpy as np
import tvm.relay as relay
import tvm
from tvm.contrib import graph_executor

準備模型，並載入

獲取預訓練的 ResNet-50 v2 ONNX 模型，並載入：

model_url = "".join(
    [
        "https://github.com/onnx/models/raw/",
        "main/vision/classification/resnet/model/",
        "resnet50-v2-7.onnx",
    ]
)

model_path = download_testdata(model_url, "resnet50-v2-7.onnx", module="onnx")
onnx_model = onnx.load(model_path)

準備圖片，並前處理

獲取一張測試圖片，並前處理成 224x224 NCHW 格式：

img_url = "https://s3.amazonaws.com/model-server/inputs/kitten.jpg"
img_path = download_testdata(img_url, "imagenet_cat.png", module="data")

# Resize it to 224x224
resized_image = Image.open(img_path).resize((224, 224))
img_data = np.asarray(resized_image).astype("float32")

# Our input image is in HWC layout while ONNX expects CHW input, so convert the array
img_data = np.transpose(img_data, (2, 0, 1))

# Normalize according to the ImageNet input specification
imagenet_mean = np.array([0.485, 0.456, 0.406]).reshape((3, 1, 1))
imagenet_stddev = np.array([0.229, 0.224, 0.225]).reshape((3, 1, 1))
norm_img_data = (img_data / 255 - imagenet_mean) / imagenet_stddev

# Add the batch dimension, as we are expecting 4-dimensional input: NCHW.
img_data = np.expand_dims(norm_img_data, axis=0)

編譯模型，用 TVM Relay

TVM 匯入 ONNX 模型成 Relay，並建立 TVM 圖模型：

target = input("target [llvm]: ")
if not target:
    target = "llvm"
    # target = "llvm -mcpu=core-avx2"
    # target = "llvm -mcpu=skylake-avx512"

# The input name may vary across model types. You can use a tool
# like Netron to check input names
input_name = "data"
shape_dict = {input_name: img_data.shape}

mod, params = relay.frontend.from_onnx(onnx_model, shape_dict)

with tvm.transform.PassContext(opt_level=3):
    lib = relay.build(mod, target=target, params=params)

dev = tvm.device(str(target), 0)
module = graph_executor.GraphModule(lib["default"](dev))

其中 target 是目標硬體平臺。llvm 指用 CPU，建議指明架構指令集，可更優化效能。如下命令可檢視 CPU：

$ llc --version | grep CPU
  Host CPU: skylake
$ lscpu

或直接上廠商網站（如 Intel® Products）檢視產品引數。

執行模型，用 TVM Runtime

用 TVM Runtime 執行模型，進行預測：

dtype = "float32"
module.set_input(input_name, img_data)
module.run()
output_shape = (1, 1000)
tvm_output = module.get_output(0, tvm.nd.empty(output_shape)).numpy()

收集優化前的效能資料

收集優化前的效能資料：

import timeit

timing_number = 10
timing_repeat = 10
unoptimized = (
    np.array(timeit.Timer(lambda: module.run()).repeat(repeat=timing_repeat, number=timing_number))
    * 1000
    / timing_number
)
unoptimized = {
    "mean": np.mean(unoptimized),
    "median": np.median(unoptimized),
    "std": np.std(unoptimized),
}

print(unoptimized)

之後，用以對比優化後的效能。

後處理輸出，得知預測結果

輸出的預測結果，後處理成可讀的分類結果：

from scipy.special import softmax

# Download a list of labels
labels_url = "https://s3.amazonaws.com/onnx-model-zoo/synset.txt"
labels_path = download_testdata(labels_url, "synset.txt", module="data")

with open(labels_path, "r") as f:
    labels = [l.rstrip() for l in f]

# Open the output and read the output tensor
scores = softmax(tvm_output)
scores = np.squeeze(scores)
ranks = np.argsort(scores)[::-1]
for rank in ranks[0:5]:
    print("class='%s' with probability=%f" % (labels[rank], scores[rank]))

調優模型，獲取調優資料

於目標硬體平臺，用 AutoTVM 自動調優，獲取調優資料：

import tvm.auto_scheduler as auto_scheduler
from tvm.autotvm.tuner import XGBTuner
from tvm import autotvm

number = 10
repeat = 1
min_repeat_ms = 0  # since we're tuning on a CPU, can be set to 0
timeout = 10  # in seconds

# create a TVM runner
runner = autotvm.LocalRunner(
    number=number,
    repeat=repeat,
    timeout=timeout,
    min_repeat_ms=min_repeat_ms,
    enable_cpu_cache_flush=True,
)

tuning_option = {
    "tuner": "xgb",
    "trials": 10,
    "early_stopping": 100,
    "measure_option": autotvm.measure_option(
        builder=autotvm.LocalBuilder(build_func="default"), runner=runner
    ),
    "tuning_records": "resnet-50-v2-autotuning.json",
}

# begin by extracting the tasks from the onnx model
tasks = autotvm.task.extract_from_program(mod["main"], target=target, params=params)

# Tune the extracted tasks sequentially.
for i, task in enumerate(tasks):
    prefix = "[Task %2d/%2d] " % (i + 1, len(tasks))
    tuner_obj = XGBTuner(task, loss_type="rank")
    tuner_obj.tune(
        n_trial=min(tuning_option["trials"], len(task.config_space)),
        early_stopping=tuning_option["early_stopping"],
        measure_option=tuning_option["measure_option"],
        callbacks=[
            autotvm.callback.progress_bar(tuning_option["trials"], prefix=prefix),
            autotvm.callback.log_to_file(tuning_option["tuning_records"]),
        ],
    )

上述 tuning_option 選用的 XGBoost Grid 演算法進行優化搜尋，資料記錄進 tuning_records。

重編譯模型，用調優資料

重新編譯出一個優化模型，依據調優資料：

with autotvm.apply_history_best(tuning_option["tuning_records"]):
    with tvm.transform.PassContext(opt_level=3, config={}):
        lib = relay.build(mod, target=target, params=params)

dev = tvm.device(str(target), 0)
module = graph_executor.GraphModule(lib["default"](dev))


# Verify that the optimized model runs and produces the same results

dtype = "float32"
module.set_input(input_name, img_data)
module.run()
output_shape = (1, 1000)
tvm_output = module.get_output(0, tvm.nd.empty(output_shape)).numpy()

scores = softmax(tvm_output)
scores = np.squeeze(scores)
ranks = np.argsort(scores)[::-1]
for rank in ranks[0:5]:
    print("class='%s' with probability=%f" % (labels[rank], scores[rank]))

對比調優與非調優模型

收集優化後的效能資料，與優化前的對比：

import timeit

timing_number = 10
timing_repeat = 10
optimized = (
    np.array(timeit.Timer(lambda: module.run()).repeat(repeat=timing_repeat, number=timing_number))
    * 1000
    / timing_number
)
optimized = {"mean": np.mean(optimized), "median": np.median(optimized), "std": np.std(optimized)}

print("optimized: %s" % (optimized))
print("unoptimized: %s" % (unoptimized))

調優模型，整個過程的執行結果，如下：

$ time python autotvm_tune.py
# TVM 編譯執行模型
## Downloading and Loading the ONNX Model
## Downloading, Preprocessing, and Loading the Test Image
## Compile the Model With Relay
target [llvm]: llvm -mcpu=core-avx2
One or more operators have not been tuned. Please tune your model for better performance. Use DEBUG logging level to see more details.
## Execute on the TVM Runtime
## Collect Basic Performance Data
{'mean': 44.97057118016528, 'median': 42.52320024970686, 'std': 6.870915251002107}
## Postprocess the output
class='n02123045 tabby, tabby cat' with probability=0.621104
class='n02123159 tiger cat' with probability=0.356378
class='n02124075 Egyptian cat' with probability=0.019712
class='n02129604 tiger, Panthera tigris' with probability=0.001215
class='n04040759 radiator' with probability=0.000262
# AutoTVM 調優模型 [Y/n]
## Tune the model
[Task  1/25]  Current/Best:  156.96/ 353.76 GFLOPS | Progress: (10/10) | 4.78 s Done.
[Task  2/25]  Current/Best:   54.66/ 241.25 GFLOPS | Progress: (10/10) | 2.88 s Done.
[Task  3/25]  Current/Best:  116.71/ 241.30 GFLOPS | Progress: (10/10) | 3.48 s Done.
[Task  4/25]  Current/Best:  119.92/ 184.18 GFLOPS | Progress: (10/10) | 3.48 s Done.
[Task  5/25]  Current/Best:   48.92/ 158.38 GFLOPS | Progress: (10/10) | 3.13 s Done.
[Task  6/25]  Current/Best:  156.89/ 230.95 GFLOPS | Progress: (10/10) | 2.82 s Done.
[Task  7/25]  Current/Best:   92.33/ 241.99 GFLOPS | Progress: (10/10) | 2.40 s Done.
[Task  8/25]  Current/Best:   50.04/ 331.82 GFLOPS | Progress: (10/10) | 2.64 s Done.
[Task  9/25]  Current/Best:  188.47/ 409.93 GFLOPS | Progress: (10/10) | 4.44 s Done.
[Task 10/25]  Current/Best:   44.81/ 181.67 GFLOPS | Progress: (10/10) | 2.32 s Done.
[Task 11/25]  Current/Best:   83.74/ 312.66 GFLOPS | Progress: (10/10) | 2.74 s Done.
[Task 12/25]  Current/Best:   96.48/ 294.40 GFLOPS | Progress: (10/10) | 2.82 s Done.
[Task 13/25]  Current/Best:  123.74/ 354.34 GFLOPS | Progress: (10/10) | 2.62 s Done.
[Task 14/25]  Current/Best:   23.76/ 178.71 GFLOPS | Progress: (10/10) | 2.90 s Done.
[Task 15/25]  Current/Best:  119.18/ 534.63 GFLOPS | Progress: (10/10) | 2.49 s Done.
[Task 16/25]  Current/Best:  101.24/ 172.92 GFLOPS | Progress: (10/10) | 2.49 s Done.
[Task 17/25]  Current/Best:  309.85/ 309.85 GFLOPS | Progress: (10/10) | 2.69 s Done.
[Task 18/25]  Current/Best:   54.45/ 368.31 GFLOPS | Progress: (10/10) | 2.46 s Done.
[Task 19/25]  Current/Best:   78.69/ 162.43 GFLOPS | Progress: (10/10) | 3.29 s Done.
[Task 20/25]  Current/Best:   40.78/ 317.50 GFLOPS | Progress: (10/10) | 4.52 s Done.
[Task 21/25]  Current/Best:  169.03/ 296.36 GFLOPS | Progress: (10/10) | 3.95 s Done.
[Task 22/25]  Current/Best:   90.96/ 210.43 GFLOPS | Progress: (10/10) | 2.28 s Done.
[Task 23/25]  Current/Best:   48.93/ 217.36 GFLOPS | Progress: (10/10) | 2.87 s Done.
[Task 25/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s Done.
[Task 25/25]  Current/Best:   25.50/  33.86 GFLOPS | Progress: (10/10) | 9.28 s Done.
## Compiling an Optimized Model with Tuning Data
class='n02123045 tabby, tabby cat' with probability=0.621104
class='n02123159 tiger cat' with probability=0.356378
class='n02124075 Egyptian cat' with probability=0.019712
class='n02129604 tiger, Panthera tigris' with probability=0.001215
class='n04040759 radiator' with probability=0.000262
## Comparing the Tuned and Untuned Models
optimized: {'mean': 34.736288779822644, 'median': 34.547542000655085, 'std': 0.5144378649382363}
unoptimized: {'mean': 44.97057118016528, 'median': 42.52320024970686, 'std': 6.870915251002107}

real    3m23.904s
user    5m2.900s
sys     5m37.099s

對比效能資料，可以發現：調優模型的執行速度更快、更平穩。

參考

GoCoding 個人實踐的經驗分享，可關注公眾號！

優雅且語義化的斷言之—將模型屬性斷言變為模型方法斷言
2022-04-21
模型
【http】https加速優化
2021-11-02
HTTP優化
大模型視角下的因果推斷
2023-11-28
大模型
TorchVision 預訓練模型進行推斷
2021-02-26
模型
Ubuntu20.04部署TVM流程及編譯最佳化模型示例
2024-04-02
Ubuntu編譯模型
利用 UMA 使硬體加速器可直接用於 TVM
2023-04-11
Facebook投放優化模型
2019-11-25
優化模型
早餐｜第十七期 · 模型優化器對模型做了哪些優化
2020-09-27
模型優化
NVIDIA Omniverse驚喜不斷加速推動數字孿生落地
2022-03-29
Win10系統怎麼優化讓開機加速_win10開機優化加速的方法
2020-02-17
Win10優化
【TVM 教程】如何在 CPU 上最佳化 GEMM
2024-12-04
TVM VLOG列印
2024-04-06
「深度」A/B測試中的因果推斷——潛在結果模型
2019-01-14
模型
【TVM 教程】如何在 GPU 上最佳化卷積
2024-12-10
GPU卷積
從模型到部署，FPGA該怎樣加速廣告推薦演算法
2019-08-22
模型FPGA演算法
思博倫報告：伴隨運營商尋求差異化優勢，5G發展不斷加速
2021-02-23
大模型訓練效率是推動大模型進化關鍵
2023-11-14
大模型
TVM Compiler中文教程：TVM排程原語（Schedule Primitives）
2024-05-12
CompileMIT
win10優化硬碟加速怎麼操作_win10優化硬碟速度如何設定
2020-07-16
Win10優化硬碟
【推薦】Java效能優化系列集錦
2018-04-15
Java優化
推薦：Java效能優化系列集錦
2018-04-05
Java優化
百萬推薦關係優化實戰
2020-12-21
優化
TVM:Schedule的理解
2024-05-27
加速推進服裝企業數字化轉型趨勢
2022-02-22
TensorFlow筆記(5)——優化手寫數字識別模型之優化器
2018-12-12
筆記優化模型
Spark效能優化：診斷記憶體的消耗
2018-09-13
Spark優化記憶體
程式碼優化-多型代替IF條件判斷
2019-12-01
優化多型
Linux 效能優化之 CPU 篇 ----- Linux 軟中斷
2020-06-26
Linux優化
Part II 診斷和優化資料庫效能
2020-04-27
優化資料庫
運籌優化（九）--整數規劃模型
2019-01-17
優化模型
專訪三七互娛CTO朱懷敏：建立技術底層優勢，加速推進遊戲工業化程式
2021-01-19
遊戲
TVM學習筆記
2024-06-22
筆記
TypeScript 型別推斷
2019-04-29
TypeScript型別
矩陣加速線性遞推
2024-08-19
矩陣
合合TextIn - 大模型加速器
2024-07-11
大模型
蝦扯蛋之條件判斷的極致優化
2018-12-07
優化
JavaScript（ES6）邏輯判斷條件優化
2019-01-05
JavaScript優化
新手推薦，前端效能優化小整理，效率加倍
2018-05-03
前端優化