同樣網路結構,不一樣的推理速度?--記一次奇怪的踩坑

haoliuhust發表於2021-11-28

背景

這是以前工程化過程中碰到的一個問題,一直沒有總結整理過。現象是這樣的,有一個網路結構(基本就是Resnet50), 以前已經工程化到MNN了。當時在PC上執行,單執行緒大概600ms。後來,模型效能提升了(模型結構沒有變化,只是資料增多),於是考慮升級模型,奇怪的是,執行卻要2s多,足足是原來的3倍多。在我當時的認知裡,結構不變,各種卷積,FC引數量都沒變,計算量應該是不變的,為啥會出現這麼大差距,百思不得其解。記錄下當時的排查和解決過程。

問題定位

首先模型在mxnet下沒問題,因此一開始認定是MNN的問題。所以往MNN提了issue, https://github.com/alibaba/MNN/issues/786, 並且後來用MNN/tools/下面的timeProfile去逐層測速(4執行緒下面),按層型別彙總:
原模型:

Sort by time cost !
Node Type Avg(ms) % Called times Flops Rate
Reshape 0.072720 0.040518 2.000000 0.000386
Pooling 0.365080 0.203415 4.000000 0.017411
Eltwise 0.559910 0.311970 24.000000 0.026874
PReLU 1.148671 0.640014 25.000000 0.056019
Scale 1.955981 1.089830 52.000000 0.077988
Convolution 175.379929 97.717857 54.000000 99.814148
total time : 179.475815 ms, total mflops : 6321.113770
main, 112, cost time: 17968.628906 ms

新模型:

Sort by time cost !
Node Type Avg(ms) % Called times Flops Rate
Reshape 0.072500 0.016967 2.000000 0.000386
Pooling 0.370410 0.086687 4.000000 0.017411
Eltwise 0.571718 0.133798 24.000000 0.026874
PReLU 1.229384 0.287711 25.000000 0.056019
Scale 1.930405 0.451770 52.000000 0.077988
Convolution 423.116730 99.021439 54.000000 99.814148
total time : 427.298096 ms, total mflops : 6321.113770
main, 112, cost time: 42751.425781 ms

可以看出差距主要是在卷積層,新模型的卷積層慢了很多。收到反饋比較慢···,所以我又去測試了其他框架,opencv_dnn, 情況依舊。所以在opencv也提交了issue:https://github.com/opencv/opencv/issues/17259, Opencv的回覆很快(點贊),並且復現了(OpenVino後端不能復現)。他們表示也很疑惑。這個時候MNN也回覆了,讓試試開啟/fp:fast編譯選項,不過我測試了還是無效,可能是windows上沒生效。Opencv團隊在開啟-DENABLE_FAST_MATH=ON後,速度差不多了,不過這個選項可能對精度有影響,因此不算最終解決方案,不過也能大概指出是跟數值計算有關樂。這時,一個大佬提出了 I zeroed all weights that were smaller than 1e-15 and both give the same efficiency. I suspect that the fusion process is leading to a lot of denormals by multiplying small numbers with small numbers. I have some doubts on my claim though because it's a bit unusual to have models filled with so many tiny weights to cause serious performance degradation.

Denormals have leading zeros in the mantissa which is not-so-normal representation. Normally, you would have leading zeros counted in the exponent to make room for having as many significant digits as possible in the mantissa. When the number becomes so small that you cannot make the exponent any smaller without an overflow, you will use leading zeros in the mantissa to store the value. Most hardware are optimized for handling normal numbers efficiently and often have alternate slow paths for dealing with denormals. When you enable fast math, you are also enabling flush-to-zero (treat denormals as zero). With FTZ, the hardware deals with them efficiently by simply ignoring them.

The CUDA backend didn't face this issue probably because the convolutions are largely implemented using fused multiply-add ops and the FMA pipeline can handle denormals. Only multi-instruction sequences like sqrt require special handling of denormals. 也就是由於出現了太多的denormal,中文是非規格化浮點數,可以簡單理解為非常小的浮點數,處理這種數的速度大大慢於規格化的浮點數。具體到我們這個問題,由於網路的權重基本都是小數,可能權重本身就太小了,慢慢出現很多很小的數(denormal number),導致了計算速度慢。我統計了2個模型權重中<1e^-15的個數,確實慢的要多很多。

問題解決

問題的根源是出現了過小的小數,並且通過OpenCV的回覆中測試了將權重過小的置為有符號的0,速度就大致一樣了,精度不會受影響。

@@ -414,6 +414,10 @@ public:
                 cv::multiply(originWeights.row(i), weightsMultipliers[i], weightsMat.row(i));
                 biasvec[i] *= wi;
             }
+            Mat mask = (abs(weightsMat) <= 1e-15f) & (weightsMat > 0);
+            weightsMat.setTo(0, mask);  // Flush to zero (FTZ) denormal weights
+            mask = (abs(weightsMat) <= 1e-15f) & (weightsMat < 0);
+            weightsMat.setTo(-0, mask);  // Flush to zero (FTZ) denormal weights
         }


仿照這個思路,我們可以將模型中權重e^-15次方的置為0,這個操作可以在原始模型上操作,也可以在模型轉換時操作,我選擇的是在MNN的轉換程式碼中修改,具體是tools/converter/source/optimizer/PostConverter.cpp,optimizeNet最後加上:

    auto& op_list=newNet->oplists;
    size_t cnt=0;
    for(auto& op :op_list)
    {
        if(op->type==MNN::OpType::OpType_Convolution||op->type==MNN::OpType::OpType_ConvolutionDepthwise)
        {
            auto conv2D = op->main.AsConvolution2D();
            for (auto& w: conv2D->weight)
            {
//                if(std::fpclassify(w)==FP_SUBNORMAL)
	            if(std::abs(w)<1e-15)
                {
                	cnt+=1;
                    if(w>0.0f)
                    {
                        w=0.0f;
                    }
                    else if(w<0.0f)
                    {
                        w=-0.0f;
                    }
                }
            }
        }
    }

    std::cout<<"weights too small cnt "<<cnt<<std::endl;

重新編譯轉換工具即可。不過其實我這樣的修改方式嚴格來說不完全正確,因為還有可能是在推理過程中產生這樣的小數,因此正確的方式是修改推理程式碼,在卷積運算元計算前和計算後把非規格浮點數忽略掉,不過這個操作改起來工作量就會大些了,因為用上面的方式已經解決我的問題了,這種改法我沒有去實施了,如果有興趣的可參考OpenCV的方式:https://github.com/opencv/opencv/pull/17295

總結

同樣的模型結構若一個模型權重含有的非常小的權重太多,是會嚴重影響推理速度的(CUDA, OpenVino不影響),可以在訓練時將這種權重置0,或者轉換模型時處理,精度不會受到影響。

相關文章