SSE影像演算法優化系列十八：三次卷積插值的進一步SSE優化。

weixin_34104341發表於2020-04-07

原文網址 : https://blog.csdn.net/weixin_34104341/article/details/93846897

　　本文是在學習https://blog.csdn.net/housisong/article/details/1452249一文的基礎上對演算法的理解和重新整理，再次非常感謝原文作者的深入分析以及分享。

　　三次卷積插值的基礎原理也是對取樣點附近的領域畫素按照某種權重分佈計算加權的結果值，比起雙線性的4個領域畫素計算，三次卷積涉及到了16個領域畫素，這也決定了其取樣點位置不是對稱的，同時耗時比雙線性也大為增加。

如左圖所示，P00為向下取整後的取樣點的座標，其領域16個畫素的位置整體靠取樣點的右下側，各個位置的重係數並不是固定值，而是和取樣點的浮點座標的小數部分關。其值由函式Sin(x * pi) / (x * pi)決定，該函式曲線如右圖藍色曲線所示，當小數部分假定為U時，在水平或者垂直方向的4個權重分量對應的x值分別為：1+U、U、1-U以及2-U。

　　實際的操作中，我們常常用一個擬合的表示式來近似該曲線，比如原文作者提供的如下程式碼：

float SinXDivX(float X)
{
    const float a = -1;                    //    a還可以取 a=-2,-1,-0.75,-0.5等等，起到調節銳化或模糊程度的作用
    X = abs(X);
    float X2 = X * X, X3 = X2 * X;
    if (X <= 1)
        return (a + 2) * X3 - (a + 3) * X2 + 1;
    else if (X <= 2)
        return a * X3 - (5 * a) * X2 + (8 * a) * X - (4 * a);
    else
        return 0;
}

　　標準的函式應該是：

float SinXDivX_Standard(float X)
{
    if (abs(X) < 0.000001f)
        return 1;
    else
        return sin(X * 3.1415926f) / (X * 3.1415926f);
}

　　注意到一點，比如X取值為0.3，如果按照標準函式

　　　　SinXDivX_Standard(1 + X) + SinXDivX_Standard(X) + SinXDivX_Standard(1 - X) + SinXDivX_Standard(2 - X) = 0.8767

　　但是如果是下式：　　

SinXDivX(1 + X) + SinXDivX(X) + SinXDivX(1 - X) + SinXDivX(2 - X) 則等於1。

　　所以使用擬合式的好處就是權重係數之後無需在進行歸一化的處理了。

　　對於一個浮點的對映座標，使用三次卷積插值的簡單的程式碼如下所示：

void Bicubic_Original(unsigned char *Src, int Width, int Height, int Stride, unsigned char *Pixel, float X, float Y)
{
    int Channel = Stride / Width;
    int PosX = floor(X), PosY = floor(Y);
    float PartXX = X - PosX, PartYY = Y - PosY;

    unsigned char *Pixel00 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX - 1, PosY - 1);
    unsigned char *Pixel01 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 0, PosY - 1);
    unsigned char *Pixel02 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 1, PosY - 1);
    unsigned char *Pixel03 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 2, PosY - 1);
    unsigned char *Pixel10 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX - 1, PosY + 0);
    unsigned char *Pixel11 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 0, PosY + 0);
    unsigned char *Pixel12 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 1, PosY + 0);
    unsigned char *Pixel13 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 2, PosY + 0);
    unsigned char *Pixel20 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX - 1, PosY + 1);
    unsigned char *Pixel21 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 0, PosY + 1);
    unsigned char *Pixel22 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 1, PosY + 1);
    unsigned char *Pixel23 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 2, PosY + 1);
    unsigned char *Pixel30 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX - 1, PosY + 2);
    unsigned char *Pixel31 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 0, PosY + 2);
    unsigned char *Pixel32 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 1, PosY + 2);
    unsigned char *Pixel33 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 2, PosY + 2);

    float U0 = SinXDivX(1 + PartXX), U1 = SinXDivX(PartXX);
    float U2 = SinXDivX(1 - PartXX), U3 = SinXDivX(2 - PartXX);
    float V0 = SinXDivX(1 + PartYY), V1 = SinXDivX(PartYY);
    float V2 = SinXDivX(1 - PartYY), V3 = SinXDivX(2 - PartYY);

    for (int I = 0; I < Channel; I++)
    {
        float Sum1 = (Pixel00[I] * U0 + Pixel01[I] * U1 + Pixel02[I] * U2 + Pixel03[I] * U3) * V0;
        float Sum2 = (Pixel10[I] * U0 + Pixel11[I] * U1 + Pixel12[I] * U2 + Pixel13[I] * U3) * V1;
        float Sum3 = (Pixel20[I] * U0 + Pixel21[I] * U1 + Pixel22[I] * U2 + Pixel23[I] * U3) * V2;
        float Sum4 = (Pixel30[I] * U0 + Pixel31[I] * U1 + Pixel22[I] * U2 + Pixel33[I] * U3) * V3;
        Pixel[I] = IM_ClampToByte(Sum1 + Sum2 + Sum3 + Sum4 + 0.5f);
    }
}

　　其中GetCheckedPixel為簡單的取畫素值的函式。

inline unsigned char *GetCheckedPixel(unsigned char *Src, int Width, int Height, int Stride, int Channel, int PosX, int PosY)
{
    return Src + IM_ClampI(PosY, 0, Height - 1) * Stride + IM_ClampI(PosX, 0, Width - 1) * Channel;
}

　　參考作者的原始碼，一個最直接的三次卷積插值的函式如下所示：

int IM_Resample_Original(unsigned char *Src, unsigned char *Dest, int SrcW, int SrcH, int StrideS, int DstW, int DstH, int StrideD, int InterpolationMode)
{
    int Channel = StrideS / SrcW;
    if ((Src == NULL) || (Dest == NULL))                                return IM_STATUS_NULLREFRENCE;
    if ((SrcW <= 0) || (SrcH <= 0) || (DstW <= 0) || (DstH <= 0))        return IM_STATUS_INVALIDPARAMETER;
    if ((Channel != 1) && (Channel != 3) && (Channel != 4))                return IM_STATUS_INVALIDPARAMETER;

    if ((SrcW == DstW) && (SrcH == DstH))
    {
        memcpy(Dest, Src, SrcW * SrcH * Channel * sizeof(unsigned char));
        return IM_STATUS_OK;
    }

    //    已經論證這個沒有必要用SSE去做優化，速度不會有太大的變化， 2018.3.28
    if (InterpolationMode == 0)                            // 最近鄰插值
    {

    }
    else if (InterpolationMode == 1)                    　　//    雙線性插值方式
    {

    }
    else if (InterpolationMode == 2)            　　　　　　//    三次立方插值
    {
        for (int Y = 0; Y < DstH; Y++)
        {
            unsigned char *LinePD = Dest + Y * StrideD;
            float SrcY = (Y + 0.4999999f) * SrcH / DstH - 0.5f;
            for (int X = 0; X < DstW; X++)
            {
                float SrcX = (X + 0.4999999f) * SrcW / DstW - 0.5f;
                Bicubic_Original(Src, SrcW, SrcH, StrideS, LinePD, SrcX, SrcY);
                LinePD += Channel;
            }    
        }
    }
    return IM_STATUS_OK;
}

　　這個速度是非常緩慢的，因為有大量的浮點計算和座標位置計算。

　　為了提高速度，原文的作者對該演算法進行了大量的優化，主要包括（1）使用定點數來優化縮放函式；（2）邊界和內部分開處理；（3）對SinXDivX做一個查詢表；（4）對border_color做一個查詢表，我按照我自己的思路進一步整理成了我比較熟悉的程式碼格式，主要如下片段所示：

//    邊界處的三次立方插值
__forceinline void Bicubic_Border(unsigned char *Src, int Width, int Height, int Stride, unsigned char *Pixel, short *SinXDivX_Table, int SrcX, int SrcY)
{
    int Channel = Stride / Width;
    int U = (unsigned char)(SrcX >> 8), V = (unsigned char)(SrcY >> 8);

    int U0 = SinXDivX_Table[256 + U], U1 = SinXDivX_Table[U];
    int U2 = SinXDivX_Table[256 - U], U3 = SinXDivX_Table[512 - U];
    int V0 = SinXDivX_Table[256 + V], V1 = SinXDivX_Table[V];
    int V2 = SinXDivX_Table[256 - V], V3 = SinXDivX_Table[512 - V];
    int PosX = SrcX >> 16, PosY = SrcY >> 16;

    unsigned char *Pixel00 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX - 1, PosY - 1);
    unsigned char *Pixel01 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 0, PosY - 1);
    unsigned char *Pixel02 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 1, PosY - 1);
    unsigned char *Pixel03 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 2, PosY - 1);
    unsigned char *Pixel10 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX - 1, PosY + 0);
    unsigned char *Pixel11 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 0, PosY + 0);
    unsigned char *Pixel12 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 1, PosY + 0);
    unsigned char *Pixel13 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 2, PosY + 0);
    unsigned char *Pixel20 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX - 1, PosY + 1);
    unsigned char *Pixel21 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 0, PosY + 1);
    unsigned char *Pixel22 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 1, PosY + 1);
    unsigned char *Pixel23 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 2, PosY + 1);
    unsigned char *Pixel30 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX - 1, PosY + 2);
    unsigned char *Pixel31 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 0, PosY + 2);
    unsigned char *Pixel32 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 1, PosY + 2);
    unsigned char *Pixel33 = GetCheckedPixel(Src, Width, Height, Stride, Channel, PosX + 2, PosY + 2);

    for (int I = 0; I < Channel; I++)
    {
        int Sum1 = (Pixel00[I] * U0 + Pixel01[I] * U1 + Pixel02[I] * U2 + Pixel03[I] * U3) * V0;
        int Sum2 = (Pixel10[I] * U0 + Pixel11[I] * U1 + Pixel12[I] * U2 + Pixel13[I] * U3) * V1;
        int Sum3 = (Pixel20[I] * U0 + Pixel21[I] * U1 + Pixel22[I] * U2 + Pixel23[I] * U3) * V2;
        int Sum4 = (Pixel30[I] * U0 + Pixel31[I] * U1 + Pixel22[I] * U2 + Pixel33[I] * U3) * V3;
        Pixel[I] = IM_ClampToByte((Sum1 + Sum2 + Sum3 + Sum4) >> 16);
    }
}

//    __forceinline強制內聯還是能提高點速度的，畢竟這個函式的引數很多
//    如果是確定的通道數，可以把裡面的Channel改為固定的值，速度能提高很多
__forceinline void Bicubic_Center(unsigned char *Src, int Width, int Height, int Stride, unsigned char *Pixel, short *SinXDivX_Table, int SrcX, int SrcY)
{
    int Channel = Stride / Width;
    int U = (unsigned char)(SrcX >> 8), V = (unsigned char)(SrcY >> 8);

    int U0 = SinXDivX_Table[256 + U], U1 = SinXDivX_Table[U];
    int U2 = SinXDivX_Table[256 - U], U3 = SinXDivX_Table[512 - U];
    int V0 = SinXDivX_Table[256 + V], V1 = SinXDivX_Table[V];
    int V2 = SinXDivX_Table[256 - V], V3 = SinXDivX_Table[512 - V];
    int PosX = SrcX >> 16, PosY = SrcY >> 16;

    unsigned char *Pixel00 = Src + (PosY - 1) * Stride + (PosX - 1) * Channel;
    unsigned char *Pixel01 = Pixel00 + Channel;
    unsigned char *Pixel02 = Pixel01 + Channel;
    unsigned char *Pixel03 = Pixel02 + Channel;
    unsigned char *Pixel10 = Pixel00 + Stride;
    unsigned char *Pixel11 = Pixel10 + Channel;
    unsigned char *Pixel12 = Pixel11 + Channel;
    unsigned char *Pixel13 = Pixel12 + Channel;
    unsigned char *Pixel20 = Pixel10 + Stride;
    unsigned char *Pixel21 = Pixel20 + Channel;
    unsigned char *Pixel22 = Pixel21 + Channel;
    unsigned char *Pixel23 = Pixel22 + Channel;
    unsigned char *Pixel30 = Pixel20 + Stride;
    unsigned char *Pixel31 = Pixel30 + Channel;
    unsigned char *Pixel32 = Pixel31 + Channel;
    unsigned char *Pixel33 = Pixel32 + Channel;
    for (int I = 0; I < Channel; I++)
    {
        int Sum1 = (Pixel00[I] * U0 + Pixel01[I] * U1 + Pixel02[I] * U2 + Pixel03[I] * U3) * V0;
        int Sum2 = (Pixel10[I] * U0 + Pixel11[I] * U1 + Pixel12[I] * U2 + Pixel13[I] * U3) * V1;
        int Sum3 = (Pixel20[I] * U0 + Pixel21[I] * U1 + Pixel22[I] * U2 + Pixel23[I] * U3) * V2;
        int Sum4 = (Pixel30[I] * U0 + Pixel31[I] * U1 + Pixel22[I] * U2 + Pixel33[I] * U3) * V3;
        Pixel[I] = IM_ClampToByte((Sum1 + Sum2 + Sum3 + Sum4) >> 16);
    }
}

int IM_Resample_PureC(unsigned char *Src, unsigned char *Dest, int SrcW, int SrcH, int StrideS, int DstW, int DstH, int StrideD, int InterpolationMode)
{
    int Channel = StrideS / SrcW;
    if ((Src == NULL) || (Dest == NULL))                                return IM_STATUS_NULLREFRENCE;
    if ((SrcW <= 0) || (SrcH <= 0) || (DstW <= 0) || (DstH <= 0))        return IM_STATUS_INVALIDPARAMETER;
    if ((Channel != 1) && (Channel != 3) && (Channel != 4))                return IM_STATUS_INVALIDPARAMETER;

    if ((SrcW == DstW) && (SrcH == DstH))
    {
        memcpy(Dest, Src, SrcW * SrcH * Channel * sizeof(unsigned char));
        return IM_STATUS_OK;
    }

    //    已經論證這個沒有必要用SSE去做優化，速度不會有太大的變化， 2018.3.28
    if (InterpolationMode == 0)                            // 最近鄰插值
    {

    }
    else if (InterpolationMode == 1)                    //    雙線性插值方式
    {

    }
    else if (InterpolationMode == 2)            //    三次立方插值
    {
        short *SinXDivX_Table = (short *)malloc(513 * sizeof(short));
        if (SinXDivX_Table == NULL)
        {
            if (SinXDivX_Table != NULL)        free(SinXDivX_Table);
            return IM_STATUS_NULLREFRENCE;
        }
        for (int I = 0; I < 513; I++)
            SinXDivX_Table[I] = int(0.5 + 256 * SinXDivX(I / 256.0f));            //    建立查詢表，定點化

        int AddX = (SrcW << 16) / DstW, AddY = (SrcH << 16) / DstH;
        int ErrorX = -(1 << 15) + (AddX >> 1), ErrorY = -(1 << 15) + (AddY >> 1);

        int StartX = ((1 << 16) - ErrorX) / AddX + 1;            //    計算出需要特殊處理的邊界
        int StartY = ((1 << 16) - ErrorY) / AddY + 1;            //    y0+y*yr>=1; y0=ErrorY => y>=(1-ErrorY)/yr
        int EndX = (((SrcW - 3) << 16) - ErrorX) / AddX + 1;
        int EndY = (((SrcH - 3) << 16) - ErrorY) / AddY + 1;    //    y0+y*yr<=(height-3) => y<=(height-3-ErrorY)/yr
        if (StartY >= DstH)            StartY = DstH;
        if (StartX >= DstW)            StartX = DstW;
        if (EndX < StartX)            EndX = StartX;
        if (EndY < StartY)            EndY = StartY;

        int SrcY = ErrorY;
        for (int Y = 0; Y < StartY; Y++, SrcY += AddY)            //    前面的不是都有效的取樣部分資料
        {
            unsigned char *LinePD = Dest + Y * StrideD;
            for (int X = 0, SrcX = ErrorX; X < DstW; X++, SrcX += AddX, LinePD += Channel)
            {
                Bicubic_Border(Src, SrcW, SrcH, StrideS, LinePD, SinXDivX_Table, SrcX, SrcY);
            }
        }
        for (int Y = StartY; Y < EndY; Y++, SrcY += AddY)
        {
            int SrcX = ErrorX;
            unsigned char *LinePD = Dest + Y * StrideD;
            for (int X = 0; X < StartX; X++, SrcX += AddX, LinePD += Channel)
            {
                Bicubic_Border(Src, SrcW, SrcH, StrideS, LinePD, SinXDivX_Table, SrcX, SrcY);
            }
            for (int X = StartX; X < EndX; X++, SrcX += AddX, LinePD += Channel)
            {
                Bicubic_Center(Src, SrcW, SrcH, StrideS, LinePD, SinXDivX_Table, SrcX, SrcY);
            }
            for (int X = EndX; X < DstW; X++, SrcX += AddX, LinePD += Channel)
            {
                Bicubic_Border(Src, SrcW, SrcH, StrideS, LinePD, SinXDivX_Table, SrcX, SrcY);
            }
        }
        for (int Y = EndY; Y < DstH; Y++, SrcY += AddY)
        {
            unsigned char *LinePD = Dest + Y * StrideD;
            for (int X = 0, SrcX = ErrorX; X < DstW; X++, SrcX += AddX, LinePD += Channel)
            {
                Bicubic_Border(Src, SrcW, SrcH, StrideS, LinePD, SinXDivX_Table, SrcX, SrcY);
            }
        }
        free(SinXDivX_Table);
    }
    return IM_STATUS_OK;
}

　　用於Bicubic_Border 和Bicubic_Center在函式中大量的被呼叫，函式的呼叫開銷也是不可忽略的，在VS中可以用__forceinline來進行強制內聯，這個大約對本例大約有10%的提速效果。

　　本例的Bicubic_Border 和Bicubic_Center函式是為了通用不同通道，用了一個for迴圈，實際操作時為了效率應該要分通道展開的，展開後的效率約能提高30%。

　　以上純C程式碼將32位的800*600的程式碼放大到1024*768大約需要40ms(如果通道分開寫，大約需要30ms)。

　　為了進一步提高速度，我們來考慮這個演算法的SSE優化，在HouSisong的專欄裡已經有了SSE優化的程式碼，不過他時直接內嵌彙編寫的，比較難以看懂，並且現在的64位作業系統時無法內嵌彙編的了，但是還是可以使用intrinsic，所以這裡我使用intrinsic語句來處理（其實我也沒看懂HouSisong的程式碼）。

　　對於邊緣部分，計算量不大，直接使用C版本的Bicubic_Border函式，重點我們看看Bicubic_Center函式。

　　Bicubic_Center函式前面部分的程式碼主要時計算權重係數和取樣點的記憶體座標，先不管，我們看看核心的計算部分程式碼如下：

    for (int I = 0; I < Channel; I++)
    {
        int Sum1 = (Pixel00[I] * U0 + Pixel01[I] * U1 + Pixel02[I] * U2 + Pixel03[I] * U3) * V0;　　　　//　　行1
        int Sum2 = (Pixel10[I] * U0 + Pixel11[I] * U1 + Pixel12[I] * U2 + Pixel13[I] * U3) * V1;　　　　//　　行2
        int Sum3 = (Pixel20[I] * U0 + Pixel21[I] * U1 + Pixel22[I] * U2 + Pixel23[I] * U3) * V2;　　　　//　　行3
        int Sum4 = (Pixel30[I] * U0 + Pixel31[I] * U1 + Pixel22[I] * U2 + Pixel33[I] * U3) * V3;　　　　//　　行4
        Pixel[I] = IM_ClampToByte((Sum1 + Sum2 + Sum3 + Sum4) >> 16);
    }

　　先考慮Channel為1的情況，觀察這一句：Pixel00[I] * U0 + Pixel01[I] * U1 + Pixel02[I] * U2 + Pixel03[I] * U3，注意此時Pixel00/Pixel01/Pixel02/Pixel03在記憶體中是連續的，而且取值範圍在[0,255]之間，U0/U1/U2/U3根據前面的查詢表建立過程，也在[0,256]之間，他們都能用short型別來表達，而這個式子為連乘然後累加，我們考慮使用一個特殊的SSE指令_mm_madd_epi16，在MSDN中其功能解釋如下：

Multiplies the 8 signed 16-bit integers from a by the 8 signed 16-bit integers from b.

  __m128i _mm_madd_epi16 (__m128i a, __m128i b);

Return Value

　 Adds the signed 32-bit integer results pairwise and packs the 4 signed 32-bit integer results.

　　r0 := (a0 * b0) + (a1 * b1)
　　r1 := (a2 * b2) + (a3 * b3)
　　r2 := (a4 * b4) + (a5 * b5)
　　r3 := (a6 * b6) + (a7 * b7)

　即a和b裡分別有8個有符號的16位數，然後對應的16位資料兩兩相乘，然後在兩兩相加，最後儲存到4個32位有符號數中。

　考慮我們的應用場景，行1到行4每行的程式碼都只有4次乘法和3次加法，不能直接使用，但是我們可以考慮把兩行整合在一起，一次性計算，這樣就需要呼叫2次

_mm_madd_epi16 ，然後2次的結果在呼叫_mm_hadd_epi32這個水平方向的累加函式就能得到新的結果，感覺真的有點奇妙，核心程式碼如下所示：

　　if (Channel == 1)
    {
        __m128i P01 = _mm_cvtepu8_epi16(_mm_unpacklo_epi32(_mm_cvtsi32_si128(*((int *)Pixel0)), _mm_cvtsi32_si128(*((int *)Pixel1))));            //    P00 P01 P02 P03 P10 P11 P12 P13
        __m128i P23 = _mm_cvtepu8_epi16(_mm_unpacklo_epi32(_mm_cvtsi32_si128(*((int *)Pixel2)), _mm_cvtsi32_si128(*((int *)Pixel3))));            //    P20 P21 P22 P23 P30 P31 P32 P33
        __m128i Sum01 = _mm_madd_epi16(P01, PartX);                            // P00 * U0 + P01 * U1        P02 * U2 + P03 * U3         P10 * U0 + P11 * U1        P12 * U2 + P13 * U3
        __m128i Sum23 = _mm_madd_epi16(P23, PartX);                            // P20 * U0 + P21 * U1        P22 * U2 + P23 * U3         P30 * U0 + P31 * U1        P32 * U2 + P33 * U3
        __m128i Sum = _mm_hadd_epi32(Sum01, Sum23);                            // P00 * U0 + P01 * U1 + P02 * U2 + P03 * U3     P10 * U0 + P11 * U1 + P12 * U2 + P13 * U3    P20 * U0 + P21 * U1    + P22 * U2 + P23 * U3    P30 * U0 + P31 * U1 + P32 * U2 + P33 * U3
        LinePD[0] = IM_ClampToByte(_mm_hsum_epi32(_mm_mullo_epi32(Sum, PartY)) >> 16);
    }

其中_mm_hsum_epi32為自定義的一個函式。

//    4個有符號的32位的資料相加的和。
inline int _mm_hsum_epi32(__m128i V)                        //    V3 V2 V1 V0
{
    //    實測這個速度要快些，_mm_extract_epi32最慢。
    __m128i T = _mm_add_epi32(V, _mm_srli_si128(V, 8));        //    V3+V1            V2+V0        V1        V0        
    T = _mm_add_epi32(T, _mm_srli_si128(T, 4));                //    V3+V1+V2+V0        V2+V0+V1    V1+V0    V0    
    return _mm_cvtsi128_si32(T);                            //    提取低位    
}

　　我感覺有的時候這些東西用語言是無法能明確而有效的表達的，而直接用程式碼卻能達到事半功倍的效果。

　　前面已經測試過用擬合曲線那個公式能滿足累加和正好為一，而不需要歸一化的，那麼理論上這個最後的移位操作後資料應該就在【0，255】範圍內，而不需要進行Clamp的，但是實際如果沒有這個Clamp，結果影像會有部分畫素溢位的，這是因為在我們定點化的過程中，這個和為1的特性已經遭到了一定的破壞了。

　　注意在Bicubic_Center的迴圈計算中，V分量在計算每行時是固定的，每行開始時可以直接一次使用_mm_setr_epi32來設定，，U分量計算每行時對於每個畫素都是變化的，我們可以對每個畫素用_mm_setr_epi32來設定，但是多次使用這個intrinsic是個比較耗時的過程，因此我們應該把每行的U儲存到一個臨時記憶體中，然後每次使用時從不同的Load方可提高速度。

　　當不是單通道的影像時，比如4通道，優化的思路是相同的，只不過我們需要做更多的拆分和組合工作，把原始的資料組合成符合SIMD指令需要的格式，這就需要靈活的使用_mm_shuffle_epi8、_mm_unpacklo_epi32、_mm_unpackhi_epi32、_mm_unpacklo_epi8、_mm_unpacklo_epi8等語句的組合，這些語句都是非常快速和高效的，對於32點陣圖像，由於一次性可以處理4個位元組，最後的IM_ClampToByte還可以直接使用SIMD的抗飽和指令（_mm_packus_epi32）代替，效率能提高少許，而且還可以呼叫不進行快取的_mm_stream_si32指令直接寫記憶體，因此能極大的提高效率。具體的組合程式碼請參考本文附件。

　　最後貼出基於SSE優化的程式碼：

int IM_Resample_SSE(unsigned char *Src, unsigned char *Dest, int SrcW, int SrcH, int StrideS, int DstW, int DstH, int StrideD, int InterpolationMode)
{
    int Channel = StrideS / SrcW;
    if ((Src == NULL) || (Dest == NULL))                                return IM_STATUS_NULLREFRENCE;
    if ((SrcW <= 0) || (SrcH <= 0) || (DstW <= 0) || (DstH <= 0))        return IM_STATUS_INVALIDPARAMETER;
    if ((Channel != 1) && (Channel != 3) && (Channel != 4))                return IM_STATUS_INVALIDPARAMETER;

    if ((SrcW == DstW) && (SrcH == DstH))
    {
        memcpy(Dest, Src, SrcW * SrcH * Channel * sizeof(unsigned char));
        return IM_STATUS_OK;
    }

    //    已經論證這個沒有必要用SSE去做優化，速度不會有太大的變化， 2018.3.28
    if (InterpolationMode == 0)                            // 最近鄰插值
    {

    }
    else if (InterpolationMode == 1)                    //    雙線性插值方式
    {

    }
    else if (InterpolationMode == 2)            //    三次立方插值
    {
        short *SinXDivX_Table = (short *)malloc(513 * sizeof(short));
        short *Table = (short *)malloc(DstW * 4 * sizeof(short));
        if ((SinXDivX_Table == NULL) || (Table == NULL))
        {
            if (SinXDivX_Table != NULL)        free(SinXDivX_Table);
            if (Table != NULL)                free(Table);
            return IM_STATUS_NULLREFRENCE;
        }
        for (int I = 0; I < 513; I++)
            SinXDivX_Table[I] = int(0.5 + 256 * SinXDivX(I / 256.0f));            //    建立查詢表，定點化

        int AddX = (SrcW << 16) / DstW, AddY = (SrcH << 16) / DstH;
        int ErrorX = -(1 << 15) + (AddX >> 1), ErrorY = -(1 << 15) + (AddY >> 1);

        int StartX = ((1 << 16) - ErrorX) / AddX + 1;            //    計算出需要特殊處理的邊界
        int StartY = ((1 << 16) - ErrorY) / AddY + 1;            //    y0+y*yr>=1; y0=ErrorY => y>=(1-ErrorY)/yr
        int EndX = (((SrcW - 3) << 16) - ErrorX) / AddX + 1;
        int EndY = (((SrcH - 3) << 16) - ErrorY) / AddY + 1;    //    y0+y*yr<=(height-3) => y<=(height-3-ErrorY)/yr
        if (StartY >= DstH)            StartY = DstH;
        if (StartX >= DstW)            StartX = DstW;
        if (EndX < StartX)            EndX = StartX;
        if (EndY < StartY)            EndY = StartY;

        for (int X = StartX, SrcX = ErrorX + StartX * AddX; X < EndX; X++, SrcX += AddX)
        {
            int U = (unsigned char)(SrcX >> 8);                    //    StartX之前和EndX之後的資料雖然沒用，但是為了方便還是分配了記憶體
            Table[X * 4 + 0] = SinXDivX_Table[256 + U];            //    前面建立這樣的一個表，方便後面用SSE進行讀取和優化
            Table[X * 4 + 1] = SinXDivX_Table[U];
            Table[X * 4 + 2] = SinXDivX_Table[256 - U];
            Table[X * 4 + 3] = SinXDivX_Table[512 - U];
        }
        int SrcY = ErrorY;
        for (int Y = 0; Y < StartY; Y++, SrcY += AddY)            //    前面的不是都有效的取樣部分資料
        {
            unsigned char *LinePD = Dest + Y * StrideD;
            for (int X = 0, SrcX = ErrorX; X < DstW; X++, SrcX += AddX, LinePD += Channel)
            {
                Bicubic_Border(Src, SrcW, SrcH, StrideS, LinePD, SinXDivX_Table, SrcX, SrcY);
            }
        }
        for (int Y = StartY; Y < EndY; Y++, SrcY += AddY)
        {
            int SrcX = ErrorX;
            unsigned char *LinePD = Dest + Y * StrideD;
            for (int X = 0; X < StartX; X++, SrcX += AddX, LinePD += Channel)
            {
                Bicubic_Border(Src, SrcW, SrcH, StrideS, LinePD, SinXDivX_Table, SrcX, SrcY);
            }
            int V = (unsigned char)(SrcY >> 8);
            unsigned char *LineY = Src + ((SrcY >> 16) - 1) * StrideS;
            __m128i PartY = _mm_setr_epi32(SinXDivX_Table[256 + V], SinXDivX_Table[V], SinXDivX_Table[256 - V], SinXDivX_Table[512 - V]);
            for (int X = StartX; X < EndX; X++, SrcX += AddX, LinePD += Channel)
            {
                __m128i PartX = _mm_loadl_epi64((__m128i *)(Table + X * 4));
                PartX = _mm_unpacklo_epi64(PartX, PartX);                                //    U0 U1 U2 U3 U0 U1 U2 U3
                unsigned char *Pixel0 = LineY + ((SrcX >> 16) - 1) * Channel;
                unsigned char *Pixel1 = Pixel0 + StrideS;
                unsigned char *Pixel2 = Pixel1 + StrideS;
                unsigned char *Pixel3 = Pixel2 + StrideS;
                if (Channel == 1)
                {
                    __m128i P01 = _mm_cvtepu8_epi16(_mm_unpacklo_epi32(_mm_cvtsi32_si128(*((int *)Pixel0)), _mm_cvtsi32_si128(*((int *)Pixel1))));            //    P00 P01 P02 P03 P10 P11 P12 P13
                    __m128i P23 = _mm_cvtepu8_epi16(_mm_unpacklo_epi32(_mm_cvtsi32_si128(*((int *)Pixel2)), _mm_cvtsi32_si128(*((int *)Pixel3))));            //    P20 P21 P22 P23 P30 P31 P32 P33
                    __m128i Sum01 = _mm_madd_epi16(P01, PartX);                            // P00 * U0 + P01 * U1        P02 * U2 + P03 * U3         P10 * U0 + P11 * U1        P12 * U2 + P13 * U3
                    __m128i Sum23 = _mm_madd_epi16(P23, PartX);                            // P20 * U0 + P21 * U1        P22 * U2 + P23 * U3         P30 * U0 + P31 * U1        P32 * U2 + P33 * U3
                    __m128i Sum = _mm_hadd_epi32(Sum01, Sum23);                            // P00 * U0 + P01 * U1 + P02 * U2 + P03 * U3     P10 * U0 + P11 * U1 + P12 * U2 + P13 * U3    P20 * U0 + P21 * U1    + P22 * U2 + P23 * U3    P30 * U0 + P31 * U1 + P32 * U2 + P33 * U3
                    LinePD[0] = IM_ClampToByte(_mm_hsum_epi32(_mm_mullo_epi32(Sum, PartY)) >> 16);
                }
                else if (Channel == 3)
                {
                
                }
                else if (Channel == 4)
                {
                    __m128i P0 = _mm_loadu_si128((__m128i *)Pixel0), P1 = _mm_loadu_si128((__m128i *)Pixel1);
                    __m128i P2 = _mm_loadu_si128((__m128i *)Pixel2), P3 = _mm_loadu_si128((__m128i *)Pixel3);

                    //    以下組合方式比使用 _mm_shuffle_epi8 和 _mm_or_si128要少8條指令
                    P0 = _mm_shuffle_epi8(P0, _mm_setr_epi8(0, 4, 8, 12, 1, 5, 9, 13, 2, 6, 10, 14, 3, 7, 11, 15));        //    B0 G0 R0 A0
                    P1 = _mm_shuffle_epi8(P1, _mm_setr_epi8(0, 4, 8, 12, 1, 5, 9, 13, 2, 6, 10, 14, 3, 7, 11, 15));        //    B1 G1 R1 A1
                    P2 = _mm_shuffle_epi8(P2, _mm_setr_epi8(0, 4, 8, 12, 1, 5, 9, 13, 2, 6, 10, 14, 3, 7, 11, 15));        //  B2 G2 R2 A2
                    P3 = _mm_shuffle_epi8(P3, _mm_setr_epi8(0, 4, 8, 12, 1, 5, 9, 13, 2, 6, 10, 14, 3, 7, 11, 15));        //    B3 G3 R3 A3

                    __m128i BG01 = _mm_unpacklo_epi32(P0, P1);        //    B0 B1 G0 G1
                    __m128i RA01 = _mm_unpackhi_epi32(P0, P1);        //    R0 R1 A0 A1
                    __m128i BG23 = _mm_unpacklo_epi32(P2, P3);        //    B2 B3 G2 G3
                    __m128i RA23 = _mm_unpackhi_epi32(P2, P3);        //    R2 R3 A2 A3

                    __m128i B01 = _mm_unpacklo_epi8(BG01, _mm_setzero_si128());
                    __m128i B23 = _mm_unpacklo_epi8(BG23, _mm_setzero_si128());
                    __m128i SumB = _mm_hadd_epi32(_mm_madd_epi16(B01, PartX), _mm_madd_epi16(B23, PartX));

                    __m128i G01 = _mm_unpackhi_epi8(BG01, _mm_setzero_si128());
                    __m128i G23 = _mm_unpackhi_epi8(BG23, _mm_setzero_si128());
                    __m128i SumG = _mm_hadd_epi32(_mm_madd_epi16(G01, PartX), _mm_madd_epi16(G23, PartX));

                    __m128i R01 = _mm_unpacklo_epi8(RA01, _mm_setzero_si128());
                    __m128i R23 = _mm_unpacklo_epi8(RA23, _mm_setzero_si128());
                    __m128i SumR = _mm_hadd_epi32(_mm_madd_epi16(R01, PartX), _mm_madd_epi16(R23, PartX));

                    __m128i A01 = _mm_unpackhi_epi8(RA01, _mm_setzero_si128());
                    __m128i A23 = _mm_unpackhi_epi8(RA23, _mm_setzero_si128());
                    __m128i SumA = _mm_hadd_epi32(_mm_madd_epi16(A01, PartX), _mm_madd_epi16(A23, PartX));

                    //    這個居然比註釋掉的還快點
                    __m128i Result = _mm_setr_epi32(_mm_hsum_epi32(_mm_mullo_epi32(SumB, PartY)), _mm_hsum_epi32(_mm_mullo_epi32(SumG, PartY)), _mm_hsum_epi32(_mm_mullo_epi32(SumR, PartY)), _mm_hsum_epi32(_mm_mullo_epi32(SumA, PartY)));
                    Result = _mm_srai_epi32(Result, 16);
                    //    *((int *)LinePD) = _mm_cvtsi128_si32(_mm_packus_epi16(_mm_packus_epi32(Result, Result), Result));
                    _mm_stream_si32((int *)LinePD, _mm_cvtsi128_si32(_mm_packus_epi16(_mm_packus_epi32(Result, Result), Result)));

                    //LinePD[0] = IM_ClampToByte(_mm_hsum_epi32(_mm_mullo_epi32(SumB, PartY)) >> 16);            //    確實有部分存在超出unsigned char範圍的，因為定點化的緣故
                    //LinePD[1] = IM_ClampToByte(_mm_hsum_epi32(_mm_mullo_epi32(SumG, PartY)) >> 16);
                    //LinePD[2] = IM_ClampToByte(_mm_hsum_epi32(_mm_mullo_epi32(SumR, PartY)) >> 16);
                    //LinePD[3] = IM_ClampToByte(_mm_hsum_epi32(_mm_mullo_epi32(SumA, PartY)) >> 16);
                }
            }
            for (int X = EndX; X < DstW; X++, SrcX += AddX, LinePD += Channel)
            {
                Bicubic_Border(Src, SrcW, SrcH, StrideS, LinePD, SinXDivX_Table, SrcX, SrcY);
            }
        }
        for (int Y = EndY; Y < DstH; Y++, SrcY += AddY)
        {
            unsigned char *LinePD = Dest + Y * StrideD;
            for (int X = 0, SrcX = ErrorX; X < DstW; X++, SrcX += AddX, LinePD += Channel)
            {
                Bicubic_Border(Src, SrcW, SrcH, StrideS, LinePD, SinXDivX_Table, SrcX, SrcY);
            }
        }
        free(Table);
        free(SinXDivX_Table);
    }
    return IM_STATUS_OK;
}

　　同樣的機器，同樣的測試環境，32位的800*600的程式碼放大到1024*768大約需要13ms，大約是普通C語言的2.5倍。

　　在同樣的環境下測得housisong的程式碼的同樣圖片耗時約為16ms，本文效率更高一些，當然畢竟大神他是N年前寫的程式碼了。

　　本文相關程式碼的下載連結： https://files.cnblogs.com/files/Imageshop/BicubicZoom.rar（可能會在3個月後刪除，因為部落格空間儲存空間已經快滿了）

　　也可下載本人的SSE優化全集測試比較各種插值的效果：https://files.cnblogs.com/files/Imageshop/SSE_Optimization_Demo.rar

轉載於:https://www.cnblogs.com/Imageshop/p/9069650.html

【影像縮放】雙立方（三次）卷積插值
2019-03-04
卷積
SSE影像演算法優化系列三十一：Base64編碼和解碼演算法的指令集優化（C#自帶函式的3到4倍速度）。
2021-09-02
演算法優化C#函式
影像的卷積和池化操作
2019-03-04
卷積
[xlswriter 優化實戰] 使用 CPU SSE2 指令集加速字元查詢
2020-07-02
優化字元
影像優化
2019-10-19
優化
gofiber sse
2024-07-23
Go
go sse
2024-07-25
Go
Spark SQL 效能優化再進一步 CBO 基於代價的優化
2018-09-26
SparkSQL優化
數值最優化—優化問題的解(二)
2020-09-24
優化
SSE 繫結特定後端prod,後端prod優雅升級的相容
2025-01-24
後端
運籌優化（十八）--對策論基礎及其最優化求解
2019-01-26
優化
webpack系列-優化
2019-04-13
Web優化
JS進階系列 --- ajax請求優化
2018-08-05
JS優化
imi 助力 ChatGPT 應用開發，支援優雅的 SSE 服務端推送功能
2023-04-30
ChatGPT服務端
事倍功半是蠢蛋2（SSE）
2024-11-06
MySQL優化--IO排程演算法優化
2020-08-24
MySql優化演算法
MySQL系列：效能優化
2018-03-29
MySql優化
Vue進階系列 --- 頁面架構優化
2018-07-08
Vue架構優化
響應式影像與優化
2022-02-18
優化
【Dijkstra演算法】未優化版+優先佇列優化版
2020-12-18
演算法優化佇列
一步步搭建 VuePress 及優化【外掛系列】
2019-03-30
Vue優化
雙向通訊之SSE
2019-05-07
使用PHP建立一個SSE
2024-06-29
PHP
Mysql優化系列之——優化器對子查詢的處理
2020-01-21
MySql優化
TiDB 查詢優化及調優系列（一）TiDB 優化器簡介
2022-04-14
TiDB優化
iOS效能優化系列篇之“列表流暢度優化”
2018-08-14
iOS優化
iOS效能優化系列篇之“優化總體原則”
2018-04-02
iOS優化
影像處理中的valid卷積與same卷積
2020-06-20
卷積
SQL優化案例-定位系統中大量的rollback（十八）
2019-07-01
SQL優化
元件庫rollup打包體積優化
2018-06-17
元件優化
優化基於FPGA的深度卷積神經網路的加速器設計
2018-04-18
優化FPGA卷積神經網路
載入速度優化專項 > 體積優化分享
2022-02-08
優化
首屏優化系列（二）
2018-07-14
優化
首屏優化系列（一）
2018-06-29
優化
運籌優化（十四）--離散優化的啟發式演算法
2019-01-19
優化演算法
百度App網路深度優化系列《一》DNS優化
2019-10-27
APP優化DNS
智慧優化演算法——python實現免疫遺傳演算法的影像擬合
2022-04-12
優化演算法Python
實時訊息推送方案-SSE
2024-03-13

SSE影像演算法優化系列十八：三次卷積插值的進一步SSE優化。

相關文章