[xlswriter 優化實戰] 使用 CPU SSE2 指令集加速字元查詢

viest發表於2020-07-02

原文網址 : https://learnku.com/articles/46670

此文轉載自作者原創公眾號：粑粑是程式設計師

使用 php-ext-xlswriter 作為測試參考專案，在測試程式碼中匯出一份 50W行 × 20列的xlsx檔案，每個單元格均為固定的字元（26字母），並開啟記憶體優化模式（固定記憶體）。

function getMemoryUsage()
{
    $pid = getmypid();

    exec("ps -e -o%mem,rss,pid | grep $pid", $output);

    $outputArray = explode(' ', $output[0]);

    return (doubleval($outputArray[2] ?? 0) / 1024) . 'MB';
}

$startTime = microtime(true);

$config = ['path' => __DIR__ . '/tests'];
$excel = new \Vtiful\Kernel\Excel($config);

$chars = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ';

$filePath = $excel->constMemory('tutorial.xlsx')
    ->header([
        'test1', 'test2', 'test3', 'test4', 'test5', 'test6', 'test7', 'test8', 'test9', 'test10',
        'test11', 'test12', 'test13', 'test14', 'test15', 'test16', 'test17', 'test18', 'test19', 'test20',
    ]);

$sheetIndex = 1;

for ($index = 0; $index < 500000; $index++) {
    $rowIndex = $index % 1000000;

    if ($index > 0 && $rowIndex === 0) {
        $sheetIndex++;
        $filePath->addSheet('sheet' . $sheetIndex);
    }

    $filePath->insertText($rowIndex + 1, 0, $chars);
    $filePath->insertText($rowIndex + 1, 1, $chars);
    $filePath->insertText($rowIndex + 1, 2, $chars);
    $filePath->insertText($rowIndex + 1, 3, $chars);
    $filePath->insertText($rowIndex + 1, 4, $chars);
    $filePath->insertText($rowIndex + 1, 5, $chars);
    $filePath->insertText($rowIndex + 1, 6, $chars);
    $filePath->insertText($rowIndex + 1, 7, $chars);
    $filePath->insertText($rowIndex + 1, 8, $chars);
    $filePath->insertText($rowIndex + 1, 9, $chars);
    $filePath->insertText($rowIndex + 1, 10, $chars);
    $filePath->insertText($rowIndex + 1, 11, $chars);
    $filePath->insertText($rowIndex + 1, 12, $chars);
    $filePath->insertText($rowIndex + 1, 13, $chars);
    $filePath->insertText($rowIndex + 1, 14, $chars);
    $filePath->insertText($rowIndex + 1, 15, $chars);
    $filePath->insertText($rowIndex + 1, 16, $chars);
    $filePath->insertText($rowIndex + 1, 17, $chars);
    $filePath->insertText($rowIndex + 1, 18, $chars);
    $filePath->insertText($rowIndex + 1, 19, $chars);

    if ($index % 100000 === 0) {
        $endTime = microtime(true);
        echo ($endTime - $startTime) . 'S, line:' . $index . ', 記憶體:' . getMemoryUsage() . PHP_EOL;
    }
}

$endTime = microtime(true);
echo ($endTime - $startTime) . 'S, line:' . $index . ', 記憶體:' . getMemoryUsage() . PHP_EOL;

$filePath->output();

$endTime = microtime(true);
echo ($endTime - $startTime) . 'S, line:' . $index . ', 記憶體:' . getMemoryUsage() . PHP_EOL;

示例程式碼輸出

0.002471923828125S, line:0, 記憶體:0MB
2.8797290325165S, line:100000, 記憶體:0MB
5.7618429660797S, line:200000, 記憶體:0MB
8.5462019443512S, line:300000, 記憶體:0MB
11.41543006897S, line:400000, 記憶體:0MB
13.46573890989S, line:500000, 記憶體:0MB
22.752922058105S, line:500000, 記憶體:0MB

示例程式碼火焰圖

【xlswriter 優化實戰】使用 CPU SSE2 指令集加速字元查詢

通過火焰圖可以直接看到 strpbrk 函式以及zip壓縮佔用了過多的 CPU 時間，zip 壓縮這個世界難題，本渣無能為力，但是 strpbrk 是 C 標準庫提供的函式，心想不應該如此慢，於是覆盤上層邏輯：

if (strpbrk(string, "\x01\x02\x03\x04\x05\x06\x07\x08\x0B\x0C"
                "\x0D\x0E\x0F\x10\x11\x12\x13\x14\x15\x16"
                "\x17\x18\x19\x1A\x1B\x1C\x1D\x1E\x1F")) {
    //......
}

此方法如果在記憶體優化模式下，每寫入一個單元格，都會存在一次字元查詢、判斷。

減少此函式被呼叫的次數，對 string 做 hash （此處暫不考慮雜湊衝突），並儲存至 Map 或 HashTable 中，如果相同的字元只需要一次檢索即刻。
在標準庫中尋找更優的字元查詢檢索函式。
秀髮乃身外之物，自行強擼。

如果可以輕鬆從標準庫中找到替代函式，那麼也就不會有這篇分享，所以第二個方案到此結束。那麼再來看下第一個方案，由於 xlsx 單張工作表可以寫入 1048576 * 16834 個單元格，如果用 Map 或 HashTable，將會造成非常大的記憶體浪費，即便使用 bitmap 標記。

引用維基百科：SSE2，全名為Streaming SIMD Extensions 2，是一種IA-32架構的SIMD（單一指令多重資料）指令集。SSE2是在 2001年隨著Intel發表第一代Pentium 4處理器也一併推出的指令集。它延伸較早的SSE指令集，而且可以完全取代MMX指令集。在2004年，Intel 再度擴充套件了SSE2指令為 SSE3 指令集。與 70 條指令的 SSE 相比，SSE2新增了144條指令。在2003年，AMD也在釋出AMD64的64位處理器時跟進SSE2指令集。

通過覆盤上層邏輯，if 中的條件語句只是過濾某幾個特殊控制符，不需要像標準庫一樣考慮通用性，所以可以通過下面程式碼來等效實現：

unsigned char
lxw_exists_control_chars(const char *string)
{
    size_t str_len = strlen(string);

#ifdef __SSE2__
    /* If the CPU supports the SSE2 instruction set, use the SSE2 instruction set to quickly filter. */
    /* Filtering 16 characters at a time. */
    if (str_len >= 16) {
        const __m128i _char_nul = _mm_set1_epi8('\x00');
        const __m128i _char_ht = _mm_set1_epi8('\x09');
        const __m128i _char_lf = _mm_set1_epi8('\x0A');
        const __m128i _char_space = _mm_set1_epi8('\x20');

        while (str_len >= 16) {
            __m128i _tm, _eq;
            __m128i _value = _mm_loadu_si128((__m128i *)string);

            /* There are no control characters in the current string */
            _tm = _mm_max_epu8(_value, _char_space);
            _eq = _mm_cmpeq_epi8(_value, _tm);
            if (_eq[0] == -1 && _eq[1] == -1)
                goto next;

            /* There are control characters in the current string */
            /* \x0B\x0C\x0D\x0E\x0F\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1A\x1B\x1C\x1D\x1E\x1F */
            _tm = _mm_min_epu8(_value, _char_lf);
            _eq = _mm_cmpeq_epi8(_char_lf, _tm);
            if (_eq[0] == -1 && _eq[1] == -1)
                return LXW_TRUE;

            /* Continue \x09 */
            _tm = _mm_min_epu8(_value, _char_ht);
            _eq = _mm_cmpeq_epi8(_char_ht, _tm);
            if (_eq[0] && _eq[1])
                goto next;

            /* There are control character in the current string */
            /* \x01\x02\x03\x04\x05\x06\x07\x08 */
            _tm = _mm_min_epu8(_value, _char_nul);
            _eq = _mm_cmpeq_epi8(_char_nul, _tm);
            if (_eq[0] == -1 && _eq[1] == -1)
                return LXW_TRUE;

            next:

            string += 16;
            str_len -= 16;
        }
    }
#endif

    /* Filter the remaining characters. */
    /* If the SSE2 instruction set is not supported, please use the conventional way to filter. */
    /* But currently all x86 architecture CPUs on the market support the SSE2 instruction set. */
    while (str_len > 0) {
        unsigned char _string = *string;

        if (_string < '\x20' && ((_string > '\x00' && _string < '\x09') || _string > '\x0A')) {
                return LXW_TRUE;
        }

        ++string;
        --str_len;
    }

    return LXW_FALSE;
}

如果字串長度等於或超過16，則使用 SSE2 進行快速處理，反之使用常規的方式處理，其核心程式碼只有以下幾行：

__m128i _value = _mm_loadu_si128((__m128i *)string);

_tm = _mm_max_epu8(_value, _char_space);
_eq = _mm_cmpeq_epi8(_value, _tm);
if (_eq[0] == -1 && _eq[1] == -1)
    goto next;

_tm = _mm_min_epu8(_value, _char_lf);
_eq = _mm_cmpeq_epi8(_char_lf, _tm);
if (_eq[0] == -1 && _eq[1] == -1)
    return LXW_TRUE;

_tm = _mm_min_epu8(_value, _char_ht);
_eq = _mm_cmpeq_epi8(_char_ht, _tm);
if (_eq[0] && _eq[1])
    goto next;

_tm = _mm_min_epu8(_value, _char_nul);
_eq = _mm_cmpeq_epi8(_char_nul, _tm);
if (_eq[0] == -1 && _eq[1] == -1)
    return LXW_TRUE;

第一塊程式碼

__m128i _value = _mm_loadu_si128((__m128i *)string);

一次載入16個字元到CPU快取中；

第二塊程式碼

_tm = _mm_max_epu8(_value, _char_space);
_eq = _mm_cmpeq_epi8(_value, _tm);
if (_eq[0] == -1 && _eq[1] == -1)
    goto next;

進行無符號8位整數比較，打包返回最大值（是否大於我們需要查詢最大字元的ASCII碼），並對結果進行檢查，打包返回的最大值是否完全等於剛剛載入的16個字元（等於可以得到結果 -1），如果前後8個字元均相等，則可以判斷本次載入的16個字元內不含我們需要找的控制符；

res = _mm_max_epu8(a, b);
a   = 116  230  136  145  101    9  115  116   49  102  106  107  100  108  115   97
b   = 32   32   32   32   32   32   32   32   32   32   32   32   32   32   32   32
res = 116  230  136  145  101   32  115  116   49  102  106  107  100  108  115   97

下方的三塊程式碼和第二塊程式碼類似，只是查詢的範圍不同而已。

ASCII: 

strpbrk                  , loop: 1000, str len: 9，time:0.000122
lxw_exists_control_chars , loop: 1000, str len: 9，time:0.000020
strpbrk                  , loop: 10000, str len: 9，time:0.001174
lxw_exists_control_chars , loop: 10000, str len: 9，time:0.000201
strpbrk                  , loop: 100000, str len: 9，time:0.011563
lxw_exists_control_chars , loop: 100000, str len: 9，time:0.002018
strpbrk                  , loop: 1000, str len: 26，time:0.000296
lxw_exists_control_chars , loop: 1000, str len: 26，time:0.000059
strpbrk                  , loop: 1000, str len: 52，time:0.000564
lxw_exists_control_chars , loop: 1000, str len: 52，time:0.000057
strpbrk                  , loop: 1000, str len: 78，time:0.000854
lxw_exists_control_chars , loop: 1000, str len: 78，time:0.000081
strpbrk                  , loop: 1000000, str len: 26，time:0.246461
lxw_exists_control_chars , loop: 1000000, str len: 26，time:0.048152
strpbrk                  , loop: 1000000, str len: 52，time:0.455256
lxw_exists_control_chars , loop: 1000000, str len: 52，time:0.046717
strpbrk                  , loop: 1000000, str len: 78，time:0.721552
lxw_exists_control_chars , loop: 1000000, str len: 78，time:0.067716

NON ASCII: 

strpbrk                  , loop: 1000, str len: 162，time:0.001447
lxw_exists_control_chars , loop: 1000, str len: 162，time:0.000072
strpbrk                  , loop: 100000, str len: 162，time:0.156455
lxw_exists_control_chars , loop: 100000, str len: 162，time:0.007992

在我們的特殊場景中，當字串長度小於16時，與標準庫strpbrk相比，效能提高了5倍。隨著字串長度的增加，如果字串只有ASCII時，最多可以提高10倍。但是如果字元不是ASCII 或者不全是 ASCII，則其效能最多可以提高20倍。

在相同的環境下再次測試，得到最新的火焰圖：
【xlswriter 優化實戰】使用 CPU SSE2 指令集加速字元查詢

在火焰圖同等比例的情況下，已經看不到熱點函式的蹤影。

Github：github.com/viest/php-ext-excel-exp…

Gitee：gitee.com/viest/php-ext-xlswriter

PECL：pecl.php.net/package/xlswriter

xlswriter-docs.viest.me

如果此文對你有所幫助，也可以支援一下作者的專案，來個Star。

本作品採用《CC 協議》，轉載必須註明作者和本文連結

MySQL: 使用explain 優化查詢效能
2021-10-29
MySqlAI優化
查詢優化
2020-10-09
優化
Mysql 慢查詢優化實踐
2019-01-29
MySql優化
pgsql查詢優化之模糊查詢
2019-07-20
SQL優化
【實戰】利用多執行緒優化查詢百萬級資料
2021-11-19
執行緒優化
三高Mysql - Mysql索引和查詢優化（偏實戰部分）
2022-04-13
MySql索引優化
HBase查詢優化
2018-08-05
優化
Oracle in 查詢優化
2021-06-07
Oracle優化
join 查詢優化
2021-01-28
優化
MySQL查詢優化
2021-03-04
MySql優化
Linux效能優化實戰CPU篇之總結（四）
2022-03-03
Linux優化
Simd庫——影像處理領域的CPU指令集加速庫
2024-06-06
mysql查詢優化檢查 explain
2018-05-04
MySql優化AI
MySQL優化COUNT()查詢
2019-01-14
MySql優化
EntityFramework優化：查詢WITH(NOLOCK)
2018-05-17
Framework優化
EntityFramework優化：查詢效能
2018-05-19
Framework優化
MySQL 的查詢優化
2020-01-07
MySql優化
優化sql查詢速度
2020-10-25
優化SQL
分頁查詢優化
2020-04-05
優化
MySQL 慢查詢優化
2019-06-23
MySql優化
KunlunDB 查詢優化（一）
2022-01-17
優化
MySQL調優之查詢優化
2021-11-09
MySql優化
白日夢的Elasticsearch實戰筆記，32個查詢案例、15個聚合案例、7個查詢優化技巧。
2021-01-27
Elasticsearch筆記優化
使用機器學習加速對非結構化資料的查詢-第1部分（使用BlazeIt加速聚合和限制查詢）
2020-12-02
機器學習
Linux效能優化實戰CPU篇之軟中斷（三）
2022-03-03
Linux優化
MySQL——優化巢狀查詢和分頁查詢
2018-11-01
MySql優化巢狀
MySQL-效能優化-索引和查詢優化
2018-06-03
MySql優化索引
exists與in子查詢優化
2019-04-09
優化
效能優化之分頁查詢
2018-11-19
優化
MySQL索引與查詢優化
2019-04-13
MySql索引優化
APP查詢圖片優化
2018-10-15
APP優化
MySQL查詢優化利刃-EXPLAIN
2020-06-08
MySql優化AI
MySQL分頁查詢優化
2020-10-13
MySql優化
資料庫查詢優化
2020-10-05
資料庫優化
從理論到實踐，Mysql查詢優化剖析
2018-09-10
MySql優化
Hive使用Calcite CBO優化流程及SQL優化實戰
2020-09-22
Hive優化SQL
【資料庫】查詢優化之子連線優化
2022-01-12
資料庫優化
關於樹結構的查詢優化，及許可權樹的查詢優化
2021-01-01
優化

[xlswriter 優化實戰] 使用 CPU SSE2 指令集加速字元查詢

第一塊程式碼

第二塊程式碼

相關文章