前言
一直以來,對於搜尋時模糊匹配的優化一直是個讓人頭疼的問題,好在強大pgsql提供了優化方案,下面就來簡單談一談如何通過索引來優化模糊匹配
案例
我們有一張千萬級資料的檢查報告表,需要通過檢查報告來模糊搜尋某個條件,我們先建立如下索引:
CREATE INDEX lab_report_report_name_index ON lab_report USING btree (report_name);
然後搜個簡單的模糊匹配條件如 LIKE "血常規%",可以發現查詢計劃生成如下,索引並沒有被使用上,這是因為傳統的btree索引並不支援模糊匹配
查閱文件後發現,pgsql可以在Btree索引上指定操作符:text_pattern_ops、varchar_pattern_ops和 bpchar_pattern_ops,它們分別對應欄位型別text、varchar和 char,官方解釋為“它們與預設操作符類的區別是值的比較是嚴格按照字元進行而不是根據區域相關的排序規則。這使得這些操作符類適合於當一個資料庫沒有使用標準“C”區域時被使用在涉及模式匹配表示式(LIKE或POSIX正規表示式)的查詢中。”, 有些抽象,我們先試試看。建立如下索引並查詢剛才的條件 LIKE"血常規%":(參考pgsql的文件 https://www.postgresql.org/docs/10/indexes-opclass.html)
CREATE INDEX lab_report_report_name_index ON lab.lab_report (report_name varchar_pattern_ops);
發現確實可以走索引掃描 ,執行時間也從213ms優化125ms,但是,如果搜尋LIKE "%血常規%"就又會走全表掃描了! 這裡我們引入本篇部落格的主角"pg_trgm"和"pg_bigm"。
建立這兩個索引前分別需要引入如下兩個擴充套件包 :
CREATE EXTENSION pg_trgm; CREATE EXTENSION pg_bigm;
這兩個索引的區別是:“pg_tigm”為pgsql官方提供的索引,"pg_tigm"為日本開發者提供。下面是詳細的對比:(參考pg_bigm的文件 http://pgbigm.osdn.jp/pg_bigm_en-1-2.html)
Comparison with pg_trgm
The pg_trgm contrib module which provides full text search capability using 3-gram (trigram) model is included in PostgreSQL. The pg_bigm was developed based on the pg_trgm. They have the following differences:
Functionalities and Features | pg_trgm | pg_bigm |
---|---|---|
Phrase matching method for full text search | 3-gram | 2-gram |
Available index | GIN and GiST | GIN only |
Available text search operators | LIKE (~~), ILIKE (~~*), ~, ~* | LIKE only |
Full text search for non-alphabetic language (e.g., Japanese) |
Not supported (*1) | Supported |
Full text search with 1-2 characters keyword | Slow (*2) | Fast |
Similarity search | Supported | Supported (version 1.1 or later) |
Maximum indexed column size | 238,609,291 Bytes (~228MB) | 107,374,180 Bytes (~102MB) |
- (*1) You can use full text search for non-alphabetic language by commenting out KEEPONLYALNUM macro variable in contrib/pg_trgm/pg_trgm.h and rebuilding pg_trgm module. But pg_bigm provides faster non-alphabetic search than such a modified pg_trgm.
- (*2) Because, in this search, only sequential scan or index full scan (not normal index scan) can run.
pg_bigm 1.1 or later can coexist with pg_trgm in the same database, but pg_bigm 1.0 cannot.
如無特殊要求推薦使用"pg_bigm",我們測試一下效果:
CREATE INDEX lab_report_report_name_index ON lab_report USING gin (report_name public.gin_bigm_ops);
可以使用點陣圖索引掃描,對於本次案例,使用pg_trgm效果同pg_bigm。
以上
本文只是簡單的介紹許多細節並未做深入的分析,歡迎留言指教或者討論