PostgreSQL十億級模糊查詢最佳實踐

德哥發表於2017-04-26

標籤

PostgreSQL , 模糊查詢 , 正則查詢 , pg_trgm , bytea , gin , 函式索引


背景

前模糊(有字首的模糊),後模糊(有字尾的模糊),前後模糊(無前字尾的模糊),正則匹配都屬於文字搜尋領域常見的需求。

PostgreSQL擁有很強的文字搜尋能力,除了支援全文檢索,還支援模糊查詢、正則查詢。內建的pg_trgm外掛是一般資料庫沒有的,可能很多人沒有聽說過。同時還內建了表示式索引、GIN索引的功能。

不同的模糊查詢需求,有不同的優化方法。

對於前模糊和後模糊,PostgreSQL則與其他資料庫一樣,可以使用btree來加速。後模糊可以使用反轉函式的函式索引來加速。

對於前後模糊和正則匹配,一種方法是使用pg_trgm外掛,利用GIN索引加速模糊和正則查詢(輸入3個或3個以上字元的模糊查詢效果很好)。另一種方法是自定義GIN表示式索引的方法,適合於定製的模糊查詢。

一、前模糊與後模糊的優化

1. 前模糊(有字首的模糊)優化方法

使用b-tree可以支援前模糊的查詢。

1.1 當使用型別預設的index ops class時,僅適合於collate=”C”的查詢(當資料庫預設的lc_collate<>C時,索引和查詢都需要明確指定collate “C”)。

索引、查詢條件的collate必須一致才能使用索引。

例子

test=# create table test(id int, info text);      
CREATE TABLE      
test=# insert into test select generate_series(1,1000000),md5(random()::text);      
INSERT 0 1000000      
test=# create index idx on test(info collate "C");      
CREATE INDEX      
      
test=# explain (analyze,verbose,timing,costs,buffers) select * from test where info like `abcd%` collate "C";      
                                                      QUERY PLAN                                                            
----------------------------------------------------------------------------------------------------------------------      
 Index Scan using idx on public.test  (cost=0.42..16.76 rows=100 width=37) (actual time=0.057..0.093 rows=18 loops=1)      
   Output: id, info      
   Index Cond: ((test.info >= `abcd`::text) AND (test.info < `abce`::text))      
   Filter: (test.info ~~ `abcd%`::text COLLATE "C")      
   Buffers: shared hit=18 read=3      
 Planning time: 0.424 ms      
 Execution time: 0.124 ms      
(7 rows)      

1.2 當資料庫預設的lc_collate<>C時,還有一種方法讓b-tree索引支援模糊查詢。使用對應型別的pattern ops,使用pattern ops將使用字元的查詢方式而非binary的搜尋方式。

文件中有如下解釋

https://www.postgresql.org/docs/9.6/static/indexes-opclass.html

The operator classes text_pattern_ops, varchar_pattern_ops, and bpchar_pattern_ops   
support B-tree indexes on the types text, varchar, and char respectively.   
  
The difference from the default operator classes is that the values are compared strictly   
character by character rather than according to the locale-specific collation rules.   
  
This makes these operator classes suitable for use by queries involving pattern   
matching expressions (LIKE or POSIX regular expressions) when the database   
does not use the standard "C" locale.  

例子

test=# drop table test;  
DROP TABLE  
test=# create table test(id int, info text);      
CREATE TABLE  
test=# insert into test select generate_series(1,1000000),md5(random()::text);      
INSERT 0 1000000  
test=# create index idx on test(info text_pattern_ops);   
CREATE INDEX  
test=# explain (analyze,verbose,timing,costs,buffers) select * from test where info like `abcd%` collate "zh_CN";      
                                                      QUERY PLAN                                                        
----------------------------------------------------------------------------------------------------------------------  
 Index Scan using idx on public.test  (cost=0.42..16.76 rows=100 width=37) (actual time=0.038..0.059 rows=12 loops=1)  
   Output: id, info  
   Index Cond: ((test.info ~>=~ `abcd`::text) AND (test.info ~<~ `abce`::text))  
   Filter: (test.info ~~ `abcd%`::text COLLATE "zh_CN")  
   Buffers: shared hit=12 read=3  
 Planning time: 0.253 ms  
 Execution time: 0.081 ms  
(7 rows)  
  
test=# explain (analyze,verbose,timing,costs,buffers) select * from test where info like `abcd%` collate "C";      
                                                      QUERY PLAN                                                        
----------------------------------------------------------------------------------------------------------------------  
 Index Scan using idx on public.test  (cost=0.42..16.76 rows=100 width=37) (actual time=0.027..0.050 rows=12 loops=1)  
   Output: id, info  
   Index Cond: ((test.info ~>=~ `abcd`::text) AND (test.info ~<~ `abce`::text))  
   Filter: (test.info ~~ `abcd%`::text COLLATE "C")  
   Buffers: shared hit=15  
 Planning time: 0.141 ms  
 Execution time: 0.072 ms  
(7 rows)  

使用型別對應的pattern ops,索引搜尋不僅支援LIKE的寫法,還支援規則表示式的寫法,如下:

test=# explain (analyze,verbose,timing,costs,buffers) select * from test where info ~ `^abcd`;  
                                                      QUERY PLAN                                                        
----------------------------------------------------------------------------------------------------------------------  
 Index Scan using idx on public.test  (cost=0.42..16.76 rows=100 width=37) (actual time=0.031..0.061 rows=12 loops=1)  
   Output: id, info  
   Index Cond: ((test.info ~>=~ `abcd`::text) AND (test.info ~<~ `abce`::text))  
   Filter: (test.info ~ `^abcd`::text)  
   Buffers: shared hit=15  
 Planning time: 0.213 ms  
 Execution time: 0.083 ms  
(7 rows)  

2. 後模糊(有字尾的模糊)的優化方法

使用反轉函式(reverse)索引,可以支援後模糊的查詢。

2.1 當使用型別預設的index ops class時,僅適合於collate=”C”的查詢(當資料庫預設的lc_collate<>C時,索引和查詢都需要明確指定collate “C”)。

索引、查詢條件的collate必須一致才能使用索引。

例子

test=# create index idx1 on test(reverse(info) collate "C");      
CREATE INDEX      
test=# select * from test limit 1;      
 id |               info                     
----+----------------------------------      
  1 | b3275976cdd437a033d4329775a52514      
(1 row)      
      
test=# explain (analyze,verbose,timing,costs,buffers) select * from test where reverse(info) like `4152%` collate "C";      
                                                        QUERY PLAN                                                              
--------------------------------------------------------------------------------------------------------------------------      
 Index Scan using idx1 on public.test  (cost=0.42..4009.43 rows=5000 width=37) (actual time=0.061..0.097 rows=18 loops=1)      
   Output: id, info      
   Index Cond: ((reverse(test.info) >= `4152`::text) AND (reverse(test.info) < `4153`::text))      
   Filter: (reverse(test.info) ~~ `4152%`::text COLLATE "C")      
   Buffers: shared hit=18 read=3      
 Planning time: 0.128 ms      
 Execution time: 0.122 ms      
(7 rows)      
      
test=# select * from test where reverse(info) like `4152%` collate "C";      
   id   |               info                     
--------+----------------------------------      
 847904 | abe2ecd90393b5275df8e34a39702514      
 414702 | 97f66d26545329321164042657d02514      
 191232 | 7820972c6220c2b01d46c11ebb532514      
 752742 | 93232ac39c6632e2540df44627c42514      
 217302 | 39e518893a1a7b1e691619bd1fc42514      
      1 | b3275976cdd437a033d4329775a52514      
 615718 | 4948f94c484c13dc6c4fae8a3db52514      
 308815 | fc2918ceff7c7a4dafd2e04031062514      
 149521 | 546d963842ea5ca593e622c810262514      
 811093 | 4b6eca2eb6b665af67b2813e91a62514      
 209000 | 1dfd0d4e326715c1739f031cca992514      
 937616 | 8827fd81f5b673fb5afecbe3e11b2514      
 419553 | bd6e01ce360af16137e8b6abc8ab2514      
 998324 | 7dff51c19dc5e5d9979163e7d14c2514      
 771518 | 8a54e30003a48539fff0aedc73ac2514      
 691566 | f90368348e3b6bf983fcbe10db2d2514      
 652274 | 8bf4a97b5f122a5540a21fa85ead2514      
 233437 | 739ed715fc203d47e37e79b5bcbe2514      
(18 rows)      

2.2 當資料庫預設的lc_collate<>C時,還有一種方法讓b-tree索引支援模糊查詢。使用對應型別的pattern ops,使用pattern ops將使用字元的查詢方式而非binary的搜尋方式。

使用型別對應的pattern ops,索引搜尋不僅支援LIKE的寫法,還支援規則表示式的寫法。

例子

test=# create index idx1 on test(reverse(info) text_pattern_ops);      
CREATE INDEX  
test=# explain (analyze,verbose,timing,costs,buffers) select * from test where reverse(info) like `4152%`;  
                                                        QUERY PLAN                                                          
--------------------------------------------------------------------------------------------------------------------------  
 Index Scan using idx1 on public.test  (cost=0.42..4009.43 rows=5000 width=37) (actual time=0.026..0.049 rows=12 loops=1)  
   Output: id, info  
   Index Cond: ((reverse(test.info) ~>=~ `4152`::text) AND (reverse(test.info) ~<~ `4153`::text))  
   Filter: (reverse(test.info) ~~ `4152%`::text)  
   Buffers: shared hit=15  
 Planning time: 0.102 ms  
 Execution time: 0.072 ms  
(7 rows)  
test=# explain (analyze,verbose,timing,costs,buffers) select * from test where reverse(info) ~ `^4152`;  
                                                        QUERY PLAN                                                          
--------------------------------------------------------------------------------------------------------------------------  
 Index Scan using idx1 on public.test  (cost=0.42..4009.43 rows=5000 width=37) (actual time=0.031..0.063 rows=12 loops=1)  
   Output: id, info  
   Index Cond: ((reverse(test.info) ~>=~ `4152`::text) AND (reverse(test.info) ~<~ `4153`::text))  
   Filter: (reverse(test.info) ~ `^4152`::text)  
   Buffers: shared hit=15  
 Planning time: 0.148 ms  
 Execution time: 0.087 ms  
(7 rows)  

3. 前、後模糊的合體優化方法

使用pg_trgm索引,可以支援前、後模糊的查詢。

注意:

(有字首的模糊)至少輸入1個字元,(有字尾的模糊)至少輸入2個字元,才有好的索引過濾效果。

如果要高效支援多位元組字元(例如中文),資料庫lc_ctype不能為”C”,只有TOKEN分割正確效果才是OK的。(因為lc_ctype決定了多位元組字元中什麼是字:LC_CTYPE: Character classification (What is a letter? Its upper-case equivalent?))。

索引、查詢條件的collate必須一致才能使用索引。

test=# l+ test      
                                              List of databases      
 Name |  Owner   | Encoding |  Collate   |   Ctype    | Access privileges |  Size  | Tablespace | Description       
------+----------+----------+------------+------------+-------------------+--------+------------+-------------      
 test | postgres | UTF8     | zh_CN.utf8 | zh_CN.utf8 |                   | 245 MB | pg_default |       
(1 row)      
      
test=# create extension pg_trgm;      
      
test=# create table test001(c1 text);      
CREATE TABLE      

生成隨機中文字串的函式

test=# create or replace function gen_hanzi(int) returns text as $$                  
declare        
  res text;        
begin        
  if $1 >=1 then        
    select string_agg(chr(19968+(random()*20901)::int), ``) into res from generate_series(1,$1);        
    return res;        
  end if;        
  return null;        
end;        
$$ language plpgsql strict;        
CREATE FUNCTION      

生成隨機資料

test=# insert into test001 select gen_hanzi(20) from generate_series(1,100000);      
INSERT 0 100000      
      
test=# create index idx_test001_1 on test001 using gin (c1 gin_trgm_ops);      
CREATE INDEX      
      
test=# select * from test001 limit 5;      
                    c1                          
------------------------------------------      
 埳噪辦甾訥昃碇玾隉箖燋邢賀浮媊踮菵暔讅橅      
 秌橑籛鴎擬倶敤麁黿醠轇坙騉鏦纗蘛婃坹嫻儅      
 蔎緾鎧爪鵬二悲膼朠麻鸂鋬楨窷違繇糭嘓索籓      
 馳泅薬鐗愅撞竅溮滲蛁灎厀攚摐瞪拡㩵詜隝縕      
 襳鋪煃匶瀌懲荼黹樆惺篋搔羾憯墆鋃硍蔓恧顤      
(5 rows)      

模糊查詢

test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 like `你%`;      
                                                      QUERY PLAN                                                             
-----------------------------------------------------------------------------------------------------------------------      
 Bitmap Heap Scan on public.test001  (cost=5.08..15.20 rows=10 width=61) (actual time=0.030..0.034 rows=3 loops=1)      
   Output: c1      
   Recheck Cond: (test001.c1 ~~ `你%`::text)      
   Heap Blocks: exact=3      
   Buffers: shared hit=7      
   ->  Bitmap Index Scan on idx_test001_1  (cost=0.00..5.08 rows=10 width=0) (actual time=0.020..0.020 rows=3 loops=1)      
         Index Cond: (test001.c1 ~~ `你%`::text)      
         Buffers: shared hit=4      
 Planning time: 0.119 ms      
 Execution time: 0.063 ms      
(10 rows)      
      
test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 like `%恧顤`;      
                                                      QUERY PLAN                                                             
-----------------------------------------------------------------------------------------------------------------------      
 Bitmap Heap Scan on public.test001  (cost=5.08..15.20 rows=10 width=61) (actual time=0.031..0.034 rows=1 loops=1)      
   Output: c1      
   Recheck Cond: (test001.c1 ~~ `%恧顤`::text)      
   Rows Removed by Index Recheck: 1      
   Heap Blocks: exact=2      
   Buffers: shared hit=6      
   ->  Bitmap Index Scan on idx_test001_1  (cost=0.00..5.08 rows=10 width=0) (actual time=0.020..0.020 rows=2 loops=1)      
         Index Cond: (test001.c1 ~~ `%恧顤`::text)      
         Buffers: shared hit=4      
 Planning time: 0.136 ms      
 Execution time: 0.062 ms      
(11 rows)      

二、前後均模糊的優化

使用pg_trgm外掛,支援前後模糊的查詢。

注意:

如果要讓pg_trgm高效支援多位元組字元(例如中文),資料庫lc_ctype不能為”C”,只有TOKEN分割正確效果才是OK的。(因為lc_ctype決定了多位元組字元中什麼是字:LC_CTYPE: Character classification (What is a letter? Its upper-case equivalent?))。

建議輸入3個或3個以上字元,否則效果不佳(後面會分析原因)。

例子

test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 like `%燋邢賀%`;      
                                                      QUERY PLAN                                                             
-----------------------------------------------------------------------------------------------------------------------      
 Bitmap Heap Scan on public.test001  (cost=5.08..15.20 rows=10 width=61) (actual time=0.038..0.038 rows=1 loops=1)      
   Output: c1      
   Recheck Cond: (test001.c1 ~~ `%燋邢賀%`::text)      
   Heap Blocks: exact=1      
   Buffers: shared hit=5      
   ->  Bitmap Index Scan on idx_test001_1  (cost=0.00..5.08 rows=10 width=0) (actual time=0.025..0.025 rows=1 loops=1)      
         Index Cond: (test001.c1 ~~ `%燋邢賀%`::text)      
         Buffers: shared hit=4      
 Planning time: 0.170 ms      
 Execution time: 0.076 ms      
(10 rows)      
      
test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 like `%燋邢%`;      
                                                              QUERY PLAN                                                                    
--------------------------------------------------------------------------------------------------------------------------------------      
 Bitmap Heap Scan on public.test001  (cost=7615669.08..7615679.20 rows=10 width=61) (actual time=147.524..178.232 rows=1 loops=1)      
   Output: c1      
   Recheck Cond: (test001.c1 ~~ `%燋邢%`::text)      
   Rows Removed by Index Recheck: 99999      
   Heap Blocks: exact=1137      
   Buffers: shared hit=14429      
   ->  Bitmap Index Scan on idx_test001_1  (cost=0.00..7615669.08 rows=10 width=0) (actual time=147.377..147.377 rows=100000 loops=1)      
         Index Cond: (test001.c1 ~~ `%燋邢%`::text)      
         Buffers: shared hit=13292      
 Planning time: 0.133 ms      
 Execution time: 178.265 ms      
(11 rows)      

三、正則匹配的優化

PostgreSQL 正則匹配的語法為 字串 ~ `pattern` 或 字串 ~* `pattern`

https://www.postgresql.org/docs/9.6/static/functions-matching.html

例子

test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 ~ `12[0-9]{3,9}`;      
                                                       QUERY PLAN                                                             
------------------------------------------------------------------------------------------------------------------------      
 Bitmap Heap Scan on public.test001  (cost=65.08..75.20 rows=10 width=61) (actual time=0.196..0.196 rows=0 loops=1)      
   Output: c1      
   Recheck Cond: (test001.c1 ~ `12[0-9]{3,9}`::text)      
   Rows Removed by Index Recheck: 1      
   Heap Blocks: exact=1      
   Buffers: shared hit=50      
   ->  Bitmap Index Scan on idx_test001_1  (cost=0.00..65.08 rows=10 width=0) (actual time=0.183..0.183 rows=1 loops=1)      
         Index Cond: (test001.c1 ~ `12[0-9]{3,9}`::text)      
         Buffers: shared hit=49      
 Planning time: 0.452 ms      
 Execution time: 0.221 ms      
(11 rows)      
      
test01=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 ~ `宸朾啣` collate "zh_CN";   
                                                      QUERY PLAN                                                       
-----------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on public.test001  (cost=6.58..19.42 rows=10 width=61) (actual time=0.061..0.061 rows=1 loops=1)
   Output: c1
   Recheck Cond: (test001.c1 ~ `宸朾啣`::text COLLATE "zh_CN")
   Heap Blocks: exact=1
   Buffers: shared hit=5
   ->  Bitmap Index Scan on idx_test001_1  (cost=0.00..6.58 rows=10 width=0) (actual time=0.049..0.049 rows=1 loops=1)
         Index Cond: (test001.c1 ~ `宸朾啣`::text COLLATE "zh_CN")
         Buffers: shared hit=4
 Planning time: 0.238 ms
 Execution time: 0.082 ms
(10 rows)

正則匹配索引原理參考contrib/pg_trgm/trgm_regexp.c

pg_trgm模糊查詢的原理

首先,pg_trgm將字串的前端新增2個空格,末尾新增1個空格。

然後,每連續的3個字元為一個TOKEN,拆開。

最後,對TOKEN建立GIN倒排索引。

檢視字串的TOKEN,可以使用如下方法。

test=# select show_trgm(`123`);      
        show_trgm              
-------------------------      
 {"  1"," 12",123,"23 "}      
(1 row)      

pg_trgm前後模糊字元個數要求的原因

使用pg_trgm時,如果要獲得最好的效果,最好滿足這些條件。

1. 有字首的模糊查詢,例如a%,至少需要提供1個字元。( 搜尋的是token=` a` )

2. 有字尾的模糊查詢,例如%ab,至少需要提供2個字元。( 搜尋的是token=`ab ` )

3. 前後模糊查詢,例如%abcd%,至少需要提供3個字元。( 這個使用陣列搜尋,搜尋的是token(s) 包含 {” a”,” ab”,abc,bcd,”cd “} )

原因是什麼呢?

因為pg_trgm生成的TOKEN是三個字元,只有在以上三個條件下,才能匹配到對應的TOKEN。

test=# select show_trgm(`123`);      
        show_trgm              
-------------------------      
 {"  1"," 12",123,"23 "}      
(1 row)      

四、小於3個輸入字元的模糊查詢的優化

當需要前後模糊搜尋1個或者2個字元時,pg_trgm無法滿足需求,但是我們可以使用表示式GIN索引。

使用表示式,將字串拆成1個單字,兩個連續的字元的陣列,對陣列建立GIN索引即可。

例子

test=# create or replace function split001(text) returns text[] as $$      
declare      
  res text[];      
begin      
  select regexp_split_to_array($1,``) into res;      
  for i in 1..length($1)-1 loop      
    res := array_append(res, substring($1,i,2));      
  end loop;      
  return res;      
end;      
$$ language plpgsql strict immutable;      
CREATE FUNCTION      
      
test=# create index idx_test001_2 on test001 using gin (split001(c1));      
      
test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where split001(c1) @> array[`你好`];      
                                                       QUERY PLAN                                                             
------------------------------------------------------------------------------------------------------------------------      
 Bitmap Heap Scan on public.test001  (cost=8.87..550.12 rows=500 width=61) (actual time=0.041..0.041 rows=0 loops=1)      
   Output: c1      
   Recheck Cond: (split001(test001.c1) @> `{你好}`::text[])      
   Buffers: shared hit=4      
   ->  Bitmap Index Scan on idx_test001_2  (cost=0.00..8.75 rows=500 width=0) (actual time=0.039..0.039 rows=0 loops=1)      
         Index Cond: (split001(test001.c1) @> `{你好}`::text[])      
         Buffers: shared hit=4      
 Planning time: 0.104 ms      
 Execution time: 0.068 ms      
(9 rows)      
      
test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where split001(c1) @> array[`你`];      
                                                       QUERY PLAN                                                              
-------------------------------------------------------------------------------------------------------------------------      
 Bitmap Heap Scan on public.test001  (cost=8.87..550.12 rows=500 width=61) (actual time=0.063..0.183 rows=86 loops=1)      
   Output: c1      
   Recheck Cond: (split001(test001.c1) @> `{你}`::text[])      
   Heap Blocks: exact=80      
   Buffers: shared hit=84      
   ->  Bitmap Index Scan on idx_test001_2  (cost=0.00..8.75 rows=500 width=0) (actual time=0.048..0.048 rows=86 loops=1)      
         Index Cond: (split001(test001.c1) @> `{你}`::text[])      
         Buffers: shared hit=4      
 Planning time: 0.101 ms      
 Execution time: 0.217 ms      
(10 rows)      
      
test=# select * from test001 where split001(c1) @> array[`你`];      
                    c1                          
------------------------------------------      
 殐踨洪冨垓丩賢閚偉垢胸鍘崩你萭隡劭芛雫袰      
 靅慨熱臉罆淓寘鰻總襎戍謸棖陪丼倫柆套你仮      
......      

五、相似查詢優化

模糊查詢和正則匹配都是找出完全符合條件的記錄,還有一種需求是相似查詢。

例如postgresql字串,輸入 p0stgresgl 也能根據相似度匹配到。

這裡同樣用到了pg_trgm外掛,如果要支援中文,同樣有這樣的要求:

如果需要讓pg_trgm支援中文相似查詢,資料庫lc_ctype不能為”C”,只有TOKEN分割正確效果才是OK的。(因為lc_ctype決定了多位元組字元中什麼是字:LC_CTYPE: Character classification (What is a letter? Its upper-case equivalent?))。

建議輸入3個或3個以上字元,否則效果不佳(後面會分析原因)。

例子

test=# create index idx_test001_3 on test001 using gist (c1 gist_trgm_ops);      
CREATE INDEX      
        
test=# explain (analyze,verbose,timing,costs,buffers) SELECT t, c1 <-> `癷磛鶿蠌鰓蠲123鶡埀婎鯿苿奶垨惸溴蔻筴熝憡` AS dist        
  FROM test001 t        
  ORDER BY dist LIMIT 5;        
                                                                   QUERY PLAN                                                                          
-------------------------------------------------------------------------------------------------------------------------------------------------      
 Limit  (cost=0.28..0.52 rows=5 width=89) (actual time=37.462..37.639 rows=5 loops=1)      
   Output: t.*, ((c1 <-> `癷磛鶿蠌鰓蠲123鶡埀婎鯿苿奶垨惸溴蔻筴熝憡`::text))      
   Buffers: shared hit=1631      
   ->  Index Scan using idx_test001_3 on public.test001 t  (cost=0.28..4763.28 rows=100000 width=89) (actual time=37.461..37.636 rows=5 loops=1)      
         Output: t.*, (c1 <-> `癷磛鶿蠌鰓蠲123鶡埀婎鯿苿奶垨惸溴蔻筴熝憡`::text)      
         Order By: (t.c1 <-> `癷磛鶿蠌鰓蠲123鶡埀婎鯿苿奶垨惸溴蔻筴熝憡`::text)      
         Buffers: shared hit=1631      
 Planning time: 0.089 ms      
 Execution time: 37.668 ms      
(9 rows)      
      
test=# SELECT t, c1 <-> `癷磛鶿蠌鰓蠲123鶡埀婎鯿苿奶垨惸溴蔻筴熝憡` AS dist        
  FROM test001 t        
  ORDER BY dist LIMIT 5;        
                     t                      |   dist         
--------------------------------------------+----------      
 (癷磛鶿蠌鰓蠲你鶡埀婎鯿苿奶垨惸溴蔻筴熝憡) | 0.307692      
 (坆桻悁斾耾瑚豌腏炁悿隖軻盃掗稐睟礓蜮鉛湆) | 0.976744      
 (癷鉜餯祂鼃恫蝅瓟顡廕梍蛸歡僷贊敔欓侑韌鐹) | 0.976744      
 (癷嚯鳬戚蹪熼胘檙佌欔韜挹樷覄惶蹝頊鑜鞖媗) | 0.976744      
 (癷饎瞲餿堒歃峽盾豼擔禞嵪豦咢脈餛竨濟隘緘) | 0.976744      
(5 rows)      

六、小結

1. 如果只有前模糊查詢需求(字串 like `xx%`),使用collate “C”的b-tree索引;當collate不為”C”時,可以使用型別對應的pattern ops(例如text_pattern_ops)建立b-tree索引。

2. 如果只有後模糊的查詢需求(字串 like `%xx` 等價於 reverse(字串) like `xx%`),使用collate “C”的reverse()表示式的b-tree索引;當collate不為”C”時,可以使用型別對應的pattern ops(例如text_pattern_ops)建立b-tree索引。

3. 如果有前後模糊查詢需求,並且包含中文,請使用lc_ctype <> “C”的資料庫,同時使用pg_trgm外掛的gin索引。(只有TOKEN分割正確效果才是OK的。(因為lc_ctype決定了多位元組字元中什麼是字:LC_CTYPE: Character classification (What is a letter? Its upper-case equivalent?))。)

4. 如果有前後模糊查詢需求,並且不包含中文,請使用pg_trgm外掛的gin索引。

5. 如果有正規表示式查詢需求,請使用pg_trgm外掛的gin索引。

6. 如果有輸入條件少於3個字元的模糊查詢需求,可以使用GIN表示式索引,通過陣列包含的方式進行搜尋,效能一樣非常好。

七、效能

1億條記錄,每條記錄15個隨機中文。測試前後模糊查詢效能。

1. 生成測試資料

vi test.sql      
insert into test001 select gen_hanzi(15) from generate_series(1,2500000);      
      
pgbench -n -r -P 1 -f ./test.sql -c 40 -j 40 -t 1 test      
    
    
    
test=# select count(*) from test001;      
   count         
-----------      
 100000000      
(1 row)      
test=# select * from test001 limit 10;    
               c1                   
--------------------------------    
 釾笉皜鰈確艄騚馺腃彊釲忰採汦擇    
 槮搮圮墔婂蹾飄孡鶒鎮貲聵線麯櫕    
 孨鄈韞萅赫炧暤蟠檼駧餪崉媧譌筯    
 烸喖醝稦怩鷟棾奜妛曫仾飛饡繪韋    
 撐豁襉峊炠眏罱襄彊鰮莆壏妒辷闤    
 蜁愊鶱磹貳帵眲嚉榑蒼潵簷簄椰魨    
 瑄翁蠃巨躋壾蛸湗鑂顂櫟砣八癱栵    
 餷巍笿鞽裝棊嘢恓煓熴錩鋈蹃煿屓    
 訆韄踔牤嘇糺絢軿鵑燿螛梋鰢謇郼    
 撲蓨傷釱糕觩嬖蓷鰼繩圓醷熌靉掑    
(10 rows)    

2. 建立索引

test=# set maintenance_work_mem =`32GB`;      
test=# create index idx_test001_1 on test001 using gin (c1 gin_trgm_ops);      

表和索引大小

test=# di+  
                             List of relations  
 Schema |     Name      | Type  |  Owner   |  Table  | Size  | Description   
--------+---------------+-------+----------+---------+-------+-------------  
 public | idx_test001_1 | index | postgres | test001 | 12 GB |   
(1 row)  
  
test=# dt+  
                      List of relations  
 Schema |  Name   | Type  |  Owner   |  Size   | Description   
--------+---------+-------+----------+---------+-------------  
 public | test001 | table | postgres | 7303 MB |   
(1 row)  

3. 模糊查詢效能測試

3.1 前模糊

響應時間:9毫秒

返回4701行

select * from test001 where c1 like `你%`;    
    
test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 like `你%`;    
                                                          QUERY PLAN                                                              
------------------------------------------------------------------------------------------------------------------------------    
 Bitmap Heap Scan on public.test001  (cost=89.50..10161.50 rows=10000 width=46) (actual time=1.546..8.868 rows=4701 loops=1)    
   Output: c1    
   Recheck Cond: (test001.c1 ~~ `你%`::text)    
   Rows Removed by Index Recheck: 85    
   Heap Blocks: exact=4776    
   Buffers: shared hit=4784    
   ->  Bitmap Index Scan on idx_test001_1  (cost=0.00..87.00 rows=10000 width=0) (actual time=0.879..0.879 rows=4786 loops=1)    
         Index Cond: (test001.c1 ~~ `你%`::text)    
         Buffers: shared hit=8    
 Planning time: 0.099 ms    
 Execution time: 9.166 ms    
(11 rows)    

3.2 後模糊

響應時間:0.25毫秒

返回2行

select * from test001 where c1 like `%靉掑`;    
    
test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 like `%靉掑`;    
                                                         QUERY PLAN                                                             
----------------------------------------------------------------------------------------------------------------------------    
 Bitmap Heap Scan on public.test001  (cost=89.50..10161.50 rows=10000 width=46) (actual time=0.049..0.223 rows=2 loops=1)    
   Output: c1    
   Recheck Cond: (test001.c1 ~~ `%靉掑`::text)    
   Rows Removed by Index Recheck: 87    
   Heap Blocks: exact=89    
   Buffers: shared hit=94    
   ->  Bitmap Index Scan on idx_test001_1  (cost=0.00..87.00 rows=10000 width=0) (actual time=0.031..0.031 rows=89 loops=1)    
         Index Cond: (test001.c1 ~~ `%靉掑`::text)    
         Buffers: shared hit=5    
 Planning time: 0.113 ms    
 Execution time: 0.249 ms    
(11 rows)    

3.3 前後模糊

響應時間:0.2毫秒

返回1行

select * from test001 where c1 like `%螛梋鰢%`;    
    
test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 like `%螛梋鰢%`;    
                                                         QUERY PLAN                                                             
----------------------------------------------------------------------------------------------------------------------------    
 Bitmap Heap Scan on public.test001  (cost=89.50..10161.50 rows=10000 width=46) (actual time=0.044..0.175 rows=1 loops=1)    
   Output: c1    
   Recheck Cond: (test001.c1 ~~ `%螛梋鰢%`::text)    
   Rows Removed by Index Recheck: 81    
   Heap Blocks: exact=82    
   Buffers: shared hit=87    
   ->  Bitmap Index Scan on idx_test001_1  (cost=0.00..87.00 rows=10000 width=0) (actual time=0.027..0.027 rows=82 loops=1)    
         Index Cond: (test001.c1 ~~ `%螛梋鰢%`::text)    
         Buffers: shared hit=5    
 Planning time: 0.112 ms    
 Execution time: 0.201 ms    
(11 rows)    


相關文章