PostgreSQLUDF實現tsvector(全文檢索),array(陣列)多值欄位與scalar(單值欄位)型別的整合索引(類分割槽索引)-單值與多值型別複合查詢效能提速100倍+案例(含，單值+多值列合成)

德哥發表於2018-02-28

背景

在應用程式中，特別是使用PostgreSQL多值列（陣列、全文檢索、JSON）型別的業務中，除了多值列有查詢需求，單值列也有查詢需求。

那麼當兩種查詢並行存在時，資料庫會根據COST選擇單列或多列組合索引。但是效能並不能達到極致，只是簡化使用者的索引設計而已。

例如

create table tbl(gid int, c1 int[]);  
  
  
insert into tbl select random()*99, gen_randarr(999, 10) from generate_series(1,10000000);

gid有100個值，c1有10個值（取值範圍1000以內）。使用者可能按gid查詢，也可能按c1查詢，還可能按兩個欄位組合查詢。

當按兩個欄位組合查詢時，現有達到方法並不高效。包括btree_gin在內。

如何提速？

demo

1、建立一個產生隨機數的函式

CREATE OR REPLACE FUNCTION public.gen_randarr(integer, integer)  
 RETURNS integer[]  
 LANGUAGE sql  
 STRICT  
AS $function$  
  select array(select (random()*$1)::int from generate_series(1,$2));  
$function$;

2、建立測試表，包含一個單值列，和一個多值列

create table tbl(gid int, c1 int[]);

3、寫入1000萬資料

insert into tbl select random()*99, gen_randarr(999, 10) from generate_series(1,10000000);

傳統加速方法1

建立單值列+多值列的複合索引

create extension btree_gin;

set maintenance_work_mem =`8GB`;  
create index idx_tbl_1 on tbl using gin (gid, c1);

複合查詢效能如何？

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl where gid=1 and c1 @> array[1,2,3];  
                                                     QUERY PLAN                                                        
---------------------------------------------------------------------------------------------------------------------  
 Bitmap Heap Scan on public.tbl  (cost=96.00..97.02 rows=1 width=65) (actual time=12.810..12.810 rows=0 loops=1)  
   Output: gid, c1  
   Recheck Cond: ((tbl.gid = 1) AND (tbl.c1 @> `{1,2,3}`::integer[]))  
   Buffers: shared hit=184  
   ->  Bitmap Index Scan on idx_tbl_1  (cost=0.00..96.00 rows=1 width=0) (actual time=12.807..12.807 rows=0 loops=1)  
         Index Cond: ((tbl.gid = 1) AND (tbl.c1 @> `{1,2,3}`::integer[]))  
         Buffers: shared hit=184  
 Planning time: 0.154 ms  
 Execution time: 12.838 ms  
(9 rows)

postgres=# explain (analyze,verbose,timing,costs,buffers)  select * from tbl where gid=1 and c1 && array[1,2,3];  
                                                         QUERY PLAN                                                           
----------------------------------------------------------------------------------------------------------------------------  
 Bitmap Heap Scan on public.tbl  (cost=129.80..3433.25 rows=3297 width=65) (actual time=17.453..22.486 rows=2932 loops=1)  
   Output: gid, c1  
   Recheck Cond: ((tbl.gid = 1) AND (tbl.c1 && `{1,2,3}`::integer[]))  
   Heap Blocks: exact=2906  
   Buffers: shared hit=3089  
   ->  Bitmap Index Scan on idx_tbl_1  (cost=0.00..128.97 rows=3297 width=0) (actual time=17.121..17.121 rows=2932 loops=1)  
         Index Cond: ((tbl.gid = 1) AND (tbl.c1 && `{1,2,3}`::integer[]))  
         Buffers: shared hit=183  
 Planning time: 0.223 ms  
 Execution time: 22.761 ms  
(10 rows)

如果你沒有繼續優化的動力，你會發現這個效能好像也蠻OK的。索引條件也用了多個。

實際上這個索引是內部將兩個索引合併後做的BITMAP SCAN。

《PostgreSQL bitmapAnd, bitmapOr, bitmap index scan, bitmap heap scan》

傳統加速方法2

實際上就是將單值，多值列索引分開建。

postgres=# drop index idx_tbl_1;  
DROP INDEX  
  
postgres=# create index idx_tbl_1 on tbl (gid);  
CREATE INDEX  
  
postgres=# create index idx_tbl_2 on tbl using gin (c1);

實際效果比GIN的複合索引差一些

explain (analyze,verbose,timing,costs,buffers) select * from tbl where gid=1 and c1 @> array[1,2,3];  
  
explain (analyze,verbose,timing,costs,buffers)  select * from tbl where gid=1 and c1 && array[1,2,3];  
  
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl where gid=1 and c1 @> array[1,2,3];  
                                                      QUERY PLAN                                                         
-----------------------------------------------------------------------------------------------------------------------  
 Bitmap Heap Scan on public.tbl  (cost=72.09..83.25 rows=1 width=65) (actual time=12.848..12.848 rows=0 loops=1)  
   Output: gid, c1  
   Recheck Cond: (tbl.c1 @> `{1,2,3}`::integer[])  
   Filter: (tbl.gid = 1)  
   Rows Removed by Filter: 13  
   Heap Blocks: exact=13  
   Buffers: shared hit=131  
   ->  Bitmap Index Scan on idx_tbl_2  (cost=0.00..72.09 rows=11 width=0) (actual time=12.810..12.810 rows=13 loops=1)  
         Index Cond: (tbl.c1 @> `{1,2,3}`::integer[])  
         Buffers: shared hit=118  
 Planning time: 0.254 ms  
 Execution time: 12.874 ms  
(12 rows)  
  
postgres=# explain (analyze,verbose,timing,costs,buffers)  select * from tbl where gid=1 and c1 && array[1,2,3];  
                                                              QUERY PLAN                                                                 
---------------------------------------------------------------------------------------------------------------------------------------  
 Bitmap Heap Scan on public.tbl  (cost=3534.41..6837.86 rows=3297 width=65) (actual time=69.636..74.613 rows=2932 loops=1)  
   Output: gid, c1  
   Recheck Cond: ((tbl.gid = 1) AND (tbl.c1 && `{1,2,3}`::integer[]))  
   Heap Blocks: exact=2906  
   Buffers: shared hit=2982 read=279  
   ->  BitmapAnd  (cost=3534.41..3534.41 rows=3297 width=0) (actual time=69.002..69.002 rows=0 loops=1)  
         Buffers: shared hit=76 read=279  
         ->  Bitmap Index Scan on idx_tbl_1  (cost=0.00..1089.93 rows=106333 width=0) (actual time=13.538..13.538 rows=100704 loops=1)  
               Index Cond: (tbl.gid = 1)  
               Buffers: shared read=279  
         ->  Bitmap Index Scan on idx_tbl_2  (cost=0.00..2442.58 rows=310077 width=0) (actual time=50.878..50.878 rows=296887 loops=1)  
               Index Cond: (tbl.c1 && `{1,2,3}`::integer[])  
               Buffers: shared hit=76  
 Planning time: 0.147 ms  
 Execution time: 74.886 ms  
(15 rows)

本文100倍效能提升加速方法

本文提到的方法，適合於單值列與多值列混合查詢的場景，實際上我們用到了PostgreSQL的UDF和表示式索引的功能。

UDF將單值列和多值列合併，變成一個新的多值列。

表示式索引，針對這個UDF來構建。

目的是將內部的BITMAP合併這個部分化解掉，僅使用一顆倒排樹，並且這顆倒排樹包含了單值列和多值列的值。

1、建立一個UDF，將本例的gid與c1的值合併。

create or replace function gen_newarr(int, anyarray) returns text[] as $$  
declare  
  res text[] := `{}`;  
  x int;  
begin  
  foreach x in array $2 loop  
    res := array_append(res, $1||`_`||x);  
  end loop;  
  return res;  
end;  
$$ language plpgsql strict immutable;  
  
  
  
  
postgres=# select gen_newarr(123,array[1,2,3,4]);  
-[ RECORD 1 ]-------------------------  
gen_newarr | {123_1,123_2,123_3,123_4}

2、建立表示式索引

set maintenance_work_mem =`8GB`;  
create index idx_tbl_2 on tbl using gin (gen_newarr(gid, c1));

3、查詢時，使用表示式查詢，所以需要更改SQL語句

postgres=# explain (analyze,verbose,timing,costs,buffers)  select * from tbl where gen_newarr(gid, c1) @> array[`1_1`,`1_2`,`1_3`];  
                                                    QUERY PLAN                                                       
-------------------------------------------------------------------------------------------------------------------  
 Bitmap Heap Scan on public.tbl  (cost=12.01..13.27 rows=1 width=65) (actual time=0.146..0.146 rows=0 loops=1)  
   Output: gid, c1  
   Recheck Cond: (gen_newarr(tbl.gid, tbl.c1) @> `{1_1,1_2,1_3}`::text[])  
   Buffers: shared hit=14  
   ->  Bitmap Index Scan on idx_tbl_2  (cost=0.00..12.01 rows=1 width=0) (actual time=0.144..0.144 rows=0 loops=1)  
         Index Cond: (gen_newarr(tbl.gid, tbl.c1) @> `{1_1,1_2,1_3}`::text[])  
         Buffers: shared hit=14  
 Planning time: 0.092 ms  
 Execution time: 0.174 ms  
(9 rows)  
  
postgres=# explain (analyze,verbose,timing,costs,buffers)  select * from tbl where gen_newarr(gid, c1) && array[`1_1`,`1_2`,`1_3`];  
                                                         QUERY PLAN                                                            
-----------------------------------------------------------------------------------------------------------------------------  
 Bitmap Heap Scan on public.tbl  (cost=1220.70..133422.08 rows=149251 width=65) (actual time=1.020..6.034 rows=2932 loops=1)  
   Output: gid, c1  
   Recheck Cond: (gen_newarr(tbl.gid, tbl.c1) && `{1_1,1_2,1_3}`::text[])  
   Heap Blocks: exact=2906  
   Buffers: shared hit=2919  
   ->  Bitmap Index Scan on idx_tbl_2  (cost=0.00..1183.38 rows=149251 width=0) (actual time=0.640..0.640 rows=2932 loops=1)  
         Index Cond: (gen_newarr(tbl.gid, tbl.c1) && `{1_1,1_2,1_3}`::text[])  
         Buffers: shared hit=13  
 Planning time: 0.102 ms  
 Execution time: 6.348 ms  
(10 rows)

查詢如下

select * from tbl where gen_newarr(gid, c1) && array[`1_1`,`1_2`,`1_3`];  
  
 gid |                   c1                      
-----+-----------------------------------------  
   1 | {62,904,204,618,917,227,388,352,167,1}  
   1 | {825,126,174,409,340,285,231,942,3,136}  
   1 | {222,418,799,881,728,582,558,2,368,196}  
   1 | {847,197,690,1,288,468,179,521,799,196}  
   1 | {867,316,447,747,953,998,370,360,558,3}  
   1 | {249,963,669,929,534,945,388,816,1,601}  
   1 | {925,609,108,981,712,681,906,832,3,275}  
   1 | {3,354,253,947,588,598,401,89,246,968}  
   1 | {323,121,22,3,7,714,80,619,178,439}  
   1 | {866,1,185,704,932,882,496,324,264,882}  
......

效能顯著提升。

單值列+全文檢索複合查詢加速

原理類似，將單值列與全文檢索列合併

create table tbl123(gid int, ts tsvector);  
  
insert into tbl123 select random()*99, array_to_tsvector(gen_randarr(999, 10)::text[]) from generate_series(1,10000000);

create index idx_tbl123_1 on tbl123 using gin ( array_to_tsvector(gen_newarr(gid, tsvector_to_array(ts))) );

explain (analyze,verbose,timing,costs,buffers)  select * from tbl123 where array_to_tsvector(gen_newarr(gid, tsvector_to_array(ts))) @@ tsquery `1_1 & 1_2 & 1_3`;  
  
  
explain (analyze,verbose,timing,costs,buffers)  select * from tbl123 where array_to_tsvector(gen_newarr(gid, tsvector_to_array(ts))) @@ tsquery `1_1 | 1_2 | 1_3`;  
  
  
postgres=# explain (analyze,verbose,timing,costs,buffers)  select * from tbl123 where array_to_tsvector(gen_newarr(gid, tsvector_to_array(ts))) @@ tsquery `1_1 & 1_2 & 1_3`;  
                                                               QUERY PLAN                                                                  
-----------------------------------------------------------------------------------------------------------------------------------------  
 Bitmap Heap Scan on public.tbl123  (cost=12.01..13.28 rows=1 width=77) (actual time=0.207..0.207 rows=0 loops=1)  
   Output: gid, ts  
   Recheck Cond: (array_to_tsvector(gen_newarr(tbl123.gid, tsvector_to_array(tbl123.ts))) @@ ```1_1`` & ``1_2`` & ``1_3```::tsquery)  
   Buffers: shared hit=14  
   ->  Bitmap Index Scan on idx_tbl123_1  (cost=0.00..12.01 rows=1 width=0) (actual time=0.204..0.204 rows=0 loops=1)  
         Index Cond: (array_to_tsvector(gen_newarr(tbl123.gid, tsvector_to_array(tbl123.ts))) @@ ```1_1`` & ``1_2`` & ``1_3```::tsquery)  
         Buffers: shared hit=14  
 Planning time: 0.080 ms  
 Execution time: 0.238 ms  
(9 rows)  
  
postgres=# explain (analyze,verbose,timing,costs,buffers)  select * from tbl123 where array_to_tsvector(gen_newarr(gid, tsvector_to_array(ts))) @@ tsquery `1_1 | 1_2 | 1_3`;  
                                                               QUERY PLAN                                                                  
-----------------------------------------------------------------------------------------------------------------------------------------  
 Bitmap Heap Scan on public.tbl123  (cost=1220.70..136709.34 rows=149251 width=77) (actual time=0.971..5.988 rows=2970 loops=1)  
   Output: gid, ts  
   Recheck Cond: (array_to_tsvector(gen_newarr(tbl123.gid, tsvector_to_array(tbl123.ts))) @@ ```1_1`` | ``1_2`` | ``1_3```::tsquery)  
   Heap Blocks: exact=2937  
   Buffers: shared hit=2950  
   ->  Bitmap Index Scan on idx_tbl123_1  (cost=0.00..1183.38 rows=149251 width=0) (actual time=0.612..0.612 rows=2970 loops=1)  
         Index Cond: (array_to_tsvector(gen_newarr(tbl123.gid, tsvector_to_array(tbl123.ts))) @@ ```1_1`` | ``1_2`` | ``1_3```::tsquery)  
         Buffers: shared hit=13  
 Planning time: 0.029 ms  
 Execution time: 6.284 ms  
(10 rows)  
  
postgres=# select * from tbl123 where array_to_tsvector(gen_newarr(gid, tsvector_to_array(ts))) @@ tsquery `1_1 | 1_2 | 1_3`;  
 gid |                            ts                               
-----+-----------------------------------------------------------  
   1 | `180` `219` `253` `262` `282` `3` `633` `657` `807` `809`  
   1 | `1` `166` `261` `670` `807` `860` `897` `922` `93` `964`  
   1 | `1` `174` `211` `319` `322` `532` `84` `849` `869` `993`  
......

效能提升也非常明顯。

小結

1、分割槽表，每個分割槽定義對應的索引。但是當單值型別本身包含的值非常多時，分割槽就要很多很多，可能也不是很好。

2、分割槽索引，目前PG還不支援對單個表建立多顆樹的複合索引（一顆樹以單值列構建，VALUE指向另一個顆樹。另一顆數以多值列構建的GIN倒排樹。）

3、也就是本例提到的方法，使用UDF，將單值列與多值列合併，類似變異，變成另一個多值列，在另一個多值列中包含這個單值列的屬性，從而達到分割槽表或分割槽索引同樣的效果。

效能提升非常明顯。

當多值列本身就攜帶單值列屬性時，我們就沒有必要建立單值列+多值列的複合索引，僅僅構建多值列索引即可。

當多值列中沒有攜帶單值列屬性時，那麼我們如果有單值列+多值列的複合查詢條件，可以採用UDF，將單值列+多值列合併成一個新的多值列，對多值列構建GIN倒排索引，提速非常明顯。本例提速超過100倍

參考

《PostgreSQL bitmapAnd, bitmapOr, bitmap index scan, bitmap heap scan》

聊聊mysql的單列多值儲存
2022-04-10
MySql
SQL查詢列（欄位）重複值及操作--整理
2017-08-17
SQL
前臺呼叫下拉選單框值（列舉欄位）
2019-05-11
mysqldump 欄位值帶單引號
2018-03-22
MySql
用Elasticsearch做大規模資料的多欄位、多型別索引檢索
2019-04-16
Elasticsearch多型型別索引
Js實現Object按照值的某個欄位（數值型別）的大小進行排序
2020-11-18
JSObject型別排序
mysql 字串型別的數值欄位按照數值的大小進行排序
2018-12-19
MySql字串型別排序
查詢oracle欄位預設值
2024-08-21
Oracle
查詢某個欄位的不同值
2007-07-30
如何實現簡單的位陣列(bit array)
2013-05-27
陣列
MySQL單列索引和組合索引的區別
2017-05-27
MySql索引
MySQL null值欄位是否使用索引的總結
2018-11-30
MySqlNull索引
值型別與引用型別
2015-09-13
型別
oracle複合索引介紹(多欄位索引)
2019-05-10
Oracle索引
sqlserver查詢一個庫所有表的欄位名及欄位型別
2016-09-12
SQLServer型別
索引與null（一）：單列索引
2019-03-04
索引Null
Laravel 對於 Mysql 欄位string型別查詢，當使用數字對這個欄位進行查詢，PHP弱型別語言導致索引失效
2021-07-22
LaravelMySql型別PHP索引
MySQL簡單實現多欄位模糊查詢
2019-03-15
MySql
值型別與引用型別的區別
2020-12-13
型別
欄位型別檢測指令碼
2011-02-05
型別指令碼
Sqlserver的欄位datetime型別預設值設為getdate()時，設值毫秒為000
2024-07-11
SQLServer型別
保留兩位小數：資料庫欄位型別NUMBER，Java欄位型別Double型別
2017-03-20
資料庫型別Java
OCM實驗-建立含特殊欄位型別的表
2014-01-28
型別
oracle的欄位型別
2011-04-06
Oracle型別
GORM 如何讀取別名欄位的值
2021-05-17
GoORM
MySQL中TEXT與BLOB欄位型別的區別
2021-09-09
MySql型別
MySQL單列索引和組合索引的區別介紹
2015-12-10
MySql索引
SQLT 最佳化SQL 用複合索引代替單列索引的案例
2014-06-06
SQL索引
C#的型別——值型別與引用型別
2009-04-13
C#型別
Elasticsearch 單字串多欄位查詢
2021-03-15
Elasticsearch字串
列舉型別與位運算
2015-04-20
型別
ORACLE MYSQL中join 欄位型別不同索引失效的情況
2016-12-29
OracleMySql型別索引
JavaScript - 基本型別與引用型別值
2019-12-17
JavaScript型別
含LONG型別欄位的表無法MOVE
2012-07-06
型別
filter的pk進行多值查詢操作
2021-04-27
Filter
Field的getModifiers()方法返回int型別值表示該欄位的修飾符
2015-01-28
型別
MongoDB更改欄位型別
2019-01-25
MongoDB型別
MySQL-修改欄位型別、設定預設值，以及新增註釋
2017-12-11
MySql型別

PostgreSQLUDF實現tsvector(全文檢索),array(陣列)多值欄位與scalar(單值欄位)型別的整合索引(類分割槽索引)-單值與多值型別複合查詢效能提速100倍+案例(含，單值+多值列合成)

標籤

背景

demo

傳統加速方法1

傳統加速方法2

本文100倍效能提升加速方法

單值列+全文檢索複合查詢加速

小結

參考

相關文章

PostgreSQLUDF實現tsvector(全文檢索),array(陣列)多值欄位與scalar(單值欄位)型別的整合索引(類分割槽索引)-單值與多值型別複合查詢效能提速100倍+案例(含，單值+多值列合成)

標籤

背景

demo

傳統加速方法1

傳統加速方法2

本文100倍效能提升加速方法

單值列+全文檢索 複合查詢加速

小結

參考

相關文章

單值列+全文檢索複合查詢加速