本節簡單介紹了PostgreSQL中的Bloom索引,包括Bloom索引的基礎知識和結構等.

簡介
Bloom Index源於Bloom filter(布隆過濾器),布隆過濾器用於在使用少量的空間的情況下可以很快速的判定某個值是否在集合中,其缺點是存在假陽性 False Positives ,因此需要Recheck來判斷該值是否在集合中,但布隆過濾器不存在假陰性,也就是說,對於某個值如果過濾器返回不存在,那就是不存在.
在PostgreSQL中,Bloom Index用於:

Bloom indexes are very helpful when we have a table that stores huge amounts of data and a lot of columns, where we find it difficult to create a large number of indexes, especially in OLAP environments where data is loaded from several sources and maintained for reporting. You could consider testing a single bloom index to see if you can avoid implementing a huge number of individual or composite indexes that could take additional disk space without much performance gain.

結構
其結構如下圖所示:

第一個page為metadata,然後每一行都會有一個bit array(signature)和TID與其對應.

示例
建立資料表,插入資料


testdb=# drop table if exists t_bloom;
DROP TABLE
testdb=# CREATE TABLE t_bloom (id int, dept int, id2 int, id3 int, id4 int, id5 int,id6 int,id7 int,details text, zipcode int);
CREATE TABLE
testdb=# 
testdb=# INSERT INTO t_bloom 
testdb-# SELECT (random() * 1000000)::int, (random() * 1000000)::int,
testdb-# (random() * 1000000)::int,(random() * 1000000)::int,(random() * 1000000)::int,(random() * 1000000)::int, 
testdb-# (random() * 1000000)::int,(random() * 1000000)::int,md5(g::text), floor(random()* (20000-9999 + 1) + 9999) 
testdb-# from generate_series(1,16*1024*1024) g;
INSERT 0 16777216
testdb=# 
testdb=# analyze t_bloom;
ANALYZE
testdb=# 
testdb=# select pg_size_pretty(pg_table_size('t_bloom'));
 pg_size_pretty 
----------------
 1619 MB
(1 row)

建立Btree索引


testdb=# 
testdb=# create index idx_t_bloom_btree on t_bloom using btree(id,dept,id2,id3,id4,id5,id6,id7,zipcode);
CREATE INDEX
testdb=# \di+ idx_t_bloom_btree
                              List of relations
 Schema |       Name        | Type  | Owner |  Table  |  Size  | Description 
--------+-------------------+-------+-------+---------+--------+-------------
 public | idx_t_bloom_btree | index | pg12  | t_bloom | 940 MB | 
(1 row)

執行查詢


testdb=# EXPLAIN ANALYZE select * from t_bloom where id4 = 305294 and zipcode = 13266;
                                                              QUERY PLAN                                                     
---------------------------------------------------------------------------------------------------------
 Index Scan using idx_t_bloom_btree on t_bloom  (cost=0.56..648832.73 rows=1 width=69) (actual time=2648.215..2648.215 rows=0
 loops=1)
   Index Cond: ((id4 = 305294) AND (zipcode = 13266))
 Planning Time: 3.244 ms
 Execution Time: 2659.804 ms
(4 rows)
testdb=# EXPLAIN ANALYZE select * from t_bloom where id5 = 241326 and id6 = 354198;
                                                              QUERY PLAN                                                     
---------------------------------------------------------------------------------------------------------
 Index Scan using idx_t_bloom_btree on t_bloom  (cost=0.56..648832.73 rows=1 width=69) (actual time=2365.533..2365.533 rows=0
 loops=1)
   Index Cond: ((id5 = 241326) AND (id6 = 354198))
 Planning Time: 1.918 ms
 Execution Time: 2365.629 ms
(4 rows)

建立Bloom索引


testdb=# create extension bloom;
CREATE EXTENSION
testdb=# CREATE INDEX idx_t_bloom_bloom ON t_bloom USING bloom(id, dept, id2, id3, id4, id5, id6, id7, zipcode) 
testdb-# WITH (length=64, col1=4, col2=4, col3=4, col4=4, col5=4, col6=4, col7=4, col8=4, col9=4);
CREATE INDEX
testdb=# \di+ idx_t_bloom_bloom
                              List of relations
 Schema |       Name        | Type  | Owner |  Table  |  Size  | Description 
--------+-------------------+-------+-------+---------+--------+-------------
 public | idx_t_bloom_bloom | index | pg12  | t_bloom | 225 MB | 
(1 row)

執行查詢


testdb=# EXPLAIN ANALYZE select * from t_bloom where id4 = 305294 and zipcode = 13266;
                                                              QUERY PLAN                                                     
-------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on t_bloom  (cost=283084.16..283088.18 rows=1 width=69) (actual time=998.727..998.727 rows=0 loops=1)
   Recheck Cond: ((id4 = 305294) AND (zipcode = 13266))
   Rows Removed by Index Recheck: 12597
   Heap Blocks: exact=12235
   ->  Bitmap Index Scan on idx_t_bloom_bloom  (cost=0.00..283084.16 rows=1 width=0) (actual time=234.893..234.893 rows=12597
 loops=1)
         Index Cond: ((id4 = 305294) AND (zipcode = 13266))
 Planning Time: 31.482 ms
 Execution Time: 998.975 ms
(8 rows)
testdb=# EXPLAIN ANALYZE select * from t_bloom where id5 = 241326 and id6 = 354198;
                                                              QUERY PLAN                                                     
-------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on t_bloom  (cost=283084.16..283088.18 rows=1 width=69) (actual time=1019.621..1019.621 rows=0 loops=1)
   Recheck Cond: ((id5 = 241326) AND (id6 = 354198))
   Rows Removed by Index Recheck: 13033
   Heap Blocks: exact=12633
   ->  Bitmap Index Scan on idx_t_bloom_bloom  (cost=0.00..283084.16 rows=1 width=0) (actual time=204.873..204.873 rows=13033
 loops=1)
         Index Cond: ((id5 = 241326) AND (id6 = 354198))
 Planning Time: 0.441 ms
 Execution Time: 1019.811 ms
(8 rows)

從執行結果來看,在查詢條件中沒有非前導列(上例中為id1)的情況下多列任意組合查詢,bloom index會優於btree index.

參考資料
Bloom Indexes in PostgreSQL
Indexes in PostgreSQL — 10 (Bloom)

PostgreSQL DBA(59) - Index(Bloom)

相關文章