PostgreSQL實時高效搜尋-全文檢索、模糊查詢、正則查詢、相似查詢、ADHOC查詢

德哥發表於2017-12-05

標籤

PostgreSQL , 搜尋引擎 , GIN , ranking , high light , 全文檢索 , 模糊查詢 , 正則查詢 , 相似查詢 , ADHOC查詢


背景

字串搜尋是非常常見的業務需求,它包括:

1、字首+模糊查詢。(可以使用b-tree索引)

select * from tbl where col like `ab%`;  
  
或  
  
select * from tbl where col ~ `^ab`;  

2、字尾+模糊查詢。(可以使用reverse(col)表示式b-tree索引)

select * from tbl where col like `%ab`;  
  
或  
  
select * from tbl where col ~ `ab$`;  
  
寫法  
  
select * from tbl where reverse(col) like `ba%`;  
  
或  
  
select * from tbl where reverse(col) ~ `^ba`;  

3、前後模糊查詢。(可以使用pg_trgm和gin索引)

https://www.postgresql.org/docs/10/static/pgtrgm.html

select * from tbl where col like `%ab%`;  
  
或  
  
select * from tbl where col ~ `ab`;  

4、全文檢索。(可以使用全文檢索型別以及gin或rum索引)

select * from tbl where tsvector_col @@ `postgres & china | digoal:A` order by ts_rank(tsvector_col, `postgres & china | digoal:A`) limit xx;  
  
詳細語法後面介紹  

5、正則查詢。(可以使用pg_trgm和gin索引)

select * from tbl where col ~ `^a[0-9]{1,5} +digoal$`;  

6、相似查詢。(可以使用pg_trgm和gin索引)

select * from tbl order by similarity(col, `postgre`) desc limit 10;  

7、ADHOC查詢,任意欄位組合查詢。(通過bloom index, multi-index bitmap scan, gin-index bitmap scan 等索引都可以實現)

select * from tbl where a=? and b=? or c=? and d=? or e between ? and ? and f in (?);  

通常來說,資料庫並不具備3以後的加速能力,但是PostgreSQL的功能非常強大,它可以非常完美的支援這類查詢的加速。(是指查詢和寫入不衝突的,並且索引BUILD是實時的。)

使用者完全不需要將資料同步到搜尋引擎,再來查詢,而且搜尋引擎也只能做到全文檢索,並不你做到正則、相似、前後模糊這幾個需求。

使用PostgreSQL可以大幅度的簡化使用者的架構,開發成本,同時保證資料查詢的絕對實時性。

一、全文檢索

全文檢索中幾個核心的功能:

詞典、分詞語法、搜尋語法、排序演算法、效率、命中詞高亮等。

PostgreSQL都已經實現,並支援擴充套件。例如擴充套件詞典、擴充套件排序演算法等。

pic

支援4種文件結構(標題、作者、摘要、內容),可以在生成tsvector時指定。在一個tsvector中允許多個文件結構。

文件結構在ranking演算法中,被用於計算權值,例如在標題中命中的詞權值可以設更大一些。

pic

支援掩碼,主要用於調和很長的文字,調和ranking的輸出。

pic

通過設定不同文件結構權值,調和ranking的輸出。

pic

詞典

預設PG沒有中文分詞,但是好在我們可以基於text search框架擴充套件,例如開源的zhparser, jieba等中文分詞外掛。

https://github.com/jaiminpan/pg_jieba

https://github.com/jaiminpan/pg_scws

甚至可以通過pljava, plpython等來實現對中文的分詞,這個實際上是對應程式設計體系內的分詞能力,通過PostgreSQL的過程語言引入,是不是很炫酷。

《使用阿里雲PostgreSQL zhparser中文分詞時不可不知的幾個引數》

《如何加快PostgreSQL結巴分詞載入速度》

《PostgreSQL Greenplum 結巴分詞(by plpython)》

分詞介紹

1、parser,功能是將字串轉換為token(可以自定義parser)。

default parser的token類別如下:

pic

例子

SELECT alias, description, token FROM ts_debug(`http://example.com/stuff/index.html`);  
  alias   |  description  |            token               
----------+---------------+------------------------------  
 protocol | Protocol head | http://  
 url      | URL           | example.com/stuff/index.html  
 host     | Host          | example.com  
 url_path | URL path      | /stuff/index.html  

建立text parser的語法

https://www.postgresql.org/docs/10/static/sql-createtsparser.html

2、配合text search configuration 和dictionary,將token轉換為lexemes

例如建立了一個同義詞字典

postgres        pgsql  
postgresql      pgsql  
postgre pgsql  
gogle   googl  
indices index*  

然後用這個字典來將token轉換為lexemes,轉換後得到的是lexeme. (tsvector中儲存的也是lexeme,並不是原始token)

mydb=# CREATE TEXT SEARCH DICTIONARY syn (template=synonym, synonyms=`synonym_sample`);  
mydb=# SELECT ts_lexize(`syn`,`indices`);  
 ts_lexize  
-----------  
 {index}  
(1 row)  
  
mydb=# CREATE TEXT SEARCH CONFIGURATION tst (copy=simple);  
mydb=# ALTER TEXT SEARCH CONFIGURATION tst ALTER MAPPING FOR asciiword WITH syn;  
mydb=# SELECT to_tsvector(`tst`,`indices`);  
 to_tsvector  
-------------  
 `index`:1  
(1 row)  
  
mydb=# SELECT to_tsquery(`tst`,`indices`);  
 to_tsquery  
------------  
 `index`:*  
(1 row)  
  
mydb=# SELECT `indexes are very useful`::tsvector;  
            tsvector               
---------------------------------  
 `are` `indexes` `useful` `very`  
(1 row)  
  
mydb=# SELECT `indexes are very useful`::tsvector @@ to_tsquery(`tst`,`indices`);  
 ?column?  
----------  
 t  
(1 row)  

建立text dictionary的語法

https://www.postgresql.org/docs/10/static/sql-createtsdictionary.html

3、將lexemes儲存為tsvector

text search configuration 決定了要存哪些東西。

convert過程中,parser得到的token依次與configuration配置的dictionary匹配,並儲存從dictionary中對應的lexeme。

ALTER TEXT SEARCH CONFIGURATION tsconfig名  
    ADD MAPPING FOR token型別1 WITH 字典1, 字典2, 字典3;  
  
如果使用這個tsconfig來轉換文字為tsvector,那麼對於 token型別1,首先與字典1匹配,如果匹配上了,會儲存字典1中對應的lexeme,如果沒有對應上,則繼續搜尋字典2......  

建立text search configuration的語法

https://www.postgresql.org/docs/10/static/sql-createtsconfig.html

建立text search template的語法

https://www.postgresql.org/docs/10/static/sql-createtstemplate.html

4、控制引數

通常parser有一些控制引數,例如是否輸出單字、雙字等。例如zhparser這個parser的引數如下:

《使用阿里雲PostgreSQL zhparser中文分詞時不可不知的幾個引數》

5、文件結構

標題、作者、摘要、內容

使用ABCD來表示。

搜尋語法

1、tsquery為搜尋輸入,支援與、或、反、距離語法,如下

to_tsquery creates a tsquery value from querytext, which must consist of
single tokens separated by the tsquery operators

& (AND), | (OR), ! (NOT), and <-> (FOLLOWED BY) and <?> (距離多少?),    

possibly grouped using parentheses.
In other words, the input to to_tsquery must already follow the general
rules for tsquery input, as described in Section 8.11.2.

The difference is that while basic tsquery input takes the tokens at face value,
to_tsquery normalizes each token into a lexeme using the specified or default configuration,
and discards any tokens that are stop words according to the configuration.
For example:

c有兩個位置,在匹配距離時,兩個都可以。
postgres=# select to_tsvector(`a b c c`);
     to_tsvector     
---------------------
 `a`:1 `b`:2 `c`:3,4
(1 row)

相鄰
postgres=# select to_tsvector(`a b c c`) @@ to_tsquery(`a <-> b`);
 ?column? 
----------
 t
(1 row)

相鄰, 實際上就是position相減=1
postgres=# select to_tsvector(`a b c c`) @@ to_tsquery(`a <1> b`);
 ?column? 
----------
 t
(1 row)

距離為2,實際上就是position相減=2
postgres=# select to_tsvector(`a b c c`) @@ to_tsquery(`a <2> c`);
 ?column? 
----------
 t
(1 row)

距離為3,實際上就是position相減=3
postgres=# select to_tsvector(`a b c c`) @@ to_tsquery(`a <3> c`);
 ?column? 
----------
 t
(1 row)

距離為2,實際上就是position相減=2
postgres=# select to_tsvector(`a b c c`) @@ to_tsquery(`a <2> b`);
 ?column? 
----------
 f
(1 row)

2、支援文件結構,語法如下

As in basic tsquery input, weight(s) can be attached to each lexeme to restrict it to match only tsvector lexemes of those weight(s). For example:

SELECT to_tsquery(`english`, `Fat | Rats:AB`);  
    to_tsquery      
------------------  
 `fat` | `rat`:AB  

3、支援字首匹配,語法如下

Also, * can be attached to a lexeme to specify prefix matching:

SELECT to_tsquery(`supern:*A & star:A*B`);  
        to_tsquery          
--------------------------  
 `supern`:*A & `star`:*AB  

4、支援thesaurus字典巢狀,自動翻譯。例如

to_tsquery can also accept single-quoted phrases. This is primarily useful when the configuration includes a thesaurus dictionary that may trigger on such phrases. In the example below, a thesaurus contains the rule supernovae stars : sn:

SELECT to_tsquery(```supernovae stars`` & !crab`);  
  to_tsquery  
---------------  
 `sn` & !`crab`  

5、搜尋操作符為@@

select * from tbl where $tsvector_col @@ $tsquery;  

排序演算法

通常全文檢索分為兩個層面,一個層面是匹配,另一個層面是親和(rank)。

1、內建了幾個RANKING演算法,程式碼如下

src/backend/utils/adt/tsrank.c

2、ts_rank和ts_rank_cd介紹

PostgreSQL provides two predefined ranking functions, which take into account lexical, proximity, and structural information; that is, they consider how often the query terms appear in the document, how close together the terms are in the document, and how important is the part of the document where they occur.

You can write your own ranking functions and/or combine their results with additional factors to fit your specific needs.

The two ranking functions currently available are:

ts_rank([ weights float4[], ] vector tsvector, query tsquery [, normalization integer ]) returns float4  

Ranks vectors based on the frequency of their matching lexemes.

ts_rank_cd([ weights float4[], ] vector tsvector, query tsquery [, normalization integer ]) returns float4  

ts_rank_cd為cover density ranking

described in Clarke, Cormack, and Tudhope`s “Relevance Ranking for One to Three Term Queries” in the journal “Information Processing and Management”, 1999. Cover density is similar to ts_rank ranking except that the proximity of matching lexemes to each other is taken into consideration.

2.1、以上兩個ranking計算函式都支援文件結構權重。支援權重微調,作為陣列引數輸入。不輸入則使用預設值。

For both these functions, the optional weights argument offers the ability to weigh word instances more or less heavily depending on how they are labeled. The weight arrays specify how heavily to weigh each category of word, in the order:

{D-weight, C-weight, B-weight, A-weight}  

If no weights are provided, then these defaults are used:

{0.1, 0.2, 0.4, 1.0}  

2.2、支援掩碼引數,對付長文字

Since a longer document has a greater chance of containing a query term it is reasonable to take into account document size,

e.g.,

a hundred-word document with five instances of a search word is probably more relevant than a thousand-word document with five instances.

Both ranking functions take an integer normalization option that specifies whether and how a document`s length should impact its rank.

The integer option controls several behaviors, so it is a bit mask:

you can specify one or more behaviors using | (for example, 2|4).

掩碼如下

0 (the default) ignores the document length  
  
1 divides the rank by 1 + the logarithm of the document length  
  
2 divides the rank by the document length  
  
4 divides the rank by the mean harmonic distance between extents (this is implemented only by ts_rank_cd)  
  
8 divides the rank by the number of unique words in document  
  
16 divides the rank by 1 + the logarithm of the number of unique words in document  
  
32 divides the rank by itself + 1  

If more than one flag bit is specified, the transformations are applied in the order listed.

Here is an example that selects only the ten highest-ranked matches:

例子

SELECT title, ts_rank_cd(textsearch, query) AS rank  
FROM apod, to_tsquery(`neutrino|(dark & matter)`) query  
WHERE query @@ textsearch  
ORDER BY rank DESC  
LIMIT 10;  
                     title                     |   rank  
-----------------------------------------------+----------  
 Neutrinos in the Sun                          |      3.1  
 The Sudbury Neutrino Detector                 |      2.4  
 A MACHO View of Galactic Dark Matter          |  2.01317  
 Hot Gas and Dark Matter                       |  1.91171  
 The Virgo Cluster: Hot Plasma and Dark Matter |  1.90953  
 Rafting for Solar Neutrinos                   |      1.9  
 NGC 4650A: Strange Galaxy and Dark Matter     |  1.85774  
 Hot Gas and Dark Matter                       |   1.6123  
 Ice Fishing for Cosmic Neutrinos              |      1.6  
 Weak Lensing Distorts the Universe            | 0.818218  

This is the same example using normalized ranking:

SELECT title, ts_rank_cd(textsearch, query, 32 /* rank/(rank+1) */ ) AS rank  
FROM apod, to_tsquery(`neutrino|(dark & matter)`) query  
WHERE  query @@ textsearch  
ORDER BY rank DESC  
LIMIT 10;  
                     title                     |        rank  
-----------------------------------------------+-------------------  
 Neutrinos in the Sun                          | 0.756097569485493  
 The Sudbury Neutrino Detector                 | 0.705882361190954  
 A MACHO View of Galactic Dark Matter          | 0.668123210574724  
 Hot Gas and Dark Matter                       |  0.65655958650282  
 The Virgo Cluster: Hot Plasma and Dark Matter | 0.656301290640973  
 Rafting for Solar Neutrinos                   | 0.655172410958162  
 NGC 4650A: Strange Galaxy and Dark Matter     | 0.650072921219637  
 Hot Gas and Dark Matter                       | 0.617195790024749  
 Ice Fishing for Cosmic Neutrinos              | 0.615384618911517  
 Weak Lensing Distorts the Universe            | 0.450010798361481  

Ranking can be expensive since it requires consulting the tsvector of each matching document, which can be I/O bound and therefore slow. Unfortunately, it is almost impossible to avoid since practical queries often result in large numbers of matches.

效率

1、寫入效率

500萬個詞的詞庫,隨機提取64個,組成一個含64個詞的分詞字串。包含全文檢索GIN索引。

56個併發寫入,每秒寫入93955行,平均響應時間0.6毫秒。

《HTAP資料庫 PostgreSQL 場景與效能測試之 7 – (OLTP) 全文檢索 – 含索引實時寫入》

2、查詢效率

1億條文字記錄,併發全文檢索查詢。

56個併發查詢,每秒查詢51369次,平均響應時間1.1毫秒。

《HTAP資料庫 PostgreSQL 場景與效能測試之 14 – (OLTP) 字串搜尋 – 全文檢索》

特殊功能 – HTML 高亮

對於匹配上的文字,將其高亮顯示.

ts_headline([ config regconfig, ] document text, query tsquery [, options text ]) returns text  

例子

SELECT ts_headline(`english`,  
  `The most common type of search  
is to find all documents containing given query terms  
and return them in order of their similarity to the  
query.`,  
  to_tsquery(`query & similarity`));  
  
                        ts_headline                           
------------------------------------------------------------  
 containing given <b>query</b> terms  
 and return them in order of their <b>similarity</b> to the  
 <b>query</b>.  
  
SELECT ts_headline(`english`,  
  `The most common type of search  
is to find all documents containing given query terms  
and return them in order of their similarity to the  
query.`,  
  to_tsquery(`query & similarity`),  
  `StartSel = <, StopSel = >`);  
                      ts_headline                        
-------------------------------------------------------  
 containing given <query> terms  
 and return them in order of their <similarity> to the  
 <query>.  

特殊功能 – 生成文件統計資訊

sqlquery返回tsvector一列,統計這一列中,有哪些lexeme,每個lexeme出現在多少文字中,每個lexeme總共出現了多少次。

ts_stat(sqlquery text, [ weights text, ]  
        OUT word text, OUT ndoc integer,  
        OUT nentry integer) returns setof record  

返回值

word text — the value of a lexeme  
  
ndoc integer — number of documents (tsvectors) the word occurred in  
  
nentry integer — total number of occurrences of the word  

例子

For example, to find the ten most frequent words in a document collection:

SELECT * FROM ts_stat(`SELECT vector FROM apod`)  
ORDER BY nentry DESC, ndoc DESC, word  
LIMIT 10;  

The same, but counting only word occurrences with weight A or B:

SELECT * FROM ts_stat(`SELECT vector FROM apod`, `ab`)  
ORDER BY nentry DESC, ndoc DESC, word  
LIMIT 10;  

特殊功能 – 設定文件結構

設定tsvector屬於哪個文件結構(標題、作者、摘要、內容)。

1、  
setweight(vector tsvector, weight "char")	  
  
assign weight to each element of vector	  
  
setweight(`fat:2,4 cat:3 rat:5B`::tsvector, `A`)	  
`cat`:3A `fat`:2A,4A `rat`:5A  
  
2、  
setweight(vector tsvector, weight "char", lexemes text[])  
  
assign weight to elements of vector that are listed in lexemes	  
  
setweight(`fat:2,4 cat:3 rat:5B`::tsvector, `A`, `{cat,rat}`)	  
`cat`:3A `fat`:2,4 `rat`:5A  

特殊功能 – 除錯文字

1、除錯token

https://www.postgresql.org/docs/10/static/textsearch-debugging.html

例子

ts_debug([ config regconfig, ] document text,  
         OUT alias text,  
         OUT description text,  
         OUT token text,  
         OUT dictionaries regdictionary[],  
         OUT dictionary regdictionary,  
         OUT lexemes text[])  
         returns setof record  
alias text — short name of the token type  
  
description text — description of the token type  
  
token text — text of the token  
  
dictionaries regdictionary[] — the dictionaries selected by the configuration for this token type  
  
dictionary regdictionary — the dictionary that recognized the token, or NULL if none did  
  
lexemes text[] — the lexeme(s) produced by the dictionary that recognized the token,   
or NULL if none did; an empty array ({}) means it was recognized as a stop word  

2、除錯 lexeme,可以使用ts_lexize判斷某個token是否在某個字典裡面有與之匹配的條目(lexeme)。

ts_lexize(dict regdictionary, token text) returns text[]  

例子

SELECT ts_lexize(`english_stem`, `stars`);  
 ts_lexize  
-----------  
 {star}  
  
SELECT ts_lexize(`english_stem`, `a`);  
 ts_lexize  
-----------  
 {}  

限制

The length of each lexeme must be less than 2K bytes,單個lexeme不能大於2K位元組。

The length of a tsvector (lexemes + positions) must be less than 1 megabyte,單個tsvector不能大於1MB。

postgres=# select length(to_tsvector(string_agg(md5(random()::text), ` `))) from generate_series(1,100000);  
ERROR:  54000: string is too long for tsvector (3624424 bytes, max 1048575 bytes)  
LOCATION:  make_tsvector, to_tsany.c:185  

The number of lexemes must be less than 2^64,單個tsvector中不能儲存超過2的64次方個lexeme。

Position values in tsvector must be greater than 0 and no more than 16,383,單個tsvector中,lexeme的位置值不能大於16,383

The match distance in a (FOLLOWED BY) tsquery operator cannot be more than 16,384,單個tsquery中,lexeme的距離值不能大於16,383

No more than 256 positions per lexeme,同一個lexeme不能超過256個位置。

The number of nodes (lexemes + operators) in a tsquery must be less than 32,768,單個tsvector不能儲存超過32765個node(lexemes+位置)。

超過限制的話,可以使用多個欄位。

二、模糊查詢正則查詢相似查詢

模糊查詢、相似查詢、正則匹配查詢,都屬於文字匹配的範疇,PostgreSQL通過gin和pg_trgm實現了這三種搜尋的索引加速。

效能如下:

pic

pic

下面是實際的案例:

《HTAP資料庫 PostgreSQL 場景與效能測試之 13 – (OLTP) 字串搜尋 – 相似查詢》

《HTAP資料庫 PostgreSQL 場景與效能測試之 12 – (OLTP) 字串搜尋 – 前後模糊查詢》

《HTAP資料庫 PostgreSQL 場景與效能測試之 11 – (OLTP) 字串搜尋 – 字尾查詢》

《HTAP資料庫 PostgreSQL 場景與效能測試之 10 – (OLTP) 字串搜尋 – 字首查詢》

《HTAP資料庫 PostgreSQL 場景與效能測試之 9 – (OLTP) 字串模糊查詢、相似查詢索引實時BUILD – 含索引實時寫入》

三、ADHOC搜尋

ADHOC搜尋,指任意欄位組合搜尋。例如一張表有N個列,N個列都可能被作為查詢條件,因此有N!種組合,為了實現這種場景的高效率搜尋,在PostgreSQL中提供了幾種方法。

1、分割槽表,可以實現一些欄位搜尋時的收斂。

2、BLOOM過濾索引,支援任意組合的等值搜尋。lossy 過濾。

3、GIN多列複合索引,BITMAP SCAN。支援任意組合搜尋,過濾到BLOCK級別。

4、多個單列索引,BITMAP SCAN。資料庫自動優化,根據COST評估選擇index scan或bitmap scan,支援任意組合搜尋,過濾到BLOCK級別。

pic

效能如下:

pic

下面是實際的案例:

《HTAP資料庫 PostgreSQL 場景與效能測試之 20 – (OLAP) 使用者畫像圈人場景 – 多個欄位任意組合條件篩選與透視》

《PostgreSQL 多欄位任意組合搜尋的效能》

《時間、空間、物件多維屬性 海量資料任意多維 高效檢索 – 阿里雲RDS PostgreSQL最佳實踐》

《多欄位,任意組合條件查詢(無需建模) – 毫秒級實時圈人 最佳實踐》

《寶劍贈英雄 – 任意組合欄位等效查詢, 探探PostgreSQL多列展開式B樹 (GIN)》

《PostgreSQL 如何高效解決 按任意欄位分詞檢索的問題 – case 1》

《PostgreSQL 9.6 黑科技 bloom 演算法索引,一個索引支撐任意列組合查詢》

《PostgreSQL bitmapAnd, bitmapOr, bitmap index scan, bitmap heap scan》

參考

1、全文檢索

https://www.postgresql.org/docs/10/static/textsearch.html

https://www.postgresql.org/docs/10/static/textsearch-controls.html#TEXTSEARCH-PARSING-QUERIES

2、模糊、正則、相似查詢

https://www.postgresql.org/docs/10/static/pgtrgm.html

3、ADHOC 查詢

https://www.postgresql.org/docs/10/static/indexes-bitmap-scans.html

https://www.postgresql.org/docs/10/static/bloom.html

https://www.postgresql.org/docs/10/static/gin.html

4、

《從難纏的模糊查詢聊開 – PostgreSQL獨門絕招之一 GIN , GiST , SP-GiST , RUM 索引原理與技術背景》

《PostgreSQL 全文檢索加速 快到沒有朋友 – RUM索引介面(潘多拉魔盒)》

《PostgreSQL 全表 全欄位 模糊查詢的毫秒級高效實現 – 搜尋引擎顫抖了》


相關文章