使用 Postgres 的全文搜尋構建可擴充套件的事件驅動搜尋架構

banq發表於2022-11-24

需求：
搜尋由三個可搜尋欄位、標題和描述（使用全文搜尋）和文件 ID 組成的文件，能夠查詢包含文件 ID 的字串。
搜尋應該在不超過 200 毫秒的時間內執行超過 100 萬個文件。

PostgreSQL 支援全文搜尋。全文索引允許對文件進行預處理並儲存索引以供以後快速搜尋。

-- Create the Documents index table
CREATE TABLE IF NOT EXISTS index."documents_index" 
                    (
                     "id" SERIAL,
                     "created_on" bigint NOT NULL,
                     "updated_on" bigint NOT NULL,
                     "customer_id" character varying(150) NOT NULL,
                     "document_id" character varying(255) NOT NULL,
                     "document_type" character varying(50) NOT NULL,
                     "document_title" text,
                     "document_description" text,
                     "words" text,
                     "ts" tsvector GENERATED ALWAYS AS (to_tsvector('english',document_id || ' ' || document_title || ' ' || document_description)) STORED,
                     "metadata" jsonb,
                     CONSTRAINT "documents_index.primary_key" PRIMARY KEY ("customer_id", "document_id", "document_type"))

-- Create GIN index on the ts (tsvector) column to improve search.
CREATE INDEX IF NOT EXISTS documents_index_ts_idx ON index."documents_index" USING GIN (ts);

-- Create Trigram Index on words column. (Requires installing pg_trgm Postgres extension)
CREATE INDEX IF NOT EXISTS documents_index_trgm_idx on index."documents_index" USING GIN ("words" gin_trgm_ops);

我們在表中新增了一個新的列ts，以儲存預處理的搜尋檔案（即詞庫列表）。ts是一個生成的列（Postgres 12的新列），自動與源資料同步。然後我們在ts列上建立了一個tsvector型別的GIN索引。
為了實現模糊搜尋，我們使用了pg_trgm Postgres擴充套件，並在表中新增了一個word列來儲存可搜尋文字。該列儲存的是可搜尋欄位的連線字串。
最後，pg_trgm擴充套件提供了GiST和GIN索引運算子類。該索引允許我們在單詞文字列上建立索引，以便進行快速的相似性搜尋。

import {BeforeInsert, BeforeUpdate, Column, Entity, Generated, PrimaryColumn} from 'typeorm';

@Entity({name: 'index', schema: 'documents_index'})
export class Document {

    @PrimaryColumn({name: 'id', nullable: false})
    @Generated('increment')
    id: number;

    @Column({name: 'created_on', nullable: false})
    createdOn: number;

    @Column({name: 'updated_on', nullable: false})
    updatedOn: number;
    
    @Column({name: 'customer_id', nullable: false})
    zoneId: string;

    @Column({name: 'document_id', nullable: false})
    documentId: string;

    @Column({name: 'document_type', nullable: false})
    documentType: string;

    @Column({name: 'document_title', nullable: false})
    documentTitle: string;

    @Column({name: 'document_description', nullable: false})
    documentDescription: string;

    @Column({name: 'words', type: 'text'})
    words: string;

    @BeforeInsert()
    @BeforeUpdate()
    async calculateWords() {
        const fullText = this.documentId + ' ' + this.documentTitle + ' ' + this.documentDescription;
        const unique = Array.from(new Set(fullText.split(' ')));
        this.words = unique.join(' ');
    }

    @Column({name: 'metadata', type: 'jsonb'})
    metadata: any;

    @Column({name: 'ts', type: 'tsvector'})
    tsVector: any;

}

我們測試了下面的查詢，它使用pgbench工具返回 436 行:

SELECT id, document_id, document_title, document_description, 
COALESCE(similarity(words, 'management system'),0) + COALESCE(ts_rank_cd(ts, 'management & system'),0) as relevancy 
FROM "index"."documents_index" WHERE 
customer_id = '1' AND (
    ts @@ to_tsquery('english', 'management & system') 
--    OR words ILIKE '%management system%'
    ) 
ORDER BY relevancy DESC, id ASC

我們能夠實現每秒約 170 筆交易。

大量資料儲存在資料庫中，效能和擴充套件會隨著資料的增長而受到影響。分割槽透過將大表分成較小的表來解決這個問題，減少記憶體交換問題和表掃描，並提高效能。

-- Create the Documents index table
CREATE TABLE IF NOT EXISTS index."documents_index" 
                    (
                     "id" SERIAL,
                     "created_on" bigint NOT NULL,
                     "updated_on" bigint NOT NULL,
                     "customer_id" character varying(150) NOT NULL,
                     "document_id" character varying(255) NOT NULL,
                     "document_type" character varying(50) NOT NULL,
                     "document_title" text,
                     "document_description" text,
                     "words" text,
                     "ts" tsvector GENERATED ALWAYS AS (to_tsvector('english',document_id || ' ' || document_title || ' ' || document_description)) STORED,
                     "metadata" jsonb,
                     CONSTRAINT "documents_index.primary_key" PRIMARY KEY ("customer_id", "document_id", "document_type"))  PARTITION by HASH("customer_id")

-- Create GIN index on the ts (tsvector) column to improve search.
CREATE INDEX IF NOT EXISTS documents_index_ts_idx ON index."documents_index" USING GIN (ts);

-- Create Trigram Index on words column. (Requires installing pg_trgm Postgres extension)
CREATE INDEX IF NOT EXISTS documents_index_trgm_idx on index."documents_index" USING GIN ("words" gin_trgm_ops);

CREATE TABLE IF NOT EXISTS index."documents_index_part_1" partition of index."documents_index_part" for values with (MODULUS 10, REMAINDER 0);
CREATE TABLE IF NOT EXISTS index."documents_index_part_2" partition of index."documents_index_part" for values with (MODULUS 10, REMAINDER 1);
CREATE TABLE IF NOT EXISTS index."documents_index_part_3" partition of index."documents_index_part" for values with (MODULUS 10, REMAINDER 2);
CREATE TABLE IF NOT EXISTS index."documents_index_part_4" partition of index."documents_index_part" for values with (MODULUS 10, REMAINDER 3);
CREATE TABLE IF NOT EXISTS index."documents_index_part_5" partition of index."documents_index_part" for values with (MODULUS 10, REMAINDER 4);
CREATE TABLE IF NOT EXISTS index."documents_index_part_6" partition of index."documents_index_part" for values with (MODULUS 10, REMAINDER 5);
CREATE TABLE IF NOT EXISTS index."documents_index_part_7" partition of index."documents_index_part" for values with (MODULUS 10, REMAINDER 6);
CREATE TABLE IF NOT EXISTS index."documents_index_part_8" partition of index."documents_index_part" for values with (MODULUS 10, REMAINDER 7);
CREATE TABLE IF NOT EXISTS index."documents_index_part_9" partition of index."documents_index_part" for values with (MODULUS 10, REMAINDER 8);
CREATE TABLE IF NOT EXISTS index."documents_index_part_10" partition of index."documents_index_part" for values with (MODULUS 10, REMAINDER 9);

對 Index Storage 表進行分割槽後，我們實現了近 60% 的查詢效能提升。

術語
1、詞幹化Stemming
這是一個將一個詞還原為其詞幹的過程，該詞幹與字尾和字首或詞根相連，被稱為詞根，以確保該詞的變體在搜尋中與結果相匹配。例如，管理、經理、管理可以從一個詞Manag中提取詞幹，在搜尋manag這個詞時，將返回具有這個詞的任何變體的結果。線上詞幹工具

2、詞幹NGram
它就像一個在單詞上移動的滑動視窗--一個連續的字元序列，直到指定的長度。例如，單詞將變成{'w', 'wo, 'wor', 'ord', 'rd'}。NGram可以用來搜尋一個詞的各個部分，甚至從中間搜尋。最常用的NGram型別是Trigram 。

3、模糊性
模糊性 "指的是在比較兩個字串時，解決方案不尋求完美的、逐個位置的匹配。相反，它們允許一些不匹配（或'模糊性'）。例如，對succesful這個詞的搜尋也會返回有success的結果。常見的應用包括拼寫檢查和垃圾郵件過濾。

4、相似性
兩個詞的相似性可以透過計算它們共有的卦數來衡量。這個簡單的想法對於測量許多自然語言中單詞的相似性非常有效。

5、排名
排名試圖衡量文件與特定查詢的相關程度，這樣當有許多匹配時，最相關的文件可以被首先顯示出來 Postgres支援排名和加權排名。通常情況下，加權是用來標記文件的特殊區域的詞，如標題或最初的摘要，以便它們可以比文件正文中的詞有更多或更少的重要性。

指數級加速架構搜尋：CMU提出基於梯度下降的可微架構搜尋方法
2018-06-27
架構梯度
使用 Docker 和 Elasticsearch 構建一個全文搜尋應用程式
2022-11-27
DockerElasticsearch
Nebula 基於 ElasticSearch 的全文搜尋引擎的文字搜尋
2021-06-17
Elasticsearch
Elasticsearch——全文搜尋
2019-02-18
Elasticsearch
tinysearch/tinysearch: 使用Rus和Wasm構建的靜態網站的微型全文搜尋引擎
2021-12-21
ASM網站
Spring Boot整合Postgres實現輕量級全文搜尋
2024-02-19
Spring Boot
分散式搜尋引擎Elasticsearch的架構分析
2020-12-08
分散式Elasticsearch架構
Laravel xunsearch 全文搜尋
2019-02-16
Laravel
sphinx 全文搜尋引擎
2019-02-16
ElasticSearch全文搜尋引擎
2019-07-29
Elasticsearch
✍ 寫一個擴充套件性較強的搜尋主頁
2020-06-12
套件
Elasticsearch 的配置與使用，為了全文搜尋
2018-04-19
Elasticsearch
Django與微服務架構：構建可擴充套件的Web應用
2024-05-21
Django微服務架構套件Web
有贊搜尋系統的架構演進
2019-03-05
架構
DoorDash使用 Kafka 和 Flink 構建可擴充套件的實時事件處理
2022-08-15
Kafka套件事件
使用Node，Vue和ElasticSearch構建實時搜尋引擎
2019-02-16
VueElasticsearch
18. 使用MySQL之全文字搜尋
2024-11-05
MySql
使用 Laravel Scout + ElasticSearch 實現全文搜尋
2021-10-15
LaravelElasticsearch
深度學習的未來：神經架構搜尋
2019-06-27
深度學習架構
用Elasticsearch構建電商搜尋平臺
2018-08-03
Elasticsearch
Mac上神奇的內建搜尋引擎——Spotlight(聚焦搜尋)
2020-12-14
Mac
CV+NLP，使用tf.Keras構建影像搜尋引擎
2019-11-20
Keras
簡要剖析：可擴充套件的微服務架構
2019-01-21
套件微服務架構
谷歌的三大可擴充套件核心架構
2021-05-15
谷歌套件架構
redis-om-spring: 更好的搜尋、文件模型等的 Spring Data Redis 擴充套件
2021-11-25
RedisSpring模型套件
二叉搜尋樹的結構
2024-10-20
構建高可用性、高效能和可擴充套件的Zabbix Server架構
2024-07-01
套件Server架構
[外掛擴充套件]通過標題搜尋站外資訊
2019-05-11
套件
【資料結構】搜尋樹
2020-12-08
資料結構
如何使用ABAP Restful API進行程式碼的全文搜尋
2018-10-21
RESTAPI行程
【搜尋引擎】 PostgreSQL 10 實時全文檢索和分詞、相似搜尋、模糊匹配實現類似Google搜尋自動提示
2019-07-11
SQL分詞Go
如何構建可控,可靠,可擴充套件的 PWA 應用
2022-04-21
套件
Laravel5.5 使用 Elasticsearch 做引擎，scout 全文搜尋
2018-11-27
LaravelElasticsearch
利用NAS尋找最佳GAN：AutoGAN架構搜尋方案專為GAN打造
2019-10-07
架構
高效的使用搜尋引擎
2018-11-07
有贊訂單搜尋AKF架構演進之路
2019-04-11
架構
使用 Python 構建可擴充套件的社交媒體情感分析服務
2019-05-17
Python套件
使用Google百度等搜尋引擎的常用搜尋技巧
2019-03-18
Go

使用 Postgres 的全文搜尋構建可擴充套件的事件驅動搜尋架構

相關文章