PostgreSQL相似文字檢索與去重-(銀屑病怎麼治？銀屑病怎麼治療？銀屑病怎麼治療好？銀屑病怎麼能治療好？)

德哥發表於2018-04-18

原文網址 : https://flycode.co/archives/185824

背景

在雲棲社群的問答區，有一位網友提到有一個問題：

表裡相似資料太多，想刪除相似度高的資料，有什麼辦法能實現嗎？  
例如：  
銀屑病怎麼治？  
銀屑病怎麼治療？  
銀屑病怎麼治療好？  
銀屑病怎麼能治療好？  
等等

解這個問題的思路

1. 首先如何判斷內容的相似度，PostgreSQL中提供了中文分詞，pg_trgm(將字串切成多個不重複的token,計算兩個字串的相似度) .

對於本題，我建議採取中文分詞的方式，首先將內容拆分成片語。

2. 在拆分成片語後，首先分組聚合，去除完全重複的資料。

3. 然後自關聯生成笛卡爾(矩陣)，計算出每條記錄和其他記錄的相似度。相似度的演算法很簡單，重疊的token數量除以集合的token去重後的數量。

4. 根據相似度，去除不需要的資料。

這裡如果資料量非常龐大，使用專業的分析程式語言會更好例如 PL/R。

實操的例子

首先要安裝PostgreSQL 中文分詞外掛

(阿里雲AliCloudDB PostgreSQL已包含這個外掛，用法參考官方手冊)

git clone https://github.com/jaiminpan/pg_jieba.git  
mv pg_jieba $PGSRC/contrib/  
export PATH=/home/digoal/pgsql9.5/bin:$PATH  
cd $PGSRC/contrib/pg_jieba  
make clean;make;make install  
  
git clone https://github.com/jaiminpan/pg_scws.git  
mv pg_jieba $PGSRC/contrib/  
export PATH=/home/digoal/pgsql9.5/bin:$PATH  
cd $PGSRC/contrib/pg_scws  
make clean;make;make install

建立外掛

psql  
# create extension pg_jieba;  
# create extension pg_scws;

建立測試CASE

create table tdup1 (id int primary key, info text);  
create extension pg_trgm;  
insert into tdup1 values (1, `銀屑病怎麼治？`);  
insert into tdup1 values (2, `銀屑病怎麼治療？`);  
insert into tdup1 values (3, `銀屑病怎麼治療好？`);  
insert into tdup1 values (4, `銀屑病怎麼能治療好？`);

這兩種分詞外掛，可以任選一種。

postgres=# select to_tsvector(`jiebacfg`, info),* from tdup1 ;  
     to_tsvector     | id |         info           
---------------------+----+----------------------  
 `治`:3 `銀屑病`:1   |  1 | 銀屑病怎麼治？  
 `治療`:3 `銀屑病`:1 |  2 | 銀屑病怎麼治療？  
 `治療`:3 `銀屑病`:1 |  3 | 銀屑病怎麼治療好？  
 `治療`:4 `銀屑病`:1 |  4 | 銀屑病怎麼能治療好？  
(4 rows)  
  
postgres=# select to_tsvector(`scwscfg`, info),* from tdup1 ;  
            to_tsvector            | id |         info           
-----------------------------------+----+----------------------  
 `治`:2 `銀屑病`:1                 |  1 | 銀屑病怎麼治？  
 `治療`:2 `銀屑病`:1               |  2 | 銀屑病怎麼治療？  
 `好`:3 `治療`:2 `銀屑病`:1        |  3 | 銀屑病怎麼治療好？  
 `好`:4 `治療`:3 `能`:2 `銀屑病`:1 |  4 | 銀屑病怎麼能治療好？  
(4 rows)

建立三個函式，

計算2個陣列的集合（去重後的集合）

postgres=# create or replace function array_union(text[], text[]) returns text[] as $$  
  select array_agg(c1) from (select c1 from unnest($1||$2) t(c1) group by c1) t;  
$$ language sql strict;  
CREATE FUNCTION

陣列去重

postgres=# create or replace function array_dist(text[]) returns text[] as $$           
  select array_agg(c1) from (select c1 from unnest($1) t(c1) group by c1) t;      
$$ language sql strict;  
CREATE FUNCTION

計算兩個陣列的重疊部分（去重後的重疊部分）

postgres=# create or replace function array_share(text[], text[]) returns text[] as $$  
  select array_agg(unnest) from (select unnest($1) intersect select unnest($2) group by 1) t;  
$$ language sql strict;  
CREATE FUNCTION

笛卡爾結果是這樣的：

regexp_split_to_array((regexp_replace(to_tsvector(`jiebacfg`,info)::text,`(:d+)`, ``, `g`)),` `) 用於將info轉換成陣列。

postgres=# with t(c1,c2,c3) as   
(select id,info,array_dist(regexp_split_to_array((regexp_replace(to_tsvector(`jiebacfg`,info)::text,`(:d+)`, ``, `g`)),` `)) from tdup1)   
select * from (select t1.c1 t1c1,t2.c1 t2c1,t1.c2 t1c2,t2.c2 t2c2,t1.c3 t1c3,t2.c3 t2c3,round(array_length(array_share(t1.c3,t2.c3),1)::numeric/array_length(array_union(t1.c3,t2.c3),1),2)   
simulate from t t1,t t2) t;  
 t1c1 | t2c1 |         t1c2         |         t2c2         |       t1c3        |       t2c3        | simulate   
------+------+----------------------+----------------------+-------------------+-------------------+----------  
    1 |    1 | 銀屑病怎麼治？       | 銀屑病怎麼治？       | {`銀屑病`,`治`}   | {`銀屑病`,`治`}   |     1.00  
    1 |    2 | 銀屑病怎麼治？       | 銀屑病怎麼治療？     | {`銀屑病`,`治`}   | {`銀屑病`,`治療`} |     0.33  
    1 |    3 | 銀屑病怎麼治？       | 銀屑病怎麼治療好？   | {`銀屑病`,`治`}   | {`銀屑病`,`治療`} |     0.33  
    1 |    4 | 銀屑病怎麼治？       | 銀屑病怎麼能治療好？ | {`銀屑病`,`治`}   | {`銀屑病`,`治療`} |     0.33  
    2 |    1 | 銀屑病怎麼治療？     | 銀屑病怎麼治？       | {`銀屑病`,`治療`} | {`銀屑病`,`治`}   |     0.33  
    2 |    2 | 銀屑病怎麼治療？     | 銀屑病怎麼治療？     | {`銀屑病`,`治療`} | {`銀屑病`,`治療`} |     1.00  
    2 |    3 | 銀屑病怎麼治療？     | 銀屑病怎麼治療好？   | {`銀屑病`,`治療`} | {`銀屑病`,`治療`} |     1.00  
    2 |    4 | 銀屑病怎麼治療？     | 銀屑病怎麼能治療好？ | {`銀屑病`,`治療`} | {`銀屑病`,`治療`} |     1.00  
    3 |    1 | 銀屑病怎麼治療好？   | 銀屑病怎麼治？       | {`銀屑病`,`治療`} | {`銀屑病`,`治`}   |     0.33  
    3 |    2 | 銀屑病怎麼治療好？   | 銀屑病怎麼治療？     | {`銀屑病`,`治療`} | {`銀屑病`,`治療`} |     1.00  
    3 |    3 | 銀屑病怎麼治療好？   | 銀屑病怎麼治療好？   | {`銀屑病`,`治療`} | {`銀屑病`,`治療`} |     1.00  
    3 |    4 | 銀屑病怎麼治療好？   | 銀屑病怎麼能治療好？ | {`銀屑病`,`治療`} | {`銀屑病`,`治療`} |     1.00  
    4 |    1 | 銀屑病怎麼能治療好？ | 銀屑病怎麼治？       | {`銀屑病`,`治療`} | {`銀屑病`,`治`}   |     0.33  
    4 |    2 | 銀屑病怎麼能治療好？ | 銀屑病怎麼治療？     | {`銀屑病`,`治療`} | {`銀屑病`,`治療`} |     1.00  
    4 |    3 | 銀屑病怎麼能治療好？ | 銀屑病怎麼治療好？   | {`銀屑病`,`治療`} | {`銀屑病`,`治療`} |     1.00  
    4 |    4 | 銀屑病怎麼能治療好？ | 銀屑病怎麼能治療好？ | {`銀屑病`,`治療`} | {`銀屑病`,`治療`} |     1.00  
(16 rows)

以上生成的實際上是一個矩陣,simulate就是矩陣中我們需要計算的相似度：

我們在去重計算時不需要所有的笛卡爾積，只需要這個矩陣對角線的上部分或下部分資料即可。

所以加個條件就能完成。

postgres=# with t(c1,c2,c3) as   
(select id,info,array_dist(regexp_split_to_array((regexp_replace(to_tsvector(`jiebacfg`,info)::text,`(:d+)`, ``, `g`)),` `)) from tdup1)   
select * from (select t1.c1 t1c1,t2.c1 t2c1,t1.c2 t1c2,t2.c2 t2c2,t1.c3 t1c3,t2.c3 t2c3,round(array_length(array_share(t1.c3,t2.c3),1)::numeric/array_length(array_union(t1.c3,t2.c3),1),2)   
simulate from t t1,t t2 where t1.c1<>t2.c1 and t1.c1<t2.c1) t;  
 t1c1 | t2c1 |        t1c2        |         t2c2         |       t1c3        |       t2c3        | simulate   
------+------+--------------------+----------------------+-------------------+-------------------+----------  
    1 |    2 | 銀屑病怎麼治？     | 銀屑病怎麼治療？     | {`銀屑病`,`治`}   | {`銀屑病`,`治療`} |     0.33  
    1 |    3 | 銀屑病怎麼治？     | 銀屑病怎麼治療好？   | {`銀屑病`,`治`}   | {`銀屑病`,`治療`} |     0.33  
    1 |    4 | 銀屑病怎麼治？     | 銀屑病怎麼能治療好？ | {`銀屑病`,`治`}   | {`銀屑病`,`治療`} |     0.33  
    2 |    3 | 銀屑病怎麼治療？   | 銀屑病怎麼治療好？   | {`銀屑病`,`治療`} | {`銀屑病`,`治療`} |     1.00  
    2 |    4 | 銀屑病怎麼治療？   | 銀屑病怎麼能治療好？ | {`銀屑病`,`治療`} | {`銀屑病`,`治療`} |     1.00  
    3 |    4 | 銀屑病怎麼治療好？ | 銀屑病怎麼能治療好？ | {`銀屑病`,`治療`} | {`銀屑病`,`治療`} |     1.00  
(6 rows)

開始對這些資料去重，去重的第一步，明確simulate, 例如相似度大於0.5的，需要去重。

postgres=# with t(c1,c2,c3) as   
(select id,info,array_dist(regexp_split_to_array((regexp_replace(to_tsvector(`jiebacfg`,info)::text,`(:d+)`, ``, `g`)),` `)) from tdup1)   
select * from (select t1.c1 t1c1,t2.c1 t2c1,t1.c2 t1c2,t2.c2 t2c2,t1.c3 t1c3,t2.c3 t2c3,round(array_length(array_share(t1.c3,t2.c3),1)::numeric/array_length(array_union(t1.c3,t2.c3),1),2)   
simulate from t t1,t t2 where t1.c1<>t2.c1 and t1.c1<t2.c1) t where simulate>0.5;  
 t1c1 | t2c1 |        t1c2        |         t2c2         |       t1c3        |       t2c3        | simulate   
------+------+--------------------+----------------------+-------------------+-------------------+----------  
    2 |    3 | 銀屑病怎麼治療？   | 銀屑病怎麼治療好？   | {`銀屑病`,`治療`} | {`銀屑病`,`治療`} |     1.00  
    2 |    4 | 銀屑病怎麼治療？   | 銀屑病怎麼能治療好？ | {`銀屑病`,`治療`} | {`銀屑病`,`治療`} |     1.00  
    3 |    4 | 銀屑病怎麼治療好？ | 銀屑病怎麼能治療好？ | {`銀屑病`,`治療`} | {`銀屑病`,`治療`} |     1.00  
(3 rows)

去重第二步，將t2c1列的ID對應的記錄刪掉即可。

delete from tdup1 where id in (with t(c1,c2,c3) as   
(select id,info,array_dist(regexp_split_to_array((regexp_replace(to_tsvector(`jiebacfg`,info)::text,`(:d+)`, ``, `g`)),` `)) from tdup1)   
select t2c1 from (select t1.c1 t1c1,t2.c1 t2c1,t1.c2 t1c2,t2.c2 t2c2,t1.c3 t1c3,t2.c3 t2c3,round(array_length(array_share(t1.c3,t2.c3),1)::numeric/array_length(array_union(t1.c3,t2.c3),1),2)   
simulate from t t1,t t2 where t1.c1<>t2.c1 and t1.c1<t2.c1) t where simulate>0.5);  
  
例如 :   
postgres=# insert into tdup1 values (11, `白血病怎麼治？`);  
INSERT 0 1  
postgres=# insert into tdup1 values (22, `白血病怎麼治療？`);  
INSERT 0 1  
postgres=# insert into tdup1 values (13, `白血病怎麼治療好？`);  
INSERT 0 1  
postgres=# insert into tdup1 values (24, `白血病怎麼能治療好？`);  
INSERT 0 1  
postgres=#   
postgres=# with t(c1,c2,c3) as                               
(select id,info,array_dist(regexp_split_to_array((regexp_replace(to_tsvector(`jiebacfg`,info)::text,`(:d+)`, ``, `g`)),` `)) from tdup1)   
select * from (select t1.c1 t1c1,t2.c1 t2c1,t1.c2 t1c2,t2.c2 t2c2,t1.c3 t1c3,t2.c3 t2c3,round(array_length(array_share(t1.c3,t2.c3),1)::numeric/array_length(array_union(t1.c3,t2.c3),1),2)   
simulate from t t1,t t2 where t1.c1<>t2.c1 and t1.c1<t2.c1) t where simulate>0.5;  
 t1c1 | t2c1 |        t1c2        |         t2c2         |       t1c3        |       t2c3        | simulate   
------+------+--------------------+----------------------+-------------------+-------------------+----------  
    2 |    3 | 銀屑病怎麼治療？   | 銀屑病怎麼治療好？   | {`銀屑病`,`治療`} | {`銀屑病`,`治療`} |     1.00  
    2 |    4 | 銀屑病怎麼治療？   | 銀屑病怎麼能治療好？ | {`銀屑病`,`治療`} | {`銀屑病`,`治療`} |     1.00  
    3 |    4 | 銀屑病怎麼治療好？ | 銀屑病怎麼能治療好？ | {`銀屑病`,`治療`} | {`銀屑病`,`治療`} |     1.00  
   22 |   24 | 白血病怎麼治療？   | 白血病怎麼能治療好？ | {`治療`,`白血病`} | {`治療`,`白血病`} |     1.00  
   13 |   22 | 白血病怎麼治療好？ | 白血病怎麼治療？     | {`治療`,`白血病`} | {`治療`,`白血病`} |     1.00  
   13 |   24 | 白血病怎麼治療好？ | 白血病怎麼能治療好？ | {`治療`,`白血病`} | {`治療`,`白血病`} |     1.00  
(6 rows)  
  
postgres=# begin;  
BEGIN  
postgres=# delete from tdup1 where id in (with t(c1,c2,c3) as   
postgres(# (select id,info,array_dist(regexp_split_to_array((regexp_replace(to_tsvector(`jiebacfg`,info)::text,`(:d+)`, ``, `g`)),` `)) from tdup1)   
postgres(# select t2c1 from (select t1.c1 t1c1,t2.c1 t2c1,t1.c2 t1c2,t2.c2 t2c2,t1.c3 t1c3,t2.c3 t2c3,round(array_length(array_share(t1.c3,t2.c3),1)::numeric/array_length(array_union(t1.c3,t2.c3),1),2)   
postgres(# simulate from t t1,t t2 where t1.c1<>t2.c1 and t1.c1<t2.c1) t where simulate>0.5);  
DELETE 4  
postgres=# select * from tdup1 ;  
 id |        info          
----+--------------------  
  1 | 銀屑病怎麼治？  
  2 | 銀屑病怎麼治療？  
 11 | 白血病怎麼治？  
 13 | 白血病怎麼治療好？  
(4 rows)

用資料庫解會遇到的問題, 因為我們的JOIN filter是<>和<，用不上hashjoin。

資料量比較大的情況下，耗時會非常的長。

postgres=# explain delete from tdup1 where id in (with t(c1,c2,c3) as   
(select id,info,array_dist(regexp_split_to_array((regexp_replace(to_tsvector(`jiebacfg`,info)::text,`(:d+)`, ``, `g`)),` `)) from tdup1)   
select t2c1 from (select t1.c1 t1c1,t2.c1 t2c1,t1.c2 t1c2,t2.c2 t2c2,t1.c3 t1c3,t2.c3 t2c3,round(array_length(array_share(t1.c3,t2.c3),1)::numeric/array_length(array_union(t1.c3,t2.c3),1),2)   
simulate from t t1,t t2 where t1.c1<>t2.c1 and t1.c1<t2.c1) t where simulate>0.5);  
                                                      QUERY PLAN                                                        
----------------------------------------------------------------------------------------------------------------------  
 Delete on tdup1  (cost=10005260133.58..10005260215.84 rows=2555 width=34)  
   ->  Hash Join  (cost=10005260133.58..10005260215.84 rows=2555 width=34)  
         Hash Cond: (tdup1.id = "ANY_subquery".t2c1)  
         ->  Seq Scan on tdup1  (cost=0.00..61.10 rows=5110 width=10)  
         ->  Hash  (cost=10005260131.08..10005260131.08 rows=200 width=32)  
               ->  HashAggregate  (cost=10005260129.08..10005260131.08 rows=200 width=32)  
                     Group Key: "ANY_subquery".t2c1  
                     ->  Subquery Scan on "ANY_subquery"  (cost=10000002667.20..10005252911.99 rows=2886838 width=32)  
                           ->  Subquery Scan on t  (cost=10000002667.20..10005224043.61 rows=2886838 width=4)  
                                 Filter: (t.simulate > 0.5)  
                                 CTE t  
                                   ->  Seq Scan on tdup1 tdup1_1  (cost=0.00..2667.20 rows=5110 width=36)  
                                 ->  Nested Loop  (cost=10000000000.00..10005113119.99 rows=8660513 width=68)  
                                       Join Filter: ((t1.c1 <> t2.c1) AND (t1.c1 < t2.c1))  
                                       ->  CTE Scan on t t1  (cost=0.00..102.20 rows=5110 width=36)  
                                       ->  CTE Scan on t t2  (cost=0.00..102.20 rows=5110 width=36)  
(16 rows)

其他更優雅的方法，使用PLR或者R進行矩陣運算，得出結果後再進行篩選。

PLR

或者使用MPP資料庫例如Greenplum加上R和madlib可以對非常龐大的資料進行處理。

MADLIB

MPP