PostgreSQL全文檢索-詞頻統計
標籤
PostgreSQL , 全文檢索 , 詞頻統計 , ts_stat , madlib
背景
TF(Term Frequency 詞頻)/IDF(Inverse Document Frequency 逆向文字頻率)是文字分析中常見的術語。
PostgreSQL支援全文檢索,支援tsvector文字向量型別。
如何在一堆文字中,找到熱詞,或者對詞頻進行分析呢?
方法1,ts_stat
第一種方法來自PostgreSQL的內建函式,ts_stat,用於生成lexeme的統計資訊,例如我想知道某個問答知識庫中,出現最多的詞是哪個,出現在了多少篇文字中。
ts_stat介紹如下
https://www.postgresql.org/docs/devel/static/functions-textsearch.html
https://www.postgresql.org/docs/devel/static/textsearch.html
https://www.postgresql.org/docs/devel/static/textsearch-features.html
12.4.4. Gathering Document Statistics
The function ts_stat is useful for checking your configuration and for finding stop-word candidates.
ts_stat(sqlquery text, [ weights text, ]
OUT word text, OUT ndoc integer,
OUT nentry integer) returns setof record
sqlquery is a text value containing an SQL query which must return a single tsvector column. ts_stat executes the query and returns statistics about each distinct lexeme (word) contained in the tsvector data. The columns returned are
-
word text — the value of a lexeme
-
ndoc integer — number of documents (tsvectors) the word occurred in
-
nentry integer — total number of occurrences of the word
If weights is supplied, only occurrences having one of those weights are counted.
For example, to find the ten most frequent words in a document collection:
SELECT * FROM ts_stat(`SELECT vector FROM apod`)
ORDER BY nentry DESC, ndoc DESC, word
LIMIT 10;
The same, but counting only word occurrences with weight A or B:
SELECT * FROM ts_stat(`SELECT vector FROM apod`, `ab`)
ORDER BY nentry DESC, ndoc DESC, word
LIMIT 10;
測試
1、建立生成隨機字串的函式
create or replace function gen_rand_str(int) returns text as $$
select substring(md5(random()::text), 4, $1);
$$ language sql strict stable;
2、建立生成若干個隨機詞的函式
create or replace function gen_rand_tsvector(int,int) returns tsvector as $$
select array_to_tsvector(array_agg(gen_rand_str($1))) from generate_series(1,$2);
$$ language sql strict;
postgres=# select gen_rand_tsvector(4,10);
gen_rand_tsvector
-----------------------------------------------------------------------
`21eb` `2c9c` `4406` `5d9c` `9ac4` `a27b` `ab13` `ba77` `e3f2` `f198`
(1 row)
3、建立測試表,並寫入測試資料
postgres=# create table ts_test(id int, info tsvector);
CREATE TABLE
postgres=# insert into ts_test select generate_series(1,100000), gen_rand_tsvector(4,10);
INSERT 0 100000
4、檢視詞頻,總共出現了多少次,在多少篇文字(多少條記錄中出現過)
postgres=# SELECT * FROM ts_stat(`SELECT info FROM ts_test`)
ORDER BY nentry DESC, ndoc DESC, word
LIMIT 10;
word | ndoc | nentry
------+------+--------
e4e6 | 39 | 39
9596 | 36 | 36
a84c | 35 | 35
2b44 | 32 | 32
5146 | 32 | 32
92f6 | 32 | 32
cd56 | 32 | 32
fd00 | 32 | 32
4258 | 31 | 31
5f18 | 31 | 31
(10 rows)
5、再寫入一批測試資料,檢視詞頻,總共出現了多少次,在多少篇文字(多少條記錄中出現過)
postgres=# insert into ts_test select generate_series(1,100000), gen_rand_tsvector(2,10);
INSERT 0 100000
postgres=# SELECT * FROM ts_stat(`SELECT info FROM ts_test`)
ORDER BY nentry DESC, ndoc DESC, word
LIMIT 10;
word | ndoc | nentry
------+------+--------
30 | 4020 | 4020
a7 | 4005 | 4005
20 | 3985 | 3985
c5 | 3980 | 3980
e6 | 3970 | 3970
f1 | 3965 | 3965
70 | 3948 | 3948
5e | 3943 | 3943
e4 | 3937 | 3937
2b | 3934 | 3934
(10 rows)
方法2,madlib
實際上MADlib也提供了詞頻統計的訓練函式
http://madlib.apache.org/docs/latest/group__grp__text__utilities.html
Term frequency
Term frequency tf(t,d) is to the raw frequency of a word/term in a document, i.e. the number of times that word/term t occurs in document d. For this function, `word` and `term` are used interchangeably. Note: the term frequency is not normalized by the document length.
term_frequency(input_table,
doc_id_col,
word_col,
output_table,
compute_vocab)
Arguments:
input_table
TEXT. The name of the table storing the documents. Each row is in the form <doc_id, word_vector> where doc_id is an id, unique to each document, and word_vector is a text array containing the words in the document. The word_vector should contain multiple entries of a word if the document contains multiple occurrence of that word.
id_col
TEXT. The name of the column containing the document id.
word_col
TEXT. The name of the column containing the vector of words/terms in the document. This column should of type that can be cast to TEXT[].
output_table
TEXT. The name of the table to store the term frequency output. The output table contains the following columns:
-
id_col: This the document id column (same as the one provided as input).
-
word: A word/term present in a document. This is either the original word present in word_col or an id representing the word (depending on the value of compute_vocab below).
-
count: The number of times this word is found in the document.
compute_vocab
BOOLEAN. (Optional, Default=FALSE) Flag to indicate if a vocabulary is to be created. If TRUE, an additional output table is created containing the vocabulary of all words, with an id assigned to each word. The table is called output_table_vocabulary (suffix added to the output_table name) and contains the following columns:
-
wordid: An id assignment for each word
-
word: The word/term
參考
http://madlib.apache.org/docs/latest/group__grp__text__utilities.html
https://www.postgresql.org/docs/devel/static/functions-textsearch.html
https://www.postgresql.org/docs/devel/static/textsearch.html
https://www.postgresql.org/docs/devel/static/textsearch-features.html
http://madlib.incubator.apache.org/
相關文章
- ElasticSearch 實現分詞全文檢索 - 概述Elasticsearch分詞
- ElasticSearch 實現分詞全文檢索 - delete-by-queryElasticsearch分詞delete
- 配置全文檢索
- Mysql 如何實現全文檢索,關鍵詞跑分MySql
- 全文檢索庫 bluge
- Kibana 全文檢索操作
- 全文檢索的轉義
- solr全文檢索學習Solr
- 詞語詞頻統計
- 詞頻統計
- 基於ElasticSearch實現商品的全文檢索檢索Elasticsearch
- openGauss每日一練(全文檢索)
- oracle全文索引之配置全文檢索環境Oracle索引
- 詞頻統計mapreduce
- php + MongoDB + Sphinx 實現全文檢索PHPMongoDB
- Oracle的全文檢索技術(轉)Oracle
- 全文檢索技術lucene的demo
- 【搜尋引擎】Solr Suggester 實現全文檢索功能-分詞和和自動提示Solr分詞
- python如何統計詞頻Python
- springboot ElasticSearch 簡單的全文檢索高亮Spring BootElasticsearch
- 【IT老齊072】全文檢索執行原理
- 基於Lucene的全文檢索實踐
- 關鍵詞感知檢索
- IM全文檢索技術專題(四):微信iOS端的最新全文檢索技術優化實踐iOS優化
- python實現詞頻統計Python
- 【搜尋引擎】 PostgreSQL 10 實時全文檢索和分詞、相似搜尋、模糊匹配實現類似Google搜尋自動提示SQL分詞Go
- 文字挖掘之語料庫、分詞、詞頻統計分詞
- 個人部落格分享(Laravel + Vue 元件,支援全文檢索)LaravelVue元件
- 詞頻統計任務程式設計實踐程式設計
- PostgreSQL一複合查詢SQL優化例子-(多個exists,範圍檢索,IN檢索,模糊檢索組合)SQL優化
- lnmp+coreseek實現站內全文檢索(安裝篇)LNMP
- 讀書筆記:從Lucene到Elasticsearch:全文檢索實戰筆記Elasticsearch
- 沒辦法了,用 MySQL 做全文檢索也挺好的MySql
- 一個線上全文索引BUG的排查:關於類阿拉件數字的分詞與檢索索引分詞
- Python統計四六級考試的詞頻Python
- 【搜尋引擎】Solr全文檢索近實時查詢優化Solr優化
- Java、Scala、Python ☞ 本地WordCount詞頻統計對比JavaPython
- Homestead 環境下安裝 Elasticsearch 並使用 scout 進行全文檢索Elasticsearch