如何解決資料庫分詞的拼寫糾正問題-PostgreSQLHunspell字典複數形容詞動詞等變異還原
標籤
PostgreSQL , Hunspell , 分詞 , 複數還原 , 字典
背景
在英語中,名詞通常都有複數,表示多個;形容詞,過去式,動詞等。 有large, larger, largest, stories, eating, did, doing, hacked這樣的。
這可能會給分詞帶來一定的困擾,例如我們來看看PG預設的ts config怎麼處理它的。
比如english tsconfig是這麼處理的
postgres=# SELECT * FROM ts_debug(`english`, `larger`);
alias | description | token | dictionaries | dictionary | lexemes
-----------+-----------------+--------+----------------+--------------+----------
asciiword | Word, all ASCII | larger | {english_stem} | english_stem | {larger}
(1 row)
postgres=# SELECT * FROM ts_debug(`english`, `large`);
alias | description | token | dictionaries | dictionary | lexemes
-----------+-----------------+-------+----------------+--------------+---------
asciiword | Word, all ASCII | large | {english_stem} | english_stem | {larg}
(1 row)
postgres=# SELECT * FROM ts_debug(`english`, `largest`);
alias | description | token | dictionaries | dictionary | lexemes
-----------+-----------------+---------+----------------+--------------+-----------
asciiword | Word, all ASCII | largest | {english_stem} | english_stem | {largest}
(1 row)
postgres=# SELECT * FROM ts_debug(`english`, `stories`);
alias | description | token | dictionaries | dictionary | lexemes
-----------+-----------------+---------+----------------+--------------+---------
asciiword | Word, all ASCII | stories | {english_stem} | english_stem | {stori}
(1 row)
很顯然,它沒有很好的處理這幾個詞, large, larger, largest, stories。
預設的parser支援的token型別
postgres=# select * from ts_token_type(`default`);
tokid | alias | description
-------+-----------------+------------------------------------------
1 | asciiword | Word, all ASCII
2 | word | Word, all letters
3 | numword | Word, letters and digits
4 | email | Email address
5 | url | URL
6 | host | Host
7 | sfloat | Scientific notation
8 | version | Version number
9 | hword_numpart | Hyphenated word part, letters and digits
10 | hword_part | Hyphenated word part, all letters
11 | hword_asciipart | Hyphenated word part, all ASCII
12 | blank | Space symbols
13 | tag | XML tag
14 | protocol | Protocol head
15 | numhword | Hyphenated word, letters and digits
16 | asciihword | Hyphenated word, all ASCII
17 | hword | Hyphenated word, all letters
18 | url_path | URL path
19 | file | File or path name
20 | float | Decimal notation
21 | int | Signed integer
22 | uint | Unsigned integer
23 | entity | XML entity
(23 rows)
實際上從PostgreSQL 9.6開始,就支援了拼寫的糾正字典,參考
https://www.postgresql.org/docs/9.6/static/textsearch-dictionaries.html#TEXTSEARCH-ISPELL-DICTIONARY
通過affix, dict檔案進行糾正。
例子
The .affix file of Ispell has the following structure:
prefixes
flag *A:
. > RE # As in enter > reenter
suffixes
flag T:
E > ST # As in late > latest
[^AEIOU]Y > -Y,IEST # As in dirty > dirtiest
[AEIOU]Y > EST # As in gray > grayest
[^EY] > EST # As in small > smallest
And the .dict file has the following structure:
lapse/ADGRS
lard/DGRS
large/PRTY
lark/MRS
postgrespro開源了一個外掛,實現了一些國家語言的fix , 可以用來處理這類拼寫糾正。
Hunspell Dictionaries
https://github.com/postgrespro/hunspell_dicts
git clone https://github.com/postgrespro/hunspell_dicts
cd hunspell_dicts
ll
total 28K
drwxr-xr-x 5 digoal users 4.0K Dec 6 19:53 hunspell_de_de
drwxr-xr-x 5 digoal users 4.0K Dec 6 19:53 hunspell_en_us
drwxr-xr-x 5 digoal users 4.0K Dec 6 19:53 hunspell_fr
drwxr-xr-x 5 digoal users 4.0K Dec 6 19:53 hunspell_nl_nl
drwxr-xr-x 5 digoal users 4.0K Dec 6 19:53 hunspell_nn_no
drwxr-xr-x 5 digoal users 4.0K Dec 6 19:53 hunspell_ru_ru
-rw-r--r-- 1 digoal users 1.3K Dec 6 19:53 README.md
cd hunspell_en_us
ll
total 560K
-rw-r--r-- 1 dege.zzz users 3.1K Dec 6 19:53 en_us.affix -- 糾正拼寫的語法
-rw-r--r-- 1 dege.zzz users 531K Dec 6 19:53 en_us.dict -- 糾正字典
drwxr-xr-x 2 dege.zzz users 4.0K Dec 6 19:53 expected
-rw-r--r-- 1 dege.zzz users 804 Dec 6 19:53 hunspell_en_us--1.0.sql
-rw-r--r-- 1 dege.zzz users 150 Dec 6 19:53 hunspell_en_us.control
drwxr-xr-x 2 dege.zzz users 4.0K Dec 6 19:53 license
-rw-r--r-- 1 dege.zzz users 370 Dec 6 19:53 Makefile
drwxr-xr-x 2 dege.zzz users 4.0K Dec 6 19:53 sql
make USE_PGXS=1 install
目前支援的幾個字典如下
Module | Dictionary | Configuration |
---|---|---|
hunspell_de_de | german_hunspell | german_hunspell |
hunspell_en_us | english_hunspell | english_hunspell |
hunspell_fr | french_hunspell | french_hunspell |
hunspell_nl_nl | dutch_hunspell | dutch_hunspell |
hunspell_nn_no | norwegian_hunspell | norwegian_hunspell |
hunspell_ru_ru | russian_hunspell | russian_hunspell |
通過模組安裝這些字典
psql
CREATE EXTENSION hunspell_en_us;
postgres=# select * from pg_ts_config;
cfgname | cfgnamespace | cfgowner | cfgparser
------------------+--------------+----------+-----------
simple | 11 | 10 | 3722
danish | 11 | 10 | 3722
dutch | 11 | 10 | 3722
english | 11 | 10 | 3722
finnish | 11 | 10 | 3722
french | 11 | 10 | 3722
german | 11 | 10 | 3722
hungarian | 11 | 10 | 3722
italian | 11 | 10 | 3722
norwegian | 11 | 10 | 3722
portuguese | 11 | 10 | 3722
romanian | 11 | 10 | 3722
russian | 11 | 10 | 3722
spanish | 11 | 10 | 3722
swedish | 11 | 10 | 3722
turkish | 11 | 10 | 3722
english_hunspell | 2200 | 10 | 3722 -- 新增
(17 rows)
解決複數,形容詞問題
postgres=# SELECT * FROM ts_debug(`english_hunspell`, `stories`);
alias | description | token | dictionaries | dictionary | lexemes
-----------+-----------------+---------+---------------------------------+------------------+---------
asciiword | Word, all ASCII | stories | {english_hunspell,english_stem} | english_hunspell | {story}
(1 row)
postgres=# SELECT * FROM ts_debug(`english_hunspell`, `large`);
alias | description | token | dictionaries | dictionary | lexemes
-----------+-----------------+-------+---------------------------------+------------------+---------
asciiword | Word, all ASCII | large | {english_hunspell,english_stem} | english_hunspell | {large}
(1 row)
postgres=# SELECT * FROM ts_debug(`english_hunspell`, `larger`);
alias | description | token | dictionaries | dictionary | lexemes
-----------+-----------------+--------+---------------------------------+------------------+---------
asciiword | Word, all ASCII | larger | {english_hunspell,english_stem} | english_hunspell | {large}
(1 row)
postgres=# SELECT * FROM ts_debug(`english_hunspell`, `largest`);
alias | description | token | dictionaries | dictionary | lexemes
-----------+-----------------+---------+---------------------------------+------------------+---------
asciiword | Word, all ASCII | largest | {english_hunspell,english_stem} | english_hunspell | {large}
(1 row)
一個小的外掛,反映的是PostgreSQL社群生態,以及PG社群圈子熱衷貢獻的精神。還有很多很多這樣的例子,在程式實現要花不少時間的問題,可能在PG圈就能找到外掛幫你解決。快來用PG吧。
相關文章
- java 英文單詞拼寫糾正框架(Word Checker)Java框架
- 04、形容詞
- 英語語法(4) 形容詞和副詞之一
- 英語語法(5) 形容詞和副詞之二
- 文字挖掘之語料庫、分詞、詞頻統計分詞
- 基於海量詞庫的單詞拼寫檢查、推薦到底是咋做的?
- 技術類書籍中的形容詞
- 資料庫物件、表空間、表、檢視、索引、同義詞序列等的字典資料庫物件索引
- HanLP分詞工具中的ViterbiSegment分詞流程HanLP分詞Viterbi
- 分詞分詞
- 金山詞霸取詞pdf檔案問題
- js劃詞翻譯、螢幕取詞,取詞位置獲取(真正解決定位問題)JS
- PHPstrom 取消單詞拼寫錯誤的提示PHP
- 【前端詞典】滾動穿透問題的解決方案前端穿透
- 用 Python 做單詞拼寫檢查Python
- 中文分詞原理及常用Python中文分詞庫介紹中文分詞Python
- SQLServer訪問Oracle(通過同義詞-檢視-資料字典)出現的問題SQLServerOracle
- 大資料語義分析:靈玖中文分詞的分詞處理大資料中文分詞
- 已知詞頻生成詞雲圖(資料庫到生成詞雲)--generate_from_frequencies(WordCloud)資料庫Cloud
- 分詞問題;及最大乘積分析分詞
- 最少插入字元;及分詞問題分析字元分詞
- [教程三] 寫一個搜尋:自定義詞庫;使用 Laravel Scout,Elasticsearch,ik 分詞LaravelElasticsearch分詞
- 古詩詞相關詞庫
- Elasticsearch 近義詞詞庫配置Elasticsearch
- #Elasticsearch中文分詞器 #IK分詞器 @FDDLCElasticsearch中文分詞
- IKAnalyzer 中文分詞的不同版本切詞方式中文分詞
- day88-ElasticSearch-分詞- 自定義擴充套件詞庫Elasticsearch分詞套件
- elasticsearch之ik分詞器和自定義詞庫實現Elasticsearch分詞
- 分詞-1分詞
- 物件導向與函數語言程式設計的區別: 動詞-名詞與名詞-動詞的區別 - simblob物件函數程式設計
- 將字串中的每個單詞順序進行顛倒,單詞還是原來的單詞,字母順序不發生變化字串
- Python分詞模組推薦:jieba中文分詞PythonJieba中文分詞
- HanLP-實詞分詞器詳解HanLP分詞
- python分詞和生成詞雲圖Python分詞
- jieba 詞性標註 & 並行分詞Jieba詞性標註並行分詞
- 使用cjieba(結巴分詞庫)實現php擴充套件中文分詞JiebaPHP套件中文分詞
- 分詞工具Hanlp基於感知機的中文分詞框架HanLP中文分詞框架
- 二叉搜尋樹應用-判斷一個單詞是否拼寫正確,實現簡單字典