轉自：https://my.oschina.net/apdplat/blog/228619#OSC_h4_8

Java 分散式中文分片語件 - word分詞

word 分詞是一個Java實現的分散式的中文分片語件，提供了多種基於詞典的分詞演算法，並利用ngram模型來消除歧義。能準確識別英文、數字，以及日期、時間等數量詞，能識別人名、地名、組織機構名等未登入詞。能透過自定義配置檔案來改變元件行為，能自定義使用者詞庫、自動檢測詞庫變化、支援大規模分散式環境，能靈活指定多種分詞演算法，能使用refine功能靈活控制分詞結果，還能使用詞性標註、同義標註、反義標註、拼音標註等功能。同時還無縫和Lucene、Solr、ElasticSearch、Luke整合。注意：word1.3需要JDK1.8

Maven 依賴：

在 pom.xml 中指定 dependency ，可用版本有 1.0 、 1.1 、 1.2 ：

< dependencies >

< dependency >

< groupId >org.apdplat</ groupId >

< artifactId >word</ artifactId >

< version >1.2</ version >

</ dependency >

</ dependencies >

分詞使用方法：

1 、快速體驗

執行專案根目錄下的指令碼 demo-word.bat 可以快速體驗分詞效果

用法 : command [text] [input] [output]

命令 command 的可選值為： demo 、 text 、 file

demo

text 楊尚川是 APDPlat 應用級產品開發平臺的作者

file d:/text.txt d:/word.txt

exit

2 、對文字進行分詞

移除停用詞： List <Word> words = WordSegmenter.seg( " 楊尚川是 APDPlat 應用級產品開發平臺的作者 " );

保留停用詞： List <Word> words = WordSegmenter.segWithStopWords( " 楊尚川是 APDPlat 應用級產品開發平臺的作者 " );

System.out.println(words);

輸出：

移除停用詞： [ 楊尚川 , apdplat, 應用級 , 產品 , 開發平臺 , 作者 ]

保留停用詞： [ 楊尚川 , 是 , apdplat, 應用級 , 產品 , 開發平臺 , 的 , 作者 ]

3 、對檔案進行分詞

String input = "d:/text.txt" ;

String output = "d:/word.txt" ;

移除停用詞： WordSegmenter.seg( new File(input), new File(output));

保留停用詞： WordSegmenter.segWithStopWords( new File(input), new File(output));

4 、自定義配置檔案

預設配置檔案為類路徑下的 word .conf ，打包在 word-x .x.jar 中

自定義配置檔案為類路徑下的 word .local.conf ，需要使用者自己提供

如果自定義配置和預設配置相同，自定義配置會覆蓋預設配置

配置檔案編碼為 UTF-8

5 、自定義使用者詞庫

自定義使用者詞庫為一個或多個資料夾或檔案，可以使用絕對路徑或相對路徑

使用者詞庫由多個詞典檔案組成，檔案編碼為 UTF-8

詞典檔案的格式為文字檔案，一行代表一個詞

可以透過系統屬性或配置檔案的方式來指定路徑，多個路徑之間用逗號分隔開

類路徑下的詞典檔案，需要在相對路徑前加入字首 classpath:

指定方式有三種：

指定方式一，程式設計指定（高優先順序）：

WordConfTools.set( "dic.path" , "classpath:dic.txt ， d:/custom_dic" );

DictionaryFactory.reload(); // 更改詞典路徑之後，重新載入詞典

指定方式二， Java 虛擬機器啟動引數（中優先順序）：

java -Ddic.path=classpath:dic.txt ， d:/custom_dic

指定方式三，配置檔案指定（低優先順序）：

使用類路徑下的檔案 word.local.conf 來指定配置資訊

dic.path=classpath:dic.txt ， d:/custom_dic

如未指定，則預設使用類路徑下的 dic.txt 詞典檔案

6 、自定義停用詞詞庫

使用方式和自定義使用者詞庫類似，配置項為：

stopwords.path=classpath:stopwords.txt ， d:/custom_stopwords_dic

7 、自動檢測詞庫變化

可以自動檢測自定義使用者詞庫和自定義停用詞詞庫的變化

包含類路徑下的檔案和資料夾、非類路徑下的絕對路徑和相對路徑

如：

classpath:dic.txt ， classpath:custom_dic_dir,

d:/dic_more.txt ， d:/DIC_DIR ， D:/DIC2_DIR ， my_dic_dir ， my_dic_file.txt

classpath:stopwords.txt ， classpath:custom_stopwords_dic_dir ，

d:/stopwords_more.txt ， d:/STOPWORDS_DIR ， d:/STOPWORDS2_DIR ， stopwords_dir ， remove.txt

8 、顯式指定分詞演算法

對文字進行分詞時，可顯式指定特定的分詞演算法，如：

WordSegmenter .seg (" APDPlat 應用級產品開發平臺 ", SegmentationAlgorithm .BidirectionalMaximumMatching );

SegmentationAlgorithm 的可選型別為：

正向最大匹配演算法： MaximumMatching

逆向最大匹配演算法： ReverseMaximumMatching

正向最小匹配演算法： MinimumMatching

逆向最小匹配演算法： ReverseMinimumMatching

雙向最大匹配演算法： BidirectionalMaximumMatching

雙向最小匹配演算法： BidirectionalMinimumMatching

雙向最大最小匹配演算法： BidirectionalMaximumMinimumMatching

全切分演算法： FullSegmentation

最少分詞演算法： MinimalWordCount

最大 Ngram 分值演算法： MaxNgramScore

9 、分詞效果評估

執行專案根目錄下的指令碼 evaluation.bat 可以對分詞效果進行評估

評估採用的測試文字有 253 3709 行，共 2837 4490 個字元

評估結果位於 target/evaluation 目錄下：

corpus-text.txt 為分好詞的人工標註文字，詞之間以空格分隔

test-text.txt 為測試文字，是把 corpus-text.txt 以標點符號分隔為多行的結果

standard-text.txt 為測試文字對應的人工標註文字，作為分詞是否正確的標準

result-text- ***.txt ， ** * 為各種分詞演算法名稱，這是 word 分詞結果

perfect-result- ***.txt ， ** * 為各種分詞演算法名稱，這是分詞結果和人工標註標準完全一致的文字

wrong-result- ***.txt ， ** * 為各種分詞演算法名稱，這是分詞結果和人工標註標準不一致的文字

10 、分散式中文分詞器

1 、在自定義配置檔案 word.conf 或 word.local.conf 中指定所有的配置項 *.path 使用 HTTP 資源，同時指定配置項 redis.*

2 、配置並啟動提供 HTTP 資源的 web 伺服器，將專案： https: //github.com/ysc/word_web 部署到 tomcat

3 、配置並啟動 redis 伺服器

11 、詞性標註（1.3才有這個功能）

將分詞結果作為輸入引數，呼叫 PartOfSpeechTagging 類的 process 方法，詞性儲存在 Word 類的 partOfSpeech 欄位中

如下所示：

List<Word> words = WordSegmenter.segWithStopWords( " 我愛中國 " );

System. out .println( " 未標註詞性： " +words);

// 詞性標註

PartOfSpeechTagging.process(words);

System. out .println( " 標註詞性： " +words);

輸出內容：

未標註詞性： [ 我 , 愛 , 中國 ]

標註詞性： [ 我 /r, 愛 /v, 中國 /ns]

12 、refine

我們看一個切分例子：

List<Word> words = WordSegmenter.segWithStopWords( " 我國工人階級和廣大勞動群眾要更加緊密地團結在黨中央周圍 " );

System. out .println(words);

結果如下：

[ 我國 , 工人階級 , 和 , 廣大 , 勞動群眾 , 要 , 更加 , 緊密 , 地 , 團結 , 在 , 黨中央 , 周圍 ]

假如我們想要的切分結果是：

[ 我國 , 工人 , 階級 , 和 , 廣大 , 勞動 , 群眾 , 要 , 更加 , 緊密 , 地 , 團結 , 在 , 黨中央 , 周圍 ]

也就是要把 “ 工人階級 ” 細分為 “ 工人階級 ” ，把 “ 勞動群眾 ” 細分為 “ 勞動群眾 ” ，那麼我們該怎麼辦呢？

我們可以透過在 word.refine.path 配置項指定的檔案 classpath:word_refine.txt 中增加以下內容：

工人階級 = 工人階級

勞動群眾 = 勞動群眾

然後，我們對分詞結果進行 refine ：

words = WordRefiner.refine(words);

System. out .println(words);

這樣，就能達到我們想要的效果：

[ 我國 , 工人 , 階級 , 和 , 廣大 , 勞動 , 群眾 , 要 , 更加 , 緊密 , 地 , 團結 , 在 , 黨中央 , 周圍 ]

我們再看一個切分例子：

List<Word> words = WordSegmenter.segWithStopWords( " 在實現 “ 兩個一百年 ” 奮鬥目標的偉大征程上再創新的業績 " );

System. out .println(words);

結果如下：

[ 在 , 實現 , 兩個 , 一百年 , 奮鬥目標 , 的 , 偉大 , 征程 , 上 , 再創 , 新的 , 業績 ]

假如我們想要的切分結果是：

[ 在 , 實現 , 兩個一百年 , 奮鬥目標 , 的 , 偉大征程 , 上 , 再創 , 新的 , 業績 ]

也就是要把 “ 兩個一百年 ” 合併為 “ 兩個一百年 ” ，把 “ 偉大 , 征程 ” 合併為 “ 偉大征程 ” ，那麼我們該怎麼辦呢？

我們可以透過在 word.refine.path 配置項指定的檔案 classpath:word_refine.txt 中增加以下內容：

兩個一百年 = 兩個一百年

偉大征程 = 偉大征程

然後，我們對分詞結果進行 refine ：

words = WordRefiner.refine(words);

System. out .println(words);

這樣，就能達到我們想要的效果：

[ 在 , 實現 , 兩個一百年 , 奮鬥目標 , 的 , 偉大征程 , 上 , 再創 , 新的 , 業績 ]

13 、同義標註

List<Word> words = WordSegmenter.segWithStopWords( " 楚離陌千方百計為無情找回記憶 " );

System. out .println(words);

結果如下：

[ 楚離陌 , 千方百計 , 為 , 無情 , 找回 , 記憶 ]

做同義標註：

SynonymTagging.process(words);

System. out .println(words);

結果如下：

[ 楚離陌 , 千方百計 [ 久有存心 , 化盡心血 , 想方設法 , 費盡心機 ], 為 , 無情 , 找回 , 記憶 [ 影像 ]]

如果啟用間接同義詞：

SynonymTagging.process(words, false );

System. out .println(words);

結果如下：

[ 楚離陌 , 千方百計 [ 久有存心 , 化盡心血 , 想方設法 , 費盡心機 ], 為 , 無情 , 找回 , 記憶 [ 影像 , 影像 ]]

List<Word> words = WordSegmenter.segWithStopWords( " 手勁大的老人往往更長壽 " );

System. out .println(words);

結果如下：

[ 手勁 , 大 , 的 , 老人 , 往往 , 更 , 長壽 ]

做同義標註：

SynonymTagging.process(words);

System. out .println(words);

結果如下：

[ 手勁 , 大 , 的 , 老人 [ 白叟 ], 往往 [ 常常 , 每每 , 經常 ], 更 , 長壽 [ 長命 , 龜齡 ]]

如果啟用間接同義詞：

SynonymTagging.process(words, false );

System. out .println(words);

結果如下：

[ 手勁 , 大 , 的 , 老人 [ 白叟 ], 往往 [ 一樣平常 , 一般 , 凡是 , 尋常 , 常常 , 常日 , 平凡 , 平居 , 平常 , 平日 , 平時 , 往常 , 日常 , 日常平凡 , 時常 , 普通 , 每每 , 泛泛 , 素日 , 經常 , 通俗 , 通常 ], 更 , 長壽 [ 長命 , 龜齡 ]]

以詞 “ 千方百計 ” 為例：

可以透過 Word 的 getSynonym() 方法獲取同義詞如：

System. out .println(word.getSynonym());

結果如下：

[ 久有存心 , 化盡心血 , 想方設法 , 費盡心機 ]

注意：如果沒有同義詞，則 getSynonym() 返回空集合： Collections.emptyList()

間接同義詞和直接同義詞的區別如下：

假設：

A 和 B 是同義詞， A 和 C 是同義詞， B 和 D 是同義詞， C 和 E 是同義詞

則：

對於 A 來說， A B C 是直接同義詞

對於 B 來說， A B D 是直接同義詞

對於 C 來說， A C E 是直接同義詞

對於 A B C 來說， A B C D E 是間接同義詞

14 、反義標註

List<Word> words = WordSegmenter.segWithStopWords( "5 月初有哪些電影值得觀看 " );

System. out .println(words);

結果如下：

[ 5, 月初 , 有 , 哪些 , 電影 , 值得 , 觀看 ]

做反義標註：

AntonymTagging.process(words);

System. out .println(words);

結果如下：

[ 5, 月初 [ 月底 , 月末 , 月終 ], 有 , 哪些 , 電影 , 值得 , 觀看 ]

List<Word> words = WordSegmenter.segWithStopWords( " 由於工作不到位、服務不完善導致顧客在用餐時發生不愉快的事情 , 餐廳方面應該向顧客作出真誠的道歉 , 而不是敷衍了事。 " );

System. out .println(words);

結果如下：

[ 由於 , 工作 , 不到位 , 服務 , 不完善 , 導致 , 顧客 , 在 , 用餐 , 時 , 發生 , 不愉快 , 的 , 事情 , 餐廳 , 方面 , 應該 , 向 , 顧客 , 作出 , 真誠 , 的 , 道歉 , 而不是 , 敷衍了事 ]

做反義標註：

AntonymTagging.process(words);

System. out .println(words);

結果如下：

[ 由於 , 工作 , 不到位 , 服務 , 不完善 , 導致 , 顧客 , 在 , 用餐 , 時 , 發生 , 不愉快 , 的 , 事情 , 餐廳 , 方面 , 應該 , 向 , 顧客 , 作出 , 真誠 [ 糊弄 , 虛偽 , 虛假 , 險詐 ], 的 , 道歉 , 而不是 , 敷衍了事 [ 一絲不苟 , 兢兢業業 , 盡心竭力 , 竭盡全力 , 精益求精 , 誠心誠意 ]]

以詞 “ 月初 ” 為例：

可以透過 Word 的 getAntonym() 方法獲取反義詞如：

System. out .println(word.getAntonym());

結果如下：

[ 月底 , 月末 , 月終 ]

注意：如果沒有反義詞， getAntonym() 返回空集合： Collections.emptyList()

15 、拼音標註

List<Word> words = WordSegmenter.segWithStopWords( " 《速度與激情 7 》的中國內地票房自 4 月 12 日上映以來，在短短兩週內突破 20 億人民幣 " );

System. out .println(words);

結果如下：

[ 速度 , 與 , 激情 , 7, 的 , 中國 , 內地 , 票房 , 自 , 4 月 , 12 日 , 上映 , 以來 , 在 , 短短 , 兩週 , 內 , 突破 , 20 億 , 人民幣 ]

執行拼音標註：

PinyinTagging.process(words);

System. out .println(words);

結果如下：

[ 速度 sd sudu, 與 y yu, 激情 jq jiqing, 7, 的 d de, 中國 zg zhongguo, 內地 nd neidi, 票房 pf piaofang, 自 z zi, 4 月 , 12 日 , 上映 sy shangying, 以來 yl yilai, 在 z zai, 短短 dd duanduan, 兩週 lz liangzhou, 內 n nei, 突破 tp tupo, 20 億 , 人民幣 rmb renminbi ]

以詞 “ 速度 ” 為例：

可以透過 Word 的 getFullPinYin() 方法獲取完整拼音如： sudu

可以透過 Word 的 getAcronymPinYin() 方法獲取首字母縮略拼音如： sd

16 、Lucene外掛：

1 、構造一個 word 分析器 ChineseWordAnalyzer

Analyzer analyzer = new ChineseWordAnalyzer();

如果需要使用特定的分詞演算法，可透過建構函式來指定：

Analyzer analyzer = new ChineseWordAnalyzer(SegmentationAlgorithm.FullSegmentation);

如不指定，預設使用雙向最大匹配演算法： SegmentationAlgorithm.BidirectionalMaximumMatching

可用的分詞演算法參見列舉類： SegmentationAlgorithm

2 、利用 word 分析器切分文字

TokenStream tokenStream = analyzer.tokenStream( "text" , " 楊尚川是 APDPlat 應用級產品開發平臺的作者 " );

// 準備消費

tokenStream.reset();

// 開始消費

while (tokenStream.incrementToken()){

// 詞

CharTermAttribute charTermAttribute = tokenStream.getAttribute(CharTermAttribute.class);

// 詞在文字中的起始位置

OffsetAttribute offsetAttribute = tokenStream.getAttribute(OffsetAttribute.class);

// 第幾個詞

PositionIncrementAttribute positionIncrementAttribute = tokenStream.getAttribute(PositionIncrementAttribute.class);

// 詞性

PartOfSpeechAttribute partOfSpeechAttribute = tokenStream.getAttribute(PartOfSpeechAttribute.class);

// 首字母縮略拼音

AcronymPinyinAttribute acronymPinyinAttribute = tokenStream.getAttribute(AcronymPinyinAttribute.class);

// 完整拼音

FullPinyinAttribute fullPinyinAttribute = tokenStream.getAttribute(FullPinyinAttribute.class);

// 同義詞

SynonymAttribute synonymAttribute = tokenStream.getAttribute(SynonymAttribute.class);

// 反義詞

AntonymAttribute antonymAttribute = tokenStream.getAttribute(AntonymAttribute.class);

LOGGER.info(charTermAttribute.toString()+ " (" +offsetAttribute.startOffset()+ " - " +offsetAttribute.endOffset()+ ") " +positionIncrementAttribute.getPositionIncrement());

LOGGER.info( "PartOfSpeech:" +partOfSpeechAttribute.toString());

LOGGER.info( "AcronymPinyin:" +acronymPinyinAttribute.toString());

LOGGER.info( "FullPinyin:" +fullPinyinAttribute.toString());

LOGGER.info( "Synonym:" +synonymAttribute.toString());

LOGGER.info( "Antonym:" +antonymAttribute.toString());

}

// 消費完畢

tokenStream.close();

3 、利用 word 分析器建立 Lucene 索引

Directory directory = new RAMDirectory();

IndexWriterConfig config = new IndexWriterConfig(analyzer);

IndexWriter indexWriter = new IndexWriter(directory, config);

4 、利用 word 分析器查詢 Lucene 索引

QueryParser queryParser = new QueryParser( "text" , analyzer);

Query query = queryParser.parse( "text: 楊尚川 " );

TopDocs docs = indexSearcher.search(query, Integer.MAX_VALUE);

17 、Solr外掛：

1 、下載 word-1.3.jar

下載地址： http: //search.maven.org/remotecontent?filepath=org/apdplat/word/1.3/word-1.3.jar

2 、建立目錄 solr-5.1.0/example/solr/lib ，將 word-1.3.jar 複製到 lib 目錄

3 、配置 schema 指定分詞器

將 solr-5.1.0/example/solr/collection1/conf/schema.xml 檔案中所有的

<tokenizer class = "solr.WhitespaceTokenizerFactory" /> 和

<tokenizer class = "solr.StandardTokenizerFactory" /> 全部替換為

並移除所有的 filter 標籤

4 、如果需要使用特定的分詞演算法：

segAlgorithm 可選值有：

正向最大匹配演算法： MaximumMatching

逆向最大匹配演算法： ReverseMaximumMatching

正向最小匹配演算法： MinimumMatching

逆向最小匹配演算法： ReverseMinimumMatching

雙向最大匹配演算法： BidirectionalMaximumMatching

雙向最小匹配演算法： BidirectionalMinimumMatching

雙向最大最小匹配演算法： BidirectionalMaximumMinimumMatching

全切分演算法： FullSegmentation

最少分詞演算法： MinimalWordCount

最大 Ngram 分值演算法： MaxNgramScore

如不指定，預設使用雙向最大匹配演算法： BidirectionalMaximumMatching

5 、如果需要指定特定的配置檔案：

<tokenizer class = "org.apdplat.word.solr.ChineseWordTokenizerFactory" segAlgorithm= "ReverseMinimumMatching"

conf= "solr-5.1.0/example/solr/nutch/conf/word.local.conf" />

word.local.conf 檔案中可配置的內容見 word-1.3.jar 中的 word.conf 檔案

如不指定，使用預設配置檔案，位於 word-1.3.jar 中的 word.conf 檔案

18 、ElasticSearch外掛：

1 、開啟命令列並切換到 elasticsearch 的 bin 目錄

cd elasticsearch-1.5.1/bin

2 、執行 plugin 指令碼安裝 word 分詞外掛：

./plugin -u http: //apdplat.org/word/archive/v1.2.zip -i word

3 、修改檔案 elasticsearch-1.5.1/config/elasticsearch.yml ，新增如下配置：

index.analysis.analyzer. default .type : "word"

index.analysis.tokenizer. default .type : "word"

4 、啟動 ElasticSearch 測試效果，在 Chrome 瀏覽器中訪問：

http: //localhost:9200/_analyze?analyzer=word&text= 楊尚川是 APDPlat 應用級產品開發平臺的作者

5 、自定義配置

修改配置檔案 elasticsearch-1.5.1/plugins/word/word.local.conf

6 、指定分詞演算法

修改檔案 elasticsearch-1.5.1/config/elasticsearch.yml ，新增如下配置：

index.analysis.analyzer. default .segAlgorithm : "ReverseMinimumMatching"

index.analysis.tokenizer. default .segAlgorithm : "ReverseMinimumMatching"

這裡 segAlgorithm 可指定的值有：

正向最大匹配演算法： MaximumMatching

逆向最大匹配演算法： ReverseMaximumMatching

正向最小匹配演算法： MinimumMatching

逆向最小匹配演算法： ReverseMinimumMatching

雙向最大匹配演算法： BidirectionalMaximumMatching

雙向最小匹配演算法： BidirectionalMinimumMatching

雙向最大最小匹配演算法： BidirectionalMaximumMinimumMatching

全切分演算法： FullSegmentation

最少分詞演算法： MinimalWordCount

最大 Ngram 分值演算法： MaxNgramScore

如不指定，預設使用雙向最大匹配演算法： BidirectionalMaximumMatching

19 、Luke外掛：

1 、下載（國內不能訪問）

2 、下載並解壓 Java 中文分片語件 word-1.0-bin.zip ：

3 、將解壓後的 Java 中文分片語件 word-1.0-bin/word-1.0 資料夾裡面的 4 個 jar 包解壓到當前資料夾

用壓縮解壓工具如 winrar 開啟 lukeall-4.0.0-ALPHA.jar ，將當前資料夾裡面除了 META-INF 資料夾、 .jar 、

.bat 、 .html 、 word.local.conf 檔案外的其他所有檔案拖到 lukeall-4.0.0-ALPHA.jar 裡面

4 、執行命令 java -jar lukeall-4.0.0-ALPHA.jar 啟動 luke ，在 Search 選項卡的 Analysis 裡面

就可以選擇 org.apdplat.word.lucene.ChineseWordAnalyzer 分詞器了

5 、在 Plugins 選項卡的 Available analyzers found on the current classpath 裡面也可以選擇

org.apdplat.word.lucene.ChineseWordAnalyzer 分詞器

注意：如果你要自己整合 word 分詞器的其他版本，在專案根目錄下執行 mvn install 編譯專案，然後執行命令

mvn dependency:copy-dependencies 複製依賴的 jar 包，接著在 target/dependency/ 目錄下就會有所有

的依賴 jar 包。其中 target/dependency/slf4j-api-1.6.4.jar 是 word 分詞器使用的日誌框架，

target/dependency/logback-classic-0.9.28.jar 和

target/dependency/logback-core-0.9.28.jar 是 word 分詞器推薦使用的日誌實現，日誌實現的配置檔案

路徑位於 target/classes/logback.xml ， target/word-1.3.jar 是 word 分詞器的主 jar 包，如果需要

自定義詞典，則需要修改分詞器配置檔案 target/classes/word.conf

已經整合好的 Luke 外掛下載（適用於 lucene4.0.0 ）：

已經整合好的 Luke 外掛下載（適用於 lucene4.10.3 ）：

20 、詞向量：

從大規模語料中統計一個詞的上下文相關詞，並用這些上下文相關片語成的向量來表達這個詞。

透過計算詞向量的相似性，即可得到詞的相似性。

相似性的假設是建立在如果兩個詞的上下文相關詞越相似，那麼這兩個詞就越相似這個前提下的。

透過執行專案根目錄下的指令碼 demo-word-vector-corpus.bat 來體驗 word 專案自帶語料庫的效果

如果有自己的文字內容，可以使用指令碼 demo-word-vector-file.bat 來對文字分詞、建立詞向量、計算相似性

Java中文分片語件 - word分詞（skycto JEEditor）

相關文章