之前在 spark 環境中一直用的是 portable 版本，詞條數量不是很夠，且有心想把 jieba,swcs 詞典加進來，

其他像 ik,ansi-seg 等分詞詞典由於沒有詞性並沒有加進來 . 本次修改主要是採用 jar 包方包將詞典目錄

data 與 hanlp.properties 合成一個 data.jar 檔案 .

1. pom.xml 過濾資原始檔的配置

<groupId>org.apache.maven.plugins</groupId>

<artifactId>maven-jar-plugin</artifactId>

<version>${maven-jar-plugin.version}</version>

<exclude>**/*.properties</exclude>

</excludes>

</configuration>

</plugin>

這裡把 properties 檔案從 jar 包檔案中去掉 , 因而結果檔案是沒有 properties 檔案的 .

可根據需要來確定是否把 properties 加入 jar 包中 . 由於我打算把 hanlp.properties 與詞典目錄寫在一起

這裡是要過濾掉 hanlp.properties 檔案

2. 修改 hanlp.properties 檔案

root=

# 將根目錄置為空，或者註釋掉 root

CustomDictionaryPath=data/dictionary/custom/CustomDictionary.txt; scws.txt; jieba.txt; 現代漢語補充詞庫 .txt; 全國地名大全 .txt ns; 人名詞典 .txt; 機構名詞典 .txt; 上海地名 .txt ns;data/dictionary/person/nrf.txt nrf;

# 增加更多的配置檔案 , 這裡增加了結巴分詞 ,scws 分詞

#IOAdapter=com.hankcs.hanlp.corpus.io.FileIOAdapter

IOAdapter=com.hankcs.hanlp.corpus.io.JarIOAdapter

# 修改 IOAdapter, 以便使用 jar 包形式載入詞典

3. 修改 HanLP.java

if ( root.length() != 0 && !root.endsWith("/")) root += "/";

當 root 的長度為時，不用在 root 字串後面新增 '/'

4. 增加處理詞典 jar 包的程式碼檔案 : JarIOAdapter.java

package com.hankcs.hanlp.corpus.io;

import java.io.*;

/**

* 基於普通檔案系統的 IO 介面卡

*

* @author hankcs

*/

public class JarIOAdapter implements IIOAdapter

{

@Override

public InputStream open(String path) throws FileNotFoundException

{

/*

採用第一行的方式載入資料會在分散式環境報錯

改用第二行的方式

*/

//return ClassLoader.getSystemClassLoader().getResourceAsStream(path);

return JarIOAdapter.class.getClassLoader().getResourceAsStream(path);

}

@Override

public OutputStream create(String path) throws FileNotFoundException

{

return new FileOutputStream(path);

}

在跑 DemoStopWord 時 , 發現

java -cp .:hanlp-1.3.2.jar:test.jar com.hankcs.demo.DemoStopWord

報錯 , 原因是介面不統一導致 . 修改

DMAG.java 如下 :

public MDAG(File dataFile) throws IOException

{

BufferedReader dataFileBufferedReader = new BufferedReader(new InputStreamReader(IOAdapter == null ?

new FileInputStream(dataFile) :

//IOAdapter.open(dataFile.getAbsolutePath())

IOAdapter.open(dataFile.getPath())

, "UTF-8"));

即可 .

5. 如何將詞典與配置檔案打成一個 jar 包

最好是把 txt 格式的檔案做成 bin 或 dat 格式的檔案 , 然後做成 jar 包，否則打包執行後無法再寫成 bin 或 dat 格式檔案 .

簡單的辦法是跑一下示例，即可生成相應的 bin 或 dat 格式檔案 .

java -cp .:hanlp-1.3.2.jar:test.jar com.hankcs.demo.DemoAtFirstSight

java -cp .:hanlp-1.3.2.jar:test.jar com.hankcs.demo.DemoChineseNameRecognition

java -cp .:hanlp-1.3.2.jar:test.jar com.hankcs.demo.DemoJapaneseNameRecognition

java -cp .:hanlp-1.3.2.jar:test.jar com.hankcs.demo.DemoPinyin

java -cp .:hanlp-1.3.2.jar:test.jar com.hankcs.demo.DemoPlaceRecognition

java -cp .:hanlp-1.3.2.jar:test.jar com.hankcs.demo.DemoOrganizationRecognition

java -cp .:hanlp-1.3.2.jar:test.jar com.hankcs.demo.DemoTokenizerConfig # 命名實體識別 , 包括上面的人名 , 地名等

java -cp .:hanlp-1.3.2.jar:test.jar com.hankcs.demo.DemoTraditionalChinese2SimplifiedChinese

java -cp .:hanlp-1.3.2.jar:test.jar com.hankcs.demo.DemoStopWord

或者用以下 shell 指令碼完成

:>a;while read cl; do echo $cl; echo "=========="$cl"=======" >>a;java -cp .:test.jar:hanlp-1.3.2.jar $cl 1>> a 2>&1;done < <(jar tvf test.jar | awk '$(NF)~"Demo"{print $(NF)}' | sed 's/.class$//;s/\//./g')

我們把 data 目錄與 hanlp.properties 檔案放在一個目錄，比如 xxx 目錄

cd xxx

jar cvf data.jar .

即可生成 data.jar 包

6. 如何執行

[dxp@Flyme-SearchTag-32-220 makeNewDict]$ ls

data.jar hanlp-1.3.2.jar README.md test test.jar

[dxp@Flyme-SearchTag-32-220 makeNewDict]$ java -cp data.jar:hanlp-1.3.2.jar:test.jar com.hankcs.demo.DemoAtFirstSight

7. 在 spark 中應用

IDE 如（ intellij idea ）中 maven 專案

引入以下依賴：

<groupId>com.hankcs</groupId>

<artifactId>hanlp</artifactId>

<scope>system</scope>

<systemPath>${LocalPath}/hanlp-1.3.2.jar</systemPath>

</dependency>

spark-submit 提交任務時增加

--jar hanlp-1.3.2.jar,data.jar

HanLP程式碼與詞典分離方案與流程

相關文章