ElasticSearch IK熱詞自動熱更新原理與Golang實現

Golang研究所發表於2021-10-15

原文網址 : https://www.cnblogs.com/wishFreedom/p/15411847.html

ElasticsearchGolang

熱更新概述

ik分詞器本身可以從配置檔案載入擴張詞庫，也可以從遠端HTTP伺服器載入。

從本地載入，則需要重啟ES生效，影響比較大。所以，一般我們都會把詞庫放在遠端伺服器上。這裡主要有2種方式：

藉助Nginx，在其某個目錄結構下放一個dic.txt，我們只要更新這個檔案，不需要重啟ES也能達到熱更新的目的。優點是簡單，無需開發，缺點就是不夠靈活。
自己開發一個HTTP介面，返回詞庫。注意：一行代表一個詞，http body中，自己追加\n換行。

這裡主要介紹第2種介面方式。

熱更新原理

檢視ik分詞器原始碼（org.wltea.analyzer.dic.Monitor）：

/**
	 * 監控流程：
	 *  ①向詞庫伺服器傳送Head請求
	 *  ②從響應中獲取Last-Modify、ETags欄位值，判斷是否變化
	 *  ③如果未變化，休眠1min，返回第①步
	 * 	④如果有變化，重新載入詞典
	 *  ⑤休眠1min，返回第①步
	 */

	public void runUnprivileged() {
		//超時設定
		RequestConfig rc = RequestConfig.custom().setConnectionRequestTimeout(10*1000)
				.setConnectTimeout(10*1000).setSocketTimeout(15*1000).build();

		HttpHead head = new HttpHead(location);
		head.setConfig(rc);

		//設定請求頭
		if (last_modified != null) {
			head.setHeader("If-Modified-Since", last_modified);
		}
		if (eTags != null) {
			head.setHeader("If-None-Match", eTags);
		}

		CloseableHttpResponse response = null;
		try {
			response = httpclient.execute(head);

			//返回200 才做操作
			if(response.getStatusLine().getStatusCode()==200){

				if (((response.getLastHeader("Last-Modified")!=null) && !response.getLastHeader("Last-Modified").getValue().equalsIgnoreCase(last_modified))
						||((response.getLastHeader("ETag")!=null) && !response.getLastHeader("ETag").getValue().equalsIgnoreCase(eTags))) {

					// 遠端詞庫有更新,需要重新載入詞典，並修改last_modified,eTags
					Dictionary.getSingleton().reLoadMainDict();
					last_modified = response.getLastHeader("Last-Modified")==null?null:response.getLastHeader("Last-Modified").getValue();
					eTags = response.getLastHeader("ETag")==null?null:response.getLastHeader("ETag").getValue();
				}
			}else if (response.getStatusLine().getStatusCode()==304) {
				//沒有修改，不做操作
				//noop
			}else{
				logger.info("remote_ext_dict {} return bad code {}" , location , response.getStatusLine().getStatusCode() );
			}
		} catch (Exception e) {
			logger.error("remote_ext_dict {} error!",e , location);
		}finally{
			try {
				if (response != null) {
					response.close();
				}
			} catch (IOException e) {
				logger.error(e.getMessage(), e);
			}
		}
    }

我們看到，每隔1分鐘：

先傳送Http HEAD請求，獲取Last-Modified、ETag（裡面都是字串）
如果其中有一個變化，則繼續傳送Get請求，獲取詞庫內容。

所以，Golang裡面同一個URL 要同時處理 HEAD 請求和 Get請求。

HEAD 格式

HEAD方法跟GET方法相同，只不過伺服器響應時不會返回訊息體。一個HEAD請求的響應中，HTTP頭中包含的元資訊應該和一個GET請求的響應訊息相同。這種方法可以用來獲取請求中隱含的元資訊，而不用傳輸實體本身。也經常用來測試超連結的有效性、可用性和最近的修改。

一個HEAD請求的響應可被快取，也就是說，響應中的資訊可能用來更新之前快取的實體。如果當前實體跟快取實體的閾值不同（可通過Content-Length、Content-MD5、ETag或Last-Modified的變化來表明），那麼這個快取就被視為過期了。

在ik分詞器中，服務端返回的一個示例如下:

$ curl --head http://127.0.0.1:9800/es/steelDict
HTTP/1.1 200 OK
Etag: DefaultTags
Last-Modified: 2021-10-15 14:49:35
Date: Fri, 15 Oct 2021 07:23:15 GMT

GET 格式

返回詞庫時，Content-Length、charset=UTF-8一定要有。
Last-Modified和Etag 只需要1個有變化即可。只有當HEAD請求返回時，這2個其中一個欄位的值變了，才會傳送GET請求獲取內容，請注意！
一行代表一個詞，自己追加\n換行

$ curl -i http://127.0.0.1:9800/es/steelDict
HTTP/1.1 200 OK
Content-Length: 130
Content-Type: text/html;charset=UTF-8
Etag: DefaultTags
Last-Modified: 2021-10-15 14:49:35
Date: Fri, 15 Oct 2021 07:37:47 GMT

裝飾管
裝飾板
圓鋼
無縫管
無縫方管
衛生級無縫管
衛生級焊管
熱軋中厚板
熱軋平板
熱軋卷平板

實現

配置ES IK分詞器

# 這裡以centos 7為例，通過rpm安裝
$ vim /usr/share/elasticsearch/plugins/ik/config/IKAnalyzer.cfg.xml
# 改這一行，換成我們的地址
<entry key="remote_ext_dict">http://10.16.52.52:9800/es/steelDict</entry>
$ systemctl restart elasticsearch # 重啟es

# 這裡還可以實時看到日誌，比較方便
$ tail -f /var/log/elasticsearch/my-application.log
[2021-10-15T15:02:31,448][INFO ][o.w.a.d.Monitor          ] [node-1] 獲取遠端詞典成功，總數為：0
[2021-10-15T15:02:31,952][INFO ][o.e.l.LicenseService     ] [node-1] license [3ca1dc7b-3722-40e5-916e-3b2093980b75] mode [basic] - valid
[2021-10-15T15:02:31,962][INFO ][o.e.g.GatewayService     ] [node-1] recovered [1] indices into cluster_state
[2021-10-15T15:02:32,812][INFO ][o.e.c.r.a.AllocationService] [node-1] Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[steel-category-mapping][2]] ...]).
[2021-10-15T15:02:41,630][INFO ][o.w.a.d.Monitor          ] [node-1] 重新載入詞典...
[2021-10-15T15:02:41,631][INFO ][o.w.a.d.Monitor          ] [node-1] try load config from /etc/elasticsearch/analysis-ik/IKAnalyzer.cfg.xml
[2021-10-15T15:02:41,631][INFO ][o.w.a.d.Monitor          ] [node-1] try load config from /usr/share/elasticsearch/plugins/ik/config/IKAnalyzer.cfg.xml
[2021-10-15T15:02:41,886][INFO ][o.w.a.d.Monitor          ] [node-1] [Dict Loading] http://10.16.52.52:9800/es/steelDict
[2021-10-15T15:02:43,958][INFO ][o.w.a.d.Monitor          ] [node-1] 獲取遠端詞典成功，總數為：0
[2021-10-15T15:02:43,959][INFO ][o.w.a.d.Monitor          ] [node-1] 重新載入詞典完畢...

Golang介面

假設使用gin框架，初始化路由：

const (
	kUrlSyncESIndex     = "/syncESIndex" // 同步鋼材品名、材質、規格、產地、倉庫到ES索引中
	kUrlGetSteelHotDict = "/steelDict"   // 獲取鋼材字典（品材規產倉）
)

func InitRouter(router *gin.Engine) {
     // ...
  
	esRouter := router.Group("es")
    // 同一個介面，根據head/get來決定是否返回資料部，避免寬頻浪費
	esRouter.HEAD(kUrlGetSteelHotDict, onHttpGetSteelHotDictHead) 
	esRouter.GET(kUrlGetSteelHotDict, onHttpGetSteelHotDict)
  
    // ...
}

head請求處理：

// onHttpGetSteelHotDictHead 處理head請求，只有當Last-Modified 或 ETag 其中1個值改變時，才會出發GET請求獲取詞庫列表
func onHttpGetSteelHotDictHead(ctx *gin.Context) {
	t, err := biz.QueryEsLastSyncTime()
	if err != nil {
		ctx.JSON(http.StatusOK, gin.H{
			"code": biz.StatusError,
			"msg":  "server internal error",
		})
		logger.Warn(err)
		return
	}
	ctx.Header("Last-Modified", t)
	ctx.Header("ETag", kDefaultTags)
}

Get請求處理：

// onHttpGetSteelHotDict 處理GET請求，返回真正的詞庫，每一行一個詞
func onHttpGetSteelHotDict(ctx *gin.Context) {
    // 這裡從mysql查詢詞庫，dic是一個[]string切片
	dic, err := biz.QuerySteelHotDic()
	if err != nil {
		ctx.JSON(http.StatusOK, gin.H{
			"code": biz.StatusError,
			"msg":  "server internal error",
		})
		logger.Warn(err)
		return
	}

    // 這裡查詢最後一次更新時間，作為判斷詞庫需要更新的標準
	t, err := biz.QueryEsLastSyncTime()
	if err != nil {
		ctx.JSON(http.StatusOK, gin.H{
			"code": biz.StatusError,
			"msg":  "server internal error",
		})
		logger.Warn(err)
		return
	}

	ctx.Header("Last-Modified", t)
	ctx.Header("ETag", kDefaultTags)

	body := ""
	for _, v := range dic {
		if v != "" {
			body += v + "\n"
		}
	}
	logger.Infof("%s query steel dict success, count = %d", ctx.Request.URL, len(dic))

	buffer := []byte(body)
	ctx.Header("Content-Length", strconv.Itoa(len(buffer)))
	ctx.Data(http.StatusOK, "text/html;charset=UTF-8", buffer)
}

效果

分詞效果：

POST http://10.0.56.153:9200/_analyze
{
  "analyzer": "ik_smart",
  "text": "武鋼 Q235B 3*1500*3000 6780 佰隆庫 在途整件出"
}

{
    "tokens": [
        {
            "token": "武鋼",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "q235b",
            "start_offset": 3,
            "end_offset": 8,
            "type": "CN_WORD",
            "position": 1
        },
        {
            "token": "3*1500*3000",
            "start_offset": 9,
            "end_offset": 20,
            "type": "ARABIC",
            "position": 2
        },
        {
            "token": "6780",
            "start_offset": 21,
            "end_offset": 25,
            "type": "ARABIC",
            "position": 3
        },
        {
            "token": "佰隆庫",
            "start_offset": 26,
            "end_offset": 29,
            "type": "CN_WORD",
            "position": 4
        },
        {
            "token": "在途",
            "start_offset": 30,
            "end_offset": 32,
            "type": "CN_WORD",
            "position": 5
        },
        {
            "token": "整件",
            "start_offset": 32,
            "end_offset": 34,
            "type": "CN_WORD",
            "position": 6
        },
        {
            "token": "出",
            "start_offset": 34,
            "end_offset": 35,
            "type": "CN_CHAR",
            "position": 7
        }
    ]
}

重新載入後，每個詞都會列印，如果嫌棄可以把程式碼註釋掉：

/**
   * 載入遠端擴充套件詞典到主詞庫表
   */
  private void loadRemoteExtDict() {
      // ...
      for (String theWord : lists) {
        if (theWord != null && !"".equals(theWord.trim())) {
          // 載入擴充套件詞典資料到主記憶體詞典中
          // 註釋這一行：
          // logger.info(theWord);
          _MainDict.fillSegment(theWord.trim().toLowerCase().toCharArray());
        }
      }
      // ...
  }

然後執行：

mvn package

生成zip目標包，拷貝到es目錄或者替換 elasticsearch-analysis-ik-6.8.4.jar 即可。

PS：如果要改ik原始碼，maven同步的時候，有些外掛會找不到，直接刪除即可，只需要保留下面一個：

後記

除錯介面不生效

因為我們需要改ik分詞器原始碼，當時做熱更新的時候發現沒有效果，於是在其程式碼中增加了一句日誌：

/**
   * 載入遠端擴充套件詞典到主詞庫表
   */
  private void loadRemoteExtDict() {
    List<String> remoteExtDictFiles = getRemoteExtDictionarys();
    for (String location : remoteExtDictFiles) {
      logger.info("[Dict Loading] " + location);
      List<String> lists = getRemoteWords(location);
      // 如果找不到擴充套件的字典，則忽略
      if (lists == null) {
        logger.error("[Dict Loading] " + location + "載入失敗");
        continue;
      } else {
        logger.info("獲取遠端詞典成功，總數為：" + lists.size());
      }
      for (String theWord : lists) {
        if (theWord != null && !"".equals(theWord.trim())) {
          // 載入擴充套件詞典資料到主記憶體詞典中
          logger.info(theWord);
          _MainDict.fillSegment(theWord.trim().toLowerCase().toCharArray());
        }
      }
    }
  }

發現輸出了0:

[2021-10-15T15:02:41,886][INFO ][o.w.a.d.Monitor] [node-1] [Dict Loading] http://10.16.52.52:9800/es/steelDict
[2021-10-15T15:02:43,958][INFO ][o.w.a.d.Monitor] [node-1] 獲取遠端詞典成功，總數為：0
[2021-10-15T15:02:43,959][INFO ][o.w.a.d.Monitor] [node-1] 重新載入詞典完畢...

後面通過執行（Dictionary.java）：

  public static void main(String[] args) {
    List<String> words = getRemoteWordsUnprivileged("http://127.0.0.1:9800/es/steelDict");
    System.out.println(words.size());
  }

單點除錯，發現HEADER中沒有設定 Content-Length 導致解析失敗。

數字分詞如何把*號不過濾

原生分詞會把 3*1500*3000 分成：3 1500 3000。如果有特殊需要，希望不分開呢（在鋼貿行業，這是一個規格，所以有這個需求）？

修改程式碼，把識別數字的邏輯加一個 “*”即可。

/**
 * 英文字元及阿拉伯數字子分詞器
 */
class LetterSegmenter implements ISegmenter {
  // ...
  
  //連結符號（這裡追加*號）
  private static final char[] Letter_Connector = new char[]{'#', '&', '+', '-', '.', '@', '_', '*'};
  //數字符號（這裡追加*號）
  private static final char[] Num_Connector = new char[]{',', '.', '*'};
  
  // ...
}

關於作者

推薦下自己的開源IM，純Golang編寫：

CoffeeChat：https://github.com/xmcy0011/CoffeeChat
opensource im with server(go) and client(flutter+swift)

參考了TeamTalk、瓜子IM等知名專案，包含服務端(go)和客戶端(flutter+swift)，單聊和機器人（小微、圖靈、思知）聊天功能已完成，目前正在研發群聊功能，歡迎對golang感興趣的小夥伴Star加關注。
在這裡插入圖片描述

elasticsearch之ik分詞器和自定義詞庫實現
2024-06-13
Elasticsearch分詞
ElasticSearch中使用ik分詞器進行實現分詞操作
2024-03-21
Elasticsearch分詞
Elasticsearch IK分詞器
2021-08-18
Elasticsearch分詞
uniapp實現熱更新
2020-10-19
APP
webpack 熱更新原理
2024-08-19
Web
Unity3D熱更新之LuaFramework篇[09]--資源熱更新與程式碼熱更新的具體實現
2019-07-29
Unity3DFramework
Android熱更新實現方式
2019-03-04
Android
熱修復（一）原理與實現詳解
2019-03-01
#Elasticsearch中文分詞器 #IK分詞器 @FDDLC
2020-11-07
Elasticsearch中文分詞
Flutter 熱更新功能實現
2019-11-11
Flutter
Flutter 動態化熱更新的思考與實踐
2020-04-07
Flutter
如何實現 Logstash/Elasticsearch 與MySQL自動同步更新操作和刪除操作 ?
2019-02-17
ElasticsearchMySql
elasticsearch安裝和使用ik分詞器
2022-08-01
Elasticsearch分詞
ElasticSearch-IK分詞器和整合使用
2021-01-26
Elasticsearch分詞
webpack與browser-sync熱更新原理深度講解
2019-03-04
Web
Unity3D熱更新之LuaFramework篇[08]--熱更新原理及熱更伺服器搭建
2019-07-27
Unity3DFramework伺服器
ElasticSearch7.6.2在windows上如何配置ik分詞器與用法
2020-12-22
ElasticsearchWindows分詞
自己動手製作elasticsearch的ik分詞器的Docker映象
2022-08-06
Elasticsearch分詞Docker
Flutter Android 端熱修復（熱更新）實踐
2019-10-31
FlutterAndroid
記 Arthas 實現一次 CPU 排查與程式碼熱更新
2020-09-03
Flink 熱詞統計(1): 基礎功能實現
2019-05-01
熱詞分析
2020-11-11
React Native 熱更新實踐
2023-04-21
React Native
Eclipse/tomcat 如何實現應用熱部署和熱啟動
2018-11-28
EclipseTomcat熱部署
ES 實現實時從Mysql資料庫中讀取熱詞,停用詞
2020-09-13
MySql資料庫
Elasticsearch學習系列一（部署和配置IK分詞器）
2022-06-18
Elasticsearch分詞
Helm3安裝帶有ik分詞的ElasticSearch
2022-07-13
分詞Elasticsearch
konfig:採用ConfigMap實現線上配置熱更新
2020-12-05
ElasticSearch實戰系列十: ElasticSearch冷熱分離架構
2021-03-30
Elasticsearch架構
Elasticsearch使用系列-ES增刪查改基本操作+ik分詞
2022-01-25
Elasticsearch分詞
純前端實現詞雲展示+附微博熱搜詞雲Demo程式碼
2021-11-13
前端
ElasticSearch7.3學習(十五)----中文分詞器(IK Analyzer)及自定義詞庫
2022-03-28
Elasticsearch中文分詞
Java動態編譯和熱更新
2018-12-27
Java編譯
Flutter 熱更新及動態UI生成
2021-05-11
FlutterUI
熱詞統計分析
2020-11-11
IK 分詞器
2022-01-09
分詞
golang reflect 實現原理
2020-04-10
Golang
從ClassLoader到Android外掛化以及熱更新原理
2018-09-10
Android