ElasticSearch之ICU分詞器

Moshow鄭鍇發表於2020-04-07

原文網址 : https://blog.csdn.net/moshowgame/article/details/99448661

Elasticsearch分詞

分詞器

分詞器接受一個字串作為輸入，將這個字串拆分成獨立的詞或語彙單元（token）（可能會丟棄一些標點符號等字元），然後輸出一個語彙單元流（token stream）。

一個analyzer分詞器包含三個部分：

character filter：分詞之前的預處理，過濾掉HTML標籤、特殊符號轉換等。
tokenizer：分詞
token filter：標準化

ES內建分詞器

事實上，ElasticSearch中有一些內建分詞器：

Standard 分詞器：預設分詞器，會將詞彙單元轉成小寫形式並且去除停用詞和標點符號，支援中文采用的方法為單字切分。
Simple 分詞器：首先會通過非字母字元來分割文字資訊，然後將詞彙單元統一為小寫形式。該分詞器會去除掉數字型別的字元。
Whitespace 分詞器：僅僅是去除空格，對字元沒有lowcase化，不支援中文；並且不對生成的詞彙單元進行其他標準化處理。
Stop 分詞器：相比Simple Analyzer多了去除請用詞處理，停用詞指語氣助詞等修飾性詞語，如the, an, 的，這等
Keyword 分詞器：不分詞，直接將輸入作為一個單詞輸出
Pattern 分詞器：通過正規表示式自定義分隔符，預設是\W+，即非字詞的符號作為分隔符
Language 分詞器：特定語言的分詞器，不支援中文。如 english 、french 和 spanish 分析器。

應該說，standard 分詞器是大多數西方語言分詞的一個合理的起點。事實上，它構成了大多數特定語言分析器的基礎，如 english 、french 和 spanish 分析器。它也支援亞洲語言，只是有些缺陷（=.=To Be Honest , 你輸入任何中文，都會被拆成一個一個的文字來分詞，簡直不要太糟糕），你可以考慮通過 ICU 外掛的方式使用 icu_analyzer 進行中文分詞更合理。

附錄有更詳細的分詞器列表。

分詞API

作為REST的搜尋引擎，ES可以直接 POST http://localhost:9200/_analyze 來查詢分詞的情況。

{
    "text": "I'm going to Guangzhou museum",
    "analyzer": "standard"
}

curl -X POST \
  http://localhost:9200/_analyze \
  -d '{
    "text": "I'\''m going to Guangzhou museum",
    "analyzer": "standard"
}'

curl或者postman都可以：

ICU 分析器外掛

Elasticsearch的 ICU 分析器外掛使用國際化元件 Unicode (ICU) 函式庫提供豐富的處理 Unicode 工具。這些包含對處理亞洲語言特別有用的 icu_分詞器 ，還有大量對除英語外其他語言進行正確匹配和排序所必須的分詞過濾器。

ICU 外掛是處理英語之外語言的必需工具，非常推薦你安裝並使用它，不幸的是，因為是基於額外的 ICU 函式庫，不同版本的ICU外掛可能並不相容之前的版本，當更新外掛的時候，你需要重新索引你的資料（=。=根據你的ES版本替換後面的版本號，例如我是6.8.1，則用6.8.1，你用7.3.0就用7.3.0，類推）。

#自動安裝
sudo bin/elasticsearch-plugin install analysis-icu
#手動安裝
（自行下載）https://artifacts.elastic.co/downloads/elasticsearch-plugins/analysis-icu/analysis-icu-6.8.1.zip
（Linux）sudo bin/elasticsearch-plugin install file:///path/to/plugin.zip
（Windows）bin\ elasticsearch-plugin.bat install file:///C:\Users\Administrator\Downloads\analysis-icu-6.8.1.zip
#安裝成功
[=================================================] 100%??
-> Installed analysis-icu

ICU分詞

{
    "text": "基於ELK打造強大的日誌收集分析系統（springboot2+logback+logstash+elasticsearch+kibana）",
    "analyzer": "icu_analyzer"
}

{
    "tokens": [
        {
            "token": "基於",
            "start_offset": 0,
            "end_offset": 2,
            "type": "<IDEOGRAPHIC>",
            "position": 0
        },
        {
            "token": "elk",
            "start_offset": 2,
            "end_offset": 5,
            "type": "<ALPHANUM>",
            "position": 1
        },
        {
            "token": "打造",
            "start_offset": 5,
            "end_offset": 7,
            "type": "<IDEOGRAPHIC>",
            "position": 2
        },
        {
            "token": "強大",
            "start_offset": 7,
            "end_offset": 9,
            "type": "<IDEOGRAPHIC>",
            "position": 3
        },
        {
            "token": "的",
            "start_offset": 9,
            "end_offset": 10,
            "type": "<IDEOGRAPHIC>",
            "position": 4
        },
        {
            "token": "日誌",
            "start_offset": 10,
            "end_offset": 12,
            "type": "<IDEOGRAPHIC>",
            "position": 5
        },
        {
            "token": "收集",
            "start_offset": 12,
            "end_offset": 14,
            "type": "<IDEOGRAPHIC>",
            "position": 6
        },
        {
            "token": "分析",
            "start_offset": 14,
            "end_offset": 16,
            "type": "<IDEOGRAPHIC>",
            "position": 7
        },
        {
            "token": "系統",
            "start_offset": 16,
            "end_offset": 18,
            "type": "<IDEOGRAPHIC>",
            "position": 8
        },
        {
            "token": "springboot2",
            "start_offset": 19,
            "end_offset": 30,
            "type": "<ALPHANUM>",
            "position": 9
        },
        {
            "token": "logback",
            "start_offset": 31,
            "end_offset": 38,
            "type": "<ALPHANUM>",
            "position": 10
        },
        {
            "token": "logstash",
            "start_offset": 39,
            "end_offset": 47,
            "type": "<ALPHANUM>",
            "position": 11
        },
        {
            "token": "elasticsearch",
            "start_offset": 48,
            "end_offset": 61,
            "type": "<ALPHANUM>",
            "position": 12
        },
        {
            "token": "kibana",
            "start_offset": 62,
            "end_offset": 68,
            "type": "<ALPHANUM>",
            "position": 13
        }
    ]
}

驗證設想

{
    "text": "北京大學與解放軍總醫院第一附屬醫院婦產科",
    "analyzer": "icu_analyzer"
}

之前有個設想：北京大學，分詞器會分一個北京一個大學和一個北京大學
結果發現：太天真了，根本沒有北京大學這個分詞。。。。。。
請看下文結果：

{
    "tokens": [
        {
            "token": "北京",
            "start_offset": 0,
            "end_offset": 2,
            "type": "<IDEOGRAPHIC>",
            "position": 0
        },
        {
            "token": "大學",
            "start_offset": 2,
            "end_offset": 4,
            "type": "<IDEOGRAPHIC>",
            "position": 1
        },
        {
            "token": "與",
            "start_offset": 4,
            "end_offset": 5,
            "type": "<IDEOGRAPHIC>",
            "position": 2
        },
        {
            "token": "解放",
            "start_offset": 5,
            "end_offset": 7,
            "type": "<IDEOGRAPHIC>",
            "position": 3
        },
        {
            "token": "軍",
            "start_offset": 7,
            "end_offset": 8,
            "type": "<IDEOGRAPHIC>",
            "position": 4
        },
        {
            "token": "總",
            "start_offset": 8,
            "end_offset": 9,
            "type": "<IDEOGRAPHIC>",
            "position": 5
        },
        {
            "token": "醫院",
            "start_offset": 9,
            "end_offset": 11,
            "type": "<IDEOGRAPHIC>",
            "position": 6
        },
        {
            "token": "第一",
            "start_offset": 11,
            "end_offset": 13,
            "type": "<IDEOGRAPHIC>",
            "position": 7
        },
        {
            "token": "附屬",
            "start_offset": 13,
            "end_offset": 15,
            "type": "<IDEOGRAPHIC>",
            "position": 8
        },
        {
            "token": "醫院",
            "start_offset": 15,
            "end_offset": 17,
            "type": "<IDEOGRAPHIC>",
            "position": 9
        },
        {
            "token": "婦產科",
            "start_offset": 17,
            "end_offset": 20,
            "type": "<IDEOGRAPHIC>",
            "position": 10
        }
    ]
}

Smart Chinese Analyzer Plugins

中文分詞器，聽說Elastic Stack 8.0會自帶，但是還沒release，靜候佳音吧。

Smart Chinese Analysis外掛將Lucene的Smart Chinese分析模組整合到elasticsearch中。

提供中文或混合中英文字的分析器。該分析器使用概率知識來查詢簡體中文文字的最佳分詞。首先將文字分成句子，然後將每個句子分割成單詞。

sudo bin/elasticsearch-plugin install analysis-smartcn

Elasticsearch 分詞器
2021-02-08
Elasticsearch分詞
#Elasticsearch中文分詞器 #IK分詞器 @FDDLC
2020-11-07
Elasticsearch中文分詞
Elasticsearch IK分詞器
2021-08-18
Elasticsearch分詞
elasticsearch之ik分詞器和自定義詞庫實現
2024-06-13
Elasticsearch分詞
Elasticsearch整合HanLP分詞器
2018-10-08
ElasticsearchHanLP分詞
ElasticSearch7.3 學習之定製分詞器（Analyzer）
2022-03-22
Elasticsearch分詞
ElasticSearch中使用ik分詞器進行實現分詞操作
2024-03-21
Elasticsearch分詞
elasticsearch教程--中文分詞器作用和使用
2019-06-12
Elasticsearch中文分詞
elasticsearch安裝和使用ik分詞器
2022-08-01
Elasticsearch分詞
Elasticsearch（ES）分詞器的那些事兒
2021-09-19
Elasticsearch分詞
ElasticSearch-IK分詞器和整合使用
2021-01-26
Elasticsearch分詞
ElasticSearch7.3 學習之倒排索引揭祕及初識分詞器(Analyzer)
2022-03-18
Elasticsearch索引分詞
Elasticsearch從入門到放棄：分詞器初印象
2020-06-29
Elasticsearch分詞
Elasticsearch學習系列一（部署和配置IK分詞器）
2022-06-18
Elasticsearch分詞
elasticsearch高亮之詞項向量
2022-03-15
Elasticsearch
elasticsearch之使用正規表示式自定義分詞邏輯
2023-02-21
Elasticsearch分詞
ElasticSearch7.3學習(十五)----中文分詞器(IK Analyzer)及自定義詞庫
2022-03-28
Elasticsearch中文分詞
自己動手製作elasticsearch的ik分詞器的Docker映象
2022-08-06
Elasticsearch分詞Docker
ElasticSearch7.6.2在windows上如何配置ik分詞器與用法
2020-12-22
ElasticsearchWindows分詞
Elasticsearch 6.x 倒排索引與分詞
2018-08-19
Elasticsearch索引分詞
ElasticSearch 實現分詞全文檢索 - 概述
2023-03-03
Elasticsearch分詞
day88-ElasticSearch-分詞- 自定義擴充套件詞庫
2020-12-21
Elasticsearch分詞套件
IK 分詞器
2022-01-09
分詞
剖析分詞器
2021-11-16
分詞
ElasticSearch 實現分詞全文檢索 - delete-by-query
2023-03-15
Elasticsearch分詞delete
HanLP-實詞分詞器詳解
2019-05-27
HanLP分詞
Elasticsearch 近義詞詞庫配置
2024-07-24
Elasticsearch
小白折騰伺服器（十）：docker 下安裝 Elasticsearch+ik 分詞外掛
2019-05-18
伺服器DockerElasticsearch分詞
IK 分詞器外掛
2020-11-13
分詞
Helm3安裝帶有ik分詞的ElasticSearch
2022-07-13
分詞Elasticsearch
中文分詞工具之基於字標註法的分詞
2019-06-26
中文分詞
Hanlp分詞之CRF中文詞法分析詳解
2019-02-18
HanLP分詞CRF詞法分析
分詞之後一天
2024-04-02
分詞
Elasticsearch使用系列-ES增刪查改基本操作+ik分詞
2022-01-25
Elasticsearch分詞
中文分詞器，整理自Ai
2024-08-01
中文分詞AI
文字挖掘之語料庫、分詞、詞頻統計
2024-05-20
分詞
elastcisearch中文分詞器各個版本
2019-01-03
AST中文分詞
HanLP分類模組的分詞器介紹
2019-06-14
HanLP分詞