Elasticsearch 分詞器

碼農充電站發表於2021-02-08

原文網址 : https://www.cnblogs.com/codeshell/p/14389403.html

Elasticsearch分詞

無論是內建的分析器（analyzer），還是自定義的分析器（analyzer），都由三種構件塊組成的：character filters ， tokenizers ， token filters。

內建的analyzer將這些構建塊預先打包到適合不同語言和文字型別的analyzer中。

Character filters （字元過濾器）

字元過濾器以字元流的形式接收原始文字，並可以通過新增、刪除或更改字元來轉換該流。

舉例來說，一個字元過濾器可以用來把阿拉伯數字（٠‎١٢٣٤٥٦٧٨‎٩）‎轉成成Arabic-Latin的等價物（0123456789）。

一個分析器可能有0個或多個字元過濾器，它們按順序應用。

（PS：類似Servlet中的過濾器，或者攔截器，想象一下有一個過濾器鏈）

Tokenizer （分詞器）

一個分詞器接收一個字元流，並將其拆分成單個token （通常是單個單詞），並輸出一個token流。例如，一個whitespace分詞器當它看到空白的時候就會將文字拆分成token。它會將文字“Quick brown fox!”轉換為[Quick, brown, fox!]

（PS：Tokenizer 負責將文字拆分成單個token ，這裡token就指的就是一個一個的單詞。就是一段文字被分割成好幾部分，相當於Java中的字串的 split ）

分詞器還負責記錄每個term的順序或位置，以及該term所表示的原單詞的開始和結束字元偏移量。（PS：文字被分詞後的輸出是一個term陣列）

一個分析器必須只能有一個分詞器

Token filters （token過濾器）

token過濾器接收token流，並且可能會新增、刪除或更改tokens。

例如，一個lowercase token filter可以將所有的token轉成小寫。stop token filter可以刪除常用的單詞，比如 the 。synonym token filter可以將同義詞引入token流。

不允許token過濾器更改每個token的位置或字元偏移量。

一個分析器可能有0個或多個token過濾器，它們按順序應用。

小結&回顧

analyzer（分析器）是一個包，這個包由三部分組成，分別是：character filters （字元過濾器）、tokenizer（分詞器）、token filters（token過濾器）

一個analyzer可以有0個或多個character filters

一個analyzer有且只能有一個tokenizer

一個analyzer可以有0個或多個token filters

character filter 是做字元轉換的，它接收的是文字字元流，輸出也是字元流

tokenizer 是做分詞的，它接收字元流，輸出token流（文字拆分後變成一個一個單詞，這些單詞叫token）

token filter 是做token過濾的，它接收token流，輸出也是token流

由此可見，整個analyzer要做的事情就是將文字拆分成單個單詞，文字 ----> 字元 ----> token

這就好比是攔截器

1. 測試分析器

analyze API 是一個工具，可以幫助我們檢視分析的過程。（PS：類似於執行計劃）

curl -X POST "192.168.1.134:9200/_analyze" -H 'Content-Type: application/json' -d'
{
  "analyzer": "whitespace",
  "text":     "The quick brown fox."
}
'

curl -X POST "192.168.1.134:9200/_analyze" -H 'Content-Type: application/json' -d'
{
  "tokenizer": "standard",
  "filter":  [ "lowercase", "asciifolding" ],
  "text":      "Is this déja vu?"
}
'

輸出：

{
    "tokens":[
        {
            "token":"The",
            "start_offset":0,
            "end_offset":3,
            "type":"word",
            "position":0
        },
        {
            "token":"quick",
            "start_offset":4,
            "end_offset":9,
            "type":"word",
            "position":1
        },
        {
            "token":"brown",
            "start_offset":10,
            "end_offset":15,
            "type":"word",
            "position":2
        },
        {
            "token":"fox.",
            "start_offset":16,
            "end_offset":20,
            "type":"word",
            "position":3
        }
    ]
}

可以看到，對於每個term，記錄了它的位置和偏移量

2. Analyzer

2.1. 配置內建的分析器

內建的分析器不用任何配置就可以直接使用。當然，預設配置是可以更改的。例如，standard分析器可以配置為支援停止字列表:

curl -X PUT "localhost:9200/my_index" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "analysis": {
      "analyzer": {
        "std_english": { 
          "type":      "standard",
          "stopwords": "_english_"
        }
      }
    }
  },
  "mappings": {
    "_doc": {
      "properties": {
        "my_text": {
          "type":     "text",
          "analyzer": "standard", 
          "fields": {
            "english": {
              "type":     "text",
              "analyzer": "std_english" 
            }
          }
        }
      }
    }
  }
}
'

在這個例子中，我們基於standard分析器來定義了一個std_englisth分析器，同時配置為刪除預定義的英語停止詞列表。後面的mapping中，定義了my_text欄位用standard，my_text.english用std_english分析器。因此，下面兩個的分詞結果會是這樣的：

curl -X POST "localhost:9200/my_index/_analyze" -H 'Content-Type: application/json' -d'
{
  "field": "my_text", 
  "text": "The old brown cow"
}
'
curl -X POST "localhost:9200/my_index/_analyze" -H 'Content-Type: application/json' -d'
{
  "field": "my_text.english", 
  "text": "The old brown cow"
}
'

第一個由於用的standard分析器，因此分詞的結果是：[ the, old, brown, cow ]

第二個用std_english分析的結果是：[ old, brown, cow ]

2.2. Standard Analyzer （預設）

如果沒有特別指定的話，standard 是預設的分析器。它提供了基於語法的標記化（基於Unicode文字分割演算法），適用於大多數語言。

例如：

curl -X POST "localhost:9200/_analyze" -H 'Content-Type: application/json' -d'
{
  "analyzer": "standard",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone."
}
'

上面例子中，那段文字將會輸出如下terms：

[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog's, bone ]

2.2.1. 配置

標準分析器接受下列引數：

max_token_length ：最大token長度，預設255
stopwords ：預定義的停止詞列表，如_english_ 或包含停止詞列表的陣列，預設是 _none_
stopwords_path ：包含停止詞的檔案路徑

2.2.2. 示例配置

curl -X PUT "localhost:9200/my_index" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_english_analyzer": {
          "type": "standard",
          "max_token_length": 5,
          "stopwords": "_english_"
        }
      }
    }
  }
}
'
curl -X POST "localhost:9200/my_index/_analyze" -H 'Content-Type: application/json' -d'
{
  "analyzer": "my_english_analyzer",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone."
}
'

以上輸出下列terms:

[ 2, quick, brown, foxes, jumpe, d, over, lazy, dog's, bone ]

2.2.3. 定義

standard分析器由下列兩部分組成：

Tokenizer

Standard Tokenizer

Token Filters

Standard Token Filter
Lower Case Token Filter
Stop Token Filter （預設被禁用）

你還可以自定義

curl -X PUT "localhost:9200/standard_example" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "analysis": {
      "analyzer": {
        "rebuilt_standard": {
          "tokenizer": "standard",
          "filter": [
            "lowercase"       
          ]
        }
      }
    }
  }
}
'

2.3. Simple Analyzer

simple 分析器當它遇到只要不是字母的字元，就將文字解析成term，而且所有的term都是小寫的。例如：

curl -X POST "localhost:9200/_analyze" -H 'Content-Type: application/json' -d'
{
  "analyzer": "simple",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone."
}
'

輸入結果如下：

[ the, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]

2.3.1. 自定義

curl -X PUT "localhost:9200/simple_example" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "analysis": {
      "analyzer": {
        "rebuilt_simple": {
          "tokenizer": "lowercase",
          "filter": [         
          ]
        }
      }
    }
  }
}
'

2.4. Whitespace Analyzer

whitespace 分析器，當它遇到空白字元時，就將文字解析成terms

示例：

curl -X POST "localhost:9200/_analyze" -H 'Content-Type: application/json' -d'
{
  "analyzer": "whitespace",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone."
}
'

輸出結果如下：

[ The, 2, QUICK, Brown-Foxes, jumped, over, the, lazy, dog's, bone. ]

2.5. Stop Analyzer

stop 分析器和 simple 分析器很像，唯一不同的是，stop 分析器增加了對刪除停止詞的支援。預設用的停止詞是 _englisht_

（PS：意思是，假設有一句話“this is a apple”，並且假設“this” 和 “is”都是停止詞，那麼用simple的話輸出會是[ this , is , a , apple ]，而用stop輸出的結果會是[ a , apple ]，到這裡就看出二者的區別了，stop 不會輸出停止詞，也就是說它不認為停止詞是一個term）

（PS：所謂的停止詞，可以理解為分隔符）

2.5.1. 示例輸出

curl -X POST "localhost:9200/_analyze" -H 'Content-Type: application/json' -d'
{
    "analyzer": "stop",
    "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone."
}
'

輸出

[ quick, brown, foxes, jumped, over, lazy, dog, s, bone ]

2.5.2. 配置

stop 接受以下引數：

stopwords ：一個預定義的停止詞列表（比如，_englisht_）或者是一個包含停止詞的列表。預設是 _english_
stopwords_path ：包含停止詞的檔案路徑。這個路徑是相對於Elasticsearch的config目錄的一個路徑

2.5.3. 示例配置

curl -X PUT "localhost:9200/my_index" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_stop_analyzer": {
          "type": "stop",
          "stopwords": ["the", "over"]
        }
      }
    }
  }
}
'

上面配置了一個stop分析器，它的停止詞有兩個：the 和 over

curl -X POST "localhost:9200/my_index/_analyze" -H 'Content-Type: application/json' -d'
{
  "analyzer": "my_stop_analyzer",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone."
}
'

基於以上配置，這個請求輸入會是這樣的：

[ quick, brown, foxes, jumped, lazy, dog, s, bone ]

2.6. Pattern Analyzer

用Java正規表示式來將文字分割成terms，預設的正規表示式是\W+（非單詞字元）

2.6.1. 示例輸出

curl -X POST "localhost:9200/_analyze" -H 'Content-Type: application/json' -d'
{
  "analyzer": "pattern",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone."
}
'

由於預設按照非單詞字元分割，因此輸出會是這樣的：

[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]

2.6.2. 配置

pattern 分析器接受如下引數：

pattern ：一個Java正規表示式，預設 \W+
flags ： Java正規表示式flags。比如：CASE_INSENSITIVE 、COMMENTS
lowercase ：是否將terms全部轉成小寫。預設true
stopwords ：一個預定義的停止詞列表，或者包含停止詞的一個列表。預設是 _none_
stopwords_path ：停止詞檔案路徑

2.6.3. 示例配置

curl -X PUT "localhost:9200/my_index" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_email_analyzer": {
          "type":      "pattern",
          "pattern":   "\\W|_", 
          "lowercase": true
        }
      }
    }
  }
}
'

上面的例子中配置了按照非單詞字元或者下劃線分割，並且輸出的term都是小寫

curl -X POST "localhost:9200/my_index/_analyze" -H 'Content-Type: application/json' -d'
{
  "analyzer": "my_email_analyzer",
  "text": "John_Smith@foo-bar.com"
}
'

因此，基於以上配置，本例輸出如下：

[ john, smith, foo, bar, com ]

2.7. Language Analyzers

支援不同語言環境下的文字分析。內建（預定義）的語言有：arabic, armenian, basque, bengali, brazilian, bulgarian, catalan, cjk, czech, danish, dutch, english, finnish, french, galician, german, greek, hindi, hungarian, indonesian, irish, italian, latvian, lithuanian, norwegian, persian, portuguese, romanian, russian, sorani, spanish, swedish, turkish, thai

2.8. 自定義Analyzer

前面也說過，一個分析器由三部分構成：

zero or more character filters
a tokenizer
zero or more token filters

2.8.1. 例項配置

curl -X PUT "localhost:9200/my_index" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type":      "custom", 
          "tokenizer": "standard",
          "char_filter": [
            "html_strip"
          ],
          "filter": [
            "lowercase",
            "asciifolding"
          ]
        }
      }
    }
  }
}
'

3. Tokenizer

3.1. Standard Tokenizer

curl -X POST "localhost:9200/_analyze" -H 'Content-Type: application/json' -d'
{
  "tokenizer": "standard",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone."
}
'

4. 中文分詞器

4.1. smartCN

一個簡單的中文或中英文混合文字的分詞器

這個外掛提供 smartcn analyzer 和 smartcn_tokenizer tokenizer，而且不需要配置

# 安裝
bin/elasticsearch-plugin install analysis-smartcn
# 解除安裝
bin/elasticsearch-plugin remove analysis-smartcn

下面測試一下

可以看到，“今天天氣真好”用smartcn分析器的結果是：

[ 今天 ， 天氣 ， 真 ， 好 ]

如果用standard分析器的話，結果會是：

[ 今 ，天 ，氣 ， 真 ， 好 ]

4.2. IK分詞器

下載對應的版本，這裡我下載6.5.3

然後，在Elasticsearch的plugins目錄下建一個ik目錄，將剛才下載的檔案解壓到該目錄下

最後，重啟Elasticsearch

接下來，還是用剛才那句話來測試一下

輸出結果如下：

{
    "tokens": [
        {
            "token": "今天天氣",
            "start_offset": 0,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "今天",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 1
        },
        {
            "token": "天天",
            "start_offset": 1,
            "end_offset": 3,
            "type": "CN_WORD",
            "position": 2
        },
        {
            "token": "天氣",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 3
        },
        {
            "token": "真好",
            "start_offset": 4,
            "end_offset": 6,
            "type": "CN_WORD",
            "position": 4
        }
    ]
}

顯然比smartcn要更好一點

5. 參考

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis.html

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenfilters.html

https://github.com/medcl/elasticsearch-analysis-ik

#Elasticsearch中文分詞器 #IK分詞器 @FDDLC
2020-11-07
Elasticsearch中文分詞
Elasticsearch IK分詞器
2021-08-18
Elasticsearch分詞
Elasticsearch整合HanLP分詞器
2018-10-08
ElasticsearchHanLP分詞
ElasticSearch之ICU分詞器
2020-04-07
Elasticsearch分詞
ElasticSearch中使用ik分詞器進行實現分詞操作
2024-03-21
Elasticsearch分詞
elasticsearch教程--中文分詞器作用和使用
2019-06-12
Elasticsearch中文分詞
elasticsearch安裝和使用ik分詞器
2022-08-01
Elasticsearch分詞
Elasticsearch（ES）分詞器的那些事兒
2021-09-19
Elasticsearch分詞
ElasticSearch-IK分詞器和整合使用
2021-01-26
Elasticsearch分詞
elasticsearch之ik分詞器和自定義詞庫實現
2024-06-13
Elasticsearch分詞
Elasticsearch從入門到放棄：分詞器初印象
2020-06-29
Elasticsearch分詞
Elasticsearch學習系列一（部署和配置IK分詞器）
2022-06-18
Elasticsearch分詞
ElasticSearch7.3 學習之定製分詞器（Analyzer）
2022-03-22
Elasticsearch分詞
ElasticSearch7.3學習(十五)----中文分詞器(IK Analyzer)及自定義詞庫
2022-03-28
Elasticsearch中文分詞
自己動手製作elasticsearch的ik分詞器的Docker映象
2022-08-06
Elasticsearch分詞Docker
ElasticSearch7.6.2在windows上如何配置ik分詞器與用法
2020-12-22
ElasticsearchWindows分詞
Elasticsearch 6.x 倒排索引與分詞
2018-08-19
Elasticsearch索引分詞
ElasticSearch 實現分詞全文檢索 - 概述
2023-03-03
Elasticsearch分詞
day88-ElasticSearch-分詞- 自定義擴充套件詞庫
2020-12-21
Elasticsearch分詞套件
IK 分詞器
2022-01-09
分詞
剖析分詞器
2021-11-16
分詞
ElasticSearch 實現分詞全文檢索 - delete-by-query
2023-03-15
Elasticsearch分詞delete
ElasticSearch7.3 學習之倒排索引揭祕及初識分詞器(Analyzer)
2022-03-18
Elasticsearch索引分詞
HanLP-實詞分詞器詳解
2019-05-27
HanLP分詞
Elasticsearch 近義詞詞庫配置
2024-07-24
Elasticsearch
小白折騰伺服器（十）：docker 下安裝 Elasticsearch+ik 分詞外掛
2019-05-18
伺服器DockerElasticsearch分詞
IK 分詞器外掛
2020-11-13
分詞
Helm3安裝帶有ik分詞的ElasticSearch
2022-07-13
分詞Elasticsearch
Elasticsearch使用系列-ES增刪查改基本操作+ik分詞
2022-01-25
Elasticsearch分詞
elasticsearch之使用正規表示式自定義分詞邏輯
2023-02-21
Elasticsearch分詞
中文分詞器，整理自Ai
2024-08-01
中文分詞AI
elastcisearch中文分詞器各個版本
2019-01-03
AST中文分詞
HanLP分類模組的分詞器介紹
2019-06-14
HanLP分詞
elasticsearch高亮之詞項向量
2022-03-15
Elasticsearch
分詞
2024-04-02
分詞
62_索引管理_快速上機動手實戰修改分詞器以及定製自己的分詞器
2024-10-02
索引分詞
HanLP分詞工具中的ViterbiSegment分詞流程
2019-08-05
HanLP分詞Viterbi
使用Docker快速安裝部署ES和Kibana並配置IK中文分詞器以及自定義分詞擴充詞庫
2020-10-28
Docker中文分詞

Elasticsearch 分詞器

相關文章