[toc]
一、Elasticsearch analizer組成
1. 組成三大件
1.1 Character Filter(字元過濾器)
用於原始文字過濾,比如原文字為html的文字,需要去掉html標籤: html_strip
1.2 Tokenizer(分詞器)
按某種規則(比如空格) 對輸入(Character Filter處理完的文字)進行切分
1.3 Token Filter(分詞過濾器)
對Tokenizer切分後的準term進行二次加工,比如大寫->小寫,stop word過濾(跑去in、the等)
二、Analyzer測試分詞
2.1 指定analyzer測試分詞
2.1.1 standard analyzer
Tokenizer: Standard Tokenize
基於unicode文字分割,適於大多數語言
Token Filter: Lower Case Token Filter/Stop Token Filter(預設禁用)
- LowerCase Token Filter: 過濾後,變小寫-->所以standard預設分詞後的搜尋匹配是小寫
- Stop Token Filter(預設禁用) -->停用詞:分詞後索引裡會丟棄的
GET _analyze
{
"analyzer": "standard",
"text": "#!#For example, UUU you can see 27 accounts in ID (Idaho)."
}
2.1.2 standard結果可見
- 全小寫
- 數字還在
- 沒有stop word(預設關閉的)
{
"tokens" : [
{
"token" : "for",
"start_offset" : 3,
"end_offset" : 6,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "example",
"start_offset" : 7,
"end_offset" : 14,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "uuu",
"start_offset" : 16,
"end_offset" : 19,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "you",
"start_offset" : 20,
"end_offset" : 23,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "can",
"start_offset" : 24,
"end_offset" : 27,
"type" : "<ALPHANUM>",
"position" : 4
},
{
"token" : "see",
"start_offset" : 28,
"end_offset" : 31,
"type" : "<ALPHANUM>",
"position" : 5
},
{
"token" : "27",
"start_offset" : 32,
"end_offset" : 34,
"type" : "<NUM>",
"position" : 6
},
{
"token" : "accounts",
"start_offset" : 35,
"end_offset" : 43,
"type" : "<ALPHANUM>",
"position" : 7
},
{
"token" : "in",
"start_offset" : 44,
"end_offset" : 46,
"type" : "<ALPHANUM>",
"position" : 8
},
{
"token" : "id",
"start_offset" : 47,
"end_offset" : 49,
"type" : "<ALPHANUM>",
"position" : 9
},
{
"token" : "idaho",
"start_offset" : 51,
"end_offset" : 56,
"type" : "<ALPHANUM>",
"position" : 10
}
]
}
2.2 其他analyzer
- standard
- stop stopword剔除
- simple
- whitespace 只用空白符分割,不剔除
- keyword 完整文字,不分詞
2.3 指定Tokenizer和Token Filter測試分詞
2.3.1 使用standard相同的Tokenizer和Filter
前面一節說:standard analyzer使用的Tokenizer是standard Tokenizer
使用的filter是lowercase
, 我們通過使用tokenizer和filter來替換analyzer試試:
GET _analyze
{
"tokenizer": "standard",
"filter": ["lowercase"],
"text": "#!#For example, UUU you can see 27 accounts in ID (Idaho)."
}
結果和上面一致:
{
"tokens" : [
{
"token" : "for",
"start_offset" : 3,
"end_offset" : 6,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "example",
"start_offset" : 7,
"end_offset" : 14,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "uuu",
"start_offset" : 16,
"end_offset" : 19,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "you",
"start_offset" : 20,
"end_offset" : 23,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "can",
"start_offset" : 24,
"end_offset" : 27,
"type" : "<ALPHANUM>",
"position" : 4
},
{
"token" : "see",
"start_offset" : 28,
"end_offset" : 31,
"type" : "<ALPHANUM>",
"position" : 5
},
{
"token" : "27",
"start_offset" : 32,
"end_offset" : 34,
"type" : "<NUM>",
"position" : 6
},
{
"token" : "accounts",
"start_offset" : 35,
"end_offset" : 43,
"type" : "<ALPHANUM>",
"position" : 7
},
{
"token" : "in",
"start_offset" : 44,
"end_offset" : 46,
"type" : "<ALPHANUM>",
"position" : 8
},
{
"token" : "id",
"start_offset" : 47,
"end_offset" : 49,
"type" : "<ALPHANUM>",
"position" : 9
},
{
"token" : "idaho",
"start_offset" : 51,
"end_offset" : 56,
"type" : "<ALPHANUM>",
"position" : 10
}
]
}
2.3.2 增加一個stop的filter再試
GET _analyze
{
"tokenizer": "standard",
"filter": ["lowercase","stop"],
"text": "#!#For example, UUU you can see 27 accounts in ID (Idaho)."
}
觀察發現:in
沒了,所以stop裡應該是有in
這個過濾成分的呢~
filter裡有兩個(使用了兩個TokenFilter--ES的欄位都可以使多個多個值的就是陣列式的)如果去掉filter裡的lowercase
, 就不會轉大寫為小寫了,這裡就不貼出結果了~
{
"tokens" : [
{
"token" : "example",
"start_offset" : 7,
"end_offset" : 14,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "uuu",
"start_offset" : 16,
"end_offset" : 19,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "you",
"start_offset" : 20,
"end_offset" : 23,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "can",
"start_offset" : 24,
"end_offset" : 27,
"type" : "<ALPHANUM>",
"position" : 4
},
{
"token" : "see",
"start_offset" : 28,
"end_offset" : 31,
"type" : "<ALPHANUM>",
"position" : 5
},
{
"token" : "27",
"start_offset" : 32,
"end_offset" : 34,
"type" : "<NUM>",
"position" : 6
},
{
"token" : "accounts",
"start_offset" : 35,
"end_offset" : 43,
"type" : "<ALPHANUM>",
"position" : 7
},
{
"token" : "id",
"start_offset" : 47,
"end_offset" : 49,
"type" : "<ALPHANUM>",
"position" : 9
},
{
"token" : "idaho",
"start_offset" : 51,
"end_offset" : 56,
"type" : "<ALPHANUM>",
"position" : 10
}
]
}
三、Elasticsearch自帶的Analyzer元件
3.1 ES自帶的character filter
3.1.1 什麼是character filter?
在tokenizer之前,對文字進行處理,例如增加刪除或替換字元;可以設定多個character filter.
它會影響tokenizer的
position
和offset
.
3.1.2 一些自帶的character filter
- html strip: 剔除html標籤
- mapping: 字串替換
- pattern replace: 正則匹配替換
3.2 ES自帶的tokenizer
3.2.1 什麼是tokenizer?
將原始文字(character filter處理後的原始文字)按照一定規則進行切分。(term or token)
3.2.2 自帶的tokenizer
- whitespace: 空格分詞
- standard
- uax_url_email: url/email
- pattern
- keyword: 不分詞
- pattern hierarchy: 路徑名拆分
3.2.3 可以用java外掛,實現自定義的tokenizer
3.3 ES自帶的token filter
3.3.1 什麼是tokenizer?
將tokenizer輸出的單詞進行加工(加工term)
3.3.2 自帶的token filter
- lowercase: 小寫化
- stop: 去除停用詞(in/the等)
- synonym: 新增近義詞
四、Demo案例
4.1 html_strip/maping+keyword
GET _analyze
{
"tokenizer": "keyword",
"char_filter": [
{
"type": "html_strip"
},
{
"type": "mapping",
"mappings": [
"- => _", ":) => _happy_", ":( => _sad_"
]
}
],
"text": "<b>Hello :) this-is-my-book,that-is-not :( World</b>"
}
使用了 tokenizer:keyword,也就是切詞時完整保留,不切割;
使用了char_filter兩個:html_strip(剔除掉html標籤),mapping(用指定內容替換原內容)
上面結果:html標籤去掉了,減號符替換成了下劃線
{
"tokens" : [
{
"token" : "Hello _happy_ this_is_my_book,that_is_not _sad_ World",
"start_offset" : 3,
"end_offset" : 52,
"type" : "word",
"position" : 0
}
]
}
4.2 char_filter使用正則替換
GET _analyze
{
"tokenizer": "standard",
"char_filter": [
{
"type": "pattern_replace",
"pattern": "http://(.*)",
"replacement": "$1"
}
],
"text": "http://www.elastic.co"
}
正則替換:type
/pattern
/replacement
結果:
{
"tokens" : [
{
"token" : "www.elastic.co",
"start_offset" : 0,
"end_offset" : 21,
"type" : "<ALPHANUM>",
"position" : 0
}
]
}
4.3 tokenizer使用目錄切分
GET _analyze
{
"tokenizer": "path_hierarchy",
"text": "/user/niewj/a/b/c"
}
分詞結果:
{
"tokens" : [
{
"token" : "/user",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 0
},
{
"token" : "/user/niewj",
"start_offset" : 0,
"end_offset" : 11,
"type" : "word",
"position" : 0
},
{
"token" : "/user/niewj/a",
"start_offset" : 0,
"end_offset" : 13,
"type" : "word",
"position" : 0
},
{
"token" : "/user/niewj/a/b",
"start_offset" : 0,
"end_offset" : 15,
"type" : "word",
"position" : 0
},
{
"token" : "/user/niewj/a/b/c",
"start_offset" : 0,
"end_offset" : 17,
"type" : "word",
"position" : 0
}
]
}
4.4 tokenfilter之whitespace與stop
GET _analyze
{
"tokenizer": "whitespace",
"filter": ["stop"], // ["lowercase", "stop"]
"text": "The girls in China are playing this game !"
}
結果:in、this都被剔除了(stopword), 但是term是大寫的還保留, 因為tokenizer用的是whitespace而非standard
{
"tokens" : [
{
"token" : "The",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 0
},
{
"token" : "girls",
"start_offset" : 4,
"end_offset" : 9,
"type" : "word",
"position" : 1
},
{
"token" : "China",
"start_offset" : 13,
"end_offset" : 18,
"type" : "word",
"position" : 3
},
{
"token" : "playing",
"start_offset" : 23,
"end_offset" : 30,
"type" : "word",
"position" : 5
},
{
"token" : "game",
"start_offset" : 36,
"end_offset" : 40,
"type" : "word",
"position" : 7
},
{
"token" : "!",
"start_offset" : 41,
"end_offset" : 42,
"type" : "word",
"position" : 8
}
]
}
4.5 自定義analyzer
4.5.1 settings自定義analyzer
PUT my_new_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer":{ // 1.自定義analyzer的名稱
"type": "custom",
"char_filter": ["my_emoticons"],
"tokenizer": "my_punctuation",
"filter": ["lowercase", "my_english_stop"]
}
},
"tokenizer": {
"my_punctuation": { // 3.自定義tokenizer的名稱
"type": "pattern", "pattern":"[ .,!?]"
}
},
"char_filter": {
"my_emoticons": { // 2.自定義char_filter的名稱
"type": "mapping", "mappings":[":) => _hapy_", ":( => _sad_"]
}
},
"filter": {
"my_english_stop": { // 4.自定義token filter的名稱
"type": "stop", "stopwords": "_english_"
}
}
}
}
}
4.5.2 測試自定義的analyzer:
POST my_new_index/_analyze
{
"analyzer": "my_analyzer",
"text": "I'm a :) person in the earth, :( And You? "
}
輸出
{
"tokens" : [
{
"token" : "i'm",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 0
},
{
"token" : "_hapy_",
"start_offset" : 6,
"end_offset" : 8,
"type" : "word",
"position" : 2
},
{
"token" : "person",
"start_offset" : 9,
"end_offset" : 15,
"type" : "word",
"position" : 3
},
{
"token" : "earth",
"start_offset" : 23,
"end_offset" : 28,
"type" : "word",
"position" : 6
},
{
"token" : "_sad_",
"start_offset" : 30,
"end_offset" : 32,
"type" : "word",
"position" : 7
},
{
"token" : "you",
"start_offset" : 37,
"end_offset" : 40,
"type" : "word",
"position" : 9
}
]
}