- 多欄位特性
- 廠家名字實現精確匹配
- 增加一個keyword欄位
- 使用不同的analyzer
- 不同語言
- pinyin 欄位的搜尋
- 還支援為搜尋和索引指定不同的analyzer
- 廠家名字實現精確匹配
- Excat Values :包括數字 / 日期 / 具體一個字串 (例如 "Apple Store")
- Elasticsearch 中的keyword
- 全文字,非結構化的文字資料
- Elasticsearch 中的 text
- Elaticsearch 為每一個欄位建立一個倒排索引
- Exact Value 在索引時,不需要做特殊的分詞處理
- 當Elasticsearch自帶的分詞器無法滿足時,可以自定義分詞器。通過自組合不同的元件實現
Character Filter
Tokenizer
Token Filter
- 在
Tokenizer
之前對文字進行處理,例如增加刪除及替換字元。可以配置多個Character Filters
。會影響Tokenizer
的position
和offset
資訊 - 一些自帶的
Character Filters
HTML strip
- 去除html標籤Mapping
- 字串替換Pattern replace
- 正則匹配替換
- 將原始的文字按照一定的規則,切分為詞(term or token)
Elasticsearch
內建的Tokenizers
whitespace
|standard
|uax_url_email
|pattern
|keyword
|path hierarchy
- 可以用JAVA 開發外掛,實現自己的
Tokenizer
- 將
Tokenizer
輸出的單詞,進行增加、修改、刪除 - 自帶的
Token Filters
Lowercase
|stop
|synonym
(新增近義詞)
-
char_filter
POST _analyze { "tokenizer":"keyword", "char_filter":["html_strip"], "text": "<b>hello world</b>" } //結果 { "tokens" : [ { "token" : "hello world", "start_offset" : 3, "end_offset" : 18, "type" : "word", "position" : 0 } ] }
-
使用char filter進行替換
POST _analyze { "tokenizer": "standard", "char_filter": [ { "type" : "mapping", "mappings" : [ "- => _"] } ], "text": "123-456, I-test! test-990 650-555-1234" } //返回 { "tokens" : [ { "token" : "123_456", "start_offset" : 0, "end_offset" : 7, "type" : "<NUM>", "position" : 0 }, { "token" : "I_test", "start_offset" : 9, "end_offset" : 15, "type" : "<ALPHANUM>", "position" : 1 }, { "token" : "test_990", "start_offset" : 17, "end_offset" : 25, "type" : "<ALPHANUM>", "position" : 2 }, { "token" : "650_555_1234", "start_offset" : 26, "end_offset" : 38, "type" : "<NUM>", "position" : 3 } ] }
-
char filter 替換表情符號
POST _analyze { "tokenizer": "standard", "char_filter": [ { "type" : "mapping", "mappings" : [ ":) => happy", ":( => sad"] } ], "text": ["I am felling :)", "Feeling :( today"] } //返回 { "tokens" : [ { "token" : "I", "start_offset" : 0, "end_offset" : 1, "type" : "<ALPHANUM>", "position" : 0 }, { "token" : "am", "start_offset" : 2, "end_offset" : 4, "type" : "<ALPHANUM>", "position" : 1 }, { "token" : "felling", "start_offset" : 5, "end_offset" : 12, "type" : "<ALPHANUM>", "position" : 2 }, { "token" : "happy", "start_offset" : 13, "end_offset" : 15, "type" : "<ALPHANUM>", "position" : 3 }, { "token" : "Feeling", "start_offset" : 16, "end_offset" : 23, "type" : "<ALPHANUM>", "position" : 104 }, { "token" : "sad", "start_offset" : 24, "end_offset" : 26, "type" : "<ALPHANUM>", "position" : 105 }, { "token" : "today", "start_offset" : 27, "end_offset" : 32, "type" : "<ALPHANUM>", "position" : 106 } ] }
-
正規表示式
GET _analyze { "tokenizer": "standard", "char_filter": [ { "type" : "pattern_replace", "pattern" : "http://(.*)", "replacement" : "$1" } ], "text" : "http://www.elastic.co" } //返回 { "tokens" : [ { "token" : "www.elastic.co", "start_offset" : 0, "end_offset" : 21, "type" : "<ALPHANUM>", "position" : 0 } ] }
- 通過路勁切分
POST _analyze { "tokenizer":"path_hierarchy", "text":"/user/ymruan/a" } { "tokens" : [ { "token" : "/user", "start_offset" : 0, "end_offset" : 5, "type" : "word", "position" : 0 }, { "token" : "/user/ymruan", "start_offset" : 0, "end_offset" : 12, "type" : "word", "position" : 0 }, { "token" : "/user/ymruan/a", "start_offset" : 0, "end_offset" : 14, "type" : "word", "position" : 0 } ] }
- token_filters
GET _analyze { "tokenizer": "whitespace", "filter": ["stop","snowball"], //on the a "text": ["The gilrs in China are playing this game!"] } { "tokens" : [ { "token" : "The", //大寫的The 不做過濾 "start_offset" : 0, "end_offset" : 3, "type" : "word", "position" : 0 }, { "token" : "gilr", "start_offset" : 4, "end_offset" : 9, "type" : "word", "position" : 1 }, { "token" : "China", "start_offset" : 13, "end_offset" : 18, "type" : "word", "position" : 3 }, { "token" : "play", "start_offset" : 23, "end_offset" : 30, "type" : "word", "position" : 5 }, { "token" : "game!", "start_offset" : 36, "end_offset" : 41, "type" : "word", "position" : 7 } ] }
- 加入lowercase後,The被當成 stopword刪除
GET _analyze { "tokenizer": "whitespace", "filter": ["lowercase","stop","snowball"], "text": ["The gilrs in China are playing this game!"] } { "tokens" : [ { "token" : "gilr", "start_offset" : 4, "end_offset" : 9, "type" : "word", "position" : 1 }, { "token" : "china", "start_offset" : 13, "end_offset" : 18, "type" : "word", "position" : 3 }, { "token" : "play", "start_offset" : 23, "end_offset" : 30, "type" : "word", "position" : 5 }, { "token" : "game!", "start_offset" : 36, "end_offset" : 41, "type" : "word", "position" : 7 } ] }
- 官網自定義分詞器的標準格式
官網看了一下,自定義分析器標準格式是: PUT /my_index { "settings": { "analysis": { "char_filter": { ... custom character filters ... },//字元過濾器 "tokenizer": { ... custom tokenizers ... },//分詞器 "filter": { ... custom token filters ... }, //詞單元過濾器 "analyzer": { ... custom analyzers ... } } } }
- 自定義分詞器
#定義自己的分詞器 PUT my_index { "settings": { "analysis": { "analyzer": { "my_custom_analyzer":{ "type":"custom", "char_filter":[ "emoticons" ], "tokenizer":"punctuation", "filter":[ "lowercase", "english_stop" ] } }, "tokenizer": { "punctuation":{ "type":"pattern", "pattern": "[ .,!?]" } }, "char_filter": { "emoticons":{ "type":"mapping", "mappings" : [ ":) => happy", ":( => sad" ] } }, "filter": { "english_stop":{ "type":"stop", "stopwords":"_english_" } } } } }