實踐003-elasticsearch之analyzer

[toc]

一、Elasticsearch analizer組成

1. 組成三大件

1.1 Character Filter(字元過濾器)

用於原始文字過濾，比如原文字為html的文字，需要去掉html標籤： html_strip

1.2 Tokenizer(分詞器)

按某種規則(比如空格) 對輸入(Character Filter處理完的文字)進行切分

1.3 Token Filter(分詞過濾器)

對Tokenizer切分後的準term進行二次加工，比如大寫->小寫，stop word過濾(跑去in、the等)

二、Analyzer測試分詞

2.1 指定analyzer測試分詞

2.1.1 standard analyzer

Tokenizer: Standard Tokenize
基於unicode文字分割，適於大多數語言
Token Filter: Lower Case Token Filter/Stop Token Filter(預設禁用)
- LowerCase Token Filter: 過濾後，變小寫-->所以standard預設分詞後的搜尋匹配是小寫
- Stop Token Filter(預設禁用) -->停用詞：分詞後索引裡會丟棄的

GET _analyze
{
  "analyzer": "standard",
  "text": "#!#For example, UUU you can see 27 accounts in ID (Idaho)."
}

2.1.2 standard結果可見

全小寫
數字還在
沒有stop word(預設關閉的)

{
  "tokens" : [
    {
      "token" : "for",
      "start_offset" : 3,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "example",
      "start_offset" : 7,
      "end_offset" : 14,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "uuu",
      "start_offset" : 16,
      "end_offset" : 19,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "you",
      "start_offset" : 20,
      "end_offset" : 23,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "can",
      "start_offset" : 24,
      "end_offset" : 27,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "see",
      "start_offset" : 28,
      "end_offset" : 31,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "27",
      "start_offset" : 32,
      "end_offset" : 34,
      "type" : "<NUM>",
      "position" : 6
    },
    {
      "token" : "accounts",
      "start_offset" : 35,
      "end_offset" : 43,
      "type" : "<ALPHANUM>",
      "position" : 7
    },
    {
      "token" : "in",
      "start_offset" : 44,
      "end_offset" : 46,
      "type" : "<ALPHANUM>",
      "position" : 8
    },
    {
      "token" : "id",
      "start_offset" : 47,
      "end_offset" : 49,
      "type" : "<ALPHANUM>",
      "position" : 9
    },
    {
      "token" : "idaho",
      "start_offset" : 51,
      "end_offset" : 56,
      "type" : "<ALPHANUM>",
      "position" : 10
    }
  ]
}

2.2 其他analyzer

standard
stop stopword剔除
simple
whitespace 只用空白符分割，不剔除
keyword 完整文字，不分詞

2.3 指定Tokenizer和Token Filter測試分詞

2.3.1 使用standard相同的Tokenizer和Filter

前面一節說：standard analyzer使用的Tokenizer是standard Tokenizer 使用的filter是lowercase, 我們通過使用tokenizer和filter來替換analyzer試試：

GET _analyze
{
  "tokenizer": "standard",
  "filter": ["lowercase"], 
  "text": "#!#For example, UUU you can see 27 accounts in ID (Idaho)."
}

結果和上面一致：

{
  "tokens" : [
    {
      "token" : "for",
      "start_offset" : 3,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "example",
      "start_offset" : 7,
      "end_offset" : 14,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "uuu",
      "start_offset" : 16,
      "end_offset" : 19,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "you",
      "start_offset" : 20,
      "end_offset" : 23,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "can",
      "start_offset" : 24,
      "end_offset" : 27,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "see",
      "start_offset" : 28,
      "end_offset" : 31,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "27",
      "start_offset" : 32,
      "end_offset" : 34,
      "type" : "<NUM>",
      "position" : 6
    },
    {
      "token" : "accounts",
      "start_offset" : 35,
      "end_offset" : 43,
      "type" : "<ALPHANUM>",
      "position" : 7
    },
    {
      "token" : "in",
      "start_offset" : 44,
      "end_offset" : 46,
      "type" : "<ALPHANUM>",
      "position" : 8
    },
    {
      "token" : "id",
      "start_offset" : 47,
      "end_offset" : 49,
      "type" : "<ALPHANUM>",
      "position" : 9
    },
    {
      "token" : "idaho",
      "start_offset" : 51,
      "end_offset" : 56,
      "type" : "<ALPHANUM>",
      "position" : 10
    }
  ]
}

2.3.2 增加一個stop的filter再試

GET _analyze
{
  "tokenizer": "standard",
  "filter": ["lowercase","stop"], 
  "text": "#!#For example, UUU you can see 27 accounts in ID (Idaho)."
}

觀察發現：in沒了，所以stop裡應該是有in這個過濾成分的呢~

filter裡有兩個(使用了兩個TokenFilter--ES的欄位都可以使多個多個值的就是陣列式的)如果去掉filter裡的lowercase, 就不會轉大寫為小寫了，這裡就不貼出結果了~

{
  "tokens" : [
    {
      "token" : "example",
      "start_offset" : 7,
      "end_offset" : 14,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "uuu",
      "start_offset" : 16,
      "end_offset" : 19,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "you",
      "start_offset" : 20,
      "end_offset" : 23,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "can",
      "start_offset" : 24,
      "end_offset" : 27,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "see",
      "start_offset" : 28,
      "end_offset" : 31,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "27",
      "start_offset" : 32,
      "end_offset" : 34,
      "type" : "<NUM>",
      "position" : 6
    },
    {
      "token" : "accounts",
      "start_offset" : 35,
      "end_offset" : 43,
      "type" : "<ALPHANUM>",
      "position" : 7
    },
    {
      "token" : "id",
      "start_offset" : 47,
      "end_offset" : 49,
      "type" : "<ALPHANUM>",
      "position" : 9
    },
    {
      "token" : "idaho",
      "start_offset" : 51,
      "end_offset" : 56,
      "type" : "<ALPHANUM>",
      "position" : 10
    }
  ]
}

三、Elasticsearch自帶的Analyzer元件

3.1 ES自帶的character filter

3.1.1 什麼是character filter?

在tokenizer之前，對文字進行處理，例如增加刪除或替換字元；可以設定多個character filter.
它會影響tokenizer的 position 和 offset.

3.1.2 一些自帶的character filter

html strip: 剔除html標籤
mapping: 字串替換
pattern replace: 正則匹配替換

3.2 ES自帶的tokenizer

3.2.1 什麼是tokenizer?

將原始文字(character filter處理後的原始文字)按照一定規則進行切分。(term or token)

3.2.2 自帶的tokenizer

whitespace: 空格分詞
standard
uax_url_email: url/email
pattern
keyword: 不分詞
pattern hierarchy: 路徑名拆分

3.2.3 可以用java外掛，實現自定義的tokenizer

3.3 ES自帶的token filter

3.3.1 什麼是tokenizer?

將tokenizer輸出的單詞進行加工(加工term)

3.3.2 自帶的token filter

lowercase: 小寫化
stop: 去除停用詞(in/the等)
synonym: 新增近義詞

四、Demo案例

4.1 html_strip/maping+keyword

GET _analyze
{
  "tokenizer": "keyword",
  "char_filter": [
    {
      "type": "html_strip"
    },
    {
      "type": "mapping",
      "mappings": [
        "- => _", ":) => _happy_", ":( => _sad_"
      ]
    }
  ],
  "text": "<b>Hello :) this-is-my-book,that-is-not :( World</b>"
}

使用了 tokenizer：keyword，也就是切詞時完整保留，不切割；

使用了char_filter兩個：html_strip(剔除掉html標籤)，mapping(用指定內容替換原內容)

上面結果：html標籤去掉了，減號符替換成了下劃線

{
  "tokens" : [
    {
      "token" : "Hello _happy_ this_is_my_book,that_is_not _sad_ World",
      "start_offset" : 3,
      "end_offset" : 52,
      "type" : "word",
      "position" : 0
    }
  ]
}

4.2 char_filter使用正則替換

GET _analyze
{
  "tokenizer": "standard",
  "char_filter": [
    {
      "type": "pattern_replace",
      "pattern": "http://(.*)",
      "replacement": "$1"
    }
  ],
  "text": "http://www.elastic.co"
}

正則替換：type/pattern/replacement

結果：

{
  "tokens" : [
    {
      "token" : "www.elastic.co",
      "start_offset" : 0,
      "end_offset" : 21,
      "type" : "<ALPHANUM>",
      "position" : 0
    }
  ]
}

4.3 tokenizer使用目錄切分

GET _analyze
{
  "tokenizer": "path_hierarchy",
  "text": "/user/niewj/a/b/c"
}

分詞結果：

{
  "tokens" : [
    {
      "token" : "/user",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/user/niewj",
      "start_offset" : 0,
      "end_offset" : 11,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/user/niewj/a",
      "start_offset" : 0,
      "end_offset" : 13,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/user/niewj/a/b",
      "start_offset" : 0,
      "end_offset" : 15,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/user/niewj/a/b/c",
      "start_offset" : 0,
      "end_offset" : 17,
      "type" : "word",
      "position" : 0
    }
  ]
}

4.4 tokenfilter之whitespace與stop

GET _analyze
{
  "tokenizer": "whitespace",
  "filter": ["stop"], // ["lowercase", "stop"]
  "text": "The girls in China are playing this game !"
}

結果：in、this都被剔除了(stopword), 但是term是大寫的還保留，因為tokenizer用的是whitespace而非standard

{
  "tokens" : [
    {
      "token" : "The",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "girls",
      "start_offset" : 4,
      "end_offset" : 9,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "China",
      "start_offset" : 13,
      "end_offset" : 18,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "playing",
      "start_offset" : 23,
      "end_offset" : 30,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "game",
      "start_offset" : 36,
      "end_offset" : 40,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "!",
      "start_offset" : 41,
      "end_offset" : 42,
      "type" : "word",
      "position" : 8
    }
  ]
}

4.5 自定義analyzer

4.5.1 settings自定義analyzer

PUT my_new_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer":{ // 1.自定義analyzer的名稱
          "type": "custom",
          "char_filter": ["my_emoticons"], 
          "tokenizer": "my_punctuation", 
          "filter": ["lowercase", "my_english_stop"]
        }
      },
      "tokenizer": {
        "my_punctuation": { // 3.自定義tokenizer的名稱
          "type": "pattern", "pattern":"[ .,!?]"
        }
      },
      "char_filter": {
        "my_emoticons": { // 2.自定義char_filter的名稱
          "type": "mapping", "mappings":[":) => _hapy_", ":( => _sad_"]
        }
      },
      "filter": {
        "my_english_stop": { // 4.自定義token filter的名稱
          "type": "stop", "stopwords": "_english_"
        }
      }
    }
  }
}

4.5.2 測試自定義的analyzer：

POST my_new_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "I'm a :) person in the earth, :( And You? "
}

輸出

{
  "tokens" : [
    {
      "token" : "i'm",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "_hapy_",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "person",
      "start_offset" : 9,
      "end_offset" : 15,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "earth",
      "start_offset" : 23,
      "end_offset" : 28,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "_sad_",
      "start_offset" : 30,
      "end_offset" : 32,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "you",
      "start_offset" : 37,
      "end_offset" : 40,
      "type" : "word",
      "position" : 9
    }
  ]
}