ElasticSearch7.3 學習之定製分詞器(Analyzer)

|舊市拾荒|發表於2022-03-22

1、預設的分詞器

關於分詞器,前面的部落格已經有介紹了,連結:ElasticSearch7.3 學習之倒排索引揭祕及初識分詞器(Analyzer)。這裡就只介紹預設的分詞器standard analyzer

2、 修改分詞器的設定

首先自定義一個分詞器es_std。啟用english停用詞token filter

PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "es_std": {
          "type": "standard",
          "stopwords": "_english_"
        }
      }
    }
  }
}

返回:

接下來開始測試兩種不同的分詞器,首先是預設的分詞器

GET /my_index/_analyze
{
  "analyzer": "standard", 
  "text": "a dog is in the house"
}

返回結果

{
  "tokens" : [
    {
      "token" : "a",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "dog",
      "start_offset" : 2,
      "end_offset" : 5,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "is",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "in",
      "start_offset" : 9,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "the",
      "start_offset" : 12,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "house",
      "start_offset" : 16,
      "end_offset" : 21,
      "type" : "<ALPHANUM>",
      "position" : 5
    }
  ]
}

可以看到就是簡單的按單詞進行拆分,在接下來測試上面自定義的一個分詞器es_std

GET /my_index/_analyze
{
  "analyzer": "es_std",
  "text":"a dog is in the house"
}

返回:

{
  "tokens" : [
    {
      "token" : "dog",
      "start_offset" : 2,
      "end_offset" : 5,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "house",
      "start_offset" : 16,
      "end_offset" : 21,
      "type" : "<ALPHANUM>",
      "position" : 5
    }
  ]
}

可以看到結果只有兩個單詞了,把停用詞都給去掉了。

3、定製化自己的分詞器

首先刪除掉上面建立的索引

DELETE my_index

然後執行下面的語句。簡單說下下面的規則吧,首先去除html標籤,把&轉換成and,然後採用standard進行分詞,最後轉換成小寫字母及去掉停用詞a the,建議讀者好好看看,下面我也會對這個分詞器進行測試。

PUT /my_index
{
  "settings": {
    "analysis": {
      "char_filter": {
        "&_to_and": {
          "type": "mapping",
          "mappings": [
            "&=> and"
          ]
        }
      },
      "filter": {
        "my_stopwords": {
          "type": "stop",
          "stopwords": [
            "the",
            "a"
          ]
        }
      },
      "analyzer": {
        "my_analyzer": {
          "type": "custom",
          "char_filter": [
            "html_strip",
            "&_to_and"
          ],
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "my_stopwords"
          ]
        }
      }
    }
  }
}

返回

{
  "acknowledged" : true,
  "shards_acknowledged" : true,
  "index" : "my_index"
}

老規矩,測試這個分詞器

GET /my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "tom&jerry are a friend in the house, <a>, HAHA!!"
}

結果如下:

{
  "tokens" : [
    {
      "token" : "tomandjerry",
      "start_offset" : 0,
      "end_offset" : 9,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "are",
      "start_offset" : 10,
      "end_offset" : 13,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "friend",
      "start_offset" : 16,
      "end_offset" : 22,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "in",
      "start_offset" : 23,
      "end_offset" : 25,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "house",
      "start_offset" : 30,
      "end_offset" : 35,
      "type" : "<ALPHANUM>",
      "position" : 6
    },
    {
      "token" : "haha",
      "start_offset" : 42,
      "end_offset" : 46,
      "type" : "<ALPHANUM>",
      "position" : 7
    }
  ]
}

最後我們可以在實際使用時設定某個欄位使用自定義分詞器,語法如下:

PUT /my_index/_mapping/
{
  "properties": {
    "content": {
      "type": "text",
      "analyzer": "my_analyzer"
    }
  }
}

 

相關文章