elasticsearch之拼音搜尋

無風聽海發表於2022-01-14

拼音搜尋在中文搜尋環境中是經常使用的一種功能,使用者只需要輸入關鍵詞的拼音全拼或者拼音首字母,搜尋引擎就可以搜尋出相關結果。在國內,中文輸入法基本上都是基於漢語拼音的,這種在符合使用者輸入習慣的條件下縮短使用者輸入時間的功能是非常受歡迎的;

一、安裝拼音搜尋外掛

下載對應版本的elasticsearch-analysis-pinyin外掛;

https://github.com/medcl/elasticsearch-analysis-pinyin/releases/download/v7.9.2/elasticsearch-analysis-pinyin-7.9.2.zip

在elasticsearch安裝目錄下的的plugin目錄新建analysis-pinyin目錄,並解壓下載的安裝包;

重啟elasticsearch,可以看到已經正常載入拼音外掛

[2022-01-13T20:37:25,368][INFO ][o.e.p.PluginsService     ] [mango] loaded plugin [analysis-pinyin]

二、使用拼音外掛

試一下分詞效果,可以看到除了每個詞的全頻,還有每個字的首字母縮寫;

POST _analyze
{
  "analyzer": "pinyin",
  "text": "我愛你,中國"
}

{
  "tokens" : [
    {
      "token" : "wo",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "wanzg",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "ai",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "ni",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "zhong",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "guo",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 4
    }
  ]
}

自定義pinyin filter,並建立mapping;

PUT /milk
{
  "settings": {
    "analysis": {
      "filter": {
        "pinyin_filter":{
          "type":"pinyin",
          "keep_separate_first_letter" : false,
          "keep_full_pinyin" : true,
          "keep_original" : true,
          "limit_first_letter_length" : 16,
          "lowercase" : true,
          "remove_duplicated_term" : true
}
      },
      "analyzer": {
        "ik_pinyin_analyzer":{
          "tokenizer":"ik_max_word",
           "filter":["pinyin_filter"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "brand":{
        "type": "text",
        "analyzer": "ik_pinyin_analyzer"
      },
      "series":{
        "type": "text",
        "analyzer": "ik_pinyin_analyzer"
      },
      "price":{
        "type": "float"
      }
    }
  }
}

批量索引文件;

POST _bulk
{"index":{"_index":"milk", "_id":1}}}
{"brand":"蒙牛", "series":"特侖蘇", "price":60}
{"index":{"_index":"milk", "_id":2}}}
{"brand":"蒙牛", "series":"真果粒", "price":40}
{"index":{"_index":"milk", "_id":3}}}
{"brand":"華山牧", "series":"華山牧", "price":49.90}
{"index":{"_index":"milk", "_id":4}}}
{"brand":"伊利", "series":"安慕希", "price":49.90}
{"index":{"_index":"milk", "_id":5}}}
{"brand":"伊利", "series":"金典", "price":49.90}

搜尋tls,可以看到已經匹配到對應的記錄;

POST milk/_search
{
  
  "query": {
    "match_phrase_prefix": {
      "series": "tl"
    }
  }
}

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 6.691126,
    "hits" : [
      {
        "_index" : "milk",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 6.691126,
        "_source" : {
          "brand" : "蒙牛",
          "series" : "特侖蘇",
          "price" : 60
        }
      }
    ]
  }
}


可以看到查詢直接使用MultiPhraseQuery來實現對兩個位置字元的定位;

POST milk/_search
{
  
  "query": {
    "match_phrase_prefix": {
      "series": "tl"
    }
  },
  "highlight": {
    "fields": {
      "series": {}
    }
  }
}

{
  "profile" : {
    "shards" : [
      {
        "id" : "[OoNXoregTmKQAFotUgOeaA][milk][0]",
        "searches" : [
          {
            "query" : [
              {
                "type" : "MultiPhraseQuery",
                "description" : "series:\"(t tl) (li l lun)\"",
                "time_in_nanos" : 177400,
                "breakdown" : {
                  "set_min_competitive_score_count" : 0,
                  "match_count" : 1,
                  "shallow_advance_count" : 0,
                  "set_min_competitive_score" : 0,
                  "next_doc" : 5400,
                  "match" : 12800,
                  "next_doc_count" : 1,
                  "score_count" : 1,
                  "compute_max_score_count" : 0,
                  "compute_max_score" : 0,
                  "advance" : 9200,
                  "advance_count" : 1,
                  "score" : 3900,
                  "build_scorer_count" : 2,
                  "create_weight" : 40800,
                  "shallow_advance" : 0,
                  "create_weight_count" : 1,
                  "build_scorer" : 105300
                }
              }
            ],
            "rewrite_time" : 45700,
            "collector" : [
              {
                "name" : "SimpleTopScoreDocCollector",
                "reason" : "search_top_hits",
                "time_in_nanos" : 14500
              }
            ]
          }
        ],
        "aggregations" : [ ]
      }
    ]
  }
}

查詢也可以返回高亮資訊

POST milk/_search
{
  
  "query": {
    "match_phrase_prefix": {
      "series": "tl"
    }
  },
  "highlight": {
    "fields": {
      "series": {}
    }
  }
}



{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 6.691126,
    "hits" : [
      {
        "_index" : "milk",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 6.691126,
        "_source" : {
          "brand" : "蒙牛",
          "series" : "特侖蘇",
          "price" : 60
        },
        "highlight" : {
          "series" : [
            "<em>特</em><em>侖</em>蘇"
          ]
        }
      }
    ]
  }
}


相關文章