Elasticsearch 使用不同分詞器導致搜尋排名的問題

seth-shi發表於2020-11-30

原文網址 : https://learnku.com/articles/52070

相信我們很多人做中文搜尋的時候,在Github找了ik中分分詞外掛

然後建立mapping的時候,很自然的使用這樣的引數(參照官方分詞文件例項)

{
      "properties": {
          "title": {
              "type": "text",
              "analyzer": "ik_max_word",
              "search_analyzer": "ik_smart"
          }
      }
}

那麼我們來看一下全部資料(打火車和火車兩條資料)

curl 127.0.0.1:9200/test/_search | jq
{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "test",
        "_type": "_doc",
        "_id": "Video_1",
        "_score": 1,
        "_source": {
          "id": 1,
          "title": "打火車"
        }
      },
      {
        "_index": "test",
        "_type": "_doc",
        "_id": "Video_2",
        "_score": 1,
        "_source": {
          "id": 2,
          "title": "火車"
        }
      }
    ]
  }
}

這時候我們開始搜尋(打火車)

curl 127.0.0.1:9200/test/_search?q=打火車 | jq
{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 0.21110919,
    "hits": [
      {
        "_index": "test",
        "_type": "_doc",
        "_id": "Video_2",
        "_score": 0.21110919,
        "_source": {
          "id": 2,
          "title": "火車"
        }
      },
      {
        "_index": "test",
        "_type": "_doc",
        "_id": "Video_1",
        "_score": 0.160443,
        "_source": {
          "id": 1,
          "title": "打火車"
        }
      }
    ]
  }
}

這時候我們驚奇的發現火車的分值是0.21110919居然比打火車的0.160443還高

中間經過一路排查, 首先感謝github.com/mobz/elasticsearch-head外掛, 讓排查資料的時候減少很多操作.
之後檢視文件分詞結果就得知了答案

curl 127.0.0.1:9200/test/_doc/Video_1/_termvectors?fields=title | jq
{
  "_index": "test",
  "_type": "_doc",
  "_id": "Video_1",
  "_version": 1,
  "found": true,
  "took": 0,
  "term_vectors": {
    "title": {
      "field_statistics": {
        "sum_doc_freq": 3,
        "doc_count": 2,
        "sum_ttf": 3
      },
      "terms": {
        "打火": {
          "term_freq": 1,
          "tokens": [
            {
              "position": 0,
              "start_offset": 0,
              "end_offset": 2
            }
          ]
        },
        "火車": {
          "term_freq": 1,
          "tokens": [
            {
              "position": 1,
              "start_offset": 1,
              "end_offset": 3
            }
          ]
        }
      }
    }
  }
}

很驚奇的發現打火車被劃分成打火和火車兩個詞, 所以這之中肯定有問題了(當然對於搜尋引擎是沒有問題的).
打火車文件中的火車得到了分值,但打火會使搜尋得分下降, 導致火車文件的排名靠前
所以我決定把兩個分詞器設定成一樣

{
        "properties": {
            "title": {
                "type": "text",
                "analyzer": "ik_smart",
                "search_analyzer": "ik_smart"
            }
        }
}

然後再看一下分詞資料(這次分詞的資料的確是我們預想的)

curl 127.0.0.1:9200/test/_doc/Video_1/_termvectors?fields=title | jq
{
  "_index": "test",
  "_type": "_doc",
  "_id": "Video_1",
  "_version": 1,
  "found": true,
  "took": 0,
  "term_vectors": {
    "title": {
      "field_statistics": {
        "sum_doc_freq": 3,
        "doc_count": 2,
        "sum_ttf": 3
      },
      "terms": {
        "打": {
          "term_freq": 1,
          "tokens": [
            {
              "position": 0,
              "start_offset": 0,
              "end_offset": 1
            }
          ]
        },
        "火車": {
          "term_freq": 1,
          "tokens": [
            {
              "position": 1,
              "start_offset": 1,
              "end_offset": 3
            }
          ]
        }
      }
    }
  }
}

這時我們再搜尋一次資料排名, 看到得分值排名的確是我們想要的了.

curl  127.0.0.1:9200/test/_search?q=打火車 | jq
{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 0.77041256,
    "hits": [
      {
        "_index": "test",
        "_type": "_doc",
        "_id": "Video_1",
        "_score": 0.77041256,
        "_source": {
          "id": 1,
          "title": "打火車"
        }
      },
      {
        "_index": "test",
        "_type": "_doc",
        "_id": "Video_2",
        "_score": 0.21110919,
        "_source": {
          "id": 2,
          "title": "火車"
        }
      }
    ]
  }
}

本作品採用《CC 協議》，轉載必須註明作者和本文連結

當神不再是我們的信仰，那麼信仰自己吧，努力讓自己變好，不辜負自己的信仰！

單詞搜尋問題
2022-05-28
a-select由於位置不夠，導致下拉選單擋住搜尋框的問題
2024-06-24
Elasticsearch：使用同義詞 synonyms 來提高搜尋效率
2021-11-03
Elasticsearch
MySQL單詞搜尋相關度排名
2021-01-15
MySql
Laravel 使用 Elasticsearch 全域性搜尋
2019-04-17
LaravelElasticsearch
elasticsearch 搜尋引擎工具的高階使用
2024-03-18
Elasticsearch
使用elasticsearch搭建自己的搜尋系統
2020-05-11
Elasticsearch
Elasticsearch常用搜尋
2020-08-27
Elasticsearch
Elasticsearch——全文搜尋
2019-02-18
Elasticsearch
elasticsearch搜尋商品
2021-07-15
Elasticsearch
Elasticsearch 向量搜尋
2022-04-16
Elasticsearch
Elasticsearch（ES）的高階搜尋（DSL搜尋）（上篇）
2021-09-20
Elasticsearch
Elasticsearch（ES）的高階搜尋（DSL搜尋）（下篇）
2021-09-21
Elasticsearch
Elasticsearch 的配置與使用，為了全文搜尋
2018-04-19
Elasticsearch
像使用 Laravel Query 一樣的搜尋 Elasticsearch
2018-04-04
LaravelElasticsearch
開源搜尋引擎排名第一，Elasticsearch是如何做到的？
2020-08-28
Elasticsearch
單詞搜尋
2021-01-03
golang slice使用不慎導致的問題
2018-05-24
Golang
Nebula 基於 ElasticSearch 的全文搜尋引擎的文字搜尋
2021-06-17
Elasticsearch
[LeetCode題解]79. 單詞搜尋
2020-09-09
LeetCode
在 Spring Boot 中使用搜尋引擎 Elasticsearch
2021-11-18
Spring BootElasticsearch
使用 Laravel Scout + ElasticSearch 實現全文搜尋
2021-10-15
LaravelElasticsearch
ElasticSearch全文搜尋引擎
2019-07-29
Elasticsearch
elasticsearch之拼音搜尋
2022-01-14
Elasticsearch
Elasticsearch 為了搜尋
2021-03-06
Elasticsearch
搜尋引擎es-分詞與搜尋
2024-08-27
分詞
關於搜尋地址的問題
2021-03-06
未使用 `deleteLater` 而直接使用 `delete` 導致問題
2024-11-16
delete
解決 PbootCMS 搜尋未搜尋到任何資料的問題
2024-09-03
boot
Elasticsearch安全又雙叒叕出問題? 搜尋引擎該怎麼選
2022-09-16
Elasticsearch
elasticsearch(五)---分散式搜尋
2018-08-21
Elasticsearch分散式
認識搜尋引擎 Elasticsearch
2021-07-15
Elasticsearch
ElasticSearch 簡單的搜尋聚合分析
2018-04-16
Elasticsearch
（1）分散式搜尋ElasticSearch認識ElasticSearch
2019-05-11
分散式Elasticsearch
SAP Fiori應用的搜尋問題
2020-02-19
ANALYZE導致的阻塞問題分析
2020-08-17
79. 單詞搜尋
2024-11-15
Laravel5.5 使用 Elasticsearch 做引擎，scout 全文搜尋
2018-11-27
LaravelElasticsearch

Elasticsearch 使用不同分詞器導致搜尋排名的問題

相關文章