Elasticsearch 使用不同分詞器導致搜尋排名的問題

seth-shi發表於2020-11-30
  • 相信我們很多人做中文搜尋的時候,在Github找了ik中分分詞外掛
  • 然後建立mapping的時候,很自然的使用這樣的引數(參照官方分詞文件例項)
    {
          "properties": {
              "title": {
                  "type": "text",
                  "analyzer": "ik_max_word",
                  "search_analyzer": "ik_smart"
              }
          }
    }

  • 那麼我們來看一下全部資料(打火車和火車兩條資料)
curl 127.0.0.1:9200/test/_search | jq
{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "test",
        "_type": "_doc",
        "_id": "Video_1",
        "_score": 1,
        "_source": {
          "id": 1,
          "title": "打火車"
        }
      },
      {
        "_index": "test",
        "_type": "_doc",
        "_id": "Video_2",
        "_score": 1,
        "_source": {
          "id": 2,
          "title": "火車"
        }
      }
    ]
  }
}
  • 這時候我們開始搜尋(打火車)
curl 127.0.0.1:9200/test/_search?q=打火車 | jq
{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 0.21110919,
    "hits": [
      {
        "_index": "test",
        "_type": "_doc",
        "_id": "Video_2",
        "_score": 0.21110919,
        "_source": {
          "id": 2,
          "title": "火車"
        }
      },
      {
        "_index": "test",
        "_type": "_doc",
        "_id": "Video_1",
        "_score": 0.160443,
        "_source": {
          "id": 1,
          "title": "打火車"
        }
      }
    ]
  }
}
  • 這時候我們驚奇的發現火車的分值是0.21110919居然比打火車0.160443還高

  • 中間經過一路排查, 首先感謝github.com/mobz/elasticsearch-head外掛, 讓排查資料的時候減少很多操作.
  • 之後檢視文件分詞結果就得知了答案
curl 127.0.0.1:9200/test/_doc/Video_1/_termvectors?fields=title | jq
{
  "_index": "test",
  "_type": "_doc",
  "_id": "Video_1",
  "_version": 1,
  "found": true,
  "took": 0,
  "term_vectors": {
    "title": {
      "field_statistics": {
        "sum_doc_freq": 3,
        "doc_count": 2,
        "sum_ttf": 3
      },
      "terms": {
        "打火": {
          "term_freq": 1,
          "tokens": [
            {
              "position": 0,
              "start_offset": 0,
              "end_offset": 2
            }
          ]
        },
        "火車": {
          "term_freq": 1,
          "tokens": [
            {
              "position": 1,
              "start_offset": 1,
              "end_offset": 3
            }
          ]
        }
      }
    }
  }
}
  • 很驚奇的發現打火車被劃分成打火火車兩個詞, 所以這之中肯定有問題了(當然對於搜尋引擎是沒有問題的).
  • 打火車文件中的火車得到了分值,但打火會使搜尋得分下降, 導致火車文件的排名靠前
  • 所以我決定把兩個分詞器設定成一樣
{
        "properties": {
            "title": {
                "type": "text",
                "analyzer": "ik_smart",
                "search_analyzer": "ik_smart"
            }
        }
}
  • 然後再看一下分詞資料(這次分詞的資料的確是我們預想的)
curl 127.0.0.1:9200/test/_doc/Video_1/_termvectors?fields=title | jq
{
  "_index": "test",
  "_type": "_doc",
  "_id": "Video_1",
  "_version": 1,
  "found": true,
  "took": 0,
  "term_vectors": {
    "title": {
      "field_statistics": {
        "sum_doc_freq": 3,
        "doc_count": 2,
        "sum_ttf": 3
      },
      "terms": {
        "打": {
          "term_freq": 1,
          "tokens": [
            {
              "position": 0,
              "start_offset": 0,
              "end_offset": 1
            }
          ]
        },
        "火車": {
          "term_freq": 1,
          "tokens": [
            {
              "position": 1,
              "start_offset": 1,
              "end_offset": 3
            }
          ]
        }
      }
    }
  }
}
  • 這時我們再搜尋一次資料排名, 看到得分值排名的確是我們想要的了.
curl  127.0.0.1:9200/test/_search?q=打火車 | jq
{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 0.77041256,
    "hits": [
      {
        "_index": "test",
        "_type": "_doc",
        "_id": "Video_1",
        "_score": 0.77041256,
        "_source": {
          "id": 1,
          "title": "打火車"
        }
      },
      {
        "_index": "test",
        "_type": "_doc",
        "_id": "Video_2",
        "_score": 0.21110919,
        "_source": {
          "id": 2,
          "title": "火車"
        }
      }
    ]
  }
}
本作品採用《CC 協議》,轉載必須註明作者和本文連結
當神不再是我們的信仰,那麼信仰自己吧,努力讓自己變好,不辜負自己的信仰!

相關文章