elasticsearch查詢之三種fetch id的方案分析

無風聽海發表於2022-02-19

一、使用場景介紹

elasticsearch除了普通的全文檢索之外,在很多的業務場景中都有使用,各個業務模組根據自己業務特色設定查詢條件,通過elasticsearch執行並返回所有命中的記錄的id;如果命中的記錄數達到數萬級別的話,查詢效能會有明顯的下降,尤其是命中超大型的document的時候;

獲取記錄的id目前可以使用的有三種方式;

通過_source:["id"]

設定_source:false,通過es返回的後設資料_id分離出device的id;

使用store=true來單獨的儲存device id,查詢的時候使用stored_fields= ['id'];

二、store對映引數

預設情況下,欄位值會被索引以使其可搜尋,但不會儲存它們。這意味著可以查詢該欄位,但不能檢索原始欄位值。

通常這並不重要。該欄位值已經是_source欄位的一部分,該欄位是預設儲存的。如果您只想檢索單個欄位或幾個欄位的值,而不是整個_source,那麼可以通過_source過濾來實現。

在某些情況下,儲存欄位是有意義的。例如,如果你有一個文件,一個標題,一個日期,和一個非常大的內容欄位,你可能想只檢索標題和日期,而不必從一個大的_source欄位提取這些欄位:

設定對應欄位的store引數為true,並建立mapping;

PUT my_store_test
{
  "mappings": {
    "_doc": {
      "properties": {
        "title": {
          "type": "text",
          "store": true 
        },
        "date": {
          "type": "date",
          "store": true 
        },
        "content": {
          "type": "text"
        }
      }
    }
  }
}



{
  "acknowledged" : true,
  "shards_acknowledged" : true,
  "index" : "my_store_test"
}

put一個document進行索引

PUT my_store_test/_doc/1
{
  "title":   "Some short title",
  "date":    "2015-01-01",
  "content": "A very long content field..."
}

{
  "_index" : "my_store_test",
  "_type" : "_doc",
  "_id" : "1",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 0,
  "_primary_term" : 1
}

通過在查詢語句中設定stored_fields來篩選要返回的欄位,elasticsearch返回的fields欄位包含對應的欄位值;

GET my_store_test/_search
{
  "stored_fields": [ "title", "date" ] 
}


{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "my_store_test",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "fields" : {
          "date" : [
            "2015-01-01T00:00:00.000Z"
          ],
          "title" : [
            "Some short title"
          ]
        }
      }
    ]
  }
}

三、測試情況

我們測試使用my_store_index,裡邊包含50W的document,還有一些特別大的document;

我們fetch_ids_query進行測試

預設情況下通過elasticsearch查詢返回的_source欄位獲取記錄的id欄位;

通過take_from__id控制從elasticsearch查詢返回的後設資料_id解析出記錄id;

通過task_stored_fields控制從elasticsearch查詢返回的fields獲取記錄的id;

from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search, Q
import time


def fetch_ids_query(client, take_from__id = False, task_stored_fields = False):
    start = time.time()
    s = Search(using=client, index="my_store_index")
    s = s.params(http_auth=["test", "test"], request_timeout=50);
    q = Q('bool',
          must_not=[Q('match_phrase_prefix', name='us')]
          )
    s = s.query(q)

    s = s.source(False) if take_from__id else s.source(['id'])
    if task_stored_fields:
        s = s.extra(stored_fields= ['id'])
        s = s.source(False)

    s = s[0:40000]
    response = s.execute()

    print(f'hit total {response.hits.total}')
    print(f'fetch total {len(response.hits.hits)}')
    

    ids = []
    if take_from__id:
        for hit in response.hits.hits:
            id = hit['_id'][37:]
            ids.append(id)
    elif task_stored_fields:
        for hit in response.hits.hits:
            id = hit.fields['id'][0]
            ids.append(id)
    else:
        for hit in response.hits.hits:
            id = hit._source['id']
            ids.append(id)

    end = time.time()
    print(f"all execute time {end - start}s")
    

client = Elasticsearch(hosts=['http://127.0.0.1:9200'], http_auth=["test", "test"])

print('fetch id from source')
fetch_ids_query(client);
print()
print('fetch id from _id and set source = false')
fetch_ids_query(client, True);
print()
print('fetch id from stored id and set source = false')
fetch_ids_query(client, False, True);

四、結果分析

經測試在命中484970,fetch 40000條記錄的前提下,後兩種方式的執行時間更短,但是通過後設資料解析_id會更加友好,不僅節省儲存空間,而且查詢的時候避免了記憶體和CPU的震盪;

fetch id from source
hit total 484970
fetch total 40000
all execute time 28.691869497299194s

fetch id from _id and set source = false
hit total 484970
fetch total 40000
all execute time 11.315539121627808s

fetch id from stored id and set source = false
hit total 484970
fetch total 40000
all execute time 13.930094957351685s

相關文章