一、使用場景介紹
elasticsearch除了普通的全文檢索之外,在很多的業務場景中都有使用,各個業務模組根據自己業務特色設定查詢條件,通過elasticsearch執行並返回所有命中的記錄的id;如果命中的記錄數達到數萬級別的話,查詢效能會有明顯的下降,尤其是命中超大型的document的時候;
獲取記錄的id目前可以使用的有三種方式;
通過_source:["id"]
設定_source:false,通過es返回的後設資料_id分離出device的id;
使用store=true來單獨的儲存device id,查詢的時候使用stored_fields= ['id'];
二、store對映引數
預設情況下,欄位值會被索引以使其可搜尋,但不會儲存它們。這意味著可以查詢該欄位,但不能檢索原始欄位值。
通常這並不重要。該欄位值已經是_source欄位的一部分,該欄位是預設儲存的。如果您只想檢索單個欄位或幾個欄位的值,而不是整個_source,那麼可以通過_source過濾來實現。
在某些情況下,儲存欄位是有意義的。例如,如果你有一個文件,一個標題,一個日期,和一個非常大的內容欄位,你可能想只檢索標題和日期,而不必從一個大的_source欄位提取這些欄位:
設定對應欄位的store引數為true,並建立mapping;
PUT my_store_test
{
"mappings": {
"_doc": {
"properties": {
"title": {
"type": "text",
"store": true
},
"date": {
"type": "date",
"store": true
},
"content": {
"type": "text"
}
}
}
}
}
{
"acknowledged" : true,
"shards_acknowledged" : true,
"index" : "my_store_test"
}
put一個document進行索引
PUT my_store_test/_doc/1
{
"title": "Some short title",
"date": "2015-01-01",
"content": "A very long content field..."
}
{
"_index" : "my_store_test",
"_type" : "_doc",
"_id" : "1",
"_version" : 1,
"result" : "created",
"_shards" : {
"total" : 2,
"successful" : 1,
"failed" : 0
},
"_seq_no" : 0,
"_primary_term" : 1
}
通過在查詢語句中設定stored_fields來篩選要返回的欄位,elasticsearch返回的fields欄位包含對應的欄位值;
GET my_store_test/_search
{
"stored_fields": [ "title", "date" ]
}
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [
{
"_index" : "my_store_test",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.0,
"fields" : {
"date" : [
"2015-01-01T00:00:00.000Z"
],
"title" : [
"Some short title"
]
}
}
]
}
}
三、測試情況
我們測試使用my_store_index,裡邊包含50W的document,還有一些特別大的document;
我們fetch_ids_query進行測試
預設情況下通過elasticsearch查詢返回的_source欄位獲取記錄的id欄位;
通過take_from__id控制從elasticsearch查詢返回的後設資料_id解析出記錄id;
通過task_stored_fields控制從elasticsearch查詢返回的fields獲取記錄的id;
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search, Q
import time
def fetch_ids_query(client, take_from__id = False, task_stored_fields = False):
start = time.time()
s = Search(using=client, index="my_store_index")
s = s.params(http_auth=["test", "test"], request_timeout=50);
q = Q('bool',
must_not=[Q('match_phrase_prefix', name='us')]
)
s = s.query(q)
s = s.source(False) if take_from__id else s.source(['id'])
if task_stored_fields:
s = s.extra(stored_fields= ['id'])
s = s.source(False)
s = s[0:40000]
response = s.execute()
print(f'hit total {response.hits.total}')
print(f'fetch total {len(response.hits.hits)}')
ids = []
if take_from__id:
for hit in response.hits.hits:
id = hit['_id'][37:]
ids.append(id)
elif task_stored_fields:
for hit in response.hits.hits:
id = hit.fields['id'][0]
ids.append(id)
else:
for hit in response.hits.hits:
id = hit._source['id']
ids.append(id)
end = time.time()
print(f"all execute time {end - start}s")
client = Elasticsearch(hosts=['http://127.0.0.1:9200'], http_auth=["test", "test"])
print('fetch id from source')
fetch_ids_query(client);
print()
print('fetch id from _id and set source = false')
fetch_ids_query(client, True);
print()
print('fetch id from stored id and set source = false')
fetch_ids_query(client, False, True);
四、結果分析
經測試在命中484970,fetch 40000條記錄的前提下,後兩種方式的執行時間更短,但是通過後設資料解析_id會更加友好,不僅節省儲存空間,而且查詢的時候避免了記憶體和CPU的震盪;
fetch id from source
hit total 484970
fetch total 40000
all execute time 28.691869497299194s
fetch id from _id and set source = false
hit total 484970
fetch total 40000
all execute time 11.315539121627808s
fetch id from stored id and set source = false
hit total 484970
fetch total 40000
all execute time 13.930094957351685s