ES 入門 - 基於詞項的查詢

以終為始發表於2020-10-01

原文網址 : https://www.cnblogs.com/michael9/p/13759208.html

準備

首先先宣告下，我這裡使用的 ES 版本 5.2.0.

為了便於理解，這裡以如下 index 為格式，該格式是通過 PMACCT 抓取的 netflow 流量資訊, 文中所涉及的到的例子，全基於此 index.

本篇涉及的內容可以理解為 ES 的入門內容，主要針對詞項的過濾，為基礎篇。

{
                "_index": "shflows_agg_1600358400",
                "_type": "shflows_agg",
                "_id": "node1_1600359600_0_172698718_shflows_agg_0",
                "_score": 1.0,
                "_source": {
                    "collector": "node1",
                    "src_port": "443",
                    "timestamp": 1600359600,
                    "device_ip": "1.1.1.1",
                    "flows": "40",
                    "dst_host": "2.2.2.2",
                    "TAG": 10001,
                    "router_ip": 172698718,
                    "dst_port": "16384",
                    "pkts": 40000,
                    "bits": 320000000000,
                    "src_host": "3.3.3.3"
                }
            },

在正式介紹搜尋前，先明確一個概念。很多人在學習 ES 查詢前，容易對 Term 和全文查詢進行混淆。

首先，Term 是表達語義的最小單位，在搜尋和利用統計語言模型時都需要處理 Term.

對應在 ES 裡，針對 Term 查詢的輸入來說，不會做任何的分詞處理，會把輸入作為一個整體，在 ES 的倒排索引中進行詞項的匹配，然後利用算分公式將結果返回。並可以通過 Constant Score 將查詢轉換為一個 Filtering，避免算分，利用快取，從而提高效能。

雖然輸入時，不做分詞處理，但在搜尋時，會做分詞處理。這樣有時就會出現無法搜尋出結果的情況，比如有 name 為 ‘Jack’ 的 doc. 但如果在搜尋時，輸入 Jack，ES 是無法查詢到的。必須改成小寫的 jack 或者使用 keyword 進行查詢。

Term 查詢包含：

Term Query
Range Query
Exists Query
Prefix Query
Wildcard Query

而全文查詢，是基於全文字的查詢。

在 ES 中，索引（輸入）和搜尋時都會分詞。先將查詢的字串傳遞到合適分詞器中，然後生成一個供查詢的詞項列表。

全文查詢包括：

Match Query
Match Phrase Query
Query String Query

而下面的例子全都是基於 Term 查詢。

ES 搜尋概述

ES 搜尋 API 可以分為兩大類：

基於 URL 的引數搜尋, 適合簡單的搜尋。
基於 Request Body 的搜尋（DSL），適合更為複雜的搜尋。

確定查詢的索引範圍：

/_search: 叢集上的所有索引

/index1/_search: index1 索引

/index1,index2/_search: index1 和 index2 索引

/index*/_search: 以 index 開頭的所有索引

URL 查詢

指定欄位查詢：

使用 q 指定引數，通過KV 間鍵值對查詢。

舉例1：查詢裝置 IP 為 1.1.1.1 的相關文件資訊：

/shflows_agg_*/_search?q=device_ip:1.1.1.1 

{
    "profile": "true"
}

profile 的意思是檢視查詢過程

結果：可以看到 type 為 TermQuery，搜尋時根據指定欄位："device_ip:10.75.44.94"

"profile": {
        "shards": [
            {
                "id": "[e_Ac3cNJRtmVxFW9DwOwjA][shflows_agg_1600531200][0]",
                "searches": [
                    {
                        "query": [
                            {
                                "type": "TermQuery",
                                "description": "device_ip:1.1.1.1",
                                "time": "445.8407320ms",
............

泛查詢

不明確指定查詢的 key，只指定 value，會對文件中所有 key 進行匹配

舉例2：查詢各個屬性中帶有 1.1.1.1 字元的文件, 比如如果 src_host 或者 dst_host 中出現 1.1.1.1，相關文件也會被查詢出來。

/shflows_agg_*/_search?q=10.75.44.94

{
    "profile": "true"
}

結果：可以看到 description 變為 _all

"profile": {
        "shards": [
            {
                "id": "[e_Ac3cNJRtmVxFW9DwOwjA][shflows_agg_1600531200][0]",
                "searches": [
                    {
                        "query": [
                            {
                                "type": "TermQuery",
                                "description": "_all:1.1.1.1",
 ......

DSL 查詢

方法：通過在 body 中，編寫 json 進行更為複雜的查詢

查詢所有文件

舉例1：查詢當前 index 所有文件：

/shflows_agg_index1/_search

{
    "query": {
        "match_all": {} # 返回所有 doc
    }
}

對文件進行排序和分頁

舉例2：查詢當前 index 所有文件，按照時間排序

/shflows_agg_index1/_search

{
    "from": 10,
    "size": 20,
    "sort": [{"timestamp": "desc"}],
    "query": {
        "match_all": {} # 返回所有 doc
    }
}

指定文件返回的引數

舉例：指定文件中，返回的僅是指定的引數

/shflows_agg_index1/_search

{
    "_source": ["timestamp", "device_ip"],
    "query": {
        "match_all": {} # 返回所有 doc
    }
}

使用指令碼欄位，對文件中的多個值進行指令碼運算

舉例：將文件中的，源 ip 和源埠進行拼接，並以 ip_address 進行命名：

/shflows_agg_index1/_search

{
    "script_fields": {
        "ip_address":{
            "script": {
                "lang": "painless",
                "inline": "params.comment + doc['device_ip'].value + ':' + doc['dst_port'].value",
                "params" : {
                    "comment" : "ip address is: " 
                }
            }
        }
    },
    "query": {
        "match_all": {} 
    }
}

結果：在 fields 裡多出了新的指令碼拼接後的欄位

{
    "took": 84,
    "timed_out": false,
    "_shards": {
        "total": 10,
        "successful": 10,
        "failed": 0
    },
    "hits": {
        "total": 36248845,
        "max_score": 1.0,
        "hits": [
            {
                "_index": "shflows_agg_1600358400",
                "_type": "shflows_agg",
                "_id": "node1_1600359600_0_172698718_shflows_agg_0",
                "_score": 1.0,
                "fields": {
                    "ip_address": [
                        "ip address is: 10.75.44.94:16384"
                    ]
                }
            },
            {
                "_index": "shflows_agg_1600358400",
                "_type": "shflows_agg",
                "_id": "node1_1600359600_0_172698718_shflows_agg_5",
                "_score": 1.0,
                "fields": {
                    "ip_address": [
                        "ip address is: 10.75.44.94:443"
                    ]
                }
            },
.......

Query Context OR Filter Context 查詢

在 ES 中，搜尋過程有 Query 和 Filter 上下文兩種：

Query 查詢：在搜尋過程中會進行相關性的算分操作
Filter 查詢：不需要進行算分，所以可以利用快取，獲得更好的效能

在 Query 和 Filter 查詢裡可以進行：

等值查詢（term）
範圍查詢（range）

舉例：如查詢 dst_port 為 443 的 doc，並打分

/shflows_agg_index1/_search

{
    "profile": "true",
    "explain": true,
    "query": {
        "term": {"dst_port": 443}
    }
}

結果：

{
    "took": 191,
    "timed_out": false,
    "_shards": {
        "total": 11,
        "successful": 11,
        "failed": 0
    },
    "hits": {
        "total": 3871488,
        "max_score": 2.2973032,
        "hits": [
            {
                "_shard": "[shflows_agg_1600358400][0]",
                "_node": "RWTixYPtTieZaRgAH0NOkQ",
                "_index": "shflows_agg_1600358400",
                "_type": "shflows_agg",
                "_id": "node1_1600359600_0_172698718_shflows_agg_5",
                "_score": 2.2973032,  ####### 可以看到這裡有計算的分數
                "_source": {
                    "collector": "node1",
                    "src_port": "16384",
                    "timestamp": 1600359600,

使用 filter 查詢：

/shflows_agg_index1/_search

{
  "profile": "true",
  "explain": true,
  "query": {
   # 使用 constant_score 不進行算分操作
    "constant_score": {
      "filter": {
        "term": {
          "dst_port": 443
        }
      }
    }
  }
}

結果：

"hits": {
        "total": 3872768,
        "max_score": 1.0, # 1.0 為固定值
.....

"profile": {
        "shards": [
            {
                "id": "[e_Ac3cNJRtmVxFW9DwOwjA][shflows_agg_1600531200][0]",
                "searches": [
                    {
                        "query": [
                            {
                                "type": "ConstantScoreQuery", ## 不變分數查詢
                                "description": "ConstantScore(dst_port:443)",

舉例：terms 查詢，查詢 dst_port 為 443 和 22 doc

/shflows_agg_index1/_search

{
  "profile": "true",
  "explain": true,
  "query": {
   # 使用 constant_score 不進行算分操作
    "constant_score": {
      "filter": {
        "terms": {
          "dst_port": [443,22]
        }
      }
    }
  }
}

舉例：資料範圍查詢

{
    "profile": "true",
    "explain": true,
    "query": {
        "constant_score": {
            "filter": {
                "range": {
                    "timestamp": {
                        # 大於等於
                        "gte": 1601049600,
                        # 小於等於
                        "lte": 1601308800
                    }
                }
            }
        }
    }
}

Bool 複合查詢：多個條件進行篩選

在 ES 可以通過 bool 查詢，將一個或者多個查詢子句組合或者巢狀到一起，實現更為複雜的查詢。

bool 查詢共包含 4 個子句：

must：搜尋的結果必須匹配，參與算分
should：選擇性匹配，類似於 OR，滿足一個條件就可以，參與算分
must_not: 必須不能匹配，屬於 Filter context，不貢獻算分
filter：必須匹配，屬於 Filter context ，不貢獻算分。

must_not 和 filter 效能更好，不需要算分。

舉例：查詢時間範圍在 1601171628 和 1601175228 之間，目的埠為 80，源目的 IP 在 [1.1.1.1 ,1.1.1.2, 1.1.1.3] 中任意一個的 doc 資訊。

{
    "profile": "true",
    "explain": true,
    "query": {
        "bool": {
            "must": [
                {
                    "range": {
                        "timestamp": {
                            "gte": 1601171628,
                            "lte": 1601175228
                        }
                    }
                },
                {
                    "term": {
                        "dst_port": 80
                    }
                }，
                {
                    "bool": {
                     # 注意這裡 should 在 must 的陣列裡，如果和 must 同級，是無法影響 must 的結果的。
                    "should": [
                            {
                                "term": {
                                    "src_host": "1.1.1.1"
                                }
                            },
                            {
                                "term": {
                                    "src_host": "1.1.1.1"
                                }
                            },
                            {
                                "term": {
                                    "src_host": "1.1.1.1"
                                }
                            }
                        ]
                    }
                }
            ]
        }
    }
}

參考

ES-Bool

ES 筆記十六：基於詞項和基於全文的搜尋
2019-11-04
筆記
ES 20 - 查詢Elasticsearch中的資料 (基於DSL查詢, 包括查詢校驗match + bool + term)
2019-06-27
Elasticsearch
基於hanlp的es分詞外掛
2019-07-01
HanLP分詞
MyBatis基於Maven入門例項
2020-04-06
MyBatisMaven
關於Ajax入門文章查詢看JavaEye
2020-04-06
Java
基於 HanLP 的 ES 中文分詞外掛
2018-12-23
HanLP中文分詞
es 筆記二之基礎查詢
2023-04-12
筆記
RestHighLevelClient查詢es
2019-06-01
RESTclient
基於聯合查詢的注入
2018-07-26
MySQL入門系列：查詢簡介（五）之子查詢
2019-03-13
MySql
Elasticsearch Query DSL查詢入門
2019-05-17
Elasticsearch
Hibernate基於Maven入門例項，與MyBatis比對
2020-04-06
MavenMyBatis
sql查詢入門練習題
2018-11-06
SQL
MySQL入門系列：查詢簡介（七）之組合查詢
2019-03-14
MySql
es的查詢和過濾context
2021-11-17
Context
基於Lucene查詢原理分析Elasticsearch的效能
2018-10-30
Elasticsearch
ES-布林查詢
2021-11-17
CURL查詢ES相關
2021-02-05
MySQL入門系列：查詢簡介（二）
2019-03-09
MySql
入門MySQL——查詢語法練習
2020-06-23
MySql
[MySQL光速入門]005 查詢資料
2019-04-18
MySql
MybatisPlus入門(五)MybatisPlus條件查詢
2024-11-24
MyBatis
基於Hadoop框架實現的對歷年四級單詞的詞頻分析（入門級Hadoop專案）
2024-05-01
Hadoop框架
OpenGL ES 入門
2019-06-19
詞向量入門
2020-05-27
基於 MongoTemplate 實現MongoDB的複雜查詢
2024-12-03
MongoDB
SQL-小白最佳入門sql查詢一
2020-01-11
SQL
《MySQL 入門教程》第 19 篇子查詢
2022-02-16
MySql
DbForge Studio for SQL Server入門教程：在查詢生成器中建立查詢
2018-08-30
SQLServer
DbForge Studio for Oracle入門教程：如何在查詢生成器中建立查詢
2018-09-05
Oracle
Python入門基礎知識例項，
2018-11-24
Python
es的複雜查詢測試，使用jest的dsl工具寫查詢語句
2020-12-19
基於海量詞庫的單詞拼寫檢查、推薦到底是咋做的？
2018-09-05
基於快遞鳥的快遞物流查詢介面
2019-08-11
基於J2EE的公交查詢系統
2020-11-25
android 入門- 詞彙
2020-04-04
Android
如何在es中查詢null值
2023-01-29
Null
Excel查詢重複項
2019-03-26
Excel