Elasticsearch從入門到放棄：瞎說Mapping

Jackeyzhe發表於2020-08-04

原文網址 : https://www.cnblogs.com/Jackeyzhe/p/13436928.html

前面我們聊了 Elasticsearch 的索引、搜尋和分詞器，今天再來聊另一個基礎內容—— Mapping。

Mapping 在 Elasticsearch 中的地位相當於關係型資料庫中的 schema，它可以用來定義索引中欄位的名字、定義欄位的資料型別，還可以用來做一些欄位的配置。從 Elasticsearch 7.0開始，Mapping 中不在乎需要定義 type 資訊了，具體原因可以看官方的解釋。

欄位的資料型別

我們剛剛提到 Mapping 中可以定義欄位的資料型別，這可能是 Mapping 最常用的功能了，所以我們先來看看 Elasticsearch 都支援哪些資料型別。

簡單型別：text、keyword、date、long、double、boolean、ip
複雜型別：物件型別、巢狀型別
特殊型別：用於描述地理位置的 geo_point、geo_shape

Elasticsearch 支援的資料型別遠不止這些，由於篇幅原因，這裡就不一一列舉了。我找幾個工作中常見的來介紹一下。

首先就是字串了，Elasticsearch 中的字串有 text 和 keyword 兩種。其中 text 型別的字串是可以被全文檢索的，它會被分詞器作用，

PUT my_index
{
  "mappings": {
    "properties": {
      "full_name": {
        "type":  "text"
      }
    }
  }
}

在設定欄位型別為 text 時，還可以利用一些引數對這個欄位進行更進一步的定製。

index：標記這個欄位是否能被搜尋，預設是 true

search_analyzer：被搜尋時所使用的分詞器，預設使用 setting 中設定的分詞器

fielddata：欄位是否允許在記憶體中進行排序、聚合，預設是 false

meta：關於欄位的一些後設資料

像一些id、郵箱、域名這樣的欄位，我們就需要使用 keyword 型別了。因為 keyword 型別可以支援排序、聚合，並且只能支援精確查詢。

有些同學可能會把 ID 設定為數字型別，這也是沒問題的，數字型別和 keyword 各有各的好處，使用數字型別可以進行範圍查詢，而使用 keyword 型別則有更高的查詢效率。具體用哪種還要看使用場景。

日期型別在 Elasticsearch 中有三種表現形式

可以格式化成日期型別的字串，如"2020-07-26"和"2015/01/01 12:10:30"這樣的
毫秒級時間戳用 long 型別表示
秒級時間戳用 integer 型別表示

在 Elasticsearch 內部，日期型別是以 long 型別的毫秒級時間戳儲存的，時區使用的是0時區。

我們可以自定義時間格式，預設使用的是strict_date_optional_time||epoch_millis

strict_date_optional_time_nanos是通用的日期格式解析，至少要包含年份，如果要包含時間，則用T分隔，例如yyyy-MM-dd'T'HH:mm:ss.SSSSSSZ或 yyyy-MM-dd。

如果想要同時支援多種日期格式，可以使用format欄位

PUT my_index
{
  "mappings": {
    "properties": {
      "date": {
        "type":   "date",
        "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
      }
    }
  }
}

Mapping引數

剛才我們提到配置 Mapping 的日期格式的引數format，Mapping 還提供了很多其他的引數。

analyzer
boost
coerce
copy_to
doc_values
dynamic
eager_global_ordinals
enabled
fielddata
fields
format
ignore_above
ignore_malformed
index_options
index_phrases
index_prefixes
index
meta
normalizer
norms
null_value
position_increment_gap
properties
search_analyzer
similarity
store
term_vector

我們來介紹幾個常用的欄位。

fields

首先是fields，它可以使同一個欄位通過不同的方式實現不同的目的。

例如，我們可以對一個字串欄位設定為text型別，用於全文檢索，同時可以利用fields設定為keyword型別，用於排序和聚合。

PUT my-index-000001
{
  "mappings": {
    "properties": {
      "city": {
        "type": "text",
        "fields": {
          "raw": {
            "type":  "keyword"
          }
        }
      }
    }
  }
}

查詢時我們就可以使用city進行全文檢索，使用city.raw進行排序和聚合。

GET my-index-000001/_search
{
  "query": {
    "match": {
      "city": "york" 
    }
  },
  "sort": {
    "city.raw": "asc" 
  },
  "aggs": {
    "Cities": {
      "terms": {
        "field": "city.raw" 
      }
    }
  }
}

enabled

有些時候，我們只想把某個欄位作為資料儲存來使用，並不需要用來做搜尋，這時，我們就可以將這個欄位禁用掉，欄位被禁用以後，它所儲存的值也不受 mapping 指定的型別控制。

PUT my-index-000001
{
  "mappings": {
    "properties": {
      "user_id": {
        "type":  "keyword"
      },
      "last_updated": {
        "type": "date"
      },
      "session_data": { 
        "type": "object",
        "enabled": false
      }
    }
  }
}

上面的例子中，我們禁用掉了 session_data 這個欄位，這時，你既可以往 session_data 欄位中儲存 JSON 格式的資料，也可以儲存非 JSON 格式的資料。

除了針對於單個欄位的禁用以外，我們還可以直接禁用掉整個 mapping。我們來重新建立一個index

PUT my-index-000002
{
  "mappings": {
    "enabled": false 
  }
}

這時，文件所有的欄位都不會被索引，只是用來儲存。

需要注意的是，無論是具體欄位中還是整個 mapping 的 enabled 屬性都不可以被修改，因為一旦設定為 false，Elasticsearch 就不會對欄位進行索引了，也不會校驗資料的合法性，如果產生了髒資料以後再設定為 true，就會造成程式錯誤。

null_value

null 在 Elasticsearch 中是不可以被索引或搜尋的，這裡我們所說的 null 並不是狹義上某種語言的 null，而是所有的空值。例如所有值都是 null 的陣列，總之，這裡的定義就是沒有值。

對於有需要搜尋空值的業務怎麼辦呢？Elasticsearch 為我們提供了 null_value 這個引數，它可以指定一個值，搜尋時使用這個值來替代空值。

舉個例子

PUT my-index-000001
{
  "mappings": {
    "properties": {
      "status_code": {
        "type":       "keyword",
        "null_value": "NULL" 
      }
    }
  }
}

我們給 status_code 欄位設定了 null_value 為 "NULL"。這裡需要注意， null_value 的型別必須與要查詢的資料型別相同，如果在這個例子中 status_code 的型別是long，那麼就不能把null_value 設定為 "NULL"。

dynamic

對於新增加的欄位：

dynamic 設定為 true 時，一旦有新增欄位的文件寫入，Mapping 也會被更新
dynamic 設定為 false 時，Mapping 不會被更新，新增欄位無法被索引，但資訊會出現在 _source 中
dynamic 設定為 strict 時，文件寫入失敗

對於已有的欄位，一旦已經有資料寫入，就不再支援修改欄位定義

Dynamic Mapping

我們在建立索引時，可以不用手動寫 Mappings， Elasticsearch 會幫我們自動識別出欄位的型別。我們稱之為 Dynamic Mapping。不過有時推算的可能不是很準確。

Elasticsearch 自動識別型別是基於 JSON 的。資料型別的對應關係如下（表格來自 elastic 官網）

JSON data type	Elasticsearch data type
`null`	No field is added.
`true` or `false`	`boolean` field
floating point number	`float` field
integer	`long` field
object	`object` field
array	Depends on the first non-`null` value in the array.
string	Either a `date` field (if the value passes date detection), a `double` or `long` field (if the value passes numeric detection) or a `text` field, with a `keyword` sub-field.

Elasticsearch 支援的欄位對映的資料型別在這個文件中，除了這些，其他的型別對映都需要顯示的指定了。

關於日期型別，預設是可以對映的，但是 Elasticsearch 只能識別幾種格式的日期yyyy/MM/dd HH:mm:ss||yyyy/MM/dd||epoch_millis。如果關掉了 date_detection 開關，那麼就只能識別為字串了。

PUT my-index-000001
{
  "mappings": {
    "date_detection": false
  }
}

當然，你也可以根據需要自己指定要識別的日期格式，只需要使用 dynamic_date_formats 引數即可。

PUT my-index-000001
{
  "mappings": {
    "dynamic_date_formats": ["MM/dd/yyyy"]
  }
}

Elasticsearch 還提供了一種把字串型的數字識別為數字的能力，它是由 numeric_detection 開關控制的。

PUT my-index-000005
{
  "mappings": {
    "numeric_detection": true
  }
}

PUT my-index-000005/_doc/1
{
  "my_float":   "1.0", 
  "my_integer": "1" 
}

在這個例子中，my_float 會被識別為 float 型別，而 my_integer 會被識別為 long 型別。

Dynamic template

dynamic template 允許我們自定義 mapping ，並應用到具體索引上。dynamic template 的定義一般是這樣的

  "dynamic_templates": [
    {
      "my_template_name": { 
        ...  match conditions ... 
        "mapping": { ... } 
      }
    },
    ...
  ]

my_template_name 可以是任意字串。

match conditions 包括match_mapping_type, match, match_pattern, unmatch, path_match, path_unmatch 這幾種。

mapping 就是指匹配到的欄位應該使用怎樣的 mapping。下面我們介紹幾種 match conditions

match_mapping_type

我們先來看一個簡單的例子

PUT my-index-000001
{
  "mappings": {
    "dynamic_templates": [
      {
        "integers": {
          "match_mapping_type": "long",
          "mapping": {
            "type": "integer"
          }
        }
      },
      {
        "strings": {
          "match_mapping_type": "string",
          "mapping": {
            "type": "text",
            "fields": {
              "raw": {
                "type":  "keyword",
                "ignore_above": 256
              }
            }
          }
        }
      }
    ]
  }
}

這裡我們有兩個模版，其一是使用 integer 型別來代替 long 型別，其二是將字串型別對映為 keyword。

match 和 unmatch

這兩個比較簡單，match 是指匹配到模式的欄位， unmatch 是表示不匹配的欄位。

PUT my-index-000001
{
  "mappings": {
    "dynamic_templates": [
      {
        "longs_as_strings": {
          "match_mapping_type": "string",
          "match":   "long_*",
          "unmatch": "*_text",
          "mapping": {
            "type": "long"
          }
        }
      }
    ]
  }
}

在這個例子中，我們需要的是 long_ 開頭的字串，不需要 _text結尾的字串欄位。

除了以上三種之外，其他的就是 match_pattern 用來進行正則匹配，path_match 和 path_unmatch 則是表示欄位所在路徑的是否匹配。

另外 dynamic template 還支援兩種變數替換，分別是 {name} 和 {dynamic_type}。其實 name 就是欄位名，dynamic_type 就是檢測出的欄位型別。

總結

關於 Elasticsearch 的 mapping 我們就先聊這些，我認為 mapping 的配置是一個需要經驗的事情，當你處理的 case 越來越多之後，就能比較輕鬆的知道如何更好的配置 mapping 了。此外，mapping 的許多欄位和引數文中都沒有涉及，對於我而言，大部分都是用到了現查文件，不過也還是建議大家看一看文件，起碼遇到問題時能知道大概查詢文件的一個方向。這樣就會比身邊人強不少。

Elasticsearch從入門到放棄：再聊搜尋
2020-07-14
Elasticsearch
Elasticsearch從入門到放棄：淺談算分
2021-01-27
Elasticsearch
Elasticsearch從入門到放棄：分詞器初印象
2020-06-29
Elasticsearch分詞
Git 從入門到放棄
2019-03-03
Git
XXE從入門到放棄
2020-01-17
Vue 從入門到放棄
2019-12-30
Vue
Nginx從入門到放棄
2020-09-30
Nginx
GraphQL從入門到放棄
2019-03-04
deepspeed從入門到放棄
2024-09-01
NumPy從入門到放棄
2024-08-08
webpack從入門到放棄
2018-03-07
Web
openstack從入門到放棄
2018-04-21
HTTP從入門到放棄
2018-04-18
HTTP
從放棄到入門-Yaf（從控制器說起）
2019-03-01
swoole——從入門到放棄（一）
2019-02-16
swoole——從入門到放棄（三）
2019-01-19
快取從入門到放棄
2019-02-28
快取
Spark從入門到放棄---RDD
2020-08-17
Spark
webpack 從入門到放棄(一)
2019-03-03
Web
從入門到放棄 - 事件溯源
2021-08-16
事件
Kafka從入門到放棄（三）—— 詳說消費者
2021-12-21
Kafka
HTTP快取從入門到放棄
2018-11-29
HTTP快取
Flink從入門到放棄-大綱
2019-02-24
Taro 小程式從入門到放棄！
2018-07-30
Python 從入門到放棄——Python科普！
2020-04-04
Python
Scikit-learn從入門到放棄
2024-08-18
t-SNE 從入門到放棄
2021-10-13
webpack -> vue Component 從入門到放棄（四）
2019-02-16
WebVue
Realm資料庫從入門到“放棄”
2019-03-02
資料庫
分散式訓練從入門到放棄
2019-04-11
分散式
AOP埋點從入門到放棄（二）
2018-08-12
AOP埋點從入門到放棄（三）
2018-08-21
從入門到放棄之promise用法(上)
2018-06-14
Promise
從入門到放棄，我用了五年
2020-09-24
Redis從入門到放棄系列(十) Cluster
2019-07-02
Redis
從入門到放棄之大資料Hive
2019-05-12
大資料Hive
Spark從入門到放棄——初始Spark（一）
2020-12-09
Spark
【爬蟲】python爬蟲從入門到放棄
2018-12-20
爬蟲Python