CDC實戰:MySQL實時同步資料到Elasticsearch之陣列集合(array)如何處理【CDC實戰系列十二】

下午喝什么茶發表於2024-03-26

需求背景:

mysql儲存的一個欄位,需要同步到elasticsearch,並儲存為陣列,以便於查詢。

如下例子,就是查詢預期。

PUT /t_test_1/_doc/1
{
  "name":"蘋果",
  "code":[1,2,3]
}

PUT /t_test_1/_doc/2
{
  "name":"香蕉",
  "code":[1,2]
}

PUT /t_test_1/_doc/3
{
  "name":"橙子",
  "code":[1,3,5]
}

# 查詢code包含1或者3中任意一個或者所有coe都包含的資料
POST /t_test_1/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "terms": {
            "code": [
              "1",
              "3"
            ]
          }
        }
      ]
    }
  }
}

資料同步背景介紹:

1、當前資料同步使用的是kafka + confluent
2、各種資料產品中介軟體版本:

MySQL5.7.x
es7.15.1
kafka3.2.0
Debezium MySQL connector plug-in 1.9.4.Final
confluentinc-kafka-connect-elasticsearch:13.1.0

詳細內容可參考之前的博文

經過調研,找到兩種方法實現:

1、第一種使用elasticsearch自帶的ingest pipeline處理。話不多少,直接上code。

(1)首先mysql欄位num_array設計成varchar,儲存格式為'a,b,c,d,e,f';

(2)同步到es變成陣列,es對應的欄位num_array設計成keyword,便於精確查詢,提高效能。

# 建立一個pipeline
PUT _ingest/pipeline/string_to_array_pipeline
{
  "description": "Transfer the string which is concat with a separtor  to array.",
  "processors": [
    {
      "split": {
        "field": "num_array",
        "separator": ","
      }
    },
    {
      "set": {
        "field": "update_user",
        "value": "system"
      }
    },
    {
      "set": {
        "field": "name",
        "value": "華山"
      }
    }
  ]
}
##t_mountain 設定欄位num_array的mapping為keyword,設定default_pipeline =string_to_array_pipeline
PUT /t_mountain
{
  "settings": {
    "default_pipeline": "string_to_array_pipeline"
  }, 
   "mappings" : {
      "date_detection" : false,
      "properties" : {
        "altitude" : {
          "type" : "float"
        },
        "create_time" : {
          "type" : "date",
          "format" : "yyyy-MM-dd HH:mm:ss || strict_date_optional_time || epoch_millis"
        },
        "create_user" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "desc" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "id" : {
          "type" : "long"
        },
        "latitude" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "location" : {
          "type" : "geo_point"
        },
        "logtitude" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "name" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "num_array" : {
          "type" : "keyword"
        },
        "ticket" : {
          "type" : "float"
        },
        "update_time" : {
          "type" : "date",
          "format" : "yyyy-MM-dd HH:mm:ss || strict_date_optional_time || epoch_millis"
        },
        "update_user" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        }
      }
    }
}
#查詢
POST t_mountain/_search { "query": { "bool": { "filter": [ { "terms": { "num_array": [ "a", "b" ] } } ] } } }

2、使用debezium自定義轉換實現json字串轉陣列

詳細內容參考gitee:debezium-custom-converter

或參考github:debezium-custom-converter

簡單來說就是mysql儲存成varchar格式:比如 'a,b,c,d,e' 或者 '["a","b","c","d"]' 經過自定義converter處理轉成list,同步到es就是陣列格式,es的所有設定和方法1相同。

------------------------資料同步過程中出現了一個很讓人費解的現象---------------------

1、資料同步過程中出現了一個很讓人費解的現象:mysql的insert語句可以正常同步的es,且pipeline能正常應用,欄位被處理成了陣列格式。但是,在mysql中update剛才插入的資料,則資料正常同步到了es,但是也只的default_pipeline沒有生效。

2、排查發現是因為同步sink指令碼中設定了"write.method":"upsert"導致mysql執行update語句時,目標索引設定的pipeline失效了。

現象令人費解!!!

3、檢視官方文件,沒有看出來這個解釋,怎麼就影響到索引設定的pipeline了

Method used for writing data to Elasticsearch, and one of INSERT or UPSERT. The default method is INSERT in which the connector constructs a document from the record value and inserts that document into Elasticsearch, completely replacing any existing document with the same ID; this matches previous behavior. The UPSERT method will create a new document if one with the specified ID does not yet exist, or will update an existing document with the same ID by adding or replacing only those fields present in the record value. The UPSERT method may require additional time and resources of Elasticsearch, so consider increasing the read.timeout.ms and decreasing the batch.size configuration properties.

用於向 Elasticsearch 中寫入資料的方法,以及 INSERT 或 UPSERT 中的一種。預設方法是 INSERT,聯結器會根據記錄值構建一個文件,並將該文件插入 Elasticsearch,同時完全替換具有相同 ID 的現有文件;這與以前的行為一致。如果指定 ID 的文件還不存在,UPSERT 方法將建立一個新文件,或更新具有相同 ID 的現有文件,只新增或替換記錄值中存在的欄位。UPSERT 方法可能需要額外的時間和 Elasticsearch 資源,因此請考慮增加 read.timeout.ms 和減少 batch.size 配置屬性。

4、遂在本地模擬同步資料,復現了上述現象。然後開啟es詳細日誌追蹤:

 bin/elasticsearch -E logger.org.elasticsearch.action=trace

先貼出來同步指令碼:

source指令碼

{
    "name": "goods-connector",
    "config": {
        "connector.class": "io.debezium.connector.mysql.MySqlConnector",
        "database.hostname": "127.0.0.1",
        "database.port": "3306",
        "database.user": "debezium_user",
        "database.password": "@debezium2022",
        "database.server.id": "12358",
        "snapshot.mode": "when_needed",
        "database.server.name": "goods",
        "database.include.list": "goods",
        "table.include.list": "goods.t_mountain,goods.t_sku,goods.t_spu",
        "database.history.kafka.bootstrap.servers": "127.0.0.1:9092",
        "database.history.kafka.topic": "dbhistory.goods",
        "include.schema.changes": "true"
    }
}

sink指令碼:

{
  "name": "elasticsearch-sink",
  "config": {
    "connector.class": "io.confluent.connect.elasticsearch.ElasticsearchSinkConnector",
    "tasks.max": "1",
    "topics": "goods.goods.t_sku,goods.goods.t_spu,goods.goods.t_mountain,goods001.goods.t_format_date",
    "key.ignore": "false",
    "connection.url": "http://127.0.0.1:9200",
    "name": "elasticsearch-sink",
    "type.name": "_doc",
    "decimal.handling.mode": "string",
    "transforms": "unwrap,key",
    "transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
    "transforms.unwrap.drop.tombstones": "true",
    "transforms.unwrap.delete.handling.mode": "drop",
    "transforms.key.type": "org.apache.kafka.connect.transforms.ExtractField$Key",
    "transforms.key.field": "id"
  }
}

可以發現,此時的sink指令碼未設定 "write.method":"upsert",那麼該設定的預設值為insert

日誌如下:

mysql insert語句,同步到es日誌,操作是 index (下面日誌中我也做了標記),這是已經經過索引設定的pipeline處理後的資料

[2024-03-25T01:22:16,232][TRACE][o.e.a.b.TransportShardBulkAction] [esserver001-9200] send action [indices:data/write/bulk[s][p]] to local primary [[goods.goods.t_mountain][0]] for request [BulkShardRequest [[goods.goods.t_mountain][0]] containing [index {[goods.goods.t_mountain][_doc][154], source[{"altitude":1200.0,"num_array":["aaaaaa","b","ccccccc"],"create_time":1710419563000,"ticket":0.0,"latitude":"34.497647","logtitude":"110.073028","update_time":1710420074000,"update_user":"system","name":"華山","location":"34.497647,110.073028","id":154,"create_user":"111111","desc":"少華山在陝西渭南華州區"}]}]] with cluster state version [2312] to [NYz8ptioSBGMQoSB94VGew] 
[2024-03-25T01:22:16,233][TRACE][o.e.a.b.TransportShardBulkAction] [esserver001-9200] [[goods.goods.t_mountain][0]] op [indices:data/write/bulk[s]] completed on primary for request [BulkShardRequest [[goods.goods.t_mountain][0]] containing [index {[goods.goods.t_mountain][_doc][154], source[{"altitude":1200.0,"num_array":["aaaaaa","b","ccccccc"],"create_time":1710419563000,"ticket":0.0,"latitude":"34.497647","logtitude":"110.073028","update_time":1710420074000,"update_user":"system","name":"華山","location":"34.497647,110.073028","id":154,"create_user":"111111","desc":"少華山在陝西渭南華州區"}]}]]
[2024-03-25T01:22:16,235][TRACE][o.e.a.b.TransportShardBulkAction] [esserver001-9200] operation succeeded. action [indices:data/write/bulk[s]],request [BulkShardRequest [[goods.goods.t_mountain][0]] containing [index {[goods.goods.t_mountain][_doc][154], source[{"altitude":1200.0,"num_array":["aaaaaa","b","ccccccc"],"create_time":1710419563000,"ticket":0.0,"latitude":"34.497647","logtitude":"110.073028","update_time":1710420074000,"update_user":"system","name":"華山","location":"34.497647,110.073028","id":154,"create_user":"111111","desc":"少華山在陝西渭南華州區"}]}]]

檢視elasticsearch原始碼,從IndexRequest.java的toString方法可以看出,日誌輸出的是哪些內容。

從以上日誌可以模擬出es的插入DSL,這是已經經過索引設定的pipeline處理後的資料

POST _bulk
{"index":{"_id":"154","_index":"t_mountain"}}
{"altitude":1200.0,"num_array":["aaaaaa","b","ccccccc"],"create_time":1710419563000,"ticket":0.0,"latitude":"34.497647","logtitude":"110.073028","update_time":1710420074000,"update_user":"system","name":"華山","location":"34.497647,110.073028","id":154,"create_user":"111111","desc":"少華山在陝西渭南華州區"}

mysql update語句,同步到es日誌,操作是 index (下面日誌中我也做了標記),這是已經經過索引設定的pipeline處理後的資料

[2024-03-25T01:23:18,419][TRACE][o.e.a.b.TransportShardBulkAction] [esserver001-9200] send action [indices:data/write/bulk[s][p]] to local primary [[goods.goods.t_mountain][0]] for request [BulkShardRequest [[goods.goods.t_mountain][0]] containing [index {[goods.goods.t_mountain][_doc][154], source[{"altitude":1200.0,"num_array":["aaaaaa","b","ccccccc"],"create_time":1710419563000,"ticket":0.0,"latitude":"34.497647","logtitude":"110.073028","update_time":1711329798000,"update_user":"system","name":"華山","location":"34.497647,110.073028","id":154,"create_user":"111111","desc":"少華山在陝西渭南華州區"}]}]] with cluster state version [2312] to [NYz8ptioSBGMQoSB94VGew] 
[2024-03-25T01:23:18,421][TRACE][o.e.a.b.TransportShardBulkAction] [esserver001-9200] [[goods.goods.t_mountain][0]] op [indices:data/write/bulk[s]] completed on primary for request [BulkShardRequest [[goods.goods.t_mountain][0]] containing [index {[goods.goods.t_mountain][_doc][154], source[{"altitude":1200.0,"num_array":["aaaaaa","b","ccccccc"],"create_time":1710419563000,"ticket":0.0,"latitude":"34.497647","logtitude":"110.073028","update_time":1711329798000,"update_user":"system","name":"華山","location":"34.497647,110.073028","id":154,"create_user":"111111","desc":"少華山在陝西渭南華州區"}]}]]
[2024-03-25T01:23:18,422][TRACE][o.e.a.b.TransportShardBulkAction] [esserver001-9200] operation succeeded. action [indices:data/write/bulk[s]],request [BulkShardRequest [[goods.goods.t_mountain][0]] containing [index {[goods.goods.t_mountain][_doc][154], source[{"altitude":1200.0,"num_array":["aaaaaa","b","ccccccc"],"create_time":1710419563000,"ticket":0.0,"latitude":"34.497647","logtitude":"110.073028","update_time":1711329798000,"update_user":"system","name":"華山","location":"34.497647,110.073028","id":154,"create_user":"111111","desc":"少華山在陝西渭南華州區"}]}]]

從以上日誌可以模擬出es的插入DSL,和上面的類似。都是插入資料。

POST _bulk
{"index":{"_id":"154","_index":"t_mountain"}}
{"altitude":1200.0,"num_array":["aaaaaa","b","ccccccc"],"create_time":1710419563000,"ticket":0.0,"latitude":"34.497647","logtitude":"110.073028","update_time":1710420074000,"update_user":"system","name":"華山","location":"34.497647,110.073028","id":154,"create_user":"111111","desc":"少華山在陝西渭南華州區"}

5、修改sink指令碼,新增一行配置 "write.method":"upsert",重啟kafka-connect,繼續資料同步

日誌如下:

5.1、mysql insert語句,同步到es日誌,操作是update(下面日誌中我也做了標記),可以看到資料分別儲存在doc物件(未經索引設定的pipeline處理的原始資料)和upsert物件(經過索引設定的pipeline處理的原始資料)中:

[2024-03-22T05:44:42,500][TRACE][o.e.a.b.TransportShardBulkAction] [esserver001-9200] send action [indices:data/write/bulk[s][p]] to local primary [[goods.goods.t_mountain][0]] for request [BulkShardRequest [[goods.goods.t_mountain][0]] containing [update {[goods.goods.t_mountain][_doc][151], doc_as_upsert[false], doc[index {[null][_doc][null], source[{"id":151,"name":"少華山","location":"34.497647,110.073028","latitude":"34.497647","logtitude":"110.073028","altitude":1200.0,"create_user":"111111","create_time":1710419563000,"update_user":null,"update_time":1710420074000,"ticket":0.0,"desc":"少華山在陝西渭南華州區","num_array":"aaaaaa,b,ccccccc"}]}], upsert[index {[null][_doc][null], source[{"altitude":1200.0,"num_array":["aaaaaa","b","ccccccc"],"create_time":1710419563000,"ticket":0.0,"latitude":"34.497647","logtitude":"110.073028","update_time":1710420074000,"update_user":"system","name":"華山","location":"34.497647,110.073028","id":151,"create_user":"111111","desc":"少華山在陝西渭南華州區"}]}], scripted_upsert[false], detect_noop[true]}]] with cluster state version [1858] to [NYz8ptioSBGMQoSB94VGew] 
[2024-03-22T05:44:42,498][TRACE][o.e.a.b.TransportShardBulkAction] [esserver001-9200] [[goods.goods.t_mountain][0]] op [indices:data/write/bulk[s]] completed on primary for request [BulkShardRequest [[goods.goods.t_mountain][0]] containing [index {[goods.goods.t_mountain][_doc][151], source[{"altitude":1200.0,"num_array":["aaaaaa","b","ccccccc"],"create_time":1710419563000,"ticket":0.0,"latitude":"34.497647","logtitude":"110.073028","update_time":1710420074000,"update_user":"system","name":"華山","location":"34.497647,110.073028","id":151,"create_user":"111111","desc":"少華山在陝西渭南華州區"}]}]]
[2024-03-22T05:44:42,500][TRACE][o.e.a.b.TransportShardBulkAction] [esserver001-9200] operation succeeded. action [indices:data/write/bulk[s]],request [BulkShardRequest [[goods.goods.t_mountain][0]] containing [index {[goods.goods.t_mountain][_doc][151], source[{"altitude":1200.0,"num_array":["aaaaaa","b","ccccccc"],"create_time":1710419563000,"ticket":0.0,"latitude":"34.497647","logtitude":"110.073028","update_time":1710420074000,"update_user":"system","name":"華山","location":"34.497647,110.073028","id":151,"create_user":"111111","desc":"少華山在陝西渭南華州區"}]}]]

從elasticsearch原始碼的UpdateRequest.java類tostring方法中,可以看出日誌輸出的是哪些內容:

從以上日誌可以模擬出es的插入DSL,可以正常同步資料,且經過索引設定的pipeline處理過的資料,正確同步到了es中。

----下面upsert中內容正常儲存到了es中,符合預期。

POST _bulk
{"update":{"_id":"151","_index":"t_mountain"}}
{"doc":{"id":151,"name":"少華山","location":"34.497647,110.073028","latitude":"34.497647","logtitude":"110.073028","altitude":1200.0,"create_user":"111111","create_time":1710419563000,"update_user":null,"update_time":1710420074000,"ticket":0.0,"desc":"少華山在陝西渭南華州區","num_array":"aaaaaa,b,ccccccc"},"upsert":{"altitude":1200.0,"num_array":["aaaaaa","b","ccccccc"],"create_time":1710419563000,"ticket":0.0,"latitude":"34.497647","logtitude":"110.073028","update_time":1710420074000,"update_user":"system","name":"華山","location":"34.497647,110.073028","id":151,"create_user":"111111","desc":"少華山在陝西渭南華州區"},"doc_as_upsert":false,"scripted_upsert":false, "detect_noop":true}

5.2、mysql update語句,同步到es日誌,操作是update(下面日誌中我也做了標記),可以看到資料分別儲存在doc物件(未經索引設定的pipeline處理的原始資料)和upsert物件(經過索引設定的pipeline處理的原始資料)中:

[2024-03-22T05:49:56,606][TRACE][o.e.a.b.TransportShardBulkAction] [esserver001-9200] send action [indices:data/write/bulk[s][p]] to local primary [[goods.goods.t_mountain][0]] for request [BulkShardRequest [[goods.goods.t_mountain][0]] containing [update {[goods.goods.t_mountain][_doc][151], doc_as_upsert[false], doc[index {[null][_doc][null], source[{"id":151,"name":"泰山","location":"34.497647,110.073028","latitude":"34.497647","logtitude":"110.073028","altitude":1200.0,"create_user":"111111","create_time":1710419563000,"update_user":"666666","update_time":1711086596000,"ticket":0.0,"desc":"少華山在陝西渭南華州區","num_array":"a,b,c,d,e,f,g,h"}]}], upsert[index {[null][_doc][null], source[{"altitude":1200.0,"num_array":["a","b","c","d","e","f","g","h"],"create_time":1710419563000,"ticket":0.0,"latitude":"34.497647","logtitude":"110.073028","update_time":1711086596000,"update_user":"system","name":"華山","location":"34.497647,110.073028","id":151,"create_user":"111111","desc":"少華山在陝西渭南華州區"}]}], scripted_upsert[false], detect_noop[true]}]] with cluster state version [1927] to [NYz8ptioSBGMQoSB94VGew] 
[2024-03-22T05:49:56,610][TRACE][o.e.a.b.TransportShardBulkAction] [esserver001-9200] [[goods.goods.t_mountain][0]] op [indices:data/write/bulk[s]] completed on primary for request [BulkShardRequest [[goods.goods.t_mountain][0]] containing [index {[goods.goods.t_mountain][_doc][151], source[{"altitude":1200.0,"num_array":"a,b,c,d,e,f,g,h","create_time":1710419563000,"ticket":0.0,"latitude":"34.497647","logtitude":"110.073028","update_time":1711086596000,"update_user":"666666","name":"泰山","location":"34.497647,110.073028","id":151,"create_user":"111111","desc":"少華山在陝西渭南華州區"}]}]]
[2024-03-22T05:49:56,613][TRACE][o.e.a.b.TransportShardBulkAction] [esserver001-9200] operation succeeded. action [indices:data/write/bulk[s]],request [BulkShardRequest [[goods.goods.t_mountain][0]] containing [index {[goods.goods.t_mountain][_doc][151], source[{"altitude":1200.0,"num_array":"a,b,c,d,e,f,g,h","create_time":1710419563000,"ticket":0.0,"latitude":"34.497647","logtitude":"110.073028","update_time":1711086596000,"update_user":"666666","name":"泰山","location":"34.497647,110.073028","id":151,"create_user":"111111","desc":"少華山在陝西渭南華州區"}]}]]

從以上日誌可以模擬出es的插入DSL:

----下面doc中內容正常儲存到了es中,upsert中的內容未更新到es中,問題就出在這裡!!!!!!

POST _bulk
{"update":{"_id":"151","_index":"t_mountain"}}
{"doc":{"id":151,"name":"泰山","location":"34.497647,110.073028","latitude":"34.497647","logtitude":"110.073028","altitude":1200.0,"create_user":"111111","create_time":1710419563000,"update_user":"666666","update_time":1711086596000,"ticket":0.0,"desc":"少華山在陝西渭南華州區","num_array":"a,b,c,d,e,f,g,h"},"upsert":{"altitude":1200.0,"num_array":["a","b","c","d","e","f","g","h"],"create_time":1710419563000,"ticket":0.0,"latitude":"34.497647","logtitude":"110.073028","update_time":1711086596000,"update_user":"system","name":"華山","location":"34.497647,110.073028","id":151,"create_user":"111111","desc":"少華山在陝西渭南華州區"},"doc_as_upsert":false,"scripted_upsert":false, "detect_noop":true}

elasticsearch更新,doc裡面的內容是部分更新,設定了幾個欄位就更新幾個欄位,不會覆蓋doc未設定的欄位。doc和upsert都存在的情況下,如果指定更新的文件不存在,則“If the document does not already exist, the contents of the upsert element are inserted as a new document.”,所以本例中5.1同步資料符合預期。如果指定的文件存在,doc和upsert都存在的情況下,則doc裡面的內容更新到es中,所以本例中5.2同步資料不符合預期,upsert中經過索引設定的pipeline處理過的資料為正確同步更新到es中。這就是bug出現的原因。

總結:

1、合理設定引數,"write.method":"upsert" 、"write.method":"insert"

設定了引數"write.method":"upsert"。mysql的update語句執行到es就是走的bulk部分更新,就會導致索引設定的pipeline失效,官方文件沒看到特殊說明,這一塊描述不是很清晰,但仔細檢視,能略顯端倪。

confluent這個設定看github上說的是為了避免資料被覆蓋。所以部分更新的情況就設定為upsert。如果同步的時候不管是新增還是更新,source讀取的資料都是包含完整欄位的,就不存在資料被覆蓋的問題,就可以設定成insert。

因為是透過debezium+kafka cnnect+confluent同步資料,無法設定elasticsearch批次更新的引數,因此在設定上述引數時要慎重,通常mysql同步資料到es設定成insert就可以了,不存在更新資料時舊資料被覆蓋的情況。

如果不能修改sink指令碼配置,那麼可以建立一個單獨的sink指令碼只處理這一張設定了指定pipeline的索引,同時配合es索引設定的alias,亦可輕鬆解決問題。

2、如果是直接操作elasticsearch的更新,可以直接設定doc或者,upsert等物件。結合"doc_as_upsert"引數也可以完成資料的正確更新。

# 建立一個pipeline
PUT _ingest/pipeline/string_to_array_pipeline
{
  "description": "Transfer the string which is concat with a separtor  to array.",
  "processors": [
    {
      "split": {
        "field": "num_array",
        "separator": ","
      }
    },
    {
      "set": {
        "field": "update_user",
        "value": "system"
      }
    },
    {
      "set": {
        "field": "name",
        "value": "華山"
      }
    }
  ]
}


##t_mountain
PUT /t_mountain
{
  "settings": {
    "default_pipeline": "string_to_array_pipeline"
  }, 
   "mappings" : {
      "date_detection" : false,
      "properties" : {
        "altitude" : {
          "type" : "float"
        },
        "create_time" : {
          "type" : "date",
          "format" : "yyyy-MM-dd HH:mm:ss || strict_date_optional_time || epoch_millis"
        },
        "create_user" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "desc" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "id" : {
          "type" : "long"
        },
        "latitude" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "location" : {
          "type" : "geo_point"
        },
        "logtitude" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "name" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "num_array" : {
          "type" : "keyword"
        },
        "ticket" : {
          "type" : "float"
        },
        "update_time" : {
          "type" : "date",
          "format" : "yyyy-MM-dd HH:mm:ss || strict_date_optional_time || epoch_millis"
        },
        "update_user" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        }
      }
    }
}
#  #A 部分buck更新t_mountain
POST _bulk
{"update":{"_id":"155","_index":"t_mountain"}}
{"doc":{"id":155,"name":"泰山4","location":"34.497647,110.073028","latitude":"34.497647","logtitude":"110.073028","altitude":1200,"create_user":"111111","create_time":1710419563000,"update_user":"777777","update_time":1711086596000,"ticket":0,"desc":"少華山在陝西渭南華州區","num_array":"a,b,c,d"}}
#  #B  buck插入新資料t_mountain
POST _bulk
{"index":{"_id":"155","_index":"t_mountain"}}
{"id":155,"name":"泰山3","location":"34.497647,110.073028","latitude":"34.497647","logtitude":"110.073028","altitude":1200,"create_user":"111111","create_time":1710419563000,"update_user":"777777","update_time":1711086596000,"ticket":0,"desc":"少華山在陝西渭南華州區","num_array":"a,b,c,d"}

GET t_mountain/_search?version=true { "query": { "term": { "id": { "value": "155" } } } }

--上面例子說明:
1、先執行#B,正常寫入,索引設定的pipeline生效,再執行#A,正常更新,但是索引設定的pipeline沒有生效。(完美復現上面問題)

2、先執行#A,此時報錯: "reason" : "[_doc][155]: document missing",因為預設的"doc_as_upsert":false設定是false,將這個修改為true,正常寫入資料。且索引設定的pipeline正常執行。

3、先執行#B,正常寫入,索引設定的pipeline生效,再執行#A,此時設定"doc_as_upsert":true,正常更新,索引設定的pipeline正常執行。

----在這個請求中,"doc_as_upsert":true指示Elasticsearch,如果指定的文件不存在,就將doc部分的內容作為新文件插入。這樣,你就不會因為文件不存在而收到錯誤,而是會建立一個新的文件。

好了,"doc_as_upsert":true這個引數就可以幫助我們靈活處理elasticsearch更新相關的需求.

參考文件:
elasticsearch官方文件update
elasticsearch官方文件bulk
kafka-connect-elasticsearch
confluent文件

相關文章