中介軟體---Binlog傳輸同步---Maxwell

FeelTouch發表於2019-02-17

原文網址 : https://blog.csdn.net/feeltouch/article/details/87534686

1. 介紹

Maxwell 是java語言編寫的能夠讀取、解析MySQL binlog，將行更新以json格式傳送到 Kafka、RabbitMQ、AWS Kinesis、Google Cloud Pub/Sub、檔案，有了增量的資料流，可以想象的應用場景實在太多了，如ETL、維護快取、收集表級別的dml指標、增量到搜尋引擎、資料分割槽遷移、切庫binlog回滾方案，等等。

它還提供其它功能：

支援SELECT * FROM table 的方式做全量資料初始化
支援主庫發生failover後，自動恢復binlog位置（GTID）
靈活的對資料進行分割槽，解決資料傾斜的問題。kafka支援 database, table, column等級別的資料分割槽
它的實現方式是偽裝成MySQL Server的從庫，接收binlog events，然後根據schemas資訊拼裝，支援ddl,xid,rows等各種event.

maxwell由 zendesk 開源：https://github.com/zendesk/maxwell ，而且維護者相當活躍。

網上已有人對 Alibaba Cannal, Zendesk Maxwell, Yelp mysql_streamer進行對比，見文後參考實時抓取MySQL的更新資料到Hadoop。

類似功能的還有：http://debezium.io/docs/connectors/mysql/

安裝

使用 maxwell 非常簡單，只需要jdk環境

yum install -y java-1.8.0-openjdk-1.8.0.121-1.b13.el6.x86_64

curl -sLo - https://github.com/zendesk/maxwell/releases/download/v1.12.0/maxwell-1.12.0.tar.gz \
       | tar zxvf -
cd maxwell-1.12.0

# 預設尋找當前目錄下的 config.properties 配置檔案

要求 mysql server binlog格式是 ROW， row_image 是 FULL。感受一下輸出結果

mysql> update test.e set m = 5.444, c = now(3) where id = 1;
{
   "database":"test",
   "table":"e",
   "type":"update",
   "ts":1477053234,
   "commit": true,
   ...
   "data":{
      "id":1,
      "m":5.444,
      "c":"2016-10-21 05:33:54.631000",
      "comment":"I am a creature of light."
   },
   "old":{
      "m":4.2341,
      "c":"2016-10-21 05:33:37.523000"
   }
}

mysql> create table test.e ( ... )
{
   "type":"table-create",
   "database":"test",
   "table":"e",
   "def":{
      "database":"test",
      "charset":"utf8mb4",
      "table":"e",
      "columns":[
         {"type":"int", "name":"id", "signed":true},
         {"type":"double", "name":"m"},
         {"type":"timestamp", "name":"c", "column-length":6},
         {"type":"varchar", "name":"comment", "charset":"latin1"}
      ],
      "primary-key":[
         "id"
      ]
   },
   "ts":1477053126000,
   "sql":"create table test.e ( id int(10) not null primary key auto_increment, m double, c timestamp(6), comment varchar(255) charset 'latin1' )",
   "position":"master.000006:800050"
}

data是 After image, old 是 Before image。 insert 只有後映象，delete只有前映象（data）
type是語句型別：insert, update, delete, database-create, database-alter, database-drop, table-create, table-alter, table-drop 。

基本配置

config.properties 配置檔案裡面的所有選項，都可以在啟動 maxweill ./bin/maxwell 是指定，覆蓋配置檔案的內容。這裡只講一些常用的。

mysql options

host
指定從哪個地址的mysql獲取binlog
replication_host
如果指定了 replication_host，那麼它是真正的binlog來源的mysql server地址，而那麼上面的host用於存放maxwell表結構和binlog位置的地址。
將兩者分開，可以避免 replication_user 往生產庫裡寫資料。
schema_host
從哪個host獲取表結構。binlog裡面沒有欄位資訊，所以maxwell需要從資料庫查出schema，存起來。
schema_host一般用不到，但在binlog-proxy場景下就很實用。比如要將已經離線的binlog通過maxwell生成json流，於是自建一個mysql server裡面沒有結構，只用於傳送binlog，此時表機構就可以制動從 schema_host 獲取。
gtid_mode
如果 mysql server 啟用了GTID，maxwell也可以基於gtid取event。如果mysql server發生failover，maxwell不需要手動指定newfile:postion

正常情況下，replication_host 和 schema_host都不需要指定，只有一個 --host。

schema_database
使用這個db來存放 maxwell 需要的表，比如要複製的databases, tables, columns, postions, heartbeats.

filtering

include_dbs
只傳送binlog裡面這些databases的變更，以,號分隔，中間不要包含空格。
也支援java風格的正則，如 include_tables=db1,/db\\d+/，表示 db1, db2, db3…這樣的。（下面的filter都支援這種regex）
提示：這裡的dbs指定的是真實db。比如binlog裡面可能 use db1 但 update db2.ttt，那麼maxwell生成的json database 內容是db2。
exclude_dbs
排除指定的這些 databbases
include_tables
只傳送這些表的資料變更。不只需要指定 database.
exclude_tables
排除指定的這些表
exclude_columns
不輸出這些欄位。如果欄位名在row中不存在，則忽略這個filter。
include_column_values
1.12.0新引入的過濾項。只輸出滿足 column=values 的行，比如 include_column_values=bar=x,foo=y，如果有bar欄位，那麼只輸出值為x的行，如果有foo欄位，那麼只輸出值為y的行。
如果沒有對應欄位，如只有bar=x沒有foo欄位，那麼也成立。（即不是或，也不是與）
blacklist_dbs
一般不用。blacklist_dbs字面上難以與exclude_dbs 分開，官網的說明也是模稜兩可。
從程式碼裡面看出的意思是，遮蔽指定的這些dbs,tables的結構變更，與行變更過濾，沒有關係。它應對的場景是，某個表上頻繁的有ddl，比如truncate。

因為往往我們只需要觀察部分表的變更，所以要注意這些 include 與 exclude 的關係，記住三點：

只要 include 有值，那麼不在include裡面的都排除
只要在 exclude 裡面的，都排除
其它都正常輸出

舉個比較極端的例子：

# database: db1,db2,db3,mydb
① include_dbs=db1,/db\\d+/
② exclude_dbs=db2
③ inlcude_tables=t1,t2,t3
④ exclude_tables=t3

配置了 include_dbs，那麼mydb不在裡面，所以排除；
配置了 exclude_dbs，那麼db2排除。剩下db1,db3
同樣對 tables，剩下t1,t2
所以db1.t1, db1.t2, db3.t1, db3.t2是篩選後剩下可輸出的。如果沒有指定include_dbs，那麼mydb.t1也可以輸出。

formatting

output_ddl
是否在輸出的json流中，包含ddl語句。預設 false
output_binlog_position
是否在輸出的json流中，包含binlog filename:postion。預設 false
output_commit_info
是否在輸出的json流裡面，包含 commit 和 xid 資訊。預設 true
比如一個事物裡，包含多個表的變更，或一個表上多條資料的變更，那麼他們都具有相同的 xid，最後一個row event輸出 commit:true 欄位。這有利於消費者實現事務回放，而不僅僅是行級別的回放。
output_thread_id
同樣，binlog裡面也包含了 thread_id ，可以包含在輸出中。預設 false
消費者可以用它來實現更粗粒度的事務回放。還有一個場景是使用者審計，使用者每次登陸之後將登陸ip、登陸時間、使用者名稱、thread_id記錄到一個表中，可輕鬆根據thread_id關聯到binlog裡面這條記錄是哪個使用者修改的。

monitoring

如果是長時間執行的maxwell，新增monitor配置，maxwell提供了http api返回監控資料。

其它

init_position
手動指定maxwell要從哪個binlog，哪個位置開始。指定的格式FILE:POSITION:HEARTBEAT。只支援在啟動maxwell的命令指定，比如 --init_postion=mysql-bin.0000456:4:0。
maxwell 預設從連線上mysql server的當前位置開始解析，如果指定 init_postion，要確保檔案確實存在，如果binlog已經被purge掉了，可能需要想其它辦法。見 Binlog視覺化搜尋：實現類似阿里RDS資料追蹤功能

2. 選擇合適的生產者

Maxwell是將binlog解析成json這種比較通用的格式，那麼要去用它可以選擇輸出到哪裡，比如Kafka, rabbitmq, file等，總之送到訊息佇列裡去。每種 Producer 有自己對應的選項。

2.1 file

1 2	producer=file output_file=/tmp/mysql_binlog_data.log

比較簡單，直接指定輸出到哪個檔案output_file。有什麼日誌收集系統，可以直接從這裡拿。

2.2 rabbitmq

rabbitmq 是非常流行的一個AMQP協議的訊息佇列服務，相關介紹請參考 rabbitmq入門

producer=rabbitmq

rabbitmq_host=10.81.xx.xxx
rabbitmq_user=admin
rabbitmq_pass=admin
rabbitmq_virtual_host=/some0
rabbitmq_exchange=maxwell.some
rabbitmq_exchange_type=topic
rabbitmq_exchange_durable=true
rabbitmq_exchange_autodelete=false
rabbitmq_routing_key_template=%db%.%table%

上面的引數都很容易理解，1.12.0版本新加入rabbitmq_message_persistent控制釋出訊息持久化的引數。
rabbitmq_routing_key_template是按照 db.tbl 的格式指定 routing_key，在建立佇列時，可以根據不同的表進入不同的佇列，提高並行消費而不亂序的能力。

因為rabbitmq搭建起來非常簡單，所以我習慣用這個。

2.3 kafka

kafka是maxwell支援最完善的一個producer，並且內建了多個版本的 kafka client(0.8.2.2, 0.9.0.1, 0.10.0.1, 0.10.2.1 or 0.11.0.1)，預設 kafka_version=0.11.0.1

producer=kafka

# 指定kafka brokers 地址
kafka.bootstrap.servers=hosta:9092,hostb:9092

# kafka主題可以是固定的，可以是 `maxwell_%{database}_%{table}` 這種按表去自動建立的動態topic
kafka_topic=maxwell

# ddl單獨使用的topic
ddl_kafka_topic=maxwell_ddl

# kafka和kenesis都支援分割槽，可以選擇根據 database, table, primary_key, 或者column的值去做partition
# maxwell預設使用database，在啟動的時候會去檢查是否topic是否有足夠多數量的partitions，所以要提前建立好
#  bin/kafka-topics.sh --zookeeper ZK_HOST:2181 --create \
#                      --topic maxwell --partitions 20 --replication-factor 2
producer_partition_by=database

# 如果指定了 producer_partition_by=column, 就需要指定下面兩個引數
# 根據user_id,create_date兩列的值去分割槽，partition_key形如 1178532016-10-10 18:29:04
producer_partition_columns=user_id,create_date
# 如果不存在user_id或create_date，則按照database分割槽:
producer_partition_by_fallback=database

maxwell會讀取kafka.開頭的引數，設定到連線引數裡，比如kafka.acks=1,kafka.retries=3等

2.4 redis

redis也有簡單的釋出訂閱(pub/sub)功能

producer=redis

redis_host=10.47.xx.xxx
redis_port=6379
# redis_auth=redis_auth
redis_database=0
redis_pub_channel=maxwell

但是試用一番之後，發現如果訂閱沒有連上去的話，所有pub的訊息是會丟失的。所以最好使用push/pop去實現。

3. 注意事項

下面的是在使用過程中遇到的一些小問題，做下總結。

timestamp column

maxwell對時間型別（datetime, timestamp, date）都是當做字串處理的，這也是為了保證資料一致(比如0000-00-00 00:00:00這樣的時間在timestamp裡是非法的，但mysql卻認，解析成java或者python型別就是null/None)。

如果MySQL表上的欄位是 timestamp 型別，是有時區的概念，binlog解析出來的是標準UTC時間，但使用者看到的是本地時間。比如 f_create_time timestamp 建立時間是北京時間2018-01-05 21:01:01，那麼mysql實際儲存的是2018-01-05 13:01:01，binlog裡面也是這個時間字串。如果不做消費者不做時區轉換，會少8個小時。被這個狠狠坑了一把。

與其每個客戶端都要考慮這個問題，我覺得更合理的做法是提供時區引數，然後maxwell自動處理時區問題，否則要麼客戶端先需要知道哪些列是timestamp型別，或者連線上原庫快取上這些型別。

binary column

maxwell可以處理binary型別的列，如blob、varbinary，它的做法就是對二進位制列使用 base64_encode，當做字串輸出到json。消費者拿到這個列資料後，不能直接拼裝，需要 base64_decode。

表結構不同步

如果是拿比較老的binlog，放到新的mysql server上去用maxwell拉去，有可能表結構已經發生了變化，比如binlog裡面欄位比 schema_host 裡面的欄位多一個。目前這種情況沒有發現異常，比如阿里RDS預設會為無主鍵無唯一索引的表，增加一個__##alibaba_rds_rowid##__，在 show create table 和 schema裡面都看不到這個隱藏主鍵，但binlog裡面會有，同步到從庫。

另外我們有通過git去管理結構版本，如果真有這種場景，也可以應對。

大事務binlog

當一個事物產生的binlog量非常大的時候，比如遷移日表資料，maxwell為了控制記憶體使用，會自動將處理不過來的binlog放到檔案系統

Using kafka version: 0.11.0.1
21:16:07,109 WARN  MaxwellMetrics - Metrics will not be exposed: metricsReportingType not configured.
21:16:07,380 INFO  SchemaStoreSchema - Creating maxwell database
21:16:07,540 INFO  Maxwell - Maxwell v?? is booting (RabbitmqProducer), starting at Position[BinlogPosition[mysql-bin.006235:24980714],
lastHeartbeat=0]
21:16:07,649 INFO  AbstractSchemaStore - Maxwell is capturing initial schema
21:16:08,267 INFO  BinlogConnectorReplicator - Setting initial binlog pos to: mysql-bin.006235:24980714
21:16:08,324 INFO  BinaryLogClient - Connected to rm-xxxxxxxxxxx.mysql.rds.aliyuncs.com:3306 at mysql-bin.006235/24980714 (sid:637
9, cid:9182598)
21:16:08,325 INFO  BinlogConnectorLifecycleListener - Binlog connected.
03:15:36,104 INFO  ListWithDiskBuffer - Overflowed in-memory buffer, spilling over into /tmp/maxwell7935334910787514257events
03:17:14,880 INFO  ListWithDiskBuffer - Overflowed in-memory buffer, spilling over into /tmp/maxwell3143086481692829045events

但是遇到另外一個問題，overflow隨後就出現異常EventDataDeserializationException: Failed to deserialize data of EventHeaderV4，當我另起一個maxwell指點之前的binlog postion開始解析，卻有沒有拋異常。事後的資料也表明並沒有資料丟失。

問題產生的原因還不明，Caused by: java.net.SocketException: Connection reset，感覺像讀取 binlog 流的時候還沒讀取到完整的event，異常關閉了連線。這個問題比較頑固，github上面類似問題都沒有達到明確的解決。（這也從側面告訴我們，大表資料遷移，也要批量進行，不要一個insert into .. select 搞定）

03:18:20,586 INFO  ListWithDiskBuffer - Overflowed in-memory buffer, spilling over into /tmp/maxwell5229190074667071141events
03:19:31,289 WARN  BinlogConnectorLifecycleListener - Communication failure.
com.github.shyiko.mysql.binlog.event.deserialization.EventDataDeserializationException: Failed to deserialize data of EventHeaderV4{time
stamp=1514920657000, eventType=WRITE_ROWS, serverId=2115082720, headerLength=19, dataLength=8155, nextPosition=520539918, flags=0}
        at com.github.shyiko.mysql.binlog.event.deserialization.EventDeserializer.deserializeEventData(EventDeserializer.java:216) ~[mys
ql-binlog-connector-java-0.13.0.jar:0.13.0]
        at com.github.shyiko.mysql.binlog.event.deserialization.EventDeserializer.nextEvent(EventDeserializer.java:184) ~[mysql-binlog-c
onnector-java-0.13.0.jar:0.13.0]
        at com.github.shyiko.mysql.binlog.BinaryLogClient.listenForEventPackets(BinaryLogClient.java:890) [mysql-binlog-connector-java-0
.13.0.jar:0.13.0]
        at com.github.shyiko.mysql.binlog.BinaryLogClient.connect(BinaryLogClient.java:559) [mysql-binlog-connector-java-0.13.0.jar:0.13
.0]
        at com.github.shyiko.mysql.binlog.BinaryLogClient$7.run(BinaryLogClient.java:793) [mysql-binlog-connector-java-0.13.0.jar:0.13.0
]
        at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]
Caused by: java.net.SocketException: Connection reset
        at java.net.SocketInputStream.read(SocketInputStream.java:210) ~[?:1.8.0_121]
        at java.net.SocketInputStream.read(SocketInputStream.java:141) ~[?:1.8.0_121]
        at com.github.shyiko.mysql.binlog.io.BufferedSocketInputStream.read(BufferedSocketInputStream.java:51) ~[mysql-binlog-connector-
java-0.13.0.jar:0.13.0]
        at com.github.shyiko.mysql.binlog.io.ByteArrayInputStream.readWithinBlockBoundaries(ByteArrayInputStream.java:202) ~[mysql-binlo
g-connector-java-0.13.0.jar:0.13.0]
        at com.github.shyiko.mysql.binlog.io.ByteArrayInputStream.read(ByteArrayInputStream.java:184) ~[mysql-binlog-connector-java-0.13
.0.jar:0.13.0]
        at com.github.shyiko.mysql.binlog.io.ByteArrayInputStream.readInteger(ByteArrayInputStream.java:46) ~[mysql-binlog-connector-jav
a-0.13.0.jar:0.13.0]
        at com.github.shyiko.mysql.binlog.event.deserialization.AbstractRowsEventDataDeserializer.deserializeLong(AbstractRowsEventDataD
eserializer.java:212) ~[mysql-binlog-connector-java-0.13.0.jar:0.13.0]
        at com.github.shyiko.mysql.binlog.event.deserialization.AbstractRowsEventDataDeserializer.deserializeCell(AbstractRowsEventDataD
eserializer.java:150) ~[mysql-binlog-connector-java-0.13.0.jar:0.13.0]
        at com.github.shyiko.mysql.binlog.event.deserialization.AbstractRowsEventDataDeserializer.deserializeRow(AbstractRowsEventDataDeserializer.java:132) ~[mysql-binlog-connector-java-0.13.0.jar:0.13.0]
        at com.github.shyiko.mysql.binlog.event.deserialization.WriteRowsEventDataDeserializer.deserializeRows(WriteRowsEventDataDeserializer.java:64) ~[mysql-binlog-connector-java-0.13.0.jar:0.13.0]
        at com.github.shyiko.mysql.binlog.event.deserialization.WriteRowsEventDataDeserializer.deserialize(WriteRowsEventDataDeserializer.java:56) ~[mysql-binlog-connector-java-0.13.0.jar:0.13.0]
        at com.github.shyiko.mysql.binlog.event.deserialization.WriteRowsEventDataDeserializer.deserialize(WriteRowsEventDataDeserializer.java:32) ~[mysql-binlog-connector-java-0.13.0.jar:0.13.0]
        at com.github.shyiko.mysql.binlog.event.deserialization.EventDeserializer.deserializeEventData(EventDeserializer.java:210) ~[mysql-binlog-connector-java-0.13.0.jar:0.13.0]
        ... 5 more
03:19:31,514 INFO  BinlogConnectorLifecycleListener - Binlog disconnected.
03:19:31,590 WARN  BinlogConnectorReplicator - replicator stopped at position: mysql-bin.006236:520531744 -- restarting
03:19:31,595 INFO  BinaryLogClient - Connected to rm-xxxxxx.mysql.rds.aliyuncs.com:3306 at mysql-bin.006236/520531744 (sid:6379, cid:9220521)

tableMapCache

前面講過，如果我只想獲取某幾個表的binlog變更，需要用 include_tables 來過濾，但如果mysql server上現在刪了一個表t1，但我的binlog是從昨天開始讀取，被刪的那個表t1在maxwell啟動的時候是拉取不到表結構的。然後昨天的binlog裡面有 t1 的變更，因為找不到表結構給來組裝成json，會拋異常。

手動在 maxwell.tables/columns 裡面插入記錄是可行的。但這個問題的根本是，maxwell在binlog過濾的時候，只在處理row_event的時候，而對 tableMapCache 要求binlog裡面的所有表都要有。

自己提交了一個commit，可以在做 tableMapCache 的時候也僅要求快取 include_dbs/tables 這些表： https://github.com/seanlook/maxwell/commit/2618b70303078bf910a1981b69943cca75ee04fb

提高消費效能

再用rabbitmq時，routing_key 是 %db%.%table%，但某些表產生的binlog增量非常大，就會導致各佇列訊息量很不平均，目前因為還沒做到事務xid或者thread_id級別的併發回放，所以最小佇列粒度也是表，儘量單獨放一個佇列，其它資料量小的合在一起。

參考

中介軟體---Binlog傳輸同步---Canal
2019-02-17
開源 | MySQL資料傳輸中介軟體—DTLE
2018-10-24
MySql
【Maxwell】使用maxwell+kafka+python做binlog增量解析消費
2018-05-14
KafkaPython
MySQL Binlog 解析工具 Maxwell 詳解
2019-03-11
MySql
MaxWell 資料同步
2019-12-13
Redis中介軟體與Web中介軟體
2024-07-07
RedisWeb
訊息中介軟體—RocketMQ訊息傳送
2019-02-18
MQ
laravel11:中介軟體傳遞引數
2024-11-01
Laravel
中介軟體之訊息中介軟體-pulsar
2024-06-09
優雅的redux非同步中介軟體 redux-effect
2019-03-25
Redux非同步
鐳速傳輸，大檔案傳輸軟體的快速通道
2020-08-18
Koa和Express的非同步中介軟體解決辦法
2020-11-12
Express非同步
redux中介軟體
2019-03-01
Redux
Laravel 中介軟體
2018-11-21
Laravel
中介軟體（middleware）
2018-05-06
Django——中介軟體
2018-09-12
Django
ThinkPHP 中介軟體
2019-12-18
PHP
中介軟體漏洞
2024-10-04
中介軟體-Nginx
2024-07-03
Nginx
MySQL中介軟體
2022-01-28
MySql
django中介軟體
2021-06-26
Django
中介軟體整理
2021-04-05
golang 中介軟體
2021-03-19
Golang
大檔案傳輸軟體的優勢有哪些？-鐳速傳輸
2023-02-03
什麼是中介軟體？Linux常用中介軟體都有哪些?
2023-02-22
Linux
ios檔案同步傳輸工具
2021-11-27
iOS
安全設計：加速傳輸軟體鐳速傳輸安全技術解讀
2020-11-19
中介軟體是什麼?Linux中介軟體是什麼意思?
2023-05-05
Linux
Macios資料傳輸軟體——AnyTrans for iOS mac
2020-05-26
MaciOS
SyncBird Pro for Mac - iOS資料傳輸軟體
2022-01-14
MaciOS
TransData for Mac(資料傳輸測速軟體)
2023-02-02
Mac
iOS資料傳輸軟體SyncBird Pro for Mac
2022-07-25
iOSMac
自定義 ocelot 中介軟體輸出自定義錯誤資訊
2021-02-07
Flutter Getx 01 - 路由、中介軟體、鑑權、傳值、跳轉
2021-04-05
Flutter路由
理解Redux中介軟體
2019-03-03
Redux
理解Express中介軟體
2018-11-04
Express
Sanic middleware – 中介軟體
2019-04-03
聊聊 koa 中介軟體
2018-07-31