Oracle GoldenGate是Oracle公司的實時資料複製軟體,支援關係型資料庫和多種大資料平臺。從GoldenGate 12.2開始,GoldenGate支援直接投遞資料到Kafka等平臺,而不用通過Java二次開發。在資料複製過程中,GoldenGate充當Kafka Producer的角色,從關係 型資料庫解析增量資料,再實時往Kafka平臺寫入。當前最新的GoldenGate版本是12.3.1.1.1。
從下圖可以看出,GoldenGate不僅支援Kafka投遞,也支援其它大資料平臺的投遞。
本文主要講述如何將增量資料投遞到Kafka平臺。
環境準備
介質準備
GoldenGate介質下載
http://www.oracle.com/technetwork/middleware/goldengate/downloads/index.html
kafka的介質可以從kafka,apache.org官網下載。
軟體安裝
基於 GoldenGate的複製鏈路中,一般分為源端和目標端,在GoldenGate for kafka場景中,源端一般是關係型資料庫,目標端包括GoldenGate for kafka的節點,以及kafka叢集。
Kafka的執行需要先安裝Zookeeper軟體。zookeeper和Kafka的安裝步驟可在網路上搜尋,不在此贅述。
本文重點講解GoldenGate for Kafka的功能,GoldenGate for DB的安裝配置在此略過。目標端GoldenGate for big data 的安裝需要有JDK環境,要求至少1.7及以上版本。安裝完JDK之後,需要指定相應的JAVA_HOME環境變數,並將$JAVA_HOME/bin新增到PATH環境變數。
安裝GoldenGate的節點要求能訪問kafka叢集,因此,安裝GoldenGate的節點要有kafka lib,並在後面的kafka.props檔案中設定對應的路徑。
GoldenGate的安裝介質是一個ZIP壓縮包,解壓之後,再繼續解壓對應的tar即安裝完成。安裝之後的目錄下有示例可供參考:
GoldenGate for kafka配置
GoldenGate投遞程式引數
REPLICAT myka -- add replicat myka, exttrail ./dirdat/ea TARGETDB LIBFILE libggjava.so SET property=dirprm/kafka.props REPORTCOUNT EVERY 1 MINUTES, RATE GROUPTRANSOPS 10000 MAP gg_src.*, TARGET gg_src.*; |
Kafka相關的屬性
hadoop@ubuntu2:/opt/GoldenGate12.2.1.1/dirprm$ more kafka.props
gg.handlerlist = kafkahandler gg.handler.kafkahandler.type = kafka gg.handler.kafkahandler.KafkaProducerConfigFile=custom_kafka_producer.properties gg.handler.kafkahandler.TopicName =mykaf4 gg.handler.kafkahandler.format =avro_op gg.handler.kafkahandler.SchemaTopicName=mySchemaTopic gg.handler.kafkahandler.BlockingSend =false gg.handler.kafkahandler.includeTokens=false gg.handler.kafkahandler.mode =tx goldengate.userexit.timestamp=utc goldengate.userexit.writers=javawriter javawriter.stats.display=TRUE javawriter.stats.full=TRUE gg.log=log4j gg.log.level=INFO gg.report.time=30sec gg.classpath=dirprm/:/opt/kafka/libs/*: javawriter.bootoptions=-Xmx512m -Xms32m -Djava.class.path=ggjava/ggjava.jar hadoop@ubuntu2:/opt/GoldenGate12.2.1.1/dirprm$ more custom_kafka_producer.properties bootstrap.servers=localhost:9092 acks=1 compression.type=gzip reconnect.backoff.ms=1000 value.serializer=org.apache.kafka.common.serialization.ByteArraySerializer key.serializer=org.apache.kafka.common.serialization.ByteArraySerializer # 100KB per partition batch.size=102400 linger.ms=10000 |
測試
確保zookeeper, kafka相關程式是正常執行的。
啟動GoldenGate投遞程式
GGSCI (ubuntu2) 12> start myka
Sending START request to MANAGER ...
REPLICAT MYKA starting
檢視狀態
GGSCI (ubuntu2) 21> info myka
REPLICAT MYKA Last Started 2017-12-18 12:59 Status RUNNING
Checkpoint Lag 00:00:00 (updated 00:00:01 ago)
Process ID 42206
Log Read Checkpoint File ./dirdat/ea000000038
2016-08-28 21:18:20.980481 RBA 2478
統計增量資料,已經寫入3條記錄。
GGSCI (ubuntu2) 22> stats myka, total
Sending STATS request to REPLICAT MYKA ... Start of Statistics at 2017-12-18 13:05:09. Replicating from GG_SRC.TB_HIVE to gg_src.TB_HIVE: *** Total statistics since 2017-12-18 12:59:05 *** Total inserts 3.00 Total updates 0.00 Total deletes 0.00 Total discards 0.00 Total operations 3.00 End of Statistics. |
檢視kafka叢集,使用consumer命令列檢視
bin/kafka-console-consumer.sh --zookeeper localhost:2181 --from-beginning --topic mykaf4
輸出如下3條記錄,除欄位資料外,還有其它輔助資訊,如源表結構資訊、源端commit時間、當前插入時間等,輸出的資訊可以在kafka.props檔案中控制。
GG_SRC.TB_HIVEI42016-08-28 12:43:21.97963642017-12-18T12:59:05.368000(00000000380000001916ID1cd&2016-02-08:00:00:00:2016-12-11:11:00:02.000000000 GG_SRC.TB_HIVEI42016-08-28 12:47:24.98154442017-12-18T12:59:05.462000(00000000380000002103IDcd22&2016-02-08:00:00:00:2016-12-11:11:00:02.000000000 GG_SRC.TB_HIVEI42016-08-28 13:18:20.98048142017-12-18T12:59:05.462001(00000000380000002292IDcd22&2016-02-08:00:00:00:2016-12-11:11:00:02.000000000 |
調整輸出的格式為XML,修改kafka.props檔案,重新執行剛才的投遞程式。
gg.handler.kafkahandler.format =xml
GGSCI>stop myka
GGSCI>alter myka, extrba 0
GGSCI>start myka, NOFILTERDUPTRANSACTIONS
使用NOFILTERDUPTRANSACTIONS的目的是禁止OGG跳過已經處理過的事務。
再檢視kafka-consumer的輸出結果:
可以看到,資料的格式已經變成xml,而且源端每個操作的詳細資訊都已經記錄。
<operation table='GG_SRC.TB_HIVE' type='I' ts='2016-08-28 12:43:21.979636' current_ts='2017-12-18T16:49:00.995000' pos='00000000380000001916' numCols='4'> <col name='ID' index='0'> <before missing='true'/> <after><![CDATA[1]]></after> </col> <col name='NAME' index='1'> <before missing='true'/> <after><![CDATA[cd]]></after> </col> <col name='BIRTH_DT' index='2'> <before missing='true'/> <after><![CDATA[2016-02-08:00:00:00]]></after> </col> <col name='CR_TM' index='3'> <before missing='true'/> <after><![CDATA[2016-12-11:11:00:02.000000000]]></after> </col> </operation> <operation table='GG_SRC.TB_HIVE' type='I' ts='2016-08-28 12:47:24.981544' current_ts='2017-12-18T16:49:00.996000' pos='00000000380000002103' numCols='4'> <col name='ID' index='0'> <before missing='true'/> <after><![CDATA[2]]></after> </col> <col name='NAME' index='1'> <before missing='true'/> <after><![CDATA[cd22]]></after> </col> <col name='BIRTH_DT' index='2'> <before missing='true'/> <after><![CDATA[2016-02-08:00:00:00]]></after> </col> <col name='CR_TM' index='3'> <before missing='true'/> <after><![CDATA[2016-12-11:11:00:02.000000000]]></after> </col> </operation> <operation table='GG_SRC.TB_HIVE' type='I' ts='2016-08-28 13:18:20.980481' current_ts='2017-12-18T16:49:00.996001' pos='00000000380000002292' numCols='4'> <col name='ID' index='0'> <before missing='true'/> <after><![CDATA[3]]></after> </col> <col name='NAME' index='1'> <before missing='true'/> <after><![CDATA[cd22]]></after> </col> <col name='BIRTH_DT' index='2'> <before missing='true'/> <after><![CDATA[2016-02-08:00:00:00]]></after> </col> <col name='CR_TM' index='3'> <before missing='true'/> <after><![CDATA[2016-12-11:11:00:02.000000000]]></after> </col> </operation> |
最後,再修改輸出格式為json。
gg.handler.kafkahandler.format =json
GGSCI>stop myka
GGSCI>alter myka, extrba 0
GGSCI>start myka, NOFILTERDUPTRANSACTIONS
檢查kafka的輸出結果:
{"table":"GG_SRC.TB_HIVE","op_type":"I","op_ts":"2016-08-28 12:43:21.979636","current_ts":"2017-12-18T16:46:23.860000","pos":"00000000380000001916","after":{"ID":"1","NAME":"cd","BIRTH_DT":"2016-02-08:00:00:00","CR_TM":"2016-12-11:11:00:02.000000000"}} {"table":"GG_SRC.TB_HIVE","op_type":"I","op_ts":"2016-08-28 12:47:24.981544","current_ts":"2017-12-18T16:46:23.914000","pos":"00000000380000002103","after":{"ID":"2","NAME":"cd22","BIRTH_DT":"2016-02-08:00:00:00","CR_TM":"2016-12-11:11:00:02.000000000"}} {"table":"GG_SRC.TB_HIVE","op_type":"I","op_ts":"2016-08-28 13:18:20.980481","current_ts":"2017-12-18T16:46:23.914001","pos":"00000000380000002292","after":{"ID":"3","NAME":"cd22","BIRTH_DT":"2016-02-08:00:00:00","CR_TM":"2016-12-11:11:00:02.000000000"}} |
可以看到,kafka上已經是JSON格式的資料,而且包含了相關的時間戳和其它輔助資訊。
至此,測試完成。