kafka生產者程式:
kafka-console-producer.sh --broker-list 192.168.12.115:7776 --topic gpss_test
kafka消費者程式:
kafka-console-consumer.sh --bootstrap-server 192.168.12.115:7776 --topic gpss_test --from-beginning
從kafka同步資料到greenplum有兩種方式:
- 用gpss啟動服務,用gpsscli向gpss註冊kafka載入作業(重點介紹)
- 用gpkafka元件來快速完成上面的步驟,因為gpkafka封裝了gpss和gpsscli的功能
準備一個配置檔案用於配置gpss服務的host和port
gpss4ic.json
{ "ListenAddress": { "Host": "", "Port": 50007 }, "Gpfdist": { "Host": "", "Port": 8319, "ReuseTables": false } }
用於載入kafka資料到greenplum的配置檔案
- 載入以”|”分割的流資料的配置檔案 kafka_testdata_delimited.yaml
DATABASE: yloms USER: gpss_usr PASSWORD: gpss_usr HOST: mdw PORT: 5432 VERSION: 2 KAFKA: INPUT: SOURCE: BROKERS: 192.168.12.115:7776 TOPIC: gpss_test VALUE: COLUMNS: - NAME: tid TYPE: integer - NAME: tcode TYPE: varchar - NAME: tname TYPE: varchar FORMAT: delimited DELIMITED_OPTION: DELIMITER: "|" ERROR_LIMIT: 25 OUTPUT: SCHEMA: ylorder TABLE: test_heap METADATA: SCHEMA: ylorder COMMIT: MINIMAL_INTERVAL: 2000 POLL: BATCHSIZE: 100 TIMEOUT: 3000
- 載入JSON格式流資料的配置檔案kafka_testdata_json.yaml
DATABASE: yloms USER: gpss_usr PASSWORD: gpss_usr HOST: mdw PORT: 5432 VERSION: 2 KAFKA: INPUT: SOURCE: BROKERS: 192.168.12.115:7776 TOPIC: gpss_test VALUE: COLUMNS: - NAME: jdata TYPE: json FORMAT: json ERROR_LIMIT: 25 OUTPUT: SCHEMA: ylorder TABLE: test_heap MAPPING: - NAME: tid EXPRESSION: (jdata->>'tid')::int - NAME: tcode EXPRESSION: (jdata->>'tcode')::varchar - NAME: tname EXPRESSION: (jdata->>'tname')::varchar METADATA: SCHEMA: ylorder COMMIT: MINIMAL_INTERVAL: 2000 POLL: BATCHSIZE: 100 TIMEOUT: 3000
用gpss做etl載入:
啟動gpss服務:
命令格式
gpss gpss4ic.json
日誌輸出
20200225:22:08:21 gpss:gpadmin:greenplum-001:010656-[INFO]:-using config file: gpss4ic.json 20200225:22:08:21 gpss:gpadmin:greenplum-001:010656-[INFO]:-config file content: { "ListenAddress": { "Host": "mdw", "Port": 50007, "Certificate": { "CertFile": "", "KeyFile": "", "CAFile": "" } }, "Gpfdist": { "Host": "mdw", "Port": 8319, "ReuseTables": false, "Certificate": { "CertFile": "", "KeyFile": "", "CAFile": "" }, "BindAddress": "0.0.0.0" } } 20200225:22:08:21 gpss:gpadmin:greenplum-001:010656-[INFO]:-gpss-listen-address-prefix: mdw:50007 20200225:22:08:21 gpss:gpadmin:greenplum-001:010656-[INFO]:-gpss will use random external table name, external table won't get reused 20200225:22:08:21 gpss:gpadmin:greenplum-001:010656-[INFO]:-gpfdist listening on 0.0.0.0:8319 20200225:22:08:21 gpss:gpadmin:greenplum-001:010656-[INFO]:-gpss listening on mdw:50007
提交一個作業:
命令格式
gpsscli submit --name kafkajson2gp --gpss-port 50007 --gpss-host mdw ./kafka_testdata_json.yaml
輸出如下
20200225:22:09:16 gpsscli:gpadmin:greenplum-001:010722-[INFO]:-JobID: kafkajson2gp
檢視作業列表:
命令格式
gpsscli list --all --gpss-port 50007 --gpss-host mdw
輸出如下
JobID GPHost GPPort DataBase Schema Table Topic Status kafkajson2gp mdw 5432 yloms ylorder test_heap gpss_test JOB_STOPPED
啟動作業:
命令格式
gpsscli start kafkajson2gp --gpss-port 50007 --gpss-host mdw
輸出如下
20200225:22:10:24 gpsscli:gpadmin:greenplum-001:010756-[INFO]:-JobID: kafkajson2gp is started
再次檢視作業:
命令格式
gpsscli list --all --gpss-port 50007 --gpss-host mdw
輸出如下
JobID GPHost GPPort DataBase Schema Table Topic Status kafkajson2gp mdw 5432 yloms ylorder test_heap gpss_test JOB_RUNNING
停掉作業:
命令格式
gpsscli stop kafkajson2gp --gpss-port 50007 --gpss-host mdw
輸出如下
20200225:22:11:04 gpsscli:gpadmin:greenplum-001:010801-[INFO]:-Stop a job: kafkajson2gp, status JOB_STOPPED
用gpkafka load啟動服務:
注意:gpkafka load 可以理解為代替了gpsscli上的提交作業,啟動作業等命令。
命令格式gpkafka --config gpss4ic.json load kafka_testdata_json.yaml
參考文件:
Loading Kafka Data into Greenplum
yanivbhemo / greenplum-gpss
本作品採用《CC 協議》,轉載必須註明作者和本文連結