用 gpss 從 kafka 消費資料載入到 greenplum

xueyubingsen發表於2020-02-26

kafka生產者程式:

kafka-console-producer.sh --broker-list 192.168.12.115:7776 --topic gpss_test

kafka消費者程式:

kafka-console-consumer.sh --bootstrap-server 192.168.12.115:7776 --topic gpss_test --from-beginning

從kafka同步資料到greenplum有兩種方式:

  1. 用gpss啟動服務,用gpsscli向gpss註冊kafka載入作業(重點介紹)
  2. 用gpkafka元件來快速完成上面的步驟,因為gpkafka封裝了gpss和gpsscli的功能
    用gpss從kafka消費資料載入到greenplum

準備一個配置檔案用於配置gpss服務的host和port

gpss4ic.json

{
    "ListenAddress": {
        "Host": "",
        "Port": 50007
    },
    "Gpfdist": {
        "Host": "",
        "Port": 8319,
        "ReuseTables": false
    }
}

用於載入kafka資料到greenplum的配置檔案

  1. 載入以”|”分割的流資料的配置檔案 kafka_testdata_delimited.yaml
    DATABASE: yloms
    USER: gpss_usr
    PASSWORD: gpss_usr
    HOST: mdw
    PORT: 5432
    VERSION: 2
    KAFKA:
    INPUT:
       SOURCE:
          BROKERS: 192.168.12.115:7776
          TOPIC: gpss_test
       VALUE:
          COLUMNS:
            - NAME: tid
              TYPE: integer
            - NAME: tcode
              TYPE: varchar
            - NAME: tname
              TYPE: varchar
          FORMAT: delimited
          DELIMITED_OPTION:
            DELIMITER: "|"
       ERROR_LIMIT: 25
    OUTPUT:
       SCHEMA: ylorder
       TABLE: test_heap
    METADATA:
       SCHEMA: ylorder
    COMMIT:
       MINIMAL_INTERVAL: 2000
    POLL:
       BATCHSIZE: 100
       TIMEOUT: 3000
  2. 載入JSON格式流資料的配置檔案kafka_testdata_json.yaml
    DATABASE: yloms
    USER: gpss_usr
    PASSWORD: gpss_usr
    HOST: mdw
    PORT: 5432
    VERSION: 2
    KAFKA:
    INPUT:
       SOURCE:
          BROKERS: 192.168.12.115:7776
          TOPIC: gpss_test
       VALUE:
         COLUMNS:
            - NAME: jdata
              TYPE: json
         FORMAT: json
       ERROR_LIMIT: 25
    OUTPUT:
       SCHEMA: ylorder
       TABLE: test_heap
       MAPPING:
         - NAME: tid
           EXPRESSION: (jdata->>'tid')::int
         - NAME: tcode
           EXPRESSION: (jdata->>'tcode')::varchar
         - NAME: tname
           EXPRESSION: (jdata->>'tname')::varchar
    METADATA:
       SCHEMA: ylorder
    COMMIT:
       MINIMAL_INTERVAL: 2000
    POLL:
       BATCHSIZE: 100
       TIMEOUT: 3000

用gpss做etl載入:

啟動gpss服務:

命令格式

gpss gpss4ic.json

日誌輸出

20200225:22:08:21 gpss:gpadmin:greenplum-001:010656-[INFO]:-using config file: gpss4ic.json
20200225:22:08:21 gpss:gpadmin:greenplum-001:010656-[INFO]:-config file content: {
        "ListenAddress": {
                "Host": "mdw",
                "Port": 50007,
                "Certificate": {
                        "CertFile": "",
                        "KeyFile": "",
                        "CAFile": ""
                }
        },
        "Gpfdist": {
                "Host": "mdw",
                "Port": 8319,
                "ReuseTables": false,
                "Certificate": {
                        "CertFile": "",
                        "KeyFile": "",
                        "CAFile": ""
                },
                "BindAddress": "0.0.0.0"
        }
}
20200225:22:08:21 gpss:gpadmin:greenplum-001:010656-[INFO]:-gpss-listen-address-prefix: mdw:50007
20200225:22:08:21 gpss:gpadmin:greenplum-001:010656-[INFO]:-gpss will use random external table name, external table won't get reused
20200225:22:08:21 gpss:gpadmin:greenplum-001:010656-[INFO]:-gpfdist listening on 0.0.0.0:8319
20200225:22:08:21 gpss:gpadmin:greenplum-001:010656-[INFO]:-gpss listening on mdw:50007

提交一個作業:

命令格式

gpsscli submit --name kafkajson2gp --gpss-port 50007 --gpss-host mdw ./kafka_testdata_json.yaml

輸出如下

20200225:22:09:16 gpsscli:gpadmin:greenplum-001:010722-[INFO]:-JobID: kafkajson2gp

檢視作業列表:

命令格式

gpsscli list --all --gpss-port 50007 --gpss-host mdw

輸出如下

JobID                               GPHost          GPPort  DataBase        Schema          Table                           Topic           Status
kafkajson2gp                        mdw             5432    yloms           ylorder         test_heap                       gpss_test       JOB_STOPPED

啟動作業:

命令格式

gpsscli start kafkajson2gp --gpss-port 50007 --gpss-host mdw

輸出如下

20200225:22:10:24 gpsscli:gpadmin:greenplum-001:010756-[INFO]:-JobID: kafkajson2gp is started

再次檢視作業:

命令格式

gpsscli list --all --gpss-port 50007 --gpss-host mdw

輸出如下

JobID                               GPHost          GPPort  DataBase        Schema          Table                           Topic           Status
kafkajson2gp                        mdw             5432    yloms           ylorder         test_heap                       gpss_test       JOB_RUNNING

停掉作業:

命令格式

gpsscli stop kafkajson2gp --gpss-port 50007 --gpss-host mdw

輸出如下

20200225:22:11:04 gpsscli:gpadmin:greenplum-001:010801-[INFO]:-Stop a job: kafkajson2gp, status JOB_STOPPED

用gpkafka load啟動服務:

注意:gpkafka load 可以理解為代替了gpsscli上的提交作業,啟動作業等命令。
命令格式

gpkafka --config gpss4ic.json load kafka_testdata_json.yaml

參考文件:
Loading Kafka Data into Greenplum
yanivbhemo / greenplum-gpss

本作品採用《CC 協議》,轉載必須註明作者和本文連結

相關文章