waterdrop匯出hdfs資料到clickhouse(text,csv,json)

七年·發表於2020-10-20

首先用hive建立表(這裡是為了生成hdfs檔案方便,實際hive表匯出應該是整合spark直接寫sql匯出):

 CREATE TABLE test.hdfs2ch2(
           id int, 
           name string, 
           create_time timestamp);
 insert into hdfs2ch2 values(1,'zhangsan',' 2020-01-01 01:01:01.000001');
 insert into hdfs2ch2 values(2,'lisi','2020-01-01 01:01:01.000002');

至於為什麼要用’2020-01-01 01:01:01.000002’這種格式的資料,是為了多演示這種比較偏的型別.
clickhosue建立表語句:

CREATE TABLE mydatabase.hdfs2ch2
(
    `id` Int64,
    `name` String,
    `create_time` DateTime
)
ENGINE = MergeTree()
ORDER BY id
SETTINGS index_granularity = 8192

下面上waterdrop指令碼:

spark {
  #程式名稱
  spark.app.name = "Waterdrop"
  #executor的數量(資料量大可以適當增大)
  spark.executor.instances = 1
  #每個excutor核數(並行度,資料量大可以適當增大到伺服器核數一半以下,儘量不要影響clickhouse)
  spark.executor.cores = 1
  #每個excutor記憶體(不能小於512m)
  spark.executor.memory = "1g"
}

input {
 hdfs {
    result_table_name = "test_source"
    #hive建立表的hdfs路徑
    path = "hdfs://node01:8020/user/hive/warehouse/test.db/hdfs2ch2"
    format="text"
 }
}

filter {
  split {
    #根據分隔符切割後給每個列的名字
    fields = ["id", "name","create_time"]
    #這裡指的是hive的欄位分隔符,不然無法切割
    delimiter = "\\001"
  }
  convert {
    #因為剛切割後所有欄位型別為string,如果不轉化就會報錯
    #可以轉化的型別string、integer、long、float、double和boolean
    source_field = "id"
    new_type = "long"
} 
  date {
    #指定要進行轉換的原欄位名
    source_field = "create_time"
    #指定轉化結束後的欄位名(必須指定)
    target_field = "create_time"
    #大S就是毫秒的表示,如果表示錯誤,會轉化失敗,轉化失敗就會生成當前時間
    source_time_format = "yyyy-MM-dd HH:mm:ss.SSSSSS"
    target_time_format = "yyyy-MM-dd HH:mm:ss"
   }
}
output {
  stdout{
    limit=2
  }
 clickhouse {
    host = "node01:8123"
    clickhouse.socket_timeout = 50000
    database = "mydatabase"
    table = "hdfs2ch2"
    fields = ["id","name","create_time"]
    username = ""
    password = ""
    bulk_size = 20000
}
}

執行:
./bin/start-waterdrop.sh --master yarn --deploy-mode client --config ./config/hdfs-clickhouse2.conf

檢視資料:

node01.hadoop.com ? select * from hdfs2ch2;

SELECT *
FROM hdfs2ch2

┌─id─┬─name─────┬─────────create_time─┐
│ 1 │ zhangsan │ 2020-01-01 01:01:01 │
└────┴──────────┴─────────────────────┘
┌─id─┬─name─┬─────────create_time─┐
│ 2 │ lisi │ 2020-01-01 01:01:01 │
└────┴──────┴─────────────────────┘

2 rows in set. Elapsed: 0.009 sec.

CSV
如果是csv格式,表頭不是欄位名的話,就使用上面方式匯入,只是input裡面 delimiter = ","其他的一樣,但是如果表頭是欄位名的話:

input {
 hdfs {
    result_table_name = "test_source"
    path = "hdfs://node01:8020/user/hive/warehouse/test.db/hdfs2ch3"
    format="csv"
    #此處註明表頭是欄位名
    options.header = "true"
 }
}

這樣子的話不要filter中的split標籤進行切分新增欄位名字了,但是每個欄位還是string型別,與clickhouse型別不一樣的還是要轉換.
完整的csv示例:


spark {
  #程式名稱
  spark.app.name = "Waterdrop"
  #executor的數量(資料量大可以適當增大)
  spark.executor.instances = 1
  #每個excutor核數(並行度,資料量大可以適當增大到伺服器核數一半以下,儘量不要影響clickhouse)
  spark.executor.cores = 1
  #每個excutor記憶體(不能小於512m)
  spark.executor.memory = "1g"
}
input {
 hdfs {
    result_table_name = "test_source"
    path = "hdfs://node01:8020/user/hive/warehouse/test.db/hdfs2ch3"
    format="csv"
options.header = "true"
 }
}

filter {

  convert {
    source_field = "id"
    new_type = "integer"
}
  convert {
    source_field = "age"
    new_type = "integer"
}
#date {
#    source_field = "create_time"
#    target_field = "create_time"
#    source_time_format = "yyyy-MM-dd HH:mm:ss.SSSSSS"
#    target_time_format = "yyyy-MM-dd HH:mm:ss"
#}
}

output {
stdout{
limit=2
}
 clickhouse {
    host = "node01:8123"
    clickhouse.socket_timeout = 50000
    database = "mydatabase"
    table = "hdfs2ch3"
    fields = ["id","name","age"]
    username = ""
    password = ""
    bulk_size = 20000
}
}

JSON
如果是json格式資料,跟csv一樣,只是json帶有一定格式,數字格式為long型別.

spark {
  #程式名稱
  spark.app.name = "Waterdrop"
  #executor的數量(資料量大可以適當增大)
  spark.executor.instances = 1
  #每個excutor核數(並行度,資料量大可以適當增大到伺服器核數一半以下,儘量不要影響clickhouse)
  spark.executor.cores = 1
  #每個excutor記憶體(不能小於512m)
  spark.executor.memory = "1g"
}

input {
 hdfs {
    result_table_name = "test_source"
    path = "hdfs://node01:8020/user/hive/warehouse/test.db/hdfs2ch4"
    format="json"
 }
}

filter {
  convert {
    source_field = "id"
    new_type = "integer"
}
  convert {
    source_field = "age"
    new_type = "integer"
}
#date {
#    source_field = "create_time"
#    target_field = "create_time"
#    source_time_format = "yyyy-MM-dd HH:mm:ss.SSSSSS"
#    target_time_format = "yyyy-MM-dd HH:mm:ss"
#}
}
output {
  stdout{
  limit=2
 }
 clickhouse {
    host = "node01:8123"
    clickhouse.socket_timeout = 50000
    database = "mydatabase"
    table = "hdfs2ch3"
    fields = ["id","name","age"]
    username = ""
    password = ""
    bulk_size = 20000
}
}

相關文章