waterdrop匯出hdfs資料到clickhouse(text,csv,json)

七年·發表於2020-10-20

原文網址 : https://blog.csdn.net/qq_28603127/article/details/109191596

首先用hive建立表(這裡是為了生成hdfs檔案方便,實際hive表匯出應該是整合spark直接寫sql匯出):

 CREATE TABLE test.hdfs2ch2(
           id int, 
           name string, 
           create_time timestamp);
 insert into hdfs2ch2 values(1,'zhangsan',' 2020-01-01 01:01:01.000001');
 insert into hdfs2ch2 values(2,'lisi','2020-01-01 01:01:01.000002');

至於為什麼要用’2020-01-01 01:01:01.000002’這種格式的資料,是為了多演示這種比較偏的型別.
clickhosue建立表語句:

CREATE TABLE mydatabase.hdfs2ch2
(
    `id` Int64,
    `name` String,
    `create_time` DateTime
)
ENGINE = MergeTree()
ORDER BY id
SETTINGS index_granularity = 8192

下面上waterdrop指令碼:

spark {
  #程式名稱
  spark.app.name = "Waterdrop"
  #executor的數量(資料量大可以適當增大)
  spark.executor.instances = 1
  #每個excutor核數(並行度,資料量大可以適當增大到伺服器核數一半以下,儘量不要影響clickhouse)
  spark.executor.cores = 1
  #每個excutor記憶體(不能小於512m)
  spark.executor.memory = "1g"
}

input {
 hdfs {
    result_table_name = "test_source"
    #hive建立表的hdfs路徑
    path = "hdfs://node01:8020/user/hive/warehouse/test.db/hdfs2ch2"
    format="text"
 }
}

filter {
  split {
    #根據分隔符切割後給每個列的名字
    fields = ["id", "name","create_time"]
    #這裡指的是hive的欄位分隔符,不然無法切割
    delimiter = "\\001"
  }
  convert {
    #因為剛切割後所有欄位型別為string,如果不轉化就會報錯
    #可以轉化的型別string、integer、long、float、double和boolean
    source_field = "id"
    new_type = "long"
} 
  date {
    #指定要進行轉換的原欄位名
    source_field = "create_time"
    #指定轉化結束後的欄位名(必須指定)
    target_field = "create_time"
    #大S就是毫秒的表示,如果表示錯誤,會轉化失敗,轉化失敗就會生成當前時間
    source_time_format = "yyyy-MM-dd HH:mm:ss.SSSSSS"
    target_time_format = "yyyy-MM-dd HH:mm:ss"
   }
}
output {
  stdout{
    limit=2
  }
 clickhouse {
    host = "node01:8123"
    clickhouse.socket_timeout = 50000
    database = "mydatabase"
    table = "hdfs2ch2"
    fields = ["id","name","create_time"]
    username = ""
    password = ""
    bulk_size = 20000
}
}

執行:
./bin/start-waterdrop.sh --master yarn --deploy-mode client --config ./config/hdfs-clickhouse2.conf

檢視資料:

node01.hadoop.com ? select * from hdfs2ch2;

SELECT *
FROM hdfs2ch2

┌─id─┬─name─────┬─────────create_time─┐
│ 1 │ zhangsan │ 2020-01-01 01:01:01 │
└────┴──────────┴─────────────────────┘
┌─id─┬─name─┬─────────create_time─┐
│ 2 │ lisi │ 2020-01-01 01:01:01 │
└────┴──────┴─────────────────────┘

2 rows in set. Elapsed: 0.009 sec.

CSV
如果是csv格式,表頭不是欄位名的話,就使用上面方式匯入,只是input裡面 delimiter = ","其他的一樣,但是如果表頭是欄位名的話:

input {
 hdfs {
    result_table_name = "test_source"
    path = "hdfs://node01:8020/user/hive/warehouse/test.db/hdfs2ch3"
    format="csv"
    #此處註明表頭是欄位名
    options.header = "true"
 }
}

這樣子的話不要filter中的split標籤進行切分新增欄位名字了,但是每個欄位還是string型別,與clickhouse型別不一樣的還是要轉換.
完整的csv示例:


spark {
  #程式名稱
  spark.app.name = "Waterdrop"
  #executor的數量(資料量大可以適當增大)
  spark.executor.instances = 1
  #每個excutor核數(並行度,資料量大可以適當增大到伺服器核數一半以下,儘量不要影響clickhouse)
  spark.executor.cores = 1
  #每個excutor記憶體(不能小於512m)
  spark.executor.memory = "1g"
}
input {
 hdfs {
    result_table_name = "test_source"
    path = "hdfs://node01:8020/user/hive/warehouse/test.db/hdfs2ch3"
    format="csv"
options.header = "true"
 }
}

filter {

  convert {
    source_field = "id"
    new_type = "integer"
}
  convert {
    source_field = "age"
    new_type = "integer"
}
#date {
#    source_field = "create_time"
#    target_field = "create_time"
#    source_time_format = "yyyy-MM-dd HH:mm:ss.SSSSSS"
#    target_time_format = "yyyy-MM-dd HH:mm:ss"
#}
}

output {
stdout{
limit=2
}
 clickhouse {
    host = "node01:8123"
    clickhouse.socket_timeout = 50000
    database = "mydatabase"
    table = "hdfs2ch3"
    fields = ["id","name","age"]
    username = ""
    password = ""
    bulk_size = 20000
}
}

JSON
如果是json格式資料,跟csv一樣,只是json帶有一定格式,數字格式為long型別.

spark {
  #程式名稱
  spark.app.name = "Waterdrop"
  #executor的數量(資料量大可以適當增大)
  spark.executor.instances = 1
  #每個excutor核數(並行度,資料量大可以適當增大到伺服器核數一半以下,儘量不要影響clickhouse)
  spark.executor.cores = 1
  #每個excutor記憶體(不能小於512m)
  spark.executor.memory = "1g"
}

input {
 hdfs {
    result_table_name = "test_source"
    path = "hdfs://node01:8020/user/hive/warehouse/test.db/hdfs2ch4"
    format="json"
 }
}

filter {
  convert {
    source_field = "id"
    new_type = "integer"
}
  convert {
    source_field = "age"
    new_type = "integer"
}
#date {
#    source_field = "create_time"
#    target_field = "create_time"
#    source_time_format = "yyyy-MM-dd HH:mm:ss.SSSSSS"
#    target_time_format = "yyyy-MM-dd HH:mm:ss"
#}
}
output {
  stdout{
  limit=2
 }
 clickhouse {
    host = "node01:8123"
    clickhouse.socket_timeout = 50000
    database = "mydatabase"
    table = "hdfs2ch3"
    fields = ["id","name","age"]
    username = ""
    password = ""
    bulk_size = 20000
}
}

Sqoop匯出ClickHouse資料到Hive
2023-02-06
OOPHive
利用跳板機連線mysql，匯出資料到csv
2020-12-14
MySql
vue element ui excel json2csv csv 匯出
2018-09-01
VueUIExcelJSON
SQLServer匯出匯入資料到MySQL
2023-11-13
ServerMySql
匯出csv
2020-09-25
匯出資料為csv格式
2018-10-11
JavaScript 匯出csv
2024-09-06
JavaScript
ClickHouse 資料表匯出和匯入（qbit）
2022-06-01
從CSV檔案匯入資料到Analytics Cloud裡建立模型和Story
2020-03-29
Cloud模型
Vue匯出資料到Excel電子表格
2020-03-13
VueExcel
hadoop之上傳資料到hdfs模式
2020-10-03
Hadoop模式
php匯出csv格式
2020-12-28
PHP
csv/json/list/datatable匯出為excel的通用模組設計
2022-03-05
JSONExcel
PHP 匯出大資料 CSV 檔案
2020-06-02
PHP大資料
MySQL匯出資料為csv的方法
2020-12-15
MySql
Colab pydrive 匯入匯出csv（pandas）
2018-03-04
使用Flume消費Kafka資料到HDFS
2018-11-19
Kafka
java匯出CSV檔案
2018-07-03
Java
使用csv批量匯入、匯出資料的需求處理
2020-09-30
MySQL 匯出資料為csv格式的方法
2021-09-09
MySql
用DataX導資料到Clickhouse遇到的坑
2023-11-02
Flink同步Kafka資料到ClickHouse分散式表
2022-12-21
Kafka分散式
C#快速匯出百萬級資料到Excel方法
2021-10-26
C#Excel
Laravel-admin 自定義csv匯出，支援原有匯出csv的所有功能，匯出所有資料使用分頁查詢處理
2021-12-17
Laravel
clickhouse表結構匯出為
2024-05-20
PHP 匯出 CSV 格式檔案
2019-05-17
PHP
magento2 後臺資料展示+csv匯出
2021-04-29
MongoDB日常運維-07遠端匯出資料到execl
2020-03-23
MongoDB運維
spark sql與mysql 資料載入與匯出資料到mysql
2018-11-08
SparkMySql
SQLite3 匯出 CSV 檔案
2018-12-07
SQLite
PHP匯出大量資料,儲存為CSV檔案
2021-04-29
PHP
ClickHouse-整合引擎（MySQL、HDFS）
2020-12-04
MySql
ASP.NET 開源匯入匯出庫Magicodes.IE 完成Csv匯入匯出
2020-05-14
ASP.NET
waterdrop
2022-07-22
PHP匯入大量CSV資料
2022-02-17
PHP
Python匯出資料到Excel表格-NotImplementedError: formatting_info=True not yet implemented
2020-04-20
PythonExcelErrorORM
Python批量匯入Excel資料到MySQL
2020-11-20
PythonExcelMySql
非常簡單的匯出 CSV 表格邏輯
2020-05-27

waterdrop匯出hdfs資料到clickhouse(text,csv,json)

相關文章