@
整合Flink
程式設計示例
本節透過一個簡單Flink寫入Hudi表的程式設計示例,後續可結合自身業務擴充,先建立一個Maven專案,這次就使用Java來編寫Flink程式。
由於中央倉庫沒有scala2.12版本的資源,前面文章已經編譯好相關jar,那這裡就將hudi-flink1.15-bundle-0.12.1.jar手動安裝到本地maven倉庫
mvn install:install-file -DgroupId=org.apache.hudi -DartifactId=hudi-flink_2.12 -Dversion=0.12.1 -Dpackaging=jar -Dfile=./hudi-flink1.15-bundle-0.12.1.jar
Pom檔案內容新增如下內容:
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>cn.itxs</groupId>
<artifactId>hudi-flink-demo</artifactId>
<version>1.0</version>
<name>hudi-flink-demo</name>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<scala.version>2.12.10</scala.version>
<scala.binary.version>2.12</scala.binary.version>
<hoodie.version>0.12.1</hoodie.version>
<hadoop.version>3.3.4</hadoop.version>
<flink.version>1.15.1</flink.version>
<slf4j.version>2.0.5</slf4j.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-java</artifactId>
<version>${flink.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-java</artifactId>
<version>${flink.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-clients</artifactId>
<version>${flink.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-runtime-web</artifactId>
<version>${flink.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>${slf4j.version}</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>${slf4j.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-table-planner_${scala.binary.version}</artifactId>
<version>${flink.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-statebackend-rocksdb</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.hudi</groupId>
<artifactId>hudi-flink_${scala.binary.version}</artifactId>
<version>${hoodie.version}</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.10.1</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
<encoding>${project.build.sourceEncoding}</encoding>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.4.1</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
建立一個HudiDemo的Java檔案實現一個簡單寫入hudi表流程
package cn.itxs;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.contrib.streaming.state.EmbeddedRocksDBStateBackend;
import org.apache.flink.contrib.streaming.state.PredefinedOptions;
import org.apache.flink.streaming.api.CheckpointingMode;
import org.apache.flink.streaming.api.environment.CheckpointConfig;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.table.api.bridge.java.StreamTableEnvironment;
import java.util.concurrent.TimeUnit;
public class HudiDemo
{
public static void main( String[] args )
{
//StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// 本地啟動flink的web頁面
StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironmentWithWebUI(new Configuration());
EmbeddedRocksDBStateBackend embeddedRocksDBStateBackend = new EmbeddedRocksDBStateBackend(true);
embeddedRocksDBStateBackend.setDbStoragePath("file:///D:/rocksdb");
embeddedRocksDBStateBackend.setPredefinedOptions(PredefinedOptions.SPINNING_DISK_OPTIMIZED_HIGH_MEM);
env.setStateBackend(embeddedRocksDBStateBackend);
env.enableCheckpointing(TimeUnit.SECONDS.toMillis(5), CheckpointingMode.EXACTLY_ONCE);
CheckpointConfig checkpointConfig = env.getCheckpointConfig();
checkpointConfig.setCheckpointStorage("hdfs://hadoop1:9000/checkpoints/flink");
checkpointConfig.setMinPauseBetweenCheckpoints(TimeUnit.SECONDS.toMillis(2));
checkpointConfig.setTolerableCheckpointFailureNumber(5);
checkpointConfig.setCheckpointTimeout(TimeUnit.MINUTES.toMillis(1));
checkpointConfig.setExternalizedCheckpointCleanup(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env);
tableEnv.executeSql("CREATE TABLE source_a2 (\n" +
" uuid varchar(20),\n" +
" name varchar(10),\n" +
" age int,\n" +
" ts timestamp(3),\n" +
" `partition` varchar(20),\n" +
" PRIMARY KEY(uuid) NOT ENFORCED\n" +
" ) WITH (\n" +
" 'connector' = 'datagen',\n" +
" 'rows-per-second' = '1'\n" +
")"
);
tableEnv.executeSql("CREATE TABLE a2 (\n" +
" uuid varchar(20),\n" +
" name varchar(10),\n" +
" age int,\n" +
" ts timestamp(3),\n" +
" `partition` varchar(20),\n" +
"PRIMARY KEY(uuid) NOT ENFORCED\n" +
" ) WITH (\n" +
" 'connector' = 'hudi',\n" +
" 'path' = 'hdfs://hadoop1:9000/tmp/hudi_flink/a2',\n" +
" 'table.type' = 'MERGE_ON_READ'\n" +
")"
);
tableEnv.executeSql("insert into a2 select * from source_a2");
}
}
透過使用createLocalEnvironmentWithWebUI開啟動FlinkWebUI,也即是可以在本地上檢視flink的web頁面
本地rocksdb狀態後端也有對應的儲存資料
HDFS上也可以檢視到剛剛建立的hudi表資訊
打包執行
對上面小修改一下程式碼,將最前面的環境中註釋createLocalEnvironmentWithWebUI和setDbStoragePath,放開getExecutionEnvironment;將表名改為a3,執行mvn package編譯打包,將打包的檔案上傳
flink run -t yarn-per-job -c cn.itxs.HudiDemo /home/commons/flink-1.15.1/otherjars/hudi-flink-demo-1.0.jar
執行日誌如下
檢視Yarn的application_1669357770610_0019
檢視HDFS也可以檢視到剛剛建立的hudi表資訊
CDC入湖
概述
CDC 即 Change Data Capture 變更資料捕獲,可以透過 CDC 得知資料來源表的更新內容(包含Insert Update 和 Delete),並將這些更新內容作為資料流傳送到下游系統。捕獲到的資料操作具有一個識別符號,分別對應資料的增加,修改和刪除。
- +I:新增資料。
- -U:一條資料的修改會產生兩個U識別符號資料。其中-U含義為修改前資料。
- +U:修改之後的資料。
- -D:刪除的資料。
CDC資料儲存了完整的資料庫變更,可以透過以下任意一種方式將資料匯入Hudi:
- 對接CDC Format,消費Kafka資料的同時匯入Hudi。支援debezium-json、canal-json和maxwell-json三種格式,該方式優點是可擴充套件性強,缺點是需要依賴Kafka和Debezium資料同步工具。
- 透過Flink-CDC-Connector直接對接DB的Binlog,將資料匯入Hudi。該方式優點是輕量化元件依賴少。
說明
- 如果無法保證上游資料順序,則需要指定write.precombine.field欄位。
- 在CDC場景下,需要開啟changelog模式,即changelog.enabled設為true。
下面則演示上面第一種方式方式的使用
MySQL 啟用 binlog
下面以 MySQL 5.7 版本為例說明。修改my.cnf
檔案,增加:
server_id=1
log_bin=mysql-bin
binlog_format=ROW
expire_logs_days=30
初始化MySQL 源資料表
先建立演示資料庫 test和一張 student 表
create database test;
use test;
CREATE TABLE `student` (
`id` INT NOT NULL AUTO_INCREMENT,
`name` varchar(10) NOT NULL,
`age` int NOT NULL,
`class` varchar(10) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE = InnoDB CHARSET = utf8;
準備Jar包依賴
將flink-sql-connector-mysql-cdc-2.3.0.jar和flink-sql-connector-kafka-1.15.1.jar上傳到flink的lib目錄下
flink-sql-connector-mysql-cdc-2.3.0.jar可以從github上下載 https://github.com/ververica/flink-cdc-connectors
flink-sql-connector-kafka-1.15.1.jar直接在maven倉庫下
flink讀取mysql binlog寫入kafka
- 建立mysql表
CREATE TABLE student_binlog (
id INT NOT NULL,
name STRING,
age INT,
class STRING,
PRIMARY KEY (`id`) NOT ENFORCED
) WITH (
'connector' = 'mysql-cdc',
'hostname' = 'mysqlserver',
'port' = '3308',
'username' = 'root',
'password' = '123456',
'database-name' = 'test',
'table-name' = 'student'
);
- 建立kafka表
create table student_binlog_sink_kafka(
id INT NOT NULL,
name STRING,
age INT,
class STRING,
PRIMARY KEY (`id`) NOT ENFORCED
) with (
'connector'='upsert-kafka',
'topic'='data_test',
'properties.bootstrap.servers' = 'kafka1:9092',
'properties.group.id' = 'testGroup',
'key.format'='json',
'value.format'='json'
);
- 將mysql binlog日誌寫入kafka
insert into student_binlog_sink_kafka select * from student_binlog;
檢視Flink的Web UI,可以看到剛才提交的job
開啟tableau方式查詢表
set 'sql-client.execution.result-mode' = 'tableau';select * from student_binlog_sink_kafka;
往mysql的student表插入和更新資料測試下
INSERT INTO student VALUES(1,'張三',16,'高一3班');
COMMIT;
INSERT INTO student VALUES(2,'李四',18,'高三3班');
COMMIT;
UPDATE student SET NAME='李四四' WHERE id = 2;
COMMIT;
flink讀取kafka資料並寫入hudi資料湖
- 建立Kafka源表
CREATE TABLE student_binlog_source_kafka (
id INT NOT NULL,
name STRING,
age INT,
class STRING
)
WITH(
'connector' = 'kafka',
'topic'='data_test',
'properties.bootstrap.servers' = 'kafka1:9092',
'properties.group.id' = 'testGroup',
'scan.startup.mode' = 'earliest-offset',
'format' = 'json'
);
- 建立hudi目標表
CREATE TABLE student_binlog_sink_hudi (
id INT NOT NULL,
name STRING,
age INT,
class STRING,
PRIMARY KEY (`id`) NOT ENFORCED
)
PARTITIONED BY (`class`)
WITH (
'connector' = 'hudi',
'path' = 'hdfs://hadoop1:9000/tmp/hudi_flink/student_binlog_sink_hudi',
'table.type' = 'MERGE_ON_READ',
'write.option' = 'insert',
'write.precombine.field' = 'class'
);
- 將kafka資料寫入hudi表
insert into student_binlog_sink_hudi select * from student_binlog_source_kafka;
mysql中student表新增加2條資料
INSERT INTO student VALUES(3,'韓梅梅',16,'高二2班');
INSERT INTO student VALUES(4,'李雷',16,'高二2班');
COMMIT;
檢視HDFS中已經有相應的分割槽和資料了
調優
Memory
引數名稱 | 描述 | 預設值 | 備註 |
---|---|---|---|
write.task.max.size | 每個write task使用的最大記憶體,超過則對資料進行flush | 1024MB | write buffer使用的記憶體 = write.task.max.size - compaction.max_memory,當write buffer總共使用的記憶體超過限制,則將最大的buffer進行flush |
write.batch.size | 資料寫入batch的大小 | 64MB | 推薦使用預設配置 |
write.log_block.size | Hudi的log writer將資料進行快取,等達到該引數限制,才將資料flush到disk形成LogBlock | 128MB | 推薦使用預設配置 |
write.merge.max_memory | COW型別的表,進行incremental data和data file能使用的最大heap size | 100MB | 推薦使用預設配置 |
compaction.max_memory | 每個write task進行compaction能使用的最大heap size | 100MB | 如果是online compaction,且資源充足,可以調大該值,如1024MB |
Parallelism
引數名稱 | 描述 | 預設值 | 備註 |
---|---|---|---|
write.tasks | write task的並行度,每一個write task寫入1~N個順序buckets | 4 | 增加該值,對小檔案的資料沒有影響 |
write.bucket_assign.tasks | bucket assigner operators的並行度 | Flink的parallelism.default引數 | 增加該值,會增加bucket的數量,所以也會增加小檔案的數量 |
write.index_boostrap.tasks | index bootstrap的並行度 | Flink的parallelism.default引數 | |
read.tasks | read operators的並行度 | 4 | |
compaction.tasks | online compaction的並行度 | 4 | 推薦使用offline compaction |
Compaction
只適用於online compaction
引數名稱 | 描述 | 預設值 | 備註 |
---|---|---|---|
compaction.schedule.enabled | 是否定期生成compaction plan | true | 即使compaction.async.enabled = false,也推薦開啟該值 |
compaction.async.enabled | MOR型別表預設開啟Async Compaction | true | false表示關閉online compaction |
compaction.trigger.strategy | 觸發compaction的Strategy | num_commits | 可選引數值:1. num_commits:delta commits數量達到多少;2. time_elapsed:上次compaction過後多少秒;3. num_and_time:同時滿足num_commits和time_elapsed;4. num_or_time:滿足num_commits或time_elapsed |
compaction.delta_commits | 5 | ||
compaction.delta_seconds | 3600 | ||
compaction.target_io | 每個compaction讀寫合計的目標IO,預設500GB | 512000 |
整合Hive
hudi源表對應一份hdfs資料,可以透過spark,flink 元件或者hudi客戶端將hudi表的資料對映為hive外部表,基於該外部表, hive可以方便的進行實時檢視,讀最佳化檢視以及增量檢視的查詢。
整合步驟
這裡以hive3.1.3(關於hive可以詳細檢視前面的文章)、 hudi 0.12.1為例, 其他版本類似
將hudi-hadoop-mr-bundle-0.9.0xxx.jar , hudi-hive-sync-bundle-0.9.0xx.jar 放到hiveserver 節點的lib目錄下
cd /home/commons/apache-hive-3.1.3-bin
cp -rf /home/commons/hudi-release-0.12.1/packaging/hudi-hadoop-mr-bundle/target/hudi-hadoop-mr-bundle-0.12.1.jar lib/
cp -rf /home/commons/hudi-release-0.12.1/packaging/hudi-hive-sync-bundle/target/hudi-hive-sync-bundle-0.12.1.jar lib/
按照需求選擇合適的方式並重啟hive
nohup hive --service metastore &
nohup hive --service hiveserver2 &
連線jdbc hive2測試,顯示所有資料庫
Flink同步Hive
Flink hive sync 現在支援兩種 hive sync mode, 分別是 hms 和 jdbc 模式。 其中 hms 只需要配置 metastore uris;而 jdbc 模式需要同時配置 jdbc 屬性 和 metastore uris,具體配置示例如下
CREATE TABLE t7(
id int,
num int,
ts int,
primary key (id) not enforced
)
PARTITIONED BY (num)
with(
'connector'='hudi',
'path' = 'hdfs://hadoop1:9000/tmp/hudi_flink/t7',
'table.type'='COPY_ON_WRITE',
'hive_sync.enable'='true',
'hive_sync.table'='h7',
'hive_sync.db'='default',
'hive_sync.mode' = 'hms',
'hive_sync.metastore.uris' = 'thrift://hadoop2:9083'
);
insert into t7 values(1,1,1);
Hive Catalog
Flink官網的找到對應文件版本找到connector-hive,下載flink-sql-connector-hive-3.1.2_2.12-1.15.1.jar,上傳到flink的lib目錄下,建表示例
CREATE CATALOG hive_catalog WITH (
'type' = 'hive',
'default-database' = 'default',
'hive-conf-dir' = '/home/commons/apache-hive-3.1.3-bin/conf/'
);
use catalog hive_catalog;
CREATE TABLE t8(
id int,
num int,
ts int,
primary key (id) not enforced
)
PARTITIONED BY (num)
with(
'connector'='hudi',
'path' = 'hdfs://hadoop1:9000/tmp/hudi_flink/t8',
'table.type'='COPY_ON_WRITE',
'hive_sync.enable'='true',
'hive_sync.table'='h8',
'hive_sync.db'='default',
'hive_sync.mode' = 'hms',
'hive_sync.metastore.uris' = 'thrift://hadoop2:9083'
);
本人部落格網站IT小神 www.itxiaoshen.com