第10講：Flink Side OutPut 分流

大資料技術派發表於2022-02-12

原文網址 : https://www.cnblogs.com/data-magnifier/p/15885543.html

IDE

Flink系列文章

關注公眾號：大資料技術派，回覆資料，領取1024G資料。

這一課時將介紹 Flink 中提供的一個很重要的功能：旁路分流器。

分流場景

我們在生產實踐中經常會遇到這樣的場景，需把輸入源按照需要進行拆分，比如我期望把訂單流按照金額大小進行拆分，或者把使用者訪問日誌按照訪問者的地理位置進行拆分等。面對這樣的需求該如何操作呢？

分流的方法

通常來說針對不同的場景，有以下三種辦法進行流的拆分。

Filter 分流

image (9).png

Filter 方法我們在第 04 課時中（Flink 常用的 DataSet 和 DataStream API）講過，這個運算元用來根據使用者輸入的條件進行過濾，每個元素都會被 filter() 函式處理，如果 filter() 函式返回 true 則保留，否則丟棄。那麼用在分流的場景，我們可以做多次 filter，把我們需要的不同資料生成不同的流。

來看下面的例子：

複製程式碼

public static void main(String[] args) throws Exception {
    StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
    //獲取資料來源
    List data = new ArrayList<Tuple3<Integer,Integer,Integer>>();
    data.add(new Tuple3<>(0,1,0));
    data.add(new Tuple3<>(0,1,1));
    data.add(new Tuple3<>(0,2,2));
    data.add(new Tuple3<>(0,1,3));
    data.add(new Tuple3<>(1,2,5));
    data.add(new Tuple3<>(1,2,9));
    data.add(new Tuple3<>(1,2,11));
    data.add(new Tuple3<>(1,2,13));

    DataStreamSource<Tuple3<Integer,Integer,Integer>> items = env.fromCollection(data);



    SingleOutputStreamOperator<Tuple3<Integer, Integer, Integer>> zeroStream = items.filter((FilterFunction<Tuple3<Integer, Integer, Integer>>) value -> value.f0 == 0);

    SingleOutputStreamOperator<Tuple3<Integer, Integer, Integer>> oneStream = items.filter((FilterFunction<Tuple3<Integer, Integer, Integer>>) value -> value.f0 == 1);



    zeroStream.print();

    oneStream.printToErr();





    //列印結果

    String jobName = "user defined streaming source";

    env.execute(jobName);

}

在上面的例子中我們使用 filter 運算元將原始流進行了拆分，輸入資料第一個元素為 0 的資料和第一個元素為 1 的資料分別被寫入到了 zeroStream 和 oneStream 中，然後把兩個流進行了列印。

image (10).png

可以看到 zeroStream 和 oneStream 分別被列印出來。

Filter 的弊端是顯而易見的，為了得到我們需要的流資料，需要多次遍歷原始流，這樣無形中浪費了我們叢集的資源。

Split 分流

Split 也是 Flink 提供給我們將流進行切分的方法，需要在 split 運算元中定義 OutputSelector，然後重寫其中的 select 方法，將不同型別的資料進行標記，最後對返回的 SplitStream 使用 select 方法將對應的資料選擇出來。

我們來看下面的例子：

複製程式碼

public static void main(String[] args) throws Exception {



    StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

    //獲取資料來源

    List data = new ArrayList<Tuple3<Integer,Integer,Integer>>();

    data.add(new Tuple3<>(0,1,0));

    data.add(new Tuple3<>(0,1,1));

    data.add(new Tuple3<>(0,2,2));

    data.add(new Tuple3<>(0,1,3));

    data.add(new Tuple3<>(1,2,5));

    data.add(new Tuple3<>(1,2,9));

    data.add(new Tuple3<>(1,2,11));

    data.add(new Tuple3<>(1,2,13));





    DataStreamSource<Tuple3<Integer,Integer,Integer>> items = env.fromCollection(data);





    SplitStream<Tuple3<Integer, Integer, Integer>> splitStream = items.split(new OutputSelector<Tuple3<Integer, Integer, Integer>>() {

        @Override

        public Iterable<String> select(Tuple3<Integer, Integer, Integer> value) {

            List<String> tags = new ArrayList<>();

            if (value.f0 == 0) {

                tags.add("zeroStream");

            } else if (value.f0 == 1) {

                tags.add("oneStream");

            }

            return tags;

        }

    });



    splitStream.select("zeroStream").print();

    splitStream.select("oneStream").printToErr();



    //列印結果

    String jobName = "user defined streaming source";

    env.execute(jobName);

}

同樣，我們把來源的資料使用 split 運算元進行了切分，並且列印出結果。

image (11).png

但是要注意，使用 split 運算元切分過的流，是不能進行二次切分的，假如把上述切分出來的 zeroStream 和 oneStream 流再次呼叫 split 切分，控制檯會丟擲以下異常。

複製程式碼

Exception in thread "main" java.lang.IllegalStateException: Consecutive multiple splits are not supported. Splits are deprecated. Please use side-outputs.

這是什麼原因呢？我們在原始碼中可以看到註釋，該方式已經廢棄並且建議使用最新的 SideOutPut 進行分流操作。

image (12).png

SideOutPut 分流

SideOutPut 是 Flink 框架為我們提供的最新的也是最為推薦的分流方法，在使用 SideOutPut 時，需要按照以下步驟進行：

定義 OutputTag
呼叫特定函式進行資料拆分
- ProcessFunction
- KeyedProcessFunction
- CoProcessFunction
- KeyedCoProcessFunction
- ProcessWindowFunction
- ProcessAllWindowFunction

在這裡我們使用 ProcessFunction 來講解如何使用 SideOutPut：

複製程式碼

public static void main(String[] args) throws Exception {



    StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

    //獲取資料來源

    List data = new ArrayList<Tuple3<Integer,Integer,Integer>>();

    data.add(new Tuple3<>(0,1,0));

    data.add(new Tuple3<>(0,1,1));

    data.add(new Tuple3<>(0,2,2));

    data.add(new Tuple3<>(0,1,3));

    data.add(new Tuple3<>(1,2,5));

    data.add(new Tuple3<>(1,2,9));

    data.add(new Tuple3<>(1,2,11));

    data.add(new Tuple3<>(1,2,13));





    DataStreamSource<Tuple3<Integer,Integer,Integer>> items = env.fromCollection(data);



    OutputTag<Tuple3<Integer,Integer,Integer>> zeroStream = new OutputTag<Tuple3<Integer,Integer,Integer>>("zeroStream") {};

    OutputTag<Tuple3<Integer,Integer,Integer>> oneStream = new OutputTag<Tuple3<Integer,Integer,Integer>>("oneStream") {};





    SingleOutputStreamOperator<Tuple3<Integer, Integer, Integer>> processStream= items.process(new ProcessFunction<Tuple3<Integer, Integer, Integer>, Tuple3<Integer, Integer, Integer>>() {

        @Override

        public void processElement(Tuple3<Integer, Integer, Integer> value, Context ctx, Collector<Tuple3<Integer, Integer, Integer>> out) throws Exception {



            if (value.f0 == 0) {

                ctx.output(zeroStream, value);

            } else if (value.f0 == 1) {

                ctx.output(oneStream, value);

            }

        }

    });



    DataStream<Tuple3<Integer, Integer, Integer>> zeroSideOutput = processStream.getSideOutput(zeroStream);

    DataStream<Tuple3<Integer, Integer, Integer>> oneSideOutput = processStream.getSideOutput(oneStream);



    zeroSideOutput.print();

    oneSideOutput.printToErr();





    //列印結果

    String jobName = "user defined streaming source";

    env.execute(jobName);

}

可以看到，我們將流進行了拆分，並且成功列印出了結果。這裡要注意，Flink 最新提供的 SideOutPut 方式拆分流是可以多次進行拆分的，無需擔心會爆出異常。

image (13).png

總結

這一課時我們講解了 Flink 的一個小的知識點，是我們生產實踐中經常遇到的場景，Flink 在最新的版本中也推薦我們使用 SideOutPut 進行流的拆分。

關注公眾號：大資料技術派，回覆資料，領取1024G資料。

Flink 如何分流資料
2020-06-11
第09講：Flink 狀態與容錯
2022-02-03
第07講：Flink 常見核心概念分析
2022-01-29
第04講：Flink 常用的 DataSet 和 DataStream API
2022-01-25
ASTAPI
第08講：Flink 視窗、時間和水印
2022-01-31
第06講：Flink 叢集安裝部署和 HA 配置
2022-01-27
input delay和output delay講解
2020-12-03
第02講：Flink 入門程式 WordCount 和 SQL 實現
2022-01-22
SQL
第03講：Flink 的程式設計模型與其他框架比較
2022-01-24
程式設計模型框架
6.Flink實時專案之業務資料分流
2022-02-22
Flink CDC Meetup · Online，5.21 開講！
2022-05-14
Client Side Cache 和 Server Side Cache 的區別
2021-10-13
clientIDEServer
第3講：this指標
2024-06-02
指標
Structured Output
2024-11-28
Struct
[開發教程]第10講：Bootstrap常用標籤與樣式
2019-05-11
boot
好程式設計師Python培訓第10講：字串的使用
2020-04-21
程式設計師Python字串
第3講：程序排程
2024-06-10
第2講：程序管理
2024-06-10
diff output understanding
2022-09-14
網路分流器｜網路分流器的區分
2018-07-20
Goroutines: the dark side of the runtime
2018-12-17
GoIDE
Swift Server-side
2018-04-16
SwiftServerIDE
BUUCTF：Beautiful_Side
2020-11-24
IDE
MySQL講義第8講——資料更新之 DELETE
2020-09-24
MySqldelete
精講RestTemplate第10篇-使用代理作為跳板傳送請求
2020-08-18
REST
音影片領域中， 5.1 7.1聲道不同版本的區別 ---- 5.1 5.1(side) 7.1 7.1(side) 7.1(wide-side)
2024-11-25
IDE
Hybris做增強的兩種方式：In App Extension和Side by Side Extension
2019-10-31
APPIDE
angular input和output
2019-03-11
Angular
DML_The OUTPUT Clause
2020-06-14
PostgreSQL技術大講堂 - 第31講：SQL調優技巧
2023-11-01
SQL
網路分流器-網路分流器TAP網路流量分析
2018-10-22
網路分流器｜網路分流器大資料採集方案
2018-08-15
大資料
網路分流器-網路分流器-DDoS攻擊與防護
2018-08-27
[開發教程]第11講：Bootstrap列表
2019-05-11
boot
[開發教程]第14講：Bootstrap表格
2019-05-11
boot
第1講：程序和執行緒
2024-06-10
執行緒
網路分流器-網路分流器-網路安全評估探討
2018-08-21
PostgreSQL技術大講堂 - 第34講：調優工具pgBagder部署
2023-11-16
SQL

第10講：Flink Side OutPut 分流

Flink系列文章

分流場景

分流的方法

Filter 分流

Split 分流

SideOutPut 分流

總結

相關文章