Flink實時計算topN熱榜

柯廣發表於2021-03-11

原文網址 : https://www.cnblogs.com/data-magnifier/p/14520843.html

TopN的常見應用場景，最熱商品購買量，最高人氣作者的閱讀量等等。

1. 用到的知識點

Flink建立kafka資料來源；
基於 EventTime 處理，如何指定 Watermark；
Flink中的Window，滾動（tumbling）視窗與滑動（sliding）視窗；
State狀態的使用；
ProcessFunction 實現 TopN 功能；

2. 案例介紹

通過使用者訪問日誌，計算最近一段時間平臺最活躍的幾位使用者topN。

建立kafka生產者，傳送測試資料到kafka；
消費kafka資料，使用滑動（sliding）視窗，每隔一段時間更新一次排名；

3. 資料來源

這裡使用kafka api傳送測試資料到kafka，程式碼如下：

@Data
@NoArgsConstructor
@AllArgsConstructor
@ToString
public class User {

    private long id;
    private String username;
    private String password;
    private long timestamp;
}

Map<String, String> config = Configuration.initConfig("commons.xml");

@Test
public void sendData() throws InterruptedException {
    int cnt = 0;

    while (cnt < 200){
        User user = new User();
        user.setId(cnt);
        user.setUsername("username" + new Random().nextInt((cnt % 5) + 2));
        user.setPassword("password" + cnt);
        user.setTimestamp(System.currentTimeMillis());
        Future<RecordMetadata> future = KafkaUtil.sendDataToKafka(config.get("kafka-topic"), String.valueOf(cnt), JSON.toJSONString(user));
        while (!future.isDone()){
            Thread.sleep(100);
        }
        try {
            RecordMetadata recordMetadata = future.get();
            System.out.println(recordMetadata.offset());
        } catch (InterruptedException e) {
            e.printStackTrace();
        } catch (ExecutionException e) {
            e.printStackTrace();
        }
        System.out.println("傳送訊息：" + cnt + "******" + user.toString());
        cnt = cnt + 1;
    }
}

這裡通過隨機數來擾亂username，便於使使用者名稱大小不一，讓結果更加明顯。KafkaUtil是自己寫的一個kafka工具類，程式碼很簡單，主要是平時做測試方便。

4. 主要程式

建立一個main程式，開始編寫程式碼。

建立flink環境，關聯kafka資料來源。

Map<String, String> config = Configuration.initConfig("commons.xml");

Properties kafkaProps = new Properties();
kafkaProps.setProperty("zookeeper.connect", config.get("kafka-zookeeper"));
kafkaProps.setProperty("bootstrap.servers", config.get("kafka-ipport"));
kafkaProps.setProperty("group.id", config.get("kafka-groupid"));

StreamExecutionEnvironment senv = StreamExecutionEnvironment.getExecutionEnvironment();

EventTime 與 Watermark

senv.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);

設定屬性senv.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)，表示按照資料時間欄位來處理，預設是TimeCharacteristic.ProcessingTime

/** The time characteristic that is used if none other is set. */
	private static final TimeCharacteristic DEFAULT_TIME_CHARACTERISTIC = TimeCharacteristic.ProcessingTime;

這個屬性必須設定，否則後面，可能視窗結束無法觸發，導致結果無法輸出。取值有三種：

ProcessingTime：事件被處理的時間。也就是由flink叢集機器的系統時間來決定。
EventTime：事件發生的時間。一般就是資料本身攜帶的時間。
IngestionTime：攝入時間，資料進入flink流的時間，跟ProcessingTime還是有區別的；

指定好使用資料的實際時間來處理，接下來需要指定flink程式如何get到資料的時間欄位，這裡使用呼叫DataStream的assignTimestampsAndWatermarks方法，抽取時間和設定watermark。

senv.addSource(
        new FlinkKafkaConsumer010<>(
                config.get("kafka-topic"),
                new SimpleStringSchema(),
                kafkaProps
        )
).map(x ->{
    return JSON.parseObject(x, User.class);
}).assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor<User>(Time.milliseconds(1000)) {
    @Override
    public long extractTimestamp(User element) {
        return element.getTimestamp();
    }
})

前面給出的程式碼中可以看出，由於傳送到kafka的時候，將User物件轉換為json字串了，這裡使用的是fastjson，接收過來可以轉化為JsonObject來處理，我這裡還是將其轉化為User物件JSON.parseObject(x, User.class)，便於處理。

這裡考慮到資料可能亂序，使用了可以處理亂序的抽象類BoundedOutOfOrdernessTimestampExtractor，並且實現了唯一的一個沒有實現的方法extractTimestamp，亂序資料，會導致資料延遲，在構造方法中傳入了一個Time.milliseconds(1000)，表明資料可以延遲一秒鐘。比如說，如果視窗長度是10s，0_{10s的資料會在11s的時候計算，此時watermark是10，才會觸發計算，也就是說引入watermark處理亂序資料，最多可以容忍0}t這個視窗的資料，最晚在t+1時刻到來。

具體關於watermark的講解可以參考這篇文章

https://blog.csdn.net/qq_39657909/article/details/106081543

視窗統計

業務需求上，通常可能是一個小時，或者過去15分鐘的資料，5分鐘更新一次排名，這裡為了演示效果，視窗長度取10s，每次滑動(slide)5s，即5秒鐘更新一次過去10s的排名資料。

.keyBy("username")
.timeWindow(Time.seconds(10), Time.seconds(5))
.aggregate(new CountAgg(), new WindowResultFunction())

我們使用.keyBy("username")對使用者進行分組，使用.timeWindow(Time size, Time slide)對每個使用者做滑動視窗（10s視窗，5s滑動一次）。然後我們使用 .aggregate(AggregateFunction af, WindowFunction wf) 做增量的聚合操作，它能使用AggregateFunction提前聚合掉資料，減少 state 的儲存壓力。較之.apply(WindowFunction wf)會將視窗中的資料都儲存下來，最後一起計算要高效地多。aggregate()方法的第一個引數用於

這裡的CountAgg實現了AggregateFunction介面，功能是統計視窗中的條數，即遇到一條資料就加一。

public class CountAgg implements AggregateFunction<User, Long, Long>{
    @Override
    public Long createAccumulator() {
        return 0L;
    }

    @Override
    public Long add(User value, Long accumulator) {
        return accumulator + 1;
    }

    @Override
    public Long getResult(Long accumulator) {
        return accumulator;
    }

    @Override
    public Long merge(Long a, Long b) {
        return a + b;
    }
}

.aggregate(AggregateFunction af, WindowFunction wf) 的第二個引數WindowFunction將每個 key每個視窗聚合後的結果帶上其他資訊進行輸出。我們這裡實現的WindowResultFunction將使用者名稱，視窗，訪問量封裝成了UserViewCount進行輸出。

private static class WindowResultFunction implements WindowFunction<Long, UserViewCount, Tuple, TimeWindow> {


    @Override
    public void apply(Tuple key, TimeWindow window, Iterable<Long> input, Collector<UserViewCount> out) throws Exception {
        Long count = input.iterator().next();
        out.collect(new UserViewCount(((Tuple1<String>)key).f0, window.getEnd(), count));
    }
}

@Data
@NoArgsConstructor
@AllArgsConstructor
@ToString
public static class UserViewCount {
    private String userName;
    private long windowEnd;
    private long viewCount;

}

TopN計算最活躍使用者

為了統計每個視窗下活躍的使用者，我們需要再次按視窗進行分組，這裡根據UserViewCount中的windowEnd進行keyBy()操作。然後使用 ProcessFunction 實現一個自定義的 TopN 函式 TopNHotItems 來計算點選量排名前3名的使用者，並將排名結果格式化成字串，便於後續輸出。

.keyBy("windowEnd")
.process(new TopNHotUsers(3))
.print();

ProcessFunction 是 Flink 提供的一個 low-level API，用於實現更高階的功能。它主要提供了定時器 timer 的功能（支援EventTime或ProcessingTime）。本案例中我們將利用 timer 來判斷何時收齊了某個 window 下所有使用者的訪問資料。由於 Watermark 的進度是全域性的，在 processElement 方法中，每當收到一條資料（ItemViewCount），我們就註冊一個 windowEnd+1 的定時器（Flink 框架會自動忽略同一時間的重複註冊）。windowEnd+1 的定時器被觸發時，意味著收到了windowEnd+1的 Watermark，即收齊了該windowEnd下的所有使用者視窗統計值。我們在 onTimer() 中處理將收集的所有商品及點選量進行排序，選出 TopN，並將排名資訊格式化成字串後進行輸出。

這裡我們還使用了 ListState<ItemViewCount> 來儲存收到的每條 UserViewCount 訊息，保證在發生故障時，狀態資料的不丟失和一致性。ListState 是 Flink 提供的類似 Java List 介面的 State API，它整合了框架的 checkpoint 機制，自動做到了 exactly-once 的語義保證。

private static class TopNHotUsers extends KeyedProcessFunction<Tuple, UserViewCount, String> {

    private int topSize;
    private ListState<UserViewCount> userViewCountListState;

    public TopNHotUsers(int topSize) {
        this.topSize = topSize;
    }

    @Override
    public void onTimer(long timestamp, OnTimerContext ctx, Collector<String> out) throws Exception {
        super.onTimer(timestamp, ctx, out);
        List<UserViewCount> userViewCounts = new ArrayList<>();
        for(UserViewCount userViewCount : userViewCountListState.get()) {
            userViewCounts.add(userViewCount);
        }

        userViewCountListState.clear();

        userViewCounts.sort(new Comparator<UserViewCount>() {
            @Override
            public int compare(UserViewCount o1, UserViewCount o2) {
                return (int)(o2.viewCount - o1.viewCount);
            }
        });

        // 將排名資訊格式化成 String, 便於列印
        StringBuilder result = new StringBuilder();
        result.append("====================================\n");
        result.append("時間: ").append(new Timestamp(timestamp-1)).append("\n");
        for (int i = 0; i < topSize; i++) {
            UserViewCount currentItem = userViewCounts.get(i);
            // No1:  商品ID=12224  瀏覽量=2413
            result.append("No").append(i).append(":")
                    .append("  使用者名稱=").append(currentItem.userName)
                    .append("  瀏覽量=").append(currentItem.viewCount)
                    .append("\n");
        }
        result.append("====================================\n\n");

        Thread.sleep(1000);

        out.collect(result.toString());

    }

    @Override
    public void open(org.apache.flink.configuration.Configuration parameters) throws Exception {
        super.open(parameters);
        ListStateDescriptor<UserViewCount> userViewCountListStateDescriptor = new ListStateDescriptor<>(
                "user-state",
                UserViewCount.class
        );
        userViewCountListState = getRuntimeContext().getListState(userViewCountListStateDescriptor);

    }

    @Override
    public void processElement(UserViewCount value, Context ctx, Collector<String> out) throws Exception {
        userViewCountListState.add(value);
        ctx.timerService().registerEventTimeTimer(value.windowEnd + 1000);
    }
}

結果輸出

可以看到，每隔5秒鐘更新輸出一次資料。

參考
http://wuchong.me/blog/2018/11/07/use-flink-calculate-hot-items/

關注公眾號：Java大資料與資料倉儲，領取資料，學習大資料技術。

Flink使用二次聚合實現TopN計算-亂序資料
2021-05-26
實時計算Flink效能調優
2018-11-05
實時計算Flink——產品安全
2018-11-14
Flink 在有贊實時計算的實踐
2019-01-18
實時計算Flink——快速入門概述
2018-11-14
實時計算Flink>產品定價>計量計費
2018-11-14
Flink實時計算pv、uv的幾種方法
2021-06-10
Apache Flink 在移動雲實時計算的實踐
2022-03-08
Apache
端到端的實時計算：TiDB + Flink 最佳實踐
2021-09-27
TiDB
實時計算 Flink> 產品簡介——最新動態
2018-11-14
實時計算Flink——獨享模式系統架構
2018-11-14
模式架構
實時計算Flink——獨享模式上下游配置
2018-11-14
模式
大資料“重磅炸彈”：實時計算框架 Flink
2019-04-08
大資料框架
伍翀：大資料實時計算Flink SQL解密
2018-09-06
大資料SQL解密
實時計算既有Flink，為何又推出個StreamPark？
2023-02-22
日常節省 30%計算資源：阿里雲實時計算 Flink 自動調優實踐
2023-05-09
阿里
8.Flink實時專案之CEP計算訪客跳出
2022-03-09
實時計算框架特點及對比：Flink、Spark Streaming、Storm
2023-01-18
框架SparkORM
實時計算框架：Flink叢集搭建與執行機制
2021-05-09
框架
從Storm到Flink，有贊五年實時計算效率提升實踐
2019-05-10
ORM
如何遷移開源 Flink 任務到實時計算Flink版？實戰手冊來幫忙！
2021-10-08
基於實時計算（Flink）與高斯模型構建實時異常檢測系統
2019-03-11
模型
Flink 熱詞統計(1): 基礎功能實現
2019-05-01
基於 Flink 流計算實現的股票交易實時資產應用
2023-03-16
Oceanus:基於Apache Flink的一站式實時計算平臺
2018-12-26
Apache
實時計算Flink>快速入門——步驟二：註冊上下游儲存
2018-11-14
結構化資料上的 TopN 運算
2020-11-10
如何降低 Flink 開發和運維成本？阿里雲實時計算平臺建設實踐
2023-03-08
運維阿里
實時計算神器：binlog
2023-04-12
實時計算小括
2022-03-22
實時計算Flink-獨享模式-Batch（試用）-建立源表——建立CSV源表
2018-11-14
模式BAT
Apache Flink 如何正確處理實時計算場景中的亂序資料
2020-12-06
Apache
Apache Flink，流計算？不僅僅是流計算！
2019-01-04
Apache
實時計算Flink>獨享模式>Batch（試用）>建立結果表——建立CSV結果表
2018-11-14
模式BAT
說說實時流式計算
2019-07-09
flink 有狀態（stateful）的計算
2019-05-24
基於flink的電商使用者行為資料分析【2】| 實時熱門商品統計
2020-11-24
小紅書實時熱搜排行榜在哪看？熱門筆記技巧來了
2022-05-10
筆記