Flink實時計算pv、uv的幾種方法

小柯同學發表於2021-06-10

原文網址 : https://www.cnblogs.com/data-magnifier/p/14873263.html

本文首發於：Java大資料與資料倉儲，Flink實時計算pv、uv的幾種方法

實時統計pv、uv是再常見不過的大資料統計需求了，前面出過一篇SparkStreaming實時統計pv,uv的案例，這裡用Flink實時計算pv,uv。

我們需要統計不同資料型別每天的pv，uv情況,並且有如下要求.

每秒鐘要輸出最新的統計結果；
程式永遠跑著不會停,所以要定期清理記憶體裡的過時資料；
收到的訊息裡的時間欄位並不是按照順序嚴格遞增的,所以要有一定的容錯機制；
訪問uv並不一定每秒鐘都會變化,重複輸出對IO是巨大的浪費,所以要在uv變更時在一秒內輸出結果,未變更時不輸出；

Flink資料流上的型別和操作

DataStream是flink流處理最核心的資料結構，其它的各種流都可以直接或者間接通過DataStream來完成相互轉換，一些常用的流直接的轉換關係如圖：

可以看出，DataStream可以與KeyedStream相互轉換，KeyedStream可以轉換為WindowedStream，DataStream不能直接轉換為WindowedStream，WindowedStream可以直接轉換為DataStream。各種流之間雖然不能相互直接轉換，但是都可以通過先轉換為DataStream，再轉換為其它流的方法來實現。

在這個計算pv,uv的需求中就主要用到DataStream、KeyedStream以及WindowedStream這些資料結構。

這裡需要用到window和watermark，使用視窗把資料按天分割，使用watermark可以通過“水位”來定期清理視窗外的遲到資料，起到清理記憶體的作用。

業務程式碼

我們的資料是json型別的，含有date,helperversion,guid這3個欄位，在實時統計pv,uv這個功能中，其它欄位可以直接丟掉，當然了在離線資料倉儲中，所有有含義的業務欄位都是要保留到hive當中的。
其它相關概念就不說了，會專門介紹，這裡直接上程式碼吧。

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.ddxygq</groupId>
    <artifactId>bigdata</artifactId>
    <version>1.0-SNAPSHOT</version>

    <properties>
        <scala.version>2.11.8</scala.version>
        <flink.version>1.7.0</flink.version>
        <pkg.name>bigdata</pkg.name>
    </properties>

    <dependencies>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-scala_2.11</artifactId>
            <version>{flink.version}</version>
  </dependency>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-streaming-scala_2.11</artifactId>
            <version>flink.version</version>
  </dependency>
  
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-streaming-java_2.11</artifactId>
            <version>{flink.version}</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-connector-kafka-0.8 -->
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-connector-kafka-0.10_2.11</artifactId>
            <version>flink.version</version>
  </dependency>

    <build>
        <!--測試程式碼和檔案-->
        <!--<testSourceDirectory>{basedir}/src/test</testSourceDirectory>-->
        <finalName>basedir/src/test</testSourceDirectory>−−><finalName>{pkg.name}</finalName>
        <sourceDirectory>src/main/java</sourceDirectory>
        <resources>
            <resource>
                <directory>src/main/resources</directory>
                <includes>
                    <include>*.properties</include>
                    <include>*.xml</include>
                </includes>
                <filtering>false</filtering>
            </resource>
        </resources>
        <plugins>
            <!-- 跳過測試外掛-->
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-surefire-plugin</artifactId>
                <configuration>
                    <skip>true</skip>
                </configuration>
            </plugin>
            <!--編譯scala外掛-->
            <plugin>
                <groupId>org.scala-tools</groupId>
                <artifactId>maven-scala-plugin</artifactId>
                <version>2.15.2</version>
                <executions>
                    <execution>
                        <goals>
                            <goal>compile</goal>
                            <goal>testCompile</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>
</project>

主要程式碼，主要使用scala開發：

package com.ddxygq.bigdata.flink.streaming.pvuv

import java.util.Properties

import com.alibaba.fastjson.JSON
import org.apache.flink.runtime.state.filesystem.FsStateBackend
import org.apache.flink.streaming.api.CheckpointingMode
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.triggers.ContinuousProcessingTimeTrigger
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer010
import org.apache.flink.streaming.util.serialization.SimpleStringSchema
import org.apache.flink.streaming.api.scala.extensions._
import org.apache.flink.api.scala._

/**
  * @ Author: keguang
  * @ Date: 2019/3/18 17:34
  * @ version: v1.0.0
  * @ description: 
  */
object PvUvCount {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment

    // 容錯
    env.enableCheckpointing(5000)
    env.getCheckpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE)
    env.setStateBackend(new FsStateBackend("file:///D:/space/IJ/bigdata/src/main/scala/com/ddxygq/bigdata/flink/checkpoint/flink/tagApp"))

    // kafka 配置
    val ZOOKEEPER_HOST = "hadoop01:2181,hadoop02:2181,hadoop03:2181"
    val KAFKA_BROKERS = "hadoop01:9092,hadoop02:9092,hadoop03:9092"
    val TRANSACTION_GROUP = "flink-count"
    val TOPIC_NAME = "flink"
    val kafkaProps = new Properties()
    kafkaProps.setProperty("zookeeper.connect", ZOOKEEPER_HOST)
    kafkaProps.setProperty("bootstrap.servers", KAFKA_BROKERS)
    kafkaProps.setProperty("group.id", TRANSACTION_GROUP)

    // watrmark 允許資料延遲時間
    val MaxOutOfOrderness = 86400 * 1000L
    
    // 消費kafka資料
    val streamData: DataStream[(String, String, String)] = env.addSource(
      new FlinkKafkaConsumer010[String](TOPIC_NAME, new SimpleStringSchema(), kafkaProps)
    ).assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor[String](Time.milliseconds(MaxOutOfOrderness)) {
      override def extractTimestamp(element: String): Long = {
        val t = JSON.parseObject(element)
        val time = JSON.parseObject(JSON.parseObject(t.getString("message")).getString("decrypted_data")).getString("time")
        time.toLong
      }
    }).map(x => {
      var date = "error"
      var guid = "error"
      var helperversion = "error"
      try {
        val messageJsonObject = JSON.parseObject(JSON.parseObject(x).getString("message"))
        val datetime = messageJsonObject.getString("time")
        date = datetime.split(" ")(0)
        // hour = datetime.split(" ")(1).substring(0, 2)
        val decrypted_data_string = messageJsonObject.getString("decrypted_data")
        if (!"".equals(decrypted_data_string)) {
          val decrypted_data = JSON.parseObject(decrypted_data_string)
          guid = decrypted_data.getString("guid").trim
          helperversion = decrypted_data.getString("helperversion")
        }
      } catch {
        case e: Exception => {
          println(e)
        }
      }
      (date, helperversion, guid)
    })
    // 這上面是設定watermark並解析json部分
    // 聚合視窗中的資料，可以研究下applyWith這個方法和OnWindowedStream這個類
    val resultStream = streamData.keyBy(x => {
      x._1 + x._2
    }).timeWindow(Time.days(1))
      .trigger(ContinuousProcessingTimeTrigger.of(Time.seconds(1)))
      .applyWith(("", List.empty[Int], Set.empty[Int], 0L, 0L))(
        foldFunction = {
          case ((_, list, set, _, 0), item) => {
            val date = item._1
            val helperversion = item._2
            val guid = item._3
            (date + "_" + helperversion, guid.hashCode +: list, set + guid.hashCode, 0L, 0L)
          }
        }
        , windowFunction = {
          case (key, window, result) => {
            result.map {
              case (leixing, list, set, _, _) => {
                (leixing, list.size, set.size, window.getStart, window.getEnd)
              }
            }
          }
        }
      ).keyBy(0)
      .flatMapWithState[(String, Int, Int, Long, Long),(Int, Int)]{
      case ((key, numpv, numuv, begin, end), curr) =>

        curr match {
          case Some(numCurr) if numCurr == (numuv, numpv) =>
            (Seq.empty, Some((numuv, numpv))) //如果之前已經有相同的資料,則返回空結果
          case _ =>
            (Seq((key, numpv, numuv, begin, end)), Some((numuv, numpv)))
        }
    }

    // 最終結果
    val resultedStream = resultStream.map(x => {
      val keys = x._1.split("_")
      val date = keys(0)
      val helperversion = keys(1)
      (date, helperversion, x._2, x._3)
    })

    resultedStream.print()
    env.execute("PvUvCount")

  }
}

使用List集合的size儲存pv,使用Set集合的size儲存uv，從而達到實時統計pv,uv的目的。
這裡用了幾個關鍵的函式：
applyWith：裡面需要的引數，初始狀態變數，和foldFunction ，windowFunction ；

存在的問題

顯然，當資料量很大的時候，這個List集合和Set集合會很大，並且這裡的pv是否可以不用List來儲存，而是通過一個狀態變數，不斷做累加，對應操作就是更新狀態來完成。

改進版

使用了一個計數器來儲存pv的值。

packagecom.ddxygq.bigdata.flink.streaming.pvuv

import java.util.Properties

import com.alibaba.fastjson.JSON
import org.apache.flink.api.common.accumulators.IntCounter
import org.apache.flink.runtime.state.filesystem.FsStateBackend
import org.apache.flink.streaming.api.CheckpointingMode
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.triggers.ContinuousProcessingTimeTrigger
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer010
import org.apache.flink.streaming.util.serialization.SimpleStringSchema
import org.apache.flink.streaming.api.scala.extensions._
import org.apache.flink.api.scala._
import org.apache.flink.core.fs.FileSystem

object PvUv2 {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment

    // 容錯
    env.enableCheckpointing(5000)
    env.getCheckpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE)
    env.setStateBackend(new FsStateBackend("file:///D:/space/IJ/bigdata/src/main/scala/com/ddxygq/bigdata/flink/checkpoint/streaming/counter"))

    // kafka 配置
    val ZOOKEEPER_HOST = "hadoop01:2181,hadoop02:2181,hadoop03:2181"
    val KAFKA_BROKERS = "hadoop01:9092,hadoop02:9092,hadoop03:9092"
    val TRANSACTION_GROUP = "flink-count"
    val TOPIC_NAME = "flink"
    val kafkaProps = new Properties()
    kafkaProps.setProperty("zookeeper.connect", ZOOKEEPER_HOST)
    kafkaProps.setProperty("bootstrap.servers", KAFKA_BROKERS)
    kafkaProps.setProperty("group.id", TRANSACTION_GROUP)

    // watrmark 允許資料延遲時間
    val MaxOutOfOrderness = 86400 * 1000L

    val streamData: DataStream[(String, String, String)] = env.addSource(
      new FlinkKafkaConsumer010[String](TOPIC_NAME, new SimpleStringSchema(), kafkaProps)
    ).assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor[String](Time.milliseconds(MaxOutOfOrderness)) {
      override def extractTimestamp(element: String): Long = {
        val t = JSON.parseObject(element)
        val time = JSON.parseObject(JSON.parseObject(t.getString("message")).getString("decrypted_data")).getString("time")
        time.toLong
      }
    }).map(x => {
      var date = "error"
      var guid = "error"
      var helperversion = "error"
      try {
        val messageJsonObject = JSON.parseObject(JSON.parseObject(x).getString("message"))
        val datetime = messageJsonObject.getString("time")
        date = datetime.split(" ")(0)
        // hour = datetime.split(" ")(1).substring(0, 2)
        val decrypted_data_string = messageJsonObject.getString("decrypted_data")
        if (!"".equals(decrypted_data_string)) {
          val decrypted_data = JSON.parseObject(decrypted_data_string)
          guid = decrypted_data.getString("guid").trim
          helperversion = decrypted_data.getString("helperversion")
        }
      } catch {
        case e: Exception => {
          println(e)
        }
      }
      (date, helperversion, guid)
    })

    val resultStream = streamData.keyBy(x => {
      x._1 + x._2
    }).timeWindow(Time.days(1))
      .trigger(ContinuousProcessingTimeTrigger.of(Time.seconds(1)))
      .applyWith(("", new IntCounter(), Set.empty[Int], 0L, 0L))(
        foldFunction = {
          case ((_, cou, set, _, 0), item) => {
            val date = item._1
            val helperversion = item._2
            val guid = item._3
            cou.add(1)
            (date + "_" + helperversion, cou, set + guid.hashCode, 0L, 0L)
          }
        }
        , windowFunction = {
          case (key, window, result) => {
            result.map {
              case (leixing, cou, set, _, _) => {
                (leixing, cou.getLocalValue, set.size, window.getStart, window.getEnd)
              }
            }
          }
        }
      ).keyBy(0)
      .flatMapWithState[(String, Int, Int, Long, Long),(Int, Int)]{
      case ((key, numpv, numuv, begin, end), curr) =>

        curr match {
          case Some(numCurr) if numCurr == (numuv, numpv) =>
            (Seq.empty, Some((numuv, numpv))) //如果之前已經有相同的資料,則返回空結果
          case _ =>
            (Seq((key, numpv, numuv, begin, end)), Some((numuv, numpv)))
        }
    }

    // 最終結果
    val resultedStream = resultStream.map(x => {
      val keys = x._1.split("_")
      val date = keys(0)
      val helperversion = keys(1)
      (date, helperversion, x._2, x._3)
    })

    val resultPath = "D:\\space\\IJ\\bigdata\\src\\main\\scala\\com\\ddxygq\\bigdata\\flink\\streaming\\pvuv\\result"
    resultedStream.writeAsText(resultPath, FileSystem.WriteMode.OVERWRITE)
    env.execute("PvUvCount")

  }
}

參考資料

https://flink.sojb.cn/dev/event_time.html
http://wuchong.me/blog/2016/05/20/flink-internals-streams-and-operations-on-streams
https://segmentfault.com/a/1190000006235690

Flink計算pv和uv的通用方法
2021-11-01
統計網站PV和UV
2018-09-29
網站
一文搞懂PV、UV、VV、IP及其關係與計算
2020-10-10
Flink 在有贊實時計算的實踐
2019-01-18
計算檔案Checksum的幾種方法
2021-09-09
實時計算Flink效能調優
2018-11-05
實時計算Flink——產品安全
2018-11-14
Flink實時計算topN熱榜
2021-03-11
Laravel 10 行程式碼實現簡單的網站 pv uv 統計
2019-04-21
Laravel行程網站
實時計算Flink——快速入門概述
2018-11-14
PV、UV、VV的意義及區別
2020-10-03
Apache Flink 在移動雲實時計算的實踐
2022-03-08
Apache
端到端的實時計算：TiDB + Flink 最佳實踐
2021-09-27
TiDB
Spark+Hbase 億級流量分析實戰（ PV/UV ）
2019-03-29
Spark
實時計算Flink>產品定價>計量計費
2018-11-14
實時計算 Flink> 產品簡介——最新動態
2018-11-14
實時計算Flink——獨享模式系統架構
2018-11-14
模式架構
實時計算Flink——獨享模式上下游配置
2018-11-14
模式
大資料“重磅炸彈”：實時計算框架 Flink
2019-04-08
大資料框架
伍翀：大資料實時計算Flink SQL解密
2018-09-06
大資料SQL解密
實時計算既有Flink，為何又推出個StreamPark？
2023-02-22
幾種集合的幾種方法
2020-12-06
自然語言處理中句子相似度計算的幾種方法
2018-06-15
自然語言處理
MySQL資料庫定時備份的幾種實現方法
2024-08-27
MySql資料庫
日常節省 30%計算資源：阿里雲實時計算 Flink 自動調優實踐
2023-05-09
阿里
sql server 計算兩個時間相差的幾天幾時幾分幾秒
2018-10-22
SQLServer
基於 Flink 流計算實現的股票交易實時資產應用
2023-03-16
8.Flink實時專案之CEP計算訪客跳出
2022-03-09
實時計算框架特點及對比：Flink、Spark Streaming、Storm
2023-01-18
框架SparkORM
實時計算框架：Flink叢集搭建與執行機制
2021-05-09
框架
從Storm到Flink，有贊五年實時計算效率提升實踐
2019-05-10
ORM
Css實現垂直居中的幾種方法
2019-05-13
CSS
如何遷移開源 Flink 任務到實時計算Flink版？實戰手冊來幫忙！
2021-10-08
Java日期時間API系列38-----一種高效的工作日計算計算方法
2021-06-05
JavaAPI
Oceanus:基於Apache Flink的一站式實時計算平臺
2018-12-26
Apache
基於實時計算（Flink）與高斯模型構建實時異常檢測系統
2019-03-11
模型
搭建自己的技術部落格系列（四）部落格接入百度統計，隨時瞭解你部落格的PV/UV
2019-09-14
實現三欄佈局的幾種方法
2019-02-16

Flink實時計算pv、uv的幾種方法

Flink資料流上的型別和操作

業務程式碼

存在的問題

改進版

相關文章