Spark Streaming中的Window操作

小財迷,嘻嘻發表於2020-12-28


視窗函式,就是在DStream流上,以一個可配置的長度為視窗,以一個可配置的速率向前移動視窗,根據視窗函式的具體內容,分別對當前視窗中的這一波資料採集某對應的操作運算元。

需要注意的是視窗長度,和視窗移動速率需要是batch time的整數倍。
在這裡插入圖片描述
新增pom依賴

    <dependency>
      <groupId>org.apache.kafka</groupId>
      <artifactId>kafka_2.11</artifactId>
      <version>2.0.0</version>
    </dependency>

    <dependency>
      <groupId>org.apache.kafka</groupId>
      <artifactId>kafka-streams</artifactId>
      <version>2.0.0</version>
    </dependency>

    <dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-core_2.11</artifactId>
    <version>2.4.5</version>
    </dependency>

    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-streaming_2.11</artifactId>
      <version>2.4.5</version>
    </dependency>

    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
      <version>2.4.5</version>
    </dependency>

    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-sql_2.11</artifactId>
      <version>2.4.5</version>
    </dependency>

    <dependency>
      <groupId>com.fasterxml.jackson.core</groupId>
      <artifactId>jackson-databind</artifactId>
      <version>2.6.6</version>
    </dependency>

1、window 輸出視窗內容

window(windowLength,SlideInterval)
該操作由一個DStream物件呼叫,傳入一個視窗長度引數,一個視窗移動速率引數,然後將當前時刻當前長度視窗中的元素取出形成一個新的DStream。

import org.apache.kafka.clients.consumer.{ConsumerConfig, ConsumerRecord}
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{DStream, InputDStream}
import org.apache.spark.streaming.kafka010.{ConsumerStrategies, KafkaUtils, LocationStrategies}
import org.apache.spark.streaming.{Seconds, StreamingContext}

object SparkWindowDemo {
  def main(args: Array[String]): Unit = {
    val conf: SparkConf = new SparkConf().setAppName("SparkWindowDemo").setMaster("local[*]")

    val streamingContext = new StreamingContext(conf,Seconds(2))   //批處理時間設定為2秒

    streamingContext.checkpoint("checkpoint")

    val kafkaParams: Map[String, String] = Map(
      (ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> "192.168.136.20:9092"),
      (ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> "org.apache.kafka.common.serialization.StringDeserializer"),
      (ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> "org.apache.kafka.common.serialization.StringDeserializer"),
      (ConsumerConfig.GROUP_ID_CONFIG -> "kafkaGroup2")
    )

    val kafkaStream: InputDStream[ConsumerRecord[String, String]] = KafkaUtils.createDirectStream(
      streamingContext,
      LocationStrategies.PreferConsistent,
      ConsumerStrategies.Subscribe(Set("SparkKafkaDemo"), kafkaParams) //SparkKafkaDemo是Kafka的一個Topic
    )

    val numStream: DStream[(String, Int)] = kafkaStream.flatMap(line => line.value().toString.split("\\s+"))
      .map((_, 1))
      //加8秒視窗,6秒滑動一次,也就是步長,步長必須是採集時間的倍數
      .window(Seconds(8),Seconds(6))
    numStream.print()

    streamingContext.start()
    streamingContext.awaitTermination()
  }
}

建立生產資訊進行測試

kafka-console-producer.sh --topic SparkKafkaDemo --broker-list 192.168.136.20:9092
#輸入:
java
java
scala
#輸出:
(java,1)
(java,1)
(scala,1)

2、countByWindow 統計視窗中出現的元素個數

countByWindow(windowLength,slideInterval)

返回指定長度視窗中的元素個數。

只能單純的計數,並不含有邏輯,輸出的是數字。

注:需要設定checkpoint

import org.apache.kafka.clients.consumer.{ConsumerConfig, ConsumerRecord}
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{DStream, InputDStream}
import org.apache.spark.streaming.kafka010.{ConsumerStrategies, KafkaUtils, LocationStrategies}
import org.apache.spark.streaming.{Seconds, StreamingContext}

object SparkWindowDemo2 {
  def main(args: Array[String]): Unit = {
    val conf: SparkConf = new SparkConf().setAppName("SparkWindowDemo").setMaster("local[*]")

    val streamingContext = new StreamingContext(conf,Seconds(2))   //批處理時間設定為2秒,也就是採集時間
    //設定checkpoint
    streamingContext.checkpoint("file:\\D:\\test\\checkpoint")

    val kafkaParams: Map[String, String] = Map(
      (ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> "192.168.136.20:9092"),
      (ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> "org.apache.kafka.common.serialization.StringDeserializer"),
      (ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> "org.apache.kafka.common.serialization.StringDeserializer"),
      (ConsumerConfig.GROUP_ID_CONFIG -> "kafkaGroup2")
    )

    val kafkaStream: InputDStream[ConsumerRecord[String, String]] = KafkaUtils.createDirectStream(
      streamingContext,
      LocationStrategies.PreferConsistent,
      ConsumerStrategies.Subscribe(Set("SparkKafkaDemo"), kafkaParams)
    )

    val numStream: DStream[Long] = kafkaStream.flatMap(line => line.value().toString.split("\\s+"))
      .map((_, 1))
      //加8秒視窗,4秒滑動一次,也就是步長,步長必須是採集時間的倍數
      //8秒內出現的單詞數
      //countByWindow 返回指定視窗的元素個數
      .countByWindow(Seconds(8),Seconds(4))
    numStream.print()   //輸出的是數字

    streamingContext.start()
    streamingContext.awaitTermination()
  }
}

建立生產資訊進行測試

kafka-console-producer.sh --topic SparkKafkaDemo --broker-list 192.168.136.20:9092
#輸入:
java
java
scala
#輸出:
3

3、countByValueAndWindow 統計元素相同的元素個數

countByValueAndWindow(windowLength,slideInterval, [numTasks])

統計當前時間視窗中元素值相同的元素的個數

import org.apache.kafka.clients.consumer.{ConsumerConfig, ConsumerRecord}
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{DStream, InputDStream}
import org.apache.spark.streaming.kafka010.{ConsumerStrategies, KafkaUtils, LocationStrategies}
import org.apache.spark.streaming.{Seconds, StreamingContext}


object SparkWindowDemo3 {
  def main(args: Array[String]): Unit = {
    val conf: SparkConf = new SparkConf().setAppName("SparkWindowDemo").setMaster("local[*]")

    val streamingContext = new StreamingContext(conf,Seconds(2))   //批處理時間設定為2秒,也就是採集時間
	// 設定checkpoint
    streamingContext.checkpoint("file:\\D:\\test\\checkpoint")


    val kafkaParams: Map[String, String] = Map(
      (ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> "192.168.136.20:9092"),
      (ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> "org.apache.kafka.common.serialization.StringDeserializer"),
      (ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> "org.apache.kafka.common.serialization.StringDeserializer"),
      (ConsumerConfig.GROUP_ID_CONFIG -> "kafkaGroup2")
    )

    val kafkaStream: InputDStream[ConsumerRecord[String, String]] = KafkaUtils.createDirectStream(
      streamingContext,
      LocationStrategies.PreferConsistent,
      ConsumerStrategies.Subscribe(Set("SparkKafkaDemo"), kafkaParams)
    )

    val numStream: DStream[(String, Long)] = kafkaStream.flatMap(line => line.value().toString.split("\\s+"))
      //加8秒視窗,4秒滑動一次,也就是步長,步長必須是採集時間的倍數
      //8秒內出現的單詞數
      //countByValueAndWindow 統計當前時間視窗中元素值相同的元素的個數
      .countByValueAndWindow(Seconds(8),Seconds(4))
    numStream.print()

    streamingContext.start()
    streamingContext.awaitTermination()
  }
}

建立生產資訊進行測試

kafka-console-producer.sh --topic SparkKafkaDemo --broker-list 192.168.136.20:9092
#輸入:
java
java
scala
#輸出:
(java,2)
(scala,1)

4、reduceByWindow 輸出視窗元素

reduceByWindow(func,windowLength,slideInterval)
在呼叫DStream上首先取視窗函式的元素形成新的DStream,然後在視窗元素形成的DStream上進行reduce。

import org.apache.kafka.clients.consumer.{ConsumerConfig, ConsumerRecord}
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{DStream, InputDStream}
import org.apache.spark.streaming.kafka010.{ConsumerStrategies, KafkaUtils, LocationStrategies}
import org.apache.spark.streaming.{Seconds, StreamingContext}

object SparkWindowDemo4 {
  def main(args: Array[String]): Unit = {
    val conf: SparkConf = new SparkConf().setAppName("SparkWindowDemo").setMaster("local[*]")

    val streamingContext = new StreamingContext(conf,Seconds(2))   //批處理時間設定為2秒,也就是採集時間

    //streamingContext.checkpoint("file:\\D:\\test\\checkpoint")


    val kafkaParams: Map[String, String] = Map(
      (ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> "192.168.136.20:9092"),
      (ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> "org.apache.kafka.common.serialization.StringDeserializer"),
      (ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> "org.apache.kafka.common.serialization.StringDeserializer"),
      (ConsumerConfig.GROUP_ID_CONFIG -> "kafkaGroup2")
    )

    val kafkaStream: InputDStream[ConsumerRecord[String, String]] = KafkaUtils.createDirectStream(
      streamingContext,
      LocationStrategies.PreferConsistent,
      ConsumerStrategies.Subscribe(Set("SparkKafkaDemo"), kafkaParams)
    )

    val numStream: DStream[String] = kafkaStream.flatMap(line => line.value().toString.split("\\s+"))
      //加8秒視窗,4秒滑動一次,也就是步長,步長必須是採集時間的倍數
      //8秒視窗期內,將所有出現的資料做拼接:_+":"+_
      .reduceByWindow(_+":"+_,Seconds(8),Seconds(4))
    numStream.print()

    streamingContext.start()
    streamingContext.awaitTermination()
  }
}

建立生產資訊進行測試

kafka-console-producer.sh --topic SparkKafkaDemo --broker-list 192.168.136.20:9092
#輸入:
java
spark
scala
java
#輸出:
java:spark:scala:java

5、reduceByKeyAndWindow 視窗中進行reduceByKey操作

reduceByKeyAndWindow(func,windowLength, slideInterval, [numTasks])

reduceByKeyAndWindow的資料來源是基於該DStream的視窗長度中的所有資料進行計算。該操作有一個可選的併發數引數。

import org.apache.kafka.clients.consumer.{ConsumerConfig, ConsumerRecord}
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{DStream, InputDStream}
import org.apache.spark.streaming.kafka010.{ConsumerStrategies, KafkaUtils, LocationStrategies}
import org.apache.spark.streaming.{Seconds, StreamingContext}

object SparkWindowDemo5 {
  def main(args: Array[String]): Unit = {
    val conf: SparkConf = new SparkConf().setAppName("SparkWindowDemo").setMaster("local[*]")

    val streamingContext = new StreamingContext(conf,Seconds(2))   //批處理時間設定為2秒,也就是採集時間

    //streamingContext.checkpoint("file:\\D:\\test\\checkpoint")

    val kafkaParams: Map[String, String] = Map(
      (ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> "192.168.136.20:9092"),
      (ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> "org.apache.kafka.common.serialization.StringDeserializer"),
      (ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> "org.apache.kafka.common.serialization.StringDeserializer"),
      (ConsumerConfig.GROUP_ID_CONFIG -> "kafkaGroup2")
    )

    val kafkaStream: InputDStream[ConsumerRecord[String, String]] = KafkaUtils.createDirectStream(
      streamingContext,
      LocationStrategies.PreferConsistent,
      ConsumerStrategies.Subscribe(Set("SparkKafkaDemo"), kafkaParams)
    )

    val numStream: DStream[(String,Int)] = kafkaStream.flatMap(line => line.value().toString.split("\\s+"))
      .map((_,1))
      //加8秒視窗,2秒滑動一次,也就是步長,步長必須是採集時間的倍數
      //8秒視窗期內實現wordcount或者reduce
      .reduceByKeyAndWindow((x:Int,y:Int)=>{x+y},Seconds(8),Seconds(2))
    numStream.print()

    streamingContext.start()
    streamingContext.awaitTermination()
  }
}

建立生產資訊進行測試

kafka-console-producer.sh --topic SparkKafkaDemo --broker-list 192.168.136.20:9092
#輸入:
java
java
scala
scala
spark
#輸出:
(spark,1)
(scala,2)
(java,2)

6、reduceByKeyAndWindow 對視窗進行流入和流出的reduceByKey操作

reduceByKeyAndWindow(func, invFunc,windowLength, slideInterval, [numTasks])這個視窗操作和上一個的區別是多傳入一個函式invFunc。前面的func作用和上一個reduceByKeyAndWindow相同,後面的invFunc是用於處理流出rdd的。

import org.apache.kafka.clients.consumer.{ConsumerConfig, ConsumerRecord}
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{DStream, InputDStream}
import org.apache.spark.streaming.kafka010.{ConsumerStrategies, KafkaUtils, LocationStrategies}
import org.apache.spark.streaming.{Seconds, StreamingContext}

object SparkWindowDemo5 {
  def main(args: Array[String]): Unit = {
    val conf: SparkConf = new SparkConf().setAppName("SparkWindowDemo").setMaster("local[*]")

    val streamingContext = new StreamingContext(conf,Seconds(2))   //批處理時間設定為2秒,也就是採集時間

    //streamingContext.checkpoint("file:\\D:\\test\\checkpoint")

    val kafkaParams: Map[String, String] = Map(
      (ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> "192.168.136.20:9092"),
      (ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> "org.apache.kafka.common.serialization.StringDeserializer"),
      (ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> "org.apache.kafka.common.serialization.StringDeserializer"),
      (ConsumerConfig.GROUP_ID_CONFIG -> "kafkaGroup2")
    )

    val kafkaStream: InputDStream[ConsumerRecord[String, String]] = KafkaUtils.createDirectStream(
      streamingContext,
      LocationStrategies.PreferConsistent,
      ConsumerStrategies.Subscribe(Set("SparkKafkaDemo"), kafkaParams)
    )

    val numStream: DStream[(String,Int)] = kafkaStream.flatMap(line => line.value().toString.split("\\s+"))
      .map((_,1))
      //加8秒視窗,2秒滑動一次,也就是步長,步長必須是採集時間的倍數
      //8秒視窗期內實現wordcount或者reduce
      //.reduceByKeyAndWindow((x:Int,y:Int)=>{x+y},Seconds(8),Seconds(2))
      .reduceByKeyAndWindow((x:Int,y:Int)=>{x+y},(x:Int,y:Int)=>{x-y},Seconds(8),Seconds(2))

    numStream.print()

    streamingContext.start()
    streamingContext.awaitTermination()
  }
}

建立生產資訊進行測試

kafka-console-producer.sh --topic SparkKafkaDemo --broker-list 192.168.136.20:9092
#輸入:
java
java
scala

#輸出:
-------------------------------------------
Time: 1608800912000 ms
-------------------------------------------
(java,1)

-------------------------------------------
Time: 1608800914000 ms
-------------------------------------------
(java,2)

-------------------------------------------
Time: 1608800916000 ms
-------------------------------------------
(scala,1)
(java,2)

-------------------------------------------
Time: 1608800918000 ms
-------------------------------------------
(scala,1)
(java,2)

-------------------------------------------
Time: 1608800920000 ms
-------------------------------------------
(scala,1)
(java,1)

-------------------------------------------
Time: 1608800922000 ms
-------------------------------------------
(scala,1)
(java,0)

-------------------------------------------
Time: 1608800924000 ms
-------------------------------------------
(scala,0)
(java,0)

相關文章