Spark Streaming中的Window操作
目錄
視窗函式,就是在DStream流上,以一個可配置的長度為視窗,以一個可配置的速率向前移動視窗,根據視窗函式的具體內容,分別對當前視窗中的這一波資料採集某對應的操作運算元。
需要注意的是視窗長度,和視窗移動速率需要是batch time的整數倍。
新增pom依賴
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka_2.11</artifactId>
<version>2.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-streams</artifactId>
<version>2.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.4.5</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.4.5</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>2.4.5</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.4.5</version>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-databind</artifactId>
<version>2.6.6</version>
</dependency>
1、window 輸出視窗內容
window(windowLength,SlideInterval)
該操作由一個DStream物件呼叫,傳入一個視窗長度引數,一個視窗移動速率引數,然後將當前時刻當前長度視窗中的元素取出形成一個新的DStream。
import org.apache.kafka.clients.consumer.{ConsumerConfig, ConsumerRecord}
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{DStream, InputDStream}
import org.apache.spark.streaming.kafka010.{ConsumerStrategies, KafkaUtils, LocationStrategies}
import org.apache.spark.streaming.{Seconds, StreamingContext}
object SparkWindowDemo {
def main(args: Array[String]): Unit = {
val conf: SparkConf = new SparkConf().setAppName("SparkWindowDemo").setMaster("local[*]")
val streamingContext = new StreamingContext(conf,Seconds(2)) //批處理時間設定為2秒
streamingContext.checkpoint("checkpoint")
val kafkaParams: Map[String, String] = Map(
(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> "192.168.136.20:9092"),
(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> "org.apache.kafka.common.serialization.StringDeserializer"),
(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> "org.apache.kafka.common.serialization.StringDeserializer"),
(ConsumerConfig.GROUP_ID_CONFIG -> "kafkaGroup2")
)
val kafkaStream: InputDStream[ConsumerRecord[String, String]] = KafkaUtils.createDirectStream(
streamingContext,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe(Set("SparkKafkaDemo"), kafkaParams) //SparkKafkaDemo是Kafka的一個Topic
)
val numStream: DStream[(String, Int)] = kafkaStream.flatMap(line => line.value().toString.split("\\s+"))
.map((_, 1))
//加8秒視窗,6秒滑動一次,也就是步長,步長必須是採集時間的倍數
.window(Seconds(8),Seconds(6))
numStream.print()
streamingContext.start()
streamingContext.awaitTermination()
}
}
建立生產資訊進行測試
kafka-console-producer.sh --topic SparkKafkaDemo --broker-list 192.168.136.20:9092
#輸入:
java
java
scala
#輸出:
(java,1)
(java,1)
(scala,1)
2、countByWindow 統計視窗中出現的元素個數
countByWindow(windowLength,slideInterval)
返回指定長度視窗中的元素個數。
只能單純的計數,並不含有邏輯,輸出的是數字。
注:需要設定checkpoint
import org.apache.kafka.clients.consumer.{ConsumerConfig, ConsumerRecord}
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{DStream, InputDStream}
import org.apache.spark.streaming.kafka010.{ConsumerStrategies, KafkaUtils, LocationStrategies}
import org.apache.spark.streaming.{Seconds, StreamingContext}
object SparkWindowDemo2 {
def main(args: Array[String]): Unit = {
val conf: SparkConf = new SparkConf().setAppName("SparkWindowDemo").setMaster("local[*]")
val streamingContext = new StreamingContext(conf,Seconds(2)) //批處理時間設定為2秒,也就是採集時間
//設定checkpoint
streamingContext.checkpoint("file:\\D:\\test\\checkpoint")
val kafkaParams: Map[String, String] = Map(
(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> "192.168.136.20:9092"),
(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> "org.apache.kafka.common.serialization.StringDeserializer"),
(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> "org.apache.kafka.common.serialization.StringDeserializer"),
(ConsumerConfig.GROUP_ID_CONFIG -> "kafkaGroup2")
)
val kafkaStream: InputDStream[ConsumerRecord[String, String]] = KafkaUtils.createDirectStream(
streamingContext,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe(Set("SparkKafkaDemo"), kafkaParams)
)
val numStream: DStream[Long] = kafkaStream.flatMap(line => line.value().toString.split("\\s+"))
.map((_, 1))
//加8秒視窗,4秒滑動一次,也就是步長,步長必須是採集時間的倍數
//8秒內出現的單詞數
//countByWindow 返回指定視窗的元素個數
.countByWindow(Seconds(8),Seconds(4))
numStream.print() //輸出的是數字
streamingContext.start()
streamingContext.awaitTermination()
}
}
建立生產資訊進行測試
kafka-console-producer.sh --topic SparkKafkaDemo --broker-list 192.168.136.20:9092
#輸入:
java
java
scala
#輸出:
3
3、countByValueAndWindow 統計元素相同的元素個數
countByValueAndWindow(windowLength,slideInterval, [numTasks])
統計當前時間視窗中元素值相同的元素的個數
import org.apache.kafka.clients.consumer.{ConsumerConfig, ConsumerRecord}
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{DStream, InputDStream}
import org.apache.spark.streaming.kafka010.{ConsumerStrategies, KafkaUtils, LocationStrategies}
import org.apache.spark.streaming.{Seconds, StreamingContext}
object SparkWindowDemo3 {
def main(args: Array[String]): Unit = {
val conf: SparkConf = new SparkConf().setAppName("SparkWindowDemo").setMaster("local[*]")
val streamingContext = new StreamingContext(conf,Seconds(2)) //批處理時間設定為2秒,也就是採集時間
// 設定checkpoint
streamingContext.checkpoint("file:\\D:\\test\\checkpoint")
val kafkaParams: Map[String, String] = Map(
(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> "192.168.136.20:9092"),
(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> "org.apache.kafka.common.serialization.StringDeserializer"),
(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> "org.apache.kafka.common.serialization.StringDeserializer"),
(ConsumerConfig.GROUP_ID_CONFIG -> "kafkaGroup2")
)
val kafkaStream: InputDStream[ConsumerRecord[String, String]] = KafkaUtils.createDirectStream(
streamingContext,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe(Set("SparkKafkaDemo"), kafkaParams)
)
val numStream: DStream[(String, Long)] = kafkaStream.flatMap(line => line.value().toString.split("\\s+"))
//加8秒視窗,4秒滑動一次,也就是步長,步長必須是採集時間的倍數
//8秒內出現的單詞數
//countByValueAndWindow 統計當前時間視窗中元素值相同的元素的個數
.countByValueAndWindow(Seconds(8),Seconds(4))
numStream.print()
streamingContext.start()
streamingContext.awaitTermination()
}
}
建立生產資訊進行測試
kafka-console-producer.sh --topic SparkKafkaDemo --broker-list 192.168.136.20:9092
#輸入:
java
java
scala
#輸出:
(java,2)
(scala,1)
4、reduceByWindow 輸出視窗元素
reduceByWindow(func,windowLength,slideInterval)
在呼叫DStream上首先取視窗函式的元素形成新的DStream,然後在視窗元素形成的DStream上進行reduce。
import org.apache.kafka.clients.consumer.{ConsumerConfig, ConsumerRecord}
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{DStream, InputDStream}
import org.apache.spark.streaming.kafka010.{ConsumerStrategies, KafkaUtils, LocationStrategies}
import org.apache.spark.streaming.{Seconds, StreamingContext}
object SparkWindowDemo4 {
def main(args: Array[String]): Unit = {
val conf: SparkConf = new SparkConf().setAppName("SparkWindowDemo").setMaster("local[*]")
val streamingContext = new StreamingContext(conf,Seconds(2)) //批處理時間設定為2秒,也就是採集時間
//streamingContext.checkpoint("file:\\D:\\test\\checkpoint")
val kafkaParams: Map[String, String] = Map(
(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> "192.168.136.20:9092"),
(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> "org.apache.kafka.common.serialization.StringDeserializer"),
(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> "org.apache.kafka.common.serialization.StringDeserializer"),
(ConsumerConfig.GROUP_ID_CONFIG -> "kafkaGroup2")
)
val kafkaStream: InputDStream[ConsumerRecord[String, String]] = KafkaUtils.createDirectStream(
streamingContext,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe(Set("SparkKafkaDemo"), kafkaParams)
)
val numStream: DStream[String] = kafkaStream.flatMap(line => line.value().toString.split("\\s+"))
//加8秒視窗,4秒滑動一次,也就是步長,步長必須是採集時間的倍數
//8秒視窗期內,將所有出現的資料做拼接:_+":"+_
.reduceByWindow(_+":"+_,Seconds(8),Seconds(4))
numStream.print()
streamingContext.start()
streamingContext.awaitTermination()
}
}
建立生產資訊進行測試
kafka-console-producer.sh --topic SparkKafkaDemo --broker-list 192.168.136.20:9092
#輸入:
java
spark
scala
java
#輸出:
java:spark:scala:java
5、reduceByKeyAndWindow 視窗中進行reduceByKey操作
reduceByKeyAndWindow(func,windowLength, slideInterval, [numTasks])
reduceByKeyAndWindow的資料來源是基於該DStream的視窗長度中的所有資料進行計算。該操作有一個可選的併發數引數。
import org.apache.kafka.clients.consumer.{ConsumerConfig, ConsumerRecord}
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{DStream, InputDStream}
import org.apache.spark.streaming.kafka010.{ConsumerStrategies, KafkaUtils, LocationStrategies}
import org.apache.spark.streaming.{Seconds, StreamingContext}
object SparkWindowDemo5 {
def main(args: Array[String]): Unit = {
val conf: SparkConf = new SparkConf().setAppName("SparkWindowDemo").setMaster("local[*]")
val streamingContext = new StreamingContext(conf,Seconds(2)) //批處理時間設定為2秒,也就是採集時間
//streamingContext.checkpoint("file:\\D:\\test\\checkpoint")
val kafkaParams: Map[String, String] = Map(
(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> "192.168.136.20:9092"),
(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> "org.apache.kafka.common.serialization.StringDeserializer"),
(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> "org.apache.kafka.common.serialization.StringDeserializer"),
(ConsumerConfig.GROUP_ID_CONFIG -> "kafkaGroup2")
)
val kafkaStream: InputDStream[ConsumerRecord[String, String]] = KafkaUtils.createDirectStream(
streamingContext,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe(Set("SparkKafkaDemo"), kafkaParams)
)
val numStream: DStream[(String,Int)] = kafkaStream.flatMap(line => line.value().toString.split("\\s+"))
.map((_,1))
//加8秒視窗,2秒滑動一次,也就是步長,步長必須是採集時間的倍數
//8秒視窗期內實現wordcount或者reduce
.reduceByKeyAndWindow((x:Int,y:Int)=>{x+y},Seconds(8),Seconds(2))
numStream.print()
streamingContext.start()
streamingContext.awaitTermination()
}
}
建立生產資訊進行測試
kafka-console-producer.sh --topic SparkKafkaDemo --broker-list 192.168.136.20:9092
#輸入:
java
java
scala
scala
spark
#輸出:
(spark,1)
(scala,2)
(java,2)
6、reduceByKeyAndWindow 對視窗進行流入和流出的reduceByKey操作
reduceByKeyAndWindow(func, invFunc,windowLength, slideInterval, [numTasks])
這個視窗操作和上一個的區別是多傳入一個函式invFunc。前面的func作用和上一個reduceByKeyAndWindow相同,後面的invFunc是用於處理流出rdd的。
import org.apache.kafka.clients.consumer.{ConsumerConfig, ConsumerRecord}
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{DStream, InputDStream}
import org.apache.spark.streaming.kafka010.{ConsumerStrategies, KafkaUtils, LocationStrategies}
import org.apache.spark.streaming.{Seconds, StreamingContext}
object SparkWindowDemo5 {
def main(args: Array[String]): Unit = {
val conf: SparkConf = new SparkConf().setAppName("SparkWindowDemo").setMaster("local[*]")
val streamingContext = new StreamingContext(conf,Seconds(2)) //批處理時間設定為2秒,也就是採集時間
//streamingContext.checkpoint("file:\\D:\\test\\checkpoint")
val kafkaParams: Map[String, String] = Map(
(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> "192.168.136.20:9092"),
(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> "org.apache.kafka.common.serialization.StringDeserializer"),
(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> "org.apache.kafka.common.serialization.StringDeserializer"),
(ConsumerConfig.GROUP_ID_CONFIG -> "kafkaGroup2")
)
val kafkaStream: InputDStream[ConsumerRecord[String, String]] = KafkaUtils.createDirectStream(
streamingContext,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe(Set("SparkKafkaDemo"), kafkaParams)
)
val numStream: DStream[(String,Int)] = kafkaStream.flatMap(line => line.value().toString.split("\\s+"))
.map((_,1))
//加8秒視窗,2秒滑動一次,也就是步長,步長必須是採集時間的倍數
//8秒視窗期內實現wordcount或者reduce
//.reduceByKeyAndWindow((x:Int,y:Int)=>{x+y},Seconds(8),Seconds(2))
.reduceByKeyAndWindow((x:Int,y:Int)=>{x+y},(x:Int,y:Int)=>{x-y},Seconds(8),Seconds(2))
numStream.print()
streamingContext.start()
streamingContext.awaitTermination()
}
}
建立生產資訊進行測試
kafka-console-producer.sh --topic SparkKafkaDemo --broker-list 192.168.136.20:9092
#輸入:
java
java
scala
#輸出:
-------------------------------------------
Time: 1608800912000 ms
-------------------------------------------
(java,1)
-------------------------------------------
Time: 1608800914000 ms
-------------------------------------------
(java,2)
-------------------------------------------
Time: 1608800916000 ms
-------------------------------------------
(scala,1)
(java,2)
-------------------------------------------
Time: 1608800918000 ms
-------------------------------------------
(scala,1)
(java,2)
-------------------------------------------
Time: 1608800920000 ms
-------------------------------------------
(scala,1)
(java,1)
-------------------------------------------
Time: 1608800922000 ms
-------------------------------------------
(scala,1)
(java,0)
-------------------------------------------
Time: 1608800924000 ms
-------------------------------------------
(scala,0)
(java,0)
相關文章
- Spark 系列(十四)—— Spark Streaming 基本操作Spark
- Spark學習進度11-Spark Streaming&Structured StreamingSparkStruct
- Spark Streaming VS FlinkSpark
- Spark Streaming入門Spark
- Cris 的 Spark Streaming 筆記Spark筆記
- Spark Streaming的PIDRateEstimator與backpressureSpark
- Spark 系列(十五)—— Spark Streaming 整合 FlumeSpark
- spark學習筆記-- Spark StreamingSpark筆記
- Spark學習筆記(三)-Spark StreamingSpark筆記
- Spark Streaming 的容錯機制Spark
- Spark Streaming和Flink的區別Spark
- Spark-Streaming的學習使用Spark
- Spark Streaming Failed to read checSparkAI
- spark-streaming之 socketTextStreamSpark
- Spark Streaming學習——DStreamSpark
- Spark Streaming 流式處理Spark
- Spark Streaming :基本工作原理Spark
- Spark Structured Streaming 解析 JSONSparkStructJSON
- Spark Streaming + Spark SQL 實現配置化ETSparkSQL
- Spark 以及 spark streaming 核心原理及實踐Spark
- Spark Streaming監聽HDFS檔案(Spark-shell)Spark
- spark structured-streaming 最全的使用總結SparkStruct
- spark structed streaming 寫入hudi表SparkStruct
- Spark streaming消費Kafka的正確姿勢SparkKafka
- Spark Streaming簡單入門(示例+原理)Spark
- Spark Streaming 生產、消費流程梳理Spark
- 實戰|使用Spark Streaming寫入HudiSpark
- Spark Streaming(六):快取與持久化Spark快取持久化
- Spark Streaming--開窗函式over()Spark函式
- Spark Streaming 之 Kafka 偏移量管理SparkKafka
- Spark Streaming——Spark第一代實時計算引擎Spark
- spark streaming執行kafka資料來源SparkKafka
- Spark Streaming高吞吐、高可靠的一些優化Spark優化
- Spark Streaming的最佳化之路—從Receiver到Direct模式Spark模式
- hadoop基礎學習三十一(spark-streaming)HadoopSpark
- Spark Streaming讀取Kafka資料兩種方式SparkKafka
- Spark Streaming精進之前必須瞭解的基本概念Spark
- Spark報錯(二):關於Spark-Streaming官方示例wordcount執行異常Spark