sparkStreaming 之 kafka源
1、recevie模式
0.8版本之前有這中模式,1.0後取消了這種模式
package day10
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}
/**
* @author yangkun
* @date 2020/11/6 16:55
* @version 1.0
*/
object spark01_receive {
def main(args: Array[String]): Unit = {
//建立配置檔案物件
val conf: SparkConf = new SparkConf().setAppName("SparkStreaming02_RDDQueue").setMaster("local[*]")
//建立SparkStreaming上下文環境物件
val ssc: StreamingContext = new StreamingContext(conf,Seconds(3))
//連線Kafka,建立DStream
val kafkaDstream: ReceiverInputDStream[(String, String)] = KafkaUtils.createStream(
ssc,
"hadoop100:2181",
"yk_test",
Map("a" -> 1)
)
//獲取kafka中的訊息,我們只需要v的部分
val lineDS: DStream[String] = kafkaDstream.map(_._2)
//扁平化
val flatMapDS: DStream[String] = lineDS.flatMap(_.split(" "))
//結構轉換 進行計數
val mapDS: DStream[(String, Int)] = flatMapDS.map((_,1))
//聚合
val reduceDS: DStream[(String, Int)] = mapDS.reduceByKey(_+_)
//列印輸出
reduceDS.print
//開啟任務
ssc.start()
ssc.awaitTermination()
}
}
2.direct模式
2.1 自動維護offset
自定的維護偏移量,偏移量維護在checkpiont中,目前我們這個版本,只是指定的檢查點,只會將offset放到檢查點中,但是並沒有從檢查點中取,會存在訊息丟失
package day10
import kafka.serializer.StringDecoder
import org.apache.kafka.clients.consumer.ConsumerConfig
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{DStream, InputDStream}
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
/**
* @author yangkun
* @date 2020/11/6 20:50
* @version 1.0
自定的維護偏移量,偏移量維護在checkpiont中
目前我們這個版本,只是指定的檢查點,只會將offset放到檢查點中,但是並沒有從檢查點中取,會存在訊息丟失
*/
object spark02_direct_auto1 {
def main(args: Array[String]): Unit = {
//建立配置檔案物件
val conf: SparkConf = new SparkConf().setAppName("SparkStreaming02_RDDQueue").setMaster("local[*]")
//建立SparkStreaming上下文環境物件
val ssc: StreamingContext = new StreamingContext(conf,Seconds(3))
//設定檢查點目錄
ssc.checkpoint("cp")
//配置kafka引數
val kafkaParams = Map(
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG->"hdp4.buptnsrc.com:6667",
ConsumerConfig.GROUP_ID_CONFIG->"yk_test"
)
val kafkaRDD: InputDStream[(String, String)] = KafkaUtils.createDirectStream[String,String,StringDecoder,StringDecoder](ssc,kafkaParams,Set("a"))
val lineDS: DStream[String] = kafkaRDD.map(_._2)
//扁平化
val flatMapDS: DStream[String] = lineDS.flatMap(_.split(" "))
//結構轉換 進行計數
val mapDS: DStream[(String, Int)] = flatMapDS.map((_,1))
//聚合
val reduceDS: DStream[(String, Int)] = mapDS.reduceByKey(_+_)
//列印輸出
reduceDS.print
ssc.start()
ssc.awaitTermination()
}
}
2.2
- 自定的維護偏移量,偏移量維護在checkpiont中
- 修改StreamingContext物件的獲取方式,先從檢查點獲取,如果檢查點沒有,通過函式建立。會保證資料不丟失
- 缺點:
- 1.小檔案過多
- 2.在checkpoint中,只記錄最後offset的時間戳,再次啟動程式的時候,會從這個時間到當前時間,把所有周期都執行一次
package day10
import kafka.serializer.StringDecoder
import org.apache.kafka.clients.consumer.ConsumerConfig
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{DStream, InputDStream}
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
/**
* @author yangkun
* @date 2020/11/6 20:50
* @version 1.0
* Desc: 通過DirectAPI連線Kafka資料來源,獲取資料
* 自定的維護偏移量,偏移量維護在checkpiont中
* 修改StreamingContext物件的獲取方式,先從檢查點獲取,如果檢查點沒有,通過函式建立。會保證資料不丟失
* 缺點:
* 1.小檔案過多
* 2.在checkpoint中,只記錄最後offset的時間戳,再次啟動程式的時候,會從這個時間到當前時間,把所有周期都執行一次
*/
object spark02_direct_auto2 {
def main(args: Array[String]): Unit = {
//建立Streaming上下文環境物件
val ssc: StreamingContext = StreamingContext.getActiveOrCreate("cp",()=>getStreamingContext())
//設定檢查點目錄
ssc.checkpoint("cp")
ssc.start()
ssc.awaitTermination()
}
def getStreamingContext()={
val conf: SparkConf = new SparkConf().setAppName("direct2").setMaster("local[*]")
val ssc: StreamingContext = new StreamingContext(conf,Seconds(3))
//配置kafka引數
val kafkaParams = Map(
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG->"hdp4.buptnsrc.com:6667",
ConsumerConfig.GROUP_ID_CONFIG->"yk_test"
)
val kafkaRDD: InputDStream[(String, String)] = KafkaUtils.createDirectStream[String,String,StringDecoder,StringDecoder](ssc,kafkaParams,Set("a"))
val lineDS: DStream[String] = kafkaRDD.map(_._2)
//扁平化
val flatMapDS: DStream[String] = lineDS.flatMap(_.split(" "))
//結構轉換 進行計數
val mapDS: DStream[(String, Int)] = flatMapDS.map((_,1))
//聚合
val reduceDS: DStream[(String, Int)] = mapDS.reduceByKey(_+_)
//列印輸出
reduceDS.print
ssc
}
}
2.3 手動維護offset
package day10
import kafka.common.TopicAndPartition
import kafka.message.MessageAndMetadata
import kafka.serializer.StringDecoder
import org.apache.kafka.clients.consumer.ConsumerConfig
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.dstream.{DStream, InputDStream}
import org.apache.spark.streaming.kafka.{HasOffsetRanges, KafkaUtils, OffsetRange}
/**
* @author yangkun
* @date 2020/11/6 21:20
* @version 1.0
*/
object spark03_direct_handle {
def main(args: Array[String]): Unit = {
//建立配置檔案物件
val conf: SparkConf = new SparkConf().setAppName("direct_handle").setMaster("local[*]")
//建立SparkStreaming上下文環境物件
val ssc: StreamingContext = new StreamingContext(conf,Seconds(3))
//配置kafka引數
val kafkaParams = Map(
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG->"hdp4.buptnsrc.com:6667",
ConsumerConfig.GROUP_ID_CONFIG->"yk_test"
)
//獲取上一次消費的位置(偏移量)
//實際專案中,為了保證資料精準一致性,我們對資料進行消費處理之後,將偏移量儲存在有事務的儲存中, 如MySQL
val offsets = Map(TopicAndPartition("a",0)->10L)
val kafkaRDD: InputDStream[String] = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder, String](
ssc,
kafkaParams,
offsets,
(m: MessageAndMetadata[String, String]) => m.message()
)
//消費完畢之後,對偏移量offset進行更新
var offsetRanges = Array.empty[OffsetRange]
kafkaRDD.transform(
//rdd是kafkaRDD型別,但是kafkaRDD是私有型別的且實現了HashOffsetRanges介面,因此將rdd轉成HashOffsetRanges然後得到offsetRanges
rdd => {
offsetRanges= rdd.asInstanceOf[HasOffsetRanges].offsetRanges
rdd
}
).foreachRDD{
rdd => {
for(o <- offsetRanges) {
println(o.topic,o.partition,o.fromOffset,o.untilOffset)
}
}
}
// val lineDS: DStream[String] = kafkaRDD.map(_._1)
//扁平化
val flatMapDS: DStream[String] = kafkaRDD.flatMap(_.split(" "))
//結構轉換 進行計數
val mapDS: DStream[(String, Int)] = flatMapDS.map((_,1))
//聚合
val reduceDS: DStream[(String, Int)] = mapDS.reduceByKey(_+_)
//列印輸出
reduceDS.print
ssc.start()
ssc.awaitTermination()
}
}
pom檔案
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<parent>
<artifactId>blibliSpark</artifactId>
<groupId>com.bupt</groupId>
<version>1.0-SNAPSHOT</version>
</parent>
<modelVersion>4.0.0</modelVersion>
<groupId>com.bupt</groupId>
<artifactId>sparkStreaming</artifactId>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.1.1</version>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.26</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.1.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.11</artifactId>
<version>2.1.1</version>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-exec</artifactId>
<version>1.2.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.1.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-8_2.11</artifactId>
<version>2.1.1</version>
</dependency>
</dependencies>
</project>
相關文章
- Flume + Kafka + SparkStreaming分析KafkaSpark
- Kafka結合SparkStreaming開發KafkaSpark
- Spark-stream基礎---sparkStreaming和Kafka整合wordCount單詞計數SparkKafka
- 圖解SparkStreaming與Kafka的整合,這些細節大家要注意!圖解SparkKafka
- 【Spark篇】---SparkStreaming+Kafka的兩種模式receiver模式和Direct模式SparkKafka模式
- Sparkstreaming讀取Kafka訊息再結合SparkSQL,將結果儲存到HBaseSparkKafkaSQL
- 開源Apache KafkaApacheKafka
- SparkStreaming入門教程(三)高階輸入源:Flume、KaSpark
- SparkStreaming VS Structed StreaminSparkStruct
- Kafka學習(四)-------- Kafka核心之ProducerKafka
- Kafka學習之(四)PHP操作KafkaKafkaPHP
- Kafka之ReplicaManager(1)Kafka
- 大資料-SparkStreaming(一)大資料Spark
- 雅虎開源的Kafka叢集管理器(Kafka Manager)Kafka
- Yahoo開源的Apache Kafka管理工具:Kafka ManagerApacheKafka
- Kafka學習(三)-------- Kafka核心之CosumerKafka
- Kafka學習之(六)搭建kafka叢集Kafka
- Kafka學習之(七)搭建kafka視覺化服務Kafka EagleKafka視覺化
- Kafka學習之(五)搭建kafka叢集之Zookeeper叢集搭建Kafka
- 詳細解析kafka之kafka分割槽和副本Kafka
- Kafka學習之(二)Centos下安裝KafkaKafkaCentOS
- Kafka 之 async producer (2) kafka.producer.async.DefaultEventHandlerKafka
- SparkStreaming 的使用與總結Spark
- Kafka之Producer原始碼Kafka原始碼
- Kafka 之 async producer (1)Kafka
- Apache Kafka監控之Kafka Web ConsoleApacheKafkaWeb
- 建立訊息佇列(Kafka)源表佇列Kafka
- 原始碼分析Kafka之Producer原始碼Kafka
- Kafka之消費與心跳Kafka
- 訊息佇列之 Kafka佇列Kafka
- Apache Kafka監控之KafkaOffsetMonitorApacheKafka
- Kafka原理剖析之「Topic建立」Kafka
- kafka原始碼剖析(二)之kafka-server的啟動Kafka原始碼Server
- 實時處理 Kafka 資料來源Kafka
- Kafka 萬億級訊息實踐之資源組流量掉零故障排查分析Kafka
- SparkStreaming實時流處理學習Spark
- Kafka原理分析之基礎篇Kafka
- Kafka原理剖析之「位點提交」Kafka