SparkStreaming+Kafka

SparkStreaming整合Kafka有兩種方式，一種是基於接收器的方法，另一種是直接方法（無接收器）。

Receiver方式：由Spark executors中的Receiver來接收kafka中的資料。
Direct方式：此方法不使用接收器接收資料，而是週期性查詢Kafka中每個主題+分割槽中的最新偏移量，並相應地定義要在每批中處理的偏移量範圍。處理資料的作業啟動後，Kafka consumerAPI讀取Kafka中定義的偏移量範圍（類似於從檔案系統讀取檔案）。

由於Direct相比Receiver有諸多優勢：簡化並行性、效率高等，因此我們選擇Direct方式。

def createStream[K, V, U <: Decoder[], T <: Decoder[]](ssc: StreamingContext, kafkaParams: Map[String, String], topics: Map[String, Int], storageLevel: StorageLevel)(implicit arg0: ClassTag[K], arg1: ClassTag[V], arg2: ClassTag[U], arg3: ClassTag[T]): ReceiverInputDStream[(K, V)]
K：Kafka訊息中key的型別，例如String
V：Kafka訊息中value的型別
U：Kafka message key 的解碼器，例如StringDecoder
T：Kafka message value 的解碼器
ssc：StreamingContext 物件
kafkaParams：KafKa配置引數的Map集合
topics：streaming需消費的kafka中topics的集合.
storageLevel：儲存級別，如僅基於記憶體或僅基於磁碟
returns DStream of (Kafka message key, Kafka message value)

編寫程式碼：

maven配置pom.xml

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-streaming-kafka-0-8_2.11</artifactId>
    <version>2.2.0</version></dependency>

DirectKafkaWordCount.scala

import kafka.serializer.StringDecoderimport org.apache.spark.streaming._import org.apache.spark.streaming.kafka._import org.apache.spark.SparkConf

object DirectKafkaWordCount {
def main(args: Array[String]) {

  val Array(brokers, topics) = args

  val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount")
  val ssc = new StreamingContext(sparkConf, Seconds(5))  //kafka的topic集合，即可以訂閱多個topic，args傳參的時候用,隔開
  val topicsSet = topics.split(",").toSet  //設定kafka引數，定義brokers集合
  val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
  val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
    ssc, kafkaParams, topicsSet)
  print("---------:" +messages)

  val lines = messages.map(_._2)
  val words = lines.flatMap(_.split(" "))
  val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
  wordCounts.print()

  ssc.start()
  ssc.awaitTermination()
 }
}

程式執行：

IDEA中執行sparkstreaming程式，注意傳參master:9092 test
啟動kafka服務
bin/kafka-server-start.sh config/server.properties
生產者實時生產資料
bin/kafka-console-producer.sh --broker-list master:9092 --topic test
可以看到IDEA的控制檯中，spark實時處理了來自kafka的資料

kafka + sparkstreaming.gif

SparkStreaming+Flume

Flume-style Push-based Approach

推式接收器：Spark Streaming設定了一個接收器，該接收器作為Flume的Avro代理，以Avro資料池的方式工作，Flume可以將資料推送到該接收器

配置Flume：把資料發到Avro資料池

cd /usr/local/flume-1.6.0-cdh5.11.1/conf/
vim flume-spark.conf

a1.sources = r1
a1.channels = c1
a1.sinks = k1

a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /usr/local/flume-1.6.0-cdh5.11.1/spool-test/spooldir
a1.sources.r1.channels = c1

a1.channels.c1.type = file
a1.channels.c1.useDualCheckpoints = truea1.channels.c1.backupCheckpointDir = /usr/local/flume-1.6.0-cdh5.11.1/spool-test/channel_check_back
a1.channels.c1.checkpointDir = /usr/local/flume-1.6.0-cdh5.11.1/spool-test/channel_check
a1.channels.c1.dataDirs = /usr/local/flume-1.6.0-cdh5.11.1/spool-test/channel_data

a1.sinks.k1.type = avro
a1.sinks.k1.channel = c1
a1.sinks.k1.hostname = master
a1.sinks.k1.port = 9998

編寫Spark程式碼

def createStream(ssc: StreamingContext, hostname: String, port: Int, storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK_SER_2): ReceiverInputDStream[SparkFlumeEvent]
Create a input stream from a Flume source.

SparkFlume .scala

package com.lyl.sparkimport org.apache.spark.SparkConfimport org.apache.spark.streaming._import org.apache.spark.streaming.flume.FlumeUtils
object SparkFlume {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("Flume-Spark")
    val ssc = new StreamingContext(conf,Seconds(5))
    val lines = FlumeUtils.createStream(ssc,"master",9998)//FlumeUtils把接收器配置在特點的主機名和埠上，必須與flume中配置的埠吻合
    lines.count().map(x => "Received: " + x +"events").print()
    ssc.start()
    ssc.awaitTermination()
  }
}

pom.xml

<!--  --><dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-streaming-flume_2.11</artifactId>
    <version>2.2.0</version></dependency>

程式執行

（1）打包jar放入spark目錄下，提交執行：
bin/spark-submit --master local[2] --class com.lyl.spark.SparkFlume SparkStreamStudy.jar
（2）啟動flume
bin/flume-ng agent --conf conf --name a1 --conf-file conf/flume-spark.conf
（3）實時傳輸檔案進入之前flume配置中定義好的目錄中（a1.sources.r1.spoolDir = /usr/local/flume-1.6.0-cdh5.11.1/spool-test/spooldir）。

[root@master spooldir]# lltotal 4-rw-r--r-- 1 root root 7 Feb 12 02:16 test.txt[root@master spooldir]# cp test.txt test2.txt[root@master spooldir]# cp test.txt test3.txt[root@master spooldir]# cp test.txt test4.txt

可以看到Spark控制檯實時輸出event個數。

【雖然這種方式簡單，但是它沒有事物支援，會增加執行接收器的工作節點發生錯誤而丟失資料的機率，不僅如此，如果執行接收器的節點故障，系統會嘗試從另一個位置啟動接收器，這時候需要重新配置flume才能將資料發給新節點，這比較麻煩，因此我們引進拉式接收器】

Pull-based Approach using a Custom Sink

拉式接收器：該接收器可以從自定義的中間資料池中主動的拉取資料，而其他程式可以使用Flume把資料推進該中間資料池。spark Streaming使用可靠的Flume接收器和事務機制從該中間資料池中提取並複製資料，在收到事物成功完成的通知前，這些資料還保留在資料池中。

補充：

可靠的接收器 - 當資料已被接收並確認已經儲存在Spark中時，可靠的接收器會向源傳送確認。具有強大的容錯保證，可確保零資料丟失。

不可靠的接收器 - 不會向源傳送確認資訊。優點是簡單實施。

作者：Seven_Ki
連結：

SparkStreaming入門教程(三)高階輸入源：Flume、Ka

SparkStreaming+Kafka

SparkStreaming+Flume

Flume-style Push-based Approach

Pull-based Approach using a Custom Sink

相關文章