spark streaming原始碼分析4 DStream相關API

五柳-先生發表於2016-01-29

原文網址 : https://blog.csdn.net/wuliusir/article/details/50606900

一、InputDStream建立的操作（StreamingContext.scala）

1、給定Receiver作為引數，建立ReceiverInputDStream,T為receiver接收到的資料型別

[java]view
 plain copy

def receiverStream[T: ClassTag](receiver: Receiver[T]): ReceiverInputDStream[T] = {  

    withNamedScope("receiver stream") {  

      new PluggableInputDStream[T](this, receiver)  

    }  

  }

2、根據引數生成akka actorstream接收資料

[java]view plain copy
 
def actorStream[T: ClassTag](  
      props: Props,  
      name: String,  
      storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK_SER_2,  
      supervisorStrategy: SupervisorStrategy = ActorSupervisorStrategy.defaultStrategy  
    ): ReceiverInputDStream[T] = withNamedScope("actor stream") {  
    receiverStream(new ActorReceiver[T](props, name, storageLevel, supervisorStrategy))  
  }  

3、TCP socket

socketStream：converter是從socket輸入流轉換成元素T的迭代器的方法

[java]view plain copy
 
def socketStream[T: ClassTag](  
      hostname: String,  
      port: Int,  
      converter: (InputStream) => Iterator[T],  
      storageLevel: StorageLevel  
    ): ReceiverInputDStream[T] = {  
    new SocketInputDStream[T](this, hostname, port, converter, storageLevel)  
  }  

socketTextStream：storageLevel預設是MEMORY_AND_DISK_SER_2,converter是從inputstream中按行讀取轉換成迭代器的固定方法

[java]view plain copy
 
def socketTextStream(  
      hostname: String,  
      port: Int,  
      storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK_SER_2  
    ): ReceiverInputDStream[String] = withNamedScope("socket text stream") {  
    socketStream[String](hostname, port, SocketReceiver.bytesToLines, storageLevel)  
  }  

4、fileStream:filter：檔案過濾器，newFileOnly：只讀取新的檔案。還有其他一些使用預設引數的方法。

[java]view plain copy
 
def fileStream[  
    K: ClassTag,  
    V: ClassTag,  
    F <: NewInputFormat[K, V]: ClassTag  
  ] (directory: String,  
     filter: Path => Boolean,  
     newFilesOnly: Boolean,  
     conf: Configuration): InputDStream[(K, V)] = {  
    new FileInputDStream[K, V, F](this, directory, filter, newFilesOnly, Option(conf))  
  }  

一個以固定格式讀取檔案作為輸入的介面

[java]view plain copy
 
def textFileStream(directory: String): DStream[String] = withNamedScope("text file stream") {  
    fileStream[LongWritable, Text, TextInputFormat](directory).map(_._2.toString)  
  }  

與receiverInputDStream不同，它是以檔案作為輸入，所以不需要receiver去讀取。而是直接根據path生成hadoopRDD，再將所有的RDD Union起來。也就是說，在一個batchDuration時間間隔內，就將這個間隔內新的file組合成一個RDD。

5、將多個DStream 聯合，返回UnionDStream。compute方法就是將多個DStream中的Rdd union

[java]view plain copy
 
/** 
   * Create a unified DStream from multiple DStreams of the same type and same slide duration. 
   */  
  def union[T: ClassTag](streams: Seq[DStream[T]]): DStream[T] = withScope {  
    new UnionDStream[T](streams.toArray)  
  }  

6、transform:將dstreams中得到的所有rdds轉換成一個RDD

[java]view plain copy
 
/** 
   * Create a new DStream in which each RDD is generated by applying a function on RDDs of 
   * the DStreams. 
   */  
  def transform[T: ClassTag](  
      dstreams: Seq[DStream[_]],  
      transformFunc: (Seq[RDD[_]], Time) => RDD[T]  
    ): DStream[T] = withScope {  
    new TransformedDStream[T](dstreams, sparkContext.clean(transformFunc))  
  }  

二、DStream操作(DStream.scala)

與RDD不同的是，DStream是以一個outputStream作為一個job。

那outputStream是如何產生的呢？在呼叫foreachRDD方法時通過註冊將一個DStream在DStreamGraph中標記為outputStream。

那有哪些API會註冊outputStream呢？

foreachRDD/print

saveAsNewAPIHadoopFiles/saveAsTextFiles

1、map/flatMap/filter/mapPartitions

與RDD類似，分別生成MappedDstream/FlatMappedDStream/FilteredDStream等,真正運算時根據receiverInputDStream的compute方法產生BlockRDD，再在這個RDD上賦予map的方法引數執行操作。

2、重新分割槽

方法最終是將BlockRDD進行重新分割槽

[java]view plain copy
 
/** 
   * Return a new DStream with an increased or decreased level of parallelism. Each RDD in the 
   * returned DStream has exactly numPartitions partitions. 
   */  
  def repartition(numPartitions: Int): DStream[T] = ssc.withScope {  
    this.transform(_.repartition(numPartitions))  
  }  

3、reduce：這個方法將DStream的每個RDD都執行reduceFunc方法，並最終每個RDD只有一個分割槽,返回的還是一個DStream[T]

區別：RDD.scala的reduce方法是提交runJob的，返回一個確切的值。

[java]view plain copy
 
/** 
   * Return a new DStream in which each RDD has a single element generated by reducing each RDD 
   * of this DStream. 
   */  
  def reduce(reduceFunc: (T, T) => T): DStream[T] = ssc.withScope {  
    this.map(x => (null, x)).reduceByKey(reduceFunc, 1).map(_._2)  
  }  

4、count:這個方法是將DStream中的每個RDD進行計數，返回一個包含技術的DStream

[java]view plain copy
 
/** 
   * Return a new DStream in which each RDD has a single element generated by counting each RDD 
   * of this DStream. 
   */  
  def count(): DStream[Long] = ssc.withScope {  
    this.map(_ => (null, 1L))  
        .transform(_.union(context.sparkContext.makeRDD(Seq((null, 0L)), 1)))  
        .reduceByKey(_ + _)  
        .map(_._2)  
  }  

5、countByValue:類似count方法，只是該方法是按value值計數的

[java]view plain copy
 
def countByValue(numPartitions: Int = ssc.sc.defaultParallelism)(implicit ord: Ordering[T] = null)  
      : DStream[(T, Long)] = ssc.withScope {  
    this.map(x => (x, 1L)).reduceByKey((x: Long, y: Long) => x + y, numPartitions)  
  }  

6、foreachRDD：foreachFunc是在一個RDD進行自定義的任何操作

[java]view plain copy
 
def foreachRDD(foreachFunc: RDD[T] => Unit): Unit = ssc.withScope {  
    val cleanedF = context.sparkContext.clean(foreachFunc, false)  
    this.foreachRDD((r: RDD[T], t: Time) => cleanedF(r))  
  }  

[java]view plain copy
 
def foreachRDD(foreachFunc: (RDD[T], Time) => Unit): Unit = ssc.withScope {  
    // because the DStream is reachable from the outer object here, and because  
    // DStreams can't be serialized with closures, we can't proactively check  
    // it for serializability and so we pass the optional false to SparkContext.clean  
    new ForEachDStream(this, context.sparkContext.clean(foreachFunc, false)).register()  
  }  

7、transform：在最終生成的RDD上執行transformFunc方法定義的轉換操作

[java]view plain copy
 
def transform[U: ClassTag](transformFunc: RDD[T] => RDD[U]): DStream[U]  

[java]view plain copy
 
def transform[U: ClassTag](transformFunc: (RDD[T], Time) => RDD[U]): DStream[U]  

8、transformWith:將自身DStream生成的RDD與other生成的RDD一起，執行transformWith方法。

[java]view plain copy
 
def transformWith[U: ClassTag, V: ClassTag](  
      other: DStream[U], transformFunc: (RDD[T], RDD[U]) => RDD[V]  
    ): DStream[V]  

[java]view plain copy
 
def transformWith[U: ClassTag, V: ClassTag](  
      other: DStream[U], transformFunc: (RDD[T], RDD[U], Time) => RDD[V]  
    ): DStream[V]  

9、union聯合

[java]view plain copy
 
def union(that: DStream[T]): DStream[T] = ssc.withScope {  
    new UnionDStream[T](Array(this, that))  
  }  

10、saveAsObjectFiles/saveAsTextFiles

儲存為檔案

三、K/V型別RDD轉換操作

1、groupByKey

[java]view plain copy
 
def groupByKey(): DStream[(K, Iterable[V])] = ssc.withScope {  
    groupByKey(defaultPartitioner())  
  }  

[java]view plain copy
 
def groupByKey(numPartitions: Int): DStream[(K, Iterable[V])] = ssc.withScope {  
    groupByKey(defaultPartitioner(numPartitions))  
  }  

[java]view plain copy
 
def groupByKey(partitioner: Partitioner): DStream[(K, Iterable[V])] = ssc.withScope {  
    val createCombiner = (v: V) => ArrayBuffer[V](v)  
    val mergeValue = (c: ArrayBuffer[V], v: V) => (c += v)  
    val mergeCombiner = (c1: ArrayBuffer[V], c2: ArrayBuffer[V]) => (c1 ++ c2)  
    combineByKey(createCombiner, mergeValue, mergeCombiner, partitioner)  
      .asInstanceOf[DStream[(K, Iterable[V])]]  
  }  

2、reduceByKey

[java]view plain copy
 
def reduceByKey(reduceFunc: (V, V) => V): DStream[(K, V)]  

[java]view plain copy
 
def reduceByKey(  
      reduceFunc: (V, V) => V,  
      numPartitions: Int): DStream[(K, V)]  

[java]view plain copy
 
def reduceByKey(  
      reduceFunc: (V, V) => V,  
      partitioner: Partitioner): DStream[(K, V)]  

3、combineByKey

[java]view plain copy
 
def combineByKey[C: ClassTag](  
      createCombiner: V => C,  
      mergeValue: (C, V) => C,  
      mergeCombiner: (C, C) => C,  
      partitioner: Partitioner,  
      mapSideCombine: Boolean = true): DStream[(K, C)] = ssc.withScope {  
    val cleanedCreateCombiner = sparkContext.clean(createCombiner)  
    val cleanedMergeValue = sparkContext.clean(mergeValue)  
    val cleanedMergeCombiner = sparkContext.clean(mergeCombiner)  
    new ShuffledDStream[K, V, C](  
      self,  
      cleanedCreateCombiner,  
      cleanedMergeValue,  
      cleanedMergeCombiner,  
      partitioner,  
      mapSideCombine)  
  }  

4、mapValues/flatMapValues與RDD的操作類似，不解釋

5、join

內部呼叫transformWith，transformWith的引數就是將兩個引數RDD作join操作。

[java]view plain copy
 
def join[W: ClassTag](  
      other: DStream[(K, W)],  
      partitioner: Partitioner  
    ): DStream[(K, (V, W))] = ssc.withScope {  
    self.transformWith(  
      other,  
      (rdd1: RDD[(K, V)], rdd2: RDD[(K, W)]) => rdd1.join(rdd2, partitioner)  
    )  
  }  

6、saveAsNewAPIHadoopFiles

儲存到檔案

轉載： http://blog.csdn.net/yueqian_zhu/article/details/49121489

Spark Streaming學習——DStream
2019-04-05
Spark
Spark 原始碼分析系列
2019-07-28
Spark原始碼
Spring中AOP相關的API及原始碼解析
2020-07-02
SpringAPI原始碼
Spark RPC框架原始碼分析（三）Spark心跳機制分析
2019-01-17
SparkRPC框架原始碼
ArrayList相關方法介紹及原始碼分析
2019-05-30
原始碼
Flutter Android 端 FlutterView 相關流程原始碼分析
2021-08-18
FlutterAndroidView原始碼
Spark學習進度11-Spark Streaming&Structured Streaming
2021-01-15
SparkStruct
Spark Streaming VS Flink
2019-03-04
Spark
Spark Streaming入門
2018-05-16
Spark
spark學習筆記-- Spark Streaming
2018-08-03
Spark筆記
Spark 系列（十四）—— Spark Streaming 基本操作
2019-08-14
Spark
Spark 系列（十五）—— Spark Streaming 整合 Flume
2019-08-15
Spark
Flutter Android 端 FlutterEngine Java 相關流程原始碼分析
2021-08-15
FlutterAndroidJava原始碼
spark 原始碼分析之十三 -- SerializerManager剖析
2019-07-15
Spark原始碼
Apache 流框架 Flink，Spark Streaming，Storm對比分析（1）
2019-03-04
Apache框架SparkORM
Apache 流框架 Flink，Spark Streaming，Storm對比分析（2）
2019-02-26
Apache框架SparkORM
Apache 流框架 Flink，Spark Streaming，Storm對比分析（一）
2018-04-27
Apache框架SparkORM
Apache 流框架 Flink，Spark Streaming，Storm對比分析（二）
2018-04-27
Apache框架SparkORM
Spark學習筆記（三）-Spark Streaming
2020-06-24
Spark筆記
spark-streaming之 socketTextStream
2018-10-17
Spark
Spark Streaming 流式處理
2018-11-13
Spark
Spark Streaming ：基本工作原理
2018-10-12
Spark
Spark Structured Streaming 解析 JSON
2018-09-14
SparkStructJSON
Spark Streaming Failed to read chec
2021-09-09
SparkAI
spark 原始碼分析之十八 -- Spark儲存體系剖析
2019-07-23
Spark原始碼
spark 原始碼分析之十五 -- Spark記憶體管理剖析
2019-07-17
Spark原始碼記憶體
【hadoop/Spark】相關命令
2024-06-07
HadoopSpark
陪玩原始碼，與時間、日期相關的程式碼分析
2024-07-20
原始碼
Spark 以及 spark streaming 核心原理及實踐
2019-01-05
Spark
Spark Streaming + Spark SQL 實現配置化ET
2021-09-09
SparkSQL
Spark報錯（二）：關於Spark-Streaming官方示例wordcount執行異常
2018-09-13
Spark
Spark RPC框架原始碼分析（一）簡述
2019-02-26
SparkRPC框架原始碼
spark 原始碼分析之十九 -- Stage的提交
2019-07-26
Spark原始碼
spark 原始碼分析之十六 -- Spark記憶體儲存剖析
2019-07-18
Spark原始碼記憶體
Netty服務端啟動過程相關原始碼分析
2019-08-06
Netty服務端原始碼
Fabric 1.0原始碼分析(42)scc（系統鏈碼） #cscc（通道相關）
2018-05-21
原始碼
Spark Streaming的PIDRateEstimator與backpressure
2018-08-30
Spark
Cris 的 Spark Streaming 筆記
2019-01-01
Spark筆記
Spark Streaming中的Window操作
2020-12-28
Spark

spark streaming原始碼分析4 DStream相關API

相關文章