一、前述
今天分享一篇SparkStreaming常用的運算元transform和updateStateByKey。
- 可以通過transform運算元,對Dstream做RDD到RDD的任意操作。其實就是DStream的型別轉換。
運算元內,拿到的RDD運算元外,程式碼是在Driver端執行的,每個batchInterval執行一次,可以做到動態改變廣播變數。
-
為SparkStreaming中每一個Key維護一份state狀態,通過更新函式對該key的狀態不斷更新。
二、具體細節
1、transform 是一個transformation類運算元
package com.spark.sparkstreaming; import java.util.ArrayList; import java.util.List; import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaPairRDD; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.function.Function; import org.apache.spark.api.java.function.PairFunction; import org.apache.spark.api.java.function.VoidFunction; import org.apache.spark.streaming.Durations; import org.apache.spark.streaming.api.java.JavaDStream; import org.apache.spark.streaming.api.java.JavaPairDStream; import org.apache.spark.streaming.api.java.JavaReceiverInputDStream; import org.apache.spark.streaming.api.java.JavaStreamingContext; import com.google.common.base.Optional; import scala.Tuple2; /** * 過濾黑名單 * transform操作 * DStream可以通過transform做RDD到RDD的任意操作。 * @author root * */ public class TransformOperator { public static void main(String[] args) { SparkConf conf = new SparkConf(); conf.setMaster("local[2]").setAppName("transform"); JavaStreamingContext jsc = new JavaStreamingContext(conf,Durations.seconds(5)); //模擬黑名單 List<Tuple2<String,Boolean>> blackList = new ArrayList<Tuple2<String,Boolean>>(); blackList.add(new Tuple2<String,Boolean>("zhangsan",true)); //將黑名單轉換成RDD final JavaPairRDD<String, Boolean> blackNameRDD = jsc.sparkContext().parallelizePairs(blackList); //接受socket資料來源 JavaReceiverInputDStream<String> nameList = jsc.socketTextStream("node5", 9999); JavaPairDStream<String, String> pairNameList = nameList.mapToPair(new PairFunction<String, String, String>() { /** *這塊程式碼在Driver端執行。 */ private static final long serialVersionUID = 1L; @Override public Tuple2<String, String> call(String s) throws Exception { return new Tuple2<String, String>(s.split(" ")[1], s); } }); JavaDStream<String> transFormResult = pairNameList.transform(new Function<JavaPairRDD<String,String>, JavaRDD<String>>() { /** * */ private static final long serialVersionUID = 1L; @Override public JavaRDD<String> call(JavaPairRDD<String, String> nameRDD) throws Exception { /** * nameRDD: * ("zhangsan","1 zhangsan") * ("lisi","2 lisi") * ("wangwu","3 wangwu") * blackNameRDD: * ("zhangsan",true) * * ("zhangsan",("1 zhangsan",[true])) * */ JavaPairRDD<String, Tuple2<String, Optional<Boolean>>> leftOuterJoin = nameRDD.leftOuterJoin(blackNameRDD); //列印下leftOuterJoin /*leftOuterJoin.foreach(new VoidFunction<Tuple2<String,Tuple2<String,Optional<Boolean>>>>() { *//** * *//* private static final long serialVersionUID = 1L; @Override public void call(Tuple2<String, Tuple2<String, Optional<Boolean>>> t) throws Exception { System.out.println(t); } });*/ //過濾:true的留下,false的過濾 //("zhangsan",("1 zhangsan",[true])) JavaPairRDD<String, Tuple2<String, Optional<Boolean>>> filter = leftOuterJoin.filter(new Function<Tuple2<String,Tuple2<String,Optional<Boolean>>>, Boolean>() { /** * */ private static final long serialVersionUID = 1L; @Override public Boolean call(Tuple2<String, Tuple2<String, Optional<Boolean>>> tuple)throws Exception { if(tuple._2._2.isPresent()){ return !tuple._2._2.get(); } return true; } }); JavaRDD<String> resultJavaRDD = filter.map(new Function<Tuple2<String,Tuple2<String,Optional<Boolean>>>, String>() { /** * */ private static final long serialVersionUID = 1L; @Override public String call( Tuple2<String, Tuple2<String, Optional<Boolean>>> tuple) throws Exception { return tuple._2._1; } }); //返回過濾好的結果 return resultJavaRDD; } }); transFormResult.print(); jsc.start(); jsc.awaitTermination(); jsc.stop(); } }
2、UpdateStateByKey運算元(相當於對不同批次的累加和更新)
UpdateStateByKey的主要功能:
* 1、為Spark Streaming中每一個Key維護一份state狀態,state型別可以是任意型別的, 可以是一個自定義的物件,那麼更新函式也可以是自定義的。
* 2、通過更新函式對該key的狀態不斷更新,對於每個新的batch而言,Spark Streaming會在使用updateStateByKey的時候為已經存在的key進行state的狀態更新
* 使用到updateStateByKey要開啟checkpoint機制和功能。
* 多久會將記憶體中的資料寫入到磁碟一份?
如果batchInterval設定的時間小於10秒,那麼10秒寫入磁碟一份。如果batchInterval設定的時間大於10秒,那麼就會batchInterval時間間隔寫入磁碟一份。
java程式碼:
package com.spark.sparkstreaming; import java.util.Arrays; import java.util.List; import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.api.java.function.FlatMapFunction; import org.apache.spark.api.java.function.Function2; import org.apache.spark.api.java.function.PairFunction; import org.apache.spark.streaming.Durations; import org.apache.spark.streaming.api.java.JavaDStream; import org.apache.spark.streaming.api.java.JavaPairDStream; import org.apache.spark.streaming.api.java.JavaReceiverInputDStream; import org.apache.spark.streaming.api.java.JavaStreamingContext; import com.google.common.base.Optional; import scala.Tuple2; /** * UpdateStateByKey的主要功能: * 1、為Spark Streaming中每一個Key維護一份state狀態,state型別可以是任意型別的, 可以是一個自定義的物件,那麼更新函式也可以是自定義的。 * 2、通過更新函式對該key的狀態不斷更新,對於每個新的batch而言,Spark Streaming會在使用updateStateByKey的時候為已經存在的key進行state的狀態更新 * * hello,3 * spark,2 * * 如果要不斷的更新每個key的state,就一定涉及到了狀態的儲存和容錯,這個時候就需要開啟checkpoint機制和功能 * * 全面的廣告點選分析 * @author root * * 有何用? 統計廣告點選流量,統計這一天的車流量,統計點選量 */ public class UpdateStateByKeyOperator { public static void main(String[] args) { SparkConf conf = new SparkConf().setMaster("local[2]").setAppName("UpdateStateByKeyDemo"); JavaStreamingContext jsc = new JavaStreamingContext(conf, Durations.seconds(5)); /** * 設定checkpoint目錄 * * 多久會將記憶體中的資料(每一個key所對應的狀態)寫入到磁碟上一份呢? * 如果你的batch interval小於10s 那麼10s會將記憶體中的資料寫入到磁碟一份 * 如果bacth interval 大於10s,那麼就以bacth interval為準 * * 這樣做是為了防止頻繁的寫HDFS */ JavaSparkContext sparkContext = jsc.sparkContext(); sparkContext.setCheckpointDir("./checkpoint"); // jsc.checkpoint("hdfs://node1:9000/spark/checkpoint"); // jsc.checkpoint("./checkpoint"); JavaReceiverInputDStream<String> lines = jsc.socketTextStream("node5", 9999); JavaDStream<String> words = lines.flatMap(new FlatMapFunction<String, String>() { /** * */ private static final long serialVersionUID = 1L; @Override public Iterable<String> call(String s) { return Arrays.asList(s.split(" ")); } }); JavaPairDStream<String, Integer> ones = words.mapToPair(new PairFunction<String, String, Integer>() { /** * */ private static final long serialVersionUID = 1L; @Override public Tuple2<String, Integer> call(String s) { return new Tuple2<String, Integer>(s, 1); } }); JavaPairDStream<String, Integer> counts = ones.updateStateByKey(new Function2<List<Integer>, Optional<Integer>, Optional<Integer>>() { /** * */ private static final long serialVersionUID = 1L; @Override public Optional<Integer> call(List<Integer> values, Optional<Integer> state) throws Exception { /** * values:經過分組最後 這個key所對應的value [1,1,1,1,1] * state:這個key在本次之前之前的狀態 */ Integer updateValue = 0 ; if(state.isPresent()){ updateValue = state.get(); } for (Integer value : values) { updateValue += value; } return Optional.of(updateValue); } });
//output operator counts.print(); jsc.start(); jsc.awaitTermination(); jsc.close(); } }
scala程式碼:
package com.bjsxt.sparkstreaming import org.apache.spark.SparkConf import org.apache.spark.streaming.Durations import org.apache.spark.streaming.StreamingContext object Operator_UpdateStateByKey { def main(args: Array[String]): Unit = { val conf = new SparkConf() conf.setMaster("local[2]").setAppName("updateStateByKey") val jsc = new StreamingContext(conf,Durations.seconds(5)) //設定日誌級別 jsc.sparkContext.setLogLevel("WARN") //設定checkpoint路徑 jsc.checkpoint("hdfs://node1:9000/spark/checkpoint") val lineDStream = jsc.socketTextStream("node5", 9999) val wordDStream = lineDStream.flatMap { _.split(" ") } val pairDStream = wordDStream.map { (_,1)} val result = pairDStream.updateStateByKey((seq:Seq[Int],option:Option[Int])=>{ var value = 0 value += option.getOrElse(0) for(elem <- seq){ value +=elem } Option(value) }) result.print() jsc.start() jsc.awaitTermination() jsc.stop() } }
結果:
可見從啟動以來一直維護這個累加狀態!!!
2、windows視窗函式(實現一階段內的累加 ,而不是程式啟動時)
假設每隔5s 1個batch,上圖中視窗長度為15s,視窗滑動間隔10s。
視窗長度和滑動間隔必須是batchInterval的整數倍。如果不是整數倍會檢測報錯。
優化後的window操作要儲存狀態所以要設定checkpoint路徑,沒有優化的window操作可以不設定checkpoint路徑。
package com.spark.sparkstreaming; import java.util.Arrays; import org.apache.spark.SparkConf; import org.apache.spark.api.java.function.FlatMapFunction; import org.apache.spark.api.java.function.Function2; import org.apache.spark.api.java.function.PairFunction; import org.apache.spark.streaming.Durations; import org.apache.spark.streaming.api.java.JavaDStream; import org.apache.spark.streaming.api.java.JavaPairDStream; import org.apache.spark.streaming.api.java.JavaReceiverInputDStream; import org.apache.spark.streaming.api.java.JavaStreamingContext; import scala.Tuple2; /** * 基於滑動視窗的熱點搜尋詞實時統計 * @author root * */ public class WindowOperator { public static void main(String[] args) { SparkConf conf = new SparkConf() .setMaster("local[2]") .setAppName("WindowHotWord"); JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(5)); /** * 設定日誌級別為WARN * */ jssc.sparkContext().setLogLevel("WARN"); /** * 注意: * 沒有優化的視窗函式可以不設定checkpoint目錄 * 優化的視窗函式必須設定checkpoint目錄 */ // jssc.checkpoint("hdfs://node1:9000/spark/checkpoint"); jssc.checkpoint("./checkpoint"); JavaReceiverInputDStream<String> searchLogsDStream = jssc.socketTextStream("node04", 9999); //word 1 JavaDStream<String> searchWordsDStream = searchLogsDStream.flatMap(new FlatMapFunction<String, String>() { /** * */ private static final long serialVersionUID = 1L; @Override public Iterable<String> call(String t) throws Exception { return Arrays.asList(t.split(" ")); } }); // 將搜尋詞對映為(searchWord, 1)的tuple格式 JavaPairDStream<String, Integer> searchWordPairDStream = searchWordsDStream.mapToPair( new PairFunction<String, String, Integer>() { private static final long serialVersionUID = 1L; @Override public Tuple2<String, Integer> call(String searchWord) throws Exception { return new Tuple2<String, Integer>(searchWord, 1); } }); /** * 每隔10秒,計算最近60秒內的資料,那麼這個視窗大小就是60秒,裡面有12個rdd,在沒有計算之前,這些rdd是不會進行計算的。 * 那麼在計算的時候會將這12個rdd聚合起來,然後一起執行reduceByKeyAndWindow操作 , * reduceByKeyAndWindow是針對視窗操作的而不是針對DStream操作的。 */ JavaPairDStream<String, Integer> searchWordCountsDStream = searchWordPairDStream.reduceByKeyAndWindow(new Function2<Integer, Integer, Integer>() { private static final long serialVersionUID = 1L; @Override public Integer call(Integer v1, Integer v2) throws Exception { return v1 + v2; } }, Durations.seconds(15), Durations.seconds(5)); //視窗長度,滑動間隔 /** * window視窗操作優化:不用設定checkpoint目錄。 */ // JavaPairDStream<String, Integer> searchWordCountsDStream = // // searchWordPairDStream.reduceByKeyAndWindow(new Function2<Integer, Integer, Integer>() { // // private static final long serialVersionUID = 1L; // // @Override // public Integer call(Integer v1, Integer v2) throws Exception { // return v1 + v2; // } // // },new Function2<Integer, Integer, Integer>() { // // private static final long serialVersionUID = 1L; // // @Override // public Integer call(Integer v1, Integer v2) throws Exception { // return v1 - v2; // } // // }, Durations.seconds(15), Durations.seconds(5)); searchWordCountsDStream.print(); jssc.start(); jssc.awaitTermination(); jssc.close(); } }
Scala程式碼:
package com.bjsxt.sparkstreaming import org.apache.spark.SparkConf import org.apache.spark.streaming.Durations import org.apache.spark.streaming.StreamingContext object Operator_Window { def main(args: Array[String]): Unit = { val conf = new SparkConf() conf.setMaster("local[2]").setAppName("updateStateByKey") val jsc = new StreamingContext(conf,Durations.seconds(5)) //設定日誌級別 jsc.sparkContext.setLogLevel("WARN") //設定checkpoint路徑 jsc.checkpoint("hdfs://node1:9000/spark/checkpoint") val lineDStream = jsc.socketTextStream("node04", 9999) val wordDStream = lineDStream.flatMap { _.split(" ") } val mapDStream = wordDStream.map { (_,1)} //window沒有優化後的 val result = mapDStream.reduceByKeyAndWindow((v1:Int,v2:Int)=>{ v1+v2 }, Durations.seconds(60), Durations.seconds(10)) //優化後的 // val result = mapDStream.reduceByKeyAndWindow((v1:Int,v2:Int)=>{ // v1+v2 // }, (v1:Int,v2:Int)=>{ // v1-v2 // }, Durations.seconds(60), Durations.seconds(10)) result.print() jsc.start() jsc.awaitTermination() jsc.stop() } }
結果: