spark一些常用運算元
一些經常用到的RDD運算元 map:將rdd的值輸入,並返回一個自定義的型別,如下輸入原始型別,輸出一個tuple型別的陣列 scala> val rdd1 = sc.parallelize(List("a","b","c","d"),2) rdd1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[1] at parallelize at <console>:24 scala> rdd1.map((_,1)).collect res1: Array[(String, Int)] = Array((a,1), (b,1), (c,1), (d,1)) ----------------------------------------------------------------------------------------------------------------- mapPartitionsWithIndex:輸出資料對應的分割槽以及分割槽的值 scala> val rdd1 = sc.parallelize(List("a","b","c","d"),2) rdd1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[3] at parallelize at <console>:24 scala> val func = (xpar:Int,y:Iterator[String])=>{ | y.toList.map(x=>"partition:"+xpar+" value:"+x).iterator | } func: (Int, Iterator[String]) => Iterator[String] = <function2> scala> rdd1.mapPartitionsWithIndex(func).collect res2: Array[String] = Array(partition:0 value:a, partition:0 value:b, partition:1 value:c, partition:1 value:d) ---------------------------------------------------------------------------------------------------------------------- aggregate(zeroValue)(seqOp, combOp):對rdd的資料先按照分割槽彙總然後將分割槽的資料在彙總(迭代彙總,seqOp或者combOp的值會和下一個值進行比較) scala> val rdd1 = sc.parallelize(List("a","b","c","d"),2) rdd1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[5] at parallelize at <console>:24 scala> rdd1.aggregate("")(_+_,_+_) res3: String = abcd ----------------------------------------------------------------------------------------------------------------------- aggregateByKey:適用於那種鍵值對型別的RDD,會根據key進行對value的操作,類似aggregate scala> val rdd = sc.parallelize(List((1,1),(1,2),(2,2),(2,3)), 2) rdd: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[12] at parallelize at <console>:24 scala> rdd.aggregateByKey(0)((x,y)=>x+y, (x,y)=>(x+y)).collect res36: Array[(Int, Int)] = Array((2,5), (1,3)) ------------------------------------------------------------------------------------------------------------------------- coalesce, repartition:repartition與coalesce相似,只不過repartition內部呼叫了coalesce,coalesce傳入的引數比repartition傳入的引數多一個,repartition有該引數的預設值,即:是否進行shuffule scala> val rdd = sc.parallelize(List(1,2,3,4,5), 2) rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[24] at parallelize at <console>:24 scala> rdd.repartition(3) res42: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[28] at repartition at <console>:27 scala> res42.partitions.length res43: Int = 3 ----------------------------------------------------------------------------------------------------------------------- collectAsMap:將結果一map方式展示 scala> val rdd = sc.parallelize(List(("a",2),("b",10),("x",22)), 2) rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[29] at parallelize at <console>:24 scala> rdd.collectAsMap res44: scala.collection.Map[String,Int] = Map(b -> 10, a -> 2, x -> 22) ----------------------------------------------------------------------------------------------------------------------- combineByKey : 和reduceByKey是相同的效果。需要三個引數 第一個每個key對應的value 第二個,區域性的value操作, 第三個:全域性value操作 scala> val rdd = sc.parallelize(List(("a",2),("b",10),("x",22),("a",200),("x",89)), 2) rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[30] at parallelize at <console>:24 scala> rdd.combineByKey(x=>x, (a:Int,b:Int)=>a+b, (a:Int,b:Int)=>a+b) res45: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[31] at combineByKey at <console>:27 scala> res45.collect res46: Array[(String, Int)] = Array((x,111), (b,10), (a,202)) --------------------------------------------------------------------------------------------------------------------------- countByKey:透過Key統計條數 scala> val rdd = sc.parallelize(List(("a",2),("b",10),("x",22),("a",200),("x",89)), 2) rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[33] at parallelize at <console>:24 scala> rdd.countByKey res49: scala.collection.Map[String,Long] = Map(x -> 2, b -> 1, a -> 2) ------------------------------------------------------------------------------------------------------------------------ filterByRange:返回符合過濾返回的資料 scala> val rdd = sc.parallelize(List(("a",2),("b",10),("x",22),("a",200),("x",89)), 2) rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[36] at parallelize at <console>:24 scala> rdd.filterByRange("a","b") res51: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[37] at filterByRange at <console>:27 scala> res51.collect res52: Array[(String, Int)] = Array((a,2), (b,10), (a,200)) ------------------------------------------------------------------------------------------------------------ flatMapValues scala> val rdd = sc.parallelize(List(("a"->"1 2 3 "),("b"->"1 2 3 "),("x"->"1 2 3 "),("a"->"1 2 3 "),("x"->"1 2 3 ")), 2) rdd: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[39] at parallelize at <console>:24 scala> rdd.flatMapValues(x=>x.split(" ")).collect res53: Array[(String, String)] = Array((a,1), (a,2), (a,3), (b,1), (b,2), (b,3), (x,1), (x,2), (x,3), (a,1), (a,2), (a,3), (x,1), (x,2), (x,3)) ---------------------------------------------------------------------------------------------------------------- foldByKey:透過key聚集資料然後做操作 scala> val rdd = sc.parallelize(List(("a",2),("b",10),("x",22),("a",200),("x",89)), 2) rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[41] at parallelize at <console>:24 scala> rdd.foldByKey(0)(_+_).collect res55: Array[(String, Int)] = Array((x,111), (b,10), (a,202)) ---------------------------------------------------------------------------------------------------------------- keyBy : 以傳入的引數做key scala> val rdd1 = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"), 3) rdd1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[43] at parallelize at <console>:24 scala> val rdd2 = rdd1.keyBy(_.length).collect rdd2: Array[(Int, String)] = Array((3,dog), (6,salmon), (6,salmon), (3,rat), (8,elephant)) ---------------------------------------------------------------------------------------------------------------- keys values scala> val rdd1 = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"), 3) rdd1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[45] at parallelize at <console>:24 scala> val rdd2 = rdd1.map(x=>(x.length,x)) rdd2: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[47] at map at <console>:26 scala> rdd2.keys res63: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[48] at keys at <console>:29 scala> rdd2.keys.collect res64: Array[Int] = Array(3, 6, 6, 3, 8) scala> rdd2.values.collect res65: Array[String] = Array(dog, salmon, salmon, rat, elephant)
來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/31506529/viewspace-2215929/,如需轉載,請註明出處,否則將追究法律責任。
相關文章
- Spark常用Transformations運算元(一)SparkORM
- spark-運算元-分割槽運算元Spark
- 【Spark篇】---SparkStreaming中運算元中OutPutOperator類運算元Spark
- Spark RDD運算元(八)mapPartitions, mapPartitionsWithIndexSparkAPPIndex
- Spark----RDD運算元分類 DAGSpark
- spark的基本運算元使用和原始碼解析Spark原始碼
- 圖解Spark排序運算元sortBy的核心原始碼圖解Spark排序原始碼
- spark RDD textFile運算元 分割槽數量詳解Spark
- [Halcon] 機器視覺中常用運算元視覺
- spark RDD運算元(五)之鍵值對聚合操作combineByKeySpark
- 運算元
- 使用運算元控制公式運算公式
- Spark運算元:統計RDD分割槽中的元素及數量Spark
- Python 影像處理 OpenCV (12): Roberts 運算元、 Prewitt 運算元、 Sobel 運算元和 Laplacian 運算元邊緣檢測技術PythonOpenCV
- RDD運算元
- 「分散式技術專題」常用的 SQL 運算元介紹分散式SQL
- 運算元據庫
- python運算元據Python
- JavaScript運算元組JavaScript
- 一些常用的 Scala 運算子
- 運算元據庫表
- MySQL DML運算元據MySql
- jmeter運算元據庫JMeter
- DDL:運算元據庫
- onnx 運算元定義
- 什麼是運算元?
- Flink -- Operator操作運算元
- SIFT運算元總結
- 最近專案中用到的運算元據的一些簡便方法
- ES5和ES6新的運算元組的方法(常用)
- Spark效能調優-RDD運算元調優篇(深度好文,面試常問,建議收藏)Spark面試
- Python運算元據庫(3)Python
- Oracle OCP(10):運算元據Oracle
- sobel運算元,matlab實現Matlab
- 3.0 常見operators運算元
- NumPy常用的位運算函式函式
- js 方法(運算元組為主JS
- 利用 Sequelize 來運算元據庫