Spark 優化GroupByKey產生RDD[(K, Iterable[V])]
RDD觸發機制
在spark中,RDD Action操作,是由SparkContext來觸發的. 通過scala Iterator來實現.
/**
* Return a new RDD by applying a function to all elements of this RDD.
*/
def map[U: ClassTag](f: T => U): RDD[U] = withScope {
val cleanF = sc.clean(f)
new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))
}
/**
* Return a new RDD by first applying a function to all elements of this
* RDD, and then flattening the results.
*/
def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] = withScope {
val cleanF = sc.clean(f)
new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.flatMap(cleanF))
}
/**
* Return a new RDD containing only the elements that satisfy a predicate.
*/
def filter(f: T => Boolean): RDD[T] = withScope {
val cleanF = sc.clean(f)
new MapPartitionsRDD[T, T](
this,
(context, pid, iter) => iter.filter(cleanF),
preservesPartitioning = true)
}
GroupByKey分析
GroupByKey是一個非常耗資源的操作,shuffle之後,每個key分組之後的資料,會快取在記憶體中,也就是Iterable[V].
def groupByKey(): RDD[(K, Iterable[V])] = self.withScope {
groupByKey(defaultPartitioner(self))
}
def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope {
// groupByKey shouldn't use map side combine because map side combine does not
// reduce the amount of data shuffled and requires all map side data be inserted
// into a hash table, leading to more objects in the old gen.
val createCombiner = (v: V) => CompactBuffer(v)
val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v
val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2
val bufs = combineByKeyWithClassTag[CompactBuffer[V]](
createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)
bufs.asInstanceOf[RDD[(K, Iterable[V])]]
}
如果對RDD[(K, Iterable[V])].在進行flatMap的操作,比如每10條統計一個結果,就會出現問題.
eg:
val sc = new SparkConf().setAppName("demo").setMaster("local[1]")
val sparkContext =new SparkContext(sc)
val rdd = sparkContext.makeRDD(Seq(
("wang",25),("wang",26),("wang",18),("wang",15),("wang",7),("wang",1)
))
.groupByKey().flatMap(kv=>{
var i =0
kv._2.map(r=>{
i=i+1
println(r)
r
})
})
sparkContext.runJob(rdd,add _)
def add(list:Iterator[Int]): Unit ={
var i=0
val items = new mutable.MutableList[Int]()
while(list.hasNext){
items.+=(list.next())
if(i>=2){
println(items.mkString(","))
items.clear()
i=0
}else if(!list.hasNext){
println(items.mkString(","))
}
i=i+1
}
}
結果:
25
26
18
15
7
1
25,26,18
15,7
1
val sc = new SparkConf().setAppName("demo").setMaster("local[1]")
val sparkContext =new SparkContext(sc)
val rdd = sparkContext.makeRDD(Seq(
("wang",25),("wang",26),("wang",18),("wang",15),("wang",7),("wang",1)
))
.groupByKey().flatMap(kv=>{
var i =0
kv._2.toIterator.map(r=>{
i=i+1
println(r)
r
})
})
sparkContext.runJob(rdd,add _)
def add(list:Iterator[Int]): Unit ={
var i=0
val items = new mutable.MutableList[Int]()
while(list.hasNext){
items.+=(list.next())
if(i>=2){
println(items.mkString(","))
items.clear()
i=0
}else if(!list.hasNext){
println(items.mkString(","))
}
i=i+1
}
}
結果:
25
26
18
25,26,18
15
7
15,7
1
1
結論
RDD[(K, Iterable[V])].flatMap直接用Iterable,那麼在Action就沒法進行控制,只能flatMap裡面所有資料執行完之後,才能執行後面操作
相關文章
- spark RDD,reduceByKey vs groupByKeySpark
- Spark(十三) Spark效能調優之RDD持久化Spark持久化
- Spark效能優化:對RDD持久化或CheckPoint操作Spark優化持久化
- 快取Apache Spark RDD - 效能調優快取ApacheSpark
- Spark DataFrame的groupBy vs groupByKeySpark
- Spark RDD APISparkAPI
- spark-RDDSpark
- Spark RDD使用詳解--RDD原理Spark
- Spark - [03] RDD概述Spark
- RDD轉換操作運算元 --- zip(k-v)、join(k)、cogroup(k)、lookup(k)
- Spark 的核心概念 RDDSpark
- Spark Basic RDD 操作示例Spark
- Windows優化大師v3.0-v3.4的序號產生器原始碼Windows優化原始碼
- Spark開發-spark執行原理和RDDSpark
- SparkSQL /DataFrame /Spark RDD誰快?SparkSQL
- Spark RDD 特徵及其依賴Spark特徵
- spark學習筆記--RDDSpark筆記
- Spark RDD的預設分割槽數:(spark 2.1.0)Spark
- Spark RDD在Spark中的地位和作用如何?Spark
- Eureka:生產環境優化總結。優化
- 貼合生產的MySql優化思路MySql優化
- Spark學習(二)——RDD基礎Spark
- 【大資料】Spark RDD基礎大資料Spark
- Spark RDD中Runtime流程解析Spark
- spark常用RDD介紹及DemoSpark
- hive、spark優化HiveSpark優化
- Spark效能優化Spark優化
- Tomcat 生產伺服器效能優化Tomcat伺服器優化
- Spark從入門到放棄---RDDSpark
- 大白話講解Spark中的RDDSpark
- RDD程式設計 上(Spark自學三)程式設計Spark
- RDD程式設計 下(Spark自學四)程式設計Spark
- Spark開發-RDD介面程式設計Spark程式設計
- Calcite 使用原生的RDD 處理SparkSpark
- spark效能優化(一)Spark優化
- Spark效能調優-RDD運算元調優篇(深度好文,面試常問,建議收藏)Spark面試
- Spark Streaming 生產、消費流程梳理Spark
- 一次生產的 JVM 優化案例JVM優化