spark中的聚合函式總結
PairRDDFunctions中的函式:
def aggregateByKey[U](zeroValue: U)(seqOp: (U, V) ⇒ U, combOp: (U, U) ⇒ U)(implicit arg0: ClassTag[U]): RDD[(K, U)]
Aggregate the values of each key, using given combine functions and a neutral “zero value”. This function can return a different result type, U, than the type of the values in this RDD, V. Thus, we need one operation for merging a V into a U and one operation for merging two U’s, as in scala.TraversableOnce. The former operation is used for merging values within a partition, and the latter is used for merging values between partitions. To avoid memory allocation, both of these functions are allowed to modify and return their first argument instead of creating a new U.
def aggregateByKey[U](zeroValue: U, numPartitions: Int)(seqOp: (U, V) ⇒ U, combOp: (U, U) ⇒ U)(implicit arg0: ClassTag[U]): RDD[(K, U)]
def aggregateByKey[U](zeroValue: U, partitioner: Partitioner)(seqOp: (U, V) ⇒ U, combOp: (U, U) ⇒ U)(implicit arg0: ClassTag[U]): RDD[(K, U)]
defreduceByKey(func: (V, V) ⇒ V): RDD[(K, V)]
Merge the values for each key using an associative and commutative reduce function. This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a “combiner” in MapReduce. Output will be hash-partitioned with the existing partitioner/ parallelism level.
defreduceByKeyLocally(func: (V, V) ⇒ V): Map[K, V]
Merge the values for each key using an associative and commutative reduce function, but return the results immediately to the master as a Map. This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a “combiner” in MapReduce.
defgroupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])]
Group the values for each key in the RDD into a single sequence. Allows controlling the partitioning of the resulting key-value pair RDD by passing a Partitioner. The ordering of elements within each group is not guaranteed, and may even differ each time the resulting RDD is evaluated.
Note
As currently implemented, groupByKey must be able to hold all the key-value pairs for any key in memory. If a key has too many values, it can result in an
OutOfMemoryError
.,This operation may be very expensive. If you are grouping in order to perform an aggregation (such as a sum or average) over each key, usingPairRDDFunctions.aggregateByKey
orPairRDDFunctions.reduceByKey
will provide much better performance.defcombineByKey[C](createCombiner: (V) ⇒ C, mergeValue: (C, V) ⇒ C, mergeCombiners: (C, C) ⇒ C): RDD[(K, C)]
:org.apache.spark.rdd.RDD[(K,C)])
Simplified version of combineByKeyWithClassTag that hash-partitions the resulting RDD using the existing partitioner/parallelism level. This method is here for backward compatibility. It does not provide combiner classtag information to the shuffle.
- See also
combineByKeyWithClassTag
defcombineByKeyWithClassTag[C](createCombiner: (V) ⇒ C, mergeValue: (C, V) ⇒ C, mergeCombiners: (C, C) ⇒ C)(implicit ct: ClassTag[C]): RDD[(K, C)]
(implicitct:scala.reflect.ClassTag[C]):org.apache.spark.rdd.RDD[(K,C)])
Simplified version of combineByKeyWithClassTag that hash-partitions the resulting RDD using the existing partitioner/parallelism level.
Annotations
@Experimental()
相關文章
- Spark 系列(十一)—— Spark SQL 聚合函式 AggregationsSparkSQL函式
- SQL語句中聚合函式忽略NULL值的總結SQL函式Null
- JS 中的函式 this 指向總結JS函式
- oracle 與 mysql 中的函式總結OracleMySql函式
- 【函式】Oracle中聚合函式rank()使用方法函式Oracle
- C++中的函式指標和函式物件總結C++函式指標物件
- mongoDB中聚合函式java處理MongoDB函式Java
- Oracle聚合函式/分析函式Oracle函式
- Spark Streaming中的操作函式分析Spark函式
- python 中 print 函式用法總結Python函式
- Oracle 中 Over() 函式學習總結Oracle函式
- Stream聚合函式函式
- caffe中各種cblas的函式使用總結函式
- 聚合函式與數字函式函式
- php函式總結PHP函式
- Oracle 函式總結Oracle函式
- 總結常用的字串函式字串函式
- Django(18)聚合函式Django函式
- MySQL 聚合函式大全MySql函式
- mysql日期函式總結MySql函式
- PHP常用函式總結PHP函式
- Oracle常用函式總結Oracle函式
- php 常用函式總結PHP函式
- SQL Server函式總結SQLServer函式
- python中list方法與函式的學習總結Python函式
- c++中string類成員函式的總結C++函式
- oracle 10g函式大全--聚合函式Oracle 10g函式
- SAP ABAP 函式總結 常用函式解釋函式
- oracle中聚合函式RANK和dense_rank的使用(轉)Oracle函式
- Python利用partial偏函式生成不同的聚合函式Python函式
- Oracle OCP(04):聚合函式Oracle函式
- Oracle 聚合函式詳解Oracle函式
- Sql Server系列:聚合函式SQLServer函式
- oracle 自定義聚合函式Oracle函式
- ORACLE 字串聚合函式 strCatOracle字串函式
- Spark的Shuffle總結分析Spark
- Spark 開窗函式Spark函式
- Spark Graphx常用函式Spark函式