【Spark Java API】Action(4)—sortBy、ta
sortBy
官方文件描述:
Return this RDD sorted by the given key function.
函式原型:
def sortBy[S](f: JFunction[T, S], ascending: Boolean, numPartitions: Int): JavaRDD[T]
**
sortBy根據給定的f函式將RDD中的元素進行排序。
**
原始碼分析:
def sortBy[K]( f: (T) => K, ascending: Boolean = true, numPartitions: Int = this.partitions.length) (implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T] = withScope { this.keyBy[K](f) .sortByKey(ascending, numPartitions) .values }/** * Creates tuples of the elements in this RDD by applying `f`. */def keyBy[K](f: T => K): RDD[(K, T)] = withScope { val cleanedF = sc.clean(f) map(x => (cleanedF(x), x)) }
**
從原始碼中可以看出,sortBy函式的實現依賴於函式。該函式接受三個引數,第一引數是一個函式,該函式帶有泛型引數T,返回型別與RDD中的元素型別一致,主要是用keyBy函式的map轉化,將每個元素轉化為tuples型別的元素;第二個引數是ascending,該引數是可選引數,主要用於RDD中的元素的排序方式,預設是true,是升序;第三個引數是numPartitions,該引數也是可選引數,主要使用對排序後的RDD進行分割槽,預設的分割槽個數與排序前一致是partitions.length。
**
例項:
List<Integer> data = Arrays.asList(5, 1, 1, 4, 4, 2, 2); JavaRDD<Integer> javaRDD = javaSparkContext.parallelize(data, 3);final Random random = new Random(100);//對RDD進行轉換,每個元素有兩部分組成JavaRDD<String> javaRDD1 = javaRDD.map(new Function<Integer, String>() { @Override public String call(Integer v1) throws Exception { return v1.toString() + "_" + random.nextInt(100); } }); System.out.println(javaRDD1.collect());//按RDD中每個元素的第二部分進行排序JavaRDD<String> resultRDD = javaRDD1.sortBy(new Function<String, Object>() { @Override public Object call(String v1) throws Exception { return v1.split("_")[1]; } },false,3); System.out.println("result--------------" + resultRDD.collect());
takeOrdered
官方文件描述:
Returns the first k (smallest) elements from this RDD using the natural ordering for T while maintain the order.
函式原型:
def takeOrdered(num: Int): JList[T]def takeOrdered(num: Int, comp: Comparator[T]): JList[T]
**
takeOrdered函式用於從RDD中,按照預設(升序)或指定排序規則,返回前num個元素。
**
原始碼分析:
def takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T] = withScope { if (num == 0) { Array.empty } else { val mapRDDs = mapPartitions { items => // Priority keeps the largest elements, so let's reverse the ordering. val queue = new BoundedPriorityQueue[T](num)(ord.reverse) queue ++= util.collection.Utils.takeOrdered(items, num)(ord) Iterator.single(queue) } if (mapRDDs.partitions.length == 0) { Array.empty } else { mapRDDs.reduce { (queue1, queue2) => queue1 ++= queue2 queue1 }.toArray.sorted(ord) } } }
**
從原始碼分析可以看出,利用mapPartitions在每個分割槽裡面進行分割槽排序,每個分割槽區域性排序只返回num個元素,這裡注意返回的mapRDDs的元素是BoundedPriorityQueue優先佇列,再針對mapRDDs進行reduce函式操作,轉化為陣列進行全域性排序。
**
例項:
//注意comparator需要序列化public static class TakeOrderedComparator implements Serializable,Comparator<Integer>{ @Override public int compare(Integer o1, Integer o2) { return -o1.compareTo(o2); } }List<Integer> data = Arrays.asList(5, 1, 0, 4, 4, 2, 2); JavaRDD<Integer> javaRDD = javaSparkContext.parallelize(data, 3); System.out.println("takeOrdered-----1-------------" + javaRDD.takeOrdered(2));List<Integer> list = javaRDD.takeOrdered(2, new TakeOrderedComparator()); System.out.println("takeOrdered----2--------------" + list);
takeSample
官方文件描述:
Return a fixed-size sampled subset of this RDD in an array
函式原型:
def takeSample(withReplacement: Boolean, num: Int): JList[T]def takeSample(withReplacement: Boolean, num: Int, seed: Long): JList[T]
**
takeSample函式返回一個陣列,在資料集中隨機取樣 num 個元素組成。
**
原始碼分析:
def takeSample( withReplacement: Boolean, num: Int, seed: Long = Utils.random.nextLong): Array[T] = { val numStDev = 10.0 if (num < 0) { throw new IllegalArgumentException("Negative number of elements requested") } else if (num == 0) { return new Array[T](0) } val initialCount = this.count() if (initialCount == 0) { return new Array[T](0) } val maxSampleSize = Int.MaxValue - (numStDev * math.sqrt(Int.MaxValue)).toInt if (num > maxSampleSize) { throw new IllegalArgumentException("Cannot support a sample size > Int.MaxValue - " + s"$numStDev * math.sqrt(Int.MaxValue)") } val rand = new Random(seed) if (!withReplacement && num >= initialCount) { return Utils.randomizeInPlace(this.collect(), rand) } val fraction = SamplingUtils.computeFractionForSampleSize(num, initialCount, withReplacement) var samples = this.sample(withReplacement, fraction, rand.nextInt()).collect() // If the first sample didn't turn out large enough, keep trying to take samples; // this shouldn't happen often because we use a big multiplier for the initial size var numIters = 0 while (samples.length < num) { logWarning(s"Needed to re-sample due to insufficient sample size. Repeat #$numIters") samples = this.sample(withReplacement, fraction, rand.nextInt()).collect() numIters += 1 } Utils.randomizeInPlace(samples, rand).take(num) }
**
從原始碼中可以看出,takeSample函式類似於函式,該函式接受三個引數,第一個引數withReplacement ,表示取樣是否放回,true表示有放回的取樣,false表示無放回取樣;第二個引數num,表示返回的取樣資料的個數,這個也是takeSample函式和sample函式的區別;第三個引數seed,表示用於指定的隨機數生成器種子。另外,takeSample函式先是計算fraction,也就是取樣比例,然後呼叫sample函式進行取樣,並對取樣後的資料進行collect(),最後呼叫take函式返回num個元素。注意,如果取樣個數大於RDD的元素個數,且選擇的無放回取樣,則返回RDD的元素的個數。
**
例項:
List<Integer> data = Arrays.asList(5, 1, 0, 4, 4, 2, 2); JavaRDD<Integer> javaRDD = javaSparkContext.parallelize(data, 3); System.out.println("takeSample-----1-------------" + javaRDD.takeSample(true,2)); System.out.println("takeSample-----2-------------" + javaRDD.takeSample(true,2,100));//返回20個元素System.out.println("takeSample-----3-------------" + javaRDD.takeSample(true,20,100));//返回7個元素System.out.println("takeSample-----4-------------" + javaRDD.takeSample(false,20,100));
作者:小飛_俠_kobe
連結:
來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/3137/viewspace-2818912/,如需轉載,請註明出處,否則將追究法律責任。
相關文章
- 【Spark Java API】Action(3)—foreach、fSparkJavaAPI
- Spark運算元:RDD行動Action操作學習–countByKey、foreach、sortBySpark
- 圖解Spark排序運算元sortBy的核心原始碼圖解Spark排序原始碼
- 【Spark篇】---Spark中Action運算元Spark
- Spark開發-Action操作Spark
- Spark in action on Kubernetes - Spark Operator的原理解析Spark
- Spark API 全集(1):Spark SQL Dataset & DataFrame APISparkAPISQL
- 【Spark Java API】Transformation(8)—fullOuterJoin、leftOuterJoin、rightOuterJoinSparkJavaAPIORM
- Spark in action on Kubernetes - 儲存篇Spark
- 《Java 8 in Action》Chapter 12:新的日期和時間APIJavaAPTAPI
- spark streaming原始碼分析4 DStream相關APISpark原始碼API
- Spark RDD APISparkAPI
- Spark3學習【基於Java】3. Spark-Sql常用APISparkJavaSQLAPI
- Spark in action on Kubernetes - 儲存篇(一)Spark
- Spark REST API & metricsSparkRESTAPI
- Spark 系列(九)—— Spark SQL 之 Structured APISparkSQLStructAPI
- akka-typed(4) - EventSourcedBehavior in action
- 讓ASP.NET Web API的Action方法ASP.NETWebAPI
- Spark SQL:4.對Spark SQL的理解SparkSQL
- JAVAEE框架學習——Struts2——Action API 使用Java框架API
- 放ta自由
- 性感的Promise,擁抱ta然後扒光taPromise
- 《Java 8 in Action》Chapter 3:Lambda表示式JavaAPT
- spark讀取hive異常,java.lang.NoClassDefFoundError: org/apache/tez/dag/api/SessionNotRunningSparkHiveJavaErrorApacheAPISession
- 《Java 8 in Action》Chapter 10:用Optional取代nullJavaAPTNull
- (spark消費kafka有jar包衝突)java.lang.NoSuchMethodError: net.jpountz.lz4.LZ4BlockInputStream.SparkKafkaJARJavaErrorBloC
- 4、hdfs api使用API
- Spark SQL 程式設計API入門系列之Spark SQL的作用與使用方式SparkSQL程式設計API
- eclipse Java Spark local模式EclipseJavaSpark模式
- ElasticSearch—— Java APIElasticsearchJavaAPI
- java常用APIJavaAPI
- ElasticSearch Java APIElasticsearchJavaAPI
- oozie java apiJavaAPI
- Java 反射 APIJava反射API
- Java 日期 APIJavaAPI
- Spark筆記:複雜RDD的API的理解(下)Spark筆記API
- action概念
- 4-Kubernetes APIAPI