sortBy

官方文件描述：

Return this RDD sorted by the given key function.

函式原型：

def sortBy[S](f: JFunction[T, S], ascending: Boolean, numPartitions: Int): JavaRDD[T]

**
sortBy根據給定的f函式將RDD中的元素進行排序。
**

原始碼分析：

def sortBy[K](   
   f: (T) => K,    
  ascending: Boolean = true,    
  numPartitions: Int = this.partitions.length)    
  (implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T] = withScope {  
    this.keyBy[K](f)      
    .sortByKey(ascending, numPartitions)      
    .values
}/** 
* Creates tuples of the elements in this RDD by applying `f`. 
*/def keyBy[K](f: T => K): RDD[(K, T)] = withScope {  
  val cleanedF = sc.clean(f)  
  map(x => (cleanedF(x), x))
}

**
從原始碼中可以看出，sortBy函式的實現依賴於函式。該函式接受三個引數，第一引數是一個函式，該函式帶有泛型引數T，返回型別與RDD中的元素型別一致，主要是用keyBy函式的map轉化，將每個元素轉化為tuples型別的元素；第二個引數是ascending，該引數是可選引數，主要用於RDD中的元素的排序方式，預設是true，是升序；第三個引數是numPartitions，該引數也是可選引數，主要使用對排序後的RDD進行分割槽，預設的分割槽個數與排序前一致是partitions.length。
**

例項：

List<Integer> data = Arrays.asList(5, 1, 1, 4, 4, 2, 2);
JavaRDD<Integer> javaRDD = javaSparkContext.parallelize(data, 3);final Random random = new Random(100);//對RDD進行轉換，每個元素有兩部分組成JavaRDD<String> javaRDD1 = javaRDD.map(new Function<Integer, String>() {    
  @Override    
  public String call(Integer v1) throws Exception {        
    return v1.toString() + "_" + random.nextInt(100);    
  }
});
System.out.println(javaRDD1.collect());//按RDD中每個元素的第二部分進行排序JavaRDD<String> resultRDD = javaRDD1.sortBy(new Function<String, Object>() {    
  @Override    
  public Object call(String v1) throws Exception {        
    return v1.split("_")[1];    
  }
},false,3);
System.out.println("result--------------" + resultRDD.collect());

takeOrdered

官方文件描述：

Returns the first k (smallest) elements from this RDD using the 
natural ordering for T while maintain the order.

函式原型：

def takeOrdered(num: Int): JList[T]def takeOrdered(num: Int, comp: Comparator[T]): JList[T]

**
takeOrdered函式用於從RDD中，按照預設（升序）或指定排序規則，返回前num個元素。
**

原始碼分析：

def takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T] = withScope {  
  if (num == 0) {    
    Array.empty  
  } else {    
    val mapRDDs = mapPartitions { items =>      
    // Priority keeps the largest elements, so let's reverse the ordering.      
    val queue = new BoundedPriorityQueue[T](num)(ord.reverse)      
    queue ++= util.collection.Utils.takeOrdered(items, num)(ord)      
    Iterator.single(queue)    
  }    
  if (mapRDDs.partitions.length == 0) {      
    Array.empty    
  } else {      
    mapRDDs.reduce { (queue1, queue2) =>        
      queue1 ++= queue2        
      queue1      
  }.toArray.sorted(ord)    
  }  
 }
}

**
從原始碼分析可以看出，利用mapPartitions在每個分割槽裡面進行分割槽排序，每個分割槽區域性排序只返回num個元素，這裡注意返回的mapRDDs的元素是BoundedPriorityQueue優先佇列，再針對mapRDDs進行reduce函式操作，轉化為陣列進行全域性排序。
**

例項：

//注意comparator需要序列化public static class TakeOrderedComparator implements Serializable,Comparator<Integer>{    
    @Override    
    public int compare(Integer o1, Integer o2) {        
      return -o1.compareTo(o2);    
    }
}List<Integer> data = Arrays.asList(5, 1, 0, 4, 4, 2, 2);
JavaRDD<Integer> javaRDD = javaSparkContext.parallelize(data, 3);
System.out.println("takeOrdered-----1-------------" + javaRDD.takeOrdered(2));List<Integer> list = javaRDD.takeOrdered(2, new TakeOrderedComparator());
System.out.println("takeOrdered----2--------------" + list);

takeSample

官方文件描述：

Return a fixed-size sampled subset of this RDD in an array

函式原型：

def takeSample(withReplacement: Boolean, num: Int): JList[T]def takeSample(withReplacement: Boolean, num: Int, seed: Long): JList[T]

**
takeSample函式返回一個陣列，在資料集中隨機取樣 num 個元素組成。
**

原始碼分析：

def takeSample(    
  withReplacement: Boolean,    
  num: Int,    
  seed: Long = Utils.random.nextLong): Array[T] = 
{  
    val numStDev = 10.0  
    if (num < 0) {    
      throw new IllegalArgumentException("Negative number of elements requested")  
    } else if (num == 0) {    
      return new Array[T](0)  
    }  
    val initialCount = this.count()  
    if (initialCount == 0) {    
      return new Array[T](0)  
    }
    val maxSampleSize = Int.MaxValue - (numStDev * math.sqrt(Int.MaxValue)).toInt  
    if (num > maxSampleSize) {    
      throw new IllegalArgumentException("Cannot support a sample size > Int.MaxValue - " +      s"$numStDev * math.sqrt(Int.MaxValue)")  
    }  
    val rand = new Random(seed)    
    if (!withReplacement && num >= initialCount) {    
      return Utils.randomizeInPlace(this.collect(), rand)  
    }  
    val fraction = SamplingUtils.computeFractionForSampleSize(num, initialCount,    withReplacement)  
    var samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()  
    // If the first sample didn't turn out large enough, keep trying to take samples;  
    // this shouldn't happen often because we use a big multiplier for the initial size  
    var numIters = 0  
    while (samples.length < num) {    
      logWarning(s"Needed to re-sample due to insufficient sample size. Repeat #$numIters")    
      samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()    
      numIters += 1  
  }  
  Utils.randomizeInPlace(samples, rand).take(num)
}

**
從原始碼中可以看出，takeSample函式類似於函式，該函式接受三個引數，第一個引數withReplacement ，表示取樣是否放回，true表示有放回的取樣，false表示無放回取樣；第二個引數num，表示返回的取樣資料的個數，這個也是takeSample函式和sample函式的區別；第三個引數seed，表示用於指定的隨機數生成器種子。另外，takeSample函式先是計算fraction，也就是取樣比例，然後呼叫sample函式進行取樣，並對取樣後的資料進行collect()，最後呼叫take函式返回num個元素。注意，如果取樣個數大於RDD的元素個數，且選擇的無放回取樣，則返回RDD的元素的個數。
**

例項：

List<Integer> data = Arrays.asList(5, 1, 0, 4, 4, 2, 2);
JavaRDD<Integer> javaRDD = javaSparkContext.parallelize(data, 3);
System.out.println("takeSample-----1-------------" + javaRDD.takeSample(true,2));
System.out.println("takeSample-----2-------------" + javaRDD.takeSample(true,2,100));//返回20個元素System.out.println("takeSample-----3-------------" + javaRDD.takeSample(true,20,100));//返回7個元素System.out.println("takeSample-----4-------------" + javaRDD.takeSample(false,20,100));

作者：小飛_俠_kobe
連結：

【Spark Java API】Action(4)—sortBy、ta

sortBy

官方文件描述：

函式原型：

原始碼分析：

例項：

takeOrdered

官方文件描述：

函式原型：

原始碼分析：

例項：

takeSample

官方文件描述：

函式原型：

原始碼分析：

例項：

相關文章