【Spark Java API】Transformation(8)—fullOuterJoin、leftOuterJoin、rightOuterJoin
fullOuterJoin
官方文件描述:
Perform a full outer join of `this` and `other`. For each element (k, v) in `this`,
the resulting RDD will either contain all pairs (k, (Some(v), Some(w))) for w in `other`,
or the pair (k, (Some(v), None)) if no elements in `other` have key k. Similarly,
for each element (k, w) in `other`, the resulting RDD will either contain all pairs (k, (Some(v), Some(w)))
for v in `this`, or the pair (k, (None, Some(w))) if no elements in `this` have key k.
Uses the given Partitioner to partition the output RDD.
函式原型:
def fullOuterJoin[W](other: JavaPairRDD[K, W]): JavaPairRDD[K, (Optional[V], Optional[W])]
def fullOuterJoin[W](other: JavaPairRDD[K, W], numPartitions: Int)
: JavaPairRDD[K, (Optional[V], Optional[W])]
def fullOuterJoin[W](other: JavaPairRDD[K, W], partitioner: Partitioner)
: JavaPairRDD[K, (Optional[V], Optional[W])]
原始碼分析:
def fullOuterJoin[W](other: RDD[(K, W)], partitioner: Partitioner) : RDD[(K, (Option[V], Option[W]))] = self.withScope {
this.cogroup(other, partitioner).flatMapValues {
case (vs, Seq()) => vs.iterator.map(v => (Some(v), None))
case (Seq(), ws) => ws.iterator.map(w => (None, Some(w)))
case (vs, ws) => for (v <- vs.iterator; w <- ws.iterator) yield (Some(v), Some(w))
}
}
**
從原始碼中可以看出,fullOuterJoin() 與 join() 類似,首先進行 cogroup(), 得到 <K, (Iterable[V1], Iterable[V2])> 型別的 MappedValuesRDD,然後對 Iterable[V1] 和 Iterable[V2] 做笛卡爾集,注意在V1,V2中新增了None,並將集合 flat() 化。
**
例項:
List<Integer> data = Arrays.asList(1, 2, 4, 3, 5, 6, 7);
final Random random = new Random();
JavaRDD<Integer> javaRDD = javaSparkContext.parallelize(data);
JavaPairRDD<Integer,Integer> javaPairRDD = javaRDD.mapToPair(new PairFunction<Integer, Integer, Integer>() {
@Override
public Tuple2<Integer, Integer> call(Integer integer) throws Exception {
return new Tuple2<Integer, Integer>(integer,random.nextInt(10));
}
});
//全關聯
JavaPairRDD<Integer,Tuple2<Optional<Integer>,Optional<Integer>>> fullJoinRDD = javaPairRDD.fullOuterJoin(javaPairRDD);
System.out.println(fullJoinRDD);
JavaPairRDD<Integer,Tuple2<Optional<Integer>,Optional<Integer>>> fullJoinRDD1 = javaPairRDD.fullOuterJoin(javaPairRDD,2);
System.out.println(fullJoinRDD1);
JavaPairRDD<Integer,Tuple2<Optional<Integer>,Optional<Integer>>> fullJoinRDD2 = javaPairRDD.fullOuterJoin(javaPairRDD, new Partitioner() {
@Override
public int numPartitions() { return 2; }
@Override
public int getPartition(Object key) { return (key.toString()).hashCode()%numPartitions(); }
});
System.out.println(fullJoinRDD2);
leftOuterJoin
官方文件描述:
Perform a left outer join of `this` and `other`. For each element (k, v) in `this`,
the resulting RDD will either contain all pairs (k, (v, Some(w))) for w in `other`,
or the pair (k, (v, None)) if no elements in `other` have key k.
Uses the given Partitioner to partition the output RDD.
函式原型:
def leftOuterJoin[W](other: JavaPairRDD[K, W]): JavaPairRDD[K, (V, Optional[W])]
def leftOuterJoin[W](other: JavaPairRDD[K, W], numPartitions: Int)
: JavaPairRDD[K, (V, Optional[W])]
def leftOuterJoin[W](other: JavaPairRDD[K, W], partitioner: Partitioner): JavaPairRDD[K, (V, Optional[W])]
原始碼分析:
def leftOuterJoin[W]( other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, Option[W]))] = self.withScope {
this.cogroup(other, partitioner).flatMapValues { pair =>
if (pair._2.isEmpty) {
pair._1.iterator.map(v => (v, None))
} else {
for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, Some(w))
}
}
}
**
從原始碼中可以看出,leftOuterJoin() 與 fullOuterJoin() 類似,首先進行 cogroup(), 得到 <K, (Iterable[V1], Iterable[V2])> 型別的 MappedValuesRDD,然後對 Iterable[V1] 和 Iterable[V2] 做笛卡爾集,注意在V1中新增了None,並將集合 flat() 化。
**
例項:
List<Integer> data = Arrays.asList(1, 2, 4, 3, 5, 6, 7);
final Random random = new Random();
JavaRDD<Integer> javaRDD = javaSparkContext.parallelize(data);
JavaPairRDD<Integer,Integer> javaPairRDD = javaRDD.mapToPair(new PairFunction<Integer, Integer, Integer>() {
@Override
public Tuple2<Integer, Integer> call(Integer integer) throws Exception {
return new Tuple2<Integer, Integer>(integer,random.nextInt(10));
}
});
//左關聯 JavaPairRDD<Integer,Tuple2<Integer,Optional<Integer>>> leftJoinRDD = javaPairRDD.leftOuterJoin(javaPairRDD);
System.out.println(leftJoinRDD);
JavaPairRDD<Integer,Tuple2<Integer,Optional<Integer>>> leftJoinRDD1 = javaPairRDD.leftOuterJoin(javaPairRDD,2);
System.out.println(leftJoinRDD1);
JavaPairRDD<Integer,Tuple2<Integer,Optional<Integer>>> leftJoinRDD2 = javaPairRDD.leftOuterJoin(javaPairRDD, new Partitioner() {
@Override
public int numPartitions() { return 2; }
@Override
public int getPartition(Object key) { return (key.toString()).hashCode()%numPartitions();
}
});
System.out.println(leftJoinRDD2);
rightOuterJoin
官方文件描述:
Perform a right outer join of `this` and `other`. For each element (k, w) in `other`,
the resulting RDD will either contain all pairs (k, (Some(v), w)) for v in `this`,
or the pair (k, (None, w)) if no elements in `this` have key k.
Uses the given Partitioner to partition the output RDD.
函式原型:
def rightOuterJoin[W](other: JavaPairRDD[K, W]): JavaPairRDD[K, (Optional[V], W)]
def rightOuterJoin[W](other: JavaPairRDD[K, W], numPartitions: Int)
: JavaPairRDD[K, (Optional[V], W)]
def rightOuterJoin[W](other: JavaPairRDD[K, W], partitioner: Partitioner): JavaPairRDD[K, (Optional[V], W)]
原始碼分析:
def rightOuterJoin[W](other: RDD[(K, W)], partitioner: Partitioner) : RDD[(K, (Option[V], W))] = self.withScope {
this.cogroup(other, partitioner).flatMapValues { pair =>
if (pair._1.isEmpty) {
pair._2.iterator.map(w => (None, w))
} else {
for (v <- pair._1.iterator; w <- pair._2.iterator) yield (Some(v), w)
}
}
}
**
從原始碼中可以看出,rightOuterJoin() 與 fullOuterJoin() 類似,首先進行 cogroup(), 得到 <K, (Iterable[V1], Iterable[V2])> 型別的 MappedValuesRDD,然後對 Iterable[V1] 和 Iterable[V2] 做笛卡爾集,注意在V2中新增了None,並將集合 flat() 化。
**
例項:
List<Integer> data = Arrays.asList(1, 2, 4, 3, 5, 6, 7);
final Random random = new Random();
JavaRDD<Integer> javaRDD = javaSparkContext.parallelize(data);
JavaPairRDD<Integer,Integer> javaPairRDD = javaRDD.mapToPair(new PairFunction<Integer, Integer, Integer>() {
@Override
public Tuple2<Integer, Integer> call(Integer integer) throws Exception {
return new Tuple2<Integer, Integer>(integer,random.nextInt(10));
}
});
//右關聯
JavaPairRDD<Integer,Tuple2<Optional<Integer>,Integer>> rightJoinRDD = javaPairRDD.rightOuterJoin(javaPairRDD);
System.out.println(rightJoinRDD);
JavaPairRDD<Integer,Tuple2<Optional<Integer>,Integer>> rightJoinRDD1 = javaPairRDD.rightOuterJoin(javaPairRDD,2);
System.out.println(rightJoinRDD1);
JavaPairRDD<Integer,Tuple2<Optional<Integer>,Integer>> rightJoinRDD2 = javaPairRDD.rightOuterJoin(javaPairRDD, new Partitioner() {
@Override
public int numPartitions() { return 2; }
@Override
public int getPartition(Object key) { return (key.toString()).hashCode()%numPartitions(); }
});
System.out.println(rightJoinRDD2);
相關文章
- 【Spark Java API】Action(4)—sortBy、taSparkJavaAPI
- 【Spark Java API】Action(3)—foreach、fSparkJavaAPI
- Spark API 全集(1):Spark SQL Dataset & DataFrame APISparkAPISQL
- java8-Optional APIJavaAPI
- Spark RDD APISparkAPI
- Java 8 常用時間 apiJavaAPI
- java8 Date/Time APIJavaAPI
- Spark3學習【基於Java】3. Spark-Sql常用APISparkJavaSQLAPI
- Spark REST API & metricsSparkRESTAPI
- Java8的Stream API使用JavaAPI
- Java8新特性--Stream APIJavaAPI
- Java8 Stream常用API整理JavaAPI
- Java8中的Stream APIJavaAPI
- Java 8新特性(二):Stream APIJavaAPI
- java8 Stream APi 入門JavaAPI
- Java 8 的時間日期 APIJavaAPI
- Spark 系列(九)—— Spark SQL 之 Structured APISparkSQLStructAPI
- Java 8 Stream API 轉換到 Kotlin 集合APIJavaAPIKotlin
- Java8 - Stream API快速入門JavaAPI
- Java 8 Stream Api 中的 peek 操作JavaAPI
- Java 8 日期和時間的APIJavaAPI
- 8個實用的Java Streams APIJavaAPI
- java EE 8u1 SDK 包和 java EE 8 API 文件包JavaAPI
- Java8中的 lambda 和Stream APIJavaAPI
- Java 8 API 設計經驗淺析JavaAPI
- Java 8 日期/時間(Date Time)API指南JavaAPI
- Java 8中的Stream API使用指南JavaAPI
- Java8新特性之時間APIJavaAPI
- Java8 Lambda表示式與Stream API (二): Stream API的使用JavaAPI
- java8 – 新的時間日期API示例JavaAPI
- Java8新特性之日期-時間APIJavaAPI
- java8 新特性之日期時間 APIJavaAPI
- Java8新特性第3章(Stream API)JavaAPI
- Java 8 Stream API: 深入理解與高效使用JavaAPI
- BAdI: InfoSpoke with TransformationORM
- Joda Time專案和java8時間apiJavaAPI
- Java 8 Streams API:對Stream分組和分割槽JavaAPI
- Java 8新特性之旅:使用Stream API處理集合JavaAPI