一、前述
Spark中預設有兩大類運算元,Transformation(轉換運算元),懶執行。action運算元,立即執行,有一個action運算元 ,就有一個job。
通俗些來說由RDD變成RDD就是Transformation運算元,由RDD轉換成其他的格式就是Action運算元。
二、常用Transformation運算元
假設資料集為此:
1、filter
過濾符合條件的記錄數,true保留,false過濾掉。
Java版:
package com.spark.spark.transformations; import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.api.java.function.Function; import org.apache.spark.api.java.function.VoidFunction; /** * filter * 過濾符合符合條件的記錄數,true的保留,false的過濾掉。 * */ public class Operator_filter { public static void main(String[] args) { /** * SparkConf物件中主要設定Spark執行的環境引數。 * 1.執行模式 * 2.設定Application name * 3.執行的資源需求 */ SparkConf conf = new SparkConf(); conf.setMaster("local"); conf.setAppName("filter"); /** * JavaSparkContext物件是spark執行的上下文,是通往叢集的唯一通道。 */ JavaSparkContext jsc = new JavaSparkContext(conf); JavaRDD<String> lines = jsc.textFile("./words.txt"); JavaRDD<String> resultRDD = lines.filter(new Function<String, Boolean>() { /** * */ private static final long serialVersionUID = 1L; @Override public Boolean call(String line) throws Exception { return !line.contains("hadoop");//這裡是不等於 } }); resultRDD.foreach(new VoidFunction<String>() { /** * */ private static final long serialVersionUID = 1L; @Override public void call(String line) throws Exception { System.out.println(line); } }); jsc.stop(); } }
scala版:
函式解釋:
進來一個String,出去一個Booean.
結果:
2、map
將一個RDD中的每個資料項,通過map中的函式對映變為一個新的元素。
特點:輸入一條,輸出一條資料。
/** * map * 通過傳入的函式處理每個元素,返回新的資料集。 * 特點:輸入一條,輸出一條。 * * * @author root * */ public class Operator_map { public static void main(String[] args) { SparkConf conf = new SparkConf(); conf.setMaster("local"); conf.setAppName("map"); JavaSparkContext jsc = new JavaSparkContext(conf); JavaRDD<String> line = jsc.textFile("./words.txt"); JavaRDD<String> mapResult = line.map(new Function<String, String>() { /** * */ private static final long serialVersionUID = 1L; @Override public String call(String s) throws Exception { return s+"~"; } }); mapResult.foreach(new VoidFunction<String>() { /** * */ private static final long serialVersionUID = 1L; @Override public void call(String t) throws Exception { System.out.println(t); } }); jsc.stop(); } }
函式解釋:
進來一個String,出去一個String。
函式結果:
3、flatMap(壓扁輸出,輸入一條,輸出零到多條)
先map後flat。與map類似,每個輸入項可以對映為0到多個輸出項。
package com.spark.spark.transformations; import java.util.Arrays; import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.api.java.function.FlatMapFunction; import org.apache.spark.api.java.function.VoidFunction; /** * flatMap * 輸入一條資料,輸出0到多條資料。 * @author root * */ public class Operator_flatMap { public static void main(String[] args) { SparkConf conf = new SparkConf(); conf.setMaster("local"); conf.setAppName("flatMap"); JavaSparkContext jsc = new JavaSparkContext(conf); JavaRDD<String> lines = jsc.textFile("./words.txt"); JavaRDD<String> flatMapResult = lines.flatMap(new FlatMapFunction<String, String>() { /** * */ private static final long serialVersionUID = 1L; @Override public Iterable<String> call(String s) throws Exception { return Arrays.asList(s.split(" ")); } }); flatMapResult.foreach(new VoidFunction<String>() { /** * */ private static final long serialVersionUID = 1L; @Override public void call(String t) throws Exception { System.out.println(t); } }); jsc.stop(); } }
函式解釋:
進來一個String,出去一個集合。
Iterater 集合
iterator 遍歷元素
函式結果:
4、sample(隨機抽樣)
隨機抽樣運算元,根據傳進去的小數按比例進行又放回或者無放回的抽樣。(True,fraction,long)
True 抽樣放回
Fraction 一個比例 float 大致 資料越大 越準確
第三個引數:隨機種子,抽到的樣本一樣 方便測試
package com.spark.spark.transformations; import java.util.ArrayList; import java.util.List; import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaPairRDD; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.api.java.function.PairFlatMapFunction; import org.apache.spark.api.java.function.VoidFunction; import scala.Tuple2; public class Operator_sample { public static void main(String[] args) { SparkConf conf = new SparkConf(); conf.setMaster("local"); conf.setAppName("sample"); JavaSparkContext jsc = new JavaSparkContext(conf); JavaRDD<String> lines = jsc.textFile("./words.txt"); JavaPairRDD<String, Integer> flatMapToPair = lines.flatMapToPair(new PairFlatMapFunction<String, String, Integer>() { /** * */ private static final long serialVersionUID = 1L; @Override public Iterable<Tuple2<String, Integer>> call(String t) throws Exception { List<Tuple2<String,Integer>> tupleList = new ArrayList<Tuple2<String,Integer>>(); tupleList.add(new Tuple2<String,Integer>(t,1)); return tupleList; } }); JavaPairRDD<String, Integer> sampleResult = flatMapToPair.sample(true,0.3,4);//樣本有7個所以大致抽樣為1-2個 sampleResult.foreach(new VoidFunction<Tuple2<String,Integer>>() { /** * */ private static final long serialVersionUID = 1L; @Override public void call(Tuple2<String, Integer> t) throws Exception { System.out.println(t); } }); jsc.stop(); } }
函式結果:
5.reduceByKey
將相同的Key根據相應的邏輯進行處理。
package com.spark.spark.transformations; import java.util.Arrays; import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaPairRDD; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.api.java.function.FlatMapFunction; import org.apache.spark.api.java.function.Function2; import org.apache.spark.api.java.function.PairFunction; import org.apache.spark.api.java.function.VoidFunction; import scala.Tuple2; public class Operator_reduceByKey { public static void main(String[] args) { SparkConf conf = new SparkConf(); conf.setMaster("local").setAppName("reduceByKey"); JavaSparkContext jsc = new JavaSparkContext(conf); JavaRDD<String> lines = jsc.textFile("./words.txt"); JavaRDD<String> flatMap = lines.flatMap(new FlatMapFunction<String, String>() { /** * */ private static final long serialVersionUID = 1L; @Override public Iterable<String> call(String t) throws Exception { return Arrays.asList(t.split(" ")); } }); JavaPairRDD<String, Integer> mapToPair = flatMap.mapToPair(new PairFunction<String, String, Integer>() { /** * */ private static final long serialVersionUID = 1L; @Override public Tuple2<String, Integer> call(String t) throws Exception { return new Tuple2<String,Integer>(t,1); } }); JavaPairRDD<String, Integer> reduceByKey = mapToPair.reduceByKey(new Function2<Integer,Integer,Integer>(){ /** * */ private static final long serialVersionUID = 1L; @Override public Integer call(Integer v1, Integer v2) throws Exception { return v1+v2; } },10); reduceByKey.foreach(new VoidFunction<Tuple2<String,Integer>>() { /** * */ private static final long serialVersionUID = 1L; @Override public void call(Tuple2<String, Integer> t) throws Exception { System.out.println(t); } }); jsc.stop(); } }
函式解釋:
函式結果:
6、sortByKey/sortBy
作用在K,V格式的RDD上,對key進行升序或者降序排序。
Sortby在java中沒有
package com.spark.spark.transformations; import java.util.Arrays; import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaPairRDD; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.api.java.function.FlatMapFunction; import org.apache.spark.api.java.function.Function2; import org.apache.spark.api.java.function.PairFunction; import org.apache.spark.api.java.function.VoidFunction; import scala.Tuple2; public class Operator_sortByKey { public static void main(String[] args) { SparkConf conf = new SparkConf(); conf.setMaster("local"); conf.setAppName("sortByKey"); JavaSparkContext jsc = new JavaSparkContext(conf); JavaRDD<String> lines = jsc.textFile("./words.txt"); JavaRDD<String> flatMap = lines.flatMap(new FlatMapFunction<String, String>() { /** * */ private static final long serialVersionUID = 1L; @Override public Iterable<String> call(String t) throws Exception { return Arrays.asList(t.split(" ")); } }); JavaPairRDD<String, Integer> mapToPair = flatMap.mapToPair(new PairFunction<String, String, Integer>() { /** * */ private static final long serialVersionUID = 1L; @Override public Tuple2<String, Integer> call(String s) throws Exception { return new Tuple2<String, Integer>(s, 1); } }); JavaPairRDD<String, Integer> reduceByKey = mapToPair.reduceByKey(new Function2<Integer, Integer, Integer>() { /** * */ private static final long serialVersionUID = 1L; @Override public Integer call(Integer v1, Integer v2) throws Exception { return v1+v2; } }); reduceByKey.mapToPair(new PairFunction<Tuple2<String,Integer>, Integer, String>() { /** * */ private static final long serialVersionUID = 1L; @Override public Tuple2<Integer, String> call(Tuple2<String, Integer> t) throws Exception { return new Tuple2<Integer, String>(t._2, t._1); } }).sortByKey(false).mapToPair(new PairFunction<Tuple2<Integer,String>, String, Integer>() {//先把key.value對調,然後排完序後再對調回來 false是降序,True是升序 /** * */ private static final long serialVersionUID = 1L; @Override public Tuple2<String, Integer> call(Tuple2<Integer, String> t) throws Exception { return new Tuple2<String,Integer>(t._2,t._1); } }).foreach(new VoidFunction<Tuple2<String,Integer>>() { /** * */ private static final long serialVersionUID = 1L; @Override public void call(Tuple2<String, Integer> t) throws Exception { System.out.println(t); } }); } }
程式碼解釋:先對調,排完序,在對調過來
程式碼結果: