Apache Spark和Hive有用的功能

banq發表於2019-01-08

嘗試Spark和Apache Hive的一些方法和功能。

1. Spark和countByValue函式
讓我們遵循以下RDD值:

var rddVal = sc.parallelize(Array(1,2,2,3,4,4,5,5,5,6));

我們的任務是建立新的RDD,其中key將是rddVal中的唯一項值,value將是rddVal項出現的次數。

countByValue是很好的工具:

%spark

var rddVal = sc.parallelize(Array(1,2,2,3,4,4,4,5,5,5,6));

val countedRDD = sc.parallelize(rddVal.countByValue().toSeq);

countedRDD.collect();


zeppelin output:

rddVal: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[3] at parallelize at <console>:29
countedRDD: org.apache.spark.rdd.RDD[(Int, Long)] = ParallelCollectionRDD[7] at parallelize at <console>:31
res2: Array[(Int, Long)] = Array((5,3), (1,1), (6,1), (2,2), (3,1), (4,3))



2. Spark和countByKey函式
有時我們有(金鑰,值)RDD,我們想要計算所有金鑰的出現次數。countByKey動作函式是一個很好的工具!

%spark
var rddKeyValues = sc.parallelize(Array(("A", 99), ("A",88), ("B",22), ("C",33)));

val countedKeys = sc.parallelize(rddKeyValues.countByKey().toSeq);

countedKeys.collect();

zeppelin output:

rddKeyValues: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[10] at parallelize at <console>:29
countedKeys: org.apache.spark.rdd.RDD[(String, Long)] = ParallelCollectionRDD[13] at parallelize at <console>:31
res4: Array[(String, Long)] = Array((B,1), (A,2), (C,1))


3. Spark和keyBy函式
如果您有值的RDD,你想在應用一個函式到每一個元素的,函式結果應該是新RDD的item key,keyBy功能是你的朋友。

%spark

def multiply(num: Int):Int={
    return num*num;
}

val inputRDD = sc.parallelize(Array(1,2,3,4,5,6));

val resRDD = inputRDD.keyBy(multiply);

resRDD.collect();


zeppelin output:

multiply: (num: Int)Int
inputRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[14] at parallelize at <console>:29
resRDD: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[15] at keyBy at <console>:33
res5: Array[(Int, Int)] = Array((1,1), (4,2), (9,3), (16,4), (25,5), (36,6))


4. Apache Hive和切換到另一個資料庫

%spark
import org.apache.spark.sql.hive.HiveContext;

val hc = new HiveContext(sc);

hc.sql("USE xademo");
hc.sql("SHOW TABLES").show();


zeppelin output:

import org.apache.spark.sql.hive.HiveContext
hc: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@235d1d35
res11: org.apache.spark.sql.DataFrame = [result: string]
+-------------------+-----------+
|          tableName|isTemporary|
+-------------------+-----------+
|call_detail_records|      false|
|   customer_details|      false|
|             genres|      false|
|            justone|      false|
|            mytable|      false|
|      mytablexademo|      false|
|   recharge_details|      false|
|          workersxa|      false|
+-------------------+-----------+


5.資料幀分割槽
有時您被要求以某種特定格式將資料幀儲存到HDFS中,並且您被迫使用動態分割槽。讓我們來說明如何做到這一點:
輸入檔案workers.txt:

[root@sandbox ~]# cat workers.txt 
1,Jerry,man,USA
2,Cathy,female,GBR
3,Teresa,female,GBR
4,Rut,female,USA
5,Roasie,female,AUS
6,Garry,man,GBR
7,Adam,man,GER
8,John,man,GBR
9,Jerremy,man,AUS
10,Angela,female,AUS
11,Ivanka,female,USA
12,Melania,female,USA


Spark code:

%spark

// Dynamic partitioning when saving from Dataframe to HDFS

case class worker(id: Int, name: String, sex: String, country: String);

val fileRDD = sc.textFile("/tests/workers.txt");

val workerDF = fileRDD.map(line=>new worker(line.split(",")(0).toInt, 
                                        line.split(",")(1), 
                                        line.split(",")(2), 
                                        line.split(",")(3))).toDF();

// save dataframe also into Hive for further use                                        
workerDF.saveAsTable("tableWorkers");

workerDF.write
        .mode("overwrite")
        .partitionBy("country")
        .json("/tests/partition/result");


zeppelin output:

defined class worker
fileRDD: org.apache.spark.rdd.RDD[String] = /tests/workers.txt MapPartitionsRDD[289] at textFile at <console>:114
workerDF: org.apache.spark.sql.DataFrame = [id: int, name: string, sex: string, country: string]
warning: there were 1 deprecation warning(s); re-run with -deprecation for details


HDFS 更有趣結果:

[root@sandbox ~]# hdfs dfs -ls /tests/partition/result
Found 5 items
-rw-r--r--   1 zeppelin hdfs          0 2018-08-13 23:35 /tests/partition/result/_SUCCESS
drwxr-xr-x   - zeppelin hdfs          0 2018-08-13 23:35 /tests/partition/result/country=AUS
drwxr-xr-x   - zeppelin hdfs          0 2018-08-13 23:35 /tests/partition/result/country=GBR
drwxr-xr-x   - zeppelin hdfs          0 2018-08-13 23:35 /tests/partition/result/country=GER
drwxr-xr-x   - zeppelin hdfs          0 2018-08-13 23:35 /tests/partition/result/country=USA
<p class="indent">[root@sandbox ~]#


Spark為每個分割槽包含分組資料的每個分割槽建立了資料夾(例如):

[root@sandbox ~]# hdfs dfs -cat /tests/partition/result/country=USA/part-r-00000-9adc651a-1260-466d-ba37-720a0395d450
{"id":1,"name":"Jerry","sex":"man"}
{"id":4,"name":"Rut","sex":"female"}


6.將分割槽的HDFS資料讀回資料幀

%spark
val backDFJson = sqlContext.read.json("/tests/partition/result");

backDFJson.show();


zeppelin output:

backDFJson: org.apache.spark.sql.DataFrame = [id: bigint, name: string, sex: string, country: string]
+---+-------+------+-------+
| id|   name|   sex|country|
+---+-------+------+-------+
|  1|  Jerry|   man|    USA|
|  4|    Rut|female|    USA|
| 11| Ivanka|female|    USA|
| 12|Melania|female|    USA|
|  5| Roasie|female|    AUS|
|  9|Jerremy|   man|    AUS|
| 10| Angela|female|    AUS|
|  7|   Adam|   man|    GER|
|  2|  Cathy|female|    GBR|
|  3| Teresa|female|    GBR|
|  6|  Garry|   man|    GBR|
|  8|   John|   man|    GBR|
+---+-------+------+-------+


7. Apache Hive和ORC表的動態分割槽
Apache Hive支援兩種分割槽:
  • 靜態分割槽
  • 動態分割槽

有關更多資訊,我可以推薦以下部落格。基本區別在於,當您將資料儲存到靜態分割槽表時,您必須使用區分分割槽的值來命名分割槽列。我是動態分割槽的情況,如果不存在則建立分割槽。不需要任何值,只需分割槽列。
測試任務:讓我們建立按country動態分割槽的工作人員ORC表。並從之前建立的表“tableWorkers”中將資料儲存到其中。

%spark

// dynamic partitioning on Hive Table...
import org.apache.spark.sql.hive.HiveContext;

var hc = new HiveContext(sc);

hc.sql(" DROP TABLE IF EXISTS WorkersPartitioned ");
hc.sql(" CREATE TABLE WorkersPartitioned(id INT, name String, sex String) "+
   " PARTITIONED BY (country STRING) "+
   " STORED AS ORC "
);

hc.sql(" SET set hive.exec.dynamic.partition=true ");
hc.sql(" SET hive.exec.dynamic.partition.mode=nonstric ");

hc.sql(" INSERT OVERWRITE TABLE WorkersPartitioned PARTITION(country) SELECT id, name, sex, country FROM tableWorkers ");

hc.sql(" SELECT * FROM WorkersPartitioned ").show();


注意程式碼“PARTITION(country)”,我們不需要輸入確切的country,即動態分割槽。
Zeppelin output:

res165: org.apache.spark.sql.DataFrame = [key: string, value: string]
res166: org.apache.spark.sql.DataFrame = []
+---+-------+------+-------+
| id|   name|   sex|country|
+---+-------+------+-------+
|  5| Roasie|female|    AUS|
|  9|Jerremy|   man|    AUS|
| 10| Angela|female|    AUS|
|  2|  Cathy|female|    GBR|
|  3| Teresa|female|    GBR|
|  6|  Garry|   man|    GBR|
|  8|   John|   man|    GBR|
|  7|   Adam|   man|    GER|
|  1|  Jerry|   man|    USA|
|  4|    Rut|female|    USA|
| 11| Ivanka|female|    USA|
| 12|Melania|female|    USA|
+---+-------+------+-------+


8. Apache Hive在表格中描述分割槽

import org.apache.spark.sql.hive.HiveContext;

val hc = new HiveContext(sc);

hc.sql("show partitions WorkersPartitioned").show();

Zeppelin output:

import org.apache.spark.sql.hive.HiveContext
hc: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@2a117cf4
+-----------+
|     result|
+-----------+
|country=AUS|
|country=GBR|
|country=GER|
|country=USA|
+-----------+


點選標題見原文
 

相關文章