Apache Spark和Hive有用的功能
嘗試Spark和Apache Hive的一些方法和功能。
1. Spark和countByValue函式
讓我們遵循以下RDD值:
var rddVal = sc.parallelize(Array(1,2,2,3,4,4,5,5,5,6)); |
我們的任務是建立新的RDD,其中key將是rddVal中的唯一項值,value將是rddVal項出現的次數。
countByValue是很好的工具:
%spark var rddVal = sc.parallelize(Array(1,2,2,3,4,4,4,5,5,5,6)); val countedRDD = sc.parallelize(rddVal.countByValue().toSeq); countedRDD.collect(); |
zeppelin output:
rddVal: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[3] at parallelize at <console>:29 countedRDD: org.apache.spark.rdd.RDD[(Int, Long)] = ParallelCollectionRDD[7] at parallelize at <console>:31 res2: Array[(Int, Long)] = Array((5,3), (1,1), (6,1), (2,2), (3,1), (4,3)) |
2. Spark和countByKey函式
有時我們有(金鑰,值)RDD,我們想要計算所有金鑰的出現次數。countByKey動作函式是一個很好的工具!
%spark var rddKeyValues = sc.parallelize(Array(("A", 99), ("A",88), ("B",22), ("C",33))); val countedKeys = sc.parallelize(rddKeyValues.countByKey().toSeq); countedKeys.collect(); |
zeppelin output:
rddKeyValues: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[10] at parallelize at <console>:29 countedKeys: org.apache.spark.rdd.RDD[(String, Long)] = ParallelCollectionRDD[13] at parallelize at <console>:31 res4: Array[(String, Long)] = Array((B,1), (A,2), (C,1)) |
3. Spark和keyBy函式
如果您有值的RDD,你想在應用一個函式到每一個元素的,函式結果應該是新RDD的item key,keyBy功能是你的朋友。
%spark def multiply(num: Int):Int={ return num*num; } val inputRDD = sc.parallelize(Array(1,2,3,4,5,6)); val resRDD = inputRDD.keyBy(multiply); resRDD.collect(); |
zeppelin output:
multiply: (num: Int)Int inputRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[14] at parallelize at <console>:29 resRDD: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[15] at keyBy at <console>:33 res5: Array[(Int, Int)] = Array((1,1), (4,2), (9,3), (16,4), (25,5), (36,6)) |
4. Apache Hive和切換到另一個資料庫
%spark import org.apache.spark.sql.hive.HiveContext; val hc = new HiveContext(sc); hc.sql("USE xademo"); hc.sql("SHOW TABLES").show(); |
zeppelin output:
import org.apache.spark.sql.hive.HiveContext hc: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@235d1d35 res11: org.apache.spark.sql.DataFrame = [result: string] +-------------------+-----------+ | tableName|isTemporary| +-------------------+-----------+ |call_detail_records| false| | customer_details| false| | genres| false| | justone| false| | mytable| false| | mytablexademo| false| | recharge_details| false| | workersxa| false| +-------------------+-----------+ |
5.資料幀分割槽
有時您被要求以某種特定格式將資料幀儲存到HDFS中,並且您被迫使用動態分割槽。讓我們來說明如何做到這一點:
輸入檔案workers.txt:
[root@sandbox ~]# cat workers.txt 1,Jerry,man,USA 2,Cathy,female,GBR 3,Teresa,female,GBR 4,Rut,female,USA 5,Roasie,female,AUS 6,Garry,man,GBR 7,Adam,man,GER 8,John,man,GBR 9,Jerremy,man,AUS 10,Angela,female,AUS 11,Ivanka,female,USA 12,Melania,female,USA |
Spark code:
%spark // Dynamic partitioning when saving from Dataframe to HDFS case class worker(id: Int, name: String, sex: String, country: String); val fileRDD = sc.textFile("/tests/workers.txt"); val workerDF = fileRDD.map(line=>new worker(line.split(",")(0).toInt, line.split(",")(1), line.split(",")(2), line.split(",")(3))).toDF(); // save dataframe also into Hive for further use workerDF.saveAsTable("tableWorkers"); workerDF.write .mode("overwrite") .partitionBy("country") .json("/tests/partition/result"); |
zeppelin output:
defined class worker fileRDD: org.apache.spark.rdd.RDD[String] = /tests/workers.txt MapPartitionsRDD[289] at textFile at <console>:114 workerDF: org.apache.spark.sql.DataFrame = [id: int, name: string, sex: string, country: string] warning: there were 1 deprecation warning(s); re-run with -deprecation for details |
HDFS 更有趣結果:
[root@sandbox ~]# hdfs dfs -ls /tests/partition/result Found 5 items -rw-r--r-- 1 zeppelin hdfs 0 2018-08-13 23:35 /tests/partition/result/_SUCCESS drwxr-xr-x - zeppelin hdfs 0 2018-08-13 23:35 /tests/partition/result/country=AUS drwxr-xr-x - zeppelin hdfs 0 2018-08-13 23:35 /tests/partition/result/country=GBR drwxr-xr-x - zeppelin hdfs 0 2018-08-13 23:35 /tests/partition/result/country=GER drwxr-xr-x - zeppelin hdfs 0 2018-08-13 23:35 /tests/partition/result/country=USA <p class="indent">[root@sandbox ~]# |
Spark為每個分割槽包含分組資料的每個分割槽建立了資料夾(例如):
[root@sandbox ~]# hdfs dfs -cat /tests/partition/result/country=USA/part-r-00000-9adc651a-1260-466d-ba37-720a0395d450 {"id":1,"name":"Jerry","sex":"man"} {"id":4,"name":"Rut","sex":"female"} |
6.將分割槽的HDFS資料讀回資料幀
%spark val backDFJson = sqlContext.read.json("/tests/partition/result"); backDFJson.show(); |
zeppelin output:
backDFJson: org.apache.spark.sql.DataFrame = [id: bigint, name: string, sex: string, country: string] +---+-------+------+-------+ | id| name| sex|country| +---+-------+------+-------+ | 1| Jerry| man| USA| | 4| Rut|female| USA| | 11| Ivanka|female| USA| | 12|Melania|female| USA| | 5| Roasie|female| AUS| | 9|Jerremy| man| AUS| | 10| Angela|female| AUS| | 7| Adam| man| GER| | 2| Cathy|female| GBR| | 3| Teresa|female| GBR| | 6| Garry| man| GBR| | 8| John| man| GBR| +---+-------+------+-------+ |
7. Apache Hive和ORC表的動態分割槽
Apache Hive支援兩種分割槽:
- 靜態分割槽
- 動態分割槽
有關更多資訊,我可以推薦以下部落格。基本區別在於,當您將資料儲存到靜態分割槽表時,您必須使用區分分割槽的值來命名分割槽列。我是動態分割槽的情況,如果不存在則建立分割槽。不需要任何值,只需分割槽列。
測試任務:讓我們建立按country動態分割槽的工作人員ORC表。並從之前建立的表“tableWorkers”中將資料儲存到其中。
%spark // dynamic partitioning on Hive Table... import org.apache.spark.sql.hive.HiveContext; var hc = new HiveContext(sc); hc.sql(" DROP TABLE IF EXISTS WorkersPartitioned "); hc.sql(" CREATE TABLE WorkersPartitioned(id INT, name String, sex String) "+ " PARTITIONED BY (country STRING) "+ " STORED AS ORC " ); hc.sql(" SET set hive.exec.dynamic.partition=true "); hc.sql(" SET hive.exec.dynamic.partition.mode=nonstric "); hc.sql(" INSERT OVERWRITE TABLE WorkersPartitioned PARTITION(country) SELECT id, name, sex, country FROM tableWorkers "); hc.sql(" SELECT * FROM WorkersPartitioned ").show(); |
注意程式碼“PARTITION(country)”,我們不需要輸入確切的country,即動態分割槽。
Zeppelin output:
res165: org.apache.spark.sql.DataFrame = [key: string, value: string] res166: org.apache.spark.sql.DataFrame = [] +---+-------+------+-------+ | id| name| sex|country| +---+-------+------+-------+ | 5| Roasie|female| AUS| | 9|Jerremy| man| AUS| | 10| Angela|female| AUS| | 2| Cathy|female| GBR| | 3| Teresa|female| GBR| | 6| Garry| man| GBR| | 8| John| man| GBR| | 7| Adam| man| GER| | 1| Jerry| man| USA| | 4| Rut|female| USA| | 11| Ivanka|female| USA| | 12|Melania|female| USA| +---+-------+------+-------+ |
8. Apache Hive在表格中描述分割槽
import org.apache.spark.sql.hive.HiveContext; val hc = new HiveContext(sc); hc.sql("show partitions WorkersPartitioned").show(); |
Zeppelin output:
import org.apache.spark.sql.hive.HiveContext hc: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@2a117cf4 +-----------+ | result| +-----------+ |country=AUS| |country=GBR| |country=GER| |country=USA| +-----------+ |
點選標題見原文
相關文章
- Hive on Spark 和 Spark sql on Hive,你能分的清楚麼HiveSparkSQL
- Hive on Spark和Spark sql on Hive,你能分的清楚麼HiveSparkSQL
- Hive和Spark分割槽策略HiveSpark
- spark with hiveSparkHive
- Spark整合hiveSparkHive
- hive on spark配置HiveSpark
- zookeeper:spark-project專案的hive和mysql配置SparkProjectHiveMySql
- hive、spark優化HiveSpark優化
- spark2.2.0 配置spark sql 操作hiveSparkSQLHive
- Apache Spark:分割槽和分桶 - NiveditaApacheSpark
- 使用Apache Spark和Apache Hudi構建分析資料湖ApacheSpark
- spark讀取hive異常,java.lang.NoClassDefFoundError: org/apache/tez/dag/api/SessionNotRunningSparkHiveJavaErrorApacheAPISession
- hive on spark記憶體模型HiveSpark記憶體模型
- spark寫入hive資料SparkHive
- Spark操作Hive分割槽表SparkHive
- Spark流教程 :使用 Apache Spark 的Twitter情緒分析SparkApache
- Flume+Spark+Hive+Spark SQL離線分析系統SparkHiveSQL
- Apache Spark Day3ApacheSpark
- hive on spark執行速度慢HiveSpark
- 帶有Apache Spark的Lambda架構ApacheSpark架構
- Apache Hive 面試問答題ApacheHive面試
- Apache Hudi 與 Hive 整合手冊ApacheHive
- Apache Spark有哪些侷限性ApacheSpark
- Apache Spark常見的三大誤解ApacheSpark
- Apache Spark 3.0 預覽版正式釋出,多項重大功能釋出ApacheSpark
- Apache Hive-2.3.0 快速搭建與使用ApacheHive
- spark相關介紹-提取hive表(一)SparkHive
- Apache Spark Dataframe Join語法教程ApacheSpark
- 三種大資料流處理框架選擇比較:Apache Kafka流、Apache Spark流和Apache Flink - quora大資料框架ApacheKafkaSpark
- Apache Spark:資料框,資料集和RDD之間的區別 - BaeldungApacheSpark
- 面向Apache Spark的Kotlin預覽版簡介ApacheSparkKotlin
- hive on spark:return code 30041 Failed to create Spark client for Spark session原因分析及解決方案探尋HiveSparkAIclientSession
- tomaztk/Spark-for-data-engineers:面向資料工程師的Apache Spark學習教程Spark工程師Apache
- 快手小程式的分銷功能有用嗎?
- Apache Kyuubi 助力 CDH 解鎖 Spark SQLApacheSparkSQL
- 【譯】Using .NET for Apache Spark to Analyze Log DataApacheSpark
- Apache Spark 2一些使用案例ApacheSpark
- 快取Apache Spark RDD - 效能調優快取ApacheSpark