Apache Spark 2一些使用案例
1. registerTempTable 與createOrReplaceTempView
以下面資料JSON為案例people.json:
{ "name": "John", "age": "28", "country": "UK" } { "name": "Cathy", "age": "30", "country": "AUS" } { "name": "Mark", "age": "50", "country": "USA" } |
輸入:
%spark2 val peopleJsonDF = sqlContext.read.format("json").load("/tests/people.json"); peopleJsonDF.createOrReplaceTempView("jsonPeople"); sqlContext.sql("select * from jsonPeople where age > 30").show(); |
Zeppelin output:
peopleJsonDF: org.apache.spark.sql.DataFrame = [age: string, country: string ... 1 more field] +---+-------+----+ |age|country|name| +---+-------+----+ | 50| USA|Mark| +---+-------+----+ |
結果:registerTempTable是Sparn 1.x語法,createOrReplaceTempView是Spark 2.x語法。兩者都做同樣的事情。
2. Apache Spark 2中的Spark-csv
任務:從JSON檔案中獲取資料,進行一些排序並將結果儲存為CSV檔案。Apache Spark 2在這裡非常方便!
%spark2 val peopleJsonDF = sqlContext.read.option("multiline", true).format("json").load("/tests/people.json"); peopleJsonDF.createOrReplaceTempView("jsonPeople"); val orderedJsonDF = sqlContext.sql(" select * from jsonPeople ORDER BY age DESC "); orderedJsonDF.show(); orderedJsonDF.write.mode("overwrite").option("sep", "|").option("header", true).format("csv").save("/tests/csvresult"); |
Zeppelin output:
peopleJsonDF: org.apache.spark.sql.DataFrame = [age: string, country: string ... 1 more field] orderedJsonDF: org.apache.spark.sql.DataFrame = [age: string, country: string ... 1 more field] +---+-------+-----+ |age|country| name| +---+-------+-----+ | 50| USA| Mark| | 30| AUS|Cathy| | 28| UK| John| +---+-------+-----+ |
HDFS output:
[root@sandbox ~]# hdfs dfs -ls /tests/csvresult Found 5 items -rw-r--r-- 1 zeppelin hdfs 0 2018-09-04 17:14 /tests/csvresult/_SUCCESS -rw-r--r-- 1 zeppelin hdfs 29 2018-09-04 17:14 /tests/csvresult/part-00000-4f4204f7-636b-4f7c-a8b7-a9a8a4c395cf.csv -rw-r--r-- 1 zeppelin hdfs 30 2018-09-04 17:14 /tests/csvresult/part-00001-4f4204f7-636b-4f7c-a8b7-a9a8a4c395cf.csv -rw-r--r-- 1 zeppelin hdfs 28 2018-09-04 17:14 /tests/csvresult/part-00002-4f4204f7-636b-4f7c-a8b7-a9a8a4c395cf.csv -rw-r--r-- 1 zeppelin hdfs 0 2018-09-04 17:14 /tests/csvresult/part-00003-4f4204f7-636b-4f7c-a8b7-a9a8a4c395cf.csv |
3.從CSV中推斷模式並將結果儲存到ORC中並透過Hive將其恢復
讓我們將之前的結果'/ tests / csvresult'作為此任務的輸入。
%spark2 import org.apache.spark.sql.hive.HiveContext; import org.apache.spark.sql.types._; var hiveContext = new HiveContext(sc); var csvDF = sqlContext.read .option("header", true) .option("sep", "|") .option("inferSchema", true) .format("csv").load("/tests/csvresult"); csvDF.createOrReplaceTempView("csvTable"); var sortedDF = sqlContext.sql("select age,country,name from csvTable order by age desc"); sortedDF.write.mode("overwrite").format("orc").save("/tests/orcresult"); hiveContext.sql(" DROP TABLE IF EXISTS people "); hiveContext.sql("CREATE EXTERNAL TABLE people (age INT, country String, name String) "+ " STORED AS ORC "+ " LOCATION '/tests/orcresult'" ); hiveContext.sql("select * from people").show(); |
Zeppelin output:
import org.apache.spark.sql.hive.HiveContext import org.apache.spark.sql.types._ warning: there was one deprecation warning; re-run with -deprecation for details hiveContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@4fc938b8 csvDF: org.apache.spark.sql.DataFrame = [age: int, country: string ... 1 more field] sortedDF: org.apache.spark.sql.DataFrame = [age: int, country: string ... 1 more field] res135: org.apache.spark.sql.DataFrame = [] res136: org.apache.spark.sql.DataFrame = [] +---+-------+-----+ |age|country| name| +---+-------+-----+ | 50| USA| Mark| | 30| AUS|Cathy| | 28| UK| John| +---+-------+-----+ |
4.刪除格式錯誤的記錄
假設有檔案malformed.csv:
50|USA|Mark 30|AUS|Cathy 28|UK |
並且任務是刪除格式錯誤的記錄並建立新的DF:
%spark2 val schema = StructType(Array( StructField("age", IntegerType, false), StructField("country", StringType, false), StructField("name", StringType, false) )); val malformedDF = sqlContext.read.format("csv") .schema(schema) .option("mode", "DROPMALFORMED") .option("header", false) .option("sep", "|") .load("/tests/malformed.csv"); malformedDF.show(); |
Zeppelin output:
schema: org.apache.spark.sql.types.StructType = StructType(StructField(age,IntegerType,false), StructField(country,StringType,false), StructField(name,StringType,false)) malformedDF: org.apache.spark.sql.DataFrame = [age: int, country: string ... 1 more field] +---+-------+-----+ |age|country| name| +---+-------+-----+ | 50| USA| Mark| | 30| AUS|Cathy| +---+-------+-----+ |
相關文章
- Spark流教程 :使用 Apache Spark 的Twitter情緒分析SparkApache
- 使用Apache Spark和Apache Hudi構建分析資料湖ApacheSpark
- Building a Movie Recommendation Service with Apache Spark & Flask - Part 2UIApacheSparkFlask
- Apache Ignite 與 Apache Spark比較ApacheSpark
- Apache Spark Day3ApacheSpark
- Apache Spark原始碼剖析ApacheSpark原始碼
- 在 Apache Spark 中使用機器學習進行客戶細分ApacheSpark機器學習
- Apache 流框架 Flink,Spark Streaming,Storm對比分析(2)Apache框架SparkORM
- Apache Spark技術實戰之3 -- Spark Cassandra Connector的安裝和使用ApacheSpark
- Apache Spark有哪些侷限性ApacheSpark
- Apache Spark 入門簡介ApacheSpark
- 機器學習專案 - 使用 Apache Spark 建立電影推薦引擎機器學習ApacheSpark
- Apache Spark和Hive有用的功能ApacheSparkHive
- Apache Spark Dataframe Join語法教程ApacheSpark
- 【技術乾貨】程式碼示例:使用 Apache Spark 連線 TDengineApacheSpark
- 使用R和Apache Spark處理大規模資料 [session]ApacheSparkSession
- spark 一些重要概念Spark
- Apache Spark技術實戰之5 -- SparkR的安裝及使用ApacheSpark
- 快取Apache Spark RDD - 效能調優快取ApacheSpark
- Apache Kyuubi 助力 CDH 解鎖 Spark SQLApacheSparkSQL
- Apache Spark:分割槽和分桶 - NiveditaApacheSpark
- 帶有Apache Spark的Lambda架構ApacheSpark架構
- Apache Spark 記憶體管理詳解ApacheSpark記憶體
- [翻譯]Apache Spark入門簡介ApacheSpark
- spark2Spark
- Spark 實時計算整合案例Spark
- 【譯】Using .NET for Apache Spark to Analyze Log DataApacheSpark
- Apache Spark常見的三大誤解ApacheSpark
- Apache Spark有何用途?有何特點?ApacheSpark
- Apache Kyuubi & Celeborn,助力 Spark 擁抱雲原生ApacheSpark
- 為Apache Spark準備的深度學習ApacheSpark深度學習
- Apache Spark 2.0正式版釋出下載ApacheSpark
- spark一些常用運算元Spark
- Spark綜合使用及使用者行為案例訪問session統計分析實戰-Spark商業應用實戰SparkSession
- Spark深度解析(2)Spark
- 教程:Apache Spark SQL入門及實踐指南!ApacheSparkSQL
- Spark SQL使用簡介(2)--UDF(使用者自定義函式)SparkSQL函式
- Spark學習進度7-綜合案例Spark