Apache Spark 2一些使用案例

banq發表於2019-01-08

1. registerTempTable 與createOrReplaceTempView
以下面資料JSON為案例people.json：

{
    "name": "John", 
    "age": "28", 
    "country": 
    "UK"
}
{
    "name": "Cathy", 
    "age": "30", 
    "country": "AUS"
}
{
    "name": "Mark", 
    "age": "50", 
    "country": "USA"
}

輸入:

%spark2

val peopleJsonDF = sqlContext.read.format("json").load("/tests/people.json");

peopleJsonDF.createOrReplaceTempView("jsonPeople");

sqlContext.sql("select * from jsonPeople where age > 30").show();

Zeppelin output:

peopleJsonDF: org.apache.spark.sql.DataFrame = [age: string, country: string ... 1 more field]
+---+-------+----+
|age|country|name|
+---+-------+----+
| 50|    USA|Mark|
+---+-------+----+

結果：registerTempTable是Sparn 1.x語法，createOrReplaceTempView是Spark 2.x語法。兩者都做同樣的事情。

2. Apache Spark 2中的Spark-csv
任務：從JSON檔案中獲取資料，進行一些排序並將結果儲存為CSV檔案。Apache Spark 2在這裡非常方便！

%spark2

val peopleJsonDF = sqlContext.read.option("multiline", true).format("json").load("/tests/people.json");

peopleJsonDF.createOrReplaceTempView("jsonPeople");

val orderedJsonDF = sqlContext.sql(" select * from jsonPeople ORDER BY age DESC ");

orderedJsonDF.show();

orderedJsonDF.write.mode("overwrite").option("sep", "|").option("header", true).format("csv").save("/tests/csvresult");

Zeppelin output:

peopleJsonDF: org.apache.spark.sql.DataFrame = [age: string, country: string ... 1 more field]
orderedJsonDF: org.apache.spark.sql.DataFrame = [age: string, country: string ... 1 more field]
+---+-------+-----+
|age|country| name|
+---+-------+-----+
| 50|    USA| Mark|
| 30|    AUS|Cathy|
| 28|     UK| John|
+---+-------+-----+

HDFS output:

[root@sandbox ~]# hdfs dfs -ls /tests/csvresult
Found 5 items
-rw-r--r--   1 zeppelin hdfs          0 2018-09-04 17:14 /tests/csvresult/_SUCCESS
-rw-r--r--   1 zeppelin hdfs         29 2018-09-04 17:14 /tests/csvresult/part-00000-4f4204f7-636b-4f7c-a8b7-a9a8a4c395cf.csv
-rw-r--r--   1 zeppelin hdfs         30 2018-09-04 17:14 /tests/csvresult/part-00001-4f4204f7-636b-4f7c-a8b7-a9a8a4c395cf.csv
-rw-r--r--   1 zeppelin hdfs         28 2018-09-04 17:14 /tests/csvresult/part-00002-4f4204f7-636b-4f7c-a8b7-a9a8a4c395cf.csv
-rw-r--r--   1 zeppelin hdfs          0 2018-09-04 17:14 /tests/csvresult/part-00003-4f4204f7-636b-4f7c-a8b7-a9a8a4c395cf.csv

3.從CSV中推斷模式並將結果儲存到ORC中並透過Hive將其恢復
讓我們將之前的結果'/ tests / csvresult'作為此任務的輸入。

%spark2

import org.apache.spark.sql.hive.HiveContext;
import org.apache.spark.sql.types._;

var hiveContext = new HiveContext(sc);

var csvDF = sqlContext.read
            .option("header", true)
            .option("sep", "|")
            .option("inferSchema", true)
            .format("csv").load("/tests/csvresult");

csvDF.createOrReplaceTempView("csvTable");


var sortedDF = sqlContext.sql("select age,country,name from csvTable order by age desc");

sortedDF.write.mode("overwrite").format("orc").save("/tests/orcresult");

hiveContext.sql(" DROP TABLE IF EXISTS people ");
hiveContext.sql("CREATE EXTERNAL TABLE people (age INT, country String, name String) "+
                        " STORED AS ORC "+
                        " LOCATION '/tests/orcresult'"
);

hiveContext.sql("select * from people").show();

Zeppelin output:

import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.types._
warning: there was one deprecation warning; re-run with -deprecation for details
hiveContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@4fc938b8
csvDF: org.apache.spark.sql.DataFrame = [age: int, country: string ... 1 more field]
sortedDF: org.apache.spark.sql.DataFrame = [age: int, country: string ... 1 more field]
res135: org.apache.spark.sql.DataFrame = []
res136: org.apache.spark.sql.DataFrame = []

+---+-------+-----+
|age|country| name|
+---+-------+-----+
| 50|    USA| Mark|
| 30|    AUS|Cathy|
| 28|     UK| John|
+---+-------+-----+

4.刪除格式錯誤的記錄

假設有檔案malformed.csv：

50|USA|Mark
30|AUS|Cathy
28|UK

並且任務是刪除格式錯誤的記錄並建立新的DF：

%spark2

val schema = StructType(Array(
  StructField("age", IntegerType, false),
  StructField("country", StringType, false),
  StructField("name", StringType, false)
));

val malformedDF = sqlContext.read.format("csv")
                        .schema(schema)
                        .option("mode", "DROPMALFORMED")
                        .option("header", false)
                        .option("sep", "|")
                        .load("/tests/malformed.csv");

malformedDF.show();

Zeppelin output:

schema: org.apache.spark.sql.types.StructType = StructType(StructField(age,IntegerType,false), StructField(country,StringType,false), StructField(name,StringType,false))
malformedDF: org.apache.spark.sql.DataFrame = [age: int, country: string ... 1 more field]
+---+-------+-----+
|age|country| name|
+---+-------+-----+
| 50|    USA| Mark|
| 30|    AUS|Cathy|
+---+-------+-----+

Spark流教程：使用 Apache Spark 的Twitter情緒分析
2021-12-29
SparkApache
使用Apache Spark和Apache Hudi構建分析資料湖
2020-06-15
ApacheSpark
Apache 流框架 Flink，Spark Streaming，Storm對比分析（2）
2019-02-26
Apache框架SparkORM
Apache Spark Day3
2020-10-30
ApacheSpark
Apache Spark有哪些侷限性
2020-06-11
ApacheSpark
【技術乾貨】程式碼示例：使用 Apache Spark 連線 TDengine
2022-05-30
ApacheSpark
Apache Spark和Hive有用的功能
2019-01-08
ApacheSparkHive
Apache Spark Dataframe Join語法教程
2019-01-08
ApacheSpark
spark 一些重要概念
2020-10-18
Spark
在 Apache Spark 中使用機器學習進行客戶細分
2021-12-23
ApacheSpark機器學習
快取Apache Spark RDD - 效能調優
2019-01-08
快取ApacheSpark
帶有Apache Spark的Lambda架構
2018-05-29
ApacheSpark架構
【譯】Using .NET for Apache Spark to Analyze Log Data
2020-05-11
ApacheSpark
Apache Kyuubi 助力 CDH 解鎖 Spark SQL
2021-11-18
ApacheSparkSQL
Apache Spark：分割槽和分桶 - Nivedita
2022-05-30
ApacheSpark
機器學習專案 - 使用 Apache Spark 建立電影推薦引擎
2021-12-24
機器學習ApacheSpark
Apache Spark常見的三大誤解
2018-12-14
ApacheSpark
Apache Kyuubi & Celeborn，助力 Spark 擁抱雲原生
2023-10-10
ApacheSpark
Apache Spark有何用途？有何特點？
2022-10-27
ApacheSpark
Apache2 使用auth_digest驗證
2021-07-07
Apache
spark一些常用運算元
2018-10-10
Spark
tomaztk/Spark-for-data-engineers：面向資料工程師的Apache Spark學習教程
2021-12-22
Spark工程師Apache
教程：Apache Spark SQL入門及實踐指南！
2018-09-12
ApacheSparkSQL
面向Apache Spark的Kotlin預覽版簡介
2020-09-03
ApacheSparkKotlin
Spark綜合使用及使用者行為案例訪問session統計分析實戰-Spark商業應用實戰
2018-12-24
SparkSession
Spark SQL使用簡介（2）--UDF（使用者自定義函式）
2018-08-02
SparkSQL函式
使用 Apache APISIX serverless 能力快速攔截 Apache Log4j2 的高危漏洞
2021-12-14
ApacheAPIServer
Apache 流框架 Flink，Spark Streaming，Storm對比分析（1）
2019-03-04
Apache框架SparkORM
Apache Spark SQL的高階Join連線技術
2019-01-08
ApacheSparkSQL
Apache 流框架 Flink，Spark Streaming，Storm對比分析（一）
2018-04-27
Apache框架SparkORM
Apache 流框架 Flink，Spark Streaming，Storm對比分析（二）
2018-04-27
Apache框架SparkORM
【案例】“Weconnect”B2B2C多使用者商城系統開發案例
2020-07-21
14 hashlib 案例2 使用者註冊
2024-09-28
三種大資料流處理框架選擇比較：Apache Kafka流、Apache Spark流和Apache Flink - quora
2020-12-08
大資料框架ApacheKafkaSpark
Goroutine 和 Channel 的的使用和一些坑以及案例分析
2019-08-16
Go
Spark綜合使用及使用者行為案例區域內熱門商品統計分析實戰-Spark商業應用實戰
2018-12-25
Spark
Spark學習進度7-綜合案例
2021-01-10
Spark
使用vue2+Axios遇到的一些坑
2019-02-16
VueiOS

Apache Spark 2一些使用案例

相關文章