【Spark篇】---SparkSQL on Hive的配置和使用

LHBlog發表於2018-02-08

原文網址 : https://www.cnblogs.com/LHWorldBlog/p/8431750.html

SparkSQLHive

一、前述

Spark on Hive： Hive只作為儲存角色，Spark負責sql解析優化，執行。

二、具體配置

1、在Spark客戶端配置Hive On Spark

在Spark客戶端安裝包下spark-1.6.0/conf中建立檔案hive-site.xml：

配置hive的metastore路徑

<configuration>
   <property>
        <name>hive.metastore.uris</name>
        <value>thrift://node1:9083</value>
   </property>
</configuration>

2、啟動Hive的metastore服務

hive --service metastore

3、啟動zookeeper叢集，啟動HDFS叢集。

4、啟動SparkShell 讀取Hive中的表總數，對比hive中查詢同一表查詢總數測試時間。

./spark-shell 
--master spark://node1:7077,node2:7077 
 --executor-cores 1 
--executor-memory 1g 
--total-executor-cores 1
import org.apache.spark.sql.hive.HiveContext
val hc = new HiveContext(sc)
hc.sql("show databases").show
hc.sql("user default").show
hc.sql("select count(*) from jizhan").show

可以發現效能明顯提升！！！

注意：

如果使用Spark on Hive 查詢資料時，出現錯誤：

找不到HDFS叢集路徑，要在客戶端機器conf/spark-env.sh中設定HDFS的路徑：

export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

三、讀取Hive中的資料載入成DataFrame

1、HiveContext是SQLContext的子類，連線Hive建議使用HiveContext。

2、由於本地沒有Hive環境，要提交到叢集執行，提交命令：

/spark-submit 
--master spark://node1:7077,node2:7077 
--executor-cores 1 
--executor-memory 2G 
--total-executor-cores 1
--class com.bjsxt.sparksql.dataframe.CreateDFFromHive 
/root/test/HiveTest.jar

java程式碼：

SparkConf conf = new SparkConf();
conf.setAppName("hive");
JavaSparkContext sc = new JavaSparkContext(conf);
//HiveContext是SQLContext的子類。
HiveContext hiveContext = new HiveContext(sc);
hiveContext.sql("USE spark");
hiveContext.sql("DROP TABLE IF EXISTS student_infos");
//在hive中建立student_infos表
hiveContext.sql("CREATE TABLE IF NOT EXISTS student_infos (name STRING,age INT) row format delimited fields terminated by '\t' ");
hiveContext.sql("load data local inpath '/root/test/student_infos' into table student_infos");

hiveContext.sql("DROP TABLE IF EXISTS student_scores"); 
hiveContext.sql("CREATE TABLE IF NOT EXISTS student_scores (name STRING, score INT) row format delimited fields terminated by '\t'");  
hiveContext.sql("LOAD DATA "
+ "LOCAL INPATH '/root/test/student_scores'"
+ "INTO TABLE student_scores");
/**
 * 查詢表生成DataFrame
 */
DataFrame goodStudentsDF = hiveContext.sql("SELECT si.name, si.age, ss.score "
+ "FROM student_infos si "
+ "JOIN student_scores ss "
+ "ON si.name=ss.name "
+ "WHERE ss.score>=80");

hiveContext.sql("DROP TABLE IF EXISTS good_student_infos");

goodStudentsDF.registerTempTable("goodstudent");
DataFrame result = hiveContext.sql("select * from goodstudent");
result.show();

/**
 * 將結果儲存到hive表 good_student_infos
 */
goodStudentsDF.write().mode(SaveMode.Overwrite).saveAsTable("good_student_infos");

Row[] goodStudentRows = hiveContext.table("good_student_infos").collect();  
for(Row goodStudentRow : goodStudentRows) {
    System.out.println(goodStudentRow);  
}
sc.stop();

scala程式碼：

val conf = new SparkConf()
 conf.setAppName("HiveSource")
 val sc = new SparkContext(conf)
 /**
  * HiveContext是SQLContext的子類。
  */
 val hiveContext = new HiveContext(sc)
 hiveContext.sql("use spark")
 hiveContext.sql("drop table if exists student_infos")
 hiveContext.sql("create table if not exists student_infos (name string,age int) row format  delimited fields terminated by '\t'")
 hiveContext.sql("load data local inpath '/root/test/student_infos' into table student_infos")
 
 hiveContext.sql("drop table if exists student_scores")
 hiveContext.sql("create table if not exists student_scores (name string,score int) row format delimited fields terminated by '\t'")
 hiveContext.sql("load data local inpath '/root/test/student_scores' into table student_scores")
 
 val df = hiveContext.sql("select si.name,si.age,ss.score from student_infos si,student_scores ss where si.name = ss.name")
 hiveContext.sql("drop table if exists good_student_infos")
 /**
  * 將結果寫入到hive表中
  */
 df.write.mode(SaveMode.Overwrite).saveAsTable("good_student_infos")
 
 sc.stop()

結果：

可以看到分組內有序，組間並不是有序的！！！！

hive on spark配置
2018-12-01
HiveSpark
【Spark篇】---SparkSql之UDF函式和UDAF函式
2018-03-07
SparkSQL函式
zookeeper：spark-project專案的hive和mysql配置
2018-09-18
SparkProjectHiveMySql
spark2.2.0 配置spark sql 操作hive
2018-09-23
SparkSQLHive
Hive on Spark和Spark sql on Hive，你能分的清楚麼
2022-01-04
HiveSparkSQL
Hive on Spark 和 Spark sql on Hive，你能分的清楚麼
2022-09-26
HiveSparkSQL
SparkSQL與Hive metastore Parquet
2020-11-03
SparkSQLHiveAST
Apache Spark和Hive有用的功能
2019-01-08
ApacheSparkHive
Hive和Spark分割槽策略
2021-06-27
HiveSpark
spark with hive
2018-11-21
SparkHive
SparkSQL /DataFrame /Spark RDD誰快？
2020-08-15
SparkSQL
SparkSQL -- 02 【SparkSQL檔案的讀取與落地，和Hive的整合，內建函式，自定義函式】
2020-11-25
SparkSQLHive函式
Spark整合hive
2020-11-05
SparkHive
Spark的基本結構及SparkSQL元件的基本用法
2018-04-27
SparkSQL元件
spark2.4.3 sparkSQL 使用者自定義函式筆記
2019-05-21
SparkSQL函式筆記
hive、spark優化
2020-12-04
HiveSpark優化
《深入理解Spark》之sparkSQL 處理流程
2018-10-11
SparkSQL
從Hive遷移到SparkSQL，有讚的大資料實踐
2019-01-10
HiveSparkSQL大資料
spark寫入hive資料
2019-04-09
SparkHive
Spark操作Hive分割槽表
2018-12-07
SparkHive
hive on spark記憶體模型
2024-04-16
HiveSpark記憶體模型
（十二）SparkSQL Catalog訪問Hive後設資料資訊
2018-09-26
SparkSQLHive
PySpark和SparkSQL基礎：如何利用Python程式設計執行Spark（附程式碼）
2020-02-27
SparkSQLPython程式設計
Flume+Spark+Hive+Spark SQL離線分析系統
2018-09-18
SparkHiveSQL
hive on spark執行速度慢
2019-01-22
HiveSpark
【Spark篇】---Spark中Shuffle檔案的定址
2018-03-07
Spark
【Hive一】Hive安裝及配置
2018-05-06
Hive
hive 的使用
2022-06-18
Hive
Linux環境Hive安裝配置及使用
2019-02-27
LinuxHive
Hive -------- 使用mysql儲存hive後設資料，Mysql的安裝以及配置步驟
2018-11-12
HiveMySql
spark_home的配置
2018-07-30
Spark
kafka：spark-project專案的kafka和scala配置
2018-09-18
KafkaSparkProject
SparkSQL部署與簡單使用
2018-08-30
SparkSQL
實踐篇：使用Spark和Scala來訓練您的第一個分類器！
2019-03-01
Spark
【Spark篇】---Spark故障解決（troubleshooting）
2018-03-04
Spark
Linux環境Spark安裝配置及使用
2019-05-07
LinuxSpark
Hive安裝配置
2022-04-01
Hive
spark相關介紹-提取hive表（一）
2021-09-19
SparkHive
hive on spark：return code 30041 Failed to create Spark client for Spark session原因分析及解決方案探尋
2021-01-05
HiveSparkAIclientSession

【Spark篇】---SparkSQL on Hive的配置和使用

相關文章