Hive資料格式轉換

banq發表於2019-01-08

Apache Hive支援許多檔案格式,用於儲存表格資料。有關 更多資訊,請參閱以下連結。讓我們測試其中的一些並嘗試它之間的一些轉換。
再次讓我們假設我們的movies.txt檔案在tests / dir HDFS目錄中。

[root@sandbox ~]# cat movies.txt
1,scifi,Matrix,USA
2,scifi,Matrix-Reloaded,USA
3,comedy,TheBigBangTheory,USA
4,comedy,MrBean,GBR
5,comedy,CrocodileDundee,AUS
6,comedy,Pelisky,CZE
7,crime,CSI-NewYork,USA


1. TextFile CSV -> Hive ORC

// CSV Table
hiveContext.sql(" CREATE EXTERNAL TABLE IF NOT EXISTS CSVMovies (id String, genre String, name String, country String) "+
                " row format delimited "+
                " fields terminated by ','"+
                " STORED AS TEXTFILE "+
                " location '/tests/dir'"
);

// ORC Table
hiveContext.sql("CREATE TABLE IF NOT EXISTS ORCExpMovies(id String, genre String, name String, country String) "+
                " ROW FORMAT DELIMITED "+
                " FIELDS TERMINATED BY ',' "+
                " STORED AS ORC"
);

// CSV -> ORC (most efficient way!)
hiveContext.sql("INSERT OVERWRITE TABLE ORCExpMovies SELECT * from CSVMovies"); 
hiveContext.sql("SELECT * FROM CSVMovies").show();


Zeppelin output:

import org.apache.spark.sql.hive.HiveContext
hiveContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@d8123e4
res3: org.apache.spark.sql.DataFrame = [result: string]
res4: org.apache.spark.sql.DataFrame = [result: string]
res5: org.apache.spark.sql.DataFrame = []
+---+------+----------------+-------+
| id| genre|            name|country|
+---+------+----------------+-------+
|  1| scifi|          Matrix|    USA|
|  2| scifi| Matrix-Reloaded|    USA|
|  3|comedy|TheBigBangTheory|    USA|
|  4|comedy|          MrBean|    GBR|
|  5|comedy| CrocodileDundee|    AUS|
|  6|comedy|         Pelisky|    CZE|
|  7| crime|     CSI-NewYork|    USA|
+---+------+----------------+-------+


2. Transformation query -> 新的Dataframe -> 匯入新的orc表...
現在讓我們新增以下scala測試:

// Transformation query -> New Dataframe -> Import to new orc table...
val moviesDF = hiveContext.sql("select * from ORCExpMovies where country ='USA'");
moviesDF.write.mode("overwrite").format("orc").saveAsTable("AuxTable");

// ORC -> ORC
hiveContext.sql(" INSERT INTO TABLE UsaMovies SELECT * FROM AuxTable ");
val parquetDF = hiveContext.sql(" SELECT * from UsaMovies ");

parquetDF.show();


Zeppelin output:

import org.apache.spark.sql.hive.HiveContext
hiveContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@4d83e5d4
res15: org.apache.spark.sql.DataFrame = [result: string]
res16: org.apache.spark.sql.DataFrame = [result: string]
res17: org.apache.spark.sql.DataFrame = []
res18: org.apache.spark.sql.DataFrame = []
res19: org.apache.spark.sql.DataFrame = [result: string]
moviesDF: org.apache.spark.sql.DataFrame = [id: string, genre: string, name: string, country: string]
res21: org.apache.spark.sql.DataFrame = []
parquetDF: org.apache.spark.sql.DataFrame = [id: string, genre: string, name: string, country: string]
+---+------+----------------+-------+
| id| genre|            name|country|
+---+------+----------------+-------+
|  1| scifi|          Matrix|    USA|
|  2| scifi| Matrix-Reloaded|    USA|
|  3|comedy|TheBigBangTheory|    USA|
|  7| crime|     CSI-NewYork|    USA|
+---+------+----------------+-------+

3. Dataframe with ORC data -> HDFS Parquet -> HDFS Parquet back to Dataframe

// ORC DF -> parquet
parquetDF.write.mode("overwrite").format("parquet").save("/tests/parquetTabule");

// parquet -> Dataframe
var parDF = hiveContext.read.format("parquet").load("/tests/parquetTabule");
parDF.show();


Zeppelin output:

import org.apache.spark.sql.hive.HiveContext
hiveContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@38e3304a
res23: org.apache.spark.sql.DataFrame = [result: string]
res24: org.apache.spark.sql.DataFrame = [result: string]
res25: org.apache.spark.sql.DataFrame = []
res26: org.apache.spark.sql.DataFrame = []
res27: org.apache.spark.sql.DataFrame = [result: string]
moviesDF: org.apache.spark.sql.DataFrame = [id: string, genre: string, name: string, country: string]
res29: org.apache.spark.sql.DataFrame = []
parquetDF: org.apache.spark.sql.DataFrame = [id: string, genre: string, name: string, country: string]
parDF: org.apache.spark.sql.DataFrame = [id: string, genre: string, name: string, country: string]
+---+------+----------------+-------+
| id| genre|            name|country|
+---+------+----------------+-------+
|  1| scifi|          Matrix|    USA|
|  2| scifi| Matrix-Reloaded|    USA|
|  3|comedy|TheBigBangTheory|    USA|
|  7| crime|     CSI-NewYork|    USA|
+---+------+----------------+-------+

有趣的是將資料幀儲存為parquet格式後'/ tests / parquetTabule'HDFS目錄的內容:

[root@sandbox ~]# hdfs dfs -ls /tests/parquetTabule
Found 5 items
-rw-r--r--   1 zeppelin hdfs          0 2018-08-07 21:07 /tests/parquetTabule/_SUCCESS
-rw-r--r--   1 zeppelin hdfs        445 2018-08-07 21:07 /tests/parquetTabule/_common_metadata
-rw-r--r--   1 zeppelin hdfs       1356 2018-08-07 21:07 /tests/parquetTabule/_metadata
-rw-r--r--   1 zeppelin hdfs       1043 2018-08-07 21:07 /tests/parquetTabule/part-r-00000-aac00b4b-17f3-44b1-9581-dbded6a3ad5e.gz.parquet
-rw-r--r--   1 zeppelin hdfs        922 2018-08-07 21:07 /tests/parquetTabule/part-r-00001-aac00b4b-17f3-44b1-9581-dbded6a3ad5e.gz.parquet


注意壓縮資料。

整個測試程式碼見這裡

相關文章