Hive資料格式轉換
Apache Hive支援許多檔案格式,用於儲存表格資料。有關 更多資訊,請參閱以下連結。讓我們測試其中的一些並嘗試它之間的一些轉換。
再次讓我們假設我們的movies.txt檔案在tests / dir HDFS目錄中。
[root@sandbox ~]# cat movies.txt 1,scifi,Matrix,USA 2,scifi,Matrix-Reloaded,USA 3,comedy,TheBigBangTheory,USA 4,comedy,MrBean,GBR 5,comedy,CrocodileDundee,AUS 6,comedy,Pelisky,CZE 7,crime,CSI-NewYork,USA |
1. TextFile CSV -> Hive ORC
// CSV Table hiveContext.sql(" CREATE EXTERNAL TABLE IF NOT EXISTS CSVMovies (id String, genre String, name String, country String) "+ " row format delimited "+ " fields terminated by ','"+ " STORED AS TEXTFILE "+ " location '/tests/dir'" ); // ORC Table hiveContext.sql("CREATE TABLE IF NOT EXISTS ORCExpMovies(id String, genre String, name String, country String) "+ " ROW FORMAT DELIMITED "+ " FIELDS TERMINATED BY ',' "+ " STORED AS ORC" ); // CSV -> ORC (most efficient way!) hiveContext.sql("INSERT OVERWRITE TABLE ORCExpMovies SELECT * from CSVMovies"); hiveContext.sql("SELECT * FROM CSVMovies").show(); |
Zeppelin output:
import org.apache.spark.sql.hive.HiveContext hiveContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@d8123e4 res3: org.apache.spark.sql.DataFrame = [result: string] res4: org.apache.spark.sql.DataFrame = [result: string] res5: org.apache.spark.sql.DataFrame = [] +---+------+----------------+-------+ | id| genre| name|country| +---+------+----------------+-------+ | 1| scifi| Matrix| USA| | 2| scifi| Matrix-Reloaded| USA| | 3|comedy|TheBigBangTheory| USA| | 4|comedy| MrBean| GBR| | 5|comedy| CrocodileDundee| AUS| | 6|comedy| Pelisky| CZE| | 7| crime| CSI-NewYork| USA| +---+------+----------------+-------+ |
2. Transformation query -> 新的Dataframe -> 匯入新的orc表...
現在讓我們新增以下scala測試:
// Transformation query -> New Dataframe -> Import to new orc table... val moviesDF = hiveContext.sql("select * from ORCExpMovies where country ='USA'"); moviesDF.write.mode("overwrite").format("orc").saveAsTable("AuxTable"); // ORC -> ORC hiveContext.sql(" INSERT INTO TABLE UsaMovies SELECT * FROM AuxTable "); val parquetDF = hiveContext.sql(" SELECT * from UsaMovies "); parquetDF.show(); |
Zeppelin output:
import org.apache.spark.sql.hive.HiveContext hiveContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@4d83e5d4 res15: org.apache.spark.sql.DataFrame = [result: string] res16: org.apache.spark.sql.DataFrame = [result: string] res17: org.apache.spark.sql.DataFrame = [] res18: org.apache.spark.sql.DataFrame = [] res19: org.apache.spark.sql.DataFrame = [result: string] moviesDF: org.apache.spark.sql.DataFrame = [id: string, genre: string, name: string, country: string] res21: org.apache.spark.sql.DataFrame = [] parquetDF: org.apache.spark.sql.DataFrame = [id: string, genre: string, name: string, country: string] +---+------+----------------+-------+ | id| genre| name|country| +---+------+----------------+-------+ | 1| scifi| Matrix| USA| | 2| scifi| Matrix-Reloaded| USA| | 3|comedy|TheBigBangTheory| USA| | 7| crime| CSI-NewYork| USA| +---+------+----------------+-------+ |
3. Dataframe with ORC data -> HDFS Parquet -> HDFS Parquet back to Dataframe
// ORC DF -> parquet parquetDF.write.mode("overwrite").format("parquet").save("/tests/parquetTabule"); // parquet -> Dataframe var parDF = hiveContext.read.format("parquet").load("/tests/parquetTabule"); parDF.show(); |
Zeppelin output:
import org.apache.spark.sql.hive.HiveContext hiveContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@38e3304a res23: org.apache.spark.sql.DataFrame = [result: string] res24: org.apache.spark.sql.DataFrame = [result: string] res25: org.apache.spark.sql.DataFrame = [] res26: org.apache.spark.sql.DataFrame = [] res27: org.apache.spark.sql.DataFrame = [result: string] moviesDF: org.apache.spark.sql.DataFrame = [id: string, genre: string, name: string, country: string] res29: org.apache.spark.sql.DataFrame = [] parquetDF: org.apache.spark.sql.DataFrame = [id: string, genre: string, name: string, country: string] parDF: org.apache.spark.sql.DataFrame = [id: string, genre: string, name: string, country: string] +---+------+----------------+-------+ | id| genre| name|country| +---+------+----------------+-------+ | 1| scifi| Matrix| USA| | 2| scifi| Matrix-Reloaded| USA| | 3|comedy|TheBigBangTheory| USA| | 7| crime| CSI-NewYork| USA| +---+------+----------------+-------+ |
有趣的是將資料幀儲存為parquet格式後'/ tests / parquetTabule'HDFS目錄的內容:
[root@sandbox ~]# hdfs dfs -ls /tests/parquetTabule Found 5 items -rw-r--r-- 1 zeppelin hdfs 0 2018-08-07 21:07 /tests/parquetTabule/_SUCCESS -rw-r--r-- 1 zeppelin hdfs 445 2018-08-07 21:07 /tests/parquetTabule/_common_metadata -rw-r--r-- 1 zeppelin hdfs 1356 2018-08-07 21:07 /tests/parquetTabule/_metadata -rw-r--r-- 1 zeppelin hdfs 1043 2018-08-07 21:07 /tests/parquetTabule/part-r-00000-aac00b4b-17f3-44b1-9581-dbded6a3ad5e.gz.parquet -rw-r--r-- 1 zeppelin hdfs 922 2018-08-07 21:07 /tests/parquetTabule/part-r-00001-aac00b4b-17f3-44b1-9581-dbded6a3ad5e.gz.parquet |
注意壓縮資料。
整個測試程式碼見這裡
相關文章
- layui tree資料格式轉換UI
- 醫學影象資料格式和格式轉換
- Oracle資料庫日期格式轉換操作Oracle資料庫
- Python通用資料格式轉換工具Python
- Java與Json資料格式轉換JavaJSON
- 資料庫表格轉換成XML格式方法!資料庫XML
- 將任意格式轉換為JSON資料格式的工具類JSON
- mxnet資料格式轉換為tensorflow,pytorch資料PyTorch
- JavaScript和json對資料格式的轉換JavaScriptJSON
- Python模組學習:struct 資料格式轉換PythonStruct
- Awk和Sed格式化Hive的資料Hive
- COCO資料集格式互換
- 影像格式轉換
- ABAP和XML資料格式互相轉換的兩種方式XML
- sqlserver資料庫日期如何格式化-日期轉換字串SQLServer資料庫字串
- sql server-Convert DateTime 格式的轉換資料SQLServer
- 利用MySQL原資料資訊批量轉換指定庫資料表生成Hive建表語句MySqlHive
- 把JSON資料格式轉換為Python的類物件JSONPython物件
- 教程:如何通過DLA實現資料檔案格式轉換
- fastJson在java後臺轉換json格式資料探究ASTJSONJava
- SQL Server資料庫中Date/Time 時間格式的轉換SQLServer資料庫
- 【Hive】hive資料遷移Hive
- 日期格式的轉換
- sql 日期格式轉換SQL
- ABAP日期格式轉換
- 從hive將資料匯出到mysql(轉)HiveMySql
- csv格式怎麼轉換成excel?csv格式轉換成excel格式檔案的方法Excel
- C++中OpenCV、Armadillo矩陣資料格式的轉換方式C++OpenCV矩陣
- YUV格式到RGB格式的轉換
- 一招教你heic格式批量轉換png,必會的格式轉換!
- Hive格式化輸出資料庫和表詳細資訊Hive資料庫
- 圖片格式怎麼轉換,如何轉換jpg
- [轉]Linux、windows下回車換行格式轉換LinuxWindows
- [時間格式的轉換]
- heic格式轉換軟體
- Poi 匯入格式轉換
- 萬能格式轉換工具
- C#日期格式轉換C#