Hive資料格式轉換
Apache Hive支援許多檔案格式,用於儲存表格資料。有關 更多資訊,請參閱以下連結。讓我們測試其中的一些並嘗試它之間的一些轉換。
再次讓我們假設我們的movies.txt檔案在tests / dir HDFS目錄中。
[root@sandbox ~]# cat movies.txt 1,scifi,Matrix,USA 2,scifi,Matrix-Reloaded,USA 3,comedy,TheBigBangTheory,USA 4,comedy,MrBean,GBR 5,comedy,CrocodileDundee,AUS 6,comedy,Pelisky,CZE 7,crime,CSI-NewYork,USA |
1. TextFile CSV -> Hive ORC
// CSV Table hiveContext.sql(" CREATE EXTERNAL TABLE IF NOT EXISTS CSVMovies (id String, genre String, name String, country String) "+ " row format delimited "+ " fields terminated by ','"+ " STORED AS TEXTFILE "+ " location '/tests/dir'" ); // ORC Table hiveContext.sql("CREATE TABLE IF NOT EXISTS ORCExpMovies(id String, genre String, name String, country String) "+ " ROW FORMAT DELIMITED "+ " FIELDS TERMINATED BY ',' "+ " STORED AS ORC" ); // CSV -> ORC (most efficient way!) hiveContext.sql("INSERT OVERWRITE TABLE ORCExpMovies SELECT * from CSVMovies"); hiveContext.sql("SELECT * FROM CSVMovies").show(); |
Zeppelin output:
import org.apache.spark.sql.hive.HiveContext hiveContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@d8123e4 res3: org.apache.spark.sql.DataFrame = [result: string] res4: org.apache.spark.sql.DataFrame = [result: string] res5: org.apache.spark.sql.DataFrame = [] +---+------+----------------+-------+ | id| genre| name|country| +---+------+----------------+-------+ | 1| scifi| Matrix| USA| | 2| scifi| Matrix-Reloaded| USA| | 3|comedy|TheBigBangTheory| USA| | 4|comedy| MrBean| GBR| | 5|comedy| CrocodileDundee| AUS| | 6|comedy| Pelisky| CZE| | 7| crime| CSI-NewYork| USA| +---+------+----------------+-------+ |
2. Transformation query -> 新的Dataframe -> 匯入新的orc表...
現在讓我們新增以下scala測試:
// Transformation query -> New Dataframe -> Import to new orc table... val moviesDF = hiveContext.sql("select * from ORCExpMovies where country ='USA'"); moviesDF.write.mode("overwrite").format("orc").saveAsTable("AuxTable"); // ORC -> ORC hiveContext.sql(" INSERT INTO TABLE UsaMovies SELECT * FROM AuxTable "); val parquetDF = hiveContext.sql(" SELECT * from UsaMovies "); parquetDF.show(); |
Zeppelin output:
import org.apache.spark.sql.hive.HiveContext hiveContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@4d83e5d4 res15: org.apache.spark.sql.DataFrame = [result: string] res16: org.apache.spark.sql.DataFrame = [result: string] res17: org.apache.spark.sql.DataFrame = [] res18: org.apache.spark.sql.DataFrame = [] res19: org.apache.spark.sql.DataFrame = [result: string] moviesDF: org.apache.spark.sql.DataFrame = [id: string, genre: string, name: string, country: string] res21: org.apache.spark.sql.DataFrame = [] parquetDF: org.apache.spark.sql.DataFrame = [id: string, genre: string, name: string, country: string] +---+------+----------------+-------+ | id| genre| name|country| +---+------+----------------+-------+ | 1| scifi| Matrix| USA| | 2| scifi| Matrix-Reloaded| USA| | 3|comedy|TheBigBangTheory| USA| | 7| crime| CSI-NewYork| USA| +---+------+----------------+-------+ |
3. Dataframe with ORC data -> HDFS Parquet -> HDFS Parquet back to Dataframe
// ORC DF -> parquet parquetDF.write.mode("overwrite").format("parquet").save("/tests/parquetTabule"); // parquet -> Dataframe var parDF = hiveContext.read.format("parquet").load("/tests/parquetTabule"); parDF.show(); |
Zeppelin output:
import org.apache.spark.sql.hive.HiveContext hiveContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@38e3304a res23: org.apache.spark.sql.DataFrame = [result: string] res24: org.apache.spark.sql.DataFrame = [result: string] res25: org.apache.spark.sql.DataFrame = [] res26: org.apache.spark.sql.DataFrame = [] res27: org.apache.spark.sql.DataFrame = [result: string] moviesDF: org.apache.spark.sql.DataFrame = [id: string, genre: string, name: string, country: string] res29: org.apache.spark.sql.DataFrame = [] parquetDF: org.apache.spark.sql.DataFrame = [id: string, genre: string, name: string, country: string] parDF: org.apache.spark.sql.DataFrame = [id: string, genre: string, name: string, country: string] +---+------+----------------+-------+ | id| genre| name|country| +---+------+----------------+-------+ | 1| scifi| Matrix| USA| | 2| scifi| Matrix-Reloaded| USA| | 3|comedy|TheBigBangTheory| USA| | 7| crime| CSI-NewYork| USA| +---+------+----------------+-------+ |
有趣的是將資料幀儲存為parquet格式後'/ tests / parquetTabule'HDFS目錄的內容:
[root@sandbox ~]# hdfs dfs -ls /tests/parquetTabule Found 5 items -rw-r--r-- 1 zeppelin hdfs 0 2018-08-07 21:07 /tests/parquetTabule/_SUCCESS -rw-r--r-- 1 zeppelin hdfs 445 2018-08-07 21:07 /tests/parquetTabule/_common_metadata -rw-r--r-- 1 zeppelin hdfs 1356 2018-08-07 21:07 /tests/parquetTabule/_metadata -rw-r--r-- 1 zeppelin hdfs 1043 2018-08-07 21:07 /tests/parquetTabule/part-r-00000-aac00b4b-17f3-44b1-9581-dbded6a3ad5e.gz.parquet -rw-r--r-- 1 zeppelin hdfs 922 2018-08-07 21:07 /tests/parquetTabule/part-r-00001-aac00b4b-17f3-44b1-9581-dbded6a3ad5e.gz.parquet |
注意壓縮資料。
整個測試程式碼見這裡
相關文章
- layui tree資料格式轉換UI
- mxnet資料格式轉換為tensorflow,pytorch資料PyTorch
- Oracle資料庫日期格式轉換操作Oracle資料庫
- 把JSON資料格式轉換為Python的類物件JSONPython物件
- 影像格式轉換
- ABAP和XML資料格式互相轉換的兩種方式XML
- sqlserver資料庫日期如何格式化-日期轉換字串SQLServer資料庫字串
- api介面資料安全格式轉換-DES,AES,SM2API
- 利用MySQL原資料資訊批量轉換指定庫資料表生成Hive建表語句MySqlHive
- openssh key格式轉換
- hrsc2016資料集xml格式轉換為yolo格式,附下載連結XMLYOLO
- COCO資料集格式互換
- 教程:如何通過DLA實現資料檔案格式轉換
- C++中OpenCV、Armadillo矩陣資料格式的轉換方式C++OpenCV矩陣
- csv格式怎麼轉換成excel?csv格式轉換成excel格式檔案的方法Excel
- 一招教你heic格式批量轉換png,必會的格式轉換!
- 圖片格式怎麼轉換,如何轉換jpg
- Python將xml格式轉換為json格式PythonXMLJSON
- heic格式轉換jpg工具——轉易俠heic轉換器
- Poi 匯入格式轉換
- heic格式轉換軟體
- [時間格式的轉換]
- 轉換UTC時間格式
- 圖片格式轉換器有什麼,怎麼無損轉換heic格式
- 怎麼將CAD轉換為JPG格式?CAD格式轉換輕鬆教你搞定!
- 資料庫轉換工具,不同資料庫之前任意轉換資料庫
- 圖片格式轉換,JPG圖片轉換成PDF
- 【Hive】hive資料遷移Hive
- Python字典格式與JSON格式的相互轉換PythonJSON
- ChannelHandler之間處理資料格式轉換與Netty自帶的ChannelhandlerNetty
- 用兩種方法把JSON資料格式轉換為Python的類物件JSONPython物件
- PHP 轉換 SM2 加密資料 ASN1 編碼格式為 C1C3C2 格式資料PHP加密
- heic圖片轉換格式怎麼轉?
- Stimulsoft Reports如何建立新的資料轉換、編輯資料轉換
- MXNet的資料格式轉化為tensorflow訓練用的資料格式
- JSON資料交換格式有幾種?JSON
- JavaScript 時間日期格式轉換JavaScript
- plist檔案格式轉換器