Hive Streaming 追加 ORC 檔案

哥不是小蘿莉發表於2016-12-07

1.概述

　　在儲存業務資料的時候，隨著業務的增長，Hive 表儲存在 HDFS 的上的資料會隨時間的增加而增加，而以 Text 文字格式儲存在 HDFS 上，所消耗的容量資源巨大。那麼，我們需要有一種方式來減少容量的成本。而在 Hive 中，有一種 ORC 檔案格式可以極大的減少儲存的容量成本。今天，筆者就為大家分享如何實現流式資料追加到 Hive ORC 表中。

2.內容

2.1 ORC

　　這裡，我們首先需要知道 Hive 的 ORC 是什麼。在此之前，Hive 中存在一種 RC 檔案，而 ORC 的出現，對 RC 這種檔案做了許多優化，這種檔案格式可以提供一種高效的方式來儲存 Hive 資料，使用 ORC 檔案可以提供 Hive 的讀寫以及效能。其優點如下：

減少 NameNode 的負載
支援複雜資料型別（如 list，map，struct 等等）
檔案中包含索引
塊壓縮
...

　　結構圖（來源於 Apache ORC 官網）如下所示：

　　這裡筆者就不一一列舉了，更多詳情，可以閱讀官網介紹：[入口地址]

2.2 使用

　　知道了 ORC 檔案的結構，以及相關作用，我們如何去使用 ORC 表，下面我們以建立一個處理 Stream 記錄的表為例，其建立示例 SQL 如下所示：

create table alerts ( id int , msg string )
     partitioned by (continent string, country string)
     clustered by (id) into 5 buckets
     stored as orc tblproperties("transactional"="true"); // currently ORC is required for streaming

　　需要注意的是，在使用 Streaming 的時候，建立 ORC 表，需要使用分割槽分桶。

　　下面，我們嘗試插入一下資料，來模擬 Streaming 的流程，程式碼如下所示：

String dbName = "testing";
String tblName = "alerts";
ArrayList<String> partitionVals = new ArrayList<String>(2);
partitionVals.add("Asia");
partitionVals.add("India");
String serdeClass = "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe";
  
HiveEndPoint hiveEP = new HiveEndPoint("thrift://x.y.com:9083", dbName, tblName, partitionVals);

　　如果，有多個分割槽，我們這裡可以將分割槽存放在分割槽集合中，進行載入。這裡，需要開啟 metastore 服務來確保 Hive 的 Thrift 服務可用。

//-------   Thread 1  -------//
StreamingConnection connection = hiveEP.newConnection(true);
DelimitedInputWriter writer = new DelimitedInputWriter(fieldNames,",", endPt);
TransactionBatch txnBatch = connection.fetchTransactionBatch(10, writer);
///// Batch 1 - First TXN
txnBatch.beginNextTransaction();
txnBatch.write("1,Hello streaming".getBytes());
txnBatch.write("2,Welcome to streaming".getBytes());
txnBatch.commit();
if(txnBatch.remainingTransactions() > 0) {
///// Batch 1 - Second TXN
txnBatch.beginNextTransaction();
txnBatch.write("3,Roshan Naik".getBytes());
txnBatch.write("4,Alan Gates".getBytes());
txnBatch.write("5,Owen O’Malley".getBytes());
txnBatch.commit();
txnBatch.close();
connection.close();
} 
txnBatch = connection.fetchTransactionBatch(10, writer);
///// Batch 2 - First TXN
txnBatch.beginNextTransaction();
txnBatch.write("6,David Schorow".getBytes());
txnBatch.write("7,Sushant Sowmyan".getBytes());
txnBatch.commit();
if(txnBatch.remainingTransactions() > 0) {
///// Batch 2 - Second TXN
txnBatch.beginNextTransaction();
txnBatch.write("8,Ashutosh Chauhan".getBytes());
txnBatch.write("9,Thejas Nair" getBytes());
txnBatch.commit();
txnBatch.close();
}
connection.close();

　　接下來，我們對 Streaming 資料進行寫入到 ORC 表進行儲存。實現結果如下圖所示：

3.案例

　　下面，我們來完成一個完整的案例，有這樣一個場景，每天有許多業務資料上報到指定伺服器，然後有中轉服務將各個業務資料按業務拆分後轉發到各自的日誌節點，再由 ETL 服務將資料入庫到 Hive 表。這裡，我們只說說入庫 Hive 表的流程，拿到資料，處理後，入庫到 Hive 的 ORC 表中。具體實現程式碼如下所示：

/**
 * @Date Nov 24, 2016
 *
 * @Author smartloli
 *
 * @Email smartdengjie@gmail.com
 *
 * @Note TODO
 */
public class IPLoginStreaming extends Thread {
    private static final Logger LOG = LoggerFactory.getLogger(IPLoginStreaming.class);
    private String path = "";

    public static void main(String[] args) throws Exception {
        String[] paths = SystemConfigUtils.getPropertyArray("hive.orc.path", ",");
        for (String str : paths) {
            IPLoginStreaming ipLogin = new IPLoginStreaming();
            ipLogin.path = str;
            ipLogin.start();
        }
    }

    @Override
    public void run() {
        List<String> list = FileUtils.read(this.path);
        long start = System.currentTimeMillis();
        try {
            write(list);
        } catch (Exception e) {
            LOG.error("Write PATH[" + this.path + "] ORC has error,msg is " + e.getMessage());
        }
        System.out.println("Path[" + this.path + "] spent [" + (System.currentTimeMillis() - start) / 1000.0 + "s]");
    }

    public static void write(List<String> list)
            throws ConnectionError, InvalidPartition, InvalidTable, PartitionCreationFailed, ImpersonationFailed, InterruptedException, ClassNotFoundException, SerializationError, InvalidColumn, StreamingException {
        String dbName = "default";
        String tblName = "ip_login_orc";
        ArrayList<String> partitionVals = new ArrayList<String>(1);
        partitionVals.add(CalendarUtils.getDay());
        String[] fieldNames = new String[] { "_bpid", "_gid", "_plat", "_tm", "_uid", "ip", "latitude", "longitude", "reg", "tname" };

        StreamingConnection connection = null;
        TransactionBatch txnBatch = null;

        try {

            HiveEndPoint hiveEP = new HiveEndPoint("thrift://master:9083", dbName, tblName, partitionVals);
            HiveConf hiveConf = new HiveConf();
            hiveConf.setBoolVar(HiveConf.ConfVars.HIVE_HADOOP_SUPPORTS_SUBDIRECTORIES, true);
            hiveConf.set("fs.hdfs.impl", "org.apache.hadoop.hdfs.DistributedFileSystem");
            connection = hiveEP.newConnection(true, hiveConf);
            DelimitedInputWriter writer = new DelimitedInputWriter(fieldNames, ",", hiveEP);
            txnBatch = connection.fetchTransactionBatch(10, writer);

            // Batch 1
            txnBatch.beginNextTransaction();
            for (String json : list) {
                String ret = "";
                JSONObject object = JSON.parseObject(json);
                for (int i = 0; i < fieldNames.length; i++) {
                    if (i == (fieldNames.length - 1)) {
                        ret += object.getString(fieldNames[i]);
                    } else {
                        ret += object.getString(fieldNames[i]) + ",";
                    }
                }
                txnBatch.write(ret.getBytes());
            }
            txnBatch.commit();

        } finally {
            if (txnBatch != null) {
                txnBatch.close();
            }
            if (connection != null) {
                connection.close();
            }
        }
    }
}

　　PS：建議使用多執行緒來處理資料。

4.預覽

　　實現結果如下所示：

分割槽詳情

該分割槽下記錄數

5.總結

　　在使用 Hive Streaming 來實現 ORC 追加的時候，除了表本身需要分割槽分桶以外，工程本身的依賴也是複雜，會設計 Hadoop Hive 等專案的依賴包，推薦使用 Maven 工程來實現，由 Maven 工程去幫我們解決各個 JAR 包之間的依賴問題。

6.結束語

　　這篇部落格就和大家分享到這裡，如果大家在研究學習的過程當中有什麼問題，可以加群進行討論或傳送郵件給我，我會盡我所能為您解答，與君共勉！

Hadoop Streaming 讀ORC檔案
2018-11-18
Hadoop
hive orc表'orc.create.index'='true'與'orc.create.index'='false'
2021-02-23
HiveIndexFalse
hive streaming 使用shell指令碼
2016-05-07
Hive指令碼
追加寫入json檔案
2024-06-08
JSON
java如何追加寫入txt檔案
2018-12-30
Java
node中給檔案追加內容
2020-10-30
hive檔案數限制
2014-03-28
Hive
hive檔案儲存格式
2020-11-29
Hive
Hive小檔案合併
2015-01-31
Hive
java檔案相關（檔案追加內容、檔案內容清空、檔案內容讀取）
2018-06-29
Java
Flume實時監控單個追加檔案
2020-10-02
大資料檔案格式比較：AVRO vs. PARQUET vs. ORC
2022-01-20
大資料VR
Spark Streaming監聽HDFS檔案（Spark-shell）
2024-11-04
Spark
hive迷案之消失的分割槽檔案
2021-09-09
Hive
python對json格式原檔案追加內容
2020-11-17
PythonJSON
hive學習之四:hive檔案格式以及壓縮編碼
2018-05-22
Hive
Serverless Streaming：毫秒級流式大檔案處理探秘
2023-02-17
Server
徹底解決Hive小檔案問題
2021-09-20
Hive
Hive表小檔案合併方法總結
2020-10-17
Hive
Linux中如何設定檔案只能追加而不能刪除
2024-03-31
Linux
OSS物件儲存的檔案追加上傳問題及解決方案
2020-10-10
物件
Java HDFS API 追加檔案寫入內容異常問題的解決
2018-06-28
JavaAPI
基於Hadoop不同版本搭建hive叢集（附配置檔案）
2020-05-28
HadoopHive
Hive實戰UDF 外部依賴檔案找不到的問題
2021-12-16
Hive
hive從入門到放棄(六)——常用檔案儲存格式
2022-05-19
Hive
dotnet C# 簡單的追加資料夾到 ZipArchive 壓縮檔案的方法
2024-04-26
C#Hive
關於oracle orc和voting disk
2013-07-05
Oracle
JavaScript 追加字串
2018-10-30
JavaScript字串
Python中檔案的讀寫、寫讀和追加寫讀三種模式的特點
2019-02-16
Python模式
（資料科學學習手札148）geopandas直接支援gdb檔案寫出與追加
2023-01-04
資料科學
REST Streaming
2019-02-16
REST
hive學習之三:專案中的hive優化實踐
2016-05-17
Hive優化
模型追加欄位
2022-01-12
模型
HIVE隨手記——Hive命令（?$HIVE_HOME/bin/hive）
2015-01-20
Hive
php追加編譯imagick
2017-11-15
PHP編譯
Spark學習進度11-Spark Streaming&Structured Streaming
2021-01-15
SparkStruct
[Hive]Hive排序優化
2018-08-15
Hive排序優化
Hive --------- hive 的優化
2018-11-12
Hive優化