HBase載入大資料

baiyunl發表於2011-09-13

原文網址 : https://blog.csdn.net/baiyunl/article/details/84016461

大資料

一、概述

HBase有很多種方法將資料載入到表中，最簡單直接的方法就是通過MapReduce呼叫TableOutputFormat方法，或者在client上呼叫API寫入資料。但是，這都不是最有效的方式。

這篇文件將向你描述如何在HBase中載入大資料。採用MapReduce作業，將資料以HBase內部的組織格式輸出成檔案，然後將資料檔案載入到已執行的叢集中。（注：就是生成HFile，然後載入到HBase中。）

二、大資料載入的步驟

大資料的載入包含了2個步驟：

1、通過MapReduce的作業進行資料準備過程

首先，通過MapReduce使用HFileOutputFormat來生成HBase的資料檔案格式。這樣格式的資料檔案就是HBase內部的檔案組織格式，並且在將資料寫入到叢集的過程中是相當容易的。

為了使該方法更有效，HFileOutputFormat必須通過配置，每個輸出的HFile必須適應單個的region。為了實現此功能，MapReduce的Job採用了Hadoop的TotalOrderPartitioner類，通過進行分割槽操作用以對應表中各個region。

同時，HFileOutputFormat包含有一個非常方便的方法，configureIncrementalLoad(), 這個方法會基於表的當前區域邊界自動設定一個TotalOrderPartitioner。

2、資料載入過程

通過HFileOutputFormat準備好資料之後，使用命令列工具將資料載入到叢集中。這個命令列工具遍歷準備好的資料檔案，並確定每一個檔案所屬的region。然後，當連線到對應的Region Server，移動到HFile到儲存目錄為使用者提供資料。

如果在資料準備或者資料載入的時候，region邊界發生了變化，那麼HBase將自動進行塊分割，用以適應新的邊界變化。這個過程效率是很低下的，特別是有其他的client在做資料錄入操作。所以需要注意，儘量使用少的時間去創造資料檔案以及錄入該資料檔案進入叢集。

3、使用importtsv為大資料載入做準備

HBase自帶了importtsv命令工具。通過hadoop jar /path/to/hbase-VERSION.jar importtsv 來使用這個命令。如果不帶引數的執行會列印以下幫助資訊：

Usage: importtsv -Dimporttsv.columns=a,b,c

Imports the given input directory of TSV data into the specified table.

The column names of the TSV data must be specified using the -Dimporttsv.columns option.

This option takes the form of comma-separated column names, where each column name is either a simple column family, or a columnfamily:qualifier.

The special column name HBASE_ROW_KEY is used to designate that this column should be usedas the row key for each imported record.

You must specify exactly one column to be the row key.

In order to prepare data for a bulk data load, pass the option:

-Dimporttsv.bulk.output=/path/for/output

Other options that may be specified with -D include:

-Dimporttsv.skip.bad.lines=false - fail if encountering an invalid line

4、使用completebulkload來載入資料

當使用importtsv匯入資料之後，completebulkload 是用來匯入資料到在執行的叢集中。

completebulkload就是採用與importtsv 相同的輸出路徑和表的名稱來執行。例如：

$ hadoop jar hbase-VERSION.jar completebulkload /user/todd/myoutput mytable

這個命令會執行的非常快，完成之後在叢集中就能看到新的資料。

5、高階用法

雖然importtsv 命令很有用，但是在許多情況下，使用者可能需要通過編寫程式碼或其他形式的匯入資料。

如果要這樣做，可以檢視ImportTsv.java 原始碼，並閱讀HFileOutputFormat的Javadoc幫助文件。

通過程式碼編寫載入大資料量可關注 LoadIncrementalHFiles類。

原文連結：http://blog.csdn.net/dajuezhao/article/details/6210459

大資料——HBase基礎
2020-09-24
大資料
大資料技術 - Hbase
2023-02-24
大資料
WPFDataGrid效能載入大資料
2018-09-21
大資料
一條資料HBase之旅，簡明HBase入門教程開篇
2018-06-15
資料匯入終章：如何將HBase的資料匯入HDFS？
2018-10-15
HBase BulkLoad批量寫入資料實戰
2018-08-19
好程式設計師大資料培訓分享HBase協處理器載入的三種方式
2020-06-03
程式設計師大資料
Hadoop大資料平臺之HBase部署
2020-11-24
Hadoop大資料
HBase海量資料高效入倉解決方案
2022-03-15
Hadoop大資料實戰系列文章之HBase
2020-11-11
Hadoop大資料
大資料技術Hbase和Hive詳解
2020-12-02
大資料Hive
Hive資料匯入HBase引起資料膨脹引發的思考
2020-12-08
Hive
資料載入
2019-05-15
大資料基礎學習-8.Hbase-1.2.0
2018-04-29
大資料
MapReduce之----往hbase資料庫寫入資料時, 出現資料異常
2018-06-23
資料庫
hbase與phoenix整合(使用phoenix操作hbase資料)
2019-03-17
圖解大資料 | 海量資料庫查詢-Hive與HBase詳解
2022-03-08
圖解大資料資料庫Hive
14 大資料之HBASE(第一部簡介)
2020-11-14
大資料
GoldenGate實時投遞資料到大資料平臺（7）– Apache Hbase
2018-03-06
Go大資料Apache
大資料入門
2019-04-08
大資料
HBase協處理器載入的三種方式
2020-04-17
大資料HBase在阿里搜尋中的應用實踐
2018-12-06
大資料阿里
spark讀取hbase的資料
2019-04-05
Spark
HBase資料的讀寫流程
2022-01-14
Hbase資料庫安裝部署
2022-05-15
資料庫
HBase 資料儲存結構
2021-02-28
入門大資料---大資料調優彙總
2020-07-15
大資料
HBase入門教程
2018-10-10
大資料入門001
2024-04-22
大資料
大話資料入門
2019-12-23
ListView動態載入資料
2018-11-28
View
Docker入門-資料掛載
2019-08-18
Docker
Pytorch資料載入與使用
2024-06-17
PyTorch
MySQL load xml載入資料
2021-12-17
MySqlXML
MySQL load data載入資料
2021-12-21
MySql
Pytorch系列:（二）資料載入
2021-04-24
PyTorch
如何使用HBase？大資料儲存的兩個實戰場景
2018-09-04
大資料
好程式設計師大資料學習路線Hbase總結
2019-08-13
程式設計師大資料
'林子雨大資料' 實驗3 HBase操作與介面程式設計
2023-04-02
大資料程式設計

HBase載入大資料

相關文章