HBase and MapR-DB: Designed for Distribution, Scale, and Speed-HBASE資料模型
Apache HBase is a database that runs on a Hadoop cluster. HBase is not a traditional RDBMS, as it relaxes the ACID (Atomicity, Consistency, Isolation, and Durability) properties of traditional RDBMS systems in order to achieve much greater scalability. Data stored in HBase also does not need to fit into a rigid schema like with an RDBMS, making it ideal for storing unstructured or semi-structured data.
The MapR Converged Data Platform supports HBase, but also supports MapR-DB, a high performance, enterprise-grade NoSQL DBMS that includes the HBase API to run HBase applications. For this blog, I’ll specifically refer to HBase, but understand that many of the advantages of using HBase in your data architecture apply to MapR-DB. MapR built MapR-DB to take HBase applications to the next level, so if the thought of higher powered, more reliable HBase deployments sound appealing to you, take a look at some of the MapR-DB content here.
HBase allows you to build big data applications for scaling, but with this comes some different ways of implementing applications compared to developing with traditional relational databases. In this blog post, I will provide an overview of HBase, touch on the limitations of relational databases, and dive into the specifics of the HBase data model.
Relational Databases vs. HBase – Data Storage Model
Why do we need NoSQL/HBase? First, let’s look at the pros of relational databases before we discuss its limitations:
- Relational databases have provided a standard persistence model
- SQL has become a de-facto standard model of data manipulation (SQL)
- Relational databases manage concurrency for transactions
- Relational database have lots of tools
Relational databases were the standard for years, so what changed? With more and more data came the need to scale. One way to scale is vertically with a bigger server, but this can get expensive, and there are limits as your size increases.
Relational Databases vs. HBase - Scaling
What changed to bring on NoSQL?
An alternative to vertical scaling is to scale horizontally with a cluster of machines, which can use commodity hardware. This can be cheaper and more reliable. To horizontally partition or shard a RDBMS, data is distributed on the basis of rows, with some rows residing on a single machine and the other rows residing on other machines, However, it’s complicated to partition or shard a relational database, and it was not designed to do this automatically. In addition, you lose the querying, transactions, and consistency controls across shards. Relational databases were designed for a single node; they were not designed to be run on clusters.
Limitations of a Relational Model
Database normalization eliminates redundant data, which makes storage efficient. However, a normalized schema causes joins for queries, in order to bring the data back together again. While HBase does not support relationships and joins, data that is accessed together is stored together so it avoids the limitations associated with a relational model. See the difference in data storage models in the chart below:
Relational databases vs. HBase - data storage model
HBase Designed for Distribution, Scale, and Speed
HBase was designed to scale due to the fact that data that is accessed together is stored together. Grouping the data by key is central to running on a cluster. In horizontal partitioning or sharding, the key range is used for sharding, which distributes different data across multiple servers. Each server is the source for a subset of data. Distributed data is accessed together, which makes it faster for scaling. HBase is actually an implementation of the BigTable storage architecture, which is a distributed storage system developed by Google that’s used to manage structured data that is designed to scale to a very large size.
HBase is referred to as a column family-oriented data store. It’s also row-oriented: each row is indexed by a key that you can use for lookup (for example, lookup a customer with the ID of 1234). Each column family groups like data (customer address, order) within rows. Think of a row as the join of all values in all column families.
HBase is a column family-oriented database
HBase is also considered a distributed database. Grouping the data by key is central to running on a cluster and sharding. The key acts as the atomic unit for updates. Sharding distributes different data across multiple servers, and each server is the source for a subset of data.
HBase is a distributed database
HBase Data Model
Data stored in HBase is located by its “rowkey.” This is like a primary key from a relational database. Records in HBase are stored in sorted order, according to rowkey. This is a fundamental tenet of HBase and is also a critical semantic used in HBase schema design.
HBase data model – row keys
Tables are divided into sequences of rows, by key range, called regions. These regions are then assigned to the data nodes in the cluster called “RegionServers.” This scales read and write capacity by spreading regions across the cluster. This is done automatically and is how HBase was designed for horizontal sharding.
Tables are split into regions = contiguous keys
The image below shows how column families are mapped to storage files. Column families are stored in separate files, which can be accessed separately.
The data is stored in HBase table cells. The entire cell, with the added structural information, is called Key Value. The entire cell, the row key, column family name, column name, timestamp, and value are stored for every cell for which you have set a value. The key consists of the row key, column family name, column name, and timestamp.
Logically, cells are stored in a table format, but physically, rows are stored as linear sets of cells containing all the key value information inside them.
In the image below, the top left shows the logical layout of the data, while the lower right section shows the physical storage in files. Column families are stored in separate files. The entire cell, the row key, column family name, column name, timestamp, and value are stored for every cell for which you have set a value.
Logical data model vs. physical data storage
As mentioned before, the complete coordinates to a cell's value are: Table:Row:Family:Column:Timestamp ➔ Value. HBase tables are sparsely populated. If data doesn’t exist at a column, it’s not stored. Table cells are versioned uninterpreted arrays of bytes. You can use the timestamp or set up your own versioning system. For every coordinate row family:column, there can be multiple versions of the value.
Sparse data with cell versions
Versioning is built in. A put is both an insert (create) and an update, and each one gets its own version. Delete gets a tombstone marker. The tombstone marker prevents the data being returned in queries. Get requests return specific version(s) based on parameters. If you do not specify any parameters, the most recent version is returned. You can configure how many versions you want to keep and this is done per column family. The default is to keep up to three versions. When the max number of versions is exceeded, extra records will be eventually removed.
Versioned data
In this blog post, you got an overview of HBase (and implicitly MapR-DB) and learned about the HBase/MapR-DB data model. Stay tuned for the next blog post, where I’ll take a deep dive into the details of the HBase architecture. In the third and final blog post in this series, we’ll take a look at schema design guidelines.
Want to learn more?
- Installing HBase on MapR
- Getting Started with HBase on MapR
- Release notes for HBase on MapR
- Apache HBase docs
- shop.oreilly.com/product/063…)
深度分析HBase架構
HBase的架構
物理上看, HBase系統有3種型別的後臺服務程式, 分別是Region server, Master server 和 zookeeper.
Region server負責實際資料的讀寫. 當訪問資料時, 客戶端與HBase的Region server直接通訊.
Master server管理Region的位置, DDL(新增和刪除表結構)
Zookeeper負責維護和記錄整個HBase叢集的狀態.
所有的HBase資料都儲存在HDFS中. 每個 Region server都把自己的資料儲存在HDFS中. 如果一個伺服器既是Region server又是HDFS的Datanode. 那麼這個Region server的資料會在把其中一個副本儲存在本地的HDFS中, 加速訪問速度.
但是, 如果是一個新遷移來的Region server, 這個region server的資料並沒有本地副本. 直到HBase執行compaction, 才會把一個副本遷移到本地的Datanode上面.
HBase Region server
HBase的表根據Row Key的區域分成多個Region, 一個Region包含這這個區域內所有資料. 而Region server負責管理多個Region, 負責在這個Region server上的所有region的讀寫操作. 一個Region server最多可以管理1000個region.
HBase Master server
HBase Maste主要負責分配region和操作DDL(如新建和刪除表)等,
HBase Master的功能:
- 協調Region server
- 在叢集處於資料恢復或者動態調整負載時,分配Region到某一個Region Server中
- 管控叢集, 監控所有RegionServer的狀態
- 提供DDL相關的API, 新建(create),刪除(delete)和更新(update)表結構.
ZooKeeper: 叢集"物業"管理員
Zookeepper是一個分散式的無中心的後設資料儲存服務. zookeeper探測和記錄HBase叢集中伺服器的狀態資訊. 如果zookeeper發現伺服器當機, 它會通知Hbase的master節點. 在生產環境部署zookeeper至少需要3臺伺服器, 用來滿足zookeeper的核心演算法Paxos的最低要求.
ZooKeeper, Master和 Region server協同工作
Zookeepr負責維護叢集的memberlist, 哪臺伺服器線上,哪臺伺服器當機都由zookeeper探測和管理. Region server, 主備Master節點主動連線Zookeeper, 維護一個Session連線,
這個session要求定時傳送heartbeat, 向zookeeper說明自己線上, 並沒有當機.
ZooKeeper有一個Ephemeral Node(臨時節點)的概念, session連線在zookeeper中建立一個臨時節點(Ephemeral Node), 如果這個session斷開, 臨時節點被自動刪除.
所有Region server都嘗試連線Zookeeper, 並在這個session中建立一個臨時節點(Ephemeral node). HBase的master節點監控這些臨時節點的是否存在, 可以發現新加入region server和判斷已經存在的region server當機.
為了高可用需求, HBase的master也有多個, 這些master節點也同時向Zookeeper註冊臨時節點(Ephemeral Node). Zookeeper把第一個成功註冊的master節點設定成active狀態, 而其他master node處於inactive狀態.
如果zookeeper規定時間內, 沒有收到active的master節點的heartbeat, 連線session超時, 對應的臨時節也自動刪除. 之前處於Inactive的master節點得到通知, 馬上變成active狀態, 立即提供服務.
同樣, 如果zookeeper沒有及時收到region server的heartbeat, session過期, 臨時節點刪除. HBase master得知region server當機, 啟動資料恢復方案.
HBase的第一次讀寫流程
HBase把各個region的位置資訊儲存在一個特殊的表中, 這個表叫做Meta table.
Zookeeper裡面儲存了這個Meta table的位置資訊.
HBase的訪問流程:
- 客戶端訪問Zookeep, 得到了具體Meta table的位置
- 客戶端再訪問真正的Meta table, 從Meta table裡面得到row key所在的region server
- 訪問rowkey所在的region server, 得到需要的真正資料.
客戶端快取meta table的位置和row key的位置資訊, 這樣就不用每次訪問都讀zookeeper.
如果region server由於當機等原因遷移到其他伺服器. Hbase客戶端訪問失敗, 客戶端快取過期, 再重新訪問zookeeper, 得到最新的meta table位置, 更新快取.
HBase Meta Table
Meta table儲存所有region的列表
Meta table用類似於Btree的方式儲存
Meta table的結構如下:
- Key: region的開始row key, region id
- Values: Region server
譯註: 在google的bigtable論文中, bigtable採用了多級meta table, Hbase的Meta table只有2級
Region Server的結構
Region Server執行在HDFS的data node上面, 它有下面4個部分組成:
- WAL: 預寫日誌(Write Ahead Log)是一HDFS上的一個檔案, 如果region server崩潰後, 日誌檔案用來恢復新寫入的的, 但是還沒有儲存在硬碟上的資料.
- BlockCache: 讀取快取, 在記憶體裡快取頻繁讀取的資料, 如果BlockCache滿了, 會根據LRU演算法(Least Recently Used)選出最不活躍的資料, 然後釋放掉
- MemStore: 寫入快取, 在資料真正被寫入硬碟前, Memstore在記憶體中快取新寫入的資料. 每個region的每個列簇(column family)都有一個memstore. memstore的資料在寫入硬碟前, 會先根據key排序, 然後寫入硬碟.
- HFiles: HDFS上的資料檔案, 裡面儲存KeyValue對.
HBase的寫入流程(1)
當hbase客戶端發起Put請求, 第一步是將資料寫入預寫日誌(WAL):
- 將修改的操作記錄在預寫日誌(WAL)的末尾
- 預寫日誌(WAL)被用來在region server崩潰時, 恢復memstore中的資料
Hbase的寫入流程(2)
資料寫入預寫日誌(WAL), 並儲存在memstore之後, 向使用者返回寫成功.
HBase MemStore
MemStore在記憶體按照Key的順序, 儲存Key-Value對, 一個Memstore對應一個列簇(column family). 同樣在HFile裡面, 所有的Key-Value對也是根據Key有序儲存.
HBase Region Flush
譯註: 原文裡面Flush的意識是, 把緩衝的資料從記憶體 轉存 到硬碟裡, 這就類似與沖廁所(Flush the toilet) , 把資料比作是水, 一下把積攢的水衝到下水道, 想當於把快取的資料寫入硬碟. 和Flush非常類似的英文還有un-plug, 比如有一浴缸的水, 只要un-plug浴缸裡面的塞子, 浴缸的水就開始流進下水道, 也類比把快取資料寫入硬碟
當Memstore累計了足夠多的資料, Region server將Memstore中的資料寫入HDFS, 儲存為一個HFile. 每個列簇(column family)對於多個HFile, 每個HFile裡面就是實際儲存的資料.
這些HFile都是當Memstore滿了以後, Flush到HDFS中的檔案. 注意到HBase限制了列簇(column family)的個數. 因為每個列簇(column family)都對應一個Memstore. [譯註: 太多的memstore佔用過多的記憶體].
當Memstore的資料Flush到硬碟時, 系統額外儲存了最後寫入操作的序列號(last written squence number), 所以HBase知道有多少資料已經成功寫入硬碟. 每個HFile都記錄這個序號, 表明這個HFile記錄了多少資料和從哪裡繼續寫入資料.
在region server啟動後, 讀取所有HFile中最高的序列號, 新的寫入序列號從這個最高序列號繼續向上累加.
HBase HFile
HFile中儲存有序的Key-Value對. 當Memstore滿了之後, Memstore中的所有資料寫入HDFS中,形成一個新的HFile. 這種大檔案寫入是順序寫, 因為避免了機械硬碟的磁頭移動, 所以寫入速度非常快.
HBase HFile Structure
HFile儲存了一個多級索引(multi-layered index), 查詢請求不需要遍歷整個HFile查詢資料, 通過多級索引就可以快速得到資料(工作機制類似於b+tree)
- Key-Value按照升序排列
- Key-Value儲存在以64KB為單位的Block裡
- 每個Block有一個葉索引(leaf-index), 記錄Block的位置
- 每個Block的最後一個Key(譯註: 最後一個key也是最大的key), 放入中間索引(intermediate index)
- 根索引(root index)指向中間索引
尾部指標(trailer pointer)在HFile的最末尾, 它指向後設資料塊區(meta block), 布隆過濾器區域和時間範圍區域. 查詢布隆過濾器可以很快得確定row key是否在HFile內, 時間範圍區域也可以幫助查詢跳過不在時間區域的讀請求.
譯註: 布隆過濾器在搜尋和檔案儲存中有廣泛用途, 具體演算法請參考https://china.googleblog.com/2007/07/bloom-filter_7469.html
HFile索引
當開啟HFile後, 系統自動快取HFile的索引在Block Cache裡, 這樣後續查詢操作只需要一次硬碟的尋道.
HBase的混合讀(Read Merge)
我們發現HBase中的一個row裡面的資料, 分配在多個地方. 已經持久化儲存的Cell在HFile, 最近寫入的Cell在Memstore裡, 最近讀取的Cell在Block cache裡. 所以當你讀HBase的一行時, 混合了Block cache, memstore和Hfiles的讀操作
- 首先, 在Block cache(讀cache)裡面查詢cell, 因為最近的讀取操作都會快取在這裡. 如果找到就返回, 沒有找到就執行下一步
- 其次, 在memstore(寫cache)裡查詢cell, memstore裡面儲存裡最近的新寫入, 如果找到就返回, 沒有找到就執行下一步
- 最後, 在讀寫cache中都查詢失敗的情況下, HBase查詢Block cache裡面的Hfile索引和布隆過濾器, 查詢有可能存在這個cell的HFile, 最後在HFile中找到資料.
HBase Minor Compaction
HBase自動選擇較小的HFile, 將它們合併成更大的HFile. 這個過程叫做minor compaction. Minor compaction通過合併小HFile, 減少HFile的數量.
HFile的合併採用歸併排序的演算法.
譯註: 較少的HFile可以提高HBase的讀效能
HBase Major Compaction
Major compaction指一個region下的所有HFile做歸併排序, 最後形成一個大的HFile. 這可以提高讀效能. 但是, major compaction重寫所有的Hfile, 佔用大量硬碟IO和網路頻寬. 這也被稱為寫放大現象(write amplification)
Major compaction可以被排程成自動執行的模式, 但是由於寫放大的問題(write amplification), major compaction通常在一週執行一次或者只在凌晨執行. 此外, major compaction的過程中, 如果發現region server負責的資料不在本地的HDFS datanode上, major compaction除了合併檔案外, 還會把其中一份資料轉存到本地的data node上.
Region = 一組連續key
快速的複習region的概念:
- 一張表垂直分割成一個或多個region, 一個region包括一組連續並且有序的row key, 每一個row key對應一行的資料.
- 每個region最大1GB(預設)
- region由region server管理
- 一個region server可以管理多個region, 最多大約1000個region(這些region可以屬於相同的表,也可以屬於不同的表)
Region的拆分
最初, 每張表只有一個region, 當一個region變得太大時, 它就分裂成2個子region. 2個子region, 各佔原始region的一半資料, 仍然被相同的region server管理. Region server向HBase master節點彙報拆分完成.
如果叢集內還有其他region server, master節點傾向於做負載均衡, 所以master節點有可能排程新的region到其他region server, 由其他region管理新的分裂出的region.
負載均衡
最初, 一個Region server上的region一分為二, 但是考慮到負載均衡, master node會把新region排程到其他伺服器上. 然而, 新region所在的region server在本地data node上沒有資料, 所有操作都是操作遠端HDFS上面的資料. 直到這個Region server執行了major compaction之後, 才有一份副本落在本地datanode中.
譯註: HFile和WAL都是儲存在HDFS中, 這裡說的把副本儲存在本地是指: 由於HDFS是一種聰明的FS, 如果他發現要求寫入檔案的客戶端恰好也是HDFS的data node, 那麼在分配哪三臺伺服器儲存副本時, 會優先在發請求的客戶端儲存資料, 這樣就可以讓Region server管理的資料雖然是3份, 但是其中一份就在本地伺服器上, 優化了訪問路徑.
具體可以參考這篇文章http://www.larsgeorge.com/2010/05/hbase-file-locality-in-hdfs.html, 裡面詳述了HDFS如何實現這種本地化的儲存. 換句話說, 如果region server沒有和HDFS的data node部署在同一臺伺服器, 就無法實現上面說的本地儲存
HDFS的資料複製(1)
所有讀寫都是操作primary node. HDFS自動複製所有WAL和HFile的資料塊到其他節點. HBase依賴HDFS保證資料安全. 當在HDFS裡面寫入一個檔案時, 一份儲存在本地節點, 另兩份儲存到其他節點
HDFS的資料複製(2)
預寫日誌(WAL) 和 HFile都存在HDFS裡面, 可以保證資料的可靠性, 但是HBase memstore裡的資料都在記憶體中, 如果系統崩潰後重啟, Hbase如何恢復Memstore裡面的資料?
譯註: 從上圖看memstore的資料在記憶體中, 也沒有多副本
HBase的災難恢復
當region server當機, 崩潰的region server管理的region不能再提供服務, HBase監測到異常後, 啟動恢復程式, 恢復region.
Zookeeper發現region server的heartbeat停止, 判斷region server當機並通知master節點. Hbase master節點得知該region server停機後, 將崩潰的region server管理的region分配給其他region server. HBase從預寫檔案(WAL)裡恢復memstore裡的資料.
HBase master知道老的region被重新分配到哪些新的region server. Master把已經crash的Region server的預寫日誌(WAL)拆分成多個. 參與故障恢復的每個region server重放的預寫日誌(WAL), 重新構建出丟失Memstore.
資料恢復
預寫日誌(WAL)記錄了HBase的每個操作, 每個操作代表一個Put或者刪除Delete動作. 所有的操作按照時間順序在預寫日誌(WAL)排列, 檔案頭記錄最老的操作, 最新的操作處於檔案末尾.
如何恢復在memstore裡, 但還沒有寫到HFile的資料? 重新執行預寫日誌(WAL)就可以. 從前到後依次執行預寫日誌(WAL)裡的操作, 重建memstore資料. 最後, Flush memstore資料到的HFile, 完成恢復.
Apache Hbase架構優點
強一致模型
- 當寫返回時, 確保所有讀操作讀到相同的值
自動擴充套件
- 資料增長過大時, 自動分裂region
- 利用HFDS分散資料和備份資料
內建自動回覆
- 預寫日誌(WAL)
整合Hadoop生態
- 在HBase上執行map reduce
Apache HBase存在的問題...
- Business continuity reliability:
- 重放預寫日誌慢
- 故障恢復既慢又複雜
- Major compaction容易引起IO風暴(寫放大)
Guidelines for HBase Schema Design
In this blog post, I’ll discuss how HBase schema is different from traditional relational schema modeling, and I’ll also provide you with some guidelines for proper HBase schema design.
Relational vs. HBase Schemas
There is no one-to-one mapping from relational databases to HBase. In relational design, the focus and effort is around describing the entity and its interaction with other entities; the queries and indexes are designed later.
With HBase, you have a “query-first” schema design; all possible queries should be identified first, and the schema model designed accordingly. You should design your HBase schema to take advantage of the strengths of HBase. Think about your access patterns, and design your schema so that the data that is read together is stored together. Remember that HBase is designed for clustering.
- Distributed data is stored and accessed together
- It is query-centric, so focus on how the data is read
- Design for the questions
Normalization
In a relational database, you normalize the schema to eliminate redundancy by putting repeating information into a table of its own. This has the following benefits:
- You don’t have to update multiple copies when an update happens, which makes writes faster.
- You reduce the storage size by having a single copy instead of multiple copies.
However, this causes joins. Since data has to be retrieved from more tables, queries can take more time to complete.
In this example below, we have an order table which has one-to-many relationship with an order items table. The order items table has a foreign key with the id of the corresponding order.
De-normalization
In a de-normalized datastore, you store in one table what would be multiple indexes in a relational world. De-normalization can be thought of as a replacement for joins. Often with HBase, you de-normalize or duplicate data so that data is accessed and stored together.
Parent-Child Relationship–Nested Entity
Here is an example of denormalization in HBase, if your tables exist in a one-to-many relationship, it’s possible to model it in HBase as a single row. In the example below, the order and related line items are stored together and can be read together with a get on the row key. This makes the reads a lot faster than joining tables together.
The rowkey corresponds to the parent entity id, the OrderId. There is one column family for the order data, and one column family for the order items. The Order Items are nested, the Order Item IDs are put into the column names and any non-identifying attributes are put into the value.
This kind of schema design is appropriate when the only way you get at the child entities is via the parent entity.
Many-to-Many Relationship in an RDBMS
Here is an example of a many-to-many relationship in a relational database. These are the query requirements:
- Get name for user x
- Get title for book x
- Get books and corresponding ratings for userID x
- Get all userIDs and corresponding ratings for book y
Many-to-Many Relationship in HBase
The queries that we are interested in are:
- Get books and corresponding ratings for userID x
- Get all userIDs and corresponding ratings for book y
For an entity table, it is pretty common to have one column family storing all the entity attributes, and column families to store the links to other entities.
The entity tables are as shown below:
Generic Data, Event Data, and Entity-Attribute-Value
Generic data that is schemaless is often expressed as name value or entity attribute value. In a relational database, this is complicated to represent. A conventional relational table consists of attribute columns that are relevant for every row in the table, because every row represents an instance of a similar object. A different set of attributes represents a different type of object, and thus belongs in a different table. The advantage of HBase is that you can define columns on the fly, put attribute names in column qualifiers, and group data by column families.
Here is an example of clinical patient event data. The Row Key is the patient ID plus a time stamp. The variable event type is put in the column qualifier, and the event measurement is put in the column value. OpenTSDB is an example of variable system monitoring data.
Self-Join Relationship – HBase
A self-join is a relationship in which both match fields are defined in the same table.
Consider a schema for twitter relationships, where the queries are: which users does userX follow, and which users follow userX? Here’s a possible solution: The userids are put in a composite row key with the relationship type as a separator. For example, Carol follows Steve Jobs and Carol is followed by BillyBob. This allows for row key scans for everyone carol:follows or carol:followedby
Below is the example Twitter table:
Tree, Graph Data
Here is an example of an adjacency list or graph, using a separate column for each parent and child:
Each row shows a node, and the row key is equal to the node id. There is a column family for parent p, and a column family children c. The column qualifiers are equal to the parent or child node ids, and the value is equal to the type to node. This allows to quickly find the parent or children nodes from the row key.
You can see there are multiple ways to represent trees, the best way depends on your queries.
Inheritance Mapping
In this online store example, the type of product is a prefix in the row key. Some of the columns are different, and may be empty depending on the type of product. This allows to model different product types in the same table and to scan easily by product type.
Data Access Patterns
Use Cases: Large-scale offline ETL analytics and generating derived data
In analytics, data is written multiple orders of magnitude more frequently than it is read. Offline analysis can also be used to provide a snapshot for online viewing. Offline systems don’t have a low-latency requirement; that is, a response isn’t expected immediately. Offline HBase ETL data access patterns, such as Map Reduce or Hive, are characterized by high latency reads and high throughput writes.
Data Access Patterns
Use Cases: Materialized view, pre-calculated summaries
To provide fast reads for online web sites, or an online view of data from data analysis, MapReduce jobs can reorganize the data into different groups for different readers or materialized views. Batch offline analysis could also be used to provide a snapshot for online views. This is going to be high throughput for batch offline writes and high latency for read (when online).
Examples include:
• Generating derived data, duplicating data for reads in HBase schemas, and delayed secondary indexes
Schema Design Exploration:
- Raw data from HDFS or HBase
- MapReduce for data transformation and ETL from raw data.
- Use bulk import from MapReduce to HBase
- Serve data for online reads from HBase
Designing for reads means aggressively de-normalizing data so that the data that is read together is stored together.
Data Access Patterns
Lambda Architecture
The Lambda architecture solves the problem of computing arbitrary functions on arbitrary data in real time by decomposing the problem into three layers: the batch layer, the serving layer, and the speed layer.
MapReduce jobs are used to create artifacts useful to consumers at scale. Incremental updates are handled in real time by processing updates to HBase in a Storm cluster, and are applied to the artifacts produced by MapReduce jobs.
The batch layer precomputes the batch views. In the batch view, you read the results from a precomputed view. The precomputed view is indexed so that it can be accessed quickly with random reads.
The serving layer indexes the batch view and loads it up so it can be efficiently queried to get particular values out of the view. A serving layer database only requires batch updates and random reads. The serving layer updates whenever the batch layer finishes precomputing a batch view.
You can do stream-based processing with Storm and batch processing with Hadoop. The speed layer only produces views on recent data, and is for functions computed on data in the few hours not covered by the batch. In order to achieve the fastest latencies possible, the speed layer doesn’t look at all the new data at once. Instead, it updates the real time view as it receives new data, instead of re-computing them like the batch layer does. In the speed layer, HBase provides the ability for Storm to continuously increment the real-time views.
How does Storm know to process new data in HBase? A needs work flag is set. Processing components scan for notifications and process them as they enter the system.
MapReduce Execution and Data Flow
The flow of data in a MapReduce execution is as follows:
- Data is being loaded from the Hadoop file system
- Next, the job defines the input format of the data
- Data is then split between different map() methods running on all the nodes
- Then record readers parse out the data into key-value pairs that serve as input into the map() methods
- The map() method produces key-value pairs that are sent to the partitioner
- When there are multiple reducers, the mapper creates one partition for each reduce task
- The key-value pairs are sorted by key in each partition
- The reduce() method takes the intermediate key-value pairs and reduces them to a final list of key-value pairs
- The job defines the output format of the data
- Data is written to the Hadoop file system
In this blog post, you learned how HBase schema is different from traditional relational schema modeling, and you also got some guidelines for proper HBase schema design. If you have any questions about this blog post, please ask them in the comments section below.
Want to learn more? Take a look at these resources that I used to prepare this blog post:
Here are some additional resources for helping you get started with HBase: