Hadoop生態圈一覽

親吻昨日的陽光發表於2015-04-28

根據Hadoop官網的相關介紹和實際使用中的軟體集,將Hadoop生態圈的主要軟體工具簡單介紹下,擴充對整個Hadoop生態圈的瞭解。

這是Hadoop生態從Google的三篇論文開始的發展歷程,現已經發展成為一個生態體系,並還在蓬勃發展中....


這是官網上的Hadoop生態圖,包含了大部分常用到的Hadoop相關工具軟體




這是以體系從下到上的佈局展示的Hadoop生態系統圖,言明瞭各工具軟體在體系中所處的位置



這張圖是Hadoop在系統中核心元件與系統的依賴關係



下面就是簡單介紹Hadoop生態圈中的一些工具

Hadoop

官網原文:
What Is Apache Hadoop?
The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.


The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.


The project includes these modules:


Hadoop Common: The common utilities that support the other Hadoop modules.
Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
Hadoop YARN: A framework for job scheduling and cluster resource management.
Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
Other Hadoop-related projects at Apache include:


Ambari™: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications visually alongwith features to diagnose their performance characteristics in a user-friendly manner.
Avro™: A data serialization system.
Cassandra™: A scalable multi-master database with no single points of failure.
Chukwa™: A data collection system for managing large distributed systems.
HBase™: A scalable, distributed database that supports structured data storage for large tables.
Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.
Mahout™: A Scalable machine learning and data mining library.
Pig™: A high-level data-flow language and execution framework for parallel computation.
Spark™: A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation.
Tez™: A generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases. Tez is being adopted by Hive™, Pig™ and other frameworks in the Hadoop ecosystem, and also by other commercial software (e.g. ETL tools), to replace Hadoop™ MapReduce as the underlying execution engine.
ZooKeeper™: A high-performance coordination service for distributed applications.



譯文:
什麼是Apache hadoop?
Apache Hadoop專案是以可靠、可擴充套件和分散式計算為目的而發展而來的開源軟體


Apache Hadoop 軟體庫是一個允許在叢集計算機上使用簡單的程式設計模型來進行大資料集的分散式任務的框架。它是設計來從單伺服器擴充套件到成千臺機器上,每個機器提供本地的計算和儲存。相比於依賴硬體來實現高可用,該庫自己設計來檢查和管理應用部署的失敗情況,因此是在叢集計算機之上提供高可用的服務,沒個節點都有可能失敗。


該專案包括模組:
Hadoop Common :通用的工具來支援其他的Hadoop模組
Hadoop Distributed FileSystem(HDFS):一個提供高可用獲取應用資料的分散式檔案系統
Hadoop YARN;Job排程和叢集資源管理的框架
Hadoop MapReduce:基於YARN系統的並行處理大資料集的程式設計模型
其他Hadoop相關的專案:
Ambari:一個基於web的工具,用來供應、管理和監測Apache Hadoop叢集包括支援Hadoop HDFS、Hadoop MapReduce、Hive、HCatalog、HBase、ZooKeeper、Oozie、Pig和Sqoop。Ambari 也提供一個可視的儀表盤來檢視叢集的健康狀態(比如熱圖),並且能夠以一種使用者友好的方式根據其特點視覺化的檢視MapReduce、pig和Hive 應用來診斷其效能特徵。
Avro :資料序列化系統。
Cassandra :可擴充套件的多主節點資料庫,而且沒有單節點失敗情況。
Chukwa : 管理大型分散式系統的資料收集系統
HBase ; 一個可擴充套件的分散式資料庫,支援大表的結構化資料儲存
Hive : 一個提供資料概述和AD組織查詢的資料倉儲
Mahout :可擴充套件大的機器學習和資料探勘庫
Pig :一個支援平行計算的高階的資料流語言和執行框架
Spark : 一個快速通用的Hadoop資料的計算引擎。spark 提供一個簡單和富有表現力的程式設計模型並支援多領域應用,包括ETL、機器學習、流處理 和圖計算。
Tez : 一個通用的資料流處理框架,構建在Hadoop YARN上,提供一個有力的靈活的引擎來執行一個任意的DAG任務來處理資料(批處理和互動式兩種方式)。Tez 可以被Hive、Pig和其他Hadoop生態系統框架和其他商業軟體(如:ETL工具)使用,用來替代Hadoop MapReduce 作為底層的執行引擎。
ZooKeeper :一個應用於分散式應用的高效能的協調服務。



/****************************************************************************/


Ambari

Ambari監控頁面:

官網原文:

The Apache Ambari project is aimed at making Hadoop management simpler by developing software for provisioning, managing, and monitoring Apache Hadoop clusters. Ambari provides an intuitive, easy-to-use Hadoop management web UI backed by its RESTful APIs.


Ambari enables System Administrators to:


Provision a Hadoop Cluster
Ambari provides a step-by-step wizard for installing Hadoop services across any number of hosts.
Ambari handles configuration of Hadoop services for the cluster.
Manage a Hadoop Cluster
Ambari provides central management for starting, stopping, and reconfiguring Hadoop services across the entire cluster.
Monitor a Hadoop Cluster
Ambari provides a dashboard for monitoring health and status of the Hadoop cluster.
Ambari leverages Ambari Metrics System for metrics collection.
Ambari leverages Ambari Alert Framework for system alerting and will notify you when your attention is needed (e.g., a node goes down, remaining disk space is low, etc).
Ambari enables Application Developers and System Integrators to:


Easily integrate Hadoop provisioning, management, and monitoring capabilities to their own applications with the Ambari REST APIs.



譯文:Apache Ambari 專案的目標是通過開發提供、管理和監測Hadoop叢集的軟體使得hadoop的管理更簡單。
Ambari 提供了直觀的,易於使用的hadoop 管理的WEB 介面依賴於他自己的RESTful API。
Ambari 幫助系統管理員:
1.提供Hadoop叢集
Ambari 提供一步步的嚮導來安裝任意數量主機的hadoop 服務群。
Ambari 管理叢集的Hadoop服務群的配置
2.管理Hadoop叢集
Ambari 提供控制管理整個叢集的啟動、停止、和重新配置Hadoop服務群
3.監測Hadoop叢集
Ambari 提供了儀表盤來監測Hadoop的健康和Hadoop叢集的狀態
Ambari利用Ambari度量系統來度量資料收集
Ambari利用Ambari警報框架為系統報警,當你需要注意時通知你(比如:一個節點掛掉、剩餘磁碟不足等等).
Ambari 為應用開發人員和系統整合商提供了:
通過使用Ambari REST 的API很容易整合Hadoop提供、管理和監測的能力到他們自己的應用中


當前最新版本:The latest release Ambari 2.0.0 



/****************************************************************************/

Avro 簡介

官網原文:
Apache Avro™ is a data serialization system.


Avro provides:


Rich data structures.
A compact, fast, binary data format.
A container file, to store persistent data.
Remote procedure call (RPC).
Simple integration with dynamic languages. Code generation is not required to read or write data files nor to use or implement RPC protocols. Code generation as an optional optimization, only worth implementing for statically typed languages.



譯文:
Avro 是資料序列化系統
Avro 提供:
1.富資料結構。
2.緊湊、快速、二進位制的資料格式化。
3.一個容器檔案來儲存持久化資料。
4.遠端過程呼叫

5.簡單的整合了動態語言,程式碼生成不再需要讀寫資料檔案也不再使用或整合RPC協議。程式碼生成作為一個可選選項,僅僅值得靜態語言實現


比較詳細的介紹請點這裡



官方原文:
Schemas
Avro relies on schemas. When Avro data is read, the schema used when writing it is always present. This permits each datum to be written with no per-value overheads, making serialization both fast and small. This also facilitates use with dynamic, scripting languages, since data, together with its schema, is fully self-describing.


When Avro data is stored in a file, its schema is stored with it, so that files may be processed later by any program. If the program reading the data expects a different schema this can be easily resolved, since both schemas are present.


When Avro is used in RPC, the client and server exchange schemas in the connection handshake. (This can be optimized so that, for most calls, no schemas are actually transmitted.) Since both client and server both have the other's full schema, correspondence between same named fields, missing fields, extra fields, etc. can all be easily resolved.


Avro schemas are defined with JSON . This facilitates implementation in languages that already have JSON libraries.

譯文:模式
AVro 依賴模式。Avro資料的讀寫操作是很頻繁的,而這些操作都需要使用模式。這樣就減少寫入每個資料資料的開銷,使得序列化快速而又輕巧。這種資料及其模式的自我描述方便於動態指令碼語言,指令碼語言,以前資料和它的模式一起使用,是完全的自描述。


當Avro 資料被儲存在一個檔案中,它的模式也一同被儲存。因此,檔案可被任何程式處理,如果程式需要以不同的模式讀取資料,這就很容易被解決,因為兩模式都是已知的。


當在RPC中使用Avro時,客戶端和服務端可以在握手連線時交換模式(這是可選的,因此大多數請求,都沒有模式的事實上的傳送)。因為客戶端和服務端都有彼此全部的模式,因此相同命名欄位、缺失欄位和多餘欄位等資訊之間通訊中需要解決的一致性問題就可以容易解決


Avro模式用JSON定義,這有利於已經擁有JSON庫的語言的實現





官方原文:
Comparison with other systems
Avro provides functionality similar to systems such as Thrift, Protocol Buffers, etc. Avro differs from these systems in the following fundamental aspects.


Dynamic typing: Avro does not require that code be generated. Data is always accompanied by a schema that permits full processing of that data without code generation, static datatypes, etc. This facilitates construction of generic data-processing systems and languages.
Untagged data: Since the schema is present when data is read, considerably less type information need be encoded with data, resulting in smaller serialization size.
No manually-assigned field IDs: When a schema changes, both the old and new schema are always present when processing data, so differences may be resolved symbolically, using field names.
Apache Avro, Avro, Apache, and the Avro and Apache logos are trademarks of The Apache Software Foundation.



譯文:
和其他系統的比較
Avro提供著與諸如Thrift和Protocol Buffers等系統相似的功能,但是在一些基礎方面還是有區別的
1 動態型別:Avro並不需要生成程式碼,模式和資料存放在一起,而模式使得整個資料的處理過程並不生成程式碼、靜態資料型別等等。這方便了資料處理系統和語言的構造。
2 未標記的資料:由於讀取資料的時候模式是已知的,那麼需要和資料一起編碼的型別資訊就很少了,這樣序列化的規模也就小了。
3 不需要使用者指定欄位號:即使模式改變,處理資料時新舊模式都是已知的,所以通過使用欄位名稱可以解決差異問題。


當前最新版本:23 July 2014: Avro 1.7.7 Released


/****************************************************************************/



Cassandra

Cassandra的安裝配置入門



官方原文:
Cassandra is a highly scalable, eventually consistent, distributed, structured key-value store. Cassandra brings together the distributed systems technologies from Dynamo and the data model from Google's BigTable. Like Dynamo, Cassandra is eventually consistent. Like BigTable, Cassandra provides a ColumnFamily-based data model richer than typical key/value systems.


Cassandra was open sourced by Facebook in 2008, where it was designed by Avinash Lakshman (one of the authors of Amazon's Dynamo) and Prashant Malik ( Facebook Engineer ). In a lot of ways you can think of Cassandra as Dynamo 2.0 or a marriage of Dynamo and BigTable. Cassandra is in production use at Facebook but is still under heavy development.



譯文:
Cassandra是一個高可擴充套件的、最終一致、分散式、結構化的k-v倉庫,Cassandra將BigTable的資料模型和Dynamo的分散式系統技術整合在一起。與Dynamo類似,Cassandra最終一致,與BigTable類似,Cassandra提供了基於列族的資料模型,比典型的k-v系統更豐富。


Cassandra 被FaceBook在2008年開源,被Avinash Lakshman(是Amazon的Dynamo的作者之一)和Prashant Malik(FaceBook的工程師)設計,在很多方面,你可以把Cassandra看作Dynamo 2.0,或者Dynamo和BigTable的結合。Cassandra已經應用在FaceBook的生產環境中,但它仍然處於密集開發期


當前最新版本:The latest release of Apache Cassandra is 2.1.4 (released on 2015-04-01)



/****************************************************************************/



Chukwa

Chukwa在百度的應用實踐

Chukwa架構圖:



官方原文:
Chukwa is an open source data collection system for monitoring large distributed systems. Chukwa is built on top of the Hadoop Distributed File System (HDFS) and Map/Reduce framework and inherits Hadoop’s scalability and robustness. Chukwa also includes a flexible and powerful toolkit for displaying, monitoring and analyzing results to make the best use of the collected data.



譯文:
Chukwa 是一個監測大型分散式系統的開源資料收集系統。Chukwa 建立在HDFS和MapReduce上,繼承了Hadoop的可擴充套件性和魯棒性。為了更好的使用收集的資料,Chukwa也包含了一個靈活有力的工具包用來顯示、監測和分析結果。


當前最新版本:Last Published: 2015-03-24 | Version: 0.6.0



/****************************************************************************/


HBase

HBase偽分散式安裝

HBase的叢集環境安裝

HBase基礎和shell操作

HBase入門篇

HBase的體系結構圖



官方原文:
Apache HBase™ is the Hadoop database, a distributed, scalable, big data store.


When Would I Use Apache HBase?


Use Apache HBase™ when you need random, realtime read/write access to your Big Data. This project's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware. Apache HBase is an open-source, distributed, versioned, non-relational database modeled after Google's Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.


Features


Linear and modular scalability.
Strictly consistent reads and writes.
Automatic and configurable sharding of tables
Automatic failover support between RegionServers.
Convenient base classes for backing Hadoop MapReduce jobs with Apache HBase tables.
Easy to use Java API for client access.
Block cache and Bloom Filters for real-time queries.
Query predicate push down via server side Filters
Thrift gateway and a REST-ful Web service that supports XML, Protobuf, and binary data encoding options
Extensible jruby-based (JIRB) shell
Support for exporting metrics via the Hadoop metrics subsystem to files or Ganglia; or via JMX



譯文:Apache HBase是Hadoop 資料庫,一個分散式,可擴充套件的大資料倉儲
什麼時候使用Apache HBase 呢?
當隨機、實時讀寫你的大資料時就需要使用HBase。這個專案的目標是成為巨大的表(數十億行 x 數百萬列資料)的託管在商品硬體的叢集上.
HBase是一個開源的,分散式,版本化,非關係的資料庫,仿照自Google的BigTable(一個Chang et al開發的結構化資料的分散式儲存系統),BigTable的分散式資料儲存由GFS(Google File System)提供,HBase在Hadoop和HDFS上提供類似大表能力。
特點:
線性的和模組化的可擴充套件性。
嚴格一致的讀和寫。
自動和可配置的分割槽表。
方便的支援hadoop的MapReduce 的Jobs與HBase表的基類。
易於使用的JAVA API的客戶端訪問。
實時查詢的塊快取和Bloom過濾器。
查詢謂詞下推通過伺服器端過濾器。
Thrift閘道器和REST-ful的WEB服務,支援XML,ProtoBuf和二進位制資料編碼選項
可擴充套件的基於JRuby(JIRS)的shell
支援匯出指標通過Hadoop的指標子系統到檔案或神經節;或者通過JMX


當前最新版本:1.0.0



/****************************************************************************/


Hive

hive原理圖



官方原文:
The Apache Hive ™ data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.


譯文:
Apache Hive資料倉儲軟體有利於查詢和管理大資料集駐紮在分散式倉庫上。Hive提供了機制保護資料上的結構並且查詢資料使用的類似SQL的語言HiveQL。同時,當HiveQL表達這邏輯不方便或者效率低下時,這種語言也允許傳統的MapReduce程式設計師新增他們自定義的mapper和reduce。


當前最新版本:8 March 2015: release 1.1.0 available



/****************************************************************************/


Mahout


mahout演算法適用場景


官方原文:

The Apache Mahout™ project's goal is to build an environment for quickly creating scalable performant machine learning applications.


The three major components of Mahout are an environment for building scalable algorithms, many new Scala + Spark (H2O in progress) algorithms, and Mahout's mature Hadoop MapReduce algorithms.



譯文:
mahout 專案目標是構建一個快速建立可擴充套件高效能的機器學習應用的環境。


mahout的三個主要的元件是構建可擴充套件的演算法環境,大量Scala+Spark演算法和Mahout的成熟的MapReduce演算法。



2015.4.11 0.10版本釋出
Apache Mahout introduces a new math environment we call Samsara, for its theme of universal renewal. It reflects a fundamental rethinking of how scalable machine learning algorithms are built and customized. Mahout-Samsara is here to help people create their own math while providing some off-the-shelf algorithm implementations. At its core are general linear algebra and statistical operations along with the data structures to support them. You can use it as a library or customize it in Scala with Mahout-specific extensions that look something like R. Mahout-Samsara comes with an interactive shell that runs distributed operations on a Spark cluster. This make prototyping or task submission much easier and allows users to customize algorithms with a whole new degree of freedom.


Mahout Algorithms include many new implementations built for speed on Mahout-Samsara. They run on Spark and some on H2O, which means as much as a 10x speed increase. You’ll find robust matrix decomposition algorithms as well as a Naive Bayes classifier and collaborative filtering. The new spark-itemsimilarity enables the next generation of cooccurrence recommenders that can use entire user click streams and context in making recommendations.


譯文:
Mahout 引進了一個新的數學環境叫做Samsara,它的主題是通用的重建。它反映了可擴充套件的機器學習演算法怎樣構架和自定義的根本性反思。在提供現成的演算法實現的同時,Mahout-Samsara幫助人們建立他們自己的數學。它的核心是一般線性代數和統計的操作隨著資料結構來支援它們。你可以使用它作為一個庫或者用Scala自定義它,Mahout-specific擴充套件看起來有些像R語言。Mahout-Samsara到達伴隨一個互動的shell(在Spark叢集上執行分散式操作)。這讓原型機制造或者任務提交更容易並且允許使用者在一個完整的心得自由度中自定義演算法。


mahout演算法包括許多新實現構建專為Mahout-Samsara。他們執行在spark上和一些H2O上,這意味著將會提速10倍以上,你將發現強大的矩陣分解演算法和樸素貝葉斯分類器和協同過濾一樣好。新的spark-itemsimilarity(spark的基於物品的相似)成為下一代共生的推薦可以使用整個使用者點選流和上下文來進行推薦。


當前最新版本:0.10.0 released



/****************************************************************************/


Pig

使用pig處理資料


官方原文:
Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.


At the present time, Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs, for which large-scale parallel implementations already exist (e.g., the Hadoop subproject). Pig's language layer currently consists of a textual language called Pig Latin, which has the following key properties:


Ease of programming. It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain.
Optimization opportunities. The way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency.
Extensibility. Users can create their own functions to do special-purpose processing.


譯文:
Pig是由用於表達資料分析程式的高階語言來分析大資料集的平臺,與基礎平臺耦合來評估這些程式。Pig程式的突出屬性是他們的結構適合大量的並行化,這將使他們能夠處理非常大的資料集。


當前,Pig的底層實現是由編譯器產生序列的MapReduce程式,大量可擴充套件並行實現已經存在。Pig語言層當前由文字語言叫Pig Latin組成。Pig Litin擁有如下屬性:
簡易程式設計:實現簡單的,難以並行的資料分析任務來並行執行是很平常的事。有多個相互關聯的資料轉換的複雜的任務是顯示編碼為資料流序列,使其易於寫,理解和保持。
優化條件:這種方法(任務被編碼為允許系統自動優化它們的執行)允許使用者專注於語義更甚於效率。
可擴充套件性:使用者可以創造談們自己的方法來做特殊目的的處理。


當前最新版本是Apache Pig 0.14.0 包括 Pig on Tez, OrcStorage



/****************************************************************************/



Spark

官方原文:
Apache Spark™ is a fast and general engine for large-scale data processing.
Speed:Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing
Ease of Use:Write applications quickly in Java, Scala or Python.Spark offers over 80 high-level operators that make it easy to build parallel apps. And you can use it interactively from the Scala and Python shells.
Generality:Combine SQL, streaming, and complex analytics.Spark powers a stack of high-level tools including Spark SQL, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application.
Runs Everywhere:Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, S3.You can run Spark readily using its standalone cluster mode, on EC2, or run it on Hadoop YARN or Apache Mesos. It can read from HDFS, HBase, Cassandra, and any Hadoop data source.



譯文:

Spark是一個快速,一般性的進行大量可擴充套件資料的引擎。

速度:在記憶體中執行程式是Hadoop的100倍以上,或者在磁碟上的10倍以上。spark還有高階的有向無環圖(DAG)執行引擎支援迴圈資料流和記憶體計算。
易於使用:可以凱蘇的使用java、scala或者python編寫程式。spark提供超過80個高水準的操作者使得很容易構建並行APP。並且你可以從scala和python的shell互動式使用它。
通用性:結合SQL,流和複雜的分析。spark 供給了高水平的棧工具包括Spark SQL,機器學習的MLlib,GraphX和Spark Streaming。你可以在同一個應用中無縫結合這些庫。

到處執行:spark執行在Hadoop、Mesos、獨立執行或者執行在雲上,他可以獲得多樣化的資料來源包括HDFS、Cassandra、HBase、S3。你可以容易的執行Spark使用它的獨立叢集模式,在EC2上,或者執行在Hadoop的YARN或者Apache的Mesos上。它可以從HDFS,HBase,Cassandra和任何Hadoop資料來源。

快速

 

通用性


可到處執行


易於程式設計





當前最新版本是Spark 1.2.2 and 1.3.1 released





/****************************************************************************/



Tez:



上圖展示的流程包含多個MR任務,每個任務都將中間結果儲存到HDFS上——前一個步驟中的reducer為下一個步驟中的mapper提供資料。



該圖表展示了使用Tez時的流程,僅在一個任務中就能完成同樣的處理過程,任務之間不需要訪問HDFS



官方原文:
The Apache Tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data. It is currently built atop Apache Hadoop YARN


The 2 main design themes for Tez are:


Empowering end users by:
Expressive dataflow definition APIs
Flexible Input-Processor-Output runtime model
Data type agnostic
Simplifying deployment
Execution Performance
Performance gains over Map Reduce
Optimal resource management
Plan reconfiguration at runtime
Dynamic physical data flow decisions
By allowing projects like Apache Hive and Apache Pig to run a complex DAG of tasks, Tez can be used to process data, that earlier took multiple MR jobs, now in a single Tez job as shown below.



譯文:
Tez 專案目標是構建一個應用框架允許為複雜的有向無環圖任務處理資料,當前構建在YARN上。
Tez的兩個主要的設計主題是:
授權使用者:
表達資料流定義的API
靈巧的輸入輸出處理器執行時模式
資料型別無關
簡化部署
執行效能
提升MapReduce效能
最優化資源管理
執行時重置配置計劃
動態邏輯資料流決議

通過允許專案向Hive和Pig來執行復雜的DAG認為,Tez可以被用於處理資料更簡單處理多MapReduce Jobs,現在展示一個單一的Tez Job


Tez API包括以下幾個元件:

  • 有向無環圖(DAG)——定義整體任務。一個DAG物件對應一個任務。
  • 節點(Vertex)——定義使用者邏輯以及執行使用者邏輯所需的資源和環境。一個節點對應任務中的一個步驟。
  • 邊(Edge)——定義生產者和消費者節點之間的連線。

    邊需要分配屬性,對Tez而言這些屬性是必須的,有了它們才能在執行時將邏輯圖展開為能夠在叢集上並行執行的物理任務集合。下面是一些這樣的屬性:

    • 資料移動屬性,定義了資料如何從一個生產者移動到一個消費者。
    • 排程(Scheduling)屬性(順序或者並行),幫助我們定義生產者和消費者任務之間應該在什麼時候進行排程。
    • 資料來源屬性(持久的,可靠的或者暫時的),定義任務輸出內容的生命週期或者永續性,讓我們能夠決定何時終止。
當前最新版本:Apache Tez 0.6.0

/****************************************************************************/


ZooKeeper:

ZooKeeper理論知識和叢集安裝配置




官方原文:
Apache ZooKeeper is an effort to develop and maintain an open-source server which enables highly reliable distributed coordination.


What is ZooKeeper?
ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications. Each time they are implemented there is a lot of work that goes into fixing the bugs and race conditions that are inevitable. Because of the difficulty of implementing these kinds of services, applications initially usually skimp on them ,which make them brittle in the presence of change and difficult to manage. Even when done correctly, different implementations of these services lead to management complexity when the applications are deployed.



譯文:
ZooKeeper是一個嘗試來開發和保持一個開源的來提供高可靠的分散式協調的服務。


什麼是ZooKeeper?
ZooKeeper 是一個集中服務,保持配置資訊,命名和提供分散式同步並且提供組服務。所有這些種服務被分散式應用用於某些形式或其他。每次它們實現這大量的工作修復Bug並比賽的情況是不可避免的。由於這些種服務的實現不同,應用最初通常吝嗇它們,使得它們忍受在變化的存在和難以管理。甚至在正確時,當應用部署時,不同的實現導致管理負責。



當前最新版本:10 March, 2014: release 3.4.6 available


/****************************************************************************/

sqoop:

sqoop工作流程:



官方原文:
Apache Sqoop
Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.


Sqoop successfully graduated from the Incubator in March of 2012 and is now a Top-Level Apache project: More information


Latest stable release is 1.4.5 (download, documentation). Latest cut of Sqoop2 is 1.99.5 (download, documentation). Note that 1.99.5 is not compatible with 1.4.5 and not feature complete, it is not intended for production deployment.





Sqoop是一個用來將Hadoop和關係型資料庫中的資料相互轉移的工具,可以將一個關係型資料庫(例如 : MySQL ,Oracle ,Postgres等)中的資料導進到Hadoop的HDFS中,也可以將HDFS的資料導進到關係型資料庫中。
對於某些NoSQL資料庫它也提供了聯結器。Sqoop,類似於其他ETL工具,使用後設資料模型來判斷資料型別並在資料從資料來源轉移到Hadoop時確保型別安全的資料處理。Sqoop專為大資料批量傳輸設計,能夠分割資料集並建立Hadoop任務來處理每個區塊。




/****************************************************************************/
flume:

flume工作流程:



官方原文:
Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.



譯文:Flume是一個分散式、可靠的、高可用的有效收集、聚合和轉移大量日誌檔案的服務。它擁有簡單靈活的基於資料流的體系結構。它是魯棒性的,擁有容錯可調的可靠性機制、故障轉移和恢復機制。使用簡單可擴充套件的可以線上分析應用的資料模型


日誌收集
Flume最早是Cloudera提供的日誌收集系統,目前是Apache下的一個孵化專案,Flume支援在日誌系統中定製各類資料傳送方,用於收集資料。
資料處理
Flume提供對資料進行簡單處理,並寫到各種資料接受方(可定製)的能力 Flume提供了從console(控制檯)、RPC(Thrift-RPC)、text(檔案)、tail(UNIX tail)、syslog(syslog日誌系統,支援TCP和UDP等2種模式),exec(命令執行)等資料來源上收集資料的能力。


當前最新版本:November 18, 2014 - Apache Flume 1.5.2 Released



/****************************************************************************/

Impala

Impala的原理圖


Impala架構分析

Impala是Cloudera公司主導開發的新型查詢系統,它提供SQL語義,能查詢儲存在Hadoop的HDFS和HBase中的PB級大資料。已有的Hive系統雖然也提供了SQL語義,但由於Hive底層執行使用的是MapReduce引擎,仍然是一個批處理過程,難以滿足查詢的互動性。相比之下,Impala的最大特點也是最大賣點就是它的快速。那麼Impala如何實現大資料的快速查詢呢?在回答這個問題前,需要先介紹Google的Dremel系統,因為Impala最開始是參照 Dremel系統進行設計的。

Dremel是Google的互動式資料分析系統,它構建於Google的GFS(Google File System)等系統之上,支撐了Google的資料分析服務BigQuery等諸多服務。Dremel的技術亮點主要有兩個:一是實現了巢狀型資料的列儲存;二是使用了多層查詢樹,使得任務可以在數千個節點上並行執行和聚合結果。列儲存在關係型資料庫中並不陌生,它可以減少查詢時處理的資料量,有效提升 查詢效率。Dremel的列儲存的不同之處在於它針對的並不是傳統的關係資料,而是巢狀結構的資料。Dremel可以將一條條的巢狀結構的記錄轉換成列儲存形式,查詢時根據查詢條件讀取需要的列,然後進行條件過濾,輸出時再將列組裝成巢狀結構的記錄輸出,記錄的正向和反向轉換都通過高效的狀態機實現。另 外,Dremel的多層查詢樹則借鑑了分散式搜尋引擎的設計,查詢樹的根節點負責接收查詢,並將查詢分發到下一層節點,底層節點負責具體的資料讀取和查詢執行,然後將結果返回上層節點。

在Cloudera的測試中,Impala的查詢效率比Hive有數量級的提升。從技術角度上來看,Impala之所以能有好的效能,主要有以下幾方面的原因。

    • Impala不需要把中間結果寫入磁碟,省掉了大量的I/O開銷。
    • 省掉了MapReduce作業啟動的開銷。MapReduce啟動task的速度很慢(預設每個心跳間隔是3秒鐘),Impala直接通過相應的服務程式來進行作業排程,速度快了很多。
    • Impala完全拋棄了MapReduce這個不太適合做SQL查詢的正規化,而是像Dremel一樣借鑑了MPP並行資料庫的思想另起爐灶,因此可做更多的查詢優化,從而省掉不必要的shuffle、sort等開銷。
    • 通過使用LLVM來統一編譯執行時程式碼,避免了為支援通用編譯而帶來的不必要開銷。
    • 用C++實現,做了很多有針對性的硬體優化,例如使用SSE指令。
    • 使用了支援Data locality的I/O排程機制,儘可能地將資料和計算分配在同一臺機器上進行,減少了網路開銷。


翻譯自Apache官網上的文件,因英語水平和專業水平有限,翻譯不當之處還希望交流指教。





相關文章