What Is Apache Hadoop?
The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.
Hadoop專案是為了開發可靠、可伸縮的分散式計算的開源軟體。
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
Hadoop 軟體庫是一個框架,這個框架允許使用簡單的程式設計模型就可以分散式地處理大量資料集,這些資料集本身是跨越多個叢集的。Hadoop設計規模可從單機擴充套件到幾千臺機器,每臺機器可提供本機的計算與儲存能力。不像某些設計依賴硬體來達到高可用性,Hadoop 軟體庫本身在應用層被設計用來檢測和處理故障。因此,它能提供一個在叢集機器之上的高可用性服務,叢集中的每臺機器都可能遭遇故障。
The project includes these modules:
- Hadoop Common: The common utilities that support the other Hadoop modules.
通用的初始化元件,能夠支援其他Hadoop模組。
- Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
Hadoop分散式檔案系統:一套分散式檔案系統,能夠提供對應用資料的高穿透性訪問能力。
- Hadoop YARN: A framework for job scheduling and cluster resource management.
YARN:一套任務排程和叢集資源管理框架。
- Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
MapReduce:一套基於YARN的並行處理大量資料集的系統。
Other Hadoop-related projects at Apache include:
- Ambari™: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications visually alongwith features to diagnose their performance characteristics in a user-friendly manner.
- Ambari,是一個基於web的配置,管理和監控阿帕奇的Hadoop叢集的工具,它支援Hadoop HDFS,Hadoop MapReduce,Hive,HCatelog,HBase,ZooKeeper,Oozie,Pig和Soop。Ambari也提供一個儀表盤,這個儀表盤可以檢視叢集健康狀況(例如熱力圖)以及以使用者友好的方式,一站式視覺化對MapReduce,Pig和Hive應用效能特性進行診斷的能力。
- Avro™: A data serialization system.
- Avro,一套序列化系統。
- Cassandra™: A scalable multi-master database with no single points of failure.
- Cassandra,一種可伸縮的多主機(主從機模式)資料庫,可以避免單點故障。
- Chukwa™: A data collection system for managing large distributed systems.
- Chukwa,一套資料集合系統,可以管理大規模的分散式系統。
- HBase™: A scalable, distributed database that supports structured data storage for large tables.
- HBase,一套可伸縮的分散式資料庫,可以支援大量資料表的結構化資料儲存。
- Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.
- Hive,一套資料倉儲架構,可以提供資料摘要和約束即席查詢。
- Mahout™: A Scalable machine learning and data mining library.
- Mahout,一套可伸縮的機器學習和資料探勘庫。
- Pig™: A high-level data-flow language and execution framework for parallel computation.
- Pig,一套高層資料流語言和執行框架,支援平行計算。
- Spark™: A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation.
- Spark,一套對Hadoop資料進行快速和通用計算的引擎。它提供一套簡單並富有表現力的程式設計模型,支援一系列應用,包括ETL(Extract,Transform and Load),機器學習,流處理,以及圖譜計算。
- Tez™: A generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases. Tez is being adopted by Hive™, Pig™ and other frameworks in the Hadoop ecosystem, and also by other commercial software (e.g. ETL tools), to replace Hadoop™ MapReduce as the underlying execution engine.
- Tez,一套泛型化的資料流程式設計框架,建立在Hadoop YARN之上。它能提供一個強大,彈性的引擎,這套引擎可以執行一個任意的DAG(有向無環圖)任務,去處理批量的和可互動的用例。Tez已經被Hive,Pig和其他Hadoop生態系統的框架所採用,也被其他商業軟體(例如ETL工具)用Hadoop MapReduce作為底層執行引擎使用。
- ZooKeeper™: A high-performance coordination service for distributed applications.
- ZooKeeper,是一項高效能的分散式應用的協同服務。
引申:
HDFS,同類競品有GFS(Google File System),亞馬遜,阿里,騰訊,各自有自己命名的分散式檔案系統。