什麼是 Apache Hadoop?

jichuanlau07發表於2014-02-13

什麼是 Apache Hadoop ?

Apache? Hadoop?專案是為可靠、可擴充套件及分散式計算而開發的開源軟體。

Apache Hadoop軟體庫是一個允許使用簡單程式設計模型對叢集計算機內的大資料集進行分散式處理的框架,
她被設計成可以從單一伺服器到成千上萬的伺服器的縱向擴充套件,這些伺服器提供本地計算及儲存。

不是依靠硬體上提供高可用性程式碼庫用來檢測處理應用層的失敗因此將計算機叢集的頂層提供高可用的服務其中的每個節點都允許失效。

該專案包含以下模組:

  • Hadoop Common: 支援其他Hadoop模組的通用工具。
  • Hadoop Distributed File System (HDFS?):訪問應用時提供高吞吐量的分散式檔案系統。
  • Hadoop YARN: 負責作業排程和叢集管理的框架。
  • Hadoop MapReduce:基於YARN的對大資料集進行並行處理的系統。


Apache中包括的其他相關專案:
  • Ambari?: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications visually alongwith features to diagnose their performance characteristics in a user-friendly manner.
  • Avro ?: 一個資料序列化系統。
  • Cassandra?: 一個支援可擴充套件的多主節點,不存在單點故障的資料庫。
  • Chukwa?: 管理分大型分散式系統的資料集合系統。
  • HBase?: 一個可擴充套件的分散式資料庫,該資料庫針對大表支援結構資料儲存。
  • Hive?: 一個提供資料彙總和熱點查詢的資料倉儲架構。
  • Mahout?:一個可擴充套件的機器學習和資料探勘庫。
  • Pig?: 為平行計算提供高層次資料流語言和執行框架。
  • Spark?:為Hadoop資料提供的快速、通用的計算引擎。她提供簡單的程式設計模型,該模型支援各種應用,包括ETL,機器學習,流處理,圖形計算。
  • Tez?:一個廣義的資料流程式設計框架,建立在Hadoop YARN,它提供了一個強大的和靈活的引擎來執行任意任務的任意(批處理和互動式的用例)。
  • ZooKeeper?:為分散式應用提供高效能的協調服務


原文參考如下:

What Is Apache Hadoop?

The Apache? Hadoop? project develops open-source software for reliable, scalable, distributed computing.

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

The project includes these modules:

  • Hadoop Common: The common utilities that support the other Hadoop modules.
  • Hadoop Distributed File System (HDFS?): A distributed file system that provides high-throughput access to application data.
  • Hadoop YARN: A framework for job scheduling and cluster resource management.
  • Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

Other Hadoop-related projects at Apache include:

  • Ambari?: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications visually alongwith features to diagnose their performance characteristics in a user-friendly manner.
  • Avro?: A data serialization system.
  • Cassandra?: A scalable multi-master database with no single points of failure.
  • Chukwa?: A data collection system for managing large distributed systems.
  • HBase?: A scalable, distributed database that supports structured data storage for large tables.
  • Hive?: A data warehouse infrastructure that provides data summarization and ad hoc querying.
  • Mahout?: A Scalable machine learning and data mining library.
  • Pig?: A high-level data-flow language and execution framework for parallel computation.
  • Spark?: A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation.
  • Tez?: A generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases. Tez is being adopted by Hive?, Pig? and other frameworks in the Hadoop ecosystem, and also by other commercial software (e.g. ETL tools), to replace Hadoop? MapReduce as the underlying execution engine.
  • ZooKeeper?: A high-performance coordination service for distributed applications.

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/23628945/viewspace-1081101/,如需轉載,請註明出處,否則將追究法律責任。

相關文章