最近工作需要,要看HDInsight部分,這裡要做筆記。自然是官網資料最權威,所以內容都從這裡搬過來:https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-introduction/
Hadoop on HDInsight
搞大資料,都知道Hadoop,那麼HDInsight和Hadoop啥關係呢?HDInsight是M$基於Azure的一個軟體架構,主要做大資料分析、管理用的,它使用了HDP(Hortonworks Data Platform)的Hadoop發行版。然後有點要注意,我們講的Hadoop 一般指的是Hadoop的生態系統,包括Storm/Hbase等,而不單單是那個小大象。
HDInsight可以理解為是Apache Hadoop在微軟Azure上的一個實現,裡面包含了對應的Storm, HBase, Pig, Hive, Sqoop, Oozie, Ambari等等,當然,也捆綁了自家的Excel,SSAS,SSRS。
HDInsight支援兩種型別作業系統,Linux和M$自己的Windows,區別主要在這裡:
CATEGORY | HADOOP ON LINUX | HADOOP ON WINDOWS |
Cluster OS | Ubuntu 12.04 Long Term Support (LTS) | Windows Server 2012 R2 |
Cluster Type | Hadoop | Hadoop, HBase, Storm |
Deployment | Azure Management Portal, Azure CLI, Azure PowerShell | Azure Management Portal, Azure CLI, Azure PowerShell, HDInsight .NET SDK |
Cluster UI | Ambari | Cluster Dashboard |
Remote Access | Secure Shell (SSH) | Remote Desktop Protocol (RDP) |
一些基本概念及定義
-
Hadoop (the "Query" workload): Provides reliable data storage with HDFS, and a simple MapReduce programming model to process and analyze data in parallel.
-
HBase (the "NoSQL" workload): A NoSQL database built on Hadoop that provides random access and strong consistency for large amounts of unstructured and semi-structured data - potentially billions of rows times millions of columns. See Overview of HBase on HDInsight.
-
Apache Storm (the "Stream" workload): A distributed, real-time computation system for processing large streams of data fast. Storm is offered as a managed cluster in HDInsight. See Analyze real-time sensor data using Storm and Hadoop.
-
Ambari: Cluster provisioning, management, and monitoring.
-
Avro (Microsoft .NET Library for Avro): Data serialization for the Microsoft .NET environment.
-
Hive & HCatalog: Structured Query Language (SQL)-like querying, and a table and storage management layer.
-
Mahout: Machine learning.
-
MapReduce and YARN: Distributed processing and resource management.
-
Oozie: Workflow management.
-
Phoenix: Relational database layer over HBase.
-
Pig: Simpler scripting for MapReduce transformations.
-
Sqoop: Data import and export.
-
Tez: Allows data-intensive processes to run efficiently at scale.
-
ZooKeeper: Coordination of processes in distributed systems.
HBase
這貨有兩個版本,一個是Apache HBase,開源、NoSQL、基於Hadoop和狗狗的BigTable,對於海量的結構化及半結構化資料訪問有很好的支撐。另一個是HDInsight HBase,微軟自己的。資料直接存放於Blob中。
HBase資料,可以通過hbase shell的create/get/put/scan命令來管理,scan是讀多個行的資料。同時有一個REST方式的C# API可以供呼叫。
HBase的使用場景
初衷就是google為了自己的web search,你搜尋三體的時候,它把所有包含三體的頁面都返回給你。除此之外,還包含了:
- Key-Value儲存,這個適合於訊息的管理,比如Facebook。
- Sensor data,包含但不限於社交資料,時間相關資料,審計日誌等。
- real-time query,比如Phoenix是一個Apache Hbase的SQL查詢引擎
Storm
官網介紹,它分散式的、容錯的、開源的一個計算系統,可以實時處理Hadoop的資料。
HDInsight中的Storm,有如下特性:
- SLA承諾是999
- Storm元件可以用Java/C#/Python來搞
- 內建的scale-up和scale-down的機制
- 可以和EventHub/Virtual Network/SQL/Blob/DocumntDB整合
實時處理的場景
- Internet of Things (IoT)
- Fraud detection
- Social analytics
- Extract, Transform, Load (ETL)
- Network monitoring
- Search
- Mobile engagement
Spark
Apache Spark,一個開源的,支援in-memory大資料分析的並行處理框架。
適用場景:
- 互動式的資料分析與BI處理
- 迭代機器學習(這是個啥?)
- 流式及實時資料處理