本文將大資料學習門檻降到了地平線

入門到入墳發表於2020-10-29

Hadoop介紹

Hadoop-大資料開源世界的亞當夏娃。
核心是HDFS資料儲存系統,和MapReduce分散式計算框架。

HDFS

原理是把大塊資料切碎,

每個碎塊複製三份,分開放在三個廉價機上,一直保持有三塊可用的資料互為備份。使用的時候只從其中一個備份讀出來,這個碎塊資料就有了。

存資料的叫datenode(格子間),管理datenode的叫namenode(執傘人)。

MapReduce

原理是大任務先分堆處理-Map,再彙總處理結果-Reduce。分和匯是多臺伺服器並行進行,才能體現叢集的威力。難度在於如何把任務拆解成符合MapReduce模型的分和匯,以及中間過程的輸入輸出<k,v> 都是什麼。

單機版Hadoop介紹

對於學習hadoop原理和hadoop開發的人來說,搭建一套hadoop系統是必須的。但

  • 配置該系統是非常頭疼的,很多人配置過程就放棄了。
  • 沒有伺服器供你使用

這裡介紹一種免配置的單機版hadoop安裝使用方法,可以簡單快速的跑一跑hadoop例子輔助學習、開發和測試。
要求筆記本上裝了Linux虛擬機器,虛擬機器上裝了docker。

安裝

使用docker下載sequenceiq/hadoop-docker:2.7.0映象並執行。

[root@bogon ~]# docker pull sequenceiq/hadoop-docker:2.7.0  
2.7.0: Pulling from sequenceiq/hadoop-docker860d0823bcab: Pulling fs layer e592c61b2522: Pulling fs layer

下載成功輸出

Digest: sha256:a40761746eca036fee6aafdf9fdbd6878ac3dd9a7cd83c0f3f5d8a0e6350c76a
Status: Downloaded newer image for sequenceiq/hadoop-docker:2.7.0

啟動

[root@bogon ~]# docker run -it sequenceiq/hadoop-docker:2.7.0 /etc/bootstrap.sh -bash --privileged=true
Starting sshd:                                             [  OK  ]
Starting namenodes on [b7a42f79339c]
b7a42f79339c: starting namenode, logging to /usr/local/hadoop/logs/hadoop-root-namenode-b7a42f79339c.out
localhost: starting datanode, logging to /usr/local/hadoop/logs/hadoop-root-datanode-b7a42f79339c.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop/logs/hadoop-root-secondarynamenode-b7a42f79339c.out
starting yarn daemons
starting resourcemanager, logging to /usr/local/hadoop/logs/yarn--resourcemanager-b7a42f79339c.out
localhost: starting nodemanager, logging to /usr/local/hadoop/logs/yarn-root-nodemanager-b7a42f79339c.out

啟動成功後命令列shell會自動進入Hadoop的容器環境,不需要執行docker exec。在容器環境進入/usr/local/hadoop/sbin,執行./start-all.sh和./mr-jobhistory-daemon.sh start historyserver,如下

bash-4.1# cd /usr/local/hadoop/sbin
bash-4.1# ./start-all.sh
This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh

Starting namenodes on [b7a42f79339c]
b7a42f79339c: namenode running as process 128. Stop it first.

localhost: datanode running as process 219. Stop it first.
Starting secondary namenodes [0.0.0.0]
0.0.0.0: secondarynamenode running as process 402. Stop it first.

starting yarn daemons
resourcemanager running as process 547. Stop it first.
localhost: nodemanager running as process 641. Stop it first.  

bash-4.1# ./mr-jobhistory-daemon.sh start historyserver
chown: missing operand after `/usr/local/hadoop/logs'
Try `chown --help' for more information.
starting historyserver, logging to /usr/local/hadoop/logs/mapred--historyserver-b7a42f79339c.out

Hadoop啟動完成,如此簡單。

要問分散式部署有多麻煩,數數光配置檔案就有多少個吧!我親眼見過一個hadoop老鳥,因為新換的伺服器hostname主機名帶橫線“-”,配了一上午,環境硬是沒起來。

執行自帶的例子

回到Hadoop主目錄,執行示例程式

bash-4.1# cd /usr/local/hadoop
bash-4.1# bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.0.jar grep input output 'dfs[a-z.]+' 
20/07/05 22:34:41 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
20/07/05 22:34:43 INFO input.FileInputFormat: Total input paths to process : 31
20/07/05 22:34:43 INFO mapreduce.JobSubmitter: number of splits:31
20/07/05 22:34:44 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1594002714328_0001
20/07/05 22:34:44 INFO impl.YarnClientImpl: Submitted application application_1594002714328_0001
20/07/05 22:34:45 INFO mapreduce.Job: The url to track the job: http://b7a42f79339c:8088/proxy/application_1594002714328_0001/
20/07/05 22:34:45 INFO mapreduce.Job: Running job: job_1594002714328_0001
20/07/05 22:35:04 INFO mapreduce.Job: Job job_1594002714328_0001 running in uber mode : false
20/07/05 22:35:04 INFO mapreduce.Job:  map 0% reduce 0%
20/07/05 22:37:59 INFO mapreduce.Job:  map 11% reduce 0%
20/07/05 22:38:05 INFO mapreduce.Job:  map 12% reduce 0%

mapreduce計算完成,有如下輸出

20/07/05 22:55:26 INFO mapreduce.Job: Counters: 49
        File System Counters
                FILE: Number of bytes read=291
                FILE: Number of bytes written=230541
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=569
                HDFS: Number of bytes written=197
                HDFS: Number of read operations=7
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=2
        Job Counters 
                Launched map tasks=1
                Launched reduce tasks=1
                Data-local map tasks=1
                Total time spent by all maps in occupied slots (ms)=5929
                Total time spent by all reduces in occupied slots (ms)=8545
                Total time spent by all map tasks (ms)=5929
                Total time spent by all reduce tasks (ms)=8545
                Total vcore-seconds taken by all map tasks=5929
                Total vcore-seconds taken by all reduce tasks=8545
                Total megabyte-seconds taken by all map tasks=6071296
                Total megabyte-seconds taken by all reduce tasks=8750080
        Map-Reduce Framework
                Map input records=11
                Map output records=11
                Map output bytes=263
                Map output materialized bytes=291
                Input split bytes=132
                Combine input records=0
                Combine output records=0
                Reduce input groups=5
                Reduce shuffle bytes=291
                Reduce input records=11
                Reduce output records=11
                Spilled Records=22
                Shuffled Maps =1
                Failed Shuffles=0
                Merged Map outputs=1
                GC time elapsed (ms)=159
                CPU time spent (ms)=1280
                Physical memory (bytes) snapshot=303452160
                Virtual memory (bytes) snapshot=1291390976
                Total committed heap usage (bytes)=136450048
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters 
                Bytes Read=437
        File Output Format Counters 
                Bytes Written=197

hdfs命令檢視輸出結果

bash-4.1# bin/hdfs dfs -cat output/*
6       dfs.audit.logger
4       dfs.class
3       dfs.server.namenode.
2       dfs.period
2       dfs.audit.log.maxfilesize
2       dfs.audit.log.maxbackupindex
1       dfsmetrics.log
1       dfsadmin
1       dfs.servers
1       dfs.replication
1       dfs.file

例子講解

grep是一個在輸入中計算正規表示式匹配的mapreduce程式,篩選出符合正則的字串以及出現次數。

shell的grep結果會顯示完整的一行,這個命令只顯示行中匹配的那個字串

grep input output 'dfs[a-z.]+'   

正規表示式dfs[a-z.]+,表示字串要以dfs開頭,後面是小寫字母或者換行符\n之外的任意單個字元都可以,數量一個或者多個。
輸入是input裡的所有檔案,

bash-4.1# ls -lrt
total 48
-rw-r--r--. 1 root root  690 May 16  2015 yarn-site.xml
-rw-r--r--. 1 root root 5511 May 16  2015 kms-site.xml
-rw-r--r--. 1 root root 3518 May 16  2015 kms-acls.xml
-rw-r--r--. 1 root root  620 May 16  2015 httpfs-site.xml
-rw-r--r--. 1 root root  775 May 16  2015 hdfs-site.xml
-rw-r--r--. 1 root root 9683 May 16  2015 hadoop-policy.xml
-rw-r--r--. 1 root root  774 May 16  2015 core-site.xml
-rw-r--r--. 1 root root 4436 May 16  2015 capacity-scheduler.xml

結果輸出到output。
計算流程如下

稍有不同的是這裡有兩次reduce,第二次reduce就是把結果按照出現次數排個序。map和reduce流程開發者自己隨意組合,只要各流程的輸入輸出能銜接上就行。

管理系統介紹

Hadoop提供了web介面的管理系統,

埠號 用途
50070 Hadoop Namenode UI埠
50075 Hadoop Datanode UI埠
50090 Hadoop SecondaryNamenode 埠
50030 JobTracker監控埠
50060 TaskTrackers埠
8088 Yarn任務監控埠
60010 Hbase HMaster監控UI埠
60030 Hbase HRegionServer埠
8080 Spark監控UI埠
4040 Spark任務UI埠

加命令引數

docker run命令要加入引數,才能訪問UI管理頁面

docker run -it -p 50070:50070 -p 8088:8088 -p 50075:50075  sequenceiq/hadoop-docker:2.7.0 /etc/bootstrap.sh -bash --privileged=true

執行這條命令後在宿主機瀏覽器就可以檢視系統了,當然如果Linux有瀏覽器也可以檢視。我的Linux沒有圖形介面,所以在宿主機檢視。

50070 Hadoop Namenode UI埠

50075 Hadoop Datanode UI埠

8088 Yarn任務監控埠


已完成和正在執行的mapreduce任務都可以在8088裡檢視,上圖有gerp和wordcount兩個任務。

一些問題

一、./sbin/mr-jobhistory-daemon.sh start historyserver必須執行,否則執行任務過程中會報

20/06/29 21:18:49 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:10020. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
java.io.IOException: java.net.ConnectException: Call From 87a4217b9f8a/172.17.0.1 to 0.0.0.0:10020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused

二、./start-all.sh必須執行否則報形如
Unknown Job job_1592960164748_0001錯誤

三、docker run命令後面必須加--privileged=true,否則執行任務過程中會報java.io.IOException: Job status not available

四、注意,Hadoop 預設不會覆蓋結果檔案,因此再次執行上面例項會提示出錯,需要先將 ./output 刪除。或者換成output01試試?

總結

本文方法可以低成本的完成Hadoop的安裝配置,對於學習理解和開發測試都有幫助的。如果開發自己的Hadoop程式,需要將程式打jar包上傳到share/hadoop/mapreduce/目錄,執行

bin/hadoop jar share/hadoop/mapreduce/yourtest.jar

來執行程式觀察效果。

相關文章