一、概念
Hadoop是由java語言編寫的,在分散式伺服器叢集上儲存海量資料並執行分散式分析應用的開源框架,其核心部件是HDFS與MapReduce。HDFS是一個分散式檔案系統,類似mogilefs,但又不同於mogilefs,hdfs由存放檔案後設資料資訊的namenode和存放資料的伺服器datanode組成;hdfs它不同於mogilefs,hdfs把後設資料資訊放在記憶體中,而mogilefs把後設資料放在資料庫中;而對於hdfs的後設資料資訊持久化是依靠secondary name node(第二名稱節點),第二名稱節點並不是真正扮演名稱節點角色,它的主要任務是週期性地將編輯日誌合併至名稱空間映象檔案中以免編輯日誌變得過大;它可以獨立執行在一個物理主機上,並需要同名稱節點同樣大小的記憶體資源來完成檔案合併;另外它還保持一份名稱空間映象的副本,以防名稱節點掛了,丟失資料;然而根據其工作機制,第二名稱節點要滯後主節點,所以當主名稱節點掛掉以後,丟失資料是在所難免的;所以snn(secondary name node)儲存映象副本的主要作用是儘可能的減少資料的丟失;MapReduce是一個計算框架,這種計算框架主要有兩個階段,第一階段是map計算;第二階段是Reduce計算;map計算的作用是把相同key的資料始終傳送給同一個mapper進行計算;reduce就是把mapper計算的結果進行摺疊計算(我們可以理解為合併),最終得到一個結果;在hadoop v1版本是這樣的架構,v2就不是了,v2版本中把mapreduce框架拆分yarn框架和mapreduce,其計算任務可以跑在yarn框架上;所以hadoop v1核心就是hdfs+mapreduce兩個叢集;v2的架構就是hdfs+yarn+mapreduce;
HDFS架構
提示:從上圖架構可以看到,客戶端訪問hdfs上的某一檔案,首先要向namenode請求檔案的後設資料資訊,然後nn就會告訴客戶端,訪問的檔案在datanode上的位置,然後客戶端再依次向datanode請求對應的資料,最後拼接成一個完整的檔案;這裡需要注意一個概念,datanode存放檔案資料是按照檔案大小和塊大小來切分存放的,什麼意思呢?比如一個檔案100M大小,假設dn(datanode)上的塊大小為10M一塊,那麼它存放在dn上是把100M切分為10M一塊,共10塊,然後把這10塊資料分別存放在不同的dn上;同時這些塊分別存放在不同的dn上,還會分別在不同的dn上存在副本,這樣一來使得一個檔案的資料塊被多個dn分散冗餘的存放;對於nn節點,它主要維護了那個檔案的資料存放在那些節點,和那些dn存放了那些檔案的資料塊(這個資料是通過dn週期性的向nn傳送);我們可以理解為nn內部有兩張表分別記錄了那些檔案的資料塊分別存放在那些dn上(以檔案為中心),和那些dn存放了那些檔案的資料塊(以節點為中心);從上面的描述不難想象,當nn掛掉以後,整個存放在hdfs上的檔案都將找不到,所以在生產中我們會使用zk(zookeeper)來對nn節點做高可用;對於hdfs來講,它本質上不是核心檔案系統,所以它依賴本地Linux檔案系統;
mapreduce計算過程
提示:如上圖所示,首先mapreduce會把給定的資料切分為多個(切分之前通過程式設計師寫程式實現把給定的資料切分為多分,並抽取成kv鍵值對),然後啟動多個mapper對其進行map計算,多個mapper計算後的結果在通過combiner進行合併(combiner是有程式設計師編寫程式實現,主要實現合併規則),把相同key的值根據某種計算規則合併在一起,然後把結果在通過partitoner(分割槽器,這個分割槽器是通過程式設計師寫程式實現,主要實現對map後的結果和對應reducer進行關聯)分別傳送給不同的reducer進行計算,最終每個reducer會產生一個最終的唯一結果;簡單講mapper的作用是讀入kv鍵值對,輸出新的kv鍵值對,會有新的kv產生;combiner的作用是把當前mapper生成的新kv鍵值對進行相同key的鍵值對進行合併,至於怎麼合併,合併規則是什麼是由程式設計師定義,所以combiner就是程式設計師寫的程式實現,本質上combiner是讀入kv鍵值對,輸出kv鍵值對,不會產生新的kv;partitioner的作用就是把combiner合併後的鍵值對進行排程至reducer,至於怎麼排程,該發往那個reducer,以及由幾個reducer進行處理,由程式設計師定義;最終reducer摺疊計算以後生成新的kv鍵值對;
hadoop v1與v2架構
提示:在hadoop v1的架構中,所有計算任務都跑在mapreduce之上,mapreduce就主要擔任了兩個角色,第一個是叢集資源管理器和資料處理;到了hadoop v2 其架構就為hdfs+yarn+一堆任務,其實我們可以把一堆任務理解為v1中的mapreduce,不同於v1中的mapreduce,v2中mapreduce只負責資料計算,不在負責叢集資源管理,叢集資源管理由yarn實現;對於v2來講其計算任務都跑在了執yarn之上;對於hdfs來講,v1和v2中的作用都是一樣的,都是起儲存檔案作用;
hadoop v2 計算任務資源排程過程
提示:rm(resource manager)收到客戶端的任務請求,此時rm會根據各dn上執行的nm(node manager)週期性報告的狀態資訊來決定把客戶端的任務排程給那個nm來執行;當rm選定好nm後,就把任務傳送給對應nm,對應nm內部會起一個appmaster(am)的容器,負責本次任務的主控端,而appmaster需要啟動container來執行任務,它會向rm請求,然後rm會根據am的請求在對應的nm上啟動一個或多個container;最後各container執行後的結果會傳送給am,然後再由am返回給rm,rm再返回給客戶端;在這其中rm主要用來接收個nm傳送的各節點狀態資訊和資源排程以及接收各am計算任務後的結果並反饋給各客戶端;nm主要用來管理各node上的資源和上報狀態資訊給rm;am主要用來管理各任務的資源申請和各任務執行後端結果返回給rm;
hadoop生態圈
提示:上圖是hadoop v2生態圈架構圖,其中hdfs和yarn是hadoop的核心元件,對於執行在其上的各種任務都必須依賴hadoop,也必須支援呼叫mapreduce介面;
二、hadoop叢集部署
環境說明
名稱 | 角色 | ip |
node01 | nn,snn,rm | 192.168.0.41 |
node02 | dn,nm | 192.168.0.42 |
node03 | dn,nm | 192.168.0.43 |
node04 | dn,nm | 192.168.0.44 |
各節點同步時間
配置/etc/hosts解析個節點主機名
各節點安裝jdk
yum install -y java-1.8.0-openjdk-devel
提示:安裝devel包才會有jps命令
驗證jdk是否安裝完成,版本是否正確,確定java命令所在位置
新增JAVA_HOME環境變數
驗證JAVA_HOME變數配置是否正確
建立目錄,用於存放hadoop安裝包
mkdir /bigdata
到此基礎環境就準備OK,接下來下載hadoop二進位制包
[root@node01 ~]# wget https://mirror.bit.edu.cn/apache/hadoop/common/hadoop-2.9.2/hadoop-2.9.2.tar.gz --2020-09-27 22:50:16-- https://mirror.bit.edu.cn/apache/hadoop/common/hadoop-2.9.2/hadoop-2.9.2.tar.gz Resolving mirror.bit.edu.cn (mirror.bit.edu.cn)... 202.204.80.77, 219.143.204.117, 2001:da8:204:1205::22 Connecting to mirror.bit.edu.cn (mirror.bit.edu.cn)|202.204.80.77|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 366447449 (349M) [application/octet-stream] Saving to: ‘hadoop-2.9.2.tar.gz’ 100%[============================================================================>] 366,447,449 1.44MB/s in 2m 19s 2020-09-27 22:52:35 (2.51 MB/s) - ‘hadoop-2.9.2.tar.gz’ saved [366447449/366447449] [root@node01 ~]# ls hadoop-2.9.2.tar.gz [root@node01 ~]#
解壓hadoop-2.9.3.tar.gz到/bigdata/目錄,並將解壓到目錄連結至hadoop
匯出hadoop環境變數配置
[root@node01 ~]# cat /etc/profile.d/hadoop.sh export HADOOP_HOME=/bigdata/hadoop export PATH=$PATH:${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin export HADOOP_YARN_HOME=${HADOOP_HOME} export HADOOP_MAPPERD_HOME=${HADOOP_HOME} export HADOOP_COMMON_HOME=${HADOOP_HOME} export HADOOP_HDFS_HOME=${HADOOP_HOME} [root@node01 ~]#
建立hadoop使用者,並設定其密碼為admin
[root@node01 ~]# useradd hadoop [root@node01 ~]# echo "admin" |passwd --stdin hadoop Changing password for user hadoop. passwd: all authentication tokens updated successfully. [root@node01 ~]#
各節點間hadoop使用者做免密登入
[hadoop@node01 ~]$ ssh-keygen Generating public/private rsa key pair. Enter file in which to save the key (/home/hadoop/.ssh/id_rsa): Created directory '/home/hadoop/.ssh'. Enter passphrase (empty for no passphrase): Enter same passphrase again: Your identification has been saved in /home/hadoop/.ssh/id_rsa. Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub. The key fingerprint is: SHA256:6CNhqdagySJXc4iRBVSoLENddO7JLZMCsdjQzqSFnmw hadoop@node01.test.org The key's randomart image is: +---[RSA 2048]----+ | o*==o . | | o=Bo o | |=oX+ . | |+E =.oo.+ | |o.o B.oBS. | |.o * =. o | |=.+ o o | |oo . . | | | +----[SHA256]-----+ [hadoop@node01 ~]$ ssh-copy-id node01 /usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/home/hadoop/.ssh/id_rsa.pub" The authenticity of host 'node01 (192.168.0.41)' can't be established. ECDSA key fingerprint is SHA256:lE8/Vyni4z8hsXaa8OMMlDpu3yOIRh6dLcIr+oE57oE. ECDSA key fingerprint is MD5:14:59:02:30:c0:16:b8:6c:1a:84:c3:0f:a7:ac:67:b3. Are you sure you want to continue connecting (yes/no)? yes /usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed /usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys hadoop@node01's password: Number of key(s) added: 1 Now try logging into the machine, with: "ssh 'node01'" and check to make sure that only the key(s) you wanted were added. [hadoop@node01 ~]$ scp -r ./.ssh node02:/home/hadoop/ The authenticity of host 'node02 (192.168.0.42)' can't be established. ECDSA key fingerprint is SHA256:lE8/Vyni4z8hsXaa8OMMlDpu3yOIRh6dLcIr+oE57oE. ECDSA key fingerprint is MD5:14:59:02:30:c0:16:b8:6c:1a:84:c3:0f:a7:ac:67:b3. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'node02,192.168.0.42' (ECDSA) to the list of known hosts. hadoop@node02's password: id_rsa 100% 1679 636.9KB/s 00:00 id_rsa.pub 100% 404 186.3KB/s 00:00 known_hosts 100% 362 153.4KB/s 00:00 authorized_keys 100% 404 203.9KB/s 00:00 [hadoop@node01 ~]$ scp -r ./.ssh node03:/home/hadoop/ The authenticity of host 'node03 (192.168.0.43)' can't be established. ECDSA key fingerprint is SHA256:lE8/Vyni4z8hsXaa8OMMlDpu3yOIRh6dLcIr+oE57oE. ECDSA key fingerprint is MD5:14:59:02:30:c0:16:b8:6c:1a:84:c3:0f:a7:ac:67:b3. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'node03,192.168.0.43' (ECDSA) to the list of known hosts. hadoop@node03's password: id_rsa 100% 1679 755.1KB/s 00:00 id_rsa.pub 100% 404 165.7KB/s 00:00 known_hosts 100% 543 350.9KB/s 00:00 authorized_keys 100% 404 330.0KB/s 00:00 [hadoop@node01 ~]$ scp -r ./.ssh node04:/home/hadoop/ The authenticity of host 'node04 (192.168.0.44)' can't be established. ECDSA key fingerprint is SHA256:lE8/Vyni4z8hsXaa8OMMlDpu3yOIRh6dLcIr+oE57oE. ECDSA key fingerprint is MD5:14:59:02:30:c0:16:b8:6c:1a:84:c3:0f:a7:ac:67:b3. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'node04,192.168.0.44' (ECDSA) to the list of known hosts. hadoop@node04's password: id_rsa 100% 1679 707.0KB/s 00:00 id_rsa.pub 100% 404 172.8KB/s 00:00 known_hosts 100% 724 437.7KB/s 00:00 authorized_keys 100% 404 165.2KB/s 00:00 [hadoop@node01 ~]$
驗證:用node01去連線node02,node03,node04看看是否是免密登入了
建立資料目錄/data/hadoop/hdfs/{nn,snn,dn},並將其屬主屬組更改為hadoop
進入到hadoop安裝目錄,建立其logs目錄,並將其安裝目錄的屬主和屬組更改為hadoop
提示:以上所有步驟都需要在各節點挨著做一遍;
配置hadoop的core-site.xml
提示:hadoop的配置檔案語法都是xml格式的配置檔案,其中<property>和</property>是一對標籤,裡面用name標籤來引用配置的選項的key的名稱,其value標籤用來配置對應key的值;上面配置表示配置預設的檔案系統地址;hdfs://node01:8020是hdfs檔案系統訪問的地址;
完整的配置
[root@node01 hadoop]# cat core-site.xml <?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. See accompanying LICENSE file. --> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>fs.defaultFS</name> <value>hdfs://node01:8020</value> <final>true</final> </property> </configuration> [root@node01 hadoop]#
配置hdfs-site.xml
提示:以上配置主要指定hdfs相關目錄以及訪問web埠資訊,副本數量;
完整的配置
[root@node01 hadoop]# cat hdfs-site.xml <?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. See accompanying LICENSE file. --> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>dfs.replication</name> <value>3</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:///data/hadoop/hdfs/nn</value> </property> <property> <name>dfs.namenode.secondary.http-address</name> <value>node01:50090</value> </property> <property> <name>dfs.namenode.http-address</name> <value>node01:50070</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:///data/hadoop/hdfs/dn</value> </property> <property> <name>fs.checkpoint.dir</name> <value>file:///data/hadoop/hdfs/snn</value> </property> <property> <name>fs.checkpoint.edits.dir</name> <value>file:///data/hadoop/hdfs/snn</value> </property> </configuration> [root@node01 hadoop]#
配置mapred-site.xml
提示:以上配置主要指定了mapreduce的框架為yarn;預設沒有mapred-site.xml,我們需要將mapred-site.xml.template修改成mapred.site.xml;這裡需要注意我上面是通過複製修改檔名,當然屬主資訊都會變成root,不要忘記把屬組資訊修改成hadoop;
完整的配置
[root@node01 hadoop]# cat mapred-site.xml <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. See accompanying LICENSE file. --> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration> [root@node01 hadoop]#
配置yarn-site.xml
提示:以上配置主要配置了yarn框架rm和nm相關地址和指定相關類;
完整的配置
[root@node01 hadoop]# cat yarn-site.xml <?xml version="1.0"?> <!-- Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. See accompanying LICENSE file. --> <configuration> <property> <name>yarn.resourcemanager.address</name> <value>node01:8032</value> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>node01:8030</value> </property> <property> <name>yarn.resourcemanager.resource-tracker.address</name> <value>node01:8031</value> </property> <property> <name>yarn.resourcemanager.admin.address</name> <value>node01:8033</value> </property> <property> <name>yarn.resourcemanager.webapp.address</name> <value>node01:8088</value> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.auxservices.mapreduce_shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <property> <name>yarn.resourcemanager.scheduler.class</name> <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value> </property> </configuration> [root@node01 hadoop]#
配置slave.xml
[root@node01 hadoop]# cat slaves node02 node03 node04 [root@node01 hadoop]#
複製各配置檔案到其他節點
到此hadoop配置就完成了;
接下來切換到hadoop使用者下,初始化hdfs
hdfs namenode -format
提示:如果執行hdfs namenode -format 出現紅框中的提示,說明hdfs格式化就成功了;
啟動hdfs叢集
提示:hdfs主要由namenode、secondarynamenode和datanode組成,只要看到對應節點上的程式啟動起來,就沒有多大問題;
到此hdfs叢集就正常啟動了
驗證:把/etc/passwd上傳到hdfs的/test目錄下,看看是否可以正常上傳?
提示:可以看到/etc/passwd檔案已經上傳至hdfs的/test目錄下了;
驗證:檢視hdfs /test目錄下passwd檔案,看看是否同/etc/passwd檔案內容相同?
提示:可以看到hdfs上的/test/passwd檔案內容同/etc/passwd檔案內容相同;
驗證:在dn節點檢視對應目錄下的檔案內容,看看是否同/etc/passwd檔案內容相同?
[root@node02 ~]# tree /data /data └── hadoop └── hdfs ├── dn │ ├── current │ │ ├── BP-157891879-192.168.0.41-1601224158145 │ │ │ ├── current │ │ │ │ ├── finalized │ │ │ │ │ └── subdir0 │ │ │ │ │ └── subdir0 │ │ │ │ │ ├── blk_1073741825 │ │ │ │ │ └── blk_1073741825_1001.meta │ │ │ │ ├── rbw │ │ │ │ └── VERSION │ │ │ ├── scanner.cursor │ │ │ └── tmp │ │ └── VERSION │ └── in_use.lock ├── nn └── snn 13 directories, 6 files [root@node02 ~]# cat /data/hadoop/hdfs/dn/current/BP-157891879-192.168.0.41-1601224158145/ current/ scanner.cursor tmp/ [root@node02 ~]# cat /data/hadoop/hdfs/dn/current/BP-157891879-192.168.0.41-1601224158145/current/finalized/subdir0/subdir0/blk_1073741825 root:x:0:0:root:/root:/bin/bash bin:x:1:1:bin:/bin:/sbin/nologin daemon:x:2:2:daemon:/sbin:/sbin/nologin adm:x:3:4:adm:/var/adm:/sbin/nologin lp:x:4:7:lp:/var/spool/lpd:/sbin/nologin sync:x:5:0:sync:/sbin:/bin/sync shutdown:x:6:0:shutdown:/sbin:/sbin/shutdown halt:x:7:0:halt:/sbin:/sbin/halt mail:x:8:12:mail:/var/spool/mail:/sbin/nologin operator:x:11:0:operator:/root:/sbin/nologin games:x:12:100:games:/usr/games:/sbin/nologin ftp:x:14:50:FTP User:/var/ftp:/sbin/nologin nobody:x:99:99:Nobody:/:/sbin/nologin systemd-network:x:192:192:systemd Network Management:/:/sbin/nologin dbus:x:81:81:System message bus:/:/sbin/nologin polkitd:x:999:997:User for polkitd:/:/sbin/nologin postfix:x:89:89::/var/spool/postfix:/sbin/nologin sshd:x:74:74:Privilege-separated SSH:/var/empty/sshd:/sbin/nologin ntp:x:38:38::/etc/ntp:/sbin/nologin tcpdump:x:72:72::/:/sbin/nologin chrony:x:998:996::/var/lib/chrony:/sbin/nologin hadoop:x:1000:1000::/home/hadoop:/bin/bash [root@node02 ~]#
提示:可以看到在dn節點上的dn目錄下能夠找到我們上傳的passwd檔案;
驗證:檢視其它節點是否有相同的檔案?是否有我們指定數量的副本?
提示:在node03和node04上也有相同的目錄和檔案;說明我們設定的副本數量為3生效了;
啟動yarn叢集
提示:可以看到對應節點上的nm啟動了;主節點上的rm也正常啟動了;
訪問nn的50070和8088,看看對應的web地址是否能夠訪問到頁面?
提示:這個地址是hdfs的web地址,在這個介面可以看到hdfs的儲存狀況,以及對hdfs上的檔案做操作;
提示:8088是yarn叢集的管理地址;在這個介面上能夠看到執行的計算任務的狀態資訊,叢集配置資訊,日誌等等;
驗證:在yarn上跑一個計算任務,統計/test/passwd檔案的單詞數量,看看對應的計算任務是否能夠跑起來?
[hadoop@node01 hadoop]$ yarn jar /bigdata/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.2.jar An example program must be given as the first argument. Valid program names are: aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files. aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files. bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi. dbcount: An example job that count the pageview counts from a database. distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of Pi. grep: A map/reduce program that counts the matches of a regex in the input. join: A job that effects a join over sorted, equally partitioned datasets multifilewc: A job that counts words from several files. pentomino: A map/reduce tile laying program to find solutions to pentomino problems. pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method. randomtextwriter: A map/reduce program that writes 10GB of random textual data per node. randomwriter: A map/reduce program that writes 10GB of random data per node. secondarysort: An example defining a secondary sort to the reduce. sort: A map/reduce program that sorts the data written by the random writer. sudoku: A sudoku solver. teragen: Generate data for the terasort terasort: Run the terasort teravalidate: Checking results of terasort wordcount: A map/reduce program that counts the words in the input files. wordmean: A map/reduce program that counts the average length of the words in the input files. wordmedian: A map/reduce program that counts the median length of the words in the input files. wordstandarddeviation: A map/reduce program that counts the standard deviation of the length of the words in the input files. [hadoop@node01 hadoop]$ yarn jar /bigdata/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.2.jar wordcount Usage: wordcount <in> [<in>...] <out> [hadoop@node01 hadoop]$ yarn jar /bigdata/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.2.jar wordcount /test/passwd /test/passwd-word-count20/09/28 00:58:01 INFO client.RMProxy: Connecting to ResourceManager at node01/192.168.0.41:8032 20/09/28 00:58:01 INFO input.FileInputFormat: Total input files to process : 1 20/09/28 00:58:01 INFO mapreduce.JobSubmitter: number of splits:1 20/09/28 00:58:01 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled 20/09/28 00:58:01 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1601224871685_0001 20/09/28 00:58:02 INFO impl.YarnClientImpl: Submitted application application_1601224871685_0001 20/09/28 00:58:02 INFO mapreduce.Job: The url to track the job: http://node01:8088/proxy/application_1601224871685_0001/ 20/09/28 00:58:02 INFO mapreduce.Job: Running job: job_1601224871685_0001 20/09/28 00:58:08 INFO mapreduce.Job: Job job_1601224871685_0001 running in uber mode : false 20/09/28 00:58:08 INFO mapreduce.Job: map 0% reduce 0% 20/09/28 00:58:14 INFO mapreduce.Job: map 100% reduce 0% 20/09/28 00:58:20 INFO mapreduce.Job: map 100% reduce 100% 20/09/28 00:58:20 INFO mapreduce.Job: Job job_1601224871685_0001 completed successfully 20/09/28 00:58:20 INFO mapreduce.Job: Counters: 49 File System Counters FILE: Number of bytes read=1144 FILE: Number of bytes written=399079 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=1053 HDFS: Number of bytes written=1018 HDFS: Number of read operations=6 HDFS: Number of large read operations=0 HDFS: Number of write operations=2 Job Counters Launched map tasks=1 Launched reduce tasks=1 Data-local map tasks=1 Total time spent by all maps in occupied slots (ms)=2753 Total time spent by all reduces in occupied slots (ms)=2779 Total time spent by all map tasks (ms)=2753 Total time spent by all reduce tasks (ms)=2779 Total vcore-milliseconds taken by all map tasks=2753 Total vcore-milliseconds taken by all reduce tasks=2779 Total megabyte-milliseconds taken by all map tasks=2819072 Total megabyte-milliseconds taken by all reduce tasks=2845696 Map-Reduce Framework Map input records=22 Map output records=30 Map output bytes=1078 Map output materialized bytes=1144 Input split bytes=95 Combine input records=30 Combine output records=30 Reduce input groups=30 Reduce shuffle bytes=1144 Reduce input records=30 Reduce output records=30 Spilled Records=60 Shuffled Maps =1 Failed Shuffles=0 Merged Map outputs=1 GC time elapsed (ms)=87 CPU time spent (ms)=620 Physical memory (bytes) snapshot=444997632 Virtual memory (bytes) snapshot=4242403328 Total committed heap usage (bytes)=285212672 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=958 File Output Format Counters Bytes Written=1018 [hadoop@node01 hadoop]$
檢視計算後生成的報告
[hadoop@node01 hadoop]$ hdfs dfs -ls -R /test -rw-r--r-- 3 hadoop supergroup 958 2020-09-28 00:32 /test/passwd drwxr-xr-x - hadoop supergroup 0 2020-09-28 00:58 /test/passwd-word-count -rw-r--r-- 3 hadoop supergroup 0 2020-09-28 00:58 /test/passwd-word-count/_SUCCESS -rw-r--r-- 3 hadoop supergroup 1018 2020-09-28 00:58 /test/passwd-word-count/part-r-00000 [hadoop@node01 hadoop]$ hdfs dfs -cat /test/passwd-word-count/part-r-00000 Management:/:/sbin/nologin 1 Network 1 SSH:/var/empty/sshd:/sbin/nologin 1 User:/var/ftp:/sbin/nologin 1 adm:x:3:4:adm:/var/adm:/sbin/nologin 1 bin:x:1:1:bin:/bin:/sbin/nologin 1 bus:/:/sbin/nologin 1 chrony:x:998:996::/var/lib/chrony:/sbin/nologin 1 daemon:x:2:2:daemon:/sbin:/sbin/nologin 1 dbus:x:81:81:System 1 for 1 ftp:x:14:50:FTP 1 games:x:12:100:games:/usr/games:/sbin/nologin 1 hadoop:x:1000:1000::/home/hadoop:/bin/bash 1 halt:x:7:0:halt:/sbin:/sbin/halt 1 lp:x:4:7:lp:/var/spool/lpd:/sbin/nologin 1 mail:x:8:12:mail:/var/spool/mail:/sbin/nologin 1 message 1 nobody:x:99:99:Nobody:/:/sbin/nologin 1 ntp:x:38:38::/etc/ntp:/sbin/nologin 1 operator:x:11:0:operator:/root:/sbin/nologin 1 polkitd:/:/sbin/nologin 1 polkitd:x:999:997:User 1 postfix:x:89:89::/var/spool/postfix:/sbin/nologin 1 root:x:0:0:root:/root:/bin/bash 1 shutdown:x:6:0:shutdown:/sbin:/sbin/shutdown 1 sshd:x:74:74:Privilege-separated 1 sync:x:5:0:sync:/sbin:/bin/sync 1 systemd-network:x:192:192:systemd 1 tcpdump:x:72:72::/:/sbin/nologin 1 [hadoop@node01 hadoop]$
在8088頁面上檢視任務的狀態資訊
到此hadoop v2叢集就搭建完畢了;