Hadoop企業開發場景案例,虛擬機器伺服器調優

¥小猿¥發表於2021-03-16

Hadoop企業開發場景案例

1 案例需求

​ (1)需求:從1G資料中,統計每個單詞出現次數。伺服器3臺,每臺配置4G記憶體,4核CPU,4執行緒。

​ (2)需求分析:

​ 1G/128m = 8個MapTask;1個ReduceTask:1個mrAppMaster

​ 平均每個節點執行10個/3臺 ≈ 3個任務(4 3 3)

2 HDFS引數調優

​ (1)修改:hadoop-env.sh

export HDFS_NAMENODE_OPTS = "-Dhadoop.security.logger=INFO,RFAS -Xmx1024m"
export HDFS_DATANODE_OPTS = "-Dhadoop.security.logger=ERROR,RFAS -Xmx1024m"

​ (2)修改:hdfs-site.xml

<!--NameNode有一個工作執行緒池,預設值是10-->
<property>
	<name>dfs.namenode.handler.count</name>
	<value>21</value>
</property>

​ (3)修改core-site.xml

<!-- 配置垃圾回收時間為 60 分鐘 -->
<property>
	<name>fs.trash.interval</name>
	<value>60</value>
</property>

​ (4)將配置分發到三臺伺服器上

rsync -av 分發的檔名稱 使用者名稱@主機名稱:儲存配置檔案地址

3 MapReduce 引數調優

​ (1)修改mapred-site.xml

<!-- 環形緩衝區大小,預設 100m -->
<property>
	<name>mapreduce.task.io.sort.mb</name>
	<value>100</value>
</property>

<!-- 環形緩衝區溢寫閾值,預設 0.8 -->
<property>
	<name>mapreduce.map.sort.spill.percent</name>
	<value>0.80</value>
</property>

<!-- merge 合併次數,預設 10 個 -->
<property>
	<name>mapreduce.task.io.sort.factor</name>
	<value>10</value>
</property>

<!-- maptask 記憶體,預設 1g; maptask 堆記憶體大小預設和該值大小一致 mapreduce.map.java.opts -->
<property>
	<name>mapreduce.map.memory.mb</name>
	<value>-1</value>
	<description>
	The amount of memory to request from the scheduler for each map task. If this is not specified or is non-positive, it is inferred from mapreduce.map.java.opts and mapreduce.job.heap.memory-mb.ratio. If java-opts are also not specified, we set it to 1024.
	</description>
</property>

<!-- matask 的 CPU 核數,預設 1 個 -->
<property>
	<name>mapreduce.map.cpu.vcores</name>
	<value>1</value>
</property>

<!-- matask 異常重試次數,預設 4 次 -->
<property>
	<name>mapreduce.map.maxattempts</name>
	<value>4</value>
</property>

<!-- 每個 Reduce 去 Map 中拉取資料的並行數。預設值是 5 -->
<property>
	<name>mapreduce.reduce.shuffle.parallelcopies</name>
	<value>5</value>
</property>

<!-- Buffer 大小佔 Reduce 可用記憶體的比例,預設值 0.7 -->
<property>
	<name>mapreduce.reduce.shuffle.input.buffer.percent</name>
	<value>0.70</value>
</property>

<!-- Buffer 中的資料達到多少比例開始寫入磁碟,預設值 0.66。 -->
<property>
	<name>mapreduce.reduce.shuffle.merge.percent</name>
	<value>0.66</value>
</property>

<!-- reducetask 記憶體,預設 1g;reducetask 堆記憶體大小預設和該值大小一致 mapreduce.reduce.java.opts -->
<property>
	<name>mapreduce.reduce.memory.mb</name>
	<value>-1</value>
	<description>The amount of memory to request from the scheduler for each reduce task. If this is not specified or is non-positive, it is inferred from mapreduce.reduce.java.opts and mapreduce.job.heap.memory-mb.ratio. If java-opts are also not specified, we set it to 1024.
	</description>
</property>

<!-- reducetask 的 CPU 核數,預設 1 個 -->
<property>
	<name>mapreduce.reduce.cpu.vcores</name>
	<value>2</value>
</property>

<!-- reducetask 失敗重試次數,預設 4 次 -->
<property>
	<name>mapreduce.reduce.maxattempts</name>
	<value>4</value>
</property>

<!-- 當MapTask完成的比例達到該值後才會為ReduceTask申請資源。預設是0.05-->
<property>
	<name>mapreduce.job.reduce.slowstart.completedmaps</name>
	<value>0.05</value>
</property>

<!-- 如果程式在規定的預設 10 分鐘內沒有讀到資料,將強制超時退出 -->
<property>
	<name>mapreduce.task.timeout</name>
	<value>600000</value>
</property>

​ (2)伺服器分發配置檔案

rsync -av 分發的檔名稱 使用者名稱@主機名稱:儲存配置檔案地址

4 Yarn引數調優

​ (1)修改Yarn-site.xml

<!-- 選擇排程器,預設容量 -->
<property>
	<description>The class to use as the resource scheduler.</description>
	<name>yarn.resourcemanager.scheduler.class</name>
	<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value>
</property>

<!-- ResourceManager 處理排程器請求的執行緒數量,預設 50;如果提交的任務數大於 50,可以增加該值,但是不能超過 3 臺 * 4 執行緒 = 12 執行緒(去除其他應用程式實際不能超過 8) -->
<property>
	<description>Number of threads to handle schedulerinterface.</description>
	<name>yarn.resourcemanager.scheduler.client.thread-count</name>
	<value>8</value>
</property>

<!-- 是否讓 yarn 自動檢測硬體進行配置,預設是 false,如果該節點有很多其他應用程式,建議
手動配置。如果該節點沒有其他應用程式,可以採用自動 -->
<property>
	<description>Enable auto-detection of node capabilities such as memory and CPU.</description>
	<name>yarn.nodemanager.resource.detect-hardware-capabilities</name>
	<value>false</value>
</property>

<!-- 是否將虛擬核數當作 CPU 核數,預設是 false,採用物理 CPU 核數 -->
<property>
	<description>Flag to determine if logical processors(such as hyperthreads) should be counted as cores. Only applicable on Linux when yarn.nodemanager.resource.cpu-vcores is set to -1 and yarn.nodemanager.resource.detect-hardware-capabilities is true.
	</description>
	<name>yarn.nodemanager.resource.count-logical-processors-as-cores</name>
	<value>false</value>
</property>

<!-- 虛擬核數和物理核數乘數,預設是 1.0 -->
<property>
	<description>Multiplier to determine how to convert phyiscal cores to vcores. This value is used if yarn.nodemanager.resource.cpu-vcores is set to -1(which implies auto-calculate vcores) and yarn.nodemanager.resource.detect-hardware-capabilities is set to true. The number of vcores will be calculated as number of CPUs * multiplier.
	</description>
	<name>yarn.nodemanager.resource.pcores-vcores-multiplier</name>
	<value>1.0</value>
</property>

<!-- NodeManager 使用記憶體數,預設 8G,修改為 4G 記憶體 -->
<property>
	<description>Amount of physical memory, in MB, that can be allocated for containers. If set to -1 and
yarn.nodemanager.resource.detect-hardware-capabilities is true, it is automatically calculated(in case of Windows and Linux). In other cases, the default is 8192MB.
	</description>
	<name>yarn.nodemanager.resource.memory-mb</name>
	<value>4096</value>
</property>

<!-- nodemanager 的 CPU 核數,不按照硬體環境自動設定時預設是 8 個,修改為 4 個 -->
<property>
	<description>Number of vcores that can be allocated for containers. This is used by the RM scheduler when allocating resources for containers. This is not used to limit the number of CPUs used by YARN containers. If it is set to -1 and yarn.nodemanager.resource.detect-hardware-capabilities is true, it is
automatically determined from the hardware in case of Windows and Linux. In other cases, number of vcores is 8 by default.
	</description>
	<name>yarn.nodemanager.resource.cpu-vcores</name>
	<value>4</value>
</property>

<!-- 容器最小記憶體,預設 1G -->
<property>
	<description>The minimum allocation for every container request at the RM in MBs. Memory requests lower than this will be set to the value of this property. Additionally, a node manager that is configured to have
less memory than this value will be shut down by the resource manager.
	</description>
	<name>yarn.scheduler.minimum-allocation-mb</name>
	<value>1024</value>
</property>

<!-- 容器最大記憶體,預設 8G,修改為 2G -->
<property>
	<description>The maximum allocation for every container request at the RM in MBs. Memory requests higher than this will throw an InvalidResourceRequestException.
	</description>
	<name>yarn.scheduler.maximum-allocation-mb</name>
	<value>2048</value>
</property>

<!-- 容器最小 CPU 核數,預設 1 個 -->
<property>
	<description>The minimum allocation for every container request at the RM in terms of virtual CPU cores. Requests lower than this will be set to the value of this property. Additionally, a node manager that is configured to have fewer virtual cores than this value will be shut down by the
resource manager.
	</description>
	<name>yarn.scheduler.minimum-allocation-vcores</name>
	<value>1</value>
</property>

<!-- 容器最大 CPU 核數,預設 4 個,修改為 2 個 -->
<property>
	<description>The maximum allocation for every container request at the RM in terms of virtual CPU cores. Requests higher than this will throw an InvalidResourceRequestException.
	</description>
	<name>yarn.scheduler.maximum-allocation-vcores</name>
	<value>2</value>
</property>

<!-- 虛擬記憶體檢查,預設開啟,修改為關閉 -->
<property>
	<description>Whether virtual memory limits will be enforced for containers.</description>
	<name>yarn.nodemanager.vmem-check-enabled</name>
	<value>false</value>
</property>

<!-- 虛擬記憶體和實體記憶體設定比例,預設 2.1 -->
<property>
	<description>Ratio between virtual memory to physical memory when setting memory limits for containers. Container allocations are expressed in terms of physical memory, and virtual memory usage is allowed to exceed this allocation by this ratio.
	</description>
	<name>yarn.nodemanager.vmem-pmem-ratio</name>
	<value>2.1</value>
</property>

​ (2)伺服器分發配置檔案

rsync -av 分發的檔名稱 使用者名稱@主機名稱:儲存配置檔案地址

10.3.5 執行程式

​ (1)重啟叢集

sbin/stop-yarn.sh
sbin/start-yarn.sh

​ (2)執行 WordCount 程式

hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar wordcount /wcinput /wcoutput
	說明:在hadoop資料夾下執行命令,/input 為要統計的 1G 資料所在的資料夾目錄,/output 為要輸出統計結果的資料夾目錄。

​ (3)觀察 Yarn 任務執行頁面

​ 網址:hadoop103:8088

​ (4)執行結果

​ /wcinput/work.txt原內容:

​ 執行結果:生成資料夾/wcoutput

加入QQ群:947117563,一起加入小猿森林吧!!群裡可以摘果實哦!!

相關文章