HDFS balance策略詳解
【前言】線上長時間執行的大規模Hadoop叢集,各個datanode節點磁碟空間使用率經常會出現分佈不均衡的情況,尤其在新增和下架節點、或者人為干預副本數量的時候。節點空間使用率不均勻會導致計算引擎頻繁在跨節點複製資料(A節點上執行的task所需資料在其它節點上),引起不必要的耗時和頻寬。另外,當部分節點空間使用率很高但未滿(90%左右)時,分配在該節點上的task會存在任務失敗的風險。因此,引入balance策略使叢集中的節點空間使用率均勻分佈必不可少。
一. balancer命令詳解
hdfs --config /hadoop-client/conf balancer -threshold 10 \叢集平衡的條件,datanode間磁碟使用率相差閾值,區間選擇:0~100-policy datanode \預設為datanode,datanode級別的平衡策略 -exclude -f /tmp/ip1.txt \預設為空,指定該部分ip不參與balance, -f:指定輸入為檔案 -include -f /tmp/ip2.txt \預設為空,只允許該部分ip參與balance,-f:指定輸入為檔案 -idleiterations 5 \迭代次數,預設為 5
hdfs balance時datanode之間資料遷移的頻寬設定(/hadoop-client/conf/hdfs-site.xml, 修改需重啟hdfs):
<property> <name>dfs.datanode.balance.bandwidthPerSec</name> <value>6250000</value></property><備註:6250000 / (1024 * 1024) = 6M/s>
動態增大頻寬(不需重啟,需要切換到hdfs使用者,不可設定太大,會佔用mapreduce任務的頻寬):
hdfs dfsadmin -fs hdfs://${active-namenode-hostname}:8020 -setBalancerBandwidth 104857600
balance指令碼在滿足以下任何一個條件都會自動退出:
* The cluster is balanced; * No block can be moved; * No block has been moved for specified consecutive iterations (5 by default); * An IOException occurs while communicating with the namenode; * Another balancer is running.
二. 原始碼解析
原始碼路徑:org.apache.hadoop.hdfs.server.balancer
統計需要balance的datanode:
private boolean shouldIgnore(DatanodeInfo dn) { // ignore decommissioned nodes (忽略已經下架的datanode) final boolean decommissioned = dn.isDecommissioned(); // ignore decommissioning nodes(忽略正在下架的datanode) final boolean decommissioning = dn.isDecommissionInProgress(); // ignore nodes in exclude list (忽略引數:-exclude配置的datanode) final boolean excluded = Util.isExcluded(excludedNodes, dn); // ignore nodes not in the include list (if include list is not empty) // (如果引數:-include不為空,忽略不在include列表裡的datanode) final boolean notIncluded = !Util.isIncluded(includedNodes, dn); if (decommissioned || decommissioning || excluded || notIncluded) { if (LOG.isTraceEnabled()) { LOG.trace("Excluding datanode " + dn + ": decommissioned=" + decommissioned + ", decommissioning=" + decommissioning + ", excluded=" + excluded + ", notIncluded=" + notIncluded); } return true; } return false; }
叢集平均使用率(計算公式):average = totalUsedSpaces * 100 / totalCapacitiestotalUsedSpaces:各datanode已使用空間(dfsUsed,不包含non dfsUsed)相加;
totalCapacities:各datanode總空間(DataNode配置的伺服器磁碟目錄)相加;
void initAvgUtilization() { for(StorageType t : StorageType.asList()) { final long capacity = totalCapacities.get(t); if (capacity > 0L) { final double avg = totalUsedSpaces.get(t)*100.0/capacity; avgUtilizations.set(t, avg); } } }
單個datanode使用率:utilization = dfsUsed * 100.0 / capacitydfsUsed:當前datanode dfs(dfsUsed,不包含non dfsUsed)已使用空間;
capacity:當前datanode(DataNode配置的伺服器磁碟目錄)總空間;
Double getUtilization(DatanodeStorageReport r, final StorageType t) { long capacity = 0L; long dfsUsed = 0L; for(StorageReport s : r.getStorageReports()) { if (s.getStorage().getStorageType() == t) { capacity += s.getCapacity(); dfsUsed += s.getDfsUsed(); } } return capacity == 0L? null: dfsUsed*100.0/capacity; }
單個datanode使用率與叢集平均使用率差值:utilizationDiff = utilization - average
單個datanode utilizationDiff與閾值的差值: thresholdDiff = |utilizationDiff| - threshold
需要遷移或者可以遷入的空間:maxSize2Move = |utilizationDiff| * capacity
可以遷入的空間計算:Math.min(remaining, maxSizeToMove)
需要遷移的空間計算:Math.min(max, maxSizeToMove)remaining:datanode節點剩餘空間
max:預設單個datanode單次balance迭代可以遷移的最大空間限制,預設10G)
預設迭代次數為5,即執行一次balance指令碼,單個datanode可以最大遷移的空間為:5*10G = 50G
for(DatanodeStorageReport r : reports) { final DDatanode dn = dispatcher.newDatanode(r.getDatanodeInfo()); final boolean isSource = Util.isIncluded(sourceNodes, dn.getDatanodeInfo()); for(StorageType t : StorageType.getMovableTypes()) { final Double utilization = policy.getUtilization(r, t); if (utilization == null) { // datanode does not have such storage type continue; } final double average = policy.getAvgUtilization(t); if (utilization >= average && !isSource) { LOG.info(dn + "[" + t + "] has utilization=" + utilization + " >= average=" + average + " but it is not specified as a source; skipping it."); continue; } final double utilizationDiff = utilization - average; final long capacity = getCapacity(r, t); final double thresholdDiff = Math.abs(utilizationDiff) - threshold; final long maxSize2Move = computeMaxSize2Move(capacity, getRemaining(r, t), utilizationDiff, maxSizeToMove); final StorageGroup g; if (utilizationDiff > 0) { final Source s = dn.addSource(t, maxSize2Move, dispatcher); if (thresholdDiff <= 0) { // within threshold aboveAvgUtilized.add(s); } else { overLoadedBytes += percentage2bytes(thresholdDiff, capacity); overUtilized.add(s); } g = s; } else { g = dn.addTarget(t, maxSize2Move); if (thresholdDiff <= 0) { // within threshold belowAvgUtilized.add(g); } else { underLoadedBytes += percentage2bytes(thresholdDiff, capacity); underUtilized.add(g); } } dispatcher.getStorageGroupMap().put(g); }
差值判斷後datanode的儲存佇列:
overUtilized:utilizationDiff > 0 && thresholdDiff > 0 <使用率超過平均值,且差值大於閾值> aboveAvgUtilized:utilizationDiff > 0 && thresholdDiff <= 0 <使用率超過平均值,且差值小於等於閾值> belowAvgUtilized:utilizationDiff < 0 && thresholdDiff <= 0 <使用率低於平均值,且差值小於等於閾值> underUtilized:utilizationDiff > 0 && thresholdDiff > 0 <使用率低於平均值,且差值大於等於閾值>
需要遷移資料的節點(Source型別): overUtilized, aboveAvgUtilized
能夠遷入資料的節點(Target型別): underUtilized, belowAvgUtilized
資料遷移配對(原則:1. 優先為同機架,其次為其它機架; 2. 一對多配對):
第一步[Source -> Target]:each overUtilized datanode => one or more underUtilized datanodes
第二步[Source -> Target]:match each remaining overutilized datanode => one or more belowAvgUtilized datanodes
第三步[Target -> Source]:each remaining underutilized datanode (step 1未和overUtilized匹配過) => one or more aboveAvgUtilized datanodes
/** Decide all <source, target> pairs according to the matcher. */ private void chooseStorageGroups(final Matcher matcher) { /* first step: match each overUtilized datanode (source) to * one or more underUtilized datanodes (targets). */ LOG.info("chooseStorageGroups for " + matcher + ": overUtilized => underUtilized"); chooseStorageGroups(overUtilized, underUtilized, matcher); /* match each remaining overutilized datanode (source) to * below average utilized datanodes (targets). * Note only overutilized datanodes that haven't had that max bytes to move * satisfied in step 1 are selected */ LOG.info("chooseStorageGroups for " + matcher + ": overUtilized => belowAvgUtilized"); chooseStorageGroups(overUtilized, belowAvgUtilized, matcher); /* match each remaining underutilized datanode (target) to * above average utilized datanodes (source). * Note only underutilized datanodes that have not had that max bytes to * move satisfied in step 1 are selected. */ LOG.info("chooseStorageGroups for " + matcher + ": underUtilized => aboveAvgUtilized"); chooseStorageGroups(underUtilized, aboveAvgUtilized, matcher); }
構建每一對<source, target>時,需要計算當前可以遷移或者遷入的空間大小。
dispatcher建立dispatchExecutor執行緒池執行資料遷移排程。
private void matchSourceWithTargetToMove(Source source, StorageGroup target) { long size = Math.min(source.availableSizeToMove(), target.availableSizeToMove()); final Task task = new Task(target, size); source.addTask(task); target.incScheduledSize(task.getSize()); dispatcher.add(source, target); LOG.info("Decided to move "+StringUtils.byteDesc(size)+" bytes from " + source.getDisplayName() + " to " + target.getDisplayName()); }
【結語】
1. 對於一些大型的HDFS叢集(隨時可能擴容或下架伺服器),balance指令碼需要作為後臺常駐程式;
2. 根據官方建議,指令碼需要部署在相對空閒的伺服器上;
3. 停止指令碼透過kill程式實現(建議不kill,後臺執行完會自動停止,多次執行同時也只會有一個執行緒存在,其它自動失敗);
針對datanode儲存維護,可以針對以下幾個方向進行最佳化: * 透過引數(threshold)增加迭代次數,以增加datanode允許遷移的資料; * 透過引數(exclude, include)設計合理的允許進行balance策略的伺服器,比如將使用率最低(20%)和最高(20%)的進行balance策略; * 透過引數(threshold )設計合理的閾值;<備註:理想狀態能夠透過程式自動發現調整引數,無需人為介入>
作者:伍柒大人_HQQ
連結:
來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/818/viewspace-2815974/,如需轉載,請註明出處,否則將追究法律責任。
相關文章
- HDFS的詳解(一)
- HDFS短路讀詳解
- 詳解HDFS入門
- Hdfs儲存策略
- 同源策略詳解
- HDFS HA 高可用機制詳解
- Hadoop配置hdfs-site.xml詳解HadoopXML
- HDFS的機架感知策略
- Linux 策略路由詳解Linux路由
- balance_dirty_pages_ratelimited詳細分析MIT
- hadoop實戰4--(hdfs讀流程,hdfs寫流程,副本放置策略)Hadoop
- Logstash讀取Kafka資料寫入HDFS詳解Kafka
- Hadoop框架:HDFS讀寫機制與API詳解Hadoop框架API
- 圖文詳解 HDFS 的工作機制及其原理
- 深度詳解GaussDB bufferpool快取策略快取
- Sharding-JDBC分片策略詳解(二)JDBC
- Kubernetes-22:kubelet 驅逐策略詳解
- 一文詳解AI模型部署策略AI模型
- Redis詳解(十一)------ 過期刪除策略和記憶體淘汰策略Redis記憶體
- mysqlbinlog命令詳解 Part 9 MySQL備份策略MySql
- 深入理解JVM(三)——垃圾收集策略詳解JVM
- Same Origin Policy 瀏覽器同源策略詳解瀏覽器
- hadoop學習-HDFS的詳細概述Hadoop
- 詳解快取更新策略及如何選擇快取
- 【經典案例】Python詳解設計模式:策略模式Python設計模式
- Python 中的設計模式詳解之:策略模式Python設計模式
- [CSS] text-wrap: balanceCSS
- Hadoop入門(二)之 HDFS 詳細解析Hadoop
- AT_abc287_g [ABC287G] Balance Update Query 題解
- Hadoop 學習系列(二)之 HDFS 詳細解析Hadoop
- HDU 1709 The Balance(母函式)函式
- LOAD_BALANCE&TAF總結
- Redis(二十):Redis資料過期和淘汰策略詳解(轉)Redis
- AT_abc287_g [ABC287G] Balance Update Query 題解2
- PostgreSQL DBA(128) - pgAdmin(Load balance with HAProxy)SQL
- Oracle LOAD_BALANCE&TAF總結Oracle
- 【史上最全】Hadoop 核心 - HDFS 分散式檔案系統詳解(上萬字建議收藏)Hadoop分散式
- 提升網站效能:Nginx五種高效負載均衡策略詳解!網站Nginx負載