HDFS balance策略詳解
一. balancer命令詳解
hdfs --config /hadoop-client/conf balancer -threshold 10 \叢集平衡的條件,datanode間磁碟使用率相差閾值,區間選擇:0~100-policy datanode \預設為datanode,datanode級別的平衡策略 -exclude -f /tmp/ip1.txt \預設為空,指定該部分ip不參與balance, -f:指定輸入為檔案 -include -f /tmp/ip2.txt \預設為空,只允許該部分ip參與balance,-f:指定輸入為檔案 -idleiterations 5 \迭代次數,預設為 5
hdfs balance時datanode之間資料遷移的頻寬設定(/hadoop-client/conf/hdfs-site.xml, 修改需重啟hdfs):
<property> <name>dfs.datanode.balance.bandwidthPerSec</name> <value>6250000</value></property><備註:6250000 / (1024 * 1024) = 6M/s>
hdfs dfsadmin -fs hdfs://${active-namenode-hostname}:8020 -setBalancerBandwidth 104857600
* The cluster is balanced; * No block can be moved; * No block has been moved for specified consecutive iterations (5 by default); * An IOException occurs while communicating with the namenode; * Another balancer is running.
二. 原始碼解析
private boolean shouldIgnore(DatanodeInfo dn) { // ignore decommissioned nodes (忽略已經下架的datanode) final boolean decommissioned = dn.isDecommissioned(); // ignore decommissioning nodes(忽略正在下架的datanode) final boolean decommissioning = dn.isDecommissionInProgress(); // ignore nodes in exclude list (忽略引數:-exclude配置的datanode) final boolean excluded = Util.isExcluded(excludedNodes, dn); // ignore nodes not in the include list (if include list is not empty) // (如果引數:-include不為空,忽略不在include列表裡的datanode) final boolean notIncluded = !Util.isIncluded(includedNodes, dn); if (decommissioned || decommissioning || excluded || notIncluded) { if (LOG.isTraceEnabled()) { LOG.trace("Excluding datanode " + dn + ": decommissioned=" + decommissioned + ", decommissioning=" + decommissioning + ", excluded=" + excluded + ", notIncluded=" + notIncluded); } return true; } return false; }
叢集平均使用率(計算公式):average = totalUsedSpaces * 100 / totalCapacitiestotalUsedSpaces:各datanode已使用空間(dfsUsed,不包含non dfsUsed)相加;
void initAvgUtilization() { for(StorageType t : StorageType.asList()) { final long capacity = totalCapacities.get(t); if (capacity > 0L) { final double avg = totalUsedSpaces.get(t)*100.0/capacity; avgUtilizations.set(t, avg); } } }
單個datanode使用率:utilization = dfsUsed * 100.0 / capacitydfsUsed:當前datanode dfs(dfsUsed,不包含non dfsUsed)已使用空間;
Double getUtilization(DatanodeStorageReport r, final StorageType t) { long capacity = 0L; long dfsUsed = 0L; for(StorageReport s : r.getStorageReports()) { if (s.getStorage().getStorageType() == t) { capacity += s.getCapacity(); dfsUsed += s.getDfsUsed(); } } return capacity == 0L? null: dfsUsed*100.0/capacity; }
單個datanode使用率與叢集平均使用率差值:utilizationDiff = utilization - average
單個datanode utilizationDiff與閾值的差值: thresholdDiff = |utilizationDiff| - threshold
需要遷移或者可以遷入的空間:maxSize2Move = |utilizationDiff| * capacity
可以遷入的空間計算:Math.min(remaining, maxSizeToMove)
需要遷移的空間計算:Math.min(max, maxSizeToMove)remaining:datanode節點剩餘空間
預設迭代次數為5,即執行一次balance指令碼,單個datanode可以最大遷移的空間為:5*10G = 50G
for(DatanodeStorageReport r : reports) { final DDatanode dn = dispatcher.newDatanode(r.getDatanodeInfo()); final boolean isSource = Util.isIncluded(sourceNodes, dn.getDatanodeInfo()); for(StorageType t : StorageType.getMovableTypes()) { final Double utilization = policy.getUtilization(r, t); if (utilization == null) { // datanode does not have such storage type continue; } final double average = policy.getAvgUtilization(t); if (utilization >= average && !isSource) { LOG.info(dn + "[" + t + "] has utilization=" + utilization + " >= average=" + average + " but it is not specified as a source; skipping it."); continue; } final double utilizationDiff = utilization - average; final long capacity = getCapacity(r, t); final double thresholdDiff = Math.abs(utilizationDiff) - threshold; final long maxSize2Move = computeMaxSize2Move(capacity, getRemaining(r, t), utilizationDiff, maxSizeToMove); final StorageGroup g; if (utilizationDiff > 0) { final Source s = dn.addSource(t, maxSize2Move, dispatcher); if (thresholdDiff <= 0) { // within threshold aboveAvgUtilized.add(s); } else { overLoadedBytes += percentage2bytes(thresholdDiff, capacity); overUtilized.add(s); } g = s; } else { g = dn.addTarget(t, maxSize2Move); if (thresholdDiff <= 0) { // within threshold belowAvgUtilized.add(g); } else { underLoadedBytes += percentage2bytes(thresholdDiff, capacity); underUtilized.add(g); } } dispatcher.getStorageGroupMap().put(g); }
overUtilized:utilizationDiff > 0 && thresholdDiff > 0 <使用率超過平均值,且差值大於閾值> aboveAvgUtilized:utilizationDiff > 0 && thresholdDiff <= 0 <使用率超過平均值,且差值小於等於閾值> belowAvgUtilized:utilizationDiff < 0 && thresholdDiff <= 0 <使用率低於平均值,且差值小於等於閾值> underUtilized:utilizationDiff > 0 && thresholdDiff > 0 <使用率低於平均值,且差值大於等於閾值>
需要遷移資料的節點(Source型別): overUtilized, aboveAvgUtilized
能夠遷入資料的節點(Target型別): underUtilized, belowAvgUtilized
資料遷移配對(原則:1. 優先為同機架,其次為其它機架; 2. 一對多配對):
第一步[Source -> Target]:each overUtilized datanode => one or more underUtilized datanodes
第二步[Source -> Target]:match each remaining overutilized datanode => one or more belowAvgUtilized datanodes
第三步[Target -> Source]:each remaining underutilized datanode (step 1未和overUtilized匹配過) => one or more aboveAvgUtilized datanodes
/** Decide all <source, target> pairs according to the matcher. */ private void chooseStorageGroups(final Matcher matcher) { /* first step: match each overUtilized datanode (source) to * one or more underUtilized datanodes (targets). */ LOG.info("chooseStorageGroups for " + matcher + ": overUtilized => underUtilized"); chooseStorageGroups(overUtilized, underUtilized, matcher); /* match each remaining overutilized datanode (source) to * below average utilized datanodes (targets). * Note only overutilized datanodes that haven't had that max bytes to move * satisfied in step 1 are selected */ LOG.info("chooseStorageGroups for " + matcher + ": overUtilized => belowAvgUtilized"); chooseStorageGroups(overUtilized, belowAvgUtilized, matcher); /* match each remaining underutilized datanode (target) to * above average utilized datanodes (source). * Note only underutilized datanodes that have not had that max bytes to * move satisfied in step 1 are selected. */ LOG.info("chooseStorageGroups for " + matcher + ": underUtilized => aboveAvgUtilized"); chooseStorageGroups(underUtilized, aboveAvgUtilized, matcher); }
構建每一對<source, target>時,需要計算當前可以遷移或者遷入的空間大小。
private void matchSourceWithTargetToMove(Source source, StorageGroup target) { long size = Math.min(source.availableSizeToMove(), target.availableSizeToMove()); final Task task = new Task(target, size); source.addTask(task); target.incScheduledSize(task.getSize()); dispatcher.add(source, target); LOG.info("Decided to move "+StringUtils.byteDesc(size)+" bytes from " + source.getDisplayName() + " to " + target.getDisplayName()); }
1. 對於一些大型的HDFS叢集(隨時可能擴容或下架伺服器),balance指令碼需要作為後臺常駐程式;
2. 根據官方建議,指令碼需要部署在相對空閒的伺服器上;
3. 停止指令碼透過kill程式實現(建議不kill,後臺執行完會自動停止,多次執行同時也只會有一個執行緒存在,其它自動失敗);
針對datanode儲存維護,可以針對以下幾個方向進行最佳化: * 透過引數(threshold)增加迭代次數,以增加datanode允許遷移的資料; * 透過引數(exclude, include)設計合理的允許進行balance策略的伺服器,比如將使用率最低(20%)和最高(20%)的進行balance策略; * 透過引數(threshold )設計合理的閾值;<備註:理想狀態能夠透過程式自動發現調整引數,無需人為介入>
