【前言】線上長時間執行的大規模Hadoop叢集，各個datanode節點磁碟空間使用率經常會出現分佈不均衡的情況，尤其在新增和下架節點、或者人為干預副本數量的時候。節點空間使用率不均勻會導致計算引擎頻繁在跨節點複製資料(A節點上執行的task所需資料在其它節點上)，引起不必要的耗時和頻寬。另外，當部分節點空間使用率很高但未滿(90%左右)時，分配在該節點上的task會存在任務失敗的風險。因此，引入balance策略使叢集中的節點空間使用率均勻分佈必不可少。

一. balancer命令詳解

hdfs --config /hadoop-client/conf balancer
-threshold  10                    \叢集平衡的條件，datanode間磁碟使用率相差閾值，區間選擇：0~100-policy datanode                  \預設為datanode，datanode級別的平衡策略
-exclude  -f  /tmp/ip1.txt        \預設為空，指定該部分ip不參與balance， -f：指定輸入為檔案
-include  -f  /tmp/ip2.txt        \預設為空，只允許該部分ip參與balance，-f：指定輸入為檔案
 -idleiterations  5               \迭代次數，預設為 5

hdfs balance時datanode之間資料遷移的頻寬設定(/hadoop-client/conf/hdfs-site.xml, 修改需重啟hdfs)：

<property>
    <name>dfs.datanode.balance.bandwidthPerSec</name>
    <value>6250000</value></property><備註：6250000 / (1024 * 1024) = 6M/s>

動態增大頻寬（不需重啟，需要切換到hdfs使用者，不可設定太大，會佔用mapreduce任務的頻寬）：

hdfs dfsadmin -fs hdfs://${active-namenode-hostname}:8020 -setBalancerBandwidth 104857600

balance指令碼在滿足以下任何一個條件都會自動退出：

 * The cluster is balanced;
 * No block can be moved;
 * No block has been moved for specified consecutive iterations (5 by default);
 * An IOException occurs while communicating with the namenode;
 * Another balancer is running.

二. 原始碼解析

原始碼路徑：org.apache.hadoop.hdfs.server.balancer

統計需要balance的datanode：

  private boolean shouldIgnore(DatanodeInfo dn) {    // ignore decommissioned nodes (忽略已經下架的datanode)
    final boolean decommissioned = dn.isDecommissioned();    // ignore decommissioning nodes（忽略正在下架的datanode）
    final boolean decommissioning = dn.isDecommissionInProgress();    // ignore nodes in exclude list (忽略引數:-exclude配置的datanode)
    final boolean excluded = Util.isExcluded(excludedNodes, dn);    // ignore nodes not in the include list (if include list is not empty)
    // (如果引數:-include不為空，忽略不在include列表裡的datanode)
    final boolean notIncluded = !Util.isIncluded(includedNodes, dn);    if (decommissioned || decommissioning || excluded || notIncluded) {      if (LOG.isTraceEnabled()) {
        LOG.trace("Excluding datanode " + dn
            + ": decommissioned=" + decommissioned
            + ", decommissioning=" + decommissioning
            + ", excluded=" + excluded
            + ", notIncluded=" + notIncluded);
      }      return true;
    }    return false;
  }

叢集平均使用率(計算公式)：average = totalUsedSpaces * 100 / totalCapacities
totalUsedSpaces：各datanode已使用空間（dfsUsed，不包含non dfsUsed）相加；
totalCapacities：各datanode總空間（DataNode配置的伺服器磁碟目錄）相加；

  void initAvgUtilization() {    for(StorageType t : StorageType.asList()) {      final long capacity = totalCapacities.get(t);      if (capacity > 0L) {        final double avg  = totalUsedSpaces.get(t)*100.0/capacity;
        avgUtilizations.set(t, avg);
      }
    }
  }

單個datanode使用率：utilization = dfsUsed * 100.0 / capacity
dfsUsed：當前datanode dfs（dfsUsed，不包含non dfsUsed）已使用空間；
capacity：當前datanode（DataNode配置的伺服器磁碟目錄）總空間；

    Double getUtilization(DatanodeStorageReport r, final StorageType t) {      long capacity = 0L;      long dfsUsed = 0L;      for(StorageReport s : r.getStorageReports()) {        if (s.getStorage().getStorageType() == t) {
          capacity += s.getCapacity();
          dfsUsed += s.getDfsUsed();
        }
      }      return capacity == 0L? null: dfsUsed*100.0/capacity;
    }

單個datanode使用率與叢集平均使用率差值：utilizationDiff = utilization - average
單個datanode utilizationDiff與閾值的差值: thresholdDiff = |utilizationDiff| - threshold

需要遷移或者可以遷入的空間：maxSize2Move = |utilizationDiff| * capacity

可以遷入的空間計算：Math.min(remaining, maxSizeToMove)
需要遷移的空間計算：Math.min(max, maxSizeToMove)
remaining:datanode節點剩餘空間
max:預設單個datanode單次balance迭代可以遷移的最大空間限制，預設10G)
預設迭代次數為5，即執行一次balance指令碼，單個datanode可以最大遷移的空間為：5*10G = 50G

    for(DatanodeStorageReport r : reports) {      final DDatanode dn = dispatcher.newDatanode(r.getDatanodeInfo());      final boolean isSource = Util.isIncluded(sourceNodes, dn.getDatanodeInfo());      for(StorageType t : StorageType.getMovableTypes()) {        final Double utilization = policy.getUtilization(r, t);        if (utilization == null) { // datanode does not have such storage type 
          continue;
        }        
        final double average = policy.getAvgUtilization(t);        if (utilization >= average && !isSource) {
          LOG.info(dn + "[" + t + "] has utilization=" + utilization
              + " >= average=" + average
              + " but it is not specified as a source; skipping it.");          continue;
        }        final double utilizationDiff = utilization - average;        final long capacity = getCapacity(r, t);        final double thresholdDiff = Math.abs(utilizationDiff) - threshold;        final long maxSize2Move = computeMaxSize2Move(capacity,
            getRemaining(r, t), utilizationDiff, maxSizeToMove);        final StorageGroup g;        if (utilizationDiff > 0) {          final Source s = dn.addSource(t, maxSize2Move, dispatcher);          if (thresholdDiff <= 0) { // within threshold
            aboveAvgUtilized.add(s);
          } else {
            overLoadedBytes += percentage2bytes(thresholdDiff, capacity);
            overUtilized.add(s);
          }
          g = s;
        } else {
          g = dn.addTarget(t, maxSize2Move);          if (thresholdDiff <= 0) { // within threshold
            belowAvgUtilized.add(g);
          } else {
            underLoadedBytes += percentage2bytes(thresholdDiff, capacity);
            underUtilized.add(g);
          }
        }
        dispatcher.getStorageGroupMap().put(g);
      }

差值判斷後datanode的儲存佇列：

overUtilized：utilizationDiff > 0 && thresholdDiff > 0        <使用率超過平均值，且差值大於閾值>
aboveAvgUtilized：utilizationDiff > 0 && thresholdDiff <= 0   <使用率超過平均值，且差值小於等於閾值>
belowAvgUtilized：utilizationDiff < 0 && thresholdDiff <= 0   <使用率低於平均值，且差值小於等於閾值>
underUtilized：utilizationDiff > 0 && thresholdDiff > 0       <使用率低於平均值，且差值大於等於閾值>

需要遷移資料的節點(Source型別): overUtilized, aboveAvgUtilized
能夠遷入資料的節點(Target型別): underUtilized, belowAvgUtilized

資料遷移配對(原則：1. 優先為同機架，其次為其它機架; 2. 一對多配對)：
第一步[Source -> Target]：each overUtilized datanode => one or more underUtilized datanodes
第二步[Source -> Target]：match each remaining overutilized datanode => one or more belowAvgUtilized datanodes
第三步[Target -> Source]：each remaining underutilized datanode (step 1未和overUtilized匹配過) => one or more aboveAvgUtilized datanodes

  /** Decide all <source, target> pairs according to the matcher. */
  private void chooseStorageGroups(final Matcher matcher) {    /* first step: match each overUtilized datanode (source) to
     * one or more underUtilized datanodes (targets).
     */
    LOG.info("chooseStorageGroups for " + matcher + ": overUtilized => underUtilized");
    chooseStorageGroups(overUtilized, underUtilized, matcher);    
    /* match each remaining overutilized datanode (source) to 
     * below average utilized datanodes (targets).
     * Note only overutilized datanodes that haven't had that max bytes to move
     * satisfied in step 1 are selected
     */
    LOG.info("chooseStorageGroups for " + matcher + ": overUtilized => belowAvgUtilized");
    chooseStorageGroups(overUtilized, belowAvgUtilized, matcher);    /* match each remaining underutilized datanode (target) to 
     * above average utilized datanodes (source).
     * Note only underutilized datanodes that have not had that max bytes to
     * move satisfied in step 1 are selected.
     */
    LOG.info("chooseStorageGroups for " + matcher + ": underUtilized => aboveAvgUtilized");
    chooseStorageGroups(underUtilized, aboveAvgUtilized, matcher);
  }

構建每一對<source, target>時，需要計算當前可以遷移或者遷入的空間大小。
dispatcher建立dispatchExecutor執行緒池執行資料遷移排程。

  private void matchSourceWithTargetToMove(Source source, StorageGroup target) {    long size = Math.min(source.availableSizeToMove(), target.availableSizeToMove());    final Task task = new Task(target, size);
    source.addTask(task);
    target.incScheduledSize(task.getSize());
    dispatcher.add(source, target);
    LOG.info("Decided to move "+StringUtils.byteDesc(size)+" bytes from "
        + source.getDisplayName() + " to " + target.getDisplayName());
  }

【結語】
1. 對於一些大型的HDFS叢集(隨時可能擴容或下架伺服器)，balance指令碼需要作為後臺常駐程式；
2. 根據官方建議，指令碼需要部署在相對空閒的伺服器上；
3. 停止指令碼透過kill程式實現（建議不kill，後臺執行完會自動停止，多次執行同時也只會有一個執行緒存在，其它自動失敗）；

針對datanode儲存維護，可以針對以下幾個方向進行最佳化：
* 透過引數(threshold)增加迭代次數，以增加datanode允許遷移的資料；   
* 透過引數(exclude, include)設計合理的允許進行balance策略的伺服器，比如將使用率最低(20%)和最高(20%)的進行balance策略;
* 透過引數(threshold )設計合理的閾值;<備註：理想狀態能夠透過程式自動發現調整引數，無需人為介入>

作者：伍柒大人_HQQ
連結：

HDFS balance策略詳解

一. balancer命令詳解

二. 原始碼解析

相關文章