hadoop官網翻譯之HDFS High Availability Using the Quorum Journal Manager

xiaoliuyiting發表於2019-01-02

原文https://blog.csdn.net/duheaven/article/details/17038679

HDFS High Availability Using the Quorum Journal Manager

Purpose

This guide provides an overview of the HDFS High Availability (HA) feature and how to configure and manage an HA HDFS cluster, using the Quorum Journal Manager (QJM) feature.

This document assumes that the reader has a general understanding of general components and node types in an HDFS cluster. Please refer to the HDFS Architecture guide for details.

目的

本指南概述HDFS的高可用性(HA)的特性,以及如何配置和管理HA HDFS叢集,使用QJM特性。

本文假設讀者有一個大致瞭解通用元件和一個HDFS叢集中的節點型別。詳情請參閱HDFS架構指南。

Note: Using the Quorum Journal Manager or Conventional Shared Storage

This guide discusses how to configure and use HDFS HA using the Quorum Journal Manager (QJM) to share edit logs between the Active and Standby NameNodes. For information on how to configure HDFS HA using NFS for shared storage instead of the QJM, please see this alternative guide.

注意:QJM或者共享儲存

本指南將要討論如何配置並利用QJM實現HA,HA是通過在活動的NameNode與備份的NameNode之間共享edit日誌,對於如何通過共享儲存代替QJM實現HA的資訊請參照下篇部落格。

Background

Prior to Hadoop 2.0.0, the NameNode was a single point of failure (SPOF) in an HDFS cluster. Each cluster had a single NameNode, and if that machine or process became unavailable, the cluster as a whole would be unavailable until the NameNode was either restarted or brought up on a separate machine.

背景

Hadoop 2.0.0之前,NameNode是在一個HDFS叢集一個單點故障(SPOF)。每個叢集有一個NameNode,如果那臺機器壞掉,叢集作為一個整體將不可用,直到NameNode啟動或在另一個單獨的機器上配置的。

This impacted the total availability of the HDFS cluster in two major ways:

這種設計將影響HDFS叢集的可用性在兩個主要方面: 

  計劃外的事件,如NameNode當機導致叢集不可用直到操作員重新啟動NameNode。

  • In the case of an unplanned event such as a machine crash, the cluster would be unavailable until an operator restarted the NameNode.

  • Planned maintenance events such as software or hardware upgrades on the NameNode machine would result in windows of cluster downtime.

      計劃內的維護活動,比如NameNode機器上的軟體或硬體升級導致叢集停機。

The HDFS High Availability feature addresses the above problems by providing the option of running two redundant NameNodes in the same cluster in an Active/Passive configuration with a hot standby. This allows a fast failover to a new NameNode in the case that a machine crashes, or a graceful administrator-initiated failover for the purpose of planned maintenance.

HDFS高可用性特性解決了上面的問題,可以通過在同一叢集上配置執行兩個冗餘的NameNodes,做到主動/被動的熱備份。這將允許當一個機器當機時,快速轉移到一個新的NameNode,或管理員進行利用故障轉移達到優雅的系統升級的目的。

Architecture

In a typical HA cluster, two separate machines are configured as NameNodes. At any point in time, exactly one of the NameNodes is in an Active state, and the other is in a Standby state. The Active NameNode is responsible for all client operations in the cluster, while the Standby is simply acting as a slave, maintaining enough state to provide a fast failover if necessary.

一個典型的HA叢集,NameNode會被配置在兩臺獨立的機器上.在任何的時間上,一個NameNode處於活動狀態,而另一個在備份狀態,活動狀態的NameNode會響應叢集中所有的客戶端,同時備份的只是作為一個副本,保證在必要的時候提供一個快速的轉移。

In order for the Standby node to keep its state synchronized with the Active node, both nodes communicate with a group of separate daemons called “JournalNodes” (JNs). When any namespace modification is performed by the Active node, it durably logs a record of the modification to a majority of these JNs. The Standby node is capable of reading the edits from the JNs, and is constantly watching them for changes to the edit log. As the Standby Node sees the edits, it applies them to its own namespace. In the event of a failover, the Standby will ensure that it has read all of the edits from the JournalNodes before promoting itself to the Active state. This ensures that the namespace state is fully synchronized before a failover occurs.

  為了使備份的節點和活動的節點保持一致,兩個節點通過一個特殊的守護執行緒相連,這個執行緒叫做“JournalNodes”(JNs)。當活動狀態的節點修改任何的名稱空間,他都會通過這些JNs記錄日誌,備用的節點可以監控edit日誌的變化,並且通過JNs讀取到變化。備份節點檢視edits可以擁有專門的namespace。在故障轉移的時候備份節點將在切換至活動狀態前確認他從JNs讀取到的所有edits。這個確認的目的是為了保證Namespace的狀態和遷移之前是完全同步的。

In order to provide a fast failover, it is also necessary that the Standby node have up-to-date information regarding the location of blocks in the cluster. In order to achieve this, the DataNodes are configured with the location of both NameNodes, and send block location information and heartbeats to both.

為了提供一個快速的轉移,備份NameNode要求儲存著最新的block在叢集當中的資訊。為了能夠得到這個,DataNode都被配置了所有的NameNode的地址,並且傳送block的地址資訊和心跳給兩個node。

It is vital for the correct operation of an HA cluster that only one of the NameNodes be Active at a time. Otherwise, the namespace state would quickly diverge between the two, risking data loss or other incorrect results. In order to ensure this property and prevent the so-called “split-brain scenario,” the JournalNodes will only ever allow a single NameNode to be a writer at a time. During a failover, the NameNode which is to become active will simply take over the role of writing to the JournalNodes, which will effectively prevent the other NameNode from continuing in the Active state, allowing the new Active to safely proceed with failover.

保證只有一個活躍的NameNode在叢集當中是一個十分重要的一步。否則namespace狀態在兩個節點間不同會導致資料都是或者其他一些不正確的結果。為了確保這個,防止所謂split - brain場景,JournalNodes將只允許一個NameNode進行寫操作。故障轉移期間,NameNode成為活躍狀態的時候會接管JournalNodes的寫許可權,這會有效防止其他NameNode持續處於活躍狀態,允許新的活動節點安全進行故障轉移。

Hardware resources

In order to deploy an HA cluster, you should prepare the following:

硬體

為了部署一個HA叢集,你應該按照以下準備:

NameNode機器:機器負責執行活動和和備份的NameNode,兩臺機器應該有著完全一樣的硬體,同樣的硬體應該和沒有HA的硬體完全一致。 

  • NameNode machines - the machines on which you run the Active and Standby NameNodes should have equivalent hardware to each other, and equivalent hardware to what would be used in a non-HA cluster.

  • JournalNode machines - the machines on which you run the JournalNodes. The JournalNode daemon is relatively lightweight, so these daemons may reasonably be collocated on machines with other Hadoop daemons, for example NameNodes, the JobTracker, or the YARN ResourceManager. Note: There must be at least 3 JournalNode daemons, since edit log modifications must be written to a majority of JNs. This will allow the system to tolerate the failure of a single machine. You may also run more than 3 JournalNodes, but in order to actually increase the number of failures the system can tolerate, you should run an odd number of JNs, (i.e. 3, 5, 7, etc.). Note that when running with N JournalNodes, the system can tolerate at most (N - 1) / 2 failures and continue to function normally.

JournalNode機器:這個機器用來執行JNs,JNs的守護執行緒相對較輕,所以可以和Hadoop的其他守護執行緒放到一起,比如NameNodes, the JobTracker, 或者 YARN ResourceManager。注意至少需要3個JNs的守護執行緒,因為edit日誌的編輯和修改必須寫入大多數的JNs。這將允許系統在單機上失敗。你可能執行多餘3個的jns,但是為了能夠判定失敗的數目,應該執行一個單數的JNs(比如3,5,7等)。注意當執行N個jns,系統最多可以容忍(N - 1)/ 2失敗,並繼續正常運轉。 

Note that, in an HA cluster, the Standby NameNode also performs checkpoints of the namespace state, and thus it is not necessary to run a Secondary NameNode, CheckpointNode, or BackupNode in an HA cluster. In fact, to do so would be an error. This also allows one who is reconfiguring a non-HA-enabled HDFS cluster to be HA-enabled to reuse the hardware which they had previously dedicated to the Secondary NameNode.

注意:在一個HA叢集中備份的NameNode也要堅持namespace的狀態,那麼就沒有必要去執行一個Secondary NameNode, CheckpointNode, 或者是BackupNode在叢集當中,事實上這麼做的話有可能會是一個錯誤。為了允許有一個重新配置的非HA的叢集可以實現HA,並且實現硬體的重用,所以把以前的secondary NameNode的機器作為這樣一個機器。

Deployment

Configuration overview

Similar to Federation configuration, HA configuration is backward(向後的) compatible(相容) and allows existing single NameNode configurations to work without change. The new configuration is designed such that all the nodes in the cluster may have the same configuration without the need for deploying different configuration files to different machines based on the type of the node.

部署

配置綜述

HA的配置向後相容允許既存的單NameNode配置在沒有任何改動的情況下工作,新的配置被設計成叢集當中的所有節點擁有著相同的配置並且並不需要為不同的機器設定不同的配置檔案。

Like HDFS Federation, HA clusters reuse the nameservice ID to identify a single HDFS instance that may in fact consist of multiple HA NameNodes. In addition, a new abstraction called NameNode IDis added with HA. Each distinct NameNode in the cluster has a different NameNode ID to distinguish it. To support a single configuration file for all of the NameNodes, the relevant configuration parameters are suffixed with the nameservice ID as well as the NameNode ID.

 如HDFS Federation,HA叢集重用nameserviceID去標示一個HDFS例項這個例項可能實際上包含了很多的HA NameNodes。另外一個新的抽象叫做NameNode ID被添置到了HA。在叢集中每個不同的NameNode有著不同的NameNode ID 去標示他,所有的NameNode採用同一個配置檔案,相關的配置引數都被用nameservice ID和NameNode ID作為字尾。

Configuration details

To configure HA NameNodes, you must add several configuration options to your hdfs-site.xml configuration file.

 配置細節

配置HA NameNodes,你必須新增幾個配置選項到你的hdfs-site.xml配置檔案當中。

The order(次序) in which you set these configurations is unimportant, but the values you choose for dfs.nameservices and dfs.ha.namenodes.[nameservice ID] will determine the keys of those that follow. Thus, you should decide on these values before setting the rest of the configuration options.

 這是這些配置的順序並不重要,但是你為dfs.nameservices和dfs.ha.namenodes.[nameservice ID] 設定的值將要決定下面步驟的key,因此,你應該決定這些值在設定這些選項之前。

  • dfs.nameservices - the logical name for this new nameservice

    Choose a logical name for this nameservice, for example “mycluster”, and use this logical name for the value of this config option. The name you choose is arbitrary. It will be used both for configuration and as the authority component of absolute HDFS paths in the cluster.

  • Note: If you are also using HDFS Federation, this configuration setting should also include the list of other nameservices, HA or otherwise, as a comma-separated list.

dfs.nameservices-新的nameservice的邏輯名字

為nameservice選擇一個名字,例如“mycluster”,並且用這個名字的值來作為一些配置項的值,這個名字是任意的。他將被用來配置在叢集中作為HDFS的絕對路徑元件。

注意如果你也採用HDFS系統,這個設定應該也包括一個包含其他nameservices,HA或者一個逗號分隔的列表

<property>
  <name>dfs.nameservices</name>
  <value>mycluster</value>
</property>

        注意:在每一個nameservice中最多隻能有兩個NameNode可以被配置

       注意:如果您願意,您可能同樣配置“servicerpc-address”設定

       注意:如果Hadoop的安全特徵被開啟,你應該類似的設定https-address為每一個NameNode 

  • dfs.ha.namenodes.[nameservice ID] - unique identifiers for each NameNode in the nameservice

    Configure with a list of comma-separated NameNode IDs. This will be used by DataNodes to determine all the NameNodes in the cluster. For example, if you used “mycluster” as the nameservice ID previously, and you wanted to use “nn1” and “nn2” as the individual IDs of the NameNodes, you would configure this as such: 

     dfs.ha.namenodes.[nameservice ID] -為在nameservice中的每一個NameNode設定唯一標示符。
    配置一個逗號分隔的 NameNode ID列表。這將是被DataNode識別為所有的NameNode。例如,如果使用“mycluster”作為nameservice ID,並且使用“nn1”和“nn2”作為NameNodes標示符,你應該如下配置:

    <property>
      <name>dfs.ha.namenodes.mycluster</name>
      <value>nn1,nn2</value>
    </property>
    

    Note: Currently, only a maximum of two NameNodes may be configured per nameservice.

  • dfs.namenode.rpc-address.[nameservice ID].[name node ID] - the fully-qualified RPC address for each NameNode to listen on

    For both of the previously-configured NameNode IDs, set the full address and IPC port of the NameNode processs. Note that this results in two separate configuration options. For example:

     dfs.namenode.rpc-address.[nameservice ID].[name node ID] - 每個NameNode監聽的完整正確的RPC地址
    對於先前配置的NameNode ID,設定全地址和IP埠的NameNode程式,注意配置兩個獨立的配置選項例如:

    <property>
      <name>dfs.namenode.rpc-address.mycluster.nn1</name>
      <value>machine1.example.com:8020</value>
    </property>
    <property>
      <name>dfs.namenode.rpc-address.mycluster.nn2</name>
      <value>machine2.example.com:8020</value>
    </property>
    

    Note: You may similarly configure the “servicerpc-address” setting if you so desire.

  • dfs.namenode.http-address.[nameservice ID].[name node ID] - the fully-qualified HTTP address for each NameNode to listen on

    Similarly to rpc-address above, set the addresses for both NameNodes’ HTTP servers to listen on. For example:

    dfs.namenode.http-address.[nameservice ID].[name node ID] -每個NameNode監聽的完整正確的HTTP地址
    和上面設定rpc-address一樣,設定NameNode的HTTP服務,例如:
     

    <property>
      <name>dfs.namenode.http-address.mycluster.nn1</name>
      <value>machine1.example.com:50070</value>
    </property>
    <property>
      <name>dfs.namenode.http-address.mycluster.nn2</name>
      <value>machine2.example.com:50070</value>
    </property>
    

    Note: If you have Hadoop’s security features enabled, you should also set the https-address similarly for each NameNode.

  • dfs.namenode.shared.edits.dir - the URI which identifies the group of JNs where the NameNodes will write/read edits

    This is where one configures the addresses of the JournalNodes which provide the shared edits storage, written to by the Active nameNode and read by the Standby NameNode to stay up-to-date with all the file system changes the Active NameNode makes. Though you must specify several JournalNode addresses, you should only configure one of these URIs. The URI should be of the form: qjournal://*host1:port1*;*host2:port2*;*host3:port3*/*journalId*. The Journal ID is a unique identifier for this nameservice, which allows a single set of JournalNodes to provide storage for multiple federated namesystems. Though not a requirement, it’s a good idea to reuse the nameservice ID for the journal identifier.

    dfs.namenode.shared.edits.dir - NameNode讀/寫edits的URL,為JNs服務
    這個配置是為了JournalNodes配置,提供edits的共享儲存地址,這個地址被活動狀態的nameNode寫入被備份狀態的nameNode讀取用來儲存活動狀態寫入的整個檔案系統的最新的改變。儘管你必須指定幾個JournalNode地址,但是你只要一個配置為這些URL,這個URI應該是這樣的“qjournal://host1:port1;host2:port2;host3:port3/journalId”,Journal ID是這個nameservice的唯一標示。它允許一套JournalNodes為多個namesystems提供儲存。雖然不是必須的,但是建議重用nameservice ID 作為journal 的標示。 

    For example, if the JournalNodes for this cluster were running on the machines “node1.example.com”, “node2.example.com”, and “node3.example.com” and the nameservice ID were “mycluster”, you would use the following as the value for this setting (the default port for the JournalNode is 8485):

    例如如果JournalNodes正執行在"node1.example.com", "node2.example.com", 和"node3.example.com",並且叢集的nameservice ID是 mycluster,你應該用下面作為設定(JournalNode 預設的埠是8485) 

    <property>
      <name>dfs.namenode.shared.edits.dir</name>
      <value>qjournal://node1.example.com:8485;node2.example.com:8485;node3.example.com:8485/mycluster</value>
    </property>
    
  • dfs.client.failover.proxy.provider.[nameservice ID] - the Java class that HDFS clients use to contact the Active NameNode

    Configure the name of the Java class which will be used by the DFS Client to determine which NameNode is the current Active, and therefore which NameNode is currently serving client requests. The two implementations which currently ship with(提供) Hadoop are the ConfiguredFailoverProxyProvider and the RequestHedgingProxyProvider (which, for the first call, concurrently invokes all namenodes to determine the active one, and on subsequent requests, invokes the active namenode until a fail-over happens), so use one of these unless you are using a custom proxy provider. For example:

    dfs.client.failover.proxy.provider.[nameservice ID] -HDFS客戶端使用的Java類與活躍的NameNode聯絡
    配置一個類的名字用來被DFS客戶端確定那個NameNode是目前活躍的那個NameNode現在正在提供響應。當前Hadoop提供的兩個實現是ConfiguredFailoverProxyProvider 和RequestHedgingProxyProvider
     (對於第一個呼叫,它同時呼叫所有namenodes 來確定活動的namenode,並且在隨後的請求中呼叫活動的namenode,直到發生故障轉移為止)。因此請使用其中一個,除非您是使用自定義代理提供程式。例如:

    <property>
      <name>dfs.client.failover.proxy.provider.mycluster</name>
      <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
    </property>
    
  • dfs.ha.fencing.methods - a list of scripts or Java classes which will be used to fence the Active NameNode during a failover

     dfs.ha.fencing.methods -一個指令碼或者java類列表用來篩選活動NameNode在故障轉移的期間

    It is desirable for correctness of the system that only one NameNode be in the Active state at any given time. Importantly, when using the Quorum Journal Manager, only one NameNode will ever be allowed to write to the JournalNodes, so there is no potential for corrupting the file system metadata from a split-brain scenario. However, when a failover occurs, it is still possible that the previous Active NameNode could serve read requests to clients, which may be out of date until that NameNode shuts down when trying to write to the JournalNodes. For this reason, it is still desirable to configure some fencing methods even when using the Quorum Journal Manager. However, to improve the availability of the system in the event the fencing mechanisms fail, it is advisable to configure a fencing method which is guaranteed to return success as the last fencing method in the list. Note that if you choose to use no actual fencing methods, you still must configure something for this setting, for example “shell(/bin/true)”.

    在任何時間只有一個活動的NameNode 都是系統所需的。當使用Quorum Journal管理器時,只有一個NameNode將被允許寫入到JournalNodes,所以沒有可能破壞檔案系統的後設資料從一個split-brain的場景。但是當一個遷移法傷的時候,前一個活動的NameNode還是可能讀取客戶端的請求併為其提供服務,當他企圖寫入JournalNodes時候他可能會超時,直到這個NameNode停止。因為這個原因,所以需要配置幾個過濾的方法當使用Quorum Journal管理器時候使用。但是為了提高當一個過濾機制失敗時系統的可用性,建議配置一個過濾方法保證返回成果的作為這個過濾方法列表的最後一項,助於如果你選擇使用一個沒有實際效果的過濾方法你也必須配置一些東西為這個設定比如“shell(/bin/true)”。

    The fencing methods used during a failover are configured as a carriage-return-separated list, which will be attempted in order until one indicates that fencing has succeeded. There are two methods which ship with(提供) Hadoop: shell and sshfence. For information on implementing your own custom fencing method, see the org.apache.hadoop.ha.NodeFencer class.

    過濾方法被配置為carriage-return-separated列表,會在故障轉移的時候被呼叫,直到一個過濾返回success。有兩個方法比Hadoop使用: 
    shell和sshfence。如果想自定義可以看org.apache.hadoop.ha.NodeFencer類。 


    sshfence - SSH to the Active NameNode and kill the process

    The sshfence option SSHes to the target node and uses fuser to kill the process listening on the service’s TCP port. In order for this fencing option to work, it must be able to SSH to the target node without providing a passphrase. Thus, one must also configure the dfs.ha.fencing.ssh.private-key-files option, which is a comma-separated list of SSH private key files. For example:

    sshfence -利用SSH連線到NameNode伺服器並殺死程式

    sshfence選項SSH連線到目標節點,並使用fuser殺死程式。為了這個過濾選項可以工作,他需要通過通過SSH的無祕鑰登陸到目標節點,那麼需要配置一個dfs.ha.fencing.ssh.private-key-files選項,他以逗號分隔,提供SSH的key檔案比如:

        <property>
          <name>dfs.ha.fencing.methods</name>
          <value>sshfence</value>
        </property>
    
        <property>
          <name>dfs.ha.fencing.ssh.private-key-files</name>
          <value>/home/exampleuser/.ssh/id_rsa</value>
        </property>
    

    Optionally, one may configure a non-standard username or port to perform the SSH. One may also configure a timeout, in milliseconds, for the SSH, after which this fencing method will be considered to have failed. It may be configured like so:

    可以通過不標準username或者port來實現SSH,我們也可以配置一個超時時間(毫秒),如果這個時間沒有連線上那麼會返回一個失敗。例如: 

        <property>
          <name>dfs.ha.fencing.methods</name>
          <value>sshfence([[username][:port]])</value>
        </property>
        <property>
          <name>dfs.ha.fencing.ssh.connect-timeout</name>
          <value>30000</value>
        </property>
    

    shell - run an arbitrary shell command to fence the Active NameNode

    The shell fencing method runs an arbitrary shell command. It may be configured like so:

     shell - 執行任何的shell命令去過濾活動的NameNode

    shell過濾方法執行任何的shell命令,他可以被配置例如:

        <property>
          <name>dfs.ha.fencing.methods</name>
          <value>shell(/path/to/my/script.sh arg1 arg2 ...)</value>
        </property>
    

    The string between ‘(’ and ‘)’ is passed directly to a bash shell and may not include any closing parentheses.

    The shell command will be run with an environment set up to contain all of the current Hadoop configuration variables, with the ‘_’ character replacing any ‘.’ characters in the configuration keys. The configuration used has already had any namenode-specific configurations promoted to their generic forms – for example dfs_namenode_rpc-address will contain the RPC address of the target node, even though the configuration may specify that variable as dfs.namenode.rpc-address.ns1.nn1.

    Additionally, the following variables referring to the target node to be fenced are also available:

    括號之間的字串會被直接傳遞給shell指令碼並且不要有任何的閉括號。

    這個指令碼將會帶著環境建立時候所有的hadoop配置變數執行,在配置的key的變數中的‘_’會被替換成'.'這種方式已經被使用在NameNode的配置中如dfs_namenode_rpc-address 是為了涵蓋目標節點的RPC地址,配置dfs.namenode.rpc-address.ns1.nn1也可以指定變數

    另外,以下變數也可以使用

       
    $target_host hostname of the node to be fenced
    $target_port IPC port of the node to be fenced
    $target_address the above two, combined as host:port
    $target_nameserviceid the nameservice ID of the NN to be fenced
    $target_namenodeid the namenode ID of the NN to be fenced

    These environment variables may also be used as substitutions in the shell command itself. For example:

    這些環境變數也可以用來替換shell命令,例如 

        <property>
          <name>dfs.ha.fencing.methods</name>
          <value>shell(/path/to/my/script.sh --nameservice=$target_nameserviceid $target_host:$target_port)</value>
        </property>
    

    If the shell command returns an exit code of 0, the fencing is determined to be successful. If it returns any other exit code, the fencing was not successful and the next fencing method in the list will be attempted.

    Note: This fencing method does not implement any timeout. If timeouts are necessary, they should be implemented in the shell script itself (eg by forking a subshell to kill its parent in some number of seconds).

    如果返回0,過濾方法並認為是成功,其他則認為不成功會呼叫下一個過濾方法。

    注意:這個方式不能實現超時功能,如果想實現,應該通過shell指令碼自己實現(比如,通過子分支強制殺死其母分支在一定的時間後)


  • fs.defaultFS - the default path prefix used by the Hadoop FS client when none is given

    Optionally, you may now configure the default path for Hadoop clients to use the new HA-enabled logical URI. If you used “mycluster” as the nameservice ID earlier, this will be the value of the authority portion of all of your HDFS paths. This may be configured like so, in your core-site.xml file:

     fs.defaultFS - 當什麼都沒有給定的時候Hadoop檔案系統客戶端預設的路徑字首

    可選項,你可以配置一個預設的hadoop客戶端路徑作為新的HA的邏輯URI。如果你用“mycluster”作為 nameservice ID。這個值將作為你的HDFS路徑的部分。可以通過core-site.xml檔案進行如下配置。

    <property>
      <name>fs.defaultFS</name>
      <value>hdfs://mycluster</value>
    </property>
    
  • dfs.journalnode.edits.dir - the path where the JournalNode daemon will store its local state

    This is the absolute path on the JournalNode machines where the edits and other local state used by the JNs will be stored. You may only use a single path for this configuration. Redundancy for this data is provided by running multiple separate JournalNodes, or by configuring this directory on a locally-attached RAID array. For example:

    <property>
      <name>dfs.journalnode.edits.dir</name>
      <value>/path/to/journal/node/local/data</value>
    </property>
    

dfs.journalnode.edits.dir -JournalNode守護執行緒報錯他本地狀態的路徑

這是一個在JNs主機的絕對路徑用來儲存JNs使用的edits和其他本地狀態。你可能只用一個單一的路徑為這個配置。如果需要一些冗餘的本分可以通過執行多個獨立的JNs或者是使用RAID實現,例如:

Deployment details

After all of the necessary configuration options have been set, you must start the JournalNode daemons on the set of machines where they will run. This can be done by running the command “hadoop-daemon.sh start journalnode” and waiting for the daemon to start on each of the relevant machines.

Once the JournalNodes have been started, one must initially synchronize the two HA NameNodes’ on-disk metadata.

部署細節:

在完成必要的配置之後,你必須在機器上啟動JNs的守護執行緒,這需要執行”hdfs-daemon.sh journalnode“命令並等到守護程式執行在每一個相關的機器上。

 一旦JNs啟動,必須進行一次初始化同步在兩個HA的NameNode,主要是為了後設資料。

  • If you are setting up a fresh HDFS cluster, you should first run the format command (hdfs namenode -format) on one of NameNodes.

  • If you have already formatted the NameNode, or are converting a non-HA-enabled cluster to be HA-enabled, you should now copy over the contents of your NameNode metadata directories to the other, unformatted NameNode by running the command “hdfs namenode -bootstrapStandby” on the unformatted NameNode. Running this command will also ensure that the JournalNodes (as configured by dfs.namenode.shared.edits.dir) contain sufficient edits transactions to be able to start both NameNodes.

  • If you are converting a non-HA NameNode to be HA, you should run the command “hdfs namenode -initializeSharedEdits”, which will initialize the JournalNodes with the edits data from the local NameNode edits directories.

如果你建立一個新的HDFS叢集你應該首先執行一下format命令(hdfs namenode -format)在其中一個NameNode上

如果你已經進行過Format NameNode,或者是正在將一個非HA的叢集轉換為一個HA的叢集,你應該拷貝你的NameNode上的後設資料資料夾的內容到另一個沒有被格式化的NameNode,執行”hdfs namenode -bootstrapStandby“在沒有格式化的NameNode上。執行這個命令應該確定JournalNodes (被配置在dfs.namenode.shared.edits.dir)包含了足夠的edits事物可以啟動NameNodes。

如果你講一個不是HA的叢集轉換為一個HA的叢集,你應該執行”hdfs -initializeSharedEdits“命令,他會使用NameNode本地edits資料初始化JNS。

At this point you may start both of your HA NameNodes as you normally would start a NameNode.

You can visit each of the NameNodes’ web pages separately by browsing to their configured HTTP addresses. You should notice that next to the configured address will be the HA state of the NameNode (either “standby” or “active”.) Whenever an HA NameNode starts, it is initially in the Standby state.

這是你需要啟動兩個HA的NameNode作為你日常的NameNode啟動

你能訪問每一個NameNode的web頁面通過配置的HTTP地址,你應該注意到在配置的地址旁邊就是HA的狀態(‘active’或者‘standby’)什麼時候一個HA NameNode 啟動它會被初始化為備份狀態。

Administrative commands

Now that your HA NameNodes are configured and started, you will have access to some additional commands to administer your HA HDFS cluster. Specifically, you should familiarize yourself with all of the subcommands of the “hdfs haadmin” command. Running this command without any additional arguments will display the following usage information:

 管理命令

現在HA的NameNode被配置並且啟動了,你將有許可權去利用命令管理你的HA HDFS叢集,你應該熟悉你的所有的”hdfs haadmin“命令。執行這個命令沒有任何的引數你會看見一下內容:

Usage: haadmin
    [-transitionToActive <serviceId>]
    [-transitionToStandby <serviceId>]
    [-failover [--forcefence] [--forceactive] <serviceId> <serviceId>]
    [-getServiceState <serviceId>]
    [-getAllServiceState]
    [-checkHealth <serviceId>]
    [-help <command>]

This guide describes high-level uses of each of these subcommands. For specific usage information of each subcommand, you should run “hdfs haadmin -help <command>”.

  • transitionToActive and transitionToStandby - transition the state of the given NameNode to Active or Standby

    These subcommands cause a given NameNode to transition to the Active or Standby state, respectively. These commands do not attempt to perform any fencing, and thus should rarely be used. Instead, one should almost always prefer to use the “hdfs haadmin -failover” subcommand.

  • failover - initiate a failover between two NameNodes

    This subcommand causes a failover from the first provided NameNode to the second. If the first NameNode is in the Standby state, this command simply transitions the second to the Active state without error. If the first NameNode is in the Active state, an attempt will be made to gracefully transition it to the Standby state. If this fails, the fencing methods (as configured by dfs.ha.fencing.methods) will be attempted in order until one succeeds. Only after this process will the second NameNode be transitioned to the Active state. If no fencing method succeeds, the second NameNode will not be transitioned to the Active state, and an error will be returned.

  • getServiceState - determine whether the given NameNode is Active or Standby

    Connect to the provided NameNode to determine its current state, printing either “standby” or “active” to STDOUT appropriately. This subcommand might be used by cron jobs or monitoring scripts which need to behave differently based on whether the NameNode is currently Active or Standby.

  • getAllServiceState - returns the state of all the NameNodes

    Connect to the configured NameNodes to determine the current state, print either “standby” or “active” to STDOUT appropriately.

  • checkHealth - check the health of the given NameNode

    Connect to the provided NameNode to check its health. The NameNode is capable of performing some diagnostics on itself, including checking if internal services are running as expected. This command will return 0 if the NameNode is healthy, non-zero otherwise. One might use this command for monitoring purposes.

    Note: This is not yet implemented, and at present will always return success, unless the given NameNode is completely down.

這個描述了常用的命令,每個子命令的詳細資訊你應該執行”hdfs haadmin -help <command>“

transitionToActive and transitionToStandby - 切換NameNode的狀態(Active或者Standby)

這些子命令會使NameNode分別轉換狀態,這種方式不會去呼叫人任何的過濾所以很少會被使用,想法人們應該選擇“hdfs haadmin -failover”子命令

failover - 啟動兩個NameNode之間的故障遷移

這個子命令會從第一個NameNode遷移到第二個,如果第一個NameNode處於備用狀態,這個命令只是沒有錯誤的轉換第二個節點到活動狀態。如果第一個NameNode處於活躍狀態,試圖將優雅地轉換到備用狀態。如果失敗,過濾方法(如由dfs.ha.fencing.methods配置)將嘗試過濾直到成功。只有在這個過程之後第二個NameNode會轉換為活動狀態,如果沒有過濾方法成功,第二個nameNode將不會活動並返回一個錯誤

getServiceState -判定NameNode的狀態

連線到NameNode,去判斷現在的狀態列印“standby”或者“active”去標準的輸出。這個子命令可以被corn jobs或者是監控指令碼使用,為了針對不同專題的NameNode採用不同的行為

checkHealth -檢查NameNode的健康

連線NameNode檢查健康,NameNode能夠執行一些診斷,包括檢查如果內部服務正在執行。如果返回0表明NameNode健康,否則返回非0.可以使用此命令用於監測目的。

注意:這個功能實現的不完整,目前除了NameNode完全的關閉,其他全部返回成功。

 

Automatic Failover

Introduction

The above sections describe how to configure manual failover. In that mode, the system will not automatically trigger a failover from the active to the standby NameNode, even if the active node has failed. This section describes how to configure and deploy automatic failover.

自動轉移

介紹

上面的小節介紹的如何手動的配置遷移,在那種末實現即使活動的節點已經失敗了,系統也不會自動的遷移到備用的節點,這個小節描述如何自動的配置和部署故障轉移

Components

Automatic failover adds two new components to an HDFS deployment: a ZooKeeper quorum, and the ZKFailoverController process (abbreviated as ZKFC).

Apache ZooKeeper is a highly available service for maintaining small amounts of coordination data, notifying clients of changes in that data, and monitoring clients for failures. The implementation of automatic HDFS failover relies on ZooKeeper for the following things:

  • Failure detection - each of the NameNode machines in the cluster maintains a persistent session in ZooKeeper. If the machine crashes, the ZooKeeper session will expire, notifying the other NameNode that a failover should be triggered.

  • Active NameNode election - ZooKeeper provides a simple mechanism to exclusively elect a node as active. If the current active NameNode crashes, another node may take a special exclusive lock in ZooKeeper indicating that it should become the next active.

元件

自動的故障轉移新增了兩週新的元件到HDFS部署中:一個是ZooKeeper quorum一個是the ZKFailoverController process(縮寫是ZKFC)。

Apache ZooKeeper是一個通過少量的協作的資料,通知客戶端的變化,並且監控客戶端失敗的高可用協調系統。實現HDFS的自動故障轉移需要ZooKeeper做下面的事情:

失敗保護-叢集當中每一個NameNode機器都會在ZooKeeper維護一個持久的session,如果機器當機,那麼就會session過期,故障遷移會被觸發。

活動的NameNode選擇-ZooKeeper提供了一個簡單的機制專門用來選擇一個活躍的節點。如果現在的活躍的NameNode當機其他的節點可以向ZooKeeper申請排他所成為下一個active的節點。

 

The ZKFailoverController (ZKFC) is a new component which is a ZooKeeper client which also monitors and manages the state of the NameNode. Each of the machines which runs a NameNode also runs a ZKFC, and that ZKFC is responsible for:

  • Health monitoring - the ZKFC pings its local NameNode on a periodic basis with a health-check command. So long as the NameNode responds in a timely fashion with a healthy status, the ZKFC considers the node healthy. If the node has crashed, frozen, or otherwise entered an unhealthy state, the health monitor will mark it as unhealthy.

  • ZooKeeper session management - when the local NameNode is healthy, the ZKFC holds a session open in ZooKeeper. If the local NameNode is active, it also holds a special “lock” znode. This lock uses ZooKeeper’s support for “ephemeral” nodes; if the session expires, the lock node will be automatically deleted.

  • ZooKeeper-based election - if the local NameNode is healthy, and the ZKFC sees that no other node currently holds the lock znode, it will itself try to acquire the lock. If it succeeds, then it has “won the election”, and is responsible for running a failover to make its local NameNode active. The failover process is similar to the manual failover described above: first, the previous active is fenced if necessary, and then the local NameNode transitions to active state.

For more details on the design of automatic failover, refer to the design document attached to HDFS-2185 on the Apache HDFS JIRA.

ZKFailoverController (ZKFC)是一個新的元件他是一個ZooKeeper的客戶端也是用來監控和管理NameNode的狀態的。每一個機器執行一個NameNode同時也要執行一個ZKFC,ZKFC是為了:

健康監控-ZKFC週期性的連線本地的NameNode執行一個健康檢查命令,只要NameNode予以相應則ZKFC認為是健康的,如果節點當機,凍結或者進入一個不健康的狀態,那麼健康監控器就會標示他是不健康的

ZooKeeper session管理-當本地NameNode是健康的時候,ZKFC會在ZooKeeper中保持一個開著的session。如果本地的NameNode是活躍的,他也會保持一個特殊的所在znode當中,這個鎖用來使得ZooKeeper支援"ephemeral"節點,如果session超期那麼這個鎖會被刪除。

ZooKeeper-based選擇 - 如果本地的NameNode是健康的,ZKFC沒有發現其他節點去持有鎖那麼他會申請鎖。如果他成功了,它已經贏得了選舉,並負責執行故障轉移使其本地NameNode活躍。遷移過程和手動遷移類似:首先執行必要的過濾,然後將本地的NameNode轉換為活動的,

更多的細節請檢視自動遷移的設計HADOOP的JIRA的HDFS-2185.

Deploying ZooKeeper

In a typical deployment, ZooKeeper daemons are configured to run on three or five nodes. Since ZooKeeper itself has light resource requirements, it is acceptable to collocate the ZooKeeper nodes on the same hardware as the HDFS NameNode and Standby Node. Many operators choose to deploy the third ZooKeeper process on the same node as the YARN ResourceManager. It is advisable to configure the ZooKeeper nodes to store their data on separate disk drives from the HDFS metadata for best performance and isolation.

The setup of ZooKeeper is out of scope for this document. We will assume that you have set up a ZooKeeper cluster running on three or more nodes, and have verified its correct operation by connecting using the ZK CLI.

部署ZooKeeper

一個部署中,ZooKeeper的守護執行緒被配置到3或者5個節點,因為ZooKeeper佔用的資源比較輕量級,他可以和HDFS的NameNode和備份Node部署在同一硬體環境上。很多執行商部署選擇部署第三個ZooKeeper程式和YARN的RM在同一個節點上,建議配置ZooKeeper節點去儲存他們的資料在HDFS後設資料分離的硬碟上為了更好地效能。

ZooKeeper 的安裝不包括在本文件中,我們假設你已經建立了ZooKeeper叢集執行了3個或3個以上的節點並且驗證他可以被ZK CLI連線並且正確的操作。

Before you begin

Before you begin configuring automatic failover, you should shut down your cluster. It is not currently possible to transition from a manual failover setup to an automatic failover setup while the cluster is running.

Configuring automatic failover

The configuration of automatic failover requires the addition of two new parameters to your configuration. In your hdfs-site.xml file, add:

開始之前

在你開始配置自動轉移之前,你應該關閉你的叢集,目前不能在叢集執行當中實現手動到自動的轉換

配置

配置自動故障轉移要求新增兩個新的配置到hdfs-site.xml,如下

 <property>
   <name>dfs.ha.automatic-failover.enabled</name>
   <value>true</value>
 </property>

This specifies that the cluster should be set up for automatic failover. In your core-site.xml file, add:

這個配置需要新增到core-site.xml檔案中,如下: 

 <property>
   <name>ha.zookeeper.quorum</name>
   <value>zk1.example.com:2181,zk2.example.com:2181,zk3.example.com:2181</value>
 </property>

This lists the host-port pairs running the ZooKeeper service.

As with the parameters described earlier in the document, these settings may be configured on a per-nameservice basis by suffixing the configuration key with the nameservice ID. For example, in a cluster with federation enabled, you can explicitly enable automatic failover for only one of the nameservices by setting dfs.ha.automatic-failover.enabled.my-nameservice-id.

There are also several other configuration parameters which may be set to control the behavior of automatic failover; however, they are not necessary for most installations. Please refer to the configuration key specific documentation for details.

這是一個主機埠的匹配在ZooKeeper 服務中。
如先前的文件中關於引數的描述,這些設定可能需要被配置先前的nameservice並用nameservice ID作為字尾,比如在叢集中,你可能期望自動故障轉移只是發生在一個nameservice上那麼可以設定dfs.ha.automatic-failover.enabled.my-nameservice-id.

還有一些引數也可以設定並且對於自動遷移或一些影響,但是他們對於絕大多數的安裝來說不是必要的。詳細內容請參考配置主鍵key文件。

Initializing HA state in ZooKeeper

After the configuration keys have been added, the next step is to initialize required state in ZooKeeper. You can do so by running the following command from one of the NameNode hosts.

[hdfs]$ $HADOOP_PREFIX/bin/hdfs zkfc -formatZK

This will create a znode in ZooKeeper inside of which the automatic failover system stores its data.

在ZooKeeper初始化HA狀態

在配置的主鍵被新增之後,下一步就是在ZooKeeper中初始化需要的狀態,你可以在一個NameNode的主機上執行下面的命令:

[hdfs]$ $HADOOP_PREFIX/bin/hdfs zkfc -formatZK

他會在ZooKeeper 建立一個Znode代替用來為自動遷移儲存資料。 

Starting the cluster with start-dfs.sh

Since automatic failover has been enabled in the configuration, the start-dfs.sh script will now automatically start a ZKFC daemon on any machine that runs a NameNode. When the ZKFCs start, they will automatically select one of the NameNodes to become active.

使用start-dfs.sh啟動叢集

因為自動遷移被配置在檔案中,start-dfs.sh指令碼會自動啟動一個ZKFC守護執行緒在NameNode執行的機器上,當ZKFC啟動時,他會自動選擇一個NameNode作為活動的。

Starting the cluster manually

If you manually manage the services on your cluster, you will need to manually start the zkfc daemon on each of the machines that runs a NameNode. You can start the daemon by running:

使用start-dfs.sh啟動叢集

因為自動遷移被配置在檔案中,start-dfs.sh指令碼會自動啟動一個ZKFC守護執行緒在NameNode執行的機器上,當ZKFC啟動時,他會自動選擇一個NameNode作為活動的。

[hdfs]$ $HADOOP_PREFIX/sbin/hadoop-daemon.sh --script $HADOOP_PREFIX/bin/hdfs start zkfc

Securing access to ZooKeeper

If you are running a secure cluster, you will likely want to ensure that the information stored in ZooKeeper is also secured. This prevents malicious clients from modifying the metadata in ZooKeeper or potentially triggering a false failover.

In order to secure the information in ZooKeeper, first add the following to your core-site.xml file:

安全的進入ZooKeeper

 

如果你執行的是一個安全的叢集,你會希望在ZooKeeper儲存的資訊也是安全的這會防止惡意的客戶端改變ZooKeeper的後設資料或者觸發錯誤的故障轉移。

為了實現安全的資訊,首先需要在core-site.xml新增下面的內容:

 

 <property>
   <name>ha.zookeeper.auth</name>
   <value>@/path/to/zk-auth.txt</value>
 </property>
 <property>
   <name>ha.zookeeper.acl</name>
   <value>@/path/to/zk-acl.txt</value>
 </property>

Please note the ‘@’ character in these values – this specifies that the configurations are not inline, but rather point to a file on disk.

The first configured file specifies a list of ZooKeeper authentications, in the same format as used by the ZK CLI. For example, you may specify something like:

請注意‘@’字元這個配置不是指向內部,而是指向磁碟上的檔案。

第一個配置指定了ZooKeeper證照檔案列表,和ZK CLI使用相同的格式,例如你可以這樣宣告

digest:hdfs-zkfcs:mypassword

…where hdfs-zkfcs is a unique username for ZooKeeper, and mypassword is some unique string used as a password.

Next, generate a ZooKeeper ACL that corresponds to this authentication, using a command like the following:

--hdfs-zkfcs是一個ZooKeeper使用者名稱,mypassword是一個密碼

下一步用下面的命令生成一個ZooKeeper ACL驗證這個證照:

[hdfs]$ java -cp $ZK_HOME/lib/*:$ZK_HOME/zookeeper-3.4.2.jar org.apache.zookeeper.server.auth.DigestAuthenticationProvider hdfs-zkfcs:mypassword
output: hdfs-zkfcs:mypassword->hdfs-zkfcs:P/OQvnYyU/nF/mGYvB/xurX8dYs=

Copy and paste the section of this output after the ‘->’ string into the file zk-acls.txt, prefixed by the string “digest:”. For example:

複製貼上->後面的字串到zk-acls.tx,“digest:”作為字首,比如: 

digest:hdfs-zkfcs:vlUvLnd8MlacsE80rDuu6ONESbM=:rwcda

In order for these ACLs to take effect, you should then rerun the zkfc -formatZK command as described above.

After doing so, you may verify the ACLs from the ZK CLI as follows:

為了這些ACL起作用,你應該執行zkfc -format命令如上描述。

這樣做之後,你可以驗證的acl ZK CLI如下:

[zk: localhost:2181(CONNECTED) 1] getAcl /hadoop-ha
'digest,'hdfs-zkfcs:vlUvLnd8MlacsE80rDuu6ONESbM=
: cdrwa

Verifying automatic failover

Once automatic failover has been set up, you should test its operation. To do so, first locate the active NameNode. You can tell which node is active by visiting the NameNode web interfaces – each node reports its HA state at the top of the page.

Once you have located your active NameNode, you may cause a failure on that node. For example, you can use kill -9 <pid of NN> to simulate a JVM crash. Or, you could power cycle the machine or unplug its network interface to simulate a different kind of outage. After triggering the outage you wish to test, the other NameNode should automatically become active within several seconds. The amount of time required to detect a failure and trigger a fail-over depends on the configuration of ha.zookeeper.session-timeout.ms, but defaults to 5 seconds.

If the test does not succeed, you may have a misconfiguration. Check the logs for the zkfc daemons as well as the NameNode daemons in order to further diagnose the issue.

驗證自動遷移

一旦自動遷移被建立,你應該測試一下他。這麼做,首先定位活躍的NameNode,你可以通過訪問NameNode的web頁面檢視哪個node是活躍的在首頁上你能看見HA的狀態。

一旦你找到活躍的NameNode,你可以在這個節點上造一些錯誤。例如,你可能使用kill -9 <NN的程式>模擬JVM崩潰。或者重啟機器拔掉網線模擬各種的異常。觸發中斷後,你希望幾秒後其他的NameNode自動的程式設計活動的。自動轉移的時間你可以通過一個配置ha.zookeeper.session-timeout.ms來設定,預設是5秒。

 

如果這個測試沒有成功,你可能是一些地方配置錯了,你需要檢查zkfc守護程式的日誌以及NameNode守護程式進一步診斷問題。

Automatic Failover FAQ

  • Is it important that I start the ZKFC and NameNode daemons in any particular order?

    No. On any given node you may start the ZKFC before or after its corresponding NameNode.

  • What additional monitoring should I put in place?

    You should add monitoring on each host that runs a NameNode to ensure that the ZKFC remains running. In some types of ZooKeeper failures, for example, the ZKFC may unexpectedly exit, and should be restarted to ensure that the system is ready for automatic failover.

    Additionally, you should monitor each of the servers in the ZooKeeper quorum. If ZooKeeper crashes, then automatic failover will not function.

  • What happens if ZooKeeper goes down?

    If the ZooKeeper cluster crashes, no automatic failovers will be triggered. However, HDFS will continue to run without any impact. When ZooKeeper is restarted, HDFS will reconnect with no issues.

  • Can I designate one of my NameNodes as primary/preferred?

    No. Currently, this is not supported. Whichever NameNode is started first will become active. You may choose to start the cluster in a specific order such that your preferred node starts first.

  • How can I initiate a manual failover when automatic failover is configured?

    Even if automatic failover is configured, you may initiate a manual failover using the same hdfs haadmin command. It will perform a coordinated failover.

自動轉移FAQ

 

  • 開始ZKFC和NameNode守護程式的啟動順序很重要嗎?

不,任何定節點上你可能會在啟動ZKFC之前或之後啟動相應的NameNode。

  • 額外的監控應該放到什麼地方?

你應該為每一個NameNode的主機新增一個監控確定ZKFC執行。如果ZooKeeper失敗,比如,ZKFC意外退出,應該重新啟動,確保系統自動故障轉移。

此外,你應該監控ZooKeeper quorum的每一個伺服器,如果ZooKeeper關閉了,那麼自動遷移不能工作

  • 如果ZooKeeper關閉了怎麼辦?

如果ZooKeeper 機器崩潰,自動遷移將不會在工作,但是HDFS不會有影響。當ZooKeeper 重新啟動,HDFS會重新連上。

  • 我可以指定NameNode的優先順序嗎?

不。目前這個不被支援。

  • 當自動轉移被配置如何進行手工轉移?

即使配置了自動轉移,你也可以使用手動轉移

 

HDFS Upgrade/Finalization/Rollback with HA Enabled

When moving between versions of HDFS, sometimes the newer software can simply be installed and the cluster restarted. Sometimes, however, upgrading the version of HDFS you’re running may require changing on-disk data. In this case, one must use the HDFS Upgrade/Finalize/Rollback facility after installing the new software. This process is made more complex in an HA environment, since the on-disk metadata that the NN relies upon is by definition distributed, both on the two HA NNs in the pair, and on the JournalNodes in the case that QJM is being used for the shared edits storage. This documentation section describes the procedure to use the HDFS Upgrade/Finalize/Rollback facility in an HA setup.

To perform an HA upgrade, the operator must do the following:

  1. Shut down all of the NNs as normal, and install the newer software.

  2. Start up all of the JNs. Note that it is critical that all the JNs be running when performing the upgrade, rollback, or finalization operations. If any of the JNs are down at the time of running any of these operations, the operation will fail.

  3. Start one of the NNs with the '-upgrade' flag.

  4. On start, this NN will not enter the standby state as usual in an HA setup. Rather, this NN will immediately enter the active state, perform an upgrade of its local storage dirs, and also perform an upgrade of the shared edit log.

  5. At this point the other NN in the HA pair will be out of sync with the upgraded NN. In order to bring it back in sync and once again have a highly available setup, you should re-bootstrap this NameNode by running the NN with the '-bootstrapStandby' flag. It is an error to start this second NN with the '-upgrade' flag.

Note that if at any time you want to restart the NameNodes before finalizing or rolling back the upgrade, you should start the NNs as normal, i.e. without any special startup flag.

To finalize an HA upgrade, the operator will use the `hdfs dfsadmin -finalizeUpgrade' command while the NNs are running and one of them is active. The active NN at the time this happens will perform the finalization of the shared log, and the NN whose local storage directories contain the previous FS state will delete its local state.

To perform a rollback of an upgrade, both NNs should first be shut down. The operator should run the roll back command on the NN where they initiated the upgrade procedure, which will perform the rollback on the local dirs there, as well as on the shared log, either NFS or on the JNs. Afterward, this NN should be started and the operator should run `-bootstrapStandby' on the other NN to bring the two NNs in sync with this rolled-back file system state.

相關文章