hadoop在啟動namenode和datanode之後,兩者之間是如何聯動了?datanode如何向namenode註冊?如何彙報資料?namenode又如何向datanode傳送命令?
心跳機制基礎概念
心跳就是HDFS中從節點DataNode週期性的向名位元組點DataNode做彙報,彙報自己的健康情況、負載狀況等,並從NameNode處領取命令在本節點執行,保證NameNode這一HDFS指揮官熟悉HDFS的全部執行情況,並對從節點DataNode發號施令,以完成來自外部的資料讀寫請求或內部的負載均衡等任務。
另外,在叢集啟動時,NameNode會NameNode#initialize
方法中呼叫loadNamesystem(conf);
方法,從磁碟載入fsimage以及edits檔案,初始化FsNamesystem、FsDirectory、 LeaseManager等。但是與資料節點相關的資訊不保留在NameNode的本地檔案系統中,而是每次啟動時,都會動態地重建這些資訊。
而這些資料也是正是在從節點DataNode接入叢集后,由其傳送心跳資訊彙報給主節點NameNode的。
BlockPoolManager
結合DataNode的啟動原始碼來看,可以看到DataNode中有個私有的成員變數private BlockPoolManager blockPoolManager;
,她的初始化程式碼在DataNode#startDataNode
中:
// 例項化BlockPoolManager
blockPoolManager = new BlockPoolManager(this);
blockPoolManager.refreshNamenodes(getConf());
接著進入BlockPoolManager
類,看一下這個類的註釋說明和BlockPoolManager(DataNode dn)
的建構函式:
/**
* Manages the BPOfferService objects for the data node.
* Creation, removal, starting, stopping, shutdown on BPOfferService
* objects must be done via APIs in this class.
*/
// 管理資料節點的BPOfferService物件。
// BPOfferService物件的建立、刪除、啟動、停止和關閉必須通過這個類中的api完成。
@InterfaceAudience.Private
class BlockPoolManager {
private static final Logger LOG = DataNode.LOG;
// NameserviceId與BPOfferService的對應Map
private final Map<String, BPOfferService> bpByNameserviceId =
Maps.newHashMap();
// BlockPoolId與BPOfferService的對應Map
private final Map<String, BPOfferService> bpByBlockPoolId =
Maps.newHashMap();
// 所有的BPOfferService集合
private final List<BPOfferService> offerServices =
new CopyOnWriteArrayList<>();
private final DataNode dn;
//This lock is used only to ensure exclusion of refreshNamenodes
// 這個refreshNamenodesLock僅僅在refreshNamenodes()方法中被用作互斥鎖
private final Object refreshNamenodesLock = new Object();
BlockPoolManager(DataNode dn) {
this.dn = dn;
}
// ...其餘程式碼省略
}
可以看到建構函式僅是將dataNode的引用賦給自身的私有成員變數,而通過BlockPoolManager
的註釋可以看出她負責管理DataNode中所有的BPOfferService
,包括完整的生命週期和各種操作都需要由BlockPoolManager
來代理。
BPOfferService類
接下來看看BPOfferService
的類定義和成員變數:
/**
* One instance per block-pool/namespace on the DN, which handles the
* heartbeats to the active and standby NNs for that namespace.
* This class manages an instance of {@link BPServiceActor} for each NN,
* and delegates calls to both NNs.
* It also maintains the state about which of the NNs is considered active.
*/
// DN上的每個塊池/名稱空間一個例項,它處理該名稱空間的主和備用NameNode的心跳。
// 這個類為每個NN管理一個BPServiceActor例項,並委託對兩個NN的呼叫。
// 它也儲存了哪個NameNode是active狀態。
@InterfaceAudience.Private
class BPOfferService {
static final Logger LOG = DataNode.LOG;
/**
* Information about the namespace that this service
* is registering with. This is assigned after
* the first phase of the handshake.
*/
// 關於此服務要註冊的名稱空間的資訊。這是在握手的第一階段之後分配的。
NamespaceInfo bpNSInfo;
/**
* The registration information for this block pool.
* This is assigned after the second phase of the
* handshake.
*/
// 此塊池的註冊資訊。這是在握手的第二階段之後分配的。
volatile DatanodeRegistration bpRegistration;
private final String nameserviceId;
private volatile String bpId;
private final DataNode dn;
/**
* A reference to the BPServiceActor associated with the currently
* ACTIVE NN. In the case that all NameNodes are in STANDBY mode,
* this can be null. If non-null, this must always refer to a member
* of the {@link #bpServices} list.
*/
// 對BPServiceActor的引用,該引用與當前的ACTIVE NN關聯。
// 當所有namenode都為STANDBY模式時,該值可以為空。
// 如果非空,則必須始終引用bpServices列表的成員。
private BPServiceActor bpServiceToActive = null;
/**
* The list of all actors for namenodes in this nameservice, regardless
* of their active or standby states.
*/
// 此名稱服務中namenode的所有參與者的列表,無論其處於active或standby狀態。
private final List<BPServiceActor> bpServices =
new CopyOnWriteArrayList<BPServiceActor>();
/**
* Each time we receive a heartbeat from a NN claiming to be ACTIVE,
* we record that NN's most recent transaction ID here, so long as it
* is more recent than the previous value. This allows us to detect
* split-brain scenarios in which a prior NN is still asserting its
* ACTIVE state but with a too-low transaction ID. See HDFS-2627
* for details.
*/
// 每次我們收到一個自稱為ACTIVE的NN的心跳時,我們在這裡記錄NN最近的事務ID,只要它比之前的值更近。
// 這允許我們檢測裂腦場景,即先前的神經網路仍然斷言其ACTIVE狀態,但事務ID過低。
private long lastActiveClaimTxId = -1;
// 鎖
private final ReentrantReadWriteLock mReadWriteLock =
new ReentrantReadWriteLock();
private final Lock mReadLock = mReadWriteLock.readLock();
private final Lock mWriteLock = mReadWriteLock.writeLock();
// utility methods to acquire and release read lock and write lock
void readLock() {
mReadLock.lock();
}
void readUnlock() {
mReadLock.unlock();
}
void writeLock() {
mWriteLock.lock();
}
void writeUnlock() {
mWriteLock.unlock();
}
BPOfferService(
final String nameserviceId, List<String> nnIds,
List<InetSocketAddress> nnAddrs,
List<InetSocketAddress> lifelineNnAddrs,
DataNode dn) {
Preconditions.checkArgument(!nnAddrs.isEmpty(),
"Must pass at least one NN.");
Preconditions.checkArgument(nnAddrs.size() == lifelineNnAddrs.size(),
"Must pass same number of NN addresses and lifeline addresses.");
this.nameserviceId = nameserviceId;
this.dn = dn;
// 每個namenode一個BPServiceActor
for (int i = 0; i < nnAddrs.size(); ++i) {
this.bpServices.add(new BPServiceActor(nameserviceId, nnIds.get(i),
nnAddrs.get(i), lifelineNnAddrs.get(i), this));
}
}
// ......其餘程式碼省略
}
由程式碼可以看出,BPOfferService
是DataNode上每個塊池或名稱空間對應的一個例項,她處理該名稱空間到對應的活躍/備份狀態NameNode的心跳。這個類管理每個NameNode的一個BPServiceActor
例項,同時也會儲存哪個是active狀態。
BPServiceActor類
接下來看看每個塊池/名稱空間一個的BPOfferService
中,儲存的每個NameNode一個的BPServiceActor
的具體類定義:
/**
* A thread per active or standby namenode to perform:
* <ul>
* <li> Pre-registration handshake with namenode</li>
* <li> Registration with namenode</li>
* <li> Send periodic heartbeats to the namenode</li>
* <li> Handle commands received from the namenode</li>
* </ul>
*/
// 每個活動或備用namenode要執行的執行緒:
// 與namenode預註冊握手
// 在namenode上註冊
// 定期傳送心跳到namenode
// 處理從namenode接收到的命令
@InterfaceAudience.Private
class BPServiceActor implements Runnable {
// ......其餘程式碼省略
BPServiceActor(String serviceId, String nnId, InetSocketAddress nnAddr,
InetSocketAddress lifelineNnAddr, BPOfferService bpos) {
this.bpos = bpos;
this.dn = bpos.getDataNode();
this.nnAddr = nnAddr;
this.lifelineSender = lifelineNnAddr != null ?
new LifelineSender(lifelineNnAddr) : null;
this.initialRegistrationComplete = lifelineNnAddr != null ?
new CountDownLatch(1) : null;
this.dnConf = dn.getDnConf();
this.ibrManager = new IncrementalBlockReportManager(
dnConf.ibrInterval,
dn.getMetrics());
prevBlockReportId = ThreadLocalRandom.current().nextLong();
fullBlockReportLeaseId = 0;
scheduler = new Scheduler(dnConf.heartBeatInterval,
dnConf.getLifelineIntervalMs(), dnConf.blockReportInterval,
dnConf.outliersReportIntervalMs);
// get the value of maxDataLength.
this.maxDataLength = dnConf.getMaxDataLength();
if (serviceId != null) {
this.serviceId = serviceId;
}
if (nnId != null) {
this.nnId = nnId;
}
commandProcessingThread = new CommandProcessingThread(this);
commandProcessingThread.start();
}
// ......其餘程式碼省略
}
可以看出,BPServiceActor
就是負責與特定NameNode通訊的工作執行緒,類註解上也很明確的列出了該類的功能。
DataNode#createDataNode
最後再回到DataNode#createDataNode
方法中:
/** Instantiate & Start a single datanode daemon and wait for it to
* finish.
* If this thread is specifically interrupted, it will stop waiting.
*/
@VisibleForTesting
@InterfaceAudience.Private
public static DataNode createDataNode(String args[], Configuration conf,
SecureResources resources) throws IOException {
// 初始化datanode
DataNode dn = instantiateDataNode(args, conf, resources);
if (dn != null) {
// 啟動datanode程式
dn.runDatanodeDaemon();
}
return dn;
}
public void runDatanodeDaemon() throws IOException {
blockPoolManager.startAll();
// start dataXceiveServer
dataXceiverServer.start();
if (localDataXceiverServer != null) {
localDataXceiverServer.start();
}
ipcServer.setTracer(tracer);
ipcServer.start();
startPlugins(getConf());
}
可以看到在這裡呼叫了blockPoolManager.startAll();
方法,之後就是一連串的start()
方法呼叫:
// BlockPoolManager#startAll()
synchronized void startAll() throws IOException {
try {
UserGroupInformation.getLoginUser().doAs(
new PrivilegedExceptionAction<Object>() {
@Override
public Object run() throws Exception {
for (BPOfferService bpos : offerServices) {
bpos.start();
}
return null;
}
});
} catch (InterruptedException ex) {
IOException ioe = new IOException();
ioe.initCause(ex.getCause());
throw ioe;
}
}
// BPOfferService#start()
//This must be called only by blockPoolManager
void start() {
for (BPServiceActor actor : bpServices) {
actor.start();
}
}
// BPServiceActor#start()
//This must be called only by BPOfferService
void start() {
if ((bpThread != null) && (bpThread.isAlive())) {
//Thread is started already
return;
}
bpThread = new Thread(this);
bpThread.setDaemon(true); // needed for JUnit testing
if (lifelineSender != null) {
lifelineSender.start();
}
bpThread.start();
}
最終是呼叫到了BPServiceActor#start()
,啟動了自身執行緒和生命線傳送執行緒。再之後就是在DataNode#secureMain
中有datanode.join();
方法來等待這些子執行緒執行結束。
所以,整個心跳機制的大致結構就是:
-
每個DataNode上都有一個BlockPoolManager例項
-
每個BlockPoolManager例項管理著所有名稱服務空間對應的BPOfferService例項
-
每個BPOfferService例項管理者對應名稱空間到所有NameNode的BPServiceActor工作執行緒:包含一個Active與若干Standby狀態的NN
-
BPServiceActor是針對特定的NameNode進行通訊和完成心跳與接收響應命令的工作執行緒。
心跳機制的大致流程是:
-
DataNode#startDataNode
方法中中對BlockPoolManager
進行例項化 -
DataNode#startDataNode
方法中中呼叫BlockPoolManager#refreshNamenodes
方法來更新namenode的nameservice,以及建立對應的BPOfferService
、BPServiceActor
等,之後進行連通namenode, -
DataNode#createDataNode
方法中呼叫BlockPoolManager#startAll
方法來啟動所有心跳相關的執行緒 -
DataNode#secureMain
中呼叫datanode.join()
方法來等待心跳執行緒被中止
心跳機制程式碼詳解
接下來來看看心跳機制的具體程式碼實現過程
DataNode#startDataNode
首先來看datanode啟動流程中對心跳機制的呼叫:
// 此方法使用指定的conf啟動資料節點,如果設定了conf的config_property_simulation屬性,則建立一個模擬的基於儲存的資料節點
void startDataNode(List<StorageLocation> dataDirectories,
SecureResources resources
) throws IOException {
// ...... 本方法更詳細的程式碼見上一篇部落格哦
// 按照namespace(nameservice)、namenode的結構進行初始化
blockPoolManager = new BlockPoolManager(this);
// 心跳管理
blockPoolManager.refreshNamenodes(getConf());
// ......
}
BlockPoolManager的構造方法為this.dn = dn;
BlockPoolManager#refreshNamenodes
重點看看心跳管理:
void refreshNamenodes(Configuration conf)
throws IOException {
// DFSConfigKeys.DFS_NAMESERVICES: 取配置項:dfs.nameservices, 預設值為null
LOG.info("Refresh request received for nameservices: " +
conf.get(DFSConfigKeys.DFS_NAMESERVICES));
Map<String, Map<String, InetSocketAddress>> newAddressMap = null;
Map<String, Map<String, InetSocketAddress>> newLifelineAddressMap = null;
try {
// 獲取管理這個叢集的namenode對應的InetSocketAddresses列表,對應配置項dfs.namenode.servicerpc-address
// 返回的格式為:Map<nameserviceId, Map<namenodeId, InetSocketAddress>>
newAddressMap =
DFSUtil.getNNServiceRpcAddressesForCluster(conf);
// 從配置中獲取與namenode上的生命線RPC伺服器對應的InetSocketAddresses列表,對應配置項dfs.namenode.lifeline.rpc-address
newLifelineAddressMap =
DFSUtil.getNNLifelineRpcAddressesForCluster(conf);
} catch (IOException ioe) {
LOG.warn("Unable to get NameNode addresses.", ioe);
}
if (newAddressMap == null || newAddressMap.isEmpty()) {
throw new IOException("No services to connect, missing NameNode " +
"address.");
}
synchronized (refreshNamenodesLock) {
doRefreshNamenodes(newAddressMap, newLifelineAddressMap);
}
}
refreshNamenodes
根據配置拼接叢集的Map<nameserviceId, Map<namenodeId, InetSocketAddress>>
,和生命線的同格式的對映關係。之後呼叫doRefreshNamenodes
方法執行具體的重新整理NameNode過程。
BlockPoolManager#doRefreshNamenodes
private void doRefreshNamenodes(
Map<String, Map<String, InetSocketAddress>> addrMap,
Map<String, Map<String, InetSocketAddress>> lifelineAddrMap)
throws IOException {
assert Thread.holdsLock(refreshNamenodesLock);
Set<String> toRefresh = Sets.newLinkedHashSet();
Set<String> toAdd = Sets.newLinkedHashSet();
Set<String> toRemove;
synchronized (this) {
// Step 1. For each of the new nameservices, figure out whether
// it's an update of the set of NNs for an existing NS,
// or an entirely new nameservice.
// 步驟1:迴圈所有獲取到的nameservice,
// 判斷她是一個已存在nameservice中的被更新了的NN集合,還是完全的一個新的nameservice
for (String nameserviceId : addrMap.keySet()) {
if (bpByNameserviceId.containsKey(nameserviceId)) {
toRefresh.add(nameserviceId);
} else {
toAdd.add(nameserviceId);
}
}
// Step 2. Any nameservices we currently have but are no longer present need to be removed.
// 步驟2:我們當前擁有但不再存在的任何名稱服務都需要刪除。(bpByNameserviceId中存在,而配置資訊addrMap中沒有的)
toRemove = Sets.newHashSet(Sets.difference(
bpByNameserviceId.keySet(), addrMap.keySet()));
assert toRefresh.size() + toAdd.size() ==
addrMap.size() :
"toAdd: " + Joiner.on(",").useForNull("<default>").join(toAdd) +
" toRemove: " + Joiner.on(",").useForNull("<default>").join(toRemove) +
" toRefresh: " + Joiner.on(",").useForNull("<default>").join(toRefresh);
// Step 3. Start new nameservices
// 步驟3:啟動所有新的nameservice
if (!toAdd.isEmpty()) {
LOG.info("Starting BPOfferServices for nameservices: " +
Joiner.on(",").useForNull("<default>").join(toAdd));
for (String nsToAdd : toAdd) {
Map<String, InetSocketAddress> nnIdToAddr = addrMap.get(nsToAdd);
Map<String, InetSocketAddress> nnIdToLifelineAddr =
lifelineAddrMap.get(nsToAdd);
ArrayList<InetSocketAddress> addrs =
Lists.newArrayListWithCapacity(nnIdToAddr.size());
ArrayList<String> nnIds =
Lists.newArrayListWithCapacity(nnIdToAddr.size());
ArrayList<InetSocketAddress> lifelineAddrs =
Lists.newArrayListWithCapacity(nnIdToAddr.size());
for (String nnId : nnIdToAddr.keySet()) {
addrs.add(nnIdToAddr.get(nnId));
nnIds.add(nnId);
lifelineAddrs.add(nnIdToLifelineAddr != null ?
nnIdToLifelineAddr.get(nnId) : null);
}
// 建立新的BPOfferService
BPOfferService bpos = createBPOS(nsToAdd, nnIds, addrs,
lifelineAddrs);
// 將新的bops放入集合中
bpByNameserviceId.put(nsToAdd, bpos);
offerServices.add(bpos);
}
}
// 全部啟動
startAll();
}
// Step 4. Shut down old nameservices. This happens outside
// of the synchronized(this) lock since they need to call
// back to .remove() from another thread
// 步驟4:關閉舊的名稱服務。這發生在synchronized(This)鎖之外,因為它們需要從另一個執行緒回撥.remove()
if (!toRemove.isEmpty()) {
LOG.info("Stopping BPOfferServices for nameservices: " +
Joiner.on(",").useForNull("<default>").join(toRemove));
for (String nsToRemove : toRemove) {
BPOfferService bpos = bpByNameserviceId.get(nsToRemove);
bpos.stop();
bpos.join();
// they will call remove on their own
// 這裡的執行邏輯大概描述如下:
// bpos.stop() -> actor.stop(); -> shouldServiceRun = false;
// bpos.join() -> actor.join(); -> bpThread.join();
// -> BPServiceActor#run 方法中 shouldRun() 返回false,執行finally中的 BPServiceActor#cleanUp
// -> BPOfferService#shutdownActor -> DataNode#shutdownBlockPool -> BlockPoolManager#remove
}
}
// Step 5. Update nameservices whose NN list has changed
// 步驟5:更新NN列表已更改的名稱服務
if (!toRefresh.isEmpty()) {
// 正在重新整理名稱服務的nn列表
LOG.info("Refreshing list of NNs for nameservices: " +
Joiner.on(",").useForNull("<default>").join(toRefresh));
for (String nsToRefresh : toRefresh) {
BPOfferService bpos = bpByNameserviceId.get(nsToRefresh);
Map<String, InetSocketAddress> nnIdToAddr = addrMap.get(nsToRefresh);
Map<String, InetSocketAddress> nnIdToLifelineAddr =
lifelineAddrMap.get(nsToRefresh);
ArrayList<InetSocketAddress> addrs =
Lists.newArrayListWithCapacity(nnIdToAddr.size());
ArrayList<InetSocketAddress> lifelineAddrs =
Lists.newArrayListWithCapacity(nnIdToAddr.size());
ArrayList<String> nnIds = Lists.newArrayListWithCapacity(
nnIdToAddr.size());
for (String nnId : nnIdToAddr.keySet()) {
addrs.add(nnIdToAddr.get(nnId));
lifelineAddrs.add(nnIdToLifelineAddr != null ?
nnIdToLifelineAddr.get(nnId) : null);
nnIds.add(nnId);
}
try {
UserGroupInformation.getLoginUser()
.doAs(new PrivilegedExceptionAction<Object>() {
@Override
public Object run() throws Exception {
bpos.refreshNNList(nsToRefresh, nnIds, addrs, lifelineAddrs);
return null;
}
});
} catch (InterruptedException ex) {
IOException ioe = new IOException();
ioe.initCause(ex.getCause());
throw ioe;
}
}
}
}
根據官方給的註釋可以看到總共分了五步,步驟一二都是對比refreshNamenodes
方法中根據配置拼接出的需要連線的nameservice,與當前已經連線好的bpByNameserviceId集合相對比,分別將差異的資料分到toRefresh
,toAdd
,toRemove
三組中。
接下來步驟三,是啟動所有新的namenode,程式碼可以分成三塊,第一塊是整理各種需要的引數,第二塊是建立新的BPOfferService
並將新的bops放入到成員變數中。第三步是全部啟動建立好的bpos。
建立新的BPOfferService
接下來先看看第二塊BPOfferService bpos = createBPOS(nsToAdd, nnIds, addrs, lifelineAddrs);
程式碼中都做了什麼:
protected BPOfferService createBPOS(
final String nameserviceId,
List<String> nnIds,
List<InetSocketAddress> nnAddrs,
List<InetSocketAddress> lifelineNnAddrs) {
return new BPOfferService(nameserviceId, nnIds, nnAddrs, lifelineNnAddrs,
dn);
}
可以看到這就是為了方便測試特地獨立出來的方法,簡單的呼叫了BPOfferService
的建構函式。
BPOfferService建構函式
BPOfferService(
final String nameserviceId, List<String> nnIds,
List<InetSocketAddress> nnAddrs,
List<InetSocketAddress> lifelineNnAddrs,
DataNode dn) {
// 至少有一個namenode可以連線
Preconditions.checkArgument(!nnAddrs.isEmpty(),
"Must pass at least one NN.");
// NameNode地址和生命線地址數量要相同
Preconditions.checkArgument(nnAddrs.size() == lifelineNnAddrs.size(),
"Must pass same number of NN addresses and lifeline addresses.");
this.nameserviceId = nameserviceId;
this.dn = dn;
// 挨個兒建立BPServiceActor例項,並存入bpServices集合中。
for (int i = 0; i < nnAddrs.size(); ++i) {
this.bpServices.add(new BPServiceActor(nameserviceId, nnIds.get(i),
nnAddrs.get(i), lifelineNnAddrs.get(i), this));
}
}
可以看到除了判斷和賦值以外,就挨個兒呼叫了BPServiceActor的建構函式。那繼續來看下一個建構函式的具體程式碼。
BPServiceActor建構函式
BPServiceActor(String serviceId, String nnId, InetSocketAddress nnAddr,
InetSocketAddress lifelineNnAddr, BPOfferService bpos) {
this.bpos = bpos;
this.dn = bpos.getDataNode();
this.nnAddr = nnAddr;
this.lifelineSender = lifelineNnAddr != null ?
new LifelineSender(lifelineNnAddr) : null;
this.initialRegistrationComplete = lifelineNnAddr != null ?
new CountDownLatch(1) : null;
this.dnConf = dn.getDnConf();
// 初始化管理增量塊(IBRs)彙報的例項
this.ibrManager = new IncrementalBlockReportManager(
dnConf.ibrInterval,
dn.getMetrics());
prevBlockReportId = ThreadLocalRandom.current().nextLong();
fullBlockReportLeaseId = 0;
// 例項化Scheduler: 包裝用於排程心跳和塊報告的時間戳計算的實用程式類
scheduler = new Scheduler(dnConf.heartBeatInterval,
dnConf.getLifelineIntervalMs(), dnConf.blockReportInterval,
dnConf.outliersReportIntervalMs);
// get the value of maxDataLength.
// 獲取配置項:ipc.maximum.data.length, 伺服器可以接受的最大請求大小。預設值是128 * 1024 * 1024(128mb)
this.maxDataLength = dnConf.getMaxDataLength();
if (serviceId != null) {
this.serviceId = serviceId;
}
if (nnId != null) {
this.nnId = nnId;
}
// 例項化CommandProcessingThread,用於非同步處理命令,且會將此執行緒標記為守護執行緒或使用者執行緒。
commandProcessingThread = new CommandProcessingThread(this);
commandProcessingThread.start();
}
可以看到步驟三的第二部分程式碼主要是例項化了所有需要的BPOfferService
和BPServiceActor
,過程中還初始化了塊增量彙報的例項和自己包裝的時間戳計算累和一些其他的守護執行緒。
BlockPoolManager#startAll
接下來看看這最重要的startAll方法都做了什麼:
synchronized void startAll() throws IOException {
try {
UserGroupInformation.getLoginUser().doAs(
new PrivilegedExceptionAction<Object>() {
@Override
public Object run() throws Exception {
for (BPOfferService bpos : offerServices) {
bpos.start();
}
return null;
}
});
} catch (InterruptedException ex) {
IOException ioe = new IOException();
ioe.initCause(ex.getCause());
throw ioe;
}
}
可以看到核心功能就是啟動所有已經例項化的bops,繼續跟蹤到BPOfferService#start
方法:
void start() {
for (BPServiceActor actor : bpServices) {
actor.start();
}
}
bops中,也是啟動了所有已經例項化的BPServiceActor,繼續看BPServiceActor#start
:
//This must be called only by BPOfferService
void start() {
if ((bpThread != null) && (bpThread.isAlive())) {
//Thread is started already
return;
}
bpThread = new Thread(this);
bpThread.setDaemon(true); // needed for JUnit testing
if (lifelineSender != null) {
lifelineSender.start();
}
bpThread.start();
}
可以看到啟動了bpThread
和lifelineSender
,接下來按照程式碼執行順序來看。
lifelineSender.start()
LifelineSender
類是BPServiceActor
的內部類,實現了Runnable, Closeable
.
先看看LifelineSender
的start()
方法:
public void start() {
// 建立一個執行緒,將LifelineSender這個內部類當做target引數傳入
lifelineThread = new Thread(this,
formatThreadName("lifeline", lifelineNnAddr));
// 設定為啟動執行緒
lifelineThread.setDaemon(true);
lifelineThread.setUncaughtExceptionHandler(
new Thread.UncaughtExceptionHandler() {
@Override
public void uncaughtException(Thread thread, Throwable t) {
LOG.error(thread + " terminating on unexpected exception", t);
}
});
// 會呼叫LifelineSender的run()方法
lifelineThread.start();
}
函式中建立了一個守護執行緒,將自身作為引數傳入後,呼叫了執行緒的start()
方法,函式內部會呼叫傳入的target引數的run()
方法,接下來看看會被執行的LifelineSender#run
方法:
@Override
public void run() {
// The lifeline RPC depends on registration with the NameNode, so wait for initial registration to complete.
// 生命線RPC依賴於向NameNode註冊,所以要等待初始註冊完成。
while (shouldRun()) {
try {
initialRegistrationComplete.await();
break;
} catch (InterruptedException e) {
// The only way thread interruption can happen while waiting on this
// latch is if the state of the actor has been updated to signal
// shutdown. The next loop's call to shouldRun() will return false,
// and the thread will finish.
// 在等待這個鎖存器的過程中,執行緒中斷的唯一方式是行為體的狀態已經被更新為關閉訊號。
// 下一個迴圈對shouldRun()的呼叫將返回false,並且執行緒將結束。
Thread.currentThread().interrupt();
}
}
// After initial NameNode registration has completed, execute the main
// loop for sending periodic lifeline RPCs if needed. This is done in a
// second loop to avoid a pointless wait on the above latch in every
// iteration of the main loop.
// 在初始的NameNode註冊完成後,執行主迴圈以傳送定期的生命線rpc(如果需要的話)。
// 這是在第二個迴圈中完成的,以避免在主迴圈的每次迭代中對上述閂鎖進行無意義的等待。
while (shouldRun()) {
try {
if (lifelineNamenode == null) {
lifelineNamenode = dn.connectToLifelineNN(lifelineNnAddr);
}
// 如果當前時間在傳送Lifeline訊息的週期時間內,則傳送Lifeline訊息
sendLifelineIfDue();
Thread.sleep(scheduler.getLifelineWaitTime());
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
} catch (IOException e) {
LOG.warn("IOException in LifelineSender for " + BPServiceActor.this,
e);
}
}
LOG.info("LifelineSender for " + BPServiceActor.this + " exiting.");
}
方法中,會先阻塞住執行緒,等待初始註冊完成(bpThread中的握手邏輯)後,會開始向NameNode
傳送生命線訊息。
LifelineSender#sendLifelineIfDue
接下來具體看看生命線訊息傳送的邏輯:
private void sendLifelineIfDue() throws IOException {
// 獲取當前傳送時間
long startTime = scheduler.monotonicNow();
if (!scheduler.isLifelineDue(startTime)) {
if (LOG.isDebugEnabled()) {
LOG.debug("Skipping sending lifeline for " + BPServiceActor.this
+ ", because it is not due.");
}
return;
}
if (dn.areHeartbeatsDisabledForTests()) {
if (LOG.isDebugEnabled()) {
LOG.debug("Skipping sending lifeline for " + BPServiceActor.this
+ ", because heartbeats are disabled for tests.");
}
return;
}
// 傳送生命線
sendLifeline();
// 進行Lifeline訊息的metric統計
dn.getMetrics().addLifeline(scheduler.monotonicNow() - startTime,
getRpcMetricSuffix());
// 設定下次傳送時間
scheduler.scheduleNextLifeline(scheduler.monotonicNow());
}
private void sendLifeline() throws IOException {
// 獲取Datanode儲存利用率報告
StorageReport[] reports =
dn.getFSDataset().getStorageReports(bpos.getBlockPoolId());
if (LOG.isDebugEnabled()) {
LOG.debug("Sending lifeline with " + reports.length + " storage " +
" reports from service actor: " + BPServiceActor.this);
}
// 總結DataNode的資料卷故障資訊
VolumeFailureSummary volumeFailureSummary = dn.getFSDataset()
.getVolumeFailureSummary();
int numFailedVolumes = volumeFailureSummary != null ?
volumeFailureSummary.getFailedStorageLocations().length : 0;
// 傳送生命線
// 生命線相關的概念見這篇部落格:https://blog.csdn.net/Androidlushangderen/article/details/53783641
// namenode 處理見 -> NameNodeRpcServer#sendLifeline
lifelineNamenode.sendLifeline(bpRegistration,
reports,
dn.getFSDataset().getCacheCapacity(),
dn.getFSDataset().getCacheUsed(),
dn.getXmitsInProgress(),
dn.getXceiverCount(),
numFailedVolumes,
volumeFailureSummary);
}
bpThread#start
bpThread的初始化邏輯new Thread(this);
可以看出,傳入引數為BPServiceActor
,所以找到BPServiceActor
的run()
方法:
// 無論出現哪種異常,都要繼續嘗試offerService()。這就是連線到NameNode並提供基本DataNode功能的迴圈
// 只有當“shouldRun”或“shouldServiceRun”被關閉時才會停止,這可能發生在關機時或由於refreshnamenode。
@Override
public void run() {
LOG.info(this + " starting to offer service");
try {
while (true) {
// init stuff
try {
// setup storage
// 連線namenode,以及握手
connectToNNAndHandshake();
break;
} catch (IOException ioe) {
// Initial handshake, storage recovery or registration failed
// 初始握手、儲存恢復或註冊失敗
runningState = RunningState.INIT_FAILED;
if (shouldRetryInit()) {
// Retry until all namenode's of BPOS failed initialization
// 重試,直到所有BPOS的namenode初始化失敗
LOG.error("Initialization failed for " + this + " "
+ ioe.getLocalizedMessage());
sleepAndLogInterrupts(5000, "initializing");
} else {
runningState = RunningState.FAILED;
LOG.error("Initialization failed for " + this + ". Exiting. ", ioe);
return;
}
}
}
runningState = RunningState.RUNNING;
// 握完手了,可以開始傳送生命線了
if (initialRegistrationComplete != null) {
initialRegistrationComplete.countDown();
}
while (shouldRun()) {
try {
// 每個BP執行緒的主迴圈。執行直到關閉,永遠呼叫遠端NameNode函式。
offerService();
} catch (Exception ex) {
LOG.error("Exception in BPOfferService for " + this, ex);
sleepAndLogInterrupts(5000, "offering service");
}
}
runningState = RunningState.EXITED;
} catch (Throwable ex) {
LOG.warn("Unexpected exception in block pool " + this, ex);
runningState = RunningState.FAILED;
} finally {
LOG.warn("Ending block pool service for: " + this);
// 被中斷後會清理自身的連線等,最終會呼叫BlockPoolManager#remove解除安裝乾淨
cleanUp();
}
}
方法中主要做了兩件事兒,一個是連線namenode,進行握手。另一個是執行offerService
方法,永遠呼叫namenode,直到叢集被終止掉。
BPServiceActor#connectToNNAndHandshake
握手的大致流程:
private void connectToNNAndHandshake() throws IOException {
// get NN proxy
// 獲得NameNode代理
// DatanodeProtocolClientSideTranslatorPB類是客戶端轉換器,
// 用於將在DatanodeProtocol介面上發出的請求轉換為實現DatanodeProtocolPB的RPC伺服器。
bpNamenode = dn.connectToNN(nnAddr);
// First phase of the handshake with NN - get the namespace info.
// 與NN握手的第一個階段 — 獲取名稱空間資訊。
NamespaceInfo nsInfo = retrieveNamespaceInfo();
// Verify that this matches the other NN in this HA pair.
// This also initializes our block pool in the DN if we are
// the first NN connection for this BP.
// 驗證這是否與這個HA對中的其他NN相匹配。
// 如果我們是這個BP的第一個NN連線,這也將初始化我們在DN中的塊池。
// 是這個BP的第一個NN連線。
bpos.verifyAndSetNamespaceInfo(this, nsInfo);
/* set thread name again to include NamespaceInfo when it's available. */
// 再次設定執行緒名稱,以便在 NamespaceInfo 可用時將其包括在內。
this.bpThread.setName(formatThreadName("heartbeating", nnAddr));
// Second phase of the handshake with the NN.
// 與NN握手的第二個階段
register(nsInfo);
}
第一階段:
// 執行與NameNode的握手的第一部分。這將呼叫versionRequest來確定NN的名稱空間和版本資訊。
// 它會自動重試,直到NN響應或DN正在關閉。
@VisibleForTesting
NamespaceInfo retrieveNamespaceInfo() throws IOException {
NamespaceInfo nsInfo = null;
while (shouldRun()) {
try {
// 獲取NamespaceInfo由名稱-節點返回,以響應資料-節點握手
nsInfo = bpNamenode.versionRequest();
LOG.debug(this + " received versionRequest response: " + nsInfo);
break;
} catch(SocketTimeoutException e) { // namenode is busy
LOG.warn("Problem connecting to server: " + nnAddr);
} catch(IOException e ) { // namenode is not available
LOG.warn("Problem connecting to server: " + nnAddr);
}
// try again in a second
// 五秒後重試...這裡官方註釋應該是有問題
sleepAndLogInterrupts(5000, "requesting version info from NN");
}
if (nsInfo != null) {
checkNNVersion(nsInfo);
} else {
throw new IOException("DN shut down before block pool connected");
}
return nsInfo;
}
第二階段:
// 在相應的NameNode上註冊一個bp
//bpDatanode需要在啟動時向NameNode註冊,以便
// 1)報告它現在為哪個儲存提供服務;
// 2)接收由NameNode發出的註冊ID,以識別已註冊的資料節點。
void register(NamespaceInfo nsInfo) throws IOException {
// The handshake() phase loaded the block pool storage
// off disk - so update the bpRegistration object from that info
// handshake()階段從磁碟上載入了區塊池儲存 - 所以根據該資訊更新bpRegistration物件
DatanodeRegistration newBpRegistration = bpos.createRegistration();
LOG.info(this + " beginning handshake with NN");
while (shouldRun()) {
try {
// Use returned registration from namenode with updated fields
// 使用從namenode返回的註冊,並更新欄位
newBpRegistration = bpNamenode.registerDatanode(newBpRegistration);
newBpRegistration.setNamespaceInfo(nsInfo);
bpRegistration = newBpRegistration;
break;
} catch(EOFException e) { // namenode might have just restarted
LOG.info("Problem connecting to server: " + nnAddr + " :"
+ e.getLocalizedMessage());
} catch(SocketTimeoutException e) { // namenode is busy
LOG.info("Problem connecting to server: " + nnAddr);
} catch(RemoteException e) {
LOG.warn("RemoteException in register", e);
throw e;
} catch(IOException e) {
LOG.warn("Problem connecting to server: " + nnAddr);
}
// Try again in a second
sleepAndLogInterrupts(1000, "connecting to server");
}
if (bpRegistration == null) {
throw new IOException("DN shut down before block pool registered");
}
LOG.info(this + " successfully registered with NN");
// 在一個BPServiceActors與NN成功註冊後,它呼叫這個函式來驗證它所連線的NN與其他服務於塊池的NN是一致的。
bpos.registrationSucceeded(this, bpRegistration);
// reset lease id whenever registered to NN.
// ask for a new lease id at the next heartbeat.
fullBlockReportLeaseId = 0;
// random short delay - helps scatter the BR from all DNs
// 隨機短延遲-幫助BR從所有DNs分散
scheduler.scheduleBlockReport(dnConf.initialBlockReportDelayMs, true);
}
BPServiceActor#offerService
在這個方法中,會持續不斷的向namenode傳送心跳和塊使用報告。
同時也會在啟動時傳送全量報告(FBR),傳送後就睡眠等下一次心跳時繼續傳送。
private void offerService() throws Exception {
LOG.info("For namenode " + nnAddr + " using"
+ " BLOCKREPORT_INTERVAL of " + dnConf.blockReportInterval + "msecs"
+ " CACHEREPORT_INTERVAL of " + dnConf.cacheReportInterval + "msecs"
+ " Initial delay: " + dnConf.initialBlockReportDelayMs + "msecs"
+ "; heartBeatInterval=" + dnConf.heartBeatInterval
+ (lifelineSender != null ?
"; lifelineIntervalMs=" + dnConf.getLifelineIntervalMs() : ""));
//
// Now loop for a long time....
//
while (shouldRun()) {
try {
DataNodeFaultInjector.get().startOfferService();
final long startTime = scheduler.monotonicNow();
//
// Every so often, send heartbeat or block-report
// 每隔一段時間,傳送心跳或塊報告
final boolean sendHeartbeat = scheduler.isHeartbeatDue(startTime);
HeartbeatResponse resp = null;
if (sendHeartbeat) {
//
// All heartbeat messages include following info:
// -- Datanode name
// -- data transfer port
// -- Total capacity
// -- Bytes remaining
// 所有心跳資訊包括以下資訊:
// Datanode的名字、資料傳輸埠、總容量、剩餘位元組數
boolean requestBlockReportLease = (fullBlockReportLeaseId == 0) &&
scheduler.isBlockReportDue(startTime);
if (!dn.areHeartbeatsDisabledForTests()) {
// 傳送心跳
resp = sendHeartBeat(requestBlockReportLease);
assert resp != null;
if (resp.getFullBlockReportLeaseId() != 0) {
if (fullBlockReportLeaseId != 0) {
LOG.warn(nnAddr + " sent back a full block report lease " +
"ID of 0x" +
Long.toHexString(resp.getFullBlockReportLeaseId()) +
", but we already have a lease ID of 0x" +
Long.toHexString(fullBlockReportLeaseId) + ". " +
"Overwriting old lease ID.");
}
fullBlockReportLeaseId = resp.getFullBlockReportLeaseId();
}
dn.getMetrics().addHeartbeat(scheduler.monotonicNow() - startTime,
getRpcMetricSuffix());
// If the state of this NN has changed (eg STANDBY->ACTIVE) then let the BPOfferService update itself.
//
// Important that this happens before processCommand below,
// since the first heartbeat to a new active might have commands that we should actually process.
// 如果這個NN的狀態發生了變化(例如STANDBY->ACTIVE),那麼讓BPOfferService自己更新。
// 重要的是,這發生在下面的processCommand之前,因為對一個新活動的第一次心跳可能有我們應該實際處理的命令。
bpos.updateActorStatesFromHeartbeat(
this, resp.getNameNodeHaState());
state = resp.getNameNodeHaState().getState();
if (state == HAServiceState.ACTIVE) {
handleRollingUpgradeStatus(resp);
}
commandProcessingThread.enqueue(resp.getCommands());
}
}
if (!dn.areIBRDisabledForTests() &&
(ibrManager.sendImmediately()|| sendHeartbeat)) {
// 傳送IBRs到namenode
ibrManager.sendIBRs(bpNamenode, bpRegistration,
bpos.getBlockPoolId(), getRpcMetricSuffix());
}
// DatanodeCommand:資料節點命令的基類。由名稱-節點發出,通知資料節點應該做什麼。
List<DatanodeCommand> cmds = null;
boolean forceFullBr =
scheduler.forceFullBlockReport.getAndSet(false);
if (forceFullBr) {
LOG.info("Forcing a full block report to " + nnAddr);
}
if ((fullBlockReportLeaseId != 0) || forceFullBr) {
// 向Namenode報告全量列表塊
cmds = blockReport(fullBlockReportLeaseId);
fullBlockReportLeaseId = 0;
}
commandProcessingThread.enqueue(cmds);
if (!dn.areCacheReportsDisabledForTests()) {
// 傳送快取報告
DatanodeCommand cmd = cacheReport();
commandProcessingThread.enqueue(cmd);
}
if (sendHeartbeat) {
dn.getMetrics().addHeartbeatTotal(
scheduler.monotonicNow() - startTime, getRpcMetricSuffix());
}
// There is no work to do; sleep until hearbeat timer elapses, or work arrives, and then iterate again.
// 沒有工作可做;睡覺直到心跳計時器結束,或者工作到來,然後再重複。
ibrManager.waitTillNextIBR(scheduler.getHeartbeatWaitTime());
} catch(RemoteException re) {
String reClass = re.getClassName();
if (UnregisteredNodeException.class.getName().equals(reClass) ||
DisallowedDatanodeException.class.getName().equals(reClass) ||
IncorrectVersionException.class.getName().equals(reClass)) {
LOG.warn(this + " is shutting down", re);
shouldServiceRun = false;
return;
}
LOG.warn("RemoteException in offerService", re);
sleepAfterException();
} catch (IOException e) {
LOG.warn("IOException in offerService", e);
sleepAfterException();
} finally {
DataNodeFaultInjector.get().endOfferService();
}
processQueueMessages();
} // while (shouldRun())
} // offerService
doRefreshNamenodes 步驟4
步驟4就是關閉不需要的名稱服務,注意可以看一下為什麼會自動呼叫remove():
// Step 4. Shut down old nameservices. This happens outside
// of the synchronized(this) lock since they need to call
// back to .remove() from another thread
// 步驟4:關閉舊的名稱服務。這發生在synchronized(This)鎖之外,因為它們需要從另一個執行緒回撥.remove()
if (!toRemove.isEmpty()) {
LOG.info("Stopping BPOfferServices for nameservices: " +
Joiner.on(",").useForNull("<default>").join(toRemove));
for (String nsToRemove : toRemove) {
BPOfferService bpos = bpByNameserviceId.get(nsToRemove);
bpos.stop();
bpos.join();
// they will call remove on their own
// 這裡的執行邏輯大概描述如下:
// bpos.stop() -> actor.stop(); -> shouldServiceRun = false;
// bpos.join() -> actor.join(); -> bpThread.join();
// -> BPServiceActor#run 方法中 shouldRun() 返回false,執行finally中的 BPServiceActor#cleanUp
// -> BPOfferService#shutdownActor -> DataNode#shutdownBlockPool -> BlockPoolManager#remove
}
}
doRefreshNamenodes 步驟5
// Step 5. Update nameservices whose NN list has changed
// 步驟5:更新NN列表已更改的名稱服務
if (!toRefresh.isEmpty()) {
// 正在重新整理名稱服務的nn列表
LOG.info("Refreshing list of NNs for nameservices: " +
Joiner.on(",").useForNull("<default>").join(toRefresh));
for (String nsToRefresh : toRefresh) {
BPOfferService bpos = bpByNameserviceId.get(nsToRefresh);
Map<String, InetSocketAddress> nnIdToAddr = addrMap.get(nsToRefresh);
Map<String, InetSocketAddress> nnIdToLifelineAddr =
lifelineAddrMap.get(nsToRefresh);
ArrayList<InetSocketAddress> addrs =
Lists.newArrayListWithCapacity(nnIdToAddr.size());
ArrayList<InetSocketAddress> lifelineAddrs =
Lists.newArrayListWithCapacity(nnIdToAddr.size());
ArrayList<String> nnIds = Lists.newArrayListWithCapacity(
nnIdToAddr.size());
for (String nnId : nnIdToAddr.keySet()) {
addrs.add(nnIdToAddr.get(nnId));
lifelineAddrs.add(nnIdToLifelineAddr != null ?
nnIdToLifelineAddr.get(nnId) : null);
nnIds.add(nnId);
}
try {
UserGroupInformation.getLoginUser()
.doAs(new PrivilegedExceptionAction<Object>() {
@Override
public Object run() throws Exception {
bpos.refreshNNList(nsToRefresh, nnIds, addrs, lifelineAddrs);
return null;
}
});
} catch (InterruptedException ex) {
IOException ioe = new IOException();
ioe.initCause(ex.getCause());
throw ioe;
}
}
}
可以看到除了組裝需要的引數之外,方法中重點是呼叫了bpos#refreshNNList
,方法中是先增後刪的方式更新。
void refreshNNList(String serviceId, List<String> nnIds,
ArrayList<InetSocketAddress> addrs,
ArrayList<InetSocketAddress> lifelineAddrs) throws IOException {
Set<InetSocketAddress> oldAddrs = Sets.newHashSet();
for (BPServiceActor actor : bpServices) {
oldAddrs.add(actor.getNNSocketAddress());
}
Set<InetSocketAddress> newAddrs = Sets.newHashSet(addrs);
// Process added NNs
Set<InetSocketAddress> addedNNs = Sets.difference(newAddrs, oldAddrs);
for (InetSocketAddress addedNN : addedNNs) {
BPServiceActor actor = new BPServiceActor(serviceId,
nnIds.get(addrs.indexOf(addedNN)), addedNN,
lifelineAddrs.get(addrs.indexOf(addedNN)), this);
actor.start();
bpServices.add(actor);
}
// Process removed NNs
Set<InetSocketAddress> removedNNs = Sets.difference(oldAddrs, newAddrs);
for (InetSocketAddress removedNN : removedNNs) {
for (BPServiceActor actor : bpServices) {
if (actor.getNNSocketAddress().equals(removedNN)) {
actor.stop();
shutdownActor(actor);
break;
}
}
}
}
心跳機制的大概原始碼如上...留得坑諸如具體datanode與namenode互動過程中具體的處理邏輯,和namenode向datanode傳送命令和datanode執行等等,之後有空再繼續補充!