Hadoop3.2.1 【 HDFS 】原始碼分析 : Standby Namenode解析
一. 前言
在同一個HA HDFS叢集中, 將會同時執行兩個Namenode例項, 其中一個為Active Namenode,用於實時處理所有客戶端請求; 另一個為Standby Namenode, StandbyNamenode的名稱空間與ActiveNamenode是完全保持一致的。 所以當ActiveNamenode出現故障時, Standby Namenode可以立即切換成Active狀態。
二.checkpoint操作
為了讓Standby Namenode的名稱空間與Active Namenode保持同步, 它們都將和JournalNodes守護程式通訊。
當Active Namenode執行任何修改名稱空間的操作時, 它至少需要將產生的editlog檔案持久化到N-(N-1)/2個JournalNode節點上才能保證名稱空間修改的安全性。 換句話說, 如果在HA策略下啟動了N個JournalNode程式, 那麼整個JournalNode叢集最多允許(N-1)/2個程式死掉, 這樣才能保證editlog成功完整地被寫入。 比如叢集中有3個JournalNode時, 最多允許1個JournalNode掛掉; 叢集中有5個JournalNode時, 最多允許2個JournalNode掛掉。 Standby Namenode則負責觀察editlog檔案的變化, 它能夠從JournalNodes中讀取editlog檔案, 然後合併更新到它自己的名稱空間中。 一旦ActiveNamenode出現故障, Standby Namenode就會保證從JournalNodes中讀出全部的editlog檔案, 然後切換成Active狀態。
Standby Namenode讀取全部的editlog檔案可確保在發生故障轉移之前, 和Active Namenode擁有完全同步的名稱空間狀態。
Standby Namenode始終保持著一個最新版本的名稱空間, 它會不斷地將讀入的editlog檔案與當前的名稱空間並 .
StandbyNamenode只需判斷當前是否滿足觸發檢查點操作的兩個條件, 如果滿足觸發條件, 則將Standby Namenode的名稱空間寫入一個新的fsimage檔案中, 並通過HTTP將這個fsimage檔案傳回Active Namenode。
■ Standby Namenode檢查是否滿足觸發檢查點操作的兩個條件。
■ Standby Namenode將當前的名稱空間儲存到fsimage.ckpt_txid檔案中, 這裡的txid是當前最新的editlog檔案中記錄的事務id。 之後Standby Namenode寫入fsimage檔案的MD5校驗和, 最後重新命名這個fsimage.ckpt_txid檔案為fsimage_txid。 當執行這個操作時, 其他的Standby Namenode操作將會被阻塞, 例如Active Namenode發生錯誤時, 需要進行主備切換或者訪問Standby Namenode的Web介面等操作。注意, Active Namenode的操作並不會被影響, 例如listing、 reading、 writing檔案等。
■ Standby Namenode向Active Namenode的ImageServlet傳送HTTP GET請求/getimage?putimage=1。 這個請求的URL中包含了新的fsimage檔案的事務id,以及Standby Namenode用於下載的埠和IP地址。
■ Active Namenode會根據Standby Namenode提供的資訊向Standby Namenode的ImageServlet發起HTTP GET請求下載fsimage檔案。 Namenode首先將下載檔案命名為fsimage.ckpt_*, 然後建立MD5校驗和, 最後將fsimage.ckpt_*重新命名為fsimage_*。
三. FSNamesystem#startStandbyServices
Namenode在啟動的時候會載入FSNamesystem, 在FSNamesystem中會通過startStandbyServices啟動一個StandbyCheckpointer類.用於處理checkpoint操作.
/**
* Start services required in standby or observer state
*
* @throws IOException
*/
void startStandbyServices(final Configuration conf, boolean isObserver)
throws IOException {
LOG.info("Starting services required for " +
(isObserver ? "observer" : "standby") + " state");
if (!getFSImage().editLog.isOpenForRead()) {
// During startup, we're already open for read.
getFSImage().editLog.initSharedJournalsForRead();
}
blockManager.setPostponeBlocksFromFuture(true);
// Disable quota checks while in standby.
dir.disableQuotaChecks();
editLogTailer = new EditLogTailer(this, conf);
editLogTailer.start();
if (!isObserver && standbyShouldCheckpoint) {
standbyCheckpointer = new StandbyCheckpointer(conf, this);
standbyCheckpointer.start();
}
}
四. CheckpointerThread
StandbyCheckpointer在呼叫start方法的時候,會啟動CheckpointerThread執行緒,而執行run方法的時候, 會呼叫doWork方法.
private void doWork() {
//間隔時長 1小時
final long checkPeriod = 1000 * checkpointConf.getCheckPeriod();
System.out.println("StandbyCheckpointer#doWork=>checkPeriod : "+ checkPeriod);
// Reset checkpoint time so that we don't always checkpoint
// on startup.
lastCheckpointTime = monotonicNow();
lastUploadTime = monotonicNow();
while (shouldRun) {
boolean needRollbackCheckpoint = namesystem.isNeedRollbackFsImage();
if (!needRollbackCheckpoint) {
try {
Thread.sleep(checkPeriod);
} catch (InterruptedException ie) {
}
if (!shouldRun) {
break;
}
}
try {
// We may have lost our ticket since last checkpoint, log in again, just in case
if (UserGroupInformation.isSecurityEnabled()) {
UserGroupInformation.getCurrentUser().checkTGTAndReloginFromKeytab();
}
final long now = monotonicNow();
//獲得最後一次往JournalNode寫入的txid和最近一次做檢查點的txid的差值
final long uncheckpointed = countUncheckpointedTxns();
//計算當前時間和上一次檢查點操作時間的間隔
final long secsSinceLast = (now - lastCheckpointTime) / 1000;
// if we need a rollback checkpoint, always attempt to checkpoint
boolean needCheckpoint = needRollbackCheckpoint;
if (needCheckpoint) {
LOG.info("Triggering a rollback fsimage for rolling upgrade.");
} else if (uncheckpointed >= checkpointConf.getTxnCount()) {
///第一種符合合併的情況:
// 當最後一次往JournalNode寫入的txid和最近一次做檢查點的txid的差值
// 大於或者等於dfs.namenode.checkpoint.txns配置的數量(預設為100萬)時做一次合併
LOG.info("Triggering checkpoint because there have been {} txns " +
"since the last checkpoint, " +
"which exceeds the configured threshold {}",
uncheckpointed, checkpointConf.getTxnCount());
needCheckpoint = true;
} else if (secsSinceLast >= checkpointConf.getPeriod()) {
LOG.info("Triggering checkpoint because it has been {} seconds " +
"since the last checkpoint, which exceeds the configured " +
"interval {}", secsSinceLast, checkpointConf.getPeriod());
//第二種符合合併的情況:
// 當時間間隔大於或者等於dfs.namenode.checkpoint.period [一小時]
// 配置的時間時做合併
needCheckpoint = true;
}
//滿足檢查點執行條件, 則呼叫doCheckpoint()方法執行檢查點操作
if (needCheckpoint) {
synchronized (cancelLock) {
if (now < preventCheckpointsUntil) {
LOG.info("But skipping this checkpoint since we are about to failover!");
canceledCount++;
continue;
}
assert canceler == null;
canceler = new Canceler();
}
// on all nodes, we build the checkpoint. However, we only ship the checkpoint if have a
// rollback request, are the checkpointer, are outside the quiet period.
final long secsSinceLastUpload = (now - lastUploadTime) / 1000;
boolean sendRequest = isPrimaryCheckPointer || secsSinceLastUpload >= checkpointConf.getQuietPeriod();
doCheckpoint(sendRequest);
// reset needRollbackCheckpoint to false only when we finish a ckpt
// for rollback image
if (needRollbackCheckpoint
&& namesystem.getFSImage().hasRollbackFSImage()) {
namesystem.setCreatedRollbackImages(true);
namesystem.setNeedRollbackFsImage(false);
}
lastCheckpointTime = now;
LOG.info("Checkpoint finished successfully.");
}
} catch (SaveNamespaceCancelledException ce) {
LOG.info("Checkpoint was cancelled: {}", ce.getMessage());
canceledCount++;
} catch (InterruptedException ie) {
LOG.info("Interrupted during checkpointing", ie);
// Probably requested shutdown.
continue;
} catch (Throwable t) {
LOG.error("Exception in doCheckpoint", t);
} finally {
synchronized (cancelLock) {
canceler = null;
}
}
}
}
六. doCheckpoint(sendRequest);
整個檢查點執行操作的邏輯都是在doCheckpoint()方法中實現的。doCheckpoint()方法首先獲取當前儲存的fsimage的prevCheckpointTxId, 然後獲取最近更新的editlog的thisCheckpointTxId, 只有新的thisCheckpointTxId大於prevCheckpointTxId, 也
就是當前名稱空間有更新, 但是並沒有儲存到新的fsimage檔案中時, 才有必要進行一次檢查點操作。 判斷完成後, doCheckpoint()會呼叫saveNamespace()方法將最新的名稱空間儲存到fsimage檔案中。 之後構造一個執行緒, 將新產生的fsimage檔案通過HTTP方式上傳到
AvtiveNamenode中
/**
* 整個檢查點執行操作的邏輯都是在doCheckpoint()方法中實現的。
*
* doCheckpoint()方法首先獲取當前儲存的fsimage的prevCheckpointTxId,
* 然後獲取最近更新的editlog的thisCheckpointTxId,
* 只有新的thisCheckpointTxId大於prevCheckpointTxId,
* 也就是當前名稱空間有更新, 但是並沒有儲存到新的fsimage檔案中時,
* 才有必要進行一次 檢查點操作。
*
* 判斷完成後,
* doCheckpoint()會呼叫saveNamespace()方法將最新的名稱空間儲存到fsimage檔案中。
*
* 之後構造一個執行緒, 將新產生的fsimage檔案通過HTTP方式上傳到AvtiveNamenode中。
*
* @param sendCheckpoint
* @throws InterruptedException
* @throws IOException
*/
private void doCheckpoint(boolean sendCheckpoint) throws InterruptedException, IOException {
assert canceler != null;
final long txid;
final NameNodeFile imageType;
// Acquire cpLock to make sure no one is modifying the name system.
// It does not need the full namesystem write lock, since the only thing
// that modifies namesystem on standby node is edit log replaying.
namesystem.cpLockInterruptibly();
try {
assert namesystem.getEditLog().isOpenForRead() :
"Standby Checkpointer should only attempt a checkpoint when " +
"NN is in standby mode, but the edit logs are in an unexpected state";
//獲取當前Standby Namenode上儲存的最新的fsimage物件
FSImage img = namesystem.getFSImage();
//獲取fsimage中儲存的txid, 也就是上一次進行檢查點操作時儲存的txid
long prevCheckpointTxId = img.getStorage().getMostRecentCheckpointTxId();
//獲取當前名稱空間的最新的txid, 也就是收到的editlog的最新的txid
long thisCheckpointTxId = img.getCorrectLastAppliedOrWrittenTxId();
assert thisCheckpointTxId >= prevCheckpointTxId;
//如果相等則沒有必要執行檢查點操作, 當前fsimage已經是最新的了
if (thisCheckpointTxId == prevCheckpointTxId) {
LOG.info("A checkpoint was triggered but the Standby Node has not " +
"received any transactions since the last checkpoint at txid {}. " +
"Skipping...", thisCheckpointTxId);
return;
}
if (namesystem.isRollingUpgrade()
&& !namesystem.getFSImage().hasRollbackFSImage()) {
//如果當前Namenode正在執行升級操作, 則建立fsimage_rollback檔案
// if we will do rolling upgrade but have not created the rollback image
// yet, name this checkpoint as fsimage_rollback
imageType = NameNodeFile.IMAGE_ROLLBACK;
} else {
//在正常情況下建立fsimage檔案
imageType = NameNodeFile.IMAGE;
}
//呼叫saveNamespace()將當前名稱空間儲存到新的檔案中
img.saveNamespace(namesystem, imageType, canceler);
txid = img.getStorage().getMostRecentCheckpointTxId();
assert txid == thisCheckpointTxId : "expected to save checkpoint at txid=" +
thisCheckpointTxId + " but instead saved at txid=" + txid;
// Save the legacy OIV image, if the output dir is defined.
String outputDir = checkpointConf.getLegacyOivImageDir();
if (outputDir != null && !outputDir.isEmpty()) {
try {
img.saveLegacyOIVImage(namesystem, outputDir, canceler);
} catch (IOException ioe) {
LOG.warn("Exception encountered while saving legacy OIV image; "
+ "continuing with other checkpointing steps", ioe);
}
}
} finally {
namesystem.cpUnlock();
}
//early exit if we shouldn't actually send the checkpoint to the ANN
if(!sendCheckpoint){
return;
}
//構造一個執行緒, 通過HTTP將fsimage上傳到Active Namenode中
// Upload the saved checkpoint back to the active
// Do this in a separate thread to avoid blocking transition to active, but don't allow more
// than the expected number of tasks to run or queue up
// See HDFS-4816
ExecutorService executor = new ThreadPoolExecutor(0, activeNNAddresses.size(), 100,
TimeUnit.MILLISECONDS, new LinkedBlockingQueue<Runnable>(activeNNAddresses.size()),
uploadThreadFactory);
// for right now, just match the upload to the nn address by convention. There is no need to
// directly tie them together by adding a pair class.
List<Future<TransferFsImage.TransferResult>> uploads =
new ArrayList<Future<TransferFsImage.TransferResult>>();
for (final URL activeNNAddress : activeNNAddresses) {
Future<TransferFsImage.TransferResult> upload =
executor.submit(new Callable<TransferFsImage.TransferResult>() {
@Override
public TransferFsImage.TransferResult call()
throws IOException, InterruptedException {
CheckpointFaultInjector.getInstance().duringUploadInProgess();
return TransferFsImage.uploadImageFromStorage(activeNNAddress, conf, namesystem
.getFSImage().getStorage(), imageType, txid, canceler);
}
});
uploads.add(upload);
}
InterruptedException ie = null;
IOException ioe= null;
int i = 0;
boolean success = false;
for (; i < uploads.size(); i++) {
Future<TransferFsImage.TransferResult> upload = uploads.get(i);
try {
// TODO should there be some smarts here about retries nodes that are not the active NN?
if (upload.get() == TransferFsImage.TransferResult.SUCCESS) {
success = true;
//avoid getting the rest of the results - we don't care since we had a successful upload
break;
}
} catch (ExecutionException e) {
ioe = new IOException("Exception during image upload", e);
break;
} catch (InterruptedException e) {
ie = e;
break;
}
}
if (ie == null && ioe == null) {
//Update only when response from remote about success or
lastUploadTime = monotonicNow();
// we are primary if we successfully updated the ANN
this.isPrimaryCheckPointer = success;
}
// cleaner than copying code for multiple catch statements and better than catching all
// exceptions, so we just handle the ones we expect.
if (ie != null || ioe != null) {
// cancel the rest of the tasks, and close the pool
for (; i < uploads.size(); i++) {
Future<TransferFsImage.TransferResult> upload = uploads.get(i);
// The background thread may be blocked waiting in the throttler, so
// interrupt it.
upload.cancel(true);
}
// shutdown so we interrupt anything running and don't start anything new
executor.shutdownNow();
// this is a good bit longer than the thread timeout, just to make sure all the threads
// that are not doing any work also stop
executor.awaitTermination(500, TimeUnit.MILLISECONDS);
// re-throw the exception we got, since one of these two must be non-null
if (ie != null) {
throw ie;
} else if (ioe != null) {
throw ioe;
}
}
}
相關文章
- Hadoop3.2.1 【 HDFS 】原始碼分析 : Secondary Namenode解析Hadoop原始碼
- Hadoop3.2.1 【 HDFS 】原始碼分析 : DataXceiver: 讀取資料塊 解析 [二]Hadoop原始碼
- 原始碼|HDFS之NameNode:啟動過程原始碼
- Hadoop3.2.1 【 HDFS 】原始碼分析 : 檔案系統資料集 [一]Hadoop原始碼
- Hadoop3.2.1 【 YARN 】原始碼分析 :RPC通訊解析HadoopYarn原始碼RPC
- 雙master hdfs namenode 全部進入standby 狀態的解救方法AST
- org.apache.hadoop.hdfs.server.namenode.NameNode.ApacheHadoopServer
- HDFS原始碼解析系列一——HDFS通訊協議原始碼協議
- HDFS 09 - HDFS NameNode 的高可用機制
- Hadoop3.2.1 【 YARN 】原始碼分析 :AdminService 淺析HadoopYarn原始碼
- HDFS原始碼解析:教你用HDFS客戶端寫資料原始碼客戶端
- Hadoop2原始碼分析-HDFS核心模組分析Hadoop原始碼
- hadoop 原始碼分析HDFS架構演進Hadoop原始碼架構
- 【Spring原始碼分析】AOP原始碼解析(上篇)Spring原始碼
- 【Spring原始碼分析】AOP原始碼解析(下篇)Spring原始碼
- Android 原始碼分析之 EventBus 的原始碼解析Android原始碼
- ReentrantLock解析及原始碼分析ReentrantLock原始碼
- HDFS原始碼分析(二)-----後設資料備份機制原始碼
- Framework 原始碼解析知識梳理(5) startService 原始碼分析Framework原始碼
- Django(49)drf解析模組原始碼分析Django原始碼
- LibRTMP原始碼分析 1:解析URL原始碼
- Hadoop之HDFS及NameNode單點故障解決方案Hadoop
- Struts2 原始碼分析-----攔截器原始碼解析 --- ParametersInterceptor原始碼
- 友好 RxJava2.x 原始碼解析(三)zip 原始碼分析RxJava原始碼
- (一) Mybatis原始碼分析-解析器模組MyBatis原始碼
- ThinkPHP6 原始碼分析之解析 RequestPHP原始碼
- RecyclerView 原始碼分析(一) —— 繪製流程解析View原始碼
- 比特幣原始碼分析:VersionBits模組解析比特幣原始碼
- 原始碼分析 —— AsyncTask 完全解析(基於7.0)原始碼
- 容器類原始碼解析系列(三)—— HashMap 原始碼分析(最新版)原始碼HashMap
- React Native 0.55.4 Android 原始碼分析(Java層原始碼解析)React NativeAndroid原始碼Java
- django-rest-framework原始碼分析2—認證(Authentication)原始碼解析DjangoRESTFramework原始碼
- HDFS分散式儲存中NameNode 和DataNode 有什麼區別?分散式
- ItemDecoration深入解析與實戰(一)——原始碼分析原始碼
- btcpool礦池原始碼分析(3)-BlockMaker模組解析TCP原始碼BloC
- btcpool礦池原始碼分析(5)-JobMaker模組解析TCP原始碼
- btcpool礦池原始碼分析(6)-nmcauxmaker模組解析TCP原始碼UX
- btcpool礦池原始碼分析(9)-statshttpd模組解析TCP原始碼httpd