HDFS原始碼分析(二)-----後設資料備份機制

weixin_34088583發表於2017-05-09

前言

在Hadoop中,全部的後設資料的儲存都是在namenode節點之中,每次又一次啟動整個叢集,Hadoop都須要從這些持久化了的檔案裡恢復資料到記憶體中,然後通過映象和編輯日誌檔案進行定期的掃描與合併。ok。這些略微瞭解Hadoop的人應該都知道。這不就是SecondNameNode乾的事情嘛。可是非常多人僅僅是瞭解此機制的表象,內部的一些實現機理預計不是每一個人都又去深究過。你能想象在寫入編輯日誌的過程中,用到了雙緩衝區來加大併發量的寫嗎,你能想象為了避免操作的一致性性,作者在寫入的時候做過多重的驗證操作,還有你有想過作者是怎樣做到操作中斷的處理情況嗎。假設你不能非常好的回答上述幾個問題,那麼沒有關係。以下來分享一下我的學習成果。


相關涉及類

建議讀者在閱讀本文的過程中。結合程式碼一起閱讀, Hadoop的原始碼可自行下載,這樣效果可能會更好。與本文有涉及的類包含以下一些,比上次的分析略多一點,個別主類的程式碼量已經在上千行了。

1.FSImage--名稱空間映象類,全部關於名稱空間映象的操作方法都被包含在了這種方法裡面。

2.FSEditLog--編輯日誌類。編輯日誌類涉及到了每一個操作記錄的寫入。

3.EditLogInputStream和EditLogOutputStream--編輯日誌輸入類和編輯日誌輸出類,編輯日誌的很多檔案寫入讀取操作都是在這2個類的基礎上實現的,這2個類是普通輸入流類的繼承類,對外開放了幾個針對於日誌讀寫的特有方法。

4.Storage和StorageInfo--儲存資訊相關類。當中後者是父類。Storage繼承了後者,這2個類是儲存資料夾直接相關的類,後設資料的備份相關的非常多資料夾操作都與此相關。

5.SecondaryNameNode--本來不想把這個類拉進去來的,可是為了使整個備份機制更加完整的呈現出來,相同也是須要去了解這一部分的程式碼。

ok,介紹完了這些類。以下正式進入後設資料備份機制的解說,只是在此之前。必須要了解各個物件的詳細操作實現,裡面有非常多巧妙的設計,同一時候也為了防止被後面的方法繞暈,方法的確非常多,但我會挑出幾個代表性的來講。


名稱空間映象

名稱空間映象在這裡簡稱為FsImage,映象這個詞我最早聽的時候是在虛擬機器映象恢復的時候聽過的,非常強大,只是在這裡的映象好像比較小規模一些,僅僅是用於資料夾樹的恢復。映象的儲存路徑是由配置檔案裡的以下這個屬性所控制

${dfs.name.dir}

當然你能夠不配。是有預設值的。在名稱空間映象中,FSImage起著主導的作用,他管理著儲存空間的生存期。以下是這個類的基本變數定義

/**
 * FSImage handles checkpointing and logging of the namespace edits.
 * fsImage映象類
 */
public class FSImage extends Storage {
  //標準時間格式
  private static final SimpleDateFormat DATE_FORM =
    new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");

  //
  // The filenames used for storing the images
  // 在名稱空間映象中可能用的幾種名稱
  //
  enum NameNodeFile {
    IMAGE     ("fsimage"),
    TIME      ("fstime"),
    EDITS     ("edits"),
    IMAGE_NEW ("fsimage.ckpt"),
    EDITS_NEW ("edits.new");
    
    private String fileName = null;
    private NameNodeFile(String name) {this.fileName = name;}
    String getName() {return fileName;}
  }

  // checkpoint states
  // 檢查點選幾種狀態
  enum CheckpointStates{START, ROLLED_EDITS, UPLOAD_START, UPLOAD_DONE; }
  /**
   * Implementation of StorageDirType specific to namenode storage
   * A Storage directory could be of type IMAGE which stores only fsimage,
   * or of type EDITS which stores edits or of type IMAGE_AND_EDITS which 
   * stores both fsimage and edits.
   * 名位元組點資料夾儲存型別
   */
  static enum NameNodeDirType implements StorageDirType {
    //名位元組點儲存型別定義主要有以下4種定義	
    UNDEFINED,
    IMAGE,
    EDITS,
    IMAGE_AND_EDITS;
    
    public StorageDirType getStorageDirType() {
      return this;
    }
    
    //做儲存型別的驗證
    public boolean isOfType(StorageDirType type) {
      if ((this == IMAGE_AND_EDITS) && (type == IMAGE || type == EDITS))
        return true;
      return this == type;
    }
  }
  
  protected long checkpointTime = -1L;
  //內部維護了編輯日誌類,與映象類配合操作
  protected FSEditLog editLog = null;
  private boolean isUpgradeFinalized = false;
立即看到的是幾個檔案狀態的名稱,什麼edit,edit.new。fsimage.ckpt,後面這些都會在後設資料機制的中進行仔細解說,能夠理解為就是暫時存放的資料夾名稱。而對於這些資料夾的遍歷,查詢操作。都是以下這個類實現的

//資料夾迭代器
  private class DirIterator implements Iterator<StorageDirectory> {
    //資料夾儲存型別
    StorageDirType dirType;
    //向前的指標,用於移除操作
    int prevIndex; // for remove()
    //向後指標
    int nextIndex; // for next()
    
    DirIterator(StorageDirType dirType) {
      this.dirType = dirType;
      this.nextIndex = 0;
      this.prevIndex = 0;
    }
    
    public boolean hasNext() {
      ....
    }
    
    public StorageDirectory next() {
      StorageDirectory sd = getStorageDir(nextIndex);
      prevIndex = nextIndex;
      nextIndex++;
      if (dirType != null) {
        while (nextIndex < storageDirs.size()) {
          if (getStorageDir(nextIndex).getStorageDirType().isOfType(dirType))
            break;
          nextIndex++;
        }
      }
      return sd;
    }
    
    public void remove() {
      ...
    }
  }

依據傳入的資料夾型別,獲取不同的資料夾。這些儲存資料夾指的就是editlog,fsimage這些資料夾檔案,有一些共同擁有的資訊,例如以下

/**
 * Common class for storage information.
 * 儲存資訊公告類
 * TODO namespaceID should be long and computed as hash(address + port)
 * 名稱空間ID必須足夠長,ip地址+埠號做雜湊計算而得
 */
public class StorageInfo {
  //儲存資訊版本號號
  public int   layoutVersion;  // Version read from the stored file.
  //名稱空間ID
  public int   namespaceID;    // namespace id of the storage
  //儲存資訊建立時間
  public long  cTime;          // creation timestamp
  
  public StorageInfo () {
  	//預設建構函式。全為0
    this(0, 0, 0L);
  }

以下從1個儲存映象的方法作為切入口

/**
   * Save the contents of the FS image to the file.
   * 儲存映象檔案
   */
  void saveFSImage(File newFile) throws IOException {
    FSNamesystem fsNamesys = FSNamesystem.getFSNamesystem();
    FSDirectory fsDir = fsNamesys.dir;
    long startTime = FSNamesystem.now();
    //
    // Write out data
    //
    DataOutputStream out = new DataOutputStream(
                                                new BufferedOutputStream(
                                                                         new FileOutputStream(newFile)));
    try {
      //寫入版本號號
      out.writeInt(FSConstants.LAYOUT_VERSION);
      //寫入名稱空間ID
      out.writeInt(namespaceID);
      //寫入資料夾下的孩子總數
      out.writeLong(fsDir.rootDir.numItemsInTree());
      //寫入時間
      out.writeLong(fsNamesys.getGenerationStamp());
      byte[] byteStore = new byte[4*FSConstants.MAX_PATH_LENGTH];
      ByteBuffer strbuf = ByteBuffer.wrap(byteStore);
      // save the root
      saveINode2Image(strbuf, fsDir.rootDir, out);
      // save the rest of the nodes
      saveImage(strbuf, 0, fsDir.rootDir, out);
      fsNamesys.saveFilesUnderConstruction(out);
      fsNamesys.saveSecretManagerState(out);
      strbuf = null;
    } finally {
      out.close();
    }

    LOG.info("Image file of size " + newFile.length() + " saved in " 
        + (FSNamesystem.now() - startTime)/1000 + " seconds.");
  }
從上面的幾行能夠看到。一個完整的映象檔案首部應該包含版本號號,名稱空間iD。檔案數。資料塊版本號塊,然後後面是詳細的檔案資訊。在這裡人家還儲存了構建節點檔案資訊以及安全資訊。在儲存檔案資料夾的資訊時,採用的saveINode2Image()先保留資料夾資訊,然後再呼叫saveImage()保留孩子檔案資訊,由於在saveImage()中會呼叫saveINode2Image()方法。

/*
   * Save one inode's attributes to the image.
   * 保留一個節點的屬性到映象中
   */
  private static void saveINode2Image(ByteBuffer name,
                                      INode node,
                                      DataOutputStream out) throws IOException {
    int nameLen = name.position();
    out.writeShort(nameLen);
    out.write(name.array(), name.arrayOffset(), nameLen);
    if (!node.isDirectory()) {  // write file inode
      INodeFile fileINode = (INodeFile)node;
      //寫入的屬性包含,副本數,近期改動資料,近期訪問時間
      out.writeShort(fileINode.getReplication());
      out.writeLong(fileINode.getModificationTime());
      out.writeLong(fileINode.getAccessTime());
      out.writeLong(fileINode.getPreferredBlockSize());
      Block[] blocks = fileINode.getBlocks();
      out.writeInt(blocks.length);
      for (Block blk : blocks)
        //將資料塊資訊也寫入
        blk.write(out);
      FILE_PERM.fromShort(fileINode.getFsPermissionShort());
      PermissionStatus.write(out, fileINode.getUserName(),
                             fileINode.getGroupName(),
                             FILE_PERM);
    } else {   // write directory inode
      //假設是資料夾,則還要寫入節點的配額限制值
      out.writeShort(0);  // replication
      out.writeLong(node.getModificationTime());
      out.writeLong(0);   // access time
      out.writeLong(0);   // preferred block size
      out.writeInt(-1);    // # of blocks
      out.writeLong(node.getNsQuota());
      out.writeLong(node.getDsQuota());
      FILE_PERM.fromShort(node.getFsPermissionShort());
      PermissionStatus.write(out, node.getUserName(),
                             node.getGroupName(),
                             FILE_PERM);
    }
  }

在這裡面會寫入很多其它的關於檔案資料夾的資訊,此方法也會被saveImage()遞迴呼叫

/**
   * Save file tree image starting from the given root.
   * This is a recursive procedure, which first saves all children of
   * a current directory and then moves inside the sub-directories.
   * 依照給定節點進行映象的儲存,每一個節點資料夾會採取遞迴的方式進行遍歷
   */
  private static void saveImage(ByteBuffer parentPrefix,
                                int prefixLength,
                                INodeDirectory current,
                                DataOutputStream out) throws IOException {
    int newPrefixLength = prefixLength;
    if (current.getChildrenRaw() == null)
      return;
    for(INode child : current.getChildren()) {
      // print all children first
      parentPrefix.position(prefixLength);
      parentPrefix.put(PATH_SEPARATOR).put(child.getLocalNameBytes());
      saveINode2Image(parentPrefix, child, out);
      ....
    }
寫入正在處理的檔案的方法是一個static靜態方法。是被外部方法所引用的

// Helper function that writes an INodeUnderConstruction
  // into the input stream
  // 寫入正在操作的檔案的資訊
  //
  static void writeINodeUnderConstruction(DataOutputStream out,
                                           INodeFileUnderConstruction cons,
                                           String path) 
                                           throws IOException {
    writeString(path, out);
    out.writeShort(cons.getReplication());
    out.writeLong(cons.getModificationTime());
    out.writeLong(cons.getPreferredBlockSize());
    int nrBlocks = cons.getBlocks().length;
    out.writeInt(nrBlocks);
    for (int i = 0; i < nrBlocks; i++) {
      cons.getBlocks()[i].write(out);
    }
    cons.getPermissionStatus().write(out);
    writeString(cons.getClientName(), out);
    writeString(cons.getClientMachine(), out);

    out.writeInt(0); //  do not store locations of last block
  }
在這裡順便看下格式化相關的方法,格式化的操作是在每次開始使用HDFS前進行的,在這個過程中會生出新的版本號號和名稱空間ID。在程式碼中是怎樣實現的呢

public void format() throws IOException {
    this.layoutVersion = FSConstants.LAYOUT_VERSION;
      //對每一個資料夾進行格式化操作
      format(sd);
    }
  }
  
  /** Create new dfs name directory.  Caution: this destroys all files
   * 格式化操作,會建立一個dfs/name的資料夾
   * in this filesystem. */
  void format(StorageDirectory sd) throws IOException {
    sd.clearDirectory(); // create currrent dir
    sd.lock();
    try {
      saveCurrent(sd);
    } finally {
      sd.unlock();
    }
    LOG.info("Storage directory " + sd.getRoot()
             + " has been successfully formatted.");
  }
操作非常easy,就是清空原有資料夾並建立新的資料夾。



編輯日誌

以下開始另外一個大類的分析,就是編輯日誌,英文名就是EditLog,這個裡面將會有很多出彩的設計。開啟這個類,你立即就看到的是幾十種的操作碼

/**
 * FSEditLog maintains a log of the namespace modifications.
 * 編輯日誌類包含了名稱空間各種改動操作的日誌記錄
 */
public class FSEditLog {
  //操作引數種類
  private static final byte OP_INVALID = -1;
  // 檔案操作相關
  private static final byte OP_ADD = 0;
  private static final byte OP_RENAME = 1;  // rename
  private static final byte OP_DELETE = 2;  // delete
  private static final byte OP_MKDIR = 3;   // create directory
  private static final byte OP_SET_REPLICATION = 4; // set replication
  //the following two are used only for backward compatibility :
  @Deprecated private static final byte OP_DATANODE_ADD = 5;
  @Deprecated private static final byte OP_DATANODE_REMOVE = 6;
  //以下2個許可權設定相關
  private static final byte OP_SET_PERMISSIONS = 7;
  private static final byte OP_SET_OWNER = 8;
  private static final byte OP_CLOSE = 9;    // close after write
  private static final byte OP_SET_GENSTAMP = 10;    // store genstamp
  /* The following two are not used any more. Should be removed once
   * LAST_UPGRADABLE_LAYOUT_VERSION is -17 or newer. */
  //配額設定相關
  private static final byte OP_SET_NS_QUOTA = 11; // set namespace quota
  private static final byte OP_CLEAR_NS_QUOTA = 12; // clear namespace quota
  private static final byte OP_TIMES = 13; // sets mod & access time on a file
  private static final byte OP_SET_QUOTA = 14; // sets name and disk quotas.
  //Token認證相關
  private static final byte OP_GET_DELEGATION_TOKEN = 18; //new delegation token
  private static final byte OP_RENEW_DELEGATION_TOKEN = 19; //renew delegation token
  private static final byte OP_CANCEL_DELEGATION_TOKEN = 20; //cancel delegation token
  private static final byte OP_UPDATE_MASTER_KEY = 21; //update master key
可見操作之多啊,然後才是基本變數的定義

//日誌刷入的緩衝大小值512k
  private static int sizeFlushBuffer = 512*1024;
  
  //編輯日誌同一時候有多個輸出流物件
  private ArrayList<EditLogOutputStream> editStreams = null;
  //內部維護了1個映象類。與映象進行互動
  private FSImage fsimage = null;

  // a monotonically increasing counter that represents transactionIds.
  //每次進行同步重新整理的事物ID
  private long txid = 0;

  // stores the last synced transactionId.
  //近期一次已經同步的事物Id
  private long synctxid = 0;

  // the time of printing the statistics to the log file.
  private long lastPrintTime;

  // is a sync currently running?

//是否有日誌同步操作正在進行 private boolean isSyncRunning; // these are statistics counters. //事務相關的統計變數 //事務的總數 private long numTransactions; // number of transactions //未能即使被同步的事物次數統計 private long numTransactionsBatchedInSync; //事務的總耗時 private long totalTimeTransactions; // total time for all transactions private NameNodeInstrumentation metrics;

這裡txtid,synctxid等變數會在後面的同步操作時頻繁出現。作者為了避免多執行緒事務id之間的相互干擾。採用了ThreadLocal的方式來維護自己的事務id

//事物ID物件類,內部包含long型別txid值
  private static class TransactionId {
    //操作事物Id
    public long txid;

    TransactionId(long value) {
      this.txid = value;
    }
  }

  // stores the most current transactionId of this thread.
  //通過ThreadLocal類儲存執行緒私有的狀態資訊
  private static final ThreadLocal<TransactionId> myTransactionId = new ThreadLocal<TransactionId>() {
    protected synchronized TransactionId initialValue() {
      return new TransactionId(Long.MAX_VALUE);
    }
  };
在EditLog編輯日誌中。全部的檔案操作都是通過特有的EditLog輸入輸出流實現的,他是一個父類。這裡以EditLogOutput為例

,程式碼被我簡化了一些

/**
 * A generic abstract class to support journaling of edits logs into 
 * a persistent storage.
 */
abstract class EditLogOutputStream extends OutputStream {
  // these are statistics counters
  //以下是2個統計量
  //檔案同步的次數。能夠理解為就是緩衝寫入的次數
  private long numSync;        // number of sync(s) to disk
  //同步寫入的總時間計數
  private long totalTimeSync;  // total time to sync

  EditLogOutputStream() throws IOException {
    numSync = totalTimeSync = 0;
  }
  
  abstract String getName();
  abstract public void write(int b) throws IOException;
  abstract void write(byte op, Writable ... writables) throws IOException;
  abstract void create() throws IOException;
  abstract public void close() throws IOException;
  abstract void setReadyToFlush() throws IOException;
  abstract protected void flushAndSync() throws IOException;

  /**
   * Flush data to persistent store.
   * Collect sync metrics.
   * 刷出時間方法
   */
  public void flush() throws IOException {
    //同步次數加1
    numSync++;
    long start = FSNamesystem.now();
    //刷出同步方法為抽象方法,由繼承的子類詳細
    flushAndSync();
    long end = FSNamesystem.now();
    //同一時候進行耗時的累加
    totalTimeSync += (end - start);
  }

  abstract long length() throws IOException;
  
  long getTotalSyncTime() {
    return totalTimeSync;
  }
  
  long getNumSync() {
    return numSync;
  }
}
人家在這裡對同步相關的操作做了一些設計,包含一些計數的統計。

輸入流與此相似。就不展開討論了,可是EditLog並沒有直接用了此類,而是在這個類中繼承了一個內容更加豐富的EditLogFileOutputStream

  /**
   * An implementation of the abstract class {@link EditLogOutputStream},
   * which stores edits in a local file.
   * 全部的寫日誌檔案的操作,都會通過這個輸出流物件實現
   */
  static private class EditLogFileOutputStream extends EditLogOutputStream {
    private File file;
    //內部維護了一個檔案輸出流物件
    private FileOutputStream fp;    // file stream for storing edit logs 
    private FileChannel fc;         // channel of the file stream for sync
    //這裡設計了一個雙緩衝區的設計,大大加強併發度,bufCurrent負責寫入寫入緩衝區
    private DataOutputBuffer bufCurrent;  // current buffer for writing
    //bufReady負載刷入資料到檔案裡
    private DataOutputBuffer bufReady;    // buffer ready for flushing
    static ByteBuffer fill = ByteBuffer.allocateDirect(512); // preallocation
注意這裡有雙緩衝的設計,雙緩衝的設計在很多的別的優秀的系統中都實用到。如今從編輯日誌寫檔案開始看起

/**
   * Create empty edit log files.
   * Initialize the output stream for logging.
   * 
   * @throws IOException
   */
  public synchronized void open() throws IOException {
    //在檔案開啟的時候,計數值都初始化0
    numTransactions = totalTimeTransactions = numTransactionsBatchedInSync = 0;
    if (editStreams == null) {
      editStreams = new ArrayList<EditLogOutputStream>();
    }
    //傳入資料夾型別獲取迭代器
    Iterator<StorageDirectory> it = fsimage.dirIterator(NameNodeDirType.EDITS); 
    while (it.hasNext()) {
      StorageDirectory sd = it.next();
      File eFile = getEditFile(sd);
      try {
        //開啟儲存資料夾下的檔案獲取輸出流
        EditLogOutputStream eStream = new EditLogFileOutputStream(eFile);
        editStreams.add(eStream);
      } catch (IOException ioe) {
        fsimage.updateRemovedDirs(sd, ioe);
        it.remove();
      }
    }
    exitIfNoStreams();
  }
這裡將會把一個新的輸出流增加到editStreams全域性變數中。

那麼對於一次標準的寫入過程是怎麼樣的呢,我們以檔案關閉的方法為例,由於檔案關閉會觸發一次最後剩餘資料的寫入操作

  /**
   * Shutdown the file store.
   * 關閉操作
   */
  public synchronized void close() throws IOException {
    while (isSyncRunning) {
      //假設同正在進行。則等待1s
      try {
        wait(1000);
      } catch (InterruptedException ie) { 
      }
    }
    if (editStreams == null) {
      return;
    }
    printStatistics(true);
    //當檔案關閉的時候重置計數
    numTransactions = totalTimeTransactions = numTransactionsBatchedInSync = 0;

    for (int idx = 0; idx < editStreams.size(); idx++) {
      EditLogOutputStream eStream = editStreams.get(idx);
      try {
        //關閉將最後的資料刷出緩衝
        eStream.setReadyToFlush();
        eStream.flush();
        eStream.close();
      } catch (IOException ioe) {
        removeEditsAndStorageDir(idx);
        idx--;
      }
    }
    editStreams.clear();
  }
主要是中的2行程式碼,setReadyToFlush()交換緩衝區

/**
     * All data that has been written to the stream so far will be flushed.
     * New data can be still written to the stream while flushing is performed.
     */
    @Override
    void setReadyToFlush() throws IOException {
      assert bufReady.size() == 0 : "previous data is not flushed yet";
      write(OP_INVALID);           // insert end-of-file marker
      //交換2個緩衝區
      DataOutputBuffer tmp = bufReady;
      bufReady = bufCurrent;
      bufCurrent = tmp;
    }
bufCurrent的緩衝用於外部寫進行的資料緩衝,而bufReady則是將要寫入檔案的資料緩衝。而真正起作用的是flush()方法。他是父類中的方法

  /**
   * Flush data to persistent store.
   * Collect sync metrics.
   * 刷出時間方法
   */
  public void flush() throws IOException {
    //同步次數加1
    numSync++;
    long start = FSNamesystem.now();
    //刷出同步方法為抽象方法,由繼承的子類詳細
    flushAndSync();
    long end = FSNamesystem.now();
    //同一時候進行耗時的累加
    totalTimeSync += (end - start);
  }
會呼叫到同步方法

/**
     * Flush ready buffer to persistent store.
     * currentBuffer is not flushed as it accumulates new log records
     * while readyBuffer will be flushed and synced.
     */
    @Override
    protected void flushAndSync() throws IOException {
      preallocate();            // preallocate file if necessary
      //將ready緩衝區中的資料寫入檔案裡
      bufReady.writeTo(fp);     // write data to file
      bufReady.reset();         // erase all data in the buffer
      fc.force(false);          // metadata updates not needed because of preallocation
      //跳過無效標誌位。由於無效標誌位每次都會寫入
      fc.position(fc.position()-1); // skip back the end-of-file marker
    }
你或許會想。簡簡單單的檔案寫入過程,的確設計的有點靜止。再回想之前檔案最頂上的幾十種操作碼型別。代表了各式各樣的操作,他們是怎樣被呼叫的呢,第一反應當然是外界傳入引數值,然後我呼叫對應語句做操作匹配。EditLog沿用的也是這個思路。

/** 
   * Add set replication record to edit log
   */
  void logSetReplication(String src, short replication) {
    logEdit(OP_SET_REPLICATION, 
            new UTF8(src), 
            FSEditLog.toLogReplication(replication));
  }
  
  /** Add set namespace quota record to edit log
   * 
   * @param src the string representation of the path to a directory
   * @param quota the directory size limit
   */
  void logSetQuota(String src, long nsQuota, long dsQuota) {
    logEdit(OP_SET_QUOTA, new UTF8(src), 
            new LongWritable(nsQuota), new LongWritable(dsQuota));
  }

  /**  Add set permissions record to edit log */
  void logSetPermissions(String src, FsPermission permissions) {
    logEdit(OP_SET_PERMISSIONS, new UTF8(src), permissions);
  }
事實上還有非常多的logSet*系列的方法,形式都是傳入操作碼,操作物件以及附加引數。就會呼叫到更加基層的logEdit方法,這種方法才是終於寫入操作記錄的方法。

/**
   * Write an operation to the edit log. Do not sync to persistent
   * store yet.
   * 寫入一個操作到編輯日誌中
   */
  synchronized void logEdit(byte op, Writable ... writables) {
    if (getNumEditStreams() < 1) {
      throw new AssertionError("No edit streams to log to");
    }
    long start = FSNamesystem.now();
    for (int idx = 0; idx < editStreams.size(); idx++) {
      EditLogOutputStream eStream = editStreams.get(idx);
      try {
        // 寫入操作到每一個輸出流中
        eStream.write(op, writables);
      } catch (IOException ioe) {
        removeEditsAndStorageDir(idx);
        idx--; 
      }
    }
    exitIfNoStreams();
    // get a new transactionId
    //獲取一個新的事物Id
    txid++;

    //
    // record the transactionId when new data was written to the edits log
    //
    TransactionId id = myTransactionId.get();
    id.txid = txid;

    // update statistics
    long end = FSNamesystem.now();
    //在每次進行logEdit寫入記錄操作的時候,都會累加事物次數和耗時
    numTransactions++;
    totalTimeTransactions += (end-start);
    if (metrics != null) // Metrics is non-null only when used inside name node
      metrics.addTransaction(end-start);
  }
每次新的操作,在這裡都生成一個新的事務id,而且會統計事務執行寫入緩衝時間等,可是此時僅僅是寫入的輸出流中,還沒有寫到檔案。原因是你要考慮到多執行緒操作的情況。

//
  // Sync all modifications done by this thread.
  //
  public void logSync() throws IOException {
    ArrayList<EditLogOutputStream> errorStreams = null;
    long syncStart = 0;

    // Fetch the transactionId of this thread. 
    long mytxid = myTransactionId.get().txid;

    ArrayList<EditLogOutputStream> streams = new ArrayList<EditLogOutputStream>();
    boolean sync = false;
    try {
      synchronized (this) {
        printStatistics(false);

        // if somebody is already syncing, then wait
        while (mytxid > synctxid && isSyncRunning) {
          try {
            wait(1000);
          } catch (InterruptedException ie) { 
          }
        }

        //
        // If this transaction was already flushed, then nothing to do
        //
        if (mytxid <= synctxid) {
          //當執行的事物id小於已同步的Id,也進行計數累加
          numTransactionsBatchedInSync++;
          if (metrics != null) // Metrics is non-null only when used inside name node
            metrics.incrTransactionsBatchedInSync();
          return;
        }

        // now, this thread will do the sync
        syncStart = txid;
        isSyncRunning = true;
        sync = true;

        // swap buffers
        exitIfNoStreams();
        for(EditLogOutputStream eStream : editStreams) {
          try {
          	//交換緩衝
            eStream.setReadyToFlush();
            streams.add(eStream);
          } catch (IOException ie) {
            FSNamesystem.LOG.error("Unable to get ready to flush.", ie);
            //
            // remember the streams that encountered an error.
            //
            if (errorStreams == null) {
              errorStreams = new ArrayList<EditLogOutputStream>(1);
            }
            errorStreams.add(eStream);
          }
        }
      }

      // do the sync
      long start = FSNamesystem.now();
      for (EditLogOutputStream eStream : streams) {
        try {
          //同步完畢之後,做輸入資料操作
          eStream.flush();
       ....
  }
ok。整個操作過程總算理清了。寫入的過程完畢之後,編輯日誌類是怎樣讀入編輯日誌檔案。並完畢記憶體後設資料的恢復 的呢,整個過程事實上就是一個解碼的過程

/**
   * Load an edit log, and apply the changes to the in-memory structure
   * This is where we apply edits that we've been writing to disk all
   * along.
   * 匯入編輯日誌檔案,並在記憶體中構建此時狀態
   */
  static int loadFSEdits(EditLogInputStream edits) throws IOException {
    FSNamesystem fsNamesys = FSNamesystem.getFSNamesystem();
    //FSDirectory是一個門面模式的體現,全部的操作都是在這個類中分給裡面的子係數實現
    FSDirectory fsDir = fsNamesys.dir;
    int numEdits = 0;
    int logVersion = 0;
    String clientName = null;
    String clientMachine = null;
    String path = null;
    int numOpAdd = 0, numOpClose = 0, numOpDelete = 0,
        numOpRename = 0, numOpSetRepl = 0, numOpMkDir = 0,
        numOpSetPerm = 0, numOpSetOwner = 0, numOpSetGenStamp = 0,
        numOpTimes = 0, numOpGetDelegationToken = 0,
        numOpRenewDelegationToken = 0, numOpCancelDelegationToken = 0,
        numOpUpdateMasterKey = 0, numOpOther = 0;

    long startTime = FSNamesystem.now();

    DataInputStream in = new DataInputStream(new BufferedInputStream(edits));
    try {
      // Read log file version. Could be missing. 
      in.mark(4);
      // If edits log is greater than 2G, available method will return negative
      // numbers, so we avoid having to call available
      boolean available = true;
      try {
        // 首先讀入日誌版本號號
        logVersion = in.readByte();
      } catch (EOFException e) {
        available = false;
      }
      if (available) {
        in.reset();
        logVersion = in.readInt();
        if (logVersion < FSConstants.LAYOUT_VERSION) // future version
          throw new IOException(
                          "Unexpected version of the file system log file: "
                          + logVersion + ". Current version = " 
                          + FSConstants.LAYOUT_VERSION + ".");
      }
      assert logVersion <= Storage.LAST_UPGRADABLE_LAYOUT_VERSION :
                            "Unsupported version " + logVersion;

      while (true) {
        ....
        
        //以下依據操作型別進行值的設定
        switch (opcode) {
        case OP_ADD:
        case OP_CLOSE: {
          ...
          break;
        } 
        case OP_SET_REPLICATION: {
          numOpSetRepl++;
          path = FSImage.readString(in);
          short replication = adjustReplication(readShort(in));
          fsDir.unprotectedSetReplication(path, replication, null);
          break;
        } 
        case OP_RENAME: {
          numOpRename++;
          int length = in.readInt();
          if (length != 3) {
            throw new IOException("Incorrect data format. " 
                                  + "Mkdir operation.");
          }
          String s = FSImage.readString(in);
          String d = FSImage.readString(in);
          timestamp = readLong(in);
          HdfsFileStatus dinfo = fsDir.getFileInfo(d);
          fsDir.unprotectedRenameTo(s, d, timestamp);
          fsNamesys.changeLease(s, d, dinfo);
          break;
        }
        ...

整個函式程式碼非常長,大家理解思路就可以。

裡面的非常多操作都是在FSDirectory中實現的,你能夠理解整個類為一個門面模式,各個相關的子系統都包含在這個類中。


NameNode後設資料備份機制

有了以上的方法做鋪墊,後設資料的備份機制才得以靈活的實現,無非就是呼叫上述的基礎方法進行各個檔案的拷貝。重新命名等操作。總體上須要發生檔案狀態變化的操作例如以下:

1.原current映象資料夾-->lastcheckpoint.tmp

2.第二名位元組點上傳新的映象檔案後fsimage.ckpt-->fsimage,並建立新的current資料夾

3.lastcheckpoint.tmp變為previous.checkpoint

4.日誌檔案edit.new-->edit檔案

大體上是以上4條思路。

首先映象檔案的備份都是從第二名位元組點的週期性檢查點檢測開始的

//
  // The main work loop
  //
  public void doWork() {
    long period = 5 * 60;              // 5 minutes
    long lastCheckpointTime = 0;
    if (checkpointPeriod < period) {
      period = checkpointPeriod;
    }
    
    //主迴圈程式
    while (shouldRun) {
      try {
        Thread.sleep(1000 * period);
      } catch (InterruptedException ie) {
        // do nothing
      }
      if (!shouldRun) {
        break;
      }
      try {
        // We may have lost our ticket since last checkpoint, log in again, just in case
        if(UserGroupInformation.isSecurityEnabled())
          UserGroupInformation.getCurrentUser().reloginFromKeytab();
        
        long now = System.currentTimeMillis();

        long size = namenode.getEditLogSize();
        if (size >= checkpointSize || 
            now >= lastCheckpointTime + 1000 * checkpointPeriod) {
          //週期性呼叫檢查點方法
          doCheckpoint();
          ...
    }
  }
然後我們就找doCheckpoint()檢查點檢查方法

/**
   * Create a new checkpoint
   */
  void doCheckpoint() throws IOException {

    // Do the required initialization of the merge work area.
    //做初始化的映象操作
    startCheckpoint();

    // Tell the namenode to start logging transactions in a new edit file
    // Retuns a token that would be used to upload the merged image.
    CheckpointSignature sig = (CheckpointSignature)namenode.rollEditLog();

    // error simulation code for junit test
    if (ErrorSimulator.getErrorSimulation(0)) {
      throw new IOException("Simulating error0 " +
                            "after creating edits.new");
    }

    //從名位元組點獲取當前映象或編輯日誌
    downloadCheckpointFiles(sig);   // Fetch fsimage and edits
    //進行映象合併操作
    doMerge(sig);                   // Do the merge
  
    //
    // Upload the new image into the NameNode. Then tell the Namenode
    // to make this new uploaded image as the most current image.
    //把合併好後的映象又一次上傳到名位元組點
    putFSImage(sig);

    // error simulation code for junit test
    if (ErrorSimulator.getErrorSimulation(1)) {
      throw new IOException("Simulating error1 " +
                            "after uploading new image to NameNode");
    }
    
    //通知名位元組點進行映象的替換操作。包含將edit.new的名稱又一次改為edit,映象名稱fsimage.ckpt改為fsImage
    namenode.rollFsImage();
    checkpointImage.endCheckpoint();

    LOG.info("Checkpoint done. New Image Size: "
              + checkpointImage.getFsImageName().length());
  }
這種方法中描寫敘述了非常清晰的備份機制。

我們主要再來看下檔案的替換方法。也就是namenode.rollFsImage方法,這種方法最後還是會調到FSImage的同名方法。

/**
   * Moves fsimage.ckpt to fsImage and edits.new to edits
   * Reopens the new edits file.
   * 完畢2個檔案的名稱替換
   */
  void rollFSImage() throws IOException {
    if (ckptState != CheckpointStates.UPLOAD_DONE) {
      throw new IOException("Cannot roll fsImage before rolling edits log.");
    }
    //
    // First, verify that edits.new and fsimage.ckpt exists in all
    // checkpoint directories.
    //
    if (!editLog.existsNew()) {
      throw new IOException("New Edits file does not exist");
    }
    Iterator<StorageDirectory> it = dirIterator(NameNodeDirType.IMAGE);
    while (it.hasNext()) {
      StorageDirectory sd = it.next();
      File ckpt = getImageFile(sd, NameNodeFile.IMAGE_NEW);
      if (!ckpt.exists()) {
        throw new IOException("Checkpoint file " + ckpt +
                              " does not exist");
      }
    }
    editLog.purgeEditLog(); // renamed edits.new to edits
方法前半部分交待的非常明白,做2類檔案的替換,

//
    // Renames new image
    // 重新命名新映象名稱
    //
    it = dirIterator(NameNodeDirType.IMAGE);
    while (it.hasNext()) {
      StorageDirectory sd = it.next();
      File ckpt = getImageFile(sd, NameNodeFile.IMAGE_NEW);
      File curFile = getImageFile(sd, NameNodeFile.IMAGE);
      // renameTo fails on Windows if the destination file 
      // already exists.
      if (!ckpt.renameTo(curFile)) {
        curFile.delete();
        if (!ckpt.renameTo(curFile)) {
          editLog.removeEditsForStorageDir(sd);
          updateRemovedDirs(sd);
          it.remove();
        }
      }
    }
    editLog.exitIfNoStreams();
中間程式碼部分完畢fsimage.ckpt的新映象重新命名為當前名稱fsimage,最後要對舊的資料夾檔案進行刪除操作

//
    // Updates the fstime file on all directories (fsimage and edits)
    // and write version file
    //
    this.layoutVersion = FSConstants.LAYOUT_VERSION;
    this.checkpointTime = FSNamesystem.now();
    it = dirIterator();
    while (it.hasNext()) {
      StorageDirectory sd = it.next();
      // delete old edits if sd is the image only the directory
      if (!sd.getStorageDirType().isOfType(NameNodeDirType.EDITS)) {
        File editsFile = getImageFile(sd, NameNodeFile.EDITS);
        editsFile.delete();
      }
      // delete old fsimage if sd is the edits only the directory
      if (!sd.getStorageDirType().isOfType(NameNodeDirType.IMAGE)) {
        File imageFile = getImageFile(sd, NameNodeFile.IMAGE);
        imageFile.delete();
      }

這個過程本身比較複雜,還是用書中的一張圖來表示好了,圖可能有點大。



總結

至此,本篇內容闡述完畢,事實上我還是忽略了很多細節部分。還是主要從基本的操作一步步的理清整個線索。建議能夠在Hadoop執行的時候。跑到name和edit資料夾下觀察各個資料夾的情況以此驗證這整套機制了。或直接做upgrade測試升級工作也是能夠的。

全部程式碼的分析請點選連結

https://github.com/linyiqun/hadoop-hdfs,興許將會繼續更新HDFS其它方面的程式碼分析。


參考文獻

《Hadoop技術內部–HDFS結構設計與實現原理》.蔡斌等


相關文章