HDFS讀寫過程解析(R1)

thamsyangsw發表於2014-02-20

一、檔案的開啟

1.1、客戶端

HDFS開啟一個檔案,需要在客戶端呼叫DistributedFileSystem.open(Path f, int bufferSize),其實現為:

public FSDataInputStream open(Path f, int bufferSize) throws IOException {

  return new DFSClient.DFSDataInputStream(

        dfs.open(getPathName(f), bufferSize, verifyChecksum, statistics));

}

其中dfs為DistributedFileSystem的成員變數DFSClient,其open函式被呼叫,其中建立一個DFSInputStream(src, buffersize, verifyChecksum)並返回。

在DFSInputStream的建構函式中,openInfo函式被呼叫,其主要從namenode中得到要開啟的檔案所對應的blocks的資訊,實現如下:

 

synchronized void openInfo() throws IOException {

  LocatedBlocks newInfo = callGetBlockLocations(namenode, src, 0, prefetchSize);

  this.locatedBlocks = newInfo;

  this.currentNode = null;

}

private static LocatedBlocks callGetBlockLocations(ClientProtocol namenode,

    String src, long start, long length) throws IOException {

    return namenode.getBlockLocations(src, start, length);

}

LocatedBlocks主要包含一個連結串列的List blocks,其中每個LocatedBlock包含如下資訊:

  • Block b:此block的資訊
  • long offset:此block在檔案中的偏移量
  • DatanodeInfo[] locs:此block位於哪些DataNode上

上面namenode.getBlockLocations是一個RPC呼叫,最終呼叫NameNode類的getBlockLocations函式。

1.2、NameNode

NameNode.getBlockLocations實現如下:

public LocatedBlocks   getBlockLocations(String src,

                                        long offset,

                                        long length) throws IOException {

  return namesystem.getBlockLocations(getClientMachine(),

                                      src, offset, length);

}

namesystem是NameNode一個成員變數,其型別為FSNamesystem,儲存的是NameNode的name space樹,其中一個重要的成員變數為FSDirectory dir。

FSDirectory和Lucene中的FSDirectory沒有任何關係,其主要包括FSImage fsImage,用於讀寫硬碟上的fsimage檔案,FSImage類有成員變數FSEditLog editLog,用於讀寫硬碟上的edit檔案,這兩個檔案的關係在上一篇文章中已經解釋過。

FSDirectory還有一個重要的成員變數INodeDirectoryWithQuota rootDir,INodeDirectoryWithQuota的父類為INodeDirectory,實現如下:

public class INodeDirectory extends INode {

  ……

  private List children;

  ……

由此可見INodeDirectory本身是一個INode,其中包含一個連結串列的INode,此連結串列中,如果仍為資料夾,則是型別INodeDirectory,如果是檔案,則是型別INodeFile,INodeFile中有成員變數BlockInfo blocks[],是此檔案包含的block的資訊。顯然這是一棵樹形的結構。

FSNamesystem.getBlockLocations函式如下:

public LocatedBlocks getBlockLocations(String src, long offset, long length,

    boolean doAccessTime) throws IOException {

  final LocatedBlocks ret = getBlockLocationsInternal(src, dir.getFileINode(src),

      offset, length, Integer.MAX_VALUE, doAccessTime); 

  return ret;

}

dir.getFileINode(src)透過路徑名從檔案系統樹中找到INodeFile,其中儲存的是要開啟的檔案的INode的資訊。

getBlockLocationsInternal的實現如下:

 

private synchronized LocatedBlocks getBlockLocationsInternal(String src,

                                                     INodeFile inode,

                                                     long offset,

                                                     long length,

                                                     int nrBlocksToReturn,

                                                     boolean doAccessTime)

                                                     throws IOException {

  //得到此檔案的block資訊

  Block[] blocks = inode.getBlocks();

  List results = new ArrayList(blocks.length);

  //計算從offset開始,長度為length所涉及的blocks

  int curBlk = 0;

  long curPos = 0, blkSize = 0;

  int nrBlocks = (blocks[0].getNumBytes() == 0) ? 0 : blocks.length;

  for (curBlk = 0; curBlk < nrBlocks; curBlk++) {

    blkSize = blocks[curBlk].getNumBytes();

    if (curPos + blkSize > offset) {

      //當offset在curPos和curPos + blkSize之間的時候,curBlk指向offset所在的block

      break;

    }

    curPos += blkSize;

  }

  long endOff = offset + length;

  //迴圈,依次遍歷從curBlk開始的每個block,直到當前位置curPos越過endOff

  do {

    int numNodes = blocksMap.numNodes(blocks[curBlk]);

    int numCorruptNodes = countNodes(blocks[curBlk]).corruptReplicas();

    int numCorruptReplicas = corruptReplicas.numCorruptReplicas(blocks[curBlk]);

    boolean blockCorrupt = (numCorruptNodes == numNodes);

    int numMachineSet = blockCorrupt ? numNodes :

                          (numNodes - numCorruptNodes);

    //依次找到此block所對應的datanode,將其中沒有損壞的放入machineSet中

    DatanodeDescriptor[] machineSet = new DatanodeDescriptor[numMachineSet];

    if (numMachineSet > 0) {

      numNodes = 0;

      for(Iterator it =

          blocksMap.nodeIterator(blocks[curBlk]); it.hasNext();) {

        DatanodeDescriptor dn = it.next();

        boolean replicaCorrupt = corruptReplicas.isReplicaCorrupt(blocks[curBlk], dn);

        if (blockCorrupt || (!blockCorrupt && !replicaCorrupt))

          machineSet[numNodes++] = dn;

      }

    }

    //使用此machineSet和當前的block構造一個LocatedBlock

    results.add(new LocatedBlock(blocks[curBlk], machineSet, curPos,

                blockCorrupt));

    curPos += blocks[curBlk].getNumBytes();

    curBlk++;

  } while (curPos < endOff

        && curBlk < blocks.length

        && results.size() < nrBlocksToReturn);

  //使用此LocatedBlock連結串列構造一個LocatedBlocks物件返回

  return inode.createLocatedBlocks(results);

}

1.3、客戶端

透過RPC呼叫,在NameNode得到的LocatedBlocks物件,作為成員變數構造DFSInputStream物件,最後包裝為FSDataInputStream返回給使用者。

 

二、檔案的讀取

2.1、客戶端

檔案讀取的時候,客戶端利用檔案開啟的時候得到的FSDataInputStream.read(long position, byte[] buffer, int offset, int length)函式進行檔案讀操作。

FSDataInputStream會呼叫其封裝的DFSInputStream的read(long position, byte[] buffer, int offset, int length)函式,實現如下:

 

public int read(long position, byte[] buffer, int offset, int length)

  throws IOException {

  long filelen = getFileLength();

  int realLen = length;

  if ((position + length) > filelen) {

    realLen = (int)(filelen - position);

  }

  //首先得到包含從offset到offset + length內容的block列表

  //比如對於64M一個block的檔案系統來說,欲讀取從100M開始,長度為128M的資料,則block列表包括第2,3,4塊block

  List blockRange = getBlockRange(position, realLen);

  int remaining = realLen;

  //對每一個block,從中讀取內容

  //對於上面的例子,對於第2塊block,讀取從36M開始,讀取長度28M,對於第3塊,讀取整一塊64M,對於第4塊,讀取從0開始,長度為36M,共128M資料

  for (LocatedBlock blk : blockRange) {

    long targetStart = position - blk.getStartOffset();

    long bytesToRead = Math.min(remaining, blk.getBlockSize() - targetStart);

    fetchBlockByteRange(blk, targetStart,

                        targetStart + bytesToRead - 1, buffer, offset);

    remaining -= bytesToRead;

    position += bytesToRead;

    offset += bytesToRead;

  }

  assert remaining == 0 : "Wrong number of bytes read.";

  if (stats != null) {

    stats.incrementBytesRead(realLen);

  }

  return realLen;

}

其中getBlockRange函式如下:

 

private synchronized List getBlockRange(long offset,

                                                      long length)

                                                    throws IOException {

  List blockRange = new ArrayList();

  //首先從快取的locatedBlocks中查詢offset所在的block在快取連結串列中的位置

  int blockIdx = locatedBlocks.findBlock(offset);

  if (blockIdx < 0) { // block is not cached

    blockIdx = LocatedBlocks.getInsertIndex(blockIdx);

  }

  long remaining = length;

  long curOff = offset;

  while(remaining > 0) {

    LocatedBlock blk = null;

    //按照blockIdx的位置找到block

    if(blockIdx < locatedBlocks.locatedBlockCount())

      blk = locatedBlocks.get(blockIdx);

    //如果block為空,則快取中沒有此block,則直接從NameNode中查詢這些block,並加入快取

    if (blk == null || curOff < blk.getStartOffset()) {

      LocatedBlocks newBlocks;

      newBlocks = callGetBlockLocations(namenode, src, curOff, remaining);

      locatedBlocks.insertRange(blockIdx, newBlocks.getLocatedBlocks());

      continue;

    }

    //如果block找到,則放入結果集

    blockRange.add(blk);

    long bytesRead = blk.getStartOffset() + blk.getBlockSize() - curOff;

    remaining -= bytesRead;

    curOff += bytesRead;

    //取下一個block

    blockIdx++;

  }

  return blockRange;

}

其中fetchBlockByteRange實現如下:

 

private void fetchBlockByteRange(LocatedBlock block, long start,

                                 long end, byte[] buf, int offset) throws IOException {

  Socket dn = null;

  int numAttempts = block.getLocations().length;

  //此while迴圈為讀取失敗後的重試次數

  while (dn == null && numAttempts-- > 0 ) {

    //選擇一個DataNode來讀取資料

    DNAddrPair retval = chooseDataNode(block);

    DatanodeInfo chosenNode = retval.info;

    InetSocketAddress targetAddr = retval.addr;

    BlockReader reader = null;

    try {

      //建立Socket連線到DataNode

      dn = socketFactory.createSocket();

      dn.connect(targetAddr, socketTimeout);

      dn.setSoTimeout(socketTimeout);

      int len = (int) (end - start + 1);

      //利用建立的Socket連結,生成一個reader負責從DataNode讀取資料

      reader = BlockReader.newBlockReader(dn, src,

                                          block.getBlock().getBlockId(),

                                          block.getBlock().getGenerationStamp(),

                                          start, len, buffersize,

                                          verifyChecksum, clientName);

      //讀取資料

      int nread = reader.readAll(buf, offset, len);

      return;

    } finally {

      IOUtils.closeStream(reader);

      IOUtils.closeSocket(dn);

      dn = null;

    }

    //如果讀取失敗,則將此DataNode標記為失敗節點

    addToDeadNodes(chosenNode);

  }

}

BlockReader.newBlockReader函式實現如下:

 

public static BlockReader newBlockReader( Socket sock, String file,

                                   long blockId,

                                   long genStamp,

                                   long startOffset, long len,

                                   int bufferSize, boolean verifyChecksum,

                                   String clientName)

                                   throws IOException {

  //使用Socket建立寫入流,向DataNode傳送讀指令

  DataOutputStream out = new DataOutputStream(

    new BufferedOutputStream(NetUtils.getOutputStream(sock,HdfsConstants.WRITE_TIMEOUT)));

  out.writeShort( DataTransferProtocol.DATA_TRANSFER_VERSION );

  out.write( DataTransferProtocol.OP_READ_BLOCK );

  out.writeLong( blockId );

  out.writeLong( genStamp );

  out.writeLong( startOffset );

  out.writeLong( len );

  Text.writeString(out, clientName);

  out.flush();

  //使用Socket建立讀入流,用於從DataNode讀取資料

  DataInputStream in = new DataInputStream(

      new BufferedInputStream(NetUtils.getInputStream(sock),

                              bufferSize));

  DataChecksum checksum = DataChecksum.newDataChecksum( in );

  long firstChunkOffset = in.readLong();

  //生成一個reader,主要包含讀入流,用於讀取資料

  return new BlockReader( file, blockId, in, checksum, verifyChecksum,

                          startOffset, firstChunkOffset, sock );

}

BlockReader的readAll函式就是用上面生成的DataInputStream讀取資料。

2.2、DataNode

在DataNode啟動的時候,會呼叫函式startDataNode,其中與資料讀取有關的邏輯如下:

 

void startDataNode(Configuration conf,

                   AbstractList dataDirs

                   ) throws IOException {

  ……

  // 建立一個ServerSocket,並生成一個DataXceiverServer來監控客戶端的連結

  ServerSocket ss = (socketWriteTimeout > 0) ?

        ServerSocketChannel.open().socket() : new ServerSocket();

  Server.bind(ss, socAddr, 0);

  ss.setReceiveBufferSize(DEFAULT_DATA_SOCKET_SIZE);

  // adjust machine name with the actual port

  tmpPort = ss.getLocalPort();

  selfAddr = new InetSocketAddress(ss.getInetAddress().getHostAddress(),

                                   tmpPort);

  this.dnRegistration.setName(machineName + ":" + tmpPort);

  this.threadGroup = new ThreadGroup("dataXceiverServer");

  this.dataXceiverServer = new Daemon(threadGroup,

      new DataXceiverServer(ss, conf, this));

  this.threadGroup.setDaemon(true); // auto destroy when empty

  ……

}

DataXceiverServer.run()函式如下:

 

public void run() {

  while (datanode.shouldRun) {

      //接受客戶端的連結

      Socket s = ss.accept();

      s.setTcpNoDelay(true);

      //生成一個執行緒DataXceiver來對建立的連結提供服務

      new Daemon(datanode.threadGroup,

          new DataXceiver(s, datanode, this)).start();

  }

  try {

    ss.close();

  } catch (IOException ie) {

    LOG.warn(datanode.dnRegistration + ":DataXceiveServer: "

                            + StringUtils.stringifyException(ie));

  }

}

DataXceiver.run()函式如下:

 

public void run() {

  DataInputStream in=null;

  try {

    //建立一個輸入流,讀取客戶端傳送的指令

    in = new DataInputStream(

        new BufferedInputStream(NetUtils.getInputStream(s),

                                SMALL_BUFFER_SIZE));

    short version = in.readShort();

    boolean local = s.getInetAddress().equals(s.getLocalAddress());

    byte op = in.readByte();

    // Make sure the xciver count is not exceeded

    int curXceiverCount = datanode.getXceiverCount();

    long startTime = DataNode.now();

    switch ( op ) {

    //讀取

    case DataTransferProtocol.OP_READ_BLOCK:

      //真正的讀取資料

      readBlock( in );

      datanode.myMetrics.readBlockOp.inc(DataNode.now() - startTime);

      if (local)

        datanode.myMetrics.readsFromLocalClient.inc();

      else

        datanode.myMetrics.readsFromRemoteClient.inc();

      break;

    //寫入

    case DataTransferProtocol.OP_WRITE_BLOCK:

      //真正的寫入資料

      writeBlock( in );

      datanode.myMetrics.writeBlockOp.inc(DataNode.now() - startTime);

      if (local)

        datanode.myMetrics.writesFromLocalClient.inc();

      else

        datanode.myMetrics.writesFromRemoteClient.inc();

      break;

    //其他的指令

    ……

    }

  } catch (Throwable t) {

    LOG.error(datanode.dnRegistration + ":DataXceiver",t);

  } finally {

    IOUtils.closeStream(in);

    IOUtils.closeSocket(s);

    dataXceiverServer.childSockets.remove(s);

  }

}

 

private void readBlock(DataInputStream in) throws IOException {

  //讀取指令

  long blockId = in.readLong();         

  Block block = new Block( blockId, 0 , in.readLong());

  long startOffset = in.readLong();

  long length = in.readLong();

  String clientName = Text.readString(in);

  //建立一個寫入流,用於向客戶端寫資料

  OutputStream baseStream = NetUtils.getOutputStream(s,

      datanode.socketWriteTimeout);

  DataOutputStream out = new DataOutputStream(

               new BufferedOutputStream(baseStream, SMALL_BUFFER_SIZE));

  //生成BlockSender用於讀取本地的block的資料,併傳送給客戶端

  //BlockSender有一個成員變數InputStream blockIn用於讀取本地block的資料

  BlockSender blockSender = new BlockSender(block, startOffset, length,

          true, true, false, datanode, clientTraceFmt);

   out.writeShort(DataTransferProtocol.OP_STATUS_SUCCESS); // send op status

   //向客戶端寫入資料

   long read = blockSender.sendBlock(out, baseStream, null);

   ……

  } finally {

    IOUtils.closeStream(out);

    IOUtils.closeStream(blockSender);

  }

}

三、檔案的寫入

下面解析向hdfs上傳一個檔案的過程。

3.1、客戶端

上傳一個檔案到hdfs,一般會呼叫DistributedFileSystem.create,其實現如下:

 

  public FSDataOutputStream create(Path f, FsPermission permission,

    boolean overwrite,

    int bufferSize, short replication, long blockSize,

    Progressable progress) throws IOException {

    return new FSDataOutputStream

       (dfs.create(getPathName(f), permission,

                   overwrite, replication, blockSize, progress, bufferSize),

        statistics);

  }

其最終生成一個FSDataOutputStream用於向新生成的檔案中寫入資料。其成員變數dfs的型別為DFSClient,DFSClient的create函式如下:

  public OutputStream create(String src,

                             FsPermission permission,

                             boolean overwrite,

                             short replication,

                             long blockSize,

                             Progressable progress,

                             int buffersize

                             ) throws IOException {

    checkOpen();

    if (permission == null) {

      permission = FsPermission.getDefault();

    }

    FsPermission masked = permission.applyUMask(FsPermission.getUMask(conf));

    OutputStream result = new DFSOutputStream(src, masked,

        overwrite, replication, blockSize, progress, buffersize,

        conf.getInt("io.bytes.per.checksum", 512));

    leasechecker.put(src, result);

    return result;

  }

其中構造了一個DFSOutputStream,在其建構函式中,同過RPC呼叫NameNode的create來建立一個檔案。
當然,建構函式中還做了一件重要的事情,就是streamer.start(),也即啟動了一個pipeline,用於寫資料,在寫入資料的過程中,我們會仔細分析。

  DFSOutputStream(String src, FsPermission masked, boolean overwrite,

      short replication, long blockSize, Progressable progress,

      int buffersize, int bytesPerChecksum) throws IOException {

    this(src, blockSize, progress, bytesPerChecksum);

    computePacketChunkSize(writePacketSize, bytesPerChecksum);

    try {

      namenode.create(

          src, masked, clientName, overwrite, replication, blockSize);

    } catch(RemoteException re) {

      throw re.unwrapRemoteException(AccessControlException.class,

                                     QuotaExceededException.class);

    }

    streamer.start();

  }

 

3.2、NameNode

NameNode的create函式呼叫namesystem.startFile函式,其又呼叫startFileInternal函式,實現如下:

  private synchronized void startFileInternal(String src,

                                              PermissionStatus permissions,

                                              String holder,

                                              String clientMachine,

                                              boolean overwrite,

                                              boolean append,

                                              short replication,

                                              long blockSize

                                              ) throws IOException {

    ......

   //建立一個新的檔案,狀態為under construction,沒有任何data block與之對應

   long genstamp = nextGenerationStamp();

   INodeFileUnderConstruction newNode = dir.addFile(src, permissions,

      replication, blockSize, holder, clientMachine, clientNode, genstamp);

   ......

  }

 

3.3、客戶端

下面輪到客戶端向新建立的檔案中寫入資料了,一般會使用FSDataOutputStream的write函式,最終會呼叫DFSOutputStream的writeChunk函式:

按照hdfs的設計,對block的資料寫入使用的是pipeline的方式,也即將資料分成一個個的package,如果需要複製三分,分別寫入DataNode 1, 2, 3,則會進行如下的過程:

  • 首先將package 1寫入DataNode 1
  • 然後由DataNode 1負責將package 1寫入DataNode 2,同時客戶端可以將pacage 2寫入DataNode 1
  • 然後DataNode 2負責將package 1寫入DataNode 3, 同時客戶端可以講package 3寫入DataNode 1,DataNode 1將package 2寫入DataNode 2
  • 就這樣將一個個package排著隊的傳遞下去,直到所有的資料全部寫入並複製完畢

  protected synchronized void writeChunk(byte[] b, int offset, int len, byte[] checksum)

                                                        throws IOException {

      //建立一個package,並寫入資料

      currentPacket = new Packet(packetSize, chunksPerPacket,

                                   bytesCurBlock);

      currentPacket.writeChecksum(checksum, 0, cklen);

      currentPacket.writeData(b, offset, len);

      currentPacket.numChunks++;

      bytesCurBlock += len;

      //如果此package已滿,則放入佇列中準備傳送

      if (currentPacket.numChunks == currentPacket.maxChunks ||

          bytesCurBlock == blockSize) {

          ......

          dataQueue.addLast(currentPacket);

          //喚醒等待dataqueue的傳輸執行緒,也即DataStreamer

          dataQueue.notifyAll();

          currentPacket = null;

          ......

      }

  }


DataStreamer的run函式如下:

  public void run() {

    while (!closed && clientRunning) {

      Packet one = null;

      synchronized (dataQueue) {

        //如果佇列中沒有package,則等待

        while ((!closed && !hasError && clientRunning

               && dataQueue.size() == 0) || doSleep) {

          try {

            dataQueue.wait(1000);

          } catch (InterruptedException  e) {

          }

          doSleep = false;

        }

        try {

          //得到佇列中的第一個package

          one = dataQueue.getFirst();

          long offsetInBlock = one.offsetInBlock;

          //由NameNode分配block,並生成一個寫入流指向此block

          if (blockStream == null) {

            nodes = nextBlockOutputStream(src);

            response = new ResponseProcessor(nodes);

            response.start();

          }

          ByteBuffer buf = one.getBuffer();

          //將package從dataQueue移至ackQueue,等待確認

          dataQueue.removeFirst();

          dataQueue.notifyAll();

          synchronized (ackQueue) {

            ackQueue.addLast(one);

            ackQueue.notifyAll();

          }

          //利用生成的寫入流將資料寫入DataNode中的block

          blockStream.write(buf.array(), buf.position(), buf.remaining());

          if (one.lastPacketInBlock) {

            blockStream.writeInt(0); //表示此block寫入完畢

          }

          blockStream.flush();

        } catch (Throwable e) {

        }

      }

      ......

  }

 

其中重要的一個函式是nextBlockOutputStream,實現如下:

  private DatanodeInfo[] nextBlockOutputStream(String client) throws IOException {

    LocatedBlock lb = null;

    boolean retry = false;

    DatanodeInfo[] nodes;

    int count = conf.getInt("dfs.client.block.write.retries", 3);

    boolean success;

    do {

      ......

      //由NameNode為檔案分配DataNode和block

      lb = locateFollowingBlock(startTime);

      block = lb.getBlock();

      nodes = lb.getLocations();

      //建立向DataNode的寫入流

      success = createBlockOutputStream(nodes, clientName, false);

      ......

    } while (retry && --count >= 0);

    return nodes;

  }

 

locateFollowingBlock中透過RPC呼叫namenode.addBlock(src, clientName)函式

 

3.4、NameNode

NameNode的addBlock函式實現如下:

  public LocatedBlock addBlock(String src,

                               String clientName) throws IOException {

    LocatedBlock locatedBlock = namesystem.getAdditionalBlock(src, clientName);

    return locatedBlock;

  }

FSNamesystem的getAdditionalBlock實現如下:

  public LocatedBlock getAdditionalBlock(String src,

                                         String clientName

                                         ) throws IOException {

    long fileLength, blockSize;

    int replication;

    DatanodeDescriptor clientNode = null;

    Block newBlock = null;

    ......

    //為新的block選擇DataNode

    DatanodeDescriptor targets[] = replicator.chooseTarget(replication,

                                                           clientNode,

                                                           null,

                                                           blockSize);

    ......

    //得到檔案路徑中所有path的INode,其中最後一個是新新增的檔案對的INode,狀態為under construction

    INode[] pathINodes = dir.getExistingPathINodes(src);

    int inodesLen = pathINodes.length;

    INodeFileUnderConstruction pendingFile  = (INodeFileUnderConstruction)

                                                pathINodes[inodesLen - 1];

    //為檔案分配block, 並設定在那寫DataNode上

    newBlock = allocateBlock(src, pathINodes);

    pendingFile.setTargets(targets);

    ......

    return new LocatedBlock(newBlock, targets, fileLength);

  }

 

3.5、客戶端

在分配了DataNode和block以後,createBlockOutputStream開始寫入資料。

  private boolean createBlockOutputStream(DatanodeInfo[] nodes, String client,

                  boolean recoveryFlag) {

      //建立一個socket,連結DataNode

      InetSocketAddress target = NetUtils.createSocketAddr(nodes[0].getName());

      s = socketFactory.createSocket();

      int timeoutValue = 3000 * nodes.length + socketTimeout;

      s.connect(target, timeoutValue);

      s.setSoTimeout(timeoutValue);

      s.setSendBufferSize(DEFAULT_DATA_SOCKET_SIZE);

      long writeTimeout = HdfsConstants.WRITE_TIMEOUT_EXTENSION * nodes.length +

                          datanodeWriteTimeout;

      DataOutputStream out = new DataOutputStream(

          new BufferedOutputStream(NetUtils.getOutputStream(s, writeTimeout),

                                   DataNode.SMALL_BUFFER_SIZE));

      blockReplyStream = new DataInputStream(NetUtils.getInputStream(s));

      //寫入指令

      out.writeShort( DataTransferProtocol.DATA_TRANSFER_VERSION );

      out.write( DataTransferProtocol.OP_WRITE_BLOCK );

      out.writeLong( block.getBlockId() );

      out.writeLong( block.getGenerationStamp() );

      out.writeInt( nodes.length );

      out.writeBoolean( recoveryFlag );

      Text.writeString( out, client );

      out.writeBoolean(false);

      out.writeInt( nodes.length - 1 );

      //注意,次迴圈從1開始,而非從0開始。將除了第一個DataNode以外的另外兩個DataNode的資訊傳送給第一個DataNode, 第一個DataNode可以根據此資訊將資料寫給另兩個DataNode

      for (int i = 1; i < nodes.length; i++) {

        nodes[i].write(out);

      }

      checksum.writeHeader( out );

      out.flush();

      firstBadLink = Text.readString(blockReplyStream);

      if (firstBadLink.length() != 0) {

        throw new IOException("Bad connect ack with firstBadLink " + firstBadLink);

      }

      blockStream = out;

  }

 

客戶端在DataStreamer的run函式中建立了寫入流後,呼叫blockStream.write將資料寫入DataNode

 

3.6、DataNode

DataNode的DataXceiver中,收到指令DataTransferProtocol.OP_WRITE_BLOCK則呼叫writeBlock函式:

  private void writeBlock(DataInputStream in) throws IOException {

    DatanodeInfo srcDataNode = null;

    //讀入頭資訊

    Block block = new Block(in.readLong(),

        dataXceiverServer.estimateBlockSize, in.readLong());

    int pipelineSize = in.readInt(); // num of datanodes in entire pipeline

    boolean isRecovery = in.readBoolean(); // is this part of recovery?

    String client = Text.readString(in); // working on behalf of this client

    boolean hasSrcDataNode = in.readBoolean(); // is src node info present

    if (hasSrcDataNode) {

      srcDataNode = new DatanodeInfo();

      srcDataNode.readFields(in);

    }

    int numTargets = in.readInt();

    if (numTargets < 0) {

      throw new IOException("Mislabelled incoming datastream.");

    }

    //讀入剩下的DataNode列表,如果當前是第一個DataNode,則此列表中收到的是第二個,第三個DataNode的資訊,如果當前是第二個DataNode,則受到的是第三個DataNode的資訊

    DatanodeInfo targets[] = new DatanodeInfo[numTargets];

    for (int i = 0; i < targets.length; i++) {

      DatanodeInfo tmp = new DatanodeInfo();

      tmp.readFields(in);

      targets[i] = tmp;

    }

    DataOutputStream mirrorOut = null;  // stream to next target

    DataInputStream mirrorIn = null;    // reply from next target

    DataOutputStream replyOut = null;   // stream to prev target

    Socket mirrorSock = null;           // socket to next target

    BlockReceiver blockReceiver = null; // responsible for data handling

    String mirrorNode = null;           // the name:port of next target

    String firstBadLink = "";           // first datanode that failed in connection setup

    try {

      //生成一個BlockReceiver, 其有成員變數DataInputStream in為從客戶端或者上一個DataNode讀取資料,還有成員變數DataOutputStream mirrorOut,用於向下一個DataNode寫入資料,還有成員變數OutputStream out用於將資料寫入本地。

      blockReceiver = new BlockReceiver(block, in,

          s.getRemoteSocketAddress().toString(),

          s.getLocalSocketAddress().toString(),

          isRecovery, client, srcDataNode, datanode);

      // get a connection back to the previous target

      replyOut = new DataOutputStream(

                     NetUtils.getOutputStream(s, datanode.socketWriteTimeout));

      //如果當前不是最後一個DataNode,則同下一個DataNode建立socket連線

      if (targets.length > 0) {

        InetSocketAddress mirrorTarget = null;

        // Connect to backup machine

        mirrorNode = targets[0].getName();

        mirrorTarget = NetUtils.createSocketAddr(mirrorNode);

        mirrorSock = datanode.newSocket();

        int timeoutValue = numTargets * datanode.socketTimeout;

        int writeTimeout = datanode.socketWriteTimeout +

                             (HdfsConstants.WRITE_TIMEOUT_EXTENSION * numTargets);

        mirrorSock.connect(mirrorTarget, timeoutValue);

        mirrorSock.setSoTimeout(timeoutValue);

        mirrorSock.setSendBufferSize(DEFAULT_DATA_SOCKET_SIZE);

        //建立向下一個DataNode寫入資料的流

        mirrorOut = new DataOutputStream(

             new BufferedOutputStream(

                         NetUtils.getOutputStream(mirrorSock, writeTimeout),

                         SMALL_BUFFER_SIZE));

        mirrorIn = new DataInputStream(NetUtils.getInputStream(mirrorSock));

        mirrorOut.writeShort( DataTransferProtocol.DATA_TRANSFER_VERSION );

        mirrorOut.write( DataTransferProtocol.OP_WRITE_BLOCK );

        mirrorOut.writeLong( block.getBlockId() );

        mirrorOut.writeLong( block.getGenerationStamp() );

        mirrorOut.writeInt( pipelineSize );

        mirrorOut.writeBoolean( isRecovery );

        Text.writeString( mirrorOut, client );

        mirrorOut.writeBoolean(hasSrcDataNode);

        if (hasSrcDataNode) { // pass src node information

          srcDataNode.write(mirrorOut);

        }

        mirrorOut.writeInt( targets.length - 1 );

        //此出也是從1開始,將除了下一個DataNode的其他DataNode資訊傳送給下一個DataNode

        for ( int i = 1; i < targets.length; i++ ) {

          targets[i].write( mirrorOut );

        }

        blockReceiver.writeChecksumHeader(mirrorOut);

        mirrorOut.flush();

      }

      //使用BlockReceiver接受block

      String mirrorAddr = (mirrorSock == null) ? null : mirrorNode;

      blockReceiver.receiveBlock(mirrorOut, mirrorIn, replyOut,

                                 mirrorAddr, null, targets.length);

      ......

    } finally {

      // close all opened streams

      IOUtils.closeStream(mirrorOut);

      IOUtils.closeStream(mirrorIn);

      IOUtils.closeStream(replyOut);

      IOUtils.closeSocket(mirrorSock);

      IOUtils.closeStream(blockReceiver);

    }

  }

 

BlockReceiver的receiveBlock函式中,一段重要的邏輯如下:

  void receiveBlock(

      DataOutputStream mirrOut, // output to next datanode

      DataInputStream mirrIn,   // input from next datanode

      DataOutputStream replyOut,  // output to previous datanode

      String mirrAddr, BlockTransferThrottler throttlerArg,

      int numTargets) throws IOException {

      ......

      //不斷的接受package,直到結束

      while (receivePacket() > 0) {}

      if (mirrorOut != null) {

        try {

          mirrorOut.writeInt(0); // mark the end of the block

          mirrorOut.flush();

        } catch (IOException e) {

          handleMirrorOutError(e);

        }

      }

      ......

  }

 

BlockReceiver的receivePacket函式如下:

  private int receivePacket() throws IOException {

    //從客戶端或者上一個節點接收一個package

    int payloadLen = readNextPacket();

    buf.mark();

    //read the header

    buf.getInt(); // packet length

    offsetInBlock = buf.getLong(); // get offset of packet in block

    long seqno = buf.getLong();    // get seqno

    boolean lastPacketInBlock = (buf.get() != 0);

    int endOfHeader = buf.position();

    buf.reset();

    setBlockPosition(offsetInBlock);

    //將package寫入下一個DataNode

    if (mirrorOut != null) {

      try {

        mirrorOut.write(buf.array(), buf.position(), buf.remaining());

        mirrorOut.flush();

      } catch (IOException e) {

        handleMirrorOutError(e);

      }

    }

    buf.position(endOfHeader);       

    int len = buf.getInt();

    offsetInBlock += len;

    int checksumLen = ((len + bytesPerChecksum - 1)/bytesPerChecksum)*

                                                            checksumSize;

    int checksumOff = buf.position();

    int dataOff = checksumOff + checksumLen;

    byte pktBuf[] = buf.array();

    buf.position(buf.limit()); // move to the end of the data.

    ......

    //將資料寫入本地的block

    out.write(pktBuf, dataOff, len);

    /// flush entire packet before sending ack

    flush();

    // put in queue for pending acks

    if (responder != null) {

      ((PacketResponder)responder.getRunnable()).enqueue(seqno,

                                      lastPacketInBlock);

    }

    return payloadLen;

  }

轉載地址:http://www.cnblogs.com/forfuture1978/archive/2010/11/10/1874222.html

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/26613085/viewspace-1084781/,如需轉載,請註明出處,否則將追究法律責任。

相關文章