HDFS讀寫過程解析(R1)
一、檔案的開啟
1.1、客戶端
HDFS開啟一個檔案,需要在客戶端呼叫DistributedFileSystem.open(Path f, int bufferSize),其實現為:
public FSDataInputStream open(Path f, int bufferSize) throws IOException { return new DFSClient.DFSDataInputStream( dfs.open(getPathName(f), bufferSize, verifyChecksum, statistics)); } |
其中dfs為DistributedFileSystem的成員變數DFSClient,其open函式被呼叫,其中建立一個DFSInputStream(src, buffersize, verifyChecksum)並返回。
在DFSInputStream的建構函式中,openInfo函式被呼叫,其主要從namenode中得到要開啟的檔案所對應的blocks的資訊,實現如下:
synchronized void openInfo() throws IOException { LocatedBlocks newInfo = callGetBlockLocations(namenode, src, 0, prefetchSize); this.locatedBlocks = newInfo; this.currentNode = null; } |
private static LocatedBlocks callGetBlockLocations(ClientProtocol namenode, String src, long start, long length) throws IOException { return namenode.getBlockLocations(src, start, length); } |
LocatedBlocks主要包含一個連結串列的List
- Block b:此block的資訊
- long offset:此block在檔案中的偏移量
- DatanodeInfo[] locs:此block位於哪些DataNode上
上面namenode.getBlockLocations是一個RPC呼叫,最終呼叫NameNode類的getBlockLocations函式。
1.2、NameNode
NameNode.getBlockLocations實現如下:
public LocatedBlocks getBlockLocations(String src, long offset, long length) throws IOException { return namesystem.getBlockLocations(getClientMachine(), src, offset, length); } |
namesystem是NameNode一個成員變數,其型別為FSNamesystem,儲存的是NameNode的name space樹,其中一個重要的成員變數為FSDirectory dir。
FSDirectory和Lucene中的FSDirectory沒有任何關係,其主要包括FSImage fsImage,用於讀寫硬碟上的fsimage檔案,FSImage類有成員變數FSEditLog editLog,用於讀寫硬碟上的edit檔案,這兩個檔案的關係在上一篇文章中已經解釋過。
FSDirectory還有一個重要的成員變數INodeDirectoryWithQuota rootDir,INodeDirectoryWithQuota的父類為INodeDirectory,實現如下:
public class INodeDirectory extends INode { ……
private List …… } |
由此可見INodeDirectory本身是一個INode,其中包含一個連結串列的INode,此連結串列中,如果仍為資料夾,則是型別INodeDirectory,如果是檔案,則是型別INodeFile,INodeFile中有成員變數BlockInfo blocks[],是此檔案包含的block的資訊。顯然這是一棵樹形的結構。
FSNamesystem.getBlockLocations函式如下:
public LocatedBlocks getBlockLocations(String src, long offset, long length, boolean doAccessTime) throws IOException { final LocatedBlocks ret = getBlockLocationsInternal(src, dir.getFileINode(src), offset, length, Integer.MAX_VALUE, doAccessTime); return ret; } |
dir.getFileINode(src)透過路徑名從檔案系統樹中找到INodeFile,其中儲存的是要開啟的檔案的INode的資訊。
getBlockLocationsInternal的實現如下:
private synchronized LocatedBlocks getBlockLocationsInternal(String src, INodeFile inode, long offset, long length, int nrBlocksToReturn, boolean doAccessTime) throws IOException { //得到此檔案的block資訊 Block[] blocks = inode.getBlocks();
List //計算從offset開始,長度為length所涉及的blocks int curBlk = 0; long curPos = 0, blkSize = 0; int nrBlocks = (blocks[0].getNumBytes() == 0) ? 0 : blocks.length; for (curBlk = 0; curBlk < nrBlocks; curBlk++) { blkSize = blocks[curBlk].getNumBytes(); if (curPos + blkSize > offset) { //當offset在curPos和curPos + blkSize之間的時候,curBlk指向offset所在的block break; } curPos += blkSize; } long endOff = offset + length; //迴圈,依次遍歷從curBlk開始的每個block,直到當前位置curPos越過endOff do { int numNodes = blocksMap.numNodes(blocks[curBlk]); int numCorruptNodes = countNodes(blocks[curBlk]).corruptReplicas(); int numCorruptReplicas = corruptReplicas.numCorruptReplicas(blocks[curBlk]); boolean blockCorrupt = (numCorruptNodes == numNodes); int numMachineSet = blockCorrupt ? numNodes : (numNodes - numCorruptNodes); //依次找到此block所對應的datanode,將其中沒有損壞的放入machineSet中 DatanodeDescriptor[] machineSet = new DatanodeDescriptor[numMachineSet]; if (numMachineSet > 0) { numNodes = 0;
for(Iterator blocksMap.nodeIterator(blocks[curBlk]); it.hasNext();) { DatanodeDescriptor dn = it.next(); boolean replicaCorrupt = corruptReplicas.isReplicaCorrupt(blocks[curBlk], dn); if (blockCorrupt || (!blockCorrupt && !replicaCorrupt)) machineSet[numNodes++] = dn; } } //使用此machineSet和當前的block構造一個LocatedBlock results.add(new LocatedBlock(blocks[curBlk], machineSet, curPos, blockCorrupt)); curPos += blocks[curBlk].getNumBytes(); curBlk++; } while (curPos < endOff && curBlk < blocks.length && results.size() < nrBlocksToReturn); //使用此LocatedBlock連結串列構造一個LocatedBlocks物件返回 return inode.createLocatedBlocks(results); } |
1.3、客戶端
透過RPC呼叫,在NameNode得到的LocatedBlocks物件,作為成員變數構造DFSInputStream物件,最後包裝為FSDataInputStream返回給使用者。
二、檔案的讀取
2.1、客戶端
檔案讀取的時候,客戶端利用檔案開啟的時候得到的FSDataInputStream.read(long position, byte[] buffer, int offset, int length)函式進行檔案讀操作。
FSDataInputStream會呼叫其封裝的DFSInputStream的read(long position, byte[] buffer, int offset, int length)函式,實現如下:
public int read(long position, byte[] buffer, int offset, int length) throws IOException { long filelen = getFileLength(); int realLen = length; if ((position + length) > filelen) { realLen = (int)(filelen - position); } //首先得到包含從offset到offset + length內容的block列表 //比如對於64M一個block的檔案系統來說,欲讀取從100M開始,長度為128M的資料,則block列表包括第2,3,4塊block
List int remaining = realLen; //對每一個block,從中讀取內容 //對於上面的例子,對於第2塊block,讀取從36M開始,讀取長度28M,對於第3塊,讀取整一塊64M,對於第4塊,讀取從0開始,長度為36M,共128M資料 for (LocatedBlock blk : blockRange) { long targetStart = position - blk.getStartOffset(); long bytesToRead = Math.min(remaining, blk.getBlockSize() - targetStart); fetchBlockByteRange(blk, targetStart, targetStart + bytesToRead - 1, buffer, offset); remaining -= bytesToRead; position += bytesToRead; offset += bytesToRead; } assert remaining == 0 : "Wrong number of bytes read."; if (stats != null) { stats.incrementBytesRead(realLen); } return realLen; } |
其中getBlockRange函式如下:
private synchronized List long length) throws IOException {
List //首先從快取的locatedBlocks中查詢offset所在的block在快取連結串列中的位置 int blockIdx = locatedBlocks.findBlock(offset); if (blockIdx < 0) { // block is not cached blockIdx = LocatedBlocks.getInsertIndex(blockIdx); } long remaining = length; long curOff = offset; while(remaining > 0) { LocatedBlock blk = null; //按照blockIdx的位置找到block if(blockIdx < locatedBlocks.locatedBlockCount()) blk = locatedBlocks.get(blockIdx); //如果block為空,則快取中沒有此block,則直接從NameNode中查詢這些block,並加入快取 if (blk == null || curOff < blk.getStartOffset()) { LocatedBlocks newBlocks; newBlocks = callGetBlockLocations(namenode, src, curOff, remaining); locatedBlocks.insertRange(blockIdx, newBlocks.getLocatedBlocks()); continue; } //如果block找到,則放入結果集 blockRange.add(blk); long bytesRead = blk.getStartOffset() + blk.getBlockSize() - curOff; remaining -= bytesRead; curOff += bytesRead; //取下一個block blockIdx++; } return blockRange; } |
其中fetchBlockByteRange實現如下:
private void fetchBlockByteRange(LocatedBlock block, long start, long end, byte[] buf, int offset) throws IOException { Socket dn = null; int numAttempts = block.getLocations().length; //此while迴圈為讀取失敗後的重試次數 while (dn == null && numAttempts-- > 0 ) { //選擇一個DataNode來讀取資料 DNAddrPair retval = chooseDataNode(block); DatanodeInfo chosenNode = retval.info; InetSocketAddress targetAddr = retval.addr; BlockReader reader = null; try { //建立Socket連線到DataNode dn = socketFactory.createSocket(); dn.connect(targetAddr, socketTimeout); dn.setSoTimeout(socketTimeout); int len = (int) (end - start + 1); //利用建立的Socket連結,生成一個reader負責從DataNode讀取資料 reader = BlockReader.newBlockReader(dn, src, block.getBlock().getBlockId(), block.getBlock().getGenerationStamp(), start, len, buffersize, verifyChecksum, clientName); //讀取資料 int nread = reader.readAll(buf, offset, len); return; } finally { IOUtils.closeStream(reader); IOUtils.closeSocket(dn); dn = null; } //如果讀取失敗,則將此DataNode標記為失敗節點 addToDeadNodes(chosenNode); } } |
BlockReader.newBlockReader函式實現如下:
public static BlockReader newBlockReader( Socket sock, String file, long blockId, long genStamp, long startOffset, long len, int bufferSize, boolean verifyChecksum, String clientName) throws IOException { //使用Socket建立寫入流,向DataNode傳送讀指令 DataOutputStream out = new DataOutputStream( new BufferedOutputStream(NetUtils.getOutputStream(sock,HdfsConstants.WRITE_TIMEOUT))); out.writeShort( DataTransferProtocol.DATA_TRANSFER_VERSION ); out.write( DataTransferProtocol.OP_READ_BLOCK ); out.writeLong( blockId ); out.writeLong( genStamp ); out.writeLong( startOffset ); out.writeLong( len ); Text.writeString(out, clientName); out.flush(); //使用Socket建立讀入流,用於從DataNode讀取資料 DataInputStream in = new DataInputStream( new BufferedInputStream(NetUtils.getInputStream(sock), bufferSize)); DataChecksum checksum = DataChecksum.newDataChecksum( in ); long firstChunkOffset = in.readLong(); //生成一個reader,主要包含讀入流,用於讀取資料 return new BlockReader( file, blockId, in, checksum, verifyChecksum, startOffset, firstChunkOffset, sock ); } |
BlockReader的readAll函式就是用上面生成的DataInputStream讀取資料。
2.2、DataNode
在DataNode啟動的時候,會呼叫函式startDataNode,其中與資料讀取有關的邏輯如下:
void startDataNode(Configuration conf,
AbstractList ) throws IOException { …… // 建立一個ServerSocket,並生成一個DataXceiverServer來監控客戶端的連結 ServerSocket ss = (socketWriteTimeout > 0) ? ServerSocketChannel.open().socket() : new ServerSocket(); Server.bind(ss, socAddr, 0); ss.setReceiveBufferSize(DEFAULT_DATA_SOCKET_SIZE); // adjust machine name with the actual port tmpPort = ss.getLocalPort(); selfAddr = new InetSocketAddress(ss.getInetAddress().getHostAddress(), tmpPort); this.dnRegistration.setName(machineName + ":" + tmpPort); this.threadGroup = new ThreadGroup("dataXceiverServer"); this.dataXceiverServer = new Daemon(threadGroup, new DataXceiverServer(ss, conf, this)); this.threadGroup.setDaemon(true); // auto destroy when empty …… } |
DataXceiverServer.run()函式如下:
public void run() { while (datanode.shouldRun) { //接受客戶端的連結 Socket s = ss.accept(); s.setTcpNoDelay(true); //生成一個執行緒DataXceiver來對建立的連結提供服務 new Daemon(datanode.threadGroup, new DataXceiver(s, datanode, this)).start(); } try { ss.close(); } catch (IOException ie) { LOG.warn(datanode.dnRegistration + ":DataXceiveServer: " + StringUtils.stringifyException(ie)); } } |
DataXceiver.run()函式如下:
public void run() { DataInputStream in=null; try { //建立一個輸入流,讀取客戶端傳送的指令 in = new DataInputStream( new BufferedInputStream(NetUtils.getInputStream(s), SMALL_BUFFER_SIZE)); short version = in.readShort(); boolean local = s.getInetAddress().equals(s.getLocalAddress()); byte op = in.readByte(); // Make sure the xciver count is not exceeded int curXceiverCount = datanode.getXceiverCount(); long startTime = DataNode.now(); switch ( op ) { //讀取 case DataTransferProtocol.OP_READ_BLOCK: //真正的讀取資料 readBlock( in ); datanode.myMetrics.readBlockOp.inc(DataNode.now() - startTime); if (local) datanode.myMetrics.readsFromLocalClient.inc(); else datanode.myMetrics.readsFromRemoteClient.inc(); break; //寫入 case DataTransferProtocol.OP_WRITE_BLOCK: //真正的寫入資料 writeBlock( in ); datanode.myMetrics.writeBlockOp.inc(DataNode.now() - startTime); if (local) datanode.myMetrics.writesFromLocalClient.inc(); else datanode.myMetrics.writesFromRemoteClient.inc(); break; //其他的指令 …… } } catch (Throwable t) { LOG.error(datanode.dnRegistration + ":DataXceiver",t); } finally { IOUtils.closeStream(in); IOUtils.closeSocket(s); dataXceiverServer.childSockets.remove(s); } } |
private void readBlock(DataInputStream in) throws IOException { //讀取指令 long blockId = in.readLong(); Block block = new Block( blockId, 0 , in.readLong()); long startOffset = in.readLong(); long length = in.readLong(); String clientName = Text.readString(in); //建立一個寫入流,用於向客戶端寫資料 OutputStream baseStream = NetUtils.getOutputStream(s, datanode.socketWriteTimeout); DataOutputStream out = new DataOutputStream( new BufferedOutputStream(baseStream, SMALL_BUFFER_SIZE)); //生成BlockSender用於讀取本地的block的資料,併傳送給客戶端 //BlockSender有一個成員變數InputStream blockIn用於讀取本地block的資料 BlockSender blockSender = new BlockSender(block, startOffset, length, true, true, false, datanode, clientTraceFmt); out.writeShort(DataTransferProtocol.OP_STATUS_SUCCESS); // send op status //向客戶端寫入資料 long read = blockSender.sendBlock(out, baseStream, null); …… } finally { IOUtils.closeStream(out); IOUtils.closeStream(blockSender); } } |
三、檔案的寫入
下面解析向hdfs上傳一個檔案的過程。
3.1、客戶端
上傳一個檔案到hdfs,一般會呼叫DistributedFileSystem.create,其實現如下:
public FSDataOutputStream create(Path f, FsPermission permission, boolean overwrite, int bufferSize, short replication, long blockSize, Progressable progress) throws IOException { return new FSDataOutputStream (dfs.create(getPathName(f), permission, overwrite, replication, blockSize, progress, bufferSize), statistics); } |
其最終生成一個FSDataOutputStream用於向新生成的檔案中寫入資料。其成員變數dfs的型別為DFSClient,DFSClient的create函式如下:
public OutputStream create(String src, FsPermission permission, boolean overwrite, short replication, long blockSize, Progressable progress, int buffersize ) throws IOException { checkOpen(); if (permission == null) { permission = FsPermission.getDefault(); } FsPermission masked = permission.applyUMask(FsPermission.getUMask(conf)); OutputStream result = new DFSOutputStream(src, masked, overwrite, replication, blockSize, progress, buffersize, conf.getInt("io.bytes.per.checksum", 512)); leasechecker.put(src, result); return result; } |
其中構造了一個DFSOutputStream,在其建構函式中,同過RPC呼叫NameNode的create來建立一個檔案。
當然,建構函式中還做了一件重要的事情,就是streamer.start(),也即啟動了一個pipeline,用於寫資料,在寫入資料的過程中,我們會仔細分析。
DFSOutputStream(String src, FsPermission masked, boolean overwrite, short replication, long blockSize, Progressable progress, int buffersize, int bytesPerChecksum) throws IOException { this(src, blockSize, progress, bytesPerChecksum); computePacketChunkSize(writePacketSize, bytesPerChecksum); try { namenode.create( src, masked, clientName, overwrite, replication, blockSize); } catch(RemoteException re) { throw re.unwrapRemoteException(AccessControlException.class, QuotaExceededException.class); } streamer.start(); } |
3.2、NameNode
NameNode的create函式呼叫namesystem.startFile函式,其又呼叫startFileInternal函式,實現如下:
private synchronized void startFileInternal(String src, PermissionStatus permissions, String holder, String clientMachine, boolean overwrite, boolean append, short replication, long blockSize ) throws IOException { ...... //建立一個新的檔案,狀態為under construction,沒有任何data block與之對應 long genstamp = nextGenerationStamp(); INodeFileUnderConstruction newNode = dir.addFile(src, permissions, replication, blockSize, holder, clientMachine, clientNode, genstamp); ...... } |
3.3、客戶端
下面輪到客戶端向新建立的檔案中寫入資料了,一般會使用FSDataOutputStream的write函式,最終會呼叫DFSOutputStream的writeChunk函式:
按照hdfs的設計,對block的資料寫入使用的是pipeline的方式,也即將資料分成一個個的package,如果需要複製三分,分別寫入DataNode 1, 2, 3,則會進行如下的過程:
- 首先將package 1寫入DataNode 1
- 然後由DataNode 1負責將package 1寫入DataNode 2,同時客戶端可以將pacage 2寫入DataNode 1
- 然後DataNode 2負責將package 1寫入DataNode 3, 同時客戶端可以講package 3寫入DataNode 1,DataNode 1將package 2寫入DataNode 2
- 就這樣將一個個package排著隊的傳遞下去,直到所有的資料全部寫入並複製完畢
protected synchronized void writeChunk(byte[] b, int offset, int len, byte[] checksum) throws IOException { //建立一個package,並寫入資料 currentPacket = new Packet(packetSize, chunksPerPacket, bytesCurBlock); currentPacket.writeChecksum(checksum, 0, cklen); currentPacket.writeData(b, offset, len); currentPacket.numChunks++; bytesCurBlock += len; //如果此package已滿,則放入佇列中準備傳送 if (currentPacket.numChunks == currentPacket.maxChunks || bytesCurBlock == blockSize) { ...... dataQueue.addLast(currentPacket); //喚醒等待dataqueue的傳輸執行緒,也即DataStreamer dataQueue.notifyAll(); currentPacket = null; ...... } } |
DataStreamer的run函式如下:
public void run() { while (!closed && clientRunning) { Packet one = null; synchronized (dataQueue) { //如果佇列中沒有package,則等待 while ((!closed && !hasError && clientRunning && dataQueue.size() == 0) || doSleep) { try { dataQueue.wait(1000); } catch (InterruptedException e) { } doSleep = false; } try { //得到佇列中的第一個package one = dataQueue.getFirst(); long offsetInBlock = one.offsetInBlock; //由NameNode分配block,並生成一個寫入流指向此block if (blockStream == null) { nodes = nextBlockOutputStream(src); response = new ResponseProcessor(nodes); response.start(); } ByteBuffer buf = one.getBuffer(); //將package從dataQueue移至ackQueue,等待確認 dataQueue.removeFirst(); dataQueue.notifyAll(); synchronized (ackQueue) { ackQueue.addLast(one); ackQueue.notifyAll(); } //利用生成的寫入流將資料寫入DataNode中的block blockStream.write(buf.array(), buf.position(), buf.remaining()); if (one.lastPacketInBlock) { blockStream.writeInt(0); //表示此block寫入完畢 } blockStream.flush(); } catch (Throwable e) { } } ...... } |
其中重要的一個函式是nextBlockOutputStream,實現如下:
private DatanodeInfo[] nextBlockOutputStream(String client) throws IOException { LocatedBlock lb = null; boolean retry = false; DatanodeInfo[] nodes; int count = conf.getInt("dfs.client.block.write.retries", 3); boolean success; do { ...... //由NameNode為檔案分配DataNode和block lb = locateFollowingBlock(startTime); block = lb.getBlock(); nodes = lb.getLocations(); //建立向DataNode的寫入流 success = createBlockOutputStream(nodes, clientName, false); ...... } while (retry && --count >= 0); return nodes; } |
locateFollowingBlock中透過RPC呼叫namenode.addBlock(src, clientName)函式
3.4、NameNode
NameNode的addBlock函式實現如下:
public LocatedBlock addBlock(String src, String clientName) throws IOException { LocatedBlock locatedBlock = namesystem.getAdditionalBlock(src, clientName); return locatedBlock; } |
FSNamesystem的getAdditionalBlock實現如下:
public LocatedBlock getAdditionalBlock(String src, String clientName ) throws IOException { long fileLength, blockSize; int replication; DatanodeDescriptor clientNode = null; Block newBlock = null; ...... //為新的block選擇DataNode DatanodeDescriptor targets[] = replicator.chooseTarget(replication, clientNode, null, blockSize); ...... //得到檔案路徑中所有path的INode,其中最後一個是新新增的檔案對的INode,狀態為under construction INode[] pathINodes = dir.getExistingPathINodes(src); int inodesLen = pathINodes.length; INodeFileUnderConstruction pendingFile = (INodeFileUnderConstruction) pathINodes[inodesLen - 1]; //為檔案分配block, 並設定在那寫DataNode上 newBlock = allocateBlock(src, pathINodes); pendingFile.setTargets(targets); ...... return new LocatedBlock(newBlock, targets, fileLength); } |
3.5、客戶端
在分配了DataNode和block以後,createBlockOutputStream開始寫入資料。
private boolean createBlockOutputStream(DatanodeInfo[] nodes, String client, boolean recoveryFlag) { //建立一個socket,連結DataNode InetSocketAddress target = NetUtils.createSocketAddr(nodes[0].getName()); s = socketFactory.createSocket(); int timeoutValue = 3000 * nodes.length + socketTimeout; s.connect(target, timeoutValue); s.setSoTimeout(timeoutValue); s.setSendBufferSize(DEFAULT_DATA_SOCKET_SIZE); long writeTimeout = HdfsConstants.WRITE_TIMEOUT_EXTENSION * nodes.length + datanodeWriteTimeout; DataOutputStream out = new DataOutputStream( new BufferedOutputStream(NetUtils.getOutputStream(s, writeTimeout), DataNode.SMALL_BUFFER_SIZE)); blockReplyStream = new DataInputStream(NetUtils.getInputStream(s)); //寫入指令 out.writeShort( DataTransferProtocol.DATA_TRANSFER_VERSION ); out.write( DataTransferProtocol.OP_WRITE_BLOCK ); out.writeLong( block.getBlockId() ); out.writeLong( block.getGenerationStamp() ); out.writeInt( nodes.length ); out.writeBoolean( recoveryFlag ); Text.writeString( out, client ); out.writeBoolean(false); out.writeInt( nodes.length - 1 ); //注意,次迴圈從1開始,而非從0開始。將除了第一個DataNode以外的另外兩個DataNode的資訊傳送給第一個DataNode, 第一個DataNode可以根據此資訊將資料寫給另兩個DataNode for (int i = 1; i < nodes.length; i++) { nodes[i].write(out); } checksum.writeHeader( out ); out.flush(); firstBadLink = Text.readString(blockReplyStream); if (firstBadLink.length() != 0) { throw new IOException("Bad connect ack with firstBadLink " + firstBadLink); } blockStream = out; } |
客戶端在DataStreamer的run函式中建立了寫入流後,呼叫blockStream.write將資料寫入DataNode
3.6、DataNode
DataNode的DataXceiver中,收到指令DataTransferProtocol.OP_WRITE_BLOCK則呼叫writeBlock函式:
private void writeBlock(DataInputStream in) throws IOException { DatanodeInfo srcDataNode = null; //讀入頭資訊 Block block = new Block(in.readLong(), dataXceiverServer.estimateBlockSize, in.readLong()); int pipelineSize = in.readInt(); // num of datanodes in entire pipeline boolean isRecovery = in.readBoolean(); // is this part of recovery? String client = Text.readString(in); // working on behalf of this client boolean hasSrcDataNode = in.readBoolean(); // is src node info present if (hasSrcDataNode) { srcDataNode = new DatanodeInfo(); srcDataNode.readFields(in); } int numTargets = in.readInt(); if (numTargets < 0) { throw new IOException("Mislabelled incoming datastream."); } //讀入剩下的DataNode列表,如果當前是第一個DataNode,則此列表中收到的是第二個,第三個DataNode的資訊,如果當前是第二個DataNode,則受到的是第三個DataNode的資訊 DatanodeInfo targets[] = new DatanodeInfo[numTargets]; for (int i = 0; i < targets.length; i++) { DatanodeInfo tmp = new DatanodeInfo(); tmp.readFields(in); targets[i] = tmp; } DataOutputStream mirrorOut = null; // stream to next target DataInputStream mirrorIn = null; // reply from next target DataOutputStream replyOut = null; // stream to prev target Socket mirrorSock = null; // socket to next target BlockReceiver blockReceiver = null; // responsible for data handling String mirrorNode = null; // the name:port of next target String firstBadLink = ""; // first datanode that failed in connection setup try { //生成一個BlockReceiver, 其有成員變數DataInputStream in為從客戶端或者上一個DataNode讀取資料,還有成員變數DataOutputStream mirrorOut,用於向下一個DataNode寫入資料,還有成員變數OutputStream out用於將資料寫入本地。 blockReceiver = new BlockReceiver(block, in, s.getRemoteSocketAddress().toString(), s.getLocalSocketAddress().toString(), isRecovery, client, srcDataNode, datanode); // get a connection back to the previous target replyOut = new DataOutputStream( NetUtils.getOutputStream(s, datanode.socketWriteTimeout)); //如果當前不是最後一個DataNode,則同下一個DataNode建立socket連線 if (targets.length > 0) { InetSocketAddress mirrorTarget = null; // Connect to backup machine mirrorNode = targets[0].getName(); mirrorTarget = NetUtils.createSocketAddr(mirrorNode); mirrorSock = datanode.newSocket(); int timeoutValue = numTargets * datanode.socketTimeout; int writeTimeout = datanode.socketWriteTimeout + (HdfsConstants.WRITE_TIMEOUT_EXTENSION * numTargets); mirrorSock.connect(mirrorTarget, timeoutValue); mirrorSock.setSoTimeout(timeoutValue); mirrorSock.setSendBufferSize(DEFAULT_DATA_SOCKET_SIZE); //建立向下一個DataNode寫入資料的流 mirrorOut = new DataOutputStream( new BufferedOutputStream( NetUtils.getOutputStream(mirrorSock, writeTimeout), SMALL_BUFFER_SIZE)); mirrorIn = new DataInputStream(NetUtils.getInputStream(mirrorSock)); mirrorOut.writeShort( DataTransferProtocol.DATA_TRANSFER_VERSION ); mirrorOut.write( DataTransferProtocol.OP_WRITE_BLOCK ); mirrorOut.writeLong( block.getBlockId() ); mirrorOut.writeLong( block.getGenerationStamp() ); mirrorOut.writeInt( pipelineSize ); mirrorOut.writeBoolean( isRecovery ); Text.writeString( mirrorOut, client ); mirrorOut.writeBoolean(hasSrcDataNode); if (hasSrcDataNode) { // pass src node information srcDataNode.write(mirrorOut); } mirrorOut.writeInt( targets.length - 1 ); //此出也是從1開始,將除了下一個DataNode的其他DataNode資訊傳送給下一個DataNode for ( int i = 1; i < targets.length; i++ ) { targets[i].write( mirrorOut ); } blockReceiver.writeChecksumHeader(mirrorOut); mirrorOut.flush(); } //使用BlockReceiver接受block String mirrorAddr = (mirrorSock == null) ? null : mirrorNode; blockReceiver.receiveBlock(mirrorOut, mirrorIn, replyOut, mirrorAddr, null, targets.length); ...... } finally { // close all opened streams IOUtils.closeStream(mirrorOut); IOUtils.closeStream(mirrorIn); IOUtils.closeStream(replyOut); IOUtils.closeSocket(mirrorSock); IOUtils.closeStream(blockReceiver); } } |
BlockReceiver的receiveBlock函式中,一段重要的邏輯如下:
void receiveBlock( DataOutputStream mirrOut, // output to next datanode DataInputStream mirrIn, // input from next datanode DataOutputStream replyOut, // output to previous datanode String mirrAddr, BlockTransferThrottler throttlerArg, int numTargets) throws IOException { ...... //不斷的接受package,直到結束 while (receivePacket() > 0) {} if (mirrorOut != null) { try { mirrorOut.writeInt(0); // mark the end of the block mirrorOut.flush(); } catch (IOException e) { handleMirrorOutError(e); } } ...... } |
BlockReceiver的receivePacket函式如下:
private int receivePacket() throws IOException { //從客戶端或者上一個節點接收一個package int payloadLen = readNextPacket(); buf.mark(); //read the header buf.getInt(); // packet length offsetInBlock = buf.getLong(); // get offset of packet in block long seqno = buf.getLong(); // get seqno boolean lastPacketInBlock = (buf.get() != 0); int endOfHeader = buf.position(); buf.reset(); setBlockPosition(offsetInBlock); //將package寫入下一個DataNode if (mirrorOut != null) { try { mirrorOut.write(buf.array(), buf.position(), buf.remaining()); mirrorOut.flush(); } catch (IOException e) { handleMirrorOutError(e); } } buf.position(endOfHeader); int len = buf.getInt(); offsetInBlock += len; int checksumLen = ((len + bytesPerChecksum - 1)/bytesPerChecksum)* checksumSize; int checksumOff = buf.position(); int dataOff = checksumOff + checksumLen; byte pktBuf[] = buf.array(); buf.position(buf.limit()); // move to the end of the data. ...... //將資料寫入本地的block out.write(pktBuf, dataOff, len); /// flush entire packet before sending ack flush(); // put in queue for pending acks if (responder != null) { ((PacketResponder)responder.getRunnable()).enqueue(seqno, lastPacketInBlock); } return payloadLen; } |
來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/26613085/viewspace-1084781/,如需轉載,請註明出處,否則將追究法律責任。
相關文章
- HDFS寫過程分析
- HDFS的讀寫流程
- HDFS讀寫流程(重點)
- HDFS讀寫兩步教程
- HDFS讀檔案過程分析:讀取檔案的Block資料BloC
- kafka connect,將資料批量寫到hdfs完整過程Kafka
- 解讀JSP的解析過程JS
- Netty原始碼解析 -- ChannelPipeline機制與讀寫過程Netty原始碼
- RAID5讀寫過程AI
- HDFS啟動過程+安全模式模式
- HDFS原始碼解析:教你用HDFS客戶端寫資料原始碼客戶端
- HDFS讀檔案過程分析:獲取檔案對應的Block列表BloC
- 原始碼|HDFS之NameNode:啟動過程原始碼
- 原始碼|HDFS之DataNode:啟動過程原始碼
- hadoop之 解析HDFS的寫檔案流程Hadoop
- hadoop實戰4--(hdfs讀流程,hdfs寫流程,副本放置策略)Hadoop
- 大資料系列2:Hdfs的讀寫操作大資料
- Hadoop之HDFS檔案讀寫流程說明Hadoop
- DNS解析過程原理DNS
- SQL 解析的過程SQL
- 域名解析過程
- TiCDC 原始碼閱讀(三)TiCDC 叢集工作過程解析原始碼
- 注入mssql後使用儲存過程讀寫任意檔案SQL儲存過程
- Logstash讀取Kafka資料寫入HDFS詳解Kafka
- Hadoop框架:HDFS讀寫機制與API詳解Hadoop框架API
- 乾貨分享 | 一文讀懂DNS原理及解析過程DNS
- DNS域名解析過程DNS
- java通過kerberos認證連線hdfs並寫數JavaROS
- oracle邏輯讀過程Oracle
- 模擬實現mapreduce中環形緩衝區的讀寫過程
- HDFS原始碼解析系列一——HDFS通訊協議原始碼協議
- docker 容器中解析 PHP 過程DockerPHP
- MapReduce 執行全過程解析
- DNS的原理和解析過程DNS
- OpenPose訓練過程解析(2)
- 轉:DNS解析過程詳解DNS
- 網路 - DNS解析過程原理DNS
- mysql儲存過程案例解析MySql儲存過程