摘要: Flink是jvm之上的大資料處理引擎。
Flink是jvm之上的大資料處理引擎,jvm存在java物件儲存密度低、full gc時消耗效能,gc存在stw的問題,同時omm時會影響穩定性。同時針對頻繁序列化和反序列化問題flink使用堆內堆外記憶體可以直接在一些場景下操作二進位制資料,減少序列化反序列化的消耗。同時基於大資料流式處理的特點,flink定製了自己的一套序列化框架。flink也會基於cpu L1 L2 L3快取記憶體的機制以及區域性性原理,設計使用快取友好的資料結構。flink記憶體管理和spark的tungsten的記憶體管理的出發點很相似。
記憶體模型
Flink可以使用堆內和堆外記憶體,記憶體模型如圖所示:
flink使用記憶體劃分為堆內記憶體和堆外記憶體。按照用途可以劃分為task所用記憶體,network memory、managed memory、以及framework所用記憶體,其中task network managed所用記憶體計入slot記憶體。framework為taskmanager公用。
堆內記憶體包含使用者程式碼所用記憶體、heapstatebackend、框架執行所用記憶體。
堆外記憶體是未經jvm虛擬化的記憶體,直接對映到作業系統的記憶體地址,堆外記憶體包含框架執行所用記憶體,jvm堆外記憶體、Direct、native等。
Direct memory記憶體可用於網路傳輸緩衝。network memory屬於direct memory的範疇,flink可以藉助於此進行zero copy,從而減少核心態到使用者態copy次數,從而進行更高效的io操作。
jvm metaspace存放jvm載入的類的後設資料,載入的類越多,需要的空間越大,overhead用於jvm的其他開銷,如native memory、code cache、thread stack等。
Managed Memory主要用於RocksDBStateBackend和批處理運算元,也屬於native memory的範疇,其中rocksdbstatebackend對應rocksdb,rocksdb基於lsm資料結構實現,每個state對應一個列族,佔有獨立的writebuffer,rocksdb佔用native記憶體大小為 blockCahe + writebufferNum * writeBuffer + index ,同時堆外記憶體是程式之間共享的,jvm虛擬化大量heap記憶體耗時較久,使用堆外記憶體的話可以有效的避免該環節。但堆外記憶體也有一定的弊端,即監控除錯使用相對複雜,對於生命週期較短的segment使用堆內記憶體開銷更低,flink在一些情況下,直接操作二進位制資料,避免一些反序列化帶來的開銷。如果需要處理的資料超出了記憶體限制,則會將部分資料儲存到硬碟上。
記憶體管理
類似於OS中的page機制,flink模擬了作業系統的機制,通過page來管理記憶體,flink對應page的資料結構為dataview和MemorySegment,memorysegment是flink記憶體分配的最小單位,預設32kb,其可以在堆上也可以在堆外,flink通過MemorySegment的資料結構來訪問堆內堆外記憶體,藉助於flink序列化機制(序列化機制會在下一小節講解),memorysegment提供了對二進位制資料的讀取和寫入的方法,flink使用datainputview和dataoutputview進行memorysegment的二進位制的讀取和寫入,flink可以通過HeapMemorySegment 管理堆內記憶體,通過HybridMemorySegment來管理堆內和堆外記憶體,MemorySegment管理jvm堆記憶體時,其定義一個位元組陣列的引用指向記憶體端,基於該內部位元組陣列的引用進行操作的HeapMemorySegment。
public abstract class MemorySegment { /** * The heap byte array object relative to which we access the memory. * 如果為堆記憶體,則指向訪問的記憶體的引用,否則若記憶體為非堆記憶體,則為null * <p>Is non-<tt>null</tt> if the memory is on the heap, and is <tt>null</tt>, if the memory is * off the heap. If we have this buffer, we must never void this reference, or the memory * segment will point to undefined addresses outside the heap and may in out-of-order execution * cases cause segmentation faults. */ protected final byte[] heapMemory; /** * The address to the data, relative to the heap memory byte array. If the heap memory byte * array is <tt>null</tt>, this becomes an absolute memory address outside the heap. * 位元組陣列對應的相對地址 */ protected long address; }
HeapMemorySegment用來分配堆上記憶體。
public final class HeapMemorySegment extends MemorySegment { /** * An extra reference to the heap memory, so we can let byte array checks fail by the built-in * checks automatically without extra checks. * 位元組陣列的引用指向該記憶體段 */ private byte[] memory; public void free() { super.free(); this.memory = null; } public final void get(DataOutput out, int offset, int length) throws IOException { out.write(this.memory, offset, length); } }
HybridMemorySegment即支援onheap和offheap記憶體,flink通過jvm的unsafe操作,如果物件o不為null,為onheap的場景,並且後面的地址或者位置是相對位置,那麼會直接對當前物件(比如陣列)的相對位置進行操作。如果物件o為null,操作的記憶體塊不是JVM堆記憶體,為off-heap的場景,並且後面的地址是某個記憶體塊的絕對地址,那麼這些方法的呼叫也相當於對該記憶體塊進行操作。
public final class HybridMemorySegment extends MemorySegment { @Override public ByteBuffer wrap(int offset, int length) { if (address <= addressLimit) { if (heapMemory != null) { return ByteBuffer.wrap(heapMemory, offset, length); } else { try { ByteBuffer wrapper = offHeapBuffer.duplicate(); wrapper.limit(offset + length); wrapper.position(offset); return wrapper; } catch (IllegalArgumentException e) { throw new IndexOutOfBoundsException(); } } } else { throw new IllegalStateException("segment has been freed"); } } }
flink通過MemorySegmentFactory來建立memorySegment,memorySegment是flink記憶體分配的最小單位。對於跨memorysegment的資料方位,flink抽象出一個訪問檢視,資料讀取datainputView,資料寫入dataoutputview。
/** * This interface defines a view over some memory that can be used to sequentially read the contents of the memory. * The view is typically backed by one or more {@link org.apache.flink.core.memory.MemorySegment}. */ @Public public interface DataInputView extends DataInput { private MemorySegment[] memorySegments; // view持有的MemorySegment的引用, 該組memorysegment可以視為一個記憶體頁, flink可以順序讀取memorysegmet中的資料 /** * Reads up to {@code len} bytes of memory and stores it into {@code b} starting at offset {@code off}. * It returns the number of read bytes or -1 if there is no more data left. * @param b byte array to store the data to * @param off offset into byte array * @param len byte length to read * @return the number of actually read bytes of -1 if there is no more data left */ int read(byte[] b, int off, int len) throws IOException; }
dataoutputview是資料寫入的檢視,outputview持有多個memorysegment的引用,flink可以順序的寫入segment。
/** * This interface defines a view over some memory that can be used to sequentially write contents to the memory. * The view is typically backed by one or more {@link org.apache.flink.core.memory.MemorySegment}. */ @Public public interface DataOutputView extends DataOutput { private final List<MemorySegment> memory; // memorysegment的引用 /** * Copies {@code numBytes} bytes from the source to this view. * @param source The source to copy the bytes from. * @param numBytes The number of bytes to copy. void write(DataInputView source, int numBytes) throws IOException; }
上一小節中講到的managedmemory記憶體部分,flink使用memorymanager來管理該記憶體,managedmemory只使用堆外記憶體,主要用於批處理中的sorting、hashing、以及caching(社群訊息,未來流處理也會使用到該部分),在流計算中作為rocksdbstatebackend的部分記憶體。memeorymanager通過memorypool來管理memorysegment。
/** * The memory manager governs the memory that Flink uses for sorting, hashing, caching or off-heap state backends * (e.g. RocksDB). Memory is represented either in {@link MemorySegment}s of equal size or in reserved chunks of certain * size. Operators allocate the memory either by requesting a number of memory segments or by reserving chunks. * Any allocated memory has to be released to be reused later. * <p>The memory segments are represented as off-heap unsafe memory regions (both via {@link HybridMemorySegment}). * Releasing a memory segment will make it re-claimable by the garbage collector, but does not necessarily immediately * releases the underlying memory. */ public class MemoryManager { /** * Allocates a set of memory segments from this memory manager. * <p>The total allocated memory will not exceed its size limit, announced in the constructor. * @param owner The owner to associate with the memory segment, for the fallback release. * @param target The list into which to put the allocated memory pages. * @param numberOfPages The number of pages to allocate. * @throws MemoryAllocationException Thrown, if this memory manager does not have the requested amount * of memory pages any more. */ public void allocatePages( Object owner, Collection<MemorySegment> target, int numberOfPages) throws MemoryAllocationException { } private static void freeSegment(MemorySegment segment, @Nullable Collection<MemorySegment> segments) { segment.free(); if (segments != null) { segments.remove(segment); } } /** * Frees this memory segment. * <p>After this operation has been called, no further operations are possible on the memory * segment and will fail. The actual memory (heap or off-heap) will only be released after this * memory segment object has become garbage collected. */ public void free() { // this ensures we can place no more data and trigger // the checks for the freed segment address = addressLimit + 1; } }
對於上一小節中提到的NetWorkMemory的記憶體,flink使用networkbuffer做了一層buffer封裝。buffer的底層也是memorysegment,flink通過bufferpool來管理buffer,每個taskmanager都有一個netwokbufferpool,該tm上的各個task共享該networkbufferpool,同時task對應的localbufferpool所需的記憶體需要從networkbufferpool申請而來,它們都是flink申請的堆外記憶體。
上游運算元向resultpartition寫入資料時,申請buffer資源,使用bufferbuilder將資料寫入memorysegment,下游運算元從resultsubpartition消費資料時,利用bufferconsumer從memorysegment中讀取資料,bufferbuilder與bufferconsumer一一對應。同時這一流程也和flink的反壓機制相關。如圖
/** * A buffer pool used to manage a number of {@link Buffer} instances from the * {@link NetworkBufferPool}. * <p>Buffer requests are mediated to the network buffer pool to ensure dead-lock * free operation of the network stack by limiting the number of buffers per * local buffer pool. It also implements the default mechanism for buffer * recycling, which ensures that every buffer is ultimately returned to the * network buffer pool. * <p>The size of this pool can be dynamically changed at runtime ({@link #setNumBuffers(int)}. It * will then lazily return the required number of buffers to the {@link NetworkBufferPool} to * match its new size. */ class LocalBufferPool implements BufferPool { @Nullable private MemorySegment requestMemorySegment(int targetChannel) throws IOException { MemorySegment segment = null; synchronized (availableMemorySegments) { returnExcessMemorySegments(); if (availableMemorySegments.isEmpty()) { segment = requestMemorySegmentFromGlobal(); } // segment may have been released by buffer pool owner if (segment == null) { segment = availableMemorySegments.poll(); } if (segment == null) { availabilityHelper.resetUnavailable(); } if (segment != null && targetChannel != UNKNOWN_CHANNEL) { if (subpartitionBuffersCount[targetChannel]++ == maxBuffersPerChannel) { unavailableSubpartitionsCount++; availabilityHelper.resetUnavailable(); } } } return segment; } } /** * A result partition for data produced by a single task. * * <p>This class is the runtime part of a logical {@link IntermediateResultPartition}. Essentially, * a result partition is a collection of {@link Buffer} instances. The buffers are organized in one * or more {@link ResultSubpartition} instances, which further partition the data depending on the * number of consuming tasks and the data {@link DistributionPattern}. * <p>Tasks, which consume a result partition have to request one of its subpartitions. The request * happens either remotely (see {@link RemoteInputChannel}) or locally (see {@link LocalInputChannel}) The life-cycle of each result partition has three (possibly overlapping) phases: Produce Consume Release Buffer management State management */ public abstract class ResultPartition implements ResultPartitionWriter, BufferPoolOwner { @Override public BufferBuilder getBufferBuilder(int targetChannel) throws IOException, InterruptedException { checkInProduceState(); return bufferPool.requestBufferBuilderBlocking(targetChannel); } } }
自定義序列化框架
flink對自身支援的基本資料型別,實現了定製的序列化機制,flink資料集物件相對固定,可以只儲存一份schema資訊,從而節省儲存空間,資料序列化就是java物件和二進位制資料之間的資料轉換,flink使用TypeInformation的createSerializer介面負責建立每種型別的序列化器,進行資料的序列化反序列化,型別資訊在構建streamtransformation時通過typeextractor根據方法簽名類資訊等提取型別資訊並儲存在streamconfig中。
/** * Creates a serializer for the type. The serializer may use the ExecutionConfig * for parameterization. * 建立出對應型別的序列化器 * @param config The config used to parameterize the serializer. * @return A serializer for this type. */ @PublicEvolving public abstract TypeSerializer<T> createSerializer(ExecutionConfig config); /** * A utility for reflection analysis on classes, to determine the return type of implementations of transformation * functions. */ @Public public class TypeExtractor { /** * Creates a {@link TypeInformation} from the given parameters. * If the given {@code instance} implements {@link ResultTypeQueryable}, its information * is used to determine the type information. Otherwise, the type information is derived * based on the given class information. * @param instance instance to determine type information for * @param baseClass base class of {@code instance} * @param clazz class of {@code instance} * @param returnParamPos index of the return type in the type arguments of {@code clazz} * @param <OUT> output type * @return type information */ @SuppressWarnings("unchecked") @PublicEvolving public static <OUT> TypeInformation<OUT> createTypeInfo(Object instance, Class<?> baseClass, Class<?> clazz, int returnParamPos) { if (instance instanceof ResultTypeQueryable) { return ((ResultTypeQueryable<OUT>) instance).getProducedType(); } else { return createTypeInfo(baseClass, clazz, returnParamPos, null, null); } } }
對於巢狀的資料型別,flink從最內層的欄位開始序列化,內層序列化的結果將組成外層序列化結果,反序列時,從記憶體中順序讀取二進位制資料,根據偏移量反序列化為java物件。flink自帶序列化機制儲存密度很高,序列化對應的型別值即可。
flink中的table模組在memorysegment的基礎上使用了BinaryRow的資料結構,可以更好地減少反序列化開銷,需要反序列化是可以只序列化相應的欄位,而無需序列化整個物件。
同時你也可以註冊子型別和自定義序列化器,對於flink無法序列化的型別,會交給kryo進行處理,如果kryo也無法處理,將強制使用avro來序列化,kryo序列化效能相對flink自帶序列化機制較低,開發時可以使用env.getConfig().disableGenericTypes()來禁用kryo,儘量使用flink框架自帶的序列化器對應的資料型別。
快取友好的資料結構
cpu中L1、L2、L3的快取讀取速度比從記憶體中讀取資料快很多,快取記憶體的訪問速度是主存的訪問速度的很多倍。另外一個重要的程式特性是區域性性原理,程式常常使用它們最近使用的資料和指令,其中兩種區域性性型別,時間區域性性指最近訪問的內容很可能短期內被再次訪問,空間區域性性是指地址相互臨近的專案很可能短時間內被再次訪問。
結合這兩個特性設計快取友好的資料結構可以有效的提升快取命中率和本地化特性,該特性主要用於排序操作中,常規情況下一個指標指向一個<key,v>物件,排序時需要根據指標pointer獲取到實際資料,然後再進行比較,這個環節涉及到記憶體的隨機訪問,快取本地化會很低,使用序列化的定長key + pointer,這樣key就會連續儲存到記憶體中,避免的記憶體的隨機訪問,還可以提升cpu快取命中率。對兩條記錄進行排序時首先比較key,如果大小不同直接返回結果,只需交換指標即可,不用交換實際資料,如果相同,則比較指標實際指向的資料。
後記
flink社群已走向流批一體的發展,後繼將更多的關注與流批一體的引擎實現及結合儲存層面的實現。flink服務請使用華為雲 EI DLI-FLINK serverless服務。
參考
[1]: https://ci.apache.org/projects/flink/flink-docs-stable/
[2]: https://github.com/apache/flink