Lookup Store 主要用於 Paimon 中的 Lookup Compaction 以及 Lookup join 的場景. 會將遠端的列存檔案在本地轉化為 KV 查詢的格式.
Hash
https://github.com/linkedin/PalDB
Sort
https://github.com/dain/leveldb
https://github.com/apache/paimon/pull/3770
整體檔案結構:
相比於 Hash file 的優勢
- 一次寫入, 避免了檔案merge
- 順序寫入, 保持原先的 key 的順序, 後續如果按照 key 的順序查詢, 可提升快取效率
SortLookupStoreWriter
SortLookupStoreWriter#put
put
@Override
public void put(byte[] key, byte[] value) throws IOException {
dataBlockWriter.add(key, value);
if (bloomFilter != null) {
bloomFilter.addHash(MurmurHashUtils.hashBytes(key));
}
lastKey = key;
// 當BlockWriter寫入達到一定閾值, 預設是 cache-page-size=64kb.
if (dataBlockWriter.memory() > blockSize) {
flush();
}
recordCount++;
}
flush
private void flush() throws IOException {
if (dataBlockWriter.size() == 0) {
return;
}
// 將data block寫入資料檔案, 並記錄對應的position和長度
BlockHandle blockHandle = writeBlock(dataBlockWriter);
MemorySlice handleEncoding = writeBlockHandle(blockHandle);
// 將BlockHandle 寫入index writer, 這也透過是一個BlockWriter寫的
indexBlockWriter.add(lastKey, handleEncoding.copyBytes());
}
writeBlock
private BlockHandle writeBlock(BlockWriter blockWriter) throws IOException {
// close the block
// 獲取block的完整陣列, 此時blockWriter中的陣列並不會被釋放, 而是會繼續複用
MemorySlice block = blockWriter.finish();
totalUncompressedSize += block.length();
// attempt to compress the block
BlockCompressionType blockCompressionType = BlockCompressionType.NONE;
if (blockCompressor != null) {
int maxCompressedSize = blockCompressor.getMaxCompressedSize(block.length());
byte[] compressed = allocateReuseBytes(maxCompressedSize + 5);
int offset = encodeInt(compressed, 0, block.length());
int compressedSize =
offset
+ blockCompressor.compress(
block.getHeapMemory(),
block.offset(),
block.length(),
compressed,
offset);
// Don't use the compressed data if compressed less than 12.5%,
if (compressedSize < block.length() - (block.length() / 8)) {
block = new MemorySlice(MemorySegment.wrap(compressed), 0, compressedSize);
blockCompressionType = this.compressionType;
}
}
totalCompressedSize += block.length();
// create block trailer
// 每一塊block會有一個trailer, 記錄壓縮型別和crc32校驗碼
BlockTrailer blockTrailer =
new BlockTrailer(blockCompressionType, crc32c(block, blockCompressionType));
MemorySlice trailer = BlockTrailer.writeBlockTrailer(blockTrailer);
// create a handle to this block
// BlockHandle 記錄了每個block的其實position和長度
BlockHandle blockHandle = new BlockHandle(position, block.length());
// write data
// 將資料追加寫入磁碟檔案
writeSlice(block);
// write trailer: 5 bytes
// 寫出trailer
writeSlice(trailer);
// clean up state
blockWriter.reset();
return blockHandle;
}
close
public LookupStoreFactory.Context close() throws IOException {
// flush current data block
flush();
LOG.info("Number of record: {}", recordCount);
// write bloom filter
@Nullable BloomFilterHandle bloomFilterHandle = null;
if (bloomFilter != null) {
MemorySegment buffer = bloomFilter.getBuffer();
bloomFilterHandle =
new BloomFilterHandle(position, buffer.size(), bloomFilter.expectedEntries());
writeSlice(MemorySlice.wrap(buffer));
LOG.info("Bloom filter size: {} bytes", bloomFilter.getBuffer().size());
}
// write index block
// 將index資料寫出至檔案
BlockHandle indexBlockHandle = writeBlock(indexBlockWriter);
// write footer
// Footer 記錄bloomfiler + index
Footer footer = new Footer(bloomFilterHandle, indexBlockHandle);
MemorySlice footerEncoding = Footer.writeFooter(footer);
writeSlice(footerEncoding);
// 最後關閉檔案
// close file
fileOutputStream.close();
LOG.info("totalUncompressedSize: {}", MemorySize.ofBytes(totalUncompressedSize));
LOG.info("totalCompressedSize: {}", MemorySize.ofBytes(totalCompressedSize));
return new SortContext(position);
}
BlockWriter
add
public void add(byte[] key, byte[] value) {
int startPosition = block.size();
// 寫入key長度
block.writeVarLenInt(key.length);
// 寫入key
block.writeBytes(key);
// 寫入value長度
block.writeVarLenInt(value.length);
// 寫入value
block.writeBytes(value);
int endPosition = block.size();
// 使用一個int陣列記錄每個KV pair的起始位置作為索引
positions.add(startPosition);
// 是否對齊. 是否對齊取決於每個KV對的長度是否一樣
if (aligned) {
int currentSize = endPosition - startPosition;
if (alignedSize == 0) {
alignedSize = currentSize;
} else {
aligned = alignedSize == currentSize;
}
}
}
- 這裡的 block 對應於一塊可擴容的 MemorySegment, 也就是
byte[]
, 當寫入長度超過當前陣列的長度時, 就會擴容
finish
public MemorySlice finish() throws IOException {
if (positions.isEmpty()) {
throw new IllegalStateException();
}
// 當透過BlockWriter寫出的資料長度都是對齊的時, 就不需要記錄各個Position的index了, 只需要記錄一個對齊長度, 讀取時自己可以計算.
if (aligned) {
block.writeInt(alignedSize);
} else {
for (int i = 0; i < positions.size(); i++) {
block.writeInt(positions.get(i));
}
block.writeInt(positions.size());
}
block.writeByte(aligned ? ALIGNED.toByte() : UNALIGNED.toByte());
return block.toSlice();
}
小結
整個檔案的寫出過程非常簡單, 就是按 block 寫出, 並且記錄每個 block 的位置, 作為 index.
SortLookupStoreReader
讀取的過程, 主要就是為了查詢 key 是否存在, 以及對應的 value 或者對應的行號.
public byte[] lookup(byte[] key) throws IOException {
// 先透過bloomfilter提前進行判斷
if (bloomFilter != null && !bloomFilter.testHash(MurmurHashUtils.hashBytes(key))) {
return null;
}
MemorySlice keySlice = MemorySlice.wrap(key);
// seek the index to the block containing the key
indexBlockIterator.seekTo(keySlice);
// if indexIterator does not have a next, it means the key does not exist in this iterator
if (indexBlockIterator.hasNext()) {
// seek the current iterator to the key
// 根據從index block中讀取到的key value的位置(BlockHandle), 讀取對應的value block
BlockIterator current = getNextBlock();
// 在value的iterator中再次二分查詢尋找對應block中是否存在match的key, 如果存在則返回對應的資料
if (current.seekTo(keySlice)) {
return current.next().getValue().copyBytes();
}
}
return null;
}
- 查詢一次 key 會經歷兩次二分查詢(index + value).
BlockReader
// 從block建立一個iterator
public BlockIterator iterator() {
BlockAlignedType alignedType =
BlockAlignedType.fromByte(block.readByte(block.length() - 1));
int intValue = block.readInt(block.length() - 5);
if (alignedType == ALIGNED) {
return new AlignedIterator(block.slice(0, block.length() - 5), intValue, comparator);
} else {
int indexLength = intValue * 4;
int indexOffset = block.length() - 5 - indexLength;
MemorySlice data = block.slice(0, indexOffset);
MemorySlice index = block.slice(indexOffset, indexLength);
return new UnalignedIterator(data, index, comparator);
}
}
SliceCompartor
這裡面傳入了 keyComparator, 用於進行 key 的比較. 用於在 index 中進行二分查詢. 這裡的比較並不是直接基於原始的資料, 而是基於 MemorySlice 進行排序.
比較的過程會將 key 的各個欄位從 MemorySegment 中讀取反序列化出來, cast 成 Comparable 進行比較.
public SliceComparator(RowType rowType) {
int bitSetInBytes = calculateBitSetInBytes(rowType.getFieldCount());
this.reader1 = new RowReader(bitSetInBytes);
this.reader2 = new RowReader(bitSetInBytes);
this.fieldReaders = new FieldReader[rowType.getFieldCount()];
for (int i = 0; i < rowType.getFieldCount(); i++) {
fieldReaders[i] = createFieldReader(rowType.getTypeAt(i));
}
}
@Override
public int compare(MemorySlice slice1, MemorySlice slice2) {
reader1.pointTo(slice1.segment(), slice1.offset());
reader2.pointTo(slice2.segment(), slice2.offset());
for (int i = 0; i < fieldReaders.length; i++) {
boolean isNull1 = reader1.isNullAt(i);
boolean isNull2 = reader2.isNullAt(i);
if (!isNull1 || !isNull2) {
if (isNull1) {
return -1;
} else if (isNull2) {
return 1;
} else {
FieldReader fieldReader = fieldReaders[i];
Object o1 = fieldReader.readField(reader1, i);
Object o2 = fieldReader.readField(reader2, i);
@SuppressWarnings({"unchecked", "rawtypes"})
int comp = ((Comparable) o1).compareTo(o2);
if (comp != 0) {
return comp;
}
}
}
}
return 0;
}
查詢的實現就是二分查詢的過程, 因為寫入的 key 是有序寫入的.
public boolean seekTo(MemorySlice targetKey) {
int left = 0;
int right = recordCount - 1;
while (left <= right) {
int mid = left + (right - left) / 2;
// 對於aligned iterator, 就直接seek record * recordSize
// 對於unaligned iterator, 就根據writer寫入的索引表來跳轉
seekTo(mid);
// 讀取一條key value pair
BlockEntry midEntry = readEntry();
int compare = comparator.compare(midEntry.getKey(), targetKey);
if (compare == 0) {
polled = midEntry;
return true;
} else if (compare > 0) {
polled = midEntry;
right = mid - 1;
} else {
left = mid + 1;
}
}
return false;
}
小結
查詢過程
- 先過一遍 bloom filter
- index 索引查詢對應 key 的 block handle
- 根據第二步的 handle, 讀取對應的 block, 在 block 中查詢對應的 key value.