RocksDB. LRUCache原始碼分析

weixin_34120274發表於2017-04-13

Block Cache

RocksDB使用Block cache作為讀cache。使用者可以指定Block cache使用LRUCache，並可以指定cache的大小。
Block cache分為兩個，一個是用來快取未被壓縮的block資料，另一個用來快取壓縮的block資料。
處理讀請求時，先查詢未壓縮的block cache，再查詢壓縮的block cache。壓縮的block cache可以替換作業系統的page cache。
用法如下

std::shared_ptr<Cache> cache = NewLRUCache(capacity);
BlockBasedTableOptions table_options;
table_options.block_cache = cache;
Options options;
options.table_factory.reset(new BlockBasedTableFactory(table_options));
table_options.block_cache_compressed = another_cache;

LRUCache

預設RocksDB會建立一個容量為8M的LRUCache。
LRUCache內部對可以使用的容量進行了分片（Shard，下面習慣性地稱之為分桶），每個桶都維護了自己的LRU list和用於查詢的hash table。
使用者可以指定幾個引數

capacity: cache的總容量
num_shard_bits：2^num_shard_bits為指定的分桶數量。如果不指定，則會計算得到一個分桶數，最大分桶數為64個。

預設計算分桶數量的方法
指定的capacity / min_shard_size，其中min_shard_size為每個shard最小的大小，為512KB

strict_capacity_limit：有一些場景，block cache的大小會大於指定的cache 容量，比如cache中的block都因為外部有讀或者Iterator引用而無法被淘汰，這些無法淘汰的block總大小超過了總容量。這種情況下，如果strict_capacity_limit為false，後續的讀操作仍然可以將資料插入到cache中。可能造成應用程式OOM。

該選項是限制每個分桶的大小，不是cache的總體使用大小。

high_pri_pool_ratio：LRUCache提供了優先順序的功能，該選項可以指定高優先順序的block在每個桶中可以佔的比例。

類圖

LRUCache類圖

類圖展示了各個類中的主要成員和一些操作。

Cache
定義了Cache的介面，包括Insert, Lookup, Release等操作。
ShardedCache
支援對Cache進行分桶，分桶數量為2^num_shard_bits，每個桶的容量相等。
分桶的依據是取key的hash值的高num_shard_bits位

(num_shard_bits_ > 0) ? (hash >> (32 - num_shard_bits_)) : 0;

LRUCache
維護了一個shard陣列，每個shard，即每個桶，都是一個LRUCacheShard用來cache分過來的key value。
CacheShard
定義了一個桶的介面，包括Insert, Lookup, Release等操作，Cache的相關呼叫經過分桶處理後，都會呼叫指定桶的對應操作。
LRUCacheShard
維護了一個LRU list和hash table，用來實現LRU策略，他們的成員型別都是LRUHandle。
LRUHandleTable
hash table的實現，根據key再次做了分組處理，並且儘量保證每個桶中只有一個元素，元素型別為LRUHandle。提供了Lookup, Insert, Remove操作。
LRUHandle
儲存key和value的單元，並且包含前向和後續指標，可以組成雙向迴圈連結串列作為LRU list。
儲存了引用計數和是否在cache中的標誌位。詳細說明如下

LRUHandle can be in these states:

Referenced externally AND in hash table.
In that case the entry is not in the LRU. (refs > 1 && in_cache == true)
Not referenced externally and in hash table. In that case the entry is
in the LRU and can be freed. (refs == 1 && in_cache == true)
Referenced externally and not in hash table. In that case the entry is
in not on LRU and not in table. (refs >= 1 && in_cache == false)

All newly created LRUHandles are in state 1. If you call LRUCacheShard::Release on entry in state 1, it will go into state 2.
To move from state 1 to state 3, either call LRUCacheShard::Erase or LRUCacheShard::Insert with the same key.
To move from state 2 to state 1, use LRUCacheShard::Lookup.
Before destruction, make sure that no handles are in state 1. This means that any successful LRUCacheShard::Lookup/LRUCacheShard::Insert have a matching RUCache::Release (to move into state 2) or LRUCacheShard::Erase (for state 3)

LRUCache結構如下

LRUCache結構

Insert的實現

從上到下逐層分析LRUCache的Insert實現。
RocksDB中，通過一個選項指定所使用的Cache大小，在open的時候傳給DB。

  auto cache = NewLRUCache(1 * 1024 * 1024 * 1024);
  BlockBasedTableOptions bbt_opts;
  bbt_opts.block_cache = cache;

呼叫Insert的一例如下

// 從option選項中獲取bock cache的指標
Cache* block_cache = rep->table_options.block_cache.get();
// ...
  if (block_cache != nullptr && block->value->cachable()) {
    s = block_cache->Insert(
        block_cache_key,
        block->value,
        block->value->usable_size(),
        &DeleteCachedEntry<Block>,
        &(block->cache_handle),
        priority);
...

LRUCache的Insert實現在它的基類ShardedCache中，先對key做hash，hash的方法類似murmur hash，然後根據hash值確定存入到哪一個桶（Shard）中。可以認為一個桶就是一個獨立的LRUCache，其型別是LRUCacheShard。

// util/sharded_cache.cc
Status ShardedCache::Insert(const Slice& key, void* value, size_t charge,
                            void (*deleter)(const Slice& key, void* value),
                            Handle** handle, Priority priority) {
  uint32_t hash = HashSlice(key);
  return GetShard(Shard(hash))
      ->Insert(key, hash, value, charge, deleter, handle, priority);
}

通過key確定分桶後，呼叫對應LRUCacheShard的Insert方法。

Status LRUCacheShard::Insert(const Slice& key, uint32_t hash, void* value,
                             size_t charge,
                             void (*deleter)(const Slice& key, void* value),
                             Cache::Handle** handle, Cache::Priority priority);

LRUCacheShard的三個主要的成員變數，Insert主要是對這三個成員變數的操作。

class LRUCacheShard : public CacheShard {
...
private:
  LRUHandle lru_;  // 連結串列的頭結點
  LRUHandle* lru_low_pri_; // 指向低優先順序部分的頭結點
  LRUHandleTable table_; // 自己實現的一個hash table
};

一個LRUCache的底層實現，是依賴一個連結串列和一個HashTable，連結串列用來維護成員的淘汰的順序和高低優先順序，HashTable用來進行快速的查詢。他們的成員型別是LRUHandle*。後面會詳細瞭解他們的功能，下面分析一下Insert的實現，分為下面幾步：

根據key value等引數構造LRUHandle
加shard級別鎖，開始對LRUCacheShard做修改
根據LRU策略和空間使用情況進行成員淘汰
根據容量選擇是否插入

如果計算發現，插入後的總的使用量仍然大於cache的容量，並且傳入了需要嚴格限制flag，則不進行插入操作。這裡有一個小點是，引數中handle是一個輸出引數，插入成功後，賦值為指向cache中的LRUHandle成員的指標。如果使用者傳入的handle為null，說明使用者不需要返回指向成員的handle指標，雖然插入失敗，但是不報錯；否則將handle賦值為null後報錯。
如果釋放了足夠的空間，或者不需要嚴格限制空間使用，則開始進行插入操作。首先插入到hash table中，如果已經存在key對應的entry，則置換出來後，進行釋放。

插入後，如果handle為null，說明插入後沒人有持有對這個元素的引用，插入到LRU list中等待被淘汰。否則，暫時不將該元素插入到LRU list中，而是賦值給handle，等呼叫者釋放了對該元素的引用，再插入到LRU list中

釋放被淘汰的元素和插入失敗的元素。

Status LRUCacheShard::Insert(const Slice& key, uint32_t hash, void* value,
                             size_t charge,
                             void (*deleter)(const Slice& key, void* value),
                             Cache::Handle** handle, Cache::Priority priority) {
  LRUHandle* e = reinterpret_cast<LRUHandle*>(
      new char[sizeof(LRUHandle) - 1 + key.size()]);
  ... // 填充e
  autovector<LRUHandle*> last_reference_list; //  儲存被淘汰的成員
  {
    MutexLock l(&mutex_);
    // 對LRU list進行成員淘汰
    EvictFromLRU(charge, &last_reference_list);

    if (usage_ - lru_usage_ + charge > capacity_ &&
        (strict_capacity_limit_ || handle == nullptr)) {
      if (handle == nullptr) {
        // Don't insert the entry but still return ok, as if the entry inserted
        // into cache and get evicted immediately.
        last_reference_list.push_back(e);
      } else {
        delete[] reinterpret_cast<char*>(e); // 沒搞清楚這裡為什麼立刻刪除，而不是像上面那樣加到last_reference_list中稍後一起刪除
        *handle = nullptr;
        s = Status::Incomplete("Insert failed due to LRU cache being full.");
      }
    } else {
      LRUHandle* old = table_.Insert(e);
      usage_ += e->charge;
      if (old != nullptr) {
        old->SetInCache(false);
        if (Unref(old)) {
          usage_ -= old->charge;
          // old is on LRU because it's in cache and its reference count
          // was just 1 (Unref returned 0)
          LRU_Remove(old);
          last_reference_list.push_back(old);
        }
      }
      if (handle == nullptr) {
        LRU_Insert(e);
      } else {
        // 當呼叫者呼叫Release方法後，會呼叫LRU_Insert方法，將該元素插入到LRU list中
        *handle = reinterpret_cast<Cache::Handle*>(e);
      }
      s = Status::OK();
    }
  }
  // 釋放被淘汰的元素
  for (auto entry : last_reference_list) {
    entry->Free();
  }

  return s;
}

上面的流程中，有兩次插入，分別是hash table的插入和連結串列的插入。

LRUHandleTable

LRUHandleTable內部根據LRUHandle元素中的hash值分桶
每個桶是一個連結串列
插入時插入到連結串列的尾部。
如果元素數量大於桶的數量，則對桶的數量調整為之前的兩倍，為的是讓一個桶的連結串列長度儘量小於等於1.

class LRUHandleTable {
 public:
  ...
  LRUHandle* Insert(LRUHandle* h);
  ...
private:
  ...
  uint32_t length_;  // 桶的數量，預設數量為16
  uint32_t elems_;  // 元素總數
  LRUHandle** list_; // 一維陣列，陣列的成員為每個連結串列的頭指標
};

hash table的Insert實現：

根據key和hash值，找到對應的桶，在桶中沿著連結串列找到key和hash值對應的節點
如果key和hash值對應的節點已經存在，則用新的節點替換老的節點，將老節點返回
如果key和hash值對應的節點不存在，則插入到連結串列結尾，並且判斷是否需要對hash table擴容

LRUHandle* LRUHandleTable::Insert(LRUHandle* h) {
  LRUHandle** ptr = FindPointer(h->key(), h->hash);
  LRUHandle* old = *ptr;
  // 如果key已經存在，則替換老的元素
  h->next_hash = (old == nullptr ? nullptr : old->next_hash);
  *ptr = h;
  // 如果key不存在，則判斷元素總數和桶的數量是否需要擴容
  if (old == nullptr) {
    ++elems_;
    if (elems_ > length_) {
      Resize();
    }
  }
  return old;
}

LRUHandle** LRUHandleTable::FindPointer(const Slice& key, uint32_t hash) {
  LRUHandle** ptr = &list_[hash & (length_ - 1)]; // 根據hash值找到對應的桶
  // 沿著連結串列比較key和hash值，直到連結串列尾部
  while (*ptr != nullptr && ((*ptr)->hash != hash || key != (*ptr)->key())) {
    ptr = &(*ptr)->next_hash;
  }
  return ptr;
}

LRU list

LRU list按優先順序分為兩個區，高優先順序區和低優先順序區

lru_為連結串列頭，也是高優先順序的表頭
lru_low_pri_為低優先順序的表頭
當high_pri_pool_ratio_為0時，表示不分優先順序，則高優先順序和低優先順序的表頭都是LRU list的表頭

插入時，不論是高優先順序插入還是低優先順序插入，都插入到表頭位置。
Insert實現如下

void LRUCacheShard::LRU_Insert(LRUHandle* e) {
  assert(e->next == nullptr);
  assert(e->prev == nullptr);
  if (high_pri_pool_ratio_ > 0 && e->IsHighPri()) {
    // Inset "e" to head of LRU list.
    e->next = &lru_;
    e->prev = lru_.prev;
    e->prev->next = e;
    e->next->prev = e;
    e->SetInHighPriPool(true);
    high_pri_pool_usage_ += e->charge;
    MaintainPoolSize();
  } else {
    // Insert "e" to the head of low-pri pool. Note that when
    // high_pri_pool_ratio is 0, head of low-pri pool is also head of LRU list.
    e->next = lru_low_pri_->next;
    e->prev = lru_low_pri_;
    e->prev->next = e;
    e->next->prev = e;
    e->SetInHighPriPool(false);
    lru_low_pri_ = e;
  }
  lru_usage_ += e->charge;
}

以上就是LRUCache的插入流程。

寫的比較匆忙, 很多細節也沒有提到, 有寫的不對或者不明確的地方, 請指正, 謝謝!

參考資料：
https://github.com/facebook/rocksdb/wiki/Block-Cache

LruCache原始碼分析
2018-02-01
原始碼
談談LruCache原始碼
2019-02-13
原始碼
Android——LruCache原始碼解析
2019-01-01
Android原始碼
android LruCache原始碼解析
2016-11-30
Android原始碼
Android 記憶體快取框架 LruCache 的原始碼分析
2019-03-02
Android記憶體快取框架原始碼
Android記憶體快取LruCache原始碼解析
2017-12-15
Android記憶體快取原始碼
Android開源框架原始碼鑑賞：LruCache與DiskLruCache
2018-01-28
Android框架原始碼
Java&Android 基礎知識梳理(9) - LruCache 原始碼解析
2018-01-23
JavaAndroid原始碼
Retrofit原始碼分析三原始碼分析
2018-05-17
原始碼
集合原始碼分析[2]-AbstractList 原始碼分析
2019-04-11
原始碼
集合原始碼分析[1]-Collection 原始碼分析
2019-03-23
原始碼
集合原始碼分析[3]-ArrayList 原始碼分析
2019-04-12
原始碼
Guava 原始碼分析之 EventBus 原始碼分析
2018-08-01
Guava原始碼
Android 原始碼分析之 AsyncTask 原始碼分析
2019-03-04
Android原始碼
【JDK原始碼分析系列】ArrayBlockingQueue原始碼分析
2020-09-27
JDK原始碼BloC
以太坊原始碼分析(36)ethdb原始碼分析
2018-05-14
原始碼
以太坊原始碼分析(38）event原始碼分析
2018-05-14
原始碼
以太坊原始碼分析(41）hashimoto原始碼分析
2018-05-14
原始碼
以太坊原始碼分析(43）node原始碼分析
2018-05-14
原始碼
以太坊原始碼分析(52）trie原始碼分析
2018-05-14
原始碼
深度 Mybatis 3 原始碼分析（一）SqlSessionFactoryBuilder原始碼分析
2019-06-06
MyBatis原始碼SQLSessionUI
以太坊原始碼分析(51）rpc原始碼分析
2018-05-14
原始碼RPC
【Android原始碼】Fragment 原始碼分析
2017-12-23
Android原始碼Fragment
【Android原始碼】Intent 原始碼分析
2017-12-23
Android原始碼Intent
k8s client-go原始碼分析 informer原始碼分析(6)-Indexer原始碼分析
2022-06-19
K8SclientGo原始碼ORMIndex
k8s client-go原始碼分析 informer原始碼分析(4)-DeltaFIFO原始碼分析
2022-05-22
K8SclientGo原始碼ORM
以太坊原始碼分析(20)core-bloombits原始碼分析
2018-05-14
原始碼OOM
以太坊原始碼分析(24)core-state原始碼分析
2018-05-14
原始碼
以太坊原始碼分析(29)core-vm原始碼分析
2018-05-14
原始碼
【MyBatis原始碼分析】select原始碼分析及小結
2017-06-11
MyBatis原始碼
redis原始碼分析（二）、redis原始碼分析之sds字串
2017-11-12
Redis原始碼字串
ArrayList 原始碼分析
2020-05-11
原始碼
kubeproxy原始碼分析
2019-08-07
原始碼
[原始碼分析]ArrayList
2019-02-22
原始碼
redux原始碼分析
2019-03-01
Redux原始碼
preact原始碼分析
2019-04-07
React原始碼
Snackbar原始碼分析
2019-03-04
原始碼
React原始碼分析
2019-03-04
React原始碼