boltdb
網上關於boltdb的文章有很多,特別是微信公眾號上,例如:
boltdb原始碼分析系列-事務-騰訊雲開發者社群-騰訊雲 (tencent.com)
這些文章都寫的挺好,但不一定覆蓋了我所關注的幾個點,下面我把我關注的幾個點就來下來。
node page bucket tx db的關係
- 磁碟資料mmap到page記憶體區域,也可以理解為就是磁碟資料
- page需要一段連續的記憶體
- node封裝的B+樹節點資料結構
- bucket一個B+樹資料結構。可以理解成一個表
- tx 讀事務或讀寫事務
- bucket是記憶體結構每個tx中都會生成一個
- 會將tx中涉及到(讀取過、修改過)的nodes都記錄在bucket中
- 讀寫事務最終寫入磁碟時是需要重新申請新的page的,即不會修改原有的page
- db整個資料庫檔案
- db中的freelist記錄了db檔案中空閒的頁(即已經可以釋放掉的頁)
tx.commit
在boltdb的 commit中才會執行b+樹的rebalance操作,執行完後再進行寫入磁碟的操作。也就是說在一個事務中涉及到的多次寫操作,會最終在commit的時候同意執行寫入磁碟spill操作。
func (tx *Tx) Commit() error {
_assert(!tx.managed, "managed tx commit not allowed")
if tx.db == nil {
return ErrTxClosed
} else if !tx.writable {
return ErrTxNotWritable
}
// TODO(benbjohnson): Use vectorized I/O to write out dirty pages.
// Rebalance nodes which have had deletions.
var startTime = time.Now()
tx.root.rebalance()
if tx.stats.Rebalance > 0 {
tx.stats.RebalanceTime += time.Since(startTime)
}
// spill data onto dirty pages.
startTime = time.Now()
if err := tx.root.spill(); err != nil {
tx.rollback()
return err
}
也正因為txn中可能有多個key插入,所以split就可能會進行多次
func (n *node) split(pageSize int) []*node {
var nodes []*node
node := n
for {
// Split node into two.
a, b := node.splitTwo(pageSize)
nodes = append(nodes, a)
// If we can't split then exit the loop.
if b == nil {
break
}
// Set node to b so it gets split on the next iteration.
node = b
}
return nodes
}
node.go
資料寫入到磁碟的時候,是從下層節點往上層節點寫的
// spill writes the nodes to dirty pages and splits nodes as it goes.
// Returns an error if dirty pages cannot be allocated.
func (n *node) spill() error {
var tx = n.bucket.tx
if n.spilled {
return nil
}
// Spill child nodes first. Child nodes can materialize sibling nodes in
// the case of split-merge so we cannot use a range loop. We have to check
// the children size on every loop iteration.
sort.Sort(n.children)
for i := 0; i < len(n.children); i++ {
if err := n.children[i].spill(); err != nil {
return err
}
}
// We no longer need the child list because it's only used for spill tracking.
n.children = nil
// Split nodes into appropriate sizes. The first node will always be n.
var nodes = n.split(tx.db.pageSize)
node.go
資料較大如何處理?直接將構造一個大的page將資料儲存進去。與此同時,原先node關聯的page可以釋放掉了。因為整個是一個append only模式,原先的page在新事務生成,且沒有其他讀事務訪問後就可以釋放掉了。
for _, node := range nodes {
// Add node's page to the freelist if it's not new.
if node.pgid > 0 {
tx.db.freelist.free(tx.meta.txid, tx.page(node.pgid))
node.pgid = 0
}
// Allocate contiguous space for the node.
p, err := tx.allocate((node.size() / tx.db.pageSize) + 1)
if err != nil {
return err
}
node.go
哪些node需要rebalance呢,size < 25% page_size或者中間節點小於2個key,葉子節點小於1個key。
func (n *node) rebalance() {
if !n.unbalanced {
return
}
n.unbalanced = false
// Update statistics.
n.bucket.tx.stats.Rebalance++
// Ignore if node is above threshold (25%) and has enough keys.
var threshold = n.bucket.tx.db.pageSize / 4
if n.size() > threshold && len(n.inodes) > n.minKeys() {
return
}
node.go
bucket中讀到了node,就將node加入到bucket中,讀到了就意味著這些node可能就會發生改變。它是在cursor移動的時候加入到bucket中的。
func (c *Cursor) node() *node {
_assert(len(c.stack) > 0, "accessing a node with a zero-length cursor stack")
// If the top of the stack is a leaf node then just return it.
if ref := &c.stack[len(c.stack)-1]; ref.node != nil && ref.isLeaf() {
return ref.node
}
// Start from root and traverse down the hierarchy.
var n = c.stack[0].node
if n == nil {
n = c.bucket.node(c.stack[0].page.id, nil)
}
for _, ref := range c.stack[:len(c.stack)-1] {
_assert(!n.isLeaf, "expected branch node")
n = n.childAt(int(ref.index))
}
_assert(n.isLeaf, "expected leaf node")
return n
}
// node creates a node from a page and associates it with a given parent.
func (b *Bucket) node(pgid pgid, parent *node) *node {
_assert(b.nodes != nil, "nodes map expected")
// Retrieve node if it's already been created.
if n := b.nodes[pgid]; n != nil {
return n
}
// Otherwise create a node and cache it.
n := &node{bucket: b, parent: parent}
if parent == nil {
b.rootNode = n
} else {
parent.children = append(parent.children, n)
}
// Use the inline page if this is an inline bucket.
var p = b.page
if p == nil {
p = b.tx.page(pgid)
}
// Read the page into the node and cache it.
n.read(p)
b.nodes[pgid] = n
// Update statistics.
b.tx.stats.NodeCount++
freelist
它表示的是磁碟中已經釋放的頁
結構
- ids 所有空閒頁
- pending {txid, pageids[]}即將釋放的txid以及其關聯的pageid
- cache map索引
->pending 釋放實際
- tx.commit時會將事務中涉及到的老的node對應的page都放到pending中
- node.spill中將關聯的舊node(node與page對應)放到freelist的pending中
pending->release釋放時機
tx的commit階段會將事務涉及的原先老page放到freelist的pending中。
func (f *freelist) free(txid txid, p *page) {
if p.id <= 1 {
panic(fmt.Sprintf("cannot free page 0 or 1: %d", p.id))
}
// Free page and all its overflow pages.
var ids = f.pending[txid]
for id := p.id; id <= p.id+pgid(p.overflow); id++ {
// Verify that page is not already free.
if f.cache[id] {
panic(fmt.Sprintf("page %d already freed", id))
}
// Add to the freelist and cache.
ids = append(ids, id)
f.cache[id] = true
}
f.pending[txid] = ids
}
db.beginRWTx 開啟讀寫事務的時候會嘗試將過期的page釋放掉
func (f *freelist) release(txid txid) {
m := make(pgids, 0)
for tid, ids := range f.pending {
if tid <= txid {
// Move transaction's pending pages to the available freelist.
// Don't remove from the cache since the page is still free.
m = append(m, ids...)
delete(f.pending, tid)
}
}
sort.Sort(m)
f.ids = pgids(f.ids).merge(m)
}