PostgreSQL Page頁結構解析(6)- B-Tree索引儲存結構#2

husthxd發表於2018-07-25

本文簡單介紹了在PG資料庫B-Tree索引的物理儲存結構中Special space部分,包括根節點、左右兄弟節點等相關索引結構資訊,以及初步探討了PG在物理儲存上如何保證索引的有序性。

一、測試資料

測試資料同上一節,索引檔案raw data:

[xdb@localhost utf8db]$ hexdump -C base/16477/26637
00000000  01 00 00 00 20 5d 0e db  00 00 00 00 40 00 f0 1f  |.... ]......@...|
00000010  f0 1f 04 20 00 00 00 00  62 31 05 00 03 00 00 00  |... ....b1......|
00000020  01 00 00 00 00 00 00 00  01 00 00 00 00 00 00 00  |................|
00000030  00 00 00 00 00 00 00 00  00 00 00 00 00 00 f0 bf  |................|
00000040  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00001ff0  00 00 00 00 00 00 00 00  00 00 00 00 08 00 00 00  |................|
00002000  01 00 00 00 98 5c 0e db  00 00 00 00 28 00 b0 1f  |.....\......(...|
00002010  f0 1f 04 20 00 00 00 00  e0 9f 20 00 d0 9f 20 00  |... ...... ... .|
00002020  c0 9f 20 00 b0 9f 20 00  b0 9f 20 00 00 00 00 00  |.. ... ... .....|
00002030  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00003fb0  00 00 00 00 04 00 10 00  10 00 00 00 00 00 00 00  |................|
00003fc0  00 00 00 00 03 00 10 00  08 00 00 00 00 00 00 00  |................|
00003fd0  00 00 00 00 02 00 10 00  04 00 00 00 00 00 00 00  |................|
00003fe0  00 00 00 00 01 00 10 00  02 00 00 00 00 00 00 00  |................|
00003ff0  00 00 00 00 00 00 00 00  00 00 00 00 03 00 00 00  |................|
00004000

二、索引結構資訊

索引物理儲存結構在上一節已大體介紹,這裡主要介紹索引的結構資訊。通過pageinspect外掛的bt_page_stats函式可以獲得索引結構資訊,包括root/leaf page,next & previous page:

testdb=# select * from bt_page_stats('pk_t_index',1);
 blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags 
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
     1 | l    |          4 |          0 |            16 |      8192 |      8068 |         0 |         0 |    0 |          3
(1 row)

相關的資料結構如下:

//---------------------- src/include/access/nbtree.h
/*
 *  BTPageOpaqueData -- At the end of every page, we store a pointer
 *  to both siblings in the tree.  This is used to do forward/backward
 *  index scans.  The next-page link is also critical for recovery when
 *  a search has navigated to the wrong page due to concurrent page splits
 *  or deletions; see src/backend/access/nbtree/README for more info.
 *
 *  In addition, we store the page's btree level (counting upwards from
 *  zero at a leaf page) as well as some flag bits indicating the page type
 *  and status.  If the page is deleted, we replace the level with the
 *  next-transaction-ID value indicating when it is safe to reclaim the page.
 *
 *  We also store a "vacuum cycle ID".  When a page is split while VACUUM is
 *  processing the index, a nonzero value associated with the VACUUM run is
 *  stored into both halves of the split page.  (If VACUUM is not running,
 *  both pages receive zero cycleids.)  This allows VACUUM to detect whether
 *  a page was split since it started, with a small probability of false match
 *  if the page was last split some exact multiple of MAX_BT_CYCLE_ID VACUUMs
 *  ago.  Also, during a split, the BTP_SPLIT_END flag is cleared in the left
 *  (original) page, and set in the right page, but only if the next page
 *  to its right has a different cycleid.
 *
 *  NOTE: the BTP_LEAF flag bit is redundant since level==0 could be tested
 *  instead.
 */

typedef struct BTPageOpaqueData
{
    BlockNumber btpo_prev;      /* left sibling, or P_NONE if leftmost */
    BlockNumber btpo_next;      /* right sibling, or P_NONE if rightmost */
    union
    {
        uint32      level;      /* tree level --- zero for leaf pages */
        TransactionId xact;     /* next transaction ID, if deleted */
    }           btpo;
    uint16      btpo_flags;     /* flag bits, see below */
    BTCycleId   btpo_cycleid;   /* vacuum cycle ID of latest split */
} BTPageOpaqueData;

typedef BTPageOpaqueData *BTPageOpaque;

/* Bits defined in btpo_flags */
#define BTP_LEAF        (1 << 0)    /* leaf page, i.e. not internal page */
#define BTP_ROOT        (1 << 1)    /* root page (has no parent) */
#define BTP_DELETED     (1 << 2)    /* page has been deleted from tree */
#define BTP_META        (1 << 3)    /* meta-page */
#define BTP_HALF_DEAD   (1 << 4)    /* empty, but still in tree */
#define BTP_SPLIT_END   (1 << 5)    /* rightmost page of split group */
#define BTP_HAS_GARBAGE (1 << 6)    /* page has LP_DEAD tuples */
#define BTP_INCOMPLETE_SPLIT (1 << 7)   /* right sibling's downlink is missing */

查詢結果中,type=l(字母l),表示這個block(page)是leaf block,在這個block中有4個item(live_items=4),沒有廢棄的items(dead_items=0),沒有left sibling(btpo_prev =0)也沒有right sibling(btpo_next =0),也就是左右兩邊都沒有同級節點。btpo是一個union,值為0,表示該page為葉子page,btpo_flags值為3即BTP_LEAF | BTP_ROOT,既是葉子page也是根page。
這些資訊物理儲存在先前介紹過的PageHeader中的special space中,共佔用16個位元組:

testdb=# select * from page_header(get_raw_page('pk_t_index',1));
    lsn     | checksum | flags | lower | upper | special | pagesize | version | prune_xid 
------------+----------+-------+-------+-------+---------+----------+---------+-----------
 1/DB0E5C98 |        0 |     0 |    40 |  8112 |    8176 |     8192 |       4 |         0
(1 row)

testdb=# select 8192+8176;
 ?column? 
----------
    16368
(1 row)

[xdb@localhost utf8db]$ hexdump -C base/16477/26637 -s 16368 -n 16
00003ff0  00 00 00 00 00 00 00 00  00 00 00 00 03 00 00 00  |................|
00004000

三、索引有序性

我們都知道,B-Tree索引是有序的,下面我們看看在物理儲存結構上如何保證有序性。
插入資料,id=18

testdb=# select * from page_header(get_raw_page('pk_t_index',1));
    lsn     | checksum | flags | lower | upper | special | pagesize | version | prune_xid 
------------+----------+-------+-------+-------+---------+----------+---------+-----------
 1/DB0E5C98 |        0 |     0 |    40 |  8112 |    8176 |     8192 |       4 |         0
(1 row)

testdb=# -- 插入資料,id=18
testdb=# insert into t_index values(18,'4','d');
INSERT 0 1
testdb=# select * from page_header(get_raw_page('pk_t_index',1));
    lsn     | checksum | flags | lower | upper | special | pagesize | version | prune_xid 
------------+----------+-------+-------+-------+---------+----------+---------+-----------
 1/DB0E6498 |        0 |     0 |    44 |  8096 |    8176 |     8192 |       4 |         0
(1 row)
-- dump索引頁
[xdb@localhost utf8db]$ hexdump -C base/16477/26637 -s 8192
00002000  01 00 00 00 98 64 0e db  00 00 00 00 2c 00 a0 1f  |.....d......,...|
00002010  f0 1f 04 20 00 00 00 00  e0 9f 20 00 d0 9f 20 00  |... ...... ... .|
00002020  c0 9f 20 00 b0 9f 20 00  a0 9f 20 00 00 00 00 00  |.. ... ... .....|
00002030  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00003fa0  00 00 00 00 05 00 10 00  12 00 00 00 00 00 00 00  |................|
00003fb0  00 00 00 00 04 00 10 00  10 00 00 00 00 00 00 00  |................|
00003fc0  00 00 00 00 03 00 10 00  08 00 00 00 00 00 00 00  |................|
00003fd0  00 00 00 00 02 00 10 00  04 00 00 00 00 00 00 00  |................|
00003fe0  00 00 00 00 01 00 10 00  02 00 00 00 00 00 00 00  |................|
00003ff0  00 00 00 00 00 00 00 00  00 00 00 00 03 00 00 00  |................|
00004000

插入資料,id=17

testdb=# -- 插入資料,id=17
testdb=# insert into t_index values(17,'4','d');
INSERT 0 1
testdb=# checkpoint;
CHECKPOINT
testdb=# -- 檢視索引資料頁頭資料
testdb=# select * from page_header(get_raw_page('pk_t_index',1));
    lsn     | checksum | flags | lower | upper | special | pagesize | version | prune_xid 
------------+----------+-------+-------+-------+---------+----------+---------+-----------
 1/DB0E6808 |        0 |     0 |    48 |  8080 |    8176 |     8192 |       4 |         0
(1 row)
-- dump索引頁
[xdb@localhost utf8db]$ hexdump  -C base/16477/26637 -s 8192
00002000  01 00 00 00 08 68 0e db  00 00 00 00 30 00 90 1f  |.....h......0...|
00002010  f0 1f 04 20 00 00 00 00  e0 9f 20 00 d0 9f 20 00  |... ...... ... .|
00002020  c0 9f 20 00 b0 9f 20 00  90 9f 20 00 a0 9f 20 00  |.. ... ... ... .|
00002030  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00003f90  00 00 00 00 06 00 10 00  11 00 00 00 00 00 00 00  |................|
00003fa0  00 00 00 00 05 00 10 00  12 00 00 00 00 00 00 00  |................|
00003fb0  00 00 00 00 04 00 10 00  10 00 00 00 00 00 00 00  |................|
00003fc0  00 00 00 00 03 00 10 00  08 00 00 00 00 00 00 00  |................|
00003fd0  00 00 00 00 02 00 10 00  04 00 00 00 00 00 00 00  |................|
00003fe0  00 00 00 00 01 00 10 00  02 00 00 00 00 00 00 00  |................|
00003ff0  00 00 00 00 00 00 00 00  00 00 00 00 03 00 00 00  |................|
00004000

索引的資料區,並沒有按照大小順序排序,\x11(17)在\x12(18)的後面(從尾部開始往前),但在索引頁的頭部ItemId區域是有序的,第5個ItemId(\x00209f90)指向的是17,而第6個ItemId(\x00209fa0)指向的是18,通過索引資料指標的有序保證索引有序性。

四、小結

小結一下,知識要點:
1、Special Space:介紹了索引PageHeaderData的Special Space的儲存結構,通過Special Space可以獲得B-Tree的root、左右sibling等資訊;
2、有序性:索引Page通過索引項指標保證有序性。

最後:

Don’t Be Afraid To Explore Beneath The Surface

PostgreSQL Page頁結構解析(6)- B-Tree索引儲存結構#2
Captain Nemo at the helm

Professor Aronnax risked his life and career to find the elusive Nautilus and to join Captain Nemo on a long series of amazing underwater adventures. We should do the same: Don’t be afraid to dive underwater – inside and underneath the tools, languages and technologies that you use every day. You may know all about how to use Postgres, but do you really know how Postgres itself works internally? Take a look inside; before you know it, you’ll be on an underwater adventure of your own.
Studying the Computer Science at work behind the scenes of our applications isn’t just a matter of having fun, it’s part of being a good developer. As software development tools improve year after year, and as building web sites and mobile apps becomes easier and easier, we shouldn’t loose sight of the Computer Science we depend on. We’re all standing on the shoulders of giants – people like Lehman and Yao, and the open source developers who used their theories to build Postgres. Don’t take the tools you use everyday for granted – take a look inside them! You’ll become a wiser developer and you’ll find insights and knowledge you could never have imagined before.

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/6906/viewspace-2374918/,如需轉載,請註明出處,否則將追究法律責任。

相關文章