PostgreSQL Page頁結構解析(6)- B-Tree索引儲存結構#2
本文簡單介紹了在PG資料庫B-Tree索引的物理儲存結構中Special space部分,包括根節點、左右兄弟節點等相關索引結構資訊,以及初步探討了PG在物理儲存上如何保證索引的有序性。
一、測試資料
測試資料同上一節,索引檔案raw data:
[xdb@localhost utf8db]$ hexdump -C base/16477/26637
00000000 01 00 00 00 20 5d 0e db 00 00 00 00 40 00 f0 1f |.... ]......@...|
00000010 f0 1f 04 20 00 00 00 00 62 31 05 00 03 00 00 00 |... ....b1......|
00000020 01 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 |................|
00000030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f0 bf |................|
00000040 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00001ff0 00 00 00 00 00 00 00 00 00 00 00 00 08 00 00 00 |................|
00002000 01 00 00 00 98 5c 0e db 00 00 00 00 28 00 b0 1f |.....\......(...|
00002010 f0 1f 04 20 00 00 00 00 e0 9f 20 00 d0 9f 20 00 |... ...... ... .|
00002020 c0 9f 20 00 b0 9f 20 00 b0 9f 20 00 00 00 00 00 |.. ... ... .....|
00002030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00003fb0 00 00 00 00 04 00 10 00 10 00 00 00 00 00 00 00 |................|
00003fc0 00 00 00 00 03 00 10 00 08 00 00 00 00 00 00 00 |................|
00003fd0 00 00 00 00 02 00 10 00 04 00 00 00 00 00 00 00 |................|
00003fe0 00 00 00 00 01 00 10 00 02 00 00 00 00 00 00 00 |................|
00003ff0 00 00 00 00 00 00 00 00 00 00 00 00 03 00 00 00 |................|
00004000
二、索引結構資訊
索引物理儲存結構在上一節已大體介紹,這裡主要介紹索引的結構資訊。透過pageinspect外掛的bt_page_stats函式可以獲得索引結構資訊,包括root/leaf page,next & previous page:
testdb=# select * from bt_page_stats('pk_t_index',1);
blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
1 | l | 4 | 0 | 16 | 8192 | 8068 | 0 | 0 | 0 | 3
(1 row)
相關的資料結構如下:
//---------------------- src/include/access/nbtree.h
/*
* BTPageOpaqueData -- At the end of every page, we store a pointer
* to both siblings in the tree. This is used to do forward/backward
* index scans. The next-page link is also critical for recovery when
* a search has navigated to the wrong page due to concurrent page splits
* or deletions; see src/backend/access/nbtree/README for more info.
*
* In addition, we store the page's btree level (counting upwards from
* zero at a leaf page) as well as some flag bits indicating the page type
* and status. If the page is deleted, we replace the level with the
* next-transaction-ID value indicating when it is safe to reclaim the page.
*
* We also store a "vacuum cycle ID". When a page is split while VACUUM is
* processing the index, a nonzero value associated with the VACUUM run is
* stored into both halves of the split page. (If VACUUM is not running,
* both pages receive zero cycleids.) This allows VACUUM to detect whether
* a page was split since it started, with a small probability of false match
* if the page was last split some exact multiple of MAX_BT_CYCLE_ID VACUUMs
* ago. Also, during a split, the BTP_SPLIT_END flag is cleared in the left
* (original) page, and set in the right page, but only if the next page
* to its right has a different cycleid.
*
* NOTE: the BTP_LEAF flag bit is redundant since level==0 could be tested
* instead.
*/
typedef struct BTPageOpaqueData
{
BlockNumber btpo_prev; /* left sibling, or P_NONE if leftmost */
BlockNumber btpo_next; /* right sibling, or P_NONE if rightmost */
union
{
uint32 level; /* tree level --- zero for leaf pages */
TransactionId xact; /* next transaction ID, if deleted */
} btpo;
uint16 btpo_flags; /* flag bits, see below */
BTCycleId btpo_cycleid; /* vacuum cycle ID of latest split */
} BTPageOpaqueData;
typedef BTPageOpaqueData *BTPageOpaque;
/* Bits defined in btpo_flags */
#define BTP_LEAF (1 << 0) /* leaf page, i.e. not internal page */
#define BTP_ROOT (1 << 1) /* root page (has no parent) */
#define BTP_DELETED (1 << 2) /* page has been deleted from tree */
#define BTP_META (1 << 3) /* meta-page */
#define BTP_HALF_DEAD (1 << 4) /* empty, but still in tree */
#define BTP_SPLIT_END (1 << 5) /* rightmost page of split group */
#define BTP_HAS_GARBAGE (1 << 6) /* page has LP_DEAD tuples */
#define BTP_INCOMPLETE_SPLIT (1 << 7) /* right sibling's downlink is missing */
查詢結果中,type=l(字母l),表示這個block(page)是leaf block,在這個block中有4個item(live_items=4),沒有廢棄的items(dead_items=0),沒有left sibling(btpo_prev =0)也沒有right sibling(btpo_next =0),也就是左右兩邊都沒有同級節點。btpo是一個union,值為0,表示該page為葉子page,btpo_flags值為3即BTP_LEAF | BTP_ROOT,既是葉子page也是根page。
這些資訊物理儲存在先前介紹過的PageHeader中的special space中,共佔用16個位元組:
testdb=# select * from page_header(get_raw_page('pk_t_index',1));
lsn | checksum | flags | lower | upper | special | pagesize | version | prune_xid
------------+----------+-------+-------+-------+---------+----------+---------+-----------
1/DB0E5C98 | 0 | 0 | 40 | 8112 | 8176 | 8192 | 4 | 0
(1 row)
testdb=# select 8192+8176;
?column?
----------
16368
(1 row)
[xdb@localhost utf8db]$ hexdump -C base/16477/26637 -s 16368 -n 16
00003ff0 00 00 00 00 00 00 00 00 00 00 00 00 03 00 00 00 |................|
00004000
三、索引有序性
我們都知道,B-Tree索引是有序的,下面我們看看在物理儲存結構上如何保證有序性。
插入資料,id=18
testdb=# select * from page_header(get_raw_page('pk_t_index',1));
lsn | checksum | flags | lower | upper | special | pagesize | version | prune_xid
------------+----------+-------+-------+-------+---------+----------+---------+-----------
1/DB0E5C98 | 0 | 0 | 40 | 8112 | 8176 | 8192 | 4 | 0
(1 row)
testdb=# -- 插入資料,id=18
testdb=# insert into t_index values(18,'4','d');
INSERT 0 1
testdb=# select * from page_header(get_raw_page('pk_t_index',1));
lsn | checksum | flags | lower | upper | special | pagesize | version | prune_xid
------------+----------+-------+-------+-------+---------+----------+---------+-----------
1/DB0E6498 | 0 | 0 | 44 | 8096 | 8176 | 8192 | 4 | 0
(1 row)
-- dump索引頁
[xdb@localhost utf8db]$ hexdump -C base/16477/26637 -s 8192
00002000 01 00 00 00 98 64 0e db 00 00 00 00 2c 00 a0 1f |.....d......,...|
00002010 f0 1f 04 20 00 00 00 00 e0 9f 20 00 d0 9f 20 00 |... ...... ... .|
00002020 c0 9f 20 00 b0 9f 20 00 a0 9f 20 00 00 00 00 00 |.. ... ... .....|
00002030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00003fa0 00 00 00 00 05 00 10 00 12 00 00 00 00 00 00 00 |................|
00003fb0 00 00 00 00 04 00 10 00 10 00 00 00 00 00 00 00 |................|
00003fc0 00 00 00 00 03 00 10 00 08 00 00 00 00 00 00 00 |................|
00003fd0 00 00 00 00 02 00 10 00 04 00 00 00 00 00 00 00 |................|
00003fe0 00 00 00 00 01 00 10 00 02 00 00 00 00 00 00 00 |................|
00003ff0 00 00 00 00 00 00 00 00 00 00 00 00 03 00 00 00 |................|
00004000
插入資料,id=17
testdb=# -- 插入資料,id=17
testdb=# insert into t_index values(17,'4','d');
INSERT 0 1
testdb=# checkpoint;
CHECKPOINT
testdb=# -- 檢視索引資料頁頭資料
testdb=# select * from page_header(get_raw_page('pk_t_index',1));
lsn | checksum | flags | lower | upper | special | pagesize | version | prune_xid
------------+----------+-------+-------+-------+---------+----------+---------+-----------
1/DB0E6808 | 0 | 0 | 48 | 8080 | 8176 | 8192 | 4 | 0
(1 row)
-- dump索引頁
[xdb@localhost utf8db]$ hexdump -C base/16477/26637 -s 8192
00002000 01 00 00 00 08 68 0e db 00 00 00 00 30 00 90 1f |.....h......0...|
00002010 f0 1f 04 20 00 00 00 00 e0 9f 20 00 d0 9f 20 00 |... ...... ... .|
00002020 c0 9f 20 00 b0 9f 20 00 90 9f 20 00 a0 9f 20 00 |.. ... ... ... .|
00002030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00003f90 00 00 00 00 06 00 10 00 11 00 00 00 00 00 00 00 |................|
00003fa0 00 00 00 00 05 00 10 00 12 00 00 00 00 00 00 00 |................|
00003fb0 00 00 00 00 04 00 10 00 10 00 00 00 00 00 00 00 |................|
00003fc0 00 00 00 00 03 00 10 00 08 00 00 00 00 00 00 00 |................|
00003fd0 00 00 00 00 02 00 10 00 04 00 00 00 00 00 00 00 |................|
00003fe0 00 00 00 00 01 00 10 00 02 00 00 00 00 00 00 00 |................|
00003ff0 00 00 00 00 00 00 00 00 00 00 00 00 03 00 00 00 |................|
00004000
索引的資料區,並沒有按照大小順序排序,\x11(17)在\x12(18)的後面(從尾部開始往前),但在索引頁的頭部ItemId區域是有序的,第5個ItemId(\x00209f90)指向的是17,而第6個ItemId(\x00209fa0)指向的是18,透過索引資料指標的有序保證索引有序性。
四、小結
小結一下,知識要點:
1、Special Space:介紹了索引PageHeaderData的Special Space的儲存結構,透過Special Space可以獲得B-Tree的root、左右sibling等資訊;
2、有序性:索引Page透過索引項指標保證有序性。
最後:
Don’t Be Afraid To Explore Beneath The Surface
Professor Aronnax risked his life and career to find the elusive Nautilus and to join Captain Nemo on a long series of amazing underwater adventures. We should do the same: Don’t be afraid to dive underwater – inside and underneath the tools, languages and technologies that you use every day. You may know all about how to use Postgres, but do you really know how Postgres itself works internally? Take a look inside; before you know it, you’ll be on an underwater adventure of your own.
Studying the Computer Science at work behind the scenes of our applications isn’t just a matter of having fun, it’s part of being a good developer. As software development tools improve year after year, and as building web sites and mobile apps becomes easier and easier, we shouldn’t loose sight of the Computer Science we depend on. We’re all standing on the shoulders of giants – people like Lehman and Yao, and the open source developers who used their theories to build Postgres. Don’t take the tools you use everyday for granted – take a look inside them! You’ll become a wiser developer and you’ll find insights and knowledge you could never have imagined before.
來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/6906/viewspace-2374918/,如需轉載,請註明出處,否則將追究法律責任。
相關文章
- PostgreSQL Page頁結構解析(5)- B-Tree索引儲存結構#1SQL索引
- PostgreSQL Page頁結構解析(7)- B-Tree索引儲存結構#3SQL索引
- PostgreSQL儲存引擎之page結構SQL儲存引擎
- PostgreSQL Page頁結構解析(1)-基礎SQL
- PostgreSQL中page頁結構SQL
- PostgreSQL Page頁結構解析(3)- 行資料SQL
- PostgreSQL Page頁結構解析(2)- 頁頭和行資料指標SQL指標
- PostgreSQL Page頁結構解析(4)- 執行DML時表佔用空間解析SQL
- oracle的B-tree索引結構分析Oracle索引
- PostgreSQL儲存引擎之heap tuple結構SQL儲存引擎
- MySQL Innodb 儲存結構 & 儲存Null值 解析MySqlNull
- redis 儲存結構原理 2Redis
- 儲存結構
- 如何從零學習PostgreSQL Page結構SQL
- MyRocks儲存引擎資料結構解析儲存引擎資料結構
- JanusGraph -- 儲存結構
- CentOS 儲存結構CentOS
- 《MySQL 基礎篇》十一:索引的儲存結構MySql索引
- VSAN儲存結構解析+儲存資料恢復案例資料恢復
- 資料的儲存結構淺析LSM-Tree和B-tree
- Redis儲存結構以及儲存格式Redis
- 圖的儲存結構
- 三種儲存結構
- linux6-儲存結構與硬碟管理Linux硬碟
- 資料結構知識點--儲存結構與邏輯結構資料結構
- PostgreSQL 資料庫學習 - 1.資料庫體系結構之儲存結構SQL資料庫
- 【資料結構——圖和圖的儲存結構】資料結構
- MySQL InnoDB的儲存結構總結MySql
- 2 Day DBA-管理Oracle例項-管理資料庫儲存結構-關於其它儲存結構Oracle資料庫
- InnoDB記錄儲存結構
- php圖的儲存結構PHP
- HBase 資料儲存結構
- oracle物理儲存結構理解Oracle
- Oracle資料儲存結構Oracle
- SAP儲存地點結構
- 【PHP資料結構】圖的概念和儲存結構PHP資料結構
- db2_使用db2dart分析儲存結構DB2Dart
- 2 Day DBA-管理Oracle例項-管理資料庫儲存結構-關於資料庫儲存結構Oracle資料庫