使用索引是我們面對海量資料搜尋是一種常用的手段。透過有效的索引訪問，可以使我們更快的訪問到需要的資料，減少物理、邏輯IO，從而提高系統效能。在CBO時代，Oracle對於提交SQL的執行路徑是有所選擇的。一個select是走Index還是走Full Table Scan，或者別的路徑，要根據Oracle對錶列的統計資訊收集結果加以計算出的執行計劃成本而確定。在計算索引成本的公式中，索引的clustering factor是一個重要參考資訊。

簡單的說，clustering factor就是反映資料錶行儲存有序程度與索引有序程度對照的指標。如果這個值越大，說明在進行索引搜尋是，需要獲取的資料塊數量越多，從而進行邏輯物理讀的次數也就越多，相應消耗的成本Cost越高。反之，對應的成本也就越低。

原理上，可以這麼理解。當我們使用索引進行查詢的時候，獲取到葉節點最後是符合條件的一系列ROWID，代表對應資料結果所在的實體地址。在根據ROWID所對應的位置，讀取對應的資料塊。如果指定的資料塊（Date Block）已經存在於SGA中的Buffer Cache，就直接讀取（進行邏輯讀）。如果不存在於Buffer Cache，就需要先從物理儲存上把資料塊讀取到記憶體SGA裡，之後再進行讀取。（物理讀+邏輯讀）。在這個過程中，我們通常傾向於減少物理讀和邏輯讀的次數。如果我們要獲取的資料都是在同樣塊或者儘可能少的資料塊裡，那麼我們索引的執行效率較高。如果有一個查詢，雖然執行路徑中包括了索引，但是將全表所有的資料塊都讀取在SGA裡，其成本也是值得商榷的。

Clustering Factor就是衡量索引執行效率成本的一個重要指標。我們通常會希望資料表中排列的順序與索引列排序的順序一致。但是，在使用一些儲存結構（隨機儲存結構），索引和資料分開儲存，資料行一般為隨機儲存。這樣，Clustering Factor是一個不斷退化的過程。下面，我們透過實驗，觀察clustering factor對於資料索引的影響，以及最佳化的方法。

1、構建實驗環境

我們選擇Oracle版本為10gR2。

//獲取當前版本

SQL> select * from v$version;

BANNER

----------------------------------------------------------------

Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - Prod

PL/SQL Release 10.2.0.1.0 - Production

CORE 10.2.0.1.0 Production

TNS for 32-bit Windows: Version 10.2.0.1.0 - Production

NLSRTL Version 10.2.0.1.0 – Production

Executed in 0.04 seconds

dba_object構建一張百萬級資料表，構造指令碼如下：

//插入指令碼

declare

i number;

begin

for i in 1..10 loop

insert /*+ append */ into t

select * from dba_objects order by i;

commit;

end loop;

end;

插入大約一百萬條資料。

SQL> select count(*) from t;

COUNT(*)

----------

1008000

Executed in 4.987 seconds

2、加入資料索引，收集統計資訊

我們構建的資料表基本是無序的，資料字典分析情況如下：

SQL> select owner, segment_name, blocks, extents, bytes from dba_segments where wner='SYS' and segment_name='T';

OWNER SEGMENT_NA BLOCKS EXTENTS BYTES

------- ---------- ---------- ---------- ----------

SYS T 14208 85 116391936

Executed in 0.221 seconds

構建t上的object_id索引列。

SQL> create index ind_t_id on t(object_id);

Index created

Executed in 17.515 seconds

SQL> exec dbms_stats.gather_table_stats(user,'T',cascade => true);

PL/SQL procedure successfully completed

Executed in 21.29 seconds

其索引的字典資訊。

SQL> select owner, segment_name, segment_type,blocks, extents, bytes from dba_segments where wner='SYS' and segment_name=upper('ind_t_id');

OWNER SEGMENT_NA SEGMENT_TYPE BLOCKS EXTENTS BYTES

------- ---------- ------------------ ---------- ---------- ----------

SYS IND_T_ID INDEX 2304 33 18874368

Executed in 0.17 seconds

發現，資料表T本身佔據空間超過116兆，對應的索引為18.9兆左右。

3、未進行資料表重構前，執行效率分析

在未進行重構之前，資料表T對應索引ind_t_id的clustering_factor資訊如下：

SQL> select owner, index_name, clustering_factor, num_rows from dba_indexes where wner='SYS' and index_name='IND_T_ID';

OWNER INDEX_NAME CLUSTERING_FACTOR NUM_ROWS

------- --------------- ----------------- --------------------

SYS IND_T_ID 1008000 1008000

Executed in 0.14 seconds

對索引列查詢，分析精確查詢和範圍查詢兩種分析。

精確查詢：

SQL> select * from t where object_id=1500;

已選擇20行。

已用時間: 00: 00: 00.04

執行計劃

----------------------------------------------------------

Plan hash value: 4182247035

----------------------------------------------------------------------------------------

----------------------------------------------------------------------------------------

| 0 | SELECT STATEMENT | | 20 | 1860 | 23 (0)| 00:00:01 |

| 1 | TABLE ACCESS BY INDEX ROWID| T | 20 | 1860 | 23 (0)| 00:00:01 |

|* 2 | INDEX RANGE SCAN | IND_T_ID | 20 | | 3 (0)| 00:00:01 |

----------------------------------------------------------------------------------------

Predicate Information (identified by operation id):

---------------------------------------------------

2 - access("OBJECT_ID"=1500)

統計資訊

----------------------------------------------------------

1 recursive calls

0 db block gets

25 consistent gets

0 physical reads

0 redo size

3126 bytes sent via SQL*Net to client

396 bytes received via SQL*Net from client

3 SQL*Net roundtrips to/from client

0 sorts (memory)

0 sorts (disk)

20 rows processed

範圍查詢：

當選取範圍略大的時候，Oracle執行最佳化器，會主動拒絕執行索引。

SQL> select * from t where object_id>1000 and object_id<2000

已選擇19980行。

已用時間: 00: 00: 04.85

執行計劃

Plan hash value: 1601196873

--------------------------------------------------------------------------

--------------------------------------------------------------------------

| 0 | SELECT STATEMENT | | 18947 | 1720K| 3092 (3)| 00:00:38 |

|* 1 | TABLE ACCESS FULL| T | 18947 | 1720K| 3092 (3)| 00:00:38 |

--------------------------------------------------------------------------

Predicate Information (identified by operation id):

---------------------------------------------------

1 - filter("OBJECT_ID"<2000 AND "OBJECT_ID">1000)

統計資訊

----------------------------------------------------------

1 recursive calls

0 db block gets

15179 consistent gets

13567 physical reads

1484 redo size

993274 bytes sent via SQL*Net to client

15026 bytes received via SQL*Net from client

1333 SQL*Net roundtrips to/from client

0 sorts (memory)

0 sorts (disk)

19980 rows processed

雖然在資料列object_id上加了索引，而且查詢返回的資料量不到2萬行，遠遠少於資料表資料總量（100萬）。Oracle最佳化器在選取路徑的時候，放棄了按照索引進行搜尋的方案。

同樣的條件，使用Hint提示方法，強制走索引也可以檢視效果。

SQL> select /*+ index(t ind_t_id) */ * from t where object_id>1000 and object_id<2000;

已選擇19980行。

已用時間: 00: 00: 01.65

執行計劃

----------------------------------------------------------

Plan hash value: 4182247035

----------------------------------------------------------------------------------------

------------------------------------------------------------------------------------

| 0 | SELECT STATEMENT | | 18947 | 1720K| 19089 (1)| 00:03:50 |

| 1 | TABLE ACCESS BY INDEX ROWID| T | 18947 | 1720K| 19089 (1)| 00:03:50 |

|* 2 | INDEX RANGE SCAN | IND_T_ID | 19020 | | 45 (3)| 00:00:01 |

-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):

---------------------------------------------------

2 - access("OBJECT_ID">1000 AND "OBJECT_ID"<2000)

統計資訊

----------------------------------------------------------

1 recursive calls

0 db block gets

21586 consistent gets

244 physical reads

16892 redo size

1938406 bytes sent via SQL*Net to client

15026 bytes received via SQL*Net from client

1333 SQL*Net roundtrips to/from client

0 sorts (memory)

0 sorts (disk)

19980 rows processed

對比兩個結果，可以看出在這個返回結果比較小（約2%）的語句中，採用索引與全表掃描差別很大。

對比專案	全表掃描	強制索引
cpu cost	3092	19083
time	00:00:38	00:03:50
consistent get	15179	21586
physical reads	13567	244
redo size	1484	16892

結論：我們在object_id上加入的索引ind_t_id，其clustering factor取值為100萬，與資料行相同。索引健康程度比較差。在唯一查詢object_id的時候，走索引。但是進行範圍查詢的時候，即使結果資料量僅為2%，oracle還是放棄了索引的執行計劃，選擇了full table scan，說明索引健康程度有時是比加索引的技巧左右大。

4、重構資料表，最佳化索引質量

從索引、Clustering Factor的原理看，解決clustering Factor的核心在於資料行重排。讓資料行按照索引列排序的順序儲存，效果比較好。

//重構表

SQL> create table tt as select * from t where 1=0;

Table created

Executed in 0.14 seconds

SQL> insert /*+ append */ into tt select * from t order by object_id;

1008000 rows inserted

Executed in 80.225 seconds

SQL> commit;

Commit complete

Executed in 0 seconds

SQL> truncate table t;

Table truncated

Executed in 0.421 seconds

SQL> insert /*+ append */ into t select * from tt;

1008000 rows inserted

Executed in 29.182 seconds

SQL> commit;

Commit complete

Executed in 0.03 seconds

SQL> alter index ind_t_id rebuild;

Index altered

Executed in 10.665 seconds

獲取統計資料，並且檢視索引的健康程度。

//統計資訊

SQL> exec dbms_stats.gather_table_stats(user,'T',cascade => true);

PL/SQL procedure successfully completed

Executed in 19.157 seconds

SQL> select owner, index_name, clustering_factor, num_rows from dba_indexes where wner='SYS' and index_name='IND_T_ID';

OWNER INDEX_NAME CLUSTERING_FACTOR NUM_ROWS

------- --------------- ----------------- --------------------

SYS IND_T_ID 13831 1008000

Executed in 0.09 seconds

可以看到，我們按照object_id進行表重構之後，索引的clustering factor縮小為13831，只有num rows的約2%。健康程度最佳化。

5、重新執行搜尋，檢視執行計劃

進行精確查詢：

SQL> select * from t where object_id=1500;

已選擇20行。

已用時間: 00: 00: 00.08

執行計劃

----------------------------------------------------------

Plan hash value: 4182247035

-----------------------------------------------------------------

----------------------------------------------------------------------------------------

| 0 | SELECT STATEMENT | | 20 | 1860 | 4 (0)| 00:00:01 |

| 1 | TABLE ACCESS BY INDEX ROWID| T | 20 | 1860 | 4 (0)| 00:00:01 |

|* 2 | INDEX RANGE SCAN | IND_T_ID | 20 | | 3 (0)| 00:00:01 |

--------------------------------------------

Predicate Information (identified by operation id):

---------------------------------------------------

2 - access("OBJECT_ID"=1500)

統計資訊

----------------------------------------------------------

381 recursive calls

0 db block gets

62 consistent gets

1 physical reads

116 redo size

3126 bytes sent via SQL*Net to client

396 bytes received via SQL*Net from client

3 SQL*Net roundtrips to/from client

3 sorts (memory)

0 sorts (disk)

20 rows processed

對比重構前後情況，發現CPU成本上有很明顯的變化，其中Index Range Scan的成本沒有發生變化，一直為3。而Table Access By Index ROWID有較大的變化，從原有的23下降到4。這一現象說明搜尋索引的成本沒有變化，但是根據索引返回ROWID，查詢資料表塊的成本有所小，和預計情況相同。

範圍查詢：

SQL> select * from t where object_id>1000 and object_id<2000;

已選擇19980行。

已用時間: 00: 00: 01.22

執行計劃

----------------------------------------------------------

Plan hash value: 4182247035

------------------------------------------------------------

----------------------------------------------------------------------------------------

| 0 | SELECT STATEMENT | | 18980 | 1723K| 308 (1)| 00:00:04 |

| 1 | TABLE ACCESS BY INDEX ROWID| T | 18980 | 1723K| 308 (1)| 00:00:04 |

|* 2 | INDEX RANGE SCAN | IND_T_ID | 19031 | | 45 (3)| 00:00:01 |

-----------------------------------------------------------------

Predicate Information (identified by operation id):

---------------------------------------------------

2 - access("OBJECT_ID">1000 AND "OBJECT_ID"<2000)

統計資訊

----------------------------------------------------------

1 recursive calls

0 db block gets

2931 consistent gets

249 physical reads

0 redo size

1938406 bytes sent via SQL*Net to client

15026 bytes received via SQL*Net from client

1333 SQL*Net roundtrips to/from client

0 sorts (memory)

0 sorts (disk)

19980 rows processed

從返回的結果看，進行資料表重構後，我們的查詢效率大大提升，原有Oracle選擇全表掃描的情況已經不再出現，直接選擇了Index搜尋。而Index搜尋的效果也有相當提升，CPU成本只有308，消耗時間00:00:04。資料邏輯讀塊只有2931，物理讀只有249。兩次試驗的詳細對比如下：

對比專案	未重構表			重構後
對比專案	全表掃描（範圍）	強制索引（範圍）	準確查詢	索引搜尋（範圍）	準確查詢
cpu cost	3092	19083	23	308	4
time	00:00:38	00:03:50	00:00:01	00:00:04	00:00:01
consistent get	15179	21586	25	2931	62
physical reads	13567	244	0	249	1
redo size	1484	16892	0	0	116

6、結論

透過上面的實驗，我們可以看到：clustering factor是索引健康程度的一個重要指標。我們有的時候，雖然在資料表的指定列中加入了索引，但是因為索引的健康程度不高，可能效率很差。我們經常可以聽說，如果查詢比例在15%或者XXX%以下的時候，我們要加索引，但是在我們的例子中，資料只有2%，但是還是成本高於FTS（Full Table Scan），不會選擇索引路徑。所以，經常性的維護我們的索引是很重要的。

但是，本文書寫的目的絕不是讓將重構表成為一個常態任務，重構表只是一個手段。在實際中，是有一些問題的。

首先，重構表的行順序是需要選擇的。一個需要我們規劃的表，其上常常不止一個索引，一個索引的clustering factor高效，可能就意味著另一個索引的惡化。所以一定要慎用。

其次，在一些物理效能不是很好的環境下做，重構資料表可能是一個比較繁重的工作。

最後，clustering factor總的趨勢一般都是不斷惡化（選擇隨機存取），合理的規劃，使用其他如分割槽等技術也是解決我們最終問題的方案。這部分可能在海量資料庫中應用比較多。

所以，在關注有沒有索引的同時，也要關注我們索引的健康程度。

Clustering Factor——索引的成本指標

相關文章