Oracle直方圖詳解

yantaicuiwei發表於2010-11-20
Oracle直方圖詳解

http://blog.csdn.net/randyamor/archive/2008/12/21/3574181.aspx

當系統中的某些表存在高度不均勻的資料分佈時,使用柱狀圖能夠產生更好的選擇性評估,從而產生更加最佳化的執行計劃。柱狀圖提供一種有效和簡捷的方法來呈現資料的分佈情況。

下面透過一個具體的例子解釋柱狀圖的使用。

SQL> create table tab (a number, b number);

Table created.

SQL> begin
       for i in 1..10000 loop
         insert into tab values (i, i);
       end loop;
       commit;
     end;
     /

PL/SQL procedure successfully completed.

SQL> update tab set b=5 where b between 6 and 9995;

9990 rows updated.

SQL> commit;

Commit complete.

這樣在tab表中,b列有10個不同的值,其中等於的值有9991個。在建立索引之前,無論是查詢b=3或者是b=5,都只能是走全表掃描(FULL TABLE SCAN),因為沒有別的可以使用的訪問路徑。

下面我們在b列上建立一個索引。

SQL> create index ix_tab_b on tab(b);

Index created.

SQL> select index_name, table_name, column_name, column_position, column_length
     from user_ind_columns
     where table_name='TAB';

INDEX_NAME                     TABLE_NAME                     COLUMN_NAME          COLUMN_POSITION COLUMN_LENGTH
------------------------------ ------------------------------ -------------------- --------------- -------------
IX_TAB_B                       TAB                            B                                  1            22

現在我們分別來看看下面的查詢。

SQL> select * from tab where b=3;

1 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 439197569

------------------------------------------------
| Id | Operation                   | Name     |
------------------------------------------------
|   0 | SELECT STATEMENT            |          |
|   1 | TABLE ACCESS BY INDEX ROWID| TAB      |
|* 2 |   INDEX RANGE SCAN          | IX_TAB_B |
------------------------------------------------

Statistics
----------------------------------------------------------
        178 recursive calls
          0 db block gets
         30 consistent gets
          5 physical reads
        116 redo size
        462 bytes sent via SQL*Net to client
        385 bytes received via SQL*Net from client
          2 SQL*Net roundtrips to/from client
          5 sorts (memory)
          0 sorts (disk)
          1 rows processed

SQL> select * from tab where b=5;

9991 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 439197569

------------------------------------------------
| Id | Operation                   | Name     |
------------------------------------------------
|   0 | SELECT STATEMENT            |          |
|   1 | TABLE ACCESS BY INDEX ROWID| TAB      |
|* 2 |   INDEX RANGE SCAN          | IX_TAB_B |
------------------------------------------------

Statistics
----------------------------------------------------------
          1 recursive calls
          0 db block gets
       1370 consistent gets
         16 physical reads
          0 redo size
     206729 bytes sent via SQL*Net to client
       7711 bytes received via SQL*Net from client
        668 SQL*Net roundtrips to/from client
          0 sorts (memory)
          0 sorts (disk)
       9991 rows processed

可以看出這裡走的都是基於RBO的INDEX RANGE SCAN。

接下來,我們使用計算統計對錶進行分析。

SQL> analyze table tab compute statistics;

Table analyzed.

SQL> select num_rows, blocks, empty_blocks, avg_space, chain_cnt, avg_row_len 
2 from dba_tables
3 where table_name = 'TAB';

NUM_ROWS     BLOCKS EMPTY_BLOCKS AVG_SPACE CHAIN_CNT AVG_ROW_LEN
---------- ---------- ------------ ---------- ---------- -----------
     10000         20            4       2080          0          10

SQL> select num_distinct, low_value, high_value, density, num_buckets, last_analyzed, sample_size
     from dba_tab_columns
     where table_name = 'TAB';

NUM_DISTINCT LOW_VALUE            HIGH_VALUE              DENSITY NUM_BUCKETS LAST_ANAL SAMPLE_SIZE
------------ -------------------- -------------------- ---------- ----------- --------- -----------
       10000 C102                 C302                      .0001           1 21-DEC-08       10000
          10 C102                 C302                         .1           1 21-DEC-08       10000

SQL> select table_name, column_name, endpoint_number, endpoint_value 
     from dba_tab_histograms
     where table_name = 'TAB';

TABLE_NAME                     COLUMN_NAME          ENDPOINT_NUMBER ENDPOINT_VALUE
------------------------------ -------------------- --------------- --------------
TAB                            A                                  0              1
TAB                            A                                  1          10000
TAB                            B                                  0              1
TAB                            B                                  1          10000

再來執行上面的兩個查詢,觀察其執行計劃,發現兩個查詢仍然走的都是INDEX RANGE SCAN,只不過這時的執行計劃是基於CBO的。

現在我們建立tab表b列的柱狀圖統計資訊,使得最佳化器能夠知道該列每個值的具體分佈情況。

SQL> analyze table tab compute statistics for columns b size 10;

Table analyzed.

SQL> select table_name, column_name, endpoint_number, endpoint_value 
     from dba_histograms
     where table_name = 'TAB';

TABLE_NAME                     COLUMN_NAME          ENDPOINT_NUMBER ENDPOINT_VALUE
------------------------------ -------------------- --------------- --------------
TAB                            B                                  1              1
TAB                            B                                  2              2
TAB                            B                                  3              3
TAB                            B                                  4              4
TAB                            B                               9995              5
TAB                            B                               9996           9996
TAB                            B                               9997           9997
TAB                            B                               9998           9998
TAB                            B                               9999           9999
TAB                            B                              10000          10000

直方圖中的ENDPOINT_VALUE表示列值,ENDPOINT_NUMBER表示累積的行數。比如ENDPOINT_VALUE=2,ENDPOINT_NUMBER=2,因為ENDPOINT_NUMBER是個累積值,實際上2的ENDPOINT_NUMBER應該是2減去上一個值的ENDPOINT_NUMBER,也即是2-1=1。同理,5的ENDPOINT_NUMBER=9995-4=9991。

SQL> select * from tab where b=3;

1 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 439197569

----------------------------------------------------------------------------------------
| Id | Operation                   | Name     | Rows | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |          |     1 |     6 |     2   (0)| 00:00:01 |
|   1 | TABLE ACCESS BY INDEX ROWID| TAB      |     1 |     6 |     2   (0)| 00:00:01 |
|* 2 |   INDEX RANGE SCAN          | IX_TAB_B |     1 |       |     1   (0)| 00:00:01 |
----------------------------------------------------------------------------------------

Statistics
----------------------------------------------------------
        178 recursive calls
          0 db block gets
         28 consistent gets
          0 physical reads
          0 redo size
        462 bytes sent via SQL*Net to client
        385 bytes received via SQL*Net from client
          2 SQL*Net roundtrips to/from client
          5 sorts (memory)
          0 sorts (disk)
          1 rows processed

SQL> select * from tab where b=5;

9991 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 1995730731

--------------------------------------------------------------------------
| Id | Operation         | Name | Rows | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT |      | 9991 | 59946 |     6   (0)| 00:00:01 |
|* 1 | TABLE ACCESS FULL| TAB | 9991 | 59946 |     6   (0)| 00:00:01 |
--------------------------------------------------------------------------

Statistics
----------------------------------------------------------
          1 recursive calls
          0 db block gets
        689 consistent gets
          0 physical reads
          0 redo size
     174757 bytes sent via SQL*Net to client
       7711 bytes received via SQL*Net from client
        668 SQL*Net roundtrips to/from client
          0 sorts (memory)
          0 sorts (disk)
       9991 rows processed

這時可以看出,不同值的分佈導致了Oracle最佳化器選擇了不同執行計劃。對於b=5的查詢來說,全表掃描的一致性讀比之前的索引範圍掃描要降低很多。可以看出此時的全表掃描比之索引範圍掃描更加的合理,最佳化器正是根據直方圖的統計資訊做出的正確的判斷。

上述的例子描述了一種理想的狀況,因為我們為每一個不同的值建立了bucket。在實際的生產系統中,一張表可能包含很多的唯一值,我們不可能為每一個唯一值建立bucket,這樣開銷將是巨大的。

下面的例子描述了唯一值大於buckets的情況。

SQL> analyze table tab compute statistics for columns b size 8;

Table analyzed.

SQL> select table_name, column_name, endpoint_number, endpoint_value 
     from dba_histograms
     where table_name = 'TAB';

TABLE_NAME                     COLUMN_NAME          ENDPOINT_NUMBER ENDPOINT_VALUE
------------------------------ -------------------- --------------- --------------
TAB                            B                                  0              1
TAB                            B                                  7              5
TAB                            B                                  8          10000

ENDPOINT_NUMBER是實際的bucket編號,ENDPOINT_VALUE是根據列值決定的該bucket的endpoint值。上面的輸出中,bucket 0存放著b列的低值,為了節省空間沒有顯示出1-6號的bucket。但是我們能夠理解,bucket[1-7]裡存放著的endpoint=5,而bucket8裡存放endpoint=10000。因此,實際上bucket0裡包含了1-5之間的所有值,而bucket8裡包含了5-10000之間的所有值,在本例中也就是9996-10000這5個數值。

綜上所述,假如資料是均衡的,沒有必要使用直方圖。如果使用唯一值數量來建立直方圖,Oracle為每個值建立一個bucket;但是假如實際的生產系統中,不能夠為每一個唯一值分配一個bucket時,Oracle採用合適的演算法儘可能將值平均分佈到每個bucket中,剩餘的值放入到最後的bucket。

在對資料不均勻的情況使用直方圖時,如果系統未使用繫結變數,每次sql語句執行都需要硬解析,重新生成執行計劃,這樣得到的總是正確的執行計劃。
如果使用繫結變數,那麼就不需要硬解析,Oracle將總是使用第一次呼叫該語句時生成的執行計劃(bind peeking),不幸的是在11g之前的版本都存在這個問題。
對於使用了繫結變數,蒐集了直方圖資訊,又錯走了執行計劃的系統,真是一個噩夢! 

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/21601207/viewspace-678569/,如需轉載,請註明出處,否則將追究法律責任。

相關文章