How Does Using Expression Statistics Differ from Using a Function-Based Index? (Doc ID 1489505.1)

其實不管是表示式統計資訊還是函式索引，都會建立一個虛擬列。表示式統計資訊還有合併列統計資訊都是11g之後的特性，如果一張表中兩個列關聯性比較強，可以建立合併列統計資訊，並收集，提高效率。

In this Document

APPLIES TO:

Oracle Database Backup Service - Version N/A and later
Oracle Database Cloud Exadata Service - Version N/A and later
Oracle Database Cloud Service - Version N/A and later
Oracle Database - Enterprise Edition - Version 11.1.0.6 and later
Oracle Database Cloud Schema Service - Version N/A and later
Information in this document applies to any platform.

GOAL

NOTE: In the images and/or the document content below, the user information and data used represents fictitious data from the Oracle sample schema(s) or Public Documentation delivered with an Oracle database product. Any similarity to actual persons, living or dead, is purely coincidental and not intended in any manner.

Function-based indexes have existed for several releases of Oracle Database. They're very useful when a column appears frequently as either an argument to a function or within an arithmetic expression within the predicate (or WHERE clause). However, it was not until Release 11g that Oracle introduced extended statistics, of which there are two types: multicolumn statistics and expression statistics. Expression statistics allow you to collect statistics on a column involved in a function call or arithmetic expression, even in the absence of a function-based index. You may be asking yourself, "If I already have a function-based index, then do I really need expression statistics, too? Isn't that overkill?" The answer is "Yes, provided that you know how to collect the appropriate statistics on the function-based index."

Walking through an example will pull all of this information into focus. (Note: for purposes of these examples, we'll run with dynamic sampling disabled.)

SOLUTION

Let's create a table T with sample data.
It will have slightly over 75,000 rows.

In a copy of the DBA_OBJECTS view, you'll typically find many more objects of type TABLE than of type DIRECTORY. Therefore, type TABLE could be considered a popular entry for column OBJECT_TYPE. We'll create a function-based index on the function LOWER(object_type). Next, we'll collect statistics on the table.

SQL> CREATE TABLE t
  2  AS SELECT owner, object_name, object_type
  3  FROM dba_objects;
Table created.
SQL> SELECT object_type, COUNT(*) HOW_MANY
  2  FROM t
  3  WHERE object_type IN ('TABLE','DIRECTORY')
  4  GROUP BY object_type
  5  ORDER BY 1;
OBJECT_TYPE           HOW_MANY
------------------- ----------
DIRECTORY                    9
TABLE                     74827 
SQL> CREATE INDEX i_obj_type_lower ON t (lower(object_type));
Index created.
SQL> EXEC dbms_stats.gather_table_stats(USER,'T');
PL/SQL procedure successfully completed.

When you search the table for objects of type "directory" (in lowercase), you would expect that the optimizer would choose an index range scan on index i_obj_type_lower, and you're correct. However, why do you see the same execution plan when you search on the lowercase object type of "table"? Wouldn't a full table scan be a better choice? Furthermore, why does the optimizer make the same cardinality estimate (2,519 rows) in each case?

SQL> set autotrace traceonly explain
SQL> SELECT owner, COUNT(*) HOW_MANY
  2  FROM t
  3  WHERE LOWER(object_type) = 'directory'
  4  GROUP BY owner;
Execution Plan
----------------------------------------------------------
Plan hash value: 1885952798  
| Id  | Operation                    | Name             | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |                  |    36 |   468 |    29 (4)  | 00:00:01 |
|   1 |  HASH GROUP BY               |                  |    36 |   468 |    29 (4)  | 00:00:01 |
|   2 |   TABLE ACCESS BY INDEX ROWID| T                |  2519 | 32747 |    28 (0)  | 00:00:01 |
|*  3 |    INDEX RANGE SCAN          | I_OBJ_TYPE_LOWER |  2519 |       |     6 (0)  | 00:00:01 |
Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access(LOWER("OBJECT_TYPE")='directory')
SQL> SELECT owner, COUNT(*) HOW_MANY
  2  FROM t
  3  WHERE LOWER(object_type) = 'table'
  4  GROUP BY owner;
Execution Plan
----------------------------------------------------------
Plan hash value: 1885952798
| Id  | Operation                    | Name             | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |                  |    36 |   468 |    29 (4)  | 00:00:01 |
|   1 |  HASH GROUP BY               |                  |    36 |   468 |    29 (4)  | 00:00:01 |
|   2 |   TABLE ACCESS BY INDEX ROWID| T                |  2519 | 32747 |    28 (0)  | 00:00:01 |
|*  3 |    INDEX RANGE SCAN          | I_OBJ_TYPE_LOWER |  2519 |       |     6 (0)  | 00:00:01 |
Predicate Information (identified by operation id):
--------------------------------------------------- 
   3 - access(LOWER("OBJECT_TYPE")='table')

The problem is that you still haven't given the cost-based optimizer (CBO) all the information that it needs. As it turns out, when you created the function-based index, Oracle Database added a virtual column named "SYS_NC00004$" to the table, but given that the column contains skewed data, you have to provide a histogram on the data distribution. Oracle Database will "learn," after observing a variety of queries against the table, that the histogram is necessary. However, let's create it right away, so that all of our execution plans are as accurate as possible.

SQL> SELECT column_name, num_distinct, histogram, virtual_column,
  2  data_default
  3  FROM user_tab_cols
  4  WHERE table_name='T'
  5  ORDER BY column_id;
COLUMN_NAME     NUM_DISTINCT HISTOGRAM       VIR DATA_DEFAULT
--------------- ------------ --------------- --- --------------------
OWNER                     36 NONE            NO
OBJECT_NAME            46424 NONE            NO
OBJECT_TYPE               30 FREQUENCY       NO
SYS_NC00004$              30 NONE            YES LOWER("OBJECT_TYPE")
SQL> EXEC dbms_stats.gather_table_stats(USER,'T', -
method_opt=>'FOR ALL COLUMNS SIZE SKEWONLY FOR COLUMNS SYS_NC00004$ SIZE 254');
PL/SQL procedure successfully completed.

Now, the CBO has all the information it needs to choose an index range scan for type "directory" and a full table scan for type "table".

SQL> set autotrace traceonly explain
SQL> SELECT owner, COUNT(*) HOW_MANY
  2  FROM t
  3  WHERE LOWER(object_type) = 'directory'
  4  GROUP BY owner;
Execution Plan
----------------------------------------------------------
Plan hash value: 1885952798
| Id  | Operation                    | Name             | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |                  |    11 |   143 |     3  (34)| 00:00:01 |
|   1 |  HASH GROUP BY               |                  |    11 |   143 |     3  (34)| 00:00:01 |
|   2 |   TABLE ACCESS BY INDEX ROWID| T                |    14 |   182 |     2   (0)| 00:00:01 |
|*  3 |    INDEX RANGE SCAN          | I_OBJ_TYPE_LOWER |     7 |       |     1   (0)| 00:00:01 |
Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access(LOWER("OBJECT_TYPE")='directory')
SQL> SELECT owner, COUNT(*) HOW_MANY
  2  FROM t
  3  WHERE LOWER(object_type) = 'table'
  4  GROUP BY owner;
Execution Plan
----------------------------------------------------------
Plan hash value: 47235625
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |    36 |   468 |   136   (5)| 00:00:02 |
|   1 |  HASH GROUP BY     |      |    36 |   468 |   136   (5)| 00:00:02 |
|*  2 |   TABLE ACCESS FULL| T    | 74979 |   951K|   132   (2)| 00:00:02 |
   
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - filter(LOWER("OBJECT_TYPE")='table')

Note that the cardinality estimates are much more accurate now, too (fourteen for type "directory and 74,979 for type "table").

So far, these steps are nothing new. You have been able to collect statistics on a function-based index since Release 9i. Now, let's discuss expression statistics. When are they necessary? Well, there may be occasions in which you don't need a function-based index as an access path (such as when the table contains an alternative access path -- perhaps an index on another column), but you still would like the CBO to understand the distribution of the column data. Let's assume that we create our table "T" again, but this time we won't create the function-based index.

SQL> CREATE TABLE t
  2  AS SELECT owner, object_name, object_type
  3  FROM dba_objects;
Table created.
SQL> EXEC dbms_stats.gather_table_stats(USER,'T', -
method_opt=>'for all columns size skewonly');
PL/SQL procedure successfully completed.
SQL> set autotrace traceonly explain
SQL> SELECT owner, COUNT(*) HOW_MANY
  2  FROM t
  3  WHERE LOWER(object_type) = 'directory'
  4  GROUP BY owner;
Execution Plan
----------------------------------------------------------
Plan hash value: 47235625
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |    36 |   468 |   133   (3)| 00:00:02 |
|   1 |  HASH GROUP BY     |      |    36 |   468 |   133   (3)| 00:00:02 |
|*  2 |   TABLE ACCESS FULL| T    |   756 |  9828 |   132   (2)| 00:00:02 |
Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access(LOWER("OBJECT_TYPE")='directory')
SQL> SELECT owner, COUNT(*) HOW_MANY
  2  FROM t
  3  WHERE LOWER(object_type) = 'table'
  4  GROUP BY owner;
Execution Plan
----------------------------------------------------------
Plan hash value: 47235625
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |    36 |   468 |   133   (3)| 00:00:02 |
|   1 |  HASH GROUP BY     |      |    36 |   468 |   133   (3)| 00:00:02 |
|*  2 |   TABLE ACCESS FULL| T    |   756 |  9828 |   132   (2)| 00:00:02 |
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - filter(LOWER("OBJECT_TYPE")='table')

Now, you're in a situation similar to that you encountered before having gathered statistics on the virtual column, which represented the function-based index in the previous example. That is, both cardinality estimates are the same -- this time, the estimate is 756 rows, or approximately one percent of the total row count. Naturally, in this case the execution plan relies upon a full table scan of "T", since there is no function-based index.

The same, inaccurate guess of one percent occurs, even when you involve table "T" in a join -- for example, with the view DBA_OBJECTS.

SQL> set autotrace traceonly explain
SQL> SELECT T1.owner, COUNT(*) HOW_MANY
  2  FROM t T1, dba_objects T2
  3  WHERE T1.owner = T2.owner
  4  AND T1.object_name = T2.object_name
  5  AND T1.object_type = T2. object_type
  6  AND lower(T1.object_type) = 'directory'
  7  GROUP BY T1.owner;
Execution Plan
----------------------------------------------------------        
Plan hash value: 1537169927
 
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT                |             |     2 |   170 |   405   (4)| 00:00:05 |
|   1 |  HASH GROUP BY                  |             |     2 |   170 |   405   (4)| 00:00:05 |
|*  2 |   HASH JOIN                     |             | 75913 |  6301K|   401   (3)| 00:00:05 |
|*  3 |    TABLE ACCESS FULL            | T           |   756 | 27216 |   132   (2)| 00:00:02 |
|   4 |    VIEW                         | DBA_OBJECTS | 75940 |  3633K|   268   (3)| 00:00:04 |
|   5 |     UNION-ALL                   |             |       |       |            |          |
|*  6 |      TABLE ACCESS BY INDEX ROWID| SUM$        |     1 |     9 |     1   (0)| 00:00:01 |
|*  7 |       INDEX UNIQUE SCAN         | I_SUM$_1    |     1 |       |     0   (0)| 00:00:01 |
|*  8 |      FILTER                     |             |       |       |            |          |
|*  9 |       HASH JOIN                 |             | 75939 |  6229K|   266   (3)| 00:00:04 |
|  10 |        TABLE ACCESS FULL        | USER$       |   104 |  1768 |     3   (0)| 00:00:01 |
|* 11 |        HASH JOIN                |             | 75939 |  4968K|   262   (2)| 00:00:04 |
|  12 |         INDEX FULL SCAN         | I_USER2     |   104 |  2288 |     1   (0)| 00:00:01 |
|* 13 |         TABLE ACCESS FULL       | OBJ$        | 75939 |  3337K|   260   (2)| 00:00:04 |
|  14 |       NESTED LOOPS              |             |     1 |    29 |     2   (0)| 00:00:01 |
|* 15 |        INDEX SKIP SCAN          | I_USER2     |     1 |    20 |     1   (0)| 00:00:01 |
|* 16 |        INDEX RANGE SCAN         | I_OBJ4      |     1 |     9 |     1   (0)| 00:00:01 |
|  17 |      NESTED LOOPS               |             |     1 |    43 |     2   (0)| 00:00:01 |
|  18 |       INDEX FULL SCAN           | I_LINK1     |     1 |    26 |     1   (0)| 00:00:01 |
|  19 |       TABLE ACCESS CLUSTER      | USER$       |     1 |    17 |     1   (0)| 00:00:01 |
|* 20 |        INDEX UNIQUE SCAN        | I_USER#     |     1 |       |     0   (0)| 00:00:01 |

Now, let's create expression statistics on the OBJECT_TYPE column, using the new syntax introduced in 11g.
First, you provide the definition of the expression statistics, and then you collect them.

SQL> SELECT dbms_stats.create_extended_stats(null,'T',  '(LOWER(OBJECT_TYPE))') NEW_VIRT_COL 
  2  FROM dual;
NEW_VIRT_COL
----------------------------------------
SYS_STUVJ5LMUFWDLVURW#_ROXEK1U
SQL> EXEC dbms_stats.gather_table_stats(USER,'T', method_opt=>'for all columns size skewonly');
PL/SQL procedure successfully completed.

Oracle Database also creates a virtual column in this case, but it begins with "SYS_STU" instead of "SYS_NC", as was the case with the function-based index.
Note that Oracle Database already has identified that a histogram is in order in this case.

SQL> SELECT column_name, num_distinct, histogram, virtual_column
  2  FROM user_tab_cols
  3  WHERE table_name='T'
  4  ORDER BY column_id;
COLUMN_NAME                    NUM_DISTINCT HISTOGRAM       VIR
------------------------------ ------------ --------------- ---
OWNER                                    36 FREQUENCY       NO
OBJECT_NAME                           46420 HEIGHT BALANCED NO
OBJECT_TYPE                              30 FREQUENCY       NO
SYS_STUVJ5LMUFWDLVURW#_ROXEK1U           30 FREQUENCY       YES

The new, virtual column name has also been entered into USER_STAT_EXTENSIONS.

SQL> SELECT extension_name, extension
  2  FROM user_stat_extensions
  3  WHERE table_name='T';
EXTENSION_NAME                 EXTENSION
------------------------------ ----------------------
SYS_STUVJ5LMUFWDLVURW#_ROXEK1U (LOWER("OBJECT_TYPE"))

In this simplified example, even though the CBO still has to choose a full table scan on table "T", you'll find that the cardinality estimates are much more accurate: an estimate of fourteen rows of type "directory", and 74,734 of type "table".

SQL> SELECT owner, COUNT(*) HOW_MANY
  2  FROM t
  3  WHERE lower(object_type) = 'directory'
  4  GROUP BY owner;
Execution Plan
----------------------------------------------------------
Plan hash value: 47235625
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |    11 |   143 |   133   (3)| 00:00:02 |
|   1 |  HASH GROUP BY     |      |    11 |   143 |   133   (3)| 00:00:02 |
|*  2 |   TABLE ACCESS FULL| T    |    14 |   182 |   132   (2)| 00:00:02 |
SQL> SELECT owner, COUNT(*) HOW_MANY
  2  FROM t
  3  WHERE lower(object_type) = 'table'
  4  GROUP BY owner;
Execution Plan
----------------------------------------------------------
Plan hash value: 47235625
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |    36 |   468 |   136   (5)| 00:00:02 |
|   1 |  HASH GROUP BY     |      |    36 |   468 |   136   (5)| 00:00:02 |
|*  2 |   TABLE ACCESS FULL| T    | 74734 |   948K|   132   (2)| 00:00:02 |

Finally, you'll notice that the CBO devised a different execution plan for the table join with DBA_OBJECTS than the one seen earlier, using the same estimate of fourteen rows of type "directory":

SQL> SELECT T1.owner, COUNT(*) HOW_MANY
  2  FROM t T1, dba_objects T2
  3  WHERE T1.owner = T2.owner
  4  AND T1.object_name = T2.object_name
  5  AND T1.object_type = T2. object_type
  6  AND lower(T1.object_type) = 'directory'
  7  GROUP BY T1.owner;
Execution Plan
----------------------------------------------------------
Plan hash value: 4292311508
| Id  | Operation                           | Name        | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |             |     2 |   182 |   220   (3)| 00:00:03 |
|   1 |  HASH GROUP BY                      |             |     2 |   182 |   220   (3)| 00:00:03 |
|   2 |   NESTED LOOPS                      |             | 73585 |  6539K|   216   (1)| 00:00:03 |
|*  3 |    TABLE ACCESS FULL                | T           |    14 |   588 |   132   (2)| 00:00:02 |
|*  4 |    VIEW                             | DBA_OBJECTS |     1 |    49 |     6   (0)| 00:00:01 |
|   5 |     UNION ALL PUSHED PREDICATE      |             |       |       |            |          |
|*  6 |      TABLE ACCESS BY INDEX ROWID    | SUM$        |     1 |     9 |     1   (0)| 00:00:01 |
|*  7 |       INDEX UNIQUE SCAN             | I_SUM$_1    |     1 |       |     0   (0)| 00:00:01 |
|*  8 |      FILTER                         |             |       |       |            |          |
|*  9 |       FILTER                        |             |       |       |            |          |
|  10 |        NESTED LOOPS                 |             |     1 |    84 |     5   (0)| 00:00:01 |
|  11 |         NESTED LOOPS                |             |     1 |    62 |     4   (0)| 00:00:01 |
|  12 |          TABLE ACCESS BY INDEX ROWID| USER$       |     1 |    17 |     1   (0)| 00:00:01 |
|* 13 |           INDEX UNIQUE SCAN         | I_USER1     |     1 |       |     0   (0)| 00:00:01 |
|* 14 |          TABLE ACCESS BY INDEX ROWID| OBJ$        |     1 |    45 |     3   (0)| 00:00:01 |
|* 15 |           INDEX RANGE SCAN          | I_OBJ5      |     1 |       |     2   (0)| 00:00:01 |
|* 16 |         INDEX RANGE SCAN            | I_USER2     |     1 |    22 |     1   (0)| 00:00:01 |
|  17 |       NESTED LOOPS                  |             |     1 |    29 |     2   (0)| 00:00:01 |
|* 18 |        INDEX SKIP SCAN              | I_USER2     |     1 |    20 |     1   (0)| 00:00:01 |
|* 19 |        INDEX RANGE SCAN             | I_OBJ4      |     1 |     9 |     1   (0)| 00:00:01 |
|  20 |      NESTED LOOPS                   |             |     1 |    43 |     1   (0)| 00:00:01 |
|  21 |       TABLE ACCESS BY INDEX ROWID   | USER$       |     1 |    17 |     1   (0)| 00:00:01 |
|* 22 |        INDEX UNIQUE SCAN            | I_USER1     |     1 |       |     0   (0)| 00:00:01 |
|* 23 |       INDEX RANGE SCAN              | I_LINK1     |     1 |    26 |     0   (0)| 00:00:01 |

In summary you have seen that if a function-based index exists, and if (in the case of skewed data) you collect statistics on the virtual column representing this index, then expression statistics are unnecessary. However, in the situation in which a function-based index is not necessary, but you still need to provide the CBO with a clear picture of the column's data distribution in the context of a join (or, perhaps, when a choice of other access paths exists on a table -- e.g., indexes on other columns), then expression statistics play a vital role.

【STAT】函式索引和使用表示式統計資訊有什麼不同

APPLIES TO:

GOAL

SOLUTION

相關文章