Oracle Histograms

ysjxjf發表於2008-10-07

From Oracle Documents

Oracle9i Database Performance Tuning Guide and Reference
Release 2 (9.2)
Part Number A96533-02

[@more@]

Using Histograms

The cost-based optimizer can use data value histograms to get accurate estimates of the distribution of column data. A histogram partitions the values in the column into bands, so that all column values in a band fall within the same range. Histograms provide improved selectivity estimates in the presence of data skew, resulting in optimal execution plans with nonuniform data distributions.

One of the fundamental tasks of the cost-based optimizer is determining the selectivity of predicates that appear in queries. Selectivity estimates are used to decide when to use an index and the order in which to join tables. Some attribute domains (a table's columns) are not uniformly distributed.

The cost-based optimizer uses height-based histograms on specified attributes to describe the distributions of nonuniform domains. In a height-based histogram, the column values are divided into bands so that each band contains approximately the same number of values. The useful information that the histogram provides, then, is where in the range of values the endpoints fall.

Consider a column C with values between 1 and 100 and a histogram with 10 buckets. If the data in C is uniformly distributed, then the histogram looks similar to Figure 3-1, where the numbers are the endpoint values.

Figure 3-1 Histogram with Uniform Distribution

Text description of pfgrf208.gif follows
Text description of the illustration pfgrf208.gif


The number of rows in each bucket is one tenth the total number of rows in the table. Four-tenths of the rows have values between 60 and 100 in this example of uniform distribution.

If the data is not uniformly distributed, then the histogram might look similar to Figure 3-2.

Figure 3-2 Histogram with Non-Uniform Distribution

Text description of pfgrf209.gif follows
Text description of the illustration pfgrf209.gif


In this case, most of the rows have the value 5 for the column; only 1/10 of the rows have values between 60 and 100.

When to Use Histograms

Histograms can affect performance and should be used only when they substantially improve query plans. Histogram statistics data is persistent, so the space required to save the data depends on the sample size. In general, create histograms on columns that are used frequently in WHERE clauses of queries and have a highly skewed data distribution. For uniformly distributed data, the cost-based optimizer can make fairly accurate guesses about the cost of executing a particular statement without the use of histograms.

Histograms, like all other optimizer statistics, are static. They are useful only when they reflect the current data distribution of a given column. (The data in the column can change as long as the distribution remains constant.) If the data distribution of a column changes frequently, you must recompute its histogram frequently.

Histograms are not useful for columns with the following characteristics:

  • All predicates on the column use bind variables.
  • The column data is uniformly distributed.
  • The column is unique and is used only with equality predicates.

Creating Histograms

You generate histograms by using the DBMS_STATS package. You can generate histograms for columns of a table or partition. For example, to create a 10-bucket histogram on the SAL column of the emp table, issue the following statement:

EXECUTE DBMS_STATS.GATHER_TABLE_STATS 
('scott','emp', METHOD_OPT => 'FOR COLUMNS SIZE 10 sal');

The SIZE keyword declares the maximum number of buckets for the histogram. You would create a histogram on the SAL column if there were an unusually high number of employees with the same salary and few employees with other salaries. You can also collect histograms for a single partition of a table.

Oracle Corporation recommends using the DBMS_STATS package to have the database automatically decide which columns need histograms. This is done by specifying SIZE AUTO.

See Also:

Oracle9i Supplied PL/SQL Packages and Types Reference for more information on the DBMS_STATS package

Choosing the Number of Buckets for a Histogram

If the number of frequently occurring distinct values in a column is relatively small, then set the number of buckets to be greater than that number. The default number of buckets for a histogram is 75. This value provides an appropriate level of detail for most data distributions. However, because the number of buckets and the data distribution both affect a histogram's usefulness, you might need to experiment with different numbers of buckets to obtain optimal results.

Types of Histograms

There are two types of histograms:

Understanding Height-Based Histograms

Height-based histograms place approximately the same number of values into each range, so that the endpoints of the range are determined by how many values are in that range. Only the last (largest) values in each bucket appear as bucket (end point) values.

Consider that a table's query results in the following four sample values: 4, 18, 30, and 35.

For a height-based histogram, each of these values occupies a portion of one bucket, in proportion to their size. The resulting selectivity is computed with the following formula:

S = Height(35) / Height(4 + 18 + 30 + 35)

Understanding Value-Based Histograms

Value-based histograms are created when the number of distinct values is less than or equal to the number of histogram buckets specified. In value-based histograms, all the values in the column have corresponding buckets, and the bucket number reflects the repetition count of each value. These can also be known as frequency histograms.

Consider the same four sample values in the previous example. In a value-based histogram, a bucket is used to represent each of the four distinct values. In other words, one bucket represents 4, one bucket represents 18, another represents 30, and another represents 35. The resulting selectivity is computed with the following formula:

S = [#rows(35)/(#rows(4) + #rows(18) + #rows(30) + #rows(35))] / #buckets

If there are many different values anticipated for a particular column of table, it is preferable to use the value-based histogram rather than the height-based histogram. This is because if there is much data skew in the height, then the skew can offset the selectivity calculation and give a nonrepresentative selectivity value.

Using Histograms

Example 3-6 illustrates the use of a histogram in order to improve the execution plan and demonstrate the skewed behavior of the s6 indexed column.

Example 3-6 Using a Histogram to Improve an Execution Plan

UPDATE so_lines l 
SET open_flag=null, 
     s6=10, 
     s6_date=sysdate, 
WHERE l.line_type_code in ('REGULAR','DETAIL','RETURN') AND 
      l.open_flag = 'Y' AND NVL(l.shipped_quantity, 0)=0 OR 
      NVL(l.shipped_quantity, 0) != 0 AND 
      l.shipped_quantity +NVL(l.cancelled_quantity, 0)= l.ordered_quantity)) AND 
      l.s6=18

This query shows the skewed distribution of data values for s6. In this case, there are two distinct non-null values: 10 and 18. The majority of the rows consists of s6 = 10 (1,589,464), while a small number of rows consist of s6 = 18 (13,091).

S6:      COUNT(*)
======================
10        1,589,464
18           13,091
NULL         21,889 

The selectivity of column s6, where s6 = 18:

S = 13,091 / (13,091 + 1,589,464) = 0.008

If No Histogram is Used: The selectivity of column s6 is assumed to be 50%, uniformly distributed across 10 and 18. This is not selective; therefore, s6 is not an ideal choice for use as an index.

If a Histogram is Used: The data distribution information is stored in the dictionary. This allows the optimizer to use this information and compute the correct selectivity based on the data distribution. In Example 3-6, the selectivity, based on the histogram data, is 0.008. This is a relatively high, or good, selectivity, which leads the optimizer to use an index on column s6 in the execution plan.

Viewing Histograms

To view histogram information, query the appropriate data dictionary view (USER_, ALL_, or DBA_). The following list shows the DBA_ views:

  • DBA_HISTOGRAMS
  • DBA_PART_HISTOGRAMS
  • DBA_SUBPART_HISTOGRAMS
  • DBA_TAB_COL_STATISTICS
Number of Rows

View the DBA_HISTOGRAMS dictionary table for the number of buckets (in other words, the number of rows) for each column:

  • ENDPOINT_NUMBER
  • ENDPOINT_VALUE

    See Also:

    Oracle9i Database Reference for column descriptions of data dictionary views, as well as histogram use and restrictions

Verifying Histogram Statistics

To verify that histogram statistics are available, query the data dictionary's DBA_HISTOGRAMS table, using a statement similar to Example 3-7.

Example 3-7 Verifying Histogram Statistics

SQL> SELECT ENDPOINT_NUMBER, ENDPOINT_VALUE 
     FROM DBA_HISTOGRAMS 
     WHERE TABLE_NAME ="SO_LINES_ALL" AND COLUMN_NAME="S2" 
     ORDER BY ENDPOINT_NUMBER; 
 

This query returns the following typical data:

ENDPOINT_NUMBER   ENDPOINT_VALUE
---------------  ---------------
          1365                 4
          1370                 5
          2124                 8
          2228                18

One row corresponds to one bucket in the histogram. Consider the differences between ENDPOINT_NUMBER values in Example 3-7 listed in Table 3-5.

Table 3-5 ENDPOINT_NUMBER Differences
Bucket (values)ENDPOINT_NUMBER DifferenceNumber of Values in Bucket

1 (0 to 4)

N/A

N/A

2 (4 to 5)

1370 - 1365

5

3 (5 to 8)

2124 - 1370

754

4 (8 to 18)

2228 - 2124

104

Table 3-5 shows that the buckets hold very different numbers of values. The data is skewed: 754 values are between 5 and 8, but only 104 are between 8 and 18. More buckets should be used.

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/66634/viewspace-1011565/,如需轉載,請註明出處,否則將追究法律責任。

相關文章