18 DISTINCT Optimization
DISTINCT combined with ORDER BY needs a temporary table in many cases.
Because DISTINCT may use GROUP BY, learn how MySQL works with columns in ORDER BY or HAVING clauses that are not part of the selected columns.
In most cases, a DISTINCT clause can be considered as a special case of GROUP BY. For example, the following two queries are equivalent【ɪˈkwɪvələnt (價值、數量、意義、重要性等)相同的;相等的;】:
SELECT DISTINCT c1, c2, c3 FROM t1 WHERE c1 > const; SELECT c1, c2, c3 FROM t1 WHERE c1 > const GROUP BY c1, c2, c3;
Due to this equivalence, the optimizations applicable to GROUP BY queries can be also applied to queries with a DISTINCT clause.
When combining LIMIT row_count with DISTINCT, MySQL stops as soon as it finds row_count unique rows.
If you do not use columns from all tables named in a query, MySQL stops scanning any unused tables as soon as it finds the first match. In the following case, assuming that t1 is used before t2 (which you can check with EXPLAIN), MySQL stops reading from t2 (for any particular row in t1) when it finds the first row in t2:
SELECT DISTINCT t1.a FROM t1, t2 where t1.a=t2.a;
19 LIMIT Query Optimization
If you need only a specified number of rows from a result set, use a LIMIT clause in the query, rather than fetching the whole result set and throwing away the extra data.
MySQL sometimes optimizes a query that has a LIMIT row_count clause and no HAVING clause:
• If you select only a few rows with LIMIT, MySQL uses indexes in some cases when normally it would prefer to do a full table scan.
• If you combine LIMIT row_count with ORDER BY, MySQL stops sorting as soon as it has found the first row_count rows of the sorted result, rather than sorting the entire result. If ordering is done by using an index, this is very fast. If a filesort must be done, all rows that match the query without the LIMIT clause are selected, and most or all of them are sorted, before the first row_count are found. After the initial rows have been found, MySQL does not sort any remainder of the result set.
One manifestation【ˌmænɪfeˈsteɪʃn 表明;顯示;表示;(幽靈的)顯現,顯靈;】 of this behavior is that an ORDER BY query with and without LIMIT may return rows in different order, as described later in this section.
• If you combine LIMIT row_count with DISTINCT, MySQL stops as soon as it finds row_count unique rows.
• In some cases, a GROUP BY can be resolved by reading the index in order (or doing a sort on the index), then calculating summaries until the index value changes. In this case, LIMIT row_count does not calculate any unnecessary GROUP BY values.
As soon as MySQL has sent the required number of rows to the client, it aborts the query unless you are using SQL_CALC_FOUND_ROWS. In that case, the number of rows can be retrieved with SELECT FOUND_ROWS().
• LIMIT 0 quickly returns an empty set. This can be useful for checking the validity【vəˈlɪdəti 有效性;(法律上的)有效,合法性;正確;(正式的)認可;正當;符合邏輯;】 of a query. It can also be employed to obtain the types of the result columns within applications that use a MySQL API that makes result set metadata available. With the mysql client program, you can use the --column-typeinfo option to display result column type.
• If the server uses temporary tables to resolve a query, it uses the LIMIT row_count clause to calculate how much space is required.
• If an index is not used for ORDER BY but a LIMIT clause is also present, the optimizer may be able to avoid using a merge file and sort the rows in memory using an in-memory filesort operation.
If multiple rows have identical【aɪˈdentɪkl 完全相同的;相同的;同一的;完全同樣的;】 values in the ORDER BY columns, the server is free to return those rows in any order, and may do so differently depending on the overall【ˌoʊvərˈɔːl , ˈoʊvərɔːl 總體的;全面的;綜合的;】 execution plan. In other words, the sort order of those rows is nondeterministic with respect to the nonordered columns.
One factor that affects the execution plan is LIMIT, so an ORDER BY query with and without LIMIT may return rows in different orders. Consider this query, which is sorted by the category column but nondeterministic with respect to the id and rating columns:
mysql> SELECT * FROM ratings ORDER BY category; +----+----------+--------+ | id | category | rating | +----+----------+--------+ | 1 | 1 | 4.5 | | 5 | 1 | 3.2 | | 3 | 2 | 3.7 | | 4 | 2 | 3.5 | | 6 | 2 | 3.5 | | 2 | 3 | 5.0 | | 7 | 3 | 2.7 | +----+----------+--------+
Including LIMIT may affect order of rows within each category value. For example, this is a valid query result:
mysql> SELECT * FROM ratings ORDER BY category LIMIT 5; +----+----------+--------+ | id | category | rating | +----+----------+--------+ | 1 | 1 | 4.5 | | 5 | 1 | 3.2 | | 4 | 2 | 3.5 | | 3 | 2 | 3.7 | | 6 | 2 | 3.5 | +----+----------+--------+
In each case, the rows are sorted by the ORDER BY column, which is all that is required by the SQL standard.
If it is important to ensure the same row order with and without LIMIT, include additional columns in the ORDER BY clause to make the order deterministic. For example, if id values are unique, you can make rows for a given category value appear in id order by sorting like this:
mysql> SELECT * FROM ratings ORDER BY category, id; +----+----------+--------+ | id | category | rating | +----+----------+--------+ | 1 | 1 | 4.5 | | 5 | 1 | 3.2 | | 3 | 2 | 3.7 | | 4 | 2 | 3.5 | | 6 | 2 | 3.5 | | 2 | 3 | 5.0 | | 7 | 3 | 2.7 | +----+----------+--------+ mysql> SELECT * FROM ratings ORDER BY category, id LIMIT 5; +----+----------+--------+ | id | category | rating | +----+----------+--------+ | 1 | 1 | 4.5 | | 5 | 1 | 3.2 | | 3 | 2 | 3.7 | | 4 | 2 | 3.5 | | 6 | 2 | 3.5 | +----+----------+--------+
For a query with an ORDER BY or GROUP BY and a LIMIT clause, the optimizer tries to choose an ordered index by default when it appears doing so would speed up query execution. Prior to MySQL 8.0.21, there was no way to override this behavior, even in cases where using some other optimization might be faster. Beginning with MySQL 8.0.21, it is possible to turn off this optimization by setting the optimizer_switch system variable's prefer_ordering_index flag to off.
Example: First we create and populate a table t as shown here:
# Create and populate a table t: mysql> CREATE TABLE t ( -> id1 BIGINT NOT NULL, -> id2 BIGINT NOT NULL, -> c1 VARCHAR(50) NOT NULL, -> c2 VARCHAR(50) NOT NULL, -> PRIMARY KEY (id1), -> INDEX i (id2, c1) -> ); # [Insert some rows into table t - not shown]
Verify that the prefer_ordering_index flag is enabled:
mysql> SELECT @@optimizer_switch LIKE '%prefer_ordering_index=on%'; +------------------------------------------------------+ | @@optimizer_switch LIKE '%prefer_ordering_index=on%' | +------------------------------------------------------+ | 1 | +------------------------------------------------------+
Since the following query has a LIMIT clause, we expect it to use an ordered index, if possible. In this case, as we can see from the EXPLAIN output, it uses the table's primary key.
mysql> EXPLAIN SELECT c2 FROM t -> WHERE id2 > 3 -> ORDER BY id1 ASC LIMIT 2\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: t partitions: NULL type: index possible_keys: i key: PRIMARY key_len: 8 ref: NULL rows: 2 filtered: 70.00 Extra: Using where
Now we disable the prefer_ordering_index flag, and re-run the same query; this time it uses the index i (which includes the id2 column used in the WHERE clause), and a filesort:
mysql> SET optimizer_switch = "prefer_ordering_index=off"; mysql> EXPLAIN SELECT c2 FROM t -> WHERE id2 > 3 -> ORDER BY id1 ASC LIMIT 2\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: t partitions: NULL type: range possible_keys: i key: i key_len: 8 ref: NULL rows: 14 filtered: 100.00 Extra: Using index condition; Using filesort
20 Function Call Optimization
MySQL functions are tagged【tæɡd 給……加上標籤;把……稱作;加識別符號(或標記、標籤)於;給……起諢名;】 internally as deterministic【dɪˌtɜːrmɪˈnɪstɪk (思想、解釋等)基於決定論的;(力量、因素)不可抗拒的,不可逆轉的;】 or nondeterministic. A function is nondeterministic if, given fixed values for its arguments, it can return different results for different invocations. Examples of nondeterministic functions: RAND(), UUID().
If a function is tagged nondeterministic, a reference to it in a WHERE clause is evaluated for every row (when selecting from one table) or combination of rows (when selecting from a multiple-table join).
MySQL also determines when to evaluate functions based on types of arguments, whether the arguments are table columns or constant values. A deterministic function that takes a table column as argument must be evaluated whenever that column changes value.
Nondeterministic functions may affect query performance. For example, some optimizations may not be available, or more locking might be required. The following discussion uses RAND() but applies to other nondeterministic functions as well.
Suppose that a table t has this definition:
CREATE TABLE t (id INT NOT NULL PRIMARY KEY, col_a VARCHAR(100));
Consider these two queries:
SELECT * FROM t WHERE id = POW(1,2); SELECT * FROM t WHERE id = FLOOR(1 + RAND() * 49);
Both queries appear to use a primary key lookup because of the equality comparison against the primary key, but that is true only for the first of them:
• The first query always produces a maximum of one row because POW() with constant arguments is a constant value and is used for index lookup.
• The second query contains an expression that uses the nondeterministic function RAND(), which is not constant in the query but in fact has a new value for every row of table t. Consequently【ˈkɑːnsɪkwentli 因此;所以;】, the query reads every row of the table, evaluates the predicate for each row, and outputs all rows for which the primary key matches the random value. This might be zero, one, or multiple rows, depending on the id column values and the values in the RAND() sequence.
The effects of nondeterminism are not limited to SELECT statements. This UPDATE statement uses a nondeterministic function to select rows to be modified:
UPDATE t SET col_a = some_expr WHERE id = FLOOR(1 + RAND() * 49);
Presumably【prɪˈzuːməbli 很可能;大概;想必是;】 the intent is to update at most a single row for which the primary key matches the expression. However, it might update zero, one, or multiple rows, depending on the id column values and the values in the RAND() sequence.
The behavior just described has implications【ˌɪmpləˈkeɪʃənz (被)牽連,牽涉;含意;可能的影響(或作用、結果);暗指;】 for performance and replication:
• Because a nondeterministic function does not produce a constant value, the optimizer cannot use strategies that might otherwise be applicable, such as index lookups. The result may be a table scan.
• InnoDB might escalate【ˈeskəleɪt (使)逐步擴大,不斷惡化,加劇;逐步升級;】 to a range-key lock rather than taking a single row lock for one matching row.
• Updates that do not execute deterministically are unsafe for replication.
The difficulties stem from the fact that the RAND() function is evaluated once for every row of the table. To avoid multiple function evaluations, use one of these techniques:
• Move the expression containing the nondeterministic function to a separate statement, saving the value in a variable. In the original statement, replace the expression with a reference to the variable, which the optimizer can treat as a constant value:
SET @keyval = FLOOR(1 + RAND() * 49); UPDATE t SET col_a = some_expr WHERE id = @keyval;
• Assign the random value to a variable in a derived table. This technique causes the variable to be assigned a value, once, prior to its use in the comparison in the WHERE clause:
UPDATE /*+ NO_MERGE(dt) */ t, (SELECT FLOOR(1 + RAND() * 49) AS r) AS dt SET col_a = some_expr WHERE id = dt.r;
UPDATE /*+ NO_MERGE(dt) */ t, (SELECT FLOOR(1 + RAND() * 49) AS r) AS dt SET col_a = some_expr WHERE id = dt.r;
As mentioned previously, a nondeterministic expression in the WHERE clause might prevent optimizations and result in a table scan. However, it may be possible to partially optimize the WHERE clause if other expressions are deterministic. For example:
SELECT * FROM t WHERE partial_key=5 AND some_column=RAND();
If the optimizer can use partial_key to reduce the set of rows selected, RAND() is executed fewer times, which diminishes the effect of nondeterminism on optimization.
21 Window Function Optimization
Window functions affect the strategies the optimizer considers:
• Derived table merging for a subquery is disabled if the subquery has window functions. The subquery is always materialized.
• Semijoins are not applicable to window function optimization because semijoins apply to subqueries in WHERE and JOIN ... ON, which cannot contain window functions.
• The optimizer processes multiple windows that have the same ordering requirements in sequence, so sorting can be skipped for windows following the first one.
• The optimizer makes no attempt to merge windows that could be evaluated in a single step (for example, when multiple OVER clauses contain identical window definitions). The workaround is to define the window in a WINDOW clause and refer to the window name in the OVER clauses.
An aggregate function not used as a window function is aggregated in the outermost possible query. For example, in this query, MySQL sees that COUNT(t1.b) is something that cannot exist in the outer query because of its placement in the WHERE clause:
SELECT * FROM t1 WHERE t1.a = (SELECT COUNT(t1.b) FROM t2);
Consequently, MySQL aggregates inside the subquery, treating t1.b as a constant and returning the count of rows of t2.
Replacing WHERE with HAVING results in an error:
mysql> SELECT * FROM t1 HAVING t1.a = (SELECT COUNT(t1.b) FROM t2); ERROR 1140 (42000): In aggregated query without GROUP BY, expression #1 of SELECT list contains nonaggregated column 'test.t1.a'; this is incompatible with sql_mode=only_full_group_by
The error occurs because COUNT(t1.b) can exist in the HAVING, and so makes the outer query aggregated.
Window functions (including aggregate functions used as window functions) do not have the preceding complexity. They always aggregate in the subquery where they are written, never in the outer query.
Window function evaluation may be affected by the value of the windowing_use_high_precision system variable, which determines whether to compute window operations without loss of precision. By default, windowing_use_high_precision is enabled.
For some moving frame aggregates, the inverse aggregate function can be applied to remove values from the aggregate. This can improve performance but possibly with a loss of precision. For example, adding a very small floating-point value to a very large value causes the very small value to be “hidden” by the large value. When inverting the large value later, the effect of the small value is lost.
Loss of precision due to inverse aggregation is a factor only for operations on floating-point (approximatevalue) data types. For other types, inverse aggregation is safe; this includes DECIMAL, which permits a fractional part but is an exact-value type.
For faster execution, MySQL always uses inverse【ˌɪnˈvɜːrs (數量、位置)相反的,反向的;反面;相反的事物;】 aggregation when it is safe:
• For floating-point values, inverse aggregation is not always safe and might result in loss of precision. The default is to avoid inverse aggregation, which is slower but preserves precision. If it is permissible to sacrifice safety for speed, windowing_use_high_precision can be disabled to permit inverse aggregation.
• For nonfloating-point data types, inverse aggregation is always safe and is used regardless of the windowing_use_high_precision value.
• windowing_use_high_precision has no effect on MIN() and MAX(), which do not use inverse aggregation in any case.
For evaluation of the variance functions STDDEV_POP(), STDDEV_SAMP(), VAR_POP(), VAR_SAMP(), and their synonyms, evaluation can occur in optimized mode or default mode. Optimized mode may produce slightly different results in the last significant digits. If such differences are permissible, windowing_use_high_precision can be disabled to permit optimized mode.
For EXPLAIN, windowing execution plan information is too extensive to display in traditional output format. To see windowing information, use EXPLAIN FORMAT=JSON and look for the windowing element.
22 Row Constructor Expression Optimization
Row constructors permit simultaneous comparisons of multiple values. For example, these two statements are semantically equivalent:
SELECT * FROM t1 WHERE (column1,column2) = (1,1); SELECT * FROM t1 WHERE column1 = 1 AND column2 = 1;
In addition, the optimizer handles both expressions the same way.
The optimizer is less likely to use available indexes if the row constructor columns do not cover the prefix of an index. Consider the following table, which has a primary key on (c1, c2, c3):
CREATE TABLE t1 ( c1 INT, c2 INT, c3 INT, c4 CHAR(100), PRIMARY KEY(c1,c2,c3) );
In this query, the WHERE clause uses all columns in the index. However, the row constructor itself does not cover an index prefix, with the result that the optimizer uses only c1 (key_len=4, the size of c1):
mysql> EXPLAIN SELECT * FROM t1 WHERE c1=1 AND (c2,c3) > (1,1)\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: t1 partitions: NULL type: ref possible_keys: PRIMARY key: PRIMARY key_len: 4 ref: const rows: 3 filtered: 100.00 Extra: Using where
In such cases, rewriting the row constructor expression using an equivalent nonconstructor expression may result in more complete index use. For the given query, the row constructor and equivalent nonconstructor expressions are:
(c2,c3) > (1,1) c2 > 1 OR ((c2 = 1) AND (c3 > 1))
Rewriting the query to use the nonconstructor expression results in the optimizer using all three columns in the index (key_len=12):
mysql> EXPLAIN SELECT * FROM t1 WHERE c1 = 1 AND (c2 > 1 OR ((c2 = 1) AND (c3 > 1)))\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: t1 partitions: NULL type: range possible_keys: PRIMARY key: PRIMARY key_len: 12 ref: NULL rows: 3 filtered: 100.00 Extra: Using where
Thus, for better results, avoid mixing row constructors with AND/OR expressions. Use one or the other.
Under certain conditions, the optimizer can apply the range access method to IN() expressions that have row constructor arguments.
23 Avoiding Full Table Scans
The output from EXPLAIN shows ALL in the type column when MySQL uses a full table scan to resolve【rɪˈzɑːlv 解決(問題或困難);決心;決定;表決;作出決定;作出決議;】 a query. This usually happens under the following conditions:
• The table is so small that it is faster to perform a table scan than to bother with a key lookup. This is common for tables with fewer than 10 rows and a short row length.
• There are no usable restrictions in the ON or WHERE clause for indexed columns.
• You are comparing indexed columns with constant values and MySQL has calculated (based on the index tree) that the constants cover too large a part of the table and that a table scan would be faster.
• You are using a key with low cardinality (many rows match the key value) through another column. In this case, MySQL assumes that by using the key probably requires many key lookups and that a table scan would be faster.
For small tables, a table scan often is appropriate and the performance impact is negligible. For large tables, try the following techniques to avoid having the optimizer incorrectly choose a table scan:
• Use ANALYZE TABLE tbl_name to update the key distributions for the scanned table.
• Use FORCE INDEX for the scanned table to tell MySQL that table scans are very expensive compared to using the given index:
SELECT * FROM t1, t2 FORCE INDEX (index_for_column) WHERE t1.col_name=t2.col_name;
• Start mysqld with the --max-seeks-for-key=1000 option or use SET max_seeks_for_key=1000 to tell the optimizer to assume that no key scan causes more than 1,000 key seeks.