PostgreSQL10.0preview功能增強-自由定義統計資訊維度

德哥發表於2017-03-30

標籤

PostgreSQL , 10.0 , 統計資訊 , 自由定義 , 自由分組統計資訊


背景

統計資訊是資料庫優化器用來計算SQL plan tree中每個NODE成本的依據,目前PostgreSQL的統計資訊是開源資料庫中最完備的。

包括繼承表、以及全域性的統計資訊,每個欄位的NULL值個數或百分比,平均欄位寬度,唯一值個數或百分比,高頻詞,高頻詞的百分比,柱狀圖,線性相關性,陣列欄位的高頻元素,高頻元素的百分比,陣列元素的柱狀圖。

          View "pg_catalog.pg_stats"  
         Column         |   Type   | Modifiers   
------------------------+----------+-----------  
 schemaname             | name     |   
 tablename              | name     |   
 attname                | name     |   
 inherited              | boolean  |   
 null_frac              | real     |   
 avg_width              | integer  |   
 n_distinct             | real     |   
 most_common_vals       | anyarray |   
 most_common_freqs      | real[]   |   
 histogram_bounds       | anyarray |   
 correlation            | real     |   
 most_common_elems      | anyarray |   
 most_common_elem_freqs | real[]   |   
 elem_count_histogram   | real[]   |   

實際上統計資訊已經非常完備了,10.0提出了一個更大膽的統計資訊,自定義統計資訊。

比如我們將多個欄位作為一個分組,生成分組的統計資訊。有什麼用呢?

例如select count(distinct a,b) from tbl, select count(*),a,b from tbl group by a,b;

以上SQL,如果有a,b分組的統計資訊,有多少唯一值,線性相關性如何。優化器就可以評估group by需要多少記憶體,結果集大概多大。

是的,現在PostgreSQL 10.0允許你定義欄位分組的統計資訊了。是不是很開森呢?

你甚至可以用它來做多個欄位組合的TOP N的評估呢。

Implement multivariate n-distinct coefficients  
  
Add support for explicitly declared statistic objects (CREATE  
STATISTICS), allowing collection of statistics on more complex  
combinations that individual table columns.  Companion commands DROP  
STATISTICS and ALTER STATISTICS ... OWNER TO / SET SCHEMA / RENAME are  
added too.  All this DDL has been designed so that more statistic types  
can be added later on, such as multivariate most-common-values and  
multivariate histograms between columns of a single table, leaving room  
for permitting columns on multiple tables, too, as well as expressions.  
  
This commit only adds support for collection of n-distinct coefficient  
on user-specified sets of columns in a single table.  This is useful to  
estimate number of distinct groups in GROUP BY and DISTINCT clauses;  
estimation errors there can cause over-allocation of memory in hashed  
aggregates, for instance, so it`s a worthwhile problem to solve.  A new  
special pseudo-type pg_ndistinct is used.  
  
(num-distinct estimation was deemed sufficiently useful by itself that  
this is worthwhile even if no further statistic types are added  
immediately; so much so that another version of essentially the same  
functionality was submitted by Kyotaro Horiguchi:  
https://postgr.es/m/20150828.173334.114731693.horiguchi.kyotaro@lab.ntt.co.jp  
though this commit does not use that code.)  
  
Author: Tomas Vondra.  Some code rework by Álvaro.  
Reviewed-by: Dean Rasheed, David Rowley, Kyotaro Horiguchi, Jeff Janes,  
    Ideriha Takeshi  
Discussion: https://postgr.es/m/543AFA15.4080608@fuzzy.cz  
    https://postgr.es/m/20170320190220.ixlaueanxegqd5gr@alvherre.pgsql  

新增的系統表

pg_statistic_ext  

新增的語法

CREATE STATISTICS  

https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=doc/src/sgml/ref/create_statistics.sgml;h=60184a347bf72c4e4d075976f2959bd2b9c8e99b;hb=7b504eb282ca2f5104b5c00b4f05a3ef6bb1385b

https://git.postgresql.org/gitweb/?p=postgresql.git;a=blobdiff;f=doc/src/sgml/catalogs.sgml;h=ac39c639edcbbc05b3b18c31cc5ab6b99b93efeb;hp=c531c73aac9f661cad9e3c03e0f16197932edfd4;hb=7b504eb282ca2f5104b5c00b4f05a3ef6bb1385b;hpb=f120b614e070aed39586d1443193738a149a90d4

這個patch的討論,詳見郵件組,本文末尾URL。

PostgreSQL社群的作風非常嚴謹,一個patch可能在郵件組中討論幾個月甚至幾年,根據大家的意見反覆的修正,patch合併到master已經非常成熟,所以PostgreSQL的穩定性也是遠近聞名的。

參考

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=7b504eb282ca2f5104b5c00b4f05a3ef6bb1385b


相關文章