Greenplum自動統計資訊收集-暨統計資訊不準引入的broadcastmotion一例
標籤
PostgreSQL , Greenplum , 統計資訊 , 自動統計資訊 , broadcast motion , 執行計劃
背景
資料庫執行計劃的好壞,與資料庫的SQL優化器息息相關。Greenplum有兩套優化器,legacy query optimizer 與 ORCA。
這兩個優化器都是CBO優化器,都需要依賴統計資訊,如果統計資訊不準確,可能生成的執行計劃就不準確。
例如我們有一個這樣的QUERY,發現怎麼跑都跑不出來。
觀察執行計劃,發現有一個節點用到了broadcast motion,也就是廣播。
-> Broadcast Motion 512:512 (slice1; segments: 512) (cost=0.00..6.13 rows=1 width=16)
-> Append-only Columnar Scan on xxxx (cost=0.00..1.00 rows=1 width=16)
當JOIN欄位並非分佈鍵時,Greenplum會根據表的大小,選擇重分佈或廣播。(小表廣播,大表的話多階段JOIN)。
《HybridDB PostgreSQL “Sort、Group、distinct 聚合、JOIN” 不懼怕資料傾斜的黑科技和原理 – 多階段聚合》
而這個執行計劃跑偏的SQL,恰恰是在一個大表上觸發了廣播(broadcast motion),這不正常。
select count(*) xxxx;
返回有31億資料。
查詢pg_class,值有1條記錄,佔用0個BLOCK
select * from pg_class where relname=`xxxx`;
relpages | 1
reltuples | 0
執行analyze,收集統計資訊,執行計劃恢復。
(Greenplum對於超級大表,收集統計資訊時,會構建臨時表來進行取樣分析)
digoal=> analyze verbose xxxxxxxx;
INFO: Executing SQL: select sum(gp_statistics_estimate_reltuples_relpages_oid(c.oid))::float4[] from gp_dist_random(`pg_class`) c where c.oid=112293
INFO: ANALYZE estimated reltuples=3091824896.000000, relpages=47509120.000000 for table xxxxxxxx
INFO: ANALYZE building sample table of size 173762 on table xxxxxxxx because it has too many rows.
INFO: Executing SQL: create table pg_temp.pg_analyze_112293_59 as ( select Ta.xx,....Ta.xxx from public.xxxxxxxx as Ta where random() < 0.00005620053343591280281543731689453125 limit 173762 ) distributed randomly
INFO: Created sample table pg_temp.pg_analyze_112293_59 with nrows=173762
INFO: ANALYZE computing statistics on attribute xx
INFO: Executing SQL: select count(*)::float4 from pg_temp_440803.pg_analyze_112293_59 as Ta where Ta.xx is null
INFO: nullfrac = 0.178474
INFO: Executing SQL: select avg(pg_column_size(Ta.xx))::float4 from pg_temp_440803.pg_analyze_112293_59 as Ta where Ta.xx is not null
INFO: avgwidth = 21.418087
INFO: Executing SQL: select count(*)::float4 from (select Ta.xx from pg_temp_440803.pg_analyze_112293_59 as Ta group by Ta.xx) as Tb
INFO: count(ndistinct()) gives 142751.000000 values.
INFO: Executing SQL: select count(v)::float4 from (select Ta.xx as v, count(Ta.xx) as f from pg_temp_440803.pg_analyze_112293_59 as Ta group by Ta.xx) as foo where f > 1
INFO: ndistinct = -1.000000
..........
收集統計資訊後,執行計劃恢復,沒有broadcast了,執行也秒級返回了。
讓Greenplum自動收集統計資訊
對於在函式內 或 函式外執行DML時,核心會跟蹤表的記錄數變更影響的資料量,我們可以設定什麼時候收集統計資訊:
none:不收集
on_no_stats:沒有統計資訊時,收集
on_change:當寫入、更新量超過閾值(gp_autostats_on_change_threshold引數設定的行數,預設為20億)後,自動收集統計資訊。
Automatic Statistics Collection
Greenplum Database can be set to automatically run ANALYZE on a table that either has no statistics or has
changed significantly when certain operations are performed on the table. For partitioned tables, automatic
statistics collection is only triggered when the operation is run directly on a leaf table, and then only the leaf
table is analyzed.
Automatic statistics collection has three modes:
• none disables automatic statistics collection.
• on_no_stats triggers an analyze operation for a table with no existing statistics when any of the
commands CREATE TABLE AS SELECT, INSERT, or COPY are executed on the table.
• on_change triggers an analyze operation when any of the commands CREATE TABLE AS SELECT,
UPDATE, DELETE, INSERT, or COPY are executed on the table and the number of rows affected exceeds
the threshold defined by the gp_autostats_on_change_threshold configuration parameter.
The automatic statistics collection mode is set separately for commands that occur within a procedural
language function and commands that execute outside of a function:
• The gp_autostats_mode configuration parameter controls automatic statistics collection behavior
outside of functions and is set to on_no_stats by default.
• The gp_autostats_mode_in_functions parameter controls the behavior when table operations are
performed within a procedural language function and is set to none by default.
With the on_change mode, ANALYZE is triggered only if the number of rows affected exceeds the threshold
defined by the gp_autostats_on_change_threshold configuration parameter. The default value for this
parameter is a very high value, 2147483647, which effectively disables automatic statistics collection;
you must set the threshold to a lower number to enable it. The on_change mode could trigger large,
unexpected analyze operations that could disrupt the system, so it is not recommended to set it globally. It
could be useful in a session, for example to automatically analyze a table following a load.
To disable automatic statistics collection outside of functions, set the gp_autostats_mode parameter to
none:
gpconfigure -c gp_autostats_mode -v none
To enable automatic statistics collection in functions for tables that have no statistics, change
gp_autostats_mode_in_functions to on_no_stats:
gpconfigure -c gp_autostats_mode_in_functions -v on_no_stats
Set the log_autostats system configuration parameter to on if you want to log automatic statistics
collection operations.
為了讓資料庫產生準確的執行計劃,建議要麼使用者自己排程analyZE收集統計資訊,要麼自動收集。
相關文章
- Oracle統計資訊自動收集Oracle
- Oracle11g 統計資訊-----統計資訊自動收集任務Oracle
- Oracle11g 統計資訊——統計資訊自動收集任務Oracle
- [統計資訊系列7] Oracle 11g的自動統計資訊收集Oracle
- 關於oracle自動收集統計資訊Oracle
- Oracle的自動統計資訊不收集直方圖的資訊Oracle直方圖
- Oracle10g 統計資訊的自動收集Oracle
- ORACLE 11g 自動收集統計資訊Oracle
- 【統計資訊】Oracle常用的收集統計資訊方式Oracle
- 【Oracle】11g Oracle自動收集統計資訊Oracle
- Oracle 11g 之自動收集統計資訊Oracle
- 關閉特定物件統計資訊自動收集物件
- ORACLE10g自動收集CBO統計資訊Oracle
- Oracle收集統計資訊Oracle
- Oracle 統計資訊收集Oracle
- 收集oracle統計資訊Oracle
- ORACLE 收集統計資訊Oracle
- Oracle 11g無法自動收集統計資訊Oracle
- oracle 11g自動收集統計資訊介紹Oracle
- oracle 統計資訊過期判斷和自動收集Oracle
- [zt] Oracle10g 自動收集收集CBO統計資訊設定Oracle
- 收集全庫統計資訊
- mysql如收集統計資訊MySql
- MySQL 5.5 統計資訊收集MySql
- 開啟oracle10g統計資訊自動收集功能Oracle
- 微課sql最佳化(3)、統計資訊收集(2)-如何收集統計資訊SQL
- MySQL系統如何收集統計資訊MySql
- 關於ORACLE自動統計CBO統計資訊Oracle
- 收集統計資訊的簡單操作
- ORACLE 統計資訊的收集與管理Oracle
- 統計資訊收集不完的解決
- oracle收集統計資訊job停止Oracle
- 雞肋 -- ORACLE10g自動收集CBO統計資訊(ZT)Oracle
- oracle10g如何啟用禁用自動統計資訊收集Oracle
- 收集資料庫統計資訊需要收集直方圖資訊.資料庫直方圖
- Oracle 11g系統自動收集統計資訊的一些知識Oracle
- 效能優化——統計資訊——SQLServer自動更新和自動建立統計資訊選項 (轉載)優化SQLServer
- Oracle 11g 修改表級別的自動收集統計資訊比率Oracle