Greenplum自動統計資訊收集-暨統計資訊不準引入的broadcastmotion一例

德哥發表於2017-12-14

背景

資料庫執行計劃的好壞，與資料庫的SQL優化器息息相關。Greenplum有兩套優化器，legacy query optimizer 與 ORCA。

這兩個優化器都是CBO優化器，都需要依賴統計資訊，如果統計資訊不準確，可能生成的執行計劃就不準確。

例如我們有一個這樣的QUERY，發現怎麼跑都跑不出來。

觀察執行計劃，發現有一個節點用到了broadcast motion，也就是廣播。

->  Broadcast Motion 512:512  (slice1; segments: 512)  (cost=0.00..6.13 rows=1 width=16)  
  ->  Append-only Columnar Scan on xxxx  (cost=0.00..1.00 rows=1 width=16)

當JOIN欄位並非分佈鍵時，Greenplum會根據表的大小，選擇重分佈或廣播。（小表廣播，大表的話多階段JOIN）。

《HybridDB PostgreSQL “Sort、Group、distinct 聚合、JOIN” 不懼怕資料傾斜的黑科技和原理 – 多階段聚合》

而這個執行計劃跑偏的SQL，恰恰是在一個大表上觸發了廣播(broadcast motion)，這不正常。

select count(*) xxxx;  
  
返回有31億資料。

查詢pg_class，值有1條記錄，佔用0個BLOCK

select * from pg_class where relname=`xxxx`;  
  
relpages       | 1  
reltuples      | 0

執行analyze，收集統計資訊，執行計劃恢復。

(Greenplum對於超級大表，收集統計資訊時，會構建臨時表來進行取樣分析)

digoal=> analyze verbose xxxxxxxx;  
  
INFO:  Executing SQL: select sum(gp_statistics_estimate_reltuples_relpages_oid(c.oid))::float4[] from gp_dist_random(`pg_class`) c where c.oid=112293  
INFO:  ANALYZE estimated reltuples=3091824896.000000, relpages=47509120.000000 for table xxxxxxxx  
INFO:  ANALYZE building sample table of size 173762 on table xxxxxxxx because it has too many rows.  
INFO:  Executing SQL: create table pg_temp.pg_analyze_112293_59 as (  select Ta.xx,....Ta.xxx  from public.xxxxxxxx as Ta where random() < 0.00005620053343591280281543731689453125 limit 173762  ) distributed randomly  
INFO:  Created sample table pg_temp.pg_analyze_112293_59 with nrows=173762  
INFO:  ANALYZE computing statistics on attribute xx  
INFO:  Executing SQL: select count(*)::float4 from pg_temp_440803.pg_analyze_112293_59 as Ta where Ta.xx is null  
INFO:  nullfrac = 0.178474  
INFO:  Executing SQL: select avg(pg_column_size(Ta.xx))::float4 from pg_temp_440803.pg_analyze_112293_59 as Ta where Ta.xx is not null  
INFO:  avgwidth = 21.418087  
INFO:  Executing SQL: select count(*)::float4 from (select Ta.xx from pg_temp_440803.pg_analyze_112293_59 as Ta group by Ta.xx) as Tb  
INFO:  count(ndistinct()) gives 142751.000000 values.  
INFO:  Executing SQL: select count(v)::float4 from (select Ta.xx as v, count(Ta.xx) as f from pg_temp_440803.pg_analyze_112293_59 as Ta group by Ta.xx) as foo where f > 1  
INFO:  ndistinct = -1.000000  
..........

收集統計資訊後，執行計劃恢復，沒有broadcast了，執行也秒級返回了。

讓Greenplum自動收集統計資訊

對於在函式內或函式外執行DML時，核心會跟蹤表的記錄數變更影響的資料量，我們可以設定什麼時候收集統計資訊：

none：不收集

on_no_stats：沒有統計資訊時，收集

on_change：當寫入、更新量超過閾值（gp_autostats_on_change_threshold引數設定的行數，預設為20億）後，自動收集統計資訊。

Automatic Statistics Collection  
  
Greenplum Database can be set to automatically run ANALYZE on a table that either has no statistics or has  
changed significantly when certain operations are performed on the table. For partitioned tables, automatic  
statistics collection is only triggered when the operation is run directly on a leaf table, and then only the leaf  
table is analyzed.  
  
Automatic statistics collection has three modes:  
• none disables automatic statistics collection.  
• on_no_stats triggers an analyze operation for a table with no existing statistics when any of the  
commands CREATE TABLE AS SELECT, INSERT, or COPY are executed on the table.  
• on_change triggers an analyze operation when any of the commands CREATE TABLE AS SELECT,  
UPDATE, DELETE, INSERT, or COPY are executed on the table and the number of rows affected exceeds  
the threshold defined by the gp_autostats_on_change_threshold configuration parameter.  
  
The automatic statistics collection mode is set separately for commands that occur within a procedural  
language function and commands that execute outside of a function:  
• The gp_autostats_mode configuration parameter controls automatic statistics collection behavior  
outside of functions and is set to on_no_stats by default.  
• The gp_autostats_mode_in_functions parameter controls the behavior when table operations are  
performed within a procedural language function and is set to none by default.  
  
With the on_change mode, ANALYZE is triggered only if the number of rows affected exceeds the threshold  
defined by the gp_autostats_on_change_threshold configuration parameter. The default value for this  
parameter is a very high value, 2147483647, which effectively disables automatic statistics collection;  
you must set the threshold to a lower number to enable it. The on_change mode could trigger large,  
unexpected analyze operations that could disrupt the system, so it is not recommended to set it globally. It  
could be useful in a session, for example to automatically analyze a table following a load.  
  
To disable automatic statistics collection outside of functions, set the gp_autostats_mode parameter to  
none:  
  
gpconfigure -c gp_autostats_mode -v none  
  
To enable automatic statistics collection in functions for tables that have no statistics, change  
gp_autostats_mode_in_functions to on_no_stats:  
  
gpconfigure -c gp_autostats_mode_in_functions -v on_no_stats  
  
Set the log_autostats system configuration parameter to on if you want to log automatic statistics  
collection operations.

為了讓資料庫產生準確的執行計劃，建議要麼使用者自己排程analyZE收集統計資訊，要麼自動收集。

[統計資訊系列7] Oracle 11g的自動統計資訊收集
2020-06-03
Oracle
手動收集——收集統計資訊
2018-06-27
【統計資訊】Oracle常用的收集統計資訊方式
2020-11-30
Oracle
啟用與禁用統計資訊自動收集
2018-04-03
收集統計資訊方案
2019-06-08
Oracle收集統計資訊
2018-06-27
Oracle
收集全庫統計資訊
2020-09-16
微課sql最佳化(3)、統計資訊收集(2)-如何收集統計資訊
2020-03-03
SQL
6 收集資料庫統計資訊
2020-04-27
資料庫
最佳實踐：解讀GaussDB(DWS) 統計資訊自動收集方案
2024-08-07
【統計資訊】Oracle統計資訊
2021-03-31
Oracle
Oracle統計資訊的收集和維護
2024-03-06
Oracle
Oracle 11g系統自動收集統計資訊的一些知識
2018-05-30
Oracle
oracle 統計資訊檢視與收集
2020-03-18
Oracle
ORACLE表統計資訊與列統計資訊、索引統計資訊
2018-03-25
Oracle索引
微課sql最佳化(8)、統計資訊收集(6)-統計資訊查詢
2020-03-03
SQL
Oracle運維指令碼-收集統計資訊
2020-06-11
Oracle運維指令碼
ORACLE analyse table方式收集表統計資訊導致SQL執行計劃不準確而效能下降
2018-10-31
OracleSQL
【統計資訊】全面檢視錶所有統計資訊SQL
2020-11-30
SQL
【統計資訊】如何備份和還原統計資訊
2020-11-30
Sqlserver關於統計資訊自動建立自動更新的知識點
2020-06-22
SQLServer
Nebula Graph 特性講解——RocksDB 統計資訊的收集和展示
2020-08-13
修改oracle 的統計資訊
2022-11-15
Oracle
ORACLE19c新特性-實時統計資訊收集
2022-04-15
Oracle
MySQL統計資訊系列
2019-05-14
MySql
【TUNE_ORACLE】定製化收集統計資訊SQL參考
2021-08-02
OracleSQL
SQL Server 更新統計資訊
2024-05-24
SQLServer
Oracle 統計資訊介紹
2019-07-29
Oracle
PostgreSQL DBA(10) - 統計資訊
2018-10-06
SQL
[20200819]12c Global Temporary table 統計資訊的收集的疑問.txt
2020-08-29
MySQL對所有表收集統計資訊(Python 2指令碼)
2018-04-11
MySqlPython指令碼
達夢資料庫手動管理統計資訊方法
2021-06-24
資料庫
微課sql最佳化(6)、統計資訊收集(4)-關於動態取樣
2020-03-03
SQL
收集 Kubernetes 資源統計資料的新工具
2018-11-22
微課sql最佳化(2)-為什麼需要收集統計資訊
2020-03-03
SQL
ansible 統計 ssh 登入資訊
2024-04-12
Win10系統關閉“小娜自動收集個人資訊”功能的方法
2018-04-06
Win10
MySQL中Innodb如何計算索引的統計資訊？
2018-07-23
MySql索引
職工資訊管理系統的設計
2020-12-22

Greenplum自動統計資訊收集-暨統計資訊不準引入的broadcastmotion一例

標籤

背景

讓Greenplum自動收集統計資訊

相關文章