Hadoop-impala十大最佳化之(3)—impala表和列資訊統計操作最佳實踐
1.1 Hadoop-impala十大最佳化之(3)—impala表和列資訊統計操作
1.1.1 表和列的資訊統計
show table stats parquet_snappy;
compute stats parquet_snappy;
n 如果是hive的話,統計資訊命令如下
u ANALYZE TABLE COMPUTE STATISTICS FOR COLUMNS
show table stats parquet_snappy;
+-------+--------+---------+--------------+-------------------+---------+-------------------+...
| #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats |...
+-------+--------+---------+--------------+-------------------+---------+-------------------+...
| -1 | 96 | 23.35GB | NOT CACHED | NOT CACHED | PARQUET | false |...
+-------+--------+---------+--------------+-------------------+---------+-------------------+...
compute stats parquet_snappy;
+-----------------------------------------+
| summary |
+-----------------------------------------+
| Updated 1 partition(s) and 6 column(s). |
+-----------------------------------------+
show table stats parquet_snappy;
+------------+--------+---------+--------------+-------------------+---------+-------------------+...
| #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats |...
+------------+--------+---------+--------------+-------------------+---------+-------------------+...
| 1000000000 | 96 | 23.35GB | NOT CACHED | NOT CACHED | PARQUET | false |...
+------------+--------+---------+--------------+-------------------+---------+-------------------+...
1.1.2 關於列的資訊統計
show column stats parquet_snappy;
compute stats parquet_snappy;
show column stats parquet_snappy;
+-------------+----------+------------------+--------+----------+----------+
| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size |
+-------------+----------+------------------+--------+----------+----------+
| id | BIGINT | -1 | -1 | 8 | 8 |
| val | INT | -1 | -1 | 4 | 4 |
| zerofill | STRING | -1 | -1 | -1 | -1 |
| name | STRING | -1 | -1 | -1 | -1 |
| assertion | BOOLEAN | -1 | -1 | 1 | 1 |
| location_id | SMALLINT | -1 | -1 | 2 | 2 |
+-------------+----------+------------------+--------+----------+----------+
compute stats parquet_snappy;
+-----------------------------------------+
| summary |
+-----------------------------------------+
| Updated 1 partition(s) and 6 column(s). |
+-----------------------------------------+
show column stats parquet_snappy;
+-------------+----------+------------------+--------+----------+-------------------+
| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size |
+-------------+----------+------------------+--------+----------+-------------------+
| id | BIGINT | 183861280 | -1 | 8 | 8 |
| val | INT | 139017 | -1 | 4 | 4 |
| zerofill | STRING | 101761 | -1 | 6 | 6 |
| name | STRING | 145636240 | -1 | 22 | 13.00020027160645 |
| assertion | BOOLEAN | 2 | -1 | 1 | 1 |
| location_id | SMALLINT | 339 | -1 | 2 | 2 |
+-------------+----------+------------------+--------+----------+-------------------+
1.1.3 有關分割槽表的資訊統計
show partitions year_month_day;
+-------+-------+-----+-------+--------+---------+--------------+-------------------+---------+...
| year | month | day | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format |...
+-------+-------+-----+-------+--------+---------+--------------+-------------------+---------+...
| 2013 | 12 | 1 | -1 | 1 | 2.51MB | NOT CACHED | NOT CACHED | PARQUET |...
| 2013 | 12 | 2 | -1 | 1 | 2.53MB | NOT CACHED | NOT CACHED | PARQUET |...
| 2013 | 12 | 3 | -1 | 1 | 2.52MB | NOT CACHED | NOT CACHED | PARQUET |...
| 2013 | 12 | 4 | -1 | 1 | 2.51MB | NOT CACHED | NOT CACHED | PARQUET |...
| 2013 | 12 | 5 | -1 | 1 | 2.52MB | NOT CACHED | NOT CACHED | PARQUET |...
| Total | | | -1 | 5 | 12.58MB | 0B | | |...
+-------+-------+-----+-------+--------+---------+--------------+-------------------+---------+...
show table stats year_month_day;
+-------+-------+-----+-------+--------+---------+--------------+-------------------+---------+...
| year | month | day | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format |...
+-------+-------+-----+-------+--------+---------+--------------+-------------------+---------+...
| 2013 | 12 | 1 | -1 | 1 | 2.51MB | NOT CACHED | NOT CACHED | PARQUET |...
| 2013 | 12 | 2 | -1 | 1 | 2.53MB | NOT CACHED | NOT CACHED | PARQUET |...
| 2013 | 12 | 3 | -1 | 1 | 2.52MB | NOT CACHED | NOT CACHED | PARQUET |...
| 2013 | 12 | 4 | -1 | 1 | 2.51MB | NOT CACHED | NOT CACHED | PARQUET |...
| 2013 | 12 | 5 | -1 | 1 | 2.52MB | NOT CACHED | NOT CACHED | PARQUET |...
| Total | | | -1 | 5 | 12.58MB | 0B | | |...
+-------+-------+-----+-------+--------+---------+--------------+-------------------+---------+...
show column stats year_month_day;
+-----------+---------+------------------+--------+----------+----------+
| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size |
+-----------+---------+------------------+--------+----------+----------+
| id | INT | -1 | -1 | 4 | 4 |
| val | INT | -1 | -1 | 4 | 4 |
| zfill | STRING | -1 | -1 | -1 | -1 |
| name | STRING | -1 | -1 | -1 | -1 |
| assertion | BOOLEAN | -1 | -1 | 1 | 1 |
| year | INT | 1 | 0 | 4 | 4 |
| month | INT | 1 | 0 | 4 | 4 |
| day | INT | 5 | 0 | 4 | 4 |
+-----------+---------+------------------+--------+----------+----------+
compute stats year_month_day;
+-----------------------------------------+
| summary |
+-----------------------------------------+
| Updated 5 partition(s) and 5 column(s). |
+-----------------------------------------+
show table stats year_month_day;
+-------+-------+-----+--------+--------+---------+--------------+-------------------+---------+...
| year | month | day | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format |...
+-------+-------+-----+--------+--------+---------+--------------+-------------------+---------+...
| 2013 | 12 | 1 | 93606 | 1 | 2.51MB | NOT CACHED | NOT CACHED | PARQUET |...
| 2013 | 12 | 2 | 94158 | 1 | 2.53MB | NOT CACHED | NOT CACHED | PARQUET |...
| 2013 | 12 | 3 | 94122 | 1 | 2.52MB | NOT CACHED | NOT CACHED | PARQUET |...
| 2013 | 12 | 4 | 93559 | 1 | 2.51MB | NOT CACHED | NOT CACHED | PARQUET |...
| 2013 | 12 | 5 | 93845 | 1 | 2.52MB | NOT CACHED | NOT CACHED | PARQUET |...
| Total | | | 469290 | 5 | 12.58MB | 0B | | |...
+-------+-------+-----+--------+--------+---------+--------------+-------------------+---------+...
show column stats year_month_day;
+-----------+---------+------------------+--------+----------+-------------------+
| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size |
+-----------+---------+------------------+--------+----------+-------------------+
| id | INT | 511129 | -1 | 4 | 4 |
| val | INT | 364853 | -1 | 4 | 4 |
| zfill | STRING | 311430 | -1 | 6 | 6 |
| name | STRING | 471975 | -1 | 22 | 13.00160026550293 |
| assertion | BOOLEAN | 2 | -1 | 1 | 1 |
| year | INT | 1 | 0 | 4 | 4 |
| month | INT | 1 | 0 | 4 | 4 |
| day | INT | 5 | 0 | 4 | 4 |
+-----------+---------+------------------+--------+----------+-------------------+
1.1.4 增量資訊統計
在2.1.0高,你可以使用語法計算增量資料和增量資料下降。增量子句與增量統計工作,是一個專門的功能分割槽表,是大或經常更新新的分割槽。
INCREMENTAL STATS or DROP INCREMENTAL STATS statement.
1.1.5 什麼時候需要重新統計資訊
當一個內容表或分割槽明顯變化,重新計算統計相關資料表或分割槽。
變化是“重大”的程度不同,根據表的絕對和相對大小。
通常,如果你新增了30%多個資料表,這是值得重新計算統計,因為行和不同值的數量差異可能導致impala選擇不同的連線順序時,表中使用的查詢。這個指南對於最大的表是最重要的。
例如,增加30%個新的資料表中含有1 TB的具有更大的影響比加入順序加30%表只包含幾兆位元組,而較大的表對查詢效能有較大的影響如果impala選擇一個次優順序連線由於過時的統計資料。
如果你載入一個完整的資料表中的資料的新的集合,而每一列的行數和不同值的數量是相對不變的,你不需要重新計算統計表。
如果一個表的資料是過時的,和表的大小使它重新計算新的資料立即不切實際,你可以使用下拉資料宣告刪除過時的資料,使其更容易識別,需要一個新的計算統計操作表。
對於一個大的分割槽表,請考慮使用增量統計特徵在Impala 2.1.0高可用,在增量統計概要說明。如果你新增一個新的分割槽表,這是值得重新計算增量資料,因為操作只需掃描資料,一個新的分割槽。
1.1.6 檢視錶和列的統計資訊樣例
[localhost:21000] > show table stats store;
+-------+--------+--------+--------+
| #Rows | #Files | Size | Format |
+-------+--------+--------+--------+
| -1 | 1 | 3.08KB | TEXT |
+-------+--------+--------+--------+
|Returned 1 row(s) in 0.03s
[localhost:21000] > show column stats store;
+--------------------+-----------+------------------+--------+----------+----------+
| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size |
+--------------------+-----------+------------------+--------+----------+----------+
| s_store_sk | INT | -1 | -1 | 4 | 4 |
| s_store_id | STRING | -1 | -1 | -1 | -1 |
| s_rec_start_date | TIMESTAMP | -1 | -1 | 16 | 16 |
| s_rec_end_date | TIMESTAMP | -1 | -1 | 16 | 16 |
| s_closed_date_sk | INT | -1 | -1 | 4 | 4 |
| s_store_name | STRING | -1 | -1 | -1 | -1 |
| s_number_employees | INT | -1 | -1 | 4 | 4 |
| s_floor_space | INT | -1 | -1 | 4 | 4 |
| s_hours | STRING | -1 | -1 | -1 | -1 |
| s_manager | STRING | -1 | -1 | -1 | -1 |
| s_market_id | INT | -1 | -1 | 4 | 4 |
| s_geography_class | STRING | -1 | -1 | -1 | -1 |
| s_market_desc | STRING | -1 | -1 | -1 | -1 |
| s_market_manager | STRING | -1 | -1 | -1 | -1 |
| s_division_id | INT | -1 | -1 | 4 | 4 |
| s_division_name | STRING | -1 | -1 | -1 | -1 |
| s_company_id | INT | -1 | -1 | 4 | 4 |
| s_company_name | STRING | -1 | -1 | -1 | -1 |
| s_street_number | STRING | -1 | -1 | -1 | -1 |
| s_street_name | STRING | -1 | -1 | -1 | -1 |
| s_street_type | STRING | -1 | -1 | -1 | -1 |
| s_suite_number | STRING | -1 | -1 | -1 | -1 |
| s_city | STRING | -1 | -1 | -1 | -1 |
| s_county | STRING | -1 | -1 | -1 | -1 |
| s_state | STRING | -1 | -1 | -1 | -1 |
| s_zip | STRING | -1 | -1 | -1 | -1 |
| s_country | STRING | -1 | -1 | -1 | -1 |
| s_gmt_offset | FLOAT | -1 | -1 | 4 | 4 |
| s_tax_percentage | FLOAT | -1 | -1 | 4 | 4 |
+--------------------+-----------+------------------+--------+----------+----------+
Returned 29 row(s) in 0.04s
在分析列統計表的表報表時,您必須指定用於收集統計資料的每個列。impala計算統計報表自動彙總所有列的統計,因為它讀取整個表比較快,可以有效地計算所有列的值。這個例子顯示了執行後的計算統計報表,統計資料被填充在表和它的所有列中:
[localhost:21000] > compute stats store;
+------------------------------------------+
| summary |
+------------------------------------------+
| Updated 1 partition(s) and 29 column(s). |
+------------------------------------------+
|Returned 1 row(s) in 1.88s
[localhost:21000] > show table stats store;
+-------+--------+--------+--------+
| #Rows | #Files | Size | Format |
+-------+--------+--------+--------+
| 12 | 1 | 3.08KB | TEXT |
+-------+--------+--------+--------+
|Returned 1 row(s) in 0.02s
[localhost:21000] > show column stats store;
+--------------------+-----------+------------------+--------+----------+-------------------+
| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size |
+--------------------+-----------+------------------+--------+----------+-------------------+
| s_store_sk | INT | 12 | -1 | 4 | 4 |
| s_store_id | STRING | 6 | -1 | 16 | 16 |
| s_rec_start_date | TIMESTAMP | 4 | -1 | 16 | 16 |
| s_rec_end_date | TIMESTAMP | 3 | -1 | 16 | 16 |
| s_closed_date_sk | INT | 3 | -1 | 4 | 4 |
| s_store_name | STRING | 8 | -1 | 5 | 4.25 |
| s_number_employees | INT | 9 | -1 | 4 | 4 |
| s_floor_space | INT | 10 | -1 | 4 | 4 |
| s_hours | STRING | 2 | -1 | 8 | 7.083300113677979 |
| s_manager | STRING | 7 | -1 | 15 | 12 |
| s_market_id | INT | 7 | -1 | 4 | 4 |
| s_geography_class | STRING | 1 | -1 | 7 | 7 |
| s_market_desc | STRING | 10 | -1 | 94 | 55.5 |
| s_market_manager | STRING | 7 | -1 | 16 | 14 |
| s_division_id | INT | 1 | -1 | 4 | 4 |
| s_division_name | STRING | 1 | -1 | 7 | 7 |
| s_company_id | INT | 1 | -1 | 4 | 4 |
| s_company_name | STRING | 1 | -1 | 7 | 7 |
| s_street_number | STRING | 9 | -1 | 3 | 2.833300113677979 |
| s_street_name | STRING | 12 | -1 | 11 | 6.583300113677979 |
| s_street_type | STRING | 8 | -1 | 9 | 4.833300113677979 |
| s_suite_number | STRING | 11 | -1 | 9 | 8.25 |
| s_city | STRING | 2 | -1 | 8 | 6.5 |
| s_county | STRING | 1 | -1 | 17 | 17 |
| s_state | STRING | 1 | -1 | 2 | 2 |
| s_zip | STRING | 2 | -1 | 5 | 5 |
| s_country | STRING | 1 | -1 | 13 | 13 |
| s_gmt_offset | FLOAT | 1 | -1 | 4 | 4 |
| s_tax_percentage | FLOAT | 5 | -1 | 4 | 4 |
+--------------------+-----------+------------------+--------+----------+-------------------+
Returned 29 row(s) in 0.04s
下面的示例顯示瞭如何將統計資料表示為一個分割槽表。在這種情況下,我們已經建立了一個表,以保持世界上最瑣碎的人口普查資料,一個單一的字串欄位,按一年的列分割槽。表統計包括每個分割槽的單獨的條目,以及數字欄位的最終總計。列資料包括分割槽列一些容易推斷出來的事實,如不同值的數量(劃分子目錄的數量)。
localhost:21000] > describe census;
+------+----------+---------+
| name | type | comment |
+------+----------+---------+
| name | string | |
| year | smallint | |
+------+----------+---------+
Returned 2 row(s) in 0.02s
[localhost:21000] > show table stats census;
+-------+-------+--------+------+---------+
| year | #Rows | #Files | Size | Format |
+-------+-------+--------+------+---------+
| 2000 | -1 | 0 | 0B | TEXT |
| 2004 | -1 | 0 | 0B | TEXT |
| 2008 | -1 | 0 | 0B | TEXT |
| 2010 | -1 | 0 | 0B | TEXT |
| 2011 | 0 | 1 | 22B | TEXT |
| 2012 | -1 | 1 | 22B | TEXT |
| 2013 | -1 | 1 | 231B | PARQUET |
| Total | 0 | 3 | 275B | |
+-------+-------+--------+------+---------+
Returned 8 row(s) in 0.02s
[localhost:21000] > show column stats census;
+--------+----------+------------------+--------+----------+----------+
| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size |
+--------+----------+------------------+--------+----------+----------+
| name | STRING | -1 | -1 | -1 | -1 |
| year | SMALLINT | 7 | -1 | 2 | 2 |
+--------+----------+------------------+--------+----------+----------+
Returned 2 row(s) in 0.02s
The following example shows how the statistics are filled in by a COMPUTE STATS statement in Impala.
[localhost:21000] > compute stats census;
+-----------------------------------------+
| summary |
+-----------------------------------------+
| Updated 3 partition(s) and 1 column(s). |
+-----------------------------------------+
|Returned 1 row(s) in 2.16s
[localhost:21000] > show table stats census;
+-------+-------+--------+------+---------+
| year | #Rows | #Files | Size | Format |
+-------+-------+--------+------+---------+
| 2000 | -1 | 0 | 0B | TEXT |
| 2004 | -1 | 0 | 0B | TEXT |
| 2008 | -1 | 0 | 0B | TEXT |
| 2010 | -1 | 0 | 0B | TEXT |
| 2011 | 4 | 1 | 22B | TEXT |
| 2012 | 4 | 1 | 22B | TEXT |
| 2013 | 1 | 1 | 231B | PARQUET |
| Total | 9 | 3 | 275B | |
+-------+-------+--------+------+---------+
Returned 8 row(s) in 0.02s
[localhost:21000] > show column stats census;
+--------+----------+------------------+--------+----------+----------+
| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size |
+--------+----------+------------------+--------+----------+----------+
| name | STRING | 4 | -1 | 5 | 4.5 |
| year | SMALLINT | 7 | -1 | 2 | 2 |
+--------+----------+------------------+--------+----------+----------+
Returned 2 row(s) in 0.02s
來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/24179204/viewspace-2129933/,如需轉載,請註明出處,否則將追究法律責任。
相關文章
- Hadoop-impala十大最佳化之(2)—impala連線查詢效能最佳化及最佳實踐Hadoop
- Hadoop-Impala十大最佳化系列之(1)—分割槽表最佳化-8個方法讓分割槽最最佳化Hadoop
- Hadoop-Impala效能最佳化系列開幕Hadoop
- Apache Kafka最佳化部署的十大最佳實踐ApacheKafka
- 微課sql最佳化(3)、統計資訊收集(2)-如何收集統計資訊SQL
- Oracle 12c資料庫最佳化器統計資訊收集的最佳實踐(二)Oracle資料庫
- Impala 5.7效能最佳化系列-10大最佳化思路
- ORACLE表統計資訊與列統計資訊Oracle
- Oracle 12c資料庫最佳化器統計資訊收集的最佳實踐(三)|何時不需要收集統計資訊Oracle資料庫
- 收集最佳化統計資料(Optimizer Statistics)的最佳實踐方法
- ORACLE表統計資訊與列統計資訊、索引統計資訊Oracle索引
- Github 十大最佳實踐Github
- 微課sql最佳化(8)、統計資訊收集(6)-統計資訊查詢SQL
- 網易公開課iOS資訊流卡頓最佳化實踐iOS
- MaxCompute表設計最佳實踐
- Oracle最佳化之單表分頁最佳化Oracle
- Golang效能最佳化實踐Golang
- Sql最佳化之回表SQL
- 最佳實踐:解讀GaussDB(DWS) 統計資訊自動收集方案
- HBase最佳實踐-列族設計優化優化
- 定期最佳化和分析表
- MySQL最佳化之系統變數最佳化MySql變數
- 頁面CLS 最佳化實踐
- SAP ABAP 效能最佳化實踐
- 8個雲成本最佳化的最佳實踐
- Mysql事務原理與最佳化最佳實踐MySql
- 最佳化您的部署:Docker 映象最佳實踐Docker
- 微課sql最佳化(5)、統計資訊收集(3)-關於預設取樣率SQL
- 效能最佳化之報表資料預先計算
- SQL SERVER十大最佳儲存實踐SQLServer
- 前端最佳實踐(一)——DOM操作前端
- MapReduce實現之Reduce端重分割槽Join操作最佳化!
- MySQL8.0效能最佳化(實踐)MySql
- HarmonyOS:應用效能最佳化實踐
- 無線網路規劃設計和部署維護之誤區與最佳實踐(3)
- 京東雲TiDB SQL最佳化的最佳實踐TiDBSQL
- PHP 最佳實踐之異常和錯誤PHP
- Mysql資料庫大表最佳化方案和Mysql大表最佳化步驟MySql資料庫