Sql優化（二）快速計算Distinct Count

郭俊JasonGuo發表於2015-03-28

原創文章，始發自本人個人部落格站點，轉載請務必註明出自http://www.jasongj.com

個人部落格上本文連結http://www.jasongj.com/2015/03/15/count_distinct/

UV vs. PV

　　在網際網路中，經常需要計算UV和PV。所謂PV即Page View，網頁被開啟多少次（YouTube等視訊網站非常重視視訊的點選率，即被播放多少次，也即PV）。而UV即Unique Visitor（微信朋友圈或者微信公眾號中的文章則統計有多少人看過該文章，也即UV。雖然微信上顯示是指明該值是PV，但經筆者測試，實為UV）。這兩個概念非常重要，比如淘寶賣家在做活動時，他往往需要統計寶貝被看了多少次，有多少個不同的人看過該活動介紹。至於如何在網際網路上唯一標識一個自然人，也是一個難點，目前還沒有一個非常準確的方法，常用的方法是使用者名稱加cookie，這裡不作深究。

count distinct vs. count group by

　　很多情景下，尤其對於文字型別的欄位，直接使用count distinct的查詢效率是非常低的，而先做group by更count往往能提升查詢效率。但實驗表明，對於不同的欄位，count distinct與count group by的效能並不一樣，而且其效率也與目標資料集的資料重複度相關。

　　本節通過幾組實驗說明了不同場景下不同query的不同效率，同時分析效能差異的原因。（本文所有實驗皆基於PostgreSQL 9.3.5平臺）
分別使用count distinct 和 count group by對 bigint, macaddr, text三種型別的欄位做查詢。
首先建立如下結構的表

Column	Type	Modifiers
mac_bigint	bigint
mac_macaddr	macaddr
mac_text	text

並插入1000萬條記錄，並保證mac_bigint為mac_macaddr去掉冒號後的16進位制轉換而成的10進位制bigint，而mac_text為mac_macaddr的文字形式，從而保證在這三個欄位上查詢的結果，也及複雜度相同。

　　count distinct SQL如下

select 
    count(distinct mac_macaddr) 
from 
    testmac

　　count group by SQL如下

select
    count(*)
from
    (select
        mac_macaddr
    from
        testmac
    group by
        1) foo

　　對於不同記錄數較大的情景（1000萬條記錄中，有300多萬條不同記錄），查詢時間（單位毫秒）如下表所示。

query/欄位型別	macaddr	bigint	text
count distinct	24668.023	13890.051	149048.911
count group by	32152.808	25929.555	159212.700

　　對於不同記錄數較小的情景（1000萬條記錄中，只有1萬條不同記錄），查詢時間（單位毫秒）如下表所示。

query/欄位型別	macaddr	bigint	text
count distinct	20006.681	9984.763	225208.133
count group by	2529.420	2554.720	3701.869

　　從上面兩組實驗可看出，在不同記錄數較小時，count group by效能普遍高於count distinct，尤其對於text型別表現的更明顯。而對於不同記錄數較大的場景，count group by效能反而低於直接count distinct。為什麼會造成這種差異呢，我們以macaddr型別為例來對比不同結果集下count group by的query plan。
　　當結果集較小時，planner會使用HashAggregation。

explain analyze select count(*) from (select mac_macaddr from testmac_small group by 1) foo;
                                        QUERY PLAN
 Aggregate  (cost=668465.04..668465.05 rows=1 width=0) (actual time=9166.486..9166.486 rows=1 loops=1)
   ->  HashAggregate  (cost=668296.74..668371.54 rows=7480 width=6) (actual time=9161.796..9164.393 rows=10001 loops=1)
         ->  Seq Scan on testmac_small  (cost=0.00..572898.79 rows=38159179 width=6) (actual time=323.338..5091.112 rows=10000000 l
oops=1)

　　而當結果集較大時，無法通過在記憶體中維護Hash表的方式使用HashAggregation，planner會使用GroupAggregation，並會用到排序，而且因為目標資料集太大，無法在記憶體中使用Quick Sort，而要在外存中使用Merge Sort，而這就極大的增加了I/O開銷。

explain analyze select count(*) from (select mac_macaddr from testmac group by 1) foo;
                                        QUERY PLAN
 Aggregate  (cost=1881542.62..1881542.63 rows=1 width=0) (actual time=34288.232..34288.232 rows=1 loops=1)
   ->  Group  (cost=1794262.09..1844329.41 rows=2977057 width=6) (actual time=25291.372..33481.228 rows=3671797 loops=1)
         ->  Sort  (cost=1794262.09..1819295.75 rows=10013464 width=6) (actual time=25291.366..29907.351 rows=10000000 loops=1)
               Sort Key: testmac.mac_macaddr
               Sort Method: external merge  Disk: 156440kB
               ->  Seq Scan on testmac  (cost=0.00..219206.64 rows=10013464 width=6) (actual time=0.082..4312.053 rows=10000000 loo
ps=1)

dinstinct count高效近似演算法

　　由於distinct count的需求非常普遍（如網際網路中計算UV），而該計算的代價又相比較高，很難適應實時性要求較高的場景，如流計算，因此有很多相關研究試圖解決該問題。比較著名的演算法有daptive sampling Algorithm，Distinct Counting with a Self-Learning Bitmap，HyperLogLog，LogLog，Probabilistic Counting Algorithms。這些演算法都不能精確計算distinct count，都是在保證誤差較小的情況下高效計算出結果。本文分別就這幾種演算法做了兩組實驗。

資料集100萬條，每條記錄均不相同，幾種演算法耗時及記憶體使用如下。

algorithm	result	error	time(ms)	memory (B)
count(distinct)	1000000	0%	14026	？
Adaptive Sampling	1008128	0.8%	8653	57627
Self-learning Bitmap	991651	0.9%	1151	65571
Bloom filter	788052	22%	2400	1198164
Probalilistic Counting	1139925	14%	3613	95
PCSA	841735	16%	842	495

資料集100萬條，只有100條不同記錄，幾種近似演算法耗時及記憶體使用如下。

algorithm	result	error	time(ms)	memory (B)
count(distinct)	100	0%	75306	？
Adaptive Sampling	100	0%	1491	57627
Self-learning Bitmap	101	1%	1031	65571
Bloom filter	100	0%	1675	1198164
Probalilistic Counting	95	5%	3613	95
PCSA	98	2%	852	495

　　
　　從上面兩組實驗可看出，大部分的近似演算法工作得都很好，其速度都比簡單的count distinct要快很多，而且它們對記憶體的使用並不多而結果去非常好，尤其是Adaptive Sampling和Self-learning Bitmap，誤差一般不超過1%，效能卻比簡單的count distinct高十幾倍乃至幾十倍。　　

distinct count結果合併

　　如上幾種近似演算法可以極大提高distinct count的效率，但對於data warehouse來說，資料量非常大，可能儲存了幾年的資料，為了提高查詢速度，對於sum及avg這些aggregation一般會建立一些aggregation table。比如如果要算過去三年的總營業額，那可以建立一張daily/monthly aggregation table，基於daily/monthly表去計算三年的營業額。但對於distinct count，即使建立了daily/monthly aggregation table，也沒辦法通過其計算三年的數值。這裡有種新的資料型別hll，這是一種HyperLogLog資料結構。一個1280位元組的hll能計算幾百億的不同數值並且保證只有很小的誤差。
　　首先建立一張表(fact)，結構如下

Column	Type	Modifiers
day	date
user_id	integer
sales	numeric

　插入三年的資料，並保證總共有10萬個不同的user_id，總資料量為1億條（一天10萬條左右）。

insert into fact
select
    current_date - (random()*1095)::integer * '1 day'::interval,
    (random()*100000)::integer + 1,
    random() * 10000 + 500
from
    generate_series(1, 100000000, 1);

　　直接從fact表中查詢不同使用者的總數，耗時115143.217 ms。
利用hll，建立daily_unique_user_hll表，將每天的不同使用者資訊存於hll型別的欄位中。

create table daily_unique_user_hll 
as select
    day, 
    hll_add_agg(hll_hash_integer(user_id))
from 
    fact
group by 1;

　　通過上面的daily aggregation table可計算任意日期範圍內的unique user count。如計算整個三年的不同使用者數，耗時17.485 ms，查詢結果為101044，誤差為(101044-100000)/100000=1.044%。

explain analyze select hll_cardinality(hll_union_agg(hll_add_agg)) from daily_unique_user_hll;
                                   QUERY PLAN
 Aggregate  (cost=196.70..196.72 rows=1 width=32) (actual time=16.772..16.772 rows=1 loops=1)
   ->  Seq Scan on daily_unique_user_hll  (cost=0.00..193.96 rows=1096 width=32) (actual time=0.298..3.251 rows=
1096 loops=1)
 Planning time: 0.081 ms
 Execution time: 16.851 ms
 Time: 17.485 ms

　　而如果直接使用count distinct基於fact表計算該值，則耗時長達 127807.105 ms。
　　
　　從上面的實驗中可以看到，hll型別實現了distinct count的合併，並可以通過hll儲存各個部分資料集上的distinct count值，並可通過合併這些hll值來快速計算整個資料集上的distinct count值，耗時只有直接使用count distinct在原始資料上計算的1/7308，並且誤差非常小，1%左右。　　

總結

　　如果必須要計算精確的distinct count，可以針對不同的情況使用count distinct或者count group by來實現較好的效率，同時對於資料的儲存型別，能使用macaddr/intger/bigint的，儘量不要使用text。
　　
　　另外不必要精確計算，只需要保證誤差在可接受的範圍之內，或者計算效率更重要時，可以採用本文所介紹的daptive sampling Algorithm，Distinct Counting with a Self-Learning Bitmap，HyperLogLog，LogLog，Probabilistic Counting Algorithms等近似演算法。另外，對於data warehouse這種儲存資料量隨著時間不斷超增加且最終資料總量非常巨大的應用場景，可以使用hll這種支援合併dintinct count結果的資料型別，並週期性的（比如daily/weekly/monthly）計算部分資料的distinct值，然後通過合併部分結果的方式得到總結果的方式來快速響應查詢請求。

對含distinct操作的SQL的優化
2019-02-15
SQL優化
大資料下的Distinct Count（二）：Bitmap篇
2020-03-03
大資料
7.14 APPROX_COUNT_DISTINCT
2020-03-20
APP
count(*) 優化
2021-01-04
優化
7.15 APPROX_COUNT_DISTINCT_AGG
2020-03-20
APP
7.16 APPROX_COUNT_DISTINCT_DETAIL
2020-03-20
APPAI
Oracle效能優化-SQL優化(案例二)
2020-02-17
Oracle優化SQL
【Leetcode】1180. Count Substrings with Only One Distinct Letter
2021-01-04
LeetCode
MySQL優化COUNT()查詢
2019-01-14
MySql優化
【PostgreSQL 】PostgreSQL 15對distinct的優化
2022-07-17
SQL優化
oracle之優化一用group by或exists優化distinct
2018-12-30
Oracle優化
MySQL之SQL優化詳解（二）
2019-06-30
MySql優化
C++ 效能優化篇二《影響優化的計算機行為》
2020-11-26
C++優化計算機
百萬資料 mysql count（*）優化
2022-05-15
MySql優化
Oracle某行系統SQL優化案例(二)
2021-06-07
OracleSQL優化
SQL優化案例-正確的使用索引（二）
2018-06-15
SQL優化索引
sql - distinct 去重複的用法
2019-03-04
SQL
java Count如何計算流中的元素
2021-09-11
Java
SQL Server中count(*)和Count(1)的區別
2021-09-09
SQLServer
sql優化之邏輯優化
2018-12-12
SQL優化
資料庫優化 - SQL優化
2020-01-11
資料庫優化SQL
SQL SERVER優化
2018-10-30
SQLServer優化
SQL優化指南
2018-12-01
SQL優化
計算節點會對收到的SQL 語句做解析、優化、執行並
2022-01-17
SQL優化
SQL -去重Group by 和Distinct的效率
2021-01-22
SQL
SQL 有序計算
2020-11-26
SQL
SQL優化之統計資訊和索引
2018-12-17
SQL優化索引
Oracle效能優化-SQL優化(案例一)
2020-02-17
Oracle優化SQL
Oracle效能優化-SQL優化(案例三)
2020-02-18
Oracle優化SQL
Oracle效能優化-SQL優化(案例四)
2020-02-18
Oracle優化SQL
Microsoft SQL Server 2005中查詢優化器使用的統計資訊二(zt)
2019-03-02
ROSSQLServer優化
SQL優化案例-從執行計劃定位SQL問題（三）
2018-11-28
SQL優化
HOME: Count Digits —— 計算字串中數字個數
2020-12-18
Git字串
SQL優化參考
2018-09-15
SQL優化
sql優化專題
2020-09-23
SQL優化
SQL語句優化
2019-05-08
SQL優化
SQL效能優化技巧
2022-02-14
SQL優化
MySQL-SQL優化
2021-02-06
MySql優化
慢Sql優化思路
2020-12-18
SQL優化

Sql優化（二） 快速計算Distinct Count

UV vs. PV

count distinct vs. count group by

dinstinct count高效近似演算法

distinct count結果合併

總結

相關文章

Sql優化（二）快速計算Distinct Count