MySQL百分位數計算(優化版)

壹頁書發表於2018-07-02
接前文

http://blog.itpub.net/29254281/viewspace-2157111/

這是上文的優化版本

建立試驗資料,5天每天一百萬隨機資料,總共500w資料

create table nums(id int not null primary key);

delimiter $$
create procedure pFastCreateNums(cnt int)
begin
    declare s int default 1;
    truncate table nums;
    insert into nums select s;
    while s*2<=cnt do
        insert into nums select id+s from nums;
        set s=s*2;
    end while;
end $$
delimiter ;


call pFastCreateNums(2000000);

drop table if exists t ;
create table t(
    query_time date,
    ts float,
    key(query_time,ts)
);


insert into t select '2018-07-01',round(100000*rand(),2) from nums where id<=1000000;
insert into t select '2018-07-02',round(100000*rand(),2) from nums where id<=1000000;
insert into t select '2018-07-03',round(100000*rand(),2) from nums where id<=1000000;
insert into t select '2018-07-04',round(100000*rand(),2) from nums where id<=1000000;
insert into t select '2018-07-05',round(100000*rand(),2) from nums where id<=1000000;

首先,修正上文的SQL,增加精度,因為在大資料量下,會有顯著的誤差。


  1. select query_time,v,ts
  2. from (
  3.     select t6.query_time,t6.ts,v,seq,
  4.     case when @gid=concat(seq,'#',query_time) then @rn:=@rn+1 when @gid:=concat(seq,'#',query_time) then @rn:=1 end s
  5.     from (
  6.         select query_time,ts,rn,percent,v,v-percent d,seq from (
  7.             select t2.query_time,ts,rn,round(rn/total,10) percent from (
  8.                 select query_time,ts,
  9.                 case when @gid=query_time then @rn:=@rn+1 when @gid:=query_time then @rn:=1 end rn
  10.                 from (
  11.                     select * from t ,(select @gid:='',@rn:=0) vars order by query_time,ts
  12.                 ) t1
  13.             ) t2 inner join (
  14.                 select query_time,count(*) total from t group by query_time
  15.             ) t3 on(t2.query_time=t3.query_time)
  16.         ) t4 ,
  17.         (select 0.71 v,1 seq union all select 0.81,2 union all select 0.91,3) t5
  18.     ) t6 where d>=0 order by query_time,v,d
  19. ) t7 where s=1 order by query_time,seq ;

在ssd環境下,上文的SQL執行時長和結果如下.

148.813 s 


前文這個SQL的計算結果是非常精確的
但是計算時間和 取樣點數量 有巨大關係. 假如原始資料是100w,三個百分位數的取樣,則資料擴張到300w;4個百分位數的取樣,則資料擴張到400w.這是因為使用笛卡爾積擴張了資料的緣故.

優化版本:

  1. select query_time,d,max(ts) ts from (
  2.     select t2.query_time,ts,rn,round(rn/total,10) percent,
  3.     case
  4.     when 0.71>=round(rn/total,10) then 0.71
  5.     when 0.81>=round(rn/total,10) then 0.81
  6.     when 0.91>=round(rn/total,10) then 0.91
  7.     end d
  8.     from (
  9.         select query_time,ts,
  10.         case when @gid=query_time then @rn:=@rn+1 when @gid:=query_time then @rn:=1 end rn
  11.         from (
  12.             select * from t ,(select @gid:='',@rn:=0) vars order by query_time,ts
  13.         ) t1
  14.     ) t2 inner join (
  15.         select query_time,count(*) total from t group by query_time
  16.     ) t3 on(t2.query_time=t3.query_time)
  17. ) t6
  18. where d is not null
  19. group by query_time,d

結果:

用時:
33.922 秒

這個版本的優點是增加百分位數取樣點,不會增加開銷.
缺點是在非常小的樣本環境中,會有誤差.
不過後續也有手段消除誤差.

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/29254281/viewspace-2157209/,如需轉載,請註明出處,否則將追究法律責任。

相關文章