Hive之分析函式

落花桂發表於2021-08-05

一、sum() over(partition by)

  • 資料準備
cookie1,2015-04-10,1
cookie1,2015-04-11,5
cookie1,2015-04-12,7
cookie1,2015-04-13,3
cookie1,2015-04-14,2
cookie1,2015-04-15,4
cookie1,2015-04-16,4

  • 查詢語句
select 
   cookieid, 
   createtime, 
   pv, 
   sum(pv) over (partition by cookieid order by createtime rows between unbounded preceding and current row) as pv1, 
   sum(pv) over (partition by cookieid order by createtime) as pv2, 
   sum(pv) over (partition by cookieid) as pv3, 
   sum(pv) over (partition by cookieid order by createtime rows between 3 preceding and current row) as pv4, 
   sum(pv) over (partition by cookieid order by createtime rows between 3 preceding and 1 following) as pv5, 
   sum(pv) over (partition by cookieid order by createtime rows between current row and unbounded following) as pv6 
from cookie1;
  • 查詢結果

  • 查詢結果說明

    • pv1: 分組內從起點到當前行的pv累積,如,11號的pv1=2015-04-10號的pv + 2015-04-11號的pv, 2015-04-12號=10號+11號+12號
    • pv2: 同pv1
    • pv3: 分組內(cookie1)所有的pv累加
    • pv4: 分組內當前行+往前3行,如,11號=10號+11號, 12號=10號+11號+12號, 13號=10號+11號+12號+13號, 14號=11號+12號+13號+14號
    • pv5: 分組內當前行+往前3行+往後1行,如,14號=11號+12號+13號+14號+15號=5+7+3+2+4=21
    • pv6: 分組內當前行+往後所有行,如,13號=13號+14號+15號+16號=3+2+4+4=13,14號=14號+15號+16號=2+4+4=10
  • partition by 的引數說明

如果不指定ROWS BETWEEN,預設為從起點到當前行;
如果不指定ORDER BY,則將分組內所有值累加;
關鍵是理解ROWS BETWEEN含義,也叫做WINDOW子句:
PRECEDING:往前
FOLLOWING:往後
CURRENT ROW:當前行
UNBOUNDED:起點,
UNBOUNDED PRECEDING 表示從前面的起點,
UNBOUNDED FOLLOWING:表示到後面的終點
–其他AVG,MIN,MAX,和SUM用法一樣。

二、avg()、min()、max() over(partition)

  avg()、min()、max() over(partition) 與 sum() over(partition) 類似,都是對視窗做操作

  • min() over(partition) 的查詢語句
select 
   cookieid, 
   createtime, 
   pv, 
   min(pv) over (partition by cookieid order by createtime rows between unbounded preceding and current row) as pv1, -- 預設為從起點到當前行
   min(pv) over (partition by cookieid order by createtime) as pv2, --從起點到當前行,結果同pv1
   min(pv) over (partition by cookieid) as pv3, --分組內所有行
   min(pv) over (partition by cookieid order by createtime rows between 3 preceding and current row) as pv4, --當前行+往前3行
   min(pv) over (partition by cookieid order by createtime rows between 3 preceding and 1 following) as pv5, --當前行+往前3行+往後1行
   min(pv) over (partition by cookieid order by createtime rows between current row and unbounded following) as pv6  --當前行+往後所有行
from cookie1;
  • 結果展示

三、row_number() over(partition by)

  row_number()從1開始,為每一條分組記錄返回一個數字

  row_number() OVER (ORDER BY id DESC) 是先把id列降序,再為降序以後的每條id記錄返回一個序號。

  row_number() OVER (PARTITION BY COL1 ORDER BY COL2) 表示根據COL1分組,在分組內部根據 COL2排序,而此函式計算的值就表示每組內部排序後的順序編號(組內連續的唯一的)

  • 資料準備

  • 查詢:需根據部門分組,顯示每個部門的工資等級
SELECT *, Row_Number() OVER (partition by deptid ORDER BY salary desc) rank FROM employee

四、用over(partition by) 還是 group by

  • 總結區別:over(partition by) 和 group by的區別

    • group by:單純分組,要查詢非group by欄位時需要用 collect_set()[0]處理,或者子查詢處理
    • over(partition by):不僅能分組,還能同時查詢非分割槽欄位,不僅可以使用sum()、avg()、min()、max()等功能,還可以使用row_number() 對資料進行排名功能
  • group by

  在hive中使用group by時,是不能select 非group by 欄位的。

select name,sex from people group by sex;
---------------------------------------------------
會報錯:
FAILED: SemanticException [Error 10025]: Line 1:15 Expression not in GROUP BY key 'name'

  解決這個問題的方式有很多:在子查詢中做group by然後用left join 連線,在外層selec。還有就是用collect_set()包圍這個非group by欄位

select collect_set(name)[0],sex from people group by sex;
  • over(partition by)

  當然,用over(partition by)也能解決分組問題,在分組的同時會對相同key的進行回填處理

  資料準備

  查詢語句

select v1,v2,sum(v2) over(partition by v1) as sum from wmg_test;

  結果展示

  所以要做到取一條分組資料,就在外層去重

select distinct v1,sum_01                          
from (
select v1,sum(v2) over(partition by v1) as sum_01
from wmg_test
) a;

  結果展示

相關文章