Hive視窗函式

孫晨c發表於2020-10-07

@

官方文件地址

Hive官網,點我就進
oracle,sqlserver都提供了視窗函式,但是在mysql5.5和5.6都沒有提供視窗函式!

視窗函式: 視窗+函式

  • 視窗: 函式執行時計算的資料集的範圍
  • 函式: 執行的函式!
    僅僅支援以下函式:

Windowing functions

  • LEAD (scalar_expression [,offset] [,default]): 返回當前行以下N行的指定列的列值!如果找不到,就採用預設值
  • LAG (scalar_expression [,offset] [,default]): 返回當前行以上N行的指定列的列值!如果找不到,就採用預設值
  • FIRST_VALUE(列名,[false(預設)]):返回當前視窗指定列的第一個值,第二個引數如果為true,代表加入第一個值為null,跳過空值,繼續尋找!
  • LAST_VALUE(列名,[false(預設)]):返回當前視窗指定列的最後一個值,第二個引數如果為true,代表加入第一個值為null,跳過空值,繼續尋找!

統計類的函式(一般都需要結合over使用):min,max,avg,sum,count

排名分析:

  • RANK
  • ROW_NUMBER
  • DENSE_RANK
  • CUME_DIST
  • PERCENT_RANK
  • NTILE

注意:不是所有的函式在執行都是可以通過改變視窗的大小,來控制計算的資料集的範圍!所有的排名函式和LAG,LEAD,支援使用over(),但是在over()中不能定義 window_clause

格式: 函式 over( partition by 欄位 ,order by 欄位 window_clause )

視窗的大小可以通過windows_clause來指定:

(rows | range) between (unbounded | [num]) preceding and ([num] preceding | current row | (unbounded | [num]) following)
(rows | range) between current row and (current row | (unbounded | [num]) following)
(rows | range) between [num] following and (unbounded | [num]) following

特殊情況:

  • ①在over()中既沒有出現windows_clause,也沒有出現order by,視窗預設為rows between UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING
  • ②在over()中(沒有出現windows_clause),指定了order by,視窗預設為rows between UNBOUNDED PRECEDING and CURRENT ROW

視窗函式和分組有什麼區別?

  • ①如果是分組操作,select後只能寫分組後的欄位
  • ②如果是視窗函式,視窗函式是在指定的視窗內,對每條記錄都執行一次函式
  • ③如果是分組操作,有去重效果,而partition不去重!

練習

(9) 查詢前20%時間的訂單資訊
精確演算法:

 select *
 from
 (select name,orderdate,cost,cume_dist() over(order by orderdate ) cdnum
 from  business) tmp
 where cdnum<=0.2

不精確計算:

 select *
 from
 (select name,orderdate,cost,ntile(5) over(order by orderdate ) cdnum
 from  business) tmp
 where cdnum=1

(8)查詢顧客的購買明細及顧客最近三次cost花費

最近三次: 當前和之前兩次當前+前一次+後一次

當前和之前兩次:

select name,orderdate,cost,sum(cost) over(partition by name order by orderdate rows between 2 PRECEDING and CURRENT  row) 
 from business 

當前+前一次+後一次:

select name,orderdate,cost,sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING and 1  FOLLOWING) 
 from business

select name,orderdate,cost,cost+
 lag(cost,1,0) over(partition by name order by orderdate )+
 lead(cost,1,0) over(partition by name order by orderdate )
 from business

(7) 查詢顧客的購買明細及顧客本月最後一次購買的時間

select name,orderdate,cost,LAST_VALUE(orderdate,true) over(partition by name,substring(orderdate,1,7) order by orderdate rows between CURRENT  row and UNBOUNDED  FOLLOWING) 
 from business 

(6) 查詢顧客的購買明細及顧客本月第一次購買的時間

select name,orderdate,cost,FIRST_VALUE(orderdate,true) over(partition by name,substring(orderdate,1,7) order by orderdate ) 
 from business

(5) 查詢顧客的購買明細及顧客下次的購買時間

 select name,orderdate,cost,lead(orderdate,1,'無資料') over(partition by name order by orderdate ) 
 from business

(4)查詢顧客的購買明細及顧客上次的購買時間

  select name,orderdate,cost,lag(orderdate,1,'無資料') over(partition by name order by orderdate ) 
 from business

(3)查詢顧客的購買明細要將cost按照日期進行累加

select name,orderdate,cost,sum(cost) over(partition by name order by orderdate ) 
 from business

(2)查詢顧客的購買明細及月購買總額

 select name,orderdate,cost,sum(cost) over(partition by name,substring(orderdate,1,7) ) 
 from business

(1)查詢在2017年4月份購買過的顧客及總人數

select name,count(*) over(rows between UNBOUNDED  PRECEDING and UNBOUNDED  FOLLOWING)
from business
where substring(orderdate,1,7)='2017-04'
group by name

等價於

select name,count(*) over()
from business
where substring(orderdate,1,7)='2017-04'
group by name

相關文章