SQL入門-進階教程

小張過河V發表於2020-09-27

前言

接上一次課程《SQL入門-基礎教程》,本次課程是對上一次課程的深入,主要包括分組、聚合、連線、分析&視窗函式等方面,目的是通過講解常用的語法和函式,爭取用最短的時間對SQL有個整體的認識,每一塊內容的詳細講解會在後面出一個系列的文章進行詳細的詳解,敬請關注…

好了,我們開始吧

00、預備知識

先講幾個資料倉儲中的概念,目的是將口徑拉齊,大家有個初步的瞭解即可。

1、事實表

發生在現實世界中的操作型事件,其所產生的可度量數值,比如,門店交易資料、倉庫出庫量資料、供應鏈進銷存資料等

2、維度表

維度表就是你觀察該事物的角度(維度),比如商品資訊表、門店資訊表、物流商品表等

3、大寬表

指業務主題相關的指標、維度、屬性關聯在一起的一張表,一般是由N張事件表和N張維度表關聯生成

一、資料準備

1、執行環境

本次課程的執行環境為開源的Hue系統HUE連結
使用者名稱:demo 密碼:demo
PS:如果有公司的Hive SQL環境也可以,本次課程中的SQL都可以在Hive下面執行

2、資料表準備

本次使用with函式來代替資料表,在學習過程中with函式中的資料不需要改變,只需要在下面寫自己的邏輯即可。

with dw_wms_outbound_info_v100 as(
  select '10700001' as order_no, '0011' as shop_no, '5501' as product_code, '2020-07-31' as order_date, 10 as payment_amount, '20200901' as dt union all
  select '10700001' as order_no, '0011' as shop_no, '5502' as product_code, '2020-08-01' as order_date, 90 as payment_amount, '20200901' as dt union all
  select '10700002' as order_no, '0011' as shop_no, '5501' as product_code, '2020-08-02' as order_date, 20 as payment_amount, '20200901' as dt union all
  select '10700002' as order_no, '0011' as shop_no, '5503' as product_code, '2020-08-02' as order_date, 30 as payment_amount, '20200901' as dt union all
  select '10700003' as order_no, '0011' as shop_no, '5501' as product_code, '2020-08-02' as order_date, 20 as payment_amount, '20200901' as dt union all
  select '10700004' as order_no, '0012' as shop_no, '6601' as product_code, '2020-07-31' as order_date, 40 as payment_amount, '20200901' as dt union all
  select '10700004' as order_no, '0012' as shop_no, '6601' as product_code, '2020-08-01' as order_date, 50 as payment_amount, '20200901' as dt
)
  
, dim_store_info_v100 as (
  select '0011' as shop_no, '北京一店' as shop_name, '20200901' as dt  union all
  select '0012' as shop_no, '北京二店' as shop_name, '20200901' as dt
)
  
, dim_sku_info_v100 as (
  select '5501' as product_code, '可樂350ml' as product_name, '20200901' as dt  union all
  select '5502' as product_code, '雪碧500ml' as product_name, '20200901' as dt  union all
  select '5504' as product_code, '芬達243ml' as product_name, '20200901' as dt  union all
  select '6601' as product_code, '巧克力'    as product_name, '20200901' as dt
)
 
---------------------上面的SQL不要動-----在下面寫自己的SQL--------------------
 
select
  t1.dt,
  t1.order_no,
  t1.shop_no,
  t2.shop_name,
  t1.product_code,
  t3.product_name,
  t1.order_date,
  t1.payment_amount
from dw_wms_outbound_info_v100 t1
left join dim_store_info_v100 t2 on t2.dt = '20200901' and t1.shop_no = t2.shop_no
left join dim_sku_info_v100 t3 on t3.dt = '20200901' and t1.product_code = t3.product_code
where t1.dt = '20200901'

如下圖所示:
在這裡插入圖片描述

3、資料表的關係

如下圖所示:dw_wms_outbound_info_v100表為事實表,dim_store_info_v100和dim_sku_info_v100表為維度表,分別通過shop_noproduct_code關聯起來。
在這裡插入圖片描述

二、進階語法

1、 with … as 語法

當我們在寫一些結構相對複雜的SQL語句時,可能某個子查詢在多個層級多個地方存在重複使用的情況,這個時候我們可以使用 with as 語句將其獨立出來,極大提高SQL可讀性,簡化SQL~

樣例:參考本次課程的[資料表準備SQL]

2、分組&聚合

2.0 語法結構

基本語法如下:
在這裡插入圖片描述

2.1 分組(group by)

分組顧名思義,即將原來的一整塊資料按某種維度分成幾小塊,語法關鍵字為 group by,對比excel的透視表功能。
語法:group by col1, col2, …

例如:按門店和商品對資料進行分組

select
    shop_no,
    product_code
from dw_wms_outbound_info_v100
where dt = '20200901'
group by
    shop_no,
    product_code

查詢結果:

shop_noproduct_code
00115501
00115502
00126601
00115503

2.2 聚合

聚合是在每個分組內進行一些統計,如在分組內的最大值(max)、最小值(min)、平均值(avg)、資料條數(count)等聚合函式

例:統計每個分組內的資料條數?(使用count函式來統計)

select
    shop_no,
    product_code,
    count(*) as record_cnt
from dw_wms_outbound_info_v100
where dt = '20200901'
group by
    shop_no,
    product_code

查詢結果:

shop_noproduct_coderecord_cnt
001155013
001155021
001266012
001155031

注:count(*)代表的是統計組內所有的行,詳細使用參考下面的聚合函式使用

2.2 分組過濾(having)

當在gropu by 子句中使用having 子句時,查詢結果中只返回滿足having條件的組

例:統計每筆訂單支付的金額大於50的訂單號?(使用count函式來統計)

select
  order_no,
  sum(payment_amount) as payment_amount
from dw_wms_outbound_info_v100
group by order_no
having payment_amount > 50

查詢結果:

order_nopayment_amount
1070000490
10700001100

注:sum(payment_amount)代表的是統計組內所有的行的支付金額加總

3、表連線

3.0 語法結構

在這裡插入圖片描述

需要從多個資料表中讀取資料,這時我們就可以使用SQL語句中的連線(JOIN),基本格式為 join ... on...and,如果不顯示指定join型別,預設為inner,
在兩個或多個資料表中查詢資料。大家可以想象excel中的vlook使用過程。
</br>
PS:
1. 如下圖中的A和B為兩個表名
2. 下面圖來源於網上部落格

3.1 內連線(Inner join)

產生的結果是A和B的交集(相同列裡面的相同值)
語法:[inner] join … on 條件1 and 條件2 and …

如下圖所示:
在這裡插入圖片描述

例:查詢商品表dim_sku_info_v100中所有商品在門店有銷售的記錄
SQL:

select
  t1.dt,
  t1.order_no,
  t1.shop_no,
  t1.product_code,
  t3.product_name,
  t1.order_date,
  t1.payment_amount
from dw_wms_outbound_info_v100 t1
inner join dim_sku_info_v100 t3 on t1.product_code = t3.product_code
 
 
結果:
 
+--------+--------+-------+------------+-------------+------------+----------+--------------+
|dt      |order_no|shop_no|product_code|product_code2|product_name|order_date|payment_amount|
+--------+--------+-------+------------+-------------+------------+----------+--------------+
|20200901|10700001|0011   |5501        |5501         |可樂350ml   |2020-07-31|10            |
|20200901|10700001|0011   |5502        |5502         |雪碧500ml   |2020-08-01|90            |
|20200901|10700002|0011   |5501        |5501         |可樂350ml   |2020-08-02|20            |
|20200901|10700003|0011   |5501        |5501         |可樂350ml   |2020-08-02|20            |
|20200901|10700004|0012   |6601        |6601         |巧克力      |2020-07-31|40            |
|20200901|10700004|0012   |6601        |6601         |巧克力      |2020-08-01|50            |
+--------+--------+-------+------------+-------------+------------+----------+--------------+
 
說明:連線條件為商品編碼,商品55035504沒有顯示出來

3.2 外連線(Full Outer Join)

產生的結果是A和B的並集(如果沒有相同的值會用null作為值)
語法:full join … on 條件1 and 條件2 and …

在這裡插入圖片描述

例:查詢商品表dim_sku_info_v100中所有商品在門店的記錄,如果銷售記錄中的商品編碼不在dim_sku_info_v100表中也要顯示
SQL:
 
select
  t1.dt,
  t1.order_no,
  t1.shop_no,
  t1.product_code,
  t3.product_code as product_code2,
  t3.product_name,
  t1.order_date,
  t1.payment_amount
from dw_wms_outbound_info_v100 t1
full join dim_sku_info_v100 t3 on t1.product_code = t3.product_code
 
結果:
 
+--------+--------+-------+------------+-------------+------------+----------+--------------+
|dt      |order_no|shop_no|product_code|product_code2|product_name|order_date|payment_amount|
+--------+--------+-------+------------+-------------+------------+----------+--------------+
|20200901|10700002|0011   |5501        |5501         |可樂350ml   |2020-08-02|20            |
|20200901|10700003|0011   |5501        |5501         |可樂350ml   |2020-08-02|20            |
|20200901|10700001|0011   |5501        |5501         |可樂350ml   |2020-07-31|10            |
|20200901|10700001|0011   |5502        |5502         |雪碧500ml   |2020-08-01|90            |
|20200901|10700004|0012   |6601        |6601         |巧克力      |2020-07-31|40            |
|20200901|10700004|0012   |6601        |6601         |巧克力      |2020-08-01|50            |
|NULL    |NULL    |NULL   |NULL        |5504         |芬達243ml   |NULL      |NULL          |
|20200901|10700002|0011   |5503        |NULL         |NULL        |2020-08-02|30            |
+--------+--------+-------+------------+-------------+------------+----------+--------------+

3.3 左連線(Left outer join)

產生表A的完全集,根據on條件,B表中匹配的則有值(沒有匹配的則以null值取代)
語法:left join … on 條件1 and 條件2 and …

在這裡插入圖片描述

例:查詢門店所有銷售記錄的商品名稱
SQL:
 
select
  t1.dt,
  t1.order_no,
  t1.shop_no,
  t1.product_code,
  t3.product_code as product_code2,
  t3.product_name,
  t1.order_date,
  t1.payment_amount
from dw_wms_outbound_info_v100 t1
left join dim_sku_info_v100 t3 on t1.product_code = t3.product_code
 
結果:
 
+--------+--------+-------+------------+-------------+------------+----------+--------------+
|dt      |order_no|shop_no|product_code|product_code2|product_name|order_date|payment_amount|
+--------+--------+-------+------------+-------------+------------+----------+--------------+
|20200901|10700001|0011   |5501        |5501         |可樂350ml   |2020-07-31|10            |
|20200901|10700001|0011   |5502        |5502         |雪碧500ml   |2020-08-01|90            |
|20200901|10700002|0011   |5501        |5501         |可樂350ml   |2020-08-02|20            |
|20200901|10700002|0011   |5503        |NULL         |NULL        |2020-08-02|30            |
|20200901|10700003|0011   |5501        |5501         |可樂350ml   |2020-08-02|20            |
|20200901|10700004|0012   |6601        |6601         |巧克力      |2020-07-31|40            |
|20200901|10700004|0012   |6601        |6601         |巧克力      |2020-08-01|50            |
+--------+--------+-------+------------+-------------+------------+----------+--------------+

3.4 右連線(Right Outer Join)

產生表B的完全集,根據on條件,A表中匹配的則有值(沒有匹配的則以null值取代),與Left Join相反
語法:right join … on 條件1 and 條件2 and …

在這裡插入圖片描述

例:查詢在商品表中所有商品在門店的銷售記錄
SQL:
 
select
  t1.dt,
  t1.order_no,
  t1.shop_no,
  t1.product_code,
  t3.product_code as product_code2,
  t3.product_name,
  t1.order_date,
  t1.payment_amount
from dw_wms_outbound_info_v100 t1
right join dim_sku_info_v100 t3 on t1.product_code = t3.product_code
 
結果:
 
+--------+--------+-------+------------+-------------+------------+----------+--------------+
|dt      |order_no|shop_no|product_code|product_code2|product_name|order_date|payment_amount|
+--------+--------+-------+------------+-------------+------------+----------+--------------+
|20200901|10700003|0011   |5501        |5501         |可樂350ml   |2020-08-02|20            |
|20200901|10700002|0011   |5501        |5501         |可樂350ml   |2020-08-02|20            |
|20200901|10700001|0011   |5501        |5501         |可樂350ml   |2020-07-31|10            |
|20200901|10700001|0011   |5502        |5502         |雪碧500ml   |2020-08-01|90            |
|NULL    |NULL    |NULL   |NULL        |5504         |芬達243ml   |NULL      |NULL          |
|20200901|10700004|0012   |6601        |6601         |巧克力      |2020-08-01|50            |
|20200901|10700004|0012   |6601        |6601         |巧克力      |2020-07-31|40            |
+--------+--------+-------+------------+-------------+------------+----------+--------------+

3.5 笛卡爾積(cross join)

cross對兩個表執行笛卡爾乘積(可以理解為沒有連線條件)。它為左錶行和右錶行的每種可能的組合返回一行,一般出現笛卡爾積時大概率現出了邏輯錯誤,比如連線條件寫錯表名,導致兩個連線時沒有連線條件。除非特殊情況下需要使用笛卡爾積

例:查詢門店所有銷售記錄的商品名稱

ps:注意下圖中的連線條件

SQL:
 
select
  t1.dt,
  t1.order_no,
  t1.shop_no,
  t1.product_code,
  t3.product_code as product_code2,
  t3.product_name,
  t1.order_date,
  t1.payment_amount
from dw_wms_outbound_info_v100 t1
left join dim_sku_info_v100 t3 on t1.product_code = t1.product_code
 
結果:如下
 
+--------+--------+-------+------------+-------------+------------+----------+--------------+
|dt      |order_no|shop_no|product_code|product_code2|product_name|order_date|payment_amount|
+--------+--------+-------+------------+-------------+------------+----------+--------------+
|20200901|10700001|0011   |5501        |5501         |可樂350ml   |2020-07-31|10            |
|20200901|10700001|0011   |5501        |5502         |雪碧500ml   |2020-07-31|10            |
|20200901|10700001|0011   |5501        |5504         |芬達243ml   |2020-07-31|10            |
|20200901|10700001|0011   |5501        |6601         |巧克力      |2020-07-31|10            |
|20200901|10700001|0011   |5502        |5501         |可樂350ml   |2020-08-01|90            |
|20200901|10700001|0011   |5502        |5502         |雪碧500ml   |2020-08-01|90            |
|20200901|10700001|0011   |5502        |5504         |芬達243ml   |2020-08-01|90            |
|20200901|10700001|0011   |5502        |6601         |巧克力      |2020-08-01|90            |
|20200901|10700002|0011   |5501        |5501         |可樂350ml   |2020-08-02|20            |
|20200901|10700002|0011   |5501        |5502         |雪碧500ml   |2020-08-02|20            |
|20200901|10700002|0011   |5501        |5504         |芬達243ml   |2020-08-02|20            |
|20200901|10700002|0011   |5501        |6601         |巧克力      |2020-08-02|20            |
|20200901|10700002|0011   |5503        |5501         |可樂350ml   |2020-08-02|30            |
|20200901|10700002|0011   |5503        |5502         |雪碧500ml   |2020-08-02|30            |
|20200901|10700002|0011   |5503        |5504         |芬達243ml   |2020-08-02|30            |
|20200901|10700002|0011   |5503        |6601         |巧克力      |2020-08-02|30            |
|20200901|10700003|0011   |5501        |5501         |可樂350ml   |2020-08-02|20            |
|20200901|10700003|0011   |5501        |5502         |雪碧500ml   |2020-08-02|20            |
|20200901|10700003|0011   |5501        |5504         |芬達243ml   |2020-08-02|20            |
|20200901|10700003|0011   |5501        |6601         |巧克力      |2020-08-02|20            |
|20200901|10700004|0012   |6601        |5501         |可樂350ml   |2020-07-31|40            |
|20200901|10700004|0012   |6601        |5502         |雪碧500ml   |2020-07-31|40            |
|20200901|10700004|0012   |6601        |5504         |芬達243ml   |2020-07-31|40            |
|20200901|10700004|0012   |6601        |6601         |巧克力      |2020-07-31|40            |
|20200901|10700004|0012   |6601        |5501         |可樂350ml   |2020-08-01|50            |
|20200901|10700004|0012   |6601        |5502         |雪碧500ml   |2020-08-01|50            |
|20200901|10700004|0012   |6601        |5504         |芬達243ml   |2020-08-01|50            |
|20200901|10700004|0012   |6601        |6601         |巧克力      |2020-08-01|50            |
+--------+--------+-------+------------+-------------+------------+----------+--------------+
 
 
Time taken: 0.815 seconds, Fetched 28 row(s)

可見,由於出現了笛卡爾積,由於寫錯將條件寫為(t1.product_code = t1.product_code),正確為t1.product_code = t2.product_code,最終資料最從7條變為了28條,資料量增長了4倍,

問題:如果A表有10萬條資料,B表有10萬條資料,如果發生邏輯錯誤,產生了笛卡爾積的操作,結果會是多少條資料?
10萬? 100萬? 10億? 100億?

4、連線後聚合

實際操作中常會對多表連線後的結果執行聚合操作,語法格式為:
在這裡插入圖片描述
SQL的執行順序為 from -> join -> group -> having -> select -> order -> limit 在後面的課程中會專門對SQL的執行順序進行詳細講解

例:查詢門店每天商品的銷售額,按日期+商品名稱聚合
SQL:

select
  t1.order_date,
  t3.product_name,
  sum(t1.payment_amount) as payment_amount
from dw_wms_outbound_info_v100 t1
left join dim_sku_info_v100 t3 on t1.product_code = t3.product_code
group by
  t1.order_date,
  t3.product_name
 
結果:如下
 
+----------+------------+--------------+
|order_date|product_name|payment_amount|
+----------+------------+--------------+
|2020-07-31|可樂350ml   |10            |
|2020-08-02|NULL        |30            |
|2020-08-01|雪碧500ml   |90            |
|2020-08-02|可樂350ml   |40            |
|2020-08-01|巧克力      |50            |
|2020-07-31|巧克力      |40            |
+----------+------------+--------------+

5、union[distinct | all]

UNION將多個SELECT語句的結果集合併為一個獨立的結果集,distinct會對結果進行去重 (預設為 distinct,去重操作會影響效能),all保留所有

語法:
select_statement UNION [ALL | DISTINCT] 
select_statement UNION [ALL | DISTINCT] 
...
select_statement
例:使用union 來對0011門店和0012門店的有銷售商品進行彙總,
SQL:
 
select
  shop_no,product_code
from dw_wms_outbound_info_v100
where shop_no = '0011'
 
 
union
 
select
  shop_no,product_code
from dw_wms_outbound_info_v100
where shop_no = '0012'
 
結果:去重
 
+-------+------------+
|shop_no|product_code|
+-------+------------+
|0011   |5501        |
|0011   |5502        |
|0012   |6601        |
|0011   |5503        |
+-------+------------+

四、進階函式

1、聚合函式

函式名用途
count([DISTINCT] col)統計行數
avg(col)平均數
min(col)最小值
max(col)最大值
sum(col)求和
percentile_approx(col, p)返回組中列的精確第p個百分點 p必須介於0和1之間
stddev(col)標準差
collect_set(col)欄位去重後的集合
collect_list(col)欄位不去重的集合
corr(col1, col2)返回兩列的皮爾遜相關係數

解釋:

  1. count(統計行數):對分組內的資料條數進行統計,有如下幾種方式:
  2. count(1):該種方式會對分組所有的記錄進行統計,沒有進行任何的過濾(1可以換成任何整數,習慣使用1
  3. count(*):該方式同count(1)
  4. count(col):根據某列進行統計,如果某一條記錄該欄位為null,那麼此記錄不會統計
  5. count(distinct col):與count(col)相比,該方法會對欄位的值進行去重後統計

例如,首先查詢門店所有銷售記錄的商品名稱,並根據商品名稱欄位進行統計資料條數

SQL:
 
select
  order_date as od,
  count(1) as count1,
  count(*) as count2,
  count(product_code) as count3,
  count(product_name) as count4,
  round(avg(payment_amount),1) as avg,
  min(payment_amount) as min,
  max(payment_amount) as max,
  sum(payment_amount) as sum,
  round(percentile_approx(payment_amount, 0.5),1) as percentile,
  round(stddev(payment_amount),1) as stddev,
  collect_set(shop_no) as set,
  collect_list(shop_no) as list
from (
  select
    t1.dt,
    t1.order_no,
    t1.shop_no,
    t1.product_code,
    t3.product_code as product_code2,
    t3.product_name,
    t1.order_date,
    t1.payment_amount
  from dw_wms_outbound_info_v100 t1
  left join dim_sku_info_v100 t3 on t1.product_code = t3.product_code
) t1
group by order_date
 
查詢結果:
 
+----------+------+------+------+------+----+---+---+---+----------+------+---------------+----------------------+
|od        |count1|count2|count3|count4|avg |min|max|sum|percentile|stddev|set            |list                  |
+----------+------+------+------+------+----+---+---+---+----------+------+---------------+----------------------+
|2020-08-02|3     |3     |3     |2     |23.3|20 |30 |70 |20        |5.8   |["0011"]       |["0011","0011","0011"]|
|2020-08-01|2     |2     |2     |2     |70.0|50 |90 |140|50        |28.3  |["0011","0012"]|["0011","0012"]       |
|2020-07-31|2     |2     |2     |2     |25.0|10 |40 |50 |10        |21.2  |["0011","0012"]|["0012","0011"]       |
+----------+------+------+------+------+----+---+---+---+----------+------+---------------+----------------------+

2、分析&視窗函式

常用的函式:

函式名用途
row_number() OVER([partition_by_clause] order_by_clause)分組排序,同一組內相同值會給不同序號
rank() OVER([partition_by_clause] order_by_clause)組內從1開始按order by排序後,將相同的值輸出為相同的值,其它值不變(好亂,看下面樣例吧)
dense_rank() OVER([partition_by_clause] order_by_clause)返回從1開始的遞增整數序列。輸出序列為ORDER BY表示式的重複值生成重複整數
lag(expr [, offset] [, default]) OVER ([partition_by_clause] order_by_clause)該函式使用前一行的列值返回表示式的值。您可以指定一個整數偏移量,它指定一個行位置,位於當前行之前的某些行數
lead與lag對應,記住一個就可以

2.1 row_number | rank | dense_rank

例如 :根據支付金額排序
SQL:
 
select
    order_date,
    order_no,
    shop_no,
    payment_amount,
    row_number() over(order by payment_amount) as row_number,
    rank() over(order by payment_amount) as rank,
    dense_rank() over(order by payment_amount) as dense_rank
from dw_wms_outbound_info_v100
 
 
結果:
+----------+--------+-------+--------------+----------+----+----------+
|order_date|order_no|shop_no|payment_amount|row_number|rank|dense_rank|
+----------+--------+-------+--------------+----------+----+----------+
|2020-07-31|10700001|0011   |10            |1         |1   |1         |
|2020-08-02|10700002|0011   |20            |2         |2   |2         |
|2020-08-02|10700003|0011   |20            |3         |2   |2         |
|2020-08-02|10700002|0011   |30            |4         |4   |3         |
|2020-07-31|10700004|0012   |40            |5         |5   |4         |
|2020-08-01|10700004|0012   |50            |6         |6   |5         |
|2020-08-01|10700001|0011   |90            |7         |7   |6         |
+----------+--------+-------+--------------+----------+----+----------+

2.2 lag | lead

例如 :計算每家門店每天相對於昨天的銷售額的變化?

SQL:
 
select
  t1.order_date,
  t1.shop_no,
  t1.payment_amount,
  -- 以下為昨天的銷售額
  lag(payment_amount) over(partition by shop_no order by t1.order_date) as yest_payment_amount,
  lag(payment_amount,1, 0) over(partition by shop_no order by t1.order_date) as yest_payment_amount2,
  lag(payment_amount,2, 0) over(partition by shop_no order by t1.order_date) as yest_payment_amount3,
  lead(payment_amount,1, 0) over(partition by shop_no order by t1.order_date) as yest_payment_amount4,
  lag(payment_amount,1, 0) over(partition by shop_no order by t1.order_date desc) as yest_payment_amount5
from (
  select
    order_date,
    shop_no,
    sum(payment_amount) as payment_amount --當天銷售額
  from dw_wms_outbound_info_v100
  group by
    order_date,
    shop_no
) t1
 
 
結果:
+----------+-------+--------------+-------------------+--------------------+--------------------+--------------------+--------------------+
|order_date|shop_no|payment_amount|yest_payment_amount|yest_payment_amount2|yest_payment_amount3|yest_payment_amount4|yest_payment_amount5|
+----------+-------+--------------+-------------------+--------------------+--------------------+--------------------+--------------------+
|2020-07-31|0012   |40            |NULL               |0                   |0                   |50                  |50                  |
|2020-08-01|0012   |50            |40                 |40                  |0                   |0                   |0                   |
|2020-07-31|0011   |10            |NULL               |0                   |0                   |90                  |90                  |
|2020-08-01|0011   |90            |10                 |10                  |0                   |70                  |70                  |
|2020-08-02|0011   |70            |90                 |90                  |10                  |0                   |0                   |
+----------+-------+--------------+-------------------+--------------------+--------------------+--------------------+--------------------+

2.3 over

與over 配置使用的聚合函式:

  1. COUNT
  2. SUM
  3. MIN
  4. MAX
  5. AVG

使用樣例:

SQL:
 
select
  t1.order_date,
  t1.order_no,
  t1.shop_no,
  t1.product_code,
  t1.payment_amount,
  --統計每天門店每天的單量
  count(order_no) over(partition by order_date, shop_no) as shop_order_cnt,
  --統計每天門店每天的銷售額
  sum(payment_amount) over(partition by order_date, shop_no) as payment_amount_sum,
  --統計每天所有門店單品最高支付金額
  max(payment_amount) over(partition by order_date) as payment_amount_max
from dw_wms_outbound_info_v100 t1
 
結果:
 
+----------+--------+-------+------------+--------------+--------------+------------------+------------------+
|order_date|order_no|shop_no|product_code|payment_amount|shop_order_cnt|payment_amount_sum|payment_amount_max|
+----------+--------+-------+------------+--------------+--------------+------------------+------------------+
|2020-08-02|10700002|0011   |5503        |30            |3             |70                |30                |
|2020-08-02|10700002|0011   |5501        |20            |3             |70                |30                |
|2020-08-02|10700003|0011   |5501        |20            |3             |70                |30                |
|2020-08-01|10700004|0012   |6601        |50            |1             |50                |90                |
|2020-08-01|10700001|0011   |5502        |90            |1             |90                |90                |
|2020-07-31|10700004|0012   |6601        |40            |1             |40                |40                |
|2020-07-31|10700001|0011   |5501        |10            |1             |10                |40                |
+----------+--------+-------+------------+--------------+--------------+------------------+------------------+
 

3、複雜型別

函式名說明demo結果
map(key1, value1, …)根據給定的key-value生成mapselect map(‘a’, ‘001’, ‘b’,‘002’) as res
select map(‘a’, ‘001’, ‘b’,‘002’, ‘a’, ‘003’)[‘a’] as res
{“a”:“001”,“b”:“002”}
003
named_struct(name1, val1, name2, val2, …)建立struct結構select named_struct(‘expression’,‘1*1’,‘unit’,10).expression1*1
array(val1, val2, …)根據給定的元素生成array生成陣列:select array(‘a’, ‘b’) as res
查詢元素:select array(‘a’, ‘b’)[0] as res
[“a”,“b”]
a

4、集合函式

函式名說明demo結果
array_containsReturns TRUE if the array contains value.select array_contains(array(‘a’, ‘b’), ‘b’) as res
select array_contains(array(‘a’, ‘b’), ‘c’) as res
true
false
size返回陣列或map元素的個數select size(array(‘a’, ‘b’)) as res
select size(map(‘a’, ‘b’)) as res
2
1
sort_array對陣列進行排序select sort_array( array(‘b’, ‘c’,‘a’ ) ) as res[“a”,“b”,“c”]

參考文件

1、Windowing and Analytics Functions
2、HUE–hive函式大全

相關文章