歡迎訪問我的GitHub
https://github.com/zq2599/blog_demos
內容:所有原創文章分類彙總及配套原始碼,涉及Java、Docker、Kubernetes、DevOPS等;
《hive學習筆記》系列導航
本篇概覽
- 本文是《hive學習筆記》系列的第六篇,前面的文章我們們對資料型別、表結構有了基本瞭解,接下來對常用的查詢語句做一次集中式的學習;
- HiveQL與SQL類似, 在語法上與大部分SQL相容, 但是並非完全相容,例如更新、事務等都不支援,子查詢和join操作也有限, 這和底層依賴Hadoop有關;
準備資料
- 本次實戰要準備兩個表:學生表和住址表,欄位都很簡單,如下圖所示,學生表有個住址ID欄位,是住址表裡的記錄的唯一ID:
2. 先建立住址表:
create table address (addressid int, province string, city string)
row format delimited
fields terminated by ',';
- 建立address.txt檔案,內容如下:
1,guangdong,guangzhou
2,guangdong,shenzhen
3,shanxi,xian
4,shanxi,hanzhong
6,jiangshu,nanjing
- 載入資料到address表:
load data
local inpath '/home/hadoop/temp/202010/25/address.txt'
into table address;
- 建立學生表,其addressid欄位關聯了address表的addressid欄位:
create table student (name string, age int, addressid int)
row format delimited
fields terminated by ',';
- 建立student.txt檔案,內容如下:
tom,11,1
jerry,12,2
mike,13,3
john,14,4
mary,15,5
- 載入資料到student表:
load data
local inpath '/home/hadoop/temp/202010/25/student.txt'
into table student;
- 至此,本次操作所需資料已準備完畢,如下所示:
hive> select * from address;
OK
1 guangdong guangzhou
2 guangdong shenzhen
3 shanxi xian
4 shanxi hanzhong
6 jiangshu nanjing
Time taken: 0.043 seconds, Fetched: 5 row(s)
hive> select * from student;
OK
tom 11 1
jerry 12 2
mike 13 3
john 14 4
mary 15 5
Time taken: 0.068 seconds, Fetched: 5 row(s)
- 開始體驗HiveQL
select和where
最普通的帶條件查詢:
hive> select * from address where city like '%a%';
OK
1 guangdong guangzhou
3 shanxi xian
4 shanxi hanzhong
6 jiangshu nanjing
Time taken: 0.128 seconds, Fetched: 4 row(s)
group by
- 按province欄位分組:
select province, count(*) from address group by province;
該查詢會觸發MR計算,結果如下:
...
Total MapReduce CPU Time Spent: 1 seconds 910 msec
OK
guangdong 2
jiangshu 1
shanxi 2
Time taken: 17.847 seconds, Fetched: 3 row(s)
- 試試巢狀查詢,內部是查出city欄位帶有a字母的記錄,然後將這些記錄按照province欄位分組:
select t.province, count(*) from (
select * from address where city like '%a%'
) t
group by t.province;
結果如下:
Total MapReduce CPU Time Spent: 1 seconds 760 msec
OK
guangdong 1
jiangshu 1
shanxi 2
Time taken: 18.036 seconds, Fetched: 3 row(s)
having
- 前面的巢狀查詢,結果有兩個省:guangdong和shanxi,如果再加個條件:只顯示城市數量大於1的省,首先想到的是再加一層巢狀:
select t1.* from (
select t.province, count(*) as cnt from (
select * from address where city like '%a%'
) t
group by t.province) t1
where t1.cnt>1;
結果如下,可見只有shanxi被顯示了:
Total MapReduce CPU Time Spent: 2 seconds 250 msec
OK
shanxi 2
Time taken: 20.067 seconds, Fetched: 1 row(s)
- 對於上述SQL,可以用having語法進行分組篩選,得到同樣資料:
select t.province, count(*) as cnt from (
select * from address where city like '%a%'
) t
group by t.province having cnt>1;
order by
- 對分組結果做排序:
select t.province, count(*) as cnt from (
select * from address where city like '%a%'
) t
group by t.province order by cnt;
會觸發MR,結果如下:
Total MapReduce CPU Time Spent: 3 seconds 50 msec
OK
jiangshu 1
guangdong 1
shanxi 2
Time taken: 40.315 seconds, Fetched: 3 row(s)
- order by對於的實現,是在最後通過一個reducer進行全部排序,該過程可能耗時較長,針對這種情況,hive提供了sort by,功能與order by一樣,但是會在每個reducer中進行排序,這樣最終做排序的時候效率就會提升;
- 要注意的是:sort by解決的問題是最終結果排序的效率,因此資料量不大時,排序不是瓶頸,此時使用sort by也不會加快整體速度;
內連線(inner join)
- 內連線用join簡寫,與連線標準匹配的資料在兩張表中都存在,才會保留:
select
s.name, s.age,
a.province, a.city
from
student s
inner join
address a
on
s.addressid=a.addressid;
結果如下:
Total MapReduce CPU Time Spent: 1 seconds 20 msec
OK
tom 11 guangdong guangzhou
jerry 12 guangdong shenzhen
mike 13 shanxi xian
john 14 shanxi hanzhong
Time taken: 17.294 seconds, Fetched: 4 row(s)
自然連線(natural join)
- 自然連線是在兩張表中尋找資料型別和列明都相同的欄位,並自動連線起來:
select name, age, province, city from student natural join address;
結果如下,可見不會根據student表的addressid欄位值去address查詢記錄,而是將addrerss的記錄全部連線一次:
Total MapReduce CPU Time Spent: 940 msec
OK
tom 11 guangdong guangzhou
jerry 12 guangdong guangzhou
mike 13 guangdong guangzhou
john 14 guangdong guangzhou
mary 15 guangdong guangzhou
tom 11 guangdong shenzhen
jerry 12 guangdong shenzhen
mike 13 guangdong shenzhen
john 14 guangdong shenzhen
mary 15 guangdong shenzhen
tom 11 shanxi xian
jerry 12 shanxi xian
mike 13 shanxi xian
john 14 shanxi xian
mary 15 shanxi xian
tom 11 shanxi hanzhong
jerry 12 shanxi hanzhong
mike 13 shanxi hanzhong
john 14 shanxi hanzhong
mary 15 shanxi hanzhong
tom 11 jiangshu nanjing
jerry 12 jiangshu nanjing
mike 13 jiangshu nanjing
john 14 jiangshu nanjing
mary 15 jiangshu nanjing
Time taken: 18.525 seconds, Fetched: 25 row(s)
左外連線(left outer join)
- 以連線中的左表為主:
select
s.name, s.age, s.addressid,
a.province, a.city
from
student s
left outer join
address a
on
s.addressid=a.addressid;
結果如下,可見name=mary的記錄,addressid等於5,在address中不存在addressid等於5的記錄,因此province和city欄位都展示了NULL,而在前面使用inner join時,結果中沒有這條記錄:
Total MapReduce CPU Time Spent: 950 msec
OK
tom 11 1 guangdong guangzhou
jerry 12 2 guangdong shenzhen
mike 13 3 shanxi xian
john 14 4 shanxi hanzhong
mary 15 5 NULL NULL
Time taken: 18.442 seconds, Fetched: 5 row(s)
右外連線(right outer join)
和左連線類似,只不過是以右表為主,語法是right outer join:
select
s.name, s.age, s.addressid,
a.province, a.city
from
student s
right outer join
address a
on
s.addressid=a.addressid;
結果如下,可見city=nanjing的記錄,在student表中沒有一條記錄與之關聯,因此結果中展示了address的欄位,而student的欄位為NULL:
Total MapReduce CPU Time Spent: 970 msec
OK
tom 11 1 guangdong guangzhou
jerry 12 2 guangdong shenzhen
mike 13 3 shanxi xian
john 14 4 shanxi hanzhong
NULL NULL NULL jiangshu nanjing
Time taken: 18.294 seconds, Fetched: 5 row(s)
全外連線(full outer join)
查詢結果等於左外連線和右外連線之和,語法是full outer join:
select
s.name, s.age, s.addressid,
a.province, a.city
from
student s
full outer join
address a
on
s.addressid=a.addressid;
結果如下:
Total MapReduce CPU Time Spent: 2 seconds 630 msec
OK
tom 11 1 guangdong guangzhou
jerry 12 2 guangdong shenzhen
mike 13 3 shanxi xian
john 14 4 shanxi hanzhong
mary 15 5 NULL NULL
NULL NULL NULL jiangshu nanjing
Time taken: 22.189 seconds, Fetched: 6 row(s)
- 至此,常用HiveQL體驗完畢,希望能給您一些參考,接下來的章節會進一步學習HiveQL的特性;
你不孤單,欣宸原創一路相伴
歡迎關注公眾號:程式設計師欣宸
微信搜尋「程式設計師欣宸」,我是欣宸,期待與您一同暢遊Java世界...
https://github.com/zq2599/blog_demos