1.join 查詢

　　1、永遠是小結果集驅動大結果集(小表驅動大表，小表放在左表)。 2、儘量不要使用join，但是join是難以避免的。

　　left join 、 left outer join 、 left semi join(左半開連線，只顯示左表資訊)

　　hive在0.8版本以後開始支援left join

　　left join 和 left outer join 效果差不多

　　hive的join中的on只能跟等值連線 "=",不能跟< >= <= !=

　　join:不加where過濾，叫笛卡爾積

　　inner join ：內連線

　　outer join ：外連結

　　full outer join ：全外連線，尋找表中所有滿足連線(包括where過濾)。

　　##建立學生測試表
CREATE TABLE student (
id INT,
name string,
sex string,
birth string,
department string,
address string
)
row format delimited fields terminated by ','
;
##從本地載入資料
load data local inpath '/root/Desktop/student.txt' into table student;
##建立學生成績測試表
CREATE TABLE score (
stu_id INT,
c_name string,
grade string
)
row format delimited fields terminated by ','
;
##從hdfs載入資料
load data inpath '/hive.data/score.txt' into table score;
##建立學生基本資訊測試表
create table stuinfo(
id int,
height double,
weight double
)
row format delimited fields terminated by ','
;
##從本地載入資料
load data local inpath '/root/Desktop/stuinfo.txt' into table stuinfo;

　　 1.1 左連線

　　如果左邊有資料，右邊沒有資料，則左邊有資料的記錄的對應列返回為空。

　　##使用左連線查詢：student表為驅動表，透過id連線
select
student.name,
score.c_name,
score.grade
from student
left join score
on student.id = score.stu_id
;

　　 1.2 左外連線

　　如果左邊有資料，右邊沒有資料，則左邊有資料的記錄的對應列返回為空。

　　##使用左外連線查詢：student表為驅動表，透過id連線
select
student.name,
score.c_name,
score.grade
from student
left outer join score
on student.id = score.stu_id
;

　　 1.3 左半連線

　　left semi join是left join 的一種最佳化，並且通常用於解決exists in，寫left semi join 查詢時必須遵循一個限制:右表(sales) 只能在 on子句中出現，且不能在select 表示式中引用右表。

　　##左半連線
select
student.*
from student
left semi join score on
student.id=score.id
;
##查詢結果等價於
select * from student where student.id not in (select stu_id from score);

　　 1.4 右外連線

　　right outer join和right join差不多，不常用

　　如果左邊沒有資料，右邊有資料，則右邊有資料的記錄對應列返回為空。

　　##使用右外連線查詢：score表為驅動表，透過id連線
select
student.name,
score.c_name,
score.grade
from student
right outer join score
on student.id = score.stu_id
;

　　 1.5 全外連線

　　##全外連線
select
student.name,
score.c_name,
score.grade
from student
full outer join score
on student.id = score.stu_id
;

　　##不使用join，from 後面跟多個表名使用","分割、 inner join 、join ：三種效果一樣
select
student.name,
score.c_name,
score.grade
from student，score
where
student.id = score.stu_id
;

　　 1.6 內連線

　　##使用內連線查詢所有有考試成績的學生的學生姓名，學科名，學科成績，及身高
select
student.name,
score.c_name,
score.grade,
stuinfo.height
from student
inner join score
on student.id = score.stu_id
join stuinfo
on student.id = stuinfo.id
;

　　 1.7 hive 提供一個小表標識，是 hive 提供的一種最佳化機制

　　##小表標識：/*+STREAMTABLE(表名)*/
select
/*+STREAMTABLE(score)*/
student.name,
score.c_name,
score.grade,
stuinfo.height
from student
inner join score
on student.id = score.stu_id
join stuinfo
on student.id = stuinfo.id
;

　　 1.8 map-side join ：

　　如果有一個連線表小到足以放入記憶體， Hive就可以把較小的表放入每個mapper的記憶體來執行連線操作。這就叫做map連線。當有一大一小表的時候，適合用map-join。會將小表檔案快取，放到記憶體中，在map端和記憶體中的資料一一進行匹配，連線查詢關係。hive-1.2.1 預設已經開啟map-side join：hive.auto.convert.join=true

　　select
student.name,
score.c_name,
score.grade,
stuinfo.height
from student
inner join score
on student.id = score.stu_id
join stuinfo
on student.id = stuinfo.id
;

　　hive 0.7版本以前，需要hive提供的mapjoin()標識。來標識該join為map-side join。標識已經過時，但是寫上仍然識別

　　select
/*+MAPJOIN(student)*/
student.name,
score.c_name,
score.grade,
stuinfo.height
from student
inner join score
on student.id = score.stu_id
join stuinfo
on student.id = stuinfo.id
;

　　hive怎麼知道將多大檔案快取,配置檔案中配置，下面為預設配置

　　 <property>
     <name> hive.mapjoin.smalltable.filesize </name>
     <value> 25000000 </value>
   </property>

　　 2.group by ：

　　GROUP BY 語句通常會和聚合函式一起使用，按照一個或者多個列對結果進行分組，然後對每個組執行聚合操作。使用group by後，查詢的欄位要麼出現在聚合函式中，要麼出現在group by 後面。

　　##查詢學生的考試門數，及平均成績
select
count(*),
avg(grade)
from student join score
on student.id=score.stu_id
group by student.id;

　　 3.where

　　SELECT語句用於選取欄位，WHERE語句用於過濾條件，兩者結合使用可以查詢到符合過濾條件的記錄。後面不能跟聚合函式或者聚合函式的結果，能跟普通的查詢值或者是方法

　　##查詢學生的考試平均成績大於90分
select
count(*),
avg(grade) avg_score
from student join score
on student.id=score.stu_id
where student.id<106
group by student.id
having avg_score>90;

　　 4.having:

　　對查詢出來的結果進行過濾，通常和 group by搭配使用。

　　##查詢學生的考試平均成績大於90分的學生id及平均成績
select
count(*),
avg(grade) avg_score
from student join score
on student.id=score.stu_id
group by student.id
having avg_score>90;

　　 5. 排序

　　sort by ：排序，區域性排序，只能保證單個reducer的結果排序。 order by: 排序，全域性排序。保證整個job的結果排序。當reducer只有1個的時候，sort by 和 order by 效果一樣。建議使用sort by 通常和： desc asc .(預設升序)

　　##查詢學生平均成績按照降序排序
select
avg(grade) avg_score
from
student join score
on student.id=score.stu_id
group by student.id
order by avg_score desc;

　　設定 reducer個數(等於1 或者 2)：

　　set mapreduce.job.reduces=2

　　##使用order by
select
avg(grade) avg_score
from
student join score
on student.id=score.stu_id
group by student.id
order by avg_score desc;
##使用sort by
select
avg(grade) avg_score
from
student join score
on student.id=score.stu_id
group by student.id
sort by avg_score desc;

　　 6.distribute by ：

　　控制 map中如何輸出到reduce。整個hive語句轉換成job預設都有該過程，如果不寫，預設使用第一列的hash值來分。當只有一個reducer的時候不能體現出來。如果distribute by和sort by 一起出現的時候注意順序問題？？distribute by在前面

　　clusterd by ：它等價於distribute by和sort by(升序)。後面跟的欄位名需要一樣 clusterd by它既兼有distribute by，還兼有sort by (只能是升序)

　　select
id
from
student
distribute by id
sort by id;

　　select
id
from
student
clusterd by id;

　　 7.limit : 限制結果集的。

　　select
id,name
from student
limit 3;

　　 8.union all ：

　　將兩個或者多個查詢的結果集合併到一起，不去重每一個結果集排序。欄位數必須一致，欄位型別儘量相同

　　##將id<108的和id>103的使用union all合併
select
id sid,
name snames
from student
where id<108
union all
select
id sid,
name sname
from student
where id>103;

　　 9.union ：

　　將兩個或者多個查詢結果集合併到一起，去重 ,合併後的資料排序

　　##將id<108的和id>103的使用union合併
select
id sid,
name sname
from student
where id<108
union
select
id sid,
name sname
from student
where id>103
order by sname;

　　 10. 子查詢

　　子查詢是內嵌在另一個 SQL 語句中的SELECT 語句。Hive 對子查詢的支援很有限。它只允許子查詢出現在SELECT 語句的FROM 子句中。Hive支援非相關子查詢，這個子查詢透過IN或EXISTS語法在WHERE子句中進行查詢。Hive目前暫不支援相關子查詢，相關子查詢的執行依賴於外部查詢的資料。

　　##非相關子查詢
select
id,name
from
student
where id
in
(select stu_id
from
score);

　　##相關子查詢的執行依賴於外部查詢的資料
select sid,uname
from
(select
id sid,
name uname
from student
) s
order by sid
##外層查詢像訪問表那樣訪問子查詢的結果，這是為什麼必須為子查詢賦予一個別名(s)的原因。子查詢中的列必須有唯一的名稱，以便外層查詢可以引用這些列。

好程式設計師大資料學習路線之hive表的查詢