今天發現一個問題，將資料檔案匯入hive，然後校驗資料，具體如下：
step1：
select count(1),count(distinct sys_code,order_no,hotel_seq) from dw_source.oc_hotel_order where dt='20150530';
結果： 57736688 57656110

說明：sys_code,order_no,hotel_seq 存在重複的，分組不唯一！
推論 select sys_code,order_no,hotel_seq,count(1) from dw_source.oc_hotel_order where dt='20150530' group by sys_code,order_no,hotel_seq having count(1) > 1 應該有個結果集！

step2：
select count(distinct concat(sys_code,order_no,hotel_seq)) from dw_source.oc_hotel_order where dt='20150530';
結果： 57656110

說明：二次驗證 step1

step3
select count(1) from
(
select sys_code,order_no,hotel_seq,count(1) from dw_source.oc_hotel_order where dt='20150530' group by sys_code,order_no,hotel_seq having count(1) > 1
) t;
結果： 0

說明：這說明step1的推論不成立
如果推論不成立，就說明 sys_code,order_no,hotel_seq 不應該存在重複的，和step1的結論又矛盾。

這是為什麼？？？

猜了N久是什麼原因，也試了很多方法測。
最後發現是空值 NULL 造成的問題，然後試了一把

create table temp.test_distinct
as
select sys_code,order_no,hotel_seq
from dw_source.oc_hotel_order
where dt='20150530'
and hotel_seq is null
limit 10;

hive> select * from temp.test_distinct;
OK
1001 1337418073974 NULL
1001 1322452118 NULL
1001 1321665107 NULL
1001 1320940345 NULL
1001 1320654060 NULL
1001 1320091070 NULL
1001 100152004170 NULL
1001 100148616269 NULL
1001 100141108024 NULL
1001 100136889859 NULL

select count(1),count(distinct sys_code,order_no,hotel_seq) from temp.test_distinct;

Total MapReduce CPU Time Spent: 2 seconds 620 msec
OK
10 0

尼瑪，果然奇蹟粗線了！！！ count(distinct sys_code,order_no,hotel_seq) 為 0 有木有!

最後佐證一下：
select count(1) from dw_source.oc_hotel_order where dt='20150530' and hotel_seq is null;
Total MapReduce CPU Time Spent: 4 minutes 14 seconds 350 msec
OK
57656110 + 80578 = 57736688 完全吻合

hive NULL值影響

相關文章