hive dynamic partition的使用

lcz393537415發表於2016-09-29
參考連結:http://www.cnblogs.com/xd502djj/archive/2013/12/11/3470074.html


語句參考:
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions.pernode=100000;
set hive.exec.max.created.files=500000;  
set mapred.reduce.tasks = 3000;
INSERT OVERWRITE  TABLE sany_online_hive_wj.ecc_wj partition(st_year,st_month,st_day)
select
a.st_pid,
a.st_loginid,
a.st_ma_serialno,
a.st_checkmark,
a.st_state,
a.st_logintime,
a.st_connecttime,
a.st_updatetime,
a.st_totalwktime,
a.st_rmntime,
a.st_berrorcode,
a.st_werrorcode,
a.st_balmcode,
a.st_walmcode,
a.st_longitude,
a.st_latitude,
a.st_saticunt,
a.st_steppos,
a.st_engv,
a.st_oillev,
a.st_batteryvol,
a.st_floatreserv33,
a.st_floatreserv34,
a.re_en_pid ,
a.st_wktime,
a.st_gpssta,
a.st_velocity,
a.st_orientation,
a.st_sgnlq,
a.st_errdealsta,
a.st_cmmctsch,
a.st_altitude,
a.st_uintreserv10,
a.st_uintreserv11,
a.st_uintreserv12,
a.st_uintreserv13,
a.st_uintreserv14,
a.st_uintreserv15,
a.st_uintreserv16,
a.st_uintreserv17,
a.st_uintreserv18,
a.st_uintreserv19,
a.st_uintreserv20,
a.st_uintreserv21,
a.st_uintreserv22,
a.st_uintreserv23,
a.st_uintreserv24,
a.st_uintreserv25,
a.st_uintreserv26,
a.st_uintreserv27,
a.st_uintreserv28,
a.st_uintreserv29,
a.st_uintreserv30,
a.st_uintreserv31,
a.st_uintreserv32,
a.st_floatreserv13,
a.st_floatreserv14,
a.st_floatreserv15,
a.st_floatreserv16,
a.st_floatreserv17,
a.st_floatreserv18,
a.st_floatreserv19,
a.st_floatreserv20,
a.st_floatreserv21,
a.st_floatreserv22,
a.st_floatreserv23,
a.st_floatreserv24,
a.st_floatreserv25,
a.st_floatreserv26,
a.st_floatreserv27,
a.st_floatreserv28,
a.st_floatreserv29,
a.st_floatreserv30,
a.st_floatreserv31,
a.st_floatreserv32,
substring(trim(a.st_updatetime),1,4) st_year,
substring(trim(a.st_updatetime),6,2) st_month,
substring(trim(a.st_updatetime),9,2) st_day
from sany_online_hive_wj.ecc_wj_distinct as a  where
  substring(trim(a.st_updatetime),1,7)='${YYYY-MM}'
  distribute by st_year,st_month,st_day


不加distribute by之前,資料從hive任務的臨時結果路徑寫入資料的分割槽路徑下,速度特別慢,3,40分鐘左右,加上後耗時3分鐘左右。
具體原因,可以參考:http://blog.csdn.net/xiaolang85/article/details/11767297

另外,hive的表,考慮資料傾斜的情況,最好是將資料均分到表的檔案中會好些。

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/31347383/viewspace-2125732/,如需轉載,請註明出處,否則將追究法律責任。

相關文章