Awk和Sed格式化Hive的資料

壹頁書發表於2016-01-13
單位日誌,需要每天過濾之後上傳到Hadoop,然後通過Hive分析.
日誌大致格式如下
 2016-01-11 10:00:03,114.112.124.236,57280,10,1,-24576,vvlive,1.0.0.1026,6941c526245a6ec26ed5a4b32e1b2d72,avcore,3,401,21502,roomid=941404583    svr=122.226.185.23  port=8009   devicetype=1    channelid=2034  time=30750  totalsend=849   lostsend=0  uniqsend=849    uniq  lost=0  reqlost=0   dropreqlost=0   totalsendbytes=423261   lostsendbytes=0 uniqsendbytes=423261    peerrtt=93  droprate=0.0000 rate=400    originalrate=400
  2016-01-11 10:00:09,114.112.124.236,52999,10,1,11776,vvlive,1.0.0.1026,9b09c9011d102407670b4a3b8b0694f6,vvlive_user,3,5,705497093,705497093+0%+2%+2012/07/29 06:39:25
  2016-01-11 10:00:11,114.112.124.236,52999,10,1,12032,vvlive,1.0.0.1026,9b09c9011d102407670b4a3b8b0694f6,avcore,3,405,705497093,roomid=909578532 svr=122.226.185.23  port=8009   devicetype=1    channel=2042    time=60000  recvcucpackets=0    detectlostpackets=0 reqlostp  acketsucccount=0    duppackets=0    reqlostpackets=0    lostnotify=0    lostrate=-1.0000    reqlostrate=-1.0000
  
  2016-01-11 10:00:11,114.112.124.236,52999,10,1,12032,vvlive,1.0.0.1026,9b09c9011d102407670b4a3b8b0694f6,avcore,3,405,705497093,roomid=909578532 svr=122.226.185.23  port=8009   devicetype=0    channel=2041    time=60000  recvcucpackets=0    detectlostpackets=0 reqlostp  acketsucccount=0    duppackets=0    reqlostpackets=0    lostnotify=0    lostrate=-1.0000    reqlostrate=-1.0000

前面的欄位都是固定的,但是最後一個欄位是無模式的.
這些日誌需要先做過濾,然後抽取其中幾個欄位包括那個無模式的列.

進入日誌目錄,


使用如下命令格式化日誌
grep avcore,3, *.dat | sed 's/^.*dat://g' | sed 's/=/\o003/g' | awk -F',' '{printf("%s\001%s\001%s\001%s\001%s\001%s\001%s\n",$1,$8,$9,$10,$12,$13,$14)}' | sed 's/\t/\o002/g' > /data/statsvr/tmp/vvlive_$(date +%y%m%d).txt

作用:
sed 's/^.*dat://g'
grep 所有的dat檔案,匹配的檔名稱會顯示出來,這個sed刪除冒號之前的資訊.

sed 's/=/\o003/g'
最後一列都是無模式的,所以對映為hive的Map格式。這個sed將日誌中的等於號,替換為Hive Map中鍵值的分隔符,(預設八進位制編碼的\003表示)

awk -F',' '{printf("%s\001%s\001%s\001%s\001%s\001%s\001%s\n",$1,$8,$9,$10,$12,$13,$14)}'
AWK提取需要的欄位,中間用hive欄位分隔符隔離(hive欄位間的分隔符預設是八進位制的\001)

sed 's/\t/\o002/g'
\002預設是hive Map 鍵值對之間的預設分隔符.


規範化之後的日誌:
  1. 2016-01-11 00:00:06^A1.0.0.1023^A0073612d925a7d89af49dd1a06f98ca2^Aavcore^A405^A0^Aroomid^C0^Bsvr^C122.226.185.23^Bport^C8009^Bdevicetype^C0^Bchannel^C1881^Btime^C60000^Brecvcucpackets^C0^Bdetectlostpackets^C0^Breqlostpacketsucccount^C0^Bduppackets^C0^Breqlostpackets^ C0^Blostnotify^C0^Blostrate^C-1.00^Breqlostrate^C-1.00
  2. 2016-01-11 00:00:06^A1.0.0.1023^A0073612d925a7d89af49dd1a06f98ca2^Aavcore^A405^A0^Aroomid^C0^Bsvr^C122.226.185.23^Bport^C8009^Bdevicetype^C1^Bchannel^C1882^Btime^C60000^Brecvcucpackets^C0^Bdetectlostpackets^C0^Breqlostpacketsucccount^C0^Bduppackets^C0^Breqlostpackets^ C0^Blostnotify^C0^Blostrate^C-1.00^Breqlostrate^C-1.00
  3. 2016-01-11 00:00:07^A1.0.0.1024^A6c0ea7a728f19068663c997d9b2d5833^Aavcore^A405^A696037^Aroomid^C913570831^Bsvr^C122.226.185.23^Bport^C8009^Bdevicetype^C0^Bchannel^C1881^Btime^C60000^Brecvcucpackets^C0^Bdetectlostpackets^C0^Breqlostpacketsucccount^C0^Bduppackets^C0^Bre qlostpackets^C0^Blostnotify^C0^Blostrate^C-1.00^Breqlostrate^C-1.00

在hive上建表.
  1. create external table vvlive     
  2. (  
  3.     ts TIMESTAMP,  
  4.     product STRING,  
  5.     pcid STRING,  
  6.     module STRING,  
  7.     mid INT,  
  8.     userid INT,  
  9.     info Map<STRING,STRING>  
  10. )   
  11. row format delimited  
  12. fields terminated by '\001'  
  13. collection items terminated by '\002'  
  14. map keys terminated by '\003'  
  15. lines terminated by '\n'  
  16. location '/avcore_live/vvlive';  

最後將規範化的日誌上傳到hdfs即可.
hadoop fs -put vvlive_160113.txt /avcore_live/vvlive/

在hive檢視資訊.


來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/29254281/viewspace-1976121/,如需轉載,請註明出處,否則將追究法律責任。

相關文章