大資料5.1 - hive離線分析

項羽齊發表於2018-04-08

原文網址 : https://www.cnblogs.com/xiangyuqi/p/8743014.html

1.1 分桶表

1.1.1 分桶表概念

　　分割槽和分桶可以同時，分桶是更細粒度的分配方式。分割槽是追求效率，分桶又解決什麼問題呢？海量資料的分開儲存。

　　對於每一個表（table）或者分割槽， Hive可以進一步組織成桶，也就是說桶是更為細粒度的資料範圍劃分。Hive也是針對某一列進行桶的組織。Hive採用對列值雜湊，然後除以桶的個數求餘的方式決定該條記錄存放在哪個桶當中。

　　把表（或者分割槽）組織成桶（Bucket）有兩個理由：

　　獲得更高的查詢處理效率。桶為表加上了額外的結構，Hive 在處理有些查詢時能利用這個結構。具體而言，連線兩個在（包含連線列的）相同列上劃分了桶的表，可以使用 Map 端連線（Map-side join）高效的實現。比如JOIN操作。對於JOIN操作兩個表有一個相同的列，如果對這兩個表都進行了桶操作。那麼將儲存相同列值的桶進行JOIN操作就可以大大減少JOIN的資料量。

　　使取樣（sampling）更高效。在處理大規模資料集時，在開發和修改查詢的階段，如果能在資料集的一小部分資料上試執行查詢，會帶來很多方便。

1.1.2 建立分桶表

建立帶桶的表：

use jtdb;

create table tb_teacher(id int,name string) clustered by (id) into 4 buckets row format delimited fields terminated by ',';

預設桶功能是關閉的

Hive的底層是MapReduce，預設Reduce就一個，就只能輸出一個位置，強制多個 reduce 進行輸出。

hive> set hive.enforce.bucketing=true;

teacher.txt

1,王海濤

2,花倩

3,張慎正

4,齊雷

5,劉昱江

6,陳子樞

分桶表不允許直接從外部匯入資料，先建立臨時表，通過臨時表把資料插過去

hive> create table tb_teacher_tmp(id int,name string) row format delimited fields terminated by ',';

hive> load data local inpath '/usr/local/src/teacher.txt' into table tb_teacher_tmp;

檢視資料

hive> select * from tb_teacher_tmp;

1 王海濤

2 花倩

3 張慎正

4 齊雷

5 劉昱江

6 陳子樞

Time taken: 0.728 seconds, Fetched: 6 row(s)

hive>

假設我們現在的tb_teacher_tmp表的資料太多了，這時就可以用分桶。

匯入資料

hive> insert overwrite table tb_teacher select * from tb_teacher_tmp;

注意，分桶時會產生多個reduce，時間會比較慢，稍等片刻

1.1.3 存放結構

1.1.4 資料塊取樣

所謂桶其實就是產生了不同的檔案，tablesample為樣本。

table_sample: TABLESAMPLE (BUCKET x OUT OF y [ON colname])

在colname上分桶的行隨機進入1到y個桶中，返回屬於桶x的行。

例如：

hive> select * from tb_teacher tablesample(bucket 1 out of 4 on id);

hive> select * from tb_teacher tablesample(bucket 1 out of 2 on id);

按百分比返回資料，返回一半資料

hive> select * from tb_teacher tablesample(50 percent);

1.2 利用hive進行資料的離線分析

1.2.1 準備環境

如果重啟了伺服器，需要重新啟動各服務

程式	路徑	啟動命令
ZooKeeper	/usr/local/src/zk /zookeeper-3.4.8/bin	./zkServer.sh start ./zkServer.sh status
Hadoop+HDFS	/usr/local/src/hadoop /hadoop-2.7.1/sbin	Start-all.sh
Flume	/usr/local/src/flume /apache-flume-1.6.0-bin/conf	下面有，暫時不用啟動
啟動專案	必須在Flume後面啟動	啟動jtlogserver專案
Hive	/usr/local/src/hive /apache-hive-1.2.0-bin/bin	./hive

檢查防火牆狀態

[root@hadoop01 ~]# firewall-cmd --state #檢查防火牆狀態

not running #返回值，未執行

關閉防火牆

systemctl stop firewalld.service #關閉防火牆服務

systemctl disable firewalld.service #禁止防火牆開啟啟動

檢查程式

[root@hadoop01 bin]# jps

3034 ResourceManager #yarn

2882 SecondaryNameNode #Hadoop

4591 Application #Flume

2713 DataNode #Hadoop

5772 QuorumPeerMain #ZooKeeper

2586 NameNode #Hadoop

1.2.2 資料清洗

由於資料的來源和格式的不同和欄位的缺失等等問題，在處理資料之前，對資料要進行加工處理，這個過程被稱為資料的清洗。保障資料的格式統一，欄位不會有明顯的缺失。

先要跟業務人員溝通，基於業務處理這些資料。

資料量大就用MR進行清洗，資料量小可以用SQL，Hive進行清洗，也可以用shell指令碼進行清洗。方式很多，哪種都可以，根據實際情況進行處理。

並不是所有資料都需要，可以先去除，只保留url+urlname+uvid+ssid+sscount+sstime+cip。

1.2.3 建立外部分割槽表

1.2.3.1 業務系統處理過程

利用外部表的方式，由瀏覽器訪問頁面從而觸發埋點js，從而呼叫logServlet，從而寫log4j日誌資訊到flume中，flume又直接寫入到HDFS中，然後hive對指定的目錄建立外部表。從而hive就獲取到使用者的網站流量日誌資訊。

1.2.3.2 修改配置檔案

利用flume的攔截器在heads中新增時間戳，這樣就可以獲取並格式化這個時間戳%Y-%m-%d，還要注意hive指定的外部表路徑一致（建立動態的路徑）。

flume-jt.properties

a1.sources.r1.interceptors = t1

a1.sources.r1.interceptors.t1.type = timestamp

a1.sinks.k1.type = hdfs

a1.sinks.k1.hdfs.path = hdfs://hadoop01:9000/flux/reportTime=%Y-%m-%d

注意：如果是配置的Hadoop HA，則對應flume要做調整

先要將hadoop叢集的core-site.xml和hdfs-site.xml放在flume的conf目錄下，然後訪問時要寫成hdfs://ns，ns為在hdfs-site.xml中定義的

<name>dfs.nameservices</name>

</property>

修改地址為：

a1.sinks.k1.hdfs.path = hdfs://ns/flux/reportTime=%Y-%m-%d

1.2.3.3 啟動flume的agent

cd /usr/local/src/flume/apache-flume-1.6.0-bin/conf

[root@hadoop01 conf]# ../bin/flume-ng agent -c ./ -f ./flume-jt.properties -n a1 -Dflume.root.logger=INFO,console &

引數說明：

-c --conf配置目錄 ./當前路徑

-f --config-file 配置檔案

-n –name 指定agent的名稱

1.2.3.4 建立外部分割槽表

cd /usr/local/src/hive/apache-hive-1.2.0-bin/bin

./hive #啟動hive

hive> create database jtlogdb;

hive> use jtlogdb;

hive> create external table flux (url string,urlname string,title string,chset string,src string,col string,lg string, je string,ec string,fv string,cn string,ref string,uagent string,stat_uv string,stat_ss string,cip string) partitioned by (reportTime string) row format delimited fields terminated by '|' location '/flux';

1.1.1.1 訪問頁面a.jsp

1.2.3.5 查詢資料

hive> select * from flux;

查詢不到資料，因為分割槽資訊還不能識別。

1.2.3.6 增加分割槽資訊

hive> alter table flux add partition (reportTime='2018-03-01') location '/flux/reportTime=2018-03-01';

hive> select * from flux;

注意：Xshell會自動在最下面加資訊，執行後面命令時刪除即可。

配置錯誤可以刪除重新配置：

alter table flux drop partition(reportTime='2018-03-01');

1.2.3.7 刪除HDFS重來

前面測試，會導致資料沒有按照我們設定的標準，不方便檢視，刪除重來。

刪除HDFS目錄

./hdfs dfs -rm -r /fluxx/reportTime=2018-03-01

刪除分割槽：

alter table flux drop partition (reportTime='2018-03-01');

1.2.3.8 訪問頁面

資料有點少

先清除IE瀏覽器快取

注意：訪問地址不能用localhost，前面已經介紹過，IE瀏覽器處理特殊localhost再次請求時不會傳送引數，所以必須用IP地址或者域名。

3次a

2次b

不同的瀏覽器多訪問幾次，訪問2次a.jsp，再訪問1次b.jsp，把谷歌瀏覽器關掉（當瀏覽器關閉，設定ss_id會話就關閉，cookie臨時檔案就被刪除，如果再次開啟就是一個新會話），開啟訪問1次b.jsp，共計9條訪問。

注意：Flume不是一條就立即形成一個hdfs檔案，而是有一定的緩衝或者時間。所以造成FlumeData HDFS檔案中資料的條數不一樣。

1.1.1.1 查詢資料

hive> select * from flux;

1.1.1.1 注意事項

一定要等所有資料寫完，再建立外部分割槽表！！

這裡是一個大大的坑，flume寫hfds過程中，我們不能建立外部表，一旦建立flume後面寫入的資料就不會出現在外部表中。就會造成hdfs的記錄數和外部表中的記錄數不一致。

怎麼解決呢？一般每天凌晨兩點時，無人訪問時，我們再建立外部表；或者發現記錄數不一致時，重新建立外部表。

drop table fluxx;

drop table dataclear;

1.2.4 建立資料庫倉庫

1.1.1.2 建立新表

hive> create table dataclear(reportTime string,url string,urlname string,uvid string,ssid string,sscount string,sstime string,cip string) row format delimited fields terminated by '|';

表名：dataclear

欄位：型別就都設定為string，sscount也設定為string

row formate delimited fileds terminated by ‘|’用下劃線分割資料

hive> show tables;

dataclear

flux

Time taken: 0.023 seconds, Fetched: 2 row(s)

hive>

把一個欄位拆分成3個欄位，利用hive的內建函式split進行拆分

hive> select stat_ss from flux limit 1; #方便廁所只取一條

7447647975_5_1519085238365

Time taken: 0.106 seconds, Fetched: 1 row(s)

hive> select stat_ss,split(stat_ss,"_") from flux limit 1;

7447647975_5_1519085238365 ["7447647975","5","1519085238365"]

Time taken: 0.123 seconds, Fetched: 1 row(s)

hive> select stat_ss,split(stat_ss,"_"),split(stat_ss,"_")[0] from flux limit 1;

7447647975_5_1519085238365 ["7447647975","5","1519085238365"] 7447647975

Time taken: 0.111 seconds, Fetched: 1 row(s)

1.1.1.3 獲取陣列中的資料

hive> insert overwrite table dataclear select reportTime,url,urlname,

stat_uv,split(stat_ss,"_")[0],split(stat_ss,"_")[1],split(stat_ss,"_")[2],cip from flux;

執行很慢

1.2.5 業務處理資料分析

1.2.5.1 PV

select count(*) as pv from dataclear

where reportTime='2018-03-01';

實際就是記錄個數

1.2.5.2 UV

獨立訪客數量，一天之內所有的訪問的數量，一天之內uvid去重後的總數

select count(distinct uvid) as uv from dataclear

where reportTime='2018-03-01';

不同瀏覽器，結果2條

1.2.5.3 VV

獨立會話數，一天之內所有的會話的數量，一天之內ssid去重後的總數

select count(distinct ssid) as vv from dataclear

where reportTime='2018-03-01';

一個IE瀏覽器，一個chrome瀏覽器，結果2。如果中間可以把瀏覽器快取情況，這樣又是一個新的會話。MR計算時間比較長。

1.2.5.4 BR

跳出率，一天之內跳出的會話總數/會話總數，會話總數就是vv

跳出的會話總數，就是訪問頁面的個數，怎麼計算呢？

hive中所有的子表都必須有別名。

select br_taba.a/br_tabb.b as br from

(

select count(*) as a from

(

select ssid from dataclear

where reportTime='2018-03-01'

group by ssid having count(ssid)=1

) as br_tab

) as br_taba,

(

select count(distinct ssid) as b from dataclear

where reportTime='2018-03-01'

) as br_tabb;

select br_taba.a/br_tabb.b as br from (select count(*) as a from (select ssid from dataclear where reportTime='2018-03-01' group by ssid having count(ssid)=1) as br_tab) as br_taba,(select count(distinct ssid) as b from dataclear where reportTime='2018-03-01') as br_tabb;

轉成6個MR過程

保留4位小數

select round(br_taba.a/br_tabb.b,4) as br from (select count(*) as a from (select ssid from dataclear where reportTime='2018-03-01' group by ssid having count(ssid)=1) as br_tab) as br_taba,(select count(distinct ssid) as b from dataclear where reportTime='2018-03-01') as br_tabb;

分解為6個job，在不同伺服器中平行計算，所以非常耗時：

1.2.5.5 NewIP

新增ip總數，一天內所有ip去重後在歷史資料中從未出現過的數量

select count(distinct dataclear.cip) from dataclear

where dataclear.reportTime='2018-03-01'

and cip not in

(select dc2.cip from dataclear as dc2

where dc2.reportTime<'2018-03-01');

select count(distinct dataclear.cip) from dataclear where dataclear.reportTime='2018-03-01' and cip not in (select dc2.cip from dataclear as dc2 where dc2.reportTime<'2018-03-01');

就一個IP來訪問，所以結果是1

注意：子查詢中不能和主查詢中有相同欄位名稱，這樣SQL無法區分。所以對子查詢表重新命名dc2，這樣欄位就能判斷其出處。

1.2.5.6 NewCust

新增客戶數，一天內所有的uvid去重在歷史資料中從未出現的總數

select count(distinct dataclear.uvid) from dataclear

where dataclear.reportTime='2018-03-01'

and uvid not in

(select dc2.uvid from dataclear as dc2 where

dc2.reportTime < '2018-03-01');

select count(distinct dataclear.uvid) from dataclear where dataclear.reportTime='2018-03-01' and uvid not in (select dc2.uvid from dataclear as dc2 where dc2.reportTime < '2018-03-01');

一個IE一個chrome，結果為2

1.2.5.7 AvgTime

平均訪問時長，一天內所有會話的訪問時長的平均值，一個會話的時長=會話中所有訪問的時間的最大值-會話中所有訪問時間的最小值

select round(avg(atTab.usetime),4) as avgtime from

(

select max(sstime) - min(sstime) as usetime from dataclear

where reportTime='2018-03-01'

group by ssid

) as atTab;

select round(avg(atTab.usetime),4) as avgtime from (select max(sstime) -min(sstime) as usetime from dataclear where reportTime='2018-03-01' group by ssid) as atTab;

1.2.5.8 AvgDeep

平均訪問深度，一天內所有會話訪問深度的平均值，一個會話的訪問深度

select round(avg(deep),2) as viewdeep from

(

select count(distinct urlname) as deep from flux

where reportTime='2018-03-01'

group by split(stat_uv,'_')[0]

) as tviewdeep;

select round(avg(deep),2) as viewdeep from (select count(distinct urlname) as deep from flux where reportTime='2018-03-01' group by split(stat_uv,'_')[0]) as tviewdeep;

Flume+Spark+Hive+Spark SQL離線分析系統
2018-09-18
SparkHiveSQL
基於Hive的大資料分析系統
2024-07-28
Hive大資料
大資料4.2 -- hive資料庫
2018-04-03
大資料Hive資料庫
基於EMR離線資料分析（阿里雲）
2021-05-20
阿里
大資料技術 - Hive
2023-02-24
大資料Hive
【大資料開發】Hive——Hive函式大全
2020-11-06
大資料Hive函式
資料分析師如何應對資料庫取數後的離線分析
2020-07-05
資料庫
基於EMR離線資料分析-反饋有禮
2022-01-18
雲端計算大資料學習路線課程大綱資料：hive入門操作
2019-08-14
大資料Hive
大資料4.1 - Flume整合案例+Hive資料倉
2018-04-08
大資料Hive
大資料開發之路：hive篇
2019-05-10
大資料Hive
雲端計算大資料學習路線課程大綱資料：hive內部函式
2019-08-14
大資料Hive函式
【Hive】hive資料遷移
2018-08-21
Hive
23 大資料之hive(第四部 hive基礎)
2020-11-22
大資料Hive
有贊大資料離線叢集遷移實戰
2020-07-30
大資料
OPPO大資料離線任務排程系統OFLOW
2021-12-28
大資料
離線資料同步變遷
2024-12-05
磁碟離線資料恢復
2024-06-27
資料恢復
好程式設計師大資料學習路線hive內部函式
2019-08-12
程式設計師大資料Hive函式
好程式設計師大資料學習路線之hive儲存格式
2019-07-30
程式設計師大資料Hive
Hadoop大資料實戰系列文章之Hive
2020-11-16
Hadoop大資料Hive
大資料技術Hbase和Hive詳解
2020-12-02
大資料Hive
OPPO大資料離線計算平臺架構演進
2021-12-23
大資料架構
大資料分析筆記 (4.1) - 線性迴歸分析(Linear Regression)
2020-11-19
大資料筆記
Apache NIFI離線同步MySQL資料
2024-11-23
ApacheNifiMySql
好程式設計師大資料學習路線分享hive的執行方式
2019-08-07
程式設計師大資料Hive
好程式設計師大資料學習路線之hive表的查詢
2019-08-05
程式設計師大資料Hive
【大資料】離線批處理計算MapReduce | 複習筆記
2020-12-11
大資料筆記
資料恢復經典案例分析-raid兩塊硬碟離線恢復
2018-12-03
資料恢復AI硬碟
大資料基礎學習-7.Hive-1.1.0
2018-04-27
大資料Hive
滴滴海量離線資料的線上化 — FastLoad
2022-12-06
AST
通過DataWorks資料整合歸檔日誌服務資料至MaxCompute進行離線分析
2019-04-09
[hive]hive資料模型中四種表
2018-08-14
Hive模型
大資料安全分析
2021-04-23
大資料
好程式設計師大資料學習路線分享hive分割槽和分桶
2019-08-20
程式設計師大資料Hive
《離線和實時大資料開發實戰》（二）大資料平臺架構 & 技術概覽
2020-09-27
大資料架構
Datax離線資料抽取(MySQL--MySQL)
2019-01-29
MySql
Datax離線資料抽取(Oracle--MySQL)
2019-01-29
OracleMySql

大資料5.1 - hive離線分析

1.1 分桶表

1.1.1 分桶表概念

1.1.2 建立分桶表

1.1.3 存放結構

1.1.4 資料塊取樣

1.2 利用hive進行資料的離線分析

1.2.1 準備環境

1.2.2 資料清洗

1.2.3 建立外部分割槽表

1.2.3.1 業務系統處理過程

1.2.3.2 修改配置檔案

1.2.3.3 啟動flume的agent

1.2.3.4 建立外部分割槽表

1.1.1.1 訪問頁面a.jsp

1.2.3.5 查詢資料

1.2.3.6 增加分割槽資訊

1.2.3.7 刪除HDFS重來

1.2.3.8 訪問頁面

1.1.1.1 查詢資料

1.1.1.1 注意事項

1.2.4 建立資料庫倉庫

1.1.1.2 建立新表

1.1.1.3 獲取陣列中的資料

1.2.5 業務處理資料分析

1.2.5.1 PV

1.2.5.2 UV

1.2.5.3 VV

1.2.5.4 BR

1.2.5.5 NewIP

1.2.5.6 NewCust

1.2.5.7 AvgTime

1.2.5.8 AvgDeep

相關文章