Flume和Hive整合之hive sink
在此文章中總結flum使用hive sink的操作過程,以及在除錯過程中的一些填坑操作。
1. Hive表要求
1.1 hive建表要求
使用hive做flume sink時,對hive表的要求:
a:表必須是事物表
b:表必須是分割槽表,分桶表
c: 表stored as orc
即clustered分桶、transactional事務、orc儲存格式。
1.2 flume中配置hive表列名要求
Flume配置的Hive 列名必須都為小寫字母。否則,報下面錯誤:
Failed connecting to EndPoint
Caused by: org.apache.hive.hcatalog.streaming.InvalidColumn: Column 'URL' not found in table for input field 81
2. 操作過程
1. 拷貝hive jar包到flume lib路徑下, 避免啟動hive sink時失敗
# pwd
/usr/local/src/hive-1.2.2/hcatalog/share/hcatalog
# ll
total 416
-rw-r--r-- 1 root root 257215 Apr 3 2017 hive-hcatalog-core-1.2.2.jar
-rw-r--r-- 1 root root 50192 Apr 3 2017 hive-hcatalog-pig-adapter-1.2.2.jar
-rw-r--r-- 1 root root 53562 Apr 3 2017 hive-hcatalog-server-extensions-1.2.2.jar
-rw-r--r-- 1 root root 56999 Apr 3 2017 hive-hcatalog-streaming-1.2.2.jar
#cp /usr/local/src/hive-1.2.2/hcatalog/share/hcatalog/*.jar /usr/local/src/flume-1.6.0/lib
#cp /usr/local/src/hive-1.2.2/lib/*.jar /usr/local/src/flume-1.6.0/lib
2. 修改hive配置,使其支援事務處理
# cat hive-site.xml
<property>
<name>hive.support.concurrency</name>
<value>true</value>
</property>
<property>
<name>hive.exec.dynamic.partition.mode</name>
<value>nonstrict</value>
</property>
<property>
<name>hive.txn.manager</name>
<value>org.apache.hadoop.hive.ql.lockmgr.DbTxnManager</value>
</property>
3. 初始化hive metastore資料庫
#/usr/local/src/hive-1.2.2/bin
#./schematool -dbType mysql -initSchema
4. 重新啟動mysql、hadoop、HiveMetaStore。
#重啟mysql
#cd /usr/bin
#./mysqladmin -u root -p shutdown
Enter password: ******
# cd /usr/bin
./mysqld_safe &
#重啟hadoop
[root@master sbin]# pwd
/usr/local/src/hadoop-2.6.5/sbin
[root@master sbin]# ./stop-all.sh
[root@master sbin]# ./start-all.sh
#啟動HiveMetaStore
[root@master sbin]# cd /usr/local/src/hive-1.2.2/
[root@master hive-1.2.2]# cd bin
[root@master bin]# ./hive --service metastore &
ls: cannot access /usr/local/src/spark-2.0.2-bin-hadoop2.6/lib/spark-assembly-*.jar: No such file or directory
Starting Hive Metastore Server
[root@master ~]# ps -ef|grep HiveMetaStore
root 20755 14146 23 22:39 pts/0 00:00:11 /usr/local/src/jdk1.8.0_172/bin/java -Xmx256m -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/usr/local/src/hadoop-2.6.5/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/usr/local/src/hadoop-2.6.5 -Dhadoop.id.str=root -Dhadoop.root.logger=INFO,console -Djava.library.path=/usr/local/src/hadoop-2.6.5/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Xmx512m -Dhadoop.security.logger=INFO,NullAppender org.apache.hadoop.util.RunJar /usr/local/src/hive-1.2.2/lib/hive-service-1.2.2.jar org.apache.hadoop.hive.metastore.HiveMetaStore
root 20922 16217 0 22:40 pts/1 00:00:00 grep --color=auto HiveMetaStore
5. Hive中建立表
hive>create database flume;
hive>create table t_user(user_id int,user_name string, age int) clustered by(user_id) into 2 buckets stored as orc tblproperties("transactional"='true');
6. 配置flume hive sink。
[root@master ~]# cd /usr/local/src/flume-1.6.0/
[root@master flume-1.6.0]# ls
bin CHANGELOG conf DEVNOTES docs lib LICENSE logs NOTICE README RELEASE-NOTES tools
[root@master flume-1.6.0]# cat ./conf/flume-hive-2.conf
a1.sources = r1
a1.sinks = k1
a1.channels = c1
a1.sources.r1.type = netcat
a1.sources.r1.bind = master
a1.sources.r1.port = 50000
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = timestamp
a1.sources.r1.interceptors.i1.headerName = time
a1.sources.r1.interceptors.i1.preserveExisting = false
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.sinks.k1.type = hive
a1.sinks.k1.hive.metastore = thrift://192.168.56.100:9083
a1.sinks.k1.hive.database = flume
a1.sinks.k1.hive.table = t_user
a1.sinks.k1.serializer = DELIMITED
a1.sinks.k1.serializer.delimiter = "\t"
a1.sinks.k1.serializer.serdeSeparator = '\t'
a1.sinks.k1.serializer.fieldnames = user_id,user_name,age
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
7. 啟動flume程式,將發往master 192.168.58.100:50000的資訊寫到hive的flume.t_user中。
[root@master flume-1.6.0]# pwd
/usr/local/src/flume-1.6.0
[root@master flume-1.6.0]# ./bin/flume-ng agent --conf conf --conf-file ./conf/flume-hive-2.conf --name a1 -Dflume.root.logger=INFO,console
在另一臺機器slave1(192.168.58.101)上給master:50000發訊息。
[root@slave1 ~]# telnet master 50000
Trying 192.168.56.100...
Connected to master.
Escape character is '^]'.
1 abc 10 #注意:10後面還有一個tab
OK
2 def 20 #注意:20後面還有一個tab
OK
flume日誌顯示如下:
2020-12-18 22:51:01,308 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:473)] Connected to metastore.
2020-12-18 22:51:03,510 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.hive.HiveWriter.nextTxnBatch(HiveWriter.java:335)] Acquired Txn Batch TxnIds=[202...301] on endPoint = {metaStoreUri='thrift://192.168.56.100:9083', database='flume', table='t_user', partitionVals=[] }. Switching to first txn
2020-12-18 22:51:06,602 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.hive.HiveWriter.commitTxn(HiveWriter.java:275)] Committing Txn 202 on EndPoint: {metaStoreUri='thrift://192.168.56.100:9083', database='flume', table='t_user', partitionVals=[] }
2020-12-18 22:51:10,610 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.hive.HiveWriter.commitTxn(HiveWriter.java:275)] Committing Txn 203 on EndPoint: {metaStoreUri='thrift://192.168.56.100:9083', database='flume', table='t_user', partitionVals=[] }
在hive中查詢資料,可以看到資料已經入hive表。
hive> use flume;
OK
Time taken: 0.838 seconds
hive> select * from t_user;
OK
1 abc 10
2 def 20
3. Flume-Hive-Sink例項
模擬flume官網hive sink例項,建立例項如下:
Hive建表:
hive> create table webserver(id int, continent string, country string, msg string) partitioned by (time string) clustered by (id) into 5 buckets
row format delimited fields terminated by '\t' stored as orc tblproperties("transactional"='true');
flume-hive-3.conf配置:
[root@master conf]# cat flume-hive-3.conf
a1.sources = r1
a1.sinks = k1
a1.channels = c1
a1.sources.r1.type = netcat
a1.sources.r1.bind = master
a1.sources.r1.port = 50000
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = timestamp
a1.sources.r1.interceptors.i1.headerName = time
a1.sources.r1.interceptors.i1.preserveExisting = false
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.sinks.k1.type = hive
a1.sinks.k1.hive.metastore = thrift://master:9083
a1.sinks.k1.hive.database = flume
a1.sinks.k1.hive.table = webserver
a1.sinks.k1.hive.partition = %y-%m-%d
a1.sinks.k1.useLocalTimeStamp = false
a1.sinks.k1.round = true
a1.sinks.k1.roundValue = 5
a1.sinks.k1.roundUnit = minute
a1.sinks.k1.serializer = DELIMITED
a1.sinks.k1.serializer.delimiter = "\t"
a1.sinks.k1.serializer.serdeSeparator = '\t'
a1.sinks.k1.serializer.fieldnames = id,continent,country,msg
啟動flume:
[root@master flume-1.6.0]# ./bin/flume-ng agent --conf conf --conf-file ./conf/flume-hive-3.conf --name a1 -Dflume.root.logger=INFO,console
模擬傳送給source資料:
[root@slave1 ~]# telnet master 50000
Trying 192.168.56.100...
Connected to master.
Escape character is '^]'.
1 asia China Hello world!
OK
2 America USA Hello AI!
OK
3 America USD hello python!
OK
flume日誌:
Hive flume.webserver表內容:
hive> select * from webserver;
OK
1 asia China Hello world! 20-12-19
2 America USA Hello AI! 20-12-19
3 America USD hello python! 20-12-19
Time taken: 0.09 seconds, Fetched: 3 row(s)
HDFS檔案目錄:
Hive日誌:
2020-12-19 22:11:20,584 INFO [main]: orc.ReaderImpl (ReaderImpl.java:rowsOptions(531)) - Reading ORC rows from hdfs://192.168.56.100:9000/user/hive/warehouse/flume.db/webserver/time=20-12-19/delta_0000502_0000601/bucket_00002 with {include: null, offset: 0, length: 9223372036854775807}
2020-12-19 22:11:20,603 INFO [main]: orc.RecordReaderFactory (RecordReaderFactory.java:getMatchingSchema(97)) - Schema on read column count does not match file schema's column count. Falling back to using file schema.
2020-12-19 22:11:20,626 INFO [main]: exec.TableScanOperator (Operator.java:close(613)) - 0 finished. closing...
2020-12-19 22:11:20,626 INFO [main]: exec.SelectOperator (Operator.java:close(613)) - 1 finished. closing...
2020-12-19 22:11:20,626 INFO [main]: exec.ListSinkOperator (Operator.java:close(613)) - 3 finished. closing...
2020-12-19 22:11:20,626 INFO [main]: exec.ListSinkOperator (Operator.java:close(635)) - 3 Close done
2020-12-19 22:11:20,626 INFO [main]: exec.SelectOperator (Operator.java:close(635)) - 1 Close done
2020-12-19 22:11:20,626 INFO [main]: exec.TableScanOperator (Operator.java:close(635)) - 0 Close done
2020-12-19 22:11:20,637 INFO [main]: CliDriver (SessionState.java:printInfo(951)) - Time taken: 0.09 seconds, Fetched: 3 row(s)
2020-12-19 22:11:20,637 INFO [main]: log.PerfLogger (PerfLogger.java:PerfLogBegin(121)) - <PERFLOG method=releaseLocks from=org.apache.hadoop.hive.ql.Driver>
2020-12-19 22:11:20,637 INFO [main]: log.PerfLogger (PerfLogger.java:PerfLogEnd(148)) - </PERFLOG method=releaseLocks start=1608387080637 end=1608387080637 duration=0 from=org.apache.hadoop.hive.ql.Driver>
2020-12-19 22:20:20,285 INFO [pool-3-thread-6]: metastore.HiveMetaStore (HiveMetaStore.java:logInfo(746)) - 6: Shutting down the object store...
2020-12-19 22:20:20,286 INFO [pool-3-thread-6]: HiveMetaStore.audit (HiveMetaStore.java:logAuditEvent(371)) - ugi=root ip=192.168.56.100 cmd=Shutting down the object store...
2020-12-19 22:20:20,286 INFO [pool-3-thread-6]: metastore.HiveMetaStore (HiveMetaStore.java:logInfo(746)) - 6: Metastore shutdown complete.
2020-12-19 22:20:20,286 INFO [pool-3-thread-6]: HiveMetaStore.audit (HiveMetaStore.java:logAuditEvent(371)) - ugi=root ip=192.168.56.100 cmd=Metastore shutdown complete.
[root@master flume-1.6.0]# hdfs dfs -ls hdfs://192.168.56.100:9000/user/hive/warehouse/flume.db/webserver/time=20-12-19
Found 2 items
-rw-r--r-- 3 root supergroup 4 2020-12-19 22:06 hdfs://192.168.56.100:9000/user/hive/warehouse/flume.db/webserver/time=20-12-19/_orc_acid_version
drwxr-xr-x - root supergroup 0 2020-12-19 22:06 hdfs://192.168.56.100:9000/user/hive/warehouse/flume.db/webserver/time=20-12-19/delta_0000502_0000601
[root@master flume-1.6.0]#
4. 遇到的問題及解決方案
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat cannot be cast to org.apache.hadoop.hive.ql.io.AcidOutputFormat
2020-12-19 18:58:37,607 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.hive.HiveSink.getOrCreateWriter(HiveSink.java:342)] k1: Creating Writer to Hive end point : {metaStoreUri='thrift://master:9083', database='flume', table='weblogs', partitionVals=[asia, China, 20-12-19-18-55] }
2020-12-19 18:58:38,680 (hive-k1-call-runner-0) [INFO - org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:377)] Trying to connect to metastore with URI thrift://master:9083
2020-12-19 18:58:38,733 (hive-k1-call-runner-0) [INFO - org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:473)] Connected to metastore.
2020-12-19 18:58:39,051 (hive-k1-call-runner-0) [INFO - org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:377)] Trying to connect to metastore with URI thrift://master:9083
2020-12-19 18:58:39,075 (hive-k1-call-runner-0) [INFO - org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:473)] Connected to metastore.
2020-12-19 18:58:40,039 (hive-k1-call-runner-0) [INFO - org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:185)] Parsing command: use flume
2020-12-19 18:58:40,743 (hive-k1-call-runner-0) [INFO - org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:209)] Parse Completed
OK
2020-12-19 18:58:40,955 (hive-k1-call-runner-0) [INFO - org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:185)] Parsing command: alter table weblogs add if not exists partition ( continent='asia',country='China',time='20-12-19-18-55' )
2020-12-19 18:58:40,975 (hive-k1-call-runner-0) [INFO - org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:209)] Parse Completed
OK
2020-12-19 18:58:41,416 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:377)] Trying to connect to metastore with URI thrift://master:9083
2020-12-19 18:58:41,428 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:473)] Connected to metastore.
2020-12-19 18:58:41,515 (SinkRunner-PollingRunner-DefaultSinkProcessor) [ERROR - org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:160)] Unable to deliver event. Exception follows.
org.apache.flume.EventDeliveryException: java.lang.ClassCastException: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat cannot be cast to org.apache.hadoop.hive.ql.io.AcidOutputFormat
at org.apache.flume.sink.hive.HiveSink.process(HiveSink.java:268)
at org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:68)
at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassCastException: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat cannot be cast to org.apache.hadoop.hive.ql.io.AcidOutputFormat
at org.apache.hive.hcatalog.streaming.AbstractRecordWriter.<init>(AbstractRecordWriter.java:75)
at org.apache.hive.hcatalog.streaming.DelimitedInputWriter.<init>(DelimitedInputWriter.java:115)
at org.apache.flume.sink.hive.HiveDelimitedTextSerializer.createRecordWriter(HiveDelimitedTextSerializer.java:67)
at org.apache.flume.sink.hive.HiveWriter.<init>(HiveWriter.java:89)
at org.apache.flume.sink.hive.HiveSink.getOrCreateWriter(HiveSink.java:343)
at org.apache.flume.sink.hive.HiveSink.drainOneBatch(HiveSink.java:296)
at org.apache.flume.sink.hive.HiveSink.process(HiveSink.java:254)
... 3 more
解決方案:建立Hive表時指定row format delimited fields terminated by '\t'. 保證hive表中定義的fleld terminated by '\t'和flucom配置檔案中一致。
官網建表語句:
create table weblogs ( id int , msg string )
partitioned by (continent string, country string, time string)
clustered by (id) into 5 buckets
stored as orc;
修改後建表語句:
create table weblogs(id int, msg string) partitioned by (continent string, country string, time string) clustered by (id) into 5 buckets
row format delimited fields terminated by '\t' stored as orc tblproperties("transactional"='true');
相關文章
- 大資料4.1 - Flume整合案例+Hive資料倉大資料Hive
- Spark整合hiveSparkHive
- Hive學習之Hive的安裝Hive
- Hue--整合Hive與ImpalaHive
- Flume+Spark+Hive+Spark SQL離線分析系統SparkHiveSQL
- flume日誌採集,hbase資料儲存,hive查詢輸出(簡單整合)Hive
- Apache Hudi 與 Hive 整合手冊ApacheHive
- Hue3.9 搭建整合【HDFS】【Hive】Hive
- Hive --------- hive 的優化Hive優化
- [Hive]Hive排序優化Hive排序優化
- HIVE學習之(三)Hive
- Atlas 2.1.0 實踐(3)—— Atlas整合HIveHive
- Hive on Spark 和 Spark sql on Hive,你能分的清楚麼HiveSparkSQL
- Hive on Spark和Spark sql on Hive,你能分的清楚麼HiveSparkSQL
- 【Hive】hive資料遷移Hive
- HIVEHive
- Hive學習之基本操作Hive
- 23 大資料之hive(第四部 hive基礎)大資料Hive
- hive學習之四:hive檔案格式以及壓縮編碼Hive
- 【Hive一】Hive安裝及配置Hive
- Hive的分割槽和排序Hive排序
- Hive和Hbase的區別Hive
- 安裝和體驗hiveHive
- Hive和Spark分割槽策略HiveSpark
- flink實戰--讀寫Hive(Flink on Hive)Hive
- hive匯出到csv hive匯出到excelHiveExcel
- Hive -------- hive常見查詢練習Hive
- HIVE基本語法以及HIVE分割槽Hive
- [Hive]Hive實現抽樣查詢Hive
- Hive學習之JDBC訪問HiveJDBC
- Hive分桶之BUCKET詳解Hive
- spark with hiveSparkHive
- ClickHouse(19)ClickHouse整合Hive表引擎詳細解析Hive
- [hive]hive資料模型中四種表Hive模型
- [Hive]hive分割槽設定注意事項Hive
- Apache Spark和Hive有用的功能ApacheSparkHive
- Hive之同比環比的計算Hive
- hive學習筆記之十一:UDTFHive筆記