Flume和Hive整合之hive sink

fyk_lovelife發表於2020-12-19

在此文章中總結flum使用hive sink的操作過程,以及在除錯過程中的一些填坑操作。

1. Hive表要求

1.1 hive建表要求

使用hive做flume sink時,對hive表的要求:

a:表必須是事物表

b:表必須是分割槽表,分桶表

c:  表stored as orc

即clustered分桶、transactional事務、orc儲存格式。

1.2 flume中配置hive表列名要求

Flume配置的Hive 列名必須都為小寫字母。否則,報下面錯誤:

Failed connecting to EndPoint

Caused by: org.apache.hive.hcatalog.streaming.InvalidColumn: Column 'URL' not found in table for input field 81

2. 操作過程

1. 拷貝hive jar包到flume lib路徑下, 避免啟動hive sink時失敗

# pwd
/usr/local/src/hive-1.2.2/hcatalog/share/hcatalog
# ll
total 416
-rw-r--r-- 1 root root 257215 Apr  3  2017 hive-hcatalog-core-1.2.2.jar
-rw-r--r-- 1 root root  50192 Apr  3  2017 hive-hcatalog-pig-adapter-1.2.2.jar
-rw-r--r-- 1 root root  53562 Apr  3  2017 hive-hcatalog-server-extensions-1.2.2.jar
-rw-r--r-- 1 root root  56999 Apr  3  2017 hive-hcatalog-streaming-1.2.2.jar

#cp /usr/local/src/hive-1.2.2/hcatalog/share/hcatalog/*.jar /usr/local/src/flume-1.6.0/lib
#cp /usr/local/src/hive-1.2.2/lib/*.jar /usr/local/src/flume-1.6.0/lib

2. 修改hive配置,使其支援事務處理

# cat hive-site.xml
    <property>
        <name>hive.support.concurrency</name>
        <value>true</value>
    </property>
    <property>
        <name>hive.exec.dynamic.partition.mode</name>
        <value>nonstrict</value>
    </property>
    <property>
        <name>hive.txn.manager</name>
        <value>org.apache.hadoop.hive.ql.lockmgr.DbTxnManager</value>
    </property>

3. 初始化hive metastore資料庫

#/usr/local/src/hive-1.2.2/bin
#./schematool -dbType mysql -initSchema

4. 重新啟動mysql、hadoop、HiveMetaStore。

#重啟mysql
#cd /usr/bin
#./mysqladmin -u root -p shutdown
Enter password: ******
# cd /usr/bin
./mysqld_safe &

#重啟hadoop
[root@master sbin]# pwd
/usr/local/src/hadoop-2.6.5/sbin
[root@master sbin]# ./stop-all.sh
[root@master sbin]# ./start-all.sh

#啟動HiveMetaStore
[root@master sbin]# cd /usr/local/src/hive-1.2.2/
[root@master hive-1.2.2]# cd bin
[root@master bin]# ./hive --service metastore &
ls: cannot access /usr/local/src/spark-2.0.2-bin-hadoop2.6/lib/spark-assembly-*.jar: No such file or directory
Starting Hive Metastore Server
[root@master ~]# ps -ef|grep HiveMetaStore
root      20755  14146 23 22:39 pts/0    00:00:11 /usr/local/src/jdk1.8.0_172/bin/java -Xmx256m -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/usr/local/src/hadoop-2.6.5/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/usr/local/src/hadoop-2.6.5 -Dhadoop.id.str=root -Dhadoop.root.logger=INFO,console -Djava.library.path=/usr/local/src/hadoop-2.6.5/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Xmx512m -Dhadoop.security.logger=INFO,NullAppender org.apache.hadoop.util.RunJar /usr/local/src/hive-1.2.2/lib/hive-service-1.2.2.jar org.apache.hadoop.hive.metastore.HiveMetaStore
root      20922  16217  0 22:40 pts/1    00:00:00 grep --color=auto HiveMetaStore


5. Hive中建立表

hive>create database flume;
hive>create table t_user(user_id int,user_name string, age int) clustered by(user_id) into 2 buckets stored as orc tblproperties("transactional"='true');

6. 配置flume hive sink。

[root@master ~]# cd /usr/local/src/flume-1.6.0/
[root@master flume-1.6.0]# ls
bin  CHANGELOG  conf  DEVNOTES  docs  lib  LICENSE  logs  NOTICE  README  RELEASE-NOTES  tools
[root@master flume-1.6.0]# cat ./conf/flume-hive-2.conf
a1.sources = r1
a1.sinks = k1
a1.channels = c1

a1.sources.r1.type = netcat
a1.sources.r1.bind = master
a1.sources.r1.port = 50000
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = timestamp
a1.sources.r1.interceptors.i1.headerName = time
a1.sources.r1.interceptors.i1.preserveExisting = false

a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.sinks.k1.type = hive
a1.sinks.k1.hive.metastore = thrift://192.168.56.100:9083
a1.sinks.k1.hive.database = flume
a1.sinks.k1.hive.table = t_user

a1.sinks.k1.serializer = DELIMITED
a1.sinks.k1.serializer.delimiter = "\t"
a1.sinks.k1.serializer.serdeSeparator = '\t'
a1.sinks.k1.serializer.fieldnames = user_id,user_name,age

a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

7. 啟動flume程式,將發往master 192.168.58.100:50000的資訊寫到hive的flume.t_user中。

[root@master flume-1.6.0]# pwd
/usr/local/src/flume-1.6.0
[root@master flume-1.6.0]# ./bin/flume-ng agent --conf conf --conf-file ./conf/flume-hive-2.conf --name a1 -Dflume.root.logger=INFO,console

在另一臺機器slave1(192.168.58.101)上給master:50000發訊息。


[root@slave1 ~]# telnet master 50000
Trying 192.168.56.100...
Connected to master.
Escape character is '^]'.
1       abc     10    #注意:10後面還有一個tab
OK
2       def     20    #注意:20後面還有一個tab
OK

flume日誌顯示如下:


2020-12-18 22:51:01,308 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:473)] Connected to metastore.
2020-12-18 22:51:03,510 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.hive.HiveWriter.nextTxnBatch(HiveWriter.java:335)] Acquired Txn Batch TxnIds=[202...301] on endPoint = {metaStoreUri='thrift://192.168.56.100:9083', database='flume', table='t_user', partitionVals=[] }. Switching to first txn
2020-12-18 22:51:06,602 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.hive.HiveWriter.commitTxn(HiveWriter.java:275)] Committing Txn 202 on EndPoint: {metaStoreUri='thrift://192.168.56.100:9083', database='flume', table='t_user', partitionVals=[] }
2020-12-18 22:51:10,610 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.hive.HiveWriter.commitTxn(HiveWriter.java:275)] Committing Txn 203 on EndPoint: {metaStoreUri='thrift://192.168.56.100:9083', database='flume', table='t_user', partitionVals=[] }

在hive中查詢資料,可以看到資料已經入hive表。

hive> use flume;
OK
Time taken: 0.838 seconds
hive> select * from t_user;
OK
1       abc     10
2       def     20

3. Flume-Hive-Sink例項

模擬flume官網hive sink例項,建立例項如下:

Hive建表:

hive> create table webserver(id int, continent string, country string, msg string) partitioned by (time string) clustered by (id) into 5 buckets
 row format delimited fields terminated by '\t' stored as orc tblproperties("transactional"='true');

flume-hive-3.conf配置:

[root@master conf]# cat flume-hive-3.conf
a1.sources = r1
a1.sinks = k1
a1.channels = c1

a1.sources.r1.type = netcat
a1.sources.r1.bind = master
a1.sources.r1.port = 50000
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = timestamp
a1.sources.r1.interceptors.i1.headerName = time
a1.sources.r1.interceptors.i1.preserveExisting = false

a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.sinks.k1.type = hive
a1.sinks.k1.hive.metastore = thrift://master:9083
a1.sinks.k1.hive.database = flume
a1.sinks.k1.hive.table = webserver
a1.sinks.k1.hive.partition = %y-%m-%d
a1.sinks.k1.useLocalTimeStamp = false
a1.sinks.k1.round = true
a1.sinks.k1.roundValue = 5
a1.sinks.k1.roundUnit = minute
a1.sinks.k1.serializer = DELIMITED
a1.sinks.k1.serializer.delimiter = "\t"
a1.sinks.k1.serializer.serdeSeparator = '\t'
a1.sinks.k1.serializer.fieldnames = id,continent,country,msg

啟動flume:

[root@master flume-1.6.0]# ./bin/flume-ng agent --conf conf --conf-file ./conf/flume-hive-3.conf --name a1 -Dflume.root.logger=INFO,console

模擬傳送給source資料:

[root@slave1 ~]# telnet master 50000
Trying 192.168.56.100...
Connected to master.
Escape character is '^]'.
1       asia    China   Hello world!
OK
2       America USA     Hello AI!
OK
3       America USD     hello python!
OK

flume日誌:

Hive flume.webserver表內容:

hive> select * from  webserver;
OK
1       asia    China   Hello world!    20-12-19
2       America USA     Hello AI!       20-12-19
3       America USD     hello python!   20-12-19
Time taken: 0.09 seconds, Fetched: 3 row(s)

HDFS檔案目錄:

Hive日誌:
2020-12-19 22:11:20,584 INFO  [main]: orc.ReaderImpl (ReaderImpl.java:rowsOptions(531)) - Reading ORC rows from hdfs://192.168.56.100:9000/user/hive/warehouse/flume.db/webserver/time=20-12-19/delta_0000502_0000601/bucket_00002 with {include: null, offset: 0, length: 9223372036854775807}
2020-12-19 22:11:20,603 INFO  [main]: orc.RecordReaderFactory (RecordReaderFactory.java:getMatchingSchema(97)) - Schema on read column count does not match file schema's column count. Falling back to using file schema.
2020-12-19 22:11:20,626 INFO  [main]: exec.TableScanOperator (Operator.java:close(613)) - 0 finished. closing...
2020-12-19 22:11:20,626 INFO  [main]: exec.SelectOperator (Operator.java:close(613)) - 1 finished. closing...
2020-12-19 22:11:20,626 INFO  [main]: exec.ListSinkOperator (Operator.java:close(613)) - 3 finished. closing...
2020-12-19 22:11:20,626 INFO  [main]: exec.ListSinkOperator (Operator.java:close(635)) - 3 Close done
2020-12-19 22:11:20,626 INFO  [main]: exec.SelectOperator (Operator.java:close(635)) - 1 Close done
2020-12-19 22:11:20,626 INFO  [main]: exec.TableScanOperator (Operator.java:close(635)) - 0 Close done
2020-12-19 22:11:20,637 INFO  [main]: CliDriver (SessionState.java:printInfo(951)) - Time taken: 0.09 seconds, Fetched: 3 row(s)
2020-12-19 22:11:20,637 INFO  [main]: log.PerfLogger (PerfLogger.java:PerfLogBegin(121)) - <PERFLOG method=releaseLocks from=org.apache.hadoop.hive.ql.Driver>
2020-12-19 22:11:20,637 INFO  [main]: log.PerfLogger (PerfLogger.java:PerfLogEnd(148)) - </PERFLOG method=releaseLocks start=1608387080637 end=1608387080637 duration=0 from=org.apache.hadoop.hive.ql.Driver>
2020-12-19 22:20:20,285 INFO  [pool-3-thread-6]: metastore.HiveMetaStore (HiveMetaStore.java:logInfo(746)) - 6: Shutting down the object store...
2020-12-19 22:20:20,286 INFO  [pool-3-thread-6]: HiveMetaStore.audit (HiveMetaStore.java:logAuditEvent(371)) - ugi=root ip=192.168.56.100       cmd=Shutting down the object store...
2020-12-19 22:20:20,286 INFO  [pool-3-thread-6]: metastore.HiveMetaStore (HiveMetaStore.java:logInfo(746)) - 6: Metastore shutdown complete.
2020-12-19 22:20:20,286 INFO  [pool-3-thread-6]: HiveMetaStore.audit (HiveMetaStore.java:logAuditEvent(371)) - ugi=root ip=192.168.56.100       cmd=Metastore shutdown complete.


[root@master flume-1.6.0]# hdfs dfs -ls hdfs://192.168.56.100:9000/user/hive/warehouse/flume.db/webserver/time=20-12-19
Found 2 items
-rw-r--r--   3 root supergroup          4 2020-12-19 22:06 hdfs://192.168.56.100:9000/user/hive/warehouse/flume.db/webserver/time=20-12-19/_orc_acid_version
drwxr-xr-x   - root supergroup          0 2020-12-19 22:06 hdfs://192.168.56.100:9000/user/hive/warehouse/flume.db/webserver/time=20-12-19/delta_0000502_0000601
[root@master flume-1.6.0]#

4. 遇到的問題及解決方案

   org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat cannot be cast to org.apache.hadoop.hive.ql.io.AcidOutputFormat

2020-12-19 18:58:37,607 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.hive.HiveSink.getOrCreateWriter(HiveSink.java:342)] k1: Creating Writer to Hive end point : {metaStoreUri='thrift://master:9083', database='flume', table='weblogs', partitionVals=[asia, China, 20-12-19-18-55] }
2020-12-19 18:58:38,680 (hive-k1-call-runner-0) [INFO - org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:377)] Trying to connect to metastore with URI thrift://master:9083
2020-12-19 18:58:38,733 (hive-k1-call-runner-0) [INFO - org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:473)] Connected to metastore.
2020-12-19 18:58:39,051 (hive-k1-call-runner-0) [INFO - org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:377)] Trying to connect to metastore with URI thrift://master:9083
2020-12-19 18:58:39,075 (hive-k1-call-runner-0) [INFO - org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:473)] Connected to metastore.
2020-12-19 18:58:40,039 (hive-k1-call-runner-0) [INFO - org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:185)] Parsing command: use flume
2020-12-19 18:58:40,743 (hive-k1-call-runner-0) [INFO - org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:209)] Parse Completed
OK
2020-12-19 18:58:40,955 (hive-k1-call-runner-0) [INFO - org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:185)] Parsing command: alter table weblogs add if not exists partition  ( continent='asia',country='China',time='20-12-19-18-55' )
2020-12-19 18:58:40,975 (hive-k1-call-runner-0) [INFO - org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:209)] Parse Completed
OK
2020-12-19 18:58:41,416 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:377)] Trying to connect to metastore with URI thrift://master:9083
2020-12-19 18:58:41,428 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:473)] Connected to metastore.
2020-12-19 18:58:41,515 (SinkRunner-PollingRunner-DefaultSinkProcessor) [ERROR - org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:160)] Unable to deliver event. Exception follows.
org.apache.flume.EventDeliveryException: java.lang.ClassCastException: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat cannot be cast to org.apache.hadoop.hive.ql.io.AcidOutputFormat
        at org.apache.flume.sink.hive.HiveSink.process(HiveSink.java:268)
        at org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:68)
        at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassCastException: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat cannot be cast to org.apache.hadoop.hive.ql.io.AcidOutputFormat
        at org.apache.hive.hcatalog.streaming.AbstractRecordWriter.<init>(AbstractRecordWriter.java:75)
        at org.apache.hive.hcatalog.streaming.DelimitedInputWriter.<init>(DelimitedInputWriter.java:115)
        at org.apache.flume.sink.hive.HiveDelimitedTextSerializer.createRecordWriter(HiveDelimitedTextSerializer.java:67)
        at org.apache.flume.sink.hive.HiveWriter.<init>(HiveWriter.java:89)
        at org.apache.flume.sink.hive.HiveSink.getOrCreateWriter(HiveSink.java:343)
        at org.apache.flume.sink.hive.HiveSink.drainOneBatch(HiveSink.java:296)
        at org.apache.flume.sink.hive.HiveSink.process(HiveSink.java:254)
        ... 3 more

解決方案:建立Hive表時指定row format delimited fields terminated by '\t'. 保證hive表中定義的fleld terminated by '\t'和flucom配置檔案中一致。

官網建表語句:
create table weblogs ( id int , msg string )
    partitioned by (continent string, country string, time string)
    clustered by (id) into 5 buckets
    stored as orc;

修改後建表語句:
create table weblogs(id int, msg string) partitioned by (continent string, country string, time string) clustered by (id) into 5 buckets 
row format delimited fields terminated by '\t' stored as orc tblproperties("transactional"='true');

 

相關文章