作為大資料，我們需要獲取大資料來源，今天把日誌收集這塊整理下，採用 Apache 的開源技術 Flume 作為日誌收集的工具。接下來根據官方文件按照自己的理解進行梳理。

1、理論理解

Flume 是一個日誌採集模組，需要根據源、通道、目的地三個元件來完成把源的日誌或者資訊透過這個管道運輸到另外一個地方，如下圖

官方資料流模型-單個代理

當然管道的目的地（Sink）也可以是下一個模組的源，如下圖

官方資料流模型-多代理

一個管道可以有多個源，如下圖

官方資料流模型-多源

一旦有多個代理進行巢狀，他們之間基本是採用 avro 這樣的source和sink。

2、搭建環境

環境搭建非常簡單，首先是基於 java，所以需要 JDK 支援，這裡使用目前最新的 Flume 1.8.0，官方說明需要至少 JDK1.8以上。
按照 JDK 不再描述，裝完之後，我們需要下載最新的 Flume 安裝包

為了使用方便，解壓該包之後，配置下環境變數

環境變數

之後再新開命令視窗，輸入

flume-ng help

測試命令

可以看到基本資訊內容，其中包括命令和引數,其中命令如下

help display this help text，顯示幫助資訊
agent run a Flume agent ，執行一個 flume 代理
avro-client run an avro Flume client 執行一個 avro 客戶端
version show Flume version info
引數部分較多，常用的是
--conf,-c <conf> use configs in <conf> directory
--name,-n <name> the name of this agent (required)
--conf-file,-f <file> specify a config file (required if -z missing)
還是比較好理解的。接下來我們來執行官方的一個例子

3、從指定網路埠採集資料到控制檯

官方有個例子的配置檔案，其實 flume 大部分功能只需要配置就可以搞定，特殊的需要額外的第三方包，或者自行定義 Source、Channel 和 Sink 元件。這裡我們進行簡單的配置，能夠入門即可。

為了方便，把配置檔案統一放在 $FLUME_HOME/conf/目錄下，我們在該目錄下建立一個配置檔案叫 example.conf,內容如下,說明非常細緻了。

#使用 Flume 的關鍵就是寫配置檔案# A）配置 Source# B) 配置 Channel# C) 配置 Sink# D) 配置 把三元件串起來#下面是在這個 agent 上命名元件# a1 是 agent的名稱# r1是資料來源的名稱，a1可能有多個資料來源，這裡是複數# k1是sink 的名稱# c1是 channel 的名稱a1.sources = r1
a1.sinks = k1
a1.channels = c1#配置 source# a1的資料來源，其中叫 r1的進行配置#  = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444# a1的 sinks 中叫 k1的進行配置#  = logger# 使用一個帶有 buffer 的記憶體式的 channel#  = memory#a1.channels.c1.capacity = 1000#a1.channels.c1.transactionCapacity = 100# Bind the source and sink to the channel# 配置源到 channel 注意複數,一個 source 可以輸出到多個 channel# 配置 sink 和 channel，一個 channel 可以輸出到的 sink 只有1個a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

有了上面的配置檔案，就需要執行他了，根據官方的提示，我們得知

#執行 $ bin/flume-ng agent  表示執行代理--conf $FLUME_HOME/conf  表示配置檔案的目錄
--conf-file $FLUME_HOME/conf/example.conf  指定配置檔案
--name a1  指定執行的代理名稱
-Dflume.root.logger=DEBUG,console
-Dorg.apache.flume.log.printconfig=true
-Dorg.apache.flume.log.rawdata=true

然後我們telnet 到本地44444埠，進行訊息傳送，看看控制檯有什麼變化

image

接收到的資料

Event: { headers:{} body: 68 61 68 61 61 68 0D                            hahaah. }

一個 Event 就是 Flume 的一個最小單元，包括可選 headers 和 body。

4、監控檔案實時採集到控制檯輸出

一般我們的專案都是把日誌透過 logback 輸出到檔案，然後需要把這些檔案遷移到另外個地方進行分析處理，現在我們使用 Flume 進行日誌移動。

先分析下 Source，Channel 和 Sink 和上一個類似
Source 我們需要使用一個Exec Source，他可以執行一個命令，比如 tail ....

#下面是在這個 agent 上命名元件# a1 是 agent的名稱# r1是資料來源的名稱，a1可能有多個資料來源，這裡是複數# k1是sink 的名稱# c1是 channel 的名稱a1.sources = r1
a1.sinks = k1
a1.channels = c1#配置 source# a1的資料來源，其中叫 r1的進行配置#  = execa1.sources.r1.command = tail -F /tmp/log/log.log# a1的 sinks 中叫 k1的進行配置#  = logger# 使用一個帶有 buffer 的記憶體式的 channel#  = memory#a1.channels.c1.capacity = 1000#a1.channels.c1.transactionCapacity = 100# Bind the source and sink to the channel# 配置源到 channel 注意複數,一個 source 可以輸出到多個 channel# 配置 sink 和 channel，一個 channel 可以輸出到的 sink 只有1個a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

配置完成之後，我們開始測試
先啟動整個配置檔案example01.conf

$ bin/flume-ng agent  表示執行代理--conf $FLUME_HOME/conf  表示配置檔案的目錄
--conf-file $FLUME_HOME/conf/example01.conf  指定配置檔案
--name a1  指定執行的代理名稱
-Dflume.root.logger=DEBUG,console
-Dorg.apache.flume.log.printconfig=true
-Dorg.apache.flume.log.rawdata=true

啟動之後，我們現在追加日誌到/tmp/log/log.log 檔案，就可以看到控制檯輸出日誌。

5、將一個伺服器的內容遷移到另外一個伺服器

需要2臺機器進行模擬

機器間日誌遷移模型

技術選型

根據這個示意圖，我們開始編寫配置檔案

############  exec-mem-avro.conf  ############# Baseexec-mem-avro.sources = exec-sourceexec-mem-avro.sinks = avro-sinkexec-mem-avro.channels = mem-channel# Sourceexec-mem-avro.sources.exec-source.type = execexec-mem-avro.sources.exec-source.command = tail -F /tmp/log/log.log# Sink  這裡修改為 avro Sinkexec-mem-avro.sinks.avro-sink.type = avroexec-mem-avro.sinks.avro-sink.hostname = localhost 
exec-mem-avro.sinks.avro-sink.port = 44444# Channelexec-mem-avro.channels.mem-channel.type = memory# Linkexec-mem-avro.sources.exec-source.channels = mem-channelexec-mem-avro.sinks.avro-sink.channel = mem-channel############  avro-mem-log.conf  ############# Baseavro-mem-log.sources = avro-source
avro-mem-log.sinks = log-sink
avro-mem-log.channels = mem-channel# Sourceavro-mem-log.sources.avro-source.type = avro
avro-mem-log.sources.avro-source.bind = localhost 
avro-mem-log.sources.avro-source.port = 44444# Sink  avro-mem-log.sinks.log-sink.type = logger# Channelavro-mem-log.channels.mem-channel.type = memory# Linkavro-mem-log.sources.avro-source.channels = mem-channel
avro-mem-log.sinks.log-sink.channel = mem-channel

需要先在第二機器上啟動 avro-mem-log.conf
然後在第一個機器上啟動 exec-mem-avro.conf

#第二個機器執行$ bin/flume-ng agent --conf $FLUME_HOME/conf 
--conf-file $FLUME_HOME/conf/avro-mem-log.conf 
--name avro-mem-log 
-Dflume.root.logger=DEBUG,console 
-Dorg.apache.flume.log.printconfig=true 
-Dorg.apache.flume.log.rawdata=true#第一個機器執行$ bin/flume-ng agent --conf $FLUME_HOME/conf 
--conf-file $FLUME_HOME/conf/exec-mem-avro.conf 
--name exec-mem-avro 
-Dflume.root.logger=DEBUG,console 
-Dorg.apache.flume.log.printconfig=true 
-Dorg.apache.flume.log.rawdata=true

在第一個機器輸入日誌到/tmp/log/log.log 在第二個機器就會收集到，到此，我們的 Flume 的基本入門告一段落

作者：breezedancer
連結：

大資料01-Flume 日誌收集

1、理論理解

2、搭建環境

3、從指定網路埠採集資料到控制檯

4、監控檔案實時採集到控制檯輸出

5、將一個伺服器的內容遷移到另外一個伺服器

相關文章