大資料3-Flume收集資料+落地HDFS

項羽齊發表於2018-04-02

原文網址 : https://www.cnblogs.com/xiangyuqi/p/8690902.html

flume

　　日誌收集系統

　　　　Flume是Cloudera提供的一個高可用的，高可靠的，分散式的海量日誌採集、聚合和傳輸的系統，Flume支援在日誌系統中定製各類資料傳送方，用於收集資料；同時，Flume提供對資料進行簡單處理，並寫到各種資料接受方（可定製）的能力。

　　　　當前Flume有兩個版本Flume 0.9X版本的統稱Flume-og，Flume1.X版本的統稱Flume-ng。由於Flume-ng經過重大重構，與Flume-og有很大不同，使用時請注意區分。

　　基本概念

　　　　Event 事件

　　　　　　把讀取的一條日誌資訊包裝成一個物件，這個物件就叫Flume Event。

　　　　　　本質就是一個json字串，如：{head:info,body:info}

　　　　Agent 代理

　　　　　　代理，是一個java程式（JVM），它承載event，從外部源傳遞到下一個目標的元件。

　　　　　　主要由3部分組成：Source、Channel、Sink。

　　　　Source 資料來源

　　　　　　Source元件是專門用來收集資料的，可以處理各種型別、各種格式的日誌資料,包括avro、thrift、exec、jms、spooling directory、netcat、sequence 　　　　　　generator、syslog、http、legacy、自定義。

　　　　Channel 資料通道

　　　　　　Source元件把資料收集來以後，臨時存放在channel中，即channel元件在agent中是專門用來存放臨時資料的。對採集到的資料進行簡單的快取，可以存放在memory、jdbc、file等等。

　　　　Sink 資料匯聚點

　　　　　　Sink元件是用於把資料傳送到目的地的元件，目的地包括hdfs、logger、avro、thrift、ipc、file、null、Hbase、solr、自定義。

　　　　組合過程

　　　　　　為了安全性，資料的傳輸是將資料封裝成一個Event事件。Source會將從伺服器收集的資料封裝成Event，然後儲存在緩衝區Channel，Channel的結構與佇列比較相似（先進先出）。Sink就會從緩衝區Channel中抓取資料，抓取到資料時，就會把Channel中的對應資料刪除，並且把抓取的資料寫入HDFS等目標地址或者也可以是下一個Source。一定是當資料傳輸成功後，才會刪除緩衝區Channel中的資料，這是為了可靠性。當接收方Crash（崩潰）時，以便可以重新傳送資料。

　　2、可靠性

　　　　當節點出現故障時，日誌能夠被傳送到其他節點上而不會丟失。

　　　　Flume提供了三種級別的可靠性保障，從強到弱依次分別為：

　　　　　　end-to-end（收到資料agent首先將event寫到磁碟上，當資料傳送成功後，再刪除；如果資料傳送失敗，可以重新傳送。）

　　　　　　Store on failure（這也是Scribe-Facebook開源的日誌收集系統-採用的策略，當資料接收方crash（崩潰）時，將資料寫到本地，待恢復後，繼續傳送）

　　　　　　Besteffort（資料傳送到接收方後，不會進行確認）

　　3、需要安裝jdk

jdk安裝

　　4、安裝flume

安裝flume

　　5、目錄結構

目錄結構

　　Source元件

　　　　　　重點掌握Avro Source和Spooling Directory Source。

＃單節點Flume配置
＃命名Agent a1的元件
a1.sources  =  r1
a1.sinks  =  k1
a1.channels  =  c1

＃描述/配置Source
a1.sources.r1.type  =  netcat        #內建型別，接收來自網路的資料
a1.sources.r1.bind  =  0.0.0.0            #等同於網路的127.0.0.1
a1.sources.r1.port  =  22222        #服務的埠號

＃描述Sink
a1.sinks.k1.type  =  logger        #內建型別

＃描述記憶體Channel
a1.channels.c1.type  =  memory    #儲存資料到記憶體
a1.channels.c1.capacity  =  1000     #容量最大存放1000條日誌
a1.channels.c1.transactionCapacity  =  100    #事務中的一批資料100條

＃為Channle繫結Source和Sink
a1.sources.r1.channels  =  c1        #一個source可以繫結到多個channel
a1.sinks.k1.channel  =  c1        #一個sink只能繫結到一個channel

flume.properties

a1.sources  =  r1
a1.sinks  =  k1
a1.channels  =  c1

a1.sources.r1.type  =  spooldir
a1.sources.r1.spoolDir  =  /usr/local/src/flume/data

a1.sinks.k1.type  =  logger

a1.channels.c1.type  =  memory
a1.channels.c1.capacity  =  1000
a1.channels.c1.transactionCapacity  =  100

a1.sources.r1.channels  =  c1
a1.sinks.k1.channel  =  c1

flume-dir.properties

a1.sources  =  r1
a1.sinks  =  k1
a1.channels  =  c1

a1.sources.r1.type  =  avro
a1.sources.r1.bind  =  0.0.0.0
a1.sources.r1.port  =  22222

a1.sinks.k1.type  =  logger

a1.channels.c1.type  =  memory
a1.channels.c1.capacity  =  1000
a1.channels.c1.transactionCapacity  =  100

a1.sources.r1.channels  =  c1
a1.sinks.k1.channel  =  c1

flume-avro.properties

a1.sources  =  r1
a1.sinks  =  k1
a1.channels  =  c1

a1.sources.r1.type  =  http
a1.sources.r1.bind  =  0.0.0.0
a1.sources.r1.port  =  22222

a1.sinks.k1.type  =  avro
a1.sinks.k1.hostname = 192.168.220.137
a1.sinks.k1.port = 22222

a1.channels.c1.type  =  memory
a1.channels.c1.capacity  =  1000
a1.channels.c1.transactionCapacity  =  100

a1.sources.r1.channels  =  c1
a1.sinks.k1.channel  =  c1

flume-http.properties

a1.sources  =  r1
a1.sinks  =  k1
a1.channels  =  c1

a1.sources.r1.type  =  avro
a1.sources.r1.bind  =  0.0.0.0
a1.sources.r1.port  =  22222

a1.sinks.k1.type  =  avro
a1.sinks.k1.hostname = 192.168.220.137
a1.sinks.k1.port = 22222

a1.channels.c1.type  =  memory
a1.channels.c1.capacity  =  1000
a1.channels.c1.transactionCapacity  =  100

a1.sources.r1.channels  =  c1
a1.sinks.k1.channel  =  c1

flume-jt.properties

channels    –     
type    –    型別名稱，"AVRO"
bind    –    需要監聽的主機名或IP
port    –    要監聽的埠

threads    –    工作執行緒最大執行緒數
selector.type          
selector.*          
interceptors    –    空格分隔的攔截器列表
interceptors.*          
compression-type    none    壓縮型別，可以是“none”或“default”，這個值必須和AvroSource的壓縮格式匹配
ssl    false    是否啟用ssl加密，如果啟用還需要配置一個“keystore”和一個“keystore-password”.
keystore    –    為SSL提供的 java金鑰檔案 所在路徑
keystore-password    –    為SSL提供的 java金鑰檔案 密碼
keystore-type    JKS    金鑰庫型別可以是 “JKS” 或 “PKCS12”.
exclude-protocols    SSLv3    空格分隔開的列表，用來指定在SSL / TLS協議中排除。SSLv3將總是被排除除了所指定的協議。
ipFilter    false    如果需要為netty開啟ip過濾，將此項設定為true
ipFilterRules    –    陪netty的ip過濾設定表示式規則

引數說明 flume-avro.properties

channels –
type – 型別，需要指定為"spooldir"
spoolDir – 讀取檔案的路徑，即"蒐集目錄"

fileSuffix .COMPLETED 對處理完成的檔案追加的字尾
deletePolicy never 處理完成後是否刪除檔案，需是"never"或"immediate"
fileHeader false Whether to add a header storing the absolute path filename.
fileHeaderKey file Header key to use when appending absolute path filename to event header.
basenameHeader false Whether to add a header storing the basename of the file.
basenameHeaderKey basename Header Key to use when appending basename of file to event header.
ignorePattern ^$ 正規表示式指定哪些檔案需要忽略
trackerDir .flumespool Directory to store metadata related to processing of files. If this path is not an absolute path, then it is interpreted as relative to the spoolDir.
consumeOrder 處理檔案的策略，oldest, youngest 或 random。
maxBackoff 4000 The maximum time (in millis) to wait between consecutive attempts to write to the channel(s) if the channel is full. The source will start at a low backoff and increase it exponentially each time the channel throws a ChannelException, upto the value specified by this parameter.
batchSize 100 Granularity at which to batch transfer to the channel
inputCharset UTF-8 讀取檔案時使用的編碼。
decodeErrorPolicy FAIL 當在輸入檔案中發現無法處理的字元編碼時如何處理。FAIL：丟擲一個異常而無法 解析該檔案。REPLACE：用“替換字元”字元，通常是Unicode的U + FFFD更換不可解析角色。 忽略：掉落的不可解析的字元序列。
deserializer LINE 宣告用來將檔案解析為事件的解析器。預設一行為一個事件。處理類必須實現EventDeserializer.Builder介面。
deserializer.* Varies per event deserializer.
bufferMaxLines – (Obselete) This option is now ignored.
bufferMaxLineLength 5000 (Deprecated) Maximum length of a line in the commit buffer. Use deserializer.maxLineLength instead.
selector.type replicating replicating or multiplexing
selector.* Depends on the selector.type value
interceptors – Space-separated list of interceptors
interceptors.*

引數說明 flume-dir.properties

type         型別，必須為"HTTP"
port    –    監聽的埠

bind    0.0.0.0    監聽的主機名或ip
handler    org.apache.flume.source.http.JSONHandler    處理器類，需要實現HTTPSourceHandler介面
handler.*    –    處理器的配置引數
selector.type    
selector.*         
interceptors    –    
interceptors.*          
enableSSL    false    是否開啟SSL,如果需要設定為true。注意，HTTP不支援SSLv3。
excludeProtocols    SSLv3    空格分隔的要排除的SSL/TLS協議。SSLv3總是被排除的。
keystore         金鑰庫檔案所在位置。
keystorePassword Keystore 金鑰庫密碼

引數說明 flume-http.properties

3.3.6.1啟動時報錯不繼續
[root@localhost conf]# ../bin/flume-ng agent --conf conf --conf-file flume.properties --name a1 -Dflume.root.logger=INFO,console
Info: Including Hive libraries found via () for Hive access
+ exec /usr/local/src/java/jdk1.7.0_51/bin/java -Xmx20m -Dflume.root.logger=INFO,console -cp 'conf:/usr/local/src/flume/apache-flume-1.6.0-bin/lib/*:/lib/*' -Djava.library.path= org.apache.flume.node.Application --conf-file flume.properties --name a1
log4j:WARN No appenders could be found for logger (org.apache.flume.lifecycle.LifecycleSupervisor).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.

錯誤原因：
log4j屬性檔案，路徑不正確
解決辦法：
[root@localhost bin]# ./flume-ng agent -c /usr/local/src/flume/apache-flume-1.6.0-bin/conf -f /usr/local/src/flume/apache-flume-1.6.0-bin/conf/flume.properties -n a1 -Dflume.root.logger=INFO,console
或者
[root@localhost bin]# ./flume-ng agent -c ../conf -f ../conf/flume.properties -n a1 -Dflume.root.logger=INFO,console
3.3.6.2監控目錄重名異常
如果檔案已經處理過，哪怕完成的檔案被刪除，也無濟於事
java.lang.IllegalStateException: File name has been re-used with different files. Spooling assumptions violated for /usr/local/src/flume/data/g.txt.COMPLETED
解決辦法：
不能有重名檔案放入到監控的目錄中。
都把檔案刪除了，它怎麼還知道這個檔案處理過呢？

Flume在監控目錄下建立了一個隱藏目錄.flumespool下面有一個隱藏檔案.flumespool-main.meta，裡面記錄了處理過的資訊。把此隱藏目錄刪除，就可以處理重名檔案。

常見報錯

[root@localhost conf]# ../bin/flume-ng agent -c ./ -f ./flume-avro.properties -n a1 -Dflume.root.logger=INFO,console
啟動結果：
2017-11-07 19:58:03,708 (lifecycleSupervisor-1-2) [INFO - org.apache.flume.source.AvroSource.start(AvroSource.java:253)] Avro source r1 started.

cd /usr/local/src/flume        #進入目錄
vi log.txt                        #建立資料檔案，內容如下
hi flume.
you are good tools.

準備資料

通過flume提供的avro客戶端向指定機器指定埠傳送日誌資訊：
./flume-ng –h                    #幫助可以看命令格式及引數用法
./flume-ng avro-client -c ../conf -H 0.0.0.0 -p 22222 -F ../../log.txt
控制檯收到訊息：

注意：紅色框中的列印的內容會被截斷，在控制檯不能顯示很多，只顯示很短的一部分內容。

傳送avro訊息

　　channel元件

事件將被儲存在記憶體中的具有指定大小的佇列中。非常適合那些需要高吞吐量但是失敗是會丟失資料的場景下。

引數說明：
type    –    型別，必須是“memory”
capacity    100    事件儲存在通道中的最大數量
transactionCapacity    100    每個事務中的最大事件數
keep-alive    3    新增或刪除操作的超時時間
byteCapacityBufferPercentage    20    Defines the percent of buffer between byteCapacity and the estimated total size of all events in the channel, to account for data in headers. See below.
byteCapacity    see description    Maximum total bytes of memory allowed as a sum of all events in this channel. The implementation only counts the Event body, which is the reason for providing the byteCapacityBufferPercentage configuration parameter as well. Defaults to a computed value equal to 80% of the maximum memory available to the JVM (i.e. 80% of the -Xmx value passed on the command line). Note that if you have multiple memory channels on a single JVM, and they happen to hold the same physical events (i.e. if you are using a replicating channel selector from a single source) then those event sizes may be double-counted for channel byteCapacity purposes. Setting this value to 0 will cause this value to fall back to a hard internal limit of about 200 GB.

Memory Channel

事件被持久儲存在可靠的資料庫中。目前支援嵌入式的Derby資料庫。如果可恢復性非常的重要可以使用這種方式

JDBC Channel

效能會比較低下，但是即使程式出錯資料不會丟失。

引數說明：
type    –    型別，必須是“file”
checkpointDir    ~/.flume/file-channel/checkpoint    檢查點檔案存放的位置
useDualCheckpoints    false    Backup the checkpoint. If this is set to true, backupCheckpointDir must be set
backupCheckpointDir    –    The directory where the checkpoint is backed up to. This directory must not be the same as the data directories or the checkpoint directory
dataDirs    ~/.flume/file-channel/data    逗號分隔的目錄列表，用以存放日誌檔案。使用單獨的磁碟上的多個目錄可以提高檔案通道效率。
transactionCapacity    10000    The maximum size of transaction supported by the channel
checkpointInterval    30000    Amount of time (in millis) between checkpoints
maxFileSize    2146435071    一個日誌檔案的最大尺寸
minimumRequiredSpace    524288000    Minimum Required free space (in bytes). To avoid data corruption, File Channel stops accepting take/put requests when free space drops below this value
capacity    1000000    Maximum capacity of the channel
keep-alive    3    Amount of time (in sec) to wait for a put operation
use-log-replay-v1    false    Expert: Use old replay logic
use-fast-replay    false    Expert: Replay without using queue
checkpointOnClose    true    Controls if a checkpoint is created when the channel is closed. Creating a checkpoint on close speeds up subsequent startup of the file channel by avoiding replay.
encryption.activeKey    –    Key name used to encrypt new data
encryption.cipherProvider    –    Cipher provider type, supported types: AESCTRNOPADDING
encryption.keyProvider    –    Key provider type, supported types: JCEKSFILE
encryption.keyProvider.keyStoreFile    –    Path to the keystore file
encrpytion.keyProvider.keyStorePasswordFile    –    Path to the keystore password file
encryption.keyProvider.keys    –    List of all keys (e.g. history of the activeKey setting)
encyption.keyProvider.keys.*.passwordFile    –    Path to the optional key password file

file channel

記憶體溢位通道。事件被儲存在記憶體佇列和磁碟中。
記憶體佇列作為主儲存，而磁碟作為溢位內容的儲存。當記憶體佇列已滿時，後續的事件將被儲存在檔案通道中。這個通道適用於正常操作期間適用記憶體通道已期實現高效吞吐，而在高峰期間適用檔案通道實現高耐受性。通過降低吞吐效率提高系統可耐受性。如果Agent崩潰，則只有儲存在檔案系統中的事件可以被恢復，記憶體中資料會丟失。此通道處於試驗階段，不建議在生產環境中使用。

引數說明：
type – 型別，必須是"SPILLABLEMEMORY"
memoryCapacity 10000 記憶體中儲存事件的最大值，如果想要禁用記憶體緩衝區將此值設定為0。
overflowCapacity 100000000 可以儲存在磁碟中的事件數量最大值。設定為0可以禁用磁碟儲存。
overflowTimeout 3 The number of seconds to wait before enabling disk overflow when memory fills up.
byteCapacityBufferPercentage 20 Defines the percent of buffer between byteCapacity and the estimated total size of all events in the channel, to account for data in headers. See below.
byteCapacity see description Maximum bytes of memory allowed as a sum of all events in the memory queue. The implementation only counts the Event body, which is the reason for providing the byteCapacityBufferPercentage configuration parameter as well. Defaults to a computed value equal to 80% of the maximum memory available to the JVM (i.e. 80% of the -Xmx value passed on the command line). Note that if you have multiple memory channels on a single JVM, and they happen to hold the same physical events (i.e. if you are using a replicating channel selector from a single source) then those event sizes may be double-counted for channel byteCapacity purposes. Setting this value to 0 will cause this value to fall back to a hard internal limit of about 200 GB.
avgEventSize 500 Estimated average size of events, in bytes, going into the channel
<file channel properties> see file channel Any file channel property with the exception of ‘keep-alive’ and ‘capacity’ can be used. The keep-alive of file channel is managed by Spillable Memory Channel. Use ‘overflowCapacity’ to set the File channel’s capacity.

Spillable Memory Channel

　　sink元件

記錄INFO級別的日誌，通常用於除錯。

引數說明：
channel    –     
type    –    The component type name, needs to be logger
maxBytesToLog    16    Maximum number of bytes of the Event body to log

要求必須在 --conf 引數指定的目錄下有 log4j的配置檔案log4j.properties
可以通過-Dflume.root.logger=INFO,console在命令啟動時手動指定log4j引數

Logger Sink

在本地檔案系統中儲存事件。
每隔指定時長生成檔案儲存這段時間內收集到的日誌資訊。

引數說明：
channel    –     
type    –    型別，必須是"file_roll"
sink.directory    –    檔案被儲存的目錄
sink.rollInterval    30    滾動檔案每隔30秒（應該是每隔30秒鐘單獨切割資料到一個檔案的意思）。如果設定為0，則禁止滾動，從而導致所有資料被寫入到一個檔案中。
sink.serializer    TEXT    Other possible options include avro_event or the FQCN of an implementation of EventSerializer.Builder interface.
batchSize    100

File Roll Sink

修改內容：
a1.sources.r1.type  =  http                            #內建型別
a1.sources.r1.port = 22222                            #設定監測目錄

a1.sinks.k1.type  =  file_roll                                #檔案落地
a1.sinks.k1.sink.directory = /usr/local/src/flume/data        #存放目錄

配置檔案flume-roll-sink.properties

配置第一行，註釋第二行，啟用console。預設是註釋第一行，開啟第二行。
curl -X POST -d '[{"headers":{"tester":"tony"},"body":"hello http flume"}]' http://0.0.0.0:22222
執行結果
[root@localhost data]# pwd
/usr/local/src/flume/data                                #資料所在目錄
 [root@localhost data]# ll
total 4
-rw-r--r--. 1 root root  0 Nov  9 16:21 1510273266537-1
-rw-r--r--. 1 root root 22 Nov  9 16:21 1510273266537-2
-rw-r--r--. 1 root root  0 Nov  9 16:22 1510273266537-3
-rw-r--r--. 1 root root  0 Nov  9 16:22 1510273266537-4
-rw-r--r--. 1 root root  0 Nov  9 16:23 1510273266537-5
-rw-r--r--. 1 root root  0 Nov  9 16:23 1510273266537-6

[root@localhost data]# tail 1510273266537-2      #資料已經寫入
hello file-roll flume
 [root@localhost data]# tail 1510273266537-6 #即使沒有資料也會產生檔案
注意：預設每隔30秒產生一個日誌檔案，但時間不夠精準

模擬http請求

　　Avro Sink

　　是實現多級流動和扇出流(1到多) 扇入流(多到1) 的基礎。

channel    –     
type    –    avro.
hostname    –    The hostname or IP address to bind to.
port    –    The port # to listen on.

batch-size    100    number of event to batch together for send.
connect-timeout    20000    Amount of time (ms) to allow for the first (handshake) request.
request-timeout    20000    Amount of time (ms) to allow for requests after the first.
我們要演示多級流動，就需要多個源，我們在安裝兩臺伺服器。
克隆兩臺新的虛擬機器 flume02、flume03

引數說明

　　多級部署結構

修改內容：
a1.sources.r1.type  =  http
a1.sources.r1.bind  =  0.0.0.0
a1.sources.r1.port  =  22222
#描述Sink
a1.sinks.k1.type  =  avro
a1.sinks.k1.hostname = 192.168.163.130
a1.sinks.k1.port = 22222

flume01配置檔案flume-avro-sink.properties

複製檔案到其它主機
[root@localhost conf]# pwd
/usr/local/src/flume/apache-flume-1.6.0-bin/conf
[root@localhost conf]# scp flume-avro-sink.properties root@192.168.163.130:/usr/local/src/flume/apache-flume-1.6.0-bin/conf/flume-avro-sink.properties 
The authenticity of host '192.168.163.130 (192.168.163.130)' can't be established.
RSA key fingerprint is 40:d6:4e:bd:3e:d0:90:3b:86:41:72:90:ec:dd:95:f9.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '192.168.163.130' (RSA) to the list of known hosts.
flume-avro-sink.properties                                                                                                            100%  477     0.5KB/s   00:00    
[root@localhost conf]# scp flume-avro-sink.properties root@192.168.163.131:/usr/local/src/flume/apache-flume-1.6.0-bin/conf/flume-avro-sink.properties 
The authenticity of host '192.168.163.131 (192.168.163.131)' can't be established.
RSA key fingerprint is 40:d6:4e:bd:3e:d0:90:3b:86:41:72:90:ec:dd:95:f9.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '192.168.163.131' (RSA) to the list of known hosts.
flume-avro-sink.properties                                                                                                            100%  477     0.5KB/s   00:00    
[root@localhost conf]#

遠端拷貝

修改內容：
a1.sources.r1.type  =  avro                            #內建型別
a1.sources.r1.bind  =  0.0.0.0
a1.sources.r1.port  =  22222

#描述Sink
a1.sinks.k1.type  =  avro
a1.sinks.k1.hostname = 192.168.163.131
a1.sinks.k1.port = 22222

flume-avro-sink.properties

修改內容：
a1.sources.r1.type  =  avro                            #內建型別
a1.sources.r1.bind  =  0.0.0.0
a1.sources.r1.port  =  22222

#描述Sink
a1.sinks.k1.type  =  logger

flume-avro-sink.properties

[root@localhost conf]# ../bin/flume-ng agent -c ./ -f ./flume-avro-sink.properties -n a1 -Dflume.root.logger=INFO,console

啟動各個flume伺服器的Agent

在flume01節點上傳送訊息
curl -X POST -d '[{"headers":{"tester":"tony"},"body":"hello http flume"}]' http://0.0.0.0:22222
執行結果：
2017-11-09 18:58:33,863 (New I/O  worker #1) [INFO - org.apache.avro.ipc.NettyServer$NettyServerAvroHandler.handleUpstream(NettyServer.java:171)] [id: 0xdd9a2bfc, /192.168.163.130:34945 => /192.168.163.131:22222] BOUND: /192.168.163.131:22222
2017-11-09 18:58:33,863 (New I/O  worker #1) [INFO - org.apache.avro.ipc.NettyServer$NettyServerAvroHandler.handleUpstream(NettyServer.java:171)] [id: 0xdd9a2bfc, /192.168.163.130:34945 => /192.168.163.131:22222] CONNECTED: /192.168.163.130:34945
2017-11-09 19:00:28,463 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{tester=tony} body: 68 65 6C 6C 6F 20 6D 6F 72 65 20 61 76 72 6F 20 hello more avro  }

模擬發HTTP請求

啟動節點是有先後順序，flume01要訪問192.168.163.130:22222，但flume02還沒有啟動，所以報下列錯誤。

解決辦法：依次倒著啟動各個節點，先flume03，再flume02，再flume01。下面提示繫結成功。

常見錯誤

　　　　flume01伺服器接收http格式資料為來源，輸出avro格式資料；flume02伺服器接收avro格式資料為來源，輸出avro格式資料；flume03伺服器接收avro格式資料為來源，輸出到log4j，列印結果到控制檯。

^{HDFS Sink}

　　　　HDFS分散式海量資料的儲存和備份，

　　　　HDFS Sink將事件寫入到HDFS中，支援建立文字檔案和序列化檔案，支援壓縮。

　　　　這些檔案可以分割槽，按照指定的時間或資料量或事件的數量為基礎。（如：多少條記錄放一個檔案，如果每條日誌都放一個檔案，那HDFS就會產生小檔案的問題，將來處理的效率太低。可以設定規則，什麼時候檔案發生滾動，形成新檔案）。它還可以通過時間戳或者機器屬性對資料進行buckets（分桶）/partitions（分割槽）操作。HDFS的目錄流程可以包含將要由替換格式的轉移序列用於生成儲存事件的目錄/檔名。使用這個Sink要求hadoop必須依據安裝好，以便flume可以通過hadoop提供的jar包與HDFS進行通訊。注意，此版本的hadoop必須支援sync()呼叫，這樣資料可以追加到尾部。

#命名Agent a1的元件
a1.sources  =  r1
a1.sinks  =  k1
a1.channels  =  c1

#描述/配置Source
a1.sources.r1.type  =  http
a1.sources.r1.port  =  22222

#描述Sink
a1.sinks.k1.type  =  hdfs
a1.sinks.k1.hdfs.path = hdfs://hadoop01:9000/flume/data

#描述記憶體Channel
a1.channels.c1.type  =  memory
a1.channels.c1.capacity  =  1000 
a1.channels.c1.transactionCapacity  =  100

#為Channle繫結Source和Sink
a1.sources.r1.channels  =  c1
a1.sinks.k1.channel  =  c1

配置檔案flume-hdfs.properties

channel    –     
type    –    型別名稱，必須是“HDFS”
hdfs.path    –    HDFS 目錄路徑 (eg hdfs://namenode/flume/webdata/)
hdfs.fileType    SequenceFile    File format: currently SequenceFile, DataStream or CompressedStream (1)DataStream will not compress output file and please don’t set codeC (2)CompressedStream requires set hdfs.codeC with an available codeC 
預設是序列化檔案，可選項：SequenceFile序列化檔案/DataStream文字檔案/CompressedStream 壓縮檔案

hdfs.filePrefix    FlumeData    Flume在目錄下建立檔案的名稱字首
hdfs.fileSuffix    –    追加到檔案的名稱字尾 (eg .avro - 注: 日期時間不會自動新增)
hdfs.inUsePrefix    –    Flume正在處理的檔案所加的字首
hdfs.inUseSuffix    .tmp    Flume正在處理的檔案所加的字尾
hdfs.rollInterval    30    Number of seconds to wait before rolling current file (0 = never roll based on time interval)
hdfs.rollSize    1024    File size to trigger roll, in bytes (0: never roll based on file size)
hdfs.rollCount    10    Number of events written to file before it rolled (0 = never roll based on number of events)
hdfs.idleTimeout    0    Timeout after which inactive files get closed (0 = disable automatic closing of idle files)
hdfs.batchSize    100    number of events written to file before it is flushed to HDFS
hdfs.codeC    –    Compression codec. one of following : gzip, bzip2, lzo, lzop, snappy
hdfs.maxOpenFiles    5000    Allow only this number of open files. If this number is exceeded, the oldest file is closed.
hdfs.minBlockReplicas    –    Specify minimum number of replicas per HDFS block. If not specified, it comes from the default Hadoop config in the classpath.
hdfs.writeFormat    –    Format for sequence file records. One of “Text” or “Writable” (the default).
hdfs.callTimeout    10000    Number of milliseconds allowed for HDFS operations, such as open, write, flush, close. This number should be increased if many HDFS timeout operations are occurring.
hdfs.threadsPoolSize    10    Number of threads per HDFS sink for HDFS IO ops (open, write, etc.)
hdfs.rollTimerPoolSize    1    Number of threads per HDFS sink for scheduling timed file rolling
hdfs.kerberosPrincipal    –    Kerberos user principal for accessing secure HDFS
hdfs.kerberosKeytab    –    Kerberos keytab for accessing secure HDFS
hdfs.proxyUser          
hdfs.round    false    時間戳是否向下取整（如果是true，會影響所有基於時間的轉移序列，除了%T）
hdfs.roundValue    1    舍值的邊界值
hdfs.roundUnit    向下舍值的單位 -  second, minute , hour
hdfs.timeZone    Local Time    Name of the timezone that should be used for resolving the directory path, e.g. America/Los_Angeles.
hdfs.useLocalTimeStamp    false    Use the local time (instead of the timestamp from the event header) while replacing the escape sequences.
hdfs.closeTries    0    Number of times the sink must try renaming a file, after initiating a close attempt. If set to 1, this sink will not re-try a failed rename (due to, for example, NameNode or DataNode failure), and may leave the file in an open state with a .tmp extension. If set to 0, the sink will try to rename the file until the file is eventually renamed (there is no limit on the number of times it would try). The file may still remain open if the close call fails but the data will be intact and in this case, the file will be closed only after a Flume restart.
hdfs.retryInterval    180    Time in seconds between consecutive attempts to close a file. Each close call costs multiple RPC round-trips to the Namenode, so setting this too low can cause a lot of load on the name node. If set to 0 or less, the sink will not attempt to close the file if the first attempt fails, and may leave the file open or with a ”.tmp” extension.
serializer    TEXT    Other possible options include avro_event or the fully-qualified class name of an implementation of the EventSerializer.Builder interface.

引數說明

　　複製依賴jar檔案

　　　　/usr/local/src/hadoop/hadoop-2.7.1/share/hadoop/common/lib　　　　所有的jar複製過去

　　　　/usr/local/src/hadoop/hadoop-2.7.1/share/hadoop/common　　　　　　 3個jar包

　　　　/usr/local/src/hadoop/hadoop-2.7.1/share/hadoop/hdfs　　　　　　　　目錄 hadoop-hdfs-2.7.1.jar

[root@localhost conf]# ../bin/flume-ng agent -c ./ -f ./flume-hdfs.properties -n a1 -Dflume.root.logger=INFO,console

執行結果，飛速列印結果


模擬發HTTP請求

在flume01節點上傳送訊息
curl -X POST -d '[{"headers":{"tester":"tony"},"body":"hello http flume"}]' http://0.0.0.0:22222
執行結果：
org.apache.flume.sink.hdfs.BucketWriter.open(BucketWriter.java:234)] Creating hdfs://hadoop01:9000/flume/data/FlumeData.1510560200492.tmp

啟動和模擬http請求

hadoop fs -put '/usr/local/src/hive/data/english.txt' /user/hive/warehouse/test.db/tb_book

大資料小白系列——HDFS(1)
2018-12-09
大資料
大資料小白系列——HDFS(2)
2018-12-15
大資料
大資料小白系列——HDFS(4)
2018-12-29
大資料
大資料小白系列——HDFS(3)
2018-12-20
大資料
HDFS資料平衡
2022-06-30
大資料檔案儲存系統HDFS
2019-01-15
大資料
大資料 | 分散式檔案系統 HDFS
2021-07-09
大資料分散式
【大資料】【hadoop】檢視hdfs檔案命令
2020-11-29
大資料Hadoop
大資料系列2：Hdfs的讀寫操作
2021-01-26
大資料
大資料01-Flume 日誌收集
2021-09-09
大資料
大資料謝列3：Hdfs的HA實現
2021-01-27
大資料
大資料系列1：一文初識Hdfs
2021-01-25
大資料
python收集jvm資料
2018-09-12
PythonJVM
資料湖+資料中臺，金山雲大資料平臺如何攻克資料價值落地難關
2020-10-29
大資料
大資料時代下，金融行業資料安全防護如何落地？
2023-04-13
大資料行業
IPIDEA助力大資料價值應用落地
2023-04-07
Idea大資料
IT十年-大資料系列講解之HDFS（二）
2018-04-09
大資料
09 大資料之Hadoop(第四部 HDFS)
2020-11-01
大資料Hadoop
大資料2-Hadoop偽分散式+ZK+HDFS
2018-04-01
大資料Hadoop分散式
Hadoop系列之HDFS 資料塊
2022-01-19
Hadoop
038 收集表單資料
2024-10-23
Router-Based HDFS Federation 在滴滴大資料的應用
2019-01-11
大資料
大資料專案實踐（一）——之HDFS叢集配置
2018-08-21
大資料
HDFS 05 - HDFS 的後設資料管理（FSImage、EditLog、Checkpoint）
2021-06-06
6 收集資料庫統計資訊
2020-04-27
資料庫
Flume：資料匯入到hdfs中
2018-09-17
大資料03-整合 Flume 和 Kafka 收集日誌
2021-09-09
大資料Kafka
faker 資料填充常用指令收集
2020-03-18
資料管理：業務資料清洗，落地實現方案
2021-06-09
大資料hadoop資料
2018-08-03
大資料Hadoop
資料匯入終章：如何將HBase的資料匯入HDFS？
2018-10-15
從 RAID 到 Hadoop Hdfs 『大資料儲存的進化史』
2018-12-18
AIHadoop大資料
好程式設計師大資料培訓分享HDFS讀流程
2020-06-15
程式設計師大資料
Hadoop大資料實戰系列文章之HDFS檔案系統
2020-11-06
Hadoop大資料
好程式設計師大資料教程分享：HDFS基本概念
2019-07-09
程式設計師大資料
雲上大資料儲存：探究 JuiceFS 與 HDFS 的異同
2023-04-04
大資料UI
淺談hdfs架構與資料流
2018-11-15
架構
往hdfs寫資料無許可權
2019-01-22

大資料3-Flume收集資料+落地HDFS

flume

日誌收集系統

基本概念

Event 事件

Agent 代理

Source 資料來源

Channel 資料通道

Sink 資料匯聚點

組合過程

Source元件

重點掌握Avro Source和Spooling Directory Source。

channel元件

sink元件

Avro Sink

多級部署結構

HDFS Sink

相關文章

　　日誌收集系統

　　基本概念

　　　　Event 事件

　　　　Agent 代理

　　　　Source 資料來源

　　　　Channel 資料通道

　　　　Sink 資料匯聚點

　　　　組合過程

　　Source元件

　　　　　　重點掌握Avro Source和Spooling Directory Source。

　　channel元件

　　sink元件

　　Avro Sink

　　多級部署結構

^{HDFS Sink}