Event 事件
把讀取的一條日誌資訊包裝成一個物件,這個物件就叫Flume Event。
Agent 代理
Source 資料來源
Source元件是專門用來收集資料的,可以處理各種型別、各種格式的日誌資料,包括avro、thrift、exec、jms、spooling directory、netcat、sequence generator、syslog、http、legacy、自定義。
Channel 資料通道
Sink 資料匯聚點
Store on failure(這也是Scribe-Facebook開源的日誌收集系統-採用的策略,當資料接收方crash(崩潰)時,將資料寫到本地,待恢復後,繼續傳送)
重點掌握Avro Source和Spooling Directory Source。
#單節點Flume配置 #命名Agent a1的元件 a1.sources = r1 a1.sinks = k1 a1.channels = c1 #描述/配置Source a1.sources.r1.type = netcat #內建型別,接收來自網路的資料 a1.sources.r1.bind = #等同於網路的127.0.0.1 a1.sources.r1.port = 22222 #服務的埠號 #描述Sink a1.sinks.k1.type = logger #內建型別 #描述記憶體Channel a1.channels.c1.type = memory #儲存資料到記憶體 a1.channels.c1.capacity = 1000 #容量最大存放1000條日誌 a1.channels.c1.transactionCapacity = 100 #事務中的一批資料100條 #為Channle繫結Source和Sink a1.sources.r1.channels = c1 #一個source可以繫結到多個channel a1.sinks.k1.channel = c1 #一個sink只能繫結到一個channel
a1.sources = r1
a1.sinks = k1
a1.channels = c1
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /usr/local/src/flume/data
a1.sinks.k1.type = logger
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
a1.sources = r1
a1.sinks = k1
a1.channels = c1
a1.sources.r1.type = avro
a1.sources.r1.bind =
a1.sources.r1.port = 22222
a1.sinks.k1.type = logger
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
a1.sources = r1
a1.sinks = k1
a1.channels = c1
a1.sources.r1.type = http
a1.sources.r1.bind =
a1.sources.r1.port = 22222
a1.sinks.k1.type = avro
a1.sinks.k1.hostname =
a1.sinks.k1.port = 22222
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
a1.sources = r1
a1.sinks = k1
a1.channels = c1
a1.sources.r1.type = avro
a1.sources.r1.bind =
a1.sources.r1.port = 22222
a1.sinks.k1.type = avro
a1.sinks.k1.hostname =
a1.sinks.k1.port = 22222
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
channels –
type – 型別名稱,"AVRO"
bind – 需要監聽的主機名或IP
port – 要監聽的埠
threads – 工作執行緒最大執行緒數
interceptors – 空格分隔的攔截器列表
compression-type none 壓縮型別,可以是“none”或“default”,這個值必須和AvroSource的壓縮格式匹配
ssl false 是否啟用ssl加密,如果啟用還需要配置一個“keystore”和一個“keystore-password”.
keystore – 為SSL提供的 java金鑰檔案 所在路徑
keystore-password – 為SSL提供的 java金鑰檔案 密碼
keystore-type JKS 金鑰庫型別可以是 “JKS” 或 “PKCS12”.
exclude-protocols SSLv3 空格分隔開的列表,用來指定在SSL / TLS協議中排除。SSLv3將總是被排除除了所指定的協議。
ipFilter false 如果需要為netty開啟ip過濾,將此項設定為true
ipFilterRules – 陪netty的ip過濾設定表示式規則
channels –
type – 型別,需要指定為"spooldir"
spoolDir – 讀取檔案的路徑,即"蒐集目錄"
fileSuffix .COMPLETED 對處理完成的檔案追加的字尾
deletePolicy never 處理完成後是否刪除檔案,需是"never"或"immediate"
fileHeader false Whether to add a header storing the absolute path filename.
fileHeaderKey file Header key to use when appending absolute path filename to event header.
basenameHeader false Whether to add a header storing the basename of the file.
basenameHeaderKey basename Header Key to use when appending basename of file to event header.
ignorePattern ^$ 正規表示式指定哪些檔案需要忽略
trackerDir .flumespool Directory to store metadata related to processing of files. If this path is not an absolute path, then it is interpreted as relative to the spoolDir.
consumeOrder 處理檔案的策略,oldest, youngest 或 random。
maxBackoff 4000 The maximum time (in millis) to wait between consecutive attempts to write to the channel(s) if the channel is full. The source will start at a low backoff and increase it exponentially each time the channel throws a ChannelException, upto the value specified by this parameter.
batchSize 100 Granularity at which to batch transfer to the channel
inputCharset UTF-8 讀取檔案時使用的編碼。
decodeErrorPolicy FAIL 當在輸入檔案中發現無法處理的字元編碼時如何處理。FAIL:丟擲一個異常而無法 解析該檔案。REPLACE:用“替換字元”字元,通常是Unicode的U + FFFD更換不可解析角色。 忽略:掉落的不可解析的字元序列。
deserializer LINE 宣告用來將檔案解析為事件的解析器。預設一行為一個事件。處理類必須實現EventDeserializer.Builder介面。
deserializer.* Varies per event deserializer.
bufferMaxLines – (Obselete) This option is now ignored.
bufferMaxLineLength 5000 (Deprecated) Maximum length of a line in the commit buffer. Use deserializer.maxLineLength instead.
selector.type replicating replicating or multiplexing
selector.* Depends on the selector.type value
interceptors – Space-separated list of interceptors
type 型別,必須為"HTTP"
port – 監聽的埠
bind 監聽的主機名或ip
handler org.apache.flume.source.http.JSONHandler 處理器類,需要實現HTTPSourceHandler介面
handler.* – 處理器的配置引數
interceptors –
enableSSL false 是否開啟SSL,如果需要設定為true。注意,HTTP不支援SSLv3。
excludeProtocols SSLv3 空格分隔的要排除的SSL/TLS協議。SSLv3總是被排除的。
keystore 金鑰庫檔案所在位置。
keystorePassword Keystore 金鑰庫密碼啟動時報錯不繼續
[root@localhost conf]# ../bin/flume-ng agent --conf conf --conf-file flume.properties --name a1 -Dflume.root.logger=INFO,console
Info: Including Hive libraries found via () for Hive access
+ exec /usr/local/src/java/jdk1.7.0_51/bin/java -Xmx20m -Dflume.root.logger=INFO,console -cp 'conf:/usr/local/src/flume/apache-flume-1.6.0-bin/lib/*:/lib/*' -Djava.library.path= org.apache.flume.node.Application --conf-file flume.properties --name a1
log4j:WARN No appenders could be found for logger (org.apache.flume.lifecycle.LifecycleSupervisor).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
[root@localhost bin]# ./flume-ng agent -c /usr/local/src/flume/apache-flume-1.6.0-bin/conf -f /usr/local/src/flume/apache-flume-1.6.0-bin/conf/flume.properties -n a1 -Dflume.root.logger=INFO,console
[root@localhost bin]# ./flume-ng agent -c ../conf -f ../conf/flume.properties -n a1 -Dflume.root.logger=INFO,console監控目錄重名異常
java.lang.IllegalStateException: File name has been re-used with different files. Spooling assumptions violated for /usr/local/src/flume/data/g.txt.COMPLETED
[root@localhost conf]# ../bin/flume-ng agent -c ./ -f ./flume-avro.properties -n a1 -Dflume.root.logger=INFO,console
2017-11-07 19:58:03,708 (lifecycleSupervisor-1-2) [INFO - org.apache.flume.source.AvroSource.start(AvroSource.java:253)] Avro source r1 started.
cd /usr/local/src/flume #進入目錄
vi log.txt #建立資料檔案,內容如下
hi flume.
you are good tools.
./flume-ng –h #幫助可以看命令格式及引數用法
./flume-ng avro-client -c ../conf -H -p 22222 -F ../../log.txt
type – 型別,必須是“memory”
capacity 100 事件儲存在通道中的最大數量
transactionCapacity 100 每個事務中的最大事件數
keep-alive 3 新增或刪除操作的超時時間
byteCapacityBufferPercentage 20 Defines the percent of buffer between byteCapacity and the estimated total size of all events in the channel, to account for data in headers. See below.
byteCapacity see description Maximum total bytes of memory allowed as a sum of all events in this channel. The implementation only counts the Event body, which is the reason for providing the byteCapacityBufferPercentage configuration parameter as well. Defaults to a computed value equal to 80% of the maximum memory available to the JVM (i.e. 80% of the -Xmx value passed on the command line). Note that if you have multiple memory channels on a single JVM, and they happen to hold the same physical events (i.e. if you are using a replicating channel selector from a single source) then those event sizes may be double-counted for channel byteCapacity purposes. Setting this value to 0 will cause this value to fall back to a hard internal limit of about 200 GB.
type – 型別,必須是“file”
checkpointDir ~/.flume/file-channel/checkpoint 檢查點檔案存放的位置
useDualCheckpoints false Backup the checkpoint. If this is set to true, backupCheckpointDir must be set
backupCheckpointDir – The directory where the checkpoint is backed up to. This directory must not be the same as the data directories or the checkpoint directory
dataDirs ~/.flume/file-channel/data 逗號分隔的目錄列表,用以存放日誌檔案。使用單獨的磁碟上的多個目錄可以提高檔案通道效率。
transactionCapacity 10000 The maximum size of transaction supported by the channel
checkpointInterval 30000 Amount of time (in millis) between checkpoints
maxFileSize 2146435071 一個日誌檔案的最大尺寸
minimumRequiredSpace 524288000 Minimum Required free space (in bytes). To avoid data corruption, File Channel stops accepting take/put requests when free space drops below this value
capacity 1000000 Maximum capacity of the channel
keep-alive 3 Amount of time (in sec) to wait for a put operation
use-log-replay-v1 false Expert: Use old replay logic
use-fast-replay false Expert: Replay without using queue
checkpointOnClose true Controls if a checkpoint is created when the channel is closed. Creating a checkpoint on close speeds up subsequent startup of the file channel by avoiding replay.
encryption.activeKey – Key name used to encrypt new data
encryption.cipherProvider – Cipher provider type, supported types: AESCTRNOPADDING
encryption.keyProvider – Key provider type, supported types: JCEKSFILE
encryption.keyProvider.keyStoreFile – Path to the keystore file
encrpytion.keyProvider.keyStorePasswordFile – Path to the keystore password file
encryption.keyProvider.keys – List of all keys (e.g. history of the activeKey setting)
encyption.keyProvider.keys.*.passwordFile – Path to the optional key password file
記憶體溢位通道。事件被儲存在記憶體佇列和磁碟中。 記憶體佇列作為主儲存,而磁碟作為溢位內容的儲存。當記憶體佇列已滿時,後續的事件將被儲存在檔案通道中。這個通道適用於正常操作期間適用記憶體通道已期實現高效吞吐,而在高峰期間適用檔案通道實現高耐受性。通過降低吞吐效率提高系統可耐受性。如果Agent崩潰,則只有儲存在檔案系統中的事件可以被恢復,記憶體中資料會丟失。此通道處於試驗階段,不建議在生產環境中使用。 引數說明: type – 型別,必須是"SPILLABLEMEMORY" memoryCapacity 10000 記憶體中儲存事件的最大值,如果想要禁用記憶體緩衝區將此值設定為0。 overflowCapacity 100000000 可以儲存在磁碟中的事件數量最大值。設定為0可以禁用磁碟儲存。 overflowTimeout 3 The number of seconds to wait before enabling disk overflow when memory fills up. byteCapacityBufferPercentage 20 Defines the percent of buffer between byteCapacity and the estimated total size of all events in the channel, to account for data in headers. See below. byteCapacity see description Maximum bytes of memory allowed as a sum of all events in the memory queue. The implementation only counts the Event body, which is the reason for providing the byteCapacityBufferPercentage configuration parameter as well. Defaults to a computed value equal to 80% of the maximum memory available to the JVM (i.e. 80% of the -Xmx value passed on the command line). Note that if you have multiple memory channels on a single JVM, and they happen to hold the same physical events (i.e. if you are using a replicating channel selector from a single source) then those event sizes may be double-counted for channel byteCapacity purposes. Setting this value to 0 will cause this value to fall back to a hard internal limit of about 200 GB. avgEventSize 500 Estimated average size of events, in bytes, going into the channel <file channel properties> see file channel Any file channel property with the exception of ‘keep-alive’ and ‘capacity’ can be used. The keep-alive of file channel is managed by Spillable Memory Channel. Use ‘overflowCapacity’ to set the File channel’s capacity.
channel –
type – The component type name, needs to be logger
maxBytesToLog 16 Maximum number of bytes of the Event body to log
要求必須在 --conf 引數指定的目錄下有 log4j的配置檔案log4j.properties
channel –
type – 型別,必須是"file_roll"
sink.directory – 檔案被儲存的目錄
sink.rollInterval 30 滾動檔案每隔30秒(應該是每隔30秒鐘單獨切割資料到一個檔案的意思)。如果設定為0,則禁止滾動,從而導致所有資料被寫入到一個檔案中。
sink.serializer TEXT Other possible options include avro_event or the FQCN of an implementation of EventSerializer.Builder interface.
batchSize 100
a1.sources.r1.type = http #內建型別
a1.sources.r1.port = 22222 #設定監測目錄
a1.sinks.k1.type = file_roll #檔案落地
a1.sinks.k1.sink.directory = /usr/local/src/flume/data #存放目錄
curl -X POST -d '[{"headers":{"tester":"tony"},"body":"hello http flume"}]'
[root@localhost data]# pwd
/usr/local/src/flume/data #資料所在目錄
[root@localhost data]# ll
total 4
-rw-r--r--. 1 root root 0 Nov 9 16:21 1510273266537-1
-rw-r--r--. 1 root root 22 Nov 9 16:21 1510273266537-2
-rw-r--r--. 1 root root 0 Nov 9 16:22 1510273266537-3
-rw-r--r--. 1 root root 0 Nov 9 16:22 1510273266537-4
-rw-r--r--. 1 root root 0 Nov 9 16:23 1510273266537-5
-rw-r--r--. 1 root root 0 Nov 9 16:23 1510273266537-6
[root@localhost data]# tail 1510273266537-2 #資料已經寫入
hello file-roll flume
[root@localhost data]# tail 1510273266537-6 #即使沒有資料也會產生檔案
Avro Sink
是實現多級流動 和 扇出流(1到多) 扇入流(多到1) 的基礎。
channel –
type – avro.
hostname – The hostname or IP address to bind to.
port – The port # to listen on.
batch-size 100 number of event to batch together for send.
connect-timeout 20000 Amount of time (ms) to allow for the first (handshake) request.
request-timeout 20000 Amount of time (ms) to allow for requests after the first.
克隆兩臺新的虛擬機器 flume02、flume03
a1.sources.r1.type = http
a1.sources.r1.bind =
a1.sources.r1.port = 22222
a1.sinks.k1.type = avro
a1.sinks.k1.hostname =
a1.sinks.k1.port = 22222
[root@localhost conf]# pwd
[root@localhost conf]# scp flume-avro-sink.properties root@
The authenticity of host ' (' can't be established.
RSA key fingerprint is 40:d6:4e:bd:3e:d0:90:3b:86:41:72:90:ec:dd:95:f9.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '' (RSA) to the list of known hosts.
flume-avro-sink.properties 100% 477 0.5KB/s 00:00
[root@localhost conf]# scp flume-avro-sink.properties root@
The authenticity of host ' (' can't be established.
RSA key fingerprint is 40:d6:4e:bd:3e:d0:90:3b:86:41:72:90:ec:dd:95:f9.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '' (RSA) to the list of known hosts.
flume-avro-sink.properties 100% 477 0.5KB/s 00:00
[root@localhost conf]#
a1.sources.r1.type = avro #內建型別
a1.sources.r1.bind =
a1.sources.r1.port = 22222
a1.sinks.k1.type = avro
a1.sinks.k1.hostname =
a1.sinks.k1.port = 22222
a1.sources.r1.type = avro #內建型別
a1.sources.r1.bind =
a1.sources.r1.port = 22222
a1.sinks.k1.type = logger
[root@localhost conf]# ../bin/flume-ng agent -c ./ -f ./flume-avro-sink.properties -n a1 -Dflume.root.logger=INFO,console
curl -X POST -d '[{"headers":{"tester":"tony"},"body":"hello http flume"}]'
2017-11-09 18:58:33,863 (New I/O worker #1) [INFO - org.apache.avro.ipc.NettyServer$NettyServerAvroHandler.handleUpstream(NettyServer.java:171)] [id: 0xdd9a2bfc, / => /] BOUND: /
2017-11-09 18:58:33,863 (New I/O worker #1) [INFO - org.apache.avro.ipc.NettyServer$NettyServerAvroHandler.handleUpstream(NettyServer.java:171)] [id: 0xdd9a2bfc, / => /] CONNECTED: /
2017-11-09 19:00:28,463 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:94)] Event: { headers:{tester=tony} body: 68 65 6C 6C 6F 20 6D 6F 72 65 20 61 76 72 6F 20 hello more avro }
HDFS Sink將事件寫入到HDFS中,支援建立文字檔案和序列化檔案,支援壓縮。
#命名Agent a1的元件
a1.sources = r1
a1.sinks = k1
a1.channels = c1
a1.sources.r1.type = http
a1.sources.r1.port = 22222
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://hadoop01:9000/flume/data
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
channel –
type – 型別名稱,必須是“HDFS”
hdfs.path – HDFS 目錄路徑 (eg hdfs://namenode/flume/webdata/)
hdfs.fileType SequenceFile File format: currently SequenceFile, DataStream or CompressedStream (1)DataStream will not compress output file and please don’t set codeC (2)CompressedStream requires set hdfs.codeC with an available codeC
預設是序列化檔案,可選項:SequenceFile序列化檔案/DataStream文字檔案/CompressedStream 壓縮檔案
hdfs.filePrefix FlumeData Flume在目錄下建立檔案的名稱字首
hdfs.fileSuffix – 追加到檔案的名稱字尾 (eg .avro - 注: 日期時間不會自動新增)
hdfs.inUsePrefix – Flume正在處理的檔案所加的字首
hdfs.inUseSuffix .tmp Flume正在處理的檔案所加的字尾
hdfs.rollInterval 30 Number of seconds to wait before rolling current file (0 = never roll based on time interval)
hdfs.rollSize 1024 File size to trigger roll, in bytes (0: never roll based on file size)
hdfs.rollCount 10 Number of events written to file before it rolled (0 = never roll based on number of events)
hdfs.idleTimeout 0 Timeout after which inactive files get closed (0 = disable automatic closing of idle files)
hdfs.batchSize 100 number of events written to file before it is flushed to HDFS
hdfs.codeC – Compression codec. one of following : gzip, bzip2, lzo, lzop, snappy
hdfs.maxOpenFiles 5000 Allow only this number of open files. If this number is exceeded, the oldest file is closed.
hdfs.minBlockReplicas – Specify minimum number of replicas per HDFS block. If not specified, it comes from the default Hadoop config in the classpath.
hdfs.writeFormat – Format for sequence file records. One of “Text” or “Writable” (the default).
hdfs.callTimeout 10000 Number of milliseconds allowed for HDFS operations, such as open, write, flush, close. This number should be increased if many HDFS timeout operations are occurring.
hdfs.threadsPoolSize 10 Number of threads per HDFS sink for HDFS IO ops (open, write, etc.)
hdfs.rollTimerPoolSize 1 Number of threads per HDFS sink for scheduling timed file rolling
hdfs.kerberosPrincipal – Kerberos user principal for accessing secure HDFS
hdfs.kerberosKeytab – Kerberos keytab for accessing secure HDFS
hdfs.round false 時間戳是否向下取整(如果是true,會影響所有基於時間的轉移序列,除了%T)
hdfs.roundValue 1 舍值的邊界值
hdfs.roundUnit 向下舍值的單位 - second, minute , hour
hdfs.timeZone Local Time Name of the timezone that should be used for resolving the directory path, e.g. America/Los_Angeles.
hdfs.useLocalTimeStamp false Use the local time (instead of the timestamp from the event header) while replacing the escape sequences.
hdfs.closeTries 0 Number of times the sink must try renaming a file, after initiating a close attempt. If set to 1, this sink will not re-try a failed rename (due to, for example, NameNode or DataNode failure), and may leave the file in an open state with a .tmp extension. If set to 0, the sink will try to rename the file until the file is eventually renamed (there is no limit on the number of times it would try). The file may still remain open if the close call fails but the data will be intact and in this case, the file will be closed only after a Flume restart.
hdfs.retryInterval 180 Time in seconds between consecutive attempts to close a file. Each close call costs multiple RPC round-trips to the Namenode, so setting this too low can cause a lot of load on the name node. If set to 0 or less, the sink will not attempt to close the file if the first attempt fails, and may leave the file open or with a ”.tmp” extension.
serializer TEXT Other possible options include avro_event or the fully-qualified class name of an implementation of the EventSerializer.Builder interface.
/usr/local/src/hadoop/hadoop-2.7.1/share/hadoop/common/lib 所有的jar複製過去
/usr/local/src/hadoop/hadoop-2.7.1/share/hadoop/common 3個jar包
/usr/local/src/hadoop/hadoop-2.7.1/share/hadoop/hdfs 目錄 hadoop-hdfs-2.7.1.jar
[root@localhost conf]# ../bin/flume-ng agent -c ./ -f ./flume-hdfs.properties -n a1 -Dflume.root.logger=INFO,console
curl -X POST -d '[{"headers":{"tester":"tony"},"body":"hello http flume"}]'
org.apache.flume.sink.hdfs.BucketWriter.open(BucketWriter.java:234)] Creating hdfs://hadoop01:9000/flume/data/FlumeData.1510560200492.tmp
hadoop fs -put '/usr/local/src/hive/data/english.txt' /user/hive/warehouse/test.db/tb_book