Big Data technologies Flume
A, Flume Profile
1) Flume provides a distributed, reliable, large amounts of log data for efficient collection, aggregation, mobile services, Flume only in Unix running environment.
2) Flume flow-based architecture, fault-tolerant, flexible and very simple.
3) Flume , Kafka used to live for real-time data collection, the Spark , Flink for real-time processing data, Impala used for real-time query.
Two, Flume role
2.1、Source
Used to collect data, Source is the place to generate the data stream, while Source data streaming will be generated to the Channel , this is somewhat similar to Java IO part of the Channel .
2.2、Channel
For bridging Sources and Sinks , similar to a queue.
2.3、Sink
From the Channel data collection, the data written to the target source ( may be next the Source , may be HDFS or HBase ).
2.4、Event
Transmission unit, The Flume basic unit of data transmission in the form of event data from the source to the destination.
Three, Flume transmission
source after monitoring a file or data stream, the data source generates new data to get the data, the data is encapsulated in an Event in and put the channel after commit submission, channel queue FIFO, sink to channel queue pull fetch data, and then written to the HDFS.
Four, Flume deployment and use
4.1 , the configuration file
查询 JAVA_HOME: echo $ JAVA_HOME
Display /opt/module/jdk1.8.0_144 /opt/module/jdk1.8.0_144 |
Installation Flume
[itstar@bigdata113 software]$ tar -zxvf apache-flume1.8.0-bin.tar.gz -C /opt/module/ |
Name Change:
[itstar@bigdata113 conf]$ mv flume-env.sh.template flume-env.sh |
flume-env.sh involves modifying entries:
export JAVA_HOME=/opt/module/jdk1.8.0_144 |
4.2 , case
4.2.1 , Case I: Monitoring data port
Target: The Flume monitoring end Console , the other end of the Console sends a message that is displayed in real time monitoring terminal.
Achieved in stages:
1) install the telnet tool
[Networking] state yum -y install the Telnet |
【The installation is complete】
|
2 ) Create a Flume Agent configuration file flume-telnet.conf
# 1. Definition of Ag ENT ===> A1 a1.sources = r1 a1.sinks = k1 a1.channels = c1
# 2 Define source a1.sources.r1.type = netcat a1.sources.r1.bind = bigdata112 a1.sources.r1.port = 44445
# 3. Definitions sink a1.sinks.k1.type = logger
# 4. Define channel a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100
# 5. Bidirectional link a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1 |
3) Analyzing 4444 . 5 port is occupied
$ netstat -tunlp | grep 44445 |
4) Start flume profile
/opt/module/flume1.8.0/bin/flume-ng agent \ --conf /opt/module/flume1.8.0/conf/ \ --name a1 \ --conf-file /opt/module/flume1.8.0/jobconf/flume-telnet.conf \ -Dflume.root.logger==INFO,console |
5) using telnet tool to the unit 44444 transmits content port
$ telnet bigdata112 44445 |
4.2.2 , Case II: real-time access local files to HDFS
1) Create a flume-hdfs.conf file
# 1.定义agent的名字a2 a2.sources = r2 a2.sinks = k2 a2.channels = c2
#2.定义Source a2.sources.r2.type = exec a2.sources.r2.command = tail -F /opt/Andy a2.sources.r2.shell = /bin/bash -c
#3.定义sink a2.sinks.k2.type = hdfs a2.sinks.k2.hdfs.path = hdfs://bigdata111:9000/flume/%H #上传文件的前缀 a2.sinks.k2.hdfs.filePrefix = Andy- #是否按照时间滚动文件夹 a2.sinks.k2.hdfs.round = true #多少时间单位创建一个新的文件夹 a2.sinks.k2.hdfs.roundValue = 1 #重新定义时间单位 a2.sinks.k2.hdfs.roundUnit = hour #是否使用本地时间戳 a2.sinks.k2.hdfs.useLocalTimeStamp = true #积攒多少个Event才flush到HDFS一次 a2.sinks.k2.hdfs.batchSize = 1000 #设置文件类型,可支持压缩 a2.sinks.k2.hdfs.fileType = DataStream #多久生成一个新的文件 a2.sinks.k2.hdfs.rollInterval = 600 #设置每个文件的滚动大小 a2.sinks.k2.hdfs.rollSize = 134217700 #文件的滚动与Event数量无关 a2.sinks.k2.hdfs.rollCount = 0 #最小副本数 a2.sinks.k2.hdfs.minBlockReplicas = 1
# 4.定义Channel a2.channels.c2.type = memory a2.channels.c2.capacity = 1000 a2.channels.c2.transactionCapacity = 100
# 5.链接 a2.sources.r2.channels = c2 a2.sinks.k2.channel = c2 |
3) 执行监控配置
/opt/module/flume1.8.0/bin/flume-ng agent \ --conf /opt/module/flume1.8.0/conf/ \ --name a2 \ --conf-file /opt/module/flume1.8.0/jobconf/flume-hdfs.conf |
4.2.3、案例三:实时读取目录文件到HDFS
目标:使用flume监听整个目录的文件
分步实现:
1) 创建配置文件flume-dir.conf
#1.定义Agent a3 a3.sources = r3 a3.sinks = k3 a3.channels = c3
# 2.定义Source a3.sources.r3.type = spooldir a3.sources.r3.spoolDir = /opt/module/flume1.8.0/upload a3.sources.r3.fileSuffix = .COMPLETED a3.sources.r3.fileHeader = true #忽略所有以.tmp结尾的文件,不上传 a3.sources.r3.ignorePattern = ([^ ]*\.tmp)
# 3.Sink a3.sinks.k3.type = hdfs a3.sinks.k3.hdfs.path = hdfs://bigdata111:9000/flume/%H #上传文件的前缀 a3.sinks.k3.hdfs.filePrefix = upload- #是否按照时间滚动文件夹 a3.sinks.k3.hdfs.round = true #多少时间单位创建一个新的文件夹 a3.sinks.k3.hdfs.roundValue = 1 #重新定义时间单位 a3.sinks.k3.hdfs.roundUnit = hour #是否使用本地时间戳 a3.sinks.k3.hdfs.useLocalTimeStamp = true #积攒多少个Event才flush到HDFS一次 a3.sinks.k3.hdfs.batchSize = 100 #设置文件类型,可支持压缩 a3.sinks.k3.hdfs.fileType = DataStream #多久生成一个新的文件 a3.sinks.k3.hdfs.rollInterval = 600 #设置每个文件的滚动大小大概是128M a3.sinks.k3.hdfs.rollSize = 134217728 #文件的滚动与Event数量无关 a3.sinks.k3.hdfs.rollCount = 0 #最小副本数 a3.sinks.k3.hdfs.minBlockReplicas = 1
#4.定义Channel a3.channels.c3.type = memory a3.channels.c3.capacity = 1000 a3.channels.c3.transactionCapacity = 100
#5.链接 a3.sources.r3.channels = c3 a3.sinks.k3.channel = c3 |
2) 执行测试:执行如下脚本后,请向upload文件夹中添加文件试试
/opt/module/flume1.8.0/bin/flume-ng agent \ --conf /opt/module/flume1.8.0/conf/ \ --name a3 \ --conf-file /opt/module/flume1.8.0/jobconf/flume-dir.conf |
尖叫提示: 在使用Spooling Directory Source时
1) 不要在监控目录中创建并持续修改文件
2) 上传完成的文件会以.COMPLETED结尾
3) 被监控文件夹每500毫秒扫描一次文件变动
4.2.4、案例四:Flume与Flume之间数据传递:单Flume多Channel、Sink,
目标:使用flume1监控文件变动,flume1将变动内容传递给flume-2,flume-2负责存储到HDFS。同时flume1将变动内容传递给flume-3,flume-3负责输出到local
分步实现:
1) 创建flume1.conf,用于监控某文件的变动,同时产生两个channel和两个sink分别输送给flume-2和flume3:
# Name the components on this agent a1.sources = r1 a1.sinks = k1 k2 a1.channels = c1 c2 # 将数据流复制给多个channel a1.sources.r1.selector.type = replicating
# Describe/configure the source a1.sources.r1.type = exec a1.sources.r1.command = tail -F /opt/Andy a1.sources.r1.shell = /bin/bash -c
# Describe the sink a1.sinks.k1.type = avro a1.sinks.k1.hostname = bigdata111 a1.sinks.k1.port = 4141
a1.sinks.k2.type = avro a1.sinks.k2.hostname = bigdata111 a1.sinks.k2.port = 4142
# Describe the channel a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100
a1.channels.c2.type = memory a1.channels.c2.capacity = 1000 a1.channels.c2.transactionCapacity = 100
# Bind the source and sink to the channel a1.sources.r1.channels = c1 c2 a1.sinks.k1.channel = c1 a1.sinks.k2.channel = c2 |
2) 创建flume-2.conf,用于接收flume1的event,同时产生1个channel和1个sink,将数据输送给hdfs:
# Name the components on this agent a2.sources = r1 a2.sinks = k1 a2.channels = c1
# Describe/configure the source a2.sources.r1.type = avro a2.sources.r1.bind = bigdata111 a2.sources.r1.port = 4141
# Describe the sink a2.sinks.k1.type = hdfs a2.sinks.k1.hdfs.path = hdfs://bigdata111:9000/flume2/%H #上传文件的前缀 a2.sinks.k1.hdfs.filePrefix = flume2- #是否按照时间滚动文件夹 a2.sinks.k1.hdfs.round = true #多少时间单位创建一个新的文件夹 a2.sinks.k1.hdfs.roundValue = 1 #重新定义时间单位 a2.sinks.k1.hdfs.roundUnit = hour #是否使用本地时间戳 a2.sinks.k1.hdfs.useLocalTimeStamp = true #积攒多少个Event才flush到HDFS一次 a2.sinks.k1.hdfs.batchSize = 100 #设置文件类型,可支持压缩 a2.sinks.k1.hdfs.fileType = DataStream #多久生成一个新的文件 a2.sinks.k1.hdfs.rollInterval = 600 #设置每个文件的滚动大小大概是128M a2.sinks.k1.hdfs.rollSize = 134217700 #文件的滚动与Event数量无关 a2.sinks.k1.hdfs.rollCount = 0 #最小副本数 a2.sinks.k1.hdfs.minBlockReplicas = 1
# Describe the channel a2.channels.c1.type = memory a2.channels.c1.capacity = 1000 a2.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel a2.sources.r1.channels = c1 a2.sinks.k1.channel = c1 |
3) 创建flume-3.conf,用于接收flume1的event,同时产生1个channel和1个sink,将数据输送给本地目录:
# Name the components on this agent a3.sources = r1 a3.sinks = k1 a3.channels = c1
# Describe/configure the source a3.sources.r1.type = avro a3.sources.r1.bind = bigdata111 a3.sources.r1.port = 4142
# Describe the sink a3.sinks.k1.type = file_roll #备注:此处的文件夹需要先创建好 a3.sinks.k1.sink.directory = /opt/flume3
# Describe the channel a3.channels.c1.type = memory a3.channels.c1.capacity = 1000 a3.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel a3.sources.r1.channels = c1 a3.sinks.k1.channel = c1 |
尖叫提示:输出的本地目录必须是已经存在的目录,如果该目录不存在,并不会创建新的目录。
4) 执行测试:分别开启对应flume-job(依次启动flume1,flume-2,flume-3),同时产生文件变动并观察结果:
$ bin/flume-ng agent --conf conf/ --name a1 --conf-file jobconf/flume1.conf
$ bin/flume-ng agent --conf conf/ --name a2 --conf-file jobconf/flume2.conf
$ bin/flume-ng agent --conf conf/ --name a3 --conf-file jobconf/flume3.conf |
4.2.5、案例五:Flume与Flume之间数据传递,多Flume汇总数据到单Flume
目标:flume11监控文件hive.log,flume-22监控某一个端口的数据流,flume11与flume-22将数据发送给flume-33,flume33将最终数据写入到HDFS。
分步实现:
1) 创建flume11.conf,用于监控hive.log文件,同时sink数据到flume-33:
# Name the components on this agent a1.sources = r1 a1.sinks = k1 a1.channels = c1
# Describe/configure the source a1.sources.r1.type = exec a1.sources.r1.command = tail -F /opt/Andy a1.sources.r1.shell = /bin/bash -c
# Describe the sink a1.sinks.k1.type = avro a1.sinks.k1.hostname = bigdata111 a1.sinks.k1.port = 4141
# Describe the channel a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1 |
2) 创建flume-22.conf,用于监控端口44444数据流,同时sink数据到flume-33:
# Name the components on this agent a2.sources = r1 a2.sinks = k1 a2.channels = c1
# Describe/configure the source a2.sources.r1.type = netcat a2.sources.r1.bind = bigdata111 a2.sources.r1.port = 44444
# Describe the sink a2.sinks.k1.type = avro a2.sinks.k1.hostname = bigdata111 a2.sinks.k1.port = 4141
# Use a channel which buffers events in memory a2.channels.c1.type = memory a2.channels.c1.capacity = 1000 a2.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel a2.sources.r1.channels = c1 a2.sinks.k1.channel = c1 |
3) 创建flume-33.conf,用于接收flume11与flume-22发送过来的数据流,最终合并后sink到HDFS:
# Name the components on this agent a3.sources = r1 a3.sinks = k1 a3.channels = c1
# Describe/configure the source a3.sources.r1.type = avro a3.sources.r1.bind = bigdata111 a3.sources.r1.port = 4141
# Describe the sink a3.sinks.k1.type = hdfs a3.sinks.k1.hdfs.path = hdfs://bigdata111:9000/flume3/%H #上传文件的前缀 a3.sinks.k1.hdfs.filePrefix = flume3- #是否按照时间滚动文件夹 a3.sinks.k1.hdfs.round = true #多少时间单位创建一个新的文件夹 a3.sinks.k1.hdfs.roundValue = 1 #重新定义时间单位 a3.sinks.k1.hdfs.roundUnit = hour #是否使用本地时间戳 a3.sinks.k1.hdfs.useLocalTimeStamp = true #积攒多少个Event才flush到HDFS一次 a3.sinks.k1.hdfs.batchSize = 100 #设置文件类型,可支持压缩 a3.sinks.k1.hdfs.fileType = DataStream #多久生成一个新的文件 a3.sinks.k1.hdfs.rollInterval = 600 #设置每个文件的滚动大小大概是128M a3.sinks.k1.hdfs.rollSize = 134217700 #文件的滚动与Event数量无关 a3.sinks.k1.hdfs.rollCount = 0 #最小冗余数 a3.sinks.k1.hdfs.minBlockReplicas = 1
# Describe the channel a3.channels.c1.type = memory a3.channels.c1.capacity = 1000 a3.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel a3.sources.r1.channels = c1 a3.sinks.k1.channel = c1 |
4) 执行测试:分别开启对应flume-job(依次启动flume-33,flume-22,flume11),同时产生文件变动并观察结果:
$ bin/flume-ng agent --conf conf/ --name a3 --conf-file jobconf/flume33.conf $ bin/flume-ng agent --conf conf/ --name a2 --conf-file jobconf/flume22.conf $ bin/flume-ng agent --conf conf/ --name a1 --conf-file jobconf/flume11.conf |
数据发送
a) telnet bigdata111 44444 打开后发送5555555 b) 在/opt/Andy 中追加666666 |
4.2.6、案例五:Flume自定义拦截器
时间戳拦截器
Timestamp.conf
#定义agent名, source、channel、sink的名称 a4.sources = r1 a4.channels = c1 a4.sinks = k1
#具体定义source a4.sources.r1.type = spooldir a4.sources.r1.spoolDir = /opt/module/flume-1.8.0/upload
#具体定义channel a4.channels.c1.type = memory a4.channels.c1.capacity = 10000 a4.channels.c1.transactionCapacity = 100
#定义拦截器,为文件最后添加时间戳 a4.sources.r1.interceptors = i1 a4.sources.r1.interceptors.i1.type = org.apache.flume.interceptor.TimestampInterceptor$Builder
#具体定义sink a4.sinks.k1.type = hdfs a4.sinks.k1.hdfs.path = hdfs://bigdata111:9000/flume-interceptors/%H a4.sinks.k1.hdfs.filePrefix = events- a4.sinks.k1.hdfs.fileType = DataStream
#不按照条数生成文件 a4.sinks.k1.hdfs.rollCount = 0 #HDFS上的文件达到128M时生成一个文件 a4.sinks.k1.hdfs.rollSize = 134217728 #HDFS上的文件达到60秒生成一个文件 a4.sinks.k1.hdfs.rollInterval = 60
#组装source、channel、sink a4.sources.r1.channels = c1 a4.sinks.k1.channel = c1 |
启动命令
/opt/module/flume-1.8.0/bin/flume-ng agent -n a4 \ -f /opt/module/flume-1.8.0/jobconf/flume-interceptors.conf \ -c /opt/module/flume-1.8.0/conf \ -Dflume.root.logger=INFO,console |
主机名拦截器
Host.conf
a1.sources= r1 a1.sinks = k1 a1.channels = c1
a1.sources.r1.type = exec a1.sources.r1.channels = c1 a1.sources.r1.command = tail -F /opt/Andy a1.sources.r1.interceptors = i1 a1.sources.r1.interceptors.i1.type = host
#参数为true时用IP192.168.1.111,参数为false时用主机名,默认为true a1.sources.r1.interceptors.i1.useIP = false a1.sources.r1.interceptors.i1.hostHeader = agentHost
a1.sinks.k1.type=hdfs a1.sinks.k1.channel = c1 a1.sinks.k1.hdfs.path = hdfs://bigdata111:9000/flumehost/%H a1.sinks.k1.hdfs.filePrefix = Andy_%{agentHost} #往生成的文件加后缀名.log a1.sinks.k1.hdfs.fileSuffix = .log a1.sinks.k1.hdfs.fileType = DataStream a1.sinks.k1.hdfs.writeFormat = Text a1.sinks.k1.hdfs.rollInterval = 10 a1.sinks.k1.hdfs.useLocalTimeStamp = true
a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100
a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1 |
启动命令:
bin/flume-ng agent -c conf/ -f jobconf/host.conf -n a1 -Dflume.root.logger=INFO,console |
UUID拦截器
uuid.conf
a1.sources = r1 a1.sinks = k1 a1.channels = c1
a1.sources.r1.type = exec a1.sources.r1.channels = c1 a1.sources.r1.command = tail -F /opt/Andy a1.sources.r1.interceptors = i1 #type的参数不能写成uuid,得写具体,否则找不到类 a1.sources.r1.interceptors.i1.type = org.apache.flume.sink.solr.morphline.UUIDInterceptor$Builder #如果UUID头已经存在,它应该保存 a1.sources.r1.interceptors.i1.preserveExisting = true a1.sources.r1.interceptors.i1.prefix = UUID_
a1.sinks.k1.type = logger
a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100
a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1 # bin/flume-ng agent -c conf/ -f jobconf/uuid.conf -n a1 -Dflume.root.logger==INFO,console |
查询替换拦截器
search.conf
a1.sources = r1 a1.sinks = k1 a1.channels = c1
a1.sources.r1.type = exec a1.sources.r1.channels = c1 a1.sources.r1.command = tail -F /opt/Andy a1.sources.r1.interceptors = i1 a1.sources.r1.interceptors.i1.type = search_replace a1.sources.r1.interceptors.i1.searchPattern = [0-9]+ a1.sources.r1.interceptors.i1.replaceString = itstar a1.sources.r1.interceptors.i1.charset = UTF-8
a1.sinks.k1.type = logger
a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100
a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1 # bin/flume-ng agent -c conf/ -f jobconf/search.conf -n a1 -Dflume.root.logger=INFO,console |
正则过滤拦截器
filter.conf
a1.sources = r1 a1.sinks = k1 a1.channels = c1
a1.sources.r1.type = exec a1.sources.r1.channels = c1 a1.sources.r1.command = tail -F /opt/Andy a1.sources.r1.interceptors = i1 a1.sources.r1.interceptors.i1.type = regex_filter a1.sources.r1.interceptors.i1.regex = ^A.* #如果excludeEvents设为false,表示过滤掉不是以A开头的events。如果excludeEvents设为true,则表示过滤掉以A开头的events。 a1.sources.r1.interceptors.i1.excludeEvents = true
a1.sinks.k1.type = logger
a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100
a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1 # bin/flume-ng agent -c conf/ -f jobconf/filter.conf -n a1 -Dflume.root.logger=INFO,console |
正则抽取拦截器
extractor.conf
a1.sources = r1 a1.sinks = k1 a1.channels = c1
a1.sources.r1.type = exec a1.sources.r1.channels = c1 a1.sources.r1.command = tail -F /opt/Andy a1.sources.r1.interceptors = i1 a1.sources.r1.interceptors.i1.type = regex_extractor a1.sources.r1.interceptors.i1.regex = hostname is (.*?) ip is (.*) a1.sources.r1.interceptors.i1.serializers = s1 s2 a1.sources.r1.interceptors.i1.serializers.s1.name = cookieid a1.sources.r1.interceptors.i1.serializers.s2.name = ip
a1.sinks.k1.type = logger
a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100
a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1 # bin/flume-ng agent -c conf/ -f jobconf/extractor.conf -n a1 -Dflume.root.logger=INFO,console |