一、简介

1.1、概述

Flume是Cloudera开发后来贡献给了Apache的一套用于进行日志的收集(collecting)、汇聚(aggregating)和传输(moving)的系统

日志是大数据的基石 - 实际开发中，有超过70%的数据是从日志中获取的

版本
a Flume0.X：Flume-og。对分布式和线程的并发性支持的不好
b Flume1.X：Flume-ng。Flume1.X和Flume0.X不兼容

1.2、基本概念

Event：
a 在Flume中，会将收集到的每一条日志封装成一个Event对象
b Event对象本质上是一个json串。即Flume会将收集到的每一条日志封装成一个json，这个json就是Event对象
c Event对象固定的包含2个键：headers和body

Agent：是Flume流动模型的基本结构，固定的包含3部分
a Source：从数据源来获取数据 - collecting
b Channel：临时存储数据 - aggregating
c Sink：将数据发往目的地 - moving

1.3、流动模型

1.3.1、单级流动

计算机生成了可选文字:SOUP e C h a n n e 1 Agent Sink

1.3.2、多级流动

在这里插入图片描述

1.3.3、扇入流动

在这里插入图片描述

1.3.4、扇出流动

在这里插入图片描述

1.3.5、复杂流动：

将上述几种流动模型按照要求来进行组合，最终形成的结构就是复杂流动

1.4、事务

在这里插入图片描述

二、基本组件

2.1、Source

Avro Source：接收被AVRO序列化之后的数据。结合Avro Sink可以实现多级、扇入、扇出流动的效果

Exec Source：监听指定命令，然后将命令的执行结果作为日志进行收集
格式文件

Spooling Directory Source：监听指定目录，如果目录下产生新的文件，那么会自动的将新文件中的内容来按行收集

Netcat Source：监听一个TCP请求，然后将TCP请求的内容作为日志来进行收集

Sequence Generator Source：会从0开始，不断的递增，递增到指定的大小。如果不指定，则递增到LONG.MAX_VALUE。这个Source用于测试流动模型搭建是否成功

HTTP Source：用于监听HTTP请求，将HTTP的请求内容作为日志收集。这个Source只能监听POST和GET请求，其中GET请求的监听只能用于实验阶段，所以实际开发中只用这个Source来监听POST请求

扩展：自定义Source
a 定义一个类实现Source接口的子接口之一：PollableSource或者EventDrivenSource
b 将代码打成jar包
c 放到Flume安装目录的lib目录下

2.2、Channel

Memory Channel：内存通道, 这个Channel的读写速度相对较快但是不可靠

File Channel：文件通道将数据临时存储在磁盘上。这个Channel的读写速度慢但是可靠

JDBC Channel：将数据临时存储到数据库中。

Spillable Memory Channel：内存溢出通道 ,这个Channel目前处于测试阶段，不推荐在实际生产环境中使用

2.3、Sink

HDFS Sink：将数据写出到HDFS上

Logger Sink：将Flume收集到的数据打印到控制台上

File Roll Sink：将数据写出到本地的文件系统中

Null Sink：会丢弃所有来自于Channel的数据

AVRO Sink：将数据以AVRO序列化之后写出到下一个节点上

扩展：自定义Sink
a. 在自定义Sink的过程中，需要注意其中的事务问题

三、其他组件

3.1、Selector

Selector是Source的子组件，即Selector是配置在Source上的

Selector提供了2种模式
a. replicating：复制模式。即输入节点收到数据之后，会将数据复制之后发送给每一个扇出节点，此时每一个扇出节点收到的数
据都是相同的
b. multiplexing：路由模式。即输入节点收到数据之后，会根据headers中指定的字段来对数据进行分发，此时每一个扇出节点收到的数据是不相同的
c. 如果不指定，则Selector默认采用的是复制模式
d. 在实际生产中，如果需要对数据来进行分类，那么使用路由模式；如果需要对数据进行备份，或者需要同时交给不同的集群来分别处理，那么使用复制模式

3.2、Processor

Processor本质上是一个Sink Group

所谓的Sink Group实际上是指将一个或者多个Sink绑定到一个组中实现相同的功能

Processor提供了3种模式
a. default：默认模式。在Flume中，如果不指定，使用的就是这种模式。在这种模式下，每一个Sink就是一个单独的SinkGroup
b. failover：崩溃恢复模式。将多个Sink绑定到一个组中，通过优先级来确定数据优先发送的节点。只要高优先级的节点存活，那么数据就不会发送给低优先级的节点
c. load balance：负载均衡模式。支持两种负载均衡模式：round_robin(轮询)和random。但是，Flume提供的load balance并没有效果

3.3、Interceptor - 拦截器

Interceptor是Source的子组件，即Interceptor也是配置在Source上的

Interceptor和Selector不一样，Interceptor可以配置多个，够程拦截器链

3.3.1、常见拦截器

1、TimeStamp Interceptor：

在headers中添加一个timestamp字段来标记数据被收集的时间

a1.sources.s1.type = netcat
a1.sources.s1.bind = 0.0.0.0
a1.sources.s1.port = 8090
a1.sources.s1.interceptors = i1
a1.sources.s1.interceptors.i1.type = timestamp

2、Host Interceptor：

在headers中添加一个host字段来标记数据是从哪台服务器上被收集来的

a1.sources.s1.type = netcat
a1.sources.s1.bind = 0.0.0.0
a1.sources.s1.port = 8090
a1.sources.s1.interceptors = i1 i2
a1.sources.s1.interceptors.i1.type = timestamp
a1.sources.s1.interceptors.i2.type = host

3、Static Interceptor：

在headers中添加一个固定的字段，实际过程中，用于对数据进行分类标记

a1.sources.s1.type = netcat
a1.sources.s1.bind = 0.0.0.0
a1.sources.s1.port = 8090
a1.sources.s1.interceptors = i1 i2 i3
a1.sources.s1.interceptors.i1.type = timestamp
a1.sources.s1.interceptors.i2.type = host
a1.sources.s1.interceptors.i3.type = static
a1.sources.s1.interceptors.i3.key = kind
a1.sources.s1.interceptors.i3.value = test

4、UUID Interceptor：

在headers中添加一个id字段，用于标记数据的唯一性

a1.sources.s1.type = netcat
a1.sources.s1.bind = 0.0.0.0
a1.sources.s1.port = 8090
a1.sources.s1.interceptors = i1 i2 i3 i4
a1.sources.s1.interceptors.i1.type = timestamp
a1.sources.s1.interceptors.i2.type = host
a1.sources.s1.interceptors.i3.type = static
a1.sources.s1.interceptors.i3.key = kind
a1.sources.s1.interceptors.i3.value = test
a1.sources.s1.interceptors.i4.type = org.apache.flume.sink.solr.morphline.UUIDInterceptor$Builder

5、Search And Replace Interceptor：

在使用的时候需要指定正则表达式，符合正则表达式格式的数据就会被替换为指定的形式

a1.sources.s1.type = netcat
a1.sources.s1.bind = 0.0.0.0
a1.sources.s1.port = 8090
a1.sources.s1.interceptors = i1 
a1.sources.s1.interceptors.i1.type = search_replace
a1.sources.s1.interceptors.i1.searchPattern = [0-9]
a1.sources.s1.interceptors.i1.replaceString = *

6、Regex Filtering Interceptor：

在使用的时候需要给定正则，同时需要指定excludeEvents属性的值。如果excludeEvents的值为true，那么就表示刨除符合正则表达式格式的数据；如果excludeEvents的值为false，则表示抛出不符合正则表达式格式的数据

a1.sources.s1.type = netcat
a1.sources.s1.bind = 0.0.0.0
a1.sources.s1.port = 8090
a1.sources.s1.interceptors = i1
a1.sources.s1.interceptors.i1.type = regex_filter
a1.sources.s1.interceptors.i1.regex = .*[0-9].*
a1.sources.s1.interceptors.i1.excludeEvents = false

四、Flume 下载地址：

百度云盘链接：https://pan.baidu.com/s/1nraU4KWkYVSGOZlv2dxatQ（提取码：4q3u）
（如果提示过期，请评论再次更新）

• 由 ChiKong_Tam 写于 2021 年 1 月 25 日

一、简介

1.1、概述

1.2、基本概念

1.3、流动模型

1.3.1、单级流动

1.3.2、多级流动

1.3.3、扇入流动

1.3.4、扇出流动

1.3.5、复杂流动：

1.4、事务

二、基本组件

2.1、Source

2.2、Channel

2.3、Sink

三、其他组件

3.1、Selector

3.2、Processor

3.3、Interceptor - 拦截器

3.3.1、常见拦截器

四、Flume 下载地址：

猜你喜欢

目录

热门文章