Flume学习

在这里插入图片描述

简介：

1.Flume原本是Cloudera公司开发的后来贡献给了Apache的一套分布式的、可靠的、针对日志数据进行收集、汇聚和传输的机制
2.在大数据中，实际开发中有超过70%的数据来源于日志-日志是大数据的基石
3.Flume针对日志提供了非常简单且灵活的流式传输机制
4.版本
a.Flume0.X:又称之为Flume-og。依赖于Zookeeper，结构配置相对复杂，现在市面上已经停用这个版本
b.Flume1.X：又称之为Flume-og。不依赖于Zookeeper，结构配置相对简单，是市面上常用的版本

二.基本概念
1.Event
a.在Flumn中，会将收集到的每一条日志封装成一个Event对象 - 在Flume中，一个Event就对应了一条日志
b.Event本质上是一个json串，固定的包含两部分：headers和body -Flume将收集到的日志封装成一个json，而这个json就是Event。Event的结构是{“headers”:{},“body”:“”}
2.Agent:是Flume流动模型的基本组成结构，固定的包含了三个部分：
a.Source:从数据源采集数据的-collecting
b.Channel：临时存储数据-aggregating
c.Sink：将数据写往目的地-moving

三、流动模型/拓扑结构
1.单级流动
在这里插入图片描述
2.多级流动

3.扇入流动

4.扇出流动

5.复杂流动：实际过程中，根据不同的需求将上述的流动模型进行组合，就构成了复杂流动结构
四、Flume的执行流程

1.Source会先采集数据，然后将数据发送给ChannelProcessor进行处理
2.ChannelProcessor处理之后，会将数据交给Interceptor来处理，注意，在Flume允许存在多个Interceptor来构成拦截器链
3.Interceptor处理完成之后，会交给Selector处理，Selector存在两种模式：replicating和multiplexing。Selector收到数据之后会根据对应的模式将数据交给对应的Channel处理
4.Channel处理之后会交给SinkProcessor。SinkProcessor本质上是一个Sinkgroup，包含了三种方式：Default，Failover和LoadBalance。SinkProcessor收到数据之后会根据对应的方法将数据交给Sink来处理
5.Sink收到数据之后，会将数据写到指定的目的地

Flumn安装：

一：安装
1.要求虚拟机或者云主机上必须安装JDK1.8，最好安装Hadoop
2.进入/home/software
cd /home/software
flume下载地址
3.解压
tar -xvf apache-flume-1.9.0-bin.tar.gz
4.让Flume和Hadoop兼容（如果没有安装Hadoop，那么这一步不需要执行）
cd /home/software/apache-flume-1.9.0-bin/lib
rm -rf guava-11.0.2.jar
5.新建目录用于存储Flume的格式文件
cd …
mkdir data
cd data
6.编辑格式文件
vim basic.conf
7.添加格式文件内容

#给Agent起名
#给Source起名
a1.sources = s1
#给channels起名
a1.channels = c1
#给Sink起名
a1.sinks = k1

#配置Source
a1.sources.s1.type = netcat
a1.sources.s1.bind = hadoop01
a1.sources.s1.port = 8090

#配置Channel
a1.channels.c1.type = memory

#配置Sink
a1.sinks.k1.type = logger

#将Source和Channel绑定
a1.sources.s1.channels = c1
#将Sink和Channels绑定
a1.sinks.k1.channel = c1

8.启动
…/bin/flume-ng agent -n a1 -c …/conf -f basic.conf -Dflume.root.logger=INFO,console
9.测试：
另起一个客户端：nc hadoop01 8090
在这里插入图片描述
启动Flume的机器显示接收如下结果：

二、参数

参数	解释
–n,–name	指定要运行的Agent的名字
-c,–conf	指定Flume运行的原生配置
-f,–conf-file	指定要运行的文件
-Dflume.root.logger	指定Flume本身运行日志的打印级别及打印方式

Source组件

AVRO Source

一、概述
1.AVRO Source监听指定的端口，接受其他节点发送来的被AVRO序列化的数据
2.AVRO Source结合AVRO Sink可以实现更多的流动模型，包括多级流动、扇入流动以及扇出流动
二、配置属性

属性	解释
type	必须是avro
bind	要监听的主机的主机名或者IP
port	要监听的端口

三、案例
1.编辑格式文件，在格式文件中需要添加指定内容
vim avrosource.conf

a1.sources = s1
a1.channels = c1
a1.sinks = k1

#配置AVRO Source
#必须是avro
a1.sources.s1.type = avro
#指定要监听的主机
a1.sources.s1.bind = hadoop01
#指定要监听的主机
a1.sources.s1.port = 8090

a1.channels.c1.type = memory

a1.sinks.k1.type = logger

a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

2.启动Flume
…/bin/flume-ng agent -n a1 -c …/conf -f avrosource.conf -Dflume.root.logger=INFO,console
3.在另一个窗口中，进入指定目录，编辑文件
cd /home/software/apache-flume-1.9.0-bin/data
vim a.txt
写入
hello world
hello flume
4.运行AVRO客户端
…/bin/flume-ng avro-client -H hadoop01 -p 8090 -F a.txt

flumn收到AVRO的信息
在这里插入图片描述

Exec Source

一、概述
1.Exec Source会运行指定的命令，然后将命令的执行结果作为日志进行收集
2.利用这个Source可以实现对文件或者其他操作的实时监听
二、配置属性

属性	解释
type	必须是exec
command	要执行和监听的命令
shell	最好指定这个属性，表示指定Shell的运行方式

三、案例
1.需求：实时监听/home/a.txt文件的变化
2.编辑格式文件，添加如下内容
vim execsource.conf

a1.sources = s1
a1.channels = c1
a1.sinks = k1

#配置Exec Source
#必须是exec
a1.sources.s1.type = exec
#指定要运行的命令
a1.sources.s1.command = tail -F /home/a.txt
#指定Shell的运行方式/类型
a1.sources.s1.shell = /bin/bash -c

a1.channels.c1.type = memory

a1.sinks.k1.type = logger

a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

3.启动Flume
…/bin/flume-ng agent -n a1 -c …/conf -f execsource.conf -Dflume.root.logger=INFO,console
4.测试：
另起一个客户端，向监听文件追加内容
cd /home
mkdir a.txt
echo “hello” >>a.txt
观察flume运行窗口
在这里插入图片描述

Spooling Directory Source
一、概述
1.Spooling Directory Source是监听指定的目录，自动将目录中出现的新文件的内容进行收集
2.如果不指定，默认情况下，一个文件被收集之后，会自动添加一个后缀.COMPLETED，通过属性fileSuffix来修改

二、配置属性

属性	解释
type	必须是spoodir
spoolDir	要监听的目录
fileSuffix	收集之后添加的文件后缀，默认是.COMPLETED

三、案例
1.编辑格式文件，添加如下内容
vim spoolingdirsource.conf

a1.sources = s1
a1.channels = c1
a1.sinks = k1

#配置Spooling Directory Source
#必须是spooldir
a1.sources.s1.type = spooldir
#指定要监听的目录
a1.sources.s1.spoolDir = /home/flumedata

a1.channels.c1.type = memory

a1.sinks.k1.type = logger

a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

2.启动Flume
…/bin/flume-ng agent -n a1 -c …/conf -f spoolingdirsource.conf -Dflume.root.logger=INFO,console

另外还有：
Netcat Source
Sequence Generator source
HTTP Source
Custom Source
也可去官网查看提供的各种source
在这里插入图片描述
Custom Source
一、概述
1.自定义Source：需要定义一个类实现Source接口的子接口：EventDrivenSource或者PollableSource
a.EventDrivenSource：事件驱动源-被动型Source。需要自己定义线程来获取数据处理数据
b.PollableSource：拉取源 -主动型Source。提供了线程来获取数据，只需要考虑怎么处理数据即可
2.除了实现上述两个接口之一，这个自定义的类一般还需要考虑实现Configurable接口，通过接口的方法获取指定的属性
二、步骤
1.需要构建Maven工程，导入对应的POM依赖

<!--Flume的核心包-->
    <dependency>
      <groupId>org.apache.flume</groupId>
      <artifactId>flume-ng-core</artifactId>
      <version>1.9.0</version>
    </dependency>
    <!--Flume的开发工具包-->
    <dependency>
      <groupId>org.apache.flume</groupId>
      <artifactId>flume-ng-sdk</artifactId>
      <version>1.9.0</version>
    </dependency>
    <!--Flume的配置包-->
    <dependency>
      <groupId>org.apache.flume</groupId>
      <artifactId>flume-ng-configuration</artifactId>
      <version>1.9.0</version>
    </dependency>

2.定义类继承AbstractSource 实现EventDrivenSource和Configurable接口
3.覆盖configure, start和stop方法
4.定义完成后，需要将类打成jar包放到Flume安装目录的lib目录下

package sc.flume;

import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.EventDrivenSource;
import org.apache.flume.channel.ChannelProcessor;
import org.apache.flume.conf.Configurable;
import org.apache.flume.event.EventBuilder;
import org.apache.flume.source.AbstractSource;

import java.nio.charset.StandardCharsets;
import java.util.HashMap;
import java.util.Map;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;


//模拟：Sequence Generator Source
public class AuthSource extends AbstractSource implements EventDrivenSource, Configurable {
    
    

    private long end;
    private long step;
    ExecutorService es = null;

    //通过这个方法来获取指定的属性值
    @Override
    public void configure(Context context) {
    
    
        //获取自增的最大值，如果不指定，默认是Long.MAX_VALUE
        end = context.getLong("end", Long.MAX_VALUE);
        step = context.getLong("step", 1L);

    }

    //启动Source
    @Override
    public void start() {
    
    
        //构建线程池
        es = Executors.newFixedThreadPool(5);
        //获取Channel处理器
        ChannelProcessor cp = this.getChannelProcessor();
        //提交任务
        es.submit(new Add(end, step, cp));

    }

    @Override
    public void stop() {
    
    
        if (es!=null){
    
    
            es.shutdown();
        }

    }

}

class Add implements Runnable{
    
    

    private final long end;
    private final long step;
    private final ChannelProcessor cp;

    public Add(long end, long step, ChannelProcessor cp) {
    
    
        this.end = end;
        this.step = step;
        this.cp = cp;
    }

    @Override
    public void run() {
    
    
        for (long i =0;i < end; i+=step) {
    
    
            //在Flume中，数据都是以Event形式存在
            //封装body
            byte[] body = (i + "").getBytes(StandardCharsets.UTF_8);
            //封装headers
            Map<String,String> headers = new HashMap<>();
            headers.put("time",System.currentTimeMillis() + "");
            //构建Event对象
            Event e = EventBuilder.withBody(body, headers);
            cp.processEvent(e);

        }
    }
}

5.编写格式文件，例如
cd /home/software/apache-flume-1.9.0-bin/data
vim authsource.conf

a1.sources = s1
a1.channels = c1
a1.sinks = k1

#配置自定义 Source
#必须是avro
a1.sources.s1.type = sc.flume.AuthSource
#指定结束范围
a1.sources.s1.end = 100
#指定递增的步长
a1.sources.s1.step = 5

a1.channels.c1.type = memory
a1.sinks.k1.type = logger

a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

6.启动Flume
…/bin/flume-ng agent -n a1 -c …/conf -f authsource.conf -Dflume.root.logger=INFO,console
7.结果
在这里插入图片描述

5.编写

Sink组件

hdfs sink
一、概述
1.HDFS Sink将收集到的数据写到HDFS中
2.在往HDFS上写的时候，支持三种文件类型：文本类型，序列类型以及压缩类型。如果不指定，那么默认使用序列类型
3.在往HDFS上写数据的时候，数据的存储文件会定时的滚动，如果不指定，那么每隔30s会滚动一次，生成一个文件，那么此时会生成大量的小文件
二、配置属性

属性	解释
type	必须是hdfs
hdfs.path	数据在HDFS上的存储路径
hdfs.rollInterval	指定文件的滚动的间隔时间
hdfs.fileType	指定文件的存储类型：DataStream（文本），SequenceFile（序列），CompressedStream（压缩）

三、案列
1.编辑格式文件，添加如下内容

a1.sources = s1
a1.channels = c1
a1.sinks = k1

#配置Exec Source
#必须是exec
a1.sources.s1.type = exec
#指定要运行的命令
a1.sources.s1.command = tail -F /home/a.txt
#指定Shell的运行方式/类型
a1.sources.s1.shell = /bin/bash -c

a1.channels.c1.type = memory

#配置HDFS Sink
a1.sinks.k1.type = hdfs
#指定数据在HDFS上的存储路径
a1.sinks.k1.hdfs.path = hdfs://hadoop01:9000/flumedata
#指定文件的存储类型
a1.sinks.k1.hdfs.fileType = DataStream
#指定文件滚动的间隔时间
a1.sinks.k1.hdfs.rollInterval = 3600
a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

2.启动Flume
…/bin/flume-ng agent -n a1 -c …/conf -f hdfssink.conf -Dflume.root.logger=INFO,console
3.测试：
另起一个客户端，向监听文件追加内容
cd /home
mkdir a.txt
echo “hello” >>a.txt
查询hdfs上文件
在这里插入图片描述
还提供了别的Sink的类型：可以官网查看

Custom Sink
一、概述
1.定义一个类实现Sink接口，考虑到需要获取配置属性，所以同样需要实现Configurable接口
2.不同于自定义Soure，自定义Sink需要考虑事务问题
二、事务在这里插入图片描述
1.Source收集数据之后i，会doPut操作将数据放到队列PutList（本质上是一个阻塞式队列）中
2.PutList会试图将数据推送到Channel中。如果PutList成功将数据放到了Channel中，那么执行doCommit操作；反之执行doRollback操作
3.Channel有了数据之后，会将数据通过doTake操作推送到TakeList中
4.TakeList会将数据推送到Sink中，如果Sink写出成功，那么执行doCommit；反之执行doRollback操作
三、自定义Sink步骤
1.构建Maven工程，导入对应的POM依赖
2.定义一个类继承AbstractSink，实现Sink接口和Configurable接口，覆盖configure，start，process和stop方法

package sc.flume;

import org.apache.flume.*;
import org.apache.flume.conf.Configurable;
import org.apache.flume.sink.AbstractSink;

import java.io.FileNotFoundException;
import java.io.PrintStream;
import java.util.Map;

//模拟：File Roll Sink ->将数据写到本地磁盘上
public class AuthSink extends AbstractSink implements Sink, Configurable {
    
    

    private String path;
    private PrintStream printStream;

    @Override
    public void configure(Context context) {
    
    
        //获取指定的存储路径
        path = context.getString("path");
        //判断用户是否指定了属性
        if (path == null) {
    
    
            throw new IllegalArgumentException("必须指定path属性！！！");
        }
    }

    //启动Sink
    @Override
    public synchronized void start() {
    
    
        //构建流用于将数据写到磁盘上
        try {
    
    
            printStream = new PrintStream(path + "/" + System.currentTimeMillis());
        } catch (FileNotFoundException e) {
    
    
            throw new RuntimeException(e);
        }

    }

    //处理逻辑需要覆盖在这个方法中
    @Override
    public Status process() throws EventDeliveryException {
    
    
        //获取Sink对应的Channel
        Channel c = this.getChannel();
        //获取事务
        Transaction t = c.getTransaction();
        //开启事务
        t.begin();
        //获取数据
        Event e;
        try {
    
    
            while ((e = c.take()) != null){
    
    
                //获取headers
                Map<String, String> headers = e.getHeaders();
                //写出headers部分的数据
                printStream.println("headers");
                for (Map.Entry<String, String> h : headers.entrySet()) {
    
    
                    printStream.println("\t" + h.getKey() + ":" + h.getValue());
                }
                //获取body
                byte[] body = e.getBody();
                //写出body数据
                printStream.println("body");
                printStream.println("\t" + new String(body));
            }
            //如果循环正常结束，那么说明数据正常写出
            //提交事务
            t.commit();
            return Status.READY;
        } catch (Exception ex) {
    
    
            //如果循环失败，那么进入catch块
            //回滚事务
            t.rollback();
            return Status.BACKOFF;
        } finally {
    
    
            //无论成功与否，都需要关闭事务
            t.close();
        }
    }

    @Override
    public synchronized void stop() {
    
    
        if (printStream != null){
    
    
            printStream.close();
        }
    }
}

3.完成之后打成jar包放到Flume安装目录的lib目录下
4.编写格式文件

a1.sources = s1
a1.channels = c1
a1.sinks = k1

#配置Exec Source
#必须是exec
a1.sources.s1.type = exec
#指定要运行的命令
a1.sources.s1.command = tail -F /home/a.txt
#指定Shell的运行方式/类型
a1.sources.s1.shell = /bin/bash -c

a1.channels.c1.type = memory

#配置自定义
#类型必须是类的全路径名
a1.sinks.k1.type = sc.flume.AuthSink
#指定文件的存储路径
a1.sinks.k1.path = /home/flumedata
#指定口端口
a1.sinks.k1.port = 8090

a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

5.启动Flume
…/bin/flume-ng agent -n a1 -c …/conf -f authsink.conf -Dflume.root.logger=INFO,console
6.测试：
另起一个客户端，向监听文件追加内容
cd /home
mkdir a.txt
echo “hello” >>a.txt
cat /home/flumedata
vim 1651458810852
查看结果
在这里插入图片描述

多级流动、扇入流动、扇出流动

一、多级流动
1.AVRO Sink结合AVRO Source实现多级、扇入、扇出流动效果
2.案列：
①将flume考到另外两台中hadoop02、hadoop03
scp -r /home/software/apache-flume-1.9.0-bin root@hadoop02:/home/software/
scp -r /home/software/apache-flume-1.9.0-bin root@hadoop03:/home/software/
②分别编辑三台格式文件，添加如下内容：
cd /home/software/apache-flume-1.9.0-bin/data
vim duoji.conf

在hadoop01的duoji.conf文件里修改

a1.sources = s1
a1.channels = c1
a1.sinks = k1

#配置Exec Source
#必须是exec
a1.sources.s1.type = exec
#指定要运行的命令
a1.sources.s1.command = tail -F /home/a.txt
#指定Shell的运行方式/类型
a1.sources.s1.shell = /bin/bash -c

a1.channels.c1.type = memory

#配置多级流动
#类型必须是avro
a1.sinks.k1.type = avro
#指定主机名或者IP
a1.sinks.k1.hostname = hadoop02
#指定口端口
a1.sinks.k1.port = 8090

a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

在hadoop02的duoji.conf文件里修改

a1.sources = s1
a1.channels = c1
a1.sinks = k1

a1.sources.s1.type = avro
a1.sources.s1.bind = 0.0.0.0
a1.sources.s1.port = 8090

a1.channels.c1.type = memory

#配置多级流动
#类型必须是avro
a1.sinks.k1.type = avro
#指定主机名或者IP
a1.sinks.k1.hostname = hadoop03
#指定口端口
a1.sinks.k1.port = 8090

a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

在hadoop03的duoji.conf文件里修改

a1.sources = s1
a1.channels = c1
a1.sinks = k1

a1.sources.s1.type = avro
a1.sources.s1.bind = 0.0.0.0
a1.sources.s1.port = 8090

a1.channels.c1.type = memory

a1.sinks.k1.type = logger

a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

③启动Flume，启动的时候，谁接受数据，就先启动谁
从haddoop03开始启动
…/bin/flume-ng agent -n a1 -c …/conf -f duoji.conf -Dflume.root.logger=INFO,console
④测试：
另起一个客户端，向监听文件追加内容
cd /home
mkdir a.txt
echo “hello” >>a.txt
观察hadoop03的flume运行窗口
在这里插入图片描述

二、扇入流动
②分别编辑三台格式文件，添加如下内容：
cd /home/software/apache-flume-1.9.0-bin/data
vim shanru.conf

在hadoop01的shanru.conf文件里修改

a1.sources = s1
a1.channels = c1
a1.sinks = k1

#配置Exec Source
a1.sources.s1.type = exec
#指定要运行的命令
a1.sources.s1.command = tail -F /home/a.txt
#指定Shell的运行方式/类型
a1.sources.s1.shell = /bin/bash -c

a1.channels.c1.type = memory

#类型必须是avro
a1.sinks.k1.type = avro
#指定主机名或者IP
a1.sinks.k1.hostname = hadoop03
#指定口端口
a1.sinks.k1.port = 8090

a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

在hadoop02的shanru.conf文件里修改

a1.sources = s1
a1.channels = c1
a1.sinks = k1

#配置Exec Source
a1.sources.s1.type = exec
#指定要运行的命令
a1.sources.s1.command = tail -F /home/a.txt
#指定Shell的运行方式/类型
a1.sources.s1.shell = /bin/bash -c

a1.channels.c1.type = memory

#类型必须是avro
a1.sinks.k1.type = avro
#指定主机名或者IP
a1.sinks.k1.hostname = hadoop03
#指定口端口
a1.sinks.k1.port = 8090

a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

在hadoop03的shanru.conf文件里修改

a1.sources = s1
a1.channels = c1
a1.sinks = k1

a1.sources.s1.type = avro
a1.sources.s1.bind = 0.0.0.0
a1.sources.s1.port = 8090

a1.channels.c1.type = memory

a1.sinks.k1.type = logger

a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

③启动Flume，启动的时候，谁接受数据，就先启动谁
haddoop03先启动，其它两台随意
…/bin/flume-ng agent -n a1 -c …/conf -f shanru.conf -Dflume.root.logger=INFO,console

三、扇出流动
1.一个source可以对应多个channel
一个Channel只能对应一个Sink
2.测试
①在hadoop01的shanchu.conf文件里修改

a1.sources = s1
a1.channels = c1 c2
a1.sinks = k1 k2

#配置Exec Source
a1.sources.s1.type = exec
#指定要运行的命令
a1.sources.s1.command = tail -F /home/a.txt
#指定Shell的运行方式/类型
a1.sources.s1.shell = /bin/bash -c

a1.channels.c1.type = memory
a1.channels.c2.type = memory

a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop02
a1.sinks.k1.port = 8090

a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop03
a1.sinks.k2.port = 8090

a1.sources.s1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2

在hadoop02和hadoop03的shanchu.conf文件里修改（hadoop02和hadoop03一样）

a1.sources = s1
a1.channels = c1
a1.sinks = k1

a1.sources.s1.type = avro
a1.sources.s1.bind = 0.0.0.0
a1.sources.s1.port = 8090

a1.channels.c1.type = memory

a1.sinks.k1.type = logger

a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

②启动Flume，启动的时候，谁接受数据，就先启动谁
haddoop01后启动，其它两台先启动
…/bin/flume-ng agent -n a1 -c …/conf -f shanchu.conf -Dflume.root.logger=INFO,console

Channel组件

Memory Channel
一、概述
1.Memory Channel将数据临时存储到内存的指定队列中
2.如过不指定，则队列大小默认是100，即在队列中最多允许同时存储100条数据。如果队列被占满，那么后来的数据就会被阻塞。实际过程中，一般会将这个值调剂为10W~30W，如果数据量比较大，也可以考虑调剂为50W
3.Channel可以批量接受Source的数据，也可以将数据批量发送给Sink，那么默认情况下，每一批数据是100条。实际过程中，一般会将这个值调节为100~3000，如果Channel的容量为50W，那么此时一般批量调剂为5000
4.Memory Channel是将数据存储在内存中，所以不可靠，但是读写速度快，因此适应于要求速度但不要求可靠性的场景
二、属性配置

属性	解释
type	memory
capacity	指定的队列的容量
transactionCapacity	数据的批的量

三、案例
1.添加格式文件，添加如下配置

a1.sources = s1
a1.channels = c1
a1.sinks = k1

#配置Source
a1.sources.s1.type = netcat
a1.sources.s1.bind = 0.0.0.0
a1.sources.s1.port = 8090

#配置Memory  Channel
#类型必须是memory
a1.channels.c1.type = memory
#指定Channel的容量
a1.channels.c1.capacity = 100000
#指定Channel的批的量
a1.channels.c1.transactionCapacity = 1000

a1.sinks.k1.type = logger

a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

2.启动Flume
…/bin/flume-ng agent -n a1 -c …/conf -f memorychannel.conf -Dflume.root.logger=INFO,console

File Channel
一、概述
1.File Channel将数据临时存储到本地的磁盘上
2.File Channel不会丢失数据，但是读写速度慢，适应于要求可靠性但是不要求速度的场景
3.如果不指定，那么默认情况下，File Channel会将数据临时存储到~/.flume/file-channel/data
4.为了File Channel占用过多的磁盘，那么默认情况下，允许在磁盘上最多存储100W条数据
二、属性配置

属性	解释
type	file
dataDirs	指定在磁盘上临时存储的位置

三、案例
1.添加格式文件，添加如下配置

a1.sources = s1
a1.channels = c1
a1.sinks = k1

#配置Source
a1.sources.s1.type = netcat
a1.sources.s1.bind = hadoop01
a1.sources.s1.port = 8090

#配置FileChannel
#类型必须是file
a1.channels.c1.type = file
#指定数据在磁盘上的存储位置
a1.channels.c1.dataDirs= /home/flumedata

a1.sinks.k1.type = logger

a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

2.启动Flume
…/bin/flume-ng agent -n a1 -c …/conf -f filechannel.conf -Dflume.root.logger=INFO,console

其他Channel
一、JDBC Channel
1.JDBC Channel会将数据临时存储到数据库中，理论上JDBC Channel的读写速度略高于File Channel，但是低于Memory Channel
2.到目前位置，这个JDBC Channel只支持Derby数据库。基于Derby的特性（微型-存储的数据少，单连接 - 只允许单用户操作），所以实际过程中很少使用这个数据库，因此实际生产过程中，几乎弃用JDBC Channel
二、Spillable Memory Channel
1.Spillable Memory Channel会先试图将数据临时存储到内存中。如果内存队列一旦被塞满，此时这个Channel不会阻塞，而是转而将数据临时存储到磁盘上
2.到目前为止，这个Channel处于实验阶段，不推荐在生产环境中使用
三、还有其他的channel去官网查询使用，自定义channel用的比较少，用起来比较麻烦，并且当前的存在的channel已经满足了需求
在这里插入图片描述

Selector组件

一、概述
1.Selector本身是Source的子组件，决定了将数据分发给哪个Channel
2.Selector中提供了两种模式：
a.replicating：复制。将数据复制之后发送给每一个节点
b.multiplexing：路由/多路复用。根据headers中的指定字段决定将数据发送给哪一个Channel
3.如果不指定，那么默认使用的就是复制模式
二、配置属性

属性	解释
selector.type	可以是replicating或者multiplexing
selector.header	如果是multiplexing，那么需要指定监听的字段
selector.header	如果是multiplexing，那么需要指定监听字段匹配的值
selector.header	如果是multiplexing，那么所有值不匹配的情况下数据发送的Channel

三、案列
1.添加格式文件，添加如下配置
vim multiplexings.conf

a1.sources = s1
a1.channels = c1 c2
a1.sinks = k1 k2

a1.sources.s1.type = http
a1.sources.s1.port = 8090
#指定Selector的类型
a1.sources.s1.selector.type = multiplexing
#指定要监听的字段
a1.sources.s1.selector.header = kind
#指定匹配的字段值
a1.sources.s1.selector.mapping.music = c1
a1.sources.s1.selector.mapping.video = c2
#指定默认值
a1.sources.s1.selector.default = c2

a1.channels.c1.type = memory
a1.channels.c2.type = memory

a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop02
a1.sinks.k1.port = 8090

a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop03
a1.sinks.k2.port = 8090

a1.sources.s1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2

2.启动Flume
①hadoop01：
…/bin/flume-ng agent -n a1 -c …/conf -f multiplexings.conf -Dflume.root.logger=INFO,console
②hadoop01和hadoop03：
…/bin/flume-ng agent -n a1 -c …/conf -f shanchu.conf -Dflume.root.logger=INFO,console
3.测试
重启一台hadoop01客户端：

curl -X POST -d '[{"headers":{"kind":"music"},"body":"music server"}]' http://hadoop01:8090
curl -X POST -d '[{"headers":{"kind":"video"},"body":"video server"}]' http://hadoop01:8090
curl -X POST -d '[{"headers":{"kind":"log"},"body":"log server"}]' http://hadoop01:8090

4.观察hadoop01和hadoop02主机日志
hadoop02主机的flumn 在这里插入图片描述
hadoop03主机的flumn

Processor组件

Default Processor
一、概述
1.在Flume中，如果不指定，那么默认使用的就是Default Processor
2.在Default Processor的模式下，每一个Sink都对应一个单独的Sinkgroup，即有几个Sink就有几个Sinkgroup
3.这个Default Processor不需要进行任何配置

Failover Sink Processor
一、概述
1.Failover Sink Processor将多个Sink绑定到一个组中，同一个组中的Sink需要指定优先级
2.只要高优先级的Sink存活，那么数据就不会发送给低优先级的Sink
二、属性配置

属性	解释
sinks	要绑定到一个组中的sink
processor	必须是failover
processor.prority.< sinkName>	指定Sink的优先级
processor.maxpenalty	等待存活的时间

三、案例
1.添加格式文件，添加如下配置
①hadoop01:
vim failoversink.conf

a1.sources = s1
a1.channels = c1 c2
a1.sinks = k1 k2

#给Sinkgroup起名
a1.sinkgroups = g1
#给Sinkgroup绑定Sink
a1.sinkgroups.g1.sinks = k1 k2
#指定Sinkgroup的类型
a1.sinkgroups.g1.processor.type = failover
#给每一个Sink指定优先级
a1.sinkgroups.g1.processor.priority.k1 = 7
a1.sinkgroups.g1.processor.priority.k2 = 2
#指定存活等待时间
a1.sinkgroups.g1.processor.maxpenalty = 10000

a1.sources.s1.type = netcat
a1.sources.s1.bind = 0.0.0.0
a1.sources.s1.port = 8090

a1.channels.c1.type = memory
a1.channels.c2.type = memory

a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop02
a1.sinks.k1.port = 8090

a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop03
a1.sinks.k2.port = 8090

a1.sources.s1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2

②hadoop02和hadoop03:
vim shanchu.conf

a1.sources = s1
a1.channels = c1
a1.sinks = k1

a1.sources.s1.type = avro
a1.sources.s1.bind = 0.0.0.0
a1.sources.s1.port = 8090

a1.channels.c1.type = memory

a1.sinks.k1.type = logger

a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

2.启动Flume
①hadoop01
…/bin/flume-ng agent -n a1 -c …/conf -f failoversink.conf -Dflume.root.logger=INFO,console
②hadoop02和hadoop03：
…/bin/flume-ng agent -n a1 -c …/conf -f failoversink.conf -Dflume.root.logger=INFO,console

Load Balancing Processor
一、概述
1.Load Balancing Processor进行负载均衡的Processor，在数据量较大的时候，可以考虑使用
2.Flume中提供了两种负载均衡的模式：round_robin(轮询)，random(随机)
3.Flume原生提供的负载均衡的Processor并不好用

interceptor组件

Timestamp Interceptor
一、概述
1.Timestamp Interceptor是在headers中添加一个timestamp字段来标记数据被收集的时间
2.Timestamp Interceptor结合HDFS Sink可以实现数据按天存储
二、属性配置

属性	解释
type	timestamp

三、案例
1.添加格式文件，添加如下配置
vim timestamp.conf

a1.sources = s1
a1.channels = c1
a1.sinks = k1

a1.sources.s1.type = netcat
a1.sources.s1.bind = hadoop01
a1.sources.s1.port = 8090

#给Interceptor起名
a1.sources.s1.interceptors = i1
#指定Timestamp Interceptor
a1.sources.s1.interceptors.i1.type = timestamp

a1.channels.c1.type = memory

a1.sinks.k1.type = logger

a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

2.启动Flume
…/bin/flume-ng agent -n a1 -c …/conf -f timestamp.conf -Dflume.root.logger=INFO,console
3.测试：
另起一个客户端：nc hadoop01 8090
在这里插入图片描述
四、数据按天存放
1.添加格式文件，添加如下配置
vim hdfsdata.conf

a1.sources = s1
a1.channels = c1
a1.sinks = k1

a1.sources.s1.type = netcat
a1.sources.s1.bind = hadoop01
a1.sources.s1.port = 8090

#给Interceptor起名
a1.sources.s1.interceptors = i1
#指定Timestamp Interceptor
a1.sources.s1.interceptors.i1.type = timestamp

a1.channels.c1.type = memory

a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://hadoop01:9000/flumedata/date=%Y-%m-%d
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.rollInterval = 3600

a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

2.启动Flume
…/bin/flume-ng agent -n a1 -c …/conf -f hdfsdata.conf -Dflume.root.logger=INFO,console
3.测试：
另起一个客户端：nc hadoop01 8090
在这里插入图片描述
Host Interceptor
一、概述
1.Host Interceptor是在headers中添加一个host
2.Host Interceptor可以用于标记数据来源于那一台主机
二、配置属性

属性	解释
type	必须是host

三、案例
见下面综合案例

Static Interceptor
一、概述
1.HStatic Interceptor是在headers中指定字段
2.可以用于这个Interceptor来标记数据的类型
二、配置属性

属性	解释
type	必须是static
key	指定在headers中的字段值

三、案例
见下面综合案例

UUID Interceptor
一、概述
1.UUID Interceptor是在headers中添加一个id字段
2.可以用于标记数据的唯一性
二、配置属性

属性	解释
type	必须是org.apache.flume.sink.solr.morphline.UUIDInterceptor$Builder

三、综合案例
1.添加格式文件，添加如下配置
vim zonghe.conf

a1.sources = s1
a1.channels = c1
a1.sinks = k1

a1.sources.s1.type = netcat
a1.sources.s1.bind = hadoop01
a1.sources.s1.port = 8090

#给Interceptor起名
a1.sources.s1.interceptors = i1 i2 i3 i4
#指定Timestamp Interceptor
a1.sources.s1.interceptors.i1.type = timestamp
#指定Host Interceptor
a1.sources.s1.interceptors.i2.type = host
#指定Static Interceptor
a1.sources.s1.interceptors.i3.type = static
a1.sources.s1.interceptors.i3.key = kind
a1.sources.s1.interceptors.i3.value = log
#指定Host Interceptor
a1.sources.s1.interceptors.i4.type = org.apache.flume.sink.solr.morphline.UUIDInterceptor$Builder

a1.channels.c1.type = memory

a1.sinks.k1.type = logger

a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

2.启动Flume
…/bin/flume-ng agent -n a1 -c …/conf -f zonghe.conf -Dflume.root.logger=INFO,console
3.测试：
另起一个客户端：nc hadoop01 8090
在这里插入图片描述
Search And Replace Interceptor
一、概述
1.Search And Replace Interceptor在使用的时候，需要指定正则表达式，会根据正则表达式的规则，将符合正则表达式的数据替换为指定形式的数据
2.在替换的时候，不会替换headers中的数据，而是会替换body中的数据

二、配置属性

属性	解释
type	必须是search_replace
searchPattern	指定要匹配的正则形式
replaceString	指定要替换的字符串

三、案例
1.添加格式文件，添加如下配置
vim replace.conf

a1.sources = s1
a1.channels = c1
a1.sinks = k1

a1.sources.s1.type = http
a1.sources.s1.port = 8090
#给拦截器起名
a1.sources.s1.interceptors = i1
#指定类型
a1.sources.s1.interceptors.i1.type = search_replace
a1.sources.s1.interceptors.i1.searchPattern = [0-9]
a1.sources.s1.interceptors.i1.replaceString = *

a1.channels.c1.type = memory

a1.sinks.k1.type = logger

a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

2.启动Flume
…/bin/flume-ng agent -n a1 -c …/conf -f replace.conf -Dflume.root.logger=INFO,console
3.测试：
另起一个客户端：

curl -X POST -d '[{"headers":{"data":"2022-05-02"},"body":"test1234"}]' http://hadoop01:8090

在这里插入图片描述
Regex Filtering Interceptor
一、概述
1.Regex Filtering Interceptor在使用的时候，需要指定正则表达式
2.属性excludeEvents的值如果不指定，默认是false
3.如果没有配置excludeEvents的值或者配置excludeEvents的值配置为false，则只有符合正则表达式的数据会保留下来，其他不符合正则表达式的数据被过滤掉；如果excludeEvents的值为true，那么符合正则表示式的数据会被顾虑掉，其他的数据则会被保留下来
二、配置属性

属性	解释
type	必须是regex_filter
regex	指定正则表达式
excludeEvents	true或者false

三、案例
1.添加格式文件，添加如下配置
vim regexfilter.conf

a1.sources = s1
a1.channels = c1
a1.sinks = k1

a1.sources.s1.type = netcat
a1.sources.s1.bind = hadoop01
a1.sources.s1.port = 8090
#给拦截器起名
a1.sources.s1.interceptors = i1
#指定类型
a1.sources.s1.interceptors.i1.type = regex_filter
#匹配所有含有数字的字符串
a1.sources.s1.interceptors.i1.regex = .*[0-9].*
a1.sources.s1.interceptors.i1.excludeEvents = false

a1.channels.c1.type = memory

a1.sinks.k1.type = logger

a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

2.启动Flume
…/bin/flume-ng agent -n a1 -c …/conf -f regexfilter.conf -Dflume.root.logger=INFO,console
3.测试：
另起一个客户端：
nc hadoop01 8090
在这里插入图片描述
hadoop和zookeeper被过滤掉，只有hadoop01和zookeeper02

Custom Interceptor
一、概述
1.在Flume中，也允许自定义拦截器。但是不同于其他组件，自定义Interceptor的时候，需要额外覆盖其中的内部接口
2.步骤：
a.构建Maven工程，导入对应的依赖

<!--Flume的核心包-->
  <dependency>
    <groupId>org.apache.flume</groupId>
    <artifactId>flume-ng-core</artifactId>
    <version>1.9.0</version>
  </dependency>
  <!--Flume的开发工具包-->
  <dependency>
    <groupId>org.apache.flume</groupId>
    <artifactId>flume-ng-sdk</artifactId>
    <version>1.9.0</version>
  </dependency>
  <!--Flume的配置包-->
  <dependency>
    <groupId>org.apache.flume</groupId>
    <artifactId>flume-ng-configuration</artifactId>
    <version>1.9.0</version>
  </dependency>

b.自定义一个类实现Interceptor接口，覆盖其中initialize，interceptor和close方法
c.定义静态内部类，实现Interceptor.Builder内部接口

package sc.flume;

import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;

import java.util.ArrayList;
import java.util.List;
import java.util.Map;

//模拟：Timestamp Interceptor
public class AuthInterceptor implements Interceptor {
    
    

    //初始化分发
    @Override
    public void initialize() {
    
    

    }

    //拦截分发，对Event对象的处理就是放在这个方法中
    @Override
    public Event intercept(Event event) {
    
    
        //时间戳在headers中，首先获取时间戳
        Map<String, String> headers = event.getHeaders();
        //判断headers中原本是否指定了时间戳
        if (headers.containsKey("time")||headers.containsKey("timestamp"))
            //如果原来指定了，那么我们就不再修改
            return event;
        //如果没有指定，那么添加一个时间戳
        headers.put("timestamp", System.currentTimeMillis() + "");
        event.setHeaders(headers);
        return event;
    }

    //批量拦截
    @Override
    public List<Event> intercept(List<Event> events) {
    
    
        //存储处理之后的Event
        List<Event> es =new ArrayList<>();
        for (Event event : events) {
    
    
            //将遍历的数据逐个处理，处理完成之后放到列表中
            es.add(intercept(event));
        }
        return es;
    }

    @Override
    public void close() {
    
    

    }

    //覆盖内部接口
    public static class Builder implements Interceptor.Builder{
    
    

        //产生要使用的拦截器对象
        @Override
        public Interceptor build() {
    
    
            return new AuthInterceptor();
        }

        //获取配置属性
        @Override
        public void configure(Context context) {
    
    

        }
    }

}

d.打成jar包方法Flume安装目录的lib目录下
e.编写格式文件，添加如下内容
vim authinterceptor.conf

a1.sources = s1
a1.channels = c1
a1.sinks = k1

a1.sources.s1.type = netcat
a1.sources.s1.bind = hadoop01
a1.sources.s1.port = 8090
#指定拦截器
a1.sources.s1.interceptors = i1
a1.sources.s1.interceptors.i1.type = sc.flume.AuthInterceptor$Builder

a1.channels.c1.type = memory

a1.sinks.k1.type = logger

a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1

f.启动Flume
…/bin/flume-ng agent -n a1 -c …/conf -f authinterceptor.conf -Dflume.root.logger=INFO,console
g.测试：
另起一个客户端：
nc hadoop01 8090
在这里插入图片描述

简介：