Flume is the easiest to use


1. Introduction

1. Definition

Flume is a highly available, highly reliable, distributed massive log collection, aggregation and transmission system provided by Cloudera.

Flume is based on a streaming architecture and is flexible and simple.

  1. Flume official website address: http://flume.apache.org
  2. Document viewing address: http://flume.apache.org/FlumeUserGuide.html
  3. Download address: http://archive.apache.org/dist/flume

Insert image description here

2. Infrastructure

The composition structure of Flume is shown in the figure below:

Insert image description here

  • Agent

Agent: The deployment unit of Flume is essentially a JVM process. The Agent internally sends data from the source to the destination in the form of events.

Composition: Agent mainly consists of three parts, Source, Channel, and Sink.

  • Source

Source: is the component responsible for receiving data to Flume Agent.
Features: Source component can handle log data of various types and formats.
Source component type:

  1. avro: It is essentially an RPC framework that supports cross-language and cross-platform data transmission. Avro Source is mostly used for Agent connections in flume.
  2. netcat: Essentially a port tool under Linux, netcat Source is used in Flume to collect port transmission data.
  3. exec: Supports the execution of commands, and uses the standard output after command execution as data collection. It is mostly used to collect an appendable file.
  4. spooling directory: Supports monitoring a directory and collecting one or more newly generated file data in the directory.
  5. taildir: Supports monitoring of multiple directories, collecting one or more appendable files in one or more directories, and supports resumed downloading at breakpoints.
  6. In addition, there are: thrift, jms, sequence generator, syslog, http, custom Source.
  • Sink

Sink: It is the component of Flume Agent responsible for sending data to external systems.
Features: The Sink component continuously polls the events in the Channel and removes them in batches, and writes these events to the storage or indexing system in batches and transactions, or sends them to another Flume Agent.
Sink component type:

  1. logger: The logger Sink component writes data to the running log of the Flume framework. With the running parameter -Dflume.root.logger=INFO, the console can output the Flume running log (which contains the collected data) to the console. Mostly used in test environments.
  2. hdfs: The hdfs Sink component is responsible for transferring data to the HDFS distributed file system.
  3. avro: The avro Sink component cooperates with the avro Source component to realize Agent connection.
  4. file: The file Sink component directly outputs the collected data to the local file system, that is, the Linux disk.
  5. In addition, there are: thrift, ipc, HBase, solr, and custom Sink.
  • Channel

**Channel:** is responsible for temporarily storing data and is a buffer between the Source and Sink components.
Features:

  1. Due to the existence of the Channel component, the Source and Sink components can operate at different rates.
  2. Channel is thread-safe and can handle write operations from several Sources and read operations from several Sinks at the same time.

Flume comes with two channels:

  1. Memory Channel: Memory-based queue storage event, suitable for scenarios that do not require high data security. Fast, not safe
  2. File Channel: Based on disk storage events, data will not be lost during downtime, and is suitable for scenarios with high sensitivity to data security. Slow and safe
  • Event

Event: Event in agent, the basic unit of Flume data transmission, sends data from source to destination in the form of Event.
Features: Event consists of two parts: Header and Body.

  1. Header: used to store some attributes of the event, which is a KV structure.
  2. Body: used to store this piece of data, in the form of a byte array.

2. Quick Start

1. Unzip Flume

wget https://gitcode.net/weixin_44624117/software/-/raw/master/software/Linux/Flume/apache-flume-1.9.0-bin.tar.gz

unzip files

tar -zxvf apache-flume-1.9.0-bin.tar.gz -C /opt/module/

Modify file directory name

mv /opt/module/apache-flume-1.9.0-bin /opt/module/flume-1.9.0

Delete the lib folder guava-11.0.2.jarfor compatibilityHadoop 3.1.3

rm /opt/module/flume-1.9.0/lib/guava-11.0.2.jar

2. Case 1: Monitoring port number

Use Flume to listen to a port, collect the port data, and print it to the console.

Insert image description here

Install the netcat tool (send HTTP request)

sudo yum install -y nc

Determine whether port 44444 is occupied

sudo netstat -nlp | grep 44444

In the Flume directory, create a job task

cd /opt/module/flume-1.9.0
mkdir -p job/simpleCase
cd /opt/module/flume-1.9.0/job/simpleCase

Add configuration file

Note: The configuration file comes from the official manual http://flume.apache.org/FlumeUserGuide.html

vim flume-1-netcat-logger.con
#Name the components on this agent
a1.sources = r1 
a1.sinks = k1 
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat 
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

# Describe the sink
a1.sinks.k1.type = logger 

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000 
a1.channels.c1.transactionCapacity = 100 

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

explain:

#Name the components on this agent
a1.sources = r1                                      # 为a1的Source组件命名为r1,多个组件用空格间隔
a1.sinks = k1                                        # 为a1的Sink组件命名为k1,多个组件用空格间隔
a1.channels = c1                                    # 为a1的Channel组件命名为c1,多个组件用空格间隔

# Describe/configure the source
a1.sources.r1.type = netcat                      # 配置r1的类型
a1.sources.r1.bind = localhost                  # 配置r1的绑定地址(注意localhost和hadoop102的区别)
a1.sources.r1.port = 44444                       # 配置r1的监听端口

# Describe the sink
a1.sinks.k1.type = logger                        # 配置k1的类型为logger,输出到控制台

# Use a channel which buffers events in memory
a1.channels.c1.type = memory                    # 配置c1的类型为memory
a1.channels.c1.capacity = 1000                 # 配置c1的容量为1000个事件
a1.channels.c1.transactionCapacity = 100     # 配置c1的事务容量为100个事件

# Bind the source and sink to the channel
a1.sources.r1.channels = c1                    # 配置r1的channel属性,指定r1连接到那个channel
a1.sinks.k1.channel = c1                        # 配置k1的channel属性,指定k1连接到那个channel

Run Flume listening port

#	方式一:
bin/flume-ng agent --conf conf/ --name a1 --conf-file job/simpleCase/flume-1-netcat-logger.conf -Dflume.root.logger=INFO,console
#	方式二:
 bin/flume-ng agent -c conf/ -n a1 -f job/simpleCase/flume-1-netcat-logger.conf -Dflume.root.logger=INFO,console

Parameter Description:

  • --conf/-c: Indicates that the configuration file is stored in the conf/ directory
  • --name/-n: Indicates naming the agent a1
  • --conf-file/-f: Specifies that the configuration file to be read is the flume-1-1netcat-logger.conf file in the job/simpleCase folder.
  • -Dflume.root.logger=INFO,console:-D means to dynamically modify the flume.root.logger parameter attribute value when flume is running, and set the console log printing level to INFO level. Log levels include: log, info, warn, error.

**Test:** Start nc on Hadoop101 and send the request

nc localhost 44444
hello
world
hello world

Insert image description here

3. Case 2: Files in empty directories

Source selection:

  • Exec source: suitable for monitoring a real-time appended file, and cannot achieve breakpoint resuming;
  • Spooldir Source: suitable for synchronizing new files, but not suitable for monitoring and synchronizing files with real-time append logs;
  • Taildir Source: Suitable for monitoring multiple real-time appended files and capable of resuming downloads at breakpoints.

Case requirements:

  • Use Flume to monitor the entire directory for real-time append files and upload them to HDFS.

Insert image description here

Create configuration file

cd /opt/module/flume-1.9.0/job/simpleCase
vim flume-2-taildir-hdfs.conf

Configuration file

# Name the components on this agent
a2.sources = r1
a2.sinks = k1
a2.channels = c1

# Describe/configure the source
a2.sources.r1.type = TAILDIR
a2.sources.r1.positionFile = /opt/module/flume-1.9.0/tail_dir.json
a2.sources.r1.filegroups = f1 f2
a2.sources.r1.filegroups.f1 = /opt/module/flume-1.9.0/datas/tailCase/files/.*file.*
a2.sources.r1.filegroups.f2 = /opt/module/flume-1.9.0/datas/tailCase/logs/.*log.*

# Describe the sink
a2.sinks.k1.type = hdfs
a2.sinks.k1.hdfs.path = hdfs://hadoop101:8020/flume/tailDir/%Y%m%d/%H
# 上传文件的前缀
a2.sinks.k1.hdfs.filePrefix = tail-
# 是否按照时间滚动文件夹
a2.sinks.k1.hdfs.round = true
# 多少时间单位创建一个新的文件夹
a2.sinks.k1.hdfs.roundValue = 1
# 重新定义时间单位
a2.sinks.k1.hdfs.roundUnit = hour
# 是否使用本地时间戳
a2.sinks.k1.hdfs.useLocalTimeStamp = true
# 积攒多少个Event才flush到HDFS一次
a2.sinks.k1.hdfs.batchSize = 100
# 设置文件类型,(可选择设置支持压缩的CompressedStream或者不支持压缩的DataStream) 
a2.sinks.k1.hdfs.fileType = DataStream
# 多久生成一个新的文件
a2.sinks.k1.hdfs.rollInterval = 60
# 设置每个文件的滚动大小大概是128M
a2.sinks.k1.hdfs.rollSize = 134217700
# 文件的滚动与Event数量无关
a2.sinks.k1.hdfs.rollCount = 0

# Use a channel which buffers events in memory
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

Start monitoring

cd /opt/module/flume-1.9.0
bin/flume-ng agent --conf conf/ --name a2 --conf-file job/simpleCase/flume-2-taildir-hdfs.conf

test

Create a new monitored directory

mkdir -p datas/tailCase/files
mkdir -p datas/tailCase/logs

/opt/module/flume/datas/Create a folder in the directory and tailCase/filesappend content to the files in the files folder

test/opt/module/flume-1.9.0/datas/tailCase/files/.*file.*

#	当前目录下会上传file的文件
cd /opt/module/flume-1.9.0/datas/tailCase/files

touch file1.txt
echo I am file1 >> file1.txt
touch log1.txt
echo I am log1 >> log1.txt

test:/opt/module/flume-1.9.0/datas/tailCase/logs/.*log.*

#	当前目录下,会上传 log的文件
cd /opt/module/flume-1.9.0/datas/tailCase/logs
touch file2.txt
echo I am file2 >> file2.txt
touch log2.txt
echo I am log2 >> log2.txt

Upload files to HDFS

Insert image description here

Breakpoint resume monitoring

Turn off the flume collection program, append files under logs/ and files/, then turn on the flume collection program to verify flume's breakpoint resume.

Taildir Source maintains a position File in json format, which regularly updates the latest position read by each file into the position File, so it can resume the download at a breakpoint. The format of Position File is as follows:

{
    
    "inode":2496272,"pos":12,"file":"/opt/module/flume/datas/tailCase/files/file1.txt"}
{
    
    "inode":2496275,"pos":12,"file":"/opt/module/flume/datas/tailCase/logs/log2.txt"}

Note: The area where file metadata is stored in Linux is called inode. Each inode has a number. The operating system uses inode numbers to identify different files. Unix/Linux systems do not use file names internally, but use inode numbers to identify files. .

3. Flume Advanced

1. Flume affairs

Insert image description here

There are two transactions in Flume

  • Put transaction: between the Source component and the Channel component, ensuring the reliability of data transfer from the Source component to the Channel component.
  • Take transaction: Between the Channel component and the Sink component, ensure the reliability of data transmission from the channel component to the Sink component.

Put transaction process

  1. The source component collects external data into the agent and packages the data into events.
  2. The source component starts transmitting events to the Channel component.
  3. First, a transaction will be started. Within the transaction, a batch of data will be put into the putlist for storage through the doPut method.
  4. After that, call the doCommit method to put all the Events in the putList into the Channel. After success, the putList is cleared.

Failure retry mechanism

  • Before putList sends data in the channel, it will first check whether the capacity in the channel can be accommodated. If it cannot be accommodated, it will not put any data and call doRollback.
  • After calling the doRollback method, the doRollback method will perform two steps:
    • Clear putList.
    • Throws ChannelException.
  • After the source component catches the exception thrown by doRollback, the source re-collects the previous batch of data and then starts a new transaction.
  • The size of the data batch depends on the value of the configuration parameter batch size of the Source component.
  • The size of putList depends on the value of the configuration parameter transactionCapacity of the Channel component (the capacity parameter refers to the capacity of the Channel).

Take transaction process

  1. The Sink component continuously polls the Channel and starts the take transaction when new events arrive.
  2. After the take transaction is started, the doTake method will be called to cut the Event in the Channel component into the takeList.
  3. When the batch size number of Events is stored in takeList, the doCommit method will be called.
  4. In the doCommit method, the data will first be written out to the external system, and the takeList will be cleared after success.
  5. When the transaction fails, the doRollback method will be called to roll back, that is, the data in the takeList will be returned to the channel intact.

2. Internal principles of Flume Agent

Insert image description here

Component name Overview Component contains type Features
ChannelSelector Select the channel to which the Event will be sent. Replication Channel Selector Copy, default option
Multiplexing Channel Seletctor Multiplexing
SinkProcessor Achieve different functions by configuring different types of SinkProcess DefaultSinkProcessor Single Sink, default
LoadBalancingSinkProcessor load balancing
FailoverSinkProcessor failover

Implementation process

  1. The Source component collects external data into the agent and packages it as Event
  2. Then, send the event to the ChannelProcessor,
    • Through the interception and filtering of each interceptor in the interceptor chain, events that meet the requirements will be returned to the ChannelProcessor.
    • After passing the ChannelSelector, decide which Channel the event goes to based on different selectors, and then return to the ChannelProcessor.
  3. Start the Put transaction and send batches of Events to the Channel
  4. Depending on the type of SinkProcessor component configuration and the corresponding functions (load balancing or failover), there will eventually be only one Sink to pull data at the same time.
  5. The Sink component continuously polls the Channel, and when a new Event arrives in the Channel, it is written to the external system.

3. Case 1: Monitoring log

need:

  • Use Flume-1 to monitor file changes.
  • Flume-1 passes the changed content to Flume-2, and Flume-2 is responsible for storing it in HDFS.
  • At the same time, Flume-1 passes the changes to Flume-3, and Flume-3 is responsible for outputting to the Local FileSystem.

Insert image description here

Simulation log file: /opt/module/flume/datas/Create a simulation log file in the directoryrealtime.log

mkdir -p /opt/module/flume-1.9.0/datas
touch /opt/module/flume-1.9.0/datas/realtime.log

Copied configuration file: /opt/module/flume/jobCreate enterprise/copya folder under the directory

mkdir -p /opt/module/flume-1.9.0/job/enterprise/copy
  • Source:flume-1-exec-avro.conf
  • Sink:``flume-2-avro-hdfsflume-3-avro-file`

Profile 1::flume-1-exec-avro.conf

vim /opt/module/flume-1.9.0/job/enterprise/copy/flume-1-exec-avro.conf
# Name the components on this agent
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2
# 将数据流复制给所有channel,其实默认就是replicating
a1.sources.r1.selector.type = replicating

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/module/flume-1.9.0/datas/realtime.log
a1.sources.r1.shell = /bin/bash -c

# Describe the sink
# sink端的avro是一个数据发送者
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop101
a1.sinks.k1.port = 4141

a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop101
a1.sinks.k2.port = 4142

# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2

Profile 2:flume-2-avro-hdfs.conf

vim /opt/module/flume-1.9.0/job/enterprise/copy/flume-2-avro-hdfs.conf
# Name the components on this agent
a2.sources = r1
a2.sinks = k1
a2.channels = c1

# Describe/configure the source
# source端的avro是一个数据接收服务
a2.sources.r1.type = avro
a2.sources.r1.bind = hadoop101
a2.sources.r1.port = 4141

# Describe the sink
a2.sinks.k1.type = hdfs
a2.sinks.k1.hdfs.path = hdfs://hadoop101:8020/flume/copy/%Y%m%d/%H
# 上传文件的前缀
a2.sinks.k1.hdfs.filePrefix = copy-
# 是否按照时间滚动文件夹
a2.sinks.k1.hdfs.round = true
# 多少时间单位创建一个新的文件夹
a2.sinks.k1.hdfs.roundValue = 1
# 重新定义时间单位
a2.sinks.k1.hdfs.roundUnit = hour
# 是否使用本地时间戳
a2.sinks.k1.hdfs.useLocalTimeStamp = true
# 积攒多少个Event才flush到HDFS一次
a2.sinks.k1.hdfs.batchSize = 100
# 设置文件类型,可支持压缩
a2.sinks.k1.hdfs.fileType = DataStream
# 多久生成一个新的文件
a2.sinks.k1.hdfs.rollInterval = 60
# 设置每个文件的滚动大小大概是128M
a2.sinks.k1.hdfs.rollSize = 134217700
# 文件的滚动与Event数量无关
a2.sinks.k1.hdfs.rollCount = 0

# Describe the channel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

Profile 3:flume-3-avro-file.conf

vim /opt/module/flume-1.9.0/job/enterprise/copy/flume-3-avro-file.conf
# Name the components on this agent
a3.sources = r1
a3.sinks = k1
a3.channels = c2

# Describe/configure the source
a3.sources.r1.type = avro
a3.sources.r1.bind = hadoop101
a3.sources.r1.port = 4142

# Describe the sink
a3.sinks.k1.type = file_roll
a3.sinks.k1.sink.directory = /opt/module/flume-1.9.0/datas/copy_result

# Describe the channel
a3.channels.c2.type = memory
a3.channels.c2.capacity = 1000
a3.channels.c2.transactionCapacity = 100

# Bind the source and sink to the channel
a3.sources.r1.channels = c2
a3.sinks.k1.channel = c2
mkdir /opt/module/flume-1.9.0/datas/copy_result
cd /opt/module/flume-1.9.0
bin/flume-ng agent -c conf/ -n a3 -f /opt/module/flume-1.9.0/job/enterprise/copy/flume-3-avro-file.conf
bin/flume-ng agent -c conf/ -n a2 -f /opt/module/flume-1.9.0/job/enterprise/copy/flume-2-avro-hdfs.conf
bin/flume-ng agent -c conf/ -n a1 -f /opt/module/flume-1.9.0/job/enterprise/copy/flume-1-exec-avro.conf
echo 2021-10-41 09-10-32 >> /opt/module/flume-1.9.0/datas/realtime.log

4. Case 2: Multiplexing and interceptor adaptation

4.1 Principle

need:

When using flume to collect server port log data, you need to send different types of logs to different analysis systems according to different log types.

principle

  • Background: In actual development, a server may generate many types of logs, and different types of logs may need to be sent to different analysis systems. The structure in will
    be used at this time .Flumechannel selecterMultiplexing
  • The principle of Multiplexing is to send different events to different Channels based on the value of a certain key of the Header in the event.
  • Custom Interceptor: implement assigning different values ​​to the keys in the Header of different types of events.
  • Summary: In this case, we simulate logs with port data and simulate different types of logs with numbers and letters. We need to customize the interceptor to distinguish numbers and letters and send them to different analysis systems (Channel).

Insert image description here

4.2 Code writing

Maven configuration

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>

  <groupId>com.lydms</groupId>
  <artifactId>first-flume</artifactId>
  <version>1.0-SNAPSHOT</version>
  <packaging>jar</packaging>

  <name>first-flume</name>

  <properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
  </properties>

  <dependencies>
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>3.8.1</version>
      <scope>test</scope>
    </dependency>


    <dependency>
      <groupId>org.apache.flume</groupId>
      <artifactId>flume-ng-core</artifactId>
      <version>1.9.0</version>
    </dependency>
  </dependencies>
  <build>
    <plugins>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-compiler-plugin</artifactId>
        <configuration>
          <source>6</source>
          <target>6</target>
        </configuration>
      </plugin>
    </plugins>
  </build>
</project>

interface:

package com.lydms.flume.interceptor;

import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;

import java.util.List;

public class CustomInterceptor implements Interceptor {
    
    


    @Override
    public void initialize() {
    
    
    }

    @Override
    public Event intercept(Event event) {
    
    
        // 1. 从事件中获取数据
        byte[] body = event.getBody();
        // 2. 判断数据开头的字符是字母还是数据
        if (body[0] >= 'a' && body[0] <= 'z') {
    
    
            event.getHeaders().put("type", "letter");         // 是字母就在事件头部设置type类型为letter
        } else if (body[0] >= '0' && body[0] <= '9') {
    
    
            event.getHeaders().put("type", "number");         // 是数字就在事件头部设置type类型为number
        }
        // 3. 返回事件
        return event;

    }

    // 对批量事件进行拦截
    @Override
    public List<Event> intercept(List<Event> events) {
    
    
        for (Event event : events) {
    
    
            intercept(event);
        }
        return events;
    }

    @Override
    public void close() {
    
    
    }

    // 拦截器对象的构造对象
    public static class Builder implements Interceptor.Builder {
    
    

        @Override
        public Interceptor build() {
    
    
            return new CustomInterceptor();
        }

        @Override
        public void configure(Context context) {
    
    
        }
    }
}

Package the project and import it into flume's lib directory ( /opt/module/flume-1.9.0/lib).

4.3 Writing configuration files

Hadoop101:Add configuration file

mkdir -p /opt/module/flume-1.9.0/job/custom/multi
vim /opt/module/flume-1.9.0/job/custom/multi/flume-1-netcat-avro.conf
# Name the components on this agent
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
a1.sources.r1.interceptors = i1
#	Java文件目录
a1.sources.r1.interceptors.i1.type = com.lydms.flume.interceptor.CustomInterceptor$Builder
a1.sources.r1.selector.type = multiplexing
a1.sources.r1.selector.header = type
a1.sources.r1.selector.mapping.letter = c1
a1.sources.r1.selector.mapping.number = c2

# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop102
a1.sinks.k1.port = 4141

a1.sinks.k2.type=avro
a1.sinks.k2.hostname = hadoop103
a1.sinks.k2.port = 4242

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Use a channel which buffers events in memory
a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2

Hadoop102:Add configuration file

mkdir -p /opt/module/flume-1.9.0/job/custom/multi
vim /opt/module/flume-1.9.0/job/custom/multi/flume-2-avro-logger.conf
# agent
a2.sources=r1
a2.sinks = k1
a2.channels = c1

# source
a2.sources.r1.type = avro
a2.sources.r1.bind = hadoop102
a2.sources.r1.port = 4141

# sink
a2.sinks.k1.type = logger

# Channel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100

# bind
a2.sinks.k1.channel = c1
a2.sources.r1.channels = c1

Hadoop103:Add configuration file

mkdir -p /opt/module/flume-1.9.0/job/custom/multi
vim /opt/module/flume-1.9.0/job/custom/multi/flume-3-avro-logger.conf
# agent
a3.sources = r1
a3.sinks = k1
a3.channels = c1

# source
a3.sources.r1.type = avro
a3.sources.r1.bind = hadoop103
a3.sources.r1.port = 4242

# sink
a3.sinks.k1.type = logger

# Channel
a3.channels.c1.type = memory
a3.channels.c1.capacity = 1000
a3.channels.c1.transactionCapacity = 100

# bind
a3.sinks.k1.channel = c1
a3.sources.r1.channels = c1
4.4 Testing

Startup project

cd /opt/module/flume-1.9.0
bin/flume-ng agent -c conf/ -n a3 -f /opt/module/flume-1.9.0/job/custom/multi/flume-3-avro-logger.conf -Dflume.root.logger=INFO,console
bin/flume-ng agent -c conf/ -n a2 -f /opt/module/flume-1.9.0/job/custom/multi/flume-2-avro-logger.conf -Dflume.root.logger=INFO,console
bin/flume-ng agent -c conf/ -n a1 -f /opt/module/flume-1.9.0/job/custom/multi/flume-1-netcat-avro.conf -Dflume.root.logger=INFO,console

test:

nc localhost 44444
hello
world
1231231
41341

Insert image description here

5. Case 3: Aggregation

Case:

  • hadoop102:flume-1 monitoring file /opt/module/flume-1.9.0/datas/.*file*.,
  • hadoop103:flume-2 monitors the data flow of a certain port.
  • hadoop104:flume-3, receives flume-1the flume-2sum of data, and flume-3 prints the final data to the console.

Insert image description here

Hadoop101:Configuration fileflume-1-exec-avro.conf

mkdir /opt/module/flume-1.9.0/job/enterprise/juhe
vim /opt/module/flume-1.9.0/job/enterprise/juhe/flume-1-exec-avro.conf
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/module/flume-1.9.0/datas/realtime.log
a1.sources.r1.shell = /bin/bash -c

# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop103
a1.sinks.k1.port = 4141

# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

Hadoop102 :Configuration fileflume-2-netcat-avro.conf

mkdir -p /opt/module/flume-1.9.0/job/enterprise/juhe
vim /opt/module/flume-1.9.0/job/enterprise/juhe/flume-2-netcat-avro.conf
# Name the components on this agent
a2.sources = r1
a2.sinks = k1
a2.channels = c1

# Describe/configure the source
a2.sources.r1.type = netcat
a2.sources.r1.bind = hadoop102
a2.sources.r1.port = 44444

# Describe the sink
a2.sinks.k1.type = avro
a2.sinks.k1.hostname = hadoop103
a2.sinks.k1.port = 4141

# Use a channel which buffers events in memory
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

Hadoop103:Configuration fileflume-1-exec-avro.conf

mkdir -p /opt/module/flume-1.9.0/job/enterprise/juhe
vim /opt/module/flume-1.9.0/job/enterprise/juhe/flume-3-avro-logger.conf
# Name the components on this agent
a3.sources = r1
a3.sinks = k1
a3.channels = c1

# Describe/configure the source
a3.sources.r1.type = avro
a3.sources.r1.bind = hadoop103
a3.sources.r1.port = 4141

# Describe the sink
a3.sinks.k1.type = logger

# Describe the channel
a3.channels.c1.type = memory
a3.channels.c1.capacity = 1000
a3.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a3.sources.r1.channels = c1
a3.sinks.k1.channel = c1

test:

#	Hadoop103
/opt/module/flume-1.9.0/bin/flume-ng agent –c conf/ -n a3 -f /opt/module/flume-1.9.0/job/enterprise/juhe/flume-3-avro-logger.conf -Dflume.root.logger=INFO,console
#	Hadoop102
/opt/module/flume-1.9.0/bin/flume-ng agent –c conf/ -n a2 -f /opt/module/flume-1.9.0/job/enterprise/juhe/flume-2-netcat-avro.conf
#	Hadoop101
/opt/module/flume-1.9.0/bin/flume-ng agent –c conf/ -n a1 -f /opt/module/flume-1.9.0/job/enterprise/juhe/flume-1-exec-avro.conf

Hadoop101: Append content /opt/module/flume/datas/to the directoryrealtime.log

echo 'Hello Worlld Hadoop101' > /opt/module/flume-1.9.0/datas/realtime.log

Hadoop102: 44444Send data to the port

nc hadoop102 44444
hello world

Hadoop103: View data

Insert image description here

4. Flume data flow monitoring

1. Introduction to Ganglia

Ganglia consists of three parts: gmond, gmetad and gweb.

  • gmond (Ganglia Monitoring Daemon):
    It is a lightweight service installed on each node host that needs to collect indicator data.
    Using gmond, you can easily collect many system indicator data, such as CPU, memory, disk, network and active process data.
  • gmetad (Ganglia Meta Daemon):
    A service that integrates all information and stores it to disk in RRD format.
  • gweb (Ganglia Web) Ganglia visualization tool:
    gweb is a PHP front-end that uses a browser to display data stored by gmetad.
    A variety of different indicator data collected under the running status of the cluster are displayed graphically in the web interface.

2. Deployment planning

web gmetad gmod
Hadoop101 ture true true
Hadoop102 true
Hadoop103 true

installation steps

#	Hadoop101
sudo yum -y install epel-release
sudo yum -y install ganglia-gmetad
sudo yum -y install ganglia-web
sudo yum -y install ganglia-gmond

#	Hadoop102
sudo yum -y install epel-release
sudo yum -y install ganglia-gmond

#	Hadoop103
sudo yum -y install epel-release
sudo yum -y install ganglia-gmond

3. Modify the configuration file:hadoop101

Modify configuration: Hadoop101

  • Modify configuration file/etc/httpd/conf.d/ganglia.conf
sudo vim /etc/httpd/conf.d/ganglia.conf
#	修改内容(2种配置。二选一)
Require ip 192.168.1.1          
#	Require all granted  

Insert image description here

Modify configuration file/etc/ganglia/gmetad.conf

sudo vim /etc/ganglia/gmetad.conf
#	修改内容
data_source "my cluster" hadoop101

Insert image description here

Modify configuration file/etc/selinux/config

sudo vim /etc/selinux/config
#	修改内容
SELINUX=disabled
SELINUXTYPE=targeted

4. Modify configuration file3台

Modify configuration: Hadoop101, Hadoop102, Hadoop103

Modify configuration file/etc/sudganglia/gmond.conf

sudo vim /etc/ganglia/gmond.conf 
#	修改内容==================
# 数据发送给hadoop101
host = hadoop101

# 接收来自任意连接的数据
bind = 0.0.0.0

Insert image description here

5. Start the service

Modify file permissions

chown ganglia:ganglia /var/lib/ganglia
sudo chmod -R 777 /var/lib/ganglia

Startup script (Hadoop101)

sudo systemctl start gmond
sudo systemctl start httpd
sudo systemctl start gmetad

Connection address: http://hadoop101/ganglia

When the page cannot be viewed, modify the configuration file and restart.

Require ip 192.168.1.1  

Insert image description here

3. Test

Insert image description here
EventPutAttemptCountsource The total number of events attempted to be written to the Channel
Insert image description hereEventPutSuccessCountThe total number of events successfully written to the channel and submitted imgStartTimechannelStop time
imgEventTakeAttemptCountsource The total number of events attempted to be written to the Channel imgEventTakeSuccessCountThe total number of events successfully written to the channel and submitted imgStopTimechannel stop time
imgChannelSizeThe total number of events in the current Channel imgChannelFillPercentagechannel occupancy percentage imgChannelCapacitychannel capacity

Guess you like

Origin blog.csdn.net/weixin_44624117/article/details/133219833