第二章分布式日志收集框架Flume

课程目录
业务现状分析=>flume概述=>flume架构及核心组件=>flume环境部署=>flume实战

1、业务现状分析

WebServer/ApplicationServer分散在各个机器上
大数据平台Hadoop进行统计分析
日志如何收集到Hadoop平台上
解决方案及存在问题

传统从Server到Hadoop处理上存在的问题
1.难以监控
2.IO的读写开销大
3.容错率高，负载均衡差
4.高延时，需隔一段时间启动

2、flume概述

flume官网：http://flume.apache.org/

Flume is a distributed（分布式的）, reliable（高可靠的）, and available service（高可用的服务） for efficiently collecting（海量收集）, aggregating（聚合）, and moving（移动系统） large amounts of log data.
Flume是由Cloudera提供的一个分布式、高可靠、高可用的服务，用于分布式的海量日志的高效收集、聚合、移动系统

设计目标
可靠性
扩展性
管理性

业界同类产品的对比
Flume： Cloudera/Apache Java
Scribe： Facebook C/C++ 不再维护
Chukwa： Yahoo/Apache Java 不再维护
Kafka：
Fluentd： Ruby
Logstash: ELK(ElasticSearch,Kibana)

Flume发展史
Cloudera 0.9.2 Flume-OG
flume-728 Flume-NG ==> Apache
2012.7 1.0
2015.5 1.6
~ 1.7

Flume架构及核心组件

Source 收集
Channel 聚集
Sink 输出

Flume安装前置条件
1.Java Runtime Environment - Java 1.8 or later
2.Memory - Sufficient memory for configurations used by sources, channels or sinks
3.Disk Space - Sufficient disk space for configurations used by channels or sinks
4.Directory Permissions - Read/Write permissions for directories used by agent

安装jdk
下载
解压到~/app
将java配置系统环境变量中: vi ~/.bash_profile
export JAVA_HOME=/home/hadoop/app/jdk1.8.0_144
export PATH= $JAVA_HOME/bin:$ PATH
source下让其配置生效：source ~/.bash_profile
检测: java -version

扫描二维码关注公众号，回复： 4045959 查看本文章

安装Flume
下载
解压到~/app
将java配置系统环境变量中: vi ~/.bash_profile
export FLUME_HOME=/home/hadoop/app/apache-flume-1.6.0-cdh5.7.0-bin
export PATH= $FLUME_HOME/bin:$ PATH
source下让其配置生效：source ~/.bash_profile
flume-env.sh的配置：export JAVA_HOME=/home/hadoop/app/jdk1.8.0_144
检测: flume-ng version

Flume架构及核心组件

Flume实战：

需求一：从指定网络端口采集数据输出到控制台

在这里插入图片描述

使用Flume的关键就是写配置文件
A）配置Source
B）配置Channel
C）配置Sink
D）把以上三个组件串起来

a1: agent名称
r1: source的名称
k1: sink的名称
c1: channel的名称

# example.conf: A single-node Flume configuration

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop000
a1.sources.r1.port = 44444

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

查询官网文档
http://flume.apache.org/FlumeUserGuide.html#avro-legacy-source

 a1.sources.r1.type = netcat
    a1.sources.r1.bind = hadoop000
    a1.sources.r1.port = 44444

type：The component type name, needs to be org.apache.flume.source.avroLegacy.AvroLegacySource
host：The hostname or IP address to bind to
port：The port # to listen on

a1.sinks.k1.type = logger

type：The component type name, needs to be logger

a1.channels.c1.type = memory

type:The component type name, needs to be memory

a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

注意：一个source可以输出到多个channel，因此上面是channels；而一此只能从channel输出一个到sink,因此下面是channel

步骤：
1.写配置文件
在conf目录下：vi example.conf
将上面代码写入其中
2.启动agent

flume-ng agent \
--name a1 \
--conf $FLUME_HOME/conf \
--conf-file $FLUME_HOME/conf/example.conf \
-Dflume.root.logger=INFO,console

3.使用telnet进行测试： telnet hadoop000 44444

需求二：监控一个文件实时采集新增的数据输出到控制台

# example.conf: A single-node Flume configuration

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /home/hadoop/data/data.log
a1.sources.r1.shell = /bin/sh -c

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

channels
type:The component type name, needs to be exec
command:The command to execute
shell:A shell invocation used to run the command. e.g. /bin/sh -c. Required only for commands relying on shell features like wildcards, back ticks, pipes etc.

步骤：
1.写配置文件
在conf目录下：vi exec-memory-logger.conf
将上面代码写入其中
2.启动agent

flume-ng agent \
--name a1 \
--conf $FLUME_HOME/conf \
--conf-file $FLUME_HOME/conf/exec-memory-logger.conf \
-Dflume.root.logger=INFO,console

3.测试：

新打开一个窗口：输入以下内容

  [hadoop@hadoop001 data]$ echo hello >> data.log 
    [hadoop@hadoop001 data]$ echo world >> data.log 
    [hadoop@hadoop001 data]$ echo welcome >> data.log

原窗口会出现以下变化：

Event: { headers:{} body: 68 65 6C 6C 6F  				hello }
Event: { headers:{} body: 77 6F 72 6C 64  				world }
Event: { headers:{} body: 77 65 6C 63 6F 6D 65			welcome }

需求三：将A服务器上的日志实时采集到B服务器端

技术选项：

exec source + memory channel + avro sink
avro source + memory channel + logger sink

两个配置文件:

exec-memory-avro.conf

# Name the components on this agent
exec-memory-avro.sources = exec-source
exec-memory-avro.sinks = avro-sink
exec-memory-avro.channels = memory-channel

# Describe/configure the source
exec-memory-avro.sources.exec-source.type = exec
exec-memory-avro.sources.exec-source.command = tail -F /home/hadoop/data/data.log
exec-memory-avro.sources.exec-source.shell = /bin/sh -c

# Describe the sink
exec-memory-avro.sinks.avro-sink.type = avro
exec-memory-avro.sinks.avro-sink.bind = hadoo000
exec-memory-avro.sinks.avro-sink.port = 4444

# Use a channel which buffers events in memory
exec-memory-avro.channels.exec-source.type = memory

# Bind the source and sink to the channel
exec-memory-avro.sources.exec-source.channels = memory-channel
exec-memory-avro.sinks.avro-sink.channel = memory-channel

avro-memory-logger.conf

# Name the components on this agent
avro-memory-logger.sources = avro source
avro-memory-logger.sinks = logger sink
avro-memory-logger.channels = memory-channel

# Describe/configure the source
avro-memory-logger.sources.avro source.type = avro
avro-memory-logger.sources.avro source.bind = hadoop000
avro-memory-logger.sources.avro source.port = 44444 

# Describe the sink
avro-memory-logger.logger sink.type = logger

# Use a channel which buffers events in memory
avro-memory-logger.channels.avro source.type = memory

# Bind the source and sink to the channel
avro-memory-logger.sources.avro source.channels = memory-channel
avro-memory-logger.sinks.logger sink.channel = memory-channel

先启动avro-memory-logger

  flume-ng agent \
    --name avro-memory-logger \
    --conf $FLUME_HOME/conf \
    --conf-file $FLUME_HOME/conf/avro-memory-logger.conf \
    -Dflume.root.logger=INFO,console

再启动exec-memory-avro

 flume-ng agent \
    --name exec-memory-avro \
    --conf $FLUME_HOME/conf \
    --conf-file $FLUME_HOME/conf/exec-memory-avro.conf \
    -Dflume.root.logger=INFO,console

测试：
新打开一个窗口：输入以下内容

 [hadoop@hadoop001 data]$ echo hello spark >> data.log 
 [hadoop@hadoop001 data]$ echo hello hadoop >> data.log

原窗口会出现以下变化：

Event: { headers:{} body: 68 65 6C 6C 6F 20 73 70 61 72 6B                hello spark }
Event: { headers:{} body: 68 65 6C 6C 6F 20 68 61 64 6F 6F 70             hello hadoop }

Spark Streaming实时流处理项目实战笔记

第二章分布式日志收集框架Flume

Flume实战：

需求一：从指定网络端口采集数据输出到控制台

需求二：监控一个文件实时采集新增的数据输出到控制台

需求三：将A服务器上的日志实时采集到B服务器端

猜你喜欢

Spark Streaming实时流处理项目实战笔记

第二章 分布式日志收集框架Flume

Flume实战：

需求一：从指定网络端口采集数据输出到控制台

需求二：监控一个文件实时采集新增的数据输出到控制台

需求三：将A服务器上的日志实时采集到B服务器端

猜你喜欢

第二章分布式日志收集框架Flume