FLUME单节点配置并自定义拦截器

 

3. Flume1.7.0解压缩和更换目录

 # cd /opt
 # tar -xzvf apache-flume-1.7.0-bin.tar.gz
 # mv apache-flume-1.7.0-bin flume1.7.0

 # chmod 777 -R /opt/flume1.7.0        #给目录授权

4. 配置环境变量

 # vim /etc/profile

export FLUME_HOME=/opt/flume1.7.0
export FLUME_CONF_DIR=$FLUME_HOME/conf
export PATH=$FLUME_HOME/bin

 # source /etc/profile
  •  

5. 测试使用

5.1 添加flume-conf.properties配置文件

 # cd /opt/flume1.7.0/conf
 # vim flume-conf.properties
# a.conf: A single-node Flume configuration
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /opt/log
a1.sources.r1.fileHeader = true
a1.sources.r1.deserializer.outputCharset=UTF-8
# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://hadoop0:9000/log
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.writeFormat=Text
a1.sinks.k1.hdfs.maxOpenFiles = 1
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.rollInterval = 0
a1.sinks.k1.hdfs.rollSize = 1000000
a1.sinks.k1.hdfs.batchSize = 100000
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000000
a1.channels.c1.transactionCapacity = 100000
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

5.2 创建目录并授权

 # mkdir /opt/log
 # chmod 777 -R /opt/log

注:hdfs的log目录,不用手动去创建,它会自动生成的

还是针对学习八中的那个需求,我们现在换一种实现方式,采用拦截器来实现。

先回想一下,spooldir source可以将文件名作为header中的key:basename写入到event的header当中去。试想一下,如果有一个拦截器可以拦截这个event,然后抽取header中这个key的值,将其拆分成3段,每一段都放入到header中,这样就可以实现那个需求了。

遗憾的是,flume没有提供可以拦截header的拦截器。不过有一个抽取body内容的拦截器:RegexExtractorInterceptor,看起来也很强大,以下是一个官方文档的示例:

If the Flume event body contained 1:2:3.4foobar5 and the following configuration was used


a1.sources.r1.interceptors.i1.regex = (\\d):(\\d):(\\d)
a1.sources.r1.interceptors.i1.serializers = s1 s2 s3
a1.sources.r1.interceptors.i1.serializers.s1.name = one
a1.sources.r1.interceptors.i1.serializers.s2.name = two
a1.sources.r1.interceptors.i1.serializers.s3.name = three
The extracted event will contain the same body but the following headers will have been added one=>1, two=>2, three=>3

大概意思就是,通过这样的配置,event body中如果有1:2:3.4foobar5 这样的内容,这会通过正则的规则抽取具体部分的内容,然后设置到header当中去。

于是决定打这个拦截器的主义,觉得只要把代码稍微改改,从拦截body改为拦截header中的具体key,就OK了。翻开源码,哎呀,很工整,改起来没难度,以下是我新增的一个拦截器:RegexExtractorExtInterceptor:

扫描二维码关注公众号,回复: 3963450 查看本文章
 
  1. package com.besttone.flume;

  2.  
  3. import java.util.List;

  4. import java.util.Map;

  5. import java.util.regex.Matcher;

  6. import java.util.regex.Pattern;

  7.  
  8. import org.apache.commons.lang.StringUtils;

  9. import org.apache.flume.Context;

  10. import org.apache.flume.Event;

  11. import org.apache.flume.interceptor.Interceptor;

  12. import org.apache.flume.interceptor.RegexExtractorInterceptorPassThroughSerializer;

  13. import org.apache.flume.interceptor.RegexExtractorInterceptorSerializer;

  14. import org.slf4j.Logger;

  15. import org.slf4j.LoggerFactory;

  16.  
  17. import com.google.common.base.Charsets;

  18. import com.google.common.base.Preconditions;

  19. import com.google.common.base.Throwables;

  20. import com.google.common.collect.Lists;

  21.  
  22. /**

  23. * Interceptor that extracts matches using a specified regular expression and

  24. * appends the matches to the event headers using the specified serializers</p>

  25. * Note that all regular expression matching occurs through Java's built in

  26. * java.util.regex package</p>. Properties:

  27. * <p>

  28. * regex: The regex to use

  29. * <p>

  30. * serializers: Specifies the group the serializer will be applied to, and the

  31. * name of the header that will be added. If no serializer is specified for a

  32. * group the default {@link RegexExtractorInterceptorPassThroughSerializer} will

  33. * be used

  34. * <p>

  35. * Sample config:

  36. * <p>

  37. * agent.sources.r1.channels = c1

  38. * <p>

  39. * agent.sources.r1.type = SEQ

  40. * <p>

  41. * agent.sources.r1.interceptors = i1

  42. * <p>

  43. * agent.sources.r1.interceptors.i1.type = REGEX_EXTRACTOR

  44. * <p>

  45. * agent.sources.r1.interceptors.i1.regex = (WARNING)|(ERROR)|(FATAL)

  46. * <p>

  47. * agent.sources.r1.interceptors.i1.serializers = s1 s2

  48. * agent.sources.r1.interceptors.i1.serializers.s1.type =

  49. * com.blah.SomeSerializer agent.sources.r1.interceptors.i1.serializers.s1.name

  50. * = warning agent.sources.r1.interceptors.i1.serializers.s2.type =

  51. * org.apache.flume.interceptor.RegexExtractorInterceptorTimestampSerializer

  52. * agent.sources.r1.interceptors.i1.serializers.s2.name = error

  53. * agent.sources.r1.interceptors.i1.serializers.s2.dateFormat = yyyy-MM-dd

  54. * </code>

  55. * </p>

  56. *

  57. * <pre>

  58. * Example 1:

  59. * </p>

  60. * EventBody: 1:2:3.4foobar5</p> Configuration:

  61. * agent.sources.r1.interceptors.i1.regex = (\\d):(\\d):(\\d)

  62. * </p>

  63. * agent.sources.r1.interceptors.i1.serializers = s1 s2 s3

  64. * agent.sources.r1.interceptors.i1.serializers.s1.name = one

  65. * agent.sources.r1.interceptors.i1.serializers.s2.name = two

  66. * agent.sources.r1.interceptors.i1.serializers.s3.name = three

  67. * </p>

  68. * results in an event with the the following

  69. *

  70. * body: 1:2:3.4foobar5 headers: one=>1, two=>2, three=3

  71. *

  72. * Example 2:

  73. *

  74. * EventBody: 1:2:3.4foobar5

  75. *

  76. * Configuration: agent.sources.r1.interceptors.i1.regex = (\\d):(\\d):(\\d)

  77. * <p>

  78. * agent.sources.r1.interceptors.i1.serializers = s1 s2

  79. * agent.sources.r1.interceptors.i1.serializers.s1.name = one

  80. * agent.sources.r1.interceptors.i1.serializers.s2.name = two

  81. * <p>

  82. *

  83. * results in an event with the the following

  84. *

  85. * body: 1:2:3.4foobar5 headers: one=>1, two=>2

  86. * </pre>

  87. */

  88. public class RegexExtractorExtInterceptor implements Interceptor {

  89.  
  90. static final String REGEX = "regex";

  91. static final String SERIALIZERS = "serializers";

  92.  
  93. // 增加代码开始

  94.  
  95. static final String EXTRACTOR_HEADER = "extractorHeader";

  96. static final boolean DEFAULT_EXTRACTOR_HEADER = false;

  97. static final String EXTRACTOR_HEADER_KEY = "extractorHeaderKey";

  98.  
  99. // 增加代码结束

  100.  
  101. private static final Logger logger = LoggerFactory

  102. .getLogger(RegexExtractorExtInterceptor.class);

  103.  
  104. private final Pattern regex;

  105. private final List<NameAndSerializer> serializers;

  106.  
  107. // 增加代码开始

  108.  
  109. private final boolean extractorHeader;

  110. private final String extractorHeaderKey;

  111.  
  112. // 增加代码结束

  113.  
  114. private RegexExtractorExtInterceptor(Pattern regex,

  115. List<NameAndSerializer> serializers, boolean extractorHeader,

  116. String extractorHeaderKey) {

  117. this.regex = regex;

  118. this.serializers = serializers;

  119. this.extractorHeader = extractorHeader;

  120. this.extractorHeaderKey = extractorHeaderKey;

  121. }

  122.  
  123. @Override

  124. public void initialize() {

  125. // NO-OP...

  126. }

  127.  
  128. @Override

  129. public void close() {

  130. // NO-OP...

  131. }

  132.  
  133. @Override

  134. public Event intercept(Event event) {

  135. String tmpStr;

  136. if(extractorHeader)

  137. {

  138. tmpStr = event.getHeaders().get(extractorHeaderKey);

  139. }

  140. else

  141. {

  142. tmpStr=new String(event.getBody(),

  143. Charsets.UTF_8);

  144. }

  145.  
  146. Matcher matcher = regex.matcher(tmpStr);

  147. Map<String, String> headers = event.getHeaders();

  148. if (matcher.find()) {

  149. for (int group = 0, count = matcher.groupCount(); group < count; group++) {

  150. int groupIndex = group + 1;

  151. if (groupIndex > serializers.size()) {

  152. if (logger.isDebugEnabled()) {

  153. logger.debug(

  154. "Skipping group {} to {} due to missing serializer",

  155. group, count);

  156. }

  157. break;

  158. }

  159. NameAndSerializer serializer = serializers.get(group);

  160. if (logger.isDebugEnabled()) {

  161. logger.debug("Serializing {} using {}",

  162. serializer.headerName, serializer.serializer);

  163. }

  164. headers.put(serializer.headerName, serializer.serializer

  165. .serialize(matcher.group(groupIndex)));

  166. }

  167. }

  168. return event;

  169. }

  170.  
  171. @Override

  172. public List<Event> intercept(List<Event> events) {

  173. List<Event> intercepted = Lists.newArrayListWithCapacity(events.size());

  174. for (Event event : events) {

  175. Event interceptedEvent = intercept(event);

  176. if (interceptedEvent != null) {

  177. intercepted.add(interceptedEvent);

  178. }

  179. }

  180. return intercepted;

  181. }

  182.  
  183. public static class Builder implements Interceptor.Builder {

  184.  
  185. private Pattern regex;

  186. private List<NameAndSerializer> serializerList;

  187.  
  188. // 增加代码开始

  189.  
  190. private boolean extractorHeader;

  191. private String extractorHeaderKey;

  192.  
  193. // 增加代码结束

  194.  
  195. private final RegexExtractorInterceptorSerializer defaultSerializer = new RegexExtractorInterceptorPassThroughSerializer();

  196.  
  197. @Override

  198. public void configure(Context context) {

  199. String regexString = context.getString(REGEX);

  200. Preconditions.checkArgument(!StringUtils.isEmpty(regexString),

  201. "Must supply a valid regex string");

  202.  
  203. regex = Pattern.compile(regexString);

  204. regex.pattern();

  205. regex.matcher("").groupCount();

  206. configureSerializers(context);

  207.  
  208. // 增加代码开始

  209. extractorHeader = context.getBoolean(EXTRACTOR_HEADER,

  210. DEFAULT_EXTRACTOR_HEADER);

  211.  
  212. if (extractorHeader) {

  213. extractorHeaderKey = context.getString(EXTRACTOR_HEADER_KEY);

  214. Preconditions.checkArgument(

  215. !StringUtils.isEmpty(extractorHeaderKey),

  216. "必须指定要抽取内容的header key");

  217. }

  218. // 增加代码结束

  219. }

  220.  
  221. private void configureSerializers(Context context) {

  222. String serializerListStr = context.getString(SERIALIZERS);

  223. Preconditions.checkArgument(

  224. !StringUtils.isEmpty(serializerListStr),

  225. "Must supply at least one name and serializer");

  226.  
  227. String[] serializerNames = serializerListStr.split("\\s+");

  228.  
  229. Context serializerContexts = new Context(

  230. context.getSubProperties(SERIALIZERS + "."));

  231.  
  232. serializerList = Lists

  233. .newArrayListWithCapacity(serializerNames.length);

  234. for (String serializerName : serializerNames) {

  235. Context serializerContext = new Context(

  236. serializerContexts.getSubProperties(serializerName

  237. + "."));

  238. String type = serializerContext.getString("type", "DEFAULT");

  239. String name = serializerContext.getString("name");

  240. Preconditions.checkArgument(!StringUtils.isEmpty(name),

  241. "Supplied name cannot be empty.");

  242.  
  243. if ("DEFAULT".equals(type)) {

  244. serializerList.add(new NameAndSerializer(name,

  245. defaultSerializer));

  246. } else {

  247. serializerList.add(new NameAndSerializer(name,

  248. getCustomSerializer(type, serializerContext)));

  249. }

  250. }

  251. }

  252.  
  253. private RegexExtractorInterceptorSerializer getCustomSerializer(

  254. String clazzName, Context context) {

  255. try {

  256. RegexExtractorInterceptorSerializer serializer = (RegexExtractorInterceptorSerializer) Class

  257. .forName(clazzName).newInstance();

  258. serializer.configure(context);

  259. return serializer;

  260. } catch (Exception e) {

  261. logger.error("Could not instantiate event serializer.", e);

  262. Throwables.propagate(e);

  263. }

  264. return defaultSerializer;

  265. }

  266.  
  267. @Override

  268. public Interceptor build() {

  269. Preconditions.checkArgument(regex != null,

  270. "Regex pattern was misconfigured");

  271. Preconditions.checkArgument(serializerList.size() > 0,

  272. "Must supply a valid group match id list");

  273. return new RegexExtractorExtInterceptor(regex, serializerList,

  274. extractorHeader, extractorHeaderKey);

  275. }

  276. }

  277.  
  278. static class NameAndSerializer {

  279. private final String headerName;

  280. private final RegexExtractorInterceptorSerializer serializer;

  281.  
  282. public NameAndSerializer(String headerName,

  283. RegexExtractorInterceptorSerializer serializer) {

  284. this.headerName = headerName;

  285. this.serializer = serializer;

  286. }

  287. }

  288. }


简单说明一下改动的内容:

增加了两个配置参数:

extractorHeader   是否抽取的是header部分,默认为false,即和原始的拦截器功能一致,抽取的是event body的内容

extractorHeaderKey 抽取的header的指定的key的内容,当extractorHeader为true时,必须指定该参数。

按照第八讲的方法,我们将该类打成jar包,作为flume的插件放到了/var/lib/flume-ng/plugins.d/RegexExtractorExtInterceptor/lib目录下,重新启动flume,将该拦截器加载到classpath中。

最终的flume.conf如下:

 
  1. tier1.sources=source1

  2. tier1.channels=channel1

  3. tier1.sinks=sink1

  4. tier1.sources.source1.type=spooldir

  5. tier1.sources.source1.spoolDir=/opt/logs

  6. tier1.sources.source1.fileHeader=true

  7. tier1.sources.source1.basenameHeader=true

  8. tier1.sources.source1.interceptors=i1

  9. tier1.sources.source1.interceptors.i1.type=com.besttone.flume.RegexExtractorExtInterceptor$Builder

  10. tier1.sources.source1.interceptors.i1.regex=(.*)\\.(.*)\\.(.*)

  11. tier1.sources.source1.interceptors.i1.extractorHeader=true

  12. tier1.sources.source1.interceptors.i1.extractorHeaderKey=basename

  13. tier1.sources.source1.interceptors.i1.serializers=s1 s2 s3

  14. tier1.sources.source1.interceptors.i1.serializers.s1.name=one

  15. tier1.sources.source1.interceptors.i1.serializers.s2.name=two

  16. tier1.sources.source1.interceptors.i1.serializers.s3.name=three

  17. tier1.sources.source1.channels=channel1

  18. tier1.sinks.sink1.type=hdfs

  19. tier1.sinks.sink1.channel=channel1

  20. tier1.sinks.sink1.hdfs.path=hdfs://master68:8020/flume/events/%{one}/%{three}

  21. tier1.sinks.sink1.hdfs.round=true

  22. tier1.sinks.sink1.hdfs.roundValue=10

  23. tier1.sinks.sink1.hdfs.roundUnit=minute

  24. tier1.sinks.sink1.hdfs.fileType=DataStream

  25. tier1.sinks.sink1.hdfs.writeFormat=Text

  26. tier1.sinks.sink1.hdfs.rollInterval=0

  27. tier1.sinks.sink1.hdfs.rollSize=10240

  28. tier1.sinks.sink1.hdfs.rollCount=0

  29. tier1.sinks.sink1.hdfs.idleTimeout=60

  30. tier1.channels.channel1.type=memory

  31. tier1.channels.channel1.capacity=10000

  32. tier1.channels.channel1.transactionCapacity=1000

  33. tier1.channels.channel1.keep-alive=30

我把source type改回了内置的spooldir,而不是上一讲自定义的source,然后添加了一个拦截器i1,type是自定义的拦截器:com.besttone.flume.RegexExtractorExtInterceptor$Builder,正则表达式按“.”分隔抽取三部分,分别放到header中的key:one,two,three当中去,即a.log.2014-07-31,通过拦截器后,在header当中就会增加三个key: one=a,two=log,three=2014-07-31。这时候我们在tier1.sinks.sink1.hdfs.path=hdfs://master68:8020/flume/events/%{one}/%{three}。

就实现了和前面第八讲一模一样的需求。

猜你喜欢

转载自blog.csdn.net/weixin_39542448/article/details/82224867