Spark Streaming integrate Flume

1. Purpose

  Spark Streaming integrate Flume. Refer to the official consolidation of file ( http://spark.apache.org/docs/2.2.0/streaming-flume-integration.html )

2 integrated way: push-based

2.1 Basic requirements

  • Flume and a spark node to work on the same machine, flume will push data port on the machine configuration by
  • streaming application must be started later, receive listen port must first push data, flume to push data
  • Add the following dependence
 groupId = org.apache.spark
 artifactId = spark-streaming-flume_2.11
 version = 2.2.0

2.2 Configuring Flume

  We know flume is how to configure it using the configuration file, use the local netcat source to analog data, this configuration is as follows:

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop
a1.sources.r1.port = 5900

# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop
a1.sinks.k1.port = 5901
#a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
#a1.channels.c1.capacity = 1000
#a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

2.3 running on the server

Ideas are as follows:

  • Packaging project with maven
  • Use saprk-submit submit
  • Open flume
  • Transmitting the analog data
  • verification

Verification code is as follows: function to do a simple word count:

 1 package flume_streaming
 2 
 3 import org.apache.spark.SparkConf
 4 import org.apache.spark.streaming.flume.FlumeUtils
 5 import org.apache.spark.streaming.{Durations, StreamingContext}
 6 
 7 /**
 8  * @Author: SmallWild
 9  * @Date: 2019/11/2 9:42
10  * @Desc: 基于flumePushWordCount
11  */
12 object flumePushWordCount {
13   def main(args: Array[String]): Unit = {
14     if (args.length != 2) {
15       System.err.println ( "Usage Parameter error,: flumePushWordCount <hostname> <Port>" )
 16        System.exit (. 1 )
 . 17      }
 18 is      // passed parameters 
. 19      Val the Array (hostname, Port) = args
 20 is      // constant You can not use local [. 1] 
21 is      Val sparkConf = new new sparkConf () // .setMaster ( "local [2]"). setAppName ( "kafkaDirectWordCount") 
22 is      Val SSC = new new StreamingContext (sparkConf, Durations.seconds (. 5 ))
 23 is      // set the log level 
24-      ssc.sparkContext.setLogLevel ( "WARN")
25     //TODO be a simple word counting 
26      Val flumeStream = FlumeUtils.createStream (ssc, hostname, port.toInt)
 27      flumeStream.map (the X-=> new new String (x.event.getBody.array ()). The TRIM)
 28        .flatMap ( _.split ( "")). Map ((_,. 1)). reduceByKey (_ + _). Print ()
 29      ssc.start ()
 30      ssc.awaitTermination ()
 31 is    }
 32 }
View Code

Verify the following steps:

. 1   . 1 ) Package Engineering
 2   mvn Clean Package - DskipTest
 . 3   2) spark- Submit submitted (as used herein, local mode)
 . 4   ./spark-submit - class flume_streaming.flumePushWordCount /
 . 5   --master local [2] /
 . 6   --packages org.apache.spark: Spark-Streaming-flume_2.11: 2.2.0 /
 . 7   /smallwild/app/SparkStreaming-1.0.jar Hadoop 5901
 . 8   . 3 ) turned Flume
 . 9   Flume ng Agent-Agent---name Simple --conf FLUME_HOME $ / $ File-FLUME_HOME --conf the conf / = the conf -Dflume.root.logger the INFO, Console
 10   . 4 ) transmits pattern data
 11 As used herein, the local port 5900 transmits data
 12 is   Telnet 5900 Hadoop
 13 is   . 5 ) Verify
 14   See streaming application is a word corresponding to the word count can occur

Validation results: correctly count the number of words in a batch sent from the port over

3 integrated way: pull-based (common)

In this way consistent and above

3.1 Considerations

  • First start flume
  • Using a custom sink, streaming initiative to pull data, the data is first stored in the buffer
  • Transaction security mechanism, a copy of the mechanisms and data are received (Transactions succeed only after data is received and replicated by Spark Streaming.)
  • Ensure high fault tolerance
  • Add the following dependence
     groupId = org.apache.spark
     artifactId = spark-streaming-flume-sink_2.11
     version = 2.2.0
    
     groupId = org.scala-lang
     artifactId = scala-library
     version = 2.11.8
    
     groupId = org.apache.commons
     artifactId = commons-lang3
     version = 3.5

3.2 Configuring Flume

And arranged in front of the difference sink, the sink require custom

 1 # Name the components on this agent
 2 a1.sources = r1
 3 a1.sinks = k1
 4 a1.channels = c1
 5 
 6 # Describe/configure the source
 7 a1.sources.r1.type = netcat
 8 a1.sources.r1.bind = hadoop
 9 a1.sources.r1.port = 5900
10 
11 # Describe the sink
12 a1.sinks.k1.type = org.apache.spark.streaming.flume.sink.SparkSink
13 a1.sinks.k1.hostname = hadoop
14 a1.sinks.k1.port = 5901
15 #a1.sinks.k1.type = logger
16 
17 # Use a channel which buffers events in memory
18 a1.channels.c1.type = memory
19 #a1.channels.c1.capacity = 1000
20 #a1.channels.c1.transactionCapacity = 100
21 
22 # Bind the source and sink to the channel
23 a1.sources.r1.channels = c1
24 a1.sinks.k1.channel = c1
View Code

3.3 runs on the service

  And logic operations substantially as before, as used herein, the following class

 import org.apache.spark.streaming.flume._

 val flumeStream = FlumeUtils.createPollingStream(streamingContext, [sink machine hostname], [sink port])

3.4 submit verification

Ideas are as follows:

  • Packaging project with maven
  • Open flume
  • Use saprk-submit submit
  • Transmitting the analog data
  • verification

And in front of basically the same

4 Summary

  Finishing Two Practices of integration flume.

Guess you like

Origin www.cnblogs.com/truekai/p/11759162.html