1. Purpose
Spark Streaming integrate Flume. Refer to the official consolidation of file ( http://spark.apache.org/docs/2.2.0/streaming-flume-integration.html )
2 integrated way: push-based
2.1 Basic requirements
- Flume and a spark node to work on the same machine, flume will push data port on the machine configuration by
- streaming application must be started later, receive listen port must first push data, flume to push data
- Add the following dependence
groupId = org.apache.spark artifactId = spark-streaming-flume_2.11 version = 2.2.0
2.2 Configuring Flume
We know flume is how to configure it using the configuration file, use the local netcat source to analog data, this configuration is as follows:
# Name the components on this agent a1.sources = r1 a1.sinks = k1 a1.channels = c1 # Describe/configure the source a1.sources.r1.type = netcat a1.sources.r1.bind = hadoop a1.sources.r1.port = 5900 # Describe the sink a1.sinks.k1.type = avro a1.sinks.k1.hostname = hadoop a1.sinks.k1.port = 5901 #a1.sinks.k1.type = logger # Use a channel which buffers events in memory a1.channels.c1.type = memory #a1.channels.c1.capacity = 1000 #a1.channels.c1.transactionCapacity = 100 # Bind the source and sink to the channel a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1
2.3 running on the server
Ideas are as follows:
- Packaging project with maven
- Use saprk-submit submit
- Open flume
- Transmitting the analog data
- verification
Verification code is as follows: function to do a simple word count:
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
1 package flume_streaming 2 3 import org.apache.spark.SparkConf 4 import org.apache.spark.streaming.flume.FlumeUtils 5 import org.apache.spark.streaming.{Durations, StreamingContext} 6 7 /** 8 * @Author: SmallWild 9 * @Date: 2019/11/2 9:42 10 * @Desc: 基于flumePushWordCount 11 */ 12 object flumePushWordCount { 13 def main(args: Array[String]): Unit = { 14 if (args.length != 2) { 15 System.err.println ( "Usage Parameter error,: flumePushWordCount <hostname> <Port>" ) 16 System.exit (. 1 ) . 17 } 18 is // passed parameters . 19 Val the Array (hostname, Port) = args 20 is // constant You can not use local [. 1] 21 is Val sparkConf = new new sparkConf () // .setMaster ( "local [2]"). setAppName ( "kafkaDirectWordCount") 22 is Val SSC = new new StreamingContext (sparkConf, Durations.seconds (. 5 )) 23 is // set the log level 24- ssc.sparkContext.setLogLevel ( "WARN") 25 //TODO be a simple word counting 26 Val flumeStream = FlumeUtils.createStream (ssc, hostname, port.toInt) 27 flumeStream.map (the X-=> new new String (x.event.getBody.array ()). The TRIM) 28 .flatMap ( _.split ( "")). Map ((_,. 1)). reduceByKey (_ + _). Print () 29 ssc.start () 30 ssc.awaitTermination () 31 is } 32 }
Verify the following steps:
. 1 . 1 ) Package Engineering 2 mvn Clean Package - DskipTest . 3 2) spark- Submit submitted (as used herein, local mode) . 4 ./spark-submit - class flume_streaming.flumePushWordCount / . 5 --master local [2] / . 6 --packages org.apache.spark: Spark-Streaming-flume_2.11: 2.2.0 / . 7 /smallwild/app/SparkStreaming-1.0.jar Hadoop 5901 . 8 . 3 ) turned Flume . 9 Flume ng Agent-Agent---name Simple --conf FLUME_HOME $ / $ File-FLUME_HOME --conf the conf / = the conf -Dflume.root.logger the INFO, Console 10 . 4 ) transmits pattern data 11 As used herein, the local port 5900 transmits data 12 is Telnet 5900 Hadoop 13 is . 5 ) Verify 14 See streaming application is a word corresponding to the word count can occur
Validation results: correctly count the number of words in a batch sent from the port over
3 integrated way: pull-based (common)
In this way consistent and above
3.1 Considerations
- First start flume
- Using a custom sink, streaming initiative to pull data, the data is first stored in the buffer
- Transaction security mechanism, a copy of the mechanisms and data are received (Transactions succeed only after data is received and replicated by Spark Streaming.)
- Ensure high fault tolerance
- Add the following dependence
groupId = org.apache.spark artifactId = spark-streaming-flume-sink_2.11 version = 2.2.0 groupId = org.scala-lang artifactId = scala-library version = 2.11.8 groupId = org.apache.commons artifactId = commons-lang3 version = 3.5
3.2 Configuring Flume
And arranged in front of the difference sink, the sink require custom
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
1 # Name the components on this agent 2 a1.sources = r1 3 a1.sinks = k1 4 a1.channels = c1 5 6 # Describe/configure the source 7 a1.sources.r1.type = netcat 8 a1.sources.r1.bind = hadoop 9 a1.sources.r1.port = 5900 10 11 # Describe the sink 12 a1.sinks.k1.type = org.apache.spark.streaming.flume.sink.SparkSink 13 a1.sinks.k1.hostname = hadoop 14 a1.sinks.k1.port = 5901 15 #a1.sinks.k1.type = logger 16 17 # Use a channel which buffers events in memory 18 a1.channels.c1.type = memory 19 #a1.channels.c1.capacity = 1000 20 #a1.channels.c1.transactionCapacity = 100 21 22 # Bind the source and sink to the channel 23 a1.sources.r1.channels = c1 24 a1.sinks.k1.channel = c1
3.3 runs on the service
And logic operations substantially as before, as used herein, the following class
import org.apache.spark.streaming.flume._ val flumeStream = FlumeUtils.createPollingStream(streamingContext, [sink machine hostname], [sink port])
3.4 submit verification
Ideas are as follows:
- Packaging project with maven
- Open flume
- Use saprk-submit submit
- Transmitting the analog data
- verification
And in front of basically the same
4 Summary
Finishing Two Practices of integration flume.