Article Directory
- Background of the project
- Case requirements
- 1. Analysis
- Two, log collection
- Three, write Spark Streaming code
-
- The first step is to create a project
- The second step is to choose to create a Scala project
- The third step is to complete the creation after setting the project name, the path of the project and the Scala version used
- The fourth step is to create a scala file
- Step 5: Import dependent packages
- Step 6: Introduce all the methods needed for this program
- Step 7: Create the main function and Spark program entry.
- Step 8: Set the host address and port number of the Kafka service, and set the topic from which to receive data and set the consumer group
- Step 9: Number Analysis
- Step 10: Save the calculation results
- Eleventh step database design
- jumpertab表
- pvtab table
- regusetab table
- Four, compile and run
- The complete code is as follows
Download the source code of this case
Link: https://pan.baidu.com/s/1IzOvSCtLvZzj81XZaYl6CQ
Extraction code: i6i8
Background of the project
In the era of rapid Internet development, more and more people obtain more information through the Internet or do their own business through the Internet. When they devote themselves to building their own website, APP or small program, they will find that after a period of operation and Maintenance found that the growth rate of pageviews and the number of users has not improved. When designing and transforming it, there is no way to start, when you don’t understand the user’s browsing preferences and the preferences of the user group. Although the server log clearly records the user's visit and browsing preferences, it is difficult to filter out high-quality information from a large number of logs in a timely and effective manner through ordinary methods. Spark Streaming is a real-time stream computing framework. This technology can perform real-time and fast analysis of data. Through the combination with Flume and Kafka, it can achieve nearly zero-delay data statistical analysis.
Case requirements
Requirements: real-time analysis of server log data, and real-time calculation of information such as page views in a certain period of time.
Use technology: Flume-"Kafka-"SparkStreaming-"MySql database
#Case Architecture
In the architecture, the log file is monitored in real time by Flume. When new data appears in the log file, the data is sent to Kafka, and Spark Streaming receives it for real-time data analysis and finally saves the analysis result in the MySQL database.
1. Analysis
1. Log analysis
1. Visit the webpage in the server through a browser, and a log message will be generated every time you visit. The log contains visitor IP, access time, access address, status code and time-consuming information, as shown in the following figure:
Two, log collection
The first step, code editing
By using Flume to monitor the content of the server log file in real time, it will be collected every time it is generated, and the collected structure will be sent to Kafka. The Flume code is as follows.
2. Start the acquisition code
After editing the code, start Flume to monitor the server log information, enter the Flume installation directory and execute the following code.
[root@master flume]# bin/flume-ng agent --name a1 --conf conf --conf-file conf/access_log-HDFS.properties -Dflume.root.logger=INFO,console
The effect is shown in the figure below.
Three, write Spark Streaming code
The first step is to create a project
The second step is to choose to create a Scala project
The third step is to complete the creation after setting the project name, the path of the project and the Scala version used
The fourth step is to create a scala file
Right-click the single-machine mouse in the "src" of the project directory and select "New"->"Package" to create a package named "com.wordcountdemo", and right-click the single-machine in the package and select "New"->"scala class" to create The file is named wordcount
Step 5: Import dependent packages
Import the Spark dependency package in IDEA, select "File"->"Project Structure"->"Libraries" in the menu, and then click the "+" button to select the "Java" option, and find spark- in the pop-up dialog box assembly-1.6.1-hadoop2.6.0.jar dependent package Click "OK" to load all dependent packages into the project, the result is shown in Figure X.
Step 6: Introduce all the methods needed for this program
Note that three jar packages that are not in spark2 are used here: kafka_2.11-0.8.2.1.jar,
metrics-core-2.2.0.jar, spark-streaming-kafka_2.11-1.6.3.jar.
import java.sql.DriverManager //连接数据库
import kafka.serializer.StringDecoder //序列化数据
import org.apache.spark.streaming.dstream.DStream //接收输入数据流
import org.apache.spark.streaming.kafka.KafkaUtils //连接Kafka
import org.apache.spark.streaming.{Seconds, StreamingContext} //实时流处理
import org.apache.spark.SparkConf //spark程序的入口函数
The result is shown in the figure.
Step 7: Create the main function and Spark program entry.
def main(args: Array[String]): Unit = {
//创建sparksession
val conf = new SparkConf().setAppName("Consumer")
val ssc = new StreamingContext(conf,Seconds(20)) //设置每隔20秒接收并计算一次
}
The result is shown in the figure.
Step 8: Set the host address and port number of the Kafka service, and set the topic from which to receive data and set the consumer group
//kafka服务器地址
val kafkaParam = Map("metadata.broker.list" -> "192.168.10.10:9092")
//设置topic
val topic = "testSpark".split(",").toSet
//接收kafka数据
val logDStream: DStream[String] = KafkaUtils.createDirectStream[String,String,StringDecoder,StringDecoder](ssc,kafkaParam,topic).map(_._2)
Step 9: Number Analysis
After receiving the data, analyze the data, split the server log data according to spaces, and separately count the number of website views, user registrations, and user bounce rate during the period, and convert the statistical results into key-value pairs. RDD.
//拆分接收到的数据
val RDDIP =logDStream.transform(rdd=>rdd.map(x=>x.split(" ")))
//进行数据分析
val pv = RDDIP.map(x=>x(0)).count().map(x=>("pv",x)) //用户浏览量
val jumper = RDDIP.map(x=>x(0)).map((_,1)).reduceByKey(_+_).filter(x=>x._2 == 1).map(x=>x._1).count.map(x=>("jumper",x)) //跳出率
val reguser =RDDIP.filter(_(8).replaceAll("\"","").toString == "/member.php?mod=register&inajax=1").count.map(x=>("reguser",x)) //注册用户数量
Step 10: Save the calculation results
Traverse the statistical result RDD to take out the values in the key-value pairs and save the analysis results to the pvtab, jumpertab and regusetab tables respectively, and finally start the Spark Streaming program.
pv.foreachRDD(line =>line.foreachPartition(rdd=>{
rdd.foreach(word=>{
val conn = DriverManager.getConnection("jdbc:mysql://master:3306/test", "root", "123456")
val format = new java.text.SimpleDateFormat("yyyy-MM-dd H:mm:ss")
val dateFf= format.format(new java.util.Date())
val sql = "insert into pvtab(time,pv) values("+"'"+dateFf+"'," +"'"+word._2+"')"
conn.prepareStatement(sql).executeUpdate()
})
}))
jumper.foreachRDD(line =>line.foreachPartition(rdd=>{
rdd.foreach(word=>{
val conn = DriverManager.getConnection("jdbc:mysql://master:3306/test", "root", "123456")
val format = new java.text.SimpleDateFormat("yyyy-MM-dd H:mm:ss")
val dateFf= format.format(new java.util.Date())
val sql = "insert into jumpertab(time,jumper) values("+"'"+dateFf+"'," +"'"+word._2+"')"
conn.prepareStatement(sql).executeUpdate()
})
}))
reguser.foreachRDD(line =>line.foreachPartition(rdd=>{
rdd.foreach(word=>{
val conn = DriverManager.getConnection("jdbc:mysql://master:3306/test", "root", "123456")
val format = new java.text.SimpleDateFormat("yyyy-MM-dd H:mm:ss")
val dateFf= format.format(new java.util.Date())
val sql = "insert into regusetab(time,reguse) values("+"'"+dateFf+"'," +"'"+word._2+"')"
conn.prepareStatement(sql).executeUpdate()
})
}))
ssc.start() //启动Spark Streaming程序
The result is shown in the figure.
Eleventh step database design
Create a database named "test", and create three tables in the database named "jumpertab", "pvtab", and "regusetab". The database structure is shown in the figure below
jumpertab表
pvtab table
regusetab table
Four, compile and run
Edit the program as a jar package and submit it to the cluster to run.
The first step is to add the project to the jar file and set the file name
Select the "File"-"Project Structure" command, select the "Artifacts" button in the pop-up dialog box, select "JAR" -> "Empty" under "+" and set the JAR at "NAME" in the pop-up dialog box The name of the file is "WordCount", and double-click "'firstSpark'compile output" under "firstSpark" on the right to load it to the left, indicating that the project has been added to the JAR package and then click the "OK" button, as shown in the figure below Show.
The second step is to generate the jar package
Click the "Build" -> "Build Artifacts..." button in the menu bar and click the "Build" button in the pop-up dialog box. After the jar package is generated, the project root directory will automatically create an out directory. You can see the generated in the directory jar package, the result is shown in the figure below.
The third step, submit and run the Spark Streaming program
[root@master bin]# ./spark-submit --master local[*] --class com.spark.streaming.sparkword /usr/local/Streaminglog.jar
- 1
The result is shown in the figure below
Step 4: View the database
The complete code is as follows
package spark
import java.sql.DriverManager
import java.util.Calendar
import kafka.serializer.StringDecoder
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.SparkConf
object kafkaspark {
def main(args: Array[String]): Unit = {
// 创建sparksession
val conf = new SparkConf().setAppName("Consumer")
val ssc = new StreamingContext(conf,Seconds(1))
val kafkaParam = Map("metadata.broker.list" -> "192.168.10.10:9092")
val topic = "testSpark".split(",").toSet
//接收kafka数据
val logDStream: DStream[String] = KafkaUtils.createDirectStream[String,String,StringDecoder,StringDecoder](ssc,kafkaParam,topic).map(_._2)
//拆分接收到的数据
val RDDIP =logDStream.transform(rdd=>rdd.map(x=>x.split(" ")))
//进行数据分析
val pv = RDDIP.map(x=>x(0)).count().map(x=>("pv",x))
val jumper = RDDIP.map(x=>x(0)).map((_,1)).reduceByKey(_+_).filter(x=>x._2 == 1).map(x=>x._1).count.map(x=>("jumper",x))
val reguser =RDDIP.filter(_(8).replaceAll("\"","").toString == "/member.php?mod=register&inajax=1").count.map(x=>("reguser",x))
//将分析结果保存到MySQL数据库
pv.foreachRDD(line =>line.foreachPartition(rdd=>{
rdd.foreach(word=>{
val conn = DriverManager.getConnection("jdbc:mysql://master:3306/test", "root", "123456")
val format = new java.text.SimpleDateFormat("H:mm:ss")
val dateFf= format.format(new java.util.Date())
var cal:Calendar=Calendar.getInstance()
cal.add(Calendar.SECOND,-1)
var Beforeasecond=format.format(cal.getTime())
val date = Beforeasecond.toString+"-"+dateFf.toString
val sql = "insert into pvtab(time,pv) values("+"'"+date+"'," +"'"+word._2+"')"
conn.prepareStatement(sql).executeUpdate()
})
}))
jumper.foreachRDD(line =>line.foreachPartition(rdd=>{
rdd.foreach(word=>{
val conn = DriverManager.getConnection("jdbc:mysql://master:3306/test", "root", "123456")
val format = new java.text.SimpleDateFormat("H:mm:ss")
val dateFf= format.format(new java.util.Date())
var cal:Calendar=Calendar.getInstance()
cal.add(Calendar.SECOND,-1)
var Beforeasecond=format.format(cal.getTime())
val date = Beforeasecond.toString+"-"+dateFf.toString
val sql = "insert into jumpertab(time,jumper) values("+"'"+date+"'," +"'"+word._2+"')"
conn.prepareStatement(sql).executeUpdate()
})
}))
reguser.foreachRDD(line =>line.foreachPartition(rdd=>{
rdd.foreach(word=>{
val conn = DriverManager.getConnection("jdbc:mysql://master:3306/test", "root", "123456")
val format = new java.text.SimpleDateFormat("H:mm:ss")
val dateFf= format.format(new java.util.Date())
var cal:Calendar=Calendar.getInstance()
cal.add(Calendar.SECOND,-1)
var Beforeasecond=format.format(cal.getTime())
val date = Beforeasecond.toString+"-"+dateFf.toString
val sql = "insert into regusetab(time,reguse) values("+"'"+date+"'," +"'"+word._2+"')"
conn.prepareStatement(sql).executeUpdate()
})
}))
val num = logDStream.map(x=>(x,1)).reduceByKey(_+_)
num.print()
//启动Streaming
ssc.start()
ssc.awaitTermination()
ssc.stop()
}
}