案例一:Spark Streaming处理socket数据
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
/**
* @author YuZhansheng
* @desc SparkStreaming处理socket数据
* @create 2019-02-19 11:26
*/
object NetworkWordCount {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
//创建StreamingContext需要两个参数:SparkConf和batch interval
val ssc = new StreamingContext(sparkConf,Seconds(5))
val lines = ssc.socketTextStream("localhost",6789)
val result = lines.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_)
result.print()
ssc.start()
ssc.awaitTermination()
}
}
测试:使用nc来测试 nc -lk 6789
发送数据,控制台打印出数据词频。
案例二:Spark Streaming处理文件系统数据(包括HDFS和本地文件系统)
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
/**
* @author YuZhansheng
* @desc Spark Streaming处理文件系统数据(包括HDFS和本地文件系统)
* @create 2019-02-19 21:31
*/
object FileWordCount {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setMaster("local").setAppName("FileWordCount")
val ssc = new StreamingContext(sparkConf,Seconds(5))
//监控/root/DataSet这个文件下文件内容
val lines = ssc.textFileStream("/root/DataSet")
val result = lines.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_)
result.print()
ssc.start()
ssc.awaitTermination()
}
}
往/root/DataSet文件夹下复制或者移动文本文件,观察控制台输出。
案例三:使用Spark Streaming完成词频统计,并将结果写入到MySQL中
小插曲:忘记了MySQLroot密码了,折腾了一会儿,才修改回来,记录一下。参看这篇博客
步骤:
1.首先确认服务器出于安全的状态,也就是没有人能够任意地连接MySQL数据库。
因为在重新设置MySQL的root密码的期间,MySQL数据库完全出于没有密码保护的
状态下,其他的用户也可以任意地登录和修改MySQL的信息。可以采用将MySQL对
外的端口封闭,并且停止Apache以及所有的用户进程的方法实现服务器的准安全
状态。最安全的状态是到服务器的Console上面操作,并且拔掉网线。
2.修改MySQL的登录设置:
# vim /etc/my.cnf
在[mysqld]的段中加上一句:skip-grant-tables
例如:
[mysqld]
datadir=/var/lib/mysql
socket=/var/lib/mysql/mysql.sock
skip-grant-tables
保存并且退出vi。
3.重新启动mysqld
# service mysqld restart
Stopping MySQL: [ OK ]
Starting MySQL: [ OK ]
4.登录并修改MySQL的root密码
# mysql
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 3 to server version: 3.23.56
Type 'help;' or '\h' for help. Type '\c' to clear the buffer.
mysql> USE mysql ;
Database changed
mysql> UPDATE user SET Password = password ( 'new-password' ) WHERE User = 'root' ;
Query OK, 0 rows affected (0.00 sec)
Rows matched: 2 Changed: 0 Warnings: 0
mysql> flush privileges ;
Query OK, 0 rows affected (0.01 sec)
mysql> quit
5.将MySQL的登录设置修改回来
# vim /etc/my.cnf
将刚才在[mysqld]的段中加上的skip-grant-tables删除
保存并且退出vim
6.重新启动mysqld
# service mysqld restart
Stopping MySQL: [ OK ]
Starting MySQL: [ OK ]
准备:先安装mysql,启动mysql服务service mysqld start 或者 service mysql start 登录mysql客户端,创建数据库和表。
mysql> create database spark;
Query OK, 1 row affected (0.00 sec)
mysql> use spark;
Database changed
mysql> create table wordcount(
-> word varchar(50) default null,
-> wordcount int(10) default null
-> );
项目中要用到MySQL驱动,先添加进pom文件
<!--MySQL驱动-->
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.47</version>
</dependency>
import java.sql.DriverManager
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
/**
* @ author YuZhansheng
* @ desc 使用Spark Streaming完成词频统计,并将结果写入到MySQL中
* @ create 2019-02-20 11:04
*/
object ForeachRDDApp {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setAppName("ForeachRDDApp").setMaster("local[2]")
val ssc = new StreamingContext(sparkConf,Seconds(5))
val lines = ssc.socketTextStream("localhost",6789)
val result = lines.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_)
//result.print() //此处仅仅是将结果统计输出在控制台
//TODO 将结果写入到mysql
result.foreachRDD(rdd => {
val connection = createConnection()
rdd.foreach{ record =>
val sql = "insert into wordcount(word, wordcount) values('"+record._1+"',"+record._2 +")"
connection.createStatement().execute(sql)
}
})
ssc.start()
ssc.awaitTermination()
}
//获取MySQL的连接
def createConnection() = {
Class.forName("com.mysql.jdbc.Driver")
DriverManager.getConnection("jdbc:mysql://localhost:3306/spark","root","18739548870yu")
}
}
运行上面代码会出现序列化异常:
19/02/20 11:27:18 ERROR JobScheduler: Error running job streaming job 1550633235000 ms.0
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2287)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:917)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:916)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.foreach(RDD.scala:916)
at com.xidian.spark.ForeachRDDApp$$anonfun$main$1.apply(ForeachRDDApp.scala:30)
at com.xidian.spark.ForeachRDDApp$$anonfun$main$1.apply(ForeachRDDApp.scala:28)
修改代码:
import java.sql.DriverManager
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
/**
* @ author YuZhansheng
* @ desc 使用Spark Streaming完成词频统计,并将结果写入到MySQL中
* @ create 2019-02-20 11:04
*/
object ForeachRDDApp {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setAppName("ForeachRDDApp").setMaster("local[2]")
val ssc = new StreamingContext(sparkConf,Seconds(5))
val lines = ssc.socketTextStream("localhost",6789)
val result = lines.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_)
//result.print() //此处仅仅是将结果统计输出在控制台
//TODO 将结果写入到mysql
// result.foreachRDD(rdd => {
// val connection = createConnection()
// rdd.foreach{ record =>
// val sql = "insert into wordcount(word, wordcount) values('"+record._1+"',"+record._2 +")"
// connection.createStatement().execute(sql)
// }
// }) 会出现序列化异常
result.foreachRDD(rdd => {
rdd.foreachPartition(partitionOfRecords => {
val connection = createConnection()
partitionOfRecords.foreach(record => {
val sql = "insert into wordcount(word, wordcount) values('"+record._1+"',"+record._2 +")"
connection.createStatement().execute(sql)
})
connection.close()
})
})
ssc.start()
ssc.awaitTermination()
}
//获取MySQL的连接
def createConnection() = {
Class.forName("com.mysql.jdbc.Driver")
DriverManager.getConnection("jdbc:mysql://localhost:3306/spark","root","18739548870yu")
}
}
测试:nc -lk 6789
a d ff g h
MySQL表中:
mysql> select * from wordcount;
+------+-----------+
| word | wordcount |
+------+-----------+
| d | 1 |
| h | 1 |
| ff | 1 |
| a | 1 |
| g | 1 |
+------+-----------+
5 rows in set (0.00 sec)
此例子存在的问题:
1、对于已有的数据做更新,所有的数据均为insert;
改进思路:再插入之前先判断单词是否已经存在,如果存在就update,不存在则insert。但是在实际生产中,往往使用HBase或者Redis来存储。
2、每个rdd的partition都创建connection,建议改成连接池,提高效率。