Spark Streaming 之foreachRDD 输出

Spark Streaming 之foreachRDD

官网：http://spark.apache.org/docs/latest/streaming-programming-guide.html
foreachRDD(func)
The most generic(类的) output(输出) operator that applies a function, func, to each RDD generated(形成) from the stream. This function should push the data in each RDD to an external(外部的) system, such as saving the RDD to files, or writing it over the network to a database. Note that the function func is executed(实行) in the driver process running the streaming application, and will usually have RDD actions in it that will force the computation(估计) of the streaming RDDs.

在IDEA 上面实现

数据要保存在mysql上面
在pom文件中添加mysql依赖

<dependency>
      <groupId>mysql</groupId>
      <artifactId>mysql-connector-java</artifactId>
      <version>${mysql.version}</version>
    </dependency>

在xshell上面操作一下

[root@hadoop001 ~]# su - mysqladmin
Last login: Wed Jan  9 10:12:35 CST 2019 on pts/1
[mysqladmin@hadoop001 ~]$ mysql -uroot -p123456

mysql> create database g5_spark 
    -> ;
Query OK, 1 row affected (0.07 sec)

mysql> use g5_spark 
Database changed
mysql> show tables;
Empty set (0.00 sec)

mysql>
mysql> create table wc(
    -> word varchar(20),
    -> count int(10)
    -> );

IDEA操作

package g5.learning

import java.sql.DriverManager

import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}

object ForeachRDDApp {

  def main(args: Array[String]): Unit = {

    //准备工作
    val conf = new SparkConf().setMaster("local[2]").setAppName("ForeachRDDApp")
    val ssc = new StreamingContext(conf, Seconds(10))

    //业务逻辑
    val lines = ssc.socketTextStream("hadoop001", 9999)
   val results = lines.flatMap(_.split(",")).map((_,1)).reduceByKey(_+_)
results.foreachRDD(rdd =>{
  rdd.foreachPartition(partition=>{
    val connection = createConnetion()

    partition.foreach(pair =>{
      val sql =s"inser into wc(word,count) values('${pair._1}',${pair._2}"
      connection.createStatement().execute(sql)
    })

    connection.close()
  })

})

    //streaming的启动
    ssc.start() // Start the computation
    ssc.awaitTermination() // Wait for the computation to terminate

  }

  def createConnetion() ={
      Class.forName("com.mysql.jdbc.Driver")
    DriverManager.getConnection("jdbc:mysql://hadoop001:3306/g5_spark","root","123456")
  }

}

nc -lk 9999 启动
运行IDEA
查看结果：mysql> select * from wc;

这里使用了官网提供的第二种方式，

第一种方式是：每条记录创建一个连接，这个方式肯定不可取，因为创建一个连接会有时间和资源的开销，没有必要出高额费用
第二种方式是：创建一个连接，面对所有记录，这个也会产生连接，不过有很大的提升
第三种方式是：在第二基础上优化，重复使用