Spark Streaming actual WordCount (accumulation)

There is such a problem in the above case:

The number of words in each batch is correctly counted, but the results cannot be accumulated!

If you need to accumulate, you need to use    updateStateByKey (func) to update the state.

Code

package SparkStrimng

/**
  * Created by 一个蔡狗 on 2020/4/10.
  *
  *
  * 每个批次的单词次数都被正确的统计出来,但是结果不能累加!
如果需要累加需要使用updateStateByKey(func)来更新状态.

  */


import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}


object SparkStreaming_02 {



  def main(args: Array[String]): Unit = {
    //1.创建StreamingContext
    //spark.master should be set as local[n], n > 1
    val conf = new SparkConf().setAppName("wc").setMaster("local[*]")
    val sc = new SparkContext(conf)
    sc.setLogLevel("WARN")
    val ssc = new StreamingContext(sc,Seconds(5))//5表示5秒中对数据进行切分形成一个RDD
    //requirement failed: ....Please set it by StreamingContext.checkpoint().
    //注意:我们在下面使用到了updateStateByKey对当前数据和历史数据进行累加
    //那么历史数据存在哪?我们需要给他设置一个checkpoint目录
    ssc.checkpoint("./")//开发中HDFS
    //2.监听Socket接收数据
    //ReceiverInputDStream就是接收到的所有的数据组成的RDD,封装成了DStream,接下来对DStream进行操作就是对RDD进行操作
    val dataDStream: ReceiverInputDStream[String] = ssc.socketTextStream("node001",9999)
    //3.操作数据
    val wordDStream: DStream[String] = dataDStream.flatMap(_.split(" "))
    val wordAndOneDStream: DStream[(String, Int)] = wordDStream.map((_,1))
    //val wordAndCount: DStream[(String, Int)] = wordAndOneDStream.reduceByKey(_+_)
    //====================使用updateStateByKey对当前数据和历史数据进行累加====================
    val wordAndCount: DStream[(String, Int)] =wordAndOneDStream.updateStateByKey(updateFunc)
    wordAndCount.print()
    ssc.start()//开启
    ssc.awaitTermination()//等待优雅停止
  }
  //currentValues:当前批次的value值,如:1,1,1 (以测试数据中的hadoop为例)
  //historyValue:之前累计的历史值,第一次没有值是0,第二次是3
  //目标是把当前数据+历史数据返回作为新的结果(下次的历史数据)
  def updateFunc(currentValues:Seq[Int], historyValue:Option[Int] ):Option[Int] ={
    // currentValues当前值
    // historyValue历史值
    val result: Int = currentValues.sum + historyValue.getOrElse(0)
// Some  数据是什么就返回什么  没有返回 None
    Some(result)

  }




}

 

  1.  carried out

1. Execute nc -lk 9999 first

2. Then execute the above code

3. Constantly enter different words in 1,

hadoop spark sqoop hadoop spark hive hadoop

4. Observe the IDEA console output

sparkStreaming calculates the data within the current 5s every 5s, and then accumulates the output data of each batch.

Published 223 original articles · 300 praises · 300,000 views +

Guess you like

Origin blog.csdn.net/bbvjx1314/article/details/105427978