sparkStreaming-WordCountDemo处理socket数据

步骤一:配置环境,特别注意虚拟机的hosts文件和window中的hosts文件,以及关闭防火墙

     linux:   sudo vim /etc/hosts  添加ip到name的映射,从而让外网访问你时用name便可

     window:   C:\Windows\System32\drivers\etc 将linux /etc/hosts 中的映射写进去,这用于在window上可以访问linux

测试:ping inux虚拟机ip /name (测试网络互通性)

          window cmd : telnet +linux虚拟机ip /name + 端口 (测试端口是否能被访问到)


步骤二:在idea上编写wordcount案例! 

            

package com.imooc.spark
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
/**
 * Hello world!
 *
 */
/**
  * Spark Streaming处理Socket数据
  *
  * 测试: nc
  */
object NetworkWordCount {
  /**官方介绍
    * Counts words in UTF8 encoded, '\n' delimited text received from the network every second.
    *
    * Usage: NetworkWordCount <hostname> <port>
    * <hostname> and <port> describe the TCP server that Spark Streaming would connect to receive data.
    *
    * To run this on your local machine, you need to first run a Netcat server
    *    `$ nc -lk 9999`
    * and then run the example
    *    `$ bin/run-example org.apache.spark.examples.streaming.NetworkWordCount localhost 9999`
    */

  def main(args: Array[String]): Unit = {

    val sparkConf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")

    /**
      * 创建StreamingContext需要两个参数:SparkConf和batch interval
      */
    val ssc = new StreamingContext(sparkConf, Seconds(5))

    val lines = ssc.socketTextStream("hadoop002", 6789)

    val result = lines.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_)

    result.print()

    ssc.start()
    ssc.awaitTermination()
  }
}

pom.xml

<properties>
  <scala.version>2.11.8</scala.version>
  <spark.version>2.2.0</spark.version>
</properties>
<dependencies>
  <dependency>
    <groupId>org.scala-lang</groupId>
    <artifactId>scala-library</artifactId>
    <version>${scala.version}</version>
  </dependency>
  <!-- Spark Streaming 依赖-->
  <dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-streaming_2.11</artifactId>
    <version>${spark.version}</version>
  </dependency>
  <dependency>
    <groupId>org.apache.commons</groupId>
    <artifactId>commons-lang3</artifactId>
    <version>3.5</version>
  </dependency>
  <dependency>
    <groupId>com.fasterxml.jackson.module</groupId>
    <artifactId>jackson-module-scala_2.11</artifactId>
    <version>2.6.5</version>
  </dependency>

  <dependency>
    <groupId>net.jpountz.lz4</groupId>
    <artifactId>lz4</artifactId>
    <version>1.3.0</version>
  </dependency>
</dependencies>

步骤三:启动测试:

扫描二维码关注公众号,回复: 1555190 查看本文章

linux 输入nc -lk 6789 

window上的idea启动项目

在linux nc -lk 6789 下输入 a b a b c d 

在window的idea上看到统计的结果


可能异常:connnect time out!

原因:1.防火墙没有关闭, 

          2. linux 机器ip name映射没有添加到windows的hosts文件中,

          3.linux 上机器name 没生成成功,ip 到name 映射没有生效

          4.代码上ip/name 和端口写错了

猜你喜欢

转载自blog.csdn.net/qq_35394891/article/details/80593561