windows10下spark本地开发环境搭建

windows10下spark本地开发环境搭建

系统环境安装

1. JDK7+ 安装

a.设置 JAVA_HOME 变量

b.设置 Path 变量,添加 ;%JAVA_HOME%\bin;%JAVA_HOME%\jre\bin

c.设置 Classpath 添加: .;%JAVA_HOME%\lib\dt.jar;%JAVA_HOME%\lib\tools.jar

2. Scala 安装

下载地址:http://www.scala-lang.org/download/all.html

选择版本:2.12.5

a.设置 SCALA_HOME 变量

b.设置 Path 变量:添加 ;%SCALA_HOME%\bin

c.设置 Classpath 添加:.;%SCALA_HOME%\bin;

3. Hadoop 安装

下载地址:https://archive.apache.org/dist/hadoop/common/

选择版本:3.1.0

a.设置 HADOOP_HOME 变量

b.设置 Path 变量:添加 ;%HADOOP_HOME%\bin

4. spark 安装 下载地址:http://spark.apache.org/downloads.html

选择版本:2.3.0

解压后执行

spark-2.3.0-bin-hadoop2.7\bin\spark-shell

上诉内容都准备好之后再次重新打开控制台输入spark-shell如果还有以上错误日志,那么请找到你的hadoop-3.1.0\bin目录找下里面有没有winutils.exe文件,如果没有的话,我们需要去下载winutils.exe,下载地址(https://github.com/steveloughran/winutils ) 进入目录后找到你相应的hadoop目录–进入bin—找到winutils.exe文件下载。下载好之后我们把它放到 hadoop-3.1.0\bin 里面,确保该目录中有winutils.exe文件。

完成后就需要在控制台输入一下命令来修改权限

hadoop-3.1.0\bin\winutils.exe chmod 777 /tmp/hive

#/tmp/hive 是存放数据的目录

5. maven 安装 下载地址:http://maven.apache.org/download.cgi

a.设置 MAVEN_HOME 变量

b.设置 Path 变量:添加 ;%MAVEN_HOME%\bin;

6. IntelliJ IDEA 安装 下载地址:http://www.jetbrains.com

安装下载exe文件

IDE环境安装

1.配置maven 菜单路径:File>>Setting>>Build,Execution,Deployment>>Build Tools>>Maven 设置maven安装路径

2.Scala插件安装 菜单路径:File>>Setting>>Plugins>>Browse Repositories 搜索Scala,安装 Scala(Version: 2018.1.9),SBT(Version: 1.8.0),SBT Executor(Version: 1.2.1)

3.新建maven项目

pom.xml

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <groupId>spark.orrin.com</groupId>
    <artifactId>word-count</artifactId>
    <version>1.0-SNAPSHOT</version>
    <properties>
        <spark.version>2.3.0</spark.version>
        <scala.version>2.11</scala.version>
    </properties>

        <dependencies>
            <dependency>
                <groupId>org.apache.spark</groupId>
                <artifactId>spark-core_${scala.version}</artifactId>
                <version>${spark.version}</version>
            </dependency>
            <dependency>
                <groupId>org.apache.spark</groupId>
                <artifactId>spark-streaming_${scala.version}</artifactId>
                <version>${spark.version}</version>
            </dependency>
            <dependency>
                <groupId>org.apache.spark</groupId>
                <artifactId>spark-sql_${scala.version}</artifactId>
                <version>${spark.version}</version>
            </dependency>
            <dependency>
                <groupId>org.apache.spark</groupId>
                <artifactId>spark-hive_${scala.version}</artifactId>
                <version>${spark.version}</version>
            </dependency>
            <dependency>
                <groupId>org.apache.spark</groupId>
                <artifactId>spark-mllib_${scala.version}</artifactId>
                <version>${spark.version}</version>
            </dependency>

        </dependencies>

    <build>
        <plugins>

            <plugin>
                <groupId>org.scala-tools</groupId>
                <artifactId>maven-scala-plugin</artifactId>
                <version>2.15.2</version>
                <executions>
                    <execution>
                        <goals>
                            <goal>compile</goal>
                            <goal>testCompile</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>

            <plugin>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.6.0</version>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                </configuration>
            </plugin>

            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-surefire-plugin</artifactId>
                <version>2.19</version>
                <configuration>
                    <skip>true</skip>
                </configuration>
            </plugin>

        </plugins>
    </build>
    
</project>

4.配置JDK和Scala SDK 菜单路径:Project Structure>>Platform Settings

SDKs,添加上述安装的jdk

Global Libraries,添加Scala SDK,选择上述安装的Scala,版本选择2.11,并将Scala SDK 2.11添加到当前项目

5.运行实例类 新建Create Scala Class,Name=WordCount,Kind=Object

package com.sparkstudy.wordcount

import org.apache.spark.{SparkConf, SparkContext}

/**
  *
  * @author migu-orrin on 2018/5/3.
  */
object WordCount {
  def main(args: Array[String]) {
    /**
      * SparkContext 的初始化需要一个SparkConf对象
      * SparkConf包含了Spark集群的配置的各种参数
      */
    val conf=new SparkConf()
      .setMaster("local")//启动本地化计算
      .setAppName("WordCount")//设置本程序名称

    //Spark程序的编写都是从SparkContext开始的
    val sc=new SparkContext(conf)
    //以上的语句等价与val sc=new SparkContext("local","testRdd")
    val data=sc.textFile("E:/data/wordcount.txt")//读取本地文件
    var result = data.flatMap(_.split(" "))//下划线是占位符,flatMap是对行操作的方法,对读入的数据进行分割
      .map((_,1))//将每一项转换为key-value,数据是key,value是1
      .reduceByKey(_+_)//将具有相同key的项相加合并成一个

    result.collect()//将分布式的RDD返回一个单机的scala array,在这个数组上运用scala的函数操作,并返回结果到驱动程序
      .foreach(println)//循环打印
    result.saveAsTextFile("E:/data/wordcountres")
  }
}

运行WordCount类,效果如下:

18/05/03 17:34:50 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 41 ms on localhost (executor driver) (1/1)
18/05/03 17:34:50 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 
18/05/03 17:34:50 INFO DAGScheduler: ResultStage 1 (collect at WordCount.scala:27) finished in 0.051 s
18/05/03 17:34:50 INFO DAGScheduler: Job 0 finished: collect at WordCount.scala:27, took 0.654681 s
(hehe,2)
(big,1)
(he,1)
(word,3)
(hello,2)
(adfads,1)
18/05/03 17:34:50 INFO deprecation: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
18/05/03 17:34:50 INFO SparkContext: Starting job: runJob at SparkHadoopWriter.scala:78
18/05/03 17:34:50 INFO DAGScheduler: Got job 1 (runJob at SparkHadoopWriter.scala:78) with 1 output partitions
18/05/03 17:34:50 INFO DAGScheduler: Final stage: ResultStage 3 (runJob at SparkHadoopWriter.scala:78)
18/05/03 17:34:50 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 2)
``

猜你喜欢

转载自my.oschina.net/orrin/blog/1812035