windows10下spark本地开发环境搭建
系统环境安装
1. JDK7+ 安装
a.设置 JAVA_HOME 变量
b.设置 Path 变量,添加 ;%JAVA_HOME%\bin;%JAVA_HOME%\jre\bin
c.设置 Classpath 添加: .;%JAVA_HOME%\lib\dt.jar;%JAVA_HOME%\lib\tools.jar
2. Scala 安装
下载地址:http://www.scala-lang.org/download/all.html
选择版本:2.12.5
a.设置 SCALA_HOME 变量
b.设置 Path 变量:添加 ;%SCALA_HOME%\bin
c.设置 Classpath 添加:.;%SCALA_HOME%\bin;
3. Hadoop 安装
下载地址:https://archive.apache.org/dist/hadoop/common/
选择版本:3.1.0
a.设置 HADOOP_HOME 变量
b.设置 Path 变量:添加 ;%HADOOP_HOME%\bin
4. spark 安装 下载地址:http://spark.apache.org/downloads.html
选择版本:2.3.0
解压后执行
spark-2.3.0-bin-hadoop2.7\bin\spark-shell
上诉内容都准备好之后再次重新打开控制台输入spark-shell如果还有以上错误日志,那么请找到你的hadoop-3.1.0\bin目录找下里面有没有winutils.exe文件,如果没有的话,我们需要去下载winutils.exe,下载地址(https://github.com/steveloughran/winutils ) 进入目录后找到你相应的hadoop目录–进入bin—找到winutils.exe文件下载。下载好之后我们把它放到 hadoop-3.1.0\bin 里面,确保该目录中有winutils.exe文件。
完成后就需要在控制台输入一下命令来修改权限
hadoop-3.1.0\bin\winutils.exe chmod 777 /tmp/hive
#/tmp/hive 是存放数据的目录
5. maven 安装 下载地址:http://maven.apache.org/download.cgi
a.设置 MAVEN_HOME 变量
b.设置 Path 变量:添加 ;%MAVEN_HOME%\bin;
6. IntelliJ IDEA 安装 下载地址:http://www.jetbrains.com
安装下载exe文件
IDE环境安装
1.配置maven 菜单路径:File>>Setting>>Build,Execution,Deployment>>Build Tools>>Maven 设置maven安装路径
2.Scala插件安装 菜单路径:File>>Setting>>Plugins>>Browse Repositories 搜索Scala,安装 Scala(Version: 2018.1.9),SBT(Version: 1.8.0),SBT Executor(Version: 1.2.1)
3.新建maven项目
pom.xml
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>spark.orrin.com</groupId>
<artifactId>word-count</artifactId>
<version>1.0-SNAPSHOT</version>
<properties>
<spark.version>2.3.0</spark.version>
<scala.version>2.11</scala.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_${scala.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_${scala.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_${scala.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_${scala.version}</artifactId>
<version>${spark.version}</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<version>2.15.2</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.6.0</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-surefire-plugin</artifactId>
<version>2.19</version>
<configuration>
<skip>true</skip>
</configuration>
</plugin>
</plugins>
</build>
</project>
4.配置JDK和Scala SDK 菜单路径:Project Structure>>Platform Settings
SDKs,添加上述安装的jdk
Global Libraries,添加Scala SDK,选择上述安装的Scala,版本选择2.11,并将Scala SDK 2.11添加到当前项目
5.运行实例类 新建Create Scala Class,Name=WordCount,Kind=Object
package com.sparkstudy.wordcount
import org.apache.spark.{SparkConf, SparkContext}
/**
*
* @author migu-orrin on 2018/5/3.
*/
object WordCount {
def main(args: Array[String]) {
/**
* SparkContext 的初始化需要一个SparkConf对象
* SparkConf包含了Spark集群的配置的各种参数
*/
val conf=new SparkConf()
.setMaster("local")//启动本地化计算
.setAppName("WordCount")//设置本程序名称
//Spark程序的编写都是从SparkContext开始的
val sc=new SparkContext(conf)
//以上的语句等价与val sc=new SparkContext("local","testRdd")
val data=sc.textFile("E:/data/wordcount.txt")//读取本地文件
var result = data.flatMap(_.split(" "))//下划线是占位符,flatMap是对行操作的方法,对读入的数据进行分割
.map((_,1))//将每一项转换为key-value,数据是key,value是1
.reduceByKey(_+_)//将具有相同key的项相加合并成一个
result.collect()//将分布式的RDD返回一个单机的scala array,在这个数组上运用scala的函数操作,并返回结果到驱动程序
.foreach(println)//循环打印
result.saveAsTextFile("E:/data/wordcountres")
}
}
运行WordCount类,效果如下:
18/05/03 17:34:50 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 41 ms on localhost (executor driver) (1/1)
18/05/03 17:34:50 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
18/05/03 17:34:50 INFO DAGScheduler: ResultStage 1 (collect at WordCount.scala:27) finished in 0.051 s
18/05/03 17:34:50 INFO DAGScheduler: Job 0 finished: collect at WordCount.scala:27, took 0.654681 s
(hehe,2)
(big,1)
(he,1)
(word,3)
(hello,2)
(adfads,1)
18/05/03 17:34:50 INFO deprecation: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
18/05/03 17:34:50 INFO SparkContext: Starting job: runJob at SparkHadoopWriter.scala:78
18/05/03 17:34:50 INFO DAGScheduler: Got job 1 (runJob at SparkHadoopWriter.scala:78) with 1 output partitions
18/05/03 17:34:50 INFO DAGScheduler: Final stage: ResultStage 3 (runJob at SparkHadoopWriter.scala:78)
18/05/03 17:34:50 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 2)
``