(转)参考: https://www.cnblogs.com/NextNight/p/6703362.html -->Nice
已有环境说明:
已安装hadoop 2.9.0 集群(安装过程见历史blog)
主节点操作:
1 安装Scala
1.1 安装包下载
Note: Starting version 2.0, Spark is built with Scala 2.11 by default.
Scala 2.10 users should download the Spark source package and build
with Scala 2.10 support.
因Spark 对Scala存在版本要求,安装Spark 2.4.0 下载 Scala 2.11即可
下载地址:https://www.scala-lang.org/download/2.11.12.html
选择网页下面的tgz下载
1.2 上传安装
上传至目录:/opt/nfs_share/software
创建scala 目录:mkdir -p /opt/scala
解压:
cd /opt/scala
tar -zxvf /opt/nfs_share/software/scala-2.11.12.tgz
1.3 配置环境变量
vim ~/.bash_profile
#末尾添加
#scala environment
export SCALA_HOME=/opt/scala/scala-2.11.12
export PATH=$PATH:$SCALA_HOME/bin
立即生效: source ~/.bash_profile
1.4 验证
scala -version
2 安装Spark
2.1 安装包下载
版本说明:http://spark.apache.org/docs/latest/index.html
Spark runs on Java 8+, Python 2.7+/3.4+ and R 3.1+. For the Scala API, Spark 2.4.0
uses Scala 2.11. You will need to use a compatible Scala version
(2.11.x).
Note that support for Java 7, Python 2.6 and old Hadoop versions before 2.6.5 were removed as of Spark 2.2.0.
Support for Scala 2.10 was removed as of 2.3.0.
官网链接:http://spark.apache.org/downloads.html
镜像下载地址:https://mirrors.tuna.tsinghua.edu.cn/apache/spark/
下载:spark-2.4.0-bin-without-hadoop.tgz -->对任何hadoop版本都兼容
2.2 上传安装
上传至: /opt/nfs_share/software
创建安装目录:mkdir -p /opt/spark
解压:
cd /opt/spark
tar -zxvf /opt/nfs_share/software/spark-2.4.0-bin-without-hadoop.tgz
2.3 配置环境变量
vim ~/.bash_profile
#追加:
#spark environment
export SPARK_HOME=/opt/spark/spark-2.4.0-bin-without-hadoop
export PATH=$PATH:$SPARK_HOME/bin
因为 HADOOP_HOME/bin目录下的文件名冲突,此处不添加
source ~/.bash_profile
2.4 修改Spark配置文件
2.4.1 spark-env.sh:
spark执行任务的环境配置,需要根据自己的机器配置来设置,内存和核心数配置的时候主要不要超出虚拟机的配置,尤其是存在默认值的配置需要仔细查看,修改
cd $SPARK_HOME/conf
cp spark-env.sh.template spark-env.sh
vim spark-env.sh
export SPARK_DIST_CLASSPATH=$(/opt/hadoop/hadoop-2.9.0/bin/hadoop classpath)
#----config
SPARK_LOCAL_DIRS=/opt/spark/spark-2.4.0-bin-without-hadoop/local #配置spark的local目录
SPARK_MASTER_IP=hdp-01 #master节点ip或hostname
SPARK_MASTER_WEBUI_PORT=8085 #web页面端口
#export SPARK_MASTER_OPTS="-Dspark.deploy.defaultCores=4" #spark-shell启动使用核数
SPARK_WORKER_CORES=1 #Worker的cpu核数
SPARK_WORKER_MEMORY=512m #worker内存大小
SPARK_WORKER_DIR=/opt/spark/spark-2.4.0-bin-without-hadoop/worker #worker目录
SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true -Dspark.worker.cleanup.appDataTtl=604800" #worker自动清理及清理时间间隔
SPARK_HISTORY_OPTS="-Dspark.history.ui.port=18080 -Dspark.history.retainedApplications=3 -Dspark.history.fs.logDirectory=hdfs://hdp-01:9000/spark/history" #history server页面端口>、备份数、log日志在HDFS的位置
SPARK_LOG_DIR=/opt/spark/spark-2.4.0-bin-without-hadoop/logs #配置Spark的log日志
JAVA_HOME=/opt/java/jdk1.8.0_191 #配置java路径
SCALA_HOME=/opt/scala/scala-2.11.12 #配置scala路径
HADOOP_HOME=/opt/hadoop/hadoop-2.9.0/lib/native #配置hadoop的lib路径
HADOOP_CONF_DIR=/opt/hadoop/hadoop-2.9.0/etc/hadoop/ #配置hadoop的配置路径
2.4.2 spark-defaults.conf
cp spark-defaults.conf.template spark-defaults.conf
vim spark-defaults.conf
#设置
spark.master spark://hdp-01:7077
spark.eventLog.enabled true
spark.eventLog.dir hdfs://hdp-01:9000/spark/history
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.driver.memory 524m
spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
2.4.3 slaves
配置worker节点
cp slaves.template slaves
vim slaves
#设置
hdp-01
hdp-03
hdp-04
3 分发文件到其他节点
scala文件夹:
scp -r /opt/scala/ hadoop@hdp-03:/opt/
scp -r /opt/scala/ hadoop@hdp-04:/opt/
spark文件夹:
scp -r /opt/spark/ hadoop@hdp-03:/opt/
scp -r /opt/spark/ hadoop@hdp-04:/opt/
分发环境变量配置文件:
scp ~/.bash_profile hadoop@hdp-03:~/.bash_profile
scp ~/.bash_profile hadoop@hdp-04:~/.bash_profile
ssh hdp-03
source ~/.bash_profile
ssh hdp-04
source ~/.bash_profile
4 启动服务
启动
cd $SPARK_HOME
sbin/start-all.sh
查看服务
页面查看
访问192.168.1.126:8085(8085端口是设置在spark-env.sh中的SPARK_MASTER_WEBUI_PORT,可自行设置),结果如下:则说明成功了
spark示例运行: ./bin/run-example SparkPi 2>&1 | grep “Pi is roughly” 计算圆周率