spark first experience

        In this wonderful era, as a technical person, if you don’t even know the name of Spark, it is obviously unjustifiable. When talking about big data, Hadoop and Spark must be mentioned. I have been following Spark for quite some time. This time I used the latest Hadoop experimental cluster to experience the ups and downs of using the latest Spark!

        Environmental information:

        centos 6.7 64-bit  

        Hadoop-2.7.2(QJM YES) 

        Spark-1.6.1

        scala 2.10.5

     

        I am using Standalone mode, hadoop-2.7.2 has been installed and supports LZO compression. Hive-1.2.1 is running well. Spark and hadoop are both running under the same user and user group (hadoop:hadoop), HADOOP_HOME= /home/hadoop/hadoop-2.7.2, mixed deployment of spark and Hadoop, deploy spark on all nodes with hadoop.

        Download directly: http://www.apache.org/dyn/closer.lua/spark/spark-1.6.1/spark-1.6.1-bin-without-hadoop.tgz , decompress and make the following changes to try Spark is running.

      (1) Change the content of spark-env.sh file

#SYSTEM
JAVA_HOME=/usr/local/jdk
SCALA_HOME=/usr/local/scala
HADOOP_HOME=/home/hadoop/hadoop-2.7.2
HADOOP_CONF_DIR=/home/hadoop/hadoop-2.7.2/etc/hadoop
SPARK_DIST_CLASSPATH=$(/home/hadoop/hadoop-2.7.2/bin/hadoop classpath) 
#SPARK_DIST_CLASSPATH=$(hadoop --config /home/hadoop/hadoop-2.7.2/etc/hadoop classpath)
#export SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:/home/hadoop/hadoop-2.7.2/share/hadoop/tools/lib/*"
#spark
SPARK_HOME=/home/hadoop/spark
SPARK_MASTER_IP=lrts5
SPARK_WORKER_CORES=4
SPARK_WORKER_INSTANCES=1
SPARK_WORKER_MEMORY=4g
SPARK_EXECUTOR_CORES=1
SPARK_EXECUTOR_MEMORY=1g
#spark
SPARK_WORKER_DIR=/home/hadoop/spark/work
SPARK_LOG_DIR=/home/hadoop/spark/logs
SPARK_PID_DIR=/home/hadoop/spark/pid
#LZO
export SPARK_CLASSPATH=/home/hadoop/hadoop-2.7.2/share/hadoop/common/hadoop-lzo-0.4.20-SNAPSHOT.jar 
export SPARK_CLASSPATH=$SPARK_CLASSPATH:$CLASSPATH
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:$HADOOP_HOME/lib/native
export HADOOP_CONF_DIR=/home/hadoop/hadoop-2.7.2/etc/hadoop

      (2) Change the contents of the slave file

lrts5
lrts6
lrts7
lrts8
      (3) Copy Hadoop’s hdfs-site.xml and core-site.xml files to the spark/conf directory

      (4) Append the following content to the spark-defaults.conf file

spark.files file:///home/hadoop/spark/conf/hdfs-site.xml,file:///home/hadoop/spark/conf/core-site.xml
          If this is not added, the following problems will occur:

java.lang.IllegalArgumentException: java.net.UnknownHostException: mycluster
    at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:418)
    at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:231)
    at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:139)
    at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:510)
    at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:453)
    at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:136)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2433)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:88)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2467)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2449)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:367)
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:287)
    at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:221)
    at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
    at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:172)
    at org.apache.spark.rdd.RDD

      (5) Read the lzo file in HDFS and execute it in slices

import org.apache.hadoop.io._
import com.hadoop.mapreduce._
val data = sc.newAPIHadoopFile[LongWritable, Text, LzoTextInputFormat]("hdfs://mycluster/user/hive/warehouse/logs_app_nginx/logdate=20160322/loghost=70/var.log.nginx.access_20160322.log.70.lzo")
data.count()
      (6) Read files in directories and subdirectories represented by wildcard characters in HDFS, and execute them in slices
import org.apache.hadoop.io._
import com.hadoop.mapreduce._
val dirdata = sc.newAPIHadoopFile[LongWritable, Text, LzoTextInputFormat]("hdfs://mycluster/user/hive/warehouse/logs_app_nginx/logdate=20160322/loghost=*/")
dirdata.count()


reference:

http://stackoverflow.com/questions/33174386/accessing-hdfs-ha-from-spark-job-unknownhostexception-error

Guess you like

Origin blog.csdn.net/zhangzhaokun/article/details/51030612