Install spark pseudo-distributed cluster + spark version of wordcount

Preliminary preparation: more than 3 linux virtual machine clusters (centos recommended), and configure password-free login
                  hadoop pseudo-distributed cluster
                  Each virtual machine needs to install jdk
 
 
 
 1. Download the compiled spark file from the spark official website, as shown below

2. Upload the file to the virtual machine that is planned to be the Master, and extract it to the path of the installation path /home/hadoop/apps

tar -zxvf spark-2.3.0-bin-hadoop2.7.tgz  -C /home/hadoop/apps
 3. The name is too long, mv is replaced by spark
mv spark-2.3.0-bin-hadoop2.7 spark
 
drwxr-xr-x. 10 hadoop hadoop 194 4月  27 09:41 hadoop
drwxrwxr-x.  8 hadoop hadoop 159 4月  28 07:47 hive
drwxr-xr-x. 13 hadoop hadoop 211 2月  23 03:42 spark
 4. Enter spark and change the file name of spark-env.sh.template to spark-env.sh
mv spark-env.sh.template spark-env.sh
 5.vi spark-env.sh ,配置JAVA_HOME , SPARK_MASTER_IP, SPARK_MASTER_PORT 
cat /etc/profile
Copy the path of JDK export JAVA_HOME=/usr/local/jdk1.8.0_162 
Enter the local jdk address, IP address (or host name), 7077 is the default port of spark, do rpc communication
5.mv slaves.template slaves , and configure the IP address of the worker, or the host name (recommended to use the host name, which can be configured through vi /etc/hosts)
we are slaves

 The default is a single machine, and the machine is used as a worker

If the machine is used as both a master and a worker, then keep it still
If only this machine is used as Master, delete localhost
If each host in the cluster is configured with the hosts file, you can write the host name directly
 

 

 6. Use the scp command to copy spark to the corresponding directory on all worker virtual machines
scp -r spark/ mini1:/home/hadoop/apps/
scp -r spark/ mini2:/home/hadoop/apps/
scp -r spark/ mini3:/home/hadoop/apps/
scp -r spark/ mini4:/home/hadoop/apps/
 

 

 7. As shown above, an error will be reported: -bash: /home/hadoop/apps/hadoop-2.7.5/sbin/start-all.sh: There is no such file or directory.
  This is because the start-all.sh of hadoop is still started. It is recommended to add an absolute location, such as: apps/spark/sbin/start-all.sh
 
8. After the startup is successful, jps checks whether the current service, master and worker have been started
 

 

9. Finally log in to the web management page master's ip: 8080

 The second part ----- wordcount of spark

1. Enter the hadoop directory, take LICENSE.txt as the wordcount file,

 

2.hadoop fs -put LICENSE.txt /wordcount/input Upload the LISCENSE.txt file to the directory /wordclount/input of hdfs

View files under hdfs through hadoop fs -ls /wordcount/input

3. Go to the spark directory, bin/spark-shell

 4. Execute the following command (spark recommends scala)

sc.textFile("hdfs://mini0:9000/wordcount/input/LICENSE.txt").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).saveAsTextFile("hdfs://mini0:9000/wordcount/out")

5. Exit scala and execute hadoop fs -ls /wordcount/out

6. Execute hadoop fs -cat /wordcount/out/part-00000 (for example: the word absolutely appears 4 times)

 

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325264877&siteId=291194637