Hadoop environment configuration and testing
In the previous experiment, we have prepared and configured the Linux environment and the Hadoop environment. Therefore, in this experiment, we will configure and test the Hadoop environment on the basis of the previous experiment.
Linux environment installation and configuration before Hadoop environment is built
https://blog.csdn.net/weixin_43640161/article/details/108614907
JDK software installation and configuration under Linux
https://blog.csdn.net/weixin_43640161/article/details /108619802
Master the installation and configuration of Eclipse software under Linux
https://blog.csdn.net/weixin_43640161/article/details/108691921
Familiar with Hadoop download and decompression
https://blog.csdn.net/weixin_43640161/article/details/ 108697510
There are three ways to install Hadoop: stand-alone mode, pseudo-distributed mode, and distributed mode.
• Stand-alone mode: The default mode of Hadoop is non-distributed mode (local mode), and it can run without other configuration. Non-distributed or single Java process, convenient for debugging.
• Pseudo-distributed mode: Hadoop can run in a pseudo-distributed manner on a single node. The Hadoop process runs as a separate Java process. The node acts as both a NameNode and a DataNode. At the same time, it reads files in HDFS.
• Distributed mode: Use multiple nodes to form a cluster environment to run Hadoop.
• This experiment adopts a stand-alone pseudo-distributed mode for installation.
Important knowledge tips:
- Hadoop can run in a pseudo-distributed manner on a single node. The Hadoop process runs as a separate Java process. The node acts as both a NameNode and a DataNode. At the same time, it reads files in HDFS.
- The configuration file of Hadoop is located in hadoop/etc/hadoop/. Pseudo-distribution needs to modify five configuration files hadoop-env.sh, core-site.xml, hdfs-site.xml, mapred-site.xml and yarn-site. xml
- The Hadoop configuration file is in xml format, and each configuration implements the
experiment steps by declaring the name and value of the property : - Modify configuration files: hadoop-env.sh, core-site.xml, hdfs-site.xml, mapred-site.xml, yarn-site.xml
- Initialize the file system hadoop namenode -format
- Start all processes start-all.sh or start-dfs.sh, start-yarn.sh
- Visit the web interface to view Hadoop information
- Run instance
- Stop all instances: stop-all.sh
The first step: configure the Hadoop environment (jdk version is different, the modified content is also different, I here is jdk1.8.0_181 and hadoop-3.1.1)
1. Configure Hadoop (pseudo-distributed), modify 5 configuration files
-
Enter the Hadoop etc directory.
Terminal command: cd /bigdata/hadoop-3.1.1/etc/hadoop
-
Modify the first configuration file
Terminal command: sudo vi hadoop-env.sh
Find line 54 and modify JAVA_HOME as follows (remember to remove the # sign in front):
export JAVA_HOME=/opt/java/jdk1.8.0_181
- Modify the second configuration file
Terminal command: sudo vi core-site.xml
<configuration>
<!-- 配置hdfs的namenode(老大)的地址 -->
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
<!-- 配置Hadoop运行时产生数据的存储目录,不是临时的数据 -->
<property>
<name>hadoop.tmp.dir</name>
<value>file:/bigdata/hadoop-3.1.1/tmp</value>
</property>
</configuration>
- Modify the third configuration file
Terminal command: sudo vi hdfs-site.xml
<configuration>
<!-- 指定HDFS存储数据的副本数据量 -->
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.http-address</name>
<value>localhost:50070</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/bigdata/hadoop-3.1.1/tmp/dfs/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/bigdata/hadoop-3.1.1/tmp/dfs/data</value>
</property>
</configuration>
In addition, although pseudo-distribution only needs to configure fs.defaultFS and dfs.replication to run (the official tutorial is the case), if the hadoop.tmp.dir parameter is not configured, the default temporary directory used is /tmp/hadoo-hadoop, And this directory may be cleaned up by the system when restarting, so format must be executed again. So we set it up and also specify dfs.namenode.name.dir and dfs.datanode.data.dir, otherwise there may be errors in the next steps.
- Modify the fourth configuration file:
Terminal command: sudo vi mapred-site.xml
<configuration>
<!-- 指定mapreduce编程模型运行在yarn上 -->
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
- Modify the fifth configuration file
sudo vi yarn-site.xml
<configuration>
<!-- 指定yarn的老大(ResourceManager的地址) -->
<property>
<name>yarn.resourcemanager.hostname</name>
<value>localhost</value>
</property>
<!-- mapreduce执行shuffle时获取数据的方式 -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
-
Initialize hdfs (format HDFS)
terminal command:
cd /bigdata/hadoop-3.1.1/bin/
sudo ./hdfs namenode -format
-
If the following information is prompted, the formatting is successful:
Step 2: Start and test Hadoop
Terminal command:
cd /bigdata/hadoop-3.1.1/sbin/
ssh localhost
sudo ./start-dfs.sh
sudo ./start-yarn.sh
start-all.sh
If the above error is reported, please modify the following 4 files as follows:
Under the /hadoop/sbin path:
add the following parameters to the top of the start-dfs.sh and stop-dfs.sh files
#!/usr/bin/env bash
HDFS_DATANODE_USER=root
HADOOP_DATANODE_SECURE_USER=hdfs
HDFS_NAMENODE_USER=root
HDFS_SECONDARYNAMENODE_USER=root
Terminal command: sudo vi start-dfs.sh
Terminal command: sudo vi stop-dfs.sh
Also, start-yarn.sh and stop-yarn.sh also need to add the following parameters at the top:
#!/usr/bin/env bash
YARN_RESOURCEMANAGER_USER=root
HADOOP_SECURE_DN_USER=yarn
YARN_NODEMANAGER_USER=root
Terminal command: sudo vi start-yarn.sh
Terminal command: sudo vi start-yarn.sh
Restart ./start-all.sh after modification, success!
In addition, if the following error occurs:
solve it in the following way:
terminal command:
ssh localhost
cd /bigdata/hadoop-3.1.1/
sudo chmod -R 777 logs
sudo chmod -R 777 tmp
-
Use the jps command to check whether the process exists. There are a total of 5 processes (except jps). Each time you restart, the process ID number will be different. If you want to shut down you can use the stop-all.sh command.
4327 DataNode
4920 NodeManager
4218 NameNode
4474 SecondaryNameNode
4651 ResourceManager
5053 Jps
-
Access the management interface of hdfs
localhost:50070
-
Access yarn management interface
localhost:8088
- If you click on Nodes, you will find that ubuntu:8042 is also accessible
- If you want to stop all services, please enter sbin/stop-all.sh
The above is the content of Hadoop environment configuration and testing. If you encounter some weird errors, you can leave a message in the comment area.