1. Basic information
Official website http://hadoop.apache.org/
Quick start http://hadoop.apache.org/docs/r1.0.4/cn/quickstart.html
Online documentation http://tool.oschina.net/apidocs/apidoc?api=hadoop
Yibai tutorial https://www.yiibai.com/hadoop/
W3Cschool tutorial https://www.w3cschool.cn/hadoop/?
2. Description of environment and tools
1. Operating system Centos7.4 x64 Minimal 1708
Install 5 virtual machines
NameNode: 2 sets of 2G memory, 1 core CPU
DataNode: 3 sets of 2G memory, 1 core CPU
2. JDK version: jdk1.8
3. Tools: xshell5
4. VMware version: VMware Workstation Pro15
5、Hadoop:3.2.0
6、Zookeeper:3.4.5
3. Installation and deployment (preparation of basic environment)
1. Virtual machine installation (install 5 virtual machines)
Reference https://blog.csdn.net/llwy1428/article/details/89328381
2. Each virtual machine is connected to the Internet (5 nodes must be configured with network cards)
Network card configuration can refer to:
https://blog.csdn.net/llwy1428/article/details/85058028
3. Modify the host name (5 nodes need to modify the host name)
Edit the host name of each node in the cluster (take the first node node1.cn as an example)
[root@localhost~]# hostnamectl set-hostname node1.cn
node1.cn
node2.cn
node3.cn
node4.cn
node5.cn
4. JDK8 environment construction (5 nodes need to be built)
Refer to https://blog.csdn.net/llwy1428/article/details/85232267
5. Configure firewall (5 nodes must be operated)
Turn off the firewall, and set boot prohibition
关闭防火墙 : systemctl stop firewalld
查看状态 : systemctl status firewalld
开机禁用 : systemctl disable firewalld
6. Configure static IP
Here is the node1.cn node as an example (other nodes are omitted):
[[email protected] ~]# vim /etc/sysconfig/network-scripts/ifcfg-ens33
Note: the red box is the modified and added part
Can refer to: https://blog.csdn.net/llwy1428/article/details/85058028
7, configure the hosts file
Take the node1.cn node as an example:
[root@node1 ~]# vim /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.11.131 node1.cn
192.168.11.132 node2.cn
192.168.11.133 node3.cn
192.168.11.134 node4.cn
192.168.11.135 node5.cn
8. Install basic tools
[root@node1 ~]# yum install -y vim wget lrzsz tree zip unzip net-tools ntp
[root@node1 ~]# yum update -y (可选)
(Depending on your own network situation, you may need to wait a few minutes)
9. Configure password-free login between nodes
Refer to the specific steps:
https://blog.csdn.net/llwy1428/article/details/85911160
https://blog.csdn.net/llwy1428/article/details/85641999
10. Modify the number of open system files on each node of the cluster
Take the node1.cn node as an example:
[root@node1 ~]# vim /etc/security/limits.conf
reference
https://blog.csdn.net/llwy1428/article/details/89389191
11. Configuration time synchronization of each node in the cluster
This article is based on Aliyun time server, Aliyun time server address: ntp6.aliyun.com
Note: If there is a dedicated time server, please change the host name or IP address of the time server. The host name needs to be mapped in the etc/hosts file.
Take node1.cn as an example:
Set the system time zone to Dongba District (Shanghai Time Zone)
[root@node1 ~]# timedatectl set-timezone Asia/Shanghai
Close ntpd service
[root@node1 ~]# systemctl stop ntpd.service
Set ntpd service to prohibit startup
[root@node1 ~]# systemctl disable ntpd
Set up a scheduled task
[root@node1 ~]# crontab -e
Write the following (synchronize with Alibaba Cloud time server every 10 minutes):
0-59/10 * * * * /usr/sbin/ntpdate ntp6.aliyun.com
Restart the scheduled task service
[root@node1 ~]# /bin/systemctl restart crond.service
Set timed tasks to start up
[root@node1 ~]# vim /etc/rc.local
After adding the following content, save and exit: wq
/bin/systemctl start crond.service
The other nodes in the cluster are the same as node1.cn.
Reference https://blog.csdn.net/llwy1428/article/details/89330330
12. Disable SELinux on each node of the cluster
Take node1.cn as an example:
[root@node1 ~]# vim /etc/selinux/config
After modifying the following content, save and exit: wq
The other nodes in the cluster are the same as node1.cn.
13. Disable Transparent HugePages on each node of the cluster
Reference https://blog.csdn.net/llwy1428/article/details/89387744
14. Configure the system environment as UTF8
Take node1.cn as an example:
[root@node1 ~]# echo "export LANG=zh_CN.UTF-8 " >> ~/.bashrc
[root@node1 ~]# source ~/.bashrc
The other nodes in the cluster are the same as node1.cn.
15, install the database
Note: MariaDb (Mysql) is installed to provide metadata support for Hive, Spark, Oozie, Superset, etc. If you do not use these tools, you do not need to install the Mysql database.
The installation process of MariaDb (Mysql) can refer to:
https://blog.csdn.net/llwy1428/article/details/84965680
https://blog.csdn.net/llwy1428/article/details/85255621
Fourth, install and deploy Hadoop cluster (HA mode)
(Note: During the construction and operation of the cluster, ensure that the time of all nodes in the cluster is synchronized)
1. Create a directory, upload files,
Note: First configure the basic information on node1.cn, then distribute the configured files to each node, and then perform further configuration
Create a directory /opt/cluster/ on each node
Take node1.cn as an example:
[root@node1 ~]# mkdir /opt/cluster
2. File download (file upload), decompression
Download
[root@node1 opt]# wget http://mirrors.tuna.tsinghua.edu.cn/apache/hadoop/common/hadoop-3.2.0/hadoop-3.2.0.tar.gz
or
Download the file manually: hadoop-3.2.0.tar.gz
Upload the downloaded file hadoop-3.2.0.tar.gz to the /opt/cluster path, and decompress hadoop-3.2.0.tar.gz
Enter the /opt/cluster directory
unzip files
[root@node1 cluster]# tar zxvf hadoop-3.2.0.tar.gz
View directory structure
3. Create several directories in hadoop
[root@node1 ~]# mkdir /opt/cluster/hadoop-3.2.0/hdfs
[root@node1 ~]# mkdir /opt/cluster/hadoop-3.2.0/hdfs/tmp
[root@node1 ~]# mkdir /opt/cluster/hadoop-3.2.0/hdfs/name
[root@node1 ~]# mkdir /opt/cluster/hadoop-3.2.0/hdfs/data
[root@node1 ~]# mkdir /opt/cluster/hadoop-3.2.0/hdfs/journaldata
4. Configure hadoop environment variables (add Hadoop environment variable information)
[root@node1 ~]# vim /etc/profile
在最后追加如下信息
export HADOOP_HOME="/opt/cluster/hadoop-3.2.0"
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
Save and exit :wq
Make the configuration file effective
[root@node1 ~]# source /etc/profile
View version
5. Configure hadoop-env.sh
[root@node1 ~]# vim /opt/cluster/hadoop-3.2.0/etc/hadoop/hadoop-env.sh
Add the following content
export JAVA_HOME=/opt/utils/jdk1.8.0_191
export HDFS_NAMENODE_USER=root
export HDFS_DATANODE_USER=root
export HDFS_JOURNALNODE_USER=root
export HDFS_ZKFC_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root
5, placement core-site.xml
[root@node1 ~]# vim /opt/cluster/hadoop-3.2.0/etc/hadoop/core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://cluster</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/cluster/hadoop-3.2.0/hdfs/tmp</value>
</property>
<property>
<name>ha.zookeeper.quorum</name>
<value>node3.cn:2181,node4.cn:2181,node5.cn:2181</value>
</property>
</configuration>
6. Edit the file hdfs-site.xml
[root@node1 ~]# vim /opt/cluster/hadoop-3.2.0/etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.nameservices</name>
<value>cluster</value>
</property>
<property>
<name>dfs.ha.namenodes.cluster</name>
<value>nn1,nn2</value>
</property>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.namenode.rpc-address.cluster.nn1</name>
<value>node1.cn:8020</value>
</property>
<property>
<name>dfs.namenode.rpc-address.cluster.nn2</name>
<value>node2.cn:8020</value>
</property>
<property>
<name>dfs.namenode.http-address.cluster.nn1</name>
<value>node1.cn:50070</value>
</property>
<property>
<name>dfs.namenode.http-address.cluster.nn2</name>
<value>node2.cn:50070</value>
</property>
<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>qjournal://node3.cn:8485;node4.cn:8485;node5.cn:8485/cluster</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/opt/cluster/hadoop-3.2.0/hdfs/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/opt/cluster/hadoop-3.2.0/hdfs/data</value>
</property>
<property>
<name>dfs.journalnode.edits.dir</name>
<value>/opt/cluster/hadoop-3.2.0/hdfs/edits</value>
</property>
<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>true</value>
</property>
<property>
<name>dfs.journalnode.edits.dir</name>
<value>/opt/cluster/hadoop-3.2.0/hdfs/journaldata</value>
</property>
<property>
<name>dfs.client.failover.proxy.provider.cluster</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
<property>
<name>dfs.ha.fencing.methods</name>
<value>shell(/bin/true)</value>
</property>
<property>
<name>dfs.ha.fencing.methods</name>
<value>sshfence</value>
</property>
<property>
<name>dfs.ha.fencing.ssh.private-key-files</name>
<value>/root/.ssh/id_rsa</value>
</property>
<property>
<name>dfs.namenode.datanode.registration.ip-hostname-check</name>
<value>false</value>
</property>
</configuration>
7. Edit the file mapred-site.xml
[root@node1 ~]# vim /opt/cluster/hadoop-3.2.0/etc/hadoop/mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=/opt/cluster/hadoop-3.2.0</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=/opt/cluster/hadoop-3.2.0</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=/opt/cluster/hadoop-3.2.0</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>node1.cn:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>node1.cn:19888</value>
</property>
</configuration>
8. Edit the file yarn-site.xml
[root@node1 ~]# vim /opt/cluster/hadoop-3.2.0/etc/hadoop/yarn-site.xml
<configuration>
<property>
<name>yarn.resourcemanager.ha.enabled</name>
<value>true</value>
</property>
<property>
<name>yarn.resourcemanager.cluster-id</name>
<value>cluster-yarn</value>
</property>
<property>
<name>yarn.resourcemanager.ha.rm-ids</name>
<value>rm1,rm2</value>
</property>
<property>
<name>yarn.resourcemanager.hostname.rm1</name>
<value>node1.cn</value>
</property>
<property>
<name>yarn.resourcemanager.hostname.rm2</name>
<value>node2.cn</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address.rm1</name>
<value>node1.cn:8088</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address.rm2</name>
<value>node2.cn:8088</value>
</property>
<property>
<name>yarn.resourcemanager.zk-address</name>
<value>node3.cn:2181,node4.cn:2181,node5.cn:2181</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property> <property>
<name>yarn.log-aggregation.retain-seconds</name>
<value>106800</value>
</property>
</configuration>
9. Configuration file workers
[root@node1 ~]# vim /opt/cluster/hadoop-3.2.0/etc/hadoop/workers
node3.cn
node4.cn
node5.cn
10. Distribute the entire hadoop-3.2.0 directory to each node
[root@node1 ~]# scp -r /opt/cluster/hadoop-3.2.0 node2.cn:/opt/cluster/
[root@node1 ~]# scp -r /opt/cluster/hadoop-3.2.0 node3.cn:/opt/cluster/
[root@node1 ~]# scp -r /opt/cluster/hadoop-3.2.0 node4.cn:/opt/cluster/
[root@node1 ~]# scp -r /opt/cluster/hadoop-3.2.0 node5.cn:/opt/cluster/
11. Configure and start zookeeper
reference
https://hunter.blog.csdn.net/article/details/96651537
https://hunter.blog.csdn.net/article/details/85937442
12. The specified three nodes start journalnode
(Here I choose node3.cn, node4.cn, node5.cn as journalnode)
[root@node3 ~]# hdfs --daemon start journalnode
[root@node4 ~]# hdfs --daemon start journalnode
[root@node5 ~]# hdfs --daemon start journalnode
13. Format the namenode on node1.cn
[root@node1 ~]# hdfs namenode -format
14. Start namenode on node1.cn
[root@node1 ~]# hdfs --daemon start namenode
15. Synchronize the successfully formatted namenode information on node1.cn on node2.cn
[root@node2 ~]# hdfs namenode -bootstrapStandby
16. Start namenode on node2.cn
[root@node2 ~]# hdfs --daemon start namenode
View
17, close the service
(1) Close the namenode on node1.cn and node2.cn
[root@node1 ~]# hdfs --daemon stop namenode
[root@node2 ~]# hdfs --daemon stop namenode
(2) Close JournalNode on node3.cn, node4.cn, node5.cn
[root@node3 ~]# hdfs --daemon stop journalnode
[root@node4 ~]# hdfs --daemon stop journalnode
[root@node5 ~]# hdfs --daemon stop journalnode
18. Format ZKFC
First start zookeeper on node3.cn, node4.cn, node5.cn
Reference https://blog.csdn.net/llwy1428/article/details/85937442
After starting zookeeper, execute on node1.cn:
[root@node1 ~]# hdfs zkfc -formatZK
19. Start hdfs and yarn services
[root@node1 ~]# /opt/cluster/hadoop-3.2.0/sbin/start-dfs.sh
[root@node1 ~]# /opt/cluster/hadoop-3.2.0/sbin/start-yarn.sh
20. Check the service startup status of each node
So far, Centos 7.4 builds a Hadoop (HA) cluster, and the operation is complete.
Five, basic shell operations
(1) Create a directory in hdfs
[root@node1 ~]# hdfs dfs -mkdir /hadoop
[root@node1 ~]# hdfs dfs -mkdir /hdfs
[root@node1 ~]# hdfs dfs -mkdir /tmp
(2) View the catalog
[root@node2 ~]# hdfs dfs -ls /
(3) Upload files
For example: create a file test.txt in the /opt directory and write some words (the process is omitted)
[root@node3 ~]# hdfs dfs -put /opt/test.txt /hadoop
View uploaded files
[root@node4 ~]# hdfs dfs -ls /hadoop
[root@node4 ~]# hdfs dfs -cat /hadoop/test.txt
(4) Delete files
[root@node5 ~]# hdfs dfs -rm /hadoop/test.txt
Deleted /hadoop/test.txt
6. View the UI pages of some services in the browser
1. View the information of hdfs
Check the ip of node1.cn and node2.cn respectively
Other pages: omitted.
2. View ResourceManager information
enter
or
Other pages: omitted.
Seven, run mapreduce wordcount
Take the test.txt above as an example
[root@node5 ~]# /opt/cluster/hadoop-3.2.0/bin/yarn jar /opt/cluster/hadoop-3.2.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.0.jar wordcount /hadoop /hadoop/output
View the execution result in the browser
Result of execution in ResourceManager
View execution results
[root@node5 ~]# hdfs dfs -cat /hadoop/output/part-r-00000