使用docker搭建hadoop集群

一、获取ubuntu镜像，进入容器

docker search ubuntu

选择一个版本拉取到本地：

docker pull ubuntu：15.10

运行容器

docker run -ti ubuntu:15.10

二、ubuntu相关准备

由于docker的ubuntu镜像中的apt-get源默认为国外源，很多软件无法下载，现在要将源改为国内源。

在网上根据自己的ubuntu版本找到适合自己的国内源

备份默认源：

mv /etc/apt/sources.list /etc/apt/sources.list.bak

将sources.list修改为（docker的ubuntu镜像无vim，需要用cat或其他来修改）:

deb http://mirrors.aliyun.com/ubuntu/ xenial main restricted universe multiverse
deb http://mirrors.aliyun.com/ubuntu/ xenial-security main restricted universe multiverse
deb http://mirrors.aliyun.com/ubuntu/ xenial-updates main restricted universe multiverse
deb http://mirrors.aliyun.com/ubuntu/ xenial-backports main restricted universe multiverse
##测试版源
deb http://mirrors.aliyun.com/ubuntu/ xenial-proposed main restricted universe multiverse
# 源码
deb-src http://mirrors.aliyun.com/ubuntu/ xenial main restricted universe multiverse
deb-src http://mirrors.aliyun.com/ubuntu/ xenial-security main restricted universe multiverse
deb-src http://mirrors.aliyun.com/ubuntu/ xenial-upda1tes main restricted universe multiverse
deb-src http://mirrors.aliyun.com/ubuntu/ xenial-backports main restricted universe multiverse
##测试版源
deb-src http://mirrors.aliyun.com/ubuntu/ xenial-proposed main restricted universe multiverse
# Canonical 合作伙伴和附加
deb http://archive.canonical.com/ubuntu/ xenial partner
deb http://extras.ubuntu.com/ubuntu/ xenial main

修改完后更新apt-get：

apt-get update

现在可以用apt-get安装vim，net-tools（ifconfig），iputils-ping（ping）

做好准备后可以将现在的容器保存一个镜像

$exit
docker ps -a #找到刚才的容器id
docker commit -m "apt-get upate" containerid ubuntu:v1

三、安装jdk

回到之前的容器：

docker start -ti containerid

或者用保存的镜像重现创建一个容器

apt-get install openjdk-8-jdk

安装java就不展开了。

四、安装hadoop

apt-get install wget #安装wget

从官网或国内镜像下载hadoop（hadoop的版本自己选择）：

wget http://mirror.bit.edu.cn/apache/hadoop/common/hadoop-2.8.4/hadoop-2.8.4.tar.gz

将hadoop解压到指定文件夹：～/hadoop/

tar xvzf hadoop-2.8.4.tar.gz

五、配置环境变量和配置文件

修改~/.bashrc文件。在文件末尾加入下面配置信息：

export JAVA_HOME=/java_home
export HADOOP_HOME=/root/hadoop/hadoop-2.8.4
export HADOOP_CONFIG_HOME=$HADOOP_HOME/etc/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin

使用以下命令可以找到JAVA_HOME：

update-alternatives --config java

配置完了后source ～/.bashrc 使环境变量生效

进行hadoop的配置（转载，注意文件路径与之前保持一致）

下面，我们开始修改Hadoop的配置文件。主要配置core-site.xml、hdfs-site.xml、mapred-site.xml这三个文件。

开始配置之前，执行下面命令：

root@8ef06706f88d:~# cd $HADOOP_HOME/
root@8ef06706f88d:~/soft/apache/hadoop/hadoop-2.6.0# mkdir tmp
root@8ef06706f88d:~/soft/apache/hadoop/hadoop-2.6.0# cd tmp/
root@8ef06706f88d:~/soft/apache/hadoop/hadoop-2.6.0/tmp# pwd
/root/soft/apache/hadoop/hadoop-2.6.0/tmp
root@8ef06706f88d:~/soft/apache/hadoop/hadoop-2.6.0/tmp# cd ../
root@8ef06706f88d:~/soft/apache/hadoop/hadoop-2.6.0# mkdir namenode
root@8ef06706f88d:~/soft/apache/hadoop/hadoop-2.6.0# cd namenode/
root@8ef06706f88d:~/soft/apache/hadoop/hadoop-2.6.0/namenode# pwd
/root/soft/apache/hadoop/hadoop-2.6.0/namenode
root@8ef06706f88d:~/soft/apache/hadoop/hadoop-2.6.0/namenode# cd ../
root@8ef06706f88d:~/soft/apache/hadoop/hadoop-2.6.0# mkdir datanode
root@8ef06706f88d:~/soft/apache/hadoop/hadoop-2.6.0# cd datanode/
root@8ef06706f88d:~/soft/apache/hadoop/hadoop-2.6.0/datanode# pwd
/root/soft/apache/hadoop/hadoop-2.6.0/datanode
root@8ef06706f88d:~/soft/apache/hadoop/hadoop-2.6.0/datanode# cd $HADOOP_CONFIG_HOME/
root@8ef06706f88d:~/soft/apache/hadoop/hadoop-2.6.0/etc/hadoop# cp mapred-site.xml.template mapred-site.xml
root@8ef06706f88d:~/soft/apache/hadoop/hadoop-2.6.0/etc/hadoop# nano hdfs-site.xml

这里创建了三个目录，后续配置的时候会用到：

tmp：作为Hadoop的临时目录
namenode：作为NameNode的存放目录
datanode：作为DataNode的存放目录

1).core-site.xml配置

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <property>
            <name>hadoop.tmp.dir</name>
            <value>/root/soft/apache/hadoop/hadoop-2.6.0/tmp</value>
            <description>A base for other temporary directories.</description>
    </property>

    <property>
            <name>fs.default.name</name>
            <value>hdfs://master:9000</value>
            <final>true</final>
            <description>The name of the default file system.  A URI whose
            scheme and authority determine the FileSystem implementation.  The
            uri's scheme determines the config property (fs.SCHEME.impl) naming
            the FileSystem implementation class.  The uri's authority is used to
            determine the host, port, etc. for a filesystem.</description>
    </property>
</configuration>

注意：

hadoop.tmp.dir配置项值即为此前命令中创建的临时目录路径。
fs.default.name配置为hdfs://master:9000，指向的是一个Master节点的主机（后续我们做集群配置的时候，自然会配置这个节点，先写在这里）

2).hdfs-site.xml配置

使用命令nano hdfs-site.xml编辑hdfs-site.xml文件：

 <?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>2</value>
        <final>true</final>
        <description>Default block replication.
        The actual number of replications can be specified when the file is created.
        The default is used if replication is not specified in create time.
        </description>
    </property>

    <property>
        <name>dfs.namenode.name.dir</name>
        <value>/root/soft/apache/hadoop/hadoop-2.6.0/namenode</value>
        <final>true</final>
    </property>

    <property>
        <name>dfs.datanode.data.dir</name>
        <value>/root/soft/apache/hadoop/hadoop-2.6.0/datanode</value>
        <final>true</final>
    </property>
</configuration>

注意：

我们后续搭建集群环境时，将配置一个Master节点和两个Slave节点。所以dfs.replication配置为2。
dfs.namenode.name.dir和dfs.datanode.data.dir分别配置为之前创建的NameNode和DataNode的目录路径

3).mapred-site.xml配置

Hadoop安装文件中提供了一个mapred-site.xml.template，所以我们之前使用了命令cp mapred-site.xml.template mapred-site.xml，创建了一个mapred-site.xml文件。下面使用命令nano mapred-site.xml编辑这个文件：

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <property>
        <name>mapred.job.tracker</name>
        <value>master:9001</value>
        <description>The host and port that the MapReduce job tracker runs
        at.  If "local", then jobs are run in-process as a single map
        and reduce task.
        </description>
    </property>
</configuration>

这里只有一个配置项mapred.job.tracker，我们指向master节点机器。

4)修改JAVA_HOME环境变量

使用命令.nano hadoop-env.sh，修改如下配置：

# The java implementation to use.
export JAVA_HOME=/usr/lib/jvm/java-7-oracle

格式化 namenode

这是很重要的一步，执行命令hadoop namenode -format

六、安装ssh

apt-get install ssh

生成密钥：

root@8ef06706f88d:/# cd ~/
root@8ef06706f88d:~# ssh-keygen -t rsa -P '' -f ~/.ssh/id_dsa
root@8ef06706f88d:~# cd .ssh
root@8ef06706f88d:~/.ssh# cat id_dsa.pub >> authorized_keys

可以将公钥拷贝给其他容器。

至此docker容器就配置好了，可以将容器保存为镜像

docker commit -m "hadoop install" container ubuntu:v2

七、分布式集群搭建

分别用之前保存的镜像启动三个容器，使用-h参数指定主机名master，slave1，slave2:

docker run -ti -h master ubuntu:v2
docker run -ti -h slave1 ubuntu:v2
docker run -ti -h slave2 ubuntu:v2

运行的容器sshd默认是没有打开的，需要手动打开各个节点的ssh服务：

/etc/init.d/ssh start

配置host保证容器之间连通：

用ifconfig获取各节点的ip

使用vi /etc/hosts将节点ip添加进去

vi /etc/hosts

0.0.0.1        master
0.0.0.2        slave1
0.0.0.3        slave2

配置slaves：在master节点中修改～/hadoop/hadoop-2.8.4/etc/hadoop/slaves 配置文件

在文件中写如slaves的hostname

slave1
slave2

以上都完成后可以在master节点上启动hadoop：

start-all.sh

最后可以执行jps命令查看各个节点状态。