Hadoop学习日志（一）

Hadoop安装

0.安装java环境

1. 下载版本1.2.1版本Hadoop

2. tar zxvf 解压缩hadoop到相应目录

3. 需要配置的4个文件：hadoop-env.sh、core-site.xml、hdfs-site.xml、mapred-site.xml、masters、slaves

4、文件说明：

hadoop-env.sh是hadoop启动环境设置

hdfs-site.xml 是分布式文件系统相关配置

mapred-site.xml是map-reduce任务相关配置

master 表示主从机器（主机器是nameNode，丛就是dataNode）

slaver 表示丛机器

扫描二维码关注公众号，回复： 593878 查看本文章

5、hadoop-env.sh配置详细：

# Set Hadoop-specific environment variables here.

# The only required environment variable is JAVA_HOME. All others are

# optional. When running a distributed configuration it is best to

# set JAVA_HOME in this file, so that it is correctly defined on

# remote nodes.

# The java implementation to use. Required. 配置该项目，设置jDK所在目录

export JAVA_HOME=/usr/local/jdk/jdk1.6.0_45

# Extra Java CLASSPATH elements. Optional.可选的配置CLASSPATH

# export HADOOP_CLASSPATH=

# The maximum amount of heap to use, in MB. Default is 1000. 配置虚拟机内存

# export HADOOP_HEAPSIZE=2000

# Extra Java runtime options. Empty by default. 配置模式，肯定是server模式了，client说明内存太小了，是java虚拟机模式，设计的内存回收等相关算法

# export HADOOP_OPTS=-server

# Command specific options appended to HADOOP_OPTS when specified 配置JMX管理配置

export HADOOP_NAMENODE_OPTS="-Dcom.sun.management.jmxremote $HADOOP_NAMENODE_OPTS"

export HADOOP_SECONDARYNAMENODE_OPTS="-Dcom.sun.management.jmxremote $HADOOP_SECONDARYNAMENODE_OPTS"

export HADOOP_DATANODE_OPTS="-Dcom.sun.management.jmxremote $HADOOP_DATANODE_OPTS"

export HADOOP_BALANCER_OPTS="-Dcom.sun.management.jmxremote $HADOOP_BALANCER_OPTS"

export HADOOP_JOBTRACKER_OPTS="-Dcom.sun.management.jmxremote $HADOOP_JOBTRACKER_OPTS"

# export HADOOP_TASKTRACKER_OPTS=

# The following applies to multiple commands (fs, dfs, fsck, distcp etc)

# export HADOOP_CLIENT_OPTS

# Extra ssh options. Empty by default.

# export HADOOP_SSH_OPTS="-o ConnectTimeout=1 -o SendEnv=HADOOP_CONF_DIR"

# Where log files are stored. $HADOOP_HOME/logs by default.

# export HADOOP_LOG_DIR=${HADOOP_HOME}/logs

# File naming remote slave hosts. $HADOOP_HOME/conf/slaves by default.

# export HADOOP_SLAVES=${HADOOP_HOME}/conf/slaves

# host:path where hadoop code should be rsync'd from. Unset by default.

# export HADOOP_MASTER=master:/home/$USER/src/hadoop

# Seconds to sleep between slave commands. Unset by default. This

# can be useful in large clusters, where, e.g., slave rsyncs can

# otherwise arrive faster than the master can service them.

# export HADOOP_SLAVE_SLEEP=0.1

# The directory where pid files are stored. /tmp by default.

# NOTE: this should be set to a directory that can only be written to by

# the users that are going to run the hadoop daemons. Otherwise there is

# the potential for a symlink attack.

# export HADOOP_PID_DIR=/var/hadoop/pids

# A string representing this instance of hadoop. $USER by default.

# export HADOOP_IDENT_STRING=$USER

# The scheduling priority for daemon processes. See 'man nice'.

# export HADOOP_NICENESS=10

由于以上配置设计到 ${HADOOP_HOME} 所以需要在/etc/profile 配置

export HADOOP_HOME=/usr/local/hadoop/hadoop 就是Hadoop所在的目录

5、core-site.xml核心配置：

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!—配置端口号 hadoop1 是本机机器名称需要在hosts文件夹下映射配置，也可以配置IP地址 - ->

<name>fs.default.name</name>

<value>hdfs://hadoop1:9000</value>

</property>

<!-- 配置文件系统所在目录 à

<name>hadoop.tmp.dir</name>

<value>/usr/local/hadoop/hadoop/tmp</value>

</property>

</configuration>

6、配置hdfs-site.xml：

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- 是否打开权限检查 à

<name>dfs.permissions</name>

<value>false</value>

</property>

<!-- 副本复制的数目 à

<name>dfs.replication</name>

</property>

</configuration>

7、配置mapred-site.xml：

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- 配置jobtracker端口号，dataNode会使用 à

<name>mapred.job.tracker</name>

<value>hdfs://hadoop1:9001</value>

</property>

</configuration>

8、master配置：

配置那些是master可以配置机器名称也可以配置IP地址

例如：

192.168.100.101

192.168.100.102

9、salver配置：

配置那些事slaver

192.168.100.101

192.168.100.102

10、配置SSH无密登录

免密码ssh设置（合适的权限很重要）

登入hadoop账户，建立ssh文件夹 mkdir .ssh

现在确认能否不输入口令就用ssh登录本机:
$ ssh namenode

如果不输入口令就无法用ssh登陆namenode，执行下面的命令：
$ ssh-keygen -t rsa –f ~/.ssh/id_rsa

回车设置密钥，可以设置一个密钥，也可以设置为空密钥。但是安全起见，设置密钥为hadoop，下面利用ssh-agent来设置免密钥登陆集群中其他机器。

私钥放在由-f选项指定的文件之中，例如~/.ssh/id_rsa。存放公钥的文件名称与私钥类似，但是以’.pub’为后缀，本例为~/.ssh/id_rsa.pub

$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

将ssh密钥存放在namenode机器的~/.ssh/authorized_keys文件中。

再用scp命令将authorized_keys和id_rsa.pub分发到其它机器的相同文件夹下，例如，将authorized_keys文件分发给datanode1机器的.ssh文件夹的命令为：

$scp ~/.ssh/id_dsa.pub hadoop@datanode1:/home/hadoop/.ssh

到datanode1机器上 cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

理论上要将公钥发放给各机器然后再各个机器上生成authorized_keys文件，实际上直接分发authorized_keys文件也可以

然后赋予文件权限，赋予各机器的.ssh文件夹权限为711，authorized_keys的权限为644。过高的权限不能通过ssh-add命令，过低的权限无法实现免密码登录。

11、启动start-all.sh 就可以启动hadoop 集群了。其他丛机器会自动被ssh无密码登录启动起来。nameNode会启动其他子节点的DataNode。

12、使用jps查看

主机器：

[root@hadoop1 conf]# jps

32365 JobTracker

32090 NameNode

25900 Jps

3017 Bootstrap

32269 SecondaryNameNode

丛机器：

[root@hadoop2 ~]# jps

10009 TaskTracker

9901 DataNode

27852 Jps

恭喜你启动成功了

Hbase安装

由于hbase 和hadoop需要搭配版本的，我用的是1.2.1那么hbase用的是hbase-0.94.23

1. tar zxvf 解压缩 hbase-0.94.23.tar.gz

2. 修改配置文件hbase-env.sh、hbase-site.xml

3. Hbase-env.sh

#/**

# *

# * Licensed to the Apache Software Foundation (ASF) under one

# * or more contributor license agreements. See the NOTICE file

# * distributed with this work for additional information

# * regarding copyright ownership. The ASF licenses this file

# * to you under the Apache License, Version 2.0 (the

# * "License"); you may not use this file except in compliance

# * with the License. You may obtain a copy of the License at

# *

# * http://www.apache.org/licenses/LICENSE-2.0

# *

# * Unless required by applicable law or agreed to in writing, software

# * distributed under the License is distributed on an "AS IS" BASIS,

# * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

# * See the License for the specific language governing permissions and

# * limitations under the License.

# */

# Set environment variables here.

# This script sets variables multiple times over the course of starting an hbase process,

# so try to keep things idempotent unless you want to take an even deeper look

# into the startup scripts (bin/hbase, etc.)

# The java implementation to use. Java 1.6 required.指定JDK所在的目录

# export JAVA_HOME=/usr/local/jdk/jdk1.6.0_45/

# Extra Java CLASSPATH elements. Optional.

# export HBASE_CLASSPATH=

# The maximum amount of heap to use, in MB. Default is 1000.

# export HBASE_HEAPSIZE=1000

# Extra Java runtime options.

# Below are what we set by default. May only work with SUN JVM.

# For more on why as well as other possible settings,

# see http://wiki.apache.org/hadoop/PerformanceTuning

export HBASE_OPTS="-XX:+UseConcMarkSweepGC"

# Uncomment one of the below three options to enable java garbage collection logging for the server-side processes.

# This enables basic gc logging to the .out file.

# export SERVER_GC_OPTS="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps"

# This enables basic gc logging to its own file.

# If FILE-PATH is not replaced, the log file(.gc) would still be generated in the HBASE_LOG_DIR .

# export SERVER_GC_OPTS="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:<FILE-PATH>"

# This enables basic GC logging to its own file with automatic log rolling. Only applies to jdk 1.6.0_34+ and 1.7.0_2+.

# If FILE-PATH is not replaced, the log file(.gc) would still be generated in the HBASE_LOG_DIR .

# export SERVER_GC_OPTS="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:<FILE-PATH> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=1 -XX:GCLogFileSize=512M"

# Uncomment one of the below three options to enable java garbage collection logging for the client processes.

# This enables basic gc logging to the .out file.

# export CLIENT_GC_OPTS="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps"

# This enables basic gc logging to its own file.

# If FILE-PATH is not replaced, the log file(.gc) would still be generated in the HBASE_LOG_DIR .

# export CLIENT_GC_OPTS="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:<FILE-PATH>"

# This enables basic GC logging to its own file with automatic log rolling. Only applies to jdk 1.6.0_34+ and 1.7.0_2+.

# If FILE-PATH is not replaced, the log file(.gc) would still be generated in the HBASE_LOG_DIR .

# export CLIENT_GC_OPTS="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:<FILE-PATH> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=1 -XX:GCLogFileSize=512M"

# Uncomment below if you intend to use the EXPERIMENTAL off heap cache.

# export HBASE_OPTS="$HBASE_OPTS -XX:MaxDirectMemorySize="

# Set hbase.offheapcache.percentage in hbase-site.xml to a nonzero value.

# Uncomment and adjust to enable JMX exporting

# See jmxremote.password and jmxremote.access in $JRE_HOME/lib/management to configure remote password access.

# More details at: http://java.sun.com/javase/6/docs/technotes/guides/management/agent.html

# export HBASE_JMX_BASE="-Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false"

# export HBASE_MASTER_OPTS="$HBASE_MASTER_OPTS $HBASE_JMX_BASE -Dcom.sun.management.jmxremote.port=10101"

# export HBASE_REGIONSERVER_OPTS="$HBASE_REGIONSERVER_OPTS $HBASE_JMX_BASE -Dcom.sun.management.jmxremote.port=10102"

# export HBASE_THRIFT_OPTS="$HBASE_THRIFT_OPTS $HBASE_JMX_BASE -Dcom.sun.management.jmxremote.port=10103"

# export HBASE_ZOOKEEPER_OPTS="$HBASE_ZOOKEEPER_OPTS $HBASE_JMX_BASE -Dcom.sun.management.jmxremote.port=10104"

# File naming hosts on which HRegionServers will run. $HBASE_HOME/conf/regionservers by default.

# export HBASE_REGIONSERVERS=${HBASE_HOME}/conf/regionservers

# File naming hosts on which backup HMaster will run. $HBASE_HOME/conf/backup-masters by default.

# export HBASE_BACKUP_MASTERS=${HBASE_HOME}/conf/backup-masters

# Extra ssh options. Empty by default.

# export HBASE_SSH_OPTS="-o ConnectTimeout=1 -o SendEnv=HBASE_CONF_DIR"

# Where log files are stored. $HBASE_HOME/logs by default.

# export HBASE_LOG_DIR=${HBASE_HOME}/logs

# Enable remote JDWP debugging of major HBase processes. Meant for Core Developers

# export HBASE_MASTER_OPTS="$HBASE_MASTER_OPTS -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=8070"

# export HBASE_REGIONSERVER_OPTS="$HBASE_REGIONSERVER_OPTS -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=8071"

# export HBASE_THRIFT_OPTS="$HBASE_THRIFT_OPTS -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=8072"

# export HBASE_ZOOKEEPER_OPTS="$HBASE_ZOOKEEPER_OPTS -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=8073"

# A string representing this instance of hbase. $USER by default.

# export HBASE_IDENT_STRING=$USER

# The scheduling priority for daemon processes. See 'man nice'.

# export HBASE_NICENESS=10

# The directory where pid files are stored. /tmp by default.

# export HBASE_PID_DIR=/var/hadoop/pids

# Seconds to sleep between slave commands. Unset by default. This

# can be useful in large clusters, where, e.g., slave rsyncs can

# otherwise arrive faster than the master can service them.

# export HBASE_SLAVE_SLEEP=0.1

# Tell HBase whether it should manage it's own instance of Zookeeper or not.

# export HBASE_MANAGES_ZK=true

4、配置hbase-site.xml

<name>hbase.rootdir</name>

<value>hdfs://hadoop1:9000/hbase</value>

</property>

<name>hbase.rootdir</name>

<value>file:///usr/local/hadoop/hbase/hbase-0.94.23/data</value>

</property>

<name>hbase.cluster.distributed</name>

</property>

<name>hbase.master</name>

<value>hdfs://hadoop1:60000</value>

</property>

<name>hbase.zookeeper.quorum</name>

<value>hadoop1,hadoop2</value>

</property>

</configuration>

Hadoop学习日志（一）

猜你喜欢