文章目录
环境
hadoop :1.0.0
java :1.8.0_171
hadoop脚本解析
接下来所讲的hadoop脚本是指的文件$HADOOP_HOME/bin/hadoop。本文会按照从前往后的顺序解析此脚本
1. 获取hadoop脚本所在路径
# 获取hadoop命令所在的文件夹
bin=`dirname "$0"`
# 进入hadoop所在的bin目录
bin=`cd "$bin"; pwd`
2. 加载hadoop-config.sh
#如果$HADOOP_HOME/libexec/hadoop-config.sh存在,则执行,否则执行bin/hadoop-config.sh
if [ -e "$bin"/../libexec/hadoop-config.sh ]; then
. "$bin"/../libexec/hadoop-config.sh
else
. "$bin"/hadoop-config.sh
fi
3.没搞明白
cygwin=false
case "`uname`" in
CYGWIN*) cygwin=true;;
esac
4.指定CLASSPATH
CLASSPATH依次追加了如下几个文件夹
- HADOOP_CONF_DIR
- HADOOP_CLASSPATH
- $JAVA_HOME/lib/tools.jar
- $HADOOP_HOME/build/classes
- $HADOOP_HOME/build
- $HADOOP_HOME/build/test/classes
- $HADOOP_HOME/build/tools
# CLASSPATH initially contains $HADOOP_CONF_DIR
CLASSPATH="${HADOOP_CONF_DIR}"
if [ "$HADOOP_USER_CLASSPATH_FIRST" != "" ] && [ "$HADOOP_CLASSPATH" != "" ] ; then
CLASSPATH=${CLASSPATH}:${HADOOP_CLASSPATH}
fi
CLASSPATH=${CLASSPATH}:$JAVA_HOME/lib/tools.jar
# for developers, add Hadoop classes to CLASSPATH
if [ -d "$HADOOP_HOME/build/classes" ]; then
CLASSPATH=${CLASSPATH}:$HADOOP_HOME/build/classes
fi
if [ -d "$HADOOP_HOME/build/webapps" ]; then
CLASSPATH=${CLASSPATH}:$HADOOP_HOME/build
fi
if [ -d "$HADOOP_HOME/build/test/classes" ]; then
CLASSPATH=${CLASSPATH}:$HADOOP_HOME/build/test/classes
fi
if [ -d "$HADOOP_HOME/build/tools" ]; then
CLASSPATH=${CLASSPATH}:$HADOOP_HOME/build/tools
fi
5.添加核心依赖
- 如果hadoop的home里有share/hadoop/hadoop-core-*文件,则依次判断并追加
- $HADOOP_PREFIX/share/hadoop
- share/hadoop里所有hadoop-core-前缀的jar包
- /share/hadoop/lib里的所以jar包
- share/hadoop/lib/jsp-2.1/里的jar包
- share/hadoop里所有hadoop-tools-前缀的jar包
- 如果1不成立,则依次判断并追加
- $HADOOP_HOME
- $HADOOP_HOME/hadoop-core-*.jar
- $HADOOP_HOME/lib/*.jar
- $HADOOP_HOME/build/ivy/lib/Hadoop/common/*.jar
- $HADOOP_HOME/lib/jsp-2.1/*.jar
- $HADOOP_HOME/hadoop-tools-*.jar
- $HADOOP_HOME/build/hadoop-tools-*.jar;
# for releases, add core hadoop jar & webapps to CLASSPATH
if [ -e $HADOOP_PREFIX/share/hadoop/hadoop-core-* ]; then
# binary layout
if [ -d "$HADOOP_PREFIX/share/hadoop/webapps" ]; then
CLASSPATH=${CLASSPATH}:$HADOOP_PREFIX/share/hadoop
fi
for f in $HADOOP_PREFIX/share/hadoop/hadoop-core-*.jar; do
CLASSPATH=${CLASSPATH}:$f;
done
# add libs to CLASSPATH
for f in $HADOOP_PREFIX/share/hadoop/lib/*.jar; do
CLASSPATH=${CLASSPATH}:$f;
done
for f in $HADOOP_PREFIX/share/hadoop/lib/jsp-2.1/*.jar; do
CLASSPATH=${CLASSPATH}:$f;
done
for f in $HADOOP_PREFIX/share/hadoop/hadoop-tools-*.jar; do
TOOL_PATH=${TOOL_PATH}:$f;
done
else
# tarball layout
if [ -d "$HADOOP_HOME/webapps" ]; then
CLASSPATH=${CLASSPATH}:$HADOOP_HOME
fi
for f in $HADOOP_HOME/hadoop-core-*.jar; do
CLASSPATH=${CLASSPATH}:$f;
done
# add libs to CLASSPATH
for f in $HADOOP_HOME/lib/*.jar; do
CLASSPATH=${CLASSPATH}:$f;
done
if [ -d "$HADOOP_HOME/build/ivy/lib/Hadoop/common" ]; then
for f in $HADOOP_HOME/build/ivy/lib/Hadoop/common/*.jar; do
CLASSPATH=${CLASSPATH}:$f;
done
fi
for f in $HADOOP_HOME/lib/jsp-2.1/*.jar; do
CLASSPATH=${CLASSPATH}:$f;
done
for f in $HADOOP_HOME/hadoop-tools-*.jar; do
TOOL_PATH=${TOOL_PATH}:$f;
done
for f in $HADOOP_HOME/build/hadoop-tools-*.jar; do
TOOL_PATH=${TOOL_PATH}:$f;
done
fi
6. 日志配置
# default log directory & file
if [ "$HADOOP_LOG_DIR" = "" ]; then
HADOOP_LOG_DIR="$HADOOP_HOME/logs"
fi
if [ "$HADOOP_LOGFILE" = "" ]; then
HADOOP_LOGFILE='hadoop.log'
fi
7. 为不同命令指定不同的类,和java运行配置
比较重要的几个类:
命令 | 类 | java运行参数 |
---|---|---|
namenode | org.apache.hadoop.hdfs.server.namenode.NameNode | $HADOOP_NAMENODE_OPTS |
secondarynamenode | org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode | $HADOOP_SECONDARYNAMENODE_OPTS |
datanode | org.apache.hadoop.hdfs.server.datanode.DataNode | $HADOOP_DATANODE_OPTS |
fs | org.apache.hadoop.fs.FsShell | $HADOOP_CLIENT_OPTS |
dfs | org.apache.hadoop.fs.FsShell | $HADOOP_CLIENT_OPTS |
jobtracker | org.apache.hadoop.mapred.JobTracker | $HADOOP_JOBTRACKER_OPTS |
tasktracker | org.apache.hadoop.mapred.TaskTracker | $HADOOP_TASKTRACKER_OPTS |
jar | org.apache.hadoop.util.RunJar | $HADOOP_CLIENT_OPTS |
上面所以选项都有一个公共选项$HADOOP_OPTS |
# figure out which class to run
if [ "$COMMAND" = "classpath" ] ; then
if $cygwin; then
CLASSPATH=`cygpath -p -w "$CLASSPATH"`
fi
echo $CLASSPATH
exit
elif [ "$COMMAND" = "namenode" ] ; then
CLASS='org.apache.hadoop.hdfs.server.namenode.NameNode'
HADOOP_OPTS="$HADOOP_OPTS $HADOOP_NAMENODE_OPTS"
elif [ "$COMMAND" = "secondarynamenode" ] ; then
CLASS='org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode'
HADOOP_OPTS="$HADOOP_OPTS $HADOOP_SECONDARYNAMENODE_OPTS"
elif [ "$COMMAND" = "datanode" ] ; then
CLASS='org.apache.hadoop.hdfs.server.datanode.DataNode'
if [ "$starting_secure_dn" = "true" ]; then
HADOOP_OPTS="$HADOOP_OPTS -jvm server $HADOOP_DATANODE_OPTS"
else
HADOOP_OPTS="$HADOOP_OPTS -server $HADOOP_DATANODE_OPTS"
fi
elif [ "$COMMAND" = "fs" ] ; then
CLASS=org.apache.hadoop.fs.FsShell
HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"
elif [ "$COMMAND" = "dfs" ] ; then
CLASS=org.apache.hadoop.fs.FsShell
HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"
elif [ "$COMMAND" = "dfsadmin" ] ; then
CLASS=org.apache.hadoop.hdfs.tools.DFSAdmin
HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"
elif [ "$COMMAND" = "mradmin" ] ; then
CLASS=org.apache.hadoop.mapred.tools.MRAdmin
HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"
elif [ "$COMMAND" = "fsck" ] ; then
CLASS=org.apache.hadoop.hdfs.tools.DFSck
HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"
elif [ "$COMMAND" = "balancer" ] ; then
CLASS=org.apache.hadoop.hdfs.server.balancer.Balancer
HADOOP_OPTS="$HADOOP_OPTS $HADOOP_BALANCER_OPTS"
elif [ "$COMMAND" = "fetchdt" ] ; then
CLASS=org.apache.hadoop.hdfs.tools.DelegationTokenFetcher
elif [ "$COMMAND" = "jobtracker" ] ; then
CLASS=org.apache.hadoop.mapred.JobTracker
HADOOP_OPTS="$HADOOP_OPTS $HADOOP_JOBTRACKER_OPTS"
elif [ "$COMMAND" = "historyserver" ] ; then
CLASS=org.apache.hadoop.mapred.JobHistoryServer
HADOOP_OPTS="$HADOOP_OPTS $HADOOP_JOB_HISTORYSERVER_OPTS"
elif [ "$COMMAND" = "tasktracker" ] ; then
CLASS=org.apache.hadoop.mapred.TaskTracker
HADOOP_OPTS="$HADOOP_OPTS $HADOOP_TASKTRACKER_OPTS"
elif [ "$COMMAND" = "job" ] ; then
CLASS=org.apache.hadoop.mapred.JobClient
HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"
elif [ "$COMMAND" = "queue" ] ; then
CLASS=org.apache.hadoop.mapred.JobQueueClient
HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"
elif [ "$COMMAND" = "pipes" ] ; then
CLASS=org.apache.hadoop.mapred.pipes.Submitter
HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"
elif [ "$COMMAND" = "version" ] ; then
CLASS=org.apache.hadoop.util.VersionInfo
HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"
elif [ "$COMMAND" = "jar" ] ; then
CLASS=org.apache.hadoop.util.RunJar
HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"
elif [ "$COMMAND" = "distcp" ] ; then
CLASS=org.apache.hadoop.tools.DistCp
CLASSPATH=${CLASSPATH}:${TOOL_PATH}
HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"
elif [ "$COMMAND" = "daemonlog" ] ; then
CLASS=org.apache.hadoop.log.LogLevel
HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"
elif [ "$COMMAND" = "archive" ] ; then
CLASS=org.apache.hadoop.tools.HadoopArchives
CLASSPATH=${CLASSPATH}:${TOOL_PATH}
HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"
elif [ "$COMMAND" = "sampler" ] ; then
CLASS=org.apache.hadoop.mapred.lib.InputSampler
HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS"
else
CLASS=$COMMAND
fi
8 执行java程序
# Check to see if we should start a secure datanode
if [ "$starting_secure_dn" = "true" ]; then
if [ "$HADOOP_PID_DIR" = "" ]; then
HADOOP_SECURE_DN_PID="/tmp/hadoop_secure_dn.pid"
else
HADOOP_SECURE_DN_PID="$HADOOP_PID_DIR/hadoop_secure_dn.pid"
fi
exec "$HADOOP_HOME/libexec/jsvc.${JSVC_ARCH}" -Dproc_$COMMAND -outfile "$HADOOP_LOG_DIR/jsvc.out" \
-errfile "$HADOOP_LOG_DIR/jsvc.err" \
-pidfile "$HADOOP_SECURE_DN_PID" \
-nodetach \
-user "$HADOOP_SECURE_DN_USER" \
-cp "$CLASSPATH" \
$JAVA_HEAP_MAX $HADOOP_OPTS \
org.apache.hadoop.hdfs.server.datanode.SecureDataNodeStarter "$@"
else
# run it
exec "$JAVA" -Dproc_$COMMAND $JAVA_HEAP_MAX $HADOOP_OPTS -classpath "$CLASSPATH" $CLASS "$@"
fi
远程调试
java远程调试需要被调试的程序在运行的时候指定如下参数:
"-agentlib:jdwp=transport=dt_socket,address=8888,server=y,suspend=y"
* 一般我们都是让服务器(server=y)客户端采用attach即可
* 如果suspend=y,则启动服务器程序后程序会阻塞,等调试端连接上服务器端才继续运行
* 如果suspend=n,则对于某些循环运行的服务器进程比较有效,当调试端连接上服务器后,每次程序运行到循环断点,程序就会停止等待调试
例如我们要远程调试jobtracker,在用start-all.sh启动jobtracker进程的时候执行如下命令:
# 第一句,指定远程调试监听端口8888
export HADOOP_JOBTRACKER_OPTS="-agentlib:jdwp=transport=dt_socket,address=8888,server=y,suspend=y"
# 第二句
start-all.sh
当然也可以根据本文中的表格调试其他类。
另外表中也说到所有类都有一个公共的选项$HADOOP_OPTS,在配置了此变量后可以调试所有类。