Hadoop错误总结

P1:
vWARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

问题：源代码编译问题，可能是32为，你的机器是64位，需要重新编译

p2:
2015-01-31 08:45:57,977 WARN [main] util.NativeCodeLoader (NativeCodeLoader.java:<clinit>(62)) - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2015-01-31 08:46:04,613 ERROR [main] util.Shell (Shell.java:getWinUtilsPath(303)) - Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:278)
at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:300)
at org.apache.hadoop.util.Shell.<clinit>(Shell.java:293)
at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:76)
at org.apache.hadoop.conf.Configuration.getTrimmedStrings(Configuration.java:1546)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:519)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:453)
at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:136)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2433)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:88)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2467)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2449)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:367)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:166)
at MyWordCount.main(MyWordCount.java:83)
Exception in thread "main" org.apache.hadoop.security.AccessControlException: Permission denied: user=Administrator, access=WRITE, inode="/test":hadoop:supergroup:drwxr-xr-x
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:234)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:214)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:161)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:5185)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteInternal(FSNamesystem.java:3137)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteInt(FSNamesystem.java:3089)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:3073)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.delete(NameNodeRpcServer.java:697)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.delete(ClientNamenodeProtocolServerSideTranslatorPB.java:491)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:59596)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2048)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2044)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2042)

at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)
at java.lang.reflect.Constructor.newInstance(Unknown Source)
at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
at org.apache.hadoop.hdfs.DFSClient.delete(DFSClient.java:1622)
at org.apache.hadoop.hdfs.DistributedFileSystem$11.doCall(DistributedFileSystem.java:585)
at org.apache.hadoop.hdfs.DistributedFileSystem$11.doCall(DistributedFileSystem.java:581)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.delete(DistributedFileSystem.java:581)
at MyWordCount.main(MyWordCount.java:86)
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied: user=Administrator, access=WRITE, inode="/test":hadoop:supergroup:drwxr-xr-x
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:234)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:214)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:161)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:5185)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteInternal(FSNamesystem.java:3137)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteInt(FSNamesystem.java:3089)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:3073)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.delete(NameNodeRpcServer.java:697)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.delete(ClientNamenodeProtocolServerSideTranslatorPB.java:491)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:59596)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2048)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2044)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2042)

at org.apache.hadoop.ipc.Client.call(Client.java:1347)
at org.apache.hadoop.ipc.Client.call(Client.java:1300)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
at $Proxy9.delete(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at $Proxy9.delete(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.delete(ClientNamenodeProtocolTranslatorPB.java:449)
at org.apache.hadoop.hdfs.DFSClient.delete(DFSClient.java:1620)
... 5 more
解决问题：
文件权限：
3.本地代码操作 HDFS时报错：

CallFrom ZBS9WJ52FEO4TQK/10.10.113.163 to hadoop1:9000 failed on connectionexception: java.net.ConnectException: Connectionrefused: no further information;

解决：

经过调代码发现： hdfs对象的URL属性为 hdfs://hadoop1:49002

与配置的端口不一至，应该是hdfs://hadoop1:9000,修改URL值报
Permissiondenied: user=Administrator, access=WRITE,inode="/":root:supergroup:drwxr-xr-x
写入目录权限问题：
hadoop fs -chmod 777 /user/hadoop
然后调试，就正常了

注意：uri 指的是 core-site.xml ,一定不能同时配 fs.default.name节点。
<property>
<name>fs.defaultFS</name>
<value>hdfs://hadoop1:9000</value>
</property>

PPP333 :

2015-01-31 10:00:01,100 WARN [main] util.NativeCodeLoader (NativeCodeLoader.java:<clinit>(62)) - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2015-01-31 10:00:02,074 ERROR [main] util.Shell (Shell.java:getWinUtilsPath(303)) - Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:278)
at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:300)
at org.apache.hadoop.util.Shell.<clinit>(Shell.java:293)
at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:76)
at org.apache.hadoop.conf.Configuration.getTrimmedStrings(Configuration.java:1546)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:519)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:453)
at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:136)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2433)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:88)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2467)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2449)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:367)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:166)
at MyWordCount.main(MyWordCount.java:83)
2015-01-31 10:00:02,285 INFO [main] Configuration.deprecation (Configuration.java:warnOnceIfDeprecated(840)) - session.id is deprecated. Instead, use dfs.metrics.session-id
2015-01-31 10:00:02,287 INFO [main] jvm.JvmMetrics (JvmMetrics.java:init(76)) - Initializing JVM Metrics with processName=JobTracker, sessionId=
Exception in thread "main" java.lang.NullPointerException
at java.lang.ProcessBuilder.start(Unknown Source)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:404)
at org.apache.hadoop.util.Shell.run(Shell.java:379)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:678)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:661)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:639)
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:435)
at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:277)
at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:125)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:344)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1268)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1265)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Unknown Source)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1265)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1286)
at MyWordCount.main(MyWordCount.java:94)

解决问题：
显然缺少dll文件，还记得https://github.com/srccodes/hadoop-common-2.2.0-bin下载的东西吧，里面就有hadoop.dll，
最好的方法就是用hadoop-common-2.2.0-bin-master/bin目录替换本地hadoop的bin目录，并在环境变量里配置PATH=HADOOP_HOME/bin，重启电脑。

注意的问题：
环境变量一定配置正确，否则还是不能运行
PATH=HADOOP_HOME/bin，如果这个不行，可以换成绝对路径

常见问题：

WARN util.NativeCodeLoader:Unable to load native-hadoop library for your platform... using builtin-

问题：Hadoop本地库与系统版本不一致引起的错误。

解决：
查看本地文件

? file /hadoop-2.2.0/lib/native/libhadoop.so.1.0.0

libhadoop.so.1.0.0:ELF 32-bit LSB shared object, Intel 80386, version 1 (SYSV), dynamicallylinked, not stripped

即hadoop 是32位的。

一．重新下载hadoop 64位版本安装。

二．可以直接用编译好的 native.tar.gz替换 /opt/hadoop/hadoop-2.2.0/lib/native

2．在执行start-all.sh命令后，datanode中检查到服务已经启动，但是不久后就自动关闭。
可能是datanode与master通信失败造成的。

解决：master的监听端口未开启成功，或者被master主机的防火墙阻止。尝试查看master上的监听端口这关闭防火墙以解决问题

>serviceiptables stop

3.本地代码操作 HDFS时报错：

CallFrom ZBS9WJ52FEO4TQK/10.10.113.163 to hadoop1:9000 failed on
connectionexception: java.net.ConnectException: Connectionrefused: no further information;

解决：

经过调代码发现： hdfs对象的URL属性为 hdfs://hadoop1:49002

与配置的端口不一至，应该是hdfs://hadoop1:9000,修改URL值报

Permissiondenied: user=Administrator, access=WRITE,inode="/":root:supergroup:drwxr-xr-x

写入目录权限问题：

hadoop fs -chmod 777 /user/hadoop

然后调试，就正常了

注意：uri 指的是 core-site.xml ,一定不能同时配 fs.default.name节点。

<property>

<name>fs.defaultFS</name>

<value>hdfs://hadoop1:9000</value>

</property>

4.使用http://hadoop1:50070查看信息。但” Browse the filesystem”打不开。

点击browsethe filesystem后，网页转向的地址用的是hadoop集群的某一个datanode的主机名，由于客户端的浏览器无法解析这个主机名，因此该页无法显示。

解决：本台机器无法访问datanode节点。

C:\Windows\System32\drivers\etc\hosts

增加对应的主机名和ip。如：

hadoop2192.168.101.115

hadoop1192.168.101.116

5.hadoop运行程序是出现javaheap space

方法1：修改hadoop环境配置文件conf/hadoop-env.sh，加入下面两行:

export HADOOP_HEAPSIZE=2000

export HADOOP_CLIENT_OPTS="-Xmx1024m $HADOOP_CLIENT_OPTS"

方法2：以上方法是对所有程序有效，如果只针对某一个程序，可以在运行时加入参数，例如：

bin/hadoop jar hadoop-examples-*.jar grep -D mapred.child.java.opts=-Xmx1024Minput output 'dfs[a-z.]+'

2014-04-03 20:32:36,596 ERROR [main] security.UserGroupInformation (UserGroupInformation.java:doAs(1494)) - PriviledgedActionException as:Administrator (auth:SIMPLE) cause:java.io.IOException: Failed to run job : Application application_1396459813671_0001 failed 2 times due to AM Container for appattempt_1396459813671_0001_000002 exited with exitCode: 1 due to: Exception from container-launch:
org.apache.hadoop.util.Shell$ExitCodeException: /bin/bash: line 0: fg: no job control

at org.apache.hadoop.util.Shell.runCommand(Shell.java:464)
at org.apache.hadoop.util.Shell.run(Shell.java:379)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)

用上面出现的错误去google，可以得到这个网页：https://issues.apache.org/jira/browse/MAPREDUCE-5655 。
恩，对的。这个网页就是我们的solution。

1. 修改MRapps.java 、YARNRunner.java的源码，然后打包替换原来的jar包中的相应class文件，这两个jar我已经打包，
可以在这里下载http://download.csdn.net/detail/fansy1990/7143547 。
然后替换集群中相应的jar吧，同时需要注意替换Myeclipse中导入的包。额，说起Myeclipse中的jar包，这里还是先上幅jar包的图吧：
hadoop-mapreduce-client-common-2.2.0.jar
hadoop-mapreduce-client-jobclient-2.2.0.jar

2. 修改mapred-default.xml ,添加：（这个只需在eclipse中导入的jar包修改即可，修改后的jar包不用上传到集群）

<property>
<name>mapred.remote.os</name>
<value>Linux</value>
<description>
Remote MapReduce framework's OS, can be either Linux or Windows
</description>
</property>

（题外话，添加了这个属性后，按说我new一个Configuration后，
我使用conf.get("mapred.remote.os")的时候应该是可以得到Linux的，但是我得到的却是null，这个就不清楚是怎么了。）

其文件在：hadoop-mapreduce-client-core-2.2.0.jar mapred-default.xml

3:出现下面异常信息：
Exception in thread "main" java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z

使用64位的： JDK1.6

1): Hadoop添加节点的方法
自己实际添加节点过程：
1. 先在slave上配置好环境，包括ssh，jdk，相关config，lib，bin等的拷贝；
2. 将新的datanode的host加到集群namenode及其他datanode中去；
3. 将新的datanode的ip加到master的conf/slaves中；
4. 重启cluster,在cluster中看到新的datanode节点；
5. 运行bin/start-balancer.sh，这个会很耗时间
备注：
1. 如果不balance，那么cluster会把新的数据都存放在新的node上，这样会降低mr的工作效率；
2. 也可调用bin/start-balancer.sh 命令执行，也可加参数 -threshold 5
threshold 是平衡阈值，默认是10%，值越低各节点越平衡，但消耗时间也更长。
3. balancer也可以在有mr job的cluster上运行，默认dfs.balance.bandwidthPerSec很低，为1M/s。在没有mr job时，可以提高该设置加快负载均衡时间。

其他备注：
1. 必须确保slave的firewall已关闭;
2. 确保新的slave的ip已经添加到master及其他slaves的/etc/hosts中，反之也要将master及其他slave的ip添加到新的slave的/etc/hosts中
2) : mapper及reducer个数
url地址： http://wiki.apache.org/hadoop/HowManyMapsAndReduces
HowManyMapsAndReduces
Partitioning your job into maps and reduces
Picking the appropriate size for the tasks for your job can radically change the performance of Hadoop. Increasing the number of tasks increases the framework overhead, but increases load balancing and lowers the cost of failures. At one extreme is the 1 map/1 reduce case where nothing is distributed. The other extreme is to have 1,000,000 maps/ 1,000,000 reduces where the framework runs out of resources for the overhead.
Number of Maps
The number of maps is usually driven by the number of DFS blocks in the input files. Although that causes people to adjust their DFS block size to adjust the number of maps. The right level of parallelism for maps seems to be around 10-100 maps/node, although we have taken it up to 300 or so for very cpu-light map tasks. Task setup takes awhile, so it is best if the maps take at least a minute to execute.
Actually controlling the number of maps is subtle. The mapred.map.tasks parameter is just a hint to the InputFormat for the number of maps. The default InputFormat behavior is to split the total number of bytes into the right number of fragments. However, in the default case the DFS block size of the input files is treated as an upper bound for input splits. A lower bound on the split size can be set via mapred.min.split.size. Thus, if you expect 10TB of input data and have 128MB DFS blocks, you'll end up with 82k maps, unless your mapred.map.tasks is even larger. Ultimately the [WWW] InputFormat determines the number of maps.
The number of map tasks can also be increased manually using the JobConf's conf.setNumMapTasks(int num). This can be used to increase the number of map tasks, but will not set the number below that which Hadoop determines via splitting the input data.
Number of Reduces
The right number of reduces seems to be 0.95 or 1.75 * (nodes * mapred.tasktracker.tasks.maximum). At 0.95 all of the reduces can launch immediately and start transfering map outputs as the maps finish. At 1.75 the faster nodes will finish their first round of reduces and launch a second round of reduces doing a much better job of load balancing.
Currently the number of reduces is limited to roughly 1000 by the buffer size for the output files (io.buffer.size * 2 * numReduces << heapSize). This will be fixed at some point, but until it is it provides a pretty firm upper bound.
The number of reduces also controls the number of output files in the output directory, but usually that is not important because the next map/reduce step will split them into even smaller splits for the maps.
The number of reduce tasks can also be increased in the same way as the map tasks, via JobConf's conf.setNumReduceTasks(int num).
自己的理解：
mapper个数的设置：跟input file 有关系，也跟filesplits有关系，filesplits的上线为dfs.block.size，下线可以通过mapred.min.split.size设置，最后还是由InputFormat决定。

较好的建议：
The right number of reduces seems to be 0.95 or 1.75 multiplied by (<no. of nodes> * mapred.tasktracker.reduce.tasks.maximum).increasing the number of reduces increases the framework overhead, but increases load balancing and lowers the cost of failures.
<property>
<name>mapred.tasktracker.reduce.tasks.maximum</name>
<value>2</value>
<description>The maximum number of reduce tasks that will be run
simultaneously by a task tracker.
</description>
</property>

3): 单个node新加硬盘
1.修改需要新加硬盘的node的dfs.data.dir，用逗号分隔新、旧文件目录
2.重启dfs

4): 同步hadoop 代码
hadoop-env.sh
# hostath where hadoop code should be rsync'd from. Unset by default.
# export HADOOP_MASTER=master:/home/$USER/src/hadoop

5:)用命令合并HDFS小文件
hadoop fs -getmerge <src> <dest>

6): 重启reduce job方法
Introduced recovery of jobs when JobTracker restarts. This facility is off by default.
Introduced config parameters "mapred.jobtracker.restart.recover", "mapred.jobtracker.job.history.block.size", and "mapred.jobtracker.job.history.buffer.size".
还未验证过。

7): IO写操作出现问题
0-1246359584298, infoPort=50075, ipcPort=50020):Got exception while serving blk_-5911099437886836280_1292 to /172.16.100.165:
java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/
172.16.100.165:50010 remote=/172.16.100.165:50930]
at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:185)
at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:293)
at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:387)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:179)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:94)
at java.lang.Thread.run(Thread.java:619)

It seems there are many reasons that it can timeout, the example given in
HADOOP-3831 is a slow reading client.

解决办法：在hadoop-site.xml中设置dfs.datanode.socket.write.timeout=0试试；
My understanding is that this issue should be fixed in Hadoop 0.19.1 so that
we should leave the standard timeout. However until then this can help
resolve issues like the one you're seeing.

8): HDFS退服节点的方法
目前版本的dfsadmin的帮助信息是没写清楚的，已经file了一个bug了，正确的方法如下：
1. 将 dfs.hosts 置为当前的 slaves，文件名用完整路径，注意，列表中的节点主机名要用大名，即 uname -n 可以得到的那个。
2. 将 slaves 中要被退服的节点的全名列表放在另一个文件里，如 slaves.ex，使用 dfs.host.exclude 参数指向这个文件的完整路径
3. 运行命令 bin/hadoop dfsadmin -refreshNodes
4. web界面或 bin/hadoop dfsadmin -report 可以看到退服节点的状态是 Decomission in progress，直到需要复制的数据复制完成为止
5. 完成之后，从 slaves 里（指 dfs.hosts 指向的文件）去掉已经退服的节点

附带说一下 -refreshNodes 命令的另外三种用途：
2. 添加允许的节点到列表中（添加主机名到 dfs.hosts 里来）
3. 直接去掉节点，不做数据副本备份（在 dfs.hosts 里去掉主机名）
4. 退服的逆操作——停止 exclude 里面和 dfs.hosts 里面都有的，正在进行 decomission 的节点的退服，也就是把 Decomission in progress 的节点重新变为 Normal （在 web 界面叫 in service)

9): Hadoop添加节点的方法
自己实际添加节点过程：
1. 先在slave上配置好环境，包括ssh，jdk，相关config，lib，bin等的拷贝；
2. 将新的datanode的host加到集群namenode及其他datanode中去；
3. 将新的datanode的ip加到master的conf/slaves中；
4. 重启cluster,在cluster中看到新的datanode节点；
5. 运行bin/start-balancer.sh，这个会很耗时间
备注：
1. 如果不balance，那么cluster会把新的数据都存放在新的node上，这样会降低mr的工作效率；
2. 也可调用bin/start-balancer.sh 命令执行，也可加参数 -threshold 5
threshold 是平衡阈值，默认是10%，值越低各节点越平衡，但消耗时间也更长。
3. balancer也可以在有mr job的cluster上运行，默认dfs.balance.bandwidthPerSec很低，为1M/s。在没有mr job时，可以提高该设置加快负载均衡时间。

其他备注：
1. 必须确保slave的firewall已关闭;
2. 确保新的slave的ip已经添加到master及其他slaves的/etc/hosts中，反之也要将master及其他slave的ip添加到新的slave的/etc/hosts中

10): hadoop 学习借鉴
1. 解决hadoop OutOfMemoryError问题：
<property>
<name>mapred.child.java.opts</name>
<value>-Xmx800M -server</value>
</property>
With the right JVM size in your hadoop-site.xml , you will have to copy this
to all mapred nodes and restart the cluster.
或者：hadoop jar jarfile [main class] -D mapred.child.java.opts=-Xmx800M

2. Hadoop java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232) while indexing.
when i use nutch1.0,get this error:
Hadoop java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232) while indexing.
这个也很好解决：
可以删除conf/log4j.properties，然后可以看到详细的错误报告
我这儿出现的是out of memory
解决办法是在给运行主类org.apache.nutch.crawl.Crawl加上参数：-Xms64m -Xmx512m
你的或许不是这个问题，但是能看到详细的错误报告问题就好解决了

11): distribute cache使用
类似一个全局变量，但是由于这个变量较大，所以不能设置在config文件中，转而使用distribute cache
具体使用方法：(详见《the definitive guide》,P240)
1. 在命令行调用时：调用-files，引入需要查询的文件(可以是local file, HDFS file(使用hdfs://xxx?)), 或者 -archives (JAR,ZIP, tar等)
% hadoop jar job.jar MaxTemperatureByStationNameUsingDistributedCacheFile \
-files input/ncdc/metadata/stations-fixed-width.txt input/ncdc/all output
2. 程序中调用：
public void configure(JobConf conf) {
metadata = new NcdcStationMetadata();
try {
metadata.initialize(new File("stations-fixed-width.txt");
} catch (IOException e) {
throw new RuntimeException(e);
}
}
另外一种间接的使用方法：在hadoop-0.19.0中好像没有
调用addCacheFile()或者addCacheArchive()添加文件，
使用getLocalCacheFiles() 或 getLocalCacheArchives() 获得文件

12): hadoop的job显示web
There are web-based interfaces to both the JobTracker (MapReduce master) and NameNode (HDFS master) which display status pages about the state of the entire system. By default, these are located at [WWW] http://job.tracker.addr:50030/ and [WWW] http://name.node.addr:50070/.

13):
hadoop监控
OnlyXP(52388483) 131702
用nagios作告警，ganglia作监控图表即可

14):
status of 255 error
错误类型：
java.io.IOException: Task process exit with nonzero status of 255.
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:424)

错误原因：
Set mapred.jobtracker.retirejob.interval and mapred.userlog.retain.hours to higher value. By default, their values are 24 hours. These might be the reason for failure, though I'm not sure

15): split size
FileInputFormat input splits: (详见《the definitive guide》P190)
mapred.min.split.size: default=1, the smallest valide size in bytes for a file split.
mapred.max.split.size: default=Long.MAX_VALUE, the largest valid size.
dfs.block.size: default = 64M, 系统中设置为128M。
如果设置 minimum split size > block size, 会增加块的数量。(猜想从其他节点拿去数据的时候，会合并block，导致block数量增多)
如果设置maximum split size < block size, 会进一步拆分block。

split size = max(minimumSize, min(maximumSize, blockSize));
其中 minimumSize < blockSize < maximumSize.

16): sort by value
hadoop 不提供直接的sort by value方法，因为这样会降低mapreduce性能。
但可以用组合的办法来实现，具体实现方法见《the definitive guide》, P250
基本思想：
1. 组合key/value作为新的key；
2. 重载partitioner，根据old key来分割；
conf.setPartitionerClass(FirstPartitioner.class);
3. 自定义keyComparator：先根据old key排序，再根据old value排序；
conf.setOutputKeyComparatorClass(KeyComparator.class);
4. 重载GroupComparator, 也根据old key 来组合； conf.setOutputValueGroupingComparator(GroupComparator.class);

17): small input files的处理
对于一系列的small files作为input file，会降低hadoop效率。
有3种方法可以将small file合并处理：
1. 将一系列的small files合并成一个sequneceFile，加快mapreduce速度。
详见WholeFileInputFormat及SmallFilesToSequenceFileConverter,《the definitive guide》, P194
2. 使用CombineFileInputFormat集成FileinputFormat，但是未实现过；
3. 使用hadoop archives(类似打包)，减少小文件在namenode中的metadata内存消耗。(这个方法不一定可行，所以不建议使用)
方法：
将/my/files目录及其子目录归档成files.har，然后放在/my目录下
bin/hadoop archive -archiveName files.har /my/files /my

查看files in the archive:
bin/hadoop fs -lsr har://my/files.har

18): skip bad records
JobConf conf = new JobConf(ProductMR.class);
conf.setJobName("ProductMR";
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(Product.class);
conf.setMapperClass(Map.class);
conf.setReducerClass(Reduce.class);
conf.setMapOutputCompressorClass(DefaultCodec.class);
conf.setInputFormat(SequenceFileInputFormat.class);
conf.setOutputFormat(SequenceFileOutputFormat.class);
String objpath = "abc1";
SequenceFileInputFormat.addInputPath(conf, new Path(objpath));

SkipBadRecords.setMapperMaxSkipRecords(conf, Long.MAX_VALUE);
SkipBadRecords.setAttemptsToStartSkipping(conf, 0);
SkipBadRecords.setSkipOutputPath(conf, new Path("data/product/skip/");

String output = "abc";
SequenceFileOutputFormat.setOutputPath(conf, new Path(output));
JobClient.runJob(conf);

For skipping failed tasks try : mapred.max.map.failures.percent

19): restart 单个datanode
如果一个datanode 出现问题，解决之后需要重新加入cluster而不重启cluster，方法如下：
bin/hadoop-daemon.sh start datanode
bin/hadoop-daemon.sh start jobtracker

20): reduce exceed 100%
"Reduce Task Progress shows > 100% when the total size of map outputs (for a
single reducer) is high "
造成原因：
在reduce的merge过程中，check progress有误差，导致status > 100%，在统计过程中就会出现以下错误：java.lang.ArrayIndexOutOfBoundsException: 3
at org.apache.hadoop.mapred.StatusHttpServer$TaskGraphServlet.getReduceAvarageProgresses(StatusHttpServer.java:228)
at org.apache.hadoop.mapred.StatusHttpServer$TaskGraphServlet.doGet(StatusHttpServer.java:159)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:427)
at org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationHandler.java:475)
at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:567)
at org.mortbay.http.HttpContext.handle(HttpContext.java:1565)
at org.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicationContext.java:635)
at org.mortbay.http.HttpContext.handle(HttpContext.java:1517)
at org.mortbay.http.HttpServer.service(HttpServer.java:954)

jira地址：https://issues.apache.org/jira/browse/HADOOP-5210

21): counters
3中counters：
1. built-in counters: Map input bytes, Map output records...
2. enum counters
调用方式：
enum Temperature {
MISSING,
MALFORMED
}

reporter.incrCounter(Temperature.MISSING, 1)
结果显示：
09/04/20 06:33:36 INFO mapred.JobClient: Air Temperature Recor
09/04/20 06:33:36 INFO mapred.JobClient: Malformed=3
09/04/20 06:33:36 INFO mapred.JobClient: Missing=66136856
3. dynamic countes:
调用方式：
reporter.incrCounter("TemperatureQuality", parser.getQuality(),1);

结果显示：
09/04/20 06:33:36 INFO mapred.JobClient: TemperatureQuality
09/04/20 06:33:36 INFO mapred.JobClient: 2=1246032
09/04/20 06:33:36 INFO mapred.JobClient: 1=973422173
09/04/20 06:33:36 INFO mapred.JobClient: 0=17: Namenode in safe mode
解决方法
bin/hadoop dfsadmin -safemode leave

22) :java.net.NoRouteToHostException: No route to host
j解决方法：
sudo /etc/init.d/iptables stop

23) ：更改namenode后，在hive中运行select 依旧指向之前的namenode地址
这是因为：When youcreate a table, hive actually stores the location of the table (e.g.
hdfs://iport/user/root/...) in the SDS and DBS tables in the metastore . So when I bring up a new cluster the master has a new IP, but hive's metastore is still pointing to the locations within the old
cluster. I could modify the metastore to update with the new IP everytime I bring up a cluster. But the easier and simpler solution was to just use an elastic IP for the master
所以要将metastore中的之前出现的namenode地址全部更换为现有的namenode地址10：Your DataNode is started and you can create directories with bin/hadoop dfs -mkdir, but you get an error message when you try to put files into the HDFS (e.g., when you run a command like bin/hadoop dfs -put).
解决方法：
Go to the HDFS info web page (open your web browser and go to http://namenode:dfs_info_port where namenode is the hostname of your NameNode and dfs_info_port is the port you chose dfs.info.port; if followed the QuickStart on your personal computer then this URL will be http://localhost:50070). Once at that page click on the number where it tells you how many DataNodes you have to look at a list of the DataNodes in your cluster.
If it says you have used 100% of your space, then you need to free up room on local disk(s) of the DataNode(s).
If you are on Windows then this number will not be accurate (there is some kind of bug either in Cygwin's df.exe or in Windows). Just free up some more space and you should be okay. On one Windows machine we tried the disk had 1GB free but Hadoop reported that it was 100% full. Then we freed up another 1GB and then it said that the disk was 99.15% full and started writing data into the HDFS again. We encountered this bug on Windows XP SP2.

------------------------------常见错误--------------------------------------
11：Your DataNodes won't start, and you see something like this in logs/*datanode*:
Incompatible namespaceIDs in /tmp/hadoop-ross/dfs/data
原因：
Your Hadoop namespaceID became corrupted. Unfortunately the easiest thing to do reformat the HDFS.
解决方法：
You need to do something like this:
bin/stop-all.sh
rm -Rf /tmp/hadoop-your-username/*
bin/hadoop namenode -format

12：You can run Hadoop jobs written in Java (like the grep example), but your HadoopStreaming jobs
(such as the Python example that fetches web page titles) won't work.
原因：
You might have given only a relative path to the mapper and reducer programs. The tutorial originally
just specified relative paths, but absolute paths are required if you are running in a real cluster.
解决方法：
Use absolute paths like this from the tutorial:
bin/hadoop jar contrib/hadoop-0.15.2-streaming.jar \
-mapper $HOME/proj/hadoop/multifetch.py \
-reducer $HOME/proj/hadoop/reducer.py \
-input urls/* \
-output titles

13： 2009-01-08 10:02:40,709 ERROR metadata.Hive (Hive.java:getPartitions(499))
- javax.jdo.JDODataStoreException: Required table missing : ""PARTITIONS"" in Catalog "" Schema "".
JPOX requires this table to perform its persistence operations. Either your MetaData is incorrect,
or you need to enable "org.jpox.autoCreateTables"
原因：就是因为在 hive-default.xml 里把 org.jpox.fixedDatastore 设置成 true 了
14：09/08/31 18:25:45 INFO hdfs.DFSClient:
Exception in createBlockOutputStream java.io.IOException:Bad connect ack with firstBadLink 192.168.1.11:50010
> 09/08/31 18:25:45 INFO hdfs.DFSClient: Abandoning block blk_-8575812198227241296_1001
> 09/08/31 18:25:51 INFO hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException:
Bad connect ack with firstBadLink 192.168.1.16:50010
> 09/08/31 18:25:51 INFO hdfs.DFSClient: Abandoning block blk_-2932256218448902464_1001
> 09/08/31 18:25:57 INFO hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException:
Bad connect ack with firstBadLink 192.168.1.11:50010
> 09/08/31 18:25:57 INFO hdfs.DFSClient: Abandoning block blk_-1014449966480421244_1001
> 09/08/31 18:26:03 INFO hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException:
Bad connect ack with firstBadLink 192.168.1.16:50010
> 09/08/31 18:26:03 INFO hdfs.DFSClient: Abandoning block blk_7193173823538206978_1001
> 09/08/31 18:26:09 WARN hdfs.DFSClient: DataStreamer Exception: java.io.IOException: Unable
to create new block.
> at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2731)
> at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1996)
> at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2182)
>
> 09/08/31 18:26:09 WARN hdfs.DFSClient: Error Recovery for block blk_7193173823538206978_1001
bad datanode[2] nodes == null
> 09/08/31 18:26:09 WARN hdfs.DFSClient: Could not get block locations. Source file "/user/umer/8GB_input"
- Aborting...
> put: Bad connect ack with firstBadLink 192.168.1.16:50010

解决方法：
I have resolved the issue:
What i did:

1) '/etc/init.d/iptables stop' -->stopped firewall
2) SELINUX=disabled in '/etc/selinux/config' file.-->disabled selinux
I worked for me after these two changes15:把IP换成主机名，datanode 挂不上
解决方法：
把temp文件删除，重启hadoop集群就行了
是因为多次部署，造成temp文件与namenode不一致的原因解决jline.ConsoleReader.readLine在Windows上不生效问题方法
在CliDriver.java的main()函数中，有一条语句reader.readLine，用来读取标准输入，但在Windows平台上该语句总是返回null，这个reader是一个实例jline.ConsoleReader实例，给Windows Eclipse调试带来不便。
我们可以通过使用java.util.Scanner.Scanner来替代它，将原来的
while ((line=reader.readLine(curPrompt+"> ") != null)
复制代码
替换为：
Scanner sc = new Scanner(System.in);
while ((line=sc.nextLine()) != null)
复制代码
重新编译发布，即可正常从标准输入读取输入的SQL语句了。

14): Windows eclispe调试hive报does not have a scheme错误可能原因
1、Hive配置文件中的“hive.metastore.local”配置项值为false，需要将它修改为true，因为是单机版
2、没有设置HIVE_HOME环境变量，或设置错误
3、“does not have a scheme”很可能是因为找不到“hive-default.xml”。使用Eclipse调试Hive时，遇到找不到hive-default.xml的解决方法：http://bbs.hadoopor.com/thread-292-1-1.html1、中文问题
从url中解析出中文,但hadoop中打印出来仍是乱码?我们曾经以为hadoop是不支持中文的，后来经过查看源代码，发现hadoop仅仅是不支持以gbk格式输出中文而己。

这是TextOutputFormat.class中的代码，hadoop默认的输出都是继承自FileOutputFormat来的，FileOutputFormat的两个子类一个是基于二进制流的输出，一个就是基于文本的输出TextOutputFormat。

public class TextOutputFormat<K, V> extends FileOutputFormat<K, V> {
protected static class LineRecordWriter<K, V>
implements RecordWriter<K, V> {
private static final String utf8 = “UTF-8″;//这里被写死成了utf-8
private static final byte[] newline;
static {
try {
newline = “\n”.getBytes(utf8);
} catch (UnsupportedEncodingException uee) {
throw new IllegalArgumentException(”can’t find ” + utf8 + ” encoding”);
}
}
…
public LineRecordWriter(DataOutputStream out, String keyValueSeparator) {
this.out = out;
try {
this.keyValueSeparator = keyValueSeparator.getBytes(utf8);
} catch (UnsupportedEncodingException uee) {
throw new IllegalArgumentException(”can’t find ” + utf8 + ” encoding”);
}
}
…
private void writeObject(Object o) throws IOException {
if (o instanceof Text) {
Text to = (Text) o;
out.write(to.getBytes(), 0, to.getLength());//这里也需要修改
} else {
out.write(o.toString().getBytes(utf8));
}
}
…
}
可以看出hadoop默认的输出写死为utf-8，因此如果decode中文正确，那么将Linux客户端的character设为utf-8是可以看到中文的。因为hadoop用utf-8的格式输出了中文。
因为大多数数据库是用gbk来定义字段的，如果想让hadoop用gbk格式输出中文以兼容数据库怎么办？
我们可以定义一个新的类：
public class GbkOutputFormat<K, V> extends FileOutputFormat<K, V> {
protected static class LineRecordWriter<K, V>
implements RecordWriter<K, V> {
//写成gbk即可
private static final String gbk = “gbk”;
private static final byte[] newline;
static {
try {
newline = “\n”.getBytes(gbk);
} catch (UnsupportedEncodingException uee) {
throw new IllegalArgumentException(”can’t find ” + gbk + ” encoding”);
}
}
…
public LineRecordWriter(DataOutputStream out, String keyValueSeparator) {
this.out = out;
try {
this.keyValueSeparator = keyValueSeparator.getBytes(gbk);
} catch (UnsupportedEncodingException uee) {
throw new IllegalArgumentException(”can’t find ” + gbk + ” encoding”);
}
}
…
private void writeObject(Object o) throws IOException {
if (o instanceof Text) {
// Text to = (Text) o;
// out.write(to.getBytes(), 0, to.getLength());
// } else {
out.write(o.toString().getBytes(gbk));
}
}
…
}
然后在mapreduce代码中加入conf1.setOutputFormat(GbkOutputFormat.class)
即可以gbk格式输出中文。

2、某次正常运行mapreduce实例时,抛出错误
java.io.IOException: All datanodes xxx.xxx.xxx.xxx:xxx are bad. Aborting…
at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2158)
at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1400(DFSClient.java:1735)
at org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1889)
java.io.IOException: Could not get block locations. Aborting…
at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2143)
at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1400(DFSClient.java:1735)
at org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1889)
经查明，问题原因是linux机器打开了过多的文件导致。
用命令ulimit -n可以发现linux默认的文件打开数目为1024，
修改/ect/security/limit.conf，增加hadoop soft 65535
再重新运行程序（最好所有的datanode都修改），问题解决

3、运行一段时间后hadoop不能stop-all.sh的问题，显示报错
no tasktracker to stop ，no datanode to stop

问题的原因是hadoop在stop的时候依据的是datanode上的mapred和dfs进程号。
而默认的进程号保存在/tmp下，linux默认会每隔一段时间（一般是一个月或者7天左右）去删除这个目录下的文件。
因此删掉hadoop-hadoop-jobtracker.pid和hadoop-hadoop-namenode.pid两个文件后，
namenode自然就找不到datanode上的这两个进程了。

在配置文件中的export HADOOP_PID_DIR可以解决这个问题问题：
Incompatible namespaceIDs in /usr/local/hadoop/dfs/data: namenode namespaceID = 405233244966;
datanode namespaceID = 33333244
原因：
在每次执行hadoop namenode -format时，都会为NameNode生成namespaceID,，
但是在hadoop.tmp.dir目录下的DataNode还是保留上次的namespaceID，因为namespaceID的不一致，
而导致DataNode无法启动，所以只要在每次执行hadoop namenode -format之前，
先删除hadoop.tmp.dir目录就可以启动成功。请注意是删除hadoop.tmp.dir对应的本地目录，而不是HDFS目录。
Hadoop, 解决, 节点, slave, 配置

1. Hadoop中遇到的问题

以前所遇到的问题由于没有记录,所以忘了

(1)NameNode没有启动成功, 是由于你对HDFS多次格式化,导致datanode中与namenode中的VERSION文件中的namespaceID不一致(对于NameNode节点,该文件位于hdfs-site配置文件中dfs.name.dir参数所指定的路径下的current文件夹中, 对于DataNode节点, 该文件位于hdfs-site配置文件中dfs.data.dir参数所指定的路径下的current文件夹中.

解决方法: 第一种是把namespaceID的值改成一致,然后重启Hadoop；第二种删除dfs.name.dir与dfs.data.dir参数指定的目录,然后使用bin/hadoop namenode -formate 重新格式化,这种方法有风险, 因为会删除所有原来在HDFS上的文件.

(2)Eclipse的Run On Hadoop就是一个坑, 其根本就没运行在集群上, (可以通过job.setNumReduceTasks设置ReducerTask个数,无论你设置多少个,都只有一个,因为运行在本地,只是文件数据在集群上, 也就是说Mapper与Reducer任务都运行在本地; 还可以通过控制台信息查看到: 如果是集群上则会有这样的信息Running job: job_201405090934_0024, 如果是本地任务,则会显示Running job: job_local426339719_0001,看到没有, 中间有个local；还可以通过web node1:50030查看任务运行情况,如果是本地任务,则不会在上面显示).

解决方法: 如果需要运行在集群上,要做三件事,如下:

//特别注意: 一定要设置,不然会报cannot read partitioner file错误
conf.set("fs.default.name","node1:49000");
//特别注意: 一定要设置,不然不会运行在集群上
conf.set("mapred.job.tracker","node1:49001");
//特别注意: 对相关类以及依赖的jar文件进行打包,这是运行在集群上必须要做的一步,不然集群找不到相关的Mapper等类文件
File jarpath;
try {
jarpath = JarTools.makeJar("bin");
conf.set("mapred.jar", jarpath.toString());
} catch (Exception e) {
logger.error("进行jar打包出错!");
e.printStackTrace();
return;
}
(3) Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
在运行时,Run Configurations 在Arguments中的 VM arguments 添加-Djava.library.path=/home/hadoop/hadoop-1.2.1/lib/native/Linux-i386-32

该路径依据你的实际路径为准
2.HBase问题

(1) Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (无法定位登录配置)
22:32:56,821 WARN ClientCnxn:1089 - Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068)
22:32:56,951 WARN RecoverableZooKeeper:253 - Possibly transient ZooKeeper, quorum=localhost:2181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/hbaseid
22:32:56,952 INFO RetryCounter:155 - Sleeping 1000ms before retry #0...

这是因为在代码中没有设置Zookeeper集群

//设置zookeeper集群
HBASE_CONFIG.set("hbase.zookeeper.quorum", "node2,node3,node4,node5,node6,node7,node8");

最好还设置HMaster

//设置HMatser
HBASE_CONFIG.set("hbase.zookeeper.master","node1:60000");

(2)
JobClient:1422 - Task Id : attempt_201405081252_0008_m_000000_0, Status : FAILED
java.lang.IllegalArgumentException: Can't read partitions file
at org.apache.hadoop.mapreduce.lib.partition.TotalOrderPartitioner.setConf(TotalOrderPartitioner.java:116)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:62)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(MapTask.java:676)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:756)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: java.io.FileNotFoundException: File /tmp/partitions_de363500-5535-466b-91bb-36472457386d does not exist.
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:402)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:255)
at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:816)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1479)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1474)
at org.apache.hadoop.mapreduce.lib.partition.TotalOrderPartitioner.readPartitions(TotalOrderPartitioner.java:301)
at org.apache.hadoop.mapreduce.lib.partition.TotalOrderPartitioner.setConf(TotalOrderPartitioner.java:88)
... 10 more

因为在生成HFile时使用了HFileOutputFormat与TotalOrderPartitioner(HFile对RowKey需要进行全局排序),所以需要分区文件, 但是TaskTracker找不到分区文件,要想TaskTracker读取分区文件,该文本必须存在与HDFS上, 所以需要设置一个参数:

Configuration conf = HbaseOperation.HBASE_CONFIG;
conf.set("fs.default.name","node1:49000");

对于具体解释,还需进一步研究.

(3)Wrong number of partitions in keyset

说明你的分区文件中的分区个数不等于reducer的个数减一,即Regions的个数不等于reducer的个数减一,其实是因为你的任务运行在本地(这样只有一个Reducer),而Regions有多个,有兴趣可以查看TotalOrderPartitioner的源代码, 中间有一段代码为:

for (int i = 0; i < splitPoints.length - 1; ++i) {
if (comparator.compare(splitPoints[i], splitPoints[i+1]) >= 0) {
throw new IOException("Split points are out of order");
}
}
HFileOutputFormat.configureIncrementalLoad(job, table);自动对job进行配置。TotalOrderPartitioner是需要先对key进行整体排序，然后划分到每个reduce中，保证每一个reducer中的的key最小最大值区间范围，是不会有交集的。因为入库到HBase的时候，作为一个整体的Region，key是绝对有序的。

暂时写到这里, 有些问题不记得了,以后遇到问题会继续更新,....

猜你喜欢