大家:
好! 因项目实际需要,要求将hive中的数据对接入hbase中。在网上看的一篇博文的基础上,加上自己的理解以及相关的操作步骤,以及常见的几个错误,整理了此篇博客,希望对大家有所帮助。
Bulk Load-HBase数据导入最佳实践
一、概述
HBase本身提供了非常多种数据导入的方式,通常有两种经常使用方式:
1、使用HBase提供的TableOutputFormat,原理是通过一个Mapreduce作业将数据导入HBase
2、还有一种方式就是使用HBase原生Client API
这两种方式因为须要频繁的与数据所存储的RegionServer通信。一次性入库大量数据时,特别占用资源,所以都不是最有效的。了解过HBase底层原理的应该都知道,HBase在HDFS中是以HFile文件结构存储的,一个比較高效便捷的方法就是使用 “Bulk Loading”方法直接生成HFile,即HBase提供的HFileOutputFormat类。
二、Bulk Load基本原理
Bulk Load处理由两个主要步骤组成
1、准备数据文件
Bulk Load的第一步。会执行一个Mapreduce作业,当中使用到了HFileOutputFormat输出HBase数据文件:StoreFile。
HFileOutputFormat的作用在于使得输出的HFile文件能够适应单个region。使用TotalOrderPartitioner类将map输出结果分区到各个不同的key区间中,每一个key区间都相应着HBase表的region。
2、导入HBase表
第二步使用completebulkload工具将第一步的结果文件依次交给负责文件相应region的RegionServer,并将文件move到region在HDFS上的存储文件夹中。一旦完毕。将数据开放给clients。
假设在bulk load准备导入或在准备导入与完毕导入的临界点上发现region的边界已经改变,completebulkload工具会自己主动split数据文件到新的边界上。可是这个过程并非最佳实践,所以用户在使用时须要最小化准备导入与导入集群间的延时,特别是当其它client在同一时候使用其它工具向同一张表导入数据。
注意:
bulk load的completebulkload步骤。就是简单的将importtsv或HFileOutputFormat的结果文件导入到某张表中。使用类似下面命令
hadoop jar hbase-VERSION.jar completebulkload [-c /path/to/hbase/config/hbase-site.xml] /user/todd/myoutput mytable
命令会非常快运行完毕。将/user/todd/myoutput下的HFile文件导入到mytable表中。注意:假设目标表不存在。工具会自己主动创建表。
三、生成HFile程序说明:
1、终于输出结果。不管是map还是reduce,输出部分key和value的类型必须是: < ImmutableBytesWritable, KeyValue>或者< ImmutableBytesWritable, Put>。
2、终于输出部分,Value类型是KeyValue 或Put。相应的Sorter各自是KeyValueSortReducer或PutSortReducer。
3、MR样例中job.setOutputFormatClass(HFileOutputFormat.class); HFileOutputFormat仅仅适合一次对单列族组织成HFile文件。
4、MR样例中HFileOutputFormat.configureIncrementalLoad(job, table);自己主动对job进行配置。SimpleTotalOrderPartitioner是须要先对key进行总体排序,然后划分到每个reduce中,保证每个reducer中的的key最小最大值区间范围,是不会有交集的。由于入库到HBase的时候,作为一个总体的Region,key是绝对有序的。
5、MR样例中最后生成HFile存储在HDFS上。输出路径下的子文件夹是各个列族。假设对HFile进行入库HBase。相当于move HFile到HBase的Region中。HFile子文件夹的列族内容没有了。
四 实战演练
第一步: 创建测试数据,并上传到hdfs上:
[root@hadoop test]# hadoop fs -cat /test/hbase.txt
key1 fm1:col1 value1
key1 fm1:col2 value2
key1 fm2:col1 value99
key4 fm1:col1 value4
第二步: 在hbase上创建所需要的表
--创建表
create 'hfiletable','fm1','fm2'
在hbase中验证创建好的表
hbase(main):006:0> truncate 'hfiletable'
Truncating 'hfiletable' table (it may take a while):
- Disabling table...
- Truncating table...
0 row(s) in 4.2430 seconds
hbase(main):007:0> scan 'hfiletable'
ROW COLUMN+CELL
0 row(s) in 0.3490 seconds
说明: 1 用truncate 是为了防止有历史数据,如果表是第一次创建的,此步不需要
第三步: 在eclipse中创建BulkLoadJob的类,这也是程序需要用的
package day01;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FsShell;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2;
import org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.io.IOException;
public class BulkLoadJob {
static Logger logger = LoggerFactory.getLogger(BulkLoadJob.class);
public static class BulkLoadMap extends
Mapper<LongWritable, Text, ImmutableBytesWritable, Put> {
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String[] valueStrSplit = value.toString().split("\t");
String hkey = valueStrSplit[0];
String family = valueStrSplit[1].split(":")[0];
String column = valueStrSplit[1].split(":")[1];
String hvalue = valueStrSplit[2];
final byte[] rowKey = Bytes.toBytes(hkey);
final ImmutableBytesWritable HKey = new ImmutableBytesWritable(rowKey);
Put HPut = new Put(rowKey);
byte[] cell = Bytes.toBytes(hvalue);
HPut.add(Bytes.toBytes(family), Bytes.toBytes(column), cell);
context.write(HKey, HPut);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = HBaseConfiguration.create();
String inputPath = args[0];
String outputPath = args[1];
HTable hTable = null;
try {
Job job = Job.getInstance(conf, "ExampleRead");
job.setJarByClass(BulkLoadJob.class);
job.setMapperClass(BulkLoadJob.BulkLoadMap.class);
job.setMapOutputKeyClass(ImmutableBytesWritable.class);
job.setMapOutputValueClass(Put.class);
// speculation
job.setSpeculativeExecution(false);
job.setReduceSpeculativeExecution(false);
// in/out format
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(HFileOutputFormat2.class);
FileInputFormat.setInputPaths(job, inputPath);
FileOutputFormat.setOutputPath(job, new Path(outputPath));
hTable = new HTable(conf, args[2]);
HFileOutputFormat2.configureIncrementalLoad(job, hTable);
if (job.waitForCompletion(true)) {
FsShell shell = new FsShell(conf);
try {
shell.run(new String[]{"-chmod", "-R", "777", args[1]});
} catch (Exception e) {
logger.error("Couldnt change the file permissions ", e);
throw new IOException(e);
}
//载入到hbase表
LoadIncrementalHFiles loader = new LoadIncrementalHFiles(conf);
loader.doBulkLoad(new Path(outputPath), hTable);
} else {
logger.error("loading failed.");
System.exit(1);
}
} catch (IllegalArgumentException e) {
e.printStackTrace();
} finally {
if (hTable != null) {
hTable.close();
}
}
}
}
第四步 : 将第三步创建好的类,打包成文件Hdfs_To_Hbase.jar,并上传到目录/root/test中,并进行以下的验证
[root@hadoop test]# ls -l /root/test/Hdfs_To_Hbase.jar
-rw-r--r-- 1 root root 44481940 Aug 20 06:14 /root/test/Hdfs_To_Hbase.jar
第五步: 在linux中,开始执行刚才上传的jar包
hadoop jar /root/test/Hdfs_To_Hbase.jar /test/hbase.txt /Hdfs_to_Hbase hfiletable >123.log 2>&1 &
参数说明:
1 /root/test/Hdfs_To_Hbase.jar 是jar包名称
2 /test/hbase.txt 是导入的HDFS的数据源
3 /Hdfs_to_Hbase 是MR每次处理后在hdfs上生成的保存Hfile的临时目录, 执行前需要删除此目录
4 hfiletable 是hbase中的表名, 首先要在hbase中创建好
第六步: 查看后台的执行日志, 该文件比较大,需要耐心看完
[root@hadoop ~]# cat 123.log
18/08/20 06:15:24 INFO zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x130dca52 connecting to ZooKeeper ensemble=localhost:2181
18/08/20 06:15:24 INFO zookeeper.ZooKeeper: Client environment:zookeeper.version=3.4.6-1569965, built on 02/20/2014 09:09 GMT
18/08/20 06:15:24 INFO zookeeper.ZooKeeper: Client environment:host.name=hadoop
18/08/20 06:15:24 INFO zookeeper.ZooKeeper: Client environment:java.version=1.8.0_45
18/08/20 06:15:24 INFO zookeeper.ZooKeeper: Client environment:java.vendor=Oracle Corporation
18/08/20 06:15:24 INFO zookeeper.ZooKeeper: Client environment:java.home=/usr/java/jre
18/08/20 06:15:24 INFO zookeeper.ZooKeeper: Client environment:java.class.path=/usr/local/hadoop/etc/hadoop:/usr/local/hadoop/share/hadoop/common/lib/curator-recipes-2.7.1.jar:/usr/local/hadoop/share/hadoop/common/lib/commons-collections-3.2.2.jar:/usr/local/hadoop/share/hadoop/common/lib/commons-digester-1.8.jar:/usr/local/hadoop/share/hadoop/common/lib/commons-io-2.4.jar:/usr/local/hadoop/share/hadoop/common/lib/paranamer-2.3.jar:/usr/local/hadoop/share/hadoop/common/lib/commons-httpclient-3.1.jar:/usr/local/hadoop/share/hadoop/common/lib/httpclient-4.2.5.jar:/usr/local/hadoop/share/hadoop/common/lib/jackson-jaxrs-1.9.13.jar:/usr/local/hadoop/share/hadoop/common/lib/jets3t-0.9.0.jar:/usr/local/hadoop/share/hadoop/common/lib/jackson-xc-1.9.13.jar:/usr/local/hadoop/share/hadoop/common/lib/slf4j-api-1.7.10.jar:/usr/local/hadoop/share/hadoop/common/lib/stax-api-1.0-2.jar:/usr/local/hadoop/share/hadoop/common/lib/commons-beanutils-1.7.0.jar:/usr/local/hadoop/share/hadoop/common/lib/jackson-core-asl-1.9.13.jar:/usr/local/hadoop/share/hadoop/common/lib/api-util-1.0.0-M20.jar:/usr/local/hadoop/share/hadoop/common/lib/hadoop-annotations-2.7.3.jar:/usr/local/hadoop/share/hadoop/common/lib/apacheds-kerberos-codec-2.0.0-M15.jar:/usr/local/hadoop/share/hadoop/common/lib/commons-codec-1.4.jar:/usr/local/hadoop/share/hadoop/common/lib/jettison-1.1.jar:/usr/local/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar:/usr/local/hadoop/share/hadoop/common/lib/zookeeper-3.4.6.jar:/usr/local/hadoop/share/hadoop/common/lib/curator-framework-2.7.1.jar:/usr/local/hadoop/share/hadoop/common/lib/jersey-json-1.9.jar:/usr/local/hadoop/share/hadoop/common/lib/jsr305-3.0.0.jar:/usr/local/hadoop/share/hadoop/common/lib/hadoop-auth-2.7.3.jar:/usr/local/hadoop/share/hadoop/common/lib/mockito-all-1.8.5.jar:/usr/local/hadoop/share/hadoop/common/lib/activation-1.1.jar:/usr/local/hadoop/share/hadoop/common/lib/api-asn1-api-1.0.0-M20.jar:/usr/local/hadoop/share/hadoop/common/lib/java-xmlbuilder-0.4.jar:/usr/local/hadoop/share/hadoop/common/lib/commons-compress-1.4.1.jar:/usr/local/hadoop/share/hadoop/common/lib/jsp-api-2.1.jar:/usr/local/hadoop/share/hadoop/common/lib/jetty-6.1.26.jar:/usr/local/hadoop/share/hadoop/common/lib/jersey-server-1.9.jar:/usr/local/hadoop/share/hadoop/common/lib/gson-2.2.4.jar:/usr/local/hadoop/share/hadoop/common/lib/xz-1.0.jar:/usr/local/hadoop/share/hadoop/common/lib/snappy-java-1.0.4.1.jar:/usr/local/hadoop/share/hadoop/common/lib/netty-3.6.2.Final.jar:/usr/local/hadoop/share/hadoop/common/lib/curator-client-2.7.1.jar:/usr/local/hadoop/share/hadoop/common/lib/jackson-mapper-asl-1.9.13.jar:/usr/local/hadoop/share/hadoop/common/lib/servlet-api-2.5.jar:/usr/local/hadoop/share/hadoop/common/lib/asm-3.2.jar:/usr/local/hadoop/share/hadoop/common/lib/jersey-core-1.9.jar:/usr/local/hadoop/share/hadoop/common/lib/protobuf-java-2.5.0.jar:/usr/local/hadoop/share/hadoop/common/lib/apacheds-i18n-2.0.0-M15.jar:/usr/local/hadoop/share/hadoop/common/lib/commons-lang-2.6.jar:/usr/local/hadoop/share/hadoop/common/lib/commons-logging-1.1.3.jar:/usr/local/hadoop/share/hadoop/common/lib/guava-11.0.2.jar:/usr/local/hadoop/share/hadoop/common/lib/htrace-core-3.1.0-incubating.jar:/usr/local/hadoop/share/hadoop/common/lib/httpcore-4.2.5.jar:/usr/local/hadoop/share/hadoop/common/lib/commons-configuration-1.6.jar:/usr/local/hadoop/share/hadoop/common/lib/jsch-0.1.42.jar:/usr/local/hadoop/share/hadoop/common/lib/log4j-1.2.17.jar:/usr/local/hadoop/share/hadoop/common/lib/commons-beanutils-core-1.8.0.jar:/usr/local/hadoop/share/hadoop/common/lib/commons-net-3.1.jar:/usr/local/hadoop/share/hadoop/common/lib/junit-4.11.jar:/usr/local/hadoop/share/hadoop/common/lib/commons-cli-1.2.jar:/usr/local/hadoop/share/hadoop/common/lib/avro-1.7.4.jar:/usr/local/hadoop/share/hadoop/common/lib/jaxb-impl-2.2.3-1.jar:/usr/local/hadoop/share/hadoop/common/lib/xmlenc-0.52.jar:/usr/local/hadoop/share/hadoop/common/lib/hamcrest-core-1.3.jar:/usr/local/hadoop/share/hadoop/common/lib/jaxb-api-2.2.2.jar:/usr/local/hadoop/share/hadoop/common/lib/jetty-util-6.1.26.jar:/usr/local/hadoop/share/hadoop/common/lib/commons-math3-3.1.1.jar:/usr/local/hadoop/share/hadoop/common/hadoop-common-2.7.3-tests.jar:/usr/local/hadoop/share/hadoop/common/hadoop-nfs-2.7.3.jar:/usr/local/hadoop/share/hadoop/common/hadoop-common-2.7.3.jar:/usr/local/hadoop/share/hadoop/hdfs:/usr/local/hadoop/share/hadoop/hdfs/lib/commons-io-2.4.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/leveldbjni-all-1.8.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/jackson-core-asl-1.9.13.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/commons-codec-1.4.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/netty-all-4.0.23.Final.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/jsr305-3.0.0.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/commons-daemon-1.0.13.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/xercesImpl-2.9.1.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/jetty-6.1.26.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/xml-apis-1.3.04.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/jersey-server-1.9.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/netty-3.6.2.Final.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/jackson-mapper-asl-1.9.13.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/servlet-api-2.5.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/asm-3.2.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/jersey-core-1.9.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/protobuf-java-2.5.0.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/commons-lang-2.6.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/commons-logging-1.1.3.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/guava-11.0.2.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/htrace-core-3.1.0-incubating.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/log4j-1.2.17.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/commons-cli-1.2.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/xmlenc-0.52.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/jetty-util-6.1.26.jar:/usr/local/hadoop/share/hadoop/hdfs/hadoop-hdfs-2.7.3-tests.jar:/usr/local/hadoop/share/hadoop/hdfs/hadoop-hdfs-nfs-2.7.3.jar:/usr/local/hadoop/share/hadoop/hdfs/hadoop-hdfs-2.7.3.jar:/usr/local/hadoop/share/hadoop/yarn/lib/commons-collections-3.2.2.jar:/usr/local/hadoop/share/hadoop/yarn/lib/jersey-client-1.9.jar:/usr/local/hadoop/share/hadoop/yarn/lib/commons-io-2.4.jar:/usr/local/hadoop/share/hadoop/yarn/lib/javax.inject-1.jar:/usr/local/hadoop/share/hadoop/yarn/lib/jackson-jaxrs-1.9.13.jar:/usr/local/hadoop/share/hadoop/yarn/lib/jackson-xc-1.9.13.jar:/usr/local/hadoop/share/hadoop/yarn/lib/leveldbjni-all-1.8.jar:/usr/local/hadoop/share/hadoop/yarn/lib/stax-api-1.0-2.jar:/usr/local/hadoop/share/hadoop/yarn/lib/jackson-core-asl-1.9.13.jar:/usr/local/hadoop/share/hadoop/yarn/lib/guice-3.0.jar:/usr/local/hadoop/share/hadoop/yarn/lib/commons-codec-1.4.jar:/usr/local/hadoop/share/hadoop/yarn/lib/jettison-1.1.jar:/usr/local/hadoop/share/hadoop/yarn/lib/zookeeper-3.4.6.jar:/usr/local/hadoop/share/hadoop/yarn/lib/jersey-json-1.9.jar:/usr/local/hadoop/share/hadoop/yarn/lib/jsr305-3.0.0.jar:/usr/local/hadoop/share/hadoop/yarn/lib/activation-1.1.jar:/usr/local/hadoop/share/hadoop/yarn/lib/commons-compress-1.4.1.jar:/usr/local/hadoop/share/hadoop/yarn/lib/jersey-guice-1.9.jar:/usr/local/hadoop/share/hadoop/yarn/lib/jetty-6.1.26.jar:/usr/local/hadoop/share/hadoop/yarn/lib/jersey-server-1.9.jar:/usr/local/hadoop/share/hadoop/yarn/lib/xz-1.0.jar:/usr/local/hadoop/share/hadoop/yarn/lib/netty-3.6.2.Final.jar:/usr/local/hadoop/share/hadoop/yarn/lib/jackson-mapper-asl-1.9.13.jar:/usr/local/hadoop/share/hadoop/yarn/lib/servlet-api-2.5.jar:/usr/local/hadoop/share/hadoop/yarn/lib/asm-3.2.jar:/usr/local/hadoop/share/hadoop/yarn/lib/jersey-core-1.9.jar:/usr/local/hadoop/share/hadoop/yarn/lib/protobuf-java-2.5.0.jar:/usr/local/hadoop/share/hadoop/yarn/lib/commons-lang-2.6.jar:/usr/local/hadoop/share/hadoop/yarn/lib/commons-logging-1.1.3.jar:/usr/local/hadoop/share/hadoop/yarn/lib/guava-11.0.2.jar:/usr/local/hadoop/share/hadoop/yarn/lib/guice-servlet-3.0.jar:/usr/local/hadoop/share/hadoop/yarn/lib/aopalliance-1.0.jar:/usr/local/hadoop/share/hadoop/yarn/lib/log4j-1.2.17.jar:/usr/local/hadoop/share/hadoop/yarn/lib/commons-cli-1.2.jar:/usr/local/hadoop/share/hadoop/yarn/lib/zookeeper-3.4.6-tests.jar:/usr/local/hadoop/share/hadoop/yarn/lib/jaxb-impl-2.2.3-1.jar:/usr/local/hadoop/share/hadoop/yarn/lib/jaxb-api-2.2.2.jar:/usr/local/hadoop/share/hadoop/yarn/lib/jetty-util-6.1.26.jar:/usr/local/hadoop/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-2.7.3.jar:/usr/local/hadoop/share/hadoop/yarn/hadoop-yarn-server-tests-2.7.3.jar:/usr/local/hadoop/share/hadoop/yarn/hadoop-yarn-api-2.7.3.jar:/usr/local/hadoop/share/hadoop/yarn/hadoop-yarn-client-2.7.3.jar:/usr/local/hadoop/share/hadoop/yarn/hadoop-yarn-registry-2.7.3.jar:/usr/local/hadoop/share/hadoop/yarn/hadoop-yarn-server-applicationhistoryservice-2.7.3.jar:/usr/local/hadoop/share/hadoop/yarn/hadoop-yarn-server-sharedcachemanager-2.7.3.jar:/usr/local/hadoop/share/hadoop/yarn/hadoop-yarn-server-resourcemanager-2.7.3.jar:/usr/local/hadoop/share/hadoop/yarn/hadoop-yarn-applications-unmanaged-am-launcher-2.7.3.jar:/usr/local/hadoop/share/hadoop/yarn/hadoop-yarn-common-2.7.3.jar:/usr/local/hadoop/share/hadoop/yarn/hadoop-yarn-server-web-proxy-2.7.3.jar:/usr/local/hadoop/share/hadoop/yarn/hadoop-yarn-server-common-2.7.3.jar:/usr/local/hadoop/share/hadoop/yarn/hadoop-yarn-server-nodemanager-2.7.3.jar:/usr/local/hadoop/share/hadoop/mapreduce/lib/commons-io-2.4.jar:/usr/local/hadoop/share/hadoop/mapreduce/lib/javax.inject-1.jar:/usr/local/hadoop/share/hadoop/mapreduce/lib/paranamer-2.3.jar:/usr/local/hadoop/share/hadoop/mapreduce/lib/leveldbjni-all-1.8.jar:/usr/local/hadoop/share/hadoop/mapreduce/lib/jackson-core-asl-1.9.13.jar:/usr/local/hadoop/share/hadoop/mapreduce/lib/hadoop-annotations-2.7.3.jar:/usr/local/hadoop/share/hadoop/mapreduce/lib/guice-3.0.jar:/usr/local/hadoop/share/hadoop/mapreduce/lib/commons-compress-1.4.1.jar:/usr/local/hadoop/share/hadoop/mapreduce/lib/jersey-guice-1.9.jar:/usr/local/hadoop/share/hadoop/mapreduce/lib/jersey-server-1.9.jar:/usr/local/hadoop/share/hadoop/mapreduce/lib/xz-1.0.jar:/usr/local/hadoop/share/hadoop/mapreduce/lib/snappy-java-1.0.4.1.jar:/usr/local/hadoop/share/hadoop/mapreduce/lib/netty-3.6.2.Final.jar:/usr/local/hadoop/share/hadoop/mapreduce/lib/jackson-mapper-asl-1.9.13.jar:/usr/local/hadoop/share/hadoop/mapreduce/lib/asm-3.2.jar:/usr/local/hadoop/share/hadoop/mapreduce/lib/jersey-core-1.9.jar:/usr/local/hadoop/share/hadoop/mapreduce/lib/protobuf-java-2.5.0.jar:/usr/local/hadoop/share/hadoop/mapreduce/lib/guice-servlet-3.0.jar:/usr/local/hadoop/share/hadoop/mapreduce/lib/aopalliance-1.0.jar:/usr/local/hadoop/share/hadoop/mapreduce/lib/log4j-1.2.17.jar:/usr/local/hadoop/share/hadoop/mapreduce/lib/junit-4.11.jar:/usr/local/hadoop/share/hadoop/mapreduce/lib/avro-1.7.4.jar:/usr/local/hadoop/share/hadoop/mapreduce/lib/hamcrest-core-1.3.jar:/usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-shuffle-2.7.3.jar:/usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-hs-plugins-2.7.3.jar:/usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-app-2.7.3.jar:/usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.7.3.jar:/usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar:/usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-common-2.7.3.jar:/usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.7.3-tests.jar:/usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-hs-2.7.3.jar:/usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.7.3.jar:/usr/local/hadoop/contrib/capacity-scheduler/*.jar:/usr/local/hbase/lib/hbase-hadoop-compat-1.1.3.jar:/usr/local/hbase/lib/metrics-core-2.2.0.jar
18/08/20 06:15:24 INFO zookeeper.ZooKeeper: Client environment:java.library.path=/usr/local/hadoop/lib
18/08/20 06:15:24 INFO zookeeper.ZooKeeper: Client environment:java.io.tmpdir=/tmp
18/08/20 06:15:24 INFO zookeeper.ZooKeeper: Client environment:java.compiler=<NA>
18/08/20 06:15:24 INFO zookeeper.ZooKeeper: Client environment:os.name=Linux
18/08/20 06:15:24 INFO zookeeper.ZooKeeper: Client environment:os.arch=amd64
18/08/20 06:15:24 INFO zookeeper.ZooKeeper: Client environment:os.version=2.6.32-504.el6.x86_64
18/08/20 06:15:24 INFO zookeeper.ZooKeeper: Client environment:user.name=root
18/08/20 06:15:24 INFO zookeeper.ZooKeeper: Client environment:user.home=/root
18/08/20 06:15:24 INFO zookeeper.ZooKeeper: Client environment:user.dir=/root
18/08/20 06:15:24 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=localhost:2181 sessionTimeout=90000 watcher=hconnection-0x130dca520x0, quorum=localhost:2181, baseZNode=/hbase
18/08/20 06:15:24 INFO zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error)
18/08/20 06:15:24 INFO zookeeper.ClientCnxn: Socket connection established to localhost/127.0.0.1:2181, initiating session
18/08/20 06:15:24 INFO zookeeper.ClientCnxn: Session establishment complete on server localhost/127.0.0.1:2181, sessionid = 0x165541a87e60009, negotiated timeout = 40000
18/08/20 06:15:25 INFO mapreduce.HFileOutputFormat2: Looking up current regions for table hfiletable
18/08/20 06:15:25 INFO mapreduce.HFileOutputFormat2: Configuring 1 reduce partitions to match current region count
18/08/20 06:15:25 INFO mapreduce.HFileOutputFormat2: Writing partition information to /user/root/hbase-staging/partitions_21f6d6d4-c583-4f96-a31a-ad889828c257
18/08/20 06:15:26 INFO compress.CodecPool: Got brand-new compressor [.deflate]
18/08/20 06:15:26 INFO mapreduce.HFileOutputFormat2: Incremental table hfiletable output configured.
18/08/20 06:15:26 INFO client.RMProxy: Connecting to ResourceManager at hadoop/192.168.17.108:8032
18/08/20 06:15:27 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
18/08/20 06:15:28 WARN hdfs.DFSClient: Caught exception
java.lang.InterruptedException
at java.lang.Object.wait(Native Method)
at java.lang.Thread.join(Thread.java:1245)
at java.lang.Thread.join(Thread.java:1319)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:609)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:370)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:546)
18/08/20 06:15:28 WARN hdfs.DFSClient: Caught exception
java.lang.InterruptedException
at java.lang.Object.wait(Native Method)
at java.lang.Thread.join(Thread.java:1245)
at java.lang.Thread.join(Thread.java:1319)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:609)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:370)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:546)
18/08/20 06:15:28 WARN hdfs.DFSClient: Caught exception
java.lang.InterruptedException
at java.lang.Object.wait(Native Method)
at java.lang.Thread.join(Thread.java:1245)
at java.lang.Thread.join(Thread.java:1319)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:609)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:370)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:546)
18/08/20 06:15:30 INFO input.FileInputFormat: Total input paths to process : 1
18/08/20 06:15:30 INFO mapreduce.JobSubmitter: number of splits:1
18/08/20 06:15:30 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1534714206732_0002
18/08/20 06:15:30 INFO impl.YarnClientImpl: Submitted application application_1534714206732_0002
18/08/20 06:15:31 INFO mapreduce.Job: The url to track the job: http://hadoop:8088/proxy/application_1534714206732_0002/
18/08/20 06:15:31 INFO mapreduce.Job: Running job: job_1534714206732_0002
18/08/20 06:15:45 INFO mapreduce.Job: Job job_1534714206732_0002 running in uber mode : false
18/08/20 06:15:45 INFO mapreduce.Job: map 0% reduce 0%
18/08/20 06:15:53 INFO mapreduce.Job: map 100% reduce 0%
18/08/20 06:16:05 INFO mapreduce.Job: map 100% reduce 100%
18/08/20 06:16:05 INFO mapreduce.Job: Job job_1534714206732_0002 completed successfully
18/08/20 06:16:05 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=263
FILE: Number of bytes written=300567
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=183
HDFS: Number of bytes written=10115
HDFS: Number of read operations=10
HDFS: Number of large read operations=0
HDFS: Number of write operations=5
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=6480
Total time spent by all reduces in occupied slots (ms)=7719
Total time spent by all map tasks (ms)=6480
Total time spent by all reduce tasks (ms)=7719
Total vcore-milliseconds taken by all map tasks=6480
Total vcore-milliseconds taken by all reduce tasks=7719
Total megabyte-milliseconds taken by all map tasks=6635520
Total megabyte-milliseconds taken by all reduce tasks=7904256
Map-Reduce Framework
Map input records=4
Map output records=4
Map output bytes=249
Map output materialized bytes=263
Input split bytes=98
Combine input records=0
Combine output records=0
Reduce input groups=2
Reduce shuffle bytes=263
Reduce input records=4
Reduce output records=4
Spilled Records=8
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=195
CPU time spent (ms)=2870
Physical memory (bytes) snapshot=363540480
Virtual memory (bytes) snapshot=4150374400
Total committed heap usage (bytes)=222429184
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=85
File Output Format Counters
Bytes Written=10115
18/08/20 06:16:05 WARN mapreduce.LoadIncrementalHFiles: managed connection cannot be used for bulkload. Creating unmanaged connection.
18/08/20 06:16:05 INFO zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x70e889e9 connecting to ZooKeeper ensemble=localhost:2181
18/08/20 06:16:05 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=localhost:2181 sessionTimeout=90000 watcher=hconnection-0x70e889e90x0, quorum=localhost:2181, baseZNode=/hbase
18/08/20 06:16:05 INFO zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error)
18/08/20 06:16:05 INFO zookeeper.ClientCnxn: Socket connection established to localhost/127.0.0.1:2181, initiating session
18/08/20 06:16:05 INFO zookeeper.ClientCnxn: Session establishment complete on server localhost/127.0.0.1:2181, sessionid = 0x165541a87e6000a, negotiated timeout = 40000
18/08/20 06:16:05 WARN mapreduce.LoadIncrementalHFiles: Skipping non-directory hdfs://hadoop:9000/Hdfs_to_Hbase/_SUCCESS
18/08/20 06:16:06 INFO hfile.CacheConfig: CacheConfig:disabled
18/08/20 06:16:06 INFO mapreduce.LoadIncrementalHFiles: Trying to load hfile=hdfs://hadoop:9000/Hdfs_to_Hbase/fm1/f1fd7bb279c04b39a13678d98a1dad19 first=key1 last=key4
18/08/20 06:16:06 INFO hfile.CacheConfig: CacheConfig:disabled
18/08/20 06:16:06 INFO mapreduce.LoadIncrementalHFiles: Trying to load hfile=hdfs://hadoop:9000/Hdfs_to_Hbase/fm2/879add100bdc46c5887773e9c7f13703 first=key1 last=key1
18/08/20 06:16:06 INFO client.ConnectionManager$HConnectionImplementation: Closing master protocol: MasterService
18/08/20 06:16:06 INFO client.ConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x165541a87e6000a
18/08/20 06:16:06 INFO zookeeper.ZooKeeper: Session: 0x165541a87e6000a closed
18/08/20 06:16:06 INFO zookeeper.ClientCnxn: EventThread shut down
第七步: 登录到后台hbase中,验证数据是否插入到对应的表中。为了表示数据准确性,我是在原有的基础上进行说明的
hbase(main):006:0> truncate 'hfiletable'
Truncating 'hfiletable' table (it may take a while):
- Disabling table...
- Truncating table...
0 row(s) in 4.2430 seconds
hbase(main):007:0> scan 'hfiletable'
ROW COLUMN+CELL
0 row(s) in 0.3490 seconds
hbase(main):008:0> scan 'hfiletable'
ROW COLUMN+CELL
key1 column=fm1:col1, timestamp=1534716962335, value=value1
key1 column=fm1:col2, timestamp=1534716962335, value=value2
key1 column=fm2:col1, timestamp=1534716962335, value=value99
key4 column=fm1:col1, timestamp=1534716962335, value=value4
2 row(s) in 0.0830 seconds
可以看到,rowkey,列簇,列值已经按照数据源的格式,插入到了hbase中,数据质量验证完毕
五 常见的几个缺少jar的错误,这也是我在测试时遇到的报错
1 Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/CompatibilityFactory
少 hbase-hadoop-compat-1.1.3.jar 这个jar
解决办法: 找到hadoop的hadoop-env.sh文件, 增加
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:${HBASE_HOME}/lib/hbase-hadoop-compat-1.1.3.jar
2 Exception in thread "main" java.lang.NoClassDefFoundError: com/yammer/metrics/core/MetricsRegistry
少 metrics-core-2.2.0.jar 这个jar
解决办法: 找到hadoop的hadoop-env.sh文件, 增加
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:${HBASE_HOME}/lib/metrics-core-2.2.0.jar