HBASE and Hadoop integration

hadoop, hive, hbase difference

1, hadoop: it is a distributed computing + distributed file system, the former is actually MapReduce, which is HDFS. The latter can operate independently, the former may be selectively used, may not be used

2, hive: popular to say that a data warehouse, data warehouse is a data file hdfs management, which supports similar functionality sql statement, you can complete computing capabilities in a distributed environment through the statement, hive will sentence converted into MapReduce, and then to execute hadoop. The calculation here is limited to search and analyze, rather than updating, adding and deleting. Its advantage is that the historical data processing, with the popular saying is offline computing, because it is the underlying MapReduce, MapReduce poor performance in real-time computing. Its approach is to load the data file came as a hive table (or tables outside), makes you feel that your sql operation is the traditional table.

3, hbase: popular to say, hbase act like databases, legacy database management is centralized local data files, and hbase based hdfs manage distributed data files, such as CRUD. In other words, hbase just use hdfs hadoop help its persistence file management data (HFile), it did not have any relationship with MapReduce. hbase the advantage of real-time computing, all real-time data directly into hbase, the client direct access hbase by API, real-time computing. Because it uses nosql, or a column structure, resulting in improved lookup performance, so that it can apply big data scene, which is its difference with the MapReduce.

Summary:
hadoop hive and is the basis of hbase, hive rely hadoop, but hbase only rely hadoop of hdfs module.
Analysis hive suitable for off-line data, the operation is the common format (e.g., a common log file), the hadoop managed data file that supports class sql, MapReduce than writing java code to more convenient, its position data warehouse, storage and analysis of historical data
hbase for real-time calculation, using nosql column structure, the operation is a special form of self-generated HFile, hadoop is a data file management, its location database, called DBMS or

1 hbase and integrated hadoop

1). The copy hbase-site.xml configuration file to $ HADOOP_HOME / etc / hadoop directory

 2).编辑$HADOOP_HOME/etc/hadoop/hadoop-env.sh 添加最后一行:

export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:/home/sheng/app/hbase-1.3.2/lib/*

3). The hbase-site.xml modified and distributed to each node hadoop-env.sh

Statistics integration testing a number of rows in the table:

hadoop jar $HBASE_HOME/lib/hbase-server-1.3.2.jar rowcounter  /user/hbase/music.txt

If you can get the number of rows in the table, were successfully integrated environmental permit

2 hdfs data will be stored in the hbase in
bulk import: the idea is original data file generated in HBase HFile File, and then loaded into the text of HBase in HFile

importtsv generate a physical file Hstore of HBase in ----> HFile file
syntax:

hadoop jar $HBASE_HOME/lib/hbase-server-1.3.2.jar  importtsv
-Dimporttsv.bulk.output=指定输出的路径于用存放HFile文件
-Dimporttsv.columns=HBASE_ROW_KEY,info:name,info:singer,info:gender,info:ryghme,info:terminal 
-Dcreate.table=yes
表名   输入的HDFS路径

Note: -Dcreate.table = yes to automatically create tables represent

    -Dimporttsv.bulk.output=指定输出的路径于用存放HFile文件  该目录自动生成

1) generate HFIle file

hadoop jar $HBASE_HOME/lib/hbase-server-1.3.2.jar  importtsv -Dimporttsv.bulk.output=/user/hbase_hfile -Dimporttsv.columns=HBASE_ROW_KEY,info:name,info:singer,info:gender,info:ryghme,info:terminal -Dcreate.table=yes music4  /user/hbase

2) completedulkload HFile loads the generated file into the HBase

Syntax:

hadoop jar  $HBASE_HOME/lib/hbase-server-1.3.2.jar  completebulkload 加载的HDFS路径 表名

hadoop jar  $HBASE_HOME/lib/hbase-server-1.3.2.jar  completebulkload /user/ambow/temp4 music4
Published 133 original articles · won praise 53 · views 20000 +

Guess you like

Origin blog.csdn.net/weixin_43599377/article/details/104514464