hadoop -- Hbase

HBase is a distributed, scalable, column-oriented data storage (millions of columns), scalable, high reliability, real-time read and write NoSQL database.

HBase uses Hadoop's HDFS as its file storage system, uses MapReduce to process massive data in HBase, and uses Zookeeper as a distributed collaborative service.

HBase basic operation commands: 

# 进入HBase 客户端
[zsm@hadoop102 hbase-1.3.1]$ bin/hbase shell

# 查看所有的表
hbase(main):002:0> list
TABLE                                                                                                                                                                                                                       
stu                                                                                                                                                                                                                    
stu2                                                                                                                                                                                                                   
stu3                                                                                                                                                                                                                   
stu4 

Create table: 

create 'tableName', 'cf1', 'cf2'

Create a table with two column families cf1 and cf2

Insert data: 

put 'tableName', 'row1', 'cf1:column1', 'value1'

In the [row1] row of the table, insert the value [value1] into the [column1] column under the [cf1] column family

update data:

put 'tableName', 'row1', 'cf1:column1', 'value2'

Change the [row1] row of the table, [column1] column value under the [cf1] column family to [value2]

View data: 

# 查看指定行
get 'tableName', 'row1'

# 查看指定行的指定列族的指定列
get 'tableName', 'row1', 'cf1:column1'

Full table scan data: 

# 全表扫描
scan 'tableName'

# 分段扫描 ,从row1行开始 【, 到row2行结束】
scan 'tableName', {STARTROW => 'row1' [, STOPROW => 'row2']}

Conditional filter: 

# 单列值过滤器(SingleColumnValueFilter)
scan 'mytable', {FILTER => "SingleColumnValueFilter('cf1', 'column1', '=', 'value1')"}

# 组合过滤
scan 'mytable', {FILTER => "FilterList(AND, SingleColumnValueFilter('cf1', 'column1', '=', 'value1'), SingleColumnValueFilter('cf2', 'column2', '>=', '100'))"}

# 页码过滤
scan 'mytable', {FILTER => "PageFilter(10)"}

# 行键范围过滤器(RowFilter)
scan 'mytable', {FILTER => "RowFilter(>=, 'binary:row1')"}

delete table, data

# 删除表
disable 'tableName'
drop 'tableName'

# 清空表
truncate 'tableName'

# 删除某一行
delete 'tableName', 'row1'

# 删除某一列
delete 'tableName', 'row1', 'cf1:column1'

HBase data model

The data model of HBase is very similar to that of a relational database. Data is stored in a table with rows and columns. But from the perspective of HBase's underlying physical storage structure (KV), HBase is more like one multi-dimensional map(多维度 K-V). It is a column-oriented data store that can accommodate millions of columns.

HBase physical storage structure:

The main components of HBase's data model:

  1. Table: Data in HBase is stored in tables, and each table has a unique name. A table consists of rows and columns, storing data in the form of key-value pairs.

  2. Row: In HBase, each row is identified by a unique row key. The row key is a variable-length byte array, sorted lexicographically. Row keys are the main way of data access and query in HBase.

  3. Column Family: The columns in a table are organized according to the column family, which defines a group of columns with similar properties. Each column family has a unique name. Columns of different column families are stored separately in physical storage, so that different storage and compression strategies can be set for data of different column families.

  4. Column Qualifier (Column Qualifier): The column qualifier is used to identify each column in the column family, which is unique within the column family. For example, if the column family is "user", then the column qualifier could be "age", "name", etc.

  5. Cell: A cell consists of a row key, column family, column qualifier, and timestamp. Data in a table consists of multiple cells.

  6. Version number (Version): HBase supports storing multiple versions of data. Each cell can store multiple versions of a value, each with a corresponding timestamp. The historical query and version control of data can be realized through the version number.

  7. Namespace (Namespace): A namespace is a logical grouping of tables, which is equivalent to a namespace container. It provides a way to logically isolate and manage tables.

HBase architecture principle:

insert image description here

  • client: Contains the interface to access HBase and maintains cache to speed up access to HBase
  • zookeeper: 
    • Ensure that at any time, there is only one active master in the cluster
    • Store the addressing entries of all regions
    • Monitor the online and offline information of the Region server in real time, and notify the Master in real time
    • Store metadata of schema and table of HBase 
  • Master
    • Assign region to RegionServer
    • Responsible for load balancing of Region servers
    • Discover invalid Region servers and reassign regions on them
    • Manage user additions, deletions, and modifications to tables
  • RegionServer
    • Store the actual data of HBase
    • Flush the cache to HDFS
    • Maintain regions and handle IO requests for these regions
    • Segment a region that changes too much during operation
  • Region: The smallest unit of HBase distributed storage and load balancing, including one or more stores, and each store saves a columns family.
  • Store: Includes memstore in memory and storefile on disk. When the client retrieves data, it first looks for it in the memstore, and if it can't find it, then it looks for the storefile.
    • MemStore: It is placed in memory, and the modified data is saved as keyValues. When the size of the MemStore reaches a threshold (64MB by default), the MemStore will be flushed to the file.
    • StoreFile: After the data in the MemStore memory is written to the file, it becomes the StoreFile. The bottom layer of the StoreFile is saved in the HFile format.
      • When the number of storefile files grows to a certain threshold, the system will merge (minor, major compaction), and version merging and deletion (majar) will be performed during the merge process to form a larger storefile
      • When the size of all storefiles in a region exceeds a certain threshold, the current region will be divided into two, and the hmaster will be assigned to the corresponding regionserver server to achieve load balancing
    • HFile: The storage format of KeyValue data in HBase, which is the binary format file of Hadoop.
  • HLog: WAL log --- WAL means write ahead log, which is used for disaster recovery. HLog records all changes of data. Once the region server goes down, it can be recovered from the log.

Guess you like

Origin blog.csdn.net/zhoushimiao1990/article/details/131645893