h2database BTree design implementation and query optimization thinking | JD Cloud technical team

h2database is an open source database written in Java, compliant with ANSI-SQL89. It not only implements the conventional BTree-based storage engine, but also supports the log-structured storage engine. The functions are very rich (deadlock detection mechanism, transaction features, MVCC, operation and maintenance tools, etc.), and the database learning is a very good case.

This article combines theory with practice, through the design and implementation of BTree index, to better understand the knowledge points and optimization principles related to database index.

BTree implementation class

The MVStore storage engine used by h2database by default, if you want to use the BTree-based storage engine, you need to specify it (the following sample code jdbcUrl).

The following are the key classes related to the general storage engine (BTree structure).

  • org.h2.table.RegularTable
  • org.h2.index.PageBtreeIndex  (SQL Index ontology implementation)
  • org.h2.store.PageStore  (storage layer, docking logic layer and file system)

The data structure of BTree can be found in detailed description and explanation from the Internet, so I won’t go into too much detail.

What needs special explanation is: PageStore. The key caches, disk reads, and undo logs of our data query and optimization are all completed by PageStore. Detailed documentation and a complete implementation can be found here.

BTree add index entry call chain

Provides a new call chain for index data. Similarly, index deletion and query will be involved, which is convenient for debugging reference.

  1. org.h2.command.dml.Insert#insertRows ( Insert SQL triggers data and index addition )
  2. org.h2.mvstore.db.RegularTable#addRow ( processed data Row, execute new addition )
  3. org.h2.index.PageBtreeIndex#add ( logic layer adds index data )
  4. org.h2.index.PageDataIndex#addTry ( Add index data in the storage layer )
  5. org.h2.index.PageDataLeaf#addRowTry ( new implementation of storage layer )
// 示例代码
// CREATE TABLE city (id INT(10) NOT NULL AUTO_INCREMENT, code VARCHAR(40) NOT NULL, name VARCHAR(40) NOT NULL);
public static void main(String[] args) throws SQLException {
    // 注意:MV_STORE=false,MVStore is used as default storage
    Connection conn = DriverManager.getConnection("jdbc:h2:~/test;MV_STORE=false", "sa", "");
    Statement statement = conn.createStatement();
    // CREATE INDEX IDX_NAME ON city(code); 添加数据触发 BTree 索引新增
    // -- SQL 实例化为:IDX_NAME:16:org.h2.index.PageBtreeIndex
    statement.executeUpdate("INSERT INTO city(code,name) values('cch','长春')");
    statement.close();
    conn.close();
}

Code Insight

Combined with the above sample code, learn about the characteristics of BTree index and the precautions for use from the implementation of the new index process. Realize the operation of the analysis index from the bottom layer, and have a further understanding of the use and optimization of SQL indexes.

table add data

 public void addRow(Session session, Row row) {
    // MVCC 控制机制,记录和比对当前事务的 id
    lastModificationId = database.getNextModificationDataId();
    if (database.isMultiVersion()) {
        row.setSessionId(session.getId());
    }
    int i = 0;
    try {
        // 根据设计规范,indexes 肯定会有一个聚集索引(h2 称之为scan index)。①
        for (int size = indexes.size(); i < size; i++) {
            Index index = indexes.get(i);
            index.add(session, row);
            checkRowCount(session, index, 1);
        }
        // 记录当前 table 的数据行数,事务回滚后会相应递减。
        rowCount++;
    } catch (Throwable e) {
        try {
            while (--i >= 0) {
                Index index = indexes.get(i);
                // 对应的,如果发生任何异常,会移除对应的索引数据。
                index.remove(session, row);
            }
        }
        throw de;
    }
}

① Like Mysql InnoDB data storage, RegularTable must have and only one clustered index. Use the primary key (or implicit auto-increment id) as the key to store complete data.

Clustered index to add data

  • The key in the index is what the query is searching for, while its value can be one of two things: it can be an actual row (document, vertex), or it can be a reference to a row stored elsewhere. In the latter case, the place where the rows are stored is called  a heap file , and the data is stored in no particular order (relative according to the index).
  • The extra hop from the index to the heap file is too much of a performance hit for reads, so it may be desirable to store indexed rows directly in the index. This is called a clustered index.
  • Based on the primary key scan, the data can be uniquely determined and obtained, and the performance of the clustered index is one scan less than that of the non-primary key index
public void add(Session session, Row row) {
    // 索引key 生成 ②
    if (mainIndexColumn != -1) {
        // 如果主键非 long, 使用 org.h2.value.Value#convertTo 尝试把主键转为 long
        row.setKey(row.getValue(mainIndexColumn).getLong());
    } else {
        if (row.getKey() == 0) {
            row.setKey((int) ++lastKey);
            retry = true;
        }
    }

    // 添加行数据到聚集索引 ③
    while (true) {
        try {
            addTry(session, row);
            break;
        } catch (DbException e) {
            if (!retry) {
                throw getNewDuplicateKeyException();
            }
        }
    }
}

② For the case where there is a primary key, the value of the current row primary key will be obtained and converted to a long value. For the case where no primary key is specified, the unique key is auto-incremented from the current clustered index attribute lastKey.

Only when the primary key is specified, the data duplication will be verified (that is, the index key is duplicated, and the auto-increment lastKey will not have the problem of duplicate values).

③ The clustered index PageDataIndex searches for the corresponding key position according to the BTree structure, and stores the Row in the page according to the order of the primary key/key. The non-clustered index PageBtreeIndex is also such a processing flow.

This involves three issues:

  1. How to find the location of the key, that is, the calculation of the BTree location?
  2. How to calculate the offsets in the Row (actual data) storage Page?
  3. How is Row written to disk and when is it written?

Index data access implementation

  • B-trees break down a database into fixed-size  blocks  or  pages , traditionally 4KB in size (sometimes larger), and only one page can be read or written at a time.
  • Each page can be identified using an address or location, which allows one page to refer to another page—similar to a pointer, but on disk instead of in memory. (corresponding to h2 database PageBtreeLeaf and PageBtreeNode)
  • Unlike PageDataIndex, PageBtreeIndex is stored in the order of column.value. The process of adding is to compare and find column.value, and determine the subscript x of offsets in the block. All that remains is to calculate the offset of the data and store it in the subscript x.
/**
 * Find an entry. 二分查找 compare 所在的位置。这个位置存储 compare 的offset。
 * org.h2.index.PageBtree#find(org.h2.result.SearchRow, boolean, boolean, boolean)
 * @param compare 查找的row, 对应上述示例 compare.value = 'cch'
 * @return the index of the found row
 */
int find(SearchRow compare, boolean bigger, boolean add, boolean compareKeys) {
    // 目前 page 持有的数据量 ④
    int l = 0, r = entryCount;
    int comp = 1;
    while (l < r) {
        int i = (l + r) >>> 1;
        // 根据 offsets[i],读取对应的 row 数据 ⑤
        SearchRow row = getRow(i);
        // 比大小 ⑥
        comp = index.compareRows(row, compare);
        if (comp == 0) {
            // 唯一索引校验 ⑦
            if (add && index.indexType.isUnique()) {
                if (!index.containsNullAndAllowMultipleNull(compare)) {
                    throw index.getDuplicateKeyException(compare.toString());
                }
            }
        }
        if (comp > 0 || (!bigger && comp == 0)) {
            r = i;
        } else {
            l = i + 1;
        }
    }
    return l;
}

④ For each block (page) entryCount, two methods are initialized. According to block allocation and instance creation initialization, or PageStore reads the block file and parses it from Page Data.

⑤ In the deserialization process, read the data from the page file bytecode (4k byte array) according to the protocol and instantiate it as a row object. Reference: org.h2.index.PageBtreeIndex#readRow(org.h2.store.Data, int, boolean, boolean) .

⑥ All types support size comparison, specific rules refer to: org.h2.index.BaseIndex#compareRows

⑦ If there are duplicate key values ​​in the data, you cannot create a unique index, UNIQUE constraint or PRIMARY KEY constraint. h2database is compatible with multiple database modes, MySQL NULL is non-unique, MSSQLServer NULL is unique, and only one occurrence is allowed.

private int addRow(SearchRow row, boolean tryOnly) {
	// 计算数据所占字节的长度
	int rowLength = index.getRowSize(data, row, onlyPosition);
	// 块大小,默认 4k
	int pageSize = index.getPageStore().getPageSize();
	// 块文件可用的 offset 获取
	int last = entryCount == 0 ? pageSize : offsets[entryCount - 1];
	if (last - rowLength < start + OFFSET_LENGTH) {
		// 校验和尝试分配计算,这其中就涉及到分割页面生长 B 树的过程 ⑧
	}
	// undo log 让B树更可靠 ⑨
	index.getPageStore().logUndo(this, data);
	if (!optimizeUpdate) {
		readAllRows();
	}

	int x = find(row, false, true, true);
	// 新索引数据的offset 插入到 offsets 数组中。使用 System.arraycopy(x + 1) 来挪动数据。
	offsets = insert(offsets, entryCount, x, offset);
	// 重新计算 offsets,写磁盘就按照 offsets 来写入数据。
	add(offsets, x + 1, entryCount + 1, -rowLength);
	// 追加实际数据 row
	rows = insert(rows, entryCount, x, row);
	entryCount++;
	// 标识 page.setChanged(true);
	index.getPageStore().update(this);
	return -1;
}

⑧ If you want to add a new key, you need to find the page whose scope can contain the new key, and add it to that page. If there is not enough free space in the page for the new key, it is split into two half-full pages and the parent page is updated to reflect the new key range partition

⑨In order to enable the database to handle abnormal crash scenarios, B-tree implementations usually have an additional hard disk data structure: pre-written log (WAL, that is, write-ahead log, also known as  redo log , or redo log). This is an append-only file to which every B-tree modification must be written before it can be applied to pages in the tree itself. When the database recovers after a crash, this log will be used to bring the B-tree back to a consistent state.

Practice summary

  • Query optimization is essentially the optimization of the amount of accessed data and the optimization of disk IO.

  • If all the data is cached in the memory, it is actually the optimization of the amount of calculation and the optimization of CPU usage.

  • The index is ordered, which actually means that the offsets in the block file are represented in the form of an array. In particular , in h2database, the offsets array elements are also ordered (for example: [4090, 4084, 4078, 4072, 4066, 4060, 4054, 4048, 4042]), which should be convenient for disk sequential reading and prevent disk fragmentation .

  • Theoretically, the clustered index scan IO is more than that of the BTree index, because in the same block file, the BTree index stores a larger amount of data and occupies fewer block files. If a table column is small enough, the clustered index scan is more efficient.

    You need to be cautious when building a table, and the field length of each column should be as short as possible to save page space .

  • Reasonable use of covering index queries to avoid back-to-table queries.  As in the above example, select id from city where code = 'cch' scan the BTree index once to get the result. If  select name from city where code = 'cch'you need to scan the BTree index once to get the index key (primary key), then traverse and scan the clustered index to get the result according to the key.

  • Use the cache reasonably to minimize the impact of disk IO.  For example, configure the cache size reasonably, and distinguish between hot and cold data queries.

Other knowledge points

  • A four-level tree of 4KB pages with a branching factor of 500 can store up to 256TB of data). (The number of references to subpages in a page in a B-tree is called  the branching factor .

reference

ddia/ch3.md B-tree

Author: JD Logistics Yang Pan

Content source: JD Cloud developer community

Guess you like

Origin blog.csdn.net/JDDTechTalk/article/details/131396216