3. Analysis of ClickHouse's MergeTree principle

Six, MergeTree principle analysis

6.1 MergeTree creation method

​ When MergeTree writes data, the data will always be written to the disk in the form of data fragments, and the data fragments cannot be modified. In order to avoid too many fragments, clickhouse periodically merges these data fragments through a background process, and the data fragments belonging to the same partition will be synthesized into a new fragment.

​ MergeTree supports primary key index, data partition, data copy and data adoption, and supports ALTER operation.

创建方式

CREATE TABLE [IF NOT EXISTS] [db.]table_name [ON CLUSTER cluster]

(

name1 [type1] [DEFAULT|MATERIALIZED|ALIAS expr1] [TTL expr1],

name2 [type2] [DEFAULT|MATERIALIZED|ALIAS expr2] [TTL expr2],

...

INDEX index_name1 expr1 TYPE type1(...) GRANULARITY value1,

INDEX index_name2 expr2 TYPE type2(...) GRANULARITY value2

) ENGINE = MergeTree()

ORDER BY expr

[PARTITION BY expr]

[PRIMARY KEY expr]

[SAMPLE BY expr]

[TTL expr [DELETE|TO DISK 'xxx'|TO VOLUME 'xxx'], ...]

[SETTINGS name=value, ...]


PARTITION BY 			
			分区键:表示表数据会以何种标准进行分区;默认all分区。
    		分区方式:单列、元组形式使用多列或者使用列表达式。
    		合理使用数据分区,可以有效减少查询时数据文件的扫描范围。

ORDER BY	
			排序键:用于指定在一个数据片段内,数据以何种标准排序;默认情况和主键相同。
			排序方式:单列、元组形式使用多列。ORDER BY (counterID,EventDate)为例,在单个数据片段中,数据首先以counterID排序,相同的counterID,在按照EventDate排序。

PAIMARY KEY
			主键:会按照主键字段生成一级索引,用于加速表查询;默认情况下,主键个ORDER BY相同。

SAMPLE BY
			抽样表达式:用于声明数据以何种标砖进行采样。

SETTINGS:index_granularity 
			index_granularity对于MergeTree表示索引粒度,默认值8192.(每隔8192行数据生成一条索引)

SETTINGS:index_granularity_bytes
			19.11前:clickhouse只支持固定大小的索引间隔,由index_granularity控制,默认8192。
			在新版本:自适应间隔大小。根据每一批次写入数据体量大小,动态划分间隔大小。数据体量由index_granularity_bytes控制,默认10M(10*1024*1024),设置为0不启动自适应功能。

SETTINGS:enable_mixed_granularity_parts
			是否开启自适应索引间隔,默认开启

SETTINGS:merge_with_ttl_timeout			数据TTL功能
			
SETTINGS:storage_policy					多路径存储策略
```



CREATE TABLE test20(ID String,Price Int32,Val Float64,EventTime Date) engine = MergeTree() PARTITION BY toYYYYMM(EventTime) ORDER BY ID

create table test (id UInt8,name String,age UInt8,shijian Date) engine = MergeTree() partition by toYYYYMM(shijian) order by id

6.2 MergeTree storage structure

The data in the MergeTree table engine has physical storage, and the data will be saved to the disk in the form of a partition directory

Insert picture description here

[root@postgresql test 08:51:37]# tree test20

test20

├── 202005_1_3_1				分区目录

│   ├── checksums.txt			校验文件,保存余下各类文件的size大小及size的哈希值,校验数据完整性

│   ├── columns.txt				列信息文件。明文格式存储列字段名称和数据类型。

│   ├── count.txt				计数文件。明文记录当前数据分区目录下的数据总行数

│   ├── EventTime.bin			

│   ├── EventTime.mrk2

│   ├── ID.bin					数据文件。使用压缩格式存储(默认LZ4),存储某一列数据

│   ├── ID.mrk2

│   ├── minmax_EventTime.idx	分区键的索引文件,记录当前分区下分区字段对应原始数据的最小和最大值

│   ├── partition.dat			分区键(使用了PARTITION BY),保存前分区下分区表达式最终生成的值

│   ├── Price.bin				

│   ├── Price.mrk2				使用了自适应大小索引间隔的列标记文件,二进制存储,保存.bin文件中数据的偏移量信息

│   ├── primary.idx				一级索引文件,二进制格式存储。一张MergeTree()表只能声明一次一级索引(primary key或者order  by)

│   ├── Val.bin				

│   └── Val.mrk2

├── detached

└── format_version.txt

2 directories, 15 files

6.3 Data partition

Data partition: For local data, a vertical segmentation of data

Data slicing: For CK clusters, data is sliced ​​horizontally.

6.3.1 Data Partitioning Rules

The rules of MergeTree data partition are determined by ID, and the ID corresponding to each data partition is determined by the value of the partition key.

分区ID生成逻辑四种规则:

1.不指定分区键			分区ID默认为all

2.使用整型				 直接按照该整形的字符形式输出作为分区ID的取值

3.使用日期类型			按照YYYYMMDD格式化后字符形式输出作为分区ID取值

4.使用其他类型			分区键取值不属于整型,也不属于日期,例如String、Float则会通过128位Hash算法取其Hash值作为分区ID

Examples of PartitionID under different rules

Types of Sample data Partition expression Partition ID
No partition key no all
Integer 18,19,20 PARTITION BY Age Partition 1:18; Partition 2:19; Partition 3:20
Integer 'A0', 'A1', 'A3' PARTITION BY length(Code) Partition 1:2
date 2019-02-01,019-06-11 PARTITION BY EventTime Partition 1: 20190201; Partition 2: 20190611
date 2019-05-01,2019-06-11 PARTITION BY toYYYYMM(EventTime) Partition 1: 201905; Partition 2: 201906
other ‘www.oldba.cn’ PARTITION BY URL Partition 1: 15r515rs15gr15615wg5e5h5548h3045h

6.3.2 Data partition directory naming rules

举例说明:

202005_1_3_1					此目录直观来看,采用时间年月作为分区ID,分三次插入到同一分区,并且三次插入完成之后的某个时刻进行了一次数据合并。

202005		PartitionID			分区目录ID

1			MinBlockNum			最小数据块编号 (默认和MaxBlockNum从1开始)

3			MaxBlockNum			最大数据块编号 (发生合并时取合并时的最大数据块编号)

1			Level				合并的层级,某个分区被合并过的次数或者这个分区的年龄。(每次合并自增+1)

6.3.3 Data partition merging process

MergeTree partition directory creation: created during data writing; after creation, the directory will also change when data is written or merged.

​ In other words: if a table does not have any data, there will not be any partition directories.

MergeTree partition directory merge process:

​ With each write of data (insert), MergeTree will generate a batch of new partition directories (even if the data written in different batches belong to the same partition, different partition directories will be generated). At some point after writing, ClickHouse will merge multiple directories belonging to the same partition into a new directory through a background task. The existing old partition will not be deleted immediately, but will be deleted by a background task at a later time (8 minutes by default).

The merging rules of the new directory name:

​ MinBlockNum: Take the smallest MinBlockNum value in all directories in the same partition.

​ MaxBlockNum: Take the largest MaxBlockNum value in all directories in the same partition.

​ Level: Take the maximum Level value in the same partition and add 1.

create table test(id UInt32,name String,age UInt8,shijian DateTime) engine = MergeTree() PARTITION BY toYYYYMM(shijian) ORDER BY id

insert into test values (1,'张三',18,'2020-12-08')					t1时刻

insert into test values (2,'李四',19,'2020-12-08')					t2时刻
	
insert into test values (3,'王五',22,'2021-01-03')					t3时刻

insert into test values (2,'李四',19,now())							t4时刻

SELECT now()

┌───────────────now()─┐
│ 2020-12-08 11:36:42 │
└─────────────────────┘

		
按照上述规则未合并时的目录:
PARTITIONID 		202012
MinBlockNum			1
MaxBlockNmu			1
							对于新建分区,它们的值一样(来源表内全局自增的BlockNum),初始值为1,每次新建目录累计加1level				0				
    
    	202012_1_1_0												t1时刻的目录

		202012_2_2_0												t2时刻的目录

		202101_3_3_0												t3时刻的目录

		202012_4_4_0												t4时刻的目录

按照上述规则合并时的目录:

假设在t2~t3时刻之间发生了合并,那么此时只有一个目录:202012_1_2_1

假设在t3~t4时刻之间发生了合并,那么此时肯有两个目录:202012_1_2_1,202101_3_3_0

假设在t4时刻之后发生了合并,那么此时也肯定有两个目录:202012_1_4_2,202101_3_3_0

注意:
在创建完成之后的某个时刻进行合并,必须是相同分区才会合并,生成新的分区,同时将旧分区目录状态设置为非激活,然后在默认8分钟之后,删除非激活状态的分区目录。

6.4 Primary index

MergerTree specifies the primary key method:

​ 1.PRIMARY KEY MergerTree will generate a primary index for the data table according to the index_granularity interval (default 8192 rows) and save it in the primary.idx file, sorted by the primary key

​ 2.ORDER BY .bin files are sorted according to the same PRIMARY KEY rules

6.4.1 Sparse Index

The primary index in the primary.idx file is implemented by a sparse index

​ Dense Index: Each row of index mark corresponds to a row of specific data records

​ Sparse index: each row of index mark corresponds to a specific data record

​ Comparison of the two:

​ a Sparse index occupies a relatively small index storage space, but the search time is longer; For scenarios with large amounts of data, use the resident memory of the index data in primary.idx

		  b  稠密索引查找时间较短,索引存储空间较大。			  		  			数据量小场景

Insert picture description here

6.4.2 Index Granularity

​ The data is marked into multiple small spaces at the granularity of index_granularity (the default fixed index granularity is 8192), and each space has a maximum of 8192 rows of data. The specific interval of this space is MarkRange, and the specific range is expressed by start and end.

Insert picture description here

6.4.3 Index data generation rules

​ Because it is a sparse index, MergeTree needs to interval index_granularity row data to generate an index record, and its index value will be obtained according to the declared primary key field. Figure 6-8 shows the actual data in the test table hits_v1 after visualization. hits_v1 uses the year and month partition (PARTITION BYtoYYYYMM(EventDate)), so the data for March 2014 will eventually be divided into the same partition directory. If you use CounterID as the primary key (ORDER BY CounterID), the value of CounterID will be taken as the index value every 8192 rows of data, and the index data will eventually be written to the primary.idx file for storage.

preview

​ For example, the CounterID value of row 0 (8192 0) is 57, the value of CounterID of row 8192 (8192 1) is 1635, and the value of CounterID of row 16384 (8192*2) is 3266, the final index data will be 5716353266.

 从图中也能够看出,MergeTree对于稀疏索引的存储是非常紧凑的,索引值前后相连,按照主键字段顺序紧密地排列在一起。不仅此处,ClickHouse中很多数据结构都被设计得非常紧凑,比如其使用位读取替代专门的标志位或状态码,可以不浪费哪怕一个字节的空间。以小见大,这也是ClickHouse为何性能如此出众的深层原因之一。

​ If multiple primary keys are used, such as ORDER BY (CounterID, EventDate), the values ​​of the CounterID and EventDate columns can be taken as the index value at the same time for 8192 rows in every interval, as shown in the figure.

preview

6.4.4 Index query process

MergeTree divides a complete piece of data into multiple small interval data segments according to the interval granularity of index_granularity, and a specific number segment is MarkRange.

MarkRange corresponds to the index number, using start and end to indicate a specific range.

By taking the value of the index number corresponding to start and end, the corresponding numerical range can be obtained.

The index query is actually the judgment of the intersection of two numerical intervals:

​ 1. An interval is a conditional interval converted from a query condition based on the primary key;

​ 2. An interval is the numerical interval corresponding to MarkRange.

Index query process:

1.生成查询条件区间:将查询条件转换为条件区间

	where ID = 'A003'			['A003','A003']

	where ID > 'A000'			('A000','+inf')

	where ID LIKE 'A006%'		['A006','A007')

2.递归交集判断:以递归的形式,依次对MarkRange的数值区间与条件区间做交集判断。

	如果不存在交集,则直接通过剪枝算法优化此整段MarkRange

	如果存在交集,且MarkRange步长大于8(end-start),则将此区间进一步拆分成8个子区间(由merge_tree_coarse_index_granularity指定,默认为8),并重复此规则,继续做递归交集判断

	如果存在交集,且MarkRange不可再分割(步长小于8),则记录MarkRange并返回

3.合并MarkRange区间:将最终匹配的MarkRange聚在一起,合并它们的范围

Insert picture description here

​ Diagram of the complete process of index query

6.4.5 Secondary index (hop count index)

Constructed from aggregated information of data. Different index types have different aggregated information content.

MergeTree supports hop count index types: minmax, set, ngrambf_v1 and tokenbf_v1. A table supports the declaration of multiple hop index at the same time.

Hop index is closed by default, you need to set set allow_experimental_data_skipping_indiced = 1

For the hop index, index_granularity defines the granularity of data, and granularity defines the granularity of aggregated information summary.

Granularity defines how many index_granularity intervals of data can be skipped by a row of hop count index.

To explain the role of granularity, we must start with the data generation rules of the hop index. The rules are roughly as follows: First, divide the data into n segments according to the index_granularity granularity interval, and there are [0,n-1] in total. Interval (n = total_rows / index_granularity, rounded up). Then, according to the expression declared when the index is defined, starting from the 0 interval, the aggregate information is obtained from the data according to the index_granularity granularity in turn, moving forward by 1 step (n+1) each time, and the aggregate information is gradually accumulated. Finally, when the granularity sub-interval is moved, a row of hop count index data is summarized and generated.

​ Take the minmax index as an example. Its aggregate information is the minimum and maximum extreme values ​​of the data in an index_granularity interval. The following figure is an example. Assuming index_granularity=8192 and granularity=3, the data will be divided into n equal parts according to index_granularity. MergeTree starts from the 0th segment and obtains aggregate information in turn. When the third partition is obtained (granularity=3), the minmax index of the first row is summarized and generated (the value of the minmax extreme value of the first 3 segments is summarized as [1, 9]), as shown in the figure.

preview

6.4.6 Data storage

Independent storage for each column

In MergeTree, data is stored in columns. Specific to each column field, each column field has a corresponding .bin data file (physical storage).

The .bin file will only save this part of the data in the current partition segment.

​ First of all, the data is compressed (currently supports: LZ4, ZSTD, Multiple and Delta algorithms);

​ Secondly, the data will be sorted according to the ORDER BY statement in advance;

​ Finally, the data is organized in the form of multiple compressed data blocks and written into the .bin file.

Compressed data block

A compressed data block consists of two parts: header information and compressed data. The header information is always represented by 9-bit bytes, specifically composed of 1 UInt8 (1 byte) integer and 2 UInt32 (4 byte) integers, representing the type of compression algorithm used, the compressed data size, and compression. The previous data size.

preview

As can be seen from the figure, the .bin compressed file is composed of multiple compressed data blocks, and the header information of each compressed data block is generated based on the CompressionMethod_CompressedSize_UncompressedSize formula.

​ During the specific data writing process, MergeTree will obtain and process the data in batches according to the index granularity. As shown below:

preview

​ Many-to-one 1. A single batch of data SIZE <64KB; if a single batch of data is less than 64KB, continue to obtain the next batch of data, until the accumulation of SIZE>=64KB, generate the next compressed data block;

​ One-to-one 2. A single batch of data 64KB<=SIZE<=1MB: If the size of a single batch of data happens to be between 64KB and 1MB, the next compressed data block is directly generated

​ One-to-many 3. A single batch of data SIZE>1MB; if a single batch of data directly exceeds 1MB, it will first be truncated according to the size of 1MB and generate the next data block. The remaining data continues to be executed according to the size.

Summary: A .bin file is composed of one or more compressed data blocks, and the size of each compressed block is between 64KB and 1MB. Between multiple compressed blocks, they are written end to end in sequence.

preview

The purpose of introducing compressed blocks into .bin files:

​ 1. After the data is compressed, the data size can be effectively reduced, the storage space is reduced, and the data transmission efficiency is accelerated; but the compression and decompression efficiency will also affect the performance.

​ 2. When reading a certain column of data (.bin file), you first need to load the compressed data into the memory to decompress and read it. That is, by compressing the block (64KB~1MB), the read granularity can be reduced to the compressed block level without reading the entire .bin file.

6.5 Data Marking

6.5.1 Data tag generation rules

primary.idx primary index

.bin data file

.mrk establishes an association between the primary index and the data file. Two main pieces of information are recorded:

​ 1. Page number information corresponding to the primary index;

​ 2. The starting position of a paragraph of text on a page.

Insert picture description here

Data marking characteristics: 1. The data marking file and the index interval are aligned. All are divided according to the granularity interval of index_granularity.

​ 2. There is also a one-to-one correspondence between data mark files and .bin files. Each column field [column].bin file has a corresponding [column].mrk data mark file, which is used to record the offset information of the data in the .bin file.

A line of marked data is represented by a tuple, containing the offset information of two integer data (the offset in the compressed file, the offset in the decompressed block)

Each line of marked data represents the reading position information of a piece of data (the default is 8192 lines) in the .bin compressed file

Marked data is different from the first-level index. It cannot be resident in memory. Instead, it uses the LRU (least recently used) caching strategy to speed up its retrieval.

6.5.2 How data tagging works

When MergeTree reads data, it must find the required data through the location information of the marked data.

The search process is roughly divided into two steps: reading the compressed data block and reading the data.

Insert picture description here

​ The data type of the JavaEnable field is UInt8, so each row of data occupies 1 byte.

​ The index_granularity granularity of the data table is 8192, so the size of each index fragment is exactly 8192B.

​ According to the data compression block rule, 8192B<64KB, when it is equal to 64KB, it will be compressed as the next data block. (64KB/8192B=8, that is, 8 rows of data are a data compression block)

How does MergeTree locate the compressed data block and read the data:

​ 1. Read compressed data block: MergeTree does not need to load the entire .bin file at one time when querying a certain column of data. Load the specified data compression block by borrowing the compressed file offset in the marked file.

​ 2. Read data: After decompressing the data, MergeTree does not need to scan the entire segment of the decompressed data at one time, borrowing the offset in the data block saved in the mark file to load a specific small segment with the granularity of index_granularity

6.6 Collaborative summary of partitioning, indexing, marking, and compressed data

6.6.1 Writing process

​ 1. Generate a partition directory (with each insert operation, a new partition directory is generated);

​ 2. At a later time, merge the directories of the same partition;

​ 3. According to the index_granularity index granularity, respectively generate the primary.idx index file, the secondary index, the .mrk data mark of each column field and the .bin compressed data file.

​ The index corresponds to the marked interval. The marked interval is different from the compressed block interval, and three relationships of one-to-one, one-to-many, and many-to-one are generated.

preview

According to the partition directory: 201403_1_34_3:

​ The N rows of data in this partition are written in batches 34 times and merged 3 times.

6.6.2 Query process

​ 1.minmax.idx (partition index)

​ 2.primary.idx (primary index)

​ 3.skp_idx.idx (secondary index)

​ 4...mrk (marked file)

​ 5…bin (data compressed file)

preview

There is no where condition in the query statement. Steps 1, 2, and 3 do not go; first scan all partition directories and the maximum range of index segments in the directory. MergeTree borrows data tags and reads multiple compressed blocks in the form of multithreading.

6.6.3 Correspondence between data mark and compressed data block

The division of compressed blocks:

​ The size of index_granularity and the three rules of compression block determine the size of data block in 64KB~1MB.

​ And the data of an index interval produces a row of data marks.

Many-to-one: Multiple data tags correspond to one data compression block. An uncompressed SIZE of index_granularity<64KB

​ Assuming that the data type of the JavaEnable field is UInt8, so each row of data occupies 1 byte. The index_granularity granularity of the data table is 8192, so the size of each index fragment is exactly 8192B. According to the data compression block rule, 8192B<64KB, when it is equal to 64KB, it is compressed as the next data block. (64KB/8192B=8, that is, 8 rows of data are a data compression block)

preview

One-to-one: One data tag corresponds to one data compression block. An uncompressed 64KB of index_granularity <= SIZE <= 1MB

​ Assuming that the URLHash field data type is UInt64 and the size is 8B, the data size of a default interval is 8*8192=65536B, which is exactly 64KB. At this time, the marked data and compressed data are in a one-to-one relationship.

preview

One-to-many: One data tag corresponds to multiple data compression blocks. An uncompressed SIZE of index_granularity> 1MB

​ Assuming that the URL field type is String and the content is exactly 4.8MB, then a data marked file corresponds to 5 data compression blocks.

preview
For more exciting content, please follow the WeChat public account to get

Insert picture description here

Guess you like

Origin blog.csdn.net/weixin_45320660/article/details/112761790