ClickHouse combat - using distributed tables

ClickHouse in action – using distributed tables

A distributed table is not an entity table, but essentially a view. Since the nodes of CK are peer-to-peer (there is no master-slave concept), when a distributed table receives a request, the request will fall on one of the nodes, and the node receiving the request will split the SQL and put The request is sent to other nodes, and then the data is aggregated, and the aggregated result is returned to the client.

View cluster information

Since the cluster name is used when creating a distributed table, it is necessary to check the information of a cluster. You can view cluster information through a sql.

select * from system.clusters;

View system settings

select * from system.settings;

Create a distributed table

When creating a distributed table, generally the local table exists first. Distributed tables are created based on local table creation. The basic syntax for creation is as follows:

CREATE TABLE [IF NOT EXISTS] [db.]table_name 
[ON CLUSTER cluster] AS [db2.]name2 
ENGINE = Distributed(cluster, database, table[, sharding_key[, policy_name]]) 
[SETTINGS name=value, ...]

It should be noted that by default, the data insertion of the distributed table is asynchronous. To implement the synchronous data insertion of the distributed table, parameters need to be set.

Explanation of distributed engine parameters

cluster- The service is the cluster name in the configuration
database- remote database name
table- remote data table name
sharding_key- (optional) shard key
policy_name- (optional) rule name, it will be used to store temporary files to send data asynchronously

When creating a distributed table, there are several important setting parameters to pay attention to:

insert_distributed_sync: By default, when inserting data into a distributed table, the ClickHouse server sends data to the cluster nodes asynchronously. When insert_distributed_sync=1, the data is processed synchronously, and the INSERT operation succeeds only after all the data is saved on all shards (if internal_replication is true, each shard has at least one replica).
fsync_after_insert: fsync file data after asynchronously inserted into the distribution. Guarantees that the OS flushes the entire inserted data to a file on the initiator node's disk.

Distributed table combat

View distributed cluster information

You can first check the cluster configuration of the system through the sql statement.

:) select * from system.clusters;
─cluster────────────────────┬─host_name───┬─host_address─┬─port─┐
│ perftest_2shards_1replicas │ x.x.x.1xx  │ x.x.x.xxx  │ 9000 │
│ perftest_2shards_1replicas │ x.x.x.2xx  │ x.x.x.xxx  │ 9000 │
└────────────────────────────┴─────────────┴──────────────┴──────┘

From the above output, we can see that our cluster name is: perftest_2shards_1replicas.

Create cluster-local tables

CREATE TABLE if not exists test_db.city_local on cluster perftest_2shards_1replicas
(
  `id` Int64,
  `city_code` Int32,
  `city_name` String,
  `total_cnt` Int64,
  `event_time` DateTime
)
Engine=MergeTree()
PARTITION BY toYYYYMMDD(event_time)
ORDER BY id;

Create a distributed table

A distributed table is actually a view, so it needs to have the same table structure definition as a shard (server node). After the view is created, the data is queried on each shard (server node) and the results are aggregated on the node that originally invoked the query.

Execute the following statement to create a distributed table on one of the nodes:

CREATE TABLE test_db.city_all on cluster perftest_2shards_1replicas
AS test_db.city_local
ENGINE = Distributed(perftest_2shards_1replicas, test_db, city_local, rand());

Note: You need to add on cluster xxx when creating a distributed table. In addition, you can refer to the official documentation for the syntax of creating a distributed table .

Among the parameters in Distributed, the first parameter: cluster name; the second parameter: database name; the third parameter: local table name; the fourth parameter: distribution strategy.

Insert data into distributed table

insert into test_db.city_all (id, city_code, city_name, total_cnt, event_time) values (1, 4000, 'guangzhou', 420000, '2022-02-21 00:00:00');
 
insert into test_db.city_all (id, city_code, city_name, total_cnt, event_time) values (2, 5000, 'shenzhan', 55000, '2022-02-22 00:00:00');
 
insert into test_db.city_all (id, city_code, city_name, total_cnt, event_time) values (3, 6000, 'huizhou', 65000, '2022-02-23 00:00:00');
 
insert into test_db.city_all (id, city_code, city_name, total_cnt, event_time) values (4, 7000, 'huizhou', 75000, '2022-02-24 00:00:00');

insert into test_db.city_all (id, city_code, city_name, total_cnt, event_time) values (5, 8000, 'huizhou', 75001, '2022-02-25 00:00:00');

Check the number of data entries in the distributed table:

:) select count() from city_all;

SELECT count() FROM city_all

Query id: eff9e667-61d7-4302-93dc-9d7379d234db

┌─count()─┐
│       5 │
└─────────┘

Connect to one of the nodes and check the number of data entries in the local table:

:) select count() from city_local;

SELECT count() FROM city_local

Query id: 24347cdb-4cfe-4d6b-8492-90d40c8e0e2c

┌─count()─┐
│       3 │
└─────────┘

Update the data of the distributed table

When you want to update the data in the distributed table, you can do it by updating the data in the local table. However, when updating the data of the local table, on cluster xxx must be added. Otherwise, only one of the nodes will be operated. Also, distributed tables cannot be updated directly.

The statement to update the local table is as follows:

ALTER TABLE city_local ON CLUSTER perftest_2shards_1replicas UPDATE total_cnt = 5555 WHERE city_name = 'shenzhan';

Through the above sql statement, the data update can be completed correctly. After the update is complete, you can use the following sql statement to view the updated data.

select * from city_all order by id;

We try to update the data through the distributed table, and an error will be reported.

ALTER TABLE city_all ON CLUSTER perftest_2shards_1replicas UPDATE total_cnt = 4444 WHERE city_name = 'shenzhan';

The following error will be reported:

┌─host────────┬─port─┬─status─┬─error───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬─num_hosts_remaining─┬─num_hosts_active─┐
│ xxx.xxx.1xx.xx│ 9000 │     60 │ Code: 60, e.displayText() = DB::Exception: Table test_db.city_all doesn't exist (version 21.8.10.19 (official build))               │                   1 │                0 │
│ xxx.xxx.2xx.xx │ 9000 │     48 │ Code: 48, e.displayText() = DB::Exception: Table engine Distributed doesn't support mutations (version 21.8.10.19 (official build)) │                   0 │                0 │
└─────────────┴──────┴────────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴─────────────────────┴──────────────────┘
↘ Progress: 0.00 rows, 0.00 B (0.00 rows/s., 0.00 B/s.)  0%

Therefore, it can be seen that data cannot be updated through distributed tables.

In addition, it should be noted that when CK updates data, even if it updates a piece of data, it will rewrite all the data in the partition where the data is located. Therefore, the update operation of CK is a heavy operation and cannot be performed frequently. Especially when the amount of data is very large, it is even more cautious to update the data.

Modify the structure of the distributed table

Generally speaking, the table structure of the distributed table must be consistent with the local one. When modifying the table structure, it is best not to modify the table structure directly through the alter statement, but to create a new table, and then import the data of the old table into the new table.

For example, if we have a table t1_local and a distributed table d1_all, we need to modify the table structure. In general, the steps to modify the table structure of a distributed table are:

(1) Create a new cluster local table according to the new table structure (on cluster xxx must be added), for example: t2_local

(2) Create a distributed table (add on cluster xxx) according to the new local table, for example: d2_all

(3) Import the data in the old data table into the new distributed table: insert into d2_all select f1, f2, f3, fn from d1_all. Note: This step can also be imported into the local table t2_local. The purpose of importing the distributed table is to allow the data to be fragmented according to the new fragmentation strategy.

(4) Verify that the data import is complete.

(5) Modify the old local table name to t1_local_bk, and modify the new local table name to: t1_local. must add on cluster xxx. At this time, the previous distributed table d1_all has not been modified, but we have modified the name of the local table, so when operating on the d1_all distributed table, it will be performed on the latest local table.

In addition, it should be noted that in order to avoid data loss, it is best to stop the data writing operation during the table structure update operation.

summary

This article describes the basic operations of distributed tables. Including the creation of distributed tables, how to insert data into distributed tables, how to update the table structure of distributed tables and other operations.

References

Description of the distributed engine
https://medium.com/@merticariug/distributed-clickhouse-configuration-d412c211687c