Getting started with Clickhouse (one article is enough)

1. Knowledge base

1.1 What is OLAP?

OLAP (Online Analytical Processing) is a data processing method used to perform complex analysis onlarge-scale data sets. are designed to provide fast and flexible query and analysis capabilities for multi-dimensional data. Unlike OLTP (online transaction processing) systems that focus on supporting daily business transactions and operations, OLAP systems

OLAP scenarios have the following key characteristics:

  • Mainly read requests: The database is mainly used for read operations, with relatively few write operations.
  • Bulk Updates: Update operations typically occur in fairly large batches rather than single row updates, and may be prohibited or infrequent.
  • Immutable data: Once data is added to the database, it cannot be modified.
  • Wide table with large number of columns: The data table contains a large number of columns and may be a wide table structure.
  • Queries are few, but complex: The number of queries is relatively small, but each query may involve fetching a small subset of columns from a large number of rows, and the queries may be complex.
  • Allow a certain delay: For simple queries, a certain delay is acceptable, about 50 milliseconds.
  • Small data: The data in the column is relatively small and includes numbers and short strings.
  • High Throughput: High throughput when processing a single query, up to billions of rows per second.
  • Non-required transactions: The requirements for transactions are not high, and some data inconsistencies can be tolerated.
  • Low data consistency requirements: Data consistency requirements are relatively low.
  • One large table per query: Typically each query involves one large table, with other tables smaller in size.
  • Results fit in RAM: Query results are usually filtered or aggregated to fit in the RAM of a single server.

1.2 What is columnar storage?

Column storage is an organization of databases and data storage systems in which data is stored in columns rather than rows. In contrast, traditional row-store database systems store data in units of rows. The following are the main features of columnar storage:

  • How data is organized (column classification, processing specific columns): Column storage stores data in the same column together instead of storing entire rows of data together. This means that the database engine can read and process data for a specific column more efficiently.
  • Data Compression (similarity): Data compression is generally easier to achieve with columnar storage. Because data in the same column often has similarities, compression algorithms can better exploit these similarities, thereby reducing storage space requirements.
  • Query performance improvement (loading columns that need analysis instead of loading the entire row): Column storage generally has better performance when processing analytical queries (such as aggregation, filtering, etc.) good performance. This is because only the columns required for the query are read and processed, rather than the entire row of data having to be loaded.
  • Ideal for analytical workloads: Column storage is particularly suitable for analytical workloads, where a large number of aggregation operations and complex queries often need to be processed.
  • Efficient column storage indexes (using specific column storage indexes): In order to support efficient column storage, database systems usually use specialized column storage indexes, which can more efficiently Quickly locate and retrieve data for specific columns.
  • Suitable for large-scale data warehouses (data compression, caching): Column storage is very useful for large-scale data warehouses, because in this scenario, data analysis and reporting is the main workload.

2. What is Clickhouse?

2.1 Concept

ClickHouse This is for personal use Connection analysis(OLAP) A database management system (DBMS).

It is intended to be used in areas of large-scale data analysis, especially in scenarios where massive amounts of data need to be processed and complex queries executed. Here are some of the main application areas:

  • Online advertising analysis: ClickHouse can be used to analyze the effectiveness of online advertising and process large amounts of click data and user behavior to optimize advertising strategies.
  • Log analysis: Processing large amounts of log data is one of ClickHouse's strengths. It can be used to analyze server logs, application logs, etc. to help identify potential problems, monitor system performance, etc.
  • E-commerce data analysis: Analyze a large amount of transaction and user behavior data on the e-commerce platform, including sales trends, user behavior patterns, etc.
  • Big Data Dashboard: ClickHouse can support the creation of real-time or quasi-real-time big data dashboards for monitoring business indicators, data visualization, etc.
  • Scientific research: In the field of scientific research, especially when large-scale experimental data needs to be analyzed, ClickHouse can be used to accelerate the data processing and analysis process.
  • Internet of Things (IoT) Data Analysis: Process large amounts of data generated by IoT devices to monitor, predict, and optimize IoT systems.

2.2 Advantages and Disadvantages

Here are some benefits of ClickHouse:

advantage describe
high performance ClickHouse focuses on high-performance OLAP queries and is particularly suitable for complex analysis of large-scale data sets. Its columnar storage engine and optimized query execution plans make it excellent at data scanning and aggregation
Column storage engine ClickHouse uses a columnar storage engine to store tables in columns instead of rows. This storage structure allows only the required columns to be read in analytical queries, thus improving query performance
Distributed architecture ClickHouse is a distributed system that can scale horizontally to handle larger data sets. It supports distributed queries and can execute queries in parallel on multiple nodes to achieve high availability and load balancing.
Suitable for time series data ClickHouse has special support and optimization for time series data, including time window, sampling, filling and other functions. This makes it ideal for working with time series data
Flexible data distribution and replication strategies ClickHouse allows users to define data distribution and replication strategies to suit different business needs. This flexibility makes it easier to configure high availability and load balancing in large-scale distributed environments
Support multiple data formats ClickHouse supports the import and export of multiple data formats, including CSV, JSON, Parquet, etc. This makes it more adaptable to different data sources and data processing scenarios
Low latency query ClickHouse can achieve low-latency queries when processing large-scale data, which makes it suitable for application scenarios that require fast response to analysis queries.
Open source and free ClickHouse is open source and free to use, which reduces deployment and usage costs

AlthoughClickHouse performs well in large-scale data analysis andOLAP scenarios, there are still some potential limitations and shortcomings, depending on the specific Usage scenarios and requirements. Here are someClickHousepossible shortcomings:

shortcoming describe
Limited transaction support ClickHouse's transaction support is relatively limited, focusing mainly on read and analysis performance. May not be the first choice for OLTP scenarios that require strong transaction consistency
Write operations are relatively slow Although ClickHouse performs well in reading and querying large-scale data, the performance may be relatively slow for frequent single-row write (INSERT, UPDATE, DELETE) operations.
Not suitable for scenarios with frequent updates ClickHouse is designed more for large-scale batch inserts and queries rather than frequent updates. Frequent updates may cause performance degradation because ClickHouse's columnar storage engine is better suited for immutable data
Complex connection operation performance is relatively slow In complex multi-table join operations, ClickHouse may perform relatively slowly. Its advantages mainly lie in single table and simple connection query
Full text search is not supported ClickHouse does not support full-text search, so in scenarios where full-text search is required, the cooperation of other specialized full-text search engines may be required.
Stored procedures and triggers are not supported ClickHouse does not support stored procedures and triggers, which means that stored procedure-style business logic cannot be implemented in the database.
Some SQL standards are not supported ClickHouse implements most of the SQL standards, but there may still be some unsupported or limited support for specific SQL features.
Complexity of maintenance and management ClickHouse can be relatively complex to configure and manage, especially in a distributed cluster environment. Issues such as configuration, performance tuning, data backup, etc. need to be carefully considered.

3. Table engine

From the previous description, you can know that ClickHouse uses a columnar storage engine to store tables by columns instead of rows. There is also a table engine concept.

ClickHouse's table engine defines the physical storage structure of the data table and , ClickHouse supports a variety of table engines, some of the common engines include MergeTree series, Log series, external engines, other engines, etc.

Engine series engine use Features
MergeTree Series MergeTree Large-scale data analysis, time series data Column storage, suitable for analytical queries, supports partitioning and indexing
ReplicatedMergeTree High availability, data redundancy Support data replication, improve high availability, support partitioning and indexing
Distributed Distributed queries across multiple nodes Supports executing distributed queries across multiple nodes
Log Series Log Keep a change log Used for logging changes, often used in conjunction with other engines
external engine Kafka Integrate ClickHouse and Apache Kafka Read and write data through Kafka topics
MySQL Access data from MySQL database in ClickHouse Data for accessing MySQL databases in ClickHouse
ODBC Connect to other data sources through the ODBC interface Connect to other data sources through the ODBC interface
Other engines TinyLog Record change logs, lightweight implementation Lightweight Log engine implementation, suitable for situations where the amount of data is small
TinyMergeTree Large-scale data analysis, lightweight implementation Lightweight MergeTree engine implementation, suitable for situations where the amount of data is small

4. Clickhouse operating mechanism

4.1 Brief description

When we use ClickHouse, its operating mechanism can be simplified to the following easy-to-understand description:

  • Table storage: Imagine a huge table, the data in the table is stored in columns instead of rows. This is like listing everyone's name in one column and their age in another column. This method is more suitable for large-scale data.
  • Partition: This table is divided into small blocks according to time or other rules, and each block is called a partition. It's like dividing time into day by day, and putting each day's data in different "boxes".
  • Insert data: When there is new data to be inserted, the data will not be directly inserted into the table. Instead, it is first put into a "staging area" and then organized into the corresponding partition of the table at the appropriate time.
  • Query processing: When we execute a query, ClickHouse will smartly read only the required columns instead of the entire row. It's like we only care about certain columns and don't have to look at the entire table.
  • Efficient merging: Since data is stored in partitions, ClickHouse can efficiently merge data in different partitions regularly to reduce storage space and improve query speed.

This is the basic operating idea of ​​ClickHouse.

4.2 Detailed description

The following is a detailed description of the operating mechanism of ClickHouse:

  1. Table creation and structure definition: Users create tables in ClickHouse through SQL statements and define the table structure, including column data types, partitioning methods, indexes, etc. Each table is associated with a specific table engine, which determines how the data is stored and its processing characteristics.
  2. Column storage: ClickHouse uses a columnar storage engine to store data of the same column together. This storage method brings high compression ratio and higher query performance, and is especially suitable for large-scale data analysis.
  3. Data partitioning: The table can be partitioned according to specified fields. The most common partitioning is based on time. Partitioning helps improve query performance, allowing the system to locate and process data within a specific time range more quickly.
  4. Data insertion and merging: When the user inserts new data, the data will first enter the "temporary storage area" (MergeTree small partition) of the MergeTree engine, and then merge according to the configuration Strategies to perform data merging operations on a regular basis. Merge operations help optimize data storage and improve query performance.
  5. Use of indexes: ClickHouse supports primary key indexes and auxiliary indexes, which improves query speed. The primary key index is used to quickly locate unique rows, while the secondary index is used for other query criteria.
  6. Asynchronous writing: ClickHouse supports asynchronous writing, and the insertion operation is very fast. Written data first goes into different partitions of the MergeTree engine and can then be sorted through periodic merge operations.
  7. Distributed Computing: If running in a distributed cluster, ClickHouse can execute queries in parallel on multiple nodes. The Distributed engine is used to execute distributed queries among multiple ClickHouse nodes, distribute the query to multiple nodes in the cluster, and merge the results back.
  8. Data copy and high availability: The ReplicatedMergeTree engine supports data replication between multiple copies to improve data redundancy and high availability. Failover can be implemented between nodes in the cluster to ensure data reliability.
  9. Query processing and optimization: When executing a query, ClickHouse uses the built-in optimizer and execution engine to optimize according to the query structure and table distribution to improve query performance. .
  10. Asynchronous updates and deletes: ClickHouse supports asynchronous update and delete operations. Through the version control mechanism of the ReplicatedMergeTree engine, conditional updates and deletions can be performed in the table.
  11. Data compression: Column storage and support for multiple compression algorithms help reduce storage usage and improve I/O efficiency.

5. Installation and configuration of Clickhouse

5.1 Installation

ClickHouse supports cluster installation, allowing the ClickHouse database to be deployed and managed on multiple servers. Such a clustered environment provides higher availability, load balancing and horizontal scalability. The following are the general installation steps for a ClickHouse cluster (general idea):

Step 1. Install ClickHouse: Install ClickHouse on each server as previously mentioned. Make sure every node has the same ClickHouse version.

Step 2. Configure ZooKeeper (optional):

  • ClickHouseClusters can be coordinated and configured usingZooKeeper. If you choose to useZooKeeper, be sure to configureZooKeeper the connection information on each node.
  • present config.xml placed in sentence subjectZooKeeperinformation.
<zookeeper>
    <node index="1">
        <host>zookeeper1-host</host>
        <port>2181</port>
    </node>
    <!-- 其它 ZooKeeper 节点 -->
</zookeeper>

Step 3. Configure cluster nodes: Configure node information in the config.xml file. ClickHouse

<remote_servers>
    <cluster_zookeeper>
        <shard>
            <weight>1</weight>
            <internal_replication>true</internal_replication>
            <replica>
                <host>node1-host</host>
                <port>9000</port>
            </replica>
            <!-- 其它副本实例 -->
        </shard>
        <!-- Additional shards -->
    </cluster_zookeeper>
</remote_servers>

Step 4. Start the Clickhouse service: Start the ClickHouseservice

sudo service clickhouse-server start

Step 5. Verify cluster configuration: UseClickHouse client to connect to any node and execute the following query to verify cluster configuration

SELECT * FROM system.clusters;

Others:ClickHouse also provides the Web interface (ClickHouse Web) For monitoring and managing clusters. You can also use third-party monitoring tools to monitor cluster status and performance.


Clickhouse is implemented throughdata replication and redundancy, Automatic data sharding and load balancing, Built-in failover, monitoring and alarm to achieve high availability, somewhat similar to Redis.

5.2 Configuration

ClickHouseThe main configuration file of is config.xml, usually located in the /etc/clickhouse-server/ directory. The following is its core configuration:

  • Cluster configuration: If using ClickHouse cluster, ensure that the cluster information in config.xml is correctly configured, including ZooKeeper connection information (if ZooKeeper is used), sharding and replica configuration, etc. .
  • Data distribution configuration: ClickHouse supports multiple data distribution and replication strategies. Based on the characteristics of the data and query requirements, select an appropriate distribution strategy and configure the data replication parameters when creating the table.
  • ZooKeeper configuration (optional): If using ZooKeeper for coordination and configuration, ensure that ZooKeeper connection information is correctly configured in config.xml, and that appropriate paths are created in ZooKeeper .
  • System tables and engine configuration: ClickHouse uses system tables (such as system.tables and system.replicas) to store metadata about clusters and tables. Make sure these system tables are properly configured and the engine is set up.
  • Cache settings: ClickHouse has multiple levels of caching, including operating system cache and ClickHouse's own cache. Depending on memory and performance needs, adjust cache settings in config.xml.
  • Monitoring and Alerting: ClickHouse provides tools for monitoring cluster status and performance. Configure monitoring tools and set alerts to take action in the event of a failure or performance degradation.
  • Data directory and disk settings: Ensure that the correct data directory and disk settings are configured, as well as an appropriate backup strategy. ClickHouse requires sufficient disk space to store data and logs.
  • Network Configuration: Configure network settings to ensure that communication between nodes is safe and reliable. Check the firewall rules to make sure the port is open.
  • Log settings: Adjust the log level and log path to facilitate troubleshooting and performance analysis.

The complete example configuration is as follows:

<!-- /etc/clickhouse-server/config.xml -->

<yandex>
    <!-- 基本设置 -->
    <listen_host>::</listen_host>
    <listen_port>8123</listen_port>
    <max_server_memory_usage>75</max_server_memory_usage> <!-- 设置最大内存使用率 -->

    <!-- 集群配置 -->
    <zookeeper>
        <node index="1">
            <host>zookeeper1-host</host>
            <port>2181</port>
        </node>
        <!-- Additional ZooKeeper nodes -->
    </zookeeper>

    <remote_servers>
        <cluster_zookeeper>
            <shard>
                <weight>1</weight>
                <internal_replication>true</internal_replication>
                <replica>
                    <host>node1-host</host>
                    <port>9000</port>
                </replica>
                <!-- Additional replicas -->
            </shard>
            <!-- Additional shards -->
        </cluster_zookeeper>
    </remote_servers>

    <!-- 数据分布配置 -->
    <data_distribution>
        <!-- 根据数据的特性和查询需求,选择适当的分布策略 -->
        <sharding_key>your_sharding_key</sharding_key>
        <!-- 在创建表时配置数据的复制参数 -->
        <replication>
            <shard>
                <internal_replication>true</internal_replication>
                <replica>
                    <host>node1-host</host>
                    <port>9000</port>
                </replica>
                <!-- Additional replicas -->
            </shard>
            <!-- Additional shards -->
        </replication>
    </data_distribution>

    <!-- 系统表和引擎配置 -->
    <system>
        <path>/var/lib/clickhouse/</path> <!-- 设置系统表和引擎的路径 -->
    </system>

    <!-- 缓存设置 -->
    <max_memory_usage_for_all_queries>10737418240</max_memory_usage_for_all_queries>
    <max_memory_usage_for_all_granules>5368709120</max_memory_usage_for_all_granules>

    <!-- 监控和警报 -->
    <query_profiler_real_time>1</query_profiler_real_time>
    <metric_log>1</metric_log>
    <metric_log_path>/var/log/clickhouse-server/metrics/</metric_log_path>
    <max_part_to_read>100</max_part_to_read>

    <!-- 数据目录和磁盘设置 -->
    <path>/var/lib/clickhouse/data/</path> <!-- 设置数据目录 -->
    <disk>
        <enabled>true</enabled>
        <keep_free_space>53687091200</keep_free_space> <!-- 设置磁盘剩余空间 -->
    </disk>

    <!-- 网络配置 -->
    <interserver_http_host>node1-host</interserver_http_host>
    <interserver_http_port>9000</interserver_http_port>

    <!-- 日志设置 -->
    <log>
        <level>information</level> <!-- 设置日志级别 -->
        <path>/var/log/clickhouse-server/</path> <!-- 设置日志路径 -->
    </log>
</yandex>

6. Clickhouse syntax

6.1 Basic syntax

Number setting method

-- 创建表
CREATE TABLE [IF NOT EXISTS] [db.]table_name [ON CLUSTER cluster]
(
    column_name1 [type1] [DEFAULT|MATERIALIZED|ALIAS expr1] [TTL expr_ttl1],
    column_name2 [type2] [DEFAULT|MATERIALIZED|ALIAS expr2] [TTL expr_ttl2],
    ...
) ENGINE = engine

-- 修改表
ALTER TABLE [db.]table
[ADD COLUMN [IF NOT EXISTS] col_name [type] [DEFAULT|MATERIALIZED|ALIAS expr] [AFTER col_after]]
[DROP COLUMN [IF EXISTS] col_name [FROM PARTITION partition]]
[MODIFY COLUMN [IF EXISTS] col_name [type] [DEFAULT|MATERIALIZED|ALIAS expr] [AFTER col_after]]
[MODIFY PRIMARY KEY|AFTER [col_name] ]
[MODIFY ORDER BY [col_name] ]
[MODIFY TTL [col_name] ]
[COMMENT col_name | TABLE 'comment']
[RENAME COLUMN [IF EXISTS] col_name TO new_col_name]
[DROP PARTITION partition]

numerical operations:

-- 插入
INSERT INTO [db.]table [(column, ...)] VALUES (expr, ...)

--更新
UPDATE [db.]table
SET col1 = expr1, col2 = expr2, ...
WHERE condition

--删除
DELETE FROM [db.]table WHERE condition

查询语法

-- 基本查询
SELECT [DISTINCT] select_expr [, ...]
FROM table
[GLOBAL] [ANY|ALL] INNER|LEFT|RIGHT|FULL [OUTER] JOIN table ON expr
[WHERE expr]
[GROUP BY expr_list]
[HAVING expr]
[ORDER BY expr [ASC|DESC], ...]
[LIMIT [n,] m]

--子查询
SELECT ...
FROM ...
WHERE expr IN (SELECT ...)

-- 聚合函数
SELECT COUNT(*), AVG(column), SUM(column), MIN(column), MAX(column)
FROM table

-- 时间序列查询
SELECT ...
FROM ...
SAMPLE BY column
FILL(column) [PREVIOUS] [interpolation]

Management language:

-- 管理命令
SHOW [SETTINGS|CREATE|TABLES|COLUMNS|PROCESSLIST|QUERIES|PROFILES|EVENTS|DICTIONARIES|ZOOKEEPER|STORAGE POLICIES|CLUSTER|GRANTS|PROCESSLIST|ACCESS]

-- 系统函数
SELECT * FROM system.tables

6.2 Data types

Here are some of the main data types supported by ClickHouse:

Number type

  • UInt8, UInt16, UInt32, UInt64: unsigned integers
  • Int8, Int16, Int32, Int64: signed integers
  • Float32, Float64: floating point number

Date and Time Type:

  • Date: date
  • DateTime: date and time
  • DateTime64: date and time with precision

Character skewer type:

  • String: Fixed length string
  • FixedString(n): fixed length string, where n is the string length

Binary data type:

  • UUID: universally unique identifier
  • IPv4, IPv6: IPv4 and IPv6 addresses
  • LowCardinality(T): Low cardinality column, used for enumeration types, etc.

Array and tuple types:

  • Array(T): array
  • Tuple(T1, T2, …): Tuple, which can contain elements of different types

set type:

  • AggregateFunction: aggregate function

Other categories:

  • Nullable(T): Nullable type
  • Enum8, Enum16: enumeration type
  • Nested: Nested data type

7. Clickhouse FAQs and Troubleshooting

7.1 FAQ

1. Restarting the ClickHouse service will take a long time

  • The main reason is that the node is loaded slowly due to too many data fragments. Just wait patiently.

2. Data insertion error too many parts exception

  • Mainly because data is inserted too frequently, data shards are merged slowly in the background, and ClickHouse activates a self-protection mechanism to refuse further data insertion. At this time, you can try to increase the batch_size of the inserted data (100,000) and reduce the frequency of data insertion (1 time per second) to alleviate this problem.

3. The replicated table becomes read-only

  • This is mainly caused by ClickHouse being unable to connect to the ZooKeeper cluster or the metadata of the replicated table on ZooKeeper being lost. At this time, new data cannot be inserted into the table. To solve this problem, first check the connection status of ZooKeeper. If the connection fails, you need to further check the network status and the status of ZooKeeper. After the connection is restored, the replicated table can continue to insert data. If the connection is normal but the metadata is lost, you can convert the replicated table to a non-replicated table and then insert the data again.

4. Memory exceeds limit when performing JOIN operation

  • It may be due to the fact that no clear filtering conditions were added to the two subqueries before and after the JOIN, or it may be due to the fact that the JOIN data itself is very large and cannot be loaded into the memory. At this time, you can try to add filter conditions to reduce the amount of data, or appropriately modify the memory limit in the configuration file to load more data.

7.2 Troubleshooting methods

  1. Check the ClickHouse running status to ensure that the service is running properly.
  2. Check the ClickHouse error log file to find the source of the problem.
  3. Check the system log file (/var/log/messages) for records related to ClickHouse to see if a system operation caused the ClickHouse exception.
  4. For unknown issues or bugs, you can seek help under the issue in the official GitHub repository. A complete problem description and error log information must be provided.

8. Clickhouse performance optimization

8.1 Performance optimization methods

Design the table structure appropriately:

  • Consider how the table is partitioned and choose appropriate primary keys and indexes to support query performance.
  • Consider the choice of data type, choose a data type of appropriate size, and reduce storage space.
  • Use a suitable engine, such as the MergeTree engine for large-scale data analysis.

Fair use partitioning:

  • Select appropriate partition fields according to the query mode to reduce the scan scope and improve query performance.
  • Avoid too many partitions as this can complicate the merge operation.

Usage of index:

  • Use primary key indexes to improve the performance of unique queries.
  • Use secondary indexes to speed up other types of queries. Note, however, that secondary indexes may increase write complexity.

Properly configure the hardware:

  • Make sure the server hardware is powerful enough, especially disk I/O, memory, and CPU.
  • Use SSDs instead of traditional disks to increase disk read and write speeds.
  • Consider using NVMe storage to further improve I/O performance.

查询优加

  • Use appropriate query conditions to avoid full table scans.
  • To avoid complex JOIN operations on large tables, queries can be broken down into multiple small queries through distributed computing.
  • Make reasonable use of ClickHouse's built-in functions to reduce data transmission and processing overhead.

Appropriate cluster size:

  • Consider the amount of data and query load and adjust the cluster size appropriately.
  • In large-scale clusters, use the Distributed engine to implement distributed queries.

Data import optimization:

  • Use bulk insert operations to increase data import speed.
  • Consider using asynchronous data import to avoid real-time requirements.

System parameter tuning:

  • Adjust ClickHouse configuration parameters such as cache size, number of threads, etc. to suit different hardware and workloads.

Periodical check

  • Regularly perform table optimization operations, such as OPTIMIZE TABLE, to reduce storage space and improve query performance.
  • System statistics are collected regularly to help ClickHouse perform better query plan optimization.

Kakewa Nishi

  • Set up monitoring and logging to detect performance issues and make adjustments in a timely manner.
  • Perform troubleshooting and performance analysis based on logs.

8.2 System parameter tuning

Here is an example of integrating some tuning suggestions into ClickHouse's configuration file config.xml, along with a detailed description:

<!-- config.xml -->

<!-- 设置单个查询可以使用的最大线程数 -->
<max_threads>16</max_threads>

<!-- 设置单个查询可以使用的最大内存量 -->
<max_memory_usage>10000000000</max_memory_usage>

<!-- 在进行 GROUP BY 操作之前,最大的数据大小,超过这个大小将执行外部 GROUP BY -->
<max_bytes_before_external_group_by>10000000000</max_bytes_before_external_group_by>

<!-- 在进行 ORDER BY 操作之前,最大的数据大小,超过这个大小将执行外部排序 -->
<max_bytes_before_external_sort>10000000000</max_bytes_before_external_sort>

<!-- MergeTree 引擎执行合并操作的线程数 -->
<merge_tree>
    <merge_threads>4</merge_threads>
</merge_tree>
<!-- MergeTree 引擎合并操作的块大小 -->
<merge_tree>
    <merge_max_block_size>10000000</merge_max_block_size>
</merge_tree>

<!-- 查询日志的最大长度 -->
<query_log>
    <max_length>1000000</max_length>
</query_log>
<!-- 记录查询执行过程中的日志级别 -->
<query_thread_log>
    <log_level>2</log_level>
</query_thread_log>

<!-- 启用分布式聚合的内存优化 -->
<settings>
    <distributed_aggregation_memory_efficient>1</distributed_aggregation_memory_efficient>
</settings>

<!-- 禁用压缩缓存 -->
<settings>
    <use_uncompressed_cache>1</use_uncompressed_cache>
</settings>

max_threads: Specifies the maximum number of threads that can be used by a single query.

  • Tuning suggestions: Set max_threads appropriately according to the number of CPU cores and load conditions of the server. Under high load conditions, it can be reduced appropriately.

max_memory_usage: Specifies the maximum amount of memory that can be used by a single query.

  • Tuning suggestions: Set max_memory_usage appropriately based on the available memory and workload of the server. Avoid setting it too large to prevent the system from slowing down due to insufficient memory.

max_bytes_before_external_group_by: Specifies the maximum data size before performing GROUP BY operation. If this size is exceeded, external GROUP BY will be performed.

  • Tuning suggestions: Adjust this parameter appropriately according to the frequency and data volume of GROUP BY operations. For GROUP BY operations with large data volumes, consider increasing this value.

max_bytes_before_external_sort: Specifies the maximum data size before ORDER BY operation. If this size is exceeded, external sorting will be performed.

  • Tuning suggestions: Adjust this parameter appropriately according to the frequency and data volume of ORDER BY operations. For ORDER BY operations with large data volumes, consider increasing this value.

merge_tree's merge_threads: Specifies the number of threads used by the MergeTree engine to perform merge operations.

  • Tuning suggestions: Adjust merge_threads appropriately according to the number of CPU cores of the server and the I/O capability of the hard disk. Consider increasing this value to speed up data merging.

merge_tree's merge_max_block_size: Specifies the block size for the MergeTree engine merge operation.

  • Tuning suggestion: Adjust merge_max_block_size appropriately according to the size of hard disk I/O and table partition. Typically, increasing this value helps reduce the cost of merge operations.

query_log_max_length: Specifies the maximum length of the query log.

  • Tuning suggestions: Adjust query_log_max_length appropriately according to logging requirements. For scenarios where detailed log query is required, this value can be increased.

query_thread_log_level: Specifies the log level during query execution.

  • Tuning suggestion: When you need to debug query performance issues, set query_thread_log_level to a lower log level to obtain more detailed query execution information.

distributed_aggregation_memory_efficient: Enables memory optimization for distributed aggregation.

  • Tuning suggestions: For scenarios involving distributed aggregation operations, enable this parameter to optimize memory usage.

use_uncompressed_cache: Disable compressed cache.

  • Tuning suggestion: Disabling the compression cache can improve query performance if the hard disk I/O is fast enough.

Guess you like

Origin blog.csdn.net/qq_20042935/article/details/134738163