【GaussDB(for MySQL)】 Big IN query optimization

The open source China community team made its first live broadcast, telling the story behind the open source China community in the name of sharing."

This article is shared from the Huawei Cloud Community " [MySQL Technology Column] GaussDB (for MySQL) Big IN Query Optimization ", author: GaussDB database.

20240508-164135(WeLinkPC).jpg

Background introduction

In a production environment, we often encounter customer business SQL statements for filtering and querying, and then perform aggregation processing, and the IN predicate list contains thousands or even tens of thousands of constant values. As shown below, the execution time of such statements is very long.

MySQL optimization

When open source MySQL processes column IN (const1, const2, ....), if there is an index on the column, the optimizer will select Range scan for scanning, otherwise it will use full table scan. The range_optimizer_max_mem_size system variable controls the maximum memory that can be used during range optimization process analysis. If there are many list elements in the IN predicate, the contents of each IN will be treated as OR each. OR takes up approximately 230 bytes. If there are many elements, more memory will be used. If the memory usage exceeds the defined maximum memory, range optimization will fail and the optimizer will change the strategy, such as converting to a full table scan, causing query performance to decrease.

For this optimization problem, it can be handled by adjusting range_optimizer_max_mem_size. The memory defined by range_optimizer_max_mem_size is at the session level. Each session executing this type of statement will occupy the same memory. In large concurrency scenarios, it will lead to excessive instance memory usage and the risk of instance OOM.

For range queries, MySQL defines the eq_range_index_dive_limit system variable to control whether the optimizer performs index diving (index div) when processing equivalent range queries. Index diving uses the index to complete the description of the number of tuples, which can obtain more accurate information and make better query strategy optimization, but the running time is also long. When the number of IN combinations exceeds a certain number, index diving is not applicable. The system uses static index statistical information values to select indexes. The results obtained by this method must be accurate. This may cause MySQL to be unable to make good use of the index, resulting in performance regression.

Big IN optimization of GaussDB (for MySQL)

GaussDB (for MySQL) Big IN performance problem method converts big IN predicates into IN subqueries. Therefore, the form of the IN predicate is:

column IN (const1, const2, ....)

Convert to the corresponding IN subquery:

column IN (SELECT ... FROM temporary_table)

After the above changes, the IN function query becomes an IN subquery, and the subquery is a non-correlated subquery.

For IN non-correlated subqueries, the MySQL optimizer provides a semi-join materialization strategy for optimization processing. The semi-join materialization strategy is to materialize the subquery results into a temporary table and then join them with the appearance. As shown below:

Concatenation can be in two orders:

Materialization-scan: Indicates a full table scan of the materialized table from the materialized table to the appearance.
Materialization-lookup: Indicates that from the appearance to the materialized table, you can use the main builder to search for data in the materialized table.

Physical and chemical scan

Execute the subquery, use the index auto_distinct_key, and deduplicate the results at the same time;
Save the results of the previous step in temporary table template 1;
Get a row of data from the temporary table and find the row that meets the supplementary conditions in the appearance;
Repeat step 3 until the traversal of the temporary table is completed.

Materialized search

Execute the subquery first;
Save the results obtained in the previous step to a temporary table;
Take a row of data from the appearance, go to the materialized temporary table to find rows that meet the supplementary conditions, use the primary key of the materialized table, and scan one row at a time;
Repeat 3 until you have viewed the entire look.

The optimizer chooses different concatenation orders depending on the size of the inner appearance. In real scenarios, the amount of data in the tables generally queried is very large, tens of millions or even hundreds of millions; the number of elements in the IN list is much smaller than the number of tables, and the optimizer will choose the Materialization-scan method for scanning. If the primary key is used during appearance query index, the total number of scanned rows after optimization is N. When M is much larger than N, the performance improvement will be very obvious.

Instructions

The rds_in_predicate_conversion_threshold parameter is a switch for modifying the query function at the bottom of the IN predicate. When the number of elements in the IN predicate list of the SQL statement exceeds the value of the parameter, the optimization strategy will be started. The function is used through the value of this variable. The following is a simple example illustrating the use of optimization:

Table Structure

create table t1(id int, a int, key idx1(a));

Check for phrases

select * from t1 where a in (1,2,3,4,5);

Set set rds_in_predicate_conversion_threshold = 0 and set range_optimizer_max_mem_size=1 to turn off the large IN predicate optimization function and range scan optimization strategy. Check the execution plan of the above query statement. The results are as follows:

> set rds_in_predicate_conversion_threshold = 0;  > set range_optimizer_max_mem_size=1;  > explain select * from t1 where a in (1,2,3,4,5);  
The result is as follows:
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-------------+  | id | select_type | table | partitions | type | possible_keys | key  | key_len | ref  | rows | filtered | Extra       |  +----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-------------+  |  1 | SIMPLE      | t3    | NULL       | ALL  | key1          | NULL | NULL    | NULL |    3 |    50.00 | Using where |  +----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-------------+  1 row in set, 2 warnings (0.00 sec)  
show warnings;  +---------+------+---------------------------------------------------------------------------------------------------------------------------+  | Level   | Code | Message                                                                                                                   |  +---------+------+---------------------------------------------------------------------------------------------------------------------------+  | Warning | 3170 | Memory capacity of 1 bytes for 'range_optimizer_max_mem_size' exceeded. Range optimization was not done for this query.   |  | Note    | 1003 | /* select#1 */ select `test`.`t3`.`id` AS `id`,`test`.`t3`.`a` AS `a` from `test`.`t3` where (`test`.`t3`.`a` in (3,4,5)) |  +---------+------+---------------------------------------------------------------------------------------------------------------------------+  2 rows in set (0.00 sec)

It was found that a warning was reported when the above statement was executed. The warning information showed that because the memory used during the range optimization process exceeded range_optimizer_max_mem_size, range limit optimization was not used for the statement. As a result, the scan type changes to ALL and becomes a full table scan.

Set set rds_in_predicate_conversion_threshold = 3 to enable the large IN predicate optimization option, which means that when the IN predicate list elements exceed 3, the large IN queue query optimization strategy will be activated. Execute the EXPLAIN FORMAT=TREE statement to see whether the optimization takes effect.

> set rds_in_predicate_conversion_threshold = 3;  > explain format=tree select * from t1 where a in (1,2,3,4,5);  +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+  | EXPLAIN                                                                                                                                                                                                                                                        |  +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+  | -> Nested loop inner join  (cost=0.70 rows=1)      -> Filter: (t1.a is not null)  (cost=0.35 rows=1)          -> Table scan on t1  (cost=0.35 rows=1)      -> Single-row index lookup on <in_predicate_2> using <auto_distinct_key> (a=t1.a)  (cost=0.35 rows=1)   |  +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+  1 row in set (0.00 sec)

The <in_predicate_*> (* is a number) table in the execution plan is a temporary table constructed in Big INTool, which stores all the data in the IN predicate list.

Usage restrictions

The query statements supported by Big IN optimization include the following statement list:

choose
Insert...select
replace...select
support point of view
Prepared STMT

Constraints and Limitations

Big IN rotor query uses the subquery optimization solution provided by mysql to achieve performance. Therefore, there are the following restrictions on use, otherwise it will reduce performance.

Scenarios where indexing cannot be used are not supported
Only supports constant IN LIST (including NOW(), ? and other statements that do not involve table queries)
Stored procedures/functions/triggers are not supported
Not supported or absent

Typical scenario test comparison

The table test structure is as follows:

CREATE TABLE `sbtest1` (    `id` int NOT NULL AUTO_INCREMENT,    `k` int NOT NULL DEFAULT '0',    `c` char(120) NOT NULL DEFAULT '',    `pad` char(60) NOT NULL DEFAULT '',    PRIMARY KEY (`id`),    KEY `k_1` (`k`)  ) ENGINE=InnoDB;  
The data volume of the table is 1000w.
> select count(*) from sbtest1;  +----------+  | count(*) |  +----------+  | 10000000 |  +----------+

The query statement is as follows, in which the condition field is indexed and the IN list contains 10,000 constant numbers.

select count(*) from sbtest1 where k in (2708275,5580784,7626186,8747250,228703,4589267,5938459,6982345,2665948,4830545,4929382,8723757,354179,1903875,5111120,5471341,7098051,3113388,2584956,6550102,2842606,2744112,7077924,4580644,5515358,1787655,6391388,6044316,2658197,5628504,413887,6058866,3321587,1430333,445303,7373496,9133196,6760595,4735642,4756387,9845147,9362192,7271805,4351748,6625915,3813276,4236692,8308973,4407131,9481423,3301846,432577,810938,3830320,6120078,6765157,6456566,6649509,1123840,2906490,9965014,3725748, ... );

The performance comparison is shown in the figure below:

It can be seen that after in-list optimization, the performance is improved by 36 times compared with the original method.

Click to follow and learn about Huawei Cloud’s new technologies as soon as possible~