What unknown optimizations does MySQL do to JOIN?

Hello everyone, I'm Kaka不期速成,日拱一卒

Through the previous article, I know that there are three join algorithms in MySQL, which are NLJ、BNLJ、BNL, in summary, divided into index nested loop joins, cache block nested loop joins, and rough loop joins.

In addition, I also learned a new concept . The join_bufferfunction is to read all the data of the associated table into the join_buffer, and then take the data line by line from the join_buffer to query the driven table. Since the data is obtained in memory, the efficiency will still be improved.

At the same time, I encountered an unfamiliar concept hash_join in the previous article, which was not explained in detail in the previous issue, and will be described in this issue.

Dead MySQL series

Dead MySQL series

1. Multi-Range Read optimization

When introducing the topic of this issue, let's first understand a knowledge point Multi-Range Read. The main function is to try to read disks sequentially. In any field, as long as there is order, there will be a certain performance improvement.

Such as MySQL indexes, now you should know that indexes are inherently ordered to avoid problems with the server reordering data and creating temporary tables.

Next, use a case to practice how this optimization is done.

Create two tables, join_test1 and join_test2

CREATE TABLE `join_test1` (
 `id` int(11) unsigned NOT NULL AUTO_INCREMENT,
 `a` int(11) unsigned NOT NULL,
 `b` int(11) unsigned NOT NULL,
 PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci;

CREATE TABLE `join_test2` (
 `id` int(11) unsigned NOT NULL AUTO_INCREMENT,
 `a` int(11) unsigned NOT NULL,
 `b` int(11) unsigned NOT NULL,
 PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci;

Add some data to the two tables for case demonstration

drop procedure idata;
delimiter ;;
create procedure idata()
begin
  declare i int;
  set i=1;
  while(i<=1000)do
    insert into join_test1 (a,b) values ( 1001-i, i);
    set i=i+1;
  end while;
  
  set i=1;
  while(i<=1000000)do
    insert into join_test2 (a,b)  values (i, i);
    set i=i+1;
  end while;

end;;
delimiter ;
call idata();

If there is an index on the field a of the table join_test1, then the index will be used when querying.

The execution process is roughly to obtain all the values ​​of field a, and then go back to the table line by line according to the value of a to obtain data on the primary key index

The current situation is that if the value of a is queried in an increasing order, the value of id will be reversed in disguised form. Although it seems to be in continuous flashback according to the primary key ID, it is definitely not continuous in the production environment, which will cause random access, it will definitely cause poor performance.

Why does random access affect performance?

MySQL's indexes are inherently ordered, and MySQL also draws on the principle of locality. The principle of locality is that data and programs have a tendency to cluster by default. After accessing a row of data, there is a great possibility to access it again. to this piece of data or data adjacent to this piece of data.

Now you should know that MySQL is not a read-only query data when reading data. By default, it will read 16kb of data. This value is determined according to innodb_page_size.

Therefore, sequential query is very fast, because the data is not obtained through the executor every time, but is obtained directly in memory, but if the access becomes random, the data will be obtained through the executor every time, so this is the cause of poor performance.

The role of MRR

Having said so much, now you should know that the role of MRR is to turn the query into an incremental query of the primary key ID, and the read to the disk is as close to the sequential read as possible, which can improve performance.

Therefore, the execution flow of the execution statement becomes like this

  • First, according to the request a, get all the data that meets the conditions, and put the value of the primary key id into read_rnd_buffer

  • Sort the values ​​of id in positive order in read_rnd_buffer

  • Then according to the primary key ID value obtained after sorting, get the data from the primary key index in turn, and return the result set

How to enable read_rnd_buffer

The size of read_rnd_buffer is controlled by the read_rnd_buffer_size parameter, the default value is 256kb, but what you need to know is that the optimization of MRR is more inclined to not use it in the judgment strategy of the optimizer. If you want to use it, you need to modify the configuration.

set optimizer_switch="mrr_cost_based=off"

mrr default valuemrr default value

What should I do if the read_rnd_buffer cannot be stored?

Recall how to deal with the insufficient join_buffer mentioned in the previous issue. The data read last time will be emptied from the buffer, and then the remaining data will be put in. In MySQL, the buffer memory for storing the result set is insufficient. Most are handled that way.

The SQL execution process after using read_rnd_buffer becomes like this

read_rnd_buffer execution flow chart

read_rnd_buffer execution flow chart

The result of explain shows

result of mrr explain

result of mrr explain

be careful

Suppose now that the scope of the query is expanded, and see what changes

Expand your query

Expand your query

It can be seen that when the range is expanded to be close to the full table data, the index a will no longer be used to perform a full table scan, and mrr optimization cannot be used anymore.

Therefore, the use of MRR to improve performance is based on two very important points. One is to perform range queries on the index, and the other is to be able to use the upper index. Of course, the index must be a column of the range query.

2. Nested-Loop Join optimization

It has been almost a month since I updated the article. How much can I recall about the Nested-Loop Join algorithm? The SQL execution process is roughly as follows:

NLJ algorithm execution process

NLJ algorithm execution process

  • Read a row of data from join_test1 table R

  • Take the id field from R to the table join_test2 to find the index a, and get the satisfied rows through the primary key ID

  • Take out the rows that meet the conditions in join_test2 and form a row with R

  • Repeat the first three steps until the data scan of the table join_test1 that meets the conditions ends

The logic of the NLJ algorithm is that after fetching a row of data from the driving table, the join operation is performed directly in the driven table. For the driving table, it becomes a value that matches each time. At this time, the conditions for MRR optimization are not satisfied.

Through the last article, you should now know the role of join_buffer in the BNL algorithm, but it is not used in the NLJ algorithm.

Then find a way to batch transfer the data of the driving table to the driven table for join operation?

That's right, the MySQL team introduced this solution in version 5.6. Part of the data is taken out of the driver table and placed in the temporary memory. This temporary memory is the join_buffer of the previous issue.

Then the execution flow chart will become like this

It should be noted here that the process of index a in read_rnd_buffer is not drawn. If you don’t understand it, go to the above to see the picture!

BKA algorithm optimization

BKA algorithm optimization

In the above figure, we still query 1000 pieces of data, then join_buffer will store 1000 pieces of data, if not, it will be segmented until the execution ends.

For the optimization of the NLJ algorithm, the official also gave a name calledBatched Key Access

Enablement of BKA Algorithm

Since MRR optimization is to be used, MRR must be turned on, and MRR must be turned on at the same time batched_key_access=on.

set optimizer_switch='mrr=on,mrr_cost_based=off,batched_key_access=on';

3. Block Nested-Loop Join algorithm optimization

A very simple optimization is to add an index to the driven table. At this time, the BNL algorithm will naturally become the BKA algorithm.

select * from t1 join t2 on (t1.b=t2.b) where t2.b>=1 and t2.b<=2000;

This SQL query only 2000 rows of data on join_test2. If your MySQL machine does not care so much about memory, you can directly add an index to field b.

On the contrary, we need to find another way

Let's review the execution flow of the BNL algorithm

  • Take out all the data of join_test1 and store it in join_buffer

  • Scan join_test2 to compare each row of data with the data in join_buffer, if skip is not satisfied, but the result set is stored

Since the driven table field b has no index, each piece of data read from join_buffer must perform a full table scan on join_test2.

In the case, the join_test2 table has a total of 100W of data, so the number of rows to be scanned is 1000*100W = 1 billion times, and only 2000 pieces of data need to be executed 1 billion times. This performance can be imagined.

This is, we can use the odd path 临时表to solve this problem, the realization idea is roughly as follows

  • First store the data that meets the conditions in join_test2 in the temporary table tmp_join_test2

  • At this time, the data of the temporary table only has 2000 data in the conditional range, so it is completely possible to add an index to the field b

  • Finally, let join_buffer do the join operation with tmp_join_test2

The corresponding SQL operation is as follows

create temporary table tmp_join_test2 (id int primary key, a int, b int, index(b))engine=innodb;
insert into tmp_join_test2 select * from join_test2 where b>=1 and b<=2000;
explain select * from join_test1 join tmp_join_test2 on (join_test1.b=tmp_join_test2.b);

Scan lines

insert is a full table scan of the table join_test2, at this time the number of scanned rows is 100W rows

join_test1 performs a full table scan and scans 1000 rows at a time

Each join operation is a piece of data, a total of 1000 times, and the number of scanned rows is 1000 rows

After using the temporary table, the total number of scanned rows has increased from 1 billion times to 100W+2000 times, and the result of executing the query is expected to return in less than one second.

Summarize

Whether you use the BKA algorithm or the temporary table has one thing in common, that is, the index can be used on the driven table to actively trigger the BKA algorithm, thereby improving performance.

4. Hash join

Everyone remember this picture! The algorithm reproduced in the previous article Block Nested-Loop Join! The result returned a hash_join, which was not explained in the previous issue.

Because the hash_join algorithm is only available in MySQL 8.0.18

BKA

BKA

The premise for hash_join to take effect is that there is no index on the joined field of the driven table. In MySQL 8.0.18, another constraint is conditional equivalence. For example, in the casejoin_test1.b=tmp_join_test2.b

But in 8.0.20, the constraint of conditional equivalence has been removed and fully supportednon-equi-join,Semijoin,Antijoin,Left outer join/Right outer join

In fact, the implementation principle of the hash_join algorithm is very simple

  • Drive the join field in the table to calculate the hash value

  • Create a hash_table in memory and store all the hash values ​​of the driver table in it

  • select * from join_test2 where b>=1 and b<=2000Get data that meets the conditions in the driven table, such as 2000 rows of data in join_test2

  • Compare these 2000 rows of data with the data in hash_table row by row, and return the data that meets the conditions as the result set

It can be seen that the number of scanned rows of the hash_join algorithm is not too different from that of the temporary table. 那么为什么MySQL会默认使用hash_join这种算法呢?This problem will be left to everyone to investigate.

V. Summary

This issue mainly shares the algorithm optimization of NLJ and BNJ

Among these optimizations, hash_join has built-in support in MySQL 8.0.18, but the lower version still defaults to the BKA algorithm

It is recommended to add an index to the joined field of the driven table, and convert the BNL algorithm to the BKA or hash_join algorithm

At the same time, it also provides you with a temporary table solution. Temporary tables are an optimization point that is very easy to ignore in the development process. You can learn to use temporary tables in an appropriate environment.

Recommended reading

Deadline MySQL Series General Catalog

Heavy blockade, so that you can't get a single piece of data "Deadly Kick MySQL Series Thirteen"

Trouble, the generation environment performed the DDL operation "Deadly Kick MySQL Series Fourteen"

Talk about the locking rules of MySQL "Deadly Kick MySQL Series Fifteen"

Why not use join? "Deadly Kick MySQL Series Sixteen"

Persistence in learning, perseverance in writing, perseverance in sharing are the beliefs that Kaka has upheld since her career. I hope the article can bring you a little help on the huge Internet, I am Kaka, see you in the next issue.

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/3828348/blog/5519394