mysql combat 10 | MySQL Why would sometimes choose the wrong index?

Earlier we introduced the index, you already know a table in MySQL actually can support multiple indexes. But you write SQL statements, and did not take the initiative to specify which index to use. In other words, which index to use is determined by the MySQL.

I do not know you have not come across such a situation, one could execute very quickly the statement, but because MySQL wrong index, which led to the execution speed is very slow?

We look at an example of it.

Let's build a simple table, table has a, b two fields, and were built on the index:

CREATE TABLE `t` (
  `id` int(11) NOT NULL,
  `a` int(11) DEFAULT NULL,
  `b` int(11) DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `a` (`a`),
  KEY `b` (`b`)
) ENGINE=InnoDB;
复制代码

We then inserted into the table t 100,000 rows, the value is incremented by an integer, namely: (1,1,1), (2,2,2), (3,3,3) until (100000,100000, 100000).

I use a stored procedure to insert data, and here I posted allowing you to reproduce:

delimiter ;;
create procedure idata()
begin
  declare i int;
  set i=1;
  while(i<=100000)do
    insert into t values(i, i, i);
    set i=i+1;
  end while;
end;;
delimiter ;
call idata();
复制代码

Next, we analyze a SQL statement:

mysql> select * from t where a between 10000 and 20000;
复制代码

You will say that this statement is also used analysis it is very simple ah, on a indexed, be sure to use an index of a.

You're right, Figure 1 shows that the use of performance explain order to see this statement.

                                      Figure 1 Using explain command to view the implementation of the statement

From the looks of FIG. 1, the implementation of this query is indeed in line with expectations, key value of this field is 'a', the optimizer chooses the index represents a.

But wait, this case is not so simple. On our already prepared contains 100,000 lines of data tables, we do as follows.


                                           And execution flow 2 session A session B in FIG.

Here, the operation session A of you are already familiar, and it is to open a transaction. Then, after the session B data are deleted, they call idata this stored procedure, insert 100,000 rows of data.

At this time, session B query select * from t where a between 10000 and 20000 will not choose the index a. We can see by the slow query log (slow log) about the specific implementation.

To illustrate the results chosen by the optimizer is correct, I added a control, namely: the use of force index (a) to allow the optimizer to use the index to force a (this part, I will mention in the second half of this article to).

The following three SQL statements, is this experiment.

set long_query_time=0;
select * from t where a between 10000 and 20000; /*Q1*/
select * from t force index(a) where a between 10000 and 20000;/*Q2*/
复制代码

  • The first sentence, is set slow query log threshold value is 0, indicating that the thread next statement will be recorded into the slow query log;
  • The second sentence, Q1 is original query session B;
  • The third sentence, Q2 is the addition of force index (a) and the implementation of the session B to the original query comparison.

Figure 3 is slow query log after the completion of the three SQL statement shown.

                                                             FIG 3 slow log Results

You can see, Q1 scans 10 million lines, apparently gone full table scan, the execution time is 40 milliseconds. Q2 10001 scanned line, executed 21 milliseconds. In other words, when we do not use of force index, MySQL using the wrong index, resulting in a longer execution time.

This example corresponds to what we usually keep historical data and new data deleted scenes. At this time, MySQL will actually choose the wrong index, is not it strange? Today, we have the results from this strange talking about it.

Logic Optimizer

In the first article, we mentioned, is working to select an index optimizer.

The optimizer chooses index purpose is to find an optimal program implementation, and with minimal costs to execute the statement. In a database, the number of scanning lines is one of the factors affecting the implementation of the price. The fewer the number of lines scanned, the less often means that access data on disk, the less CPU resource consumption.

Of course, the number of scanning lines is not the only criteria for judging whether the optimizer will combine the use of temporary tables, sorting and other factors whether a comprehensive judgment.

Our simple query is not related to the temporary table and sort, so MySQL is certainly the wrong index to determine when the number of scanning lines of a problem.

So, the question is: number of scan lines is how to judge?

MySQL Before we can start executing the statement, and can not know exactly meet the conditions of this record how many, but only to estimate the number of records according to statistics.

This statistical information is an index of "discrimination." Obviously, the more an index on a different value, the discrimination index, the better. And an index on the number of different values, which we call "base" (cardinality). In other words, the larger the base, the better the discrimination index.

We can use the show index method, see a base index. , It is the result of the table show index t 4 as shown in FIG. Although three field values ​​for each row in this table are the same, but in the statistics, the base value of the index and the three different, but in fact not accurate.


                                                        The results in Table 4 show index t of FIG.

So, MySQL is how to get the base of the index it? Here, I'll give you a brief statistical sampling method MySQL.

Why do statistical sampling? Because the entire table to take out line by line statistics, although you can get accurate results, but the price is too high, you can only choose "statistical sampling."

When sampling statistics, InnoDB is selected by default N pages of data, statistical different values ​​on these pages, get a mean value and then multiplied by the number of pages in the index, it has been the base of this index.

The data table will be continuously updated, the index statistics will not be fixed. So, when the number of changed data rows exceeds 1 / M, it will automatically trigger a re-do index statistics.

In MySQL, there are two ways to store the index statistics, can be selected by setting the value of the parameter innodb_stats_persistent:

  • When set to on, the statistics would indicate persistent storage. In this case, the default is N 20, M 10.
  • When set to off, only represents the statistical information stored in memory. In this case, the default is N 8, M 16.

Because it is statistical sampling, so no matter where N is 8 or 20, this base is very easy inaccurate.

But that's not all.

As you can see from Figure 4, the index statistics (cardinality columns) Although less accurate, but still largely similar, the wrong index there must be other reasons.

In fact, only one input index statistics for a particular statement, the optimizer but also to determine, implementation of the sentence itself how many lines you want to scan.

Next, we'll take a look at the optimizer estimates the number of scanning lines of these two statements is.


                                                      5 Unexpected explain results

rows This field indicates that the expected number of scanning lines.

Among them, the result was in line with expectations, the value of rows of Q1 is 104,620; however rows Q2 value is 37116, the deviation is big. And in Figure 1 we explain the command to see the rows are only 10001 lines, this deviation is misguided judgment optimizer.

Here, it may be your first question is not why are not allowed, but the optimizer execution plan 37000 Why stood scanning lines do not, but the number of scan lines is selected to perform planned 100,000 it?

This is because, if the index a, each time to get a value from the index a, should be back to find out a whole row of data on the primary key index, the price of the optimizer should be enumerated.

And if the 100,000 row selection scan is scanned directly on the primary key index, with no additional costs.

The optimizer will estimate the cost of these two options, it seems from the results, the optimizer thinks direct primary key index scan faster. Of course, the execution time it seems that this choice is not optimal.

Use the general index back to the table of costs need to be included, in the implementation explain in Figure 1, also consider the cost of this strategy, but the choice of Figure 1 is right. In other words, this strategy is not a problem.

So injustice head main debt, MySQL wrong index, this child did not have to be attributed to accurately determine the number of scanning lines. As for why will get the wrong number of scanning lines, this reason as after-school issues, leaving you to analyze the.

Since it is not statistics, it would be corrected. analyze table t command, the statistics can be used to re-index information. We look at the implementation of the results. FIG 6 performs explain the results of the recovery command analyze table t          

This time right.

So in practice, if you find that explain the result of the estimated value of rows with relatively large gap between the actual situation, we can use this method to deal with.

In fact, if only index statistics are not accurate, we can solve many problems through analyze command, but we said earlier, the optimizer can not just look at the number of scan lines.

Is still based on the table t, we look at another statement:

mysql> select * from t where (a between 1 and 1000)  and (b between 50000 and 100000) order by b limit 1;
复制代码

From the condition point of view, this query No records, and therefore will return an empty collection.

Before beginning this statement, you can first imagine, if you choose to index, will choose which one do?

For this analysis, we first look at a, b two index chart.


                                                    Fig 7 a, b index structure of FIG.

If a query index, then the former is a scan index value of 1000, and then taken to the corresponding id, isolated and then up the primary key index for each row, and then filtered based on the field b. Obviously this requires scanning 1000 lines.

If the query using the index b, then the value 50001 is the last scan of the index b, and the implementation process is the same as above, but also the need to return to the primary key index values ​​and then determine, it is necessary to scan 50001 lines.

So you would think that if a use of the index, it will be much faster execution speed significantly. So, let's take a look in the end is not all about children.

FIG 8 is performed explain the results.

mysql> explain select * from t where (a between 1 and 1000) and (b between 50000 and 100000) order by b limit 1;
复制代码


                                          8 Use explain ways to view the execution plan 2

You can see the result is returned in key field displays, the optimizer chooses the index b, and rows field shows the number of lines to be scanned is 50198.

From this result, you can get two conclusions:

  1. Estimate of the number of scanning lines is still not accurate;
  2. This case MySQL has the wrong index.

Index selection and handling abnormal

In fact, most of the time optimizer can find the correct index, but occasionally you still encounter both cases our example above: the original can be executed very quickly in SQL statements, execution speed much slower than you expected, you How should I do it?

One method is the same as our first example, using a force index forcibly select index. MySQL will analyze the results of the analytical lexical index may be used as a candidate, then the candidate list in order to determine how many lines need to scan each index. If the force index specified index in the index candidate list, to directly select the index is no longer assess the implementation of the cost of other indexes.

Let's look at a second example. At first analysis, we believe that a better selection index. Now, we look at the implementation of the results:


                                                  9 Figure statements using different indexes to perform time-consuming

We can see, the original statements requires 2.23 seconds, and when you use force index (a) of only 0.05 seconds, 40 times faster than the optimizer's choice.

That is, the optimizer does not choose the correct index, force index acts as a "correction".

However, many programmers prefer not to use force index, not one to write so beautiful, and secondly, if the index change the name, the statement also have to change, it is very troublesome. And if later migrate to another database, this syntax may also not be compatible.

Timeliness but in fact the use of force index main problem is change. Because the situation wrong index or less appear, so developers will not be the first time usually write force index. But wait until something goes wrong line, you will go to modify the SQL statement, plus the force index. But also tested and released after modification, for production systems, this process is not quick.

So the question is best database in an internal database to solve. So, in a database how to solve it?

Since the optimizer to give up using the index a, a description is not enough suitable, so the second method is that we can consider modifying statement, we expect to guide the use of MySQL indexes. For example, in this case, obviously the "order by b limit 1" changed to "order by b, a limit 1 ", the logic is the same semantics.

Let's look at the effect after the change:

                                          FIG 10 order by b, a limit 1 Results of

Before the optimizer chooses to use the index b, because it avoids the sort that the use of the index b (b is the index itself, already ordered, and if you choose the index b, then do not need to sort, only need to traverse), so even scan a few more lines, but also determined to be less costly.

Now order by b, a such an approach, in accordance with the requirements of b, a sort, it means that the use of these two indexes require sorting. Therefore, the number of scanning lines became the main conditions affecting decisions, so this time optimizer chose to scan only needs 1000 index of the line a.

Of course, this modification is not common optimization methods, but there are just limit 1 In this statement there, so if you have to meet the conditions of recording, order by b limit 1 and order by b, a limit 1 will return b is the smallest that line, logically consistent, it can do so.

If you feel that modify the semantics of this child is not very good, there's an improved method 11 is executed effect.

mysql> select * from  (select * from t where (a between 1 and 1000)  and (b between 50000 and 100000) order by b limit 100)alias limit 1;
复制代码


                                                     Figure 11 explain rewrite SQL

In this example, we limit 100 allows the optimizer to realize that the cost of using the b index is high. In fact, according to data characteristics we induce a bit optimizer does not have the versatility.

The third method is, in some scenes, we can create a more appropriate index to be provided to the optimizer to choose, or delete an index misuse.

However, in this case, I have no way to change the behavior by the optimizer to find the new index. In fact, this situation is relatively small, especially after the DBA index optimized library, and then run into this bug, find a more appropriate index is generally more difficult.

If I say there is a method to delete the index b, you may feel funny. But in fact I have come across two such examples, the DBA is ultimately communicate with business development, found that the wrong choice of index optimizer fact, there is no need to exist, so he deleted this index, the optimizer will reselect to the correct index.

summary

Today, we chatted chatted index statistics update mechanism, and referred to the possibility of the wrong index optimizer.

For problems caused by inaccurate results in Index statistics, you can analyze table to resolve.

As for the other cases of miscarriage of justice optimizer, you may be forced to specify the index used in the application side force index, the optimizer can also be guided through the modification statements, you can also add or delete an index to circumvent this problem.

You might say, a few examples later in this article today, how not to expand a description of its principles. I want to tell you is that today's topic, we are faced with the MySQL bug, each must expand deep into the code line by line to quantify, it is not something we should do here.

So, I put my solutions used to share with you, hope you are when confronted with a similar situation, to have some ideas.

When you usually deal with in the MySQL optimizer bug is there any other ways, also sent to the comments section to share it.

Finally, I leave you to think a problem. In front of us during the construction of the first example by fitting session A, so that session B and then delete the data re-insert the data again, and then explain the results found in, rows from the field 10001 become more than 37,000.

And if not with the session A, but performed separately delete from t, call idata (), explain these three sentences, you will see rows field, but it is still about 10,000. You can yourself verify the results.

This is what causes it? Even if you analyze it.

You can put your analysis conclusions written in the comments section, I will end next article and you discuss this issue. Thank you for listening, you are welcome to send this share to more friends to read together.

On the issue of time

In the last article I leave you with the last question is, if a write-once use change buffer mechanism, after the host reboots, will change buffer and data loss.

The answer to this question is not lost, the message area many students have answered right. Although memory is only updated, but when the transaction is committed, we change buffer operation is also recorded in the redo log, so when the crash recovery, change buffer can get it back.

Some students asked, merge the data whether the process will write directly to disk, which is a good question. Here, I would like for you to analyze.

Execution flow merge is this:

  • Read data from disk into memory page (older versions of data pages);
  • Find out the change buffer records (there may be more) from this page of data change buffer, the application in order to obtain the new data page;
  • Write redo log. The redo log contains a change of change and change buffer data.

Here merge process is over. At this time, the data page and the corresponding change buffer memory disk location have not modified, are dirty pages, brush back after their own physical data, the process is another.

Reproduced in: https: //juejin.im/post/5d033c0cf265da1b855c5280

Guess you like

Origin blog.csdn.net/weixin_34375251/article/details/93183469