What exactly does the Sql optimizer do for you?

In the last article, we introduced "DB - How to Read and Store Data" , which talks about the work of the sql optimizer.

One of the great advantages of relational databases is that users don’t need to care about how data is accessed, because these optimizers have done it for us, but when optimizing SQL queries, I have to pay attention to this, because it involves query performance question.

Experienced programmers are familiar with some SQL optimizations, such as the leftmost matching principle, non-BT predicate avoidance, etc., so how does the optimizer determine these? And why must the leftmost match, what is the principle of the leftmost match, do you have an in-depth understanding?

In this article, we will use some examples to analyze what the optimizer does, so that we can better optimize SQL queries.

In this article you can know:

  1. What is the access path of sql

  2. How the Optimizer Determines the Optimal Access Path

  3. What is the principle of leftmost matching based on?

  4. How to effectively evaluate the number of rows hit by sql

 

Example table :

CREATE TABLE test (
  id int(11) NOT NULL AUTO_INCREMENT,
  user_name varchar(100) DEFAULT NULL,
  sex int(11) DEFAULT NULL,
  age int(11) DEFAULT NULL,
  c_date datetime DEFAULT NULL,
  PRIMARY KEY (id),
  # index
  KEY id_name_sex (id,user_name,sex),
  KEY name_sex_age (user_name,sex,age)
) ENGINE=InnoDB AUTO_INCREMENT=12 DEFAULT CHARSET=utf8;

 

1. Access path

Before the SQL statement can actually be executed, the optimizer must first determine how to access the data. This includes: which index should be used, how the index should be accessed, whether assisted random reads are required, etc.

From a piece of SQL, to the optimization of the optimizer, to the engine for data query, to the data storage page, this is a process of determining the access path.

2. Predicate

A predicate is one or more search parameters in the where clause that we often say. Predicate expressions are the main starting point for index design. If an index can satisfy all predicate expressions of the select query statement, the optimizer may establish an efficient access path.

select * from test where id =1 and user_name like ’test%’

For example, in the above query, the search parameters after where, id and user_name are predicates.

 

3. Index slice

The index slice represents the range of the range determined by the predicate expression, and the cost of the access path depends largely on the thickness of the index slice.

The thicker the index slice, the more index pages need to be scanned, and the more index records need to be processed, and the biggest overhead is the need to perform synchronous read operations on the target. Conversely, a narrower index slice will significantly reduce the cost of index access, and there will be fewer simultaneous reads on the table.

Synchronous read is a random IO operation, and a single read takes about 10ms. We explained this in the previous post.

 

for example:

//will match 5 data
sql1:select * from test where sex=1;
// match 2 data
sql2:select * from test where sex=1 and age <10;

 

Therefore, we need to determine the thickness of the index slice through a predicate. The less the range of the filtered range, the narrower the thickness of the index slice. So the predicate must be able to match the index, or what is the matching rule?

 

Fourth, match columns & filter columns

Predicates may not all match the index, and those that can be matched are called matching columns. At this point it can participate in the definition of the index slice.

Only matching columns and filter columns can participate in the definition and filtering of index slices, others cannot.

Let's look at the definition of predicate matching:

Check the indexed columns, check the indexed columns in order from the beginning to the end, and check the following rules:

  1. Does the column have at least one simple enough predicate to correspond to it in the where clause? If there is, this column is the matching column. If not, then this column and the index columns that follow are non-matching columns.

  2. Whether the predicate is a range predicate, and if so, the remaining index columns are all non-matching columns.

  3. For indexed columns after the last matching column, that column is a filter column if there is a simple enough predicate for it.

 

1. Example

select * from test where user_name=’test1’ and sex>0 and age =10

discovery indexid_name_sex

  1. Check its index column ( id, user_name, sex) row by row

  2. First check idand find that the predicate behind where does not correspond to it, then this index column and the following index columns are non-matching columns

  3. Index id_name_sexmatching ends, no matching columns

discovery indexname_sex_age

  1. Check its index column ( user_name, sex, age) row by row

  2. First check user_nameand find that the predicate behind user_namewhere corresponds to it, and determine that this column is a matching column

  3. Check the index field sexand find that there is a predicate sexcorresponding to the where, and it is determined that this column is a matching column. Since the predicate sexis a range predicate, the remaining indexes are non-matching columns.

  4. The index column ageis after the last matching column sex, and there is a predicate agecorresponding to it, so this column is a filter column,

 

With this example, we finally determined:

  • match index:name_sex_age

  • Animal arrangement : user_name,sex

  • Filter column:age

 

Let's look at the explain, which corresponds to our analysis.

 

2. Determine what the matching column is for

After determining the matching columns, we can know which indexes are used in the current query, and which columns of the index are matched. Finally, the access range of the data can be locked in advance to save the reading pressure for data reading.

Compared with the query that does not match the index, the query with matching columns, the conditional filtering is pre-conditional, while the query that does not match the index is post-conditional filtering, that is, after the full table scan, the results are filtered, so the disk IO pressure is too high.

In addition, the "leftmost matching" principle is also based on the matching column rules. In addition to the principle of the B-tree, there is another important reason why it is the leftmost matching. When checking the matching columns, the index is checked from the beginning to the end. List.

So for whether the index can be matched, the order of the predicates behind where is not important, the important thing is the order of the index columns.

 

for example:

select * from test where user_name=’test1’ and sex>0 and age =10
select * from test where sex>0 and user_name=’test1’ and age =10
select * from test where age =10 and user_name='test1' and sex>0

 

can match the name_sex_ageindex

3. Complex Predicates

like predicate

If the value is %xx, then a full index scan will be selected, and no index matching will be performed. If it is xx%, this will participate in index matching and an index slice scan will be selected.

OR operator

Even for simple predicates, if they are OR operations with other predicates, it is extremely difficult for the optimizer, unless in multi-index access, it is possible to participate in the definition of an index slice, try not to use it.

Suppose a predicate evaluates to false, and a row cannot be deterministically excluded without checking other predicates, then such predicates are very difficult for the optimizer.

BT predicate

For example, if there is only the and operator, then all simple predicates can be called BT predicates, that is, good predicates. Unless the access path is a multi-index scan, only BT predicates can participate in the definition of index slices.

Predicate value is undefined

For example, the value of the predicate uses a function or participates in the calculation. When the optimizer performs static SQL binding, it needs to recalculate the selection every time, which cannot be cached, consumes a lot of CPU, and cannot participate in the matching of index columns.

 

5. Filter factor

The matching columns determine which index columns are used, but the thickness of the index slice (that is, how many rows are expected to be accessed) has not been estimated. Here needs to be determined by the filter factor.

The selectivity of the predicate described by the filter factor, that is, the proportion of the number of rows of records that satisfy the predicate condition in the table, depends on the distribution of column values.

 

1. Filter factor for a single predicate

For example, our test table has 10,000 records, the predicate user_name matches an index column, and its filter factor is 0.2% (1/number of different user_names = ratio of 500 different values ​​in user_name), which means that the query results will be Contains 20 rows of records.

 

select * from test where user_name=’test’

2. Filter factor of combined predicates

When there are multiple predicates matching the matching column, we can derive the combined filter factor from the filter factor of a single predicate. The general formula is:

combined filter factor = predicate 1 filter factor * predicate 2 filter factor....

For example the following query

select * from test where user_name=’test’ and sex=1 and age =10

Contains 3 predicates, user_name, sex, age, where user_name has 500 different values, sex has 2 different values, and age has 40 different values.

Then the filter factor for each predicate:

FF(user_name) =1/500*100 =0.2%

FF(sex) =1/2*100=50%

FF(age) =1/40*100=2.5%

Combined filter factor=0.2%* 50%* 2.5%=0.0025%

 

Through the above combination of filter factors, the final result set can be calculated = 10000*0.0025%=0.25 ~=1

 

After evaluating the above filter factor, we can see that the result set that needs to be searched finally only needs to obtain one row, which has a high performance improvement for the disk access of the database.

This is also the importance of filter factor evaluation before the optimizer evaluates optional access paths.

 

6. Sort

Materializing a result set means building the result set by performing the necessary database accesses. At best, only one record needs to be returned, and at worst, multiple records need to be returned, requiring a lot of disk reads. And sorting is one of them.

In the following cases, only one record needs to be materialized for a fetch call, otherwise the entire result set needs to be materialized when sorting the results.

  • There is no sorting requirement, such as order by, group by, etc.

  • Although the sorting needs to meet the following two conditions:

  1. <!--There is an index to meet the sorting requirements of the result set, such as the above (id_name_sex) or (name_sex_age)-->

  2. <!--The optimizer decides to use this index in the traditional way, by accessing the first index row that satisfies the condition and reading the corresponding table row, and then accessing the second index row that meets the condition and reading the corresponding table row ,And so on. -->

  3. <!--For example, when using an index (name_sex_age), select * from test where user_name='test' order by sex, at this time in the index, the result set is ordered based on sex itself -->

 

7. Finally

The sql optimizer does more than just your work, but the estimation of the size of the index slice and the determination of the access path are its most important tasks, and we will continue to introduce it later.

 

-----------------------------------------------------------------------------

If you want to see more interesting and original technical articles, scan and follow the official account.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324439949&siteId=291194637