《Pro Oracle SQL》CHAPTER2--2.11 Determining the Execution Plan

Determining the Execution Plan    确定执行计划 (page 74)
    When a hard parse occurs, Oracle will determine which execution plan is best for the query. An
execution plan is simply the set of steps that Oracle will take to access the objects used by your query
and return the data that satisfies your query’s question. In order to determine the plan, Oracle will
gather and use lots of information, as you’ve already seen. One of the key pieces of information that
Oracle will use to determine the plan is statistics. Statistics can be gathered on objects, such as tables
and indexes; system statistics can be gathered as well. System statistics provide Oracle data about
average speeds for block reads and much more. All this information is used to help Oracle review
different scenarios for how a query could execute and to determine which of these scenarios is likely to
result in the best performance.
    当硬解析一发生，Oracle就会确定哪一个执行计划最适合于查询。执行计划就是Oracle用于访问你的查询所用对象的一组步骤，返回的数据满足你查询的要求。为了确定计划，如你所见，Oracle要收集和使用很多信息。Oracle用于确定计划的一部分关键信息是统计信息。统计信息能从对象上获取，如表和索引；系统统计信息也能获取到。系统统计信息提供关于块读取平均速度等的Oracle数据。所有的这些信息用于帮助Oracle检查查询如何执行的不同场景和确定这些场景中哪一个最有可能产生最佳性能。
    Understanding how Oracle determines execution plans will not only help you write better SQL but
will help you to understand how and why performance is affected by certain execution plan choices.
After Oracle verifies the syntax and permissions for a SQL statement, it uses the statisics information it
collects from the data dictionary to compute a cost for each operation and combination of operations
that could be used in order to get the result set your query needs. Cost is an internal value Oracle uses to compare different plan operations for the same query to each other with the lowest costed option being considered best. For example, a statement could be executed using a full table scan or an index. Using the statistics, parameters, and other information, Oracle determines which method will result in the
fastest execution time.
    理解Oracle如何确定执行计划不仅有助于你写出更佳的SQL而且有助于你理解如何和为什么性能会因某些执行计划的选择而受影响。在Oracle确认SQL语句的句法和性能后，它使用从数据字典收集的统计信息计算每步操作、以及可被使用的为了获得你查询所需的结果集的操作组合的成本。“成本”是 Oracle用于比较的一内在值，对相同的查询，不同的计划操作相互之间具有最低成本项的被认为最好。例如，一条语句能用全表扫描或者索引执行。使用统计信息，参数，和其他信息，Oracle确定哪一个方法将产生最快的执行时间。
    Since Oracle’s main goal in determining an execution plan is to choose a set of operations that will
result in the fastest response time possible for the SQL statement being parsed, the more accurate
statistics are, the more likely Oracle will be to compute the best execution plan. In the chapters ahead,
I’ll review details about the various access methods and join methods available and how to review
execution plans in detail. For now, I want to make sure you understand what statistics are, why they’re
important, and how to review them for yourself.
    由于Oracle的首要目标是确定一个执行计划，选择一组操作尽可能产生最快的响应时间的执行计划，对被解析的SQL语句(而言)，越是准确的统计信息，越有可能Oracle计算出最好的执行计划。在接下来章节中，我将详细论述各种有效的访问方法和连接方法以及如何检查执行计划。现在，我想确保你理解统计信息是什么，为什么他们重要，和你自己如何查看它们。
    The optimizer is the code path within the Oracle kernel that is responsible for determining the optimal execution plan for a query. So, when I talk about statistics, I’m talking about how the optimizer uses statistics. I use the script named st-all.sql to display statistics for the employees table, as shown in Listing 2-11. I’ll use this information to discuss how statistics are used by the optimizer.
    优化器是位于Oracle内核的代码路径，负责确定查询的优化执行计划。如此，当我说到统计信息，我说的就是优化器如何使用统计信息。我使用名称为 st-all.sql的脚本显示employees表的统计信息，如图列表2-11所示。我使用这些信息讨论优化器如何使用统计信息。
Listing 2-11. Statistics for the Employees Table
SQL> @st-all
Enter the owner name: hr
Enter the table name: employees
============================================================================================
TABLE STATISTICS
============================================================================================
Owner         : hr
Table name    : employees
Tablespace    : example
Partitioned   : no
Last analyzed : 06/21/2010 17:27:14
Degree        : 1
# Rows        : 107
# Blocks      : 5
Empty Blocks : 0
Avg Space     : 0
Avg Row Length: 68
Monitoring?   : yes
Status        : valid
============================================================================================
COLUMN STATISTICS
============================================================================================
Name                  Null?      NDV      # Nulls      # Buckets      AvgLen Lo-Hi Values
============================================================================================
commission_pct       Y        7          72         1                 2      .1 | .4
department_id         Y        11        1          11                3     10 | 110
email                      N       107       0               1         8           ABANDA | WTAYLOR
employee_id            N       107        0              1                 4     100 | 206
first_name              Y        91    0              1                 7           Adam | Winston
hire_date               N        98         0               1         8     06/17/1987 | 04/21/2000
job_id                    N        19         0             19           9       AC_ACCOUNT | ST_MAN
last_name              N        102        0              1                8      Abel | Zlotkey
manager_id             Y        18         1          18        4     100 | 205
phone_number        Y        107    0          1         15    011.44.1343.329268 | 650.509.4876
salary                    Y        57         0              1                 4      2100 | 24000
============================================================================================
INDEX INFORMATION
============================================================================================

Index Name                  BLevel Leaf Blks #Rows Dist Keys LB/Key DB/Key ClustFactor Uniq?
-----------------         ------ --------- ----- --------- ------ ------ ----------- -----
EMP_DEPARTMENT_IX      0         1         106        11        1      1           7            NO
EMP_EMAIL_UK             0         1         107        107      1      1          19            YES
EMP_EMP_ID_PK            0         1        107        107      1      1           2             YES
EMP_JOB_IX               0         1        107         19     1      1           8             NO
EMP_MANAGER_IX           0         1         106        18     1      1           7 NO
EMP_NAME_IX           0         1         107       107      1      1          15    NO

Index Name               Pos#         Order   Column Name
------------------ ----------      -----   ------------------------------
emp_department_ix           1        ASC   department_id

emp_email_uk                 1          ASC   email

emp_emp_id_pk               1          ASC   employee_id

emp_job_ix                   1           ASC    job_id

emp_manager_ix             1          ASC   manager_id

emp_name_ix                 1           ASC   last_name

    The first set of statistics shown in Listing 2-11 is table statistics. These values can be queried from
the all_tables view (or dba_tables or user_tables as well). The next section lists column statistics and
can be queried from the all_tab_cols view. The final section lists index statistics and can be queried from the all_indexes and all_ind_columns views.
    列表2-11中的统计信息的第一组是表统计。这些值能通过查询all_tabel视图(或者dba_tables或者user_tables)。下一段列出列统计信息，能从all_tab_cols视图中查询到。最后一段列出索引统计信息，能从al_indexes和all_ind_columns视图中查询到。
    Just like statistics in baseball, the statistics the optimizer uses are intended to be predictive. For
example, if a baseball player has a batting average of .333, you’d expect that he’d get a hit about 1 out of every 3 times. That won’t always be true, but it is an indicator that most people rely on. Likewise, the optimizer relies on the num_distinct column statistic to compute how frequently values within a column will occur. By default, the assumption is that any value will occur in the same proportion as any other value. If you were looking at the num_distinct statistic for a column named color and it was set to 10, it means that the optimizer is going to expect there to be 10 possible colors and that each color would be present in one tenth of the total rows of the table.
     正如棒球中的统计那样，优化器使用的统计信息意图在于预测。例如，如果一棒球手的击球率是.333,你可能预期它能每三次击中一次。这并不总是对的，但是它是一个大多数人信赖的指示器。同样的，优化器依赖于num_distinct列统计计算在一列中发生的频率值。默认情况下，假设任意值的出现(几率)等同比率于其他值。如果你看到num_distinct对列名为color的统计值设定的是10，它意味着优化器将期望，有10种可能的颜色且每种颜色将占有表的总行数的十分之一。
    So, let’s say that the optimizer was parsing the following query:    我们看优化器在解析如下查询：
select * from widgets where color = ‘BLUE’

    The optimizer could choose to read the entire table ( TABLE ACCESS FULL operation) or it could
choose to use an index ( TABLE ACCESS BY INDEX ROWID ). But how does it decide which one is best?   It uses statistics. I’m just going to use two statistics for this example. I’ll use the statistic that indicates the number of rows in the widgets table (num_rows = 1000 ) and the statistic that indicates how many distinct values are in the color column ( num_distinct = 10 ). The math is quite simple in this case:
     优化器可能选择读整个表(TABLE ACCESS FULL 全表扫描操作)或者它选择使用一索引(TABLE ACCESS BY INDEX ROWID通过索引ROWID访问表)。但是它如何决定哪一个最好？它使用统计信息。本例中我将使用两个统计信息。我使用统计信息指出样例表中的行数 ( num_rows = 1000 ) 和用统计信息指出在color列中有多少相异的值( num_distinct = 10 )。在这个例子中数学相当简单：
     Number of rows query should return = (1 / num_distinct) x num_rows
   = (1 / 10) x 1000
   = 100
    If you think about it for a second, it makes perfect sense. If there are 1000 rows in the table and
there are 10 distinct colors present in the table, then if your query only wants rows where the color is
blue, you’ll be asking for only one tenth of the data, or 100 rows. This computed value is called selectivity. By dividing the number of distinct values into 1, you will determine how selectivity any single
value is. Easy, right?
    如果你思考一会，它挺符合逻辑。如果在表中有1000行且表中有10种颜色。即使你的查询只想要颜色为blue的行，你将仅请求十分之一的数据，或者 100行。这个计算的值称之为选择性。用1除以distinct值的数量，你将确定任意单个值的选择性是多少。是不是很容易？
    Well, as you can imagine, the computations do get more complex, but I hope this simple example
helps you see how the optimizer is doing nothing more than some fairly straightforward calculations.
No rocket science…really! But, as you can see, even something so simple can be dramatically affected if
the values used aren’t accurate.
    那么你能想象，实际计算更复杂，但是我希望这个简单的例子帮助你理解优化器实际只是做一些相当直接的计算。不是火箭科学...对吧。但是，你将见到，如果所用的值不准确，即使某些如此简单的东西能被显著的受影响。
    What if, at the time the optimizer parsed this query, the statistics were out of date or missing? For
example, let’s say that instead of indicating there were 1000 rows and 10 colors, the statistics showed 100 rows in the table and 1 color. Using these values, the number of rows the query should return would be computed to be 100 ( 1 / 1 x 100 ). The number of rows is the same as our original computation, but is it really the same? No, it’s very different. In the case of the first calcuation, the optimizer would have
assumed 10% of 1000 rows were returned while in the second case the 100 rows returned represent all
the rows in the table (at least according to the statistics). Can you see how this would influence the
optimizer’s decision about what operation to choose to retrieve the data?
    如果在优化器解析这个查询时，统计信息过时或者缺失会是什么样子？例如，我们不说表中有1000行和10种颜色，而是说统计信息显示在表中有100行是1 种颜色。用这些值，查询将返回的行数应该计算成100 ( 1 / 1 x 100 )。(计算的结果)行数等同我们最初的计算，但是它真是一样的么？不，非常的不同。在第一个计算中，优化器设1000行中的10%返回而第二个计算中返回的100行代表在表中的所有行(至少根据统计)。你能看出这将如何影响优化器的选择什么操作来检索数据的判断么？
    Understanding the importance of statistics will help you know how to identify performance
problems that are not necessarily related to the way you wrote the SQL but instead rooted in issues with
the statistics. You could have done everything right, but if the statistics are wrong or inaccurate enough
that they don’t accurately reflect the reality of your data, you need to be able to pinpoint that quickly—
and not spend hours or days trying to fix a code problem that really isn’t a code problem.
    理解统计信息的重要性有助于你知道如何定位性能问题，这与你书写SQL的方式不必然相关而是用统计信息定位问题。你本该把每件事都做对，但是如果统计信息是错误的或者不是足够的准确，它们没有准确的反应出你数据的真实性，你必须能快而准确的定位----不是发数小时或数天试图修复一代码问题而实际它又不是代码问题。
    But just to keep you from getting too happy that you’ve now got a way to point the finger of blame
away from yourself, let me show you an example of how you can write SQL in such a way that the
optimizer can’t use the statistics properly. In this case, you write a very simple query as follows:
    但是不要因为你已经知道一种方法避免自己被指责而太高兴，让我给你展示一个例子，你书写SQL方式如何能使得优化器不能恰当的使用统计信息。在这个例子中，你写一个如下非常简单的SQL：
    select * from car_purchases where manufacturer = ‘Ford’ and make = ‘Focus’

    The query uses a table containing information about car purchases for all American model cars. For
the sake of this example, let’s assert that each make of car is only produced by one manufacturer. That
means that only Ford will have a Focus. So, what’s the problem with the way this query is written? It will certainly return the correct result set, but that’s not the only question that needs to be answered. You also need to determine if the optimizer will be able to accurately understand the data given this query formulation. So, let’s look at the statistics:
    这个查询使用了一张包含全美汽车销售信息的表。在本例中，我们断言每种车只有一个制造商生成。也就是只有Ford生产Focus。那么这样书写的查询有什么问题？它将包含正确的结果集，但是这不是仅有需要回答的问题。你也需要确定是否优化器能准确的理解对这个查询的书写格式的(统计信息)数据。让我们看如下统计信息：
num_rows (car_purchases): 1,000,000
num_distinct (manufacturer): 4
num_distinct (make): 1000

    Since there are two different conditions (or predicates) to apply, you need to first figure out the
selectivities of each one by itself. The selectivity of manufacturer would be 1/4 or .25. The selectivity of make would be 1/1000 or .001. Since the predicates are combined with an AND , the two selectivities will be multiplied together to get the correct overall selectivity for both combined. So, the final selectivity would be .00025 (.25 x .001). That means the optimizer will determine that the query will return 250 rows (.00025 X 1,000,000).
    由于这里应用了2个不同的条件(或谓词)，你需要首先计算出每步自身的选择(性)值。制造商的选择值是1/4或者.25。生产的选择值是1/1000或者.001。由于谓词由AND组合，两个选择值相乘获得整体的两者组合的选择性值。这样，最终的选择值是.00025(.25x.001)。这意味着优化器将能确定查询将返回250行(.00025 X 1,000,000)。
    Remember that I started by asserting that only one manufacturer would produce a certain make of
car. That means that since none of the other three manufacturers could have possibily produced a
Focus, the calculation that includes the selectivity for manufacturer is flawed. The truth is that we know
all Focus model vehicles have to be manufacturered by Ford. Including the condition where manufacturer = ‘Ford’ reduces the overall selectivity by 25%. In this case, the true selectivity should have been only the selectivity for the model column alone. If just that predicate had been written, then the selectivity would have been 1/1000 or .001, and the optimizer would have computed that 1,000 rows would be returned by the query instead of 250. That means the answer the optimizer came up with was “off” by a factor of 4. You may look at the difference between 250 and 1,000 and think “so what’s the big deal; that’s not that far off, is it?” Let’s go back to the baseball example and apply this same logic to see if it stands out more to you. If a player normally has a .333 average and you were to tack on another meaningless condition that would require you to multiply his average by .25 as well, what happens? All of a sudden, the high-paid professional athlete looks like a sandlot wanna-be with an average of .083 (.333 x .25)!
    记住我的前提设定某种车型只有一个制造商生产。意味着其它的三家制造商不可能生产Focus，包含制造商选择值的计算是有缺陷的。我们知道所有的Focus车必须是由Ford生产才是真的。包含的条件manufacturer = ‘Ford’使得整体的选择值降低成(原有的)25%。在这种情况下，正确的选择值应该只是单独model列的选择值。如果只是写了谓词(manufacturer = ‘Ford’)，则选择值应该是 1/1000或.001，优化器应该计算出只返回1000行而不是返回250行。也就是说优化器得出结果“削价了 ”4倍。你可能已经看出1000与250之间的不同了，心想“有什么大不了的；没有差多远，不是么？”让我们回到棒球的例子，应用相同的逻辑看是否它还能让你站的住脚？如果一个棒球手通常有.333的准确率而你附加其他无意义的条件使得你将他的击球率再乘以.25，会发生什么？忽然间，高薪的职业运动员看上去像业余选手，击球率只有.083 (.333 x .25)!
    Numbers can change everything—and not just in baseball. The calculations the optimizer makes will drastically affect the choice of execution plan operations. Those choices can make the difference between response times of a few seconds to response times of several hours. In this particular example,
you get to see what happens when the optimizer doesn’t know something that you do. All the optimizer
can do is to plug in the statistics values and come up with an answer. If you know something about your
data that the optimizer can’t know, make sure you code your SQL accordingly and don’t lead the optimizer astray .

数字可以改变任何事----不只是棒球。优化器做出的计算将极大的影响执行计划操作的选择。这些选择能使得响应时间从几秒到几小时的差别。在这个特例中，你看到了当优化器不知道某些而你却知道的东西会发生什么。所有的优化器所能做的就是查询统计信息值然后得出答案。如果你知道了某些你的数据而优化器却不能知道的东西，务必相应的编写你的SQL而不让优化器迷失方向。

《Pro Oracle SQL》CHAPTER2--2.11 Determining the Execution Plan

猜你喜欢