Predicate pushdown for SQL optimization

1. What is predicate pushdown

        The so-called predicate pushdown is to bring as many judgments as possible closer to the data source, so that irrelevant data can be skipped during query . When the file format uses Parquet or Orcfile, it is even possible to skip irrelevant files in whole.

2. Predicate pushdown in HIVE

Predicate Pushdown         in Hive  , referred to as predicate pushdown, the main idea is to push down the filter conditions to the map side and perform filtering in advance to reduce the amount of data transmitted from the map side to the reduce side and improve the overall performance. In short, it is to filter first and then do aggregation and other operations.

-- 具体配置项是:(默认为true)
set hive.optimize.ppd = true

Summarize: 

1. Predicate pushdown: In the storage layer, a large number of invalid data in large tables are filtered to reduce scanning invalid data; the so-called pushdown means that the predicate filtering is performed on the map side, and the so-called non-pushdown means that the predicate filtering is performed on the reduce side. 2. Inner
join
3. In left join, the predicate in the left table should be written after where 4. In
right join, the predicate in the left table should be written after join

3. Predicate pushdown leads to inconsistent results

        Let's look at some typical SQL below.

SQL1: 20672 and 9721

select
    count(distinct t1.role_id) as new_role_cnt,
    count(distinct t2.role_id) as pay_role_cnt
from(
    select
        role_id, part_date
    from ods_game_dev.ods_role_create
    where part_date = '2020-01-01'
) t1
left join ods_game.dev.ods_role_recharge t2
on t1.role_id = t2.rile_id and t2.part_date = '2020-01-01'

SQL2: 9721 and 9721

select
    count(distinct t1.role_id) as new_role_cnt,
    count(distinct t2.role_id) as pay_role_cnt
from ods_game.dev.ods_role_create t1
left join ods_game.dev.ods_role_recharge t2
on t1.role_id = t2.rile_id 
where t1.part_date = '2020-01-01' and t2.part_date = '2020-01-01'

SQL3: 20672 and 9721

select
    count(distinct t1.role_id) as new_role_cnt,
    count(distinct t2.role_id) as pay_role_cnt
from ods_game.dev.ods_role_create t1
left join ods_game.dev.ods_role_recharge t2
on t1.role_id = t2.rile_id and t2.part_date = '2020-01-01'
where t1.part_date = '2020-01-01'

SQL4: 184125 and 9721

select
    count(distinct t1.role_id) as new_role_cnt,
    count(distinct t2.role_id) as pay_role_cnt
from ods_game.dev.ods_role_create t1
left join ods_game.dev.ods_role_recharge t2
on t1.role_id = t2.rile_id and t2.part_date = '2020-01-01' and t1.part_date = '2020-01-01'

 From the above SQL we can see that:

        1) SQL1 : The t1 table query is filtered first, and the t2 table condition is written in on to satisfy the predicate pushdown. After conditional filtering is performed respectively, join is performed. So when counting, what we see is the data of the respective filter conditions of count .

        2) SQL2 : t1 is in where, satisfying the predicate pushdown. t2 does not satisfy predicate pushdown. Therefore, the conditions of the t2 table are filtered after the join, which causes the conditions of the t2 table to be experienced when counting. So the data is consistent.

        3) SQL3 : The left table t1 is pushed down when the where predicate is satisfied, and the right table t2 is pushed down when the on satisfies the predicate. Therefore, the data is filtered first, and then the join operation is performed. Same as SQL1, count the data of each filter condition .

        4) SQL4 : The left table t1 does not satisfy the predicate pushdown, and the right table t2 satisfies the filter condition. The filter condition for the left table t1 must be placed in where, and the effect of putting it in on is unpredictable and does not take effect. The condition of the right table t2 satisfies the predicate push-down in on and takes effect. So the t1 table is the full amount of data, and the t2 table is the filtered data.

Summarize:

        Through the above analysis, the predicate pushdown is effective, but the output of the final result is inconsistent because of the different execution order. 

Guess you like

Origin blog.csdn.net/d905133872/article/details/131245092