两表关联只返回主表与子表没关联上的数据,这种情况即反连接,一般指的是not in和not exsits。
在实际应用中,我们常常会遇到这样的两表关联的情况,想要查看一张表的某个列没有出现在另一张表中的数据,下面我们来看看在PostgreSQL中这种场景我们有哪些优化方法呢?
例子:
–建表插入数据:
说明:a表1000条数据,b表100w条数据,我们想要判断a表中的id在b表的aid中不存在的数据。
create table a(id int primary key, info text);
create table b(id int primary key, aid int, crt_time timestamp);
create index b_aid on b(aid);
insert into a select generate_series(1,1000), md5(random()::text);
insert into b select generate_series(1,1000000), generate_series(1,100), clock_timestamp();
常见写法:
最常见的写法莫过于直接使用not in查询:
select * from a where id not in (select aid from b);
但是显然这种方式性能很差:
bill=# explain analyze select * from a where id not in (select aid from b);
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Seq Scan on a (cost=0.00..13406521.50 rows=500 width=37) (actual time=96598.351..96598.352 rows=0 loops=1)
Filter: (NOT (SubPlan 1))
Rows Removed by Filter: 1000
SubPlan 1
-> Materialize (cost=0.00..24313.00 rows=1000000 width=4) (actual time=0.002..56.810 rows=900005 loops=1000)
-> Seq Scan on b (cost=0.00..15406.00 rows=1000000 width=4) (actual time=0.010..84.128 rows=1000000 loops=1)
Planning Time: 0.430 ms
Execution Time: 96599.874 ms
(8 rows)
Time: 96600.994 ms (01:36.601)
显然上面这种写法是不能接受的,那么你可以想想有没有什么好的优化办法呢?
优化方法1:not exsits
–SQL:
select * from a where not exists (select aid from b where b.aid = a.id)
–执行计划:
使用exsits,获取到了符合条件的结果后即break,效率有了明显提升。
bill=# explain analyze select * from a where not exists (select aid from b where b.aid = a.id);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
---
Merge Anti Join (cost=0.70..38.76 rows=833 width=37) (actual time=0.173..0.780 rows=900 loops=1)
Merge Cond: (a.id = b.aid)
-> Index Scan using a_pkey on a (cost=0.28..31.07 rows=1000 width=37) (actual time=0.044..0.415 rows=1000 loops=1)
-> Index Only Scan using b_aid on b (cost=0.42..16027.42 rows=1000000 width=4) (actual time=0.010..0.042 rows=101 loops=
1)
Heap Fetches: 0
Planning Time: 0.373 ms
Execution Time: 0.900 ms
(7 rows)
优化方法2:left join
SQL:
select * from a left join b on(a.id = b.aid) where b.* is not null;
–执行计划
bill=# explain analyze select * from a left join b on(a.id = b.aid) where b.* is not null;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Merge Left Join (cost=0.70..39.67 rows=995 width=53) (actual time=0.087..0.775 rows=100 loops=1)
Merge Cond: (a.id = b.aid)
Filter: (b.* IS NOT NULL)
Rows Removed by Filter: 900
-> Index Scan using a_pkey on a (cost=0.28..31.07 rows=1000 width=37) (actual time=0.057..0.458 rows=1000 loops=1)
-> Index Scan using b_aid on b (cost=0.42..21433.72 rows=1000000 width=56) (actual time=0.018..0.103 rows=101 loops=1)
Planning Time: 0.472 ms
Execution Time: 0.827 ms
(8 rows)
Time: 1.960 ms
优化方法3:sub query
–SQL:
因为这里a表只有1000条数据,我们使用sub query查询其实只会扫描b表和a表行数一样多行(加上limit),最后使用is null来判断在b表中未出现的aid。
select * from
(
select
a.* ,
(select aid from b where b.aid=a.id limit 1) as aid
from a
) as t
where t.aid is null;
–执行计划:
bill=# explain analyze
bill-# select * from
bill-# (
bill(# select
bill(# a.* ,
bill(# (select aid from b where b.aid=a.id limit 1) as aid
bill(# from a
bill(# ) as t
bill-# where t.aid is null;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
------
Seq Scan on a (cost=0.00..1770.21 rows=5 width=41) (actual time=0.160..1.879 rows=900 loops=1)
Filter: ((SubPlan 2) IS NULL)
Rows Removed by Filter: 100
SubPlan 1
-> Limit (cost=0.42..1.74 rows=1 width=4) (actual time=0.001..0.001 rows=0 loops=900)
-> Index Only Scan using b_aid on b (cost=0.42..1.74 rows=1 width=4) (actual time=0.001..0.001 rows=0 loops=900)
Index Cond: (aid = a.id)
Heap Fetches: 0
SubPlan 2
-> Limit (cost=0.42..1.74 rows=1 width=4) (actual time=0.001..0.001 rows=0 loops=1000)
-> Index Only Scan using b_aid on b b_1 (cost=0.42..1.74 rows=1 width=4) (actual time=0.001..0.001 rows=0 loops=
1000)
Index Cond: (aid = a.id)
Heap Fetches: 0
Planning Time: 0.142 ms
Execution Time: 1.925 ms
(15 rows)
Time: 2.402 ms
优化方法4:sub query
–SQL:
上面方法中sub query我们还可以简写成下面这种:
select * from a where (select aid from b where b.aid=a.id limit 1) is null;
–执行计划:
bill=# explain analyze select * from a where (select aid from b where b.aid=a.id limit 1) is null;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
--
Seq Scan on a (cost=0.00..1761.50 rows=5 width=37) (actual time=0.502..3.470 rows=900 loops=1)
Filter: ((SubPlan 1) IS NULL)
Rows Removed by Filter: 100
SubPlan 1
-> Limit (cost=0.42..1.74 rows=1 width=4) (actual time=0.003..0.003 rows=0 loops=1000)
-> Index Only Scan using b_aid on b (cost=0.42..1.74 rows=1 width=4) (actual time=0.002..0.002 rows=0 loops=1000
)
Index Cond: (aid = a.id)
Heap Fetches: 0
Planning Time: 0.262 ms
Execution Time: 3.651 ms
(10 rows)
Time: 4.786 ms
优化方法4:with递归
–SQL:
使用pg中的with递归语法:
和上面的sub query不同的是:a表都是全表扫一遍,sub query中B表索引扫描次数等于a表的行数,with递归B表索引扫描次数等于aid在b表中出现的次数。
select * from a where id not in
(
with recursive skip as (
(
select min(aid) aid from b where aid is not null
)
union all
(
select (select min(aid) aid from b where b.aid > s.aid and b.aid is not null)
from skip s where s.aid is not null
)
)
select aid from skip where aid is not null
);
–执行计划:
bill=# explain analyze select * from a where id not in
bill-# (
bill(# with recursive skip as (
bill(# (
bill(# select min(aid) aid from b where aid is not null
bill(# )
bill(# union all
bill(# (
bill(# select (select min(aid) aid from b where b.aid > s.aid and b.aid is not null)
bill(# from skip s where s.aid is not null
bill(# )
bill(# )
bill(# select aid from skip where aid is not null
bill(# );
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
--------------------------------
Seq Scan on a (cost=54.57..76.07 rows=500 width=37) (actual time=1.332..1.763 rows=900 loops=1)
Filter: (NOT (hashed SubPlan 5))
Rows Removed by Filter: 100
SubPlan 5
-> CTE Scan on skip (cost=52.30..54.32 rows=100 width=4) (actual time=0.087..1.146 rows=100 loops=1)
Filter: (aid IS NOT NULL)
Rows Removed by Filter: 1
CTE skip
-> Recursive Union (cost=0.45..52.30 rows=101 width=4) (actual time=0.083..1.087 rows=101 loops=1)
-> Result (cost=0.45..0.46 rows=1 width=4) (actual time=0.082..0.083 rows=1 loops=1)
InitPlan 3 (returns $1)
-> Limit (cost=0.42..0.45 rows=1 width=4) (actual time=0.077..0.078 rows=1 loops=1)
-> Index Only Scan using b_aid on b b_1 (cost=0.42..4.65 rows=167 width=4) (actual time=0.
075..0.076 rows=1 loops=1)
Index Cond: (aid IS NOT NULL)
Heap Fetches: 0
-> WorkTable Scan on skip s (cost=0.00..4.98 rows=10 width=4) (actual time=0.009..0.009 rows=1 loops=101
)
Filter: (aid IS NOT NULL)
Rows Removed by Filter: 0
SubPlan 2
-> Result (cost=0.47..0.48 rows=1 width=4) (actual time=0.008..0.008 rows=1 loops=100)
InitPlan 1 (returns $3)
-> Limit (cost=0.42..0.47 rows=1 width=4) (actual time=0.007..0.007 rows=1 loops=100)
-> Index Only Scan using b_aid on b (cost=0.42..2.84 rows=56 width=4) (actual time
=0.006..0.006 rows=1 loops=100)
Index Cond: ((aid > s.aid) AND (aid IS NOT NULL))
Heap Fetches: 0
Planning Time: 0.548 ms
Execution Time: 2.027 ms
(27 rows)
Time: 4.036 ms
小结
一个常见的not in语句在PostgreSQL中竟然有这么多不同的优化方法,可见pg的语法真是十分强大,你还有没有什么更好的写法呢?