加班遇到一个SQL问题,本想把别人的SQL改下成SparkSQL,在YARN上运行,然而数据一直对不上。
原SQL
⚠️说明:a.id,b.id没有空的,数据1:1,b.name可能存在空的
select a.id,b.id,b.name
from table_a a
left join table_b b on a.id = b.id and b.is_delete = 0 and b.name is not null
where
exists(select 1 from table_b c where a.id = c.id and c.is_delete = 0
)
改后的
想法是:既然exists过滤了,为什么不直接inner join呢,于是乎:
select a.id,b.id,b.name
from table_a a
inner join table_b b on a.id = b.id and b.is_delete = 0
求助群友
问了下群友,区别是我一直以为left join 后 b.name is not null并没什么用,就没有带,我问群友,下面这两个SQL有什么区别:
select *
from table_a a
left join table_b b on a.id = b.id and b.is_delete = 0
where
exists(select 1 from table_b c where a.id = c.id and c.is_delete = 0
)select *
from table_a a
inner join table_b b on a.id = b.id and b.is_delete = 0
群友问AI的结果
这样的回答,不太对
自悟
然后仔细去品味这两个SQL有什么不同
-- (1)
select a.id,b.id,b.name
from table_a a
left join table_b b on a.id = b.id and b.is_delete = 0 and b.name is not null
where
exists(select 1 from table_b c where a.id = c.id and c.is_delete = 0
)-- (2)select a.id,b.id,b.name
from table_a a
inner join table_b b on a.id = b.id and b.is_delete = 0
数据table_a
id |
1 |
2 |
3 |
数据table_b
id | name | id_delete |
1 | aa | 0 |
2 | bb | 0 |
3 | NULL | 0 |
结论:
(1)sql计算后的会剔除掉 table_a 不符合 a.id = c.id and c.is_delete = 0 条件的数据。加上 name is not null。最后的数据会出现这两类情况:
a.id,null,null # name为null,b表全部为空
a.id,b.id,b.name # 全部有值
a.id | b.id | b.name |
1 | 1 | aa |
2 | 2 | bb |
3 | NULL | NULL |
(2)sql计算后中则会出现这两类情况:
a.id,b.id,null
a.id,b.id,b.name # 全部有值
a.id | b.id | b.name |
1 | 1 | aa |
2 | 2 | bb |
3 | 3 | NULL |
所以,count的时候是没有问题的,两个都可以,如果是取具体的值有所区别。
拓展
如果a.id b.id 是1:n 呢?
数据table_a
id |
1 |
2 |
3 |
10 |
数据table_b
id | name | id_delete |
1 | aa | 0 |
1 | NULL | 0 |
2 | bb | 0 |
3 | NULL | 0 |
-- (1)
select a.id,b.id,b.name
from table_a a
left join table_b b on a.id = b.id and b.is_delete = 0 and b.name is not null
where
exists(select 1 from table_b c where a.id = c.id and c.is_delete = 0
)
a.id | b.id | b.name |
1 | 1 | aa |
2 | 2 | bb |
3 | NULL | NULL |
-- (2)select a.id,b.id,b.name
from table_a a
inner join table_b b on a.id = b.id and b.is_delete = 0
a.id | b.id | b.name |
1 | 1 | aa |
1 | 1 | NULL |
2 | 2 | bb |
3 | 3 | NULL |
所以,如果count的时候,1对n,相对于1:1是有区别的。
最后
(1)菜就要学,就要钻研。
(2)加班使人头疼,头脑不灵光。
(3)具体问题,具体分析。