一文弄懂Hive中谓词下推(on与where的区别)

数仓实际开发中经常会涉及到多表关联，这个时候就会涉及到on与where的使用。如果对这两者在数仓中的作用比较混乱的，读完这一文就可以理解透彻了。

先来说一下where与on在SQL中最直观的区别

on 在筛选条件的时候，on会显示所有满足 | 不满足条件的数据(补NULL)，而 where 只显示满足条件的数据。
on对join类型(内外连接)的改变而会有反应而where没有，对where来说只是当个连接作用。

上面的说法就不具体举例验证了，这里我们主要研究where与on在hive中对性能的影响，有条件的小伙伴可以手动试一下，贴上数据源

CREATE TABLE a (id string,name string) PARTITIONED BY (dt STRING);
CREATE TABLE b (id string,dept string) PARTITIONED BY (dt STRING);
INSERT INTO TABLE a PARTITION(dt='2022-09-08')VALUES ("1","Daniel");
INSERT INTO TABLE a PARTITION(dt='2022-09-08')VALUES ("2","Andy");
INSERT INTO TABLE a PARTITION(dt='2022-09-08')VALUES ("3","Marc");
INSERT INTO TABLE b PARTITION(dt='2022-09-08')VALUES ("1","BD");
INSERT INTO TABLE b PARTITION(dt='2022-09-08')VALUES ("2","BE");
SELECT * from a where dt = '2022-09-08';
SELECT * from b where dt = '2022-09-08';

先上一个实际的需求，关联a，b两表，取a表最新日期的数据

SELECT *
FROM a
JOIN b ON a.id = b.id
WHERE a.dt = '2022-09-08';

相信绝大多数人会这么写，先说结论，这样写没有任何问题

可能有的小伙伴会这样尝试

SELECT *
FROM a
JOIN b ON a.id = b.id
AND a.dt = '2022-09-08';

这样与上面的效果是等同的，也没有问题，那么问题在哪里？

如果需要以a表为主表，关联查询b表，也就是左外连接，这个时候两种写法就有问题了

写法一

SELECT *
FROM a
LEFT JOIN b ON a.id = b.id
WHERE a.dt = '2022-09-08';

高效写法，hive会只取指定日期的数据

写法二

SELECT *
FROM a
LEFT JOIN b ON a.id = b.id
AND a.dt = '2022-09-08';

缓慢写法，hive会先查出所有数据做关联，然后再去关联指定日期的数据

写法三

SELECT *
FROM
  (SELECT *
   FROM a
   WHERE dt = '2022-09-08') t1
LEFT JOIN b ON t1.id = b.id;

高效写法，hive会只取指定日期的数据。虽然写法看着比较low，但是效果是等同于1的，为了写出不那么low的sql，这里先介绍一下Hive中的谓词下推

这里拿写法一和写法二的执行计划来简单说明证明一下这个观点，我这里引擎为hive on spark

写法一

Explain
STAGE DEPENDENCIES:
  Stage-2 is a root stage
  Stage-1 depends on stages: Stage-2
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-2
    Spark
      DagName: hive_20220909110604_3af93825-e92f-4a19-ab13-38a8d5ed0542:53374
      Vertices:
        Map 2
            Map Operator Tree:
                TableScan
                  alias: b
                  Statistics: Num rows: 2 Data size: 30 Basic stats: COMPLETE Column stats: NONE

                  Spark HashTable Sink Operator
                    keys:
                      0 id (type: string)
                      1 id (type: string)
            Local Work:
              Map Reduce Local Work

  Stage: Stage-1
    Spark
      DagName: hive_20220909110604_3af93825-e92f-4a19-ab13-38a8d5ed0542:53373
      Vertices:
        Map 1
            Map Operator Tree:
                TableScan
                  alias: a

                  filterExpr: (dt = '2022-09-08') (type: boolean)
                  Statistics: Num rows: 3 Data size: 53 Basic stats: COMPLETE Column stats: NONE
                  Filter Operator
                    predicate: (dt = '2022-09-08') (type: boolean)
                    Statistics: Num rows: 1 Data size: 17 Basic stats: COMPLETE Column stats: NONE
                    Map Join Operator
                      condition map:
                           Left Outer Join0 to 1
                      keys:
                        0 id (type: string)
                        1 id (type: string)
                      outputColumnNames: _col0, _col1, _col6, _col7, _col8
                      input vertices:
                        1 Map 2
                      Statistics: Num rows: 2 Data size: 33 Basic stats: COMPLETE Column stats: NONE
                      Select Operator
                        expressions: _col0 (type: string), _col1 (type: string), '2022-09-08' (type: string), _col6 (type: string), _col7 (type: string), _col8 (type: string)
                        outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
                        Statistics: Num rows: 2 Data size: 33 Basic stats: COMPLETE Column stats: NONE
                        File Output Operator
                          compressed: false
                          Statistics: Num rows: 2 Data size: 33 Basic stats: COMPLETE Column stats: NONE
                          table:
                              input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                              output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                              serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
            Local Work:
              Map Reduce Local Work

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

写法二

Explain
STAGE DEPENDENCIES:
  Stage-2 is a root stage
  Stage-1 depends on stages: Stage-2
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-2
    Spark
      DagName: hive_20220909110827_88d2aa5e-449a-442f-aa51-21d6a021455d:53395
      Vertices:
        Map 2
            Map Operator Tree:
                TableScan
                  alias: b
                  Statistics: Num rows: 2 Data size: 30 Basic stats: COMPLETE Column stats: NONE
                  Spark HashTable Sink Operator

                    filter predicates:
                      0 {(dt = '2022-09-08')}
                      1
                    keys:
                      0 id (type: string)
                      1 id (type: string)
            Local Work:
              Map Reduce Local Work

  Stage: Stage-1
    Spark
      DagName: hive_20220909110827_88d2aa5e-449a-442f-aa51-21d6a021455d:53394
      Vertices:
        Map 1
            Map Operator Tree:
                TableScan

                  alias: a
                  Statistics: Num rows: 3 Data size: 53 Basic stats: COMPLETE Column stats: NONE
                  Map Join Operator
                    condition map:
                         Left Outer Join0 to 1

                    filter predicates:
                      0 {(dt = '2022-09-08')}
                      1
                    keys:
                      0 id (type: string)
                      1 id (type: string)
                    outputColumnNames: _col0, _col1, _col2, _col6, _col7, _col8
                    input vertices:
                      1 Map 2
                    Statistics: Num rows: 3 Data size: 58 Basic stats: COMPLETE Column stats: NONE
                    Select Operator
                      expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col6 (type: string), _col7 (type: string), _col8 (type: string)
                      outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
                      Statistics: Num rows: 3 Data size: 58 Basic stats: COMPLETE Column stats: NONE
                      File Output Operator
                        compressed: false
                        Statistics: Num rows: 3 Data size: 58 Basic stats: COMPLETE Column stats: NONE
                        table:
                            input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                            output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                            serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
            Local Work:
              Map Reduce Local Work

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

从上面的注释可以看出，在写法一的谓词下推后，数据在一开始扫描的时候就已经被过滤掉了。而在写法的不推的情况下，会拿所有的数据进行查询，最后再进行多次过滤。

谓词下推 Predicate Pushdown（PPD）：简而言之，就是在不影响结果的情况下，尽量将过滤条件提前执行。谓词下推后，过滤条件在map端执行，减少了map端的输出，降低了数据在集群上传输的量，节约了集群的资源，也提升了任务的性能。

PPD控制参数： hive.optimize.ppd 默认开启

Name名称解释

保留表在outer join中需要返回所有数据的表叫做保留表;

left outer join中，左表是保留表；

right outer join中，右表则是保留表；

full outer join中左表和右表都要返回所有数据，则左右表都是保留表。

空表相对来讲，在outer join中对于没有匹配到的行需要用NULL来填充的表称为空表；

left outer join中，左表的数据全返回，对于左表在右表中无法匹配的数据的列用NULL表示，则此时右表是空表；

right outer join中，左表是空表；

full outer join中左表和右表都是Null Supplying table，因为左表和右表都会用NULL来填充无法匹配的数据。

Join中的谓词Join中的谓词是指Join On语句中的谓词; 如：a join b on a.id=1 那么a.id=1是Join中的谓词。

Join之后的谓词where语句中的谓词称之为Join之后的谓词。

The logic can be summarized by these two rules:

During Join predicates cannot be pushed past Preserved Row tables.(保留表的谓词写在join中不能下推)
After Join predicates cannot be pushed past Null Supplying tables.(空表的谓词写在join之后不能下推)

This captured in the following table:
Preserved Row TableNull Supplying TableJoin PredicateCase J1: Not PushedCase J2: PushedWhere PredicateCase W1: PushedCase W2: Not Pushed

具体案例

Pushed or NotSQLPushedselect * from a join b on a.id = b.id and a.dt = ‘2022-09-08’;Pushedselect * from a join b on a.id = b.id where a.dt = ‘2022-09-08’;Pushedselect * from a join b on a.id = b.id and b.dt = ‘2022-09-08’;Pushedselect * from a join b on a.id = b.id where b.dt = ‘2022-09-08’;Not Pushedselect * from a left join b on a.id = b.id and a.dt = ‘2022-09-08’;Pushedselect * from a left join b on a.id = b.id where a.dt = ‘2022-09-08’;Pushedselect * from a left join b on a.id = b.id and b.dt = ‘2022-09-08’;Not Pushedselect * from a left join b on a.id = b.id where b.dt = ‘2022-09-08’;Pushedselect * from a right join b on a.id = b.id and a.dt = ‘2022-09-08’;Not Pushedselect * from a right join b on a.id = b.id where a.dt = ‘2022-09-08’;Not Pushedselect * from a right join b on a.id = b.id and b.dt = ‘2022-09-08’;Pushedselect * from a right join b on a.id = b.id where b.dt = ‘2022-09-08’;Not Pushedselect * from a full join b on a.id = b.id and a.dt = ‘2022-09-08’;Not Pushedselect * from a full join b on a.id = b.id where a.dt = ‘2022-09-08’;Not Pushedselect * from a full join b on a.id = b.id and b.dt = ‘2022-09-08’;Not Pushedselect * from a full join b on a.id = b.id where b.dt = ‘2022-09-08’;

join(inner join)left outer joinright outer joinfull outer joinleft tableright tableleft tableright tableleft tableright tableleft tableright tablejoinPushedPushedNot PushedPushedPushedNot PushedNot PushedNot PushedwherePushedPushedPushedNot PushedNot PushedPushedNot PushedNot Pushed

不确定函数之类的函数的是不能下推的，例如rand()类，但是unix_timestamp()除外，观察它的执行计划可以知，它可以下推

EXPLAIN
SELECT *
FROM a
LEFT JOIN b ON a.id = b.id
WHERE a.dt = unix_timestamp();

Explain
STAGE DEPENDENCIES:
  Stage-2 is a root stage
  Stage-1 depends on stages: Stage-2
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-2
    Spark
      DagName: hive_20220909114638_7c328579-23dc-434b-9109-8af34c166272:53432
      Vertices:
        Map 2
            Map Operator Tree:
                TableScan
                  alias: b
                  Statistics: Num rows: 2 Data size: 30 Basic stats: COMPLETE Column stats: NONE
                  Spark HashTable Sink Operator

                    keys:
                      0 id (type: string)
                      1 id (type: string)
            Local Work:
              Map Reduce Local Work

  Stage: Stage-1
    Spark
      DagName: hive_20220909114638_7c328579-23dc-434b-9109-8af34c166272:53431
      Vertices:
        Map 1
            Map Operator Tree:
                TableScan
                  alias: a

                  filterExpr: (dt = 1662522398) (type: boolean)
                  Statistics: Num rows: 3 Data size: 53 Basic stats: COMPLETE Column stats: NONE
                  Filter Operator
                    predicate: (dt = 1662522398) (type: boolean)
                    Statistics: Num rows: 1 Data size: 17 Basic stats: COMPLETE Column stats: NONE
                    Map Join Operator
                      condition map:
                           Left Outer Join0 to 1
                      keys:
                        0 id (type: string)
                        1 id (type: string)
                      outputColumnNames: _col0, _col1, _col2, _col6, _col7, _col8
                      input vertices:
                        1 Map 2
                      Statistics: Num rows: 2 Data size: 33 Basic stats: COMPLETE Column stats: NONE
                      Select Operator
                        expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col6 (type: string), _col7 (type: string), _col8 (type: string)
                        outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
                        Statistics: Num rows: 2 Data size: 33 Basic stats: COMPLETE Column stats: NONE
                        File Output Operator
                          compressed: false
                          Statistics: Num rows: 2 Data size: 33 Basic stats: COMPLETE Column stats: NONE
                          table:
                              input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                              output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                              serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
            Local Work:
              Map Reduce Local Work

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

1. 对于Join(Inner Join)、Full outer Join，条件写在on后面，还是where后面，性能上面没有区别；
2. 对于Left outer Join ，右侧的表写在on后面、左侧的表写在where后面，性能上有提高；
3. 对于Right outer Join，左侧的表写在on后面、右侧的表写在where后面，性能上有提高；
4. 当条件分散在两个表时，谓词下推可按上述结论2和3自由组合，情况如下：

SQL过滤时机select * from a left outer join b on ( a.id = b.id and a.dt=’2022-09-08′ and b.id = ‘2022-09-08’);id在map端过滤，dt在reduce端过滤，低效select * from a left outer join b on ( a.id = b.id and b.id = ‘2022-09-08′) where a.dt=’2022-09-08′;id，dt都在map端过滤，高效select * from a left outer join b on ( a.id = b.id and a.dt=’2022-09-08’) where b.id = ‘2022-09-08′;id，dt都在reduce端过滤，极低效select * from a left outer join b on ( a.id = b.id ) where a.dt=’2022-09-08’ and b.id = ‘2022-09-08’;id在reduce端过滤，dt在map端过滤，低效

Original: https://blog.csdn.net/a805814077/article/details/126777345
Author: DanielMaster
Title: 一文弄懂Hive中谓词下推(on与where的区别)

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/817367/

转载文章受原作者版权保护。转载请注明原作者出处！

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

一文弄懂Hive中谓词下推(on与where的区别)

大家都在看