本文将介绍Hive SQL中窗口函数、分析函数以及Over的使用,其实不仅仅是Hive,对于很多数数据库来说同样也适用,比如Mysql8,Oracle,MSSQL等传统的关系型数据库。
测试环境准备
如有一张表stock_hq,表格中的数据如下所示:
TDATESECCODEBLOCKAMOUNTPRICE20221113000001.szA0121010.1020221113000002.szA022109.1020221113000003.szA012108.1020221114000001.szA0121010.1020221114000002.szA022109.1020221114000003.szA012108.1020221115000001.szA0121010.1020221115000002.szA022109.1020221115000003.szA012108.1020221116000001.szA0121010.1020221116000002.szA022109.1020221116000003.szA012108.1020221117000001.szA0111010.1020221117000002.szA021109.1020221117000003.szA011109.1020221118000001.szA0110010.1020221118000002.szA0210010.1020221118000003.szA0110010.10
窗口和Over
窗口其实就是一个数据范围,它指定了我们统计计算分些数据范围。在Spark和Flink中我们知道,窗口有全局窗口和滚动窗口之分,同样在SQL的窗口中也有类似的概念。在SQL中,窗口是通过Over来实现的。
OVER( [PARTITION BY xx] [ORDER BY XX] [Window specifications ])
在Over()中可以由上述3者进行不同的组合,或者3者都可以不指定。
PARTITON BY 选项是可选的,它可以指定一个或者多个字段进行开窗,如果不指定开窗字段,则只有”一个窗口”。
- 示例1:不指定开窗字段,不指定排序,不定义窗口大小,则默认表中所有的数据都在一个窗口中
SELECT SECCODE,TDATE,AMOUNT,SUM(AMOUNT) over() AS SUM_AMOUNT FROM stock_hq;
输出:
TDATESECCODEAMOUNTSUM_AMOUNT20221113000001.sz210315020221113000002.sz210315020221113000003.sz210315020221114000001.sz210315020221114000002.sz210315020221114000003.sz210315020221115000001.sz210315020221115000002.sz210315020221115000003.sz210315020221116000001.sz210315020221116000002.sz210315020221116000003.sz210315020221117000001.sz110315020221117000002.sz110315020221117000003.sz110315020221118000001.sz100315020221118000002.sz100315020221118000003.sz1003150
很明显,上述数据中Sum(Amount)统计的是所有记录的和,也就是说所有数据在同一窗口中。
- 示例2:指定一个开窗字段
SELECT SECCODE,TDATE,AMOUNT,SUM(AMOUNT) over(PARTITION BY SECCODE) AS SUM_AMOUNT FROM stock_hq;
上述语句中指定了一个开窗字段,没有指定排序和窗口大小,则分成多个窗口,每个窗口中的数据是每个SECCODE的所有数据。
SECCODETDATEAMOUNTSUM_AMOUNT000001.sz202211132101050000001.sz202211142101050000001.sz202211181001050000001.sz202211152101050000001.sz202211171101050000001.sz202211162101050000002.sz202211132101050000002.sz202211181001050000002.sz202211171101050000002.sz202211162101050000002.sz202211152101050000002.sz202211142101050000003.sz202211152101050000003.sz202211162101050000003.sz202211142101050000003.sz202211171101050000003.sz202211132101050000003.sz202211181001050
- 示例3:指定多个开窗字段
select BLOCK,SECCODE,TDATE,AMOUNT,SUM(AMOUNT) over(PARTITION BY BLOCK,SECCODE) AS SUM_AMOUNT FROM stock_hq;
输出:
BLOCKSECCODETDATEAMOUNTSUM_AMOUNTA01000001.sz202211132101050A01000001.sz202211181001050A01000001.sz202211142101050A01000001.sz202211171101050A01000001.sz202211152101050A01000001.sz202211162101050A01000003.sz202211181001050A01000003.sz202211171101050A01000003.sz202211162101050A01000003.sz202211152101050A01000003.sz202211142101050A01000003.sz202211132101050A02000002.sz202211162101050A02000002.sz202211152101050A02000002.sz202211171101050A02000002.sz202211142101050A02000002.sz202211181001050A02000002.sz202211132101050
开窗字段值相同的数据分在同一个窗口中(类似于group by).
指定了Order By后,那么窗口就不再是”静态”的了,而成了一个动态滚动的窗口,废话不多说,上菜:
- 示例1:不指定分区,只指定Order by
select TDATE,SECCODE, AMOUNT,SUM(AMOUNT) over(order by SECCODE) AS SUM_AMOUNT FROM stock_hq;
输出:
TDATESECCODEAMOUNTSUM_AMOUNT20221113000001.sz210105020221114000001.sz210105020221118000001.sz100105020221115000001.sz210105020221117000001.sz110105020221116000001.sz210105020221113000002.sz210210020221118000002.sz100210020221117000002.sz110210020221116000002.sz210210020221115000002.sz210210020221114000002.sz210210020221115000003.sz210315020221116000003.sz210315020221114000003.sz210315020221117000003.sz110315020221113000003.sz210315020221118000003.sz1003150
如果只单独指定了Order By,Order By字段相同的数据会先分成一组做一个统计,然后再到下一个组如000002.sz的数据时会将000002.sz的所有的数据先做个统计,再累加上一个分组000001.sz的统计结果。
- 示例2:指定分区,指定排序
select TDATE,SECCODE, AMOUNT,SUM(AMOUNT) over(partition by SECCODE order by TDATE) AS SUM_AMOUNT FROM stock_hq;
TDATESECCODEAMOUNTSUM_AMOUNT20221113000001.sz21021020221114000001.sz21042020221115000001.sz21063020221116000001.sz21084020221117000001.sz11095020221118000001.sz100105020221113000002.sz21021020221114000002.sz21042020221115000002.sz21063020221116000002.sz21084020221117000002.sz11095020221118000002.sz100105020221113000003.sz21021020221114000003.sz21042020221115000003.sz21063020221116000003.sz21084020221117000003.sz11095020221118000003.sz1001050
在每一个窗口中,每一行的统计结果为上一行的统计结果加上当前行的值。
窗口的定义主要用于指定窗口的大小,有如下几种语义进行指定:
(ROWS | RANGE) BETWEEN (UNBOUNDED | [num]) PRECEDING AND ([num] PRECEDING | CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROWS | RANGE) BETWEEN CURRENT ROW AND (CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROWS | RANGE) BETWEEN [num] FOLLOWING AND (UNBOUNDED | [num]) FOLLOWING
UNBOUNDED 无边界的
PRECEDING 当前行的前
FOLLOWING 当前行后跟多少行
示例组合:
- ROWS BETWEEN 3 PRECEDING AND 1 PRECEDING 当前行的前3行到前一行
- ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING 当前行的前3行+当前行+当前行的后一行
- ROWS BETWEEN UNBOUNDED PRECEDING AND 2 PRECEDING 第一行到当前行的前2行
- ROWS BETWEEN CURRENT ROW AND UNBOUND FOLLWING 当前行到末尾行
- ROWS BETWEEN 3 FOLLOWING AND UNBOUND FOLLOWING 从当前行往后数3行开始到末尾
- ROWS BETWEEN 3 FOLLOWING AND 10 FOLLOWING 从当前行的后面第3行开始到后面第10行之间的数据。
由于我们上面的日期是连续的,所以需要删除某一天的数据,让效果看起来更明显。
delete from stock_hq where tdate = 20221115;
- ROWS是根据数据的物理顺序来指定窗口的大小,可以不指定排序字段(采用数据库的默认顺序)。
如:ROWS BETWEEN 3 PRECEDING AND CURRENT ROW.
在000001.sz窗口中,假如当前行的日期是20221117:到当前行时统计的是当前行前3行的值加当前行的值也就是740。统计的范围是固定的,与当前行的值无关。
示例:
select SECCODE,TDATE,AMOUNT,SUM(AMOUNT) over( partition BY SECCODE order by TDATE ROWS BETWEEN 3 PRECEDING AND CURRENT Row) AS SUM_AMOUNT FROM stock_hq;
输出:
SECCODETDATEAMOUNTSUM_AMOUNT000001.sz20221113210210000001.sz20221114210420000001.sz20221116210630000001.sz20221117110740000001.sz20221118100630000002.sz20221113210210000002.sz20221114210420000002.sz20221116210630000002.sz20221117110740000002.sz20221118100630000003.sz20221113210210000003.sz20221114210420000003.sz20221116210630000003.sz20221117110740000003.sz20221118100630
- RANGE 必须指定一个排序字段,且排序字段必须是数字类型或时间类型, 窗口的大小和当前行的排序字段的值有关。
如:RANGE BETWEEN 3 PRECEDING AND CURRENT ROW.
假如当前行的日期是20221117,到当前行时统计的是日期大于等于20221114到当前行20221117这一范围上的值,如下示例中是530,是一个逻辑上的窗口设定,与当前值有关。
select SECCODE,TDATE,AMOUNT,SUM(AMOUNT) over( partition BY SECCODE order by TDATE RANGE BETWEEN 3 PRECEDING AND CURRENT Row) AS SUM_AMOUNT FROM stock_hq;
输出:
SECCODETDATEAMOUNTSUM_AMOUNT000001.sz20221113210210000001.sz20221114210420000001.sz20221116210630000001.sz20221117110530000001.sz20221118100420000002.sz20221113210210000002.sz20221114210420000002.sz20221116210630000002.sz20221117110530000002.sz20221118100420000003.sz20221113210210000003.sz20221114210420000003.sz20221116210630000003.sz20221117110530000003.sz20221118100420
窗口函数
- 函数原型:LEAD(column,rows,default_value)
- 用途:将窗口的数据整体上移指定的行,上移后空缺的值使用指定的default_value填充。
示例:每个窗口的数据整体往上移一行,空缺的值默认为NULL
select TDATE,SECCODE, AMOUNT,LEAD(AMOUNT,1) over(partition by SECCODE order by TDATE) AS LEAD_AMOUNT FROM stock_hq;
输出:
TDATESECCODEAMOUNTLEAD_AMOUNT20221113000001.sz21021020221114000001.sz21021020221116000001.sz21011020221117000001.sz11010020221118000001.sz10020221113000002.sz21021020221114000002.sz21021020221116000002.sz21011020221117000002.sz11010020221118000002.sz10020221113000003.sz21021020221114000003.sz21021020221116000003.sz21011020221117000003.sz11010020221118000003.sz100
- 函数原型:LAG(column,rows,default_value)
- 用途:将窗口的数据整体下移指定的行,下移后空缺的值使用指定的default_value填充。
select TDATE,SECCODE, AMOUNT,LAG(AMOUNT,1) over(partition by SECCODE order by TDATE) AS LEAD_AMOUNT FROM stock_hq;
输出:
TDATESECCODEAMOUNTLAG_AMOUNT20221113000001.sz21020221114000001.sz21021020221116000001.sz21021020221117000001.sz11021020221118000001.sz10011020221113000002.sz21020221114000002.sz21021020221116000002.sz21021020221117000002.sz11021020221118000002.sz10011020221113000003.sz21020221114000003.sz21021020221116000003.sz21021020221117000003.sz11021020221118000003.sz100110
- 函数原型:FIRST_VALUE(column,Bool)
最多两个参数,第一个参数是列名,第二个参数是一个bool值,默认是false。如果设为true,则会跳过NULL值找第一个不为NULL的值。
Tips:有些关系型数据库中只有一个参数,请注意。
-
用途:取窗口中的第一个值
-
函数原型:LAST_VALUE(column,Bool)
最多两个参数,第一个参数是列名,第二个参数是一个bool值,默认是false。如果设为true,则会跳过NULL值找第一个不为NULL的值。
- 用途:取窗口中的最后一个值。
Original: https://blog.csdn.net/wangzhongyudie/article/details/127928048
Author: wangzhongyudie
Title: Hive学习笔记:04 SQL的窗口函数、OVER的使用
原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/818208/
转载文章受原作者版权保护。转载请注明原作者出处!