Hive开窗函数的进阶SQL案例
一、开窗函数基础
1. 定义与作用
开窗函数(Window Functions)在保留原始行数据的同时,对分组内的行进行聚合或排序分析,常用于累计计算、排名、移动平均等场景。与普通聚合函数(如SUM
、AVG
)的区别在于:
- 普通聚合函数:每组返回一行(行数减少)。
- 开窗函数:每组返回多行(保留原始行数)。
2. 核心语法结构
函数名() OVER ([PARTITION BY 列1, 列2...] -- 分组依据[ORDER BY 列A [ASC|DESC]] -- 排序依据[ROWS BETWEEN 范围] -- 窗口范围
)
- **
PARTITION BY
**:按列分组,函数在每个分组内独立计算(如按省份分组)。 - **
ORDER BY
**:分组内按列排序,影响窗口范围逻辑(如时间升序)。 - **
ROWS BETWEEN
**:指定窗口边界,常见选项如下表:
范围关键字 | 含义 |
---|---|
UNBOUNDED PRECEDING | 起点行(分区第一行) |
CURRENT ROW | 当前行 |
n PRECEDING | 前n行 |
n FOLLOWING | 后n行 |
UNBOUNDED FOLLOWING | 终点行(分区最后一行) |
默认行为说明:
- 仅用
PARTITION BY
→ 窗口为整个分组(UNBOUNDED PRECEDING TO UNBOUNDED FOLLOWING
)。PARTITION BY + ORDER BY
→ 窗口为起点到当前行(UNBOUNDED PRECEDING TO CURRENT ROW
)
二、累计计算场景
1. 年度招聘人数累加
WITH t1 AS (SELECT SUBSTR(hiredate, 1, 4) AS year FROM emp
),
t2 AS (SELECT year, COUNT(*) AS cnt FROM t1 GROUP BY year
)
SELECT year, cnt,SUM(cnt) OVER (ORDER BY year) AS cumulative_cnt -- 按年份滚动累加
FROM t2;
结果说明:
year | cnt | cumulative_cnt |
---|---|---|
2023 | 50 | 50 |
2024 | 80 | 130 |
2025 | 70 | 200 |
三、排名分析场景
3. 并列排名处理
SELECT name, class, score,ROW_NUMBER() OVER (ORDER BY score DESC) AS row_num, -- 连续唯一序号RANK() OVER (ORDER BY score DESC) AS rank, -- 并列跳号(1,1,3)DENSE_RANK() OVER (ORDER BY score DESC) AS dense_rank -- 并列不跳号(1,1,2)
FROM Scores;
排名函数对比:
分数 | row_num | rank | dense_rank |
---|---|---|---|
95 | 1 | 1 | 1 |
95 | 2 | 1 | 1 |
90 | 3 | 3 | 2 |
四、移动计算场景
4. 1日移动平均销售额
SELECT shop, day, sales,AVG(sales) OVER (ORDER BY day ROWS BETWEEN 2 PRECEDING AND CURRENT ROW -- 限定窗口为当前行及前2行) AS avg_3day
FROM daily_sales;
结果片段:
day | sales | avg_3day |
---|---|---|
2025-01-01 | 200 | 200.0 |
2025-01-02 | 300 | 250.0 |
2025-01-03 | 250 | 250.0 |
4. 2相邻行差值计算
SELECT record_date, temperature,temperature - LAG(temperature, 1) OVER (ORDER BY record_date) AS diff -- 今日与昨日温差
FROM weather;
五、连续行为分析
5. 用户连续登录统计
-- 步骤1:生成连续性标识
WITH t1 AS (SELECT user_id, login_date,DATE_SUB(login_date, INTERVAL ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY login_date) DAY) AS grpFROM logins
)
-- 步骤2:聚合连续区间
SELECT user_id,MIN(login_date) AS start_date,MAX(login_date) AS end_date,COUNT(*) AS consecutive_days
FROM t1
GROUP BY user_id, grp;
输出说明:
user_id | start_date | end_date | consecutive_days |
---|---|---|---|
1001 | 2025-01-01 | 2025-01-05 | 5 |
1001 | 2025-01-07 | 2025-01-08 | 2 |
六、高级偏移分析
6.1 会话最新未读消息提取
WITH RankedMessages AS (SELECT from_id, to_id, msg, isRead, time,ROW_NUMBER() OVER (PARTITION BY from_id, to_id ORDER BY time DESC) AS rn -- 按会话分组,时间倒序排FROM chat_history
)
SELECT from_id, to_id, msg
FROM RankedMessages
WHERE rn = 1 AND isRead = 0; -- 取最新且未读的消息
6.2 计算登录间隔天数
SELECT user_id, start_date, end_date,DATEDIFF(start_date, LAG(end_date, 1) OVER (PARTITION BY user_id ORDER BY start_date)) - 1 AS gap_days -- 本次开始日期与上次结束日期的间隔
FROM login_sessions;
七、分区聚合控制
7. 动态计算组内极值
SELECT id, val,MAX(val) OVER (PARTITION BY id ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS group_max -- 分组内全局最大值
FROM test;
结果特点:同一分组的所有行显示相同的最大值。
附:核心函数适用场景总结
场景 | 推荐函数 | 关键子句 |
---|---|---|
累计求和 | SUM() OVER(ORDER BY) | ORDER BY 定义累加顺序 |
组内排名 | DENSE_RANK() OVER(PARTITION BY) | PARTITION BY 分组依据 |
移动平均 | AVG() OVER(ROWS BETWEEN n PRECEDING) | ROWS BETWEEN 限定窗口范围 |
相邻记录比较 | LAG()/LEAD() | 偏移量参数控制前后行 |
连续性行为分析 | ROW_NUMBER() + 日期差值 | 生成连续性标识列 |
提示:实际开发中需注意
- 性能优化:避免全分区无边界窗口(如
UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
),优先用ROWS
限定物理范围- 空值处理:
LAG/LEAD
的第三个参数可设置默认值(如LAG(col,1,0)
)