大数据学习(125)-hive数据分析
🍋🍋大数据学习🍋🍋
🔥系列专栏: 👑哲学语录: 用力所能及,改变世界。
💖如果觉得博主的文章还不错的话,请点赞👍+收藏⭐️+留言📝支持一下博主哦🤞
1. 连续登录问题变种
-
题目:
找出恰好连续登录 3 天的用户(不允许更长的连续区间)。
表结构:user_logs(user_id, login_date)
。 -
参考答案:
WITH ranked_logs AS (SELECT user_id,login_date,ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY login_date) AS rnFROM user_logs ), consecutive_groups AS (SELECT user_id,DATE_SUB(login_date, INTERVAL rn DAY) AS grp,MIN(login_date) AS start_date,MAX(login_date) AS end_date,COUNT(*) AS daysFROM ranked_logsGROUP BY user_id, grp ) SELECT user_id, start_date, end_date FROM consecutive_groups WHERE days = 3;
2. 连续未登录问题
-
题目:
找出用户最长连续未登录天数(假设表中仅记录登录日期)。
表结构:user_logs(user_id, login_date)
。 -
参考答案:
WITH next_logs AS (SELECT user_id,login_date,LEAD(login_date) OVER (PARTITION BY user_id ORDER BY login_date) AS next_loginFROM user_logs ) SELECT user_id,MAX(DATEDIFF(next_login, login_date) - 1) AS max_consecutive_missing FROM next_logs WHERE next_login IS NOT NULL GROUP BY user_id;
二、窗口函数高级应用
3. 移动平均值计算
-
题目:
计算用户最近 7 天的平均消费金额(滑动窗口)。
表结构:orders(user_id, order_date, amount)
。 -
参考答案:
SELECT user_id,order_date,AVG(amount) OVER (PARTITION BY user_id ORDER BY order_date RANGE BETWEEN INTERVAL '6 DAY' PRECEDING AND CURRENT ROW) AS rolling_7day_avg FROM orders;
4. 增长率计算
-
题目:
计算每个用户月消费金额的环比增长率。
表结构:orders(user_id, order_date, amount)
。 -
参考答案:
WITH monthly_sales AS (SELECT user_id,DATE_FORMAT(order_date, '%Y-%m') AS month,SUM(amount) AS total_amountFROM ordersGROUP BY user_id, month ) SELECT user_id,month,total_amount,(total_amount / LAG(total_amount) OVER (PARTITION BY user_id ORDER BY month) - 1) * 100 AS growth_rate FROM monthly_sales;
三、时间序列分析
5. 缺失日期填充
-
题目:
生成用户每日登录状态(0 = 未登录,1 = 登录),包括缺失的日期。
表结构:user_logs(user_id, login_date)
。 -
参考答案:
WITH date_range AS (SELECT user_id,MIN(login_date) AS start_date,MAX(login_date) AS end_dateFROM user_logsGROUP BY user_id ), all_dates AS (SELECT dr.user_id,d.calendar_dateFROM date_range drCROSS JOIN (SELECT CURDATE() - INTERVAL n DAY AS calendar_dateFROM (SELECT @row := @row + 1 AS n FROM (SELECT 0 UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3) t1,(SELECT 0 UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3) t2,(SELECT @row := -1) t3) t) dWHERE d.calendar_date BETWEEN dr.start_date AND dr.end_date ) SELECT ad.user_id,ad.calendar_date,IF(ul.login_date IS NULL, 0, 1) AS is_logged_in FROM all_dates ad LEFT JOIN user_logs ul ON ad.user_id = ul.user_id AND ad.calendar_date = ul.login_date;
6. 周期性检测
-
题目:
找出用户每周固定某天登录的行为模式(如每周一登录)。
表结构:user_logs(user_id, login_date)
。 -
参考答案:
WITH day_of_week AS (SELECT user_id,login_date,DAYOFWEEK(login_date) AS dowFROM user_logs ) SELECT user_id,dow,COUNT(DISTINCT WEEK(login_date)) AS weeks_count,COUNT(*) AS login_count FROM day_of_week GROUP BY user_id, dow HAVING login_count = weeks_count; -- 每周该天均登录
四、复杂业务场景
7. 购买间隔分析
-
题目:
计算用户平均购买间隔,并找出间隔超过 30 天的用户。
表结构:orders(user_id, order_date)
。 -
参考答案:
WITH order_intervals AS (SELECT user_id,order_date,DATEDIFF(order_date, LAG(order_date) OVER (PARTITION BY user_id ORDER BY order_date)) AS days_since_lastFROM orders ) SELECT user_id,AVG(days_since_last) AS avg_interval FROM order_intervals WHERE days_since_last IS NOT NULL GROUP BY user_id HAVING avg_interval > 30;
8. 活跃 / 流失用户分析
-
题目:
标记用户每月状态(活跃 = 当月有登录,流失 = 连续 3 个月未登录)。
表结构:user_logs(user_id, login_date)
。 -
参考答案:
WITH months AS (SELECT user_id,DATE_FORMAT(login_date, '%Y-%m') AS month,MAX(login_date) AS last_loginFROM user_logsGROUP BY user_id, month ), status AS (SELECT m.user_id,m.month,m.last_login,LEAD(m.last_login, 3) OVER (PARTITION BY m.user_id ORDER BY m.month) AS next_3rd_month_loginFROM months m ) SELECT user_id,month,CASE WHEN next_3rd_month_login IS NULL THEN '流失'ELSE '活跃'END AS status FROM status;
五、进阶挑战
9. 最长连续事件链
-
题目:
找出用户最长的连续事件链(如连续点赞、评论等,事件类型相同)。
表结构:events(user_id, event_time, event_type)
。 -
参考答案:
WITH ranked_events AS (SELECT user_id,event_time,event_type,ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY event_time) AS rnFROM events ), event_groups AS (SELECT user_id,event_type,DATE_SUB(event_time, INTERVAL rn SECOND) AS grp,COUNT(*) AS chain_lengthFROM ranked_eventsGROUP BY user_id, event_type, grp ) SELECT user_id,event_type,MAX(chain_length) AS max_chain FROM event_groups GROUP BY user_id, event_type;
10. 会话识别
-
题目:
将用户行为按会话分组(假设会话间隔为 30 分钟)。
表结构:actions(user_id, action_time, action_type)
。 -
参考答案:
WITH time_diff AS (SELECT user_id,action_time,action_type,TIMESTAMPDIFF(MINUTE, LAG(action_time) OVER (PARTITION BY user_id ORDER BY action_time), action_time) AS minutes_since_lastFROM actions ), session_markers AS (SELECT user_id,action_time,action_type,IF(minutes_since_last > 30 OR minutes_since_last IS NULL, 1, 0) AS new_sessionFROM time_diff ), sessions AS (SELECT user_id,action_time,action_type,SUM(new_session) OVER (PARTITION BY user_id ORDER BY action_time) AS session_idFROM session_markers ) SELECT * FROM sessions;
- 先手动模拟数据:创建测试表并插入少量数据,验证逻辑正确性。
- 对比不同方法:例如连续值问题,尝试用
LEAD()
、DATE_SUB + ROW_NUMBER
等多种方法实现。 - 注意边界条件:处理空值、同一天多次记录、跨年 / 跨月等场景。