芝麻酱工作创新点分享2——mysql的窗口函数使用
一、场景概述
我们在一些查询时,需要在一条语句中查询表数据的时候,还想查group后聚合函数的数据。
我来造一个场景,假定我们有员工表和部门表:
CREATE TABLE `department` (`id` bigint NOT NULL COMMENT '主键ID,雪花算法',`code` char(32) NOT NULL COMMENT '部门码',`name` varchar(64) DEFAULT NULL COMMENT '部门名称'PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='部门表';CREATE TABLE `sararimann` (`id` bigint NOT NULL COMMENT '主键ID,雪花算法',`name` varchar(64) NOT NULL COMMENT '名字',`entrydate` datetime NOT NULL COMMENT '入职时间',`salary` bigint NOT NULL COMMENT '薪资',`dep_id` bigint NOT NULL COMMENT '部门Id'PRIMARY KEY (`id`),KEY `idx_dep_id` (`dep_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='员工表';
我们想要查找所有薪资高于部门平均薪资的员工,并输出各员工在部门的薪资排名
二、传统解决方案
2.1 Java程序计算
遇到这样的需求,如果公司的mysql恰好是低版本的5.7,并且数据量不是很大的情况下,我认为最务实的方法是,直接把所有相关数据查出来,而后在Java中处理数据,返回结果。
2.2 子查询方法
先不考虑计算排名的需求,我们单纯做只查找高于部门平均工资的用户。
SELECT s.id AS id, s.name AS `name`, s.entrydate AS entry_date, s.salary AS salary, s.dep_id AS dep_id, dep.name AS dep_name, dep.code AS dep_code
FROM
sararimann AS s
LEFT JOIN department AS dep ON s.dep_id = dep.id
WHERE s.salary > (SELECT AVG(s2.salary)FROM sararimann s2WHERE s2.dep_id = s.dep_id
);
我们来解释下这段代码,无非就是每次筛选时,要重新查一遍该部门的平均工资。
仅从代码层面上来看,假定员工数据量是n,部门数据量是m,这样的算法复杂度是O(n^2)的。
基于mysql的缓存机制,可能并不是每个员工算薪资条件时,都需要重新查一遍平均工资,也有可能是O(m*n)。
2.3 优化一些的JOIN方法
SELECT s.id AS id, s.name AS `name`, s.entrydate AS entry_date, s.salary AS salary, s.dep_id AS dep_id, dep.name AS dep_name, dep.code AS dep_code
FROM
sararimann AS s
LEFT JOIN department AS dep ON s.dep_id = dep.id
WHERE s.salary > (SELECT AVG(s2.salary)FROM sararimann s2WHERE s2.dep_id = s.dep_id
);
我们可以LEFT JOIN 一个group过的临时表。
即使在mysql8中,我们通过explain可以发现,左联的子查询进行了全表扫描(ALL),并没有走索引。
2.4 创建一个物理表
建表sql
CREATE TABLE `dep_salary` (`id` bigint NOT NULL AUTO_INCREMENT,`dep_id` bigint NOT NULL,`avg_salary` bigint DEFAULT NULL,`sum_salary` bigint DEFAULT NULL,PRIMARY KEY (`id`),UNIQUE KEY `idx_dep_id` (`dep_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='部门工资的聚合统计';
插入数据
INSERT INTO dep_salary(dep_id,avg_salary,sum_salary)
SELECT dep_id, AVG(salary) as avg_salary, SUM(salary) as sum_salary
FROM sararimann
GROUP BY dep_id
我们改写之前的JOIN的sql:
SELECT s.id AS id, s.name AS `name`, s.entrydate AS entry_date, s.salary AS salary, s.dep_id AS dep_id, dep.name AS dep_name, dep.code AS dep_code, ds.avg_salary
FROM sararimann s
LEFT JOIN dep_salary ds
ON s.dep_id = ds.dep_id
LEFT JOIN department AS dep ON s.dep_id = dep.id
WHERE s.salary > ds.avg_salary;
这时候有同学就要问了,那不能在查询前总处理以下dep_salary表的数据吧。
emm,其实可以用flink抽数做,这个在后面的博客可能会讲。
三、mysql的窗口函数
3.1 官网的介绍
学习任何技术,请优先移步官网
个人对这个窗口函数的理解吧,就是相当于在mysql中声明一个窗口,该窗口类似Java的一个临时表,放在map中。数据扫描时,根据id关联一下,拼进主数据中。这样一来,就可以做到不用group也能得到group后的聚合函数结果的关键数据,辅助查询和筛选。
3.2 代码实现
SELECT t.id, t.name, t.entrydate,t.salary,t.dep_id,t.dep_name,t.salary_rank FROM
(SELECT s.id AS id,s.name AS name,s.entrydate AS entrydate, s.salary AS salary, dep.id AS dep_id, dep.name AS dep_name, ROW_NUMBER() OVER w_rank AS salary_rank, AVG(s.salary) OVER w_avg AS dep_avg_salary
FROM sararimann AS s
LEFT JOIN department AS dep ON s.dep_id = dep.id
WINDOW
w_rank AS (PARTITION BY s.dep_id ORDER BY s.salary DESC),
w_avg AS (PARTITION BY s.dep_id)
ORDER BY s.dep_id ASC, s.entrydate ASC) AS t
WHERE t.salary > t.dep_avg_salary;
这样的代码,虽然看上去用了子查询,但实际上子查询仅仅是遍历了一下内层的结果,算法复杂度还是O(n)
四、完整的测试数据
为方便大家学习,这里给出完整测试数据,省了大家自己造了
INSERT INTO `dep` VALUES (1,'fin','财务部',0),(2,'sales','销售部',0),(3,'human_res','人力资源',0),(4,'tec1','技术1',0),(5,'tec2','技术2',0);INSERT INTO `sararimann` VALUES (1,'财务-o-1','2017-06-04 00:00:00',9000,1,0),(2,'财务-o-2','2016-07-01 00:00:00',8500,1,0),(3,'财务-m-1','2021-08-27 00:00:00',6700,1,0),(4,'财务-m-2','2020-04-04 00:00:00',6100,1,0),(5,'财务-j-1','2024-03-02 00:00:00',5000,1,0),(6,'财务-j-2','2025-02-02 00:00:00',4700,1,0),(7,'财务-j-3','2025-01-02 00:00:00',4200,1,0),(8,'销售-o-1','2010-06-08 00:00:00',9500,2,0),(9,'销售-o-2','2015-11-30 00:00:00',8500,2,0),(10,'销售-o-3','2014-08-07 00:00:00',12000,2,0),(11,'销售-m-1','2019-03-25 00:00:00',5500,2,0),(12,'销售-m-2','2018-08-08 00:00:00',5200,2,0),(13,'销售-m-3','2020-03-24 00:00:00',5000,2,0),(14,'销售-m-4','2022-02-28 00:00:00',4200,2,0),(15,'销售-m-5','2019-05-06 00:00:00',4500,2,0),(16,'销售-j-1','2024-07-04 00:00:00',3500,2,0),(17,'销售-j-2','2025-03-15 00:00:00',3000,2,0),(18,'销售-j-3','2024-04-27 00:00:00',3400,2,0),(19,'销售-j-4','2023-12-30 00:00:00',3500,2,0),(20,'销售-j-5','2024-05-01 00:00:00',3100,2,0),(21,'销售-j-6','2024-04-21 00:00:00',3200,2,0),(22,'销售-j-7','2025-05-21 00:00:00',3000,2,0),(23,'hr-o-1','2010-03-08 00:00:00',8000,3,0),(24,'hr-m-1','2019-05-04 00:00:00',5400,3,0),(25,'hr-j-1','2024-03-19 00:00:00',3500,3,0),(26,'hr-j-2','2025-02-28 00:00:00',3200,3,0),(27,'技术-s-1','2014-08-21 00:00:00',21000,4,0),(28,'技术-s-2','2018-04-06 00:00:00',18000,5,0),(29,'技术-s-3','2016-06-01 00:00:00',15000,4,0),(30,'技术-s-4','2015-03-06 00:00:00',19000,5,0),(31,'技术-s-5','2017-08-01 00:00:00',12000,4,0),(32,'技术-m-1','2020-04-08 00:00:00',8500,4,0),(33,'技术-m-2','2021-07-01 00:00:00',9000,4,0),(34,'技术-m-3','2019-11-21 00:00:00',11000,4,0),(35,'技术-m-4','2020-08-21 00:00:00',10000,5,0),(36,'技术-m-5','2020-10-01 00:00:00',8500,5,0),(37,'技术-m-6','2022-01-15 00:00:00',7500,5,0),(38,'技术-m-7','2021-09-21 00:00:00',8000,4,0),(39,'技术-m-8','2021-12-03 00:00:00',9000,5,0),(40,'技术-m-9','2019-08-21 00:00:00',11500,4,0),(41,'技术-j-1','2022-08-25 00:00:00',6500,5,0),(42,'技术-j-2','2023-04-21 00:00:00',5000,5,0),(43,'技术-j-3','2023-07-30 00:00:00',5500,5,0),(44,'技术-j-4','2024-08-15 00:00:00',5000,4,0),(45,'技术-j-5','2023-09-10 00:00:00',6000,4,0),(46,'技术-j-6','2025-03-01 00:00:00',4500,5,0),(47,'技术-j-7','2024-08-21 00:00:00',5200,5,0),(48,'技术-j-8','2024-05-30 00:00:00',5000,4,0),(49,'技术-j-9','2025-01-15 00:00:00',6000,4,0),(50,'技术-j-10','2023-07-01 00:00:00',5000,5,0),(51,'技术-j-11','2024-03-21 00:00:00',6200,5,0),(52,'技术-j-12','2024-09-21 00:00:00',5500,4,0);