当前位置：首页 > news >正文

记录一下Linux 6.12 中 cpu_util函数的作用

news 2025/10/28 10:40:04

cpu_util -- 估算指定 CPU 上 CFS 任务所使用的 CPU 容量（utilization）

一.作用

能效调度（EAS）：决定任务迁移到大核还是小核

调频（schedutil）：决定 CPU 频率

负载均衡：判断 CPU 是否过载

/**
8156   * cpu_util() - Estimates the amount of CPU capacity used by CFS tasks.
8157   * @cpu: the CPU to get the utilization for
8158   * @p: task for which the CPU utilization should be predicted or NULL
8159   * @dst_cpu: CPU @p migrates to, -1 if @p moves from @cpu or @p == NULL
8160   * @boost: 1 to enable boosting, otherwise 0
8161   *
8162   * The unit of the return value must be the same as the one of CPU capacity
8163   * so that CPU utilization can be compared with CPU capacity.
8164   *
8165   * CPU utilization is the sum of running time of runnable tasks plus the
8166   * recent utilization of currently non-runnable tasks on that CPU.
8167   * It represents the amount of CPU capacity currently used by CFS tasks in
8168   * the range [0..max CPU capacity] with max CPU capacity being the CPU
8169   * capacity at f_max.
8170   *
8171   * The estimated CPU utilization is defined as the maximum between CPU
8172   * utilization and sum of the estimated utilization of the currently
8173   * runnable tasks on that CPU. It preserves a utilization "snapshot" of
8174   * previously-executed tasks, which helps better deduce how busy a CPU will
8175   * be when a long-sleeping task wakes up. The contribution to CPU utilization
8176   * of such a task would be significantly decayed at this point of time.
8177   *
8178   * Boosted CPU utilization is defined as max(CPU runnable, CPU utilization).
8179   * CPU contention for CFS tasks can be detected by CPU runnable > CPU
8180   * utilization. Boosting is implemented in cpu_util() so that internal
8181   * users (e.g. EAS) can use it next to external users (e.g. schedutil),
8182   * latter via cpu_util_cfs_boost().
8183   *
8184   * CPU utilization can be higher than the current CPU capacity
8185   * (f_curr/f_max * max CPU capacity) or even the max CPU capacity because
8186   * of rounding errors as well as task migrations or wakeups of new tasks.
8187   * CPU utilization has to be capped to fit into the [0..max CPU capacity]
8188   * range. Otherwise a group of CPUs (CPU0 util = 121% + CPU1 util = 80%)
8189   * could be seen as over-utilized even though CPU1 has 20% of spare CPU
8190   * capacity. CPU utilization is allowed to overshoot current CPU capacity
8191   * though since this is useful for predicting the CPU capacity required
8192   * after task migrations (scheduler-driven DVFS).
8193   *
8194   * Return: (Boosted) (estimated) utilization for the specified CPU.
8195   */
8196  static unsigned long
8197  cpu_util(int cpu, struct task_struct *p, int dst_cpu, int boost)
8198  {
8199  	struct cfs_rq *cfs_rq = &cpu_rq(cpu)->cfs;// 从 cpu 的 CFS 运行队列（cfs_rq）中读取 PELT 计算的平均利用率
8200  	unsigned long util = READ_ONCE(cfs_rq->avg.util_avg);
8201  	unsigned long runnable;
8202  
8203  	if (boost) {
8204  		runnable = READ_ONCE(cfs_rq->avg.runnable_avg);
8205  		util = max(util, runnable);
8206  	}
8207  
8208  	/*
8209  	 * If @dst_cpu is -1 or @p migrates from @cpu to @dst_cpu remove its
8210  	 * contribution. If @p migrates from another CPU to @cpu add its
8211  	 * contribution. In all the other cases @cpu is not impacted by the
8212  	 * migration so its util_avg is already correct.
8213  	 */// 情况 A：任务 p 从 cpu 移走,从当前 util 中 减去 p 的贡献
8214  	if (p && task_cpu(p) == cpu && dst_cpu != cpu)
8215  		lsub_positive(&util, task_util(p));// 情况 B：任务 p 迁入 cpu,加上 p 的 util 贡献
8216  	else if (p && task_cpu(p) != cpu && dst_cpu == cpu)
8217  		util += task_util(p);
8218  
8219  	if (sched_feat(UTIL_EST)) {
8220  		unsigned long util_est;
8221  // 读取 cfs_rq->avg.util_est（当前队列的预估 util）
8222  		util_est = READ_ONCE(cfs_rq->avg.util_est);
8223  
8224  		/*
8225  		 * During wake-up @p isn't enqueued yet and doesn't contribute
8226  		 * to any cpu_rq(cpu)->cfs.avg.util_est.
8227  		 * If @dst_cpu == @cpu add it to "simulate" cpu_util after @p
8228  		 * has been enqueued.
8229  		 *
8230  		 * During exec (@dst_cpu = -1) @p is enqueued and does
8231  		 * contribute to cpu_rq(cpu)->cfs.util_est.
8232  		 * Remove it to "simulate" cpu_util without @p's contribution.
8233  		 *
8234  		 * Despite the task_on_rq_queued(@p) check there is still a
8235  		 * small window for a possible race when an exec
8236  		 * select_task_rq_fair() races with LB's detach_task().
8237  		 *
8238  		 *   detach_task()
8239  		 *     deactivate_task()
8240  		 *       p->on_rq = TASK_ON_RQ_MIGRATING;
8241  		 *       -------------------------------- A
8242  		 *       dequeue_task()                    \
8243  		 *         dequeue_task_fair()              + Race Time
8244  		 *           util_est_dequeue()            /
8245  		 *       -------------------------------- B
8246  		 *
8247  		 * The additional check "current == p" is required to further
8248  		 * reduce the race window.
8249  		 */// p 将迁入/唤醒到 cpu（dst_cpu == cpu）
8250  		if (dst_cpu == cpu)// 加上 p 的预估 util（即使它还没入队）
8251  			util_est += _task_util_est(p);// p 将从 cpu 移走（如 exec）  task_on_rq_queued(p)：确保 p 当前在队列中
8252  		else if (p && unlikely(task_on_rq_queued(p) || current == p))// 减去 p 的贡献
8253  			lsub_positive(&util_est, _task_util_est(p));
8254  // 取 util_avg 和 util_est 的最大值，确保不会低估突发负载
8255  		util = max(util, util_est);
8256  	}
8257  // arch_scale_cpu_capacity(cpu)：该 CPU 的最大算力（考虑大小核，如小核=512，大核=1024）,限制 util 不超过 CPU 最大算力
8258  	return min(util, arch_scale_cpu_capacity(cpu));
8259  }

返回值单位：与 CPU 最大算力（capacity）一致（通常归一化到 0~1024）

二.注意点

/*
8225         * During wake-up @p isn't enqueued yet and doesn't contribute
8226         * to any cpu_rq(cpu)->cfs.avg.util_est.
8227         * If @dst_cpu == @cpu add it to "simulate" cpu_util after @p
8228         * has been enqueued.
8229         *
8230         * During exec (@dst_cpu = -1) @p is enqueued and does
8231         * contribute to cpu_rq(cpu)->cfs.util_est.
8232         * Remove it to "simulate" cpu_util without @p's contribution.
8233         *
8234         * Despite the task_on_rq_queued(@p) check there is still a
8235         * small window for a possible race when an exec
8236         * select_task_rq_fair() races with LB's detach_task().
8237         *
8238         * detach_task()
8239         * deactivate_task()
8240         * p->on_rq = TASK_ON_RQ_MIGRATING;
8241         * -------------------------------- A
8242         * dequeue_task() \
8243         * dequeue_task_fair() + Race Time
8244         * util_est_dequeue() /
8245         * -------------------------------- B
8246         *
8247         * The additional check "current == p" is required to further
8248         * reduce the race window.
8249         */

这段注释非常关键，它解释了 Linux 调度器中一个 微妙但重要的竞态条件（race condition），以及内核如何通过额外检查（current == p）来缓解它。

🎯 核心问题：如何在任务状态变化的“间隙”中准确模拟 CPU 利用率？

函数 cpu_util() 经常被用于 预测性调度决策，比如：

任务唤醒时，选择目标 CPU（select_task_rq_fair()）；
负载均衡时，评估迁移后的影响。

但在这些时刻，任务 p 的状态可能处于 “中间态” —— 它既不完全属于旧 CPU，也未完全加入新 CPU。此时，cfs_rq->avg.util_est 的值 不能真实反映 任务 p 是否应被计入。

🔍 一、两种典型场景

场景 1：任务唤醒（Wake-up）

此时 p 尚未入队（p->on_rq == TASK_ON_RQ_NONE）；
因此 cpu_rq(cpu)->cfs.avg.util_est 不包含 p 的贡献；
但调度器想知道：“如果把 p 放到 dst_cpu 上，util 会是多少？”
✅ 所以主动加上 _task_util_est(p)。

场景 2：任务 exec() 或迁移（Migration）

此时 p 仍在原 CPU 的运行队列中（p->on_rq == TASK_ON_RQ_QUEUED）；
所以 cfs_rq->avg.util_est 已经包含 p 的贡献；
但调度器想知道：“如果把 p 移走，util 会剩多少？”
✅ 所以主动减去 _task_util_est(p)。

⚠️ 二、竞态条件（Race Condition）详解

问题出在 场景 2（exec/migration） 的一个极短时间窗口：

📜 背景：任务迁移的步骤

当负载均衡器（LB）迁移任务 p 时，会调用：

detach_task() → deactivate_task() → dequeue_task() → dequeue_task_fair()

在 deactivate_task() 中：

void deactivate_task(struct rq *rq, struct task_struct *p, int flags)
2214 {
2215 	SCHED_WARN_ON(flags & DEQUEUE_SLEEP);
2216 //  标记为“正在迁移”
2217 	WRITE_ONCE(p->on_rq, TASK_ON_RQ_MIGRATING);
2218 	ASSERT_EXCLUSIVE_WRITER(p->on_rq);
2219 
2220 	/*
2221 	 * Code explicitly relies on TASK_ON_RQ_MIGRATING begin set *before*
2222 	 * dequeue_task() and cleared *after* enqueue_task().
2223 	 */
2224 
2225 	dequeue_task(rq, p, flags);
2226 }
2227 EXPORT_SYMBOL_GPL(deactivate_task);/*
2117   * Must only return false when DEQUEUE_SLEEP.
2118   */
2119  inline bool dequeue_task(struct rq *rq, struct task_struct *p, int flags)
2120  {
2121  	bool dequeue_task_result;
2122  	if (sched_core_enabled(rq))
2123  		sched_core_dequeue(rq, p, flags);
2124  
2125  	if (!(flags & DEQUEUE_NOCLOCK))
2126  		update_rq_clock(rq);
2127  
2128  	if (!(flags & DEQUEUE_SAVE))
2129  		sched_info_dequeue(rq, p);
2130  
2131  	psi_dequeue(p, flags);
2132  
2133  	/*
2134  	 * Must be before ->dequeue_task() because ->dequeue_task() can 'fail'
2135  	 * and mark the task ->sched_delayed.
2136  	 */
2137  	uclamp_rq_dec(rq, p);
2138  	trace_android_rvh_dequeue_task(rq, p, flags);
2139  	dequeue_task_result = p->sched_class->dequeue_task(rq, p, flags);
2140  	trace_android_rvh_after_dequeue_task(rq, p, flags, &dequeue_task_result);
2141  	return dequeue_task_result;
2142  }/*
7377   * The dequeue_task method is called before nr_running is
7378   * decreased. We remove the task from the rbtree and
7379   * update the fair scheduling stats:
7380   */
7381  static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
7382  {
7383  	if (!p->se.sched_delayed)// util_est_dequeue()（真正从 util_est 中移除 p）在这里执行
7384  		util_est_dequeue(&rq->cfs, p);
7385  
7386  	util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP);
7387  	if (dequeue_entities(rq, &p->se, flags) < 0)
7388  		return false;
7389  
7390  	/*
7391  	 * Must not reference @p after dequeue_entities(DEQUEUE_DELAYED).
7392  	 */
7393  
7394  	hrtick_update(rq);
7395  	return true;
7396  }

🕒 竞态窗口

在这个间隙（A 到 B 之间）：

p->on_rq == TASK_ON_RQ_MIGRATING；
task_on_rq_queued(p) 返回 false（因为它不再是 QUEUED）；
但 util_est 尚未减去 p 的贡献（因为 util_est_dequeue() 还没调用）！

❌ 导致的问题

此时，如果另一个 CPU 正在执行 exec() 并调用 select_task_rq_fair()：

它检查 task_on_rq_queued(p) → false；
于是 不会执行 lsub_positive(&util_est, ...)；
但 util_est 里 仍然包含 p 的旧值；
结果：高估了目标 CPU 的 util，可能导致错误的调度决策！

🛡 三、解决方案：`current == p` 检查

为了缩小这个竞态窗口，内核增加了额外判断 current == p：

if (p && unlikely(task_on_rq_queued(p) || current == p))lsub_positive(&util_est, _task_util_est(p));

💡 为什么 `current == p` 有效？

在 exec() 路径中，调用 select_task_rq_fair() 的上下文 就是任务 p 自身（current == p）；
即使 p处于 TASK_ON_RQ_MIGRATING状态，只要是 p 自己在做 exec，就可以安全地认为：
“我即将离开当前 CPU，应该从 util_est 中移除我的贡献。”

✅ 效果

在竞态窗口内，如果是 p 自身触发的查询，仍会正确减去其 util_est；
极大降低了误判概率（虽然不能 100% 消除，但窗口已非常小）。

三.理解不同参数下的含义

🔁 先回顾函数签名

unsigned long cpu_util(int cpu, struct task_struct *p, int dst_cpu, int boost)

cpu：目标 CPU（我们要估算它的 util）
p：一个任务（可为 NULL），用于模拟迁移/唤醒的影响
dst_cpu
：
- == cpu：表示 p 将迁入或唤醒到 cpu
- == -1：表示 p 将从 cpu 上移除（如 exec 或迁移走）
- 其他值：p 与 cpu 无关
boost
：
- 1：启用 runnable boosting（取 max(util_avg, runnable_avg)）
- 0：仅使用 util_avg / util_est

📊 对比总结表


`cpu_util(cpu, NULL, -1, 0)`	当前基础 util	❌	❌	EAS baseline、内部统计
`cpu_util(cpu, NULL, -1, 1)`	当前 boosted util（含排队压力）	❌	✅	负载均衡、schedutil （in cpu_util_cfs_boost）
`cpu_util(cpu, p, -1, 0)`	移除任务`p`后的 util	✅（移除）	❌	exec、源 CPU 负载评估
`cpu_util(cpu, p, cpu, 0)`	迁入任务`p`后的 util	✅（添加）	❌	EAS 选核（`find_energy_efficient_cpu`）

3.1 `cpu_util(cpu, p, cpu, 0)`

🔹 含义：

预测：如果将任务 p 迁移到（或唤醒到）CPU cpu 上，该 CPU 的利用率会是多少？

🔸 参数解析：

dst_cpu == cpu：表示 p 将加入 cpu；
boost = 0：EAS 关注的是 真实计算需求，而非排队压力（因为迁移后可能独占 CPU）。

3.2 `cpu_util(cpu, p, -1, 0);`

🔹 含义：

估算：如果将任务 p 从 CPU cpu 上移除，该 CPU 的利用率会是多少？

🔸 参数解析：

p 是当前在 cpu 上的任务；
dst_cpu = -1：表示 p 要离开 cpu（如 exec 或迁出）；
boost = 0：不启用 boosting。

🔸 典型用途：

exec() 系统调用时：任务即将替换自身镜像，调度器需评估“移除当前任务后”的 CPU 负载；
负载均衡决策：计算“迁出任务 p 后，源 CPU 的剩余负载”。

💡 内部会执行：util = current_util - task_util(p)

3.3 `cpu_util(cpu, NULL, -1, 1);`

🔹 含义：

获取 CPU cpu 当前的“boosted”利用率，反映 CPU 是否存在任务排队（contention）。

🔸 参数解析：

p = NULL：不涉及任务迁移；
boost = 1：启用 boosting → util = max(util_avg, runnable_avg)

🔸 典型用途：

负载均衡（Load Balancing）
：判断 CPU 是否“过载”。
- 即使 util_avg = 500，但如果 runnable_avg = 1000，说明有两个任务在争抢 CPU，实际已满载！
schedutil governor 的 cpu_util_cfs_boost() 就是包装这个调用。