记录一下Linux 6.12 中 cpu_util函数的作用
cpu_util -- 估算指定 CPU 上 CFS 任务所使用的 CPU 容量(utilization)
一.作用
能效调度(EAS):决定任务迁移到大核还是小核
调频(schedutil):决定 CPU 频率
负载均衡:判断 CPU 是否过载
/**
8156 * cpu_util() - Estimates the amount of CPU capacity used by CFS tasks.
8157 * @cpu: the CPU to get the utilization for
8158 * @p: task for which the CPU utilization should be predicted or NULL
8159 * @dst_cpu: CPU @p migrates to, -1 if @p moves from @cpu or @p == NULL
8160 * @boost: 1 to enable boosting, otherwise 0
8161 *
8162 * The unit of the return value must be the same as the one of CPU capacity
8163 * so that CPU utilization can be compared with CPU capacity.
8164 *
8165 * CPU utilization is the sum of running time of runnable tasks plus the
8166 * recent utilization of currently non-runnable tasks on that CPU.
8167 * It represents the amount of CPU capacity currently used by CFS tasks in
8168 * the range [0..max CPU capacity] with max CPU capacity being the CPU
8169 * capacity at f_max.
8170 *
8171 * The estimated CPU utilization is defined as the maximum between CPU
8172 * utilization and sum of the estimated utilization of the currently
8173 * runnable tasks on that CPU. It preserves a utilization "snapshot" of
8174 * previously-executed tasks, which helps better deduce how busy a CPU will
8175 * be when a long-sleeping task wakes up. The contribution to CPU utilization
8176 * of such a task would be significantly decayed at this point of time.
8177 *
8178 * Boosted CPU utilization is defined as max(CPU runnable, CPU utilization).
8179 * CPU contention for CFS tasks can be detected by CPU runnable > CPU
8180 * utilization. Boosting is implemented in cpu_util() so that internal
8181 * users (e.g. EAS) can use it next to external users (e.g. schedutil),
8182 * latter via cpu_util_cfs_boost().
8183 *
8184 * CPU utilization can be higher than the current CPU capacity
8185 * (f_curr/f_max * max CPU capacity) or even the max CPU capacity because
8186 * of rounding errors as well as task migrations or wakeups of new tasks.
8187 * CPU utilization has to be capped to fit into the [0..max CPU capacity]
8188 * range. Otherwise a group of CPUs (CPU0 util = 121% + CPU1 util = 80%)
8189 * could be seen as over-utilized even though CPU1 has 20% of spare CPU
8190 * capacity. CPU utilization is allowed to overshoot current CPU capacity
8191 * though since this is useful for predicting the CPU capacity required
8192 * after task migrations (scheduler-driven DVFS).
8193 *
8194 * Return: (Boosted) (estimated) utilization for the specified CPU.
8195 */
8196 static unsigned long
8197 cpu_util(int cpu, struct task_struct *p, int dst_cpu, int boost)
8198 {
8199 struct cfs_rq *cfs_rq = &cpu_rq(cpu)->cfs;// 从 cpu 的 CFS 运行队列(cfs_rq)中读取 PELT 计算的平均利用率
8200 unsigned long util = READ_ONCE(cfs_rq->avg.util_avg);
8201 unsigned long runnable;
8202
8203 if (boost) {
8204 runnable = READ_ONCE(cfs_rq->avg.runnable_avg);
8205 util = max(util, runnable);
8206 }
8207
8208 /*
8209 * If @dst_cpu is -1 or @p migrates from @cpu to @dst_cpu remove its
8210 * contribution. If @p migrates from another CPU to @cpu add its
8211 * contribution. In all the other cases @cpu is not impacted by the
8212 * migration so its util_avg is already correct.
8213 */// 情况 A:任务 p 从 cpu 移走,从当前 util 中 减去 p 的贡献
8214 if (p && task_cpu(p) == cpu && dst_cpu != cpu)
8215 lsub_positive(&util, task_util(p));// 情况 B:任务 p 迁入 cpu,加上 p 的 util 贡献
8216 else if (p && task_cpu(p) != cpu && dst_cpu == cpu)
8217 util += task_util(p);
8218
8219 if (sched_feat(UTIL_EST)) {
8220 unsigned long util_est;
8221 // 读取 cfs_rq->avg.util_est(当前队列的预估 util)
8222 util_est = READ_ONCE(cfs_rq->avg.util_est);
8223
8224 /*
8225 * During wake-up @p isn't enqueued yet and doesn't contribute
8226 * to any cpu_rq(cpu)->cfs.avg.util_est.
8227 * If @dst_cpu == @cpu add it to "simulate" cpu_util after @p
8228 * has been enqueued.
8229 *
8230 * During exec (@dst_cpu = -1) @p is enqueued and does
8231 * contribute to cpu_rq(cpu)->cfs.util_est.
8232 * Remove it to "simulate" cpu_util without @p's contribution.
8233 *
8234 * Despite the task_on_rq_queued(@p) check there is still a
8235 * small window for a possible race when an exec
8236 * select_task_rq_fair() races with LB's detach_task().
8237 *
8238 * detach_task()
8239 * deactivate_task()
8240 * p->on_rq = TASK_ON_RQ_MIGRATING;
8241 * -------------------------------- A
8242 * dequeue_task() \
8243 * dequeue_task_fair() + Race Time
8244 * util_est_dequeue() /
8245 * -------------------------------- B
8246 *
8247 * The additional check "current == p" is required to further
8248 * reduce the race window.
8249 */// p 将迁入/唤醒到 cpu(dst_cpu == cpu)
8250 if (dst_cpu == cpu)// 加上 p 的预估 util(即使它还没入队)
8251 util_est += _task_util_est(p);// p 将从 cpu 移走(如 exec) task_on_rq_queued(p):确保 p 当前在队列中
8252 else if (p && unlikely(task_on_rq_queued(p) || current == p))// 减去 p 的贡献
8253 lsub_positive(&util_est, _task_util_est(p));
8254 // 取 util_avg 和 util_est 的最大值,确保不会低估突发负载
8255 util = max(util, util_est);
8256 }
8257 // arch_scale_cpu_capacity(cpu):该 CPU 的最大算力(考虑大小核,如小核=512,大核=1024),限制 util 不超过 CPU 最大算力
8258 return min(util, arch_scale_cpu_capacity(cpu));
8259 }返回值单位:与 CPU 最大算力(capacity)一致(通常归一化到 0~1024)
二.注意点
/*
8225 * During wake-up @p isn't enqueued yet and doesn't contribute
8226 * to any cpu_rq(cpu)->cfs.avg.util_est.
8227 * If @dst_cpu == @cpu add it to "simulate" cpu_util after @p
8228 * has been enqueued.
8229 *
8230 * During exec (@dst_cpu = -1) @p is enqueued and does
8231 * contribute to cpu_rq(cpu)->cfs.util_est.
8232 * Remove it to "simulate" cpu_util without @p's contribution.
8233 *
8234 * Despite the task_on_rq_queued(@p) check there is still a
8235 * small window for a possible race when an exec
8236 * select_task_rq_fair() races with LB's detach_task().
8237 *
8238 * detach_task()
8239 * deactivate_task()
8240 * p->on_rq = TASK_ON_RQ_MIGRATING;
8241 * -------------------------------- A
8242 * dequeue_task() \
8243 * dequeue_task_fair() + Race Time
8244 * util_est_dequeue() /
8245 * -------------------------------- B
8246 *
8247 * The additional check "current == p" is required to further
8248 * reduce the race window.
8249 */
这段注释非常关键,它解释了 Linux 调度器中一个 微妙但重要的竞态条件(race condition),以及内核如何通过额外检查(current == p)来缓解它。
🎯 核心问题:如何在任务状态变化的“间隙”中准确模拟 CPU 利用率?
函数 cpu_util() 经常被用于 预测性调度决策,比如:
任务唤醒时,选择目标 CPU(
select_task_rq_fair());负载均衡时,评估迁移后的影响。
但在这些时刻,任务 p 的状态可能处于 “中间态” —— 它既不完全属于旧 CPU,也未完全加入新 CPU。此时,cfs_rq->avg.util_est 的值 不能真实反映 任务 p 是否应被计入。
🔍 一、两种典型场景
场景 1:任务唤醒(Wake-up)
此时
p尚未入队(p->on_rq == TASK_ON_RQ_NONE);因此
cpu_rq(cpu)->cfs.avg.util_est不包含p的贡献;但调度器想知道:“如果把
p放到dst_cpu上,util 会是多少?”✅ 所以主动加上
_task_util_est(p)。
场景 2:任务 exec() 或迁移(Migration)
此时
p仍在原 CPU 的运行队列中(p->on_rq == TASK_ON_RQ_QUEUED);所以
cfs_rq->avg.util_est已经包含p的贡献;但调度器想知道:“如果把
p移走,util 会剩多少?”✅ 所以主动减去
_task_util_est(p)。
⚠️ 二、竞态条件(Race Condition)详解
问题出在 场景 2(exec/migration) 的一个极短时间窗口:
📜 背景:任务迁移的步骤
当负载均衡器(LB)迁移任务 p 时,会调用:
detach_task() → deactivate_task() → dequeue_task() → dequeue_task_fair()在 deactivate_task() 中:
void deactivate_task(struct rq *rq, struct task_struct *p, int flags)
2214 {
2215 SCHED_WARN_ON(flags & DEQUEUE_SLEEP);
2216 // 标记为“正在迁移”
2217 WRITE_ONCE(p->on_rq, TASK_ON_RQ_MIGRATING);
2218 ASSERT_EXCLUSIVE_WRITER(p->on_rq);
2219
2220 /*
2221 * Code explicitly relies on TASK_ON_RQ_MIGRATING begin set *before*
2222 * dequeue_task() and cleared *after* enqueue_task().
2223 */
2224
2225 dequeue_task(rq, p, flags);
2226 }
2227 EXPORT_SYMBOL_GPL(deactivate_task);/*
2117 * Must only return false when DEQUEUE_SLEEP.
2118 */
2119 inline bool dequeue_task(struct rq *rq, struct task_struct *p, int flags)
2120 {
2121 bool dequeue_task_result;
2122 if (sched_core_enabled(rq))
2123 sched_core_dequeue(rq, p, flags);
2124
2125 if (!(flags & DEQUEUE_NOCLOCK))
2126 update_rq_clock(rq);
2127
2128 if (!(flags & DEQUEUE_SAVE))
2129 sched_info_dequeue(rq, p);
2130
2131 psi_dequeue(p, flags);
2132
2133 /*
2134 * Must be before ->dequeue_task() because ->dequeue_task() can 'fail'
2135 * and mark the task ->sched_delayed.
2136 */
2137 uclamp_rq_dec(rq, p);
2138 trace_android_rvh_dequeue_task(rq, p, flags);
2139 dequeue_task_result = p->sched_class->dequeue_task(rq, p, flags);
2140 trace_android_rvh_after_dequeue_task(rq, p, flags, &dequeue_task_result);
2141 return dequeue_task_result;
2142 }/*
7377 * The dequeue_task method is called before nr_running is
7378 * decreased. We remove the task from the rbtree and
7379 * update the fair scheduling stats:
7380 */
7381 static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
7382 {
7383 if (!p->se.sched_delayed)// util_est_dequeue()(真正从 util_est 中移除 p)在这里执行
7384 util_est_dequeue(&rq->cfs, p);
7385
7386 util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP);
7387 if (dequeue_entities(rq, &p->se, flags) < 0)
7388 return false;
7389
7390 /*
7391 * Must not reference @p after dequeue_entities(DEQUEUE_DELAYED).
7392 */
7393
7394 hrtick_update(rq);
7395 return true;
7396 }🕒 竞态窗口
在这个间隙(A 到 B 之间):
p->on_rq == TASK_ON_RQ_MIGRATING;task_on_rq_queued(p)返回 false(因为它不再是 QUEUED);但
util_est尚未减去p的贡献(因为util_est_dequeue()还没调用)!
❌ 导致的问题
此时,如果另一个 CPU 正在执行 exec() 并调用 select_task_rq_fair():
它检查
task_on_rq_queued(p)→ false;于是 不会执行
lsub_positive(&util_est, ...);但
util_est里 仍然包含p的旧值;结果:高估了目标 CPU 的 util,可能导致错误的调度决策!
🛡 三、解决方案:current == p 检查
为了缩小这个竞态窗口,内核增加了额外判断 current == p:
if (p && unlikely(task_on_rq_queued(p) || current == p))lsub_positive(&util_est, _task_util_est(p));💡 为什么 current == p 有效?
在
exec()路径中,调用select_task_rq_fair()的上下文 就是任务p自身(current == p);即使 p处于 TASK_ON_RQ_MIGRATING状态,只要 是
p自己在做 exec,就可以安全地认为:“我即将离开当前 CPU,应该从 util_est 中移除我的贡献。”
✅ 效果
在竞态窗口内,如果是
p自身触发的查询,仍会正确减去其 util_est;极大降低了误判概率(虽然不能 100% 消除,但窗口已非常小)。
三.理解不同参数下的含义
🔁 先回顾函数签名
unsigned long cpu_util(int cpu, struct task_struct *p, int dst_cpu, int boost)cpu:目标 CPU(我们要估算它的 util)p:一个任务(可为 NULL),用于模拟迁移/唤醒的影响dst_cpu:
== cpu:表示 p 将迁入或唤醒到 cpu== -1:表示 p 将从 cpu 上移除(如 exec 或迁移走)其他值:p 与 cpu 无关
boost:
1:启用 runnable boosting(取 max(util_avg, runnable_avg))0:仅使用 util_avg / util_est
📊 对比总结表
cpu_util(cpu, NULL, -1, 0) | 当前基础 util | ❌ | ❌ | EAS baseline、内部统计 |
cpu_util(cpu, NULL, -1, 1) | 当前 boosted util(含排队压力) | ❌ | ✅ | 负载均衡、schedutil (in cpu_util_cfs_boost) |
cpu_util(cpu, p, -1, 0) | 移除任务p后的 util | ✅(移除) | ❌ | exec、源 CPU 负载评估 |
cpu_util(cpu, p, cpu, 0) | 迁入任务p后的 util | ✅(添加) | ❌ | EAS 选核(find_energy_efficient_cpu) |
3.1 cpu_util(cpu, p, cpu, 0)
🔹 含义:
预测:如果将任务
p迁移到(或唤醒到)CPUcpu上,该 CPU 的利用率会是多少?
🔸 参数解析:
dst_cpu == cpu:表示 p 将加入 cpu;boost = 0:EAS 关注的是 真实计算需求,而非排队压力(因为迁移后可能独占 CPU)。
3.2 cpu_util(cpu, p, -1, 0);
🔹 含义:
估算:如果将任务
p从 CPUcpu上移除,该 CPU 的利用率会是多少?
🔸 参数解析:
p是当前在cpu上的任务;dst_cpu = -1:表示 p 要离开 cpu(如 exec 或迁出);boost = 0:不启用 boosting。
🔸 典型用途:
exec()系统调用时:任务即将替换自身镜像,调度器需评估“移除当前任务后”的 CPU 负载;负载均衡决策:计算“迁出任务 p 后,源 CPU 的剩余负载”。
💡 内部会执行:
util = current_util - task_util(p)
3.3 cpu_util(cpu, NULL, -1, 1);
🔹 含义:
获取 CPU
cpu当前的“boosted”利用率,反映 CPU 是否存在任务排队(contention)。
🔸 参数解析:
p = NULL:不涉及任务迁移;boost = 1:启用 boosting →util = max(util_avg, runnable_avg)
🔸 典型用途:
负载均衡(Load Balancing)
:判断 CPU 是否“过载”。
即使
util_avg = 500,但如果runnable_avg = 1000,说明有两个任务在争抢 CPU,实际已满载!
schedutil governor 的
cpu_util_cfs_boost()就是包装这个调用。
🌟 关键区别:这个值能检测到 “CPU 利用率不高,但任务在排队” 的情况,避免错误地认为 CPU 空闲。
3.4 cpu_util(cpu, NULL, -1, 0);
🔹 含义:
获取 CPU
cpu当前的“基础”利用率,不包含任何任务迁移模拟,也不启用 boosting。
🔸 参数解析:
p = NULL:不考虑任何特定任务;dst_cpu = -1:无迁移操作;boost = 0:仅使用util_avg和util_est(取 max),不考虑 runnable 状态。
🔸 典型用途:
EAS(Energy Aware Scheduling)在计算“当前负载”时,作为 baseline;
某些内部统计,不需要 contention 感知。
📌 这是最“保守”的 util 估算:只反映 实际运行 + 历史预估,不反映排队压力。
