当前位置：首页 > news >正文

BCC-调度组件分析

news 来源：原创 2025/6/9 14:13:20

工具	类型	追踪点
cpuunclaimed	CPU无人认领	frequency
runqlen	运行队列数量	frequency
runqlat	从加入队列到获取CPU延迟、直方图	wakeup/wakeup_new -> sched_switch
runqslower	从加入队列到获取CPU延迟	wakeup/wakeup_new -> sched_switch
offwaketime	从失去到重新获得CPU的次数	sched_switch -> wakeup
wakeuptime	追踪睡眠到唤醒的次数	sched_switch -> wakeup
cpudist	运行队列中获取CPU时间	sched_switch -> sched_switch
offcputime	运行队列中失去CPU次数	sched_switch -> sched_switch

在这里插入图片描述

hock位置

tracepoint

sched_wakeup

sched_wakeup被调用一定会通过try_to_wake_up函数，kprobe/try_to_wake_up和当前跟踪点使用效果一致。

用于唤醒进程，加入到运行队列等待调度运行。

linux-5.10.202/kernel/sched/core.c: 2846

static int
try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
{
    trace_sched_wakeup(p);

另一个会执行到此跟踪点的函数在ttwu_do_wakeup，但也是由try_to_wake_up调用到。

linux-5.10.202/kernel/sched/core.c: 2472

static void ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags,
			   struct rq_flags *rf)
{
	trace_sched_wakeup(p);

eg: 来自于软中断的唤醒

#0  ttwu_do_wakeup (rq=rq@entry=0xffff888029632fc0, p=p@entry=0xffff88800423dc00, wake_flags=wake_flags@entry=0, rf=rf@entry=0xffffc90000003f58) at kernel/sched/core.c:2474
#1  0xffffffff81111760 in ttwu_do_activate (rq=rq@entry=0xffff888029632fc0, p=p@entry=0xffff88800423dc00, wake_flags=wake_flags@entry=0, rf=rf@entry=0xffffc90000003f58) at kernel/sched/core.c:2526
#2  0xffffffff8111296e in ttwu_queue (wake_flags=0, cpu=0, p=0xffff88800423dc00) at kernel/sched/core.c:2722
#3  try_to_wake_up (p=0xffff88800423dc00, state=state@entry=3, wake_flags=wake_flags@entry=0) at kernel/sched/core.c:3000
#4  0xffffffff81112cb1 in wake_up_process (p=<optimized out>) at kernel/sched/core.c:3069
#5  0xffffffff81e00239 in wakeup_softirqd () at kernel/softirq.c:76
#6  __do_softirq () at kernel/softirq.c:320

sched_wakeup_new

通常来自于clone/fork创建出子进程，子进程立即加入队列等待调度运行。

linux-5.10.202/kernel/sched/core.c: 3355

void wake_up_new_task(struct task_struct *p)
{
    trace_sched_wakeup_new(p);          // tracepoint

eg: 来自于clone系统调用

#0  wake_up_new_task (p=p@entry=0xffff8880041edc00) at kernel/sched/core.c:3356
#1  0xffffffff810d92b2 in kernel_clone (args=args@entry=0xffffc900001e3ed0) at kernel/fork.c:2530
#2  0xffffffff810d96b2 in __do_sys_clone (clone_flags=<optimized out>, newsp=<optimized out>, parent_tidptr=<optimized out>, child_tidptr=<optimized out>, tls=<optimized out>) at kernel/fork.c:2623

sched_switch

这个追踪点之后立刻就是上下文切换，prev进程切换到next进程，可能是任务被抢占、时间耗尽、IO就绪、调度器判断时机等执行到。

linux-5.10.202/kernel/sched/core.c: 4430

static void __sched notrace __schedule(bool preempt)
{
    trace_sched_switch(preempt, prev, next);        // tracepoint
    
    rq = context_switch(rq, prev, next, &rf);       // 这里是切换vm、上下文
    
    return finish_task_switch(prev);                // 后续有bcc程序kprobe此函数 和此追踪点作用一致

kprobe

finish_task_switch

和tracepoint/sched_switch作用一致。

try_to_wake_up

和tracepoint/sched_wakeup作用一致。

参考

从操作系统新的视角看hello world
京东深入理解Linux进程与内存：修炼底层内功，掌握高性能原理

libbpf-tools/runqlat

统计任务加入调度队列或失去CPU时到获取CPU任务上线的时间间隔，以直方图呈现。

# /usr/share/bcc/libbpf-tools/runqlat
Tracing run queue latency... Hit Ctrl-C to end.
^C
     usecs               : count     distribution
         0 -> 1          : 71       |***                                     |
         2 -> 3          : 167      |*******                                 |
         4 -> 7          : 522      |**********************                  |
         8 -> 15         : 920      |****************************************|
        16 -> 31         : 784      |**********************************      |
        32 -> 63         : 233      |**********                              |
        64 -> 127        : 125      |*****                                   |
       128 -> 255        : 42       |*                                       |
       256 -> 511        : 20       |                                        |
       512 -> 1023       : 4        |                                        |
      1024 -> 2047       : 1        |                                        |

原理

统计在tracepoint/sched_wakeup和tracepoint/sched_wakeup_new时候加入调度队列，或sched_switch离线开始，到sched_switch准备切换上下文上线运行的时间差。

即任务具备条件可以获取CPU开始到真正开始运行的延迟。

场景

常规任务和CFS调度器的调度周期(调度器认为的一定时间内必须当前CPU所有就绪任务执行一遍)等有关

参考

github bcc/tools/runqlat_example.txt

libbpf-tools/runqslower

统计任务加入调度队列或失去CPU时到获取CPU任务上线的时间间隔。

# /usr/share/bcc/libbpf-tools/runqslower 
Tracing run queue latency higher than 10000 us
TIME     COMM             TID           LAT(us)
15:30:51 b'node'          2027            10163
15:30:51 b'node'          2022            11320
15:30:51 b'node'          2023            11117
15:30:51 b'node'          2023            10775
15:30:51 b'node'          1877            10810
15:30:51 b'rcu_sched'     14              12235
15:30:51 b'tuned'         1204            10152

原理

和runqlat实现方式几乎是一模一样，区别仅仅是runqlat统计直方图，runqslower按调度超时时间即时显示

场景

参考

github bcc/tools/runqslower_example.txt

libbpf-tools/runqlen

调度队列正在运行的任务数量

# /usr/share/bcc/libbpf-tools/runqlen 
Sampling run queue length... Hit Ctrl-C to end.
     runqlen       : count     distribution
        0          : 944      |****************************************|

原理

采用bpf frequency方式每秒99次运行，获取当前进程所属绝对公平调度器正在运行程序数量。

每个cpu都有对应的公平调度器队列

int do_perf_event()
{
    unsigned int len = 0;
    pid_t pid = 0;
    struct task_struct *task = NULL;
    struct cfs_rq_partial *my_q = NULL;

    // Fetch the run queue length from task->se.cfs_rq->nr_running. This is an
    // unstable interface and may need maintenance. Perhaps a future version
    // of BPF will support task_rq(p) or something similar as a more reliable
    // interface.
    task = (struct task_struct *)bpf_get_current_task();
    my_q = (struct cfs_rq_partial *)task->se.cfs_rq;
    len = my_q->nr_running;

    // Calculate run queue length by subtracting the currently running task,
    // if present. len 0 == idle, len 1 == one running task.
    if (len > 0)
        len--;      # 只有一个正在运行，代表队列等待为0

场景

和/proc/loadavg负载有关。

参考

github bcc/tools/runqlenr_example.txt

libbpf-tools/cpudist

追踪进程一次上线时间，cfs调度器中受优先级、任务数量(影响调度周期的计算)影响。

# /usr/share/bcc/tools/cpudist 
Tracing on-CPU time... Hit Ctrl-C to end.
^C
     usecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 13       |****                                    |
         4 -> 7          : 72       |**************************              |
         8 -> 15         : 84       |******************************          |
        16 -> 31         : 78       |****************************            |
        32 -> 63         : 110      |****************************************|
        64 -> 127        : 59       |*********************                   |
       128 -> 255        : 36       |*************                           |
       256 -> 511        : 13       |****                                    |
       512 -> 1023       : 4        |*                                       |
      1024 -> 2047       : 1        |                                        |

原理

在tracepoint/sched_switch时候统计任务的上线下线时间差

场景

参考

github bcc/tools/cpudist_example.txt

libbpf-tools/wakeuptime

追踪睡眠到唤醒的次数，按栈计数

# /usr/share/bcc/libbpf-tools/wakeuptime 
Tracing blocked time (us) by kernel stack... Hit Ctrl-C to end.
^C
    target:          sshd
    ffffffffab114bd0 try_to_wake_up
    ffffffffab113c66 ttwu_do_wakeup
    ffffffffab206a44 bpf_trace_run1
    ffffffffc06c8dc0 cleanup_module
    ffffffffab20697e bpf_get_stackid_raw_tp
    waker:           kworker/u256:1
        46

原理

在tracepoint/sched_switch任务下线时记录时间戳，准备离线。

在tracepoint/sched_wakeup统计从离线到现在经历了多少时间，并在此时记录栈，记录的是唤醒的栈。

场景

如锁的操作会使得调度离线，统计锁的这段时间，wakeuptime和offwaketime都在统计调度器离线时间时同时记录了栈方便知道为什么会调度离线

参考

github bcc/tools/wakeuptime_example.txt

tools/offwaketime

统计进程从失去到重新获得CPU的次数，按栈计数。

# /usr/share/bcc/tools/offwaketime 
Tracing blocked time (us) by user + kernel off-CPU and waker stack... Hit Ctrl-C to end.
^C
    waker:            0
    --               --
    finish_task_switch
    __schedule
    schedule_idle
    do_idle
    cpu_startup_entry
    start_secondary
    secondary_startup_64_no_verify
    target:          swapper/3 0
        1

原理

此工具仅提供kprobe方式。插桩finish_task_switch函数，作用和tracepoint/sched_switch一致。位于context_switch函数用于从当前CPU切换不同的任务，将prev换下切换为next进程，此时记录失去CPU的栈。

插桩try_to_wake_up函数，和sched_wakeup跟踪点位置接近，均位于唤醒进程处，次数统计唤醒时的栈。

工具和libbpf-tools/wakeuptime类似，区别是此处记录的是失去cpu的栈。

场景

参考

github bcc/tools/offwaketime_example.txt

tools/cpuunclaimed

统计CPU无人认领，此时有任务需要运行但有CPU未分配到任务空载。

# /usr/share/bcc/tools/cpuunclaimed.py 
Sampling run queues... Output every 1 seconds. Hit Ctrl-C to end.
%CPU  12.63%, unclaimed idle 86.36%
%CPU  12.50%, unclaimed idle 87.50%
%CPU  12.63%, unclaimed idle 87.37%
%CPU  12.75%, unclaimed idle 87.25%
%CPU  12.50%, unclaimed idle 87.50%
%CPU  12.63%, unclaimed idle 87.37%
%CPU  12.50%, unclaimed idle 87.50%
%CPU  12.50%, unclaimed idle 87.50%

原理

项目采用perf定时采集模式，频率99hz

用以计算cpu无人认领的几率，代表有任务都被挤在个别cpu了有的核心空闲，有的核心任务太多。

无人认领异常最多的可能是任务绑定了cpu，绑定一起了。上面的示例就是8核cpu上8开启了8个任务但8个任务都设置了绑定到一个cpu上。

do_perf_event

主要采集时间戳、当前cpu num，当前cpu的队列长度

struct data_t {
    u64 ts;
    u64 cpu;
    u64 len;
};

int do_perf_event(struct bpf_perf_event_data *ctx)
{
    int cpu = bpf_get_smp_processor_id();
    u64 now = bpf_ktime_get_ns();
    
    my_q = (struct cfs_rq_partial *)task->se.cfs_rq;
    len = my_q->nr_running;

    struct data_t data = {.ts = now, .cpu = cpu, .len = len};

采集的过程中，按组(也就是一段间隔，1s)分析，统计出来几个数据

running
- 周期内cpu的状态是running的次数
idle
- 周期内不是running的次数
positive
- cpu 无人认领的次数

# calculate stats
g_running = 0
g_queued = 0
for ge in group:
    if samples[ge]['len'] > 0:      # 代表有任务，cpu不是idle
        g_running += 1
    if samples[ge]['len'] > 1:      # 代表有额外的任务，这个任务如果其他cpu有空闲应该迁移过去
        g_queued += samples[ge]['len'] - 1
g_idle = ncpu - g_running

# calculate the number of threads that could have run as the
# minimum of idle and queued
if g_idle > 0 and g_queued > 0:     # cpu有空闲，且队列大于1，此时cpu无人认领
    if g_queued > g_idle:
        i = g_idle
    else:
        i = g_queued
    positive += i
running += g_running
idle += g_idle

最后输出

%CPU 12.50%, unclaimed idle 87.50%

第一个百分比 running/总，代表cpu的负载，类似于top命令看到的负载/cpu个数或windows任务管理器显示的cpu负载百分比
第二个百分比 cpu无人认领/总，代表有任务可迁移到空闲cpu但没迁移造成了空闲的百分比

total = running + idle
unclaimed = float(positive) / total
util = float(running) / total

print("%%CPU %6.2f%%, unclaimed idle %0.2f%%" % (100 * util,
      100 * unclaimed))

场景

系统正常情况也会遇到无人认领情况，但通常比例低于1%：

缓存热，任务保持在一个核心上可以提高缓存命中率，调度器任务认为还没有必要必须迁移任务
准备切换任务到别的cpu还没有切换
bug

参考

github bcc/tools/cpuunclaimed_example.txt

libbpf-tools/offcputime

统计任务失去CPU的时间，按栈显示。

# /usr/share/bcc/libbpf-tools/offcputime 
    bpf_prog_5e757ea05056b671_sched_switch
    bpf_get_stackid_raw_tp
    bpf_prog_5e757ea05056b671_sched_switch
    bpf_trace_run3
    -                containerd (863)
        579526

原理

参考cpudist，在追踪点sched_switch中统计任务从CPU中离线。

场景

追踪点sched_switch更接近CPU硬件切换任务上下文，可以较少受到其他干扰。

参考

github bcc/tools/offcputime_example.txt

Skynet.socket 函数族使用详解

MantisBT在Windows10上安装部署详细步骤

计算机体系结构及存储系统入门

性能测试、负载测试、压力测试的全面解析

Oracle无法正常OPEN(二）

优选算法系列（4.前缀和_上）

Ubuntu20.0.4创建ssh key以及repo命令的使用

蓝桥杯第十届数的分解

传输层安全协议 SSL/TLS 详细介绍

画秒杀系统流程图

HTB 笔记 | SQL 注入基础 + 实操小练习 P2

Git 是什么

Unity知识点快速回顾系列

UART转APB模块ModelSim仿真

玄机-第六章流量特征分析-蚂蚁爱上树的测试报告

Ubuntu部署Dufs文件服务器

唯品会 unidbg 补环境分析

Weblogic

解决企业案例

srpingboot-后端登录注册功能的实现

怎么在工商局网站做股东变更/专业的制作网站开发公司

删除wordpress主题/seo自动优化软件安卓

什么网站做批发零食的很多/学做网站需要学什么

微信公众号网站开发本地调试/成人教育培训机构

泰州网站制作案例/百度企业

网站策划编辑如何做/跨境电商培训