linux进程调度-在系统调用时期调度
Linux进程调度系列2 - 在系统调用时期被调度
对于linux的进程调度(这里不讲调度算法,只是讲实际linux源码中如何进行进程切换的)。
前提知识点可以见文章《Linux进程调度系列1 - 进程用户态被调度》
首先应该清楚两个点:
- 进程主动申请调度器调度(主动让出CPU)。
- 一个是时间片用完,或者被抢占(被动让出CPU)。
本文将仅详细说明进程在内核态系统调用时期时间片用完被调度的情况
其他的情况将在接下来的文章中继续探讨!
一、知识回顾
- 进入定时器中断时,CPU会保存进程用户态的
PC
到EL1_lr
,pstate
保存到SPSR
寄存器中以及关中断,并切换到该进程的内核栈中(从用户态进入内核态必定是从该用户内核栈的最开始处)。 - 然后在中断处理函数中将进程用户态的环境保存到内核态中(以栈帧结构进行保存)。
- 定时器处理函数内部:
- 该cpu的
preempt count
字段中的中断计数加1(用来表示现在中断嵌套的个数以及大于0表示处于中断上下文)。 - 对现在current指向的进程控制块的时间片进行减减操作(其他处理细节就不赘述)并设置
need_resched
字段表示需要调度。 preempt count
字段中的中断计数减1,进入ret_to_user
函数中。- 然后判断
preempt count
字段中的中断计数以及抢占计数是否为0(都为0表示进入进程上下文并且可以进行进程调度,否则不能进行进程调度,进入finish_ret_to_user
,会进行进程运行环境的恢复,从该进程的内核栈的栈帧中恢复,并将内核栈SP指向栈底,然后eret返回用户态继续执行)。 - 这里将从可以进行调度(即中断计数和抢占计数都为0)进行继续分析:
- 可以进行调度,进入
schedule()
进行调度器。
- 该cpu的
- 调度器内部会通过调度算法进行判断是否需要切换、以及需要切换的话选出要上CPU的进程(这里假设需要进行切换)。
- 根据调度算法选出的需要上CPU的进程,并拿到该进程控制块,然后进入
cpu_switch_to
进行进程上下文切换。 - 这时候调度器的工作是把当前进程的
x19~x29
、sp(内核栈)
、lr
寄存器的值保存到该进程对应的task_struct
中,同样把需要上CPU的进程的task_struct
中的x19~x29
、sp(内核栈)
、lr
寄存器恢复到对应的寄存器中。这样完成了内核栈的切换,此时是回到了之前进程的被切换下内核栈的**cpu_switch_to
函数调用点继续运行**。(所有的进程调度都是在这个函数的点进行调度的,同样重新被调度也是恢复到这个点继续执行,并且内核栈sp与lr都恢复了,这样就恢复了之前的调度点时的运行环境)。由此得出不同进程的调度点与恢复点的不同之处就是x19~x29
、sp(内核栈)
、lr
寄存器,然而这些寄存器的值右可以被保存和恢复,所以就可以完美地进行进程调度(只是还没有回到进程的用户态,只是在内核态继续运行)。
二、流程图如下:
-
阶段一:进程A系统调用期间发生时钟中断
-
阶段二:由进程A切换到进程B运行
-
阶段三:进程B重新上CPU执行
-
阶段四:进程B执行时发生时钟中断
-
阶段五:进程A重新上CPU运行
-
阶段六:
- 这时进程A将继续运行系统调用时期的代码(内核态进程上下文),直到系统调用运行结束,然后通同样会恢复进程用户态的运行环境(从进程内核栈的栈底的栈帧中恢复),然后
eret
指令返回进行用户态继续执行。
三、 为什么这样切换没有改变进程状态?
- 因为进程的内核栈
SP
值没有变、lr
寄存器的值没有变、x19~x29
寄存器值也没有变 ------> 保留了进程的调用过程 - 特别是所有的切换都是从
cpu_switch_to
这个函数的同一个点发生 ------> 即内核态时的PC
也没有变,被调度出CPU时和上CPU时的pc没有变。 - 有上面两点得出进程的运行环境(状态)并没有发生变化。
- 当然会有疑问了
x0~x18
没有保存,是不是说明运行环境改变了?- 从物理层面说,确实
x0~x18
寄存器中的值可能不一样了。 - 但是从进程运行状态看
cpu_switch_to
函数中并没有用x0~x18
寄存器,所以这些就算改变了也没有一点影响。
- 从物理层面说,确实
- 如果调用
cpu_switch_to
函数的函数用到了x0~x18
呢?- 这里就得看arm64的函数调用规定标准了,调用者需要保存
x0~x18
寄存器到栈中,被调用者需要保存x19~x29
寄存器(如果被调用者要用这些寄存器的话)。 - 所以调用
cpu_switch_to
函数的函数已经把x0~x18
寄存器保存到栈中了,然后进程切换栈的内容并没有变,而且SP的值也有保存和恢复,所以运行环境也没有变化。
- 这里就得看arm64的函数调用规定标准了,调用者需要保存
四、 结尾
这里探讨了进程内核态系统调用时期时间片调度的情况,后续将会对其他情况继续深入探讨!
进程用户态被调度的情况见文章《Linux进程调度系列1 - 进程用户态被调度》《Linux进程调度系列1 - 进程用户态被调度》
五. 附重要调度相关内核源码
其中最重要的地方在cpu_switch_to
,这里就是时机进行进程切换的地方
el0_irq
用户态进入中断的中断处理
SYM_CODE_START_LOCAL_NOALIGN(el0_irq)
kernel_entry 0
el0_irq_naked:
el0_interrupt_handler handle_arch_irq
b ret_to_user
SYM_CODE_END(el0_irq)
ret_to_user
返回用户态的函数
- 在这里面判断是否中断计数以及抢占计数
- 如果中断计数为0则表明现在以及退出了中断上下文,进入了进行上下文,如果不为0则还是在中断上下文,并不能进行进程调度
/*
#define _TIF_WORK_MASK (_TIF_NEED_RESCHED | _TIF_SIGPENDING | \
_TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE | \
_TIF_UPROBE | _TIF_FSCHECK | _TIF_MTE_ASYNC_FAULT | \
_TIF_NOTIFY_SIGNAL)
*/
SYM_CODE_START_LOCAL(ret_to_user)
disable_daif
gic_prio_kentry_setup tmp=x3
#ifdef CONFIG_TRACE_IRQFLAGS
bl trace_hardirqs_off
#endif
ldr x19, [tsk, #TSK_TI_FLAGS]
// 测试TSK_TI_FLAGS中_TIF_WORK_MASK对应的位
and x2, x19, #_TIF_WORK_MASK
// 不为0则跳转到work_pending
cbnz x2, work_pending
finish_ret_to_user:
user_enter_irqoff
/* Ignore asynchronous tag check faults in the uaccess routines */
clear_mte_async_tcf
enable_step_tsk x19, x2
#ifdef CONFIG_GCC_PLUGIN_STACKLEAK
bl stackleak_erase
#endif
kernel_exit 0
/*
* Ok, we need to do extra processing, enter the slow path.
*/
work_pending:
mov x0, sp // 'regs'
mov x1, x19
bl do_notify_resume
ldr x19, [tsk, #TSK_TI_FLAGS] // re-check for single-step
b finish_ret_to_user
SYM_CODE_END(ret_to_user)
do_notify_resume
调用schedule()
的函数
asmlinkage void do_notify_resume(struct pt_regs *regs,
unsigned long thread_flags)
{
do {
/* Check valid user FS if needed */
addr_limit_user_check();
if (thread_flags & _TIF_NEED_RESCHED) {
/* Unmask Debug and SError for the next task */
local_daif_restore(DAIF_PROCCTX_NOIRQ);
schedule(); // -> __schedule()
} else {
local_daif_restore(DAIF_PROCCTX);
if (thread_flags & _TIF_UPROBE)
uprobe_notify_resume(regs);
if (thread_flags & _TIF_MTE_ASYNC_FAULT) {
clear_thread_flag(TIF_MTE_ASYNC_FAULT);
send_sig_fault(SIGSEGV, SEGV_MTEAERR,
(void __user *)NULL, current);
}
if (thread_flags & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL))
do_signal(regs);
if (thread_flags & _TIF_NOTIFY_RESUME) {
tracehook_notify_resume(regs);
rseq_handle_notify_resume(NULL, regs);
}
if (thread_flags & _TIF_FOREIGN_FPSTATE)
fpsimd_restore_current_state();
}
local_daif_mask();
thread_flags = READ_ONCE(current_thread_info()->flags);
} while (thread_flags & _TIF_WORK_MASK);
}
__schedule
/*
* __schedule() is the main scheduler function.
*
* The main means of driving the scheduler and thus entering this function are:
*
* 1. Explicit blocking: mutex, semaphore, waitqueue, etc.
*
* 2. TIF_NEED_RESCHED flag is checked on interrupt and userspace return
* paths. For example, see arch/x86/entry_64.S.
*
* To drive preemption between tasks, the scheduler sets the flag in timer
* interrupt handler scheduler_tick().
*
* 3. Wakeups don't really cause entry into schedule(). They add a
* task to the run-queue and that's it.
*
* Now, if the new task added to the run-queue preempts the current
* task, then the wakeup sets TIF_NEED_RESCHED and schedule() gets
* called on the nearest possible occasion:
*
* - If the kernel is preemptible (CONFIG_PREEMPTION=y):
*
* - in syscall or exception context, at the next outmost
* preempt_enable(). (this might be as soon as the wake_up()'s
* spin_unlock()!)
*
* - in IRQ context, return from interrupt-handler to
* preemptible context
*
* - If the kernel is not preemptible (CONFIG_PREEMPTION is not set)
* then at the next:
*
* - cond_resched() call
* - explicit schedule() call
* - return from syscall or exception to user-space
* - return from interrupt-handler to user-space
*
* WARNING: must be called with preemption disabled!
*/
static void __sched notrace __schedule(bool preempt)
{
struct task_struct *prev, *next;
unsigned long *switch_count;
unsigned long prev_state;
struct rq_flags rf;
struct rq *rq;
int cpu;
cpu = smp_processor_id();
rq = cpu_rq(cpu);
prev = rq->curr;
schedule_debug(prev, preempt);
if (sched_feat(HRTICK))
hrtick_clear(rq);
local_irq_disable();
rcu_note_context_switch(preempt);
/*
* Make sure that signal_pending_state()->signal_pending() below
* can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)
* done by the caller to avoid the race with signal_wake_up():
*
* __set_current_state(@state) signal_wake_up()
* schedule() set_tsk_thread_flag(p, TIF_SIGPENDING)
* wake_up_state(p, state)
* LOCK rq->lock LOCK p->pi_state
* smp_mb__after_spinlock() smp_mb__after_spinlock()
* if (signal_pending_state()) if (p->state & @state)
*
* Also, the membarrier system call requires a full memory barrier
* after coming from user-space, before storing to rq->curr.
*/
rq_lock(rq, &rf);
smp_mb__after_spinlock();
/* Promote REQ to ACT */
rq->clock_update_flags <<= 1;
update_rq_clock(rq);
switch_count = &prev->nivcsw;
/*
* We must load prev->state once (task_struct::state is volatile), such
* that:
*
* - we form a control dependency vs deactivate_task() below.
* - ptrace_{,un}freeze_traced() can change ->state underneath us.
*/
prev_state = prev->state;
if (!preempt && prev_state) {
if (signal_pending_state(prev_state, prev)) {
prev->state = TASK_RUNNING;
} else {
prev->sched_contributes_to_load =
(prev_state & TASK_UNINTERRUPTIBLE) &&
!(prev_state & TASK_NOLOAD) &&
!(prev->flags & PF_FROZEN);
if (prev->sched_contributes_to_load)
rq->nr_uninterruptible++;
/*
* __schedule() ttwu()
* prev_state = prev->state; if (p->on_rq && ...)
* if (prev_state) goto out;
* p->on_rq = 0; smp_acquire__after_ctrl_dep();
* p->state = TASK_WAKING
*
* Where __schedule() and ttwu() have matching control dependencies.
*
* After this, schedule() must not care about p->state any more.
*/
deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);
if (prev->in_iowait) {
atomic_inc(&rq->nr_iowait);
delayacct_blkio_start();
}
}
switch_count = &prev->nvcsw;
}
next = pick_next_task(rq, prev, &rf);
clear_tsk_need_resched(prev);
clear_preempt_need_resched();
if (likely(prev != next)) {
rq->nr_switches++;
/*
* RCU users of rcu_dereference(rq->curr) may not see
* changes to task_struct made by pick_next_task().
*/
RCU_INIT_POINTER(rq->curr, next);
/*
* The membarrier system call requires each architecture
* to have a full memory barrier after updating
* rq->curr, before returning to user-space.
*
* Here are the schemes providing that barrier on the
* various architectures:
* - mm ? switch_mm() : mmdrop() for x86, s390, sparc, PowerPC.
* switch_mm() rely on membarrier_arch_switch_mm() on PowerPC.
* - finish_lock_switch() for weakly-ordered
* architectures where spin_unlock is a full barrier,
* - switch_to() for arm64 (weakly-ordered, spin_unlock
* is a RELEASE barrier),
*/
++*switch_count;
psi_sched_switch(prev, next, !task_on_rq_queued(prev));
trace_sched_switch(preempt, prev, next);
/* Also unlocks the rq: */
rq = context_switch(rq, prev, next, &rf);
} else {
rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);
rq_unlock_irq(rq, &rf);
}
balance_callback(rq);
}
context_switch
调用__switch_to
/*
* context_switch - switch to the new MM and the new thread's register state.
*/
static __always_inline struct rq *
context_switch(struct rq *rq, struct task_struct *prev,
struct task_struct *next, struct rq_flags *rf)
{
prepare_task_switch(rq, prev, next);
/*
* For paravirt, this is coupled with an exit in switch_to to
* combine the page table reload and the switch backend into
* one hypercall.
*/
arch_start_context_switch(prev);
/*
* kernel -> kernel lazy + transfer active
* user -> kernel lazy + mmgrab() active
*
* kernel -> user switch + mmdrop() active
* user -> user switch
*/
if (!next->mm) { // to kernel
enter_lazy_tlb(prev->active_mm, next);
next->active_mm = prev->active_mm;
if (prev->mm) // from user
mmgrab(prev->active_mm);
else
prev->active_mm = NULL;
} else { // to user
membarrier_switch_mm(rq, prev->active_mm, next->mm);
/*
* sys_membarrier() requires an smp_mb() between setting
* rq->curr / membarrier_switch_mm() and returning to userspace.
*
* The below provides this either through switch_mm(), or in
* case 'prev->active_mm == next->mm' through
* finish_task_switch()'s mmdrop().
*/
switch_mm_irqs_off(prev->active_mm, next->mm, next);
if (!prev->mm) { // from kernel
/* will mmdrop() in finish_task_switch(). */
rq->prev_mm = prev->active_mm;
prev->active_mm = NULL;
}
}
rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);
prepare_lock_switch(rq, next, rf);
/* Here we just switch the register state and the stack. */
switch_to(prev, next, prev); // -> __switch_to()
barrier();
return finish_task_switch(prev);
}
__switch_to
进程上下文切换(包括内存表、运行环境等)
/*
* Thread switching.
*/
__notrace_funcgraph struct task_struct *__switch_to(struct task_struct *prev,
struct task_struct *next)
{
struct task_struct *last;
fpsimd_thread_switch(next);
tls_thread_switch(next);
hw_breakpoint_thread_switch(next);
contextidr_thread_switch(next);
entry_task_switch(next);
uao_thread_switch(next);
ssbs_thread_switch(next);
erratum_1418040_thread_switch(next);
/*
* Complete any pending TLB or cache maintenance on this CPU in case
* the thread migrates to a different CPU.
* This full barrier is also required by the membarrier system
* call.
*/
dsb(ish);
/*
* MTE thread switching must happen after the DSB above to ensure that
* any asynchronous tag check faults have been logged in the TFSR*_EL1
* registers.
*/
mte_thread_switch(next);
/* the actual thread switch */
last = cpu_switch_to(prev, next);
return last;
}
cpu_switch_to
实际进行进程切换的函数
/*
* Register switch for AArch64. The callee-saved registers need to be saved
* and restored. On entry:
* x0 = previous task_struct (must be preserved across the switch)
* x1 = next task_struct
* Previous and next are guaranteed not to be the same.
*
*/
SYM_FUNC_START(cpu_switch_to)
mov x10, #THREAD_CPU_CONTEXT
add x8, x0, x10
mov x9, sp // 在el1下,sp指的是sp_el1
stp x19, x20, [x8], #16 // store callee-saved registers
stp x21, x22, [x8], #16
stp x23, x24, [x8], #16
stp x25, x26, [x8], #16
stp x27, x28, [x8], #16
stp x29, x9, [x8], #16
str lr, [x8]
add x8, x1, x10
ldp x19, x20, [x8], #16 // restore callee-saved registers
ldp x21, x22, [x8], #16
ldp x23, x24, [x8], #16
ldp x25, x26, [x8], #16
ldp x27, x28, [x8], #16
ldp x29, x9, [x8], #16
ldr lr, [x8]
mov sp, x9
msr sp_el0, x1 // 切换current指针为新上cpu的进程
ptrauth_keys_install_kernel x1, x8, x9, x10
scs_save x0, x8
scs_load_current
ret
SYM_FUNC_END(cpu_switch_to)
kernel_exit
退出异常处理
.macro kernel_exit, el
.if \el != 0
disable_daif
/* Restore the task's original addr_limit. */
ldr x20, [sp, #S_ORIG_ADDR_LIMIT]
str x20, [tsk, #TSK_TI_ADDR_LIMIT]
/* No need to restore UAO, it will be restored from SPSR_EL1 */
.endif
/* Restore pmr */
alternative_if ARM64_HAS_IRQ_PRIO_MASKING
ldr x20, [sp, #S_PMR_SAVE]
msr_s SYS_ICC_PMR_EL1, x20
mrs_s x21, SYS_ICC_CTLR_EL1
tbz x21, #6, .L__skip_pmr_sync\@ // Check for ICC_CTLR_EL1.PMHE
dsb sy // Ensure priority change is seen by redistributor
.L__skip_pmr_sync\@:
alternative_else_nop_endif
ldp x21, x22, [sp, #S_PC] // load ELR, SPSR
#ifdef CONFIG_ARM64_SW_TTBR0_PAN
alternative_if_not ARM64_HAS_PAN
bl __swpan_exit_el\el
alternative_else_nop_endif
#endif
.if \el == 0
ldr x23, [sp, #S_SP] // load return stack pointer
msr sp_el0, x23
tst x22, #PSR_MODE32_BIT // native task?
b.eq 3f
#ifdef CONFIG_ARM64_ERRATUM_845719
alternative_if ARM64_WORKAROUND_845719
#ifdef CONFIG_PID_IN_CONTEXTIDR
mrs x29, contextidr_el1
msr contextidr_el1, x29
#else
msr contextidr_el1, xzr
#endif
alternative_else_nop_endif
#endif
3:
scs_save tsk, x0
/* No kernel C function calls after this as user keys are set. */
ptrauth_keys_install_user tsk, x0, x1, x2
apply_ssbd 0, x0, x1
.endif
msr elr_el1, x21 // set up the return data
msr spsr_el1, x22
ldp x0, x1, [sp, #16 * 0]
ldp x2, x3, [sp, #16 * 1]
ldp x4, x5, [sp, #16 * 2]
ldp x6, x7, [sp, #16 * 3]
ldp x8, x9, [sp, #16 * 4]
ldp x10, x11, [sp, #16 * 5]
ldp x12, x13, [sp, #16 * 6]
ldp x14, x15, [sp, #16 * 7]
ldp x16, x17, [sp, #16 * 8]
ldp x18, x19, [sp, #16 * 9]
ldp x20, x21, [sp, #16 * 10]
ldp x22, x23, [sp, #16 * 11]
ldp x24, x25, [sp, #16 * 12]
ldp x26, x27, [sp, #16 * 13]
ldp x28, x29, [sp, #16 * 14]
.if \el == 0
alternative_if_not ARM64_UNMAP_KERNEL_AT_EL0
ldr lr, [sp, #S_LR]
add sp, sp, #S_FRAME_SIZE // restore sp
eret
alternative_else_nop_endif
#ifdef CONFIG_UNMAP_KERNEL_AT_EL0
bne 4f
msr far_el1, x29
tramp_alias x30, tramp_exit_native, x29
br x30
4:
tramp_alias x30, tramp_exit_compat, x29
br x30
#endif
.else
ldr lr, [sp, #S_LR]
add sp, sp, #S_FRAME_SIZE // restore sp
/* Ensure any device/NC reads complete */
alternative_insn nop, "dmb sy", ARM64_WORKAROUND_1508412
eret
.endif
sb
.endm