当前位置：首页 > news >正文

linux进程调度-在系统调用时期调度

news 2025/7/19 8:22:40

Linux进程调度系列2 - 在系统调用时期被调度

对于linux的进程调度（这里不讲调度算法，只是讲实际linux源码中如何进行进程切换的）。

前提知识点可以见文章《Linux进程调度系列1 - 进程用户态被调度》

首先应该清楚两个点：

进程主动申请调度器调度（主动让出CPU）。
一个是时间片用完，或者被抢占（被动让出CPU）。

本文将仅详细说明进程在内核态系统调用时期时间片用完被调度的情况

其他的情况将在接下来的文章中继续探讨！

一、知识回顾

进入定时器中断时，CPU会保存进程用户态的PC到EL1_lr，pstate保存到SPSR寄存器中以及关中断，并切换到该进程的内核栈中（从用户态进入内核态必定是从该用户内核栈的最开始处）。
然后在中断处理函数中将进程用户态的环境保存到内核态中（以栈帧结构进行保存）。
定时器处理函数内部：
1. 该cpu的preempt count字段中的中断计数加1（用来表示现在中断嵌套的个数以及大于0表示处于中断上下文）。
2. 对现在current指向的进程控制块的时间片进行减减操作（其他处理细节就不赘述）并设置need_resched字段表示需要调度。
3. preempt count字段中的中断计数减1，进入ret_to_user函数中。
4. 然后判断preempt count字段中的中断计数以及抢占计数是否为0（都为0表示进入进程上下文并且可以进行进程调度，否则不能进行进程调度，进入finish_ret_to_user,会进行进程运行环境的恢复，从该进程的内核栈的栈帧中恢复，并将内核栈SP指向栈底，然后eret返回用户态继续执行）。
5. 这里将从可以进行调度（即中断计数和抢占计数都为0）进行继续分析：
6. 可以进行调度，进入schedule()进行调度器。
调度器内部会通过调度算法进行判断是否需要切换、以及需要切换的话选出要上CPU的进程（这里假设需要进行切换）。
根据调度算法选出的需要上CPU的进程，并拿到该进程控制块，然后进入cpu_switch_to进行进程上下文切换。
这时候调度器的工作是把当前进程的x19~x29、sp(内核栈)、lr寄存器的值保存到该进程对应的task_struct中，同样把需要上CPU的进程的task_struct中的x19~x29、sp(内核栈)、lr寄存器恢复到对应的寄存器中。这样完成了内核栈的切换，此时是回到了之前进程的被切换下内核栈的**cpu_switch_to函数调用点继续运行**。（所有的进程调度都是在这个函数的点进行调度的，同样重新被调度也是恢复到这个点继续执行，并且内核栈sp与lr都恢复了，这样就恢复了之前的调度点时的运行环境）。由此得出不同进程的调度点与恢复点的不同之处就是x19~x29、sp(内核栈)、lr寄存器，然而这些寄存器的值右可以被保存和恢复，所以就可以完美地进行进程调度（只是还没有回到进程的用户态，只是在内核态继续运行）。

二、流程图如下：

阶段一：进程A系统调用期间发生时钟中断
阶段二：由进程A切换到进程B运行
阶段三：进程B重新上CPU执行
阶段四：进程B执行时发生时钟中断
阶段五：进程A重新上CPU运行
阶段六：

这时进程A将继续运行系统调用时期的代码（内核态进程上下文），直到系统调用运行结束，然后通同样会恢复进程用户态的运行环境（从进程内核栈的栈底的栈帧中恢复），然后eret指令返回进行用户态继续执行。

三、为什么这样切换没有改变进程状态？

因为进程的内核栈SP值没有变、lr寄存器的值没有变、x19~x29寄存器值也没有变 ------> 保留了进程的调用过程
特别是所有的切换都是从cpu_switch_to这个函数的同一个点发生 ------> 即内核态时的PC也没有变，被调度出CPU时和上CPU时的pc没有变。
有上面两点得出进程的运行环境（状态）并没有发生变化。

当然会有疑问了x0～x18没有保存，是不是说明运行环境改变了？
- 从物理层面说，确实x0～x18寄存器中的值可能不一样了。
- 但是从进程运行状态看cpu_switch_to函数中并没有用x0～x18寄存器，所以这些就算改变了也没有一点影响。
如果调用cpu_switch_to函数的函数用到了x0～x18呢？
- 这里就得看arm64的函数调用规定标准了，调用者需要保存x0～x18寄存器到栈中，被调用者需要保存x19~x29寄存器（如果被调用者要用这些寄存器的话）。
- 所以调用cpu_switch_to函数的函数已经把x0～x18寄存器保存到栈中了，然后进程切换栈的内容并没有变，而且SP的值也有保存和恢复，所以运行环境也没有变化。

四、结尾

这里探讨了进程内核态系统调用时期时间片调度的情况，后续将会对其他情况继续深入探讨！

进程用户态被调度的情况见文章《Linux进程调度系列1 - 进程用户态被调度》《Linux进程调度系列1 - 进程用户态被调度》

五. 附重要调度相关内核源码

其中最重要的地方在cpu_switch_to，这里就是时机进行进程切换的地方

`el0_irq` 用户态进入中断的中断处理

SYM_CODE_START_LOCAL_NOALIGN(el0_irq)
	kernel_entry 0
el0_irq_naked:
	el0_interrupt_handler handle_arch_irq
	b	ret_to_user
SYM_CODE_END(el0_irq)

`ret_to_user` 返回用户态的函数

在这里面判断是否中断计数以及抢占计数
如果中断计数为0则表明现在以及退出了中断上下文，进入了进行上下文，如果不为0则还是在中断上下文，并不能进行进程调度

/*
#define _TIF_WORK_MASK		(_TIF_NEED_RESCHED | _TIF_SIGPENDING | \
				 _TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE | \
				 _TIF_UPROBE | _TIF_FSCHECK | _TIF_MTE_ASYNC_FAULT | \
				 _TIF_NOTIFY_SIGNAL)
*/

SYM_CODE_START_LOCAL(ret_to_user)
	disable_daif
	gic_prio_kentry_setup tmp=x3
#ifdef CONFIG_TRACE_IRQFLAGS
	bl	trace_hardirqs_off
#endif
	ldr	x19, [tsk, #TSK_TI_FLAGS]
	// 测试TSK_TI_FLAGS中_TIF_WORK_MASK对应的位
	and	x2, x19, #_TIF_WORK_MASK
	// 不为0则跳转到work_pending
	cbnz	x2, work_pending
finish_ret_to_user:
	user_enter_irqoff
	/* Ignore asynchronous tag check faults in the uaccess routines */
	clear_mte_async_tcf
	enable_step_tsk x19, x2
#ifdef CONFIG_GCC_PLUGIN_STACKLEAK
	bl	stackleak_erase
#endif
	kernel_exit 0

/*
 * Ok, we need to do extra processing, enter the slow path.
 */
 

work_pending:
	mov	x0, sp				// 'regs'
	mov	x1, x19
	bl	do_notify_resume
	ldr	x19, [tsk, #TSK_TI_FLAGS]	// re-check for single-step
	b	finish_ret_to_user
SYM_CODE_END(ret_to_user)

`do_notify_resume` 调用`schedule()`的函数

asmlinkage void do_notify_resume(struct pt_regs *regs,
				 unsigned long thread_flags)
{
	do {
		/* Check valid user FS if needed */
		addr_limit_user_check();

		if (thread_flags & _TIF_NEED_RESCHED) {
			/* Unmask Debug and SError for the next task */
			local_daif_restore(DAIF_PROCCTX_NOIRQ);

			schedule(); // -> __schedule()
		} else {
			local_daif_restore(DAIF_PROCCTX);

			if (thread_flags & _TIF_UPROBE)
				uprobe_notify_resume(regs);

			if (thread_flags & _TIF_MTE_ASYNC_FAULT) {
				clear_thread_flag(TIF_MTE_ASYNC_FAULT);
				send_sig_fault(SIGSEGV, SEGV_MTEAERR,
					       (void __user *)NULL, current);
			}

			if (thread_flags & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL))
				do_signal(regs);

			if (thread_flags & _TIF_NOTIFY_RESUME) {
				tracehook_notify_resume(regs);
				rseq_handle_notify_resume(NULL, regs);
			}

			if (thread_flags & _TIF_FOREIGN_FPSTATE)
				fpsimd_restore_current_state();
		}

		local_daif_mask();
		thread_flags = READ_ONCE(current_thread_info()->flags);
	} while (thread_flags & _TIF_WORK_MASK);
}

__schedule

/*
 * __schedule() is the main scheduler function.
 *
 * The main means of driving the scheduler and thus entering this function are:
 *
 *   1. Explicit blocking: mutex, semaphore, waitqueue, etc.
 *
 *   2. TIF_NEED_RESCHED flag is checked on interrupt and userspace return
 *      paths. For example, see arch/x86/entry_64.S.
 *
 *      To drive preemption between tasks, the scheduler sets the flag in timer
 *      interrupt handler scheduler_tick().
 *
 *   3. Wakeups don't really cause entry into schedule(). They add a
 *      task to the run-queue and that's it.
 *
 *      Now, if the new task added to the run-queue preempts the current
 *      task, then the wakeup sets TIF_NEED_RESCHED and schedule() gets
 *      called on the nearest possible occasion:
 *
 *       - If the kernel is preemptible (CONFIG_PREEMPTION=y):
 *
 *         - in syscall or exception context, at the next outmost
 *           preempt_enable(). (this might be as soon as the wake_up()'s
 *           spin_unlock()!)
 *
 *         - in IRQ context, return from interrupt-handler to
 *           preemptible context
 *
 *       - If the kernel is not preemptible (CONFIG_PREEMPTION is not set)
 *         then at the next:
 *
 *          - cond_resched() call
 *          - explicit schedule() call
 *          - return from syscall or exception to user-space
 *          - return from interrupt-handler to user-space
 *
 * WARNING: must be called with preemption disabled!
 */
static void __sched notrace __schedule(bool preempt)
{
	struct task_struct *prev, *next;
	unsigned long *switch_count;
	unsigned long prev_state;
	struct rq_flags rf;
	struct rq *rq;
	int cpu;

	cpu = smp_processor_id();
	rq = cpu_rq(cpu);
	prev = rq->curr;

	schedule_debug(prev, preempt);

	if (sched_feat(HRTICK))
		hrtick_clear(rq);

	local_irq_disable();
	rcu_note_context_switch(preempt);

	/*
	 * Make sure that signal_pending_state()->signal_pending() below
	 * can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)
	 * done by the caller to avoid the race with signal_wake_up():
	 *
	 * __set_current_state(@state)		signal_wake_up()
	 * schedule()				  set_tsk_thread_flag(p, TIF_SIGPENDING)
	 *					  wake_up_state(p, state)
	 *   LOCK rq->lock			    LOCK p->pi_state
	 *   smp_mb__after_spinlock()		    smp_mb__after_spinlock()
	 *     if (signal_pending_state())	    if (p->state & @state)
	 *
	 * Also, the membarrier system call requires a full memory barrier
	 * after coming from user-space, before storing to rq->curr.
	 */
	rq_lock(rq, &rf);
	smp_mb__after_spinlock();

	/* Promote REQ to ACT */
	rq->clock_update_flags <<= 1;
	update_rq_clock(rq);

	switch_count = &prev->nivcsw;

	/*
	 * We must load prev->state once (task_struct::state is volatile), such
	 * that:
	 *
	 *  - we form a control dependency vs deactivate_task() below.
	 *  - ptrace_{,un}freeze_traced() can change ->state underneath us.
	 */
	prev_state = prev->state;
	if (!preempt && prev_state) {
		if (signal_pending_state(prev_state, prev)) {
			prev->state = TASK_RUNNING;
		} else {
			prev->sched_contributes_to_load =
				(prev_state & TASK_UNINTERRUPTIBLE) &&
				!(prev_state & TASK_NOLOAD) &&
				!(prev->flags & PF_FROZEN);

			if (prev->sched_contributes_to_load)
				rq->nr_uninterruptible++;

			/*
			 * __schedule()			ttwu()
			 *   prev_state = prev->state;    if (p->on_rq && ...)
			 *   if (prev_state)		    goto out;
			 *     p->on_rq = 0;		  smp_acquire__after_ctrl_dep();
			 *				  p->state = TASK_WAKING
			 *
			 * Where __schedule() and ttwu() have matching control dependencies.
			 *
			 * After this, schedule() must not care about p->state any more.
			 */
			deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);

			if (prev->in_iowait) {
				atomic_inc(&rq->nr_iowait);
				delayacct_blkio_start();
			}
		}
		switch_count = &prev->nvcsw;
	}

	next = pick_next_task(rq, prev, &rf);
	clear_tsk_need_resched(prev);
	clear_preempt_need_resched();

	if (likely(prev != next)) {
		rq->nr_switches++;
		/*
		 * RCU users of rcu_dereference(rq->curr) may not see
		 * changes to task_struct made by pick_next_task().
		 */
		RCU_INIT_POINTER(rq->curr, next);
		/*
		 * The membarrier system call requires each architecture
		 * to have a full memory barrier after updating
		 * rq->curr, before returning to user-space.
		 *
		 * Here are the schemes providing that barrier on the
		 * various architectures:
		 * - mm ? switch_mm() : mmdrop() for x86, s390, sparc, PowerPC.
		 *   switch_mm() rely on membarrier_arch_switch_mm() on PowerPC.
		 * - finish_lock_switch() for weakly-ordered
		 *   architectures where spin_unlock is a full barrier,
		 * - switch_to() for arm64 (weakly-ordered, spin_unlock
		 *   is a RELEASE barrier),
		 */
		++*switch_count;

		psi_sched_switch(prev, next, !task_on_rq_queued(prev));

		trace_sched_switch(preempt, prev, next);

		/* Also unlocks the rq: */
		rq = context_switch(rq, prev, next, &rf);
	} else {
		rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);
		rq_unlock_irq(rq, &rf);
	}

	balance_callback(rq);
}

`context_switch` 调用`__switch_to`

/*
 * context_switch - switch to the new MM and the new thread's register state.
 */
static __always_inline struct rq *
context_switch(struct rq *rq, struct task_struct *prev,
	       struct task_struct *next, struct rq_flags *rf)
{
	prepare_task_switch(rq, prev, next);

	/*
	 * For paravirt, this is coupled with an exit in switch_to to
	 * combine the page table reload and the switch backend into
	 * one hypercall.
	 */
	arch_start_context_switch(prev);

	/*
	 * kernel -> kernel   lazy + transfer active
	 *   user -> kernel   lazy + mmgrab() active
	 *
	 * kernel ->   user   switch + mmdrop() active
	 *   user ->   user   switch
	 */
	if (!next->mm) {                                // to kernel
		enter_lazy_tlb(prev->active_mm, next);

		next->active_mm = prev->active_mm;
		if (prev->mm)                           // from user
			mmgrab(prev->active_mm);
		else
			prev->active_mm = NULL;
	} else {                                        // to user
		membarrier_switch_mm(rq, prev->active_mm, next->mm);
		/*
		 * sys_membarrier() requires an smp_mb() between setting
		 * rq->curr / membarrier_switch_mm() and returning to userspace.
		 *
		 * The below provides this either through switch_mm(), or in
		 * case 'prev->active_mm == next->mm' through
		 * finish_task_switch()'s mmdrop().
		 */
		switch_mm_irqs_off(prev->active_mm, next->mm, next);

		if (!prev->mm) {                        // from kernel
			/* will mmdrop() in finish_task_switch(). */
			rq->prev_mm = prev->active_mm;
			prev->active_mm = NULL;
		}
	}

	rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);

	prepare_lock_switch(rq, next, rf);

	/* Here we just switch the register state and the stack. */
	switch_to(prev, next, prev);   // -> __switch_to()
	barrier();

	return finish_task_switch(prev);
}

`__switch_to` 进程上下文切换（包括内存表、运行环境等）

/*
 * Thread switching.
 */
__notrace_funcgraph struct task_struct *__switch_to(struct task_struct *prev,
				struct task_struct *next)
{
	struct task_struct *last;

	fpsimd_thread_switch(next);
	tls_thread_switch(next);
	hw_breakpoint_thread_switch(next);
	contextidr_thread_switch(next);
	entry_task_switch(next);
	uao_thread_switch(next);
	ssbs_thread_switch(next);
	erratum_1418040_thread_switch(next);

	/*
	 * Complete any pending TLB or cache maintenance on this CPU in case
	 * the thread migrates to a different CPU.
	 * This full barrier is also required by the membarrier system
	 * call.
	 */
	dsb(ish);

	/*
	 * MTE thread switching must happen after the DSB above to ensure that
	 * any asynchronous tag check faults have been logged in the TFSR*_EL1
	 * registers.
	 */
	mte_thread_switch(next);

	/* the actual thread switch */
	last = cpu_switch_to(prev, next);

	return last;
}

`cpu_switch_to` 实际进行进程切换的函数

/*
 * Register switch for AArch64. The callee-saved registers need to be saved
 * and restored. On entry:
 *   x0 = previous task_struct (must be preserved across the switch)
 *   x1 = next task_struct
 * Previous and next are guaranteed not to be the same.
 *
 */
SYM_FUNC_START(cpu_switch_to)
	mov	x10, #THREAD_CPU_CONTEXT
	add	x8, x0, x10
	mov	x9, sp	// 在el1下，sp指的是sp_el1
	stp	x19, x20, [x8], #16		// store callee-saved registers
	stp	x21, x22, [x8], #16
	stp	x23, x24, [x8], #16
	stp	x25, x26, [x8], #16
	stp	x27, x28, [x8], #16
	stp	x29, x9, [x8], #16
	str	lr, [x8]
	add	x8, x1, x10
	ldp	x19, x20, [x8], #16		// restore callee-saved registers
	ldp	x21, x22, [x8], #16
	ldp	x23, x24, [x8], #16
	ldp	x25, x26, [x8], #16
	ldp	x27, x28, [x8], #16
	ldp	x29, x9, [x8], #16
	ldr	lr, [x8]
	mov	sp, x9
	msr	sp_el0, x1	// 切换current指针为新上cpu的进程
	ptrauth_keys_install_kernel x1, x8, x9, x10
	scs_save x0, x8
	scs_load_current
	ret
SYM_FUNC_END(cpu_switch_to)

`kernel_exit`退出异常处理

.macro	kernel_exit, el
	.if	\el != 0
	disable_daif

	/* Restore the task's original addr_limit. */
	ldr	x20, [sp, #S_ORIG_ADDR_LIMIT]
	str	x20, [tsk, #TSK_TI_ADDR_LIMIT]

	/* No need to restore UAO, it will be restored from SPSR_EL1 */
	.endif

	/* Restore pmr */
alternative_if ARM64_HAS_IRQ_PRIO_MASKING
	ldr	x20, [sp, #S_PMR_SAVE]
	msr_s	SYS_ICC_PMR_EL1, x20
	mrs_s	x21, SYS_ICC_CTLR_EL1
	tbz	x21, #6, .L__skip_pmr_sync\@	// Check for ICC_CTLR_EL1.PMHE
	dsb	sy				// Ensure priority change is seen by redistributor
.L__skip_pmr_sync\@:
alternative_else_nop_endif

	ldp	x21, x22, [sp, #S_PC]		// load ELR, SPSR

#ifdef CONFIG_ARM64_SW_TTBR0_PAN
alternative_if_not ARM64_HAS_PAN
	bl	__swpan_exit_el\el
alternative_else_nop_endif
#endif

	.if	\el == 0
	ldr	x23, [sp, #S_SP]		// load return stack pointer
	msr	sp_el0, x23
	tst	x22, #PSR_MODE32_BIT		// native task?
	b.eq	3f

#ifdef CONFIG_ARM64_ERRATUM_845719
alternative_if ARM64_WORKAROUND_845719
#ifdef CONFIG_PID_IN_CONTEXTIDR
	mrs	x29, contextidr_el1
	msr	contextidr_el1, x29
#else
	msr contextidr_el1, xzr
#endif
alternative_else_nop_endif
#endif
3:
	scs_save tsk, x0

	/* No kernel C function calls after this as user keys are set. */
	ptrauth_keys_install_user tsk, x0, x1, x2

	apply_ssbd 0, x0, x1
	.endif

	msr	elr_el1, x21			// set up the return data
	msr	spsr_el1, x22
	ldp	x0, x1, [sp, #16 * 0]
	ldp	x2, x3, [sp, #16 * 1]
	ldp	x4, x5, [sp, #16 * 2]
	ldp	x6, x7, [sp, #16 * 3]
	ldp	x8, x9, [sp, #16 * 4]
	ldp	x10, x11, [sp, #16 * 5]
	ldp	x12, x13, [sp, #16 * 6]
	ldp	x14, x15, [sp, #16 * 7]
	ldp	x16, x17, [sp, #16 * 8]
	ldp	x18, x19, [sp, #16 * 9]
	ldp	x20, x21, [sp, #16 * 10]
	ldp	x22, x23, [sp, #16 * 11]
	ldp	x24, x25, [sp, #16 * 12]
	ldp	x26, x27, [sp, #16 * 13]
	ldp	x28, x29, [sp, #16 * 14]

	.if	\el == 0
alternative_if_not ARM64_UNMAP_KERNEL_AT_EL0
	ldr	lr, [sp, #S_LR]
	add	sp, sp, #S_FRAME_SIZE		// restore sp
	eret
alternative_else_nop_endif
#ifdef CONFIG_UNMAP_KERNEL_AT_EL0
	bne	4f
	msr	far_el1, x29
	tramp_alias	x30, tramp_exit_native, x29
	br	x30
4:
	tramp_alias	x30, tramp_exit_compat, x29
	br	x30
#endif
	.else
	ldr	lr, [sp, #S_LR]
	add	sp, sp, #S_FRAME_SIZE		// restore sp

	/* Ensure any device/NC reads complete */
	alternative_insn nop, "dmb sy", ARM64_WORKAROUND_1508412

	eret
	.endif
	sb
	.endm

查看全文

http://www.dtcms.com/a/51715.html