一、cpu热插拔介绍
在单核时代,操作系统只需要管理一个cpu,当系统有任务要执行时,所有任务在该cpu的就绪队列上排队,调度器根据调度算法选择一个最佳任务执行。当就绪队列上的所有任务都执行完成后,cpu就执行idle进程而进入空闲状态。由于idle进程的优先级最低,因此一旦有其它任务进入就绪队列,就又会抢占idle进程继续执行实际的任务。cpu不停地在各个任务以及idle进程之间切换,实现整个系统的运转。其基本架构如下:
但是在引入smp之后,操作系统将同时管理多个cpu,每个cpu拥有自己的就绪队列和idle进程,并可以独立地执行调度操作。且当各cpu之间的负载出现不均衡时,可以由负载均衡模块在它们之间迁移任务。相应的其基本架构如下:
操作系统的运行至少需要一个cpu,因此对其而言,若将多余的cpu从系统中下线,并不会影响除了性能外的其它功能。为此,内核引入了cpu热插拔功能,以支持动态地向系统上线或下线cpu
它在当前主要有以下两个应用场景:
(1)用户通过sysfs接口手动开启或关闭某个cpu
(2)作为系统休眠唤醒流程的一部分,在系统休眠前先关闭所有的secondary cpu,且在被唤醒后重新开启这些cpu
二、cpu 热插拔原理
cpu online相当于把一个下电的cpu重新投入运行,因此需要为该cpu设置内核入口地址,然后上电运行。cpu在启动时执行一系列的初始化流程,并最终加入内核调度系统,参与任务调度。由于它是将一个下电的cpu重新启动,故其实际上与上一篇介绍的secondary cpu启动流程相同,因此本文将主要介绍cpu offine的流程。
cpu offline需要将一个当前正在执行任务的cpu下电,由于系统中很多模块都与cpu密切相关。因此若要下线一个cpu,必须先将其与相关模块解耦,如将其就绪队列上的所有任务,及与其相关的中断从该cpu上迁移出去,从cpu拓扑结构和numa节点中删除该cpu等。
当斩断了与系统之间千丝万缕的关系后,所有软硬件模块都将无视该cpu的存在,此时就可以通过cpu_ops回调执行其下电流程,正式完成它与系统的告别仪式。因此,移除一个cpu的关键就是切断其与相关模块之间的关系。
由于以上各个模块的作用不同,因此其在cpu hotplug流程中关闭的时机也不同。如进程迁移需要依赖由tick中断驱动的周期性进程调度,因此需要在进程迁移完成后才能迁移该cpu的中断。
为此内核为cpu hotplug实现了一个状态机,用于定义在hotplug各个阶段需要执行的操作。其定义如下:
enum cpuhp_state {CPUHP_INVALID = -1,CPUHP_OFFLINE = 0,CPUHP_CREATE_THREADS,…CPUHP_AP_DTPM_CPU_ONLINE,CPUHP_AP_ACTIVE,CPUHP_ONLINE,
}
该状态机以CPUHP_OFFLINE开始,并以CPUHP_ONLINE结束,其中每个状态都包含cpu上线和下线时需要执行的操作。当成功执行了某状态对应的操作之后,cpu将会转换为状态机中的下一个状态。
如当上线一个cpu时,其初始状态为CPUHP_OFFLINE,因此需要从该状态开始依次执行相应状态的操作,若最终成功执行到CPUHP_ONLINE状态,则该cpu上线完成。其相应的状态转换关系如下:
当需要下线一个cpu时,其初始状态为CPUHP_ONLINE。显然它需要使用与启动时相反的顺序,运转该状态机,若其最终成功执行到CPUHP_OFFLINE状态,则cpu下线完成。其相应的状态转换关系如下:
以上状态机中的每个状态都以cpuhp_step结构体表示,该结构体主要包含cpu online和offline时需要执行的回调函数。其定义如下:
struct cpuhp_step {const char *name; (1)union {int (*single)(unsigned int cpu); int (*multi)(unsigned int cpu,struct hlist_node *node);} startup; (2)union {int (*single)(unsigned int cpu);int (*multi)(unsigned int cpu,struct hlist_node *node);} teardown; (3)struct hlist_head list; (4)bool cant_stop; (5)bool multi_instance; (6)
}
(1)该step对应状态的名称
(2)cpu online时需要执行的回调
(3)cpu offline时需要执行的回调。由于某些状态可能包含多个实例,因而需要对这些实例分别执行回调函数。故为其提供了single和multi两种回调函数的版本
(4)若该状态为多实例情形,则将其所有实例都挂到一个链表中。该成员为链表的头节点
(5)若设置了该参数,则在该状态下不能停止hotplug流程
(6)用于表示其是否为多实例状态
由于cpu运行时与其关系最紧密的为进程管理、中断管理、时间子系统以及拓扑结构相关的模块。其相应的功能框图如下:
在cpu关闭时,以上这些模块相互之间存在着各种依赖,因此正确处理好它们的关闭流程是cpu能否正确关闭的关键。后面我们也将重点围绕这些模块对其做进一步的分析
三、流程
3.1 数据结构
关键的数据结构有三种,如下图所示:
struct cpuhp_cpu_state
:用来存储hotplug的状态;enum cpuhp_state
:枚举各种状态,这个会对应到全局数组中的某一项,而该项中会定义回调函数。当然,也可以通过函数接口来设置回调函数。struct cpuhp_step
:Hotplug state machine step,主要定义了函数指针,当跳转到某一个状态时会回调。
补充上面的缺失或者错误的点:
BP: BOOT CPU,启动其他CPU的cpu
AP:Application Processor,被启动的CPU
对上面的BP全局数组和AP全局数组,目前5.15.* 版本的内核,以及是合并成一个数组:static struct cpuhp_step cpuhp_hp_states[];
/* Boot processor state steps */
static struct cpuhp_step cpuhp_hp_states[] = {[CPUHP_OFFLINE] = {.name = "offline",.startup.single = NULL,.teardown.single = NULL,},
#ifdef CONFIG_SMP[CPUHP_CREATE_THREADS]= {.name = "threads:prepare",.startup.single = smpboot_create_threads,.teardown.single = NULL,.cant_stop = true,},[CPUHP_PERF_PREPARE] = {.name = "perf:prepare",.startup.single = perf_event_init_cpu,.teardown.single = perf_event_exit_cpu,},[CPUHP_RANDOM_PREPARE] = {.name = "random:prepare",.startup.single = random_prepare_cpu,.teardown.single = NULL,},[CPUHP_WORKQUEUE_PREP] = {.name = "workqueue:prepare",.startup.single = workqueue_prepare_cpu,.teardown.single = NULL,},[CPUHP_HRTIMERS_PREPARE] = {.name = "hrtimers:prepare",.startup.single = hrtimers_prepare_cpu,.teardown.single = hrtimers_dead_cpu,},[CPUHP_SMPCFD_PREPARE] = {.name = "smpcfd:prepare",.startup.single = smpcfd_prepare_cpu,.teardown.single = smpcfd_dead_cpu,},[CPUHP_RELAY_PREPARE] = {.name = "relay:prepare",.startup.single = relay_prepare_cpu,.teardown.single = NULL,},[CPUHP_SLAB_PREPARE] = {.name = "slab:prepare",.startup.single = slab_prepare_cpu,.teardown.single = slab_dead_cpu,},[CPUHP_RCUTREE_PREP] = {.name = "RCU/tree:prepare",.startup.single = rcutree_prepare_cpu,.teardown.single = rcutree_dead_cpu,},/** On the tear-down path, timers_dead_cpu() must be invoked* before blk_mq_queue_reinit_notify() from notify_dead(),* otherwise a RCU stall occurs.*/[CPUHP_TIMERS_PREPARE] = {.name = "timers:prepare",.startup.single = timers_prepare_cpu,.teardown.single = timers_dead_cpu,},/* Kicks the plugged cpu into life */[CPUHP_BRINGUP_CPU] = {.name = "cpu:bringup",.startup.single = bringup_cpu,.teardown.single = finish_cpu,.cant_stop = true,},/* Final state before CPU kills itself */[CPUHP_AP_IDLE_DEAD] = {.name = "idle:dead",},/** Last state before CPU enters the idle loop to die. Transient state* for synchronization.*/[CPUHP_AP_OFFLINE] = {.name = "ap:offline",.cant_stop = true,},/* First state is scheduler control. Interrupts are disabled */[CPUHP_AP_SCHED_STARTING] = {.name = "sched:starting",.startup.single = sched_cpu_starting,.teardown.single = sched_cpu_dying,},[CPUHP_AP_RCUTREE_DYING] = {.name = "RCU/tree:dying",.startup.single = NULL,.teardown.single = rcutree_dying_cpu,},[CPUHP_AP_SMPCFD_DYING] = {.name = "smpcfd:dying",.startup.single = NULL,.teardown.single = smpcfd_dying_cpu,},/* Entry state on starting. Interrupts enabled from here on. Transient* state for synchronsization */[CPUHP_AP_ONLINE] = {.name = "ap:online",},/** Handled on control processor until the plugged processor manages* this itself.*/[CPUHP_TEARDOWN_CPU] = {.name = "cpu:teardown",.startup.single = NULL,.teardown.single = takedown_cpu,.cant_stop = true,},[CPUHP_AP_SCHED_WAIT_EMPTY] = {.name = "sched:waitempty",.startup.single = NULL,.teardown.single = sched_cpu_wait_empty,},/* Handle smpboot threads park/unpark */[CPUHP_AP_SMPBOOT_THREADS] = {.name = "smpboot/threads:online",.startup.single = smpboot_unpark_threads,.teardown.single = smpboot_park_threads,},[CPUHP_AP_IRQ_AFFINITY_ONLINE] = {.name = "irq/affinity:online",.startup.single = irq_affinity_online_cpu,.teardown.single = NULL,},[CPUHP_AP_PERF_ONLINE] = {.name = "perf:online",.startup.single = perf_event_init_cpu,.teardown.single = perf_event_exit_cpu,},[CPUHP_AP_WATCHDOG_ONLINE] = {.name = "lockup_detector:online",.startup.single = lockup_detector_online_cpu,.teardown.single = lockup_detector_offline_cpu,},[CPUHP_AP_WORKQUEUE_ONLINE] = {.name = "workqueue:online",.startup.single = workqueue_online_cpu,.teardown.single = workqueue_offline_cpu,},[CPUHP_AP_RANDOM_ONLINE] = {.name = "random:online",.startup.single = random_online_cpu,.teardown.single = NULL,},[CPUHP_AP_RCUTREE_ONLINE] = {.name = "RCU/tree:online",.startup.single = rcutree_online_cpu,.teardown.single = rcutree_offline_cpu,},
#endif/** The dynamically registered state space is here*/#ifdef CONFIG_SMP/* Last state is scheduler control setting the cpu active */[CPUHP_AP_ACTIVE] = {.name = "sched:active",.startup.single = sched_cpu_activate,.teardown.single = sched_cpu_deactivate,},
#endif/* CPU is fully up and running. */[CPUHP_ONLINE] = {.name = "online",.startup.single = NULL,.teardown.single = NULL,},
};
下面展示BP 和AP ,在启动过程中的函数调用关系:
BP (认为是CPU0) AP(暂时认为是CPU1)
smp_init | do_cpu_up| --> cpuhp_invoke_callback| --> smpboot_create_threads| perf_event_init_cpu| workqueue_prepare_cpu| hrtimers_prepare_cpu| ...| bringup_cpu| -->__cpu_up| -->boot_secondary (cpu_ops[cpu]->cpu_boot(cpu))| --> cpu_psci_cpu_boot| -->psci_cpu_on (invoke_psci_fn(0xc4000003, cpuid, secondary_entry, 0);)| secondary_entry(psci固件赋值的函数地址,在uboot汇编可以看到)next cpu --> secondary_startup| --> __secondary_switched| -->secondary_start_kernel | --> notify_cpu_starting| -->cpuhp_invoke_callback| -->sched_cpu_starting| gic_starting_cpu| arch_timer_starting_cpu| dummy_timer_starting_cpu| | bringup_wait_for_ap
下面是展示打印log,启动CPU过程的函数调用,这里的函数调用,是上面定义的cpuhp_hp_state[] 函数数组:
[ 0.064050] smp: ====cpuhp-walk:Bringing up secondary CPUs ...
[ 0.064051] smp: ====cpuhp-walk:before bringing up cpu:0
[ 0.064052] smp: ====cpuhp-walk:after bringing up cpu:0
[ 0.064053] smp: ====cpuhp-walk:before bringing up cpu:1
[ 0.064053] cpuhp_walk:cpu:1,do_cpu_up
[ 0.064060] cphp_walk:cpu:1,cpuhp_up_callbacks
[ 0.064061] cphp_walk:cpu:1,cpuhp_invoke_callback, 1.1, cb:smpboot_create_threads+0x0/0xe0
[ 0.104074] cphp_walk:cpu:1,cpuhp_invoke_callback, 1.1, cb:perf_event_init_cpu+0x0/0x140
[ 0.104080] cphp_walk:cpu:1,cpuhp_invoke_callback, 1.1, cb:workqueue_prepare_cpu+0x0/0x98
[ 0.112080] cphp_walk:cpu:1,cpuhp_invoke_callback, 1.1, cb:hrtimers_prepare_cpu+0x0/0xb0
[ 0.112084] cphp_walk:cpu:1,cpuhp_invoke_callback, 1.1, cb:smpcfd_prepare_cpu+0x0/0x70
[ 0.112089] cphp_walk:cpu:1,cpuhp_invoke_callback, 1.1, cb:relay_prepare_cpu+0x0/0xf8
[ 0.112092] cphp_walk:cpu:1,cpuhp_invoke_callback, 1.1, cb:rcutree_prepare_cpu+0x0/0x1c0
[ 0.112099] cphp_walk:cpu:1,cpuhp_invoke_callback, 1.1, cb:timers_prepare_cpu+0x0/0x80
[ 0.112101] cphp_walk:cpu:1,cpuhp_invoke_callback, 1.1, cb:bringup_cpu+0x0/0x140
[ 0.112104] cpuhp_walk:cpu:1,bringup_cpu
[ 0.112105] cpuhp_walk:cpu:1,__cpu_up
[ 0.112106] cpuhp_walk:cpu:1,boot_secondary, cpu_boot:cpu_psci_cpu_boot+0x0/0x94
[ 0.112109] psci: cpuhp_walk:cpu:1,cpu_psci_cpu_boot, cpu_on:psci_cpu_on+0x0/0x94
[ 0.112112] psci: cphp_walk:cpuid:1,psci_cpu_on, fn:0xc4000003
[ 0.145924] cphp_walk:cpu:1,secondary_start_kernel
[ 0.145934] Detected PIPT I-cache on CPU1
[ 0.145942] cpuhp_walk:cpu:1,notify_cpu_starting
[ 0.145943] cphp_walk:cpu:1,cpuhp_invoke_callback, 1.1, cb:sched_cpu_starting+0x0/0x120
[ 0.145946] cphp_walk:cpu:1,cpuhp_invoke_callback, 1.1, cb:gic_starting_cpu+0x0/0x40
[ 0.145950] GICv3: CPU1: found redistributor 1 region 1:0x00000000299a0000
[ 0.145958] GICv3: CPU1: using allocated LPI pending table @0x0000002176620000
[ 0.145964] cphp_walk:cpu:1,cpuhp_invoke_callback, 1.1, cb:arch_timer_starting_cpu+0x0/0x318
[ 0.145971] cphp_walk:cpu:1,cpuhp_invoke_callback, 1.1, cb:dummy_timer_starting_cpu+0x0/0x88
[ 0.145973] cpuhp_walk, CPU1: Booted secondary processor 0x0000000001 [0x701f6633]
[ 0.145978] cpuhp_walk:cpu:1,bringup_cpu, wait ap
[ 0.145980] cphp_walk:cpu:1,bringup_wait_for_ap, 1
[ 0.145984] cphp_walk:cpu:1,bringup_wait_for_ap, 2
[ 0.145987] cphp_walk:cpu:1,cpuhp_invoke_callback, 1.1, cb:smpboot_unpark_threads+0x0/0xb0
[ 0.145994] cphp_walk:cpu:1,cpuhp_invoke_callback, 1.1, cb:irq_affinity_online_cpu+0x0/0x110
[ 0.145998] cphp_walk:cpu:1,cpuhp_invoke_callback, 1.1, cb:perf_event_init_cpu+0x0/0x140
[ 0.146002] cphp_walk:cpu:1,cpuhp_invoke_callback, 1.1, cb:workqueue_online_cpu+0x0/0x228
[ 0.146019] cphp_walk:cpu:1,cpuhp_invoke_callback, 1.1, cb:rcutree_online_cpu+0x0/0x98
[ 0.146022] cphp_walk:cpu:1,cpuhp_invoke_callback, 1.1, cb:page_writeback_cpu_online+0x0/0x20
[ 0.146025] cphp_walk:cpu:1,cpuhp_invoke_callback, 1.1, cb:vmstat_cpu_online+0x0/0x70
[ 0.146029] cphp_walk:cpu:1,cpuhp_invoke_callback, 1.1, cb:sched_cpu_activate+0x0/0x168
[ 0.146035] smp: ====cpuhp-walk:after bringing up cpu:1
[ 0.146036] smp: ====cpuhp-walk:before bringing up cpu:2
[ 0.146037] cpuhp_walk:cpu:2,do_cpu_up
[ 0.146041] cphp_walk:cpu:2,cpuhp_up_callbacks
[ 0.146042] cphp_walk:cpu:2,cpuhp_invoke_callback, 1.1, cb:smpboot_create_threads+0x0/0xe0
[ 0.185939] cphp_walk:cpu:2,cpuhp_invoke_callback, 1.1, cb:perf_event_init_cpu+0x0/0x140
[ 0.185943] cphp_walk:cpu:2,cpuhp_invoke_callback, 1.1, cb:workqueue_prepare_cpu+0x0/0x98
[ 0.193946] cphp_walk:cpu:2,cpuhp_invoke_callback, 1.1, cb:hrtimers_prepare_cpu+0x0/0xb0
[ 0.193949] cphp_walk:cpu:2,cpuhp_invoke_callback, 1.1, cb:smpcfd_prepare_cpu+0x0/0x70
[ 0.193952] cphp_walk:cpu:2,cpuhp_invoke_callback, 1.1, cb:relay_prepare_cpu+0x0/0xf8
[ 0.193954] cphp_walk:cpu:2,cpuhp_invoke_callback, 1.1, cb:rcutree_prepare_cpu+0x0/0x1c0
[ 0.193959] cphp_walk:cpu:2,cpuhp_invoke_callback, 1.1, cb:timers_prepare_cpu+0x0/0x80
[ 0.193961] cphp_walk:cpu:2,cpuhp_invoke_callback, 1.1, cb:bringup_cpu+0x0/0x140
[ 0.193964] cpuhp_walk:cpu:2,bringup_cpu
[ 0.193964] cpuhp_walk:cpu:2,__cpu_up
[ 0.193965] cpuhp_walk:cpu:2,boot_secondary, cpu_boot:cpu_psci_cpu_boot+0x0/0x94
[ 0.193967] psci: cpuhp_walk:cpu:2,cpu_psci_cpu_boot, cpu_on:psci_cpu_on+0x0/0x94
[ 0.193969] psci: cphp_walk:cpuid:256,psci_cpu_on, fn:0xc4000003
[ 0.227952] cphp_walk:cpu:2,secondary_start_kernel
[ 0.227961] Detected PIPT I-cache on CPU2
[ 0.227967] cpuhp_walk:cpu:2,notify_cpu_starting
[ 0.227968] cphp_walk:cpu:2,cpuhp_invoke_callback, 1.1, cb:sched_cpu_starting+0x0/0x120
[ 0.227972] cphp_walk:cpu:2,cpuhp_invoke_callback, 1.1, cb:gic_starting_cpu+0x0/0x40
[ 0.227976] GICv3: CPU2: found redistributor 100 region 2:0x00000000299c0000
[ 0.227994] GICv3: CPU2: using allocated LPI pending table @0x0000002176630000
[ 0.227999] cphp_walk:cpu:2,cpuhp_invoke_callback, 1.1, cb:arch_timer_starting_cpu+0x0/0x318
[ 0.228007] cphp_walk:cpu:2,cpuhp_invoke_callback, 1.1, cb:dummy_timer_starting_cpu+0x0/0x88
[ 0.228010] cpuhp_walk, CPU2: Booted secondary processor 0x0000000100 [0x701f6633]
[ 0.228017] cpuhp_walk:cpu:2,bringup_cpu, wait ap
[ 0.228019] cphp_walk:cpu:2,bringup_wait_for_ap, 1
[ 0.228022] cphp_walk:cpu:2,bringup_wait_for_ap, 2
[ 0.228029] cphp_walk:cpu:2,cpuhp_invoke_callback, 1.1, cb:smpboot_unpark_threads+0x0/0xb0
[ 0.228039] cphp_walk:cpu:2,cpuhp_invoke_callback, 1.1, cb:irq_affinity_online_cpu+0x0/0x110
[ 0.228042] cphp_walk:cpu:2,cpuhp_invoke_callback, 1.1, cb:perf_event_init_cpu+0x0/0x140
[ 0.228047] cphp_walk:cpu:2,cpuhp_invoke_callback, 1.1, cb:workqueue_online_cpu+0x0/0x228
[ 0.228070] cphp_walk:cpu:2,cpuhp_invoke_callback, 1.1, cb:rcutree_online_cpu+0x0/0x98
[ 0.228073] cphp_walk:cpu:2,cpuhp_invoke_callback, 1.1, cb:page_writeback_cpu_online+0x0/0x20
[ 0.228076] cphp_walk:cpu:2,cpuhp_invoke_callback, 1.1, cb:vmstat_cpu_online+0x0/0x70
[ 0.228080] cphp_walk:cpu:2,cpuhp_invoke_callback, 1.1, cb:sched_cpu_activate+0x0/0x168
[ 0.228086] smp: ====cpuhp-walk:after bringing up cpu:2
[ 0.228087] smp: ====cpuhp-walk:before bringing up cpu:3
[ 0.228088] cpuhp_walk:cpu:3,do_cpu_up
[ 0.228092] cphp_walk:cpu:3,cpuhp_up_callbacks
[ 0.228093] cphp_walk:cpu:3,cpuhp_invoke_callback, 1.1, cb:smpboot_create_threads+0x0/0xe0
[ 0.267968] cphp_walk:cpu:3,cpuhp_invoke_callback, 1.1, cb:perf_event_init_cpu+0x0/0x140
[ 0.267972] cphp_walk:cpu:3,cpuhp_invoke_callback, 1.1, cb:workqueue_prepare_cpu+0x0/0x98
[ 0.275974] cphp_walk:cpu:3,cpuhp_invoke_callback, 1.1, cb:hrtimers_prepare_cpu+0x0/0xb0
[ 0.275977] cphp_walk:cpu:3,cpuhp_invoke_callback, 1.1, cb:smpcfd_prepare_cpu+0x0/0x70
[ 0.275980] cphp_walk:cpu:3,cpuhp_invoke_callback, 1.1, cb:relay_prepare_cpu+0x0/0xf8
[ 0.275983] cphp_walk:cpu:3,cpuhp_invoke_callback, 1.1, cb:rcutree_prepare_cpu+0x0/0x1c0
[ 0.275988] cphp_walk:cpu:3,cpuhp_invoke_callback, 1.1, cb:timers_prepare_cpu+0x0/0x80
[ 0.275990] cphp_walk:cpu:3,cpuhp_invoke_callback, 1.1, cb:bringup_cpu+0x0/0x140
[ 0.275993] cpuhp_walk:cpu:3,bringup_cpu
[ 0.275994] cpuhp_walk:cpu:3,__cpu_up
[ 0.275995] cpuhp_walk:cpu:3,boot_secondary, cpu_boot:cpu_psci_cpu_boot+0x0/0x94
[ 0.275996] psci: cpuhp_walk:cpu:3,cpu_psci_cpu_boot, cpu_on:psci_cpu_on+0x0/0x94
[ 0.275998] psci: cphp_walk:cpuid:257,psci_cpu_on, fn:0xc4000003
[ 0.309980] cphp_walk:cpu:3,secondary_start_kernel
[ 0.309985] Detected PIPT I-cache on CPU3
[ 0.309989] cpuhp_walk:cpu:3,notify_cpu_starting
[ 0.309990] cphp_walk:cpu:3,cpuhp_invoke_callback, 1.1, cb:sched_cpu_starting+0x0/0x120
[ 0.309993] cphp_walk:cpu:3,cpuhp_invoke_callback, 1.1, cb:gic_starting_cpu+0x0/0x40
[ 0.309996] GICv3: CPU3: found redistributor 101 region 3:0x00000000299e0000
[ 0.310002] GICv3: CPU3: using allocated LPI pending table @0x0000002176640000
[ 0.310007] cphp_walk:cpu:3,cpuhp_invoke_callback, 1.1, cb:arch_timer_starting_cpu+0x0/0x318
[ 0.310012] cphp_walk:cpu:3,cpuhp_invoke_callback, 1.1, cb:dummy_timer_starting_cpu+0x0/0x88
[ 0.310014] cpuhp_walk, CPU3: Booted secondary processor 0x0000000101 [0x701f6633]
[ 0.310019] cpuhp_walk:cpu:3,bringup_cpu, wait ap
[ 0.310021] cphp_walk:cpu:3,bringup_wait_for_ap, 1
[ 0.310024] cphp_walk:cpu:3,bringup_wait_for_ap, 2
[ 0.310028] cphp_walk:cpu:3,cpuhp_invoke_callback, 1.1, cb:smpboot_unpark_threads+0x0/0xb0
[ 0.310036] cphp_walk:cpu:3,cpuhp_invoke_callback, 1.1, cb:irq_affinity_online_cpu+0x0/0x110
[ 0.310039] cphp_walk:cpu:3,cpuhp_invoke_callback, 1.1, cb:perf_event_init_cpu+0x0/0x140
[ 0.310043] cphp_walk:cpu:3,cpuhp_invoke_callback, 1.1, cb:workqueue_online_cpu+0x0/0x228
[ 0.310050] cphp_walk:cpu:3,cpuhp_invoke_callback, 1.1, cb:rcutree_online_cpu+0x0/0x98
[ 0.310052] cphp_walk:cpu:3,cpuhp_invoke_callback, 1.1, cb:page_writeback_cpu_online+0x0/0x20
[ 0.310054] cphp_walk:cpu:3,cpuhp_invoke_callback, 1.1, cb:vmstat_cpu_online+0x0/0x70
[ 0.310057] cphp_walk:cpu:3,cpuhp_invoke_callback, 1.1, cb:sched_cpu_activate+0x0/0x168
[ 0.310063] smp: ====cpuhp-walk:after bringing up cpu:3
[ 0.310064] smp: ====cpuhp-walk:Brought up 1 node, 4 CPUs
[ 0.310066] SMP: Total of 4 processors activated.
[ 0.310067] CPU features: detected: 32-bit EL0 Support
[ 0.310069] CPU features: detected: CRC32 instructions
[ 0.310303] CPU: All CPU(s) started at EL2
从上面日志,可以看出 CPU0执行了cpuhp_hp_states[CPUHP_BRINGUP_CPU] 之前的函数数组,CPU1执行了cpuhp_hp_states[CPUHP_BRINGUP_CPU]之后的函数。
其中还有L1-cahe 的协议:Detected PIPT I-cache on CPU1,最后log,发现是在EL2的环境启动,并支持EL0。
3.2 cpu_up
Linux内核会创建虚拟总线cpu_subsys
,每个CPU注册的时候,都会挂载在该总线上,CPU的online和offline的操作,最终会回调到该总线上的函数。通过echo 0 > /sys/devices/system/cpu/cpu1/online
和echo 1 > /sys/devices/system/cpu/cpu1/online
来控制CPU的热插拔。
- Kernel会为每个CPU都创建一个hotplug线程,执行
teardown/startup
回调函数; - cpu_up的时候依赖底层的
__cpu_up
函数的实现;
3.3 cpu offline流程分析
cpu的热插拔可通过其online属性实现,如需要关闭该cpu,则只需执行以下命令即可:
echo 0 > /sys/devices/system/cpu/cpu0/online
由于内核中该流程较复杂,因此采用以下两张总体框图进行表示:
从上图可知该流程一共需要三个线程和周期性调度器参与,其中三个线程分别为:
(1)启动cpu热插拔命令的线程,以下将其称为shell thread
(2)用于执行hotplug状态机的percpu线程,以下将其称为hotplug thread
(3)idle线程
由于cpu下电流程各部分相互之间具有依赖关系,因此需要对它们的执行流程进行同步操作。上图中红色虚线和紫色虚线即用于该目的,它们的关系如下:
(1)shell线程发起关闭cpu命令,并在内核中唤醒percpu线程hotplug,执行实际的状态机处理流程。然后等待hotplug线程执行完成后唤醒再唤醒它
(2)hotplug线程将cpu设置为inactive状态,使负载均衡模块不再向该cpu均衡负载。并且使能balance push机制,使得该cpu上的进程在scheduler流程中执行balance_push操作,将自身迁移到其它cpu上
由于balance push操作由tick中断触发的周期性调度器执行,因此在进程迁移过程中,该线程可继续执行其后的状态机回调函数。但CPUHP_TEARDOWN_CPU状态对应的回调函数takedown_cpu需要执行cpu中断迁移,tick时钟关闭等工作。故必须要等进程迁移完成后才能执行这部分操作(因为周期性调度器依赖于tick时钟和中断驱动),因此hotplug线程需要在该状态执行之前等待进程迁移完成
(3)当进程迁移完成后,scheduler将唤醒hotplug线程,使其继续执行。由于接下来需要执行takedown_cpu函数,且该函数最终会触发cpu的下电操作。而hotplug线程是percpu的,此时cpu本身都要撂挑子了,显然与该cpu绑定的线程也应该要完成其使命。因此该线程会唤醒shell进程完成状态机后面的回调,然后其自身将退出执行并被设置为park状态
(4)shell进程被唤醒后,将会运行在其它的cpu上(进程唤醒后会执行选核操作,而待关闭的cpu已经为inactive状态,因此不会被选中),并开始执行takedown_cpu流程
由于进程已完成迁移,故它会调用stop machine将该cpu从cpu拓扑结构和numa节点中移除,将其自身状态设置为offline,且将其中断迁移到其它cpu上。当完成以上流程之后,stop machine将其自身设置为park状态,此时该cpu上将只剩下idle进程,因此idle进程开始执行
(5)当idle进程检测到cpu已经为offline状态,就开始执行cpu自杀流程。它首先关闭idle进程自身的nohz tick时钟,从而杀死进程自身。然后唤醒shell线程使其继续执行剩余的状态机回调。最后调用psci的cpu下电接口,并陷入bl31,由bl31完成该cpu的下电流程
(6)shell线程被唤醒后,继续执行cpu下电后的一些遗留工作。并最终完成整个cpu offline流程
在补充一位网友的图:
- cpu down的流程跟cpu on相反,整个过程很类似;
- cpuhp拔插核线程创建后,由于should_run为false,所以并没有实质的运行,并处于S状态,在_cpu_down函数中,由__cpuhp_kick_ap函数将should_run置成true,然后wake_up_process cpuhp线程运行在被拔核cpu,然后等待cpuhp运行到指定状态(从CPUHP_ONLINE到CPUHP_TEARDOWN_CPU)的teardown.single调用;
- cpuhp线程(cpuhp_thread_fun)从CPUHP_ONLINE执行到CPUHP_TEARDOWN_CPU状态,做相关的teardown.single的调用,对于takedown_cpu的调用首先在处理拔核的cpu上运行,会停止每cpu线程,然后通过stop_machine_cpuslocked函数将take_cpu_down任务queue到被拔核的stopper线程上运行,在stop_machine_cpuslocked的__stop_cpus函数中,拔核cpu会通过wait_for_completion(&done.completion)等待被拔核cpu的migrate线程(cpu_stopper_thread)执行完work的回调(muti_cpu_stop,会调用到take_cpu_down),被拔核cpu执行完work的回调后,就会从过cpu_stop_signal_done函数释放信号量(complete(&done->completion)),然后拔核cpu继续运行将被拔核cpu下电;
- 对于运行在被拔核cpu上的stopper线程(migration)执行take_cpu_down,会先通过执行__cpu_disable将被拔核cpu设为offline,然后从CPUHP_TEARDOWN_CPU运行到CPUHP_AP_OFFLINE状态的teardown.single调用,然后调用stop_machine_park将stopper线程的flag置上KTHREAD_SHOULD_PARK,然后在hotplug线程循环函数smpboot_thread_fn中判断KTHREAD_SHOULD_PARK,然后将stopper线程停下来,然后kthread_parkme里会调用schedule_preempt_disabled schedule出去,然后这时会运行被拔核上的idle 线程(或者将其他非rq上的task拉到被拔核cpu执行)。
- 运行被拔核cpu上的idle线程会通过函数cpu_is_offline判断当前的cpu是否offline,若offline会调用cpuhp_report_idle_dead,然后调用cpuhp_complete_idle_dead,先将state设为CPUHP_AP_IDLE_DEAD,进而complete_ap_thread,这时拔核cpu一直阻塞在takedown_cpu函数中的wait_for_ap_thread,然后拔核cpu执行最后的拔核操作;
- cpu电的关闭是通过do_idle->cpu_die->cpu_off利用PSCI向ATF发起关闭cpu的请求,然后由ATF跟其他电源管理模块通信,最终在电源管理模块将cpu的电关闭。