AMD-Vi completion-wait loop timed out

前言

内核大量打印"AMD-Vi completion-wait loop timed out"，同时伴随有soft lockup或者rcu cpu stall，如下：

Dec  8 10:02:17  kernel: AMD-Vi: Completion-Wait loop timed out
Dec  8 10:02:17  kernel: AMD-Vi: Completion-Wait loop timed out
Dec  8 10:02:17  kernel: AMD-Vi: Completion-Wait loop timed out
Dec  8 10:02:17  kernel: AMD-Vi: Completion-Wait loop timed out
Dec  8 10:02:18  kernel: AMD-Vi: Completion-Wait loop timed out
Dec  8 10:02:18  kernel: AMD-Vi: Completion-Wait loop timed out
Dec  8 10:02:18  kernel: AMD-Vi: Completion-Wait loop timed out
Dec  8 10:02:18  kernel: AMD-Vi: Completion-Wait loop timed out
Dec  8 10:02:18  kernel: AMD-Vi: Completion-Wait loop timed out
Dec  8 10:02:19  kernel: AMD-Vi: Completion-Wait loop timed out
Dec  8 10:02:19  kernel: AMD-Vi: Completion-Wait loop timed out
Dec  8 10:02:19  kernel: AMD-Vi: Completion-Wait loop timed out
Dec  8 10:02:19  kernel: AMD-Vi: Completion-Wait loop timed out
Dec  8 10:02:19  kernel: AMD-Vi: Completion-Wait loop timed out
Dec  8 10:02:20  kernel: AMD-Vi: Completion-Wait loop timed out
Dec  8 10:02:20  kernel: AMD-Vi: Completion-Wait loop timed out
Dec  8 10:02:20  kernel: AMD-Vi: Completion-Wait loop timed out
Dec  8 10:02:20  kernel: AMD-Vi: Completion-Wait loop timed out
Dec  8 10:02:20  kernel: AMD-Vi: Completion-Wait loop timed out
Dec  8 10:02:21  kernel: AMD-Vi: Completion-Wait loop timed out
Dec  8 10:02:21  kernel: AMD-Vi: Completion-Wait loop timed out
Dec  8 10:02:21  kernel: AMD-Vi: Completion-Wait loop timed out
Dec  8 10:02:21  kernel: AMD-Vi: Completion-Wait loop timed out
Dec  8 10:02:21  kernel: AMD-Vi: Completion-Wait loop timed out
Dec  8 10:02:22  kernel: AMD-Vi: Completion-Wait loop timed out
Dec  8 10:02:22  kernel: AMD-Vi: Completion-Wait loop timed out
Dec  8 10:02:22  kernel: AMD-Vi: Completion-Wait loop timed out
Dec  8 10:02:22  kernel: AMD-Vi: Completion-Wait loop timed out
Dec  8 10:02:22  kernel: AMD-Vi: Completion-Wait loop timed out
Dec  8 10:02:22  kernel: watchdog: BUG: soft lockup - CPU#46 stuck for 22s! [swapper/6:0]
Dec  8 10:02:22  kernel: CPU: 46 PID: 0 Comm: swapper/46 Tainted: G             L    5.10.128 2
Dec  8 10:02:22  kernel: RIP: 0010:_raw_spin_unlock_irqrestore+0x15/0x20
Dec  8 10:02:22  kernel: Call Trace:
Dec  8 10:02:22  kernel: <IRQ>
Dec  8 10:02:22  kernel: amd_iommu_flush_iotlb_all+0x4e/0x60
Dec  8 10:02:22  kernel: iommu_dma_flush_iotlb_all+0x1d/0x20
Dec  8 10:02:22  kernel: iova_domain_flush+0x1e/0x30
Dec  8 10:02:22  kernel: fq_flush_timeout+0x39/0xb0
Dec  8 10:02:22  kernel: ? fq_ring_free+0x110/0x110
Dec  8 10:02:22  kernel: call_timer_fn+0x2e/0x100
Dec  8 10:02:22  kernel: __run_timers.part.0+0x1de/0x260
Dec  8 10:02:22  kernel: ? clockevents_program_event+0x8f/0xe0
Dec  8 10:02:22  kernel: ? tick_program_event+0x41/0x80
Dec  8 10:02:22  kernel: run_timer_softirq+0x2a/0x50
Dec  8 10:02:22  kernel: __do_softirq+0xce/0x281
Dec  8 10:02:22  kernel: asm_call_irq_on_stack+0x12/0x20
Dec  8 10:02:22  kernel: </IRQ>
Dec  8 10:02:22  kernel: do_softirq_own_stack+0x3d/0x50
Dec  8 10:02:22  kernel: irq_exit_rcu+0xc5/0x100
Dec  8 10:02:22  kernel: sysvec_apic_timer_interrupt+0x3d/0x90
Dec  8 10:02:22  kernel: asm_sysvec_apic_timer_interrupt+0x12/0x20
Dec  8 10:02:22  kernel: RIP: 0010:native_safe_halt+0xe/0x10

勤快的小伙伴可能会迅速的google到下面的链接：

https://support.lenovo.com/us/en/solutions/tt1512-thinksystem-server-with-amd-processor-running-linux-may-hang-or-crash-with-kernel-message-amd-vi-completion-wait-loop-timed-out

其中却没有解释，为啥机器上会有soft lockup，而且还一直在一个CPU上soft lockup。

Timed out log来源

AMD iommu架构中的一条命令，参考其spec，2.4.1 COMPLETION_WAIT

The COMPLETION_WAIT command allows software to serialize itself with IOMMU command processing. The COMPLETION_WAIT command does not finish until all older commands issuedsince a prior COMPLETION_WAIT have completely executed.

其命令的中，有关于该命令是否完成的说明如下：

当命令完成时，iommu会将cmd.store_data写入cmd.store_addr中；参考代码：

5.10.128iommu_completion_wait()
---data = ++iommu->cmd_sem_val;build_completion_wait(&cmd, iommu, data);ret = __iommu_queue_command_sync(iommu, &cmd, false);if (ret)goto out_unlock;ret = wait_on_sem(iommu, data);
---build_completion_wait()
---u64 paddr = iommu_virt_to_phys((void *)iommu->cmd_sem);memset(cmd, 0, sizeof(*cmd));cmd->data[0] = lower_32_bits(paddr) | CMD_COMPL_WAIT_STORE_MASK;cmd->data[1] = upper_32_bits(paddr);cmd->data[2] = data;CMD_SET_TYPE(cmd, CMD_COMPL_WAIT);
---wait_on_sem()
---while (*iommu->cmd_sem != data && i < LOOP_TIMEOUT) {udelay(1);i += 1;}if (i == LOOP_TIMEOUT) {pr_alert("Completion-Wait loop timed out\n");return -EIO;}
---

这条Log就代表，COMPLETION_WAIT一直没有完成。

那么为什么一直没有完成呢？这里并没有找到明确的答案，不过从以下链接中：

iommu/amd: flush IOTLB for specific domains only (v2) - Patchwork

参考该Patch解决AMD-Vi: Completion-Wait loop timed out的方式，它减少了domain_flush_tlb()的调用次数；那么是不是说，发送太多的flush tlb类型的操作，会导致COMPLETION_WAIT命令超时。

Linux Kernel Watchdog

在Linux内核中，存在两种watchdog，分别用于检测soft lockup和hard lockup，

soft lockup，，
hard lockup，用于检测本CPU上，无法响应中断的场景，

对于soft lockup watchdog，用于检测本CPU上，无法调度任务的场景，

喂狗，通过优先级最高的任务进行，如下：

5.10.128watchdog_enabled()
---hrtimer_init(hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_HARD);hrtimer->function = watchdog_timer_fn;hrtimer_start(hrtimer, ns_to_ktime(sample_period),HRTIMER_MODE_REL_PINNED_HARD);
---watchdog_timer_fn()->  stop_one_cpu_nowait(smp_processor_id(),softlockup_fn, NULL,this_cpu_ptr(&softlockup_stop_work));softlockup_fn()-> update_touch_ts()-> __this_cpu_write(watchdog_touch_ts, get_timestamp());

由定时器定期发起softlockup_fn()给stop machine运行；需要特别说明的是：这里的stop_one_cpu_nowait()就是将某个回调交给该CPU优先级最高的调度类运行，即stop_sched_class，参考：

5.10.128const struct sched_class stop_sched_class__section("__stop_sched_class") = {.enqueue_task        = enqueue_task_stop,.dequeue_task        = dequeue_task_stop,...
};#define SCHED_DATA                \STRUCT_ALIGN();               \__begin_sched_classes = .;    \*(__idle_sched_class)         \*(__fair_sched_class)         \*(__rt_sched_class)           \*(__dl_sched_class)           \*(__stop_sched_class)         \__end_sched_classes = .;

如果该CPU上最高优先级的任务都无法调度，去喂狗，那么就可以认为，该CPU已经瘫痪，无法进行任务调度，于是watchdog要叫了；

狗叫的过程，同样也优定时器驱动，注意，这里的定时器是hrtimer，在中断上下文运行，

watchdog_timer_fn()-> is_softlockup()---if (time_after(now, touch_ts + get_softlockup_thresh()))return now - touch_ts;----> pr_emerg("BUG: soft lockup - CPU#%d stuck for %us! [%s:%d]\n",smp_processor_id(), duration,current->comm, task_pid_nr(current));

这里，我们已经知道，soft lockup由hrtimer定时器驱动，如果中断被禁止了呢？接下来，将由基于NMI中断的hard lockup处理；

hardlockup_detector_event_create()
---/* Try to register using hardware perf events */evt = perf_event_create_kernel_counter(wd_attr, cpu, NULL,watchdog_overflow_callback, NULL);
---watchdog_overflow_callback()-> is_hardlockup()---unsigned long hrint = __this_cpu_read(hrtimer_interrupts);if (__this_cpu_read(hrtimer_interrupts_saved) == hrint)return true;__this_cpu_write(hrtimer_interrupts_saved, hrint);return false;---watchdog_timer_fn()-> watchdog_interrupt_count()

hard lockup由perf event的NMI中断驱动，喂狗方为位于中断上下文的soft lockup的hrtimer；

hard lockup的超时时间为/proc/sys/kernel/watchdog，同行为10秒；soft lockup为hard lockup的2倍。

fq_flush_timeout()

在内核任务调度切换地址空间时，会进行tlb的刷新，如下：

context_switch()-> switch_mm_irqs_off()-> choose_new_asid(next, next_tlb_gen, &new_asid, &need_flush);-> load_new_mm_cr3(next->pgd, new_asid, true);---if (need_flush) {invalidate_user_asid(new_asid);new_mm_cr3 = build_cr3(pgdir, new_asid);}----

因为切换地址空间之后，同一个虚机地址，大概率会变成另外一个物理地址，所以，需要将相关tlb全部flush掉；

同理，iommu也会有类似的机制，fq_flush_timeout()就是这样一个作用；

本问题中，出现的所有的soft lockup调用栈均在该函数中，这里，我们简单的看下其工作机制；

iommu flush queue工作方式如下：

queue_iova()将需要unmap的addr放入flush queue，对应的queue entry会获取一个seq number；
启动同一个定时器，该定时器是softirq timer，回调函数为fq_flush_timeout()；超时时间为10ms；
超时之后，fq_flush_timeout()会调用iova_flush_domain() flush掉所有的iommu tlb entry，然后，给seq number加1
释放掉所有拥有当前seq number的flush queue entry；；

参考以下代码：

queue_iova()
---spin_lock_irqsave(&fq->lock, flags);fq_ring_free(iovad, fq);if (fq_full(fq)) {iova_domain_flush(iovad);fq_ring_free(iovad, fq);}idx = fq_ring_add(fq);fq->entries[idx].iova_pfn = pfn;fq->entries[idx].pages    = pages;fq->entries[idx].data     = data;fq->entries[idx].counter  = atomic64_read(&iovad->fq_flush_start_cnt);spin_unlock_irqrestore(&fq->lock, flags);
-----------------------------------------------------------------------> STEP 1if (!atomic_read(&iovad->fq_timer_on) &&!atomic_xchg(&iovad->fq_timer_on, 1))mod_timer(&iovad->fq_timer,jiffies + msecs_to_jiffies(IOVA_FQ_TIMEOUT));
-----------------------------------------------------------------------> STEP 2
---fq_flush_timeout()
---atomic_set(&iovad->fq_timer_on, 0);iova_domain_flush(iovad);
-----------------------------------------------------------------------> STEP 3for_each_possible_cpu(cpu) {unsigned long flags;struct iova_fq *fq;fq = per_cpu_ptr(iovad->fq, cpu);spin_lock_irqsave(&fq->lock, flags);fq_ring_free(iovad, fq);spin_unlock_irqrestore(&fq->lock, flags);}
-----------------------------------------------------------------------> STEP 4
---

flush queue是每CPU的；fq_flush_timeout()会统一执行iova_domain_flush()，然后批量释放所有的flush queue上的iova；

调用栈分析

准备知识完备之后，我们看下本问题中出现的soft lockup的调用栈，以下均进行了裁剪，

【调用栈A】
kernel: watchdog: BUG: soft lockup - CPU#46 stuck for 22s! [swapper/46:0]
kernel: CPU: 46 PID: 0 Comm: swapper/46 Not tainted 5.10.128 #2
kernel: RIP: 0010:_raw_spin_unlock_irqrestore+0x15/0x20
kernel: fq_flush_timeout+0x79/0xb0【调用栈B】
kernel: watchdog: BUG: soft lockup - CPU#46 stuck for 22s! [swapper/46:0]
kernel: CPU: 46 PID: 0 Comm: swapper/46 Tainted: G             L    5.10.128 #2
kernel: RIP: 0010:_raw_spin_unlock_irqrestore+0x15/0x20
kernel: amd_iommu_flush_iotlb_all+0x4e/0x60
kernel: iommu_dma_flush_iotlb_all+0x1d/0x20
kernel: iova_domain_flush+0x1e/0x30
kernel: fq_flush_timeout+0x39/0xb0

这两个调用栈重复了2个小时的时间，而且一直在CPU 46上；参考代码：

fq_flush_timeout()
---atomic_set(&iovad->fq_timer_on, 0);iova_domain_flush(iovad);-> iommu_dma_flush_iotlb_all()-> amd_iommu_flush_dte_all()---spin_lock_irqsave(&dom->lock, flags);domain_flush_tlb_pde(dom);domain_flush_complete(dom);spin_unlock_irqrestore(&dom->lock, flags);   -----> 【调用栈B】---for_each_possible_cpu(cpu) {unsigned long flags;struct iova_fq *fq;fq = per_cpu_ptr(iovad->fq, cpu);spin_lock_irqsave(&fq->lock, flags);fq_ring_free(iovad, fq);spin_unlock_irqrestore(&fq->lock, flags);  ------> 【调用栈 A】}
---

从中我们可以得到以下信息：

调用栈A和B反复出现，说明fq_flush_timeout()是会返回进入退出的；
soft lockup调用栈RIP在_raw_spin_unlock_irqrestore()，是因为此时中断被重新enable，这个时候soft lockup的hrtimer才能进来；
调用栈B处发生soft lockup，处于spin_lock_irqsave()下，发生了soft lockup，但是没有发生hard lockup，这说明，这20s的时间，并不是发生在iova_domain_flush()；

结合以上推论，我们可以进一步推论，soft lockup的发生，是因为其反复进入fq_flush_timeout()，为什么会这样？

首先查看softirq的处理函数：

__do_softirq()
---pending = local_softirq_pending();if (pending) {if (time_before(jiffies, end) && !need_resched() &&--max_restart)goto restart;wakeup_softirqd();}
---

这里即会检查need_resched()又有max_restart控制，所以不会是这里；

再看下timer的处理函数：

__run_timers()
---while (time_after_eq(jiffies, base->clk) &&time_after_eq(jiffies, base->next_expiry)) {levels = collect_expired_timers(base, heads);base->clk++;base->next_expiry = __next_timer_interrupt(base);while (levels--)expire_timers(base, heads + levels);}
---

只要满足以下条件，这里timer的回调就会反复被调用，

timer->fn()执行没完成时，timer就再次被arm且超时；该条件可以满足，如下：
- fq_flush_timeout()的超时时间是10ms，
- AMD-Vi completion-wait loop timed out出现，说明iommu_completion_wait()的执行时间至少是100ms；
- fq_flush_timeout()开始执行之后，queue_iova()就可以再次arm该timer；
timer是否会被重复enqueue在一个cpu，即time_base上？参考如下代码：

__mod_timer()
---base = lock_timer_base(timer, &flags);...new_base = get_target_base(base, timer->flags);if (base != new_base) {/** We are trying to schedule the timer on the new base.* However we can't change timer's base while it is running,* otherwise del_timer_sync() can't detect that the timer's* handler yet has not finished. This also guarantees that the* timer is serialized wrt itself.*/if (likely(base->running_timer != timer)) {/* See the comment in lock_timer_base() */timer->flags |= TIMER_MIGRATING;raw_spin_unlock(&base->lock);base = new_base;raw_spin_lock(&base->lock);WRITE_ONCE(timer->flags,(timer->flags & ~TIMER_BASEMASK) | base->cpu);forward_timer_base(base);}}
---

如果这个timer是正在执行的timer，那么它就可以被enqueue到同一个time_base上

所以，soft lockup的原因是fq_flush_timeout()的执行时间长于其超时时间，且每次执行时，都会被再次arm；

异常的时间

毫无疑问，fq_flush_timeout()执行时间过长，与反复打印的AMD-Vi completion-wait loop timed out有关，它意味着，fq_flush_timeout()的执行时间至少被block 100ms，轻松满足条件；那么为什么会出现该timed out呢？

参考小结"Timed out log来源"，它有可能是因为过多的执行了iova_domain_flush()或者类似操作；

参考queue_iova()代码，其中包含以下一段：

  if (fq_full(fq)) {iova_domain_flush(iovad);fq_ring_free(iovad, fq);}

如果fq_flush_timeout()执行的不及时，可能导致每个cpu的flush queue满了，然后各自自己执行iova_doman_flush()，而这就有可能造成completion-wait loop timed out；同时，这又会进一步恶化fq_flush_timeout()的执行，造成恶性循环。同时，queue_iova()在dma unmap路径，这通常在IO 完成路径，如下：

dma_unmap_sg()-> dma_unmap_sg_attrs()---if (dma_map_direct(dev, ops))dma_direct_unmap_sg(dev, sg, nents, dir, attrs);else if (ops->unmap_sg)ops->unmap_sg(dev, sg, nents, dir, attrs);---iommu_dma_unmap_sg()-> __iommu_dma_unmap()-> iommu_dma_free_iova()-> queue_iova()

这也会造成IO性能的下降，查看系统的sar信息，确实有该情况出现：

12:00:01 AM       DEV       tps  rd_sec/s  wr_sec/s  avgrq-sz  avgqu-sz     await     svctm     %util
09:50:01 AM  dev259-0     26.80     12.68    570.67     21.77      0.00      0.10      0.29      0.7
09:51:01 AM  dev259-0     49.92     12.93   1325.57     26.81      0.00      0.08      0.28      1.41
09:52:01 AM  dev259-0     51.58     25.44   1319.07     26.07      0.00      0.08      0.27      1.39
09:53:01 AM  dev259-0     26.80     12.74    590.68     22.52      0.00      0.11      0.30      0.79
09:54:01 AM  dev259-0     27.26     12.89    588.92     22.08      0.00      0.11      0.29      0.80
09:55:01 AM  dev259-0     23.45     12.73    548.27     23.93      0.00      0.10      0.33      0.78
09:56:01 AM  dev259-0     25.63     12.89    575.09     22.94      0.00      0.10      0.31      0.79
09:57:01 AM  dev259-0     27.56     25.37    596.40     22.56      0.00      0.11      0.29      0.81
09:58:01 AM  dev259-0     24.81     12.96    587.04     24.18      0.00      0.10      0.32      0.80
09:59:01 AM  dev259-0     25.07     12.70    564.55     23.03      0.00      0.10      0.31      0.78
10:00:01 AM  dev259-0     25.66     12.89    566.23     22.57      0.00      0.10      0.31      0.79
10:01:01 AM  dev259-0     55.91     12.70   2219.87     39.93      0.01      0.21      0.29      1.65
10:02:01 AM  dev259-0     53.30     12.79   1255.96     23.80      0.01      0.17      0.33      1.76
10:03:01 AM  dev259-0     23.84      0.00    401.68     16.85      0.01      0.53      0.37      0.89
10:04:01 AM  dev259-0     23.05      0.00    375.07     16.28      0.04      1.62      1.01      2.33
10:05:01 AM  dev259-0     22.49      0.00    360.32     16.02      0.02      0.79      0.58      1.30
10:06:01 AM  dev259-0     24.27      0.00    402.58     16.59      0.02      0.90      0.65      1.57
10:07:02 AM  dev259-0     23.85      0.00    398.34     16.70      0.01      0.23      0.41      0.98
10:08:01 AM  dev259-0     23.58      0.00    383.09     16.25      0.01      0.27      0.41      0.98
10:09:01 AM  dev259-0     21.90      0.00    358.85     16.39      0.01      0.30      0.56      1.23
10:10:01 AM  dev259-0     22.21      0.27    354.27     15.96      0.02      0.96      0.43      0.95
10:11:01 AM  dev259-0     47.02      1.20   1091.48     23.24      0.02      0.33      0.32      1.52
10:12:02 AM  dev259-0     49.38      0.00   1128.20     22.85      0.22      4.38      0.38      1.89
10:13:01 AM  dev259-0     23.64      0.00    399.66     16.90      0.00      0.08      0.35      0.82
10:14:01 AM  dev259-0     23.70      0.00    374.11     15.79      0.00      0.16      0.35      0.84
10:15:01 AM  dev259-0     22.53      0.00    363.78     16.14      0.01      0.53      0.43      0.96

那么造成fq_flush_timeout()性能下降的始作俑者是谁呢？

可能是该Commit中提到的情况：

nvme-pci: clamp max_hw_sectors based on DMA optimized limitation - kernel/git/torvalds/linux.git - Linux kernel source tree

iova的锁竞争导致fq_flush_timeout()的执行效率下降。

附录

iommu=pt

调用栈

kernel: watchdog: BUG: soft lockup - CPU#46 stuck for 22s! [swapper/46:0]
kernel: CPU: 46 PID: 0 Comm: swapper/46 Not tainted 5.10.128 #2
kernel: RIP: 0010:_raw_spin_unlock_irqrestore+0x15/0x20
kernel: Call Trace:
kernel: <IRQ>
kernel: fq_flush_timeout+0x79/0xb0
kernel: ? fq_ring_free+0x110/0x110
kernel: call_timer_fn+0x2e/0x100
kernel: __run_timers.part.0+0x1de/0x260
kernel: ? clockevents_program_event+0x8f/0xe0
kernel: ? tick_program_event+0x41/0x80
kernel: run_timer_softirq+0x2a/0x50
kernel: __do_softirq+0xce/0x281
kernel: asm_call_irq_on_stack+0x12/0x20kernel: watchdog: BUG: soft lockup - CPU#46 stuck for 22s! [swapper/46:0]
kernel: CPU: 46 PID: 0 Comm: swapper/46 Tainted: G             L    5.10.128 #2
kernel: RIP: 0010:_raw_spin_unlock_irqrestore+0x15/0x20
kernel: Call Trace:
kernel: <IRQ>
kernel: amd_iommu_flush_iotlb_all+0x4e/0x60
kernel: iommu_dma_flush_iotlb_all+0x1d/0x20
kernel: iova_domain_flush+0x1e/0x30
kernel: fq_flush_timeout+0x39/0xb0
kernel: ? fq_ring_free+0x110/0x110
kernel: call_timer_fn+0x2e/0x100
kernel: __run_timers.part.0+0x1de/0x260
kernel: ? clockevents_program_event+0x8f/0xe0
kernel: ? tick_program_event+0x41/0x80
kernel: run_timer_softirq+0x2a/0x50
kernel: __do_softirq+0xce/0x281
kernel: asm_call_irq_on_stack+0x12/0x20