LoViT: 用于手术阶段识别的长视频Transformer|文献速递-生成式模型与transformer在医学影像中的应用

Title

题目

LoViT: Long Video Transformer for surgical phase recognition

LoViT: 用于手术阶段识别的长视频Transformer

文献速递介绍

快速发展的手术数据科学（SDS）领域旨在通过先进利用手术室（OR）内医疗设备采集的数据来变革介入性医疗（Maier-Hein et al., 2022）。 SDS 的核心是对手术流程的分类和理解，这不仅对于了解手术过程至关重要，还对评估手术技能和提供上下文敏感的术中支持具有重要意义（Vercauteren et al., 2020）。自动化手术阶段和动作识别是这一领域的关键，其通过为手术团队提供实时反馈，提升了手术的熟练度、安全性和整体操作效率。这些技术进步为手术技术和培训方法的持续演进奠定了基础。

在内镜手术中，手术阶段识别任务包括将视频帧分段为不同的操作阶段，从而为手术过程提供高层次的概览（Garrow et al., 2021）。这一分类着重于识别手术的宏观阶段。而动作识别则更进一步，深入分析，识别单帧中的具体任务和动作。

早期的自动化手术阶段识别工作（Blum et al., 2010; Bardram et al., 2011; Dergachyova et al., 2016; Jin et al., 2018; Gao et al., 2021; Quellec et al., 2015; Holden et al., 2014）主要使用统计模型及附加数据（如注释或工具相关信息），尽管在复杂的手术视频分析中表现出了熟练度，但其表现能力因预设依赖性而受限（Jin et al., 2018; Gao et al., 2021）。随着深度学习的兴起，出现了向纯粹基于视频的方法的转变。多任务学习策略需要同时为工具和阶段注释，增加了注释的负担（Twinanda et al., 2017; Twinanda, 2017; Jin et al., 2020）。然而，目前的发展趋势倾向于单任务学习，这是我们研究进一步探索的重点。

构建一个能够高效处理手术长视频的端到端模型是一个巨大的挑战。传统方法通常从开发空间特征提取器开始，然后输入到时间特征提取器中。然而，现有策略通常在空间特征训练中采用基于帧级阶段标签的方法（Czempiel et al., 2020; Gao et al., 2020; Jin et al., 2021）。这种方法在不同阶段包含视觉相似场景时容易产生歧义（如图1所示），从而对空间特征提取器的高效训练构成重大障碍，突显出更精细的数据输入与监督信号配对的必要性。

此外，如图1所示，对于诸如“夹闭和切割”这类关键事件的错误识别可能导致后续帧的错误分类。这表明需要一种强化阶段过渡识别的方法，从而提升模型对整个手术流程的理解能力。

在时间分析方面，现有用于手术阶段识别的单任务模型主要分为三类：利用循环神经网络（RNNs）、卷积神经网络（CNNs）（LeCun et al., 1998）和Transformers（Vaswani et al., 2017）的模型。循环神经网络（包括长短期记忆网络LSTM，Hochreiter和Schmidhuber, 1997）难以捕获长期时间依赖性，而基于卷积神经网络的方法（如时间卷积网络TCNs，Lea et al., 2016; Farha和Gall, 2019）则应用了固定的滤波器大小，可能无法有效捕捉长时间模式。

为了解决这些挑战，我们提出了长视频Transformer（LoViT），在处理手术长视频方面表现出了最先进的性能。我们的贡献如下：

引入了时间丰富的空间特征提取器，超越了传统的空间识别范式，通过在特征提取阶段融入时间感知，显著提升了模型解读手术过程复杂时间进程的能力。

创新性地设计了一种阶段过渡感知监督机制，突出强调手术中的关键过渡时刻。这一前瞻性策略为模型提供了对手术操作内在流程的更高理解能力。

最后，我们融合了多尺度时间特征聚合，尽管不是我们贡献的核心，但仍然对模型形成了关键性增强。通过结合局部和全局时间信息，增强了模型的鲁棒性，确保我们的主要创新始终处于研究的前沿。

Aastract

摘要

Online surgical phase recognition plays a significant role towards building contextual tools that could quantifyperformance and oversee the execution of surgical workflows. Current approaches are limited since they trainspatial feature extractors using frame-level supervision that could lead to incorrect predictions due to similarframes appearing at different phases, and poorly fuse local and global features due to computational constraintswhich can affect the analysis of long videos commonly encountered in surgical interventions. In this paper,we present a two-stage method, called Long Video Transformer (LoViT), emphasizing the development ofa temporally-rich spatial feature extractor and a phase transition map. The temporally-rich spatial featureextractor is designed to capture critical temporal information within the surgical video frames. The phasetransition map provides essential insights into the dynamic transitions between different surgical phases. LoViTcombines these innovations with a multiscale temporal aggregator consisting of two cascaded L-Trans modulesbased on self-attention, followed by a G-Informer module based on ProbSparse self-attention for processingglobal temporal information. The multi-scale temporal head then leverages the temporally-rich spatial featuresand phase transition map to classify surgical phases using phase transition-aware supervision. Our approachoutperforms state-of-the-art methods on the Cholec80 and AutoLaparo datasets consistently. Compared toTrans-SVNet, LoViT achieves a 2.4 pp (percentage point) improvement in video-level accuracy on Cholec80 anda 3.1 pp improvement on AutoLaparo. Our results demonstrate the effectiveness of our approach in achievingstate-of-the-art performance of surgical phase recognition on two datasets of different surgical procedures andtemporal sequencing characteristics.

在线手术阶段识别在构建能够量化手术表现和监督手术流程执行的上下文工具中起着重要作用。当前的方法受到一定限制，因为它们使用基于帧级监督的空间特征提取器进行训练，这可能由于相似帧在不同阶段出现而导致错误预测。此外，由于计算限制，这些方法难以有效融合局部和全局特征，从而影响了常见于手术干预的长视频的分析。

在本文中，我们提出了一种两阶段方法，称为长视频Transformer（LoViT），其重点在于开发一个时间丰富的空间特征提取器和阶段过渡图。时间丰富的空间特征提取器旨在捕捉手术视频帧中的关键时间信息，而阶段过渡图则提供了关于不同手术阶段之间动态过渡的重要洞察。

LoViT 将这些创新与一个多尺度时间聚合器结合，该聚合器由两个基于自注意力的级联 L-Trans 模块组成，随后是一个基于 ProbSparse 自注意力的 G-Informer 模块，用于处理全局时间信息。多尺度时间分类头利用时间丰富的空间特征和阶段过渡图，通过阶段过渡感知的监督进行手术阶段分类。

在 Cholec80 和 AutoLaparo 数据集上的实验结果表明，我们的方法始终优于现有的最新方法。与 Trans-SVNet 相比，LoViT 在 Cholec80 数据集上实现了视频级准确率提高 2.4 个百分点，在 AutoLaparo 数据集上提高了 3.1 个百分点。我们的结果表明，该方法在不同手术操作和时间序列特征的两个数据集上实现了手术阶段识别的最先进性能。

Method

方法

In this work, we target the problem of online surgical phase recognition. Formally, this is a video classification problem where we aimto solve for a mapping 𝑓 such that 𝑓𝜽 ( 𝑋𝑡 ) ≈ 𝑝𝑡 , where 𝑋𝑡 = {𝒙𝑗 } 𝑡 𝑗=1is a given input video stream, and 𝒙𝑗 ∈ R𝐻×𝑊 ×𝐶. The symbols 𝐻,𝑊* , and 𝐶 represent the image height, width, and number of channels,respectively. As in our work, we deal with RGB images, 𝐶 = 3. Theheight and width of each video frame change from dataset to dataset.The first frame of the video is noted as 𝒙1 , and the current 𝑡th frameas 𝒙𝑡 . The output 𝑝**𝑡 ∈ {𝑘} 𝐾 𝑘=1 is the class index corresponding to thesurgical phase of the video frame 𝒙𝑡 , where 𝐾 is the total numberof classes or surgical phases. The symbol 𝜽 is a vector of parameterscorresponding to the weights of our network model 𝑓, which we callLoViT throughout the paper.

在本研究中，我们针对在线手术阶段识别问题进行研究。形式上，这是一个视频分类问题，我们旨在找到一个映射 𝑓，使得 𝑓𝜽(𝑋𝑡) ≈ 𝑝𝑡，其中 𝑋𝑡 = {𝒙𝑗} 𝑡 𝑗=1 是给定的输入视频流，𝒙𝑗 ∈ R𝐻×𝑊 ×𝐶。符号 𝐻、𝑊 和 𝐶 分别表示图像的高度、宽度和通道数。

在本研究中，我们处理的是 RGB 图像，因此 𝐶 = 3。每个视频帧的高度和宽度因数据集而异。视频的第一帧表示为 𝒙1，当前的第 𝑡 帧表示为 𝒙𝑡。输出 𝑝𝑡 ∈ {𝑘} 𝐾 𝑘=1 是与视频帧 𝒙𝑡 对应的手术阶段的分类索引，其中 𝐾 是类别或手术阶段的总数。符号 𝜽 是与我们网络模型 𝑓（在本文中称为 LoViT）的权重相对应的参数向量。

Conclusion

结论

We propose a new surgical phase recognition method named LoViT,which adopts video-clip level supervision to train a temporally-richspatial feature extractor first and then combines local fine-grainedand global information via a multiscale temporal feature aggregatorsupported by phase transition maps. Compared to previous methods,our Transformer-based LoViT allows for efficient and robust phaserecognition of long videos without losing local or global information. Moreover, our LoViT is the first to demonstrate that phasetransition maps are useful for identifying the relationships betweenphases. The proposed LoViT achieves state-of-the-art performance withimprovements over existing methods.

我们提出了一种新的手术阶段识别方法，命名为 LoViT。该方法首先采用视频片段级监督来训练时间丰富的空间特征提取器，然后通过多尺度时间特征聚合器结合局部细粒度和全局信息，并辅以阶段过渡图支持。与以往方法相比，我们基于 Transformer 的 LoViT 能够高效且稳健地识别长视频的阶段，同时保留局部和全局信息。此外，LoViT 首次证明了阶段过渡图在识别阶段间关系中的重要作用。所提出的 LoViT 实现了最先进的性能，相较现有方法取得了显著提升。

Results

结果

5.1. Comparison with state-of-the-art methods

To assess the effectiveness of our proposed method, we carried out acomparative study, contrasting our LoViT model against contemporarystate-of-the-art techniques pertinent to the domains of action anticipation and surgical phase recognition. This study utilized two distinctdatasets: Cholec80 (Twinanda et al., 2017) and AutoLaparo (Wanget al., 2022).In the upper part of Table 1, we present a quantitative comparisonof the Cholec80 dataset. It should be noted that we re-implementedTrans-SVNet using the weights made available by the authors of theoriginal study. As for AVT, our implementation was based on the last 30frames according to the code released along with their publication. Thereported results of other benchmark methods were directly cited fromtheir respective publications. Methods such as OperA (Czempiel et al.,were not included in our comparison due to discrepancies indataset splits and the absence of publicly accessible code. Additionally,

we excluded SAHC (Ding and Li, 2022) on account of an evaluativeoversight: they utilized a frame rate of 25fps instead of the 5fps usedfor their ground truth3 . Table 1 reveals that our LoViT model surpassesother methods in the majority of the evaluated metrics, with the soleexception being precision on the Cholec80 dataset. Specifically, LoViTattained an accuracy that exceeds the benchmark set by Trans-SVNetby a margin of 2.4 pp. Moreover, our model demonstrates superiorperformance over AVT, the leading model for action anticipation, bya difference of 4.8 pp in accuracy. It also exhibits more consistentperformance, as evidenced by a reduced standard deviation in accuracy by roughly 1 pp in contrast to Trans-SVNet. Furthermore, LoViTshowcased better results when comparing our temporal module to a reimplementation of the temporal module proposed for long-term actionrecognition in TeSTra (Zhao and Krähenbühl, 2022). Beyond standardmetrics, LoViT also proved to be more effective when evaluated againstrelaxed metrics1 .

5.1 与最新方法的比较

为了评估我们提出的方法的有效性，我们进行了一项比较研究，将我们的 LoViT 模型与当前在动作预测和手术阶段识别领域的最新技术进行对比。本研究使用了两个不同的数据集：Cholec80（Twinanda et al., 2017）和 AutoLaparo（Wang et al., 2022）。

在表1的上半部分，我们展示了 Cholec80 数据集的定量比较结果。需要注意的是，我们使用原研究作者提供的权重重新实现了 Trans-SVNet。至于 AVT，我们根据其论文附带的代码基于最近30帧进行了实现。其他基准方法的结果直接引用了其各自发表的研究。像 OperA（Czempiel et al., 2021）这样的方法未纳入比较，因为数据集划分存在差异，且其代码未公开。此外，由于评估中的疏漏（使用了25fps帧率而非其真值的5fps），我们未将 SAHC（Ding and Li, 2022）纳入比较。

表1显示，在 Cholec80 数据集的大多数评估指标中，我们的 LoViT 模型优于其他方法，仅在精确率上略逊一筹。具体而言，LoViT 在准确率上比 Trans-SVNet 提高了 2.4 个百分点。此外，与动作预测领域的领先模型 AVT 相比，LoViT 的准确率高出 4.8 个百分点。LoViT 的性能更加稳定，与 Trans-SVNet 相比，准确率的标准差减少了约 1 个百分点。此外，与用于长期动作识别的 TeSTra（Zhao 和 Krähenbühl, 2022）中提出的时间模块的重新实现相比，LoViT 的时间模块也显示出了更优异的结果。

除了标准指标外，在使用更宽松的指标进行评估时，LoViT 同样表现出更高的效率。

Figure

图

Fig. 1. Example of similar frames (first and third) corresponding to different phases in Cholec80 dataset (Twinanda et al., 2017).

图1. Cholec80 数据集中（Twinanda et al., 2017）相似帧（第一帧和第三帧）对应于不同阶段的示例。

Fig. 2. The proposed LoViT framework for surgical video phase recognition. The   module extracts temporally-rich spatial features 𝑒 from each video frame 𝑥. Two cascadedL-Trans modules (L𝑠 -Trans and L𝑙 -Trans) output local temporal features 𝑠 and 𝑙 with inputs of different local window sizes (𝜆1 and 𝜆2 ). G-Informer captures the global relationshipsto generate the temporal feature 𝑔. A fusion head combines the multi-scale features 𝑠, 𝑙, and 𝑔, followed by two linear layers that learn a phase transition map ℎ̂ 𝑡 and a phaselabel ̂𝑝𝑡 of the current 𝑡th video frame 𝑥𝑡 . Modules with the same color share the same weight. During training,   is trained separately and its weights are then frozen to trainthe other temporal modules of LoViT.

图2. 用于手术视频阶段识别的 LoViT 框架。 模块从每个视频帧 𝑥 中提取时间丰富的空间特征 𝑒。两个级联的 L-Trans 模块（L𝑠-Trans 和 L𝑙-Trans）通过不同的局部窗口大小 (𝜆1 和 𝜆2) 作为输入，输出局部时间特征 𝑠 和 𝑙。G-Informer 模块捕获全局关系以生成时间特征 𝑔。一个融合头将多尺度特征 𝑠、𝑙 和 𝑔 结合起来，随后通过两层线性层分别学习阶段过渡图 ℎ̂ 𝑡 和当前第 𝑡 帧视频帧 𝑥𝑡 的阶段标签 ̂𝑝𝑡。颜色相同的模块共享相同的权重。在训练过程中， 模块单独训练，其权重随后被冻结，用于训练 LoViT 的其他时间模块。

Fig. 3. The architecture of training the temporally-rich spatial feature extractor. Duringthe 𝑡th frame training, a video stream 𝑋𝑡 = {𝑥𝑗 } 𝑡 𝑗=1 is sampled at evenly spaced intervals𝑤𝑡 from the start of the current phase to the current frame, producing 𝑋𝑡 ′ ⊆ 𝑋𝑡 . Eachframe 𝑥 ∈ 𝑋𝑡 ′ is embedded using the spatial feature extractor  , then grouped intoa feature sequence (with a blue dashed box). A temporal aggregator  follows toadd temporal information for recognition. The predicted phase ̂𝑝𝑡 is compared to thecorresponding ground truth phase 𝑝𝑡 to compute a cross-entropy loss. We will throw and only retain   for spatial feature extraction after the training stage

图3. 时间丰富空间特征提取器的训练架构。在第 𝑡 帧训练期间，从当前阶段的起始帧到当前帧的范围内，以固定间隔 𝑤𝑡 对视频流 𝑋𝑡 = {𝑥𝑗 } 𝑡 𝑗=1 进行采样，生成子集 𝑋𝑡′ ⊆ 𝑋𝑡。每帧 𝑥 ∈ 𝑋𝑡′ 使用空间特征提取器   进行嵌入，然后被分组为一个特征序列（如蓝色虚线框所示）。接着通过时间聚合器  添加时间信息以辅助识别。预测阶段 ̂𝑝𝑡 与对应的真实阶段 𝑝𝑡 进行比较，计算交叉熵损失。在训练阶段完成后，将丢弃时间聚合器 ，仅保留空间特征提取器   进行空间特征提取。

Fig. 4. L-Trans: The L-Trans adopts two cascaded fusion modules to process twobranch temporal inputs (grey line and black line). Fusion module: It consists of anencoder and a decoder. The encoder is comprised of an 𝑚-layer self-attention layer forthe grey line input, and the decoder is composed of an 𝑛-layer cascaded self-attentionwith cross-attention for processing the encoder’s output and the black line input.

图4. L-Trans: L-Trans 使用两个级联的融合模块来处理两条分支的时间输入（灰线和黑线）。

融合模块: 融合模块由编码器和解码器组成。编码器包含一个 𝑚 层自注意力层，用于处理灰线输入；解码器由一个 𝑛 层级联自注意力层与交叉注意力组成，用于处理编码器的输出和黑线输入。

Fig. 5. The example of building phase transition map. We project phase transition areaonto a phase transition map using a left–right asymmetric Gaussian kernel where leftand right-side kernel lengths are 3𝜎𝑙 and 3𝜎𝑟 respectively. 𝑝𝑙 and 𝑝𝑟 mean adjacentdifferent phases.

图5. 构建阶段过渡图的示例。我们使用左右不对称的高斯核将阶段过渡区域投射到阶段过渡图上，其中左侧和右侧核长度分别为 3𝜎𝑙 和 3𝜎𝑟。𝑝𝑙 和 𝑝𝑟 分别表示相邻的不同阶段。

Fig. 6. Qualitative comparisons with some other methods in the Cholec80 and AutoLaparo datasets. The first line in (a) presents some images in the video corresponding tothe moment pointed by the red arrow, where light red presents incorrect examples ofboth AVT and Trans-SVNet, and dark red presents wrong examples of only Trans-SVNet.The following four lines in (a) and the first three lines in (b) represent the phase resultsrecognized by different methods and the corresponding ground truth GTp . The last twolines in both subfigures mean the heatmap output from the proposed LoViT ℎ̂ and itsGround Truth GTh .

图6. 在 Cholec80 和 AutoLaparo 数据集上，与其他方法的定性比较。 (a) 中第一行展示了视频中对应红色箭头所指时刻的一些图像，其中浅红色表示 AVT 和 Trans-SVNet 都错误的示例，深红色表示仅 Trans-SVNet 错误的示例。 (a) 中后四行和 (b) 中前三行表示不同方法识别的阶段结果及对应的真实值（GTp）。两个子图中的最后两行分别为所提出的 LoViT 输出的热图 ℎ̂ 和其真实值（GTh）。

Fig. 7. Inference time visualization of LoViT for different input video lengths.

图7. LoViT 对不同输入视频长度的推理时间可视化。

Fig. 8. Visualization for the spatial feature distribution of different extractors. Pointset: Video frames of Video 60 in Cholec80. Different colors: different tool annotations.First column: the spatial feature distribution of the frame-only spatial feature extractorin Trans-SVNet. Second column: the spatial feature distribution of the temporally-richspatial feature extractor in our LoViT.

图8. 不同特征提取器的空间特征分布可视化。数据点：Cholec80 中视频60的各帧。不同颜色：不同的工具标注。第一列：Trans-SVNet 中仅基于帧的空间特征提取器的空间特征分布。第二列：我们提出的 LoViT 中时间丰富的空间特征提取器的空间特征分布。

Fig. 9. Examples of spatial feature distribution of similar video frames. Top: threerows depict each frame that is similar, i.e. in regards to the tool environment. Bottom:Visualization of the spatial feature distribution of example images using two differentextractors.

图9. 相似视频帧的空间特征分布示例。上方：三行分别显示了工具环境相似的每帧图像。下方：使用两种不同特征提取器对示例图像的空间特征分布进行可视化。

Table

表

Table 1The results (%) of different state-of-the-art methods on both the Cholec80 and AutoLaparo datasets. The best results are marked in bold

表1 不同最新方法在 Cholec80 和 AutoLaparo 数据集上的结果（%）。最佳结果以粗体标注。

Table 2The results (%) of different parts of proposed LoViT on both the Cholec80 and theAutoLaparo datasets. The best results are marked in bold.

表2 提出方法 LoViT 不同部分在 Cholec80 和 AutoLaparo 数据集上的结果（%）。最佳结果以粗体标注。

Table 3Effects (%) of Temporally-rich spacial feature extractor () on Cholec80 and AutoLaparo datasets. The best results are marked in bold

表3 时间丰富的空间特征提取器（）对 Cholec80 和 AutoLaparo 数据集的影响（%）。最佳结果以粗体标注。

Table 4Effects of adding phase transition-aware supervision on video- and phase-level metrics (%) when evaluated on Cholec80 and AutoLaparo datasets.Note that ‘✓’ means adding phase transition-aware supervision.

表4 添加阶段过渡感知监督对视频级和阶段级指标（%）的影响，在 Cholec80 和 AutoLaparo 数据集上的评估结果。注：‘✓’表示添加了阶段过渡感知监督。