Title
题目
Mammography classification with multi-view deep learning techniques:Investigating graph and transformer-based architectures
基于多视角深度学习技术的乳腺X线分类:图神经网络与Transformer架构的研究
01
文献速递介绍
乳腺X线检查是乳腺癌筛查的主要成像方式,也是降低乳腺癌死亡率的最重要工具之一(Broeders et al., 2012; Morra et al., 2015)。由于筛查乳腺X线检查的读片量大、诊断任务明确且采集过程相对标准化,因此它是实现自动化或半自动化读片的理想候选对象。近期研究(Rodríguez-Ruiz et al., 2019; Kyono et al., 2018; Dembrower et al., 2020)表明,深度学习系统有望提供独立评估,从而减轻放射科医生的负担。然而,设计适用于乳腺X线检查的深度学习系统依然面临诸多挑战:癌症患病率低于1%,乳腺X线筛查典型属于“从大海捞针”的问题,需要非常大且丰富的数据集以实现高性能(Wu et al., 2020; Schaffter et al., 2020)。需要处理高分辨率图像(Wu et al., 2020)。需要整合多个尺度(Shen et al., 2021b; Pinto Pereira et al., 2009)和多视角信息(Van Schie et al., 2011; Samulski 和 Karssemeijer, 2011; Perek et al., 2018; Famouri et al., 2020; Ren et al., 2021)。
一种完整的自动处理筛查乳腺X线的方法是所谓的多视角架构,它将筛查检查中通常包含的四个视图信息结合起来,生成检查级分类分数(例如,检查是否可能包含癌症的概率)。多视角架构能够同时执行同侧分析(ipsi-lateral analysis)和对侧分析(contra-lateral analysis):
同侧分析通过结合颅尾(CC)和中侧斜位(MLO)视图,解决高乳腺密度和组织叠加效应(Sacchetto et al., 2016; Wei et al., 2011; Van Gils et al., 1998; Ren et al., 2021; Samulski 和 Karssemeijer, 2011)。
对侧分析整合两侧乳腺的信息,例如检测单独分析视图可能无法显现的不对称性(Rangayyan et al., 2007)。这些架构的一个优势是,理论上可以通过检查级标签进行训练,绕过获取昂贵的像素级监督的需求。在 DREAM 挑战中,首次尝试使用弱监督图像标签训练深度神经网络(DNNs)的结果表明,使用强监督外部数据训练的DNN显著优于仅依赖图像标签的DNN(Schaffter et al., 2020)。随着深度学习架构的进步,基于图像级监督的乳腺癌检测性能得到了显著提升(Wu et al., 2020; Shen et al., 2021b)。
尽管大多数最新的乳腺X线解决方案基于卷积神经网络(CNNs),尤其是残差网络,文献中也出现了一些替代性深度架构:
视觉Transformer(ViT) 在多个医学和非医学任务中表现优于CNN(Dosovitskiy et al., 2020; He et al., 2022; Xu et al., 2022; Matsoukas et al., 2022)。相比CNN,Transformer 在以下三方面表现出色:
优化计算资源分配到图像的相关区域(像素并非同等重要)。
优化语义编码。通过自注意力机制连接空间上远距离的语义特征(Dosovitskiy et al., 2020)。基于图的架构被设计为显式模拟放射科医生的解读模式,同时进行同侧和对侧分析(Ren et al., 2021; Du et al., 2019; Liu et al., 2021b; Zhang et al., 2021; Yang et al., 2021)。这些架构利用同侧视图解决组织叠加问题,并通过联合分析空间共定位且视觉特征相似的结构,识别潜在病变(Wei et al., 2011; Samulski 和 Karssemeijer, 2011; Ren et al., 2021; Yang et al., 2021)。在本文中,我们针对不同归纳偏置的多视角架构进行了直接比较,研究贡献如下:扩展现有基于Transformer(van Tulder et al., 2021; Matsoukas et al., 2022)和基于图卷积网络(GCNs)(Liu et al., 2021b)的架构以处理四个乳腺X线视图。引入一种新的基于Transformer的架构,融合了同侧与对侧跨视图注意力机制。不仅从性能上评估不同架构,还评估其整合局部和全局特征的方式。结果表明,不同架构在本质上具有互补性,对特定特征表现出敏感性,结合多种架构可以更有效地进行乳腺癌检测,即使Transformer在整体上优于基于卷积的架构。
论文余下部分结构如下:第二部分回顾了用于检查级乳腺X线分析的主要架构;第三部分分析了实验中探索的架构;数据集和实验设置分别在第四和第五部分描述;结果和讨论分别在第六和第七部分呈现;最后在第八部分得出简要结论。
Abatract
摘要
The potential and promise of deep learning systems to provide an independent assessment and relieveradiologists’ burden in screening mammography have been recognized in several studies. However, the lowcancer prevalence, the need to process high-resolution images, and the need to combine information frommultiple views and scales still pose technical challenges. Multi-view architectures that combine informationfrom the four mammographic views to produce an exam-level classification score are a promising approach tothe automated processing of screening mammography. However, training such architectures from exam-levellabels, without relying on pixel-level supervision, requires very large datasets and may result in suboptimalaccuracy. Emerging architectures such as Visual Transformers (ViT) and graph-based architectures can potentially integrate ipsi-lateral and contra-lateral breast views better than traditional convolutional neural networks,thanks to their stronger ability of modeling long-range dependencies. In this paper, we extensively evaluatenovel transformer-based and graph-based architectures against state-of-the-art multi-view convolutional neuralnetworks, trained in a weakly-supervised setting on a middle-scale dataset, both in terms of performance andinterpretability. Extensive experiments on the CSAW dataset suggest that, while transformer-based architectureoutperform other architectures, different inductive biases lead to complementary strengths and weaknesses, aseach architecture is sensitive to different signs and mammographic features. Hence, an ensemble of differentarchitectures should be preferred over a winner-takes-all approach to achieve more accurate and robust results.Overall, the findings highlight the potential of a wide range of multi-view architectures for breast cancerclassification, even in datasets of relatively modest size, although the detection of small lesions remainschallenging without pixel-wise supervision or ad-hoc networks
深度学习系统在筛查乳腺X线检查中提供独立评估并减轻放射科医生负担的潜力和前景已在多项研究中得到认可。然而,低癌症患病率、高分辨率图像处理的需求以及多视角和多尺度信息的整合需求仍然带来技术挑战。将来自四个乳腺X线图像视角的信息结合起来以生成检查级分类评分的多视角架构,是筛查乳腺X线自动处理的一种有前景的方法。然而,从检查级标签训练此类架构,而不依赖于像素级监督,需要非常大的数据集,并可能导致次优的准确性。
新兴架构如视觉Transformer(ViT)和基于图的架构,凭借其在建模长程依赖关系方面的更强能力,可能比传统卷积神经网络(CNN)更好地整合同侧和对侧乳腺视图。在本文中,我们全面评估了基于Transformer和基于图的架构相较于最新的多视角卷积神经网络的性能和可解释性,这些网络在一个中等规模的数据集上以弱监督方式进行训练。
在 CSAW 数据集上的广泛实验表明,虽然基于Transformer的架构优于其他架构,但不同的归纳偏置导致了互补的优势和劣势,因为每种架构对不同的特征和乳腺X线特征敏感。因此,应优先采用不同架构的集成方法,而非单一优胜的方法,以实现更准确和更稳健的结果。总体而言,研究结果突显了多视角架构在乳腺癌分类中的潜力,即使是在相对规模较小的数据集上,尽管在没有像素级监督或特定网络的情况下检测小病变仍然具有挑战性。
Conclusion
结论
This paper presents a comparative analysis of three different multiview architectures for breast cancer classification: a 4-view convolutional network, a graph-based architecture, and a transformer-basedarchitecture. Given their fundamentally different inductive biases, thesearchitectures not only achieve different performance, but also tend tofocus on different areas of the breast. Even though transformer-basedarchitectures achieve the most promising results among the three options, the results indicate that an ensemble model can improve overallperformance by increasing the AUC and reducing the false positiverate (FPR). Heatmaps were used to analyze the regions of the breastthat were most relevant for each model predictions. Depending on thearchitecture, the selected areas were not always aligned with lesionannotations, but tended to concentrate in high density regions.Overall, the findings highlight the potential of a wide range of multiview architectures for breast cancer classification even in datasets ofrelatively modest size. Further research is needed to validate thesefindings on larger-scale datasets, and to enhance the ability of multiview architectures to integrate local cues to improve the detection ofsmall and ill-defined lesions.
这篇论文对三种不同的多视角架构进行了比较分析,旨在用于乳腺癌分类:分别是4视角卷积神经网络、基于图的架构和基于变换器(transformer)的架构。鉴于这些架构具有不同的归纳偏置,它们不仅在性能上有所不同,而且通常关注乳腺的不同区域。尽管基于变换器的架构在三者中取得了最有前景的结果,研究表明,集成模型能够通过提高AUC(ROC曲线下面积)并减少假阳性率(FPR)来提升整体性能。热图被用于分析每个模型预测时最相关的乳腺区域。根据不同架构,所选的区域并不总是与病变标注对齐,但往往集中在高密度区域。
总的来说,研究结果凸显了多种多视角架构在乳腺癌分类中的潜力,即使在相对较小的数据集上也能表现出色。然而,仍然需要进一步的研究来验证这些发现,特别是在更大规模的数据集上,并提升多视角架构在整合局部线索方面的能力,以改善对小而模糊病变的检测。
Results
结果
In this section, we analyze the results from two distinct perspectives.The first subsection presents the results obtained using the referenceperformance metrics, broken down by breast and patient, while thesecond presents a punctual analysis on single views to evaluate theaccuracy of the provided predictions.
在本节中,我们从两个不同的角度分析结果。第一小节展示了基于乳腺和患者的参考性能指标所获得的结果,而第二小节对单一视图进行精确分析,以评估提供的预测结果的准确性。
Figure
图
Fig. 1. A schematic representation of the NYU model from Wu et al. (2020). Thebackbone parameters are shared among images of the same view (CC and MLO), asindicated by the different colors. The loss is calculated from the softmax output, andthe predictions of the CC and MLO views are averaged at inference time.
图1. NYU 模型的示意图(来自 Wu 等,2020)。 同一视图(CC 和 MLO)图像之间共享骨干网络的参数,如不同颜色所示。损失从 softmax 输出中计算得出,在推理时,CC 和 MLO 视图的预测结果取平均值。
Fig. 2. A representation of pseudo-landmark and the respective tessellation of the same breast in the two projections.
图2. 同一乳腺在两种投影中的伪标志点及其对应的镶嵌表示。
Fig. 3. A representation of the entire AGN4V model. It is worth noting that this architecture requires an additional set of inputs (that is, the pseudo-landmarks and their position), which are used by the IGN and BGN modules to simulate radiologists’ analysis.
图3. AGN4V 模型的整体架构表示。 需要注意的是,该架构需要额外的一组输入(即伪标志点及其位置),这些输入被 IGN 和 BGN 模块用于模拟放射科医生的分析过程。
Fig. 4. Representation of the MaMVT architecture used in this work: the four viewsare passed through a shared Swin backbone, with an additional cross-attention blockinserted inside the backbone after the 10th layer of the 3rd block to perform crossattention between each view. The final output for each view is then passed througha classification layer and used for additional loss computation. The two left and rightviews are additionally concatenated to obtain a left and right representation as well,which are also passed through a classification layer and are used to perform both losscomputation and to obtain the final classification result for the exam.
图4. 本研究中使用的 MaMVT 架构表示: 四个视图通过共享的 Swin 骨干网络处理,在骨干网络的第 3 个模块的第 10 层之后插入了一个额外的交叉注意力块,用于在各视图之间执行交叉注意力操作。每个视图的最终输出随后通过一个分类层,并用于额外的损失计算。此外,左右两侧的视图被进一步拼接以获得左右的整体表示,这些表示同样通过一个分类层处理,用于损失计算,并最终得到检查级分类结果。
Fig. 5. Simplified example of the patch supervision method: shown on the left is theimage mask, split into patches and converted into the label vector below, where eachvalue corresponds to one patch: indices 5 and 7 are set to 1, since their respectivepatches contain the lesion. Shown on the right is a hypothetical prediction of eachimage patch following the same structure: in this example, all patches were predictedcorrectly with the exception of patch 5.
图5. 补丁监督方法的简化示例: 左侧显示了图像掩码,将其分割为多个补丁,并转换为下方的标签向量,其中每个值对应一个补丁:由于补丁 5 和 7 包含病变,其索引被设置为 1。右侧显示了按照相同结构对每个图像补丁的假设预测:在此示例中,除了补丁 5 外,所有补丁均被正确预测。
Fig. 6. Side-variant four-view cross-attention module scheme. First, for each side (L-CCand L-MLO, R-CC and R-MLO), the pair-wise attention operations are performed andthen added to their respective views. Then the same operation is applied for each typeof view (L-CC and L-MLO, R-CC and R-MLO).
图6. 侧别四视图交叉注意力模块方案。 首先,对于每一侧(L-CC 和 L-MLO,R-CC 和 R-MLO),执行成对的注意力操作,并将结果添加到各自的视图中。然后,对每种视图类型(L-CC 和 L-MLO,R-CC 和 R-MLO)重复相同的操作。
Fig. 7. Side-invariant four-view cross-attention module scheme. All the pair-wiseattention operations are performed first, and then added to their respective views.Only the sum operations for the L-CC and R-MLO views are shown for clarity
图7. 侧别无关的四视图交叉注意力模块方案。 首先执行所有成对的注意力操作,然后将结果添加到各自的视图中。为简洁起见,仅展示了 L-CC 和 R-MLO 视图的求和操作。
Fig. 8. Flowchart for exam selection and stratification in training, validation and testset, with number of exams and images included at each step
图8. 检查选择与在训练集、验证集和测试集中的分层流程图,显示每个步骤中包含的检查和图像数量。
Fig. 9. Two examples of synthetic cases comparing the original healthy control image(left) and the result of the synthetic lesion insertion (right). The network was trainedon both the original healthy control and the synthetic lesions.
图9. 两个合成病例的示例,比较原始健康对照图像(左)与插入合成病变后的结果(右)。网络同时使用原始健康对照图像和合成病变进行训练。
Fig. 10. The red line (a) represents the lesion annotation pixel (AP), while the greenone (a) underlines the area covered by the Grad-CAM heatmap (𝐺𝑃𝑡 ). These twoquantities were used to calculate the DICE score and the Intersection over Lesion. Theblue (b) represents the area covered by the entire breast (BP), and is used to calculatethe Intersection over Breast.
图10. (a) 中红线表示病变标注像素(AP),绿线表示 Grad-CAM 热图覆盖的区域(𝐺𝑃𝑡)。这两个量用于计算 DICE 系数和病变区域的交并比(Intersection over Lesion)。 (b) 中蓝线表示整个乳腺覆盖的区域(BP),用于计算乳腺区域的交并比(Intersection over Breast)。
Fig. 11. Correlation between the predictions of three best run of each architecture on the validation set.
图11. 各架构在验证集上三次最佳运行预测结果之间的相关性。
Fig. 12. ROC curve on the test with 95% confidence intervals calculated by bootstrapping. All ROCs are calculated on all cancers, except for the ensemble that are calculatedon both all cancers and SD cancers.
图12 在测试集上的ROC曲线,95%置信区间通过自助抽样法计算。所有ROC曲线都基于所有癌症计算,唯一例外是集成模型,它的ROC曲线同时计算了所有癌症和筛查检测癌症(SD癌症)的结果。
Fig. 13. Score distribution on the cancer cases (a) and negative control (b) exams foreach architecture and ensemble (y axis in logarithmic scale). The best performing runis selected for each architecture on the validation set. MaMVT-v2 assigns high score tothe most cancer cases, but also generates the highest percentage of highly scored falsepositives. For the ensemble, distribution of positive cases is reported for all cancersand for screen detected cancers (SD) separately
图13. 各架构和集成模型在癌症病例(a)和负对照(b)检查中的评分分布(y轴为对数尺度)。每个架构在验证集上选择最佳运行结果。MaMVT-v2对大多数癌症病例分配了较高的分数,但也生成了最高比例的高分假阳性。对于集成模型,分别报告了所有癌症和筛查检测癌症(SD)中的阳性病例分布。
Fig. 14. Example of negative control that was classified as positive by the AGN4Varchitecture. Note the asymmetry between the left and right breast.
图14. 负对照示例,AGN4V架构将其分类为阳性。请注意左乳和右乳之间的不对称性。
Fig. 15. DICE (a–b), IOB (c–d) and IOL (e–f) scores calculated on the cancer cases of Validation Test set (a,c,e) and Training set (b,d,f), divided by architecture and by correctand incorrect predictions (in other words, for detected and missed cancers). Each metric compares the GradCAM heatmaps with the lesion segmentation as detailed in Section 5.7.The scores at various thresholds were obtained by normalizing the GradCAM heatmaps with values between 0 and 1, and then binarizing the maps by applying the correspondingthreshold.
图.15. 将验证测试集(a、c、e)和训练集(b、d、f)中的癌症病例的DICE(a-b)、IOB(c-d)和IOL(e-f)得分按照架构和正确/错误预测(即检测到的癌症和未检测到的癌症)进行划分。每个指标都将GradCAM热力图与5.7节中详细描述的病变分割进行比较。通过将GradCAM热力图的值归一化到0到1之间,并应用相应的阈值将热力图二值化,从而获得在不同阈值下的得分。
Fig. 16. Grad-CAM heatmap for the cancer prediction task. From left to right, each row displays the original image, the corresponding lesion annotation mask, and the GRADCAMsobtained from the Baseline, AGN4V and MaMVT-(v1 Imagenet, v1 PEAC and v2) architectures along with the corresponding prediction score. While the GRADCAM for the baselineand AGN4V architectures are more focused on local areas, the MaMVT architectures attend to larger portion of the breast parenchyma independently of the prediction score.
图16. 癌症预测任务的Grad-CAM热图。从左到右,每行显示原始图像、相应的病变标注掩码以及分别从Baseline、AGN4V和MaMVT(v1 Imagenet、v1 PEAC和v2)架构中获得的Grad-CAM图像,以及相应的预测得分。尽管Baseline和AGN4V架构的Grad-CAM图像更集中于局部区域,但MaMVT架构则更多地关注整个乳腺实质区域,而与预测得分无关。
Fig. 17. Grad-CAM heatmap for the cancer prediction task on different views of the same exam. From top to bottom, each column displays the L-CC, L-MLO, R-CC and R-MLO.From left to right, each row displays the original view image, the corresponding lesion annotation mask, and the GRADCAMs obtained from the Baseline, AGN4V (when possible)and MaMVT-(v1 Imagenet, v1 PEAC and v2) architectures along with the corresponding prediction score. While the GRADCAM for the baseline and AGN4V architectures are morefocused on local areas, the MaMVT architectures attend to larger portion of the breast parenchyma independently of the prediction score.
图17. 不同视角下癌症预测任务的Grad-CAM热图。从上到下,每列显示L-CC、L-MLO、R-CC和R-MLO视图。从左到右,每行显示原始视图图像、相应的病变标注掩码,以及分别从Baseline、AGN4V(如果可能)和MaMVT(v1 Imagenet、v1 PEAC和v2)架构中获得的Grad-CAM图像,并显示相应的预测得分。尽管Baseline和AGN4V架构的Grad-CAM图像更加集中于局部区域,但MaMVT架构则更关注乳腺实质的更大部分区域,而与预测得分无关。
Table
表
Table 1Performance metrics, at breast and patient level, calculated on the validation set. Performance metrics reported include the Area under the ROC Curve (AUC) for cancer detection(Cancer, for brevity) and recall prediction (Recall, for brevity). Performances are separately calculated on all cancers (including screen detected and interval cancers), and screendetected cancer only; 95% confidence intervals are calculated based on 1000 bootstrap repetitions from three training runs. Models indicated with (top rows) are trained fromscratch. All other models are either pre-trained on ImageNet (MaMVT-v1, MaMVT-v2) or using self-supervised learning (Baseline, AGN4V, MaMVT-v1 (PEAC)). All models indicatedwith † (bottom rows in the table) are trained using random swapping of the left and right breast. The remaining models (intermediate rows) are trained using standard dataaugmentation. MaMVT models indicated with ∙ are trained using the side invariant version of the cross-attention module. Best and second–best models are indicated in bold andunderline characters, respectively
表1 在验证集上按乳腺和患者级别计算的性能指标。报告的性能指标包括癌症检测的ROC曲线下面积(AUC)(为简洁起见简称为“Cancer”)和召回预测(为简洁起见简称为“Recall”)。性能分别针对所有癌症(包括筛查检测癌症和间隔癌症)以及仅筛查检测癌症进行计算;95%置信区间基于三次训练运行中的1000次自助抽样计算得出。
标有 的模型(表格顶部行)从头开始训练。所有其他模型均使用 ImageNet(如 MaMVT-v1、MaMVT-v2)或自监督学习(如 Baseline、AGN4V、MaMVT-v1 (PEAC))进行预训练。表格底部标有 † 的模型使用左右乳腺随机交换训练。其余模型(中间行)使用标准数据增强进行训练。标有 ∙ 的 MaMVT 模型使用侧别无关版本的交叉注意力模块训练。最佳和次优模型分别用粗体和下划线标出。
Table 2Performance metrics, at breast and patient level, calculated on the test set. Performance metrics reported include the Area under the ROC Curve (AUC) for cancer detection(Cancer, for brevity) and recall prediction (Recall, for brevity). Performances are separately calculated on all cancers (including screen detected and interval cancers), and screendetected cancer only; 95% confidence intervals are calculated based on 1000 bootstrap repetitions from three training runs. Models indicated with * (top rows) are trained fromscratch. All other models are either pre-trained on ImageNet (MaMVT-v1, MaMVT-v2) or using self-supervised learning (Baseline, AGN4V, MaMVT-v1 (PEAC)). All models indicatedwith † (bottom rows in the table) are trained using random swapping of the left and right breast. The remaining models (intermediate rows) are trained using standard dataaugmentation. MaMVT models indicated with ∙ are trained using the side invariant version of the cross-attention module. Best and second–best models are indicated in bold andunderline characters, respectively
表2 在测试集上按乳腺和患者级别计算的性能指标。报告的性能指标包括癌症检测的ROC曲线下面积(AUC)(简称为“Cancer”)和召回预测(简称为“Recall”)。性能分别针对所有癌症(包括筛查检测癌症和间隔癌症)以及仅筛查检测癌症进行计算。95%置信区间基于三次训练运行中的1000次自助抽样计算得出。
标有 * 的模型(表格顶部行)从头开始训练。其他模型均使用 ImageNet(如 MaMVT-v1、MaMVT-v2)或自监督学习(如 Baseline、AGN4V、MaMVT-v1 (PEAC))进行预训练。表格底部标有 † 的模型使用左右乳腺随机交换训练。其余模型(中间行)使用标准数据增强进行训练。标有 ∙ 的 MaMVT 模型使用侧别无关版本的交叉注意力模块训练。最佳和次优模型分别用粗体和下划线标出。
Table 3Patient-level AUC calculated for the best run of each architecture and for the ensemble with and without test-time augmentation(TTA). TTA increases performance for each individual architecture, but has minimal effect on the ensemble.
表3根据每种架构的最佳运行结果和集成模型,计算患者级别的AUC,并考虑是否使用测试时增强(TTA)。TTA提高了每个单独架构的性能,但对集成模型的影响较小。
Table 4Comparison of model parameters (millions) and throughput (exams processed persecond) for the four architectures.
表4 四种架构的模型参数(百万)和吞吐量(每秒处理的检查次数)比较。