用于牙科的多任务视频增强

Multi-task Video Enhancement for Dental Interventions

2022 miccai

Abstract

微型照相机牢牢地固定在牙科手机上,这样牙医就可以持续地监测保守牙科手术的进展情况。但视频辅助牙科干预中的视频增强减轻了低光、噪音、模糊和相机握手等降低视觉舒适度的问题。为此,我们引入了一种新的深度网络,用于多任务视频增强,使牙科场景的宏观可视化。特别是,该网络以多尺度方式联合利用视频恢复和时间对齐来有效增强视频。我们对虚幻场景中自然牙齿的视频进行的实验表明,所提出的网络在多任务中获得了接近实时处理的最新结果。我们在https://doi.org/10.34808/1jby-ay90 上发布了video -lab,这是第一个具有多任务标签的牙科视频数据集,以促进相关视频处理应用的进一步研究。

Related Work

UberNet [9] and cross-stitch networks [16] are encoder-focused architectures that propagate task outputs across scales in the encoder.

Multi-modal distillation in PAD-Net [27] and PAP-Net [29] are decoder-focused networks that fuse outputs of task heads to make the final dense predictions but only at a single scale.

MTI-Net [24], which is most similar to our architecture, extends the decoder fusion by propagating task-specific features bottom-up across multiple scales through the encoder.

Instead of propagating the task features in scale-specific distillation modules across scales to the encoder, our network simultaneously propagates task outputs to the encoder and to the task heads in the decoder. Furthermore, the networks make dense task prediction in static images while we extend our network to videos.

Contribution

i) a novel application of a microcamera in computer-aided dental intervention for continuous tooth macro-visualization during drilling (居然是硬件创新)(悻悻离去)

(ii)    a new, asymmetrically annotated dataset of natural teeth in phantom scenes with pairs of frames of compromised and good quality using a beam splitter,

(iii)  a novel deep network for video processing that propagates task outputs to encoder and decoder across multiple scales to model task interactions, and (iv) demonstration that an instantiated model e˙ectively addresses multi-task video enhancement in our application by matching and surpassing state-of-the-art re-sults of single task networks in near real-time.

Method

通过不同任务间的交互来增强视频的处理效果

视频增强任务是相互关联的。比如:
--对齐视频帧(aligning video frames)有助于去模糊(deblurring)。
--去噪(denoising)和去模糊可以揭示有助于运动估计(motion estimation)的图像特征。
这种相互依赖性可以通过设计一个多任务模型来充分利用。

MOST-Net 是一种多输出、多尺度、多任务的网络架构。它的目标是通过编码器和解码器之间的多尺度特性建模任务间的交互。网络的输出包括多个任务(用 T 表示),这些任务在不同尺度(用 s 表示)上都有输出。例如

传播方式:

  • 尺度内传播:任务的输出会在当前尺度内传播。
  • 跨尺度传播:任务输出会从较低的尺度上采样(upsample),然后传播到较高尺度的解码器层和任务分支中。

约束条件:

ui denotes some operator, for instance, the upsampling operator for seg-mentation or the scaling operator for homography estimation.

Problem Statement

模型需要同时解决视频恢复、牙齿分割和运动估计任务,并在一个退化图像生成模型的假设下进行学习和优化。

T = 3 and O1: video restoration, O2:segmentation , O3: homography esti-mation. 

video stream generates observations , where t is the time index and P > 0 is a scalar value referring to the number of past frames.

The problem is to 1. estimate a clean frame, 2. a binary teeth segmentation mask and 3. approximate the inter-frame motion by a homography matrix, denoted by the triplet (三个任务的联合输出在尺度 s=1上表示为一个三元组↓)



Let x correspond to pixel location. Given per-pixel blur kernels kx,t of size K, the degraded image(为了模拟输入视频的退化过程(如模糊和噪声)) at s = 1 is generated as:

We assume multiple independently moving objects present in the considered scenes, while our task is to estimate only the motion related to the object of interest (i.e. teeth), which is present in the region indicated by non-zero values of mask M:

∀t ∀x 是指所有t和x

Training***

在多任务和多尺度的深度学习模型中定义损失函数和优化目标

数据集

Loss Function

 

需要对 N(样本数)、T(任务数)和 S(尺度数)进行总共 N * T * S 次求和操作。

损失函数类型

模型通过最小化总损失函数来学习参数 Θ,以便同时优化所有任务和所有尺度下的输出预测。优化过程需要考虑不同任务之间的相互关系和尺度之间的协同作用(多任务多尺度学习的核心思想)。

感觉这个multi task learning这块还是有点没搞清楚,我再看看别的论文

Structure

MOST-Net enables refinement of lower scale segmentations by upsampling and inputting them at the task-specific branches of higher scales.

Encoders

MOST-Net extracts features from two input frames Bt−1 and Bt independently at three scales.也就是说,模型同时在多个尺度上处理输入数据。

U-shaped Downsampling : features are extracted via 3 × 3 convolutions with strides of 1, 2, 2 for s = 1, 2, 3 followed by ReLU activations and 5 residual blocks [4] at each scale. The residual connections are augmented with an additional branch of convolutions in the Fast Fourier domain.
output channel dimension :2^(s+4)

At each scale, featuresandare concatenated and a channel attention mechanism follows [30] to fuse them into

MOST-Net uses homography outputs from lower scales to warp encoder features from the previous time step as

Decoders

encoder featuresare passed onto the expanding blocks scale-wisely via the skipping connections.

At the lower scale (s = 3),are directly passed on a stack of two residual blocks with 128 output channels. transposed convolutions with strides of 2 are used twice to recover the resolution scale.

At higher scales (s < 3), featuresare first concatenated with the upsampled decoder features and convolved by 3X 3 kernels to halve the number of channels.(为啥要减半?)Subsequently, they are propagated onto two residual blocks with 64 and 32 output channels each. The residual block outputs constitute scale-specific shared backbones. Lightweight task-specific branches follow to estimate the dense outputs. Specifically, one 3×3 convolution estimates  and two 3 × 3 convolutions, separated by ReLU, yieldat each scale

At each scale, homography estimation modules estimate 4 offsets(偏移量), related 1-1 to homographies via the Direct Linear Transformation (DLT) as in [5,12].  The motion gated attention modules multiply featureswith segmentationsto filter out context irrelevant to the motion of the teeth.The channel dimensionality is then halved by a 3 × 3 convolution while a second one extracts features from the restored output. The concatenation of the two streams forms features 

Homography Estimation Module: At each scale,  and are employed to predict the offsets with shallow downstream networks. Predicted offsets at lower scales are transformed back to homographies and cascaded(串联) bottom-up [12] to refine the higher scale ones.

Similarly to [5], we use blocks of 3 × 3 convolutions coupled with ReLU, batch normalization and max-pooling to reduce the spatial size of the features. Before the regression layer, a 0.2 dropout is applied.or s = 1, the convolution output channels are 64, 128, 256, 256 and 256. For s=2,3 the network depth is cropped from the second and third layers onwards respectively.

Task-Specific Branches

这段是自己根据gpt加的,以前没弄过多任务学习,方便理解*

Each task (colorization, motion estimation, segmentation) is handled by separate branches of the network. These branches can be seen in the image as the paths where F1,F2,F3 (the features at different scales) are passed through different processing stages (e.g., motion gated attention, channel attention, homography estimation) to produce task-specific outputs, such as the colorized frame Rt, mask Mt, and flow Ht​.

The network is optimized for multiple tasks by using shared features across different task-specific branches, while each branch focuses on a particular task's output (colorization, segmentation, motion estimation).The losses corresponding to each task are computed separately and combined in the final objective function, which allows the model to simultaneously learn multiple tasks while sharing common feature representations.

Experiment

Dataset

Vident-lab: a dataset for multi-task video processing of phantom dental scenes - Open Research Data - Bridge of Knowledge

  • Frame-to-Frame (F2F) Training:

    • The model is trained using static video fragments recorded with a camera (C1). The goal is to apply a trained image denoiser to clean noisy frames, obtain denoised frames and and their noise maps
  • Denoising Process:

    • The noisy frames are first denoised using the trained model. Then, these denoised frames are temporally interpolated (using 17 frames) to generate a blurry effect. The temporal interpolation helps in simulating realistic motion blur.
  • Adding Noise:

    • After the blur effect, noise maps are added to the blurry frames(The denoised frames are tem-porally interpolated [19] 8 times and averaged over a temporal window of 17 frames to synthesize real-istic blur) to form the input video frames (B). The noise maps represent the original noise that would have been present in the actual noisy frames.
  • Colorization: registration of frames between two di˙erent modalities C1 and C2

    • To generate output video frames (R), frames from camera C1 are colorized using a process where frames from a second camera (C2) are mapped to create the ground truth frames.
    • Specifically, the frames from C1 are colorized based on data from C2 to form the colorized video frames. This helps in overcoming the difficulty of aligning the frames between the two cameras and creating accurate pixel-to-pixel correspondences.
  • Color Mapping Network:

    • A color mapping (CM) network is learned to predict parameters that map 3D functions from the dental scene colors of camera C2 to the camera C1. This network helps achieve precise color mapping and ensures accurate spatial correspondence between frames B and R.

Segmentation masks and homographies 单应性

HRNet48 [22] pretrained on ImageNet, is fine-tuned on our annotations to automatically segment the teeth in the remaining frames in all three sets. We compute optical flows between consecutive clean frames with RAFT [23]. Motion fields are cropped with teeth masks Mt to discard other moving objects, such as the dental bur or the suction tube, as we are interested in stabilizing the videos with respect to the teeth. Subsequently, a partial aÿne homography H is fitted by RANSAC to the segmented motion field.

Setup

We train, validate, and test all methods on our dataset (Tab. 1). In all MOST-Net training runs, we set λ1, λ2, λ3 to 2 × 10−4, 5 × 10−5 and 1 for balancing tasks in Eq. 4.

augmented by horizontal and vertical flips with 0.5 probability, random channel perturbations, and color jittering, after [31]. 

batch size 16 , Adam , Learning rate 1e − 4, decayed to 1e − 6 with cosine annealing

PyTorch 1.10 (FP32). The inference speed is reported in frames-per-second (FPS) on GPU NVidia RTX 5000.

Results

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.rhkb.cn/news/5116.html

如若内容造成侵权/违法违规/事实不符,请联系长河编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

Hnu电子电路实验2

目录 【说明】 与本次实验相关的代码及报告等文件见以下链接&#xff1a; 一、实验目的 二、实验内容 三&#xff1a;实验原理 1.指令译码器 2.AU 算术单元 四&#xff1a;实验过程 1.指令译码器 A&#xff09;创建工程&#xff08;选择的芯片为 familyCyclone II&am…

C语言之图像文件的属性

&#x1f31f; 嗨&#xff0c;我是LucianaiB&#xff01; &#x1f30d; 总有人间一两风&#xff0c;填我十万八千梦。 &#x1f680; 路漫漫其修远兮&#xff0c;吾将上下而求索。 图像文件属性提取系统设计与实现 目录 设计题目设计内容系统分析总体设计详细设计程序实现…

AI 新动态:技术突破与应用拓展

目录 一.大语言模型的持续进化 二.AI 在医疗领域的深度应用 疾病诊断 药物研发 三.AI 与自动驾驶的新进展 四.AI 助力环境保护 应对气候变化 能源管理 后记 在当下科技迅猛发展的时代&#xff0c;人工智能&#xff08;AI&#xff09;无疑是最具影响力的领域之一。AI 技…

ElasticSearch DSL查询之排序和分页

一、排序功能 1. 默认排序 在 Elasticsearch 中&#xff0c;默认情况下&#xff0c;查询结果是根据 相关度 评分&#xff08;score&#xff09;进行排序的。我们之前已经了解过&#xff0c;相关度评分是通过 Elasticsearch 根据查询条件与文档内容的匹配程度自动计算得出的。…

【NLP基础】Word2Vec 中 CBOW 指什么?

【NLP基础】Word2Vec 中 CBOW 指什么&#xff1f; 重要性&#xff1a;★★ CBOW 模型是根据上下文预测目标词的神经网络&#xff08;“目标词”是指中间的单词&#xff0c;它周围的单词是“上下文”&#xff09;。通过训练这个 CBOW 模型&#xff0c;使其能尽可能地进行正确的…

资料03:【TODOS案例】微信小程序开发bilibili

样式 抽象数据类型 页面数据绑定 事件传参

vim文本编辑器

vim命令的使用&#xff1a; [rootxxx ~]# touch aa.txt #首先创建一个文件 [rootxxx ~]# vim aa.txt #vim进入文件aa.txt进行编辑 vim是vi的升级版&#xff0c;具有以下三种基本模式&#xff1a; 输入模式(编辑模式) 点击i进入编辑模式 &#xff08;说明…

(undone) 并行计算学习 (Day2: 什么是 “伪共享” ?)

伪共享是什么&#xff1f; TODO: 这里补点文档&#xff01;&#xff01;&#xff01;&#xff01;&#xff01;&#xff01; 缓存一致性、同步的代价&#xff01;&#xff01;&#xff01; 也就是&#xff0c;当不同线程所访问的内存元素恰好在同一个 cache line 上时&#xf…

基于python的博客系统设计与实现

摘要&#xff1a;目前&#xff0c;对于信息的获取是十分的重要&#xff0c;我们要做到的不是裹足不前&#xff0c;而是应该主动获取和共享给所有人。博客系统就能够实现信息获取与分享的功能&#xff0c;博主在发表文章后&#xff0c;互联网上的其他用户便可以看到&#xff0c;…

使用插件SlideVerify实现滑块验证

作者gitee地址&#xff1a;https://gitee.com/monoplasty/vue-monoplasty-slide-verify 使用步骤&#xff1a; 1、安装插件 npm install --save vue-monoplasty-slide-verify 2、在main.js中进行配置 import SlideVerify from vue-monoplasty-slide-verify; Vue.use(SlideV…

【深度学习项目】语义分割-FCN网络(原理、网络架构、基于Pytorch实现FCN网络)

文章目录 介绍深度学习语义分割的关键特点主要架构和技术数据集和评价指标总结 FCN网络FCN 的特点FCN 的工作原理FCN 的变体和发展FCN 的网络结构FCN 的实现&#xff08;基于Pytorch&#xff09;1. 环境配置2. 文件结构3. 预训练权重下载地址4. 数据集&#xff0c;本例程使用的…

2024年博客之星主题创作|从零到一:我的技术成长与创作之路

2024年博客之星主题创作&#xff5c;从零到一&#xff1a;我的技术成长与创作之路 个人简介个人主页个人成就热门专栏 历程回顾初来CSDN&#xff1a;怀揣憧憬&#xff0c;开启创作之旅成长之路&#xff1a;从平凡到榜一的蜕变持续分享&#xff1a;打卡基地与成长复盘四年历程&a…

【整体介绍】

ODO&#xff1a;汽车总行驶里程 Chime: 例如安全带没系的报警声音 多屏交互就是中控屏的信息会同步到主驾驶的仪表盘上 面试问题&#xff1a;蓝牙电话协议HFP 音乐协议A2DP 三方通话测试的逻辑

PyTorch使用教程(13)-一文搞定模型的可视化和训练过程监控

一、简介 在现代深度学习的研究和开发中&#xff0c;模型的可视化和监控是不可或缺的一部分。PyTorch&#xff0c;作为一个流行的深度学习框架&#xff0c;通过其丰富的生态系统提供了多种工具来满足这一需求。其中&#xff0c;torch.utils.tensorboard 是一个强大的接口&…

2025寒假备战蓝桥杯01---朴素二分查找的学习

文章目录 1.暴力方法的引入2.暴力解法的思考 与改进3.朴素二分查找的引入4.朴素二分查找的流程5.朴素二分查找的细节6.朴素二分查找的题目 1.暴力方法的引入 对于下面的这个有序的数据元素的组合&#xff0c;我们的暴力解法就是挨个进行遍历操作&#xff0c;一直找到和我们的这…

Qt按钮美化教程

前言 Qt按钮美化主要有三种方式&#xff1a;QSS、属性和自绘 QSS 字体大小 font-size: 18px;文字颜色 color: white;背景颜色 background-color: rgb(10,88,163); 按钮边框 border: 2px solid rgb(114,188,51);文字对齐 text-align: left;左侧内边距 padding-left: 10…

ESP32下FreeRTOS实时操作系统使用

ESP32下FreeRTOS实时操作系统使用 文章目录 ESP32下FreeRTOS实时操作系统使用一、概述二、为什么要使用实时操作系统RTOS&#xff1f;三、FreeRTOS任务3.1 什么是 FreeRTOS 任务&#xff1f;3.2 FreeRTOS 任务的特点3.3 FreeRTOS 任务的生命周期3.4 FreeRTOS 任务的状态3.5 Fre…

包文件分析器 Webpack Bundle Analyzer

webpack-bundle-analyzer 是一个非常有用的工具&#xff0c;用于可视化和分析 Webpack 打包生成的文件。这使得开发者能够更好地理解应用的依赖关系、包的大小&#xff0c;以及优化打包的机会。以下是关于 webpack-bundle-analyzer 的详细介绍&#xff0c;包括它的安装、使用以…

BEVFusion论文阅读

1. 简介 融合激光雷达和相机的信息已经变成了3D目标检测的一个标准&#xff0c;当前的方法依赖于激光雷达传感器的点云作为查询&#xff0c;以利用图像空间的特征。然而&#xff0c;人们发现&#xff0c;这种基本假设使得当前的融合框架无法在发生 LiDAR 故障时做出任何预测&a…

二十七、资源限制-LimitRange

LimitRange生产必备 在调度的时候 requests 比较重要,在运行时 limits 比较重要。 一、产生原因 生产中只有ResourceQuota是不够的 只配置ResourceQuotas的情况下,pod的yaml文件没有配置resources配置,都是0的话,就可以无限配置,永远达不到limit LimitRange做了什么 如…