《Single-Stage Extensive Semantic Fusion for multi-modal sarcasm detection》

系列论文研读目录

文章目录

系列论文研读目录
文章题目含义
ABSTRACT
Keywords
1. Introduction
2. Related work
3. Method
- 3.1. Multi-modal projection 多模态投影
- 3.2. Extensive Semantic Fusion Multiway Transformer 可拓语义融合多路Transformer
- 3.3. Multi-objective optimization 多目标优化
- - 3.3.1. Multi-stage pre-training 多阶段预训练
  - 3.3.2. Fine-tuning optimization 微调优化
4. Experiments
- 4.1. Dataset
- 4.2. Parameter settings and pre-training
- 4.3. Baselines
- 4.4. Main results
- 4.5. Scaling up on multi-modal sentiment analysis 扩大多模式情感分析
- 4.6. Ablation study 消融研究
- 4.7. Analysis
5. Conclusion

文章链接

文章题目含义

多模态讽刺语检测的单阶段扩展语义融合

ABSTRACT

With the rise of social media and online interactions, there is a growing need for analytical models capable of understanding the nuanced, multi-modal communication inherent in platforms, especially for detecting sarcasm. Existing research employs multi-stage models along with extensive semantic information extractions and single-modal encoders. These models often struggle with efficient aligning and fusing multi-modal representations. Addressing these shortcomings, we introduce the Single-Stage Extensive Semantic Fusion (SSESF) model, designed to concurrently process multi-modal inputs in a unified framework, which performs encoding and fusing in the same architecture with shared parameters. A projection mechanism is employed to overcome the challenges posed by the diversity of inputs and the integration of a wide range of semantic information. Additionally, we design a multi-objective optimization that enhances the model’s ability to learn latent semantic nuances with supervised contrastive learning. The unified framework emphasizes the interaction and integration of multi-modal data, while multi-objective optimization preserves the complexity of semantic nuances for sarcasm detection. Experimental results on a public multi-modal sarcasm dataset demonstrate the superiority of our model, achieving state-of-the-art performance. The findings highlight the model’s capability to integrate extensive semantic information, demonstrating its effectiveness in the simultaneous interpretation and fusion of multi-modal data for sarcasm detection.随着社交媒体和在线互动的兴起，人们越来越需要能够理解平台中固有的细微差别、多模态通信的分析模型，特别是用于检测讽刺。现有的研究采用多阶段模型沿着大量的语义信息提取和单模态编码器。这些模型通常难以有效地对齐和融合多模态表示。针对这些缺点，我们引入了单阶段广泛语义融合（SSESF）模型，旨在同时处理多模态输入在一个统一的框架，它执行编码和融合在同一架构中共享参数。采用投影机制来克服输入的多样性和广泛的语义信息的整合所带来的挑战。此外，我们设计了一个多目标优化，提高了模型的能力，学习潜在的语义细微差别与监督对比学习。统一的框架强调多模态数据的交互和集成，而多目标优化保留了讽刺检测的语义细微差别的复杂性。在公共多模态讽刺数据集上的实验结果证明了该模型的优越性，达到了最先进的性能。研究结果突出了该模型的能力，整合广泛的语义信息，证明了其有效性的同声传译和融合的多模态数据的讽刺检测。

Keywords

Multi-modal sarcasm detection 多模态讽刺检测
Multi-modal representation learning 多模态表示学习
Multi-modal fusion 多模态融合

1. Introduction

Sarcasm represents a linguistic phenomenon where the intended meaning often diverges from the literal interpretation of the words employed. It typically manifests through irony, mockery, or humorous derision, making its detection particularly challenging. Understanding sarcasm is a challenge due to its reliance on contextual cues, tone of voice, and nuanced expressions. In Fig. 1(a), an image shows a pile of snow in front of a house, accompanied by the text, ‘‘the snowplow went by this morning and dropped off a present in my driveway’’. The incongruity between the image and the text conveys sarcasm, as the snow is sarcastically referred to as a ‘‘present’’. This illustrates the need to consider both visual and textual elements for accurate detection. Fig. 1(b) displays an image with text inside ‘‘your joke was soo funny’’. These examples show that the detection of sarcasm requires a thorough consideration of both visual and textual elements, and a deep dig into extensive semantic information that previous models lack.讽刺是一种语言现象，它所表达的意思往往与所用词语的字面意思相背离。它通常表现为讽刺，嘲弄或幽默的嘲笑，使其检测特别具有挑战性。理解讽刺是一个挑战，因为它依赖于上下文线索，语调和微妙的表达。在图1（a）中，图像显示了房子前面的一堆雪，伴随着文本，“今天早上扫雪机经过，在我的车道上放下了一件礼物”。图像和文本之间的不协调传达了讽刺，因为雪被比喻为“礼物”。这说明需要考虑视觉和文本元素以进行准确检测。图1（B）显示了一个图像，其中的文本“你的笑话太有趣了”。这些例子表明，讽刺的检测需要全面考虑视觉和文本元素，并深入挖掘广泛的语义信息，以前的模型缺乏。
In traditional multi-modal tasks, models typically encode each modality separately before aligning and understanding their semantics. Previous studies like [1] extend multi-stage approach to multi-modal sarcasm detection tasks, which requires a more detailed semantic capture. For example, traditional pre-trained image models often overlook text within images and fail to grasp the semantic details of multiple entities, highlighting a limitation in capturing fine-grained semantics. Bridging this gap, subsequent approaches have begun to incorporate extensive semantic information to enhance detection capabilities, including hashtags [2], image attributes [3], and adjective–noun pairs (ANPs) [4], aiming to enrich the model’s understanding of content. Approaches such as [5,6] utilize object detection for dissecting images into region-based objects for granular visual analysis. Additionally, [7] enhances semantic information extraction through image captioning techniques.在传统的多模态任务中，模型通常在对齐和理解它们的语义之前单独编码每个模态。以前的研究如[1]将多阶段方法扩展到多模态讽刺检测任务，这需要更详细的语义捕获。例如，传统的预训练图像模型往往忽略图像中的文本，无法掌握多个实体的语义细节，突出了捕捉细粒度语义的局限性。为了弥合这一差距，随后的方法已经开始纳入广泛的语义信息来增强检测能力，包括主题标签[2]，图像属性[3]和形容词-名词对（ANP）[4]，旨在丰富模型对内容的理解。诸如[5，6]的方法利用对象检测将图像分割成基于区域的对象以进行粒度视觉分析。此外，[7]通过图像字幕技术增强了语义信息提取。
The integration of extensive semantic information into multi-modal understanding requires sophisticated fusion techniques. Methods such as 2D-Intra-Attention [8] and Graph Convolutional Networks (GCN) [6] have advanced the modeling of cross-modal relationships. However, their emphasis on post-encoding fusion may not fully exploit the inherent capabilities of each modality.将大量语义信息集成到多模态理解中需要复杂的融合技术。诸如2D-Intra-Attention [8]和图卷积网络（GCN）[6]等方法已经推进了跨模态关系的建模。然而，他们对编码后融合的强调可能没有充分利用每种模态的固有功能。
During the multi-modal pre-training phase, the prevalent use of image-text contrastive loss aims to align text with the corresponding image semantics [9,10]. While this approach is effective for basic alignment, it may not adequately address the semantic nuances and discrepancies that are critical in tasks requiring detailed semantic understanding, such as multi-modal sarcasm detection.在多模态预训练阶段，图像-文本对比损失的普遍使用旨在将文本与相应的图像语义对齐[9，10]。虽然这种方法对于基本对齐是有效的，但它可能无法充分解决在需要详细语义理解的任务中至关重要的语义细微差别和差异，例如多模态讽刺检测。
To address these challenges, we propose a single-stage model approach that circumvents the reliance on pre-trained single-modality encoders and their limitations in multi-modal representations fusion. Our model introduces a single-stage model with a projection mechanism designed to accept and integrate multi-modal input along with extensive semantic information. This allows for a more comprehensive understanding and fusion of semantic information across modalities. Furthermore, our model addresses the issue of semantic detail loss by optimizing the integration of contrastive learning and transfer learning objective for downstream tasks. Instead of enforcing alignment, it preserves the nuances of various modalities and extensive semantic information. This optimization includes the redesign of multi-task objective, quantization-aware training, ensuring the model not only retains its original capabilities but also adapts effectively to the requirements of downstream tasks.为解决该问题，提出了一种单级模型方法，避免了对预先训练的单模态编码器的依赖以及它们在多模态表示融合中的局限性。我们的模型引入了一个带有投射机制的单阶段模型，该投射机制被设计为接受和集成多模态输入沿着广泛的语义信息。这允许跨模态的语义信息的更全面的理解和融合。此外，该模型通过优化下游任务的对比学习和迁移学习目标的集成，解决了语义细节丢失的问题。它保留了各种模态的细微差别和广泛的语义信息，而不是强制执行对齐。该优化包括重新设计多任务目标、量化训练，保证模型在保持原有性能的同时，有效适应下游任务的需求。
In this work, we introduce the Single-Stage Extensive Semantic Fusion (SSESF) model, a novel approach designed to transcend the limitations inherent in traditional multi-modal sarcasm detection. By integrating text extracted from images through Optical Character Recognition (OCR) as part of our model’s input, we significantly enhance the semantic understanding of image content. Our model operates in a single-stage framework, eliminating the dependency on the performance of single-modality encoders by simultaneously processing and fusing multiple modalities and extensive semantic information. Our contributions are summarized as follows:在这项工作中，我们介绍了单阶段广泛的语义融合（SSESF）模型，一种新的方法，旨在超越传统的多模态讽刺检测固有的局限性。通过将通过光学字符识别（OCR）从图像中提取的文本集成为我们模型的输入的一部分，我们显着增强了对图像内容的语义理解。我们的模型在一个单阶段的框架中运行，通过同时处理和融合多种模态和广泛的语义信息，消除了对单模态编码器性能的依赖。我们的贡献总结如下：
• The application of a single-stage framework that processes and fuses multiple modalities and extensive semantic information simultaneously, enhancing semantic understanding.应用单阶段框架，同时处理和融合多种模态和广泛的语义信息，增强语义理解。
• The introduction of a projection mechanism in a single-stage model, which effectively handle multiple modalities and additional semantic information, including text from images via OCR, image regions via object detection, and image captions. 在单阶段模型中引入投影机制，有效地处理多种模态和附加语义信息，包括通过OCR从图像中提取文本，通过对象检测提取图像区域，以及图像标题。
• A redesigned fine-tuning process with a multi-objective optimization that focuses on preserving and integrating the unique semantic contributions of each modality and extensive semantic information to improve understanding and adaptability to downstream tasks.重新设计的微调过程，具有多目标优化，专注于保留和整合每个模态的独特语义贡献和广泛的语义信息，以提高对下游任务的理解和适应性。
• Demonstrated superior performance on multi-modal sarcasm detection tasks and other tasks, validating the effectiveness of our approach through experimental results.在多模态讽刺检测任务和其他任务上表现出上级性能，通过实验结果验证了我们方法的有效性。
This model sets a new benchmark in the field by not only addressing the challenges of multi-modal sarcasm detection but also offering a framework. It is adaptable and efficient for a wide range of applications, showing its versatility and capacity for semantic comprehension.该模型不仅解决了多模态讽刺检测的挑战，而且提供了一个框架，为该领域树立了新的基准。它适应性强，效率高，适用范围广，显示了它的多功能性和语义理解能力。

2. Related work

The study of multi-modal sarcasm detection has become increasingly relevant with the proliferation of social media platforms. Early studies in this domain, such as [1], marked milestones by integrating image and textual data. Despite their innovative approach, these models relied on multi-stage frameworks, concatenating global information from each modality without effectively capturing the intricate relationships between them.多模态讽刺检测的研究随着社交媒体平台的激增而变得越来越重要。这一领域的早期研究，如[1]，通过整合图像和文本数据来标记里程碑。尽管这些模型采用了创新的方法，但它们依赖于多阶段框架，将来自每种模式的全球信息连接起来，而没有有效地捕捉它们之间的复杂关系。
Subsequent contributions, including the development of a Twitter (now X) multi-modal sarcasm dataset by [3], introduced a hierarchical fusion model that considers image attributes as a distinct modality. While advancing the field, such models still struggle to preserve detailed semantic information, primarily due to the limitations in capturing image-specific attributes. Similarly, [2] and others such as [4,5] have endeavored to enrich semantic understanding through extensive semantic information modeling, such as hashtags, adjective–noun pairs and image regions. [7] further attempts to generate descriptive captions using the Clipcap model [11], enriching the semantic context. Beyond these methods, some other extensive semantic information is yet to be considered like advanced object detection [12,13], image or video caption [14], image segmentation [15].随后的贡献，包括[3]开发的Twitter（现在的X）多模态讽刺数据集，引入了一种分层融合模型，将图像属性视为不同的模态。在推进该领域的同时，这些模型仍然难以保留详细的语义信息，主要是由于捕获图像特定属性的限制。类似地，[2]和其他如[4，5]的方法通过广泛的语义信息建模来丰富语义理解，如主题标签，形容词-名词对和图像区域。[7]进一步尝试使用Clipcap模型[11]生成描述性字幕，丰富语义上下文。除了这些方法之外，还需要考虑其他一些广泛的语义信息，如高级对象检测[12，13]，图像或视频字幕[14]，图像分割[15]。
The work of [4] introduces a novel dual-network architecture designed to simultaneously process cross-modality contrast and semantic associations. Utilizing a Decomposition Network in conjunction with a Relation Network, this approach enables the differentiation and explicit mapping of semantic relationships between images and text. [2] leverages a BERT-based framework to examine the incongruities both within and across modalities. By implementing a specialized intermodality attention mechanism and a co-attention model, this study pioneers in precisely delineating the discordance that underlie multimodal sarcasm. In [5], the innovative use of cross-modal graphs models the relationships between textual and visual representations. This approach creates a dynamic and explicit link between the modalities by attribute-object pairs. [7] proposes a hierarchical framework that analyzes sarcasm through congruity at both the atomic and composition levels. This method also considers the influence of external knowledge.[4]的工作介绍了一种新颖的双网络架构，旨在同时处理跨模态对比和语义关联。利用分解网络结合关系网络，这种方法能够区分和明确映射图像和文本之间的语义关系。[2]利用基于BERT的框架来检查模态内部和跨模态的不一致性。通过实施一个专门的跨模态注意机制和共同注意模型，本研究的先驱，精确地描绘了不和谐的基础多模态讽刺。在[5]中，跨模态图的创新使用为文本和视觉表示之间的关系建模。这种方法通过属性-对象对在模态之间创建动态和明确的链接。[7]提出了一个层次框架，分析讽刺通过一致性在原子和组成水平。该方法还考虑了外部知识的影响。
Contrary to these multi-stage approaches, our work introduces a Single-Stage Extensive Semantic Fusion (SSESF) model that diverges from the dependency on pre-trained single-modality encoders. The single-stage unified framework of our model has the ability to simultaneously process multiple modalities and extensive semantic information without the need for separate encoding stages. This is achieved through a projection mechanism designed to simplify the integration of diverse modal and inputs. Moreover, traditional semantic alignment goals in multi-stage models often result in the loss of modality-specific details that do not directly correspond across modalities. In response, we have developed a multi-objective optimization that optimize the preservation of these semantic details. This approach not only retains the nuanced semantics essential for sarcasm detection but also enhances the model’s adaptability to the complexities of multi-modal data interpretation.与这些多阶段方法相反，我们的工作引入了一个单阶段广泛语义融合（SSESF）模型，该模型不同于对预训练的单模态编码器的依赖。该模型的单阶段统一框架能够同时处理多种模态和广泛的语义信息，而无需单独的编码阶段。这是通过一个投影机制来实现的，该机制旨在简化不同模态和输入的集成。此外，多阶段模型中的传统语义对齐目标通常会导致丢失模态特定的细节，这些细节在模态之间并不直接对应。作为回应，我们已经开发了一个多目标优化，优化这些语义细节的保存。这种方法不仅保留了讽刺检测所必需的细微差别的语义，而且还增强了模型对多模态数据解释的复杂性的适应性。

3. Method

In this section, we introduce our SSESF model that evolves from a multi-modal foundation model [16], adapting it for a wider range of applications that extend beyond traditional vision and vision-language tasks. Our enhanced backbone model, which we called Extensive Semantic Fusion Multiway (ESFM) Transformer, is specifically modified to accommodate a wider array of modal inputs, including extensive semantic information, thus facilitating its application to an expanded range of downstream tasks. The model architecture is shown in Fig. 2.在本节中，我们将介绍我们的SSESF模型，该模型是从多模态基础模型[16]发展而来的，适用于超越传统视觉和视觉语言任务的更广泛的应用。我们的增强骨干模型，我们称之为广泛的语义融合多路（ESFM）Transformer，专门修改，以适应更广泛的模态输入，包括广泛的语义信息，从而促进其应用范围扩大下游任务。模型结构如图2所示。
在这里插入图片描述
The SSESF model is characterized by several key differences that distinguish it from other multi-stage models:SSESF模型的特点是有几个关键的区别，使其区别于其他多阶段模型：

Projection Mechanism for Multi-Modal Input Representations. The ESFM Transformer utilizes a projection mechanism that enables the modules to concurrently process embeddings from multiple modalities, including those carrying extensive semantic information. This capability ensures that the model can integrate and understand a composite view of the data it processes.多模态输入表示的投影机制。ESFM Transformer利用投影机制，使模块能够同时处理来自多个模态的嵌入，包括携带大量语义信息的嵌入。此功能确保模型可以集成并理解其处理的数据的复合视图。
Shared Parameters in Multi-Head Self-Attention (MSA). Unlike multi-stage models that segregate models for each modality, ESFM Transformer shares network parameters across all inputs in the MSA layer. This feature facilitates interaction between different modalities and semantic information, enhancing the model’s ability to perform comprehensive semantic fusion.Multi-Head Self-Attention（MSA）的基本概念与为每种模态分离模型的多阶段模型不同，ESFM Transformer在MSA层的所有输入中共享网络参数。该功能有助于不同模态和语义信息之间的交互，增强了模型执行全面语义融合的能力。
Controllable Multiway Feed Forward Network (FFN) Parameters. The ESFM Transformer allows for selective freezing of parameters in the FFN layer. For instance, outputs from a MSA layer corresponding to image inputs can be directed to the corresponding part of parameters within the Multiway FFN, enabling targeted interaction of modalities. Alternatively, inputs from all modalities, including extensive semantics, can activate the entirety of the Multiway FFN parameters, achieving a full fusion and interaction of multi-modal data.可控多路前馈网络（FFN）参数。ESFM Transformer允许选择性冻结FFN层中的参数。例如，对应于图像输入的来自MSA层的输出可以被引导到多路FFN内的参数的对应部分，从而实现模态的目标交互。或者，来自所有模态的输入，包括广泛的语义，可以激活整个多路FFN参数，实现多模态数据的完全融合和交互。
Multi-Objective Optimization. During the pre-training phase, the framework employs multiple learning objectives: masked data modeling, contrastive learning, and multi-modal data matching. The fine-tuning phase leverages label prediction and supervised contrastive learning. This stage is further augmented by multitask learning, which meticulously optimizes the gradients and weights assigned to different tasks, ensuring the efficacy of each task is preserved.多目标优化。在预训练阶段，该框架采用了多个学习目标：掩蔽数据建模、对比学习和多模态数据匹配。微调阶段利用标签预测和监督对比学习。多任务学习进一步增强了这一阶段，它会细致地优化分配给不同任务的梯度和权重，确保每个任务的效率得到保持。
Expandability for Downstream Tasks. The design of the SSESF model is flexible, enabling easy expansion to accommodate additional modalities, semantic information, and task objectives. This adaptability ensures that the model can be tailored to a wide variety of applications and be taken as a versatile tool for advancing research and development.下游任务的可扩展性。SSESF模型的设计是灵活的，易于扩展，以适应额外的模态，语义信息和任务目标。这种适应性确保该模型可以针对各种应用进行定制，并作为推进研究和开发的通用工具。

3.1. Multi-modal projection 多模态投影

In this section, we delve into the input projection mechanism of SSESF model. This mechanism is designed to project and embed a wide range of inputs, ensuring compatibility of the simultaneous processing of multiple modalities with the ESFM Transformer. The input projection mechanism provides a foundation for the SSESF model to effectively process and fuse different modalities and semantic information.在本节中，我们深入研究了SSESF模型的输入投影机制。该机制旨在投射和嵌入广泛的输入，确保ESFM Transformer同时处理多种模态的兼容性。输入映射机制为SSESF模型有效地处理和融合不同的模态和语义信息提供了基础。
在这里插入图片描述
图像输入投影。与视觉变换器（ViT）[17]对齐，图像被分割成补丁，序列化，然后嵌入。这个过程将图像v∈ R HWC变换为一系列平坦的补丁嵌入，其中C是通道数，（H，W）表示图像分辨率。每个大小为（P，P）的补丁被重新整形为v∈ R N×（P2C），其中N=HW/P2是补丁总数。𝑃𝐻𝑊𝐶𝑃这些图像块的嵌入是通过线性投影获得的，并补充了一个特殊的可学习标记[CLS]，以及位置和类型嵌入，以形成图像输入投影：0= [，1，…，]，其中0 ∈ R。
在这里插入图片描述
文本输入投影。与BERT [18]相一致，文本被标记为子字单元，包含序列开始标记[CLS]和特殊的边界标记[SEP]。这些标记经过线性投影生成嵌入，然后添加位置和类型嵌入以构建文本输入投影：0= [，1，.，，]+，其中0∈ R（+2）×，表示标记化子词单元的长度。
在这里插入图片描述

额外的模态和语义信息投射。附加的输入，如视频输入，沿着其他语义输入，如OCR结果和图像字幕，被合并到类似的投影框架中。受ViViT [19]启发的视频投影从视频中提取时空管，应用线性投影作为ViT嵌入的3D扩展。广泛的语义投影扩展了文本投影，用于嵌入OCR结果，图像标题，以及来自对象检测的图像区域的图像投影，促进了分层语义洞察的包含。也可以以类似的方式考虑诸如音频和深度图的其他广泛的语义投影。附加的模态和语义信息投射可以表示为：0∈ R ×，其中表示附加语义投射的长度。
在这里插入图片描述
融合投影。以上的多模态投影被级联以形成融合投影：0= [0;0;0]。

3.2. Extensive Semantic Fusion Multiway Transformer 可拓语义融合多路Transformer

Building upon the foundation established by MoME [16], we present the Extensive Semantic Fusion Multiway (ESFM) Transformer as an advanced architecture for vision-language tasks and beyond. The ESFM Transformer extends the capabilities of foundation multi-modal models by incorporating a input projection mechanism, allowing for the processing and fusion of a broader range of modalities and extensive semantic information. This enhancement is particularly crucial for complex tasks such as multi-modal sarcasm detection, where nuanced understanding across various data types is essential.在MoME [16]建立的基础上，我们提出了扩展语义融合多路（ESFM）Transformer，作为视觉语言任务及其他任务的高级架构。ESFM Transformer通过整合输入投影机制扩展了基础多模态模型的功能，允许处理和融合更广泛的模态和广泛的语义信息。这种增强对于复杂的任务尤其重要，例如多模态讽刺检测，其中对各种数据类型的细致入微的理解至关重要。
The ESFM Transformer employs a modified Transformer block that integrates multi-modal inputs through an advanced multiway feedforward network (Multiway FFN), replacing the conventional FFN approach with a more flexible design. This design enables ESFM Transformer to handle not only image and text inputs but also extensive modalities and semantic information, thus offering a comprehensive approach to multi-modal integration. The computation process within the ESFM Transformer architecture is defined as follows:ESFM Transformer采用改进的Transformer模块，通过高级多路前馈网络（Multiway FFN）集成多模态输入，以更灵活的设计取代传统的FFN方法。这种设计使ESFM Transformer不仅能够处理图像和文本输入，还能够处理大量的模态和语义信息，从而为多模态集成提供了一种全面的方法。ESFM Transformer架构中的计算过程定义如下：在这里插入图片描述
在此结构中，Hl-1表示来自前一层的输出向量，表示层索引。𝑙LN代表层归一化，MSA是多头自关注。
The MSA mechanism in ESFM Transformer shares parameters across different modalities, enhancing the model’s ability to fuse and understand the interplay between them. This feature promotes a deeper integration of visual and linguistic content, along with other modalities, ensuring a cohesive multi-modal representation. The Multiway FFN is characterized by its adaptability of enabling selective activation of parameters for specific modal inputs. In multi-modal fusion layers, all parameters can be activated to achieve a comprehensive fusion of modalities, aligning with the goal of advanced modality integration. This approach allows for the tailored processing of each modality while facilitating their integration at different stages of the network. The ESFM Transformer is designed with flexibility, allowing for the introduction of new modalities and extensive semantic information. This capability ensures that the model remains applicable and effective for a wide range of downstream tasks, offering a scalable solution to the evolving demands of multi-modal research.ESFM Transformer中的MSA机制在不同模态之间共享参数，增强了模型融合和理解它们之间相互作用的能力。这一功能促进了视觉和语言内容的更深层次整合，沿着其他模态，确保了一个有凝聚力的多模态表示。多路FFN的特点是它的适应性，使特定的模态输入的参数的选择性激活。在多模态融合层中，可以激活所有参数以实现模态的全面融合，与高级模态集成的目标保持一致。这种方法允许对每种模态进行量身定制的处理，同时促进它们在网络的不同阶段的整合。ESFM Transformer设计灵活，允许引入新的模态和广泛的语义信息。这种能力确保了该模型对于广泛的下游任务仍然适用和有效，为多模态研究不断变化的需求提供了可扩展的解决方案。

3.3. Multi-objective optimization 多目标优化

In this section, we detail the multi-objective optimization strategy employed for the SSESF model, designed to enhance its performance across downstream and multi-objective tasks. The comprehensive approach to model training involves multi-stage pre-training, fine-tuning optimization for multiple tasks, and efficiency improvements through quantization. Multi-task learning [20] can help to preserve the model performance on multiple objectives.在本节中，我们详细介绍了SSESF模型采用的多目标优化策略，旨在增强其在下游和多目标任务中的性能。模型训练的综合方法包括多阶段预训练、针对多个任务的微调优化以及通过量化提高效率。多任务学习[20]可以帮助保持模型在多个目标上的性能。

3.3.1. Multi-stage pre-training 多阶段预训练

Contrastive Learning. Image-Text Contrast (ITC) is applied to align the representations of images and texts within a batch of image-text pairs. This task aims to identify matched pairs, leveraging the final output vectors of [CLS] tokens for image and text representations, respectively. Through linear projection and normalization, image and text vectors are derived to compute similarities, facilitating the model’s understanding of image-to-text and text-to-image relationships.对比学习。图像-文本对比度（ITC）用于对齐一批图像-文本对中的图像和文本的表示。此任务旨在识别匹配对，分别利用图像和文本表示的[CLS]标记的最终输出向量。通过线性投影和归一化，导出图像和文本向量以计算相似度，从而有助于模型理解图像到文本和文本到图像的关系。
Masked Data Modeling. Masked Data Modeling, as Masked Language Modeling (MLM) and Masked Image Modeling (MIM), randomly replace tokens with a [MASK] token within the input sequence. The model is trained to predict these tokens based on the context provided by unmasked tokens. This task enhances the model’s ability to understand data pattern and semantics.
屏蔽数据建模。屏蔽数据建模，如屏蔽语言建模（MLM）和屏蔽图像建模（MIM），在输入序列中随机地用[MASK]标记替换标记。该模型被训练为基于由未屏蔽的记号提供的上下文来预测这些记号。此任务增强了模型理解数据模式和语义的能力。
Image-Text Matching. Image-Text Matching (ITM) further refines the model’s capability to discern whether a given image and text are congruent. Utilizing the hidden vector of the last layer to represent the image-text pair, this task employs a binary classification approach to predict match accuracy.图像-文本匹配图像-文本匹配（ITM）进一步细化了模型的能力，以辨别给定的图像和文本是否一致。利用最后一层的隐向量来表示图像-文本对，该任务采用二值分类方法来预测匹配精度。

3.3.2. Fine-tuning optimization 微调优化

In the fine-tuning phase, we adapt the model to fit the nuanced demands of multi-modal sarcasm detection task. This adaptation is crucial for the model to effectively grasp and interpret the core semantic details inherent in sarcasm, necessitating an approach through supervised learning objectives. The fine-tuning process is enhanced by incorporating advanced techniques such as quantization-aware training and multi-objective optimization to ensure efficiency in handling complex multi-modal tasks.在微调阶段，我们调整模型，以适应多模态讽刺检测任务的细微差别的要求。这种适应对于模型有效地把握和解释讽刺中固有的核心语义细节至关重要，需要通过监督学习目标的方法。微调过程通过引入量化感知训练和多目标优化等先进技术来增强，以确保处理复杂多模式任务的效率。
To address the computational efficiency and model size concerns in deploying sophisticated multi-modal models, we employ quantizationaware training. This approach considers the effects of quantization during the training process, enabling the model to adapt to the reduced precision without significant loss of accuracy. Quantizationaware training thus ensures that the SSESF model retains high performance while being more resource-efficient, making it suitable for deployment in constrained environments.为了解决部署复杂多模态模型时的计算效率和模型大小问题，我们采用了量化感知训练。这种方法在训练过程中考虑了量化的影响，使模型能够适应降低的精度，而不会显著损失精度。因此，量化感知训练确保SSESF模型保持高性能，同时更具资源效率，使其适合在受限环境中部署。
The SSESF model is fine-tuned with a dual focus: detecting sarcasm and supervised contrastive learning. The supervised contrastive learning objective is aims to draw closer the representations of samples sharing sarcasm semantics while distancing those that do not [21,22]. This objective is formulated as:SSESF模型经过微调，具有双重重点：检测讽刺和监督对比学习。监督对比学习的目标是拉近共享讽刺语义的样本的表示，同时远离那些不共享的样本[21，22]。这一目标的表述如下：
在这里插入图片描述

其中，A（i）是该批次中的所有数据，P（i）{p∈A（i）∶yp=yi}是该批次中的所有阳性，z是特征，t是温度，in和out分别指相同和不同标签的数据样本。
The sarcasm detecting is performed by feeding the output of last layer to a Decoder. This objective is formulated as:通过将最后一层的输出馈送到解码器来执行讽刺检测。这一目标的表述如下：
在这里插入图片描述

其中，交叉熵损失是用于讽刺检测的交叉熵损失，并且交叉熵损失是监督对比损失，从而增强了模型区分细微差别的讽刺语义的能力。权重用作这些目标之间的权重。池化（pooling）聚合来自最后一层的灯笼向量。
Given the simultaneous pursuit of sarcasm detection and contrastive learning objectives, we adopt a multi-objective optimization that intelligently balances the gradients of these tasks. The optimization problem is defined as:考虑到同时追求讽刺检测和对比学习目标，我们采用了多目标优化，智能地平衡这些任务的梯度。优化问题定义为：
在这里插入图片描述

其中，Rk为多个任务的共享参数，Rk为第1个任务的具体参数，Rk为第1个任务的权重。𝑘𝑘该公式允许对特定于任务的梯度进行动态加权，确保模型可以在讽刺检测和监督对比学习之间实现最佳平衡。当应用于微调阶段的两个主要任务时，该算法计算最佳任务权重，以使模型的学习目标与期望的结果保持一致，从而确保有效的多任务学习。由于只有两个任务要优化，因此可以通过推导下降方向来计算，公式为：
在这里插入图片描述

4. Experiments

4.1. Dataset

In our study, we utilize the Twitter multi-modal sarcasm detection dataset collected by [3], which is developed specifically for evaluating the multi-modal sarcasm detection task. The dataset comprises English tweets with pictures. According to [3], the dataset is pre-processed to discard tweets with common and frequently related sarcasm words. Data masking and other process are also applied on the tweets. Table 1 presents the statistics of the dataset.在我们的研究中，我们利用[3]收集的Twitter多模态讽刺检测数据集，该数据集是专门为评估多模态讽刺检测任务而开发的。该数据集包括带有图片的英语推文。根据[3]，数据集经过预处理，以丢弃具有常见和频繁相关讽刺词的推文。数据屏蔽和其他处理也应用于推文。表1显示了数据集的统计数据。
在这里插入图片描述

4.2. Parameter settings and pre-training

The model undergoes a pre-training phase, employing tasks such as Image-Text Contrastive Learning (ITC), Masked Data Modeling (comprising Masked Language Modeling (MLM) and Masked Image Modeling (MIM)), and Image-Text Matching (ITM). These tasks are pivotal for the model to learn robust cross-modal representations and alignment. Pre-training leverages public datasets like ImageNet-21k, English Wikipedia [23], and various multi-modal datasets like COCO [24] and VG [25]. Parameters are loaded from models proven effective in visionlanguage tasks, like VLMo [16] and BEiT-3 [26]. Algin with ViT-Base and VLMo-Base, SSESF consists of 12-layer ESFM Transformer with 768 hidden size and 12 attention heads.该模型经历预训练阶段，采用诸如图像-文本对比学习（ITC）、掩蔽数据建模（包括掩蔽语言建模（MLM）和掩蔽图像建模（MIM））和图像-文本匹配（ITM）等任务。这些任务对于模型学习鲁棒的跨模态表示和对齐至关重要。预训练利用公共数据集，如ImageNet-21 k，英文维基百科[23]，以及各种多模态数据集，如COCO [24]和VG [25]。参数从在视觉语言任务中被证明有效的模型中加载，如VLMo [16]和BEiT-3 [26]。Algin与ViT-Base和VLMo-Base，SSESF由12层ESFM Transformer组成，具有768个隐藏大小和12个注意头。
For fine-tuning, we utilize a dataset comprising sarcastic and nonsarcastic tweets, aiming to refine the model’s capability to detect sarcasm accurately. This phase not only leverages labels for sarcasm detection but also employs supervised contrastive learning objectives, enhancing the model’s discriminative power by learning detailed semantic nuances within the data. Image augmentation techniques such as random resized cropping, horizontal flipping, and color jittering, RandAugment are employed. Some other data augmentation [27] is considered to be added in the future. As for text data, we employ the SentencePiece tokenizer. OCR is performed using Google’s TesseractOCR Engine. The expriments are run on 2 Nvidia Tesla V100 32 GB GPU cards. Table 2 presents the hyper-parameters utilized.为了进行微调，我们利用了一个包含讽刺和非讽刺推文的数据集，旨在改进模型准确检测讽刺的能力。该阶段不仅利用标签进行讽刺检测，还采用监督对比学习目标，通过学习数据中详细的语义细微差别来增强模型的区分力。图像增强技术，如随机调整大小的裁剪，水平翻转，和颜色抖动，RandAugment。其他一些数据增强[27]被认为是在未来增加。至于文本数据，我们使用SentencePiece tokenizer。OCR是使用Google的TesseractOCR引擎执行的。实验在2个Nvidia Tesla V100 32 GB GPU卡上运行。表2列出了所使用的超参数。
Regarding optimization, we employ the AdamW optimizer with hyperparameters 𝛽1 = 0.9, 𝛽2 = 0.999, and a learning rate of 2e−5. A cosine learning rate decay scheduler is used.在优化方面，我们使用AdamW优化器，超参数为0.1 = 0.9，0.2= 0.999，学习率为2e−5。使用了余弦学习率衰减调度器。

4.3. Baselines

To measure the effectiveness of our proposed SSESF model, we use Accuracy, Precision, Recall, and F1-score as performance metrics. In order to compare our model to existing state-of-the-art models, we evaluate the following:为了衡量我们提出的SSESF模型的有效性，我们使用准确率，精度，召回率和F1分数作为性能指标。为了将我们的模型与现有的最先进的模型进行比较，我们评估了以下内容：
在这里插入图片描述
1、图像模态方法：仅使用视觉信息的模型。
（1）图像：遵循[3]的工作，使用预先训练的ResNet [28]，只更新分类层参数。
（2）ViT：使用预先训练的ViT [17] [类]标记进行讽刺检测。
2、文本模态方法：仅使用文本信息的模型。
（1）TextCNN：[29]使用CNN对文本中的讽刺进行分类。
（2）Bi-LSTM：使用双向LSTM对文本上的讽刺进行分类。
（3） SMSD：[30]探索了一种自匹配网络来捕获文本不一致信息。
（4）BERT：使用预训练的BERT-base [18]对文本进行讽刺检测。
3、多模态方法：采用文本和图像模态信息的模型。
（1）HFM：[3]提出了一种分层融合方法，将图像特征，图像属性特征和文本特征作为输入。
（2）D&R Net：[4]通过构建分解和关系网络，对跨模态对比和语义关联进行建模。
（3）Att-BERT：[2]探讨了基于BERT的跨通道注意机制和共注意机制，前者用于建模通道之间的不协调，后者用于建模文本通道内的不协调。
（4）InCrossMGs：[6]建立模态内和跨模态图，对语义节点关系建模。
（5）CMGCN：[5]应用对象检测来增强图像区域，并构建模态内和模态间图。
（6）HKE：[7]应用分层框架，结合各种知识资源的效果进行讽刺检测。

4.4. Main results

In this section, we detail the evaluation of our SSESF model on a multi-modal sarcasm dataset in Table 3, illustrating its superior performance and innovative features. The results underscore the efficacy of our model, which incorporates extensive semantic information and a projection mechanism, allowing for extensive modality and semantic extensibility. The ESFM Transformer’s capacity to simultaneously process multiple modal inputs and extensive semantic information significantly enhances semantic understanding and fusion. Moreover, our multi-objective optimization approach, featuring fine-tuning and model transfer capabilities, distinctly benefits from supervised contrastive learning, focusing on sarcasm semantics rather than the generic semantic alignment targeted by unsupervised contrastive learning.在本节中，我们详细介绍了我们的SSESF模型在表3中的多模态讽刺数据集上的评估，说明了其上级性能和创新功能。结果强调了我们的模型的有效性，它结合了广泛的语义信息和投影机制，允许广泛的模态和语义扩展性。ESFM Transformer能够同时处理多个模态输入和广泛的语义信息，显着增强了语义理解和融合。此外，我们的多目标优化方法，具有微调和模型转移能力，明显受益于监督对比学习，专注于讽刺语义，而不是无监督对比学习所针对的通用语义对齐。
在这里插入图片描述
Our SSESF model achieves state-of-the-art results, outperforming existing models across all metrics. Notably, our model demonstrates an enhancement in accuracy and F1-score, highlighting the beneficial impact of integrating extensive semantic information for sarcasm detection. This is particularly evident in the model’s performance improvement over the HKE model, which previously set the benchmark with its use of external knowledge.我们的SSESF模型实现了最先进的结果，在所有指标上都优于现有模型。值得注意的是，我们的模型显示出准确性和F1分数的增强，突出了整合广泛的语义信息对讽刺检测的有益影响。这一点在该模型相对于HKE模型的性能改进中尤为明显，HKE模型先前通过使用外部知识设定了基准。

4.5. Scaling up on multi-modal sentiment analysis 扩大多模式情感分析

The exploration of multi-modal sentiment analysis presents an opportunity to demonstrate the adaptability and effectiveness of our proposed model across diverse multi-modal tasks. Given the limited availability of datasets, we extend our validation efforts to include multimodal sentiment analysis, leveraging the MVSA-Single and MVSAMultiple datasets developed by [36]. This choice allows us to not only showcase our model’s capacity for semantic enhancement across modalities but also to refine its application in sentiment analysis, a field ripe for the benefits of our innovative approach.多模态情感分析的探索提供了一个机会，以证明我们提出的模型在不同的多模态任务的适应性和有效性。鉴于数据集的可用性有限，我们将验证工作扩展到包括多模态情感分析，利用[36]开发的MVSA-Single和MVSAMultiple数据集。这种选择使我们不仅能够展示我们的模型在不同模态之间进行语义增强的能力，而且还可以改进其在情感分析中的应用，这是一个适合我们创新方法的领域。
Our model integrates advanced OCR technology and object detection algorithms to extract and focus on semantic information from key regions within images. The comparative analysis of our model against established benchmarks in the field of multi-modal sentiment analysis is presented in Table 4. While our model achieves competitive performance, with accuracy and F1-scores that closely rival those of leading models, it highlights the untapped potential for further optimization and adaptation to multi-modal sentiment analysis.我们的模型集成了先进的OCR技术和对象检测算法，从图像中的关键区域提取和集中语义信息。我们的模型与多模态情感分析领域的既定基准的比较分析如表4所示。虽然我们的模型实现了具有竞争力的性能，其准确性和F1分数与领先模型的准确性和F1分数非常接近，但它突出了进一步优化和适应多模态情感分析的未开发潜力。

4.6. Ablation study 消融研究

In this section, we present an ablation study that aims to analyze the impact of various model components on the performance of our proposed method. The evaluation metrics used for comparison are Accuracy and F1-score.在本节中，我们提出了一个消融研究，旨在分析各种模型组件对我们提出的方法的性能的影响。用于比较的评估指标是准确度和F1评分。
Table 5 showcases the results obtained from different model configurations. ESFM: ESFM with parameters sharing. The configurations without ESFM is similar to multi-stage models with separate encoders. ESI: extensive semantic information. CL: unsupervised contrastive learning. SCL: supervised contrastive learning. MOO: multi-objective optimization.表5显示了从不同模型配置获得的结果。ESFM：ESFM具有参数共享。没有ESFM的配置类似于具有单独编码器的多级模型。ESI：扩展语义信息。CL：无监督对比学习。SCL：监督对比学习。MOO：多目标优化。
Upon an analysis of the results, the significance of integrating specific components into the multi-modal sarcasm detection task becomes apparent. Notably, the SSESF model, equipped with extensive semantic information, supervised contrastive learning and multi-objective optimization achieved the highest accuracy (87.88%) and F1-score (84.99%), outperforming the other configurations.通过对实验结果的分析，我们发现在多模态反语检测任务中整合特定成分的重要性。值得注意的是，SSESF模型，配备了广泛的语义信息，监督对比学习和多目标优化实现了最高的准确率（87.88%）和F1分数（84.99%），优于其他配置。
The effectiveness of integrating extensive semantic information can be attributed to its pivotal role in enhancing the comprehension of sarcasm semantics, even though not all images contain textual content. It not only aids in the understanding process but occasionally plays a crucial role. In contrast, the implementation of unsupervised contrastive learning in certain configurations led to a decrease in model performance. This phenomenon is likely due to the design of traditional contrastive learning, which aims to align representations that correspond to identical entities. However, in the complex multi-modal situations, images and text are not always directly aligned. Instead, they often resonate and complement each other in more subtle ways. The configuration of supervised contrastive learning and multi-task optimization yielded superior results. This improvement can be attributed to a shift in focus towards identifying similarities in sarcasm cues, as opposed to merely aligning entities. Furthermore, multi-objective optimization enhances the model’s ability to maintain the objective of sarcasm prediction through optimized gradient computation.整合广泛的语义信息的有效性可以归因于它在增强对讽刺语义的理解方面的关键作用，即使不是所有的图像都包含文本内容。它不仅有助于理解过程，而且有时还起着至关重要的作用。相比之下，在某些配置中实施无监督对比学习会导致模型性能下降。这种现象可能是由于传统的对比学习的设计，其目的是对齐对应于相同实体的表示。然而，在复杂的多模态情况下，图像和文本并不总是直接对齐的。相反，它们经常以更微妙的方式产生共鸣和互补。监督对比学习和多任务优化的配置产生了上级的结果。这种改进可以归因于焦点转向识别讽刺线索中的相似性，而不仅仅是对齐实体。此外，多目标优化通过优化梯度计算增强了模型保持讽刺预测目标的能力。
These findings underscore the critical role of the single-stage unified framework, include extensive semantic information with projection mechanism, and multi-objective optimization with supervised contrastive learning. These approaches significantly enhance the effectiveness of the SSESF model, thereby advancing the capabilities of sarcasm detection.这些发现强调了单阶段统一框架的关键作用，包括具有投影机制的广泛语义信息，以及具有监督对比学习的多目标优化。这些方法大大提高了SSESF模型的有效性，从而提高了讽刺检测的能力。

4.7. Analysis

In this section, we provide a detailed analysis, emphasizing the integration of textual information extracted from images as an extensive modality. We examine the attention weights at the initial layer of the attention module to gain insights into the model’s behavior and its ability to capture relevant cues. Fig. 3 visually presents the results of our analysis, depicting attention weight heatmaps on both the image and text.在本节中，我们提供了详细的分析，强调从图像中提取的文本信息作为一个广泛的模态的集成。我们在注意力模块的初始层检查注意力权重，以深入了解模型的行为及其捕获相关线索的能力。图3直观地展示了我们的分析结果，描绘了图像和文本上的注意力权重热图。
In Fig. 3, the left image represents the original image, while the right image displays the same image with an attention weight heatmap overlaid on it. The attention weights on the image are derived from the first layer of ESFM Transformer, specifically focusing on one of the attention heads and the attention weights assigned to the image’s classification [CLS] token. The original text accompanying the image is provided below it. Additionally, we present the OCR text generated as a extensive semantic information. To analyze the attention weights on the textual input, we construct a heatmap aligned with the early fused text, where the attention weights are obtained from the first layer of ESFM Transformer, considering one attention head and the attention weights assigned to the text’s [CLS] token.在图3中，左图表示原始图像，而右图显示了覆盖有注意力权重热图的同一图像。图像上的注意力权重来自ESFM Transformer的第一层，特别关注其中一个注意力头部和分配给图像分类[CLS]令牌的注意力权重。下面提供了伴随图像的原始文本。此外，我们将生成的OCR文本作为广泛的语义信息。为了分析文本输入上的注意力权重，我们构建了一个与早期融合文本对齐的热图，其中注意力权重从ESFM Transformer的第一层获得，考虑一个注意力头部和分配给文本的[CLS]标记的注意力权重。
By examining the attention weight heatmaps, we observe that certain key cues (highlighted in blue) have been effectively captured within the text extracted from the images. These cues play an essential role in understanding and detecting sarcasm. The incorporation of semantic information from images through OCR and its projection mechanism with the textual input demonstrates the advantages of our approach in capturing meaningful cues for sarcasm detection. The ESFM Transformer enables the model to effectively understand and combine information from multiple modalities. This integration of textual and visual cues enhances the model’s ability to detect and interpret sarcasm accurately.通过检查注意力权重热图，我们观察到某些关键线索（以蓝色突出显示）已被有效地捕获在从图像中提取的文本中。这些线索在理解和检测讽刺中起着至关重要的作用。通过OCR及其投影机制将图像中的语义信息与文本输入相结合，证明了我们的方法在捕获有意义的线索进行讽刺检测方面的优势。ESFM Transformer使模型能够有效地理解和联合收割机从多个模态的信息。文本和视觉线索的这种集成增强了模型准确检测和解释讽刺的能力。
We also present an analysis of a failure case in sarcasm detection, highlighting the challenges in accurately interpreting sarcastic content. Fig. 4 depicts an example where the model fails to detect sarcasm effectively.我们还提出了一个失败的情况下，讽刺检测分析，突出准确地解释讽刺内容的挑战。图4描绘了模型未能有效地检测讽刺的示例。
The image in Fig. 4 shows a box of meals. According to the text, at first glance, the text explicitly states that the food is good. However, the content creator express sarcasm, as the meal is in a hospital. Understanding the sarcasm in this example requires external knowledge and cultural context. It is challenging to perceive the connection between hospitals and negative emotions such as sadness or dissatisfaction. Possible reasons for the failure in this case could be attributed to the model’s limited access to external knowledge or its inability to effectively incorporate such knowledge in the sarcasm detection process. Additionally, the reliance on visual cues alone may not be sufficient to capture the nuanced sarcasm present in this example. So we recommend to included more extensive semantic to address this problem.图4中的图像显示了一盒饭菜。从文字上看，乍一看，文字上明确写着饭菜不错。然而，内容创作者却表达出讥讽，就像这顿饭是在医院里吃的一样。理解这个例子中的讽刺需要外部知识和文化背景。要感知医院与悲伤或不满等负面情绪之间的联系是一个挑战。在这种情况下失败的可能原因可以归因于模型对外部知识的访问受限或者它不能有效地将这些知识结合到讽刺检测过程中。另外，仅依赖视觉线索可能不足以捕捉该示例中存在的细微差别的讽刺。因此我们建议加入更广泛的语义来解决这个问题。

5. Conclusion

In this paper, we propose the Single-Stage Extensive Semantic Fusion (SSESF) model for multi-modal sarcasm detection. SSESF showcases the benefits of incorporating extensive semantic information in multi-modal sarcasm detection. The fusion of textual and visual cues through projection mechanism and Extensive Semantic Fusion Multiway (ESFM) Transformer, combined with supervised contrastive learning and multi-objective optimization, offers promising avenues for enhancing the understanding and detection of sarcasm in multi-modal contexts. Experimental results on a public multi-modal sarcasm dataset demonstrated the superiority of our proposed SSESF model, achieving state-of-the-art performance. Future work should focus on further improving extensive semantic information and exploring the integration of other modalities.在本文中，我们提出了用于多模态讽刺检测的单阶段广泛语义融合（SSESF）模型。SSESF展示了将广泛的语义信息纳入多模态讽刺检测的好处。通过投射机制和扩展语义融合多路（ESFM）Transformer融合文本和视觉线索，结合有监督的对比学习和多目标优化，为增强多模态语境中讽刺的理解和检测提供了有希望的途径。在公共多模态讽刺数据集上的实验结果证明了我们提出的SSESF模型的优越性，达到了最先进的性能。今后的工作应侧重于进一步改善广泛的语义信息和探索其他模式的整合。