【LLM多模态】视频理解模型Cogvlm-video和MVBench评测基准

note

Cogvlm-video模型通过视频抽帧（24帧，每帧大小为224 x 224）后经过ViT进行图像编码（ViT中添加了2x2的卷积核更好的压缩视觉信息），使用adapter模块更好的将视觉特征和文本特征对齐，得到的图像特征和文本特征拼接后，送入语言模型的decoder解码器。
- 模型结构代码：cogvlm_video_chat/ZhipuAI/cogvlm2-video-llama3-chat/modeling_cogvlm.py
- 代码的VisionExpertAttention类的Transpose a 3D tensor [B, L, H*HD] into a 4D tensor with size [B H L HD]
MVBench评测集准：大多数测试主要集中在静态图像任务上的空间理解，而忽视了动态视频中的时间理解。MVBench评测集准在20个视频理解任务上评测模型，20个任务可以分为九大类。下图展示了 MVBench 评测基准中的 20 种任务，这些任务被分为两大类：空间理解任务（Spatial Understanding: Inferring from a single frame）和时间理解任务（Temporal Understanding: Reasoning based on entire video）。
- 自动QA生成：作者设计了一个自动化的QA生成范式，将公共视频注释转换为多项选择QA，以评估MVBench中的每个任务。
VideoChat2模型：为了在MVBench上取得良好的性能，作者开发了一个强大的视频MLLM基线模型VideoChat2，并通过渐进式多模态训练与多样化的指令调整数据进行训练。
外部的视频理解微调代码：https://github.com/modelscope/ms-swift/blob/main/docs/source_en/Multi-Modal/cogvlm2-video-best-practice.md

文章目录

note
一、cogvlm-video视频理解模型
- 1. 模型结构
- 2. 训练过程
- - （1）pre-training
  - （2）post-training
二、视频理解能力
- 空间理解任务（Spatial Understanding: Inferring from a single frame）
- 时间理解任务（Temporal Understanding: Reasoning based on entire video）
三、相关实验结论
Reference

一、cogvlm-video视频理解模型

CogVLM2-Video模型：通过抽取关键帧的方式，实现对连续画面的解读，该模型可以支持最高1分钟的视频（这里1min指cogvlm2-video-llama3-base的效果）。
- chat模型是取前24帧，base模型是取整个视频的平均24帧
- 支持英文（但实测有时能回答中文，估计中文训练数据较少）
CogVLM2 在预训练阶段和微调阶段都使用具有视觉专家的模型结构，支持高达 1344 × 1344 像素的输入分辨率。视频理解模型CogVLM2-Video 提出了一种自动化时序定位数据构建方法，并引入多帧视频图像和时间戳作为编码器输入。

模型名称	cogvlm2-video-llama3-chat	cogvlm2-video-llama3-base
基座模型	Meta-Llama-3-8B-Instruct	Meta-Llama-3-8B-Instruct
语言	英文	英文
任务	视频理解, 单轮对话模型	视频理解, 基座模型, 不可对话
Int4模型	1	1
文本长度	2K	2K
图片分辨率	224*224(视频, 取前24帧)	224*224(视频, 取平均24帧)

1. 模型结构

在视频理解领域，现有的模型通常缺乏时序定位能力，无法准确关联视频帧与具体时间戳；而Cogvlm2-video模型通过融合视频抽帧和时间戳信息，提高了模型在视频理解和时间定位方面的能力。
视频抽帧（24帧，每帧大小为224 x 224）后经过ViT进行图像编码（ViT中添加了2x2的卷积核更好的压缩视觉信息），使用adapter模块更好的将视觉特征和文本特征对齐，得到的图像特征和文本特征拼接后，送入语言模型的decoder解码器。
开源的cogvlm2-video-llama3-base和cogvlm2-video-llama3-chat模型将cogvlm2的语言模型从chatglm改为llama3

论文中的模型架构如下：
在这里插入图片描述

2. 训练过程

（1）pre-training

数据构造策略：用模型在开源数据上打标再人工check、使用合成数据
由于cogvlm2-video是直接加载cogvlm模型权重进行SFT微调，所以pretraining这部分是cogvlm模型的pre-training，即预训练分为两阶段：

第一阶段是基于图文对数据进行image captioning loss的优化
第二阶段是image caption和REC两个数据上进行预训练，这里的REC是bounding box预测任务（只对bounding box的坐标答案计算loss）

（2）post-training

在这里插入图片描述
数据构造策略：

大量VQA数据
TQA数据：用VLM逐帧理解视频，然后使用GPT-4o进行数据过滤（会过滤场景内容变化较小的video，使用的prompt如下）

We extracted several frames from this video and described each frame using an image caption model, stored in the dictionary variable 'image_captions: Dict[str:str]'. In 'image_captions', the key is the second at which the image appears in the video, and the value is a detailed description of the image at that moment. Our image captions may contain hallucinations and errors. If you find any information that seems incorrect, please ignore the erroneous information. image_captions={images_caption} Please determine whether there are significant scene changes in each second of the video based on the frame descriptions. If there are significant changes, output \"Yes\"; otherwise, output \"No\".For example, if the description continuously mentions a pool scene or a kitchen scene, then return \"No\". If the description first mentions an indoor scene and then a beach scene, then return \"Yes\". Output your final answers directly. Do not give out your reason.

注意这里的训练数据包括时间问答数据（如问某个动作在第几秒等）
构建数据（构建视频的问答对）使用的prompt如下：

We extracted several frames from this video and described each frame using an image caption model, stored in the dictionary variable 'image_captions: Dict[str:str]'. In 'image_captions', the key is the second at which the image appears in the video, and the value is a detailed description of the image at that moment. You will play two roles: a human asking a question related to the biggest change in the video and an intelligent chatbot designed for video understanding.Your question and answer should be based on the information provided by 'image_captions'.\n" "image_captions={images_caption}\nPlease generate the response in the form of a Python dictionary string with keys \"Human\" for question and \"Bot\" for answer. Each corresponding value should be the question and answer text respectively. For example, your response should look like this: {\"Human\": \"Your question here...\", \"Bot\": \"Your answer here...\"}. Do not mention variables like 'image_captions' in your response. Always generate the question and the answer as if you are directly looking at the video.The questions you generate should focus on the biggest scene change. Do not generate questions that can be easily answered from a single image. Here are some examples of what we expect:\nExample 1: {\"Human\": \"At what second does the girl appear?\", \"Bot\": \"The girl appears at the 3rd second in the video.\"}\nExample 2: {\"Human\": \"When does the video switch from the swimming pool to the grass?\", \"Bot\": \"At the 3rd second.\"}\nExample 3: {\"Human\": \"When does a significant scene change occur in the video? \", \"Bot\": \"At the 10th second, the scene changes from the playground to the classroom.\"}\nIf you think it is not possible to generate such questions and answers from the given image captions, output \"None\" directly. Keep your answers concise and accurate and generate only one question and answer pair.

训练策略：
（一）CogVLM2和GLM-4V在post-training阶段：

第一阶段：

在这个阶段，模型使用所有视觉问答（VQA）训练数据集和大约300K的对齐语料库进行训练，以增强模型的基础能力。
模型进行3000次迭代，学习率为1e-5，全局批次大小为2340。
这个阶段的目的是提升模型对图像相关问题回答的能力，同时保持对图像内容的准确理解。

第二阶段：

在第二阶段，模型使用一部分VQA数据集和大约50K的偏好对齐数据进行（DPO）训练，以优化模型的输出风格，使其更符合人类的偏好。
这个阶段将全局批次大小减少到1150，并进行750步的训练。
这个阶段的目标是使其更加符合人类的表达方式和偏好，提高模型在实际应用中的可用性和准确性。
注意：
两阶段都是全参微调
为了增强训练的稳定性和效果，视觉编码器的参数也会被更新，并将其学习率调整为一般情况lr的1/10

（二）CogVLM-video的SFT过程：指令微调（Instruction Tuning）和时间定位微调（Temporal Grounding Tuning）

指令微调（Instruction Tuning）：

这个阶段的目标是提高模型对视频理解的一般能力，特别是处理视频字幕和问答任务。
模型使用处理好的caption数据集和公开可用的问题回答数据进行训练。
学习率为4e-6，主要使用VideoChat2提供的指令微调数据，不包括简单的字幕数据集。
在指令微调阶段，总共使用了330k视频样本。

时间定位微调（Temporal Grounding Tuning）：

这个阶段的目标是训练CogVLM2-Video模型处理时间定位问题，即能够理解视频中特定时间点信息。
模型在时间定位问答（TQA）数据集上进行训练，学习率为1e-6。
通过这个训练过程，CogVLM2-Video不仅在公共基准测试中表现出色，而且还具备了大多数先前视频模型所缺乏的时间问题回答能力。
注意：这两个阶段都是全参微调

二、视频理解能力

下图显示了 CogVLM2-Video在 MVBench、VideoChatGPT-Bench和 Zero-shot VideoQA 数据集 (MSVD-QA、MSRVTT-QA、ActivityNet-QA) 上的性能。
在这里插入图片描述
CogVLM2-Video 在 MVBench 数据集上的表现：

MVBench评测集准：
大多数测试主要集中在静态图像任务上的空间理解，而忽视了动态视频中的时间理解。MVBench评测集准在20个视频理解任务上评测模型，20个任务可以分为九大类。下图展示了 MVBench 评测基准中的 20 种任务，这些任务被分为两大类：空间理解任务（Spatial Understanding: Inferring from a single frame）和时间理解任务（Temporal Understanding: Reasoning based on entire video）。每类任务包含若干具体子任务：
在这里插入图片描述

空间理解任务（Spatial Understanding: Inferring from a single frame）

根据视频的单帧图像进行推理和理解的任务：

Action（动作）:

任务示例：What’s the man doing?（这个人在做什么？）
解释：识别和描述图像中的动作。

Object（物体）:

任务示例：What’s on the table?（桌子上有什么？）
解释：识别图像中的物体。

Position（位置）:

任务示例：Is the man on the stage?（那个人在舞台上吗？）
解释：确定图像中物体或人物的位置。

Count（计数）:

任务示例：How many chairs?（有多少把椅子？）
解释：计算图像中某类物体的数量。

Scene（场景）:

任务示例：Where’s the man?（这个人在哪里？）
解释：识别图像中的场景。

Pose（姿态）:

任务示例：What’s the man’s pose?（这个人的姿势是什么？）
解释：确定图像中文物或人物的姿态。

Attribute（属性）:

任务示例：What color is the desk?（桌子是什么颜色的？）
解释：描述图像中物体的属性。

Character（字幕）:

任务示例：What are the subtitles?（字幕是什么？）
解释：识别和读取图像中的文本信息。

Cognition（认知）:

任务示例：Why is the man singing in the canteen?（为什么那个人在食堂里唱歌？）
解释：进行更高层次的理解和推理，回答“为什么”类的问题。

时间理解任务（Temporal Understanding: Reasoning based on entire video）

这是根据整段视频进行推理和理解的任务：

Action（动作）:

Action Sequence（动作序列）：识别动作发生的顺序。
Action Antonym（动作反义词）：识别相反的动作。
Action Prediction（动作预测）：预测未发生的动作。
Unexpected Action（意外动作）：识别意外发生的动作。
Fine-grained Action（细粒度动作）：识别细微的动作变化。

Object（对象）:

Object Shuffle（物体打乱）：识别物体的重新排列或位置变化。
Object Existence（物体存在）：判断某物体是否存在。
Object Interaction（物体交互）：识别物体之间的交互行为。

Position（位置）:

Moving Direction（移动方向）：识别物体或人的移动方向。
Action Localization（动作定位）：定位特定动作发生的位置。

Count（计数）:

Action Count（动作计数）：计算特定动作的发生次数。
Moving Count（移动计数）：计算移动的物体或人物数量。

Scene（场景）:

Scene Transition（场景转换）：识别场景变化和转换点。

Pose（姿态）:

Fine-grained Pose（细粒度姿态）：识别细微的姿态变化。

Attribute（属性）:

State Change（状态变化）：识别物体或人物的状态变化。
Moving Attribute（移动属性）：识别物体或人物移动时的属性变化。

Character（字幕）:

Character Order（字幕顺序）：识别字符出现的顺序。

Cognition（认知）:

Episodic Reasoning（情节推理）：理解影片的剧情发展和情节关系。
Egocentric Navigation（自我Navigation）：基于视频中的视角进行Navigation和推理。
Counterfactual Inference（反事实推理）：推理假设情景中“如果……会怎么样”的问题。

三、相关实验结论

1min内的视频总结效果较好
改变默认帧数（frame_num=24，表示模型一次处理的帧数），帧数越多则效果越差。
分段再合并的测试结论：
- 对于电影等视频，分段再合并后不容易衔接片段间信息（比如肖生克的救赎电影片段）
- 分段再总结的方法，也不适用于用户对话的情况

Reference

[1] https://modelscope.cn/models/ZhipuAI/cogvlm2-video-llama3-chat/files
[2] https://modelscope.cn/models/ZhipuAI/cogvlm2-video-llama3-base
[3] https://cogvlm2-video.github.io/
[4] 技术博客：https://cogvlm2-video.github.io/
[5] 评估脚本：https://github.com/magic-research/PLLaVA/blob/main/README.md
[6] 【论文分享】PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning
[7] LVM视频理解模型
[8] [CVPR2024 Highlight] MVBench多模态视频理解能力的全面评测：https://zhuanlan.zhihu.com/p/669658267