Meta AI在视频理解方面取得了令人瞩目的里程碑式成就,推出了LongVU,这是一种开创性的模型,能够理解以前对人工智能系统来说具有挑战性的长视频。 研究论文 "LongVU:用于长视频语言理解的时空自适应压缩 "提出了一种革命性的方法,使人工智能能够有效地处理和理解长达几分钟甚至一小时的视频,而这在以前是无法实现的。
多模态大语言模型(MLLM)在理解和分析视频内容方面取得了可喜的进展。 然而,受限于给定的上下文长度,处理长视频仍然是一项重大挑战。 为了解决这一限制,我们提出了一种时空自适应压缩机制 LongVU,以减少视频标记的数量,同时保留长视频的视觉细节。 我们的想法是利用跨模态查询和帧间依赖关系,自适应地减少视频中的时空冗余。 具体来说,我们利用 DINOv2 特征来删除相似度高的冗余帧。 然后,我们利用文本引导的跨模态查询来选择性地减少帧特征。 此外,我们还根据帧与帧之间的时间依赖关系,对帧进行空间标记缩减。 我们的自适应压缩策略在有限的上下文长度内有效地处理了大量帧,几乎没有损失任何视觉信息。 在各种视频理解基准测试中,我们的 LongVU 始终超越现有方法,尤其是在长达一小时的视频理解任务(如 VideoMME 和 MLVU)中。 在轻量级 LLM 的情况下,我们的 LongVU 还能有效地扩展到更小的规模,并具有最先进的视频理解性能。
LongVU 架构
LongVU 的结构。 给定一个密集采样的视频帧,我们首先利用 DINOv2 去除冗余帧,然后融合 SigLIP 和 DINOv2 的剩余帧特征。 然后,我们通过跨模态查询有选择地减少视觉标记。 最后,我们基于时间依赖性进行空间标记压缩,以进一步满足 LLM 的有限上下文长度。
示例
# git clone https://github.com/Vision-CAIR/LongVU
import numpy as np
import torch
from longvu.builder import load_pretrained_model
from longvu.constants import (DEFAULT_IMAGE_TOKEN,IMAGE_TOKEN_INDEX,
)
from longvu.conversation import conv_templates, SeparatorStyle
from longvu.mm_datautils import (KeywordsStoppingCriteria,process_images,tokenizer_image_token,
)
from decord import cpu, VideoReadertokenizer, model, image_processor, context_len = load_pretrained_model("./checkpoints/longvu_qwen", None, "cambrian_qwen",
)model.eval()
video_path = "./examples/video1.mp4"
qs = "Describe this video in detail"vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
fps = float(vr.get_avg_fps())
frame_indices = np.array([i for i in range(0, len(vr), round(fps),)])
video = []
for frame_index in frame_indices:img = vr[frame_index].asnumpy()video.append(img)
video = np.stack(video)
image_sizes = [video[0].shape[:2]]
video = process_images(video, image_processor, model.config)
video = [item.unsqueeze(0) for item in video]qs = DEFAULT_IMAGE_TOKEN + "\n" + qs
conv = conv_templates["qwen"].copy()
conv.append_message(conv.roles[0], qs)
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(model.device)
stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
keywords = [stop_str]
stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)
with torch.inference_mode():output_ids = model.generate(input_ids,images=video,image_sizes=image_sizes,do_sample=False,temperature=0.2,max_new_tokens=128,use_cache=True,stopping_criteria=[stopping_criteria],)
pred = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
Github:https://github.com/Vision-CAIR/LongVU
如何 24GB VRAM 运行
https://github.com/Vision-CAIR/LongVU/issues/6
# git clone https://github.com/Vision-CAIR/LongVU
import numpy as np
import torch
from longvu.builder import load_pretrained_model
from longvu.constants import (DEFAULT_IMAGE_TOKEN,IMAGE_TOKEN_INDEX,
)
from longvu.conversation import conv_templates, SeparatorStyle
from longvu.mm_datautils import (KeywordsStoppingCriteria,process_images,tokenizer_image_token,
)
from decord import cpu, VideoReadertokenizer, model, image_processor, context_len = load_pretrained_model("Vision-CAIR/LongVU_Qwen2_7B", model_base=None,model_name="cambrian_qwen",device="cuda:0"
)model.eval()
video_path = "./examples/video1.mp4"
qs = "Describe this video in detail"vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
fps = float(vr.get_avg_fps())
# frame_indices = np.array([i for i in range(0, len(vr), round(fps),)])
num_frames = 1000 if len(vr) > 1000 else len(vr)
frame_indices = np.array([i for i in range(0, num_frames, round(fps),)])video = []
for frame_index in frame_indices:img = vr[frame_index].asnumpy()video.append(img)
video = np.stack(video)
image_sizes = [video[0].shape[:2]]
video = process_images(video, image_processor, model.config)
video = [item.unsqueeze(0) for item in video]qs = DEFAULT_IMAGE_TOKEN + "\n" + qs
conv = conv_templates["qwen"].copy()
conv.append_message(conv.roles[0], qs)
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(model.device)
stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
keywords = [stop_str]
stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)
# with torch.inference_mode():
# output_ids = model.generate(
# input_ids,
# images=video,
# image_sizes=image_sizes,
# do_sample=False,
# temperature=0.2,
# max_new_tokens=128,
# use_cache=True,
# stopping_criteria=[stopping_criteria],
# )
attention_mask = torch.ones_like(input_ids)
with torch.inference_mode():output_ids = model.generate(input_ids,attention_mask=attention_mask,images=video,image_sizes=image_sizes,do_sample=True,temperature=0.2,pad_token_id=tokenizer.eos_token_id,max_new_tokens=512,use_cache=True,stopping_criteria=[stopping_criteria],)
pred = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
输出:
‘The video begins with a scene featuring two characters in an animated setting, one dressed in a bright yellow and red outfit with a mask, and the other in a blue and white traditional robe, standing on a rocky terrain with a green, leaf-like structure and a mountainous backdrop. The character in the yellow and red outfit is seen making a gesture with their right hand, while the other character appears to be speaking or reacting to the first character. The scene then transitions to a misty, ethereal environment where the same two characters are now standing on a staircase leading to a building with a golden roof, surrounded by smoke or clouds. The character in the yellow and red outfit is now holding a sword, while the other character is holding a fan, and both are looking up at the building. The scene shifts again to a large, ornate building with a golden roof, where a figure in a white and red outfit is seen descending a staircase, with smaller figures in white and red attire standing on the steps, and a large, white, cloud-like object in the foreground. The final scene shows the same building with the figure in white and red now seated on a golden throne, surrounded by smaller figures in white and red, and a large, white, cloud-like object still in the foreground, suggesting a ceremonial or significant event taking place.’