发布日期: 2025-10-11

更新日期: 2025-11-27

文章字数: 3.6k

阅读时长: 14 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-10-11 更新

UniVideo: Unified Understanding, Generation, and Editing for Videos

Authors:Cong Wei, Quande Liu, Zixuan Ye, Qiulin Wang, Xintao Wang, Pengfei Wan, Kun Gai, Wenhu Chen

Unified multimodal models have shown promising results in multimodal content generation and editing but remain largely limited to the image domain. In this work, we present UniVideo, a versatile framework that extends unified modeling to the video domain. UniVideo adopts a dual-stream design, combining a Multimodal Large Language Model (MLLM) for instruction understanding with a Multimodal DiT (MMDiT) for video generation. This design enables accurate interpretation of complex multimodal instructions while preserving visual consistency. Built on this architecture, UniVideo unifies diverse video generation and editing tasks under a single multimodal instruction paradigm and is jointly trained across them. Extensive experiments demonstrate that UniVideo matches or surpasses state-of-the-art task-specific baselines in text/image-to-video generation, in-context video generation and in-context video editing. Notably, the unified design of UniVideo enables two forms of generalization. First, UniVideo supports task composition, such as combining editing with style transfer, by integrating multiple capabilities within a single instruction. Second, even without explicit training on free-form video editing, UniVideo transfers its editing capability from large-scale image editing data to this setting, handling unseen instructions such as green-screening characters or changing materials within a video. Beyond these core capabilities, UniVideo also supports visual-prompt-based video generation, where the MLLM interprets visual prompts and guides the MMDiT during synthesis. To foster future research, we will release our model and code.

统一的多模态模型在多模态内容生成和编辑方面已经取得了有前景的结果，但主要局限于图像领域。在这项工作中，我们提出了UniVideo，这是一个通用框架，它将统一建模扩展到视频领域。UniVideo采用双流设计，结合多模态大型语言模型（MLLM）进行指令理解，以及多模态DiT（MMDiT）进行视频生成。这种设计能够准确解释复杂的多模态指令，同时保持视觉一致性。基于这种架构，UniVideo将各种视频生成和编辑任务统一在一个单一的多模态指令范式下，并在它们之间进行联合训练。大量实验表明，UniVideo在文本/图像到视频生成、上下文视频生成和上下文视频编辑方面达到了或超越了最先进的任务特定基准测试。值得注意的是，UniVideo的统一设计实现了两种形式的泛化。首先，UniVideo支持任务组合，例如通过单个指令集成多种功能来结合编辑与风格转换。其次，即使在没有针对自由形式视频编辑的明确训练情况下，UniVideo也能从其大规模图像编辑数据中转用编辑能力，处理未见过的指令，如在视频中更换背景或更改材质。除了这些核心能力之外，UniVideo还支持基于视觉提示的视频生成，其中MLLM解释视觉提示并在合成过程中引导MMDiT。为了促进未来的研究，我们将发布我们的模型和代码。

Summary

本文介绍了一种名为UniVideo的通用框架，它将统一建模扩展到视频领域。UniVideo采用双流设计，结合多模态大型语言模型（MLLM）用于指令理解与多模态DiT（MMDiT）用于视频生成。这种设计能够准确解释复杂的跨模态指令，同时保持视觉一致性。基于这一架构，UniVideo将各种视频生成和编辑任务统一在单一的多模态指令范式下，并在这些任务上进行联合训练。实验表明，UniVideo在文本/图像到视频的生成、上下文视频生成和上下文视频编辑等方面达到了或超越了最先进的任务特定基线。此外，UniVideo的统一设计实现了两种形式的泛化。首先，它支持任务组合，如通过单一指令集成编辑与风格转换。其次，即使没有针对自由形式视频编辑进行明确训练，UniVideo也能从大规模图像编辑数据迁移其编辑能力。除了这些核心能力外，UniVideo还支持基于视觉提示的视频生成，其中MLLM解释视觉提示并在合成过程中引导MMDiT。

Key Takeaways

UniVideo是一个扩展到视频领域的通用框架，采用双流设计。
UniVideo结合了多模态大型语言模型（MLLM）和多模态DiT（MMDiT）技术。
它能够准确解释复杂的跨模态指令，并保持视觉一致性。
UniVideo统一了视频生成和编辑任务，并在这些任务上进行联合训练。
UniVideo在多个视频生成和编辑任务上表现出色，达到了或超越了最先进的任务特定基线。
UniVideo支持任务组合，如编辑与风格转换的集成。

Cool Papers

点此查看论文截图

Controllable Hybrid Captioner for Improved Long-form Video Understanding

Authors:Kuleen Sasse, Efsun Sarioglu Kayi, Arun Reddy

Video data, especially long-form video, is extremely dense and high-dimensional. Text-based summaries of video content offer a way to represent query-relevant content in a much more compact manner than raw video. In addition, textual representations are easily ingested by state-of-the-art large language models (LLMs), which enable reasoning over video content to answer complex natural language queries. To solve this issue, we rely on the progressive construction of a text-based memory by a video captioner operating on shorter chunks of the video, where spatio-temporal modeling is computationally feasible. We explore ways to improve the quality of the activity log comprised solely of short video captions. Because the video captions tend to be focused on human actions, and questions may pertain to other information in the scene, we seek to enrich the memory with static scene descriptions using Vision Language Models (VLMs). Our video understanding system relies on the LaViLa video captioner in combination with a LLM to answer questions about videos. We first explored different ways of partitioning the video into meaningful segments such that the textual descriptions more accurately reflect the structure of the video content. Furthermore, we incorporated static scene descriptions into the captioning pipeline using LLaVA VLM, resulting in a more detailed and complete caption log and expanding the space of questions that are answerable from the textual memory. Finally, we have successfully fine-tuned the LaViLa video captioner to produce both action and scene captions, significantly improving the efficiency of the captioning pipeline compared to using separate captioning models for the two tasks. Our model, controllable hybrid captioner, can alternate between different types of captions according to special input tokens that signals scene changes detected in the video.

视频数据，特别是长视频，极为密集且高维。基于文本的视频内容摘要提供了一种比原始视频更紧凑地表示查询相关内容的方式。此外，文本表示很容易被最新的大型语言模型（LLM）所接纳，能够推理视频内容以回答复杂的自然语言查询。为了解决这个问题，我们依赖于由操作视频较短片段的视频字幕器逐步构建的基于文本的存储器，其中时空建模是计算可行的。我们探索了提高仅由短视频字幕构成的活动日志质量的方法。由于视频字幕往往侧重于人类行为，而问题可能与场景中的其他信息有关，我们寻求使用视觉语言模型（VLM）丰富存储器中的静态场景描述。我们的视频理解系统依赖于LaViLa视频字幕器与LLM相结合来回答有关视频的问题。我们首先探索了将视频分割成有意义片段的不同方式，以使文本描述更准确地反映视频内容的结构。此外，我们利用LLaVA VLM将静态场景描述融入字幕管道，从而得到更详细和完整的字幕日志，并扩大了可从文本存储器中回答的问题空间。最后，我们已经成功地对LaViLa视频字幕器进行了微调，以产生动作和场景字幕，与为两个任务使用单独的字幕模型相比，这大大提高了字幕管道的效率。我们的可控混合字幕器模型可以根据视频中的场景变化等特殊输入令牌在不同类型字幕之间进行切换。

论文及项目相关链接

PDF

Summary

本文探讨了对视频内容进行文本摘要的方法。针对长视频数据的高密度和高维度特性，通过视频字幕员对视频较短片段的操作，逐步构建基于文本的记忆。为提高仅由短视频字幕构成的活动日志质量，结合使用视觉语言模型（VLMs）丰富记忆内容，使得系统能理解并回答关于视频的问题。通过对视频进行有意义的分段，并融入静态场景描述，提高字幕日志的详细性和完整性。最终，成功调整LaViLa视频字幕员，使其能够产生动作和场景字幕，与单独为两项任务使用字幕模型相比，显著提高字幕管道的效率。可控混合字幕员可以根据视频中的场景变化特殊输入令牌来交替产生不同类型的字幕。

Key Takeaways

视频数据，特别是长视频，是密集且高维度的，文本摘要能更紧凑地表示查询相关内容。
短视频片段的文本记忆构建对于理解和回答问题至关重要。
视频字幕倾向于关注人类行为，而问题可能涉及场景中的其他信息，因此需要丰富记忆内容以包含静态场景描述。
结合使用视觉语言模型（VLMs）和LaViLa视频字幕员提高了对视频内容的理解。
通过特殊输入令牌，可控混合字幕员可以根据视频中的场景变化生成不同类型的字幕。
对视频进行有意义的分段可以更准确反映视频内容结构。

Cool Papers

点此查看论文截图

Think With Videos For Agentic Long-Video Understanding

Authors:Huaying Yuan, Zheng Liu, Junjie Zhou, Hongjin Qian, Yan Shu, Nicu Sebe, Ji-Rong Wen, Zhicheng Dou

Long-video understanding~(LVU) is a challenging problem in computer vision. Existing methods either downsample frames for single-pass reasoning, sacrificing fine-grained details, or depend on textual reasoning over task-agnostic representations, hindering task-specific perception and exploration. In this paper, we propose VideoExplorer, a framework grounded in the principle of ``thinking with video’’, which naturally intertwines planning, temporal grounding, and scalable perception into a coherent reasoning process. Rather than reasoning over a static context, VideoExplorer iteratively formulates sub-questions, locates relevant moments, and performs task-oriented, temporally scalable video understanding until reaching the final answer, enabling faithful, efficient, and interpretable reasoning. To address the lack of LVU training resources, we construct a long-video reasoning dataset using difficulty-adaptive sampling to ensure high-quality trajectories on complex tasks. Building on this dataset, we design a two-stage training pipeline: supervised trajectory initialization followed by trajectory-level preference optimization, encouraging adaptive temporal grounding and iterative information integration guided by downstream rewards. Extensive evaluations on popular long-video understanding and reasoning benchmarks demonstrate VideoExplorer’s significant advantage over existing baselines, highlighting its robustness, adaptability, and efficiency. Our code is made publicly available in this repository(https://github.com/yhy-2000/VideoDeepResearch).

长视频理解（LVU）是计算机视觉领域的一个难题。现有方法要么对帧进行降采样以进行单次推理，牺牲了细微的细节，要么依赖于任务无关表示上的文本推理，阻碍了特定任务的感知和探索。在本文中，我们提出了VideoExplorer，这是一个基于“用视频思考”原则的框架，它将规划、时间定位和可扩展感知自然地结合到一个连贯的推理过程中。VideoExplorer不是对静态上下文进行推理，而是迭代地制定子问题，定位相关时刻，并执行面向任务的、时间可缩放的视频理解，直到达到最终答案，从而实现忠实、高效和可解释的推理。为了解决LVU训练资源的缺乏，我们使用难度自适应采样构建了一个长视频推理数据集，以确保在复杂任务上的高质量轨迹。基于该数据集，我们设计了一个两阶段训练管道：监督轨迹初始化，然后是轨迹级别的偏好优化，鼓励自适应时间定位以及由下游奖励引导的迭代信息整合。在流行长视频理解和推理基准测试上的广泛评估表明，VideoExplorer相较于现有基线具有显著优势，凸显了其稳健性、适应性和效率。我们的代码已在此仓库中公开提供（https://github.com/yhy-2000/VideoDeepResearch）。

论文及项目相关链接

PDF

Summary
视频理解中的一个挑战是处理长视频内容。现有方法会忽略细节或在任务无关的表示上进行文本推理，从而影响特定任务的感知和探索。本研究提出了VideoExplorer框架，通过“用视频思考”原则将规划、时间定位以及可伸缩感知自然结合到一个连贯的推理过程中。VideoExplorer能迭代地提出子问题、定位关键时刻并进行面向任务的、可伸缩的视频理解，直至得到最终答案，实现忠实、高效和可解释的推理。为解决长视频理解训练资源匮乏的问题，研究构建了基于难度自适应采样的长视频推理数据集，设计了包括监督轨迹初始化和轨迹级偏好优化的两阶段训练流程。评估结果表明，VideoExplorer相较于现有基线有显著优势，展现出稳健性、适应性和高效性。

Key Takeaways