发布日期: 2025-11-26

更新日期: 2025-11-27

文章字数: 13.4k

阅读时长: 54 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-26 更新

ReAlign: Text-to-Motion Generation via Step-Aware Reward-Guided Alignment

Authors:Wanjiang Weng, Xiaofeng Tan, Junbo Wang, Guo-Sen Xie, Pan Zhou, Hongsong Wang

Text-to-motion generation, which synthesizes 3D human motions from text inputs, holds immense potential for applications in gaming, film, and robotics. Recently, diffusion-based methods have been shown to generate more diversity and realistic motion. However, there exists a misalignment between text and motion distributions in diffusion models, which leads to semantically inconsistent or low-quality motions. To address this limitation, we propose Reward-guided sampling Alignment (ReAlign), comprising a step-aware reward model to assess alignment quality during the denoising sampling and a reward-guided strategy that directs the diffusion process toward an optimally aligned distribution. This reward model integrates step-aware tokens and combines a text-aligned module for semantic consistency and a motion-aligned module for realism, refining noisy motions at each timestep to balance probability density and alignment. Extensive experiments of both motion generation and retrieval tasks demonstrate that our approach significantly improves text-motion alignment and motion quality compared to existing state-of-the-art methods.

文本到动作生成技术通过文本输入合成3D人类动作，在游戏、电影和机器人等领域具有巨大潜力。最近，基于扩散的方法被证明可以生成更多样化和更逼真的动作。然而，扩散模型中文本和动作分布之间存在不对齐的问题，这会导致语义不一致或质量低的动作。为了解决这一局限性，我们提出了奖励引导采样对齐（ReAlign）方法，包括一个步骤感知奖励模型，在降噪采样过程中评估对齐质量，以及一个奖励引导策略，引导扩散过程朝最优对齐分布进行。该奖励模型结合了步骤感知令牌，并融合了文本对齐模块（用于语义一致性）和运动对齐模块（用于逼真度），在每个时间步对噪声动作进行微调，以平衡概率密度和对齐。对动作生成和检索任务的广泛实验表明，我们的方法在文本-动作对齐和动作质量方面显著优于现有最先进的方法。

论文及项目相关链接

PDF Accepted by AAAI 2026

Summary
文本到动作生成技术通过文本输入合成3D人类动作，在游戏、电影和机器人等领域具有巨大潜力。针对扩散模型中文本与动作分布不匹配的问题，提出奖励引导采样对齐（ReAlign）方法，包括感知对齐质量的奖励模型和引导扩散过程向最优对齐分布的策略。该方法结合文本对齐模块和动作对齐模块，提高语义一致性和动作真实感。实验证明，相较于现有方法，该方法显著改善文本与动作的匹配度和动作质量。

Key Takeaways

文本到动作生成技术在游戏、电影和机器人等领域有广泛应用前景。
扩散模型在生成多样化和真实动作方面表现出优势，但存在文本与动作分布不匹配的问题。
奖励引导采样对齐（ReAlign）方法提出，以解决文本与动作分布不匹配的问题。
ReAlign方法包括奖励模型和策略，用于评估对齐质量和引导扩散过程。
ReAlign方法结合文本对齐模块和动作对齐模块，提高语义一致性和动作真实感。
奖励模型在每一步都进行精细调整，以平衡概率密度和对齐。

Cool Papers

点此查看论文截图

FineXtrol: Controllable Motion Generation via Fine-Grained Text

Authors:Keming Shen, Bizhu Wu, Junliang Chen, Xiaoqin Wang, Linlin Shen

Recent works have sought to enhance the controllability and precision of text-driven motion generation. Some approaches leverage large language models (LLMs) to produce more detailed texts, while others incorporate global 3D coordinate sequences as additional control signals. However, the former often introduces misaligned details and lacks explicit temporal cues, and the latter incurs significant computational cost when converting coordinates to standard motion representations. To address these issues, we propose FineXtrol, a novel control framework for efficient motion generation guided by temporally-aware, precise, user-friendly, and fine-grained textual control signals that describe specific body part movements over time. In support of this framework, we design a hierarchical contrastive learning module that encourages the text encoder to produce more discriminative embeddings for our novel control signals, thereby improving motion controllability. Quantitative results show that FineXtrol achieves strong performance in controllable motion generation, while qualitative analysis demonstrates its flexibility in directing specific body part movements.

近期的研究致力于提高文本驱动运动生成的操控性和精确度。一些方法利用大型语言模型（LLM）来生成更详细的文本，而其他方法则结合全局3D坐标序列作为附加控制信号。然而，前者经常引入错位细节并缺乏明确的时序线索，后者在将坐标转换为标准运动表示时会产生巨大的计算成本。为了解决这些问题，我们提出了FineXtrol，这是一种新型控制框架，能够高效地根据描述特定身体部位随时间变化的精确、友好且精细的文本控制信号进行运动生成。为了支持此框架，我们设计了一个分层对比学习模块，鼓励文本编码器为我们新型的控制信号生成更具辨识度的嵌入向量，从而提高运动控制力。定量结果表明，FineXtrol在可控运动生成方面取得了出色的表现，而定性分析则证明了其在指导特定身体部位运动方面的灵活性。

论文及项目相关链接

PDF 20 pages, 14 figures, AAAI 2026

Summary

文本提出了一种名为FineXtrol的新型运动控制框架，用于由精确、具有时间感知能力且用户友好的文本控制信号引导的高效运动生成。该框架解决了现有方法中的一些问题，如大型语言模型产生的细节不匹配和全球三维坐标序列作为控制信号带来的计算成本问题。为解决这些问题，设计了一种层次对比学习模块，促使文本编码器产生更多具有鉴别力的嵌入，从而提高运动控制性能。定量和定性结果均表明，FineXtrol在可控运动生成方面表现出色，并能灵活指导特定身体部位的移动。

Key Takeaways

FineXtrol是一种用于高效运动生成的新型控制框架，基于精确、时间感知和用户友好的文本控制信号。
该框架解决了大型语言模型产生的细节不匹配和全球三维坐标序列带来的计算成本问题。
设计了一种层次对比学习模块，以提高文本编码器产生的嵌入的鉴别力，从而改善运动控制性能。
FineXtrol实现了可控运动生成的强大性能。
该框架的定性分析表明其灵活指导特定身体部位移动的能力。
通过使用FineXtrol，文本驱动的运动生成可以更加精确、具有时间感知，并且用户友好。

Cool Papers

点此查看论文截图

HunyuanVideo 1.5 Technical Report

Authors:Bing Wu, Chang Zou, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Jack Peng, Jianbing Wu, Jiangfeng Xiong, Jie Jiang, Linus, Patrol, Peizhen Zhang, Peng Chen, Penghao Zhao, Qi Tian, Songtao Liu, Weijie Kong, Weiyan Wang, Xiao He, Xin Li, Xinchi Deng, Xuefei Zhe, Yang Li, Yanxin Long, Yuanbo Peng, Yue Wu, Yuhong Liu, Zhenyu Wang, Zuozhuo Dai, Bo Peng, Coopers Li, Gu Gong, Guojian Xiao, Jiahe Tian, Jiaxin Lin, Jie Liu, Jihong Zhang, Jiesong Lian, Kaihang Pan, Lei Wang, Lin Niu, Mingtao Chen, Mingyang Chen, Mingzhe Zheng, Miles Yang, Qiangqiang Hu, Qi Yang, Qiuyong Xiao, Runzhou Wu, Ryan Xu, Rui Yuan, Shanshan Sang, Shisheng Huang, Siruis Gong, Shuo Huang, Weiting Guo, Xiang Yuan, Xiaojia Chen, Xiawei Hu, Wenzhi Sun, Xiele Wu, Xianshun Ren, Xiaoyan Yuan, Xiaoyue Mi, Yepeng Zhang, Yifu Sun, Yiting Lu, Yitong Li, You Huang, Yu Tang, Yixuan Li, Yuhang Deng, Yuan Zhou, Zhichao Hu, Zhiguang Liu, Zhihe Yang, Zilin Yang, Zhenzhi Lu, Zixiang Zhou, Zhao Zhong

We present HunyuanVideo 1.5, a lightweight yet powerful open-source video generation model that achieves state-of-the-art visual quality and motion coherence with only 8.3 billion parameters, enabling efficient inference on consumer-grade GPUs. This achievement is built upon several key components, including meticulous data curation, an advanced DiT architecture featuring selective and sliding tile attention (SSTA), enhanced bilingual understanding through glyph-aware text encoding, progressive pre-training and post-training, and an efficient video super-resolution network. Leveraging these designs, we developed a unified framework capable of high-quality text-to-video and image-to-video generation across multiple durations and resolutions.Extensive experiments demonstrate that this compact and proficient model establishes a new state-of-the-art among open-source video generation models. By releasing the code and model weights, we provide the community with a high-performance foundation that lowers the barrier to video creation and research, making advanced video generation accessible to a broader audience. All open-source assets are publicly available at https://github.com/Tencent-Hunyuan/HunyuanVideo-1.5.

我们推出HunyuanVideo 1.5，这是一款轻便且强大的开源视频生成模型。它仅包含8.3亿个参数，即可实现最先进的视觉质量和运动连贯性，能在消费级GPU上进行高效推理。这一成就建立在几个关键组件之上，包括精心策划的数据集、带有选择性滑动瓷砖注意力（SSTA）的高级DiT架构、通过字形感知文本编码增强双语理解、渐进的预训练和微调，以及高效的视频超分辨率网络。凭借这些设计，我们开发了一个统一框架，能够在多个持续时间和分辨率上实现高质量文本到视频和图像到视频的生成。大量实验表明，这款紧凑且专业的模型在开源视频生成模型中建立了新的最先进的水平。我们通过公开代码和模型权重，为社区提供了一个高性能的基础，降低了视频创作和研究的门槛，使先进的视频生成技术为更广泛的受众所使用。所有开源资产均可在[https://github.com/Tencent-Hunyuan/HunyuanVideo-1.5公开获取。

论文及项目相关链接

PDF

Summary

HunyuanVideo 1.5是一款轻量级但功能强大的开源视频生成模型，实现了最先进的视觉质量和运动连贯性，仅有8.3亿参数，可在消费级GPU上进行高效推理。它建立在几个关键组件之上，包括精心策划的数据收集、具有选择性滑动块注意力（SSTA）的高级DiT架构、通过字形感知文本编码增强双语理解、渐进的预训练和微调，以及高效的视频超分辨率网络。这些设计使我们能开发出一种统一框架，能够生成高质量文本到视频和图像到视频的转换，跨越多个持续时间和分辨率。这一紧凑且专业的模型在开源视频生成模型中建立了新的技术标杆。通过公开代码和模型权重，我们为社区提供了一个高性能的基础，降低了视频创作和研究的门槛，使先进的视频生成技术更易于大众访问。所有开源资产均可在公开网站上获取。

Key Takeaways

HunyuanVideo 1.5是一个轻量级的开源视频生成模型，能够在消费级GPU上进行高效推理。
该模型实现了先进的视觉质量和运动连贯性，仅有8.3亿参数。
模型的关键组件包括数据收集、高级DiT架构、字形感知文本编码、渐进的预训练和微调，以及视频超分辨率网络。
模型能够进行文本到视频和图像到视频的转换，支持多种持续时间和分辨率。
该模型在开源视频生成模型中表现出卓越的性能，达到了新的技术标杆。
模型代码和权重已经公开，为社区提供了一个高性能的基础，降低了视频创作和研究的门槛。

Cool Papers

点此查看论文截图

VideoPerceiver: Enhancing Fine-Grained Temporal Perception in Video Multimodal Large Language Models

Authors:Fufangchen Zhao, Liao Zhang, Daiqi Shi, Yuanjun Gao, Chen Ye, Yang Cai, Jian Gao, Danfeng Yan

We propose VideoPerceiver, a novel video multimodal large language model (VMLLM) that enhances fine-grained perception in video understanding, addressing VMLLMs’ limited ability to reason about brief actions in short clips or rare transient events in long videos. VideoPerceiver adopts a two-stage training framework. During supervised fine-tuning (SFT), we construct “key-information-missing” videos by extracting event-action keywords from captions, identifying corresponding key frames, and replacing them with adjacent frames. We jointly encode original and modified video tokens with text tokens, aligning intermediate visual representations with keywords via an auxiliary contrastive loss to enhance sensitivity to fine-grained motion cues. In reinforcement learning (RL), both video variants are fed into the model to generate descriptions, and a novel relative reward ensures responses from complete videos outperform those from degraded inputs, explicitly training the model to recover temporally precise action details. We also curate a dataset of 80,000 videos with fine-grained actions and transient events. Experiments show VideoPerceiver substantially outperforms state-of-the-art VMLLMs on fine-grained action understanding and rare event captioning benchmarks, while maintaining strong performance on standard tasks. By prioritizing task-relevant visual features, our work redefines video-language model training for fine-grained perception.

我们提出了VideoPerceiver，这是一种新型的视频多模态大型语言模型（VMLLM），旨在提高视频理解中的精细感知能力，解决VMLLM在短视频中的短暂行为或长视频中的罕见即时事件的推理能力有限的问题。VideoPerceiver采用了两阶段训练框架。在监督微调（SFT）期间，我们通过从字幕中提取事件动作关键词、识别相应的关键帧并将其替换为相邻帧，来构建“缺少关键信息”的视频。我们联合编码原始和修改后的视频令牌与文本令牌，通过辅助对比损失将中间视觉表示与关键词对齐，以提高对精细运动线索的敏感性。在强化学习（RL）中，将两种视频变体都输入模型以生成描述，一种新型相对奖励确保来自完整视频的响应优于来自退化输入的响应，从而明确训练模型以恢复时间精确的动作细节。我们还整理了一个包含8万条带有精细动作和即时事件视频的数据集。实验表明，VideoPerceiver在精细动作理解和罕见事件描述基准测试上显著优于最新VMLLM，同时在标准任务上保持强劲性能。通过优先处理与任务相关的视觉特征，我们的工作重新定义了用于精细感知的视频语言模型训练。

论文及项目相关链接

PDF

Summary

VideoPerceiver是一种新型的视频多模态大型语言模型，旨在提高视频理解的精细感知能力。它通过两阶段训练框架解决现有模型在短暂动作和罕见事件识别上的不足。采用“关键信息缺失”视频构建方法，增强对精细动作线索的敏感性，并通过强化学习明确训练模型恢复时间精确的动作细节。实验表明，VideoPerceiver在精细动作理解和罕见事件描述方面显著优于现有技术，同时保持标准任务性能。

Key Takeaways

VideoPerceiver是一种针对视频理解的多模态大型语言模型，旨在提高精细感知能力。
该模型采用两阶段训练框架，包括监督微调（SFT）和强化学习（RL）。
在监督微调阶段，通过构建“关键信息缺失”视频，增强模型对精细动作线索的敏感性。
辅助对比损失用于对齐视觉表示和关键词。
在强化学习阶段，通过相对奖励确保完整视频的响应优于退化输入，从而明确训练模型以恢复时间精确的动作细节。
VideoPerceiver显著优于现有技术，在精细动作理解和罕见事件描述方面表现出卓越性能。

Cool Papers

点此查看论文截图

Gaussian process priors with Markov properties for effective reproduction number inference

Authors:Jessalyn N. Sebastian, Volodymyr M. Minin

Many quantities characterizing infectious disease outbreaks - like the effective reproduction number ($R_t$), defined as the average number of secondary infections a newly infected individual will cause over the course of their infection - need to be modeled as time-varying parameters. It is common practice to use Gaussian random walks as priors for estimating such functions in Bayesian analyses of pathogen surveillance data. In this setting, however, the random walk prior may be too permissive, as it fails to capture prior scientific knowledge about the estimand and results in high posterior variance. We propose several Gaussian Markov process priors for $R_t$ inference, including the Integrated Brownian Motion (IBM), which can be represented as a Markov process when augmented with its corresponding Brownian Motion component, and is therefore computationally efficient and simple to implement and tune. We use simulated outbreak data to compare the performance of these proposed priors with the Gaussian random walk prior and another state-of-the-art Gaussian process prior based on an approximation to a Matérn covariance function. We find that IBM can match or exceed the performance of other priors, and we show that it produces epidemiologically reasonable and precise results when applied to county-level SARS-CoV-2 data.

描述传染病爆发特征的许多数量，如有效繁殖数（$R_t$），定义为新感染个体在其感染过程中将引发的继发性感染的平均数量，需要被建模为随时间变化的参数。在病原体监测数据的贝叶斯分析中，使用高斯随机游走作为估计此类函数的先验概率是一种常见的做法。然而，在这种情况下，随机游走先验可能过于宽松，因为它未能捕捉到关于估计量的先前科学知识，并导致较高的后验方差。我们为$R_t$推断提出了几种高斯马尔可夫过程先验，包括综合布朗运动（IBM），当其与相应的布朗运动分量结合时，可以表示为马尔可夫过程，因此计算效率高，易于实施和调整。我们使用模拟的爆发数据来比较这些提议的先验与高斯随机游走先验和基于马特恩协方差函数近似值的最新高斯过程先验的性能。我们发现IBM可以匹配或超越其他先验的性能，并且我们展示了它应用于县级SARS-CoV-2数据时能产生流行病学的合理和精确结果。

论文及项目相关链接

PDF 19 pages, 5 figures, 2 tables in the main text

Summary

本文探讨了传染病暴发特征参数的建模问题，特别是有效繁殖数（Rt）的时间变化性。文章指出，在使用高斯随机游走作为先验估计这类参数时，可能存在过于宽松的问题，无法捕捉关于估计量的先前科学知识，导致后验方差较高。因此，文章提出了几种高斯马尔可夫过程先验估计Rt的方法，其中包括集成布朗运动（IBM）。通过模拟疫情数据比较，发现IBM的表现在匹配或超越其他先验估计的同时，对SARS-CoV-2的县级数据应用也表现出合理且精确的结果。

Key Takeaways

有效繁殖数（Rt）是描述传染病暴发的重要参数，需要建模为时间变化的参数。
高斯随机游走作为先验估计在传染病参数建模中可能过于宽松。
提出的集成布朗运动（IBM）是一种计算高效、易于实施和调整的马尔可夫过程先验估计。
IBM的表现通过模拟疫情数据比较，在匹配或超越其他先验估计的同时，对真实数据应用也表现出合理且精确的结果。
IBM能够捕捉关于估计量的先前科学知识，降低后验方差。
与其他先进的先验估计方法相比，IBM在应用于县级SARS-CoV-2数据时表现出色。

Cool Papers

点此查看论文截图

Zero-Shot Video Deraining with Video Diffusion Models

Authors:Tuomas Varanka, Juan Luis Gonzalez, Hyeongwoo Kim, Pablo Garrido, Xu Yao

Existing video deraining methods are often trained on paired datasets, either synthetic, which limits their ability to generalize to real-world rain, or captured by static cameras, which restricts their effectiveness in dynamic scenes with background and camera motion. Furthermore, recent works in fine-tuning diffusion models have shown promising results, but the fine-tuning tends to weaken the generative prior, limiting generalization to unseen cases. In this paper, we introduce the first zero-shot video deraining method for complex dynamic scenes that does not require synthetic data nor model fine-tuning, by leveraging a pretrained text-to-video diffusion model that demonstrates strong generalization capabilities. By inverting an input video into the latent space of diffusion models, its reconstruction process can be intervened and pushed away from the model’s concept of rain using negative prompting. At the core of our approach is an attention switching mechanism that we found is crucial for maintaining dynamic backgrounds as well as structural consistency between the input and the derained video, mitigating artifacts introduced by naive negative prompting. Our approach is validated through extensive experiments on real-world rain datasets, demonstrating substantial improvements over prior methods and showcasing robust generalization without the need for supervised training.

现有的视频去雨方法通常是在配对数据集上进行训练的，这些配对数据集要么是合成的，这限制了它们对真实世界雨场景的泛化能力，要么是由静态相机拍摄的，这限制了它们在具有背景和相机运动的动态场景中的有效性。此外，最近对扩散模型进行微调的工作已经取得了有前景的结果，但微调往往会削弱生成先验，限制了其对未见案例的泛化能力。在本文中，我们引入了第一个针对复杂动态场景进行零样本视频去雨的方法，它不需要合成数据或模型微调，通过利用预训练的文本到视频扩散模型展现出强大的泛化能力。通过将输入视频反转到扩散模型的潜在空间，可以干预其重建过程，并使用负提示将其推开远离模型对雨的概念。我们方法的核心是一个注意力切换机制，我们发现它对保持动态背景和输入与去雨视频之间的结构一致性至关重要，减轻了由简单负提示引起的伪影问题。我们的方法通过在实际雨数据集上进行的大量实验得到了验证，证明了相较于先前的方法有实质性的改进，并展示了无需监督训练的稳健泛化能力。

论文及项目相关链接

PDF WACV 2026

Summary

本文提出了一种零样本视频去雨方法，适用于复杂动态场景。该方法无需合成数据，也不需要精细调整模型，而是利用预训练的文本到视频扩散模型，通过负提示技术将输入视频重建过程推向远离模型对雨的概念的方向。该方法核心在于注意力切换机制，可维持动态背景以及输入与去雨视频之间的结构一致性，减少负提示产生的伪影。在真实世界降雨数据集上的实验验证了该方法的优越性，相较于先前的方法有显著改善，且无需监督训练即可实现稳健的泛化。

Key Takeaways

该方法是一种零样本视频去雨技术，适用于复杂动态场景。
不需要合成数据和模型精细调整，利用预训练的文本到视频扩散模型。
通过负提示技术，将输入视频的重建过程推向远离模型对雨的概念。
注意力切换机制是方法的核心，可维持动态背景和结构一致性。
方法可减轻由负提示产生的伪影。
在真实世界降雨数据集上的实验验证了该方法的优越性。

Cool Papers

点此查看论文截图

MotionDuet: Dual-Conditioned 3D Human Motion Generation with Video-Regularized Text Learning

Authors:Yi-Yang Zhang, Tengjiao Sun, Pengcheng Fang, Deng-Bao Wang, Xiaohao Cai, Min-Ling Zhang, Hansung Kim

3D Human motion generation is pivotal across film, animation, gaming, and embodied intelligence. Traditional 3D motion synthesis relies on costly motion capture, while recent work shows that 2D videos provide rich, temporally coherent observations of human behavior. Existing approaches, however, either map high-level text descriptions to motion or rely solely on video conditioning, leaving a gap between generated dynamics and real-world motion statistics. We introduce MotionDuet, a multimodal framework that aligns motion generation with the distribution of video-derived representations. In this dual-conditioning paradigm, video cues extracted from a pretrained model (e.g., VideoMAE) ground low-level motion dynamics, while textual prompts provide semantic intent. To bridge the distribution gap across modalities, we propose Dual-stream Unified Encoding and Transformation (DUET) and a Distribution-Aware Structural Harmonization (DASH) loss. DUET fuses video-informed cues into the motion latent space via unified encoding and dynamic attention, while DASH aligns motion trajectories with both distributional and structural statistics of video features. An auto-guidance mechanism further balances textual and visual signals by leveraging a weakened copy of the model, enhancing controllability without sacrificing diversity. Extensive experiments demonstrate that MotionDuet generates realistic and controllable human motions, surpassing strong state-of-the-art baselines.

三维人体运动生成在影视、动画、游戏和智能体等领域中至关重要。传统的三维运动合成依赖于昂贵的动作捕捉技术，而最近的研究表明，二维视频提供了丰富且时间连贯的人类行为观察。然而，现有方法要么将高级文本描述映射到运动中，要么仅依赖视频条件，导致生成的运动动态与真实世界运动统计数据之间存在差距。我们介绍了MotionDuet，这是一种多模式框架，它将运动生成与从视频派生的表示分布对齐。在这种双重条件范式中，从预训练模型（例如VideoMAE）中提取的视频线索为低级运动动态提供了基础，而文本提示则提供了语义意图。为了弥合各模式间的分布差距，我们提出了双流统一编码和转换（DUET）以及分布感知结构和谐（DASH）损失。DUET通过统一编码和动态注意力将视频信息线索融合到运动潜在空间中，而DASH使运动轨迹与视频特征的分面分布和结构统计信息对齐。自动引导机制通过利用模型的弱化版本进一步平衡文本和视觉信号，提高了可控性而不会牺牲多样性。大量实验表明，MotionDuet能够生成真实且可控的人体运动，超越了强大的现有最新基线。

论文及项目相关链接

PDF

Summary

本文介绍了MotionDuet，这是一种将文本描述与视频条件相结合的多模态框架，用于生成与真实世界运动统计一致的人类运动。该框架通过统一编码和动态注意力融合视频信息进入运动潜在空间，并提出分布感知结构和谐损失来对齐运动轨迹与视频特征的分面和结构统计。实验结果证明，MotionDuet能够生成现实且可控的人类运动，超越了现有的最强基线。

Key Takeaways

MotionDuet是一种多模态框架，用于人类运动生成，结合了文本描述和视频条件。
该框架旨在生成与真实世界运动统计一致的人类运动。
通过统一编码和动态注意力，视频信息被融合到运动潜在空间中。
分布感知结构和谐损失用于对齐运动轨迹与视频特征的分面和结构统计。
MotionDuet通过一种自动引导机制平衡文本和视觉信号，提高了运动的可控性和多样性。
实验结果证明，MotionDuet在生成现实且可控的人类运动方面超越了现有的最强基线。

Cool Papers

点此查看论文截图

Spotlight: Identifying and Localizing Video Generation Errors Using VLMs

Authors:Aditya Chinchure, Sahithya Ravi, Pushkar Shukla, Vered Shwartz, Leonid Sigal

Current text-to-video models (T2V) can generate high-quality, temporally coherent, and visually realistic videos. Nonetheless, errors still often occur, and are more nuanced and local compared to the previous generation of T2V models. While current evaluation paradigms assess video models across diverse dimensions, they typically evaluate videos holistically without identifying when specific errors occur or describing their nature. We address this gap by introducing Spotlight, a novel task aimed at localizing and explaining video-generation errors. We generate 600 videos using 200 diverse textual prompts and three state-of-the-art video generators (Veo 3, Seedance, and LTX-2), and annotate over 1600 fine-grained errors across six types, including motion, physics, and prompt adherence. We observe that adherence and physics errors are predominant and persist across longer segments, whereas appearance-disappearance and body pose errors manifest in shorter segments. We then evaluate current VLMs on Spotlight and find that VLMs lag significantly behind humans in error identification and localization in videos. We propose inference-time strategies to probe the limits of current VLMs on our task, improving performance by nearly 2x. Our task paves a way forward to building fine-grained evaluation tools and more sophisticated reward models for video generators.

当前文本转视频模型（T2V）可以生成高质量、时间连贯且视觉逼真的视频。尽管如此，错误仍然经常发生，与上一代T2V模型相比，这些错误更加微妙和局部化。虽然当前评估范式从多个维度对视频模型进行评估，但它们通常对视频进行整体评估，并没有识别出特定错误发生的时间点或描述其性质。我们通过引入Spotlight来解决这一差距，这是一个旨在定位并解释视频生成错误的新型任务。我们使用200个不同的文本提示和三个最先进的视频生成器（Veo 3、Seedance和LTX-2）生成了600个视频，并标注了超过1600个精细错误，涉及六种类型，包括运动、物理和提示遵循度。我们发现遵守性和物理错误是主要的，并且在更长的片段中持续存在，而外观消失和体态错误则出现在较短的片段中。然后我们在Spotlight上评估当前的视觉语言模型（VLMs），发现它们在视频中的错误识别和定位方面远远落后于人类。我们提出了推理时间策略来探测当前VLMs在我们任务上的局限性，并提高了近两倍的性能。我们的任务为未来构建视频生成器的精细评估工具和更复杂的奖励模型铺平了道路。

论文及项目相关链接

PDF

Summary

本文介绍了当前文本转视频模型（T2V）虽然能生成高质量、时间连贯且视觉逼真的视频，但仍存在错误。作者提出了一种名为Spotlight的新任务，旨在定位并解释视频生成中的错误。通过对600个视频进行标注和分析，作者发现了六种类型的精细错误，并对当前视频语言模型（VLMs）在Spotlight任务上的表现进行了评估。研究发现，遵守性和物理性错误是主要的且持久存在的，而外观消失和身体姿势错误则出现在较短的片段中。作者还提出了一些推理时策略，提高了当前VLMs在此任务上的性能。本文的工作为建立视频生成器的精细评估工具和更复杂的奖励模型铺平了道路。

Key Takeaways

当前文本转视频模型（T2V）虽然高质量，但仍然存在错误，这些错误更加细微和局部化。
作者提出了名为Spotlight的新任务，旨在定位和解释视频生成中的错误。
通过标注和分析600个视频，作者识别了六种类型的错误，包括运动、物理、遵守性等。
遵守性和物理性错误是最主要的，且存在于较长片段中，而外观消失和身体姿势错误则较短片段中更为明显。
当前视频语言模型（VLMs）在Spotlight任务上的错误识别和定位能力显著落后于人类。
作者提出了一些推理时策略，提高了当前VLMs在任务上的性能，近乎翻倍。

Cool Papers

点此查看论文截图

Show Me: Unifying Instructional Image and Video Generation with Diffusion Models

Authors:Yujiang Pu, Zhanbo Huang, Vishnu Boddeti, Yu Kong

Generating visual instructions in a given context is essential for developing interactive world simulators. While prior works address this problem through either text-guided image manipulation or video prediction, these tasks are typically treated in isolation. This separation reveals a fundamental issue: image manipulation methods overlook how actions unfold over time, while video prediction models often ignore the intended outcomes. To this end, we propose ShowMe, a unified framework that enables both tasks by selectively activating the spatial and temporal components of video diffusion models. In addition, we introduce structure and motion consistency rewards to improve structural fidelity and temporal coherence. Notably, this unification brings dual benefits: the spatial knowledge gained through video pretraining enhances contextual consistency and realism in non-rigid image edits, while the instruction-guided manipulation stage equips the model with stronger goal-oriented reasoning for video prediction. Experiments on diverse benchmarks demonstrate that our method outperforms expert models in both instructional image and video generation, highlighting the strength of video diffusion models as a unified action-object state transformer.

在开发交互式世界模拟器时，在给定上下文中生成视觉指令至关重要。尽管先前的工作通过文本引导的图像操作或视频预测来解决这个问题，但这些任务通常是孤立的。这种分离暴露了一个根本问题：图像操作方法忽略了动作如何随时间展开，而视频预测模型往往忽略了预期的结果。为此，我们提出了ShowMe，这是一个统一框架，通过选择性激活视频扩散模型的空间和时间组件，使这两个任务得以实现。此外，我们引入了结构和运动一致性奖励，以提高结构保真度和时间连贯性。值得注意的是，这种统一带来了双重好处：通过视频预训练获得的空间知识提高了非刚性图像编辑的上下文一致性和真实性，而指令引导的操作阶段则为模型配备了更强的面向目标推理能力以实现视频预测。在多种基准测试上的实验表明，我们的方法在指令图像和视频生成方面均优于专家模型，突显了视频扩散模型作为统一的动作对象状态转换器的优势。

论文及项目相关链接

PDF Accepted by WACV 2026

Summary

该文提出一种名为ShowMe的统一框架，该框架结合了视频扩散模型的时空组件，用于生成给定上下文中的视觉指令，实现图像操纵和视频预测的双重任务。通过引入结构和运动一致性奖励，提高了结构保真度和时间连贯性。该统一框架带来了双重优势：视频预训练获得的空间知识提高了非刚性图像编辑的上下文一致性和真实性，而指令引导的操作阶段则为视频预测赋予了更强的目标导向推理能力。

Key Takeaways

文章提出了一种新的统一框架ShowMe，用于生成给定上下文中的视觉指令。
该框架结合了视频扩散模型的时空组件，实现图像操纵和视频预测的双重任务。
通过引入结构和运动一致性奖励，提高了生成内容的质量。
视频预训练的空间知识能提高非刚性图像编辑的上下文一致性和真实性。
指令引导的操作阶段为视频预测赋予了更强的目标导向推理能力。
实验表明，该方法在指令图像和视频生成方面均优于专家模型。

Cool Papers

点此查看论文截图

Pressure2Motion: Hierarchical Human Motion Reconstruction from Ground Pressure with Text Guidance

Authors:Zhengxuan Li, Qinhui Yang, Yiyu Zhuang, Chuan Guo, Xinxin Zuo, Xiaoxiao Long, Yao Yao, Xun Cao, Qiu Shen, Hao Zhu

We present Pressure2Motion, a novel motion capture algorithm that reconstructs human motion from a ground pressure sequence and text prompt. At inference time, Pressure2Motion requires only a pressure mat, eliminating the need for specialized lighting setups, cameras, or wearable devices, making it suitable for privacy-preserving, low-light, and low-cost motion capture scenarios. Such a task is severely ill-posed due to the indeterminacy of pressure signals with respect to full-body motion. To address this issue, we introduce Pressure2Motion, a generative model that leverages pressure features as input and utilizes a text prompt as a high-level guiding constraint to resolve ambiguities. Specifically, our model adopts a dual-level feature extractor to accurately interpret pressure data, followed by a hierarchical diffusion model that discerns broad-scale movement trajectories and subtle posture adjustments. Both the physical cues gained from the pressure sequence and the semantic guidance derived from descriptive texts are leveraged to guide the motion estimation with precision. To the best of our knowledge, Pressure2Motion is a pioneering work in leveraging both pressure data and linguistic priors for motion reconstruction, and the established MPL benchmark is the first benchmark for this novel motion capture task. Experiments show that our method generates high-fidelity, physically plausible motions, establishing a new state of the art for this task. The codes and benchmarks will be publicly released upon publication.

我们提出了Pressure2Motion，这是一种新型的运动捕捉算法，它可以从地面压力序列和文本提示重建人类运动。在推理阶段，Pressure2Motion仅需要一个压力垫，无需特殊的照明设置、相机或可穿戴设备，因此它适用于保护隐私、低光照和低成本的运动捕捉场景。由于压力信号相对于全身运动的不确定性，此任务存在严重的不适定性问题。为了解决这个问题，我们引入了Pressure2Motion，这是一种生成模型，它以压力特征为输入，并利用文本提示作为高级指导约束来解决歧义。具体来说，我们的模型采用双级特征提取器来准确解释压力数据，然后是一个分层扩散模型，用于区分大规模运动轨迹和细微的姿势调整。我们从压力序列中获得的物理线索和从描述文本中派生的语义指导都被用来精确指导运动估计。据我们所知，Pressure2Motion是率先利用压力数据和语言先验进行运动重建的工作，而建立的MPL基准是该新型运动捕捉任务的首个基准。实验表明，我们的方法生成了高保真、物理上合理的运动，为该任务建立了新的技术水平。代码和基准将在发表时公开发布。

论文及项目相关链接

PDF

Summary
压力感知运动捕获技术：通过地面压力序列和文本提示重建人类运动。提出Pressure2Motion算法，仅使用压力垫捕捉运动，无需特殊照明设置、相机或可穿戴设备。利用压力特征和文本提示解决压力信号与全身运动的不确定性问题。采用双级特征提取器和分层扩散模型，分别判断大体运动轨迹和细微姿势调整。结合压力数据的物理线索和描述文本的语义指导，精确估计运动。为压力数据和语言先验在动作重建中的融合开辟了新的里程碑，实验证明其生成的动作高保真且物理可行。

Key Takeaways

Pressure2Motion是一种新型运动捕获算法，可从地面压力序列和文本提示重建人类运动。
该算法仅使用压力垫进行运动捕获，无需特殊设备，适合隐私保护、低光照和低成本的场景。
压力感知运动捕获面临压力信号与全身运动不确定性的问题，通过引入Pressure2Motion算法和文本提示解决。
该算法采用双级特征提取器和分层扩散模型，可准确解读压力数据并判断运动轨迹及细微姿势调整。
算法结合压力数据的物理线索和描述文本的语义指导，精确估计运动。
Pressure2Motion是融合压力数据和语言先验进行动作重建的开创性工作，建立了新的里程碑。

Cool Papers

点此查看论文截图

Motion-R1: Enhancing Motion Generation with Decomposed Chain-of-Thought and RL Binding

Authors:Runqi Ouyang, Haoyun Li, Zhenyuan Zhang, Xiaofeng Wang, Zeyu Zhang, Zheng Zhu, Guan Huang, Sirui Han, Xingang Wang

Text-to-Motion generation has become a fundamental task in human-machine interaction, enabling the synthesis of realistic human motions from natural language descriptions. Although recent advances in large language models and reinforcement learning have contributed to high-quality motion generation, two major challenges remain. Existing approaches often fail to capture the temporal and causal complexities inherent in natural language, leading to oversimplified or incoherent motions. Additionally, RL-based methods are frequently overly complex, hindering their scalability and adaptability across various motion generation tasks. To address these challenges, we propose Motion-R1, a novel framework that combines decomposed Chain-of-Thought reasoning with reinforcement learning to enhance both the quality and interpretability of generated motions. Specifically, we introduce the Decomposed CoT Data Engine, which leverages an automated pipeline to synthesize high-quality reasoning data, allowing the model to better capture the temporal dependencies and causal relationships of human motion. We also propose RL Binding, a reinforcement learning strategy that incorporates multi-modal text-motion alignment into the RL reward function, guiding the model to produce motions that are both semantically accurate and motionally realistic. Extensive experiments across benchmark datasets demonstrate that Motion-R1 achieves state-of-the-art performance, with a 3.5% improvement in MM-Dist on HumanML3D and improvements in R-Precision and FID on KIT-ML and BABEL, surpassing existing methods across key metrics and highlighting its superior capability in handling complex motion generation tasks. Project page: https://motion-r1.github.io/.

文本到动作生成已成为人机交互中的一项基本任务，能够实现从自然语言描述中合成逼真的人类动作。尽管最近的大型语言模型和强化学习的进步为高质量的动作生成做出了贡献，但仍然存在两个主要挑战。现有方法往往无法捕捉自然语言固有的时间和因果复杂性，导致动作过于简化或缺乏连贯性。此外，基于强化学习的方法通常过于复杂，阻碍了其在各种动作生成任务中的可扩展性和适应性。为了解决这些挑战，我们提出了Motion-R1，这是一个结合分解链思维推理和强化学习的新型框架，旨在提高生成动作的质量和可解释性。具体来说，我们引入了分解CoT数据引擎，利用自动化管道合成高质量推理数据，使模型能够更好地捕捉人类动作的时间依赖性和因果关系。我们还提出了RL绑定，这是一种强化学习策略，将多模式文本-动作对齐纳入强化学习奖励函数，引导模型生成既语义准确又动作逼真的动作。在基准数据集上的广泛实验表明，Motion-R1达到了最新技术水平，在HumanML3D上的MM-Dist提高了3.5%，在KIT-ML和BABEL上的R-Precision和FID也有所提高，超过了现有方法在关键指标上的表现，凸显了其处理复杂动作生成任务的卓越能力。项目页面：https://motion-r1.github.io/。

论文及项目相关链接

PDF

Summary

本文介绍了Text-to-Motion生成领域的一个新挑战及解决方案。尽管现有技术如大型语言模型和强化学习已经可以实现高质量的运动生成，但仍面临两大挑战：难以捕捉自然语言中的时间性和因果复杂性，以及基于强化学习的方法过于复杂，难以在不同运动生成任务中扩展和适应。为此，提出了Motion-R1框架，结合分解思维链推理与强化学习，提高生成运动的质量和可解释性。通过引入分解思维链数据引擎和强化学习绑定策略，实现了高效的运动生成。该框架在多个基准数据集上取得了最先进的性能。

Key Takeaways

Text-to-Motion生成仍然面临捕捉自然语言的时空和因果复杂性以及方法复杂性的挑战。
Motion-R1框架结合分解思维链推理与强化学习，提高运动生成的质量和可解释性。
分解思维链数据引擎能合成高质量推理数据，更好地捕捉人类运动的时空依赖性。
强化学习绑定策略将多模式文本-运动对齐纳入强化学习奖励函数，指导模型生成语义准确且动作逼真的运动。
Motion-R1在多个基准数据集上实现了最先进的性能，包括HumanML3D、KIT-ML和BABEL。
Motion-R1的改进包括在MM-Dist上实现3.5%的改进，并在R-Precision和FID上取得进展。

Cool Papers

点此查看论文截图

Frame-wise Conditioning Adaptation for Fine-Tuning Diffusion Models in Text-to-Video Prediction

Authors:Zheyuan Liu, Junyan Wang, Zicheng Duan, Cristian Rodriguez-Opazo, Anton van den Hengel

Text-video prediction (TVP) is a downstream video generation task that requires a model to produce subsequent video frames given a series of initial video frames and text describing the required motion. In practice TVP methods focus on a particular category of videos depicting manipulations of objects carried out by human beings or robot arms. Previous methods adapt models pre-trained on text-to-image tasks, and thus tend to generate video that lacks the required continuity. A natural progression would be to leverage more recent pre-trained text-to-video (T2V) models. This approach is rendered more challenging by the fact that the most common fine-tuning technique, low-rank adaptation (LoRA), yields undesirable results. In this work, we propose an adaptation-based strategy we label Frame-wise Conditioning Adaptation (FCA). Within the module, we devise a sub-module that produces frame-wise text embeddings from the input text, which acts as an additional text condition to aid generation. We use FCA to fine-tune the T2V model, which incorporates the initial frame(s) as an extra condition. We compare and discuss the more effective strategy for injecting such embeddings into the T2V model. We conduct extensive ablation studies on our design choices with quantitative and qualitative performance analysis. Our approach establishes a new state-of-the-art for the task of TVP. Our code is open-source at https://github.com/Cuberick-Orion/FCA .

文本视频预测（TVP）是下游视频生成任务之一，它要求模型根据一系列初始视频帧和描述所需运动的文本来生成后续的视频帧。在实践中，TVP方法主要关注于人类或机器人手臂操作物体的特定类别的视频。之前的方法采用预训练于文本到图像任务的模型，因此往往生成的视频缺乏所需的连续性。接下来更自然的进展是利用最近的预训练文本到视频（T2V）模型。然而，由于最常见的微调技术——低秩适应（LoRA）会产生不理想的结果，这使得这一方法更具挑战性。在这项工作中，我们提出了一种基于适应性的策略，我们称之为帧条件适应性（FCA）。在该模块中，我们设计了一个子模块，用于从输入文本中产生帧级文本嵌入，这可以作为辅助生成的额外文本条件。我们使用FCA对T2V模型进行微调，该模型将初始帧（或帧）作为额外的条件。我们比较并讨论了将这种嵌入注入T2V模型的更有效策略。我们对设计选择进行了广泛的消融研究，并进行了定量和定性的性能分析。我们的方法为TVP任务建立了新的技术顶尖水平。我们的代码在[https://github.com/Cuberick-Orion/FCA上开源。](https://github.com/Cuberick-Orion/FCA%E4%B8%8A%E5%BC%BA%E5%BA%9始终以TVP为目的的核心要素和任务定义为背景来翻译这段文字会更加准确和清晰。以下是简化并翻译后的版本：

文本视频预测（TVP）是生成未来视频帧的任务，基于一系列初始的视频帧和描述所需动作的文本信息。现有的方法主要关注在特定类型的视频上，例如人类或机器人操作物体的场景。先前的研究倾向于使用预先在文本到图像任务上训练的模型，但这些模型生成的视频缺乏连贯性。现在，更先进的做法是使用预训练的文本到视频（T2V）模型。然而，因为常用的微调技术——低秩适应（LoRA）效果不理想，这使得利用T2V模型变得更加具有挑战性。

在本研究中，我们提出了一种基于适应性的策略，称为帧级条件适应（FCA）。在该策略中，我们设计了一个子模块来生成帧级的文本嵌入，这些嵌入作为生成过程的额外条件。通过使用FCA策略微调T2V模型，并结合初始视频帧作为附加条件，我们实现了更高效的TVP任务性能。我们还深入探讨了将文本嵌入注入T2V模型的有效策略，并通过定量和定性的性能分析验证了我们的设计选择。

论文及项目相关链接

PDF Accepted by TMLR, 11/2025. 29 pages, 15 figures

Summary

该文介绍了文本视频预测（TVP）任务，这是一个需要根据初始视频帧和描述所需运动的文本生成后续视频帧的下游视频生成任务。文章指出，先前的方法多采用在文本到图像任务上预训练的模型，生成的视频往往缺乏所需的连续性。为改进这一点，研究团队提出了一种基于适应性的策略，称为Frame-wise Conditioning Adaptation（FCA）。该策略设计了一个子模块来生成基于输入文本的帧级文本嵌入，作为辅助生成的额外文本条件。通过FCA对文本到视频（T2V）模型进行微调，并结合初始帧作为额外条件。文章还对比和讨论了注入这些嵌入的有效策略，并通过定量和定性性能分析进行了广泛的设计选择消融研究。此研究在TVP任务上取得了最新成果。

Key Takeaways

文本视频预测（TVP）是生成后续视频帧的下游任务，需基于初始视频帧和描述所需运动的文本。
之前的TVP方法主要关注人类或机器人操作物体的特定类别视频。
现有方法倾向于使用在文本到图像任务上预训练的模型，但生成的视频缺乏连续性。
引入Frame-wise Conditioning Adaptation（FCA）策略来提高模型性能。
FCA策略包括一个产生帧级文本嵌入的子模块，作为辅助生成的额外条件。
研究使用FCA策略对文本到视频（T2V）模型进行微调，并融入了初始帧作为附加条件。

Cool Papers

点此查看论文截图

MagicMirror: ID-Preserved Video Generation in Video Diffusion Transformers

Authors:Yuechen Zhang, Yaoyang Liu, Bin Xia, Bohao Peng, Zexin Yan, Eric Lo, Jiaya Jia

We present MagicMirror, a framework for generating identity-preserved videos with cinematic-level quality and dynamic motion. While recent advances in video diffusion models have shown impressive capabilities in text-to-video generation, maintaining consistent identity while producing natural motion remains challenging. Previous methods either require person-specific fine-tuning or struggle to balance identity preservation with motion diversity. Built upon Video Diffusion Transformers, our method introduces three key components: (1) a dual-branch facial feature extractor that captures both identity and structural features, (2) a lightweight cross-modal adapter with Conditioned Adaptive Normalization for efficient identity integration, and (3) a two-stage training strategy combining synthetic identity pairs with video data. Extensive experiments demonstrate that MagicMirror effectively balances identity consistency with natural motion, outperforming existing methods across multiple metrics while requiring minimal parameters added. The code and model will be made publicly available.

我们推出了MagicMirror，这是一个能够生成具有电影级质量和动态运动的身份保留视频的框架。虽然视频扩散模型方面的最新进展在文本到视频生成方面表现出了令人印象深刻的能力，但在生成自然运动的同时保持身份一致性仍然具有挑战性。之前的方法要么需要针对个人进行微调，要么在平衡身份保留和运动多样性方面感到困难。我们的方法基于视频扩散变压器，引入了三个关键组件：（1）一个双分支面部特征提取器，能够捕捉身份和结构特征；（2）一个带有条件自适应规范化的轻量级跨模态适配器，用于有效的身份集成；（3）一个两阶段训练策略，结合合成身份对与视频数据。大量实验表明，MagicMirror有效地平衡了身份一致性和自然运动，在多个指标上优于现有方法，同时添加的参数很少。代码和模型将公开发布。

论文及项目相关链接

PDF ICCV 2025, It is best viewed in Acrobat. Project Page: https://julianjuaner.github.io/projects/MagicMirror/

摘要

MagicMirror框架用于生成具有电影级质量和动态运动的身份保留视频。该框架解决了文本到视频生成中身份一致性及自然运动保持的挑战。MagicMirror建立在视频扩散变压器上，并引入了三个关键组件：双分支面部特征提取器，用于捕捉身份和结构特征；带有条件自适应归一化的轻量级跨模态适配器，用于高效身份集成；以及结合合成身份对和视频数据的两阶段训练策略。实验表明，MagicMirror在身份一致性和自然运动之间取得了有效平衡，在多个指标上优于现有方法，同时增加了极少的参数。

关键见解