发布日期: 2025-11-16

更新日期: 2025-11-27

文章字数: 18k

阅读时长: 72 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-16 更新

Time-to-Move: Training-Free Motion Controlled Video Generation via Dual-Clock Denoising

Authors:Assaf Singer, Noam Rotstein, Amir Mann, Ron Kimmel, Or Litany

Diffusion-based video generation can create realistic videos, yet existing image- and text-based conditioning fails to offer precise motion control. Prior methods for motion-conditioned synthesis typically require model-specific fine-tuning, which is computationally expensive and restrictive. We introduce Time-to-Move (TTM), a training-free, plug-and-play framework for motion- and appearance-controlled video generation with image-to-video (I2V) diffusion models. Our key insight is to use crude reference animations obtained through user-friendly manipulations such as cut-and-drag or depth-based reprojection. Motivated by SDEdit’s use of coarse layout cues for image editing, we treat the crude animations as coarse motion cues and adapt the mechanism to the video domain. We preserve appearance with image conditioning and introduce dual-clock denoising, a region-dependent strategy that enforces strong alignment in motion-specified regions while allowing flexibility elsewhere, balancing fidelity to user intent with natural dynamics. This lightweight modification of the sampling process incurs no additional training or runtime cost and is compatible with any backbone. Extensive experiments on object and camera motion benchmarks show that TTM matches or exceeds existing training-based baselines in realism and motion control. Beyond this, TTM introduces a unique capability: precise appearance control through pixel-level conditioning, exceeding the limits of text-only prompting. Visit our project page for video examples and code: https://time-to-move.github.io/.

基于扩散的视频生成可以创建逼真的视频，但现有的图像和文字条件无法提供精确的运动控制。以前的方法对于运动条件下的合成通常需要针对特定模型进行微调，这既耗费计算资源又很受限。我们引入了Time-to-Move（TTM），这是一个无需训练的即插即用框架，使用图像到视频（I2V）扩散模型进行运动和外观控制的视频生成。我们的关键见解是使用通过用户友好的操作（如剪切和拖动或基于深度的重新投影）获得的粗略参考动画。受SDEdit使用粗略布局线索进行图像编辑的启发，我们将粗略动画视为粗略的运动线索并将其适应视频领域。我们保留图像条件的外貌，并引入双时钟去噪，这是一种区域依赖策略，在指定的运动区域强制严格对齐，而在其他地方则允许灵活性，平衡了符合用户意图的自然动态。采样过程的这种轻量级修改不会带来额外的训练或运行成本，并且与任何主干架构都兼容。在物体和相机运动基准测试上的大量实验表明，TTM在真实感和运动控制方面与现有的基于训练的方法相匹敌或更好。除此之外，TTM还引入了一项独特的能力：通过像素级条件进行精确的外观控制，突破了仅文本提示的局限性。有关视频示例和代码，请访问我们的项目页面：https://time-to-move.github.io/。

论文及项目相关链接

PDF

Summary

基于扩散的视频生成技术能够生成逼真的视频，但现有的图像和文字条件无法提供精确的运动控制。过去的方法需要针对模型进行特定的微调，这既计算量大又有限制。我们推出Time-to-Move（TTM），这是一种无需训练、即插即用的框架，用于运动控制和外观控制的视频生成，结合图像到视频的扩散模型。我们的关键见解是使用用户友好的操作（如剪切和拖动或深度重新投影）获得的粗略参考动画。受SDEdit使用粗略布局线索进行图像编辑的启发，我们将粗略动画视为粗略运动线索并适应视频领域。我们保留图像条件以保持外观，并引入双时钟去噪，这是一种区域依赖策略，强制运动指定区域中的强对齐，同时允许其他区域的灵活性，平衡了符合用户意图的自然动态。这种采样过程的轻量级修改没有产生额外的训练或运行成本，并且与任何主干架构都兼容。在物体和相机运动基准测试上的广泛实验表明，TTM在真实感和运动控制方面匹配或超过现有的基于训练的对标物。此外，TTM还引入了独特的精确外观控制能力，通过像素级的条件超越文本提示的局限性。更多视频示例和代码请访问我们的项目页面：[https://time-to-move.github.io/]”。

Key Takeaways

TTM是一个无需训练的框架，用于运动控制和外观控制的视频生成。
使用用户友好的操作（如剪切和拖动）获得的粗略参考动画作为运动线索。
结合图像到视频的扩散模型，实现精确的运动控制。
通过图像条件保持外观一致性。
引入双时钟去噪策略，平衡用户意图的自然动态。
采样过程的修改无需额外的训练或运行成本。

Cool Papers

点此查看论文截图

Decomate: Leveraging Generative Models for Co-Creative SVG Animation

Authors:Jihyeon Park, Jiyoon Myung, Seone Shin, Jungki Son, Joohyung Han

Designers often encounter friction when animating static SVG graphics, especially when the visual structure does not match the desired level of motion detail. Existing tools typically depend on predefined groupings or require technical expertise, which limits designers’ ability to experiment and iterate independently. We present Decomate, a system that enables intuitive SVG animation through natural language. Decomate leverages a multimodal large language model to restructure raw SVGs into semantically meaningful, animation-ready components. Designers can then specify motions for each component via text prompts, after which the system generates corresponding HTML/CSS/JS animations. By supporting iterative refinement through natural language interaction, Decomate integrates generative AI into creative workflows, allowing animation outcomes to be directly shaped by user intent.

设计师在动画静态SVG图形时经常会遇到摩擦，特别是当视觉结构无法达到期望的运动细节水平时。现有工具通常依赖于预先定义的分组或需要专业技术知识，这限制了设计师独立进行实验和迭代的能力。我们推出Decomate系统，通过自然语言实现直观的SVG动画。Decomate利用多模态大型语言模型将原始SVG重组为语义上有意义、适合动画的组件。设计师随后可以通过文本提示为每个组件指定运动，系统生成相应的HTML/CSS/JS动画。通过支持通过自然语言交互进行迭代细化，Decomate将生成式人工智能集成到创意工作流程中，允许动画结果直接根据用户意图进行塑造。

论文及项目相关链接

PDF Accepted at the 1st Workshop on Generative and Protective AI for Content Creation (NeurIPS 2025)

Summary

Decomate系统通过自然语言直观地进行SVG动画设计，利用多模态大型语言模型将原始SVG重组为具有语义意义的动画组件。设计师可通过文本提示为各组件指定动作，系统则生成相应的HTML/CSS/JS动画。此系统支持通过自然语言交互进行迭代优化，将生成式AI融入创意工作流程，使动画效果能直接体现用户意图。

Key Takeaways

Decomate解决设计师在动画化静态SVG图形时遇到的问题，尤其当视觉结构不匹配所需运动细节时。
现有工具通常依赖于预定义分组或需要专业技术，限制了设计师的独立实验和迭代能力。
Decomate系统通过自然语言直观地进行SVG动画设计。
利用多模态大型语言模型重组SVG，形成动画就绪的语义组件。
设计师可以为每个组件通过文本提示指定动作。
系统支持通过自然语言交互进行迭代优化。

Cool Papers

点此查看论文截图

Shared Latent Representation for Joint Text-to-Audio-Visual Synthesis

Authors:Dogucan Yaman, Seymanur Akti, Fevziye Irem Eyiokur, Alexander Waibel

We propose a text-to-talking-face synthesis framework leveraging latent speech representations from HierSpeech++. A Text-to-Vec module generates Wav2Vec2 embeddings from text, which jointly condition speech and face generation. To handle distribution shifts between clean and TTS-predicted features, we adopt a two-stage training: pretraining on Wav2Vec2 embeddings and finetuning on TTS outputs. This enables tight audio-visual alignment, preserves speaker identity, and produces natural, expressive speech and synchronized facial motion without ground-truth audio at inference. Experiments show that conditioning on TTS-predicted latent features outperforms cascaded pipelines, improving both lip-sync and visual realism.

我们提出了一种利用 HierSpeech++ 的潜在语音表征的文本到语音合成框架。Text-to-Vec 模块从文本生成 Wav2Vec2 嵌入，联合条件语音和面部生成。为了处理干净和 TTS 预测特征之间的分布偏移，我们采用两阶段训练：首先在 Wav2Vec2 嵌入上进行预训练，然后在 TTS 输出上进行微调。这实现了音频和视频的紧密对齐，保留了说话者的身份，并在推理时无需真实音频即可生成自然、富有表现力的语音和同步的面部动作。实验表明，以 TTS 预测的潜在特征为条件优于级联管道，提高了唇同步和视觉逼真度。

论文及项目相关链接

PDF

摘要

本摘要利用基于 HierSpeech++ 的潜在语音表示提出一个文本到对话人脸合成框架。该框架使用文本生成 Wav2Vec2 嵌入的 Text-to-Vec 模块，联合条件语音和面部生成。为解决干净特征和 TTS 预测特征之间的分布转移问题，采用两阶段训练：先在 Wav2Vec2 嵌入上进行预训练，再在 TTS 输出上进行微调。这实现了音频与视觉的紧密对齐，保持了说话人的身份，并在推断时产生了自然、有表现力的语音和同步的面部动作，无需真实音频。实验表明，在基于 TTS 预测的潜在特征上进行条件处理优于级联管道，提高了唇同步和视觉逼真度。

关键见解

提出基于 HierSpeech++ 的文本到对话人脸合成框架。
使用 Text-to-Vec 模块生成 Wav2Vec2 嵌入，联合条件语音和面部生成。
采用两阶段训练处理分布转移问题，实现了音频与视觉的紧密对齐。
框架能在推断时产生自然、有表现力的语音和同步的面部动作，无需真实音频。
相比级联管道，该框架在基于 TTS 预测的潜在特征上进行条件处理表现更优。
实验证明该框架提高了唇同步和视觉逼真度。
该框架能够保持说话人的身份。

Cool Papers

点此查看论文截图

Pressure2Motion: Hierarchical Motion Synthesis from Ground Pressure with Text Guidance

Authors:Zhengxuan Li, Qinhui Yang, Yiyu Zhuang, Chuan Guo, Xinxin Zuo, Xiaoxiao Long, Yao Yao, Xun Cao, Qiu Shen, Hao Zhu

We present Pressure2Motion, a novel motion capture algorithm that synthesizes human motion from a ground pressure sequence and text prompt. It eliminates the need for specialized lighting setups, cameras, or wearable devices, making it suitable for privacy-preserving, low-light, and low-cost motion capture scenarios. Such a task is severely ill-posed due to the indeterminate nature of the pressure signals to full-body motion. To address this issue, we introduce Pressure2Motion, a generative model that leverages pressure features as input and utilizes a text prompt as a high-level guiding constraint. Specifically, our model utilizes a dual-level feature extractor that accurately interprets pressure data, followed by a hierarchical diffusion model that discerns broad-scale movement trajectories and subtle posture adjustments. Both the physical cues gained from the pressure sequence and the semantic guidance derived from descriptive texts are leveraged to guide the motion generation with precision. To the best of our knowledge, Pressure2Motion is a pioneering work in leveraging both pressure data and linguistic priors for motion generation, and the established MPL benchmark is the first benchmark for this task. Experiments show our method generates high-fidelity, physically plausible motions, establishing a new state-of-the-art for this task. The codes and benchmarks will be publicly released upon publication.

我们提出了Pressure2Motion，这是一种新型的动作捕捉算法，它可以从地面压力序列和文本提示中合成人类动作。它不需要特殊的照明设置、相机或可穿戴设备，因此非常适合隐私保护、低光照和低成本的动捕场景。由于压力信号对全身运动的不确定性，此任务设定极为不明确。为了解决这个问题，我们引入了Pressure2Motion，这是一种生成模型，它以压力特征为输入，并利用文本提示作为高级指导约束。具体来说，我们的模型采用双级特征提取器来准确解释压力数据，随后是分层扩散模型，用于辨别大体运动轨迹和细微姿势调整。我们从压力序列中获得的物理线索和从描述性文本中得出的语义指导都被用来精确指导动作生成。据我们所知，Pressure2Motion是率先利用压力数据和语言先验值进行动作生成的工作，而建立的MPL基准则是该任务的首个基准。实验表明，我们的方法生成了高保真、物理上合理的动作，为该任务建立了新的技术顶尖水平。代码和基准将在发表时公开发布。

论文及项目相关链接

PDF

Summary：
提出一种名为Pressure2Motion的新型运动捕捉算法，该算法可以从地面压力序列和文本提示中合成人类运动，无需特殊照明设置、相机或可穿戴设备。通过压力特征和文本提示的高级约束，该算法解决了压力信号与全身运动之间不确定性的问题。实验表明，该方法生成的运动具有高保真度和物理可行性，为该任务建立了新的技术标杆。

Key Takeaways：

Pressure2Motion是一种新型运动捕捉算法，可从地面压力序列和文本提示中合成人类运动。
该算法适用于隐私保护、低光照和低成本的场景，无需特殊照明设置、相机或可穿戴设备。
Pressure2Motion通过压力特征和文本提示的高级约束来解决压力信号与全身运动之间的不确定性问题。
该算法采用双级特征提取器和分层扩散模型，分别解析压力数据和辨别大规模运动轨迹以及细微姿势调整。
Pressure2Motion结合了压力序列的物理线索和描述性文本的语义指导，以精准指导运动生成。
据悉，Pressure2Motion是首次利用压力数据和语言先验进行运动生成的研究。

Cool Papers

点此查看论文截图

PhysCorr: Dual-Reward DPO for Physics-Constrained Text-to-Video Generation with Automated Preference Selection

Authors:Peiyao Wang, Weining Wang, Qi Li

Recent advances in text-to-video generation have achieved impressive perceptual quality, yet generated content often violates fundamental principles of physical plausibility - manifesting as implausible object dynamics, incoherent interactions, and unrealistic motion patterns. Such failures hinder the deployment of video generation models in embodied AI, robotics, and simulation-intensive domains. To bridge this gap, we propose PhysCorr, a unified framework for modeling, evaluating, and optimizing physical consistency in video generation. Specifically, we introduce PhysicsRM, the first dual-dimensional reward model that quantifies both intra-object stability and inter-object interactions. On this foundation, we develop PhyDPO, a novel direct preference optimization pipeline that leverages contrastive feedback and physics-aware reweighting to guide generation toward physically coherent outputs. Our approach is model-agnostic and scalable, enabling seamless integration into a wide range of video diffusion and transformer-based backbones. Extensive experiments across multiple benchmarks demonstrate that PhysCorr achieves significant improvements in physical realism while preserving visual fidelity and semantic alignment. This work takes a critical step toward physically grounded and trustworthy video generation.

文本到视频生成的最新进展已经实现了令人印象深刻的感知质量，然而生成的内容往往违反了物理可行性的基本原则——表现为物体动力学不可信、交互不一致和动作模式不真实。这种失败阻碍了视频生成模型在嵌入式人工智能、机器人技术和模拟密集型领域的应用。为了弥补这一差距，我们提出了PhysCorr，一个用于建模、评估和优化视频生成中物理一致性的统一框架。具体来说，我们介绍了PhysicsRM，第一个双重维度的奖励模型，该模型既衡量物体内部的稳定性，又衡量物体之间的交互。在此基础上，我们开发了PhyDPO，一种新型的直接偏好优化管道，它利用对比反馈和物理感知重权来引导生成物理连贯的输出。我们的方法是模型无关和可扩展的，能够无缝集成到广泛的视频扩散和基于变压器的骨干网中。在多个基准测试上的广泛实验表明，PhysCorr在物理现实感方面取得了显著改进，同时保持了视觉保真度和语义对齐。这项工作为实现基于物理原理和可信赖的视频生成迈出了重要一步。

论文及项目相关链接

PDF

Summary

本文提出一种名为PhysCorr的统一框架，用于建模、评估和优化视频生成中的物理一致性。该框架引入了PhysicsRM双维奖励模型，对物体内部稳定性和物体间交互进行量化。在此基础上，开发了一种名为PhyDPO的新型直接偏好优化管道，通过对比反馈和物理感知重权来引导生成更物理上连贯的输出。该方法具有模型无关性和可扩展性，可无缝集成到各种视频扩散和基于Transformer的架构中。实验证明，PhysCorr在提升物理真实感的同时，保持视觉保真度和语义对齐。

Key Takeaways

PhysCorr是一个统一框架，用于提高视频生成中的物理一致性。
引入了PhysicsRM双维奖励模型，量化物体稳定性和交互性。
开发了一种名为PhyDPO的直接偏好优化管道。
该方法利用对比反馈和物理感知重权来引导生成更物理上连贯的输出。
PhysCorr适用于各种视频扩散和基于Transformer的架构。
实验证明，PhysCorr在提升物理真实感的同时保持视觉保真度和语义对齐。

Cool Papers

点此查看论文截图

SurgViVQA: Temporally-Grounded Video Question Answering for Surgical Scene Understanding

Authors:Mauro Orazio Drago, Luca Carlini, Pelinsu Celebi Balyemez, Dennis Pierantozzi, Chiara Lena, Cesare Hassan, Danail Stoyanov, Elena De Momi, Sophia Bano, Mobarak I. Hoque

Video Question Answering (VideoQA) in the surgical domain aims to enhance intraoperative understanding by enabling AI models to reason over temporally coherent events rather than isolated frames. Current approaches are limited to static image features, and available datasets often lack temporal annotations, ignoring the dynamics critical for accurate procedural interpretation. We propose SurgViVQA, a surgical VideoQA model that extends visual reasoning from static images to dynamic surgical scenes. It uses a Masked Video–Text Encoder to fuse video and question features, capturing temporal cues such as motion and tool–tissue interactions, which a fine-tuned large language model (LLM) then decodes into coherent answers. To evaluate its performance, we curated REAL-Colon-VQA, a colonoscopic video dataset that includes motion-related questions and diagnostic attributes, as well as out-of-template questions with rephrased or semantically altered formulations to assess model robustness. Experimental validation on REAL-Colon-VQA and the public EndoVis18-VQA dataset shows that SurgViVQA outperforms existing image-based VQA benchmark models, particularly in keyword accuracy, improving over PitVQA by +11% on REAL-Colon-VQA and +9% on EndoVis18-VQA. A perturbation study on the questions further confirms improved generalizability and robustness to variations in question phrasing. SurgViVQA and the REAL-Colon-VQA dataset provide a framework for temporally-aware understanding in surgical VideoQA, enabling AI models to interpret dynamic procedural contexts more effectively. Code and dataset available at https://github.com/madratak/SurgViVQA.

手术领域的视频问答（VideoQA）旨在通过使AI模型能够对时间上连贯的事件进行推理，而不是孤立的帧，从而增强手术过程中的理解。当前的方法仅限于静态图像特征，而可用的数据集通常缺乏时间注释，忽略了对准确程序解释至关重要的动态。我们提出了SurgViVQA，一种手术视频问答模型，它将视觉推理从静态图像扩展到动态手术场景。它使用遮罩视频-文本编码器融合视频和问题特征，捕捉时间线索，如运动和工具-组织交互，然后经过微调的大型语言模型（LLM）将其解码为连贯的答案。为了评估其性能，我们制作了REAL-Colon-VQA，这是一个结肠镜视频数据集，包括与运动相关的问题和诊断属性，以及重新表述或语义更改的模板外问题，以评估模型的稳健性。在REAL-Colon-VQA和公共EndoVis18-VQA数据集上的实验验证表明，SurgViVQA优于现有的基于图像的VQA基准模型，尤其在关键词准确率方面，在REAL-Colon-VQA上比PitVQA高出11%，在EndoVis18-VQA上高出9%。对问题的扰动研究进一步证实了其在问题表述变化方面的通用性和稳健性。SurgViVQA和REAL-Colon-VQA数据集为手术视频问答中的时间感知理解提供了框架，使AI模型能够更有效地解释动态程序上下文。代码和数据集可在https://github.com/madratak/SurgViVQA获得。

论文及项目相关链接

PDF

Summary

本文介绍了针对手术领域的视频问答（VideoQA）技术。现有的方法主要基于静态图像特征，忽视了手术过程中的动态变化。为此，提出了SurgViVQA模型，该模型将视觉推理从静态图像扩展到动态手术场景。通过融合视频和问题特征，捕捉动作和工具-组织交互等时间线索，并使用微调的大型语言模型（LLM）解码出连贯的答案。为评估性能，开发了REAL-Colon-VQA数据集，包括运动相关问题、诊断属性和出模板的问题。实验验证表明，SurgViVQA在关键词准确率上优于现有图像基准VQA模型，如比PitVQA在REAL-Colon-VQA上提高了11%，在EndoVis18-VQA上提高了9%。此外，问题的扰动研究进一步证实了其对于问题表述变化的良好泛化能力和稳健性。

Key Takeaways

VideoQA在手术领域旨在通过AI模型对时序连贯事件的理解来增强术中理解。
当前方法主要基于静态图像特征，忽略了手术的动态性。
SurgViVQA模型将视觉推理从静态图像扩展到动态手术场景。
SurgViVQA使用Masked Video-Text Encoder融合视频和问题特征，捕捉时间线索。
REAL-Colon-VQA数据集的建立，包括运动相关问题、诊断属性和出模板的问题，用于评估模型性能。
SurgViVQA在关键词准确率上优于现有图像基准VQA模型。
SurgViVQA具有良好的泛化能力和对问题表述变化的稳健性。

Cool Papers

点此查看论文截图

MotionStream: Real-Time Video Generation with Interactive Motion Controls

Authors:Joonghyuk Shin, Zhengqi Li, Richard Zhang, Jun-Yan Zhu, Jaesik Park, Eli Schechtman, Xun Huang

Current motion-conditioned video generation methods suffer from prohibitive latency (minutes per video) and non-causal processing that prevents real-time interaction. We present MotionStream, enabling sub-second latency with up to 29 FPS streaming generation on a single GPU. Our approach begins by augmenting a text-to-video model with motion control, which generates high-quality videos that adhere to the global text prompt and local motion guidance, but does not perform inference on the fly. As such, we distill this bidirectional teacher into a causal student through Self Forcing with Distribution Matching Distillation, enabling real-time streaming inference. Several key challenges arise when generating videos of long, potentially infinite time-horizons: (1) bridging the domain gap from training on finite length and extrapolating to infinite horizons, (2) sustaining high quality by preventing error accumulation, and (3) maintaining fast inference, without incurring growth in computational cost due to increasing context windows. A key to our approach is introducing carefully designed sliding-window causal attention, combined with attention sinks. By incorporating self-rollout with attention sinks and KV cache rolling during training, we properly simulate inference-time extrapolations with a fixed context window, enabling constant-speed generation of arbitrarily long videos. Our models achieve state-of-the-art results in motion following and video quality while being two orders of magnitude faster, uniquely enabling infinite-length streaming. With MotionStream, users can paint trajectories, control cameras, or transfer motion, and see results unfold in real-time, delivering a truly interactive experience.

当前的运动条件视频生成方法存在延迟过大（每分钟输出一个视频）和非因果处理的问题，这阻止了实时交互。我们提出了MotionStream，使用单个GPU实现了子秒级延迟和高达29帧/秒的流式生成。我们的方法首先通过增强文本到视频模型的运动控制功能来生成高质量视频，这些视频遵循全局文本提示和局部运动指导，但不进行即时推理。因此，我们通过使用自我强迫和分布匹配蒸馏将双向教师提炼为因果学生，从而实现实时流式推理。在生成长时间甚至可能无限时间范围的视频时，会出现几个关键挑战：（1）弥合在有限长度上进行训练和在无限范围上进行推断之间的领域差距；（2）通过防止误差累积来维持高质量；（3）在不断增加的上下文窗口导致的计算成本不增加的情况下，保持快速推理。我们的方法的关键在于引入精心设计滑动窗口因果注意力，并结合注意力汇点。通过结合训练过程中的注意力汇点和KV缓存滚动，我们能够在固定上下文窗口内适当地模拟推理时间外推，从而实现任意长度视频的恒定速度生成。我们的模型在运动跟随和视频质量方面达到了业界最佳水平，同时速度提高了两个数量级，能够实现独一无二的无限制长度流式传输。使用MotionStream，用户可以在实时涂鸦轨迹、控制摄像头或转移运动，并实时看到结果呈现，提供真正的交互式体验。

摘要

当前的运动条件视频生成方法存在延迟过长（每分钟生成一个视频）和非因果处理的问题，这阻碍了实时交互。我们提出了MotionStream，实现了每秒生成高达29帧的视频，并在单个GPU上进行流式生成，延迟缩短至不到一秒。我们的方法首先通过增强文本到视频模型的运动控制功能来生成高质量的视频，这些视频遵循全局文本提示和局部运动指导，但并不支持即时推理。因此，我们通过分布匹配蒸馏的自我强制将这种双向教师蒸馏为因果学生，从而实现实时流式推理。在生成长时间甚至无限时长视频时，面临以下关键挑战：1）从有限长度训练扩展到无限时长视频领域的差距；2）通过防止错误累积来维持高质量；3）在不断增加的上下文窗口的情况下，保持快速推理，避免计算成本的增加。我们的方法的关键在于引入精心设计滑动窗口因果注意力以及注意力池。通过结合训练时的注意力池和自我滚动KV缓存滚动，我们能够用固定的上下文窗口模拟推理时间的外推，实现了任意长度视频的恒定速度生成。我们的模型在运动跟踪和视频质量方面达到了最新水平，同时速度提高了两个数量级，能够唯一实现无限长度流式传输。使用MotionStream，用户可以在实时中绘制轨迹、控制相机或转换运动，并看到结果展开，实现真正的交互式体验。

要点概括

MotionStream能够实现子秒延迟的视频生成，高达每秒29帧的流式生成速度。
通过增强文本到视频模型的运动控制功能生成高质量的视频。
采用自我强制和分布匹配蒸馏法将双向教师模型转化为因果学生模型，实现实时推理。
面临从有限长度训练到无限时长视频的外推挑战，包括领域差距、质量维持和快速推理的问题。
通过滑动窗口因果注意力和注意力池的设计解决这些挑战，采用固定上下文窗口模拟推理时间的外推。
模型在运动跟踪和视频质量方面达到最新水平，同时提高生成速度两个数量级，支持无限长度流式传输。
MotionStream提供真正的交互式体验，用户可实时绘制轨迹、控制相机或转换运动并即时看到结果。

Cool Papers

点此查看论文截图

MoSa: Motion Generation with Scalable Autoregressive Modeling

Authors:Mengyuan Liu, Sheng Yan, Yong Wang, Yingjie Li, Gui-Bin Bian, Hong Liu

We introduce MoSa, a novel hierarchical motion generation framework for text-driven 3D human motion generation that enhances the Vector Quantization-guided Generative Transformers (VQ-GT) paradigm through a coarse-to-fine scalable generation process. In MoSa, we propose a Multi-scale Token Preservation Strategy (MTPS) integrated into a hierarchical residual vector quantization variational autoencoder (RQ-VAE). MTPS employs interpolation at each hierarchical quantization to effectively retain coarse-to-fine multi-scale tokens. With this, the generative transformer supports Scalable Autoregressive (SAR) modeling, which predicts scale tokens, unlike traditional methods that predict only one token at each step. Consequently, MoSa requires only 10 inference steps, matching the number of RQ-VAE quantization layers. To address potential reconstruction degradation from frequent interpolation, we propose CAQ-VAE, a lightweight yet expressive convolution-attention hybrid VQ-VAE. CAQ-VAE enhances residual block design and incorporates attention mechanisms to better capture global dependencies. Extensive experiments show that MoSa achieves state-of-the-art generation quality and efficiency, outperforming prior methods in both fidelity and speed. On the Motion-X dataset, MoSa achieves an FID of 0.06 (versus MoMask’s 0.20) while reducing inference time by 27 percent. Moreover, MoSa generalizes well to downstream tasks such as motion editing, requiring no additional fine-tuning. The code is available at https://mosa-web.github.io/MoSa-web

我们介绍了MoSa，这是一种用于文本驱动的三维人体运动生成的新型分层运动生成框架，它通过粗细可伸缩的生成过程增强了基于向量量化的生成式转换器（VQ-GT）范式。在MoSa中，我们提出了一种多尺度符号保留策略（MTPS），该策略被集成到分层剩余向量量化变分自编码器（RQ-VAE）中。MTPS在每个分层量化上采用插值，以有效地保留粗细多尺度符号。因此，生成式转换器支持可伸缩自回归（SAR）建模，该建模可以预测尺度符号，不同于传统方法在每个步骤中只预测一个符号。因此，MoSa仅需要10个推理步骤，与RQ-VAE量化层数相匹配。为了解决频繁插值可能导致的重建质量下降问题，我们提出了CAQ-VAE，这是一个轻便而表现力强混合卷积注意力VQ-VAE。CAQ-VAE增强了剩余块设计并融入了注意力机制，以更好地捕捉全局依赖性。大量实验表明，MoSa在生成质量和效率方面达到了最新水平，在保真度和速度方面都超越了以前的方法。在Motion-X数据集上，MoSa的FID为0.06（相比之下MoMask为0.20），同时推理时间减少了27%。此外，MoSa对于下游任务（如运动编辑）具有很好的通用性，无需进行任何额外的微调。相关代码可访问https://mosa-web.github.io/MoSa-web。

论文及项目相关链接

PDF

摘要

本文介绍了MoSa，一种用于文本驱动的三维人体运动生成的新型分层运动生成框架。MoSa改进了基于向量量化的生成性转换器（VQ-GT）范式，通过粗细结合的分层生成过程，实现了优秀的运动生成效果。MoSa采用多尺度令牌保留策略（MTPS），集成到分层残差向量量化变分自编码器（RQ-VAE）中。MTPS通过在每个分层量化中进行插值，有效地保留了粗到细的多元令牌。因此，生成性转换器支持可伸缩的自回归（SAR）建模，可预测不同尺度的令牌，不同于传统方法在每个步骤中只预测一个令牌。MoSa仅需要10个推理步骤，与RQ-VAE的量化层数相匹配。为解决频繁插值可能导致的重建质量下降问题，我们提出了CAQ-VAE，一种轻便而表现力强的卷积注意力混合VQ-VAE。CAQ-VAE优化了残差块设计，并融入了注意力机制，以更好地捕捉全局依赖性。实验表明，MoSa在生成质量和效率方面达到了最新水平，在Motion-X数据集上的FID得分为0.06（对比MoMask的0.20），同时减少了27%的推理时间。此外，MoSa在下游任务如运动编辑中具有良好的泛化能力，无需额外微调。

关键见解

MoSa是一种新颖的文本驱动的三维人体运动生成分层运动生成框架，基于VQ-GT范式改进。
MoSa采用多尺度令牌保留策略（MTPS），集成到RQ-VAE中，有效保留多尺度令牌。
MoSa支持可伸缩的自回归（SAR）建模，能够预测不同尺度的令牌。
MoSa通过优化推理步骤和提高效率，实现了高质量的运动生成。
CAQ-VAE的提出解决了频繁插值导致的重建质量下降问题，提升了运动生成的性能。
MoSa在Motion-X数据集上的FID得分优于其他方法，显示其在生成质量方面的优越性。
MoSa具有良好的泛化能力，可应用于下游任务如运动编辑，且无需额外微调。

Cool Papers

点此查看论文截图

RealDPO: Real or Not Real, that is the Preference

Authors:Guo Cheng, Danni Yang, Ziqi Huang, Jianlou Si, Chenyang Si, Ziwei Liu

Video generative models have recently achieved notable advancements in synthesis quality. However, generating complex motions remains a critical challenge, as existing models often struggle to produce natural, smooth, and contextually consistent movements. This gap between generated and real-world motions limits their practical applicability. To address this issue, we introduce RealDPO, a novel alignment paradigm that leverages real-world data as positive samples for preference learning, enabling more accurate motion synthesis. Unlike traditional supervised fine-tuning (SFT), which offers limited corrective feedback, RealDPO employs Direct Preference Optimization (DPO) with a tailored loss function to enhance motion realism. By contrasting real-world videos with erroneous model outputs, RealDPO enables iterative self-correction, progressively refining motion quality. To support post-training in complex motion synthesis, we propose RealAction-5K, a curated dataset of high-quality videos capturing human daily activities with rich and precise motion details. Extensive experiments demonstrate that RealDPO significantly improves video quality, text alignment, and motion realism compared to state-of-the-art models and existing preference optimization techniques.

视频生成模型最近在合成质量方面取得了显著的进步。然而，生成复杂运动仍然是一个关键挑战，因为现有模型往往难以产生自然、流畅和上下文一致的运动。生成的运动和真实世界运动之间的差距限制了它们的实际应用。为了解决这个问题，我们引入了RealDPO，这是一种新的对齐范式，它利用真实世界数据作为正面样本进行偏好学习，从而实现更精确的运动合成。与传统的监督微调（SFT）不同，后者只提供有限的校正反馈，RealDPO采用直接偏好优化（DPO）和定制的损失函数来提高运动真实性。通过对比真实世界视频和错误的模型输出，RealDPO能够实现迭代自我校正，逐步改进运动质量。为了支持复杂运动合成的后训练，我们提出了RealAction-5K，这是一个高质量的视频数据集，捕捉人类日常活动，包含丰富而精确的运动细节。大量实验表明，与最先进的模型和现有的偏好优化技术相比，RealDPO显著提高了视频质量、文本对齐和运动真实性。

摘要

视频生成模型在合成质量上取得了显著进展，但生成复杂运动仍是一个关键挑战。现有模型难以产生自然、平滑、上下文一致的运动。为解决这一问题，本文引入了RealDPO，一种利用真实世界数据作为正样本进行偏好学习的新型对齐范式，以实现更精确的运动合成。不同于提供有限矫正反馈的传统监督微调（SFT），RealDPO采用直接偏好优化（DPO）和定制的损失函数，通过对比真实世界视频和错误的模型输出，实现迭代自我修正，逐步改进运动质量。为支持复杂运动合成的后训练，本文提出了RealAction-5K数据集，包含高质量的人类日常活动视频，具有丰富和精确的运动细节。实验表明，RealDPO在视频质量、文本对齐和运动现实感方面显著优于最先进的模型和现有的偏好优化技术。

关键见解

视频生成模型在合成质量上取得显著进展，但生成复杂运动仍具挑战。
现有模型难以产生自然、平滑、上下文一致的运动。
RealDPO利用真实世界数据作为正样本进行偏好学习，实现更精确的运动合成。
RealDPO采用直接偏好优化（DPO）和定制损失函数，提供迭代自我修正，改进运动质量。
为支持复杂运动合成的后训练，提出了RealAction-5K数据集。
RealDPO在视频质量、文本对齐和运动现实感方面显著优于其他技术。
RealDPO的引入有望解决生成模型与真实世界运动之间的差距，提高模型的实用性和适用性。

Cool Papers

点此查看论文截图

PhysCtrl: Generative Physics for Controllable and Physics-Grounded Video Generation

Authors:Chen Wang, Chuhao Chen, Yiming Huang, Zhiyang Dou, Yuan Liu, Jiatao Gu, Lingjie Liu

Existing video generation models excel at producing photo-realistic videos from text or images, but often lack physical plausibility and 3D controllability. To overcome these limitations, we introduce PhysCtrl, a novel framework for physics-grounded image-to-video generation with physical parameters and force control. At its core is a generative physics network that learns the distribution of physical dynamics across four materials (elastic, sand, plasticine, and rigid) via a diffusion model conditioned on physics parameters and applied forces. We represent physical dynamics as 3D point trajectories and train on a large-scale synthetic dataset of 550K animations generated by physics simulators. We enhance the diffusion model with a novel spatiotemporal attention block that emulates particle interactions and incorporates physics-based constraints during training to enforce physical plausibility. Experiments show that PhysCtrl generates realistic, physics-grounded motion trajectories which, when used to drive image-to-video models, yield high-fidelity, controllable videos that outperform existing methods in both visual quality and physical plausibility. Project Page: https://cwchenwang.github.io/physctrl

现有视频生成模型擅长从文本或图像生成逼真的视频，但往往缺乏物理合理性和3D可控性。为了克服这些限制，我们引入了PhysCtrl，这是一个基于物理参数和力控制的新型图像到视频生成框架。其核心是一个生成物理网络，通过扩散模型学习四种材料（弹性、沙子、可塑性材料和刚性）的物理动力学分布，该模型以物理参数和应用力为条件。我们将物理动力学表示为3D点轨迹，并在由物理模拟器生成的大规模合成数据集（55万个动画）上进行训练。我们增强扩散模型，加入新型时空注意力模块，模拟粒子交互并在训练过程中引入基于物理的约束，以强制执行物理合理性。实验表明，PhysCtrl生成真实的、基于物理的运动轨迹，当用于驱动图像到视频模型时，可生成高质量、可控的视频，在视觉质量和物理合理性方面均优于现有方法。项目页面：https://cwchenwang.github.io/physctrl

论文及项目相关链接

PDF NeurIPS 2025 Camera Ready Version

Summary

本文介绍了一个名为PhysCtrl的新型框架，该框架实现了基于物理参数的图像到视频的生成。它通过引入物理参数和力控制，克服了现有视频生成模型在物理逼真度和3D可控性方面的局限。采用扩散模型学习四种材料的物理动力学分布，并在大型合成数据集上进行训练。实验表明，PhysCtrl能够生成逼真的、基于物理的运动轨迹，当用于驱动图像到视频模型时，可以产生高质量、可控的视频，在视觉质量和物理逼真度方面优于现有方法。

Key Takeaways

PhysCtrl是一个基于物理参数的图像到视频生成的新型框架。
引入物理参数和力控制，以提高视频生成模型的物理逼真度和3D可控性。
采用扩散模型学习四种材料的物理动力学分布。
在大型合成数据集上进行训练，包含550K个动画。
提出了一个新型的时空注意力块，模拟粒子交互，并在训练期间加入物理约束以加强物理逼真度。
实验表明，PhysCtrl能生成逼真的、基于物理的运动轨迹。

Cool Papers

点此查看论文截图

Light Future: Multimodal Action Frame Prediction via InstructPix2Pix

Authors:Zesen Zhong, Duomin Zhang, Yijia Li

Predicting future motion trajectories is a critical capability across domains such as robotics, autonomous systems, and human activity forecasting, enabling safer and more intelligent decision-making. This paper proposes a novel, efficient, and lightweight approach for robot action prediction, offering significantly reduced computational cost and inference latency compared to conventional video prediction models. Importantly, it pioneers the adaptation of the InstructPix2Pix model for forecasting future visual frames in robotic tasks, extending its utility beyond static image editing. We implement a deep learning-based visual prediction framework that forecasts what a robot will observe 100 frames (10 seconds) into the future, given a current image and a textual instruction. We repurpose and fine-tune the InstructPix2Pix model to accept both visual and textual inputs, enabling multimodal future frame prediction. Experiments on the RoboTWin dataset (generated based on real-world scenarios) demonstrate that our method achieves superior SSIM and PSNR compared to state-of-the-art baselines in robot action prediction tasks. Unlike conventional video prediction models that require multiple input frames, heavy computation, and slow inference latency, our approach only needs a single image and a text prompt as input. This lightweight design enables faster inference, reduced GPU demands, and flexible multimodal control, particularly valuable for applications like robotics and sports motion trajectory analytics, where motion trajectory precision is prioritized over visual fidelity.

预测未来运动轨迹是机器人技术、自主系统和人类活动预测等领域中的一项关键能力，能够为更安全、更智能的决策提供支撑。本文针对机器人动作预测提出了一种新颖、高效且轻量级的方法，相较于传统的视频预测模型，该方法大大减少了计算成本和推理延迟。更重要的是，它率先将InstructPix2Pix模型应用于机器人任务的未来视觉帧预测，扩展了其超越静态图像编辑的实用性。我们实现了一个基于深度学习的视觉预测框架，能够根据当前图像和文本指令预测机器人未来100帧（10秒）的观测情况。我们对InstructPix2Pix模型进行改造和微调，以接受视觉和文本输入，从而实现多模态未来帧预测。在RoboTWin数据集（基于真实场景生成）上的实验表明，我们的方法在机器人动作预测任务中实现了优于先进基准线的SSIM和PSNR指标。与传统的视频预测模型不同，这些模型需要多帧输入、大量计算和缓慢的推理延迟，而我们的方法只需要单个图像和文本提示作为输入。这种轻量级的设计实现了更快的推理速度、降低了GPU需求，并实现了灵活的多模态控制，对于机器人技术和运动轨迹分析等领域的应用特别有价值，在这些领域中，运动轨迹的精确度比视觉保真度更重要。

论文及项目相关链接

PDF 9 pages including appendix, 4 tables, 8 figures, to be submitted to WACV 2026

Summary

本文提出了一种用于机器人动作预测的新型高效、轻量级方法。该方法采用深度学习视觉预测框架，根据当前图像和文本指令预测机器人未来10秒的运动轨迹。通过重新训练和改进InstructPix2Pix模型，使其能够处理视觉和文本输入，实现了多模态未来帧预测。相较于其他机器人动作预测模型，该方法具有更低的计算成本和更快的推理速度。在RoboTWin数据集上的实验表明，该方法在机器人动作预测任务中实现了较高的SSIM和PSNR指标。

Key Takeaways

文本提出了一个针对机器人动作预测的轻量级预测方法。
方法基于深度学习视觉预测框架，利用当前图像和文本指令进行预测。
采用InstructPix2Pix模型进行改进，使其能够处理视觉和文本输入，实现多模态预测。
该方法相较于其他机器人动作预测模型具有更低的计算成本和更快的推理速度。
方法在RoboTWin数据集上实现了较高的SSIM和PSNR指标。
方法对于运动轨迹的精度较高，对于机器人学和运动轨迹分析应用有重要意义。

Cool Papers

点此查看论文截图

T-GVC: Trajectory-Guided Generative Video Coding at Ultra-Low Bitrates

Authors:Zhitao Wang, Hengyu Man, Wenrui Li, Xingtao Wang, Xiaopeng Fan, Debin Zhao

Recent advances in video generation techniques have given rise to an emerging paradigm of generative video coding for Ultra-Low Bitrate (ULB) scenarios by leveraging powerful generative priors. However, most existing methods are limited by domain specificity (e.g., facial or human videos) or excessive dependence on high-level text guidance, which tend to inadequately capture fine-grained motion details, leading to unrealistic or incoherent reconstructions. To address these challenges, we propose Trajectory-Guided Generative Video Coding (dubbed T-GVC), a novel framework that bridges low-level motion tracking with high-level semantic understanding. T-GVC features a semantic-aware sparse motion sampling pipeline that extracts pixel-wise motion as sparse trajectory points based on their semantic importance, significantly reducing the bitrate while preserving critical temporal semantic information. In addition, by integrating trajectory-aligned loss constraints into diffusion processes, we introduce a training-free guidance mechanism in latent space to ensure physically plausible motion patterns without sacrificing the inherent capabilities of generative models. Experimental results demonstrate that T-GVC outperforms both traditional and neural video codecs under ULB conditions. Furthermore, additional experiments confirm that our framework achieves more precise motion control than existing text-guided methods, paving the way for a novel direction of generative video coding guided by geometric motion modeling.

近期视频生成技术的进展为超低比特率（ULB）场景的生成式视频编码提供了新的范例，它借助强大的生成先验信息。然而，大多数现有方法受到领域特定性（如面部或人体视频）的限制，或过于依赖高级文本指导，往往无法充分捕捉精细的运动细节，导致重建结果不真实或不一致。为了解决这些挑战，我们提出了轨迹引导生成式视频编码（简称T-GVC），这是一个新的框架，将低级别的运动跟踪与高级别的语义理解相结合。T-GVC采用语义感知稀疏运动采样管道，根据语义重要性提取像素级的运动作为稀疏轨迹点，在大幅降低比特率的同时保持关键的时间语义信息。此外，通过将轨迹对齐损失约束集成到扩散过程中，我们在潜在空间中引入了一种无训练指导机制，确保物理上可行的运动模式，同时不牺牲生成模型的内生能力。实验结果表明，在ULB条件下，T-GVC的性能优于传统和神经网络视频编码器。进一步的实验证实，我们的框架比现有的文本引导方法实现了更精确的运动控制，为生成式视频编码的几何运动建模指导方向开创了新的途径。

论文及项目相关链接

PDF

Summary

本文提出了轨迹引导生成式视频编码（T-GVC）框架，该框架结合低级别的运动跟踪与高级别的语义理解。它通过语义感知的稀疏运动采样管道，根据像素的语义重要性提取稀疏轨迹点，从而显著降低比特率同时保留关键的时间语义信息。此外，通过在扩散过程中引入轨迹对齐的损失约束，T-GVC在潜在空间中引入了一种无需训练的引导机制，确保物理上合理的运动模式，同时不牺牲生成模型的内生能力。实验结果表明，T-GVC在超低比特率条件下优于传统和神经网络视频编码方法。

Key Takeaways

T-GVC框架结合了低级别运动跟踪和高级别语义理解，解决了现有方法面临的挑战。
语义感知的稀疏运动采样管道能提取重要像素的稀疏轨迹点，降低比特率同时保留时间语义信息。
通过在扩散过程中引入轨迹对齐损失约束，T-GVC确保物理上合理的运动模式。
T-GVC在超低比特率条件下性能优于传统和神经网络视频编码方法。
与现有文本引导的方法相比，T-GVC实现了更精确的运动控制。
T-GVC为生成式视频编码提供了新的方向，即通过几何运动建模进行引导。

Cool Papers

点此查看论文截图

Generating Attribute-Aware Human Motions from Textual Prompt

Authors:Xinghan Wang, Kun Xu, Fei Li, Cao Sheng, Jiazhong Yu, Yadong Mu

Text-driven human motion generation has recently attracted considerable attention, allowing models to generate human motions based on textual descriptions. However, current methods neglect the influence of human attributes-such as age, gender, weight, and height-which are key factors shaping human motion patterns. This work represents a pilot exploration for bridging this gap. We conceptualize each motion as comprising both attribute information and action semantics, where textual descriptions align exclusively with action semantics. To achieve this, a new framework inspired by Structural Causal Models is proposed to decouple action semantics from human attributes, enabling text-to-semantics prediction and attribute-controlled generation. The resulting model is capable of generating attribute-aware motion aligned with the user’s text and attribute inputs. For evaluation, we introduce a comprehensive dataset containing attribute annotations for text-motion pairs, setting the first benchmark for attribute-aware motion generation. Extensive experiments validate our model’s effectiveness.

文本驱动的人类运动生成近期引起了广泛的关注，它允许模型基于文本描述生成人类运动。然而，当前的方法忽视了人类属性（如年龄、性别、体重和身高）的影响，这些属性是塑造人类运动模式的关键因素。这项工作是对缩小这一差距的初步探索。我们将每种运动概念化为属性信息和动作语义的组合，文本描述只与动作语义相对应。为实现这一点，受结构因果模型的启发，我们提出了一个新的框架，将动作语义与人类属性解耦，实现文本到语义的预测和属性控制生成。所得模型能够根据用户的文本和属性输入，生成与之相符的属性感知运动。为了评估，我们引入了一个综合数据集，其中包含文本与运动对属性的注解，为属性感知运动生成设置了第一个基准。大量实验验证了我们的模型的有效性。

论文及项目相关链接

PDF Accepted by AAAI 2026

摘要

文本驱动的人类运动生成已引起广泛关注，允许模型基于文本描述生成人类运动。然而，当前的方法忽视了人类属性（如年龄、性别、体重和身高）的影响，这些属性是塑造人类运动模式的关键因素。这项工作旨在填补这一空白。我们概念化每个动作既包含属性信息又包含动作语义，文本描述只与动作语义对齐。为实现这一点，受结构因果模型的启发，提出了一个新框架，将动作语义与人类属性解耦，实现文本到语义的预测和属性控制生成。所得模型能够生成与用户文本和属性输入对齐的属性感知运动。为了评估，我们引入了一个包含文本运动对属性注释的综合数据集，为属性感知运动生成设立了首个基准测试。大量实验验证了该模型的有效性。

关键见解

当前文本驱动的人类运动生成方法忽视了人类属性的影响，如年龄、性别、体重和身高。
此研究提出了一个基于结构因果模型的新框架来生成与人类属性和文本输入对齐的运动。
该框架实现了动作语义和人类属性的解耦，使文本与动作语义预测和属性控制生成成为可能。
为了评估模型效果，引入了一个包含文本运动对属性注释的综合数据集。
通过大量实验验证了模型在生成与用户文本和属性输入对齐的属性感知运动方面的有效性。
这种新方法为文本驱动的人类运动生成领域开辟了新的方向，特别是在考虑人类属性的影响方面。
随着这一技术的进一步发展，我们可以期待更真实、更个性化的运动生成，从而在各种应用中实现更广泛的用途。

Cool Papers

点此查看论文截图

Advanced Sign Language Video Generation with Compressed and Quantized Multi-Condition Tokenization

Authors:Cong Wang, Zexuan Deng, Zhiwei Jiang, Yafeng Yin, Fei Shen, Zifeng Cheng, Shiping Ge, Shiwei Gan, Qing Gu

Sign Language Video Generation (SLVG) seeks to generate identity-preserving sign language videos from spoken language texts. Existing methods primarily rely on the single coarse condition (\eg, skeleton sequences) as the intermediary to bridge the translation model and the video generation model, which limits both the naturalness and expressiveness of the generated videos. To overcome these limitations, we propose SignViP, a novel SLVG framework that incorporates multiple fine-grained conditions for improved generation fidelity. Rather than directly translating error-prone high-dimensional conditions, SignViP adopts a discrete tokenization paradigm to integrate and represent fine-grained conditions (\ie, fine-grained poses and 3D hands). SignViP contains three core components. (1) Sign Video Diffusion Model is jointly trained with a multi-condition encoder to learn continuous embeddings that encapsulate fine-grained motion and appearance. (2) Finite Scalar Quantization (FSQ) Autoencoder is further trained to compress and quantize these embeddings into discrete tokens for compact representation of the conditions. (3) Multi-Condition Token Translator is trained to translate spoken language text to discrete multi-condition tokens. During inference, Multi-Condition Token Translator first translates the spoken language text into discrete multi-condition tokens. These tokens are then decoded to continuous embeddings by FSQ Autoencoder, which are subsequently injected into Sign Video Diffusion Model to guide video generation. Experimental results show that SignViP achieves state-of-the-art performance across metrics, including video quality, temporal coherence, and semantic fidelity. The code is available at https://github.com/umnooob/signvip/.

手势语言视频生成（SLVG）旨在从口语文本生成保持身份的手势语言视频。现有方法主要依赖于单一粗略条件（例如骨架序列）作为翻译模型和视频生成模型之间的桥梁，这限制了生成视频的自然度和表现力。为了克服这些限制，我们提出了SignViP，这是一个新的SLVG框架，它结合了多种精细条件，以提高生成的保真度。SignViP没有直接翻译容易出错的高维条件，而是采用离散标记化范式来集成和表示精细条件（即精细姿势和3D手）。SignViP包含三个核心组件。（1）手势视频扩散模型与多条件编码器共同训练，学习连续的嵌入，这些嵌入包含精细运动和外观。（2）有限标量量化（FSQ）自动编码器进一步训练这些嵌入，将其压缩并量化为离散标记，以紧凑表示条件。（3）多条件标记翻译器被训练用于将口语文本翻译成离散的多条件标记。在推理过程中，多条件标记翻译器首先将口语文本翻译成离散的多条件标记。这些标记随后被FSQ自动编码器解码为连续的嵌入，然后注入手势视频扩散模型以指导视频生成。实验结果表明，SignViP在视频质量、时间连贯性和语义保真度等各项指标上均达到最新技术水平。代码可用https://github.com/umnooob/signvip/访问。

论文及项目相关链接

PDF

Summary

本文介绍了Sign Language Video Generation（SLVG）的挑战及现有方法的局限性。为改进生成视频的逼真度和表现力，提出了SignViP框架，通过引入多种精细条件来提高生成精度。SignViP包括三个核心组件：Sign Video Diffusion Model、Finite Scalar Quantization (FSQ) Autoencoder和多条件翻译器。该框架通过离散标记法整合和表达精细条件，如精细姿态和三维手。实验结果显示，SignViP在视频质量、时间连贯性和语义保真度等方面达到最佳性能。

Key Takeaways

SLVG旨在从口语文本生成身份保留的肢体语言视频。
现有方法主要依赖单一粗略条件作为翻译模型和视频生成模型之间的桥梁，限制了生成视频的自然性和表现力。
SignViP框架通过引入多种精细条件提高生成精度，如精细姿态和三维手。
SignViP包括三个核心组件：Sign Video Diffusion Model、FSQ Autoencoder和多条件翻译器。
SignViP采用离散标记法整合和表达精细条件，提高生成视频的逼真度。
实验结果显示，SignViP在多个指标上达到最佳性能，包括视频质量、时间连贯性和语义保真度。

Cool Papers

点此查看论文截图

Free-T2M: Robust Text-to-Motion Generation for Humanoid Robots via Frequency-Domain

Authors:Wenshuo Chen, Haozhe Jia, Songning Lai, Lei Wang, Yuqi Lin, Hongru Xiao, Lijie Hu, Yutao Yue

Enabling humanoid robots to synthesize complex, physically coherent motions from natural language commands is a cornerstone of autonomous robotics and human-robot interaction. While diffusion models have shown promise in this text-to-motion (T2M) task, they often generate semantically flawed or unstable motions, limiting their applicability to real-world robots. This paper reframes the T2M problem from a frequency-domain perspective, revealing that the generative process mirrors a hierarchical control paradigm. We identify two critical phases: a semantic planning stage, where low-frequency components establish the global motion trajectory, and a fine-grained execution stage, where high-frequency details refine the movement. To address the distinct challenges of each phase, we introduce Frequency enhanced text-to-motion (Free-T2M), a framework incorporating stage-specific frequency-domain consistency alignment. We design a frequency-domain temporal-adaptive module to modulate the alignment effects of different frequency bands. These designs enforce robustness in the foundational semantic plan and enhance the accuracy of detailed execution. Extensive experiments show our method dramatically improves motion quality and semantic correctness. Notably, when applied to the StableMoFusion baseline, Free-T2M reduces the FID from 0.152 to 0.060, establishing a new state-of-the-art within diffusion architectures. These findings underscore the critical role of frequency-domain insights for generating robust and reliable motions, paving the way for more intuitive natural language control of robots.

实现人形机器人从自然语言命令中合成复杂且物理连贯的动作，是自主机器人技术和人机交互的核心部分。尽管扩散模型在文本到动作（T2M）任务中显示出潜力，但它们通常会产生语义上有缺陷或不稳定的活动，限制了它们在真实世界机器人中的应用。本文从频域角度重新定义了T2M问题，发现生成过程反映了一种分层控制范式。我们确定了两个关键阶段：语义规划阶段，低频组件确定全局运动轨迹；精细执行阶段，高频细节完善运动。为了解决每个阶段的特定挑战，我们引入了频率增强文本到动作（Free-T2M）框架，该框架结合了阶段特定的频域一致性对齐。我们设计了一个频域时序自适应模块，以调节不同频带的对齐效果。这些设计增强了基础语义计划的稳健性，提高了详细执行的准确性。大量实验表明，我们的方法显著提高了运动质量和语义正确性。值得注意的是，当应用于StableMoFusion基线时，Free-T2M将FID从0.152降低到0.060，在扩散架构中建立了新的最先进的水平。这些发现强调了频域洞察力对于生成稳健和可靠运动的关键作用，为人形机器人更直观的自然语言控制铺平了道路。

论文及项目相关链接

PDF

Summary

本文探讨了使类人机器人能够从自然语言命令中合成复杂、物理连贯的运动的重要性。针对扩散模型在文本到运动（T2M）任务中生成语义缺陷或不稳定运动的问题，本文提出从频率域角度重新考虑T2M问题，并引入频率增强文本到运动（Free-T2M）框架，通过阶段特定的频率域一致性对齐来解决两个关键阶段——语义规划阶段和精细执行阶段的挑战。实验表明，该方法显著提高运动质量和语义准确性，当应用于StableMoFusion基线时，FID从0.152降低到0.060，建立了一个新的扩散架构内的最佳水平。这些发现强调了频率域洞察对于生成稳健和可靠运动的关键作用，为更直观的机器人自然语言控制铺平了道路。

Key Takeaways

类人机器人从自然语言命令中合成复杂、物理连贯的运动是自主机器人和人类-机器人交互的核心任务。
扩散模型在文本到运动（T2M）任务中常生成语义缺陷或不稳定运动。
本文从频率域角度重新考虑T2M问题，并提出频率增强文本到运动（Free-T2M）框架。
Free-T2M框架包含两个关键阶段：语义规划阶段和精细执行阶段。
Free-T2M通过阶段特定的频率域一致性对齐来解决这两个阶段的挑战。
实验显示Free-T2M显著提高了运动质量和语义准确性，并建立了新的扩散架构内的最佳水平。

Cool Papers

点此查看论文截图

LangPose: Language-Aligned Motion for Robust 3D Human Pose Estimation

Authors:Longyun Liao, Rong Zheng

2D-to-3D human pose lifting is an ill-posed problem due to depth ambiguity and occlusion. Existing methods relying on spatial and temporal consistency alone are insufficient to resolve these problems especially in the presence of significant occlusions or high dynamic actions. Semantic information, however, offers a complementary signal that can help disambiguate such cases. To this end, we propose LangPose, a framework that leverages action knowledge by aligning motion embeddings with text embeddings of fine-grained action labels. LangPose operates in two stages: pretraining and fine-tuning. In the pretraining stage, the model simultaneously learns to recognize actions and reconstruct 3D poses from masked and noisy 2D poses. During the fine-tuning stage, the model is further refined using real-world 3D human pose estimation datasets without action labels. Additionally, our framework incorporates masked body parts and masked time windows in motion modeling, encouraging the model to leverage semantic information when spatial and temporal consistency is unreliable. Experiments demonstrate the effectiveness of LangPose, achieving SOTA level performance in 3D pose estimation on public datasets, including Human3.6M and MPI-INF-3DHP. Specifically, LangPose achieves an MPJPE of 36.7mm on Human3.6M with detected 2D poses as input and 15.5mm on MPI-INF-3DHP with ground-truth 2D poses as input.

二维到三维的人体姿态提升是一个病态问题，原因在于深度模糊和遮挡。现有仅依赖空间和时间一致性的方法不足以解决这些问题，特别是在存在显著遮挡或高动态动作的情况下。然而，语义信息提供了一种补充信号，可以帮助解决这些问题。为此，我们提出了LangPose框架，它通过将对运动嵌入与精细动作标签的文本嵌入对齐，利用动作知识。LangPose分为两个阶段：预训练和微调。在预训练阶段，模型同时学习识别动作并从遮挡和嘈杂的二维姿态重建三维姿态。在微调阶段，模型使用没有动作标签的真实世界三维人体姿态估计数据集进行进一步微调。此外，我们的框架在运动建模中融入了遮挡的身体部位和遮挡的时间窗口，鼓励模型在空间和时间的可靠性较低时利用语义信息。实验证明了LangPose的有效性，在公共数据集上的三维姿态估计达到了最先进的性能水平，包括Human3.6M和MPI-INF-3DHP数据集。具体来说，LangPose在Human3.6M数据集上输入的检测到的人体二维姿态下实现了平均关节位置误差（MPJPE）为36.7毫米的准确度，并在MPI-INF-3DHP数据集上输入的地面真实二维姿态下实现了MPJPE为15.5毫米的准确度。

论文及项目相关链接

PDF Accepted by WACV2026. Please find the supplementary material under the “Ancillary files”

摘要
文本提出一个名为LangPose的框架，旨在解决由于深度歧义和遮挡导致的二维到三维人体姿态提升的问题。该框架利用动作知识，通过对运动嵌入和精细动作标签文本嵌入的对齐来辅助解决这些问题。LangPose分为预训练和微调两个阶段，并在公共数据集上实现了先进水平的三维姿态估计性能。

关键见解

2D-to-3D人体姿态提升是一个因深度歧义和遮挡而出现的问题。
现有方法仅依赖空间和时间一致性无法有效解决这些问题，特别是在存在显著遮挡或高动态动作时。
LangPose框架通过结合动作知识，利用运动嵌入与精细动作标签文本嵌入的对齐来辅助解决这些问题。
LangPose分为预训练和微调两个阶段，预训练阶段中学习识别动作并从遮挡和噪声的二维姿态重建三维姿态，微调阶段则使用真实世界的三维人体姿态估计数据集进一步精炼模型。
LangPose框架在运动中融入了掩膜身体部位和掩膜时间窗口，鼓励模型在空间和时间的可靠性不可靠时利用语义信息。
实验表明，LangPose在公共数据集上的三维姿态估计性能达到先进水平，如Human3.6M和MPI-INF-3DHP数据集。
LangPose在Human3.6M数据集上实现的MPJPE为36.7mm（使用检测到的二维姿态作为输入），在MPI-INF-3DHP数据集上为15.5mm（使用地面真实的二维姿态作为输入）。

Cool Papers

点此查看论文截图

A multi-phase thermo-mechanical model for rock-ice avalanche

Authors:Shiva P. Pudasaini

We propose a novel multi-phase thermo-mechanical rock-ice avalanche model. It considers rock, ice and fluid; includes rigorously derived ice melt rate, melting efficiency dependent fluid production rate and a general temperature equation. It explains advection-diffusion of heat including heat exchange across the avalanche, basal heat conduction, production and loss of heat due to frictional shearing and changing temperature, and temperature enhancement due to entrainment. Temperature equation couples rates of thermal conductivity and temperature. Ice melt intensity determines these rates as mixture conductivity evolves, characterizing thermo-mechanical processes. The model includes interfacial mass and momentum exchanges and mass and momentum productions due to entrainment. The latter significantly changes the state of temperature; yet, the former characterizes the rock-ice avalanche. Phase mass and momentum balances and temperature are coupled. New model offers the first-ever complete dynamical solution for rock-ice avalanche with changing temperature and ice melting. We develop an advection-diffusion-decay-source model and its analytical solutions providing novel understanding of temperature evolution. The 2021 Chamoli event simulations with r$.$avaflow (https://www.landslidemodels.org/r.avaflow/) illustrate the functionality of thermo-mechanical rock-ice avalanche model. Four scenarios are considered: variations in ice-melt-efficiency; fraction of ice; ice and rock frictions; governing the process of melting, flow transformation, spreading and mobility. Ice melting designates the motion and explains the rock-ice avalanche mobility: a phenomenal thermo-mechanical play. Essentially different controls of ice and rock frictions on the state of flow mobility are revealed, explaining complex thermo-mechanical processes. This provides a useful method for practitioners and engineers in solving problems associated with rock-ice avalanches.

我们提出了一种新型的多阶段热机械岩冰崩模型。该模型考虑了岩石、冰和流体，包括严格推导的冰融化速率、依赖于融化效率的流体产生速率和一般的温度方程。它解释了热的对流-扩散，包括崩滑中的热交换、底部热传导、因摩擦剪切和温度变化产生的热量以及因夹带造成的温度增强。温度方程结合了热导率和温度的变化率。冰融化的强度决定了这些速率，随着混合物的热导率的变化而变化，表征了热机械过程。该模型包括界面质量和动量交换以及因夹带而产生的质量和动量生成。后者对温度状态有显著影响；然而，前者则代表了岩冰崩的特点。阶段质量、动量平衡和温度是相互关联的。新模型为具有可变温度的岩冰崩提供了首个完整的动力学解决方案。我们建立了一个对流-扩散-衰减-源模型及其解析解，为温度演变提供了新的理解。使用r.avaflow（https://www.landslidemodels.org/r.avaflow/）对2021年查莫利事件的模拟说明了热机械岩冰崩模型的功能。考虑了四种情景：冰融化效率的变化、冰的比例、冰和岩石的摩擦力以及控制融化、流动转化、扩散和移动的过程。冰的融化标志着运动，并解释了岩冰崩的移动能力：这是一个现象级的热机械过程。揭示了冰和岩石摩擦力对流动状态的不同控制，解释了复杂的热机械过程。这为从业人员和工程师解决与岩冰崩相关的问题提供了一种有用的方法。

论文及项目相关链接

PDF Section 4 added on application, text enhanced accordingly

Summary

本文提出一种新型多阶段热机械岩石冰崩模型，考虑了岩石、冰和流体，包括严格推导的冰融化速率、依赖于融化效率的流体产生速率和一般温度方程。该模型解释了热对流扩散，包括冰崩中的热交换、底部热传导、因摩擦剪切和温度变化产生的热量损失以及因夹带造成的温度增强。新型模型为岩石冰崩提供了首个完整的动态解决方案，具有温度变化及冰融化的特点。模型还建立了界面质量和动量交换以及因夹带引起的质量和动量产生。通过模拟2021年的Chamoli事件，展示了该热机械岩石冰崩模型的功能性。该模型揭示了不同冰融化效率和摩擦系数下的冰崩过程中的四个场景，突显了冰融化对岩石冰崩运动状态的关键作用，揭示了复杂的热机械过程控制机制。这为工程师和从业者解决与岩石冰崩相关的问题提供了实用方法。

Key Takeaways