发布日期: 2025-10-18

更新日期: 2025-11-27

文章字数: 2k

阅读时长: 7 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-10-18 更新

OmniMotion: Multimodal Motion Generation with Continuous Masked Autoregression

Authors:Zhe Li, Weihao Yuan, Weichao Shen, Siyu Zhu, Zilong Dong, Chang Xu

Whole-body multi-modal human motion generation poses two primary challenges: creating an effective motion generation mechanism and integrating various modalities, such as text, speech, and music, into a cohesive framework. Unlike previous methods that usually employ discrete masked modeling or autoregressive modeling, we develop a continuous masked autoregressive motion transformer, where a causal attention is performed considering the sequential nature within the human motion. Within this transformer, we introduce a gated linear attention and an RMSNorm module, which drive the transformer to pay attention to the key actions and suppress the instability caused by either the abnormal movements or the heterogeneous distributions within multi-modalities. To further enhance both the motion generation and the multimodal generalization, we employ the DiT structure to diffuse the conditions from the transformer towards the targets. To fuse different modalities, AdaLN and cross-attention are leveraged to inject the text, speech, and music signals. Experimental results demonstrate that our framework outperforms previous methods across all modalities, including text-to-motion, speech-to-gesture, and music-to-dance. The code of our method will be made public.

全身多模态人体运动生成主要面临两大挑战：创建有效的运动生成机制和将文本、语音、音乐等多种模态整合到一个统一框架中。不同于通常采用离散掩模建模或自回归建模的先前方法，我们开发了一种连续掩模自回归运动转换器，该转换器考虑了人类运动的序列特性，进行了因果注意力处理。在这个转换器中，我们引入了一个门控线性注意力和一个RMSNorm模块，驱动转换器关注关键动作，并抑制由多模态中的异常动作或异质分布引起的不稳定。为了进一步改进运动生成和多模态泛化，我们采用DiT结构将转换器的条件扩散到目标上。为了融合不同的模态，我们利用AdaLN和交叉注意力来注入文本、语音和音乐信号。实验结果表明，我们的框架在所有模态上都优于以前的方法，包括文本到运动、语音到动作和音乐到舞蹈。我们方法的代码将会公开。

论文及项目相关链接

PDF

Summary

本文提出了一个基于连续掩模自回归运动变压器（continuous masked autoregressive motion transformer）的多模态人体运动生成方法。该方法通过引入门控线性注意力和RMSNorm模块，解决了人体运动生成中的关键动作识别和异常运动及多模态分布不均带来的不稳定问题。同时采用DiT结构增强运动生成和多模态泛化能力，并运用AdaLN和交叉注意融合不同模态信号。实验结果表明，该方法在所有模态上的性能均优于之前的方法，包括文本转动作、语音转手势和音乐转舞蹈。

Key Takeaways

提出了一种基于连续掩模自回归运动变压器的人体多模态运动生成方法，考虑动作序列的连续性。
通过门控线性注意力和RMSNorm模块，提高运动生成的关键动作识别和稳定性。
采用DiT结构增强运动生成和多模态泛化能力。
AdaLN和交叉注意力用于融合不同模态信号，如文本、语音和音乐。
实验结果表明，该方法在文本转动作、语音转手势和音乐转舞蹈等任务上性能优越。
该方法的代码将公开提供，有助于进一步研究和应用。

Cool Papers

点此查看论文截图

Authors:Prerit Gupta, Shourya Verma, Ananth Grama, Aniket Bera

Generating realistic, context-aware two-person motion conditioned on diverse modalities remains a central challenge in computer graphics, animation, and human-computer interaction. We introduce DualFlow, a unified and efficient framework for multi-modal two-person motion generation. DualFlow conditions 3D motion synthesis on diverse inputs, including text, music, and prior motion sequences. Leveraging rectified flow, it achieves deterministic straight-line sampling paths between noise and data, reducing inference time and mitigating error accumulation common in diffusion-based models. To enhance semantic grounding, DualFlow employs a Retrieval-Augmented Generation (RAG) module that retrieves motion exemplars using music features and LLM-based text decompositions of spatial relations, body movements, and rhythmic patterns. We use contrastive objective that further strengthens alignment with conditioning signals and introduce synchronization loss that improves inter-person coordination. Extensive evaluations across text-to-motion, music-to-motion, and multi-modal interactive benchmarks show consistent gains in motion quality, responsiveness, and efficiency. DualFlow produces temporally coherent and rhythmically synchronized motions, setting state-of-the-art in multi-modal human motion generation.

生成真实、与上下文相关的双人运动，以适应多种模式，仍然是计算机图形学、动画和人机交互领域的一个核心挑战。我们引入了DualFlow，这是一个统一高效的多模式双人运动生成框架。DualFlow根据各种输入进行3D运动合成，包括文本、音乐和先前的运动序列。它利用校正流在噪声和数据之间实现确定性直线采样路径，从而缩短推理时间，并减轻扩散模型中常见的误差累积问题。为了增强语义定位，DualFlow采用检索增强生成（RAG）模块，该模块使用音乐特征和基于大型语言模型的文本分解来检索运动范例，包括空间关系、身体运动和节奏模式。我们使用对比目标来进一步加强与条件信号的匹配，并引入同步损失来提高人物之间的协调性。在文本到运动、音乐到运动和多种模式的交互基准测试上的广泛评估表明，在动作质量、响应性和效率方面都有一致的改进。DualFlow产生时间上连贯、节奏同步的运动，在多种模式的人类运动生成方面达到最新水平。

论文及项目相关链接

PDF Under review at ICLR 2026

Summary

基于文本、音乐及先前动作序列的多模态双人运动生成是一项挑战。我们推出DualFlow，一个统一高效的框架，用于多模态双人运动生成。DualFlow通过纠正流的方式，实现噪声和数据之间的确定性直线采样路径，缩短推理时间，并减轻扩散模型常见的误差累积问题。为增强语义定位，DualFlow采用检索增强生成模块，利用音乐特征和基于大型语言模型的文本分解，检索空间关系、身体动作和节奏模式的运动实例。通过对比目标和同步损失，进一步与条件信号对齐，提高人物间协调性。跨文本到运动、音乐到运动和多媒体互动的综合评估显示，DualFlow在运动质量、响应性和效率方面均有所提高，生成的动作在时间上连贯、节奏同步，在多模态人类运动生成领域处于领先地位。

Key Takeaways