发布日期: 2025-10-25

更新日期: 2025-11-27

文章字数: 1.9k

阅读时长: 7 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-10-25 更新

OmniMotion-X: Versatile Multimodal Whole-Body Motion Generation

Authors:Guowei Xu, Yuxuan Bian, Ailing Zeng, Mingyi Shi, Shaoli Huang, Wen Li, Lixin Duan, Qiang Xu

This paper introduces OmniMotion-X, a versatile multimodal framework for whole-body human motion generation, leveraging an autoregressive diffusion transformer in a unified sequence-to-sequence manner. OmniMotion-X efficiently supports diverse multimodal tasks, including text-to-motion, music-to-dance, speech-to-gesture, and global spatial-temporal control scenarios (e.g., motion prediction, in-betweening, completion, and joint/trajectory-guided synthesis), as well as flexible combinations of these tasks. Specifically, we propose the use of reference motion as a novel conditioning signal, substantially enhancing the consistency of generated content, style, and temporal dynamics crucial for realistic animations. To handle multimodal conflicts, we introduce a progressive weak-to-strong mixed-condition training strategy. To enable high-quality multimodal training, we construct OmniMoCap-X, the largest unified multimodal motion dataset to date, integrating 28 publicly available MoCap sources across 10 distinct tasks, standardized to the SMPL-X format at 30 fps. To ensure detailed and consistent annotations, we render sequences into videos and use GPT-4o to automatically generate structured and hierarchical captions, capturing both low-level actions and high-level semantics. Extensive experimental evaluations confirm that OmniMotion-X significantly surpasses existing methods, demonstrating state-of-the-art performance across multiple multimodal tasks and enabling the interactive generation of realistic, coherent, and controllable long-duration motions.

本文介绍了OmniMotion-X，这是一个通用多模态框架，用于生成全身人体运动，它采用统一序列到序列方式的自回归扩散变压器。OmniMotion-X高效支持多样化的多模态任务，包括文本到运动、音乐到舞蹈、语音到手势以及全局时空控制场景（如运动预测、中间帧生成、完成和关节/轨迹引导合成等），以及这些任务的灵活组合。具体来说，我们提出使用参考运动作为新型条件信号，这大大提高了生成内容的一致性、风格和时序动态，对于现实动画至关重要。为了解决多模态冲突，我们引入了从弱到强的渐进混合条件训练策略。为了进行高质量的多模态训练，我们构建了迄今为止最大的统一多模态运动数据集OmniMoCap-X，集成了28个公开可用的MoCap源，涵盖10个不同任务，标准化为SMPL-X格式，每秒30帧。为了确保详细和一致的注释，我们将序列呈现为视频，并使用GPT-4o自动生成结构和层次化的字幕，捕捉低级别动作和高级别语义。广泛的实验评估证实，OmniMotion-X显著超越了现有方法，在多个多模态任务上达到了最新性能，能够实现现实、连贯和可控的长期运动生成。

论文及项目相关链接

PDF

Summary

OmniMotion-X是一个多功能框架，用于生成全身人体运动。它采用自回归扩散变压器，以统一序列到序列的方式支持多样化多模态任务，如文本转运动、音乐转舞蹈等。OmniMotion-X引入参考运动作为新型条件信号，提升生成内容的连贯性、风格和时序动态关键性。同时，为处理多模态冲突，它采用渐进弱到强的混合条件训练策略。此外，OmniMotion-X构建了一个大规模统一多模态运动数据集OmniMoCap-X，并渲染视频和自动生成结构化层次标题以确保详细和一致的注释。实验证明，OmniMotion-X在多模态任务上显著超越现有方法，展现了卓越性能，并能生成真实、连贯和可控制的长时运动。

Key Takeaways

OmniMotion-X是一个多功能框架，用于生成全身人体运动，支持多种多模态任务。
引入参考运动作为条件信号，提高生成内容的连贯性、风格和时序动态。
采用渐进弱到强的混合条件训练策略来处理多模态冲突。
构建大规模统一多模态运动数据集OmniMoCap-X，包含多种任务和标准化数据格式。
渲染视频和自动生成结构化层次标题以确保详细和一致的注释。
实验证明OmniMotion-X在多模态任务上的卓越性能。

Cool Papers

点此查看论文截图

SnapMoGen: Human Motion Generation from Expressive Texts

Authors:Chuan Guo, Inwoo Hwang, Jian Wang, Bing Zhou

Text-to-motion generation has experienced remarkable progress in recent years. However, current approaches remain limited to synthesizing motion from short or general text prompts, primarily due to dataset constraints. This limitation undermines fine-grained controllability and generalization to unseen prompts. In this paper, we introduce SnapMoGen, a new text-motion dataset featuring high-quality motion capture data paired with accurate, expressive textual annotations. The dataset comprises 20K motion clips totaling 44 hours, accompanied by 122K detailed textual descriptions averaging 48 words per description (vs. 12 words of HumanML3D). Importantly, these motion clips preserve original temporal continuity as they were in long sequences, facilitating research in long-term motion generation and blending. We also improve upon previous generative masked modeling approaches. Our model, MoMask++, transforms motion into multi-scale token sequences that better exploit the token capacity, and learns to generate all tokens using a single generative masked transformer. MoMask++ achieves state-of-the-art performance on both HumanML3D and SnapMoGen benchmarks. Additionally, we demonstrate the ability to process casual user prompts by employing an LLM to reformat inputs to align with the expressivity and narration style of SnapMoGen. Project webpage: https://snap-research.github.io/SnapMoGen/

文本到动作生成技术在近年来取得了显著的进步。然而，当前的方法仍然局限于从短或通用文本提示中合成动作，这主要是因为数据集的限制。这一局限影响了对未见提示的精细控制和泛化。在本文中，我们介绍了SnapMoGen，一个全新的文本到动作数据集，它拥有高质量的动作捕捉数据与精确、生动的文本注释相配对。该数据集包含总计44小时的2万条动作片段，以及平均每条描述包含48个词的12万条详细描述（相比之下HumanML3D只有12个词）。重要的是，这些动作片段保持了原始的时间连续性，因为它们来自长序列，有助于长期动作生成和混合的研究。我们还改进了之前的生成式掩模建模方法。我们的模型MoMask++将动作转换为多尺度令牌序列，更好地利用令牌容量，并学习使用单个生成式掩码变压器生成所有令牌。MoMask++在HumanML3D和SnapMoGen基准测试中均达到了最先进的性能。此外，我们展示了通过使用大型语言模型处理随意用户提示的能力，以调整输入与SnapMoGen的表达能力和叙述风格相匹配。项目网页：https://snap-research.github.io/SnapMoGen/

Summary

本文介绍了SnapMoGen数据集，该数据集包含高质量的运动捕捉数据与精确、生动的文本注解相配对。数据集包含20K运动片段，总时长为44小时，附带详细的文本描述，平均每个描述包含约四十八个字词。本文提出的新模型MoMask++能够将动作转化为多尺度标记序列，更充分地利用标记容量并生成所有标记序列。此外，该模型还能处理用户即兴提示，通过大型语言模型对输入进行重塑以适应SnapMoGen的数据特性。总体来说，这一工作实现了高质量的长序列动作捕捉数据与先进模型应用的深度融合。有关详情，请参见项目网页。

Key Takeaways

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-10-25/Text-to-Motion/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

Text-to-Motion

R1_Reasoning

R1_Reasoning 方向最新论文已更新，请持续关注 Update in 2025-11-05 MARAG-R1 Beyond Single Retriever via Reinforcement-Learned Multi-Tool Agentic Retrieval

2025-11-05 R1_Reasoning

R1_Reasoning

Interactive

Interactive 方向最新论文已更新，请持续关注 Update in 2025-10-25 Stoichiometrically-informed symbolic regression for extracting chemical reaction mechanisms from data

2025-10-25 Interactive

Interactive