发布日期: 2025-10-02

更新日期: 2025-11-27

文章字数: 1.2k

阅读时长: 4 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-10-02 更新

LUMA: Low-Dimension Unified Motion Alignment with Dual-Path Anchoring for Text-to-Motion Diffusion Model

Authors:Haozhe Jia, Wenshuo Chen, Yuqi Lin, Yang Yang, Lei Wang, Mang Ning, Bowen Tian, Songning Lai, Nanqian Jia, Yifan Chen, Yutao Yue

While current diffusion-based models, typically built on U-Net architectures, have shown promising results on the text-to-motion generation task, they still suffer from semantic misalignment and kinematic artifacts. Through analysis, we identify severe gradient attenuation in the deep layers of the network as a key bottleneck, leading to insufficient learning of high-level features. To address this issue, we propose \textbf{LUMA} (\textit{\textbf{L}ow-dimension \textbf{U}nified \textbf{M}otion \textbf{A}lignment}), a text-to-motion diffusion model that incorporates dual-path anchoring to enhance semantic alignment. The first path incorporates a lightweight MoCLIP model trained via contrastive learning without relying on external data, offering semantic supervision in the temporal domain. The second path introduces complementary alignment signals in the frequency domain, extracted from low-frequency DCT components known for their rich semantic content. These two anchors are adaptively fused through a temporal modulation mechanism, allowing the model to progressively transition from coarse alignment to fine-grained semantic refinement throughout the denoising process. Experimental results on HumanML3D and KIT-ML demonstrate that LUMA achieves state-of-the-art performance, with FID scores of 0.035 and 0.123, respectively. Furthermore, LUMA accelerates convergence by 1.4$\times$ compared to the baseline, making it an efficient and scalable solution for high-fidelity text-to-motion generation.

当前基于扩散模型（通常基于U-Net架构）在文本到运动生成任务上取得了有前景的结果，但它们仍存在语义不对齐和运动学伪影的问题。通过分析，我们确定了网络深层中的严重梯度衰减是这一瓶颈的关键原因，导致对高级特征的学习不足。为了解决这个问题，我们提出了文本到运动扩散模型LUMA（Low-dimension Unified Motion Alignment）。该模型结合了双路径锚定技术以增强语义对齐。第一个路径采用轻量级MoCLIP模型，通过对比学习进行训练，无需依赖外部数据，在时序域内提供语义监督。第二个路径引入了丰富的语义内容的低频DCT分量来提取频率域的互补对齐信号。这两个锚通过自适应的时间调制机制进行融合，使模型能够在去噪过程中逐步从粗略对齐过渡到精细的语义细化。在人类ML3D和KIT-ML上的实验结果表明，LUMA达到了业界最佳性能，FID得分分别为0.035和0.123。此外，与基线相比，LUMA加速了收敛速度，达到了原来的1.4倍，使其成为高效且可扩展的解决方案，用于高保真文本到运动生成。

论文及项目相关链接

PDF

摘要

当前基于U-Net架构的扩散模型在文本到运动生成任务上虽表现出色，但仍存在语义不对齐和动力学伪影问题。分析发现，网络深层中的梯度衰减是限制性能的关键瓶颈，导致对高级特征的学习不足。为解决此问题，我们提出了LUMA（Low-dimension Unified Motion Alignment，低维统一运动对齐），这是一种文本到运动的扩散模型，采用双路径锚定技术增强语义对齐。第一条路径采用轻量级的MoCLIP模型，通过对比学习进行训练，无需依赖外部数据，为时间域提供语义监督。第二条路径在频率域引入互补对齐信号，从富含语义内容的低频DCT组件中提取。这两个锚通过自适应的时间调制机制融合，使模型在去噪过程中从粗略对齐逐步过渡到精细的语义细化。在人类ML3D和KIT-ML上的实验结果显示，LUMA达到了最新技术水平，FID得分分别为0.035和0.123。此外，与基线相比，LUMA加速了1.4倍的收敛速度，成为高效、可扩展的高保真文本到运动生成解决方案。

关键见解