发布日期: 2025-07-09

更新日期: 2025-07-17

文章字数: 1k

阅读时长: 4 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-07-09 更新

Spatio-Temporal Control for Masked Motion Synthesis

Authors:Ekkasit Pinyoanuntapong, Muhammad Usama Saleem, Korrawe Karunratanakul, Pu Wang, Hongfei Xue, Chen Chen, Chuan Guo, Junli Cao, Jian Ren, Sergey Tulyakov

Recent advances in motion diffusion models have enabled spatially controllable text-to-motion generation. However, these models struggle to achieve high-precision control while maintaining high-quality motion generation. To address these challenges, we propose MaskControl, the first approach to introduce controllability to the generative masked motion model. Our approach introduces two key innovations. First, \textit{Logits Regularizer} implicitly perturbs logits at training time to align the distribution of motion tokens with the controlled joint positions, while regularizing the categorical token prediction to ensure high-fidelity generation. Second, \textit{Logit Optimization} explicitly optimizes the predicted logits during inference time, directly reshaping the token distribution that forces the generated motion to accurately align with the controlled joint positions. Moreover, we introduce \textit{Differentiable Expectation Sampling (DES)} to combat the non-differential distribution sampling process encountered by logits regularizer and optimization. Extensive experiments demonstrate that MaskControl outperforms state-of-the-art methods, achieving superior motion quality (FID decreases by ~77%) and higher control precision (average error 0.91 vs. 1.08). Additionally, MaskControl enables diverse applications, including any-joint-any-frame control, body-part timeline control, and zero-shot objective control. Video visualization can be found at https://www.ekkasit.com/ControlMM-page/

最近，动作扩散模型的进步已经实现了空间可控的文本到动作生成。然而，这些模型在保持高质量动作生成的同时，难以实现高精度的控制。为了解决这些挑战，我们提出了MaskControl，它是首个在生成式遮罩动作模型中引入可控性的方法。我们的方法引入了两个关键的创新点。首先，\textit{Logits Regularizer}在训练时隐式地扰动逻辑值，使运动符号的分布与受控关节位置对齐，同时正则化类别符号预测，以确保高保真生成。其次，\textit{Logit Optimization}在推理过程中显式优化预测的逻辑值，直接重塑符号分布，迫使生成的动作准确与受控关节位置对齐。此外，我们引入了\textit{Differentiable Expectation Sampling (DES)}来应对逻辑值正则器和优化所遇到的非微分分布采样过程。大量实验表明，MaskControl优于现有方法，实现了卓越的动作质量（FID降低约77%），以及更高的控制精度（平均误差0.91 vs. 1.08）。此外，MaskControl支持多样化的应用，包括任何关节任何帧的控制、身体部位的时间线控制以及零目标控制。视频可视化请访问：[https://www.ekkasit.com/ControlMM-page/]

论文及项目相关链接

PDF Accepted to ICCV. Change name to MaskControl. project page https://exitudio.github.io/ControlMM-page

Summary

本文介绍了针对运动扩散模型的新挑战，提出了MaskControl方法，该方法首次将可控性引入生成式掩膜运动模型。通过引入Logits Regularizer和Logit Optimization两个关键创新点，实现了运动令牌分布与受控关节位置的对齐，同时保证了高保真度的生成。此外，还介绍了可微分期望采样（DES）来解决非微分分布采样过程的问题。实验表明，MaskControl优于现有技术，提高了运动质量和控制精度，并实现了多种应用，如任意关节任意帧控制、身体部位时间线控制和零目标控制。

Key Takeaways

MaskControl是首个将可控性引入生成式掩膜运动模型的方法。
通过Logits Regularizer和Logit Optimization两个关键创新点，实现了运动令牌分布与受控关节位置的对齐。
引入了可微分期望采样（DES）以解决非微分分布采样问题。
MaskControl在运动和控制的精度上超越了现有技术。
MaskControl降低了FID得分，显示出优越的运动质量。
MaskControl具有多种应用，包括任意关节任意帧控制、身体部位时间线控制和零目标控制。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-07-09/Text-to-Motion/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

Text-to-Motion

R1_Reasoning

R1_Reasoning 方向最新论文已更新，请持续关注 Update in 2025-07-10 A Survey on Latent Reasoning

2025-07-10 R1_Reasoning

R1_Reasoning

Talking Head Generation

Talking Head Generation 方向最新论文已更新，请持续关注 Update in 2025-07-09 EchoMimicV3 1.3B Parameters are All You Need for Unified Multi-Modal and Multi-Task Human Animation

2025-07-09 Talking Head Generation

Talking Head Generation