发布日期: 2025-10-22

更新日期: 2025-11-27

文章字数: 2k

阅读时长: 8 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-10-22 更新

SoPo: Text-to-Motion Generation Using Semi-Online Preference Optimization

Authors:Xiaofeng Tan, Hongsong Wang, Xin Geng, Pan Zhou

Text-to-motion generation is essential for advancing the creative industry but often presents challenges in producing consistent, realistic motions. To address this, we focus on fine-tuning text-to-motion models to consistently favor high-quality, human-preferred motions, a critical yet largely unexplored problem. In this work, we theoretically investigate the DPO under both online and offline settings, and reveal their respective limitation: overfitting in offline DPO, and biased sampling in online DPO. Building on our theoretical insights, we introduce Semi-online Preference Optimization (SoPo), a DPO-based method for training text-to-motion models using “semi-online” data pair, consisting of unpreferred motion from online distribution and preferred motion in offline datasets. This method leverages both online and offline DPO, allowing each to compensate for the other’s limitations. Extensive experiments demonstrate that SoPo outperforms other preference alignment methods, with an MM-Dist of 3.25% (vs e.g. 0.76% of MoDiPO) on the MLD model, 2.91% (vs e.g. 0.66% of MoDiPO) on MDM model, respectively. Additionally, the MLD model fine-tuned by our SoPo surpasses the SoTA model in terms of R-precision and MM Dist. Visualization results also show the efficacy of our SoPo in preference alignment. Project page: https://xiaofeng-tan.github.io/projects/SoPo/ .

文本到动作生成对于推动创意产业的发展至关重要，但在生产一致且逼真的动作时常常面临挑战。为了解决这个问题，我们专注于微调文本到动作模型，以一贯地偏爱高质量、人类首选的动作，这是一个关键但尚未被充分研究的问题。在这项工作中，我们在在线和离线设置下从理论上探讨了DPO的局限性，并揭示了它们的限制：离线DPO的过拟合和在线DPO的偏向采样。基于我们的理论见解，我们引入了半在线偏好优化（SoPo），这是一种基于DPO的方法，用于使用“半在线”数据对训练文本到动作模型进行微调。这些数据对由在线分布的不受欢迎的动作和离线数据集中的首选动作组成。这种方法结合了在线和离线DPO，允许它们相互弥补彼此的局限性。大量实验表明，SoPo在其他偏好对齐方法上表现出色，在MLD模型上的MM-Dist为3.25%（例如MoDiPO为0.76%），在MDM模型上为2.91%（例如MoDiPO为0.66%）。此外，由我们的SoPo微调过的MLD模型在R-precision和MM Dist方面超过了当前最佳模型。可视化结果也显示了我们的SoPo在偏好对齐方面的有效性。项目页面：https://xiaofeng-tan.github.io/projects/SoPo/。

论文及项目相关链接

PDF

Summary

本文聚焦于文本到动作生成模型的优化问题，针对在线和离线情境下的数据分布偏好优化（DPO）进行了理论探讨，并揭示了其局限性。在此基础上，提出了半在线偏好优化（SoPo）方法，结合在线和离线数据对文本到动作模型进行训练，实现了对动作质量的提升。实验结果显示，SoPo在动作质量评估指标上表现优异，且超越了现有最优模型。

Key Takeaways

文本到动作生成在创意产业中至关重要，但生成一致、逼真的动作存在挑战。
本文重点关注了文本到动作模型的优化问题，旨在提高动作质量并倾向于人类偏好。
对在线和离线情境下的数据分布偏好优化（DPO）进行了理论探讨，并揭示了其局限性。
提出了半在线偏好优化（SoPo）方法，结合了在线和离线数据对模型进行训练。
SoPo方法通过补偿在线和离线DPO的局限性，实现了对动作质量的显著提升。
实验结果显示，SoPo在多个评估指标上表现优异，超越了现有最优模型。

Cool Papers

点此查看论文截图

MaskControl: Spatio-Temporal Control for Masked Motion Synthesis

Authors:Ekkasit Pinyoanuntapong, Muhammad Usama Saleem, Korrawe Karunratanakul, Pu Wang, Hongfei Xue, Chen Chen, Chuan Guo, Junli Cao, Jian Ren, Sergey Tulyakov

Recent advances in motion diffusion models have enabled spatially controllable text-to-motion generation. However, these models struggle to achieve high-precision control while maintaining high-quality motion generation. To address these challenges, we propose MaskControl, the first approach to introduce controllability to the generative masked motion model. Our approach introduces two key innovations. First, \textit{Logits Regularizer} implicitly perturbs logits at training time to align the distribution of motion tokens with the controlled joint positions, while regularizing the categorical token prediction to ensure high-fidelity generation. Second, \textit{Logit Optimization} explicitly optimizes the predicted logits during inference time, directly reshaping the token distribution that forces the generated motion to accurately align with the controlled joint positions. Moreover, we introduce \textit{Differentiable Expectation Sampling (DES)} to combat the non-differential distribution sampling process encountered by logits regularizer and optimization. Extensive experiments demonstrate that MaskControl outperforms state-of-the-art methods, achieving superior motion quality (FID decreases by ~77%) and higher control precision (average error 0.91 vs. 1.08). Additionally, MaskControl enables diverse applications, including any-joint-any-frame control, body-part timeline control, and zero-shot objective control. Video visualization can be found at https://www.ekkasit.com/ControlMM-page/

最近的动作扩散模型的进展已经实现了空间可控的文本到动作生成。然而，这些模型在保持高质量动作生成的同时，难以实现高精度的控制。为了应对这些挑战，我们提出了MaskControl，它是第一个在生成式掩膜运动模型中引入可控性的方法。我们的方法引入了两个关键的创新点。首先，\textit{Logits Regularizer}在训练时隐式地扰动逻辑值，使运动标记的分布与控制的关节位置对齐，同时正则化类别标记预测，以确保高保真生成。其次，\textit{Logit Optimization}在推理时显式优化预测的逻辑值，直接重塑标记分布，迫使生成的动作准确与控制的关节位置对齐。此外，我们引入了\textit{Differentiable Expectation Sampling (DES)}来解决逻辑值调节器和优化过程中遇到的非微分分布采样过程。大量实验表明，MaskControl优于现有技术，实现了卓越的运动质量（FID降低约77%），以及更高的控制精度（平均误差0.91 vs. 1.08）。此外，MaskControl支持多样化的应用，包括任何关节任何帧的控制、身体部位的时间线控制以及零目标控制。视频可视化内容可在 https://www.ekkasit.com/ControlMM-page/ 找到。

论文及项目相关链接

PDF Camera Ready Version. ICCV2025 (Oral). Change name from ControlMM to MaskControl. project page https://exitudio.github.io/ControlMM-page

Summary

本文介绍了针对运动扩散模型在空间可控文本到运动生成方面的挑战，提出了MaskControl方法。该方法引入了两个关键创新点：Logits Regularizer和Logit Optimization。前者在训练时隐式扰动logits，使运动标记的分布与控制的关节位置对齐；后者在推理时间显式优化预测的logits，直接重塑标记分布，使生成的运动准确符合控制的关节位置。此外，还引入了Differentiable Expectation Sampling (DES)来解决非微分分布采样过程中遇到的问题。实验表明，MaskControl在动作质量和控制精度上超越了现有技术，并提供了多种应用，如任意关节任意帧控制、身体部位时间线控制和零目标控制。

Key Takeaways

MaskControl是首个在生成式掩膜运动模型中引入可控性的方法。
MaskControl包含两个关键创新点：Logits Regularizer和Logit Optimization，分别用于隐式和显式控制。
Logits Regularizer通过对训练时的logits进行扰动，使运动标记分布与控制的关节位置对齐。
Logit Optimization在推理时优化预测的logits，确保生成的运动准确符合控制的关节位置。
MaskControl引入了Differentiable Expectation Sampling (DES)以解决非微分分布采样问题。
实验表明，MaskControl在动作质量（FID下降约77%）和控制精度上优于现有技术。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-10-22/Text-to-Motion/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

Text-to-Motion

Talking Head Generation

Talking Head Generation 方向最新论文已更新，请持续关注 Update in 2025-10-22 When Words Smile Generating Diverse Emotional Facial Expressions from Text

2025-10-22 Talking Head Generation

Talking Head Generation

Interactive

Interactive 方向最新论文已更新，请持续关注 Update in 2025-10-22 MT-Video-Bench A Holistic Video Understanding Benchmark for Evaluating Multimodal LLMs in Multi-Turn Dialogues

2025-10-22 Interactive

Interactive