发布日期: 2025-11-20

更新日期: 2025-11-27

文章字数: 4k

阅读时长: 16 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-20 更新

InterMoE: Individual-Specific 3D Human Interaction Generation via Dynamic Temporal-Selective MoE

Authors:Lipeng Wang, Hongxing Fan, Haohua Chen, Zehuan Huang, Lu Sheng

Generating high-quality human interactions holds significant value for applications like virtual reality and robotics. However, existing methods often fail to preserve unique individual characteristics or fully adhere to textual descriptions. To address these challenges, we introduce InterMoE, a novel framework built on a Dynamic Temporal-Selective Mixture of Experts. The core of InterMoE is a routing mechanism that synergistically uses both high-level text semantics and low-level motion context to dispatch temporal motion features to specialized experts. This allows experts to dynamically determine the selection capacity and focus on critical temporal features, thereby preserving specific individual characteristic identities while ensuring high semantic fidelity. Extensive experiments show that InterMoE achieves state-of-the-art performance in individual-specific high-fidelity 3D human interaction generation, reducing FID scores by 9% on the InterHuman dataset and 22% on InterX.

生成高质量的人类互动对于虚拟现实和机器人等应用具有重大价值。然而，现有方法往往无法保留独特的个人特征或完全遵循文本描述。为了解决这些挑战，我们引入了InterMoE，这是一个建立在动态时间选择性专家混合体上的新型框架。InterMoE的核心是一种路由机制，它协同使用高级文本语义和低级运动上下文，将时间运动特征分配给专业专家。这允许专家动态地确定选择能力，专注于关键的时间特征，从而保留特定的个人特征身份，同时确保高语义保真度。大量实验表明，InterMoE在个人特定的高保真3D人类交互生成方面达到了最新技术水平，在InterHuman数据集上将FID得分降低了9%，在InterX上降低了22%。

论文及项目相关链接

PDF Accepted to AAAI-26. Codes: https://github.com/Lighten001/InterMoE

Summary

本文介绍了一种名为InterMoE的新型框架，该框架基于动态时间选择性混合专家（Dynamic Temporal-Selective Mixture of Experts）构建，旨在生成高质量的人类互动。其核心机制是通过结合高级文本语义和低级运动上下文，将时间运动特征分配给专业专家。这确保了专家能够动态确定选择能力，专注于关键的时间特征，从而在保持个体特征的同时确保高语义保真度。在个体特定高保真3D人类互动生成方面，InterMoE实现了最佳性能，在InterHuman数据集上减少了9%的FID分数，在InterX上减少了22%。

Key Takeaways

InterMoE框架旨在生成高质量的人类互动，尤其适用于虚拟现实和机器人等领域。
现有方法往往无法保持独特的个体特性或完全遵循文本描述，InterMoE旨在解决这些问题。
InterMoE的核心机制是一种路由机制，能协同使用高级文本语义和低级运动上下文。
该框架将时间运动特征分配给专业专家，允许专家专注于关键特征。
InterMoE能动态确定选择能力，确保在保持个体特性同时实现高语义保真度。
通过实验证明，InterMoE在个体特定高保真3D人类互动生成方面达到最佳性能。

Cool Papers

点此查看论文截图

Uni-Hand: Universal Hand Motion Forecasting in Egocentric Views

Authors:Junyi Ma, Wentao Bao, Jingyi Xu, Guanzhong Sun, Yu Zheng, Erhang Zhang, Xieyuanli Chen, Hesheng Wang

Forecasting how human hands move in egocentric views is critical for applications like augmented reality and human-robot policy transfer. Recently, several hand trajectory prediction (HTP) methods have been developed to generate future possible hand waypoints, which still suffer from insufficient prediction targets, inherent modality gaps, entangled hand-head motion, and limited validation in downstream tasks. To address these limitations, we present a universal hand motion forecasting framework considering multi-modal input, multi-dimensional and multi-target prediction patterns, and multi-task affordances for downstream applications. We harmonize multiple modalities by vision-language fusion, global context incorporation, and task-aware text embedding injection, to forecast hand waypoints in both 2D and 3D spaces. A novel dual-branch diffusion is proposed to concurrently predict human head and hand movements, capturing their motion synergy in egocentric vision. By introducing target indicators, the prediction model can forecast the specific joint waypoints of the wrist or the fingers, besides the widely studied hand center points. In addition, we enable Uni-Hand to additionally predict hand-object interaction states (contact/separation) to facilitate downstream tasks better. As the first work to incorporate downstream task evaluation in the literature, we build novel benchmarks to assess the real-world applicability of hand motion forecasting algorithms. The experimental results on multiple publicly available datasets and our newly proposed benchmarks demonstrate that Uni-Hand achieves the state-of-the-art performance in multi-dimensional and multi-target hand motion forecasting. Extensive validation in multiple downstream tasks also presents its impressive human-robot policy transfer to enable robotic manipulation, and effective feature enhancement for action anticipation/recognition.

预测人类手部在自我中心视角下的运动对于增强现实和人机策略迁移等应用至关重要。近期，已经开发了几种手部轨迹预测（HTP）方法来生成未来可能的手部路径点，但它们仍然面临预测目标不足、固有模态间隙、手部与头部运动纠缠以及下游任务验证有限等局限性。为了解决这些局限性，我们提出了一种通用的手部运动预测框架，该框架考虑了多模态输入、多维度和多目标预测模式以及下游应用的多任务优势。我们通过视觉语言融合、全局上下文融合和任务感知文本嵌入注入来协调多种模式，以预测二维和三维空间中的手部路径点。提出了一种新颖的双分支扩散方法，同时预测头部和手部运动，捕捉自我中心视野中的运动协同。通过引入目标指标，预测模型除了广泛研究的手中心点外，还可以预测手腕或手指的具体关节路径点。此外，我们使Uni-Hand能够额外预测手部与物体的交互状态（接触/分离），以更好地促进下游任务。作为文献中首次将下游任务评估纳入研究的工作，我们构建了新的基准测试来评估手部运动预测算法在现实世界中的适用性。在多个公开数据集和我们新提出的基准测试上的实验结果证明，Uni-Hand在多维度和多目标手部运动预测方面达到了最先进的性能。在多个下游任务中的广泛验证也显示了其在实现机器人操作、动作预测和识别等任务时的出色人机策略迁移能力。

论文及项目相关链接

PDF Extended journal version of MMTwin (IROS’25)

Summary
手轨迹预测对于增强现实和人机交互等应用至关重要。针对现有方法预测目标不足、模态差距、手头部运动纠缠以及下游任务验证有限等问题，提出了一种通用手运动预测框架，考虑多模态输入、多维度和多目标预测模式以及多任务优势。通过视觉语言融合、全局上下文融合和任务感知文本嵌入注入来预测二维和三维空间中的手轨迹。提出一种新型双分支扩散法，同时预测头部和手部的运动，捕捉以自我为中心视野中的运动协同。此外，还预测手物交互状态以促进下游任务。建立新的评估标准，以评估手运动预测算法在现实世界的适用性，并在多个公开数据集和新提出的评估标准上实现了最先进的性能。

Key Takeaways

手轨迹预测对于增强现实和人机交互等应用非常重要。
现有手轨迹预测方法存在预测目标不足、模态差距等问题。
提出了一种通用手运动预测框架，考虑多模态输入、多维度和多目标预测。
通过视觉语言融合、全局上下文融合等技术预测手轨迹。
提出了一种新型双分支扩散法，同时预测头部和手部的运动。
预测手物交互状态以促进下游任务。

Cool Papers

点此查看论文截图

PRIMUS: Pretraining IMU Encoders with Multimodal Self-Supervision

Authors:Arnav M. Das, Chi Ian Tang, Fahim Kawsar, Mohammad Malekzadeh

Sensing human motions through Inertial Measurement Units (IMUs) embedded in personal devices has enabled significant applications in health and wellness. Labeled IMU data is scarce, however, unlabeled or weakly labeled IMU data can be used to model human motions. For video or text modalities, the “pretrain and adapt” approach utilizes large volumes of unlabeled or weakly labeled data to build a strong feature extractor, followed by adaptation to specific tasks using limited labeled data. However, pretraining methods are poorly understood for IMU data, and pipelines are rarely evaluated on out-of-domain tasks. We propose PRIMUS: a method for PRetraining IMU encoderS that uses a novel pretraining objective that is empirically validated based on downstream performance on both in-domain and out-of-domain datasets. The PRIMUS objective effectively enhances downstream performance by combining self-supervision, multimodal, and nearest-neighbor supervision. With fewer than 500 labeled samples per class, PRIMUS improves test accuracy by up to 15%, compared to state-of-the-art baselines. To benefit the broader community, we have open-sourced our code at github.com/nokia-bell-labs/pretrained-imu-encoders.

通过嵌入个人设备的惯性测量单元（IMU）感知人类动作已经在健康和健身领域实现了许多重要应用。然而，尽管带有标签的IMU数据很稀缺，但无标签或弱标签的IMU数据也可用于建立人类动作模型。对于视频或文本模式，“预训练并适应”方法利用大量无标签或弱标签数据来构建强大的特征提取器，然后使用有限的有标签数据进行特定任务的适应。然而，对于IMU数据的预训练方法理解不足，并且管道很少在域外任务上进行评估。我们提出了PRIMUS：一种用于预训练IMU编码器的方法，它使用一个基于域内和域外数据集下游性能的经验验证的新型预训练目标。PRIMUS目标通过结合自监督、多模态和最近邻监督，有效地提高了下游性能。在每类少于500个有标签样本的情况下，与最新基线相比，PRIMUS将测试精度提高了高达15%。为了造福更广泛的社区，我们已在github.com/nokia-bell-labs/pretrained-imu-encoders上开源了我们的代码。

论文及项目相关链接

PDF Presented at ICASSP 2025. Also presented under the title “PRIMUS: Pretraining IMU Encoders with Multimodal and Self-Supervised Learning” at NeurIPS 2024 TSALM Workshop (Time Series in the Age of Large Models)

Summary
基于惯性测量单元（IMU）的个人设备运动感知技术在健康与健身领域应用广泛。鉴于IMU数据缺乏充足标注，提出利用大量未标注或弱标注IMU数据进行预训练的方法PRIMUS。该方法结合了自监督、多模态和近邻监督，能有效提升下游任务的性能。在仅使用少量标注样本的情况下，较现有基线模型提升测试精度高达15%。相关代码已开源。

Key Takeaways

IMU数据感知技术在健康与健身领域应用广泛。
标明的IMU数据稀缺，需利用未标注或弱标注数据进行建模。
“预训练与适应”方法适用于视频或文本模态，但对IMU数据预训练方法理解不足且跨域任务评估较少。
提出PRIMUS方法，利用新型预训练目标进行IMU编码器预训练。
PRIMUS结合自监督、多模态和近邻监督，有效提升下游任务性能。
在少量标注样本下，较基线模型提升测试精度高达15%。

Cool Papers

点此查看论文截图

MoReFun: Past-Movement Guided Motion Representation Learning for Future Motion Prediction and Understanding

Authors:Junyu Shi, Haoting Wu, Zhiyuan Zhang, Lijiang Liu, Yong Sun, Qiang Nie

3D human motion prediction aims to generate coherent future motions from observed sequences, yet existing end-to-end regression frameworks often fail to capture complex dynamics and tend to produce temporally inconsistent or static predictions-a limitation rooted in representation shortcutting, where models rely on superficial cues rather than learning meaningful motion structure. We propose a two-stage self-supervised framework that decouples representation learning from prediction. In the pretraining stage, the model performs unified past-future self-reconstruction, reconstructing the past sequence while recovering masked joints in the future sequence under full historical guidance. A velocity-based masking strategy selects highly dynamic joints, forcing the model to focus on informative motion components and internalize the statistical dependencies between past and future states without regression interference. In the fine-tuning stage, the pretrained model predicts the entire future sequence, now treated as fully masked, and is further equipped with a lightweight future-text prediction head for joint optimization of low-level motion prediction and high-level motion understanding. Experiments on Human3.6M, 3DPW, and AMASS show that our method reduces average prediction errors by 8.8% over state-of-the-art methods while achieving competitive future-motion understanding performance compared to LLM-based models. Code is available at: https://github.com/JunyuShi02/MoReFun

三维人体运动预测旨在从观察到的序列中生成连贯的未来动作，但现有的端到端回归框架往往无法捕捉复杂的动态，并倾向于产生时间上不一致或静态的预测。这一局限性源于表示捷径，模型依赖于表面线索，而不是学习有意义的运动结构。我们提出了一个两阶段的自监督框架，将表示学习与预测解耦。在预训练阶段，模型执行统一的过去-未来自我重建，重建过去序列，同时在完整的历史指导下恢复未来序列中的遮挡关节。基于速度的遮挡策略选择高度动态的关节，迫使模型专注于信息丰富的运动成分，并在不干扰回归的情况下内化过去和未来状态之间的统计依赖性。在微调阶段，预训练模型预测整个未来序列（现在被视为完全遮挡），并配备了轻量级的未来文本预测头，用于优化低级别的运动预测和高级别的运动理解联合优化。在人类3.6M、3DPW和AMASS上的实验表明，我们的方法将平均预测误差降低了8.8%，超过了最先进的方法，同时在未来运动理解性能方面与基于LLM的模型相比具有竞争力。代码可用在：https://github.com/JunyuShi02/MoReFun。

论文及项目相关链接

PDF

Summary

该文提出了一种两阶段自监督框架，用于改进三维人体运动预测。此框架通过解除表示学习与预测之间的耦合关系来解决现有框架中的不足。经过预训练和精细调整两个阶段，模型可以更好地从观察到的序列生成连贯的未来运动，减少平均预测误差，并提高未来运动理解性能。

Key Takeaways

当前三维人体运动预测框架存在不足，难以捕捉复杂动态并产生一致的未来预测。
现有模型的局限性在于表示快捷方式，它们依赖表面线索而不是学习有意义的运动结构。
提出的两阶段自监督框架将表示学习与预测分开，以提高预测的准确性。
预训练阶段通过统一过去-未来自我重建来训练模型，同时恢复掩码关节。
采用基于速度的掩蔽策略，选择高度动态的关节，迫使模型专注于信息丰富的运动组件。
在精细调整阶段，预训练的模型预测整个未来序列，并配备了一个轻量级的未来文本预测头，以实现低级别运动预测和高级别运动理解的联合优化。
实验结果表明，该方法在多个数据集上平均预测误差降低了8.8%，并在未来运动理解方面取得了与大型语言模型相当的性能。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-11-20/Text-to-Motion/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

Text-to-Motion

Talking Head Generation

Talking Head Generation 方向最新论文已更新，请持续关注 Update in 2025-11-20 Talk, Snap, Complain Validation-Aware Multimodal Expert Framework for Fine-Grained Customer Grievances

2025-11-20 Talking Head Generation

Talking Head Generation

Interactive

Interactive 方向最新论文已更新，请持续关注 Update in 2025-11-20 Listen Like a Teacher Mitigating Whisper Hallucinations using Adaptive Layer Attention and Knowledge Distillation

2025-11-20 Interactive

Interactive