发布日期: 2025-11-22

更新日期: 2025-11-27

文章字数: 4.1k

阅读时长: 16 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-22 更新

TriDiff-4D: Fast 4D Generation through Diffusion-based Triplane Re-posing

Authors:Eddie Pokming Sheung, Qihao Liu, Wufei Ma, Prakhar Kaushik, Jianwen Xie, Alan Yuille

With the increasing demand for 3D animation, generating high-fidelity, controllable 4D avatars from textual descriptions remains a significant challenge. Despite notable efforts in 4D generative modeling, existing methods exhibit fundamental limitations that impede their broader applicability, including temporal and geometric inconsistencies, perceptual artifacts, motion irregularities, high computational costs, and limited control over dynamics. To address these challenges, we propose TriDiff-4D, a novel 4D generative pipeline that employs diffusion-based triplane re-posing to produce high-quality, temporally coherent 4D avatars. Our model adopts an auto-regressive strategy to generate 4D sequences of arbitrary length, synthesizing each 3D frame with a single diffusion process. By explicitly learning 3D structure and motion priors from large-scale 3D and motion datasets, TriDiff-4D enables skeleton-driven 4D generation that excels in temporal consistency, motion accuracy, computational efficiency, and visual fidelity. Specifically, TriDiff-4D first generates a canonical 3D avatar and a corresponding motion sequence from a text prompt, then uses a second diffusion model to animate the avatar according to the motion sequence, supporting arbitrarily long 4D generation. Experimental results demonstrate that TriDiff-4D significantly outperforms existing methods, reducing generation time from hours to seconds by eliminating the optimization process, while substantially improving the generation of complex motions with high-fidelity appearance and accurate 3D geometry.

随着3D动画需求的不断增长，从文本描述生成高保真、可控的4D虚拟角色仍然是一项重大挑战。尽管在4D生成建模方面已经做出了显著的努力，但现有方法仍存在一些基本局限性，阻碍了其更广泛的应用，包括时间和几何不一致性、感知伪影、运动不规则、高计算成本和动力学控制有限等。为了解决这些挑战，我们提出了TriDiff-4D，这是一种新型的4D生成管道，采用基于扩散的三平面重塑技术，以生成高质量、时间连贯的4D虚拟角色。我们的模型采用自回归策略来生成任意长度的4D序列，每个3D帧都可以通过单个扩散过程进行合成。TriDiff-4D通过从大规模3D和运动数据集中显式学习3D结构和运动先验知识，实现了骨架驱动的4D生成，在时间一致性、运动准确性、计算效率和视觉逼真度方面表现出色。具体来说，TriDiff-4D首先根据文本提示生成一个标准的3D虚拟角色和相应的运动序列，然后使用第二个扩散模型根据运动序列使虚拟角色动起来，支持任意长度的4D生成。实验结果表明，TriDiff-4D在生成时间、复杂运动的生成、高保真外观和准确的3D几何方面显著优于现有方法，通过消除优化过程，将生成时间从数小时缩短到数秒。

论文及项目相关链接

PDF 8 pages, 10 figures, Under review at a conference

Summary

文本提出一种名为TriDiff-4D的新型4D生成管道，用于从文本描述生成高质量、时间连贯的4D动画角色。该方法采用基于扩散的三平面重新定位技术，通过大规模3D和运动数据集学习3D结构和运动先验，实现骨架驱动的4D生成，具有优良的时间连贯性、运动准确性、计算效率和视觉逼真度。

Key Takeaways

文本描述了对3D动画日益增长的需求以及从文本生成高质量、可控的4D角色的挑战。
现有方法在生成过程中存在诸如时间连贯性、几何一致性、感知伪影和运动不规则等问题。
TriDiff-4D是一种新型4D生成管道，采用基于扩散的三平面重新定位技术，以产生高质量、时间连贯的动画角色。
TriDiff-4D通过大规模数据集学习3D结构和运动先验，支持骨架驱动的动画生成。
该方法采用自回归策略生成任意长度的序列，并通过单次扩散过程合成每个3D帧。
TriDiff-4D实现了出色的时间连贯性、运动准确性、计算效率和视觉逼真度。

Cool Papers

点此查看论文截图

GazeInterpreter: Parsing Eye Gaze to Generate Eye-Body-Coordinated Narrations

Authors:Qing Chang, Zhiming Hu

Comprehensively interpreting human behavior is a core challenge in human-aware artificial intelligence. However, prior works typically focused on body behavior, neglecting the crucial role of eye gaze and its synergy with body motion. We present GazeInterpreter - a novel large language model-based (LLM-based) approach that parses eye gaze data to generate eye-body-coordinated narrations. Specifically, our method features 1) a symbolic gaze parser that translates raw gaze signals into symbolic gaze events; 2) a hierarchical structure that first uses an LLM to generate eye gaze narration at semantic level and then integrates gaze with body motion within the same observation window to produce integrated narration; and 3) a self-correcting loop that iteratively refines the modality match, temporal coherence, and completeness of the integrated narration. This hierarchical and iterative processing can effectively align physical values and semantic text in the temporal and spatial domains. We validated the effectiveness of our eye-body-coordinated narrations on the text-driven motion generation task in the large-scale Nymeria benchmark. Moreover, we report significant performance improvements for the sample downstream tasks of action anticipation and behavior summarization. Taken together, these results reveal the significant potential of parsing eye gaze to interpret human behavior and open up a new direction for human behavior understanding.

全面解读人类行为是人工智能领域中的一项核心挑战。然而，早期的研究主要关注身体行为，忽视了眼位及其与身体运动的协同作用的重要性。我们提出了GazeInterpreter——一种基于大型语言模型（LLM）的新型方法，它能够解析眼位数据以生成眼位协调叙述。具体来说，我们的方法具有以下特点：1）符号眼位解析器，将原始眼位信号转换为符号眼位事件；2）层次结构首先使用LLM在语义层面生成眼位叙述，然后将眼位与同一观察窗口中的身体运动相结合，产生综合叙述；3）自我校正循环，通过迭代优化模式匹配、时间连贯性和综合叙述的完整性。这种层次化和迭代化的处理可以有效地在时间域和空间域内对齐物理值和语义文本。我们在大规模Nymeria基准测试中对眼位协调叙述在文本驱动运动生成任务上的有效性进行了验证。此外，我们还报告了在动作预测和行为总结等下游样本任务上实现了显著的性能提升。总而言之，这些结果揭示了通过解析眼位来解释人类行为的巨大潜力，并为人类行为理解开辟了新的方向。

论文及项目相关链接

PDF Accepted to AAAI 2026. 9 pages, 4 figures

Summary

本文提出了一种基于大型语言模型（LLM）的GazeInterpreter方法，该方法能够解析眼注视数据并生成与身体动作相协调的叙述。通过符号注视解析器将原始注视信号转换为符号注视事件，采用分层结构在语义层面生成注视叙述，并与身体动作结合产生综合叙述。此外，该方法具有自我修正功能，可迭代优化叙述的模态匹配、时间连贯性和完整性。在Nymeria基准测试中，该方法在文本驱动的运动生成任务上表现出显著效果，并在动作预测和行为总结等下游任务中取得了改进。展示了解析眼注视以解释人类行为的重要性，为理解人类行为开启了新方向。

Key Takeaways

GazeInterpreter是一种基于大型语言模型的方法，能够解析眼注视数据并生成与身体动作相协调的叙述。
该方法通过符号注视解析器将原始注视信号转换为符号注视事件。
采用分层结构，首先在语义层面生成注视叙述，然后与身体动作结合产生综合叙述。
GazeInterpreter具有自我修正功能，可优化叙述的模态匹配、时间连贯性和完整性。
在Nymeria基准测试中，GazeInterpreter在文本驱动的运动生成任务上表现出显著效果。
该方法在动作预测和行为总结等下游任务中取得了性能改进。

Cool Papers

点此查看论文截图

Sigma: Semantically Informative Pre-training for Skeleton-based Sign Language Understanding

Authors:Muxin Pu, Mei Kuan Lim, Chun Yong Chong, Chen Change Loy

Pre-training has proven effective for learning transferable features in sign language understanding (SLU) tasks. Recently, skeleton-based methods have gained increasing attention because they can robustly handle variations in subjects and backgrounds without being affected by appearance or environmental factors. Current SLU methods continue to face three key limitations: 1) weak semantic grounding, as models often capture low-level motion patterns from skeletal data but struggle to relate them to linguistic meaning; 2) imbalance between local details and global context, with models either focusing too narrowly on fine-grained cues or overlooking them for broader context; and 3) inefficient cross-modal learning, as constructing semantically aligned representations across modalities remains difficult. To address these, we propose Sigma, a unified skeleton-based SLU framework featuring: 1) a sign-aware early fusion mechanism that facilitates deep interaction between visual and textual modalities, enriching visual features with linguistic context; 2) a hierarchical alignment learning strategy that jointly maximises agreements across different levels of paired features from different modalities, effectively capturing both fine-grained details and high-level semantic relationships; and 3) a unified pre-training framework that combines contrastive learning, text matching and language modelling to promote semantic consistency and generalisation. Sigma achieves new state-of-the-art results on isolated sign language recognition, continuous sign language recognition, and gloss-free sign language translation on multiple benchmarks spanning different sign and spoken languages, demonstrating the impact of semantically informative pre-training and the effectiveness of skeletal data as a stand-alone solution for SLU.

预训练在标志语言理解（SLU）任务中学习可迁移特征方面已被证明是有效的。最近，基于骨架的方法越来越受到关注，因为它们能够稳健地处理主体和背景的变化，而不受外观或环境因素的影响。当前SLU方法仍然面临三个主要局限性：1）语义定位较弱，因为模型通常从骨骼数据中捕获低层次的运动模式，但很难将它们与语言意义联系起来；2）局部细节和全局上下文之间的不平衡，模型要么过于狭隘地关注细微线索，要么忽视它们以获取更广泛的上下文；3）跨模态学习不够高效，因为构建跨模态的语义对齐表示仍然很困难。为了解决这些问题，我们提出了Sigma，这是一个基于骨架的SLU框架，具有以下特点：1）一种感知标志的早期融合机制，促进视觉和文本模态之间的深度交互，丰富视觉特征以提供语言上下文；2）分层对齐学习策略，旨在最大化不同模态配对特征的不同层次之间的协议，有效捕获细微的细节和高级语义关系；3）一个统一的预训练框架，结合对比学习、文本匹配和语言建模，以促进语义一致性和泛化。Sigma在多个跨越不同手语和口语的基准测试上实现了最新的成果，包括孤立手语识别、连续手语识别和免标点手语翻译，证明了语义信息丰富的预训练的影响以及骨架数据作为SLU的独立解决方案的有效性。

论文及项目相关链接

PDF

Summary：

本文介绍了预训练在提升手语理解任务中的效果，以及骨架基方法在手语识别中的优势。文章提出了当前手语理解方法存在的三大局限性和一种名为Sigma的统一骨架基手语理解框架。该框架通过融合视觉和文本模态、采用层次对齐学习策略以及构建统一预训练框架来解决现有问题。Sigma在不同基准测试上实现了孤立手语识别、连续手语识别和免字幕手语翻译的新水平效果。

Key Takeaways：

预训练有助于学习手语理解任务中的可迁移特征。
骨架基方法能稳健处理主体和背景变化，不受外观或环境影响。
当前手语理解方法面临语义定位弱、局部细节与全局上下文失衡以及跨模态学习不高效等三大局限。
Sigma框架通过融合视觉和文本模态丰富视觉特征的语言上下文。
Sigma采用层次对齐学习策略，同时优化不同层次的跨模态特征配对，捕捉精细细节和高级语义关系。
Sigma的统一预训练框架结合了对比学习、文本匹配和语言建模，促进了语义一致性和泛化能力。

Cool Papers

点此查看论文截图

Human Motion Unlearning

Authors:Edoardo De Matteis, Matteo Migliarini, Alessio Sampieri, Indro Spinelli, Fabio Galasso

We introduce the task of human motion unlearning to prevent the synthesis of toxic animations while preserving the general text-to-motion generative performance. Unlearning toxic motions is challenging as those can be generated from explicit text prompts and from implicit toxic combinations of safe motions (e.g., “kicking” is “loading and swinging a leg”). We propose the first motion unlearning benchmark by filtering toxic motions from the large and recent text-to-motion datasets of HumanML3D and Motion-X. We propose baselines, by adapting state-of-the-art image unlearning techniques to process spatio-temporal signals. Finally, we propose a novel motion unlearning model based on Latent Code Replacement, which we dub LCR. LCR is training-free and suitable to the discrete latent spaces of state-of-the-art text-to-motion diffusion models. LCR is simple and consistently outperforms baselines qualitatively and quantitatively. Project page: https://www.pinlab.org/hmu.

我们引入了人类运动撤销任务，以防止有毒动画的合成，同时保持一般的文本到运动的生成性能。撤销有毒的运动是一个挑战，因为这些运动可能来自明确的文本提示，也可能来自安全运动的隐性有毒组合（例如，“踢”就是“抬起并摆动腿”）。我们通过从HumanML3D和Motion-X的大型最新文本到运动数据集中过滤有毒运动，提出了第一个运动撤销基准测试。我们提出了基准线，通过适应最新的图像撤销技术来处理时空信号。最后，我们提出了一种基于潜在代码替换的新型运动撤销模型，我们称之为LCR。LCR无需训练，适用于最新文本到运动扩散模型的离散潜在空间。LCR简单且无论在质上还是在量上都始终优于基准线。项目页面：https://www.pinlab.org/hmu。

论文及项目相关链接

PDF

Summary

文本介绍了防止合成有毒动画的任务，同时保留文本到动作的生成性能。针对运动卸载面临的新挑战，文本提出了一种运动卸载基准方法，该方法通过过滤大型和最新的文本到运动数据集HumanML3D和Motion-X中的有毒动作来实现。文本还提出了基线方法，即通过适应先进的图像卸载技术来处理时空信号。最后，提出了一种基于潜在代码替换的新运动卸载模型LCR（Latent Code Replacement）。LCR无需训练，适用于最先进的文本到动作扩散模型的离散潜在空间。LCR简单且定性、定量均表现优越。详情请见项目页面：https://www.pinlab.org/hmu。简而言之，这是一个致力于提高文本转动作生成质量的新技术研究与改进的文章。尽管面临着从文本生成有毒动画的挑战，研究人员提出了解决方案来提高动画的质量，通过剔除不健康的动作片段来提升整体的用户体验。研究的主要成果包括一个基于潜在代码替换的运动卸载模型LCR。

Key Takeaways

以下是关键要点总结：