⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-11-18 更新
DEFT-LLM: Disentangled Expert Feature Tuning for Micro-Expression Recognition
Authors:Ren Zhang, Huilai Li, Chao qi, Guoliang Xu, Tianyu Zhou, Wei wei, Jianqin Yin
Micro expression recognition (MER) is crucial for inferring genuine emotion. Applying a multimodal large language model (MLLM) to this task enables spatio-temporal analysis of facial motion and provides interpretable descriptions. However, there are still two core challenges: (1) The entanglement of static appearance and dynamic motion cues prevents the model from focusing on subtle motion; (2) Textual labels in existing MER datasets do not fully correspond to underlying facial muscle movements, creating a semantic gap between text supervision and physical motion. To address these issues, we propose DEFT-LLM, which achieves motion semantic alignment by multi-expert disentanglement. We first introduce Uni-MER, a motion-driven instruction dataset designed to align text with local facial motion. Its construction leverages dual constraints from optical flow and Action Unit (AU) labels to ensure spatio-temporal consistency and reasonable correspondence to the movements. We then design an architecture with three experts to decouple facial dynamics into independent and interpretable representations (structure, dynamic textures, and motion-semantics). By integrating the instruction-aligned knowledge from Uni-MER into DEFT-LLM, our method injects effective physical priors for micro expressions while also leveraging the cross modal reasoning ability of large language models, thus enabling precise capture of subtle emotional cues. Experiments on multiple challenging MER benchmarks demonstrate state-of-the-art performance, as well as a particular advantage in interpretable modeling of local facial motion.
微表情识别(MER)对于推断真实情绪至关重要。将多模态大型语言模型(MLLM)应用于此任务,可以实现面部运动的时空分析,并提供可解释的描述。然而,仍然存在两个核心挑战:(1)静态外观和动态运动线索的纠缠,阻碍了模型对细微运动的关注;(2)现有MER数据集中的文本标签与潜在的面部肌肉运动不完全对应,从而在文本监督和物理运动之间产生了语义鸿沟。为了解决这些问题,我们提出了DEFT-LLM,它通过多专家分解实现运动语义对齐。我们首先引入了Uni-MER,这是一个运动驱动指令数据集,旨在将文本与局部面部运动对齐。其构建利用光流和动作单元(AU)标签的双重约束,以确保时空一致性和与运动的合理对应。然后,我们设计了一个包含三个专家的架构,将面部动态解耦为独立且可解释的表示(结构、动态纹理和运动语义)。通过将Uni-MER的指令对齐知识注入DEFT-LLM,我们的方法在利用大型语言模型的跨模态推理能力的同时,为微表情注入有效的物理先验知识,从而能够精确捕捉细微的情绪线索。在多个具有挑战性的MER基准上的实验证明了其卓越的性能,特别是在局部面部运动的可解释建模方面具有明显的优势。
论文及项目相关链接
摘要
文本提出了一种解决微表情识别中面临的关键挑战的方法,包括模型无法专注于细微动作的问题以及现有数据集标签与实际面部肌肉运动不一致的问题。通过引入Uni-MER数据集和多专家解耦架构DEFT-LLM,实现了文本与面部运动的语义对齐。该方法能有效捕捉细微情感线索,同时在多个具有挑战性的微表情识别基准测试中表现出卓越性能。
关键见解
- MER(微表情识别)在推断真实情感方面至关重要。多模态大型语言模型(MLLM)的时空分析能够揭示面部表情运动的可解释描述。但存在两大挑战。
点此查看论文截图
Novel Diffusion Models for Multimodal 3D Hand Trajectory Prediction
Authors:Junyi Ma, Wentao Bao, Jingyi Xu, Guanzhong Sun, Xieyuanli Chen, Hesheng Wang
Predicting hand motion is critical for understanding human intentions and bridging the action space between human movements and robot manipulations. Existing hand trajectory prediction (HTP) methods forecast the future hand waypoints in 3D space conditioned on past egocentric observations. However, such models are only designed to accommodate 2D egocentric video inputs. There is a lack of awareness of multimodal environmental information from both 2D and 3D observations, hindering the further improvement of 3D HTP performance. In addition, these models overlook the synergy between hand movements and headset camera egomotion, either predicting hand trajectories in isolation or encoding egomotion only from past frames. To address these limitations, we propose novel diffusion models (MMTwin) for multimodal 3D hand trajectory prediction. MMTwin is designed to absorb multimodal information as input encompassing 2D RGB images, 3D point clouds, past hand waypoints, and text prompt. Besides, two latent diffusion models, the egomotion diffusion and the HTP diffusion as twins, are integrated into MMTwin to predict camera egomotion and future hand trajectories concurrently. We propose a novel hybrid Mamba-Transformer module as the denoising model of the HTP diffusion to better fuse multimodal features. The experimental results on three publicly available datasets and our self-recorded data demonstrate that our proposed MMTwin can predict plausible future 3D hand trajectories compared to the state-of-the-art baselines, and generalizes well to unseen environments. The code and pretrained models have been released at https://github.com/IRMVLab/MMTwin.
预测手部动作对于理解人类意图以及构建人类动作与机器人操作之间的动作空间至关重要。现有的手部轨迹预测(HTP)方法基于过去的以自我为中心的观察来预测未来手部在三维空间中的轨迹。然而,这些模型仅设计用于处理二维自我为中心的视频输入。它们无法感知来自二维和三维观察的多元环境信息,阻碍了三维HTP性能的进一步提高。此外,这些模型忽视了手部动作与头戴式相机自我运动之间的协同作用,要么单独预测手部轨迹,要么仅从过去的帧中编码自我运动。为了解决这些局限性,我们提出了用于多模态三维手部轨迹预测的新型扩散模型MMTwin。MMTwin旨在将多模态信息作为输入,包括二维RGB图像、三维点云、过去的手部轨迹终点和文本提示。此外,将两个潜在扩散模型——自我运动扩散和HTP扩散作为双胞胎模型集成到MMTwin中,以同时预测相机自我运动和未来的手部轨迹。我们提出了一种新型的混合Mamba-Transformer模块作为HTP扩散的去噪模型,以更好地融合多模态特征。在三个公开数据集和我们自行录制的数据上的实验结果表明,与我们提出的MMTwin相比,最先进的基线模型可以预测合理的未来三维手部轨迹,并在未见过的环境中有良好的泛化能力。代码和预训练模型已发布在https://github.com/IRMVLab/MMTwin。
论文及项目相关链接
PDF Accepted to IROS 2025
摘要
本文提出一种名为MMTwin的多模态三维手轨迹预测扩散模型。该模型能够吸收包括二维RGB图像、三维点云、过去的手轨迹终点和文本提示等多模态信息作为输入。通过整合两个潜扩散模型——egomotion扩散和HTP扩散,MMTwin能同时预测相机运动模式和未来手轨迹。此外,研究还提出了一种新型的混合Mamba-Transformer模块,作为HTP扩散的去噪模型,以更好地融合多模态特征。实验结果表明,相比最先进的基线方法,MMTwin能更准确地预测未来三维手轨迹,且具有良好的泛化能力。
关键见解
- 预测手运动对于理解人类意图和弥补人类动作与机器人操作之间的动作空间至关重要。
- 现有手轨迹预测方法主要基于过去二维视觉观察,缺乏多模态环境信息的融合。
- MMTwin模型通过结合多模态信息(包括二维RGB图像、三维点云等)来改进三维手轨迹预测性能。
- MMTwin集成了两个潜扩散模型来同时预测相机运动模式和未来手轨迹。
- 提出了一种新型的混合Mamba-Transformer模块,作为HTP扩散的去噪模型,以提高多模态特征的融合效果。
- 实验结果证明了MMTwin在预测未来三维手轨迹方面的优越性,相较于现有方法具有更好的准确性及泛化能力。
- MMTwin的代码和预训练模型已发布在https://github.com/IRMVLab/MMTwin。
点此查看论文截图
Walk Before You Dance: High-fidelity and Editable Dance Synthesis via Generative Masked Motion Prior
Authors:Foram N Shah, Parshwa Shah, Muhammad Usama Saleem, Ekkasit Pinyoanuntapong, Pu Wang, Hongfei Xue, Ahmed Helmy
Recent advances in dance generation have enabled the automatic synthesis of 3D dance motions. However, existing methods still face significant challenges in simultaneously achieving high realism, precise dance-music synchronization, diverse motion expression, and physical plausibility. To address these limitations, we propose a novel approach that leverages a generative masked text-to-motion model as a distribution prior to learn a probabilistic mapping from diverse guidance signals, including music, genre, and pose, into high-quality dance motion sequences. Our framework also supports semantic motion editing, such as motion inpainting and body part modification. Specifically, we introduce a multi-tower masked motion model that integrates a text-conditioned masked motion backbone with two parallel, modality-specific branches: a music-guidance tower and a pose-guidance tower. The model is trained using synchronized and progressive masked training, which allows effective infusion of the pretrained text-to-motion prior into the dance synthesis process while enabling each guidance branch to optimize independently through its own loss function, mitigating gradient interference. During inference, we introduce classifier-free logits guidance and pose-guided token optimization to strengthen the influence of music, genre, and pose signals. Extensive experiments demonstrate that our method sets a new state of the art in dance generation, significantly advancing both the quality and editability over existing approaches. Project Page available at https://foram-s1.github.io/DanceMosaic/
近期舞蹈生成技术的进展已经能够实现3D舞蹈动作的自动合成。然而,现有方法仍然面临着同时实现高逼真度、精确的舞蹈音乐同步、丰富的动作表达和物理可行性的重大挑战。为了解决这些局限性,我们提出了一种新颖的方法,利用生成式掩码文本到运动模型作为分布先验来学习从多种引导信号(包括音乐、风格和姿势)到高质量舞蹈动作序列的概率映射。我们的框架还支持语义运动编辑,如运动补全和身体部位修改。具体来说,我们引入了一个多塔式掩膜运动模型,该模型将文本条件掩膜运动主干与两个并行、模态特定的分支相结合:音乐引导塔和姿势引导塔。该模型采用同步和渐进的掩膜训练,这允许有效地将预训练的文本到运动先验融入舞蹈合成过程,同时使每个引导分支能够通过自己的损失函数进行优化,减轻梯度干扰。在推理过程中,我们引入了无分类器对数几率引导和姿势引导令牌优化,以加强音乐、风格和姿势信号的影响。大量实验表明,我们的方法在舞蹈生成方面树立了新的最先进的水平,在质量和可编辑性方面均显著优于现有方法。项目页面可在https://foram-s1.github.io/DanceMosaic/访问。
论文及项目相关链接
Summary
本文提出了一种基于生成性掩模文本到运动模型的新型舞蹈生成方法,该方法可作为概率分布先验来学习从多种指导信号(包括音乐、风格和姿势)到高质量舞蹈动作序列的映射。该方法支持语义运动编辑,如运动补全和身体部位修改。通过引入多塔掩模运动模型和同步渐进掩模训练,该方法在舞蹈生成中实现了高质量、精确的音乐同步、丰富的运动表达和物理可行性。
Key Takeaways
- 提出了一种新型的舞蹈生成方法,基于生成性掩模文本到运动模型。
- 该方法能够将多种指导信号(如音乐、风格和姿势)映射到高质量舞蹈动作序列。
- 支持语义运动编辑,如运动补全和身体部位修改。
- 通过引入多塔掩模运动模型和同步渐进掩模训练,提高了舞蹈生成的质量、音乐同步的精确性、运动表达的丰富性和物理可行性。
- 模型训练过程中,采用文本到运动的先验知识,同时使每个指导分支能够独立于其他分支进行优化。
- 在推理过程中,引入了无分类器对数几率引导和姿态引导令牌优化,增强了音乐、风格和姿态信号的影响。
点此查看论文截图
DreamRunner: Fine-Grained Compositional Story-to-Video Generation with Retrieval-Augmented Motion Adaptation
Authors:Zun Wang, Jialu Li, Han Lin, Jaehong Yoon, Mohit Bansal
Storytelling video generation (SVG) aims to produce coherent and visually rich multi-scene videos that follow a structured narrative. Existing methods primarily employ LLM for high-level planning to decompose a story into scene-level descriptions, which are then independently generated and stitched together. However, these approaches struggle with generating high-quality videos aligned with the complex single-scene description, as visualizing such complex description involves coherent composition of multiple characters and events, complex motion synthesis and multi-character customization. To address these challenges, we propose DREAMRUNNER, a novel story-to-video generation method: First, we structure the input script using a large language model (LLM) to facilitate both coarse-grained scene planning as well as fine-grained object-level layout planning. Next, DREAMRUNNER presents retrieval-augmented test-time adaptation to capture target motion priors for objects in each scene, supporting diverse motion customization based on retrieved videos, thus facilitating the generation of new videos with complex, scripted motions. Lastly, we propose a novel spatial-temporal region-based 3D attention and prior injection module SR3AI for fine-grained object-motion binding and frame-by-frame spatial-temporal semantic control. We compare DREAMRUNNER with various SVG baselines, demonstrating state-of-the-art performance in character consistency, text alignment, and smooth transitions. Additionally, DREAMRUNNER exhibits strong fine-grained condition-following ability in compositional text-to-video generation, significantly outperforming baselines on T2V-ComBench. Finally, we validate DREAMRUNNER’s robust ability to generate multi-object interactions with qualitative examples.
故事叙述视频生成(SVG)旨在生成连贯且视觉丰富的多场景视频,遵循结构化的叙事。现有方法主要使用大型语言模型(LLM)进行高级规划,将故事分解为场景级描述,然后独立生成并拼接在一起。然而,这些方法在生成与复杂单场景描述相符的高质量视频方面遇到困难,因为可视化这样的复杂描述涉及到多个角色和事件的连贯组合、复杂的动作合成和多角色定制。为了解决这些挑战,我们提出了DREAMRUNNER,一种新颖的故事到视频生成方法:首先,我们使用大型语言模型(LLM)对输入脚本进行结构化处理,以促进粗粒度的场景规划和细粒度的对象级布局规划。接下来,DREAMRUNNER呈现检索增强的测试时适应,以捕获每个场景中目标物体的目标运动先验,支持基于检索视频的多样化运动定制,从而便于生成具有复杂脚本动作的新视频。最后,我们提出了基于时空区域的新型3D注意力和先验注入模块SR3AI,用于细粒度对象运动绑定和逐帧时空语义控制。我们将DREAMRUNNER与各种SVG基线进行了比较,在角色一致性、文本对齐和顺畅过渡方面展示了卓越的性能。此外,DREAMRUNNER在组合文本到视频的生成中表现出强大的精细条件跟随能力,在T2V-ComBench上显著优于基线。最后,我们通过定性示例验证了DREAMRUNNER生成多对象交互的稳健能力。
论文及项目相关链接
PDF AAAI 2026, Project website: https://zunwang1.github.io/DreamRunner
Summary
该文本介绍了针对故事叙述视频生成(SVG)的新方法DREAMRUNNER。它结合了大型语言模型(LLM)进行场景规划,并采用检索增强测试时适应技术来捕捉目标运动先验信息,支持基于检索视频的运动定制。此外,还提出了一种新的基于时空区域的三维注意力及先验注入模块SR3AI,用于精细对象运动绑定和逐帧时空语义控制。DREAMRUNNER在角色一致性、文本对齐和顺畅过渡等方面表现出卓越性能,并在组合文本到视频生成上展现出精细条件跟随能力。
Key Takeaways
- DREAMRUNNER是一种针对故事叙述视频生成(SVG)的新方法,结合了大型语言模型(LLM)进行场景规划。
- 检索增强测试时适应技术用于捕捉目标运动先验信息,支持基于检索视频的运动定制。
- 提出了SR3AI模块,用于精细对象运动绑定和逐帧时空语义控制。
- DREAMRUNNER在角色一致性、文本对齐和顺畅过渡等方面具有卓越性能。
- 在组合文本到视频生成方面,DREAMRUNNER展现出强大的精细条件跟随能力。
- 与其他SVG基线相比,DREAMRUNNER具有最先进的性能。