发布日期: 2025-11-20

更新日期: 2025-11-27

文章字数: 6.4k

阅读时长: 25 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-20 更新

Talk, Snap, Complain: Validation-Aware Multimodal Expert Framework for Fine-Grained Customer Grievances

Authors:Rishu Kumar Singh, Navneet Shreya, Sarmistha Das, Apoorva Singh, Sriparna Saha

Existing approaches to complaint analysis largely rely on unimodal, short-form content such as tweets or product reviews. This work advances the field by leveraging multimodal, multi-turn customer support dialogues, where users often share both textual complaints and visual evidence (e.g., screenshots, product photos) to enable fine-grained classification of complaint aspects and severity. We introduce VALOR, a Validation-Aware Learner with Expert Routing, tailored for this multimodal setting. It employs a multi-expert reasoning setup using large-scale generative models with Chain-of-Thought (CoT) prompting for nuanced decision-making. To ensure coherence between modalities, a semantic alignment score is computed and integrated into the final classification through a meta-fusion strategy. In alignment with the United Nations Sustainable Development Goals (UN SDGs), the proposed framework supports SDG 9 (Industry, Innovation and Infrastructure) by advancing AI-driven tools for robust, scalable, and context-aware service infrastructure. Further, by enabling structured analysis of complaint narratives and visual context, it contributes to SDG 12 (Responsible Consumption and Production) by promoting more responsive product design and improved accountability in consumer services. We evaluate VALOR on a curated multimodal complaint dataset annotated with fine-grained aspect and severity labels, showing that it consistently outperforms baseline models, especially in complex complaint scenarios where information is distributed across text and images. This study underscores the value of multimodal interaction and expert validation in practical complaint understanding systems. Resources related to data and codes are available here: https://github.com/sarmistha-D/VALOR

现有的投诉分析方案大多依赖于单模态的简短内容，如推特或产品评论。本研究通过利用多模态、多轮次的客户支持对话，推动了该领域的发展。在这种对话中，用户经常共享文本投诉和视觉证据（如截图、产品照片），以实现投诉方面和严重性的精细分类。我们引入了VALOR，一个具有专家路由的验证感知学习者，适用于这种多模态环境。它采用多专家推理设置，使用具有思维链提示的大规模生成模型进行微妙决策。为确保模态之间的连贯性，计算语义对齐分数并将其通过元融合策略整合到最终分类中。根据联合国可持续发展目标（UN SDGs），所提出的框架支持SDG 9（工业、创新和基础设施），通过推动AI驱动工具实现稳健、可扩展和上下文感知的服务基础设施。此外，通过启用投诉叙事和视觉上下文的结构化分析，它为SDG 12（负责任的消费和生产）做出贡献，促进更具响应性的产品设计和消费者服务的问责制。我们在精心制作的多模态投诉数据集上评估VALOR，该数据集带有精细方面的严重程度标签，表明它始终优于基线模型，尤其是在复杂投诉场景中，信息分布在文本和图像中。该研究强调了多模态交互和专家验证在实际投诉理解系统中的价值。有关数据和代码的资源可在此处找到：https://github.com/sarmistha-D/VALOR

论文及项目相关链接

PDF To be published in the Proceedings of the 40th Annual AAAI Conference on Artificial Intelligence (AAAI 2026 Special Track on AI for Social Impact )

Summary

本文介绍了一种基于多模态、多轮次客户支持对话的投诉分析新方法。该方法利用用户分享的文本投诉和视觉证据（如截图、产品照片），通过VALOR验证感知学习者与专家路由系统，进行精细粒度的投诉方面和严重性的分类。VALOR采用多专家推理设置，利用大规模生成模型进行细微决策，并通过语义对齐分数确保模态间的连贯性。该研究支持联合国可持续发展目标（UN SDGs）中的产业、创新和基础设施（SDG 9）以及负责任的消费和生产（SDG 12）。VALOR在精细标注的多模态投诉数据集上的表现优于基线模型，特别是在文本和图像信息分布复杂的投诉场景中。

Key Takeaways

现有投诉分析主要依赖单模态的简短内容，如推特或产品评论，而本文则利用多模态、多轮次的客户支持对话来提高投诉分析的准确性。
用户常分享文本投诉和视觉证据（如截图、产品照片），VALOR系统能够处理这种多模态信息，进行精细粒度的投诉分类。
VALOR采用多专家推理和大规模生成模型，通过Chain-of-Thought（CoT）提示进行细微决策。
为确保不同模态之间的连贯性，计算了语义对齐分数，并纳入最终的分类决策中。
VALOR支持联合国可持续发展目标中的产业、创新和基础设施（SDG 9）以及负责任的消费和生产（SDG 12）。
在精细标注的多模态投诉数据集上，VALOR表现优异，尤其在复杂的投诉场景中。

Cool Papers

点此查看论文截图

Towards Authentic Movie Dubbing with Retrieve-Augmented Director-Actor Interaction Learning

Authors:Rui Liu, Yuan Zhao, Zhenqi Jia

The automatic movie dubbing model generates vivid speech from given scripts, replicating a speaker’s timbre from a brief timbre prompt while ensuring lip-sync with the silent video. Existing approaches simulate a simplified workflow where actors dub directly without preparation, overlooking the critical director-actor interaction. In contrast, authentic workflows involve a dynamic collaboration: directors actively engage with actors, guiding them to internalize the context cues, specifically emotion, before performance. To address this issue, we propose a new Retrieve-Augmented Director-Actor Interaction Learning scheme to achieve authentic movie dubbing, termed Authentic-Dubber, which contains three novel mechanisms: (1) We construct a multimodal Reference Footage library to simulate the learning footage provided by directors. Note that we integrate Large Language Models (LLMs) to achieve deep comprehension of emotional representations across multimodal signals. (2) To emulate how actors efficiently and comprehensively internalize director-provided footage during dubbing, we propose an Emotion-Similarity-based Retrieval-Augmentation strategy. This strategy retrieves the most relevant multimodal information that aligns with the target silent video. (3) We develop a Progressive Graph-based speech generation approach that incrementally incorporates the retrieved multimodal emotional knowledge, thereby simulating the actor’s final dubbing process. The above mechanisms enable the Authentic-Dubber to faithfully replicate the authentic dubbing workflow, achieving comprehensive improvements in emotional expressiveness. Both subjective and objective evaluations on the V2C Animation benchmark dataset validate the effectiveness. The code and demos are available at https://github.com/AI-S2-Lab/Authentic-Dubber.

自动电影配音模型能够根据给定的脚本生成生动逼真的语音，从简短的音色提示中复制演讲者的音色，同时确保与无声视频的同步。现有方法模拟了一种简化的工作流程，演员直接进行配音而无需准备，忽略了关键的导演与演员之间的互动。相比之下，真实的工作流程涉及动态的协作：导演与演员积极互动，指导他们内化情境线索，特别是情绪，然后在表演之前。为了解决这一问题，我们提出了一种新的检索增强导演-演员交互学习方案，以实现真实的电影配音，称为“真实配音者”。其中包含三种新机制：（1）我们构建了一个多模态参考素材库，以模拟导演提供的学习素材。值得注意的是，我们整合了大型语言模型（LLM），以实现跨多模态信号的深层情感表示理解。（2）为了模拟演员在配音过程中如何高效全面地内化导演提供的素材，我们提出了一种基于情感相似度的检索增强策略。该策略检索与目标无声视频最相关的多模态信息。（3）我们开发了一种基于渐进图的语音生成方法，该方法逐步融入检索到的多模态情感知识，从而模拟演员的最终配音过程。上述机制使“真实配音者”能够忠实地复制真实的配音工作流程，在情感表达方面取得了全面的改进。在V2C动画基准数据集上的主观和客观评估都验证了其有效性。相关代码和演示可在https://github.com/AI-S2-Lab/Authentic-Dubber获取。

论文及项目相关链接

PDF Accepted by AAAI 2026

Summary

本文提出了一种新的自动电影配音模型——真实配音器（Authentic-Dubber），通过引入导演与演员间的交互模拟机制来提升情感表达的准确性。模型构建了一个多模态参考素材库来模拟导演提供的学习素材，通过情感相似性检索增强策略高效模拟演员在观看素材后对目标静默视频的吸收内化过程。模型使用基于图的语言逐步融入所获取的多模态情感知识，从而模拟演员的最终配音过程。在V2C动画基准数据集上的主观和客观评估验证了模型的有效性。模型和演示已在GitHub上公开。

Key Takeaways

真实配音器模型（Authentic-Dubber）通过引入导演与演员间的交互模拟机制提升情感表达的准确性。
构建了一个多模态参考素材库来模拟导演提供的学习素材，利用大型语言模型（LLMs）实现对情感表示的深度理解。
提出情感相似性检索增强策略，高效模拟演员在观看素材后对目标静默视频的吸收内化过程。
采用基于图的语言逐步融入多模态情感知识，模拟演员的最终配音过程。

Cool Papers

点此查看论文截图

StreamingTalker: Audio-driven 3D Facial Animation with Autoregressive Diffusion Model

Authors:Yifan Yang, Zhi Cen, Sida Peng, Xiangwei Chen, Yifu Deng, Xinyu Zhu, Fan Jia, Xiaowei Zhou, Hujun Bao

This paper focuses on the task of speech-driven 3D facial animation, which aims to generate realistic and synchronized facial motions driven by speech inputs.Recent methods have employed audio-conditioned diffusion models for 3D facial animation, achieving impressive results in generating expressive and natural animations.However, these methods process the whole audio sequences in a single pass, which poses two major challenges: they tend to perform poorly when handling audio sequences that exceed the training horizon and will suffer from significant latency when processing long audio inputs. To address these limitations, we propose a novel autoregressive diffusion model that processes input audio in a streaming manner. This design ensures flexibility with varying audio lengths and achieves low latency independent of audio duration. Specifically, we select a limited number of past frames as historical motion context and combine them with the audio input to create a dynamic condition. This condition guides the diffusion process to iteratively generate facial motion frames, enabling real-time synthesis with high-quality results. Additionally, we implemented a real-time interactive demo, highlighting the effectiveness and efficiency of our approach. We will release the code at https://zju3dv.github.io/StreamingTalker/.

本文专注于语音驱动的三维面部动画任务，旨在通过语音输入生成真实且同步的面部运动。最近的方法采用了音频条件扩散模型进行三维面部动画，在生成表达丰富、自然流畅的动画方面取得了令人印象深刻的效果。然而，这些方法在一次通过中处理整个音频序列，这带来了两个主要挑战：当处理超过训练范围的音频序列时，它们的性能往往不佳，并且在处理长音频输入时会出现显著的延迟。为了解决这些局限性，我们提出了一种新型的自回归扩散模型，以流的方式处理输入音频。这种设计确保了不同音频长度的灵活性，并实现了独立于音频持续时间的低延迟。具体来说，我们选择一定数量的过去帧作为历史运动上下文，并与音频输入相结合，创建一个动态条件。此条件引导扩散过程迭代生成面部运动帧，实现实时合成并产生高质量结果。此外，我们实现了一个实时交互演示，突出了我们方法的有效性和效率。我们会在https://zju3dv.github.io/StreamingTalker/发布代码。

论文及项目相关链接

PDF

Summary

本文研究了语音驱动的3D面部动画任务，旨在通过语音输入生成真实且同步的面部运动。针对现有方法在处理长音频序列时面临的挑战，如性能下降和延迟问题，提出了一种新颖的流式自动回归扩散模型。该模型以流式方式处理输入音频，确保对不同长度的音频具有灵活性，并实现了与音频持续时间无关的低延迟。通过结合历史运动上下文和音频输入创建动态条件，指导扩散过程迭代生成面部运动帧，实现实时合成高质量结果。

Key Takeaways

本文专注于语音驱动的3D面部动画任务，旨在生成真实且同步的面部运动。
现有方法在处理长音频序列时存在挑战，如性能下降和延迟。
提出了一种新颖的流式自动回归扩散模型，以流式方式处理输入音频。
模型对不同长度的音频具有灵活性，并实现了低延迟。
结合历史运动上下文和音频输入创建动态条件，指导扩散过程生成面部运动帧。
实时合成高质量结果，展示了方法的有效性和效率。
实现了实时互动演示，证明了方法的实用性，并将代码发布在https://zju3dv.github.io/StreamingTalker/。

Cool Papers

点此查看论文截图

FxSearcher: gradient-free text-driven audio transformation

Authors:Hojoon Ki, Jongsuk Kim, Minchan Kwon, Junmo Kim

Achieving diverse and high-quality audio transformations from text prompts remains challenging, as existing methods are fundamentally constrained by their reliance on a limited set of differentiable audio effects. This paper proposes \textbf{FxSearcher}, a novel gradient-free framework that discovers the optimal configuration of audio effects (FX) to transform a source signal according to a text prompt. Our method employs Bayesian Optimization and CLAP-based score function to perform this search efficiently. Furthermore, a guiding prompt is introduced to prevent undesirable artifacts and enhance human preference. To objectively evaluate our method, we propose an AI-based evaluation framework. The results demonstrate that the highest scores achieved by our method on these metrics align closely with human preferences. Demos are available at https://hojoonki.github.io/FxSearcher/

实现基于文本提示的多样化和高质量音频转换仍然是一个挑战，因为现有方法从根本上受到可微音频效果集有限的制约。本文提出了一个新型无梯度框架——FxSearcher，它能够根据文本提示搜索到最佳的音频效果配置，以转换源信号。我们的方法采用贝叶斯优化和基于CLAP的评分函数，以高效完成搜索过程。此外，引入了一个引导提示，以防止产生不希望出现的伪迹并增强人类偏好。为了客观地评估我们的方法，我们提出了一个基于AI的评估框架。结果表明，我们的方法在这些指标上获得的最高分数与人类偏好紧密吻合。演示视频可在https://hojoonki.github.io/FxSearcher/查看。

论文及项目相关链接

PDF

Summary

文本提出了一种名为FxSearcher的新型无梯度框架，用于根据文本提示进行音频信号的转换。该框架能够发现最佳音频效果配置，通过贝叶斯优化和CLAP评分函数实现高效搜索。此外，引入了指导提示以避免不良伪影并增强人类偏好。采用基于AI的评估框架对方法进行客观评价，取得了与人类偏好相一致的最高评分。

Key Takeaways

文本提出FxSearcher新型无梯度框架用于文本提示下的音频转换。
该框架通过贝叶斯优化和CLAP评分函数实现音频效果配置的高效搜索。
引入指导提示以增强音频转换结果的人类偏好并避免不良伪影。
采用基于AI的评估框架对音频转换方法进行客观评价。
FxSearcher在评价指标上取得了高分数，与人类偏好相一致。
该方法具有广泛的应用前景，可以应用于音频编辑、语音转换等领域。

Cool Papers

点此查看论文截图

Towards Reliable Human Evaluations in Gesture Generation: Insights from a Community-Driven State-of-the-Art Benchmark

Authors:Rajmund Nagy, Hendric Voss, Thanh Hoang-Minh, Mihail Tsakov, Teodor Nikolov, Zeyi Zhang, Tenglong Ao, Sicheng Yang, Shaoli Huang, Yongkang Cheng, M. Hamza Mughal, Rishabh Dabral, Kiran Chhatre, Christian Theobalt, Libin Liu, Stefan Kopp, Rachel McDonnell, Michael Neff, Taras Kucherenko, Youngwoo Yoon, Gustav Eje Henter

We review human evaluation practices in automated, speech-driven 3D gesture generation and find a lack of standardisation and frequent use of flawed experimental setups. This leads to a situation where it is impossible to know how different methods compare, or what the state of the art is. In order to address common shortcomings of evaluation design, and to standardise future user studies in gesture-generation works, we introduce a detailed human evaluation protocol for the widely-used BEAT2 motion-capture dataset. Using this protocol, we conduct large-scale crowdsourced evaluation to rank six recent gesture-generation models – each trained by its original authors – across two key evaluation dimensions: motion realism and speech-gesture alignment. Our results provide strong evidence that 1) newer models do not consistently outperform earlier approaches; 2) published claims of high motion realism or speech-gesture alignment may not hold up under rigorous evaluation; and 3) the field must adopt disentangled assessments of motion quality and multimodal alignment for accurate benchmarking in order to make progress. Finally, in order to drive standardisation and enable new evaluation research, we will release five hours of synthetic motion from the benchmarked models; over 750 rendered video stimuli from the user studies – enabling new evaluations without model reimplementation required – alongside our open-source rendering script, and the 16,000 pairwise human preference votes collected for our benchmark.

我们回顾了自动化语音驱动的三维手势生成的评估实践，发现缺乏标准化和频繁使用有缺陷的实验设置。这导致无法了解不同方法之间的比较，也无法了解先进的技术水平。为了解决评估设计中的常见缺陷，并对未来手势生成作品中的用户研究进行标准化，我们为广泛使用的BEAT2动作捕捉数据集引入了一个详细的人类评估协议。使用该协议，我们进行了大规模的众包评估，对六个最近的手势生成模型进行了排名——每个模型都由其原始作者进行训练——在两个关键评估维度：动作真实性和语音手势对齐。我们的结果提供了强有力的证据表明：1）新模型并不始终优于早期方法；2）关于高运动真实性或语音手势对齐的已发布声明可能无法在严格评估下成立；3）为了取得进展，该领域必须采用对运动质量和多模式对齐的分离评估来进行准确基准测试。最后，为了推动标准化并促进新的评估研究，我们将发布来自基准测试模型的五小时合成运动；来自用户研究的超过750个渲染的视频刺激，无需重新实现模型即可进行新的评估——以及我们的开源渲染脚本和为我们基准测试收集的16,000个成对的人类偏好投票。

论文及项目相关链接

PDF 23 pages, 10 figures. The last two authors made equal contributions

摘要
该研究综述了自动化语音驱动的三维手势生成的评估实践，发现缺乏标准化和频繁使用有缺陷的实验设置。为了弥补评价设计的常见缺陷，并规范未来手势生成工作中的用户研究，我们为广泛使用的BEAT2动作捕捉数据集引入了一个详细的人类评估协议。使用该协议，我们对六个最近的手势生成模型进行了大规模的人群评价，这两个关键的评价维度是动作真实性和语音手势对齐度。结果表明，新模型并不一致优于早期方法；先前关于高运动真实性或语音手势对齐的声明可能无法在严格评估中持续有效；为了取得进展，该领域必须采用对运动质量和多模式对齐的分解评估来进行准确基准测试。最后，为了推动标准化并促进新的评估研究，我们将发布五个小时的合成运动数据、超过750个用户研究的渲染视频刺激以及我们的开源渲染脚本和收集的1万六千个人偏好投票作为基准测试数据。这些数据无需重新实现模型即可进行新的评估。

关键见解

缺乏标准化和频繁使用有缺陷的实验设置使得比较不同方法和了解最新技术状态变得困难。
引入了一种针对BEAT2动作捕捉数据集的人类评估协议以规范未来的用户研究。
通过大规模的人群评价六个手势生成模型，发现新模型并不总是优于早期模型。
严格评估下，先前关于高运动真实性和语音手势对齐的声明可能不成立。
为了准确基准测试，必须分别评估运动质量和多模式对齐。

Cool Papers

点此查看论文截图

Playmate2: Training-Free Multi-Character Audio-Driven Animation via Diffusion Transformer with Reward Feedback

Authors:Xingpei Ma, Shenneng Huang, Jiaran Cai, Yuansheng Guan, Shen Zheng, Hanfeng Zhao, Qiang Zhang, Shunsi Zhang

Recent advances in diffusion models have significantly improved audio-driven human video generation, surpassing traditional methods in both quality and controllability. However, existing approaches still face challenges in lip-sync accuracy, temporal coherence for long video generation, and multi-character animation. In this work, we propose a diffusion transformer (DiT)-based framework for generating lifelike talking videos of arbitrary length, and introduce a training-free method for multi-character audio-driven animation. First, we employ a LoRA-based training strategy combined with a position shift inference approach, which enables efficient long video generation while preserving the capabilities of the foundation model. Moreover, we combine partial parameter updates with reward feedback to enhance both lip synchronization and natural body motion. Finally, we propose a training-free approach, Mask Classifier-Free Guidance (Mask-CFG), for multi-character animation, which requires no specialized datasets or model modifications and supports audio-driven animation for three or more characters. Experimental results demonstrate that our method outperforms existing state-of-the-art approaches, achieving high-quality, temporally coherent, and multi-character audio-driven video generation in a simple, efficient, and cost-effective manner.

近期扩散模型（diffusion models）的进展极大地推动了音频驱动的人类视频生成技术，无论在质量还是可控性方面都超越了传统方法。然而，现有方法仍然面临唇形同步精度、长视频生成的时序连贯性以及多角色动画方面的挑战。在这项工作中，我们提出了一个基于扩散转换器（DiT）的框架，用于生成任意长度的逼真谈话视频，并介绍了一种无需训练的多角色音频驱动动画方法。首先，我们采用基于LoRA的训练策略，结合位置偏移推理方法，既能够高效生成长视频，同时保留基础模型的能力。此外，我们将部分参数更新与奖励反馈相结合，以提高唇形同步和自然身体动作的质量。最后，我们提出了无需训练的面具分类器引导（Mask Classifier-Free Guidance，Mask-CFG）方法进行多角色动画生成，该方法无需专用数据集或模型修改，支持三个或更多角色的音频驱动动画。实验结果表明，我们的方法在质量、时序连贯性和多角色音频驱动视频生成方面超越了现有最先进的方法，以一种简单、高效和成本效益高的方式实现了高质量的视频生成。

论文及项目相关链接

PDF AAAI 2026

Summary

最新扩散模型技术改进了音频驱动的人脸视频生成，提高了视频的质量和可控性。本研究提出一个基于扩散Transformer（DiT）的框架，用于生成任意长度的逼真对话视频，并引入了一种无训练的多角色音频驱动动画方法。结合LoRA训练策略和位置偏移推断方法，实现了高效的长视频生成，同时保留了基础模型的功能。通过部分参数更新和奖励反馈，提高了唇形同步和自然身体动作。此外，本研究还提出了一种无训练的多角色动画方法——Mask Classifier-Free Guidance（Mask-CFG），无需特定数据集或模型修改，支持三个或更多角色的音频驱动动画。实验结果表明，该方法优于现有先进技术，实现了高质量、时间连贯性和多角色的音频驱动视频生成，简单、高效且经济实惠。

Key Takeaways