发布日期: 2025-11-19

更新日期: 2025-11-27

文章字数: 5.3k

阅读时长: 21 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-19 更新

Tighter Truncated Rectangular Prism Approximation for RNN Robustness Verification

Authors:Xingqi Lin, Liangyu Chen, Min Wu, Min Zhang, Zhenbing Zeng

Robustness verification is a promising technique for rigorously proving Recurrent Neural Networks (RNNs) robustly. A key challenge is to over-approximate the nonlinear activation functions with linear constraints, which can transform the verification problem into an efficiently solvable linear programming problem. Existing methods over-approximate the nonlinear parts with linear bounding planes individually, which may cause significant over-estimation and lead to lower verification accuracy. In this paper, in order to tightly enclose the three-dimensional nonlinear surface generated by the Hadamard product, we propose a novel truncated rectangular prism formed by two linear relaxation planes and a refinement-driven method to minimize both its volume and surface area for tighter over-approximation. Based on this approximation, we implement a prototype DeepPrism for RNN robustness verification. The experimental results demonstrate that \emph{DeepPrism} has significant improvement compared with the state-of-the-art approaches in various tasks of image classification, speech recognition and sentiment analysis.

稳健性验证是证明循环神经网络（RNN）稳健性的有前途的技术。关键挑战在于用线性约束对非线性激活函数进行过近似处理，这可以将验证问题转化为可高效解决的线性规划问题。现有方法通过线性边界平面单独对非线性部分进行过近似处理，这可能会导致较大的过度估计，从而导致验证精度降低。本文中，为了紧密封闭由Hadamard乘积产生的三维非线性表面，我们提出了一种由两个线性松弛平面形成的新型截断矩形棱柱，以及一种以细化驱动的方法来最小化其体积和表面积，以实现更紧密的过近似。基于这种近似方法，我们实现了用于RNN稳健性验证的DeepPrism原型。实验结果表明，与最新方法在图像分类、语音识别和情感分析等各项任务中相比，DeepPrism具有显著的改进。

论文及项目相关链接

PDF

Summary

本文提出一种用于循环神经网络（RNN）稳健性验证的新方法。核心挑战在于非线性激活函数的线性约束过度近似，该问题可以通过采用由两个线性松弛平面构成的截断矩形棱柱来解决。新方法旨在紧密包裹由Hadamard产品生成的三维非线性表面，通过最小化其体积和表面积来实现更紧密的上界近似。基于这种近似，实现了名为DeepPrism的原型用于RNN稳健性验证，在图像分类、语音识别和情感分析等任务上取得了显著改进。

Key Takeaways

稳健性验证是证明循环神经网络（RNN）稳健性的有前途的技术。
现有方法通过线性边界平面进行非线性部分的过度近似，可能导致显著过度估计和较低的验证精度。
为紧密包裹由Hadamard产品生成的三维非线性表面，提出了由两个线性松弛平面构成的截断矩形棱柱。
最小化截断矩形棱柱的体积和表面积以实现更紧密的上界近似。
基于这种新的近似方法，实现了名为DeepPrism的RNN稳健性验证原型。
在图像分类、语音识别和情感分析等任务上，DeepPrism相较于最新方法取得了显著改进。

Cool Papers

点此查看论文截图

DialogGraph-LLM: Graph-Informed LLMs for End-to-End Audio Dialogue Intent Recognition

Authors:HongYu Liu, Junxin Li, Changxi Guo, Hao Chen, Yaqian Huang, Yifu Guo, Huan Yang, Lihua Cai

Recognizing speaker intent in long audio dialogues among speakers has a wide range of applications, but is a non-trivial AI task due to complex inter-dependencies in speaker utterances and scarce annotated data. To address these challenges, an end-to-end framework, namely DialogGraph-LLM, is proposed in the current work. DialogGraph-LLM combines a novel Multi-Relational Dialogue Attention Network (MR-DAN) architecture with multimodal foundation models (e.g., Qwen2.5-Omni-7B) for direct acoustic-to-intent inference. An adaptive semi-supervised learning strategy is designed using LLM with a confidence-aware pseudo-label generation mechanism based on dual-threshold filtering using both global and class confidences, and an entropy-based sample selection process that prioritizes high-information unlabeled instances. Extensive evaluations on the proprietary MarketCalls corpus and the publicly available MIntRec 2.0 benchmark demonstrate DialogGraph-LLM’s superiority over strong audio and text-driven baselines. The framework demonstrates strong performance and efficiency in intent recognition in real world scenario audio dialogues, proving its practical value for audio-rich domains with limited supervision. Our code is available at https://github.com/david188888/DialogGraph-LLM.

识别长音频对话中说话者的意图具有广泛的应用，但由于说话者话语之间的复杂相互依赖关系和标注数据的稀缺，这是一项非平凡的AI任务。为了解决这些挑战，当前工作提出了一个端到端的框架，即DialogGraph-LLM。DialogGraph-LLM结合了一种新型的多关系对话注意网络（MR-DAN）架构和多模态基础模型（例如Qwen2.5-Omni-7B）进行直接的声学意图推断。设计了一种自适应的半监督学习策略，使用LLM，基于双重阈值过滤的全局和类别置信度，以及基于熵的样本选择过程来优先处理高信息量的未标记实例的置信度感知伪标签生成机制。在专有MarketCalls语料库和公开的MIntRec 2.0基准测试上的广泛评估表明，DialogGraph-LLM优于强大的音频和文本驱动基准测试。该框架在现实场景音频对话的意图识别中表现出强大的性能和效率，证明了其在监督有限的丰富音频领域的实用价值。我们的代码可在https://github.com/david188888/DialogGraph-LLM上找到。

论文及项目相关链接

PDF 8 pages, 2 figures. To appear in: Proceedings of the 28th European Conference on Artificial Intelligence (ECAI 2025), Frontiers in Artificial Intelligence and Applications, Vol. 413. DOI: 10.3233/FAIA251182

Summary：针对长音频对话中的说话人意图识别这一具有挑战性的任务，提出了一种名为DialogGraph-LLM的端到端框架。该框架结合了多关系对话注意力网络（MR-DAN）和多模态基础模型（如Qwen2.5-Omni-7B），用于直接进行声音到意图的推理。该框架还使用大型语言模型设计了一种自适应半监督学习策略，使用基于双重阈值过滤和熵样本选择的置信度感知伪标签生成机制。在MarketCalls语料库和公开的MIntRec 2.0基准测试上的评估表明，DialogGraph-LLM在音频对话中的意图识别方面表现出卓越的性能和效率。

Key Takeaways：

DialogGraph-LLM是一个用于音频对话中说话人意图识别的端到端框架。
该框架结合了MR-DAN和多模态基础模型进行声音到意图的直接推理。
DialogGraph-LLM采用了自适应半监督学习策略，利用大型语言模型进行设计。
该策略包括基于双重阈值过滤和熵样本选择的置信度感知伪标签生成机制。
在MarketCalls语料库和MIntRec 2.0基准测试上的评估证明了DialogGraph-LLM的优越性能。
DialogGraph-LLM在实际场景中的音频对话意图识别中表现出强大的性能和效率。

Cool Papers

点此查看论文截图

Reinforcing Trustworthiness in Multimodal Emotional Support Systems

Authors:Huy M. Le, Dat Tien Nguyen, Ngan T. T. Vo, Tuan D. Q. Nguyen, Nguyen Binh Le, Duy Minh Ho Nguyen, Daniel Sonntag, Lizi Liao, Binh T. Nguyen

In today’s world, emotional support is increasingly essential, yet it remains challenging for both those seeking help and those offering it. Multimodal approaches to emotional support show great promise by integrating diverse data sources to provide empathetic, contextually relevant responses, fostering more effective interactions. However, current methods have notable limitations, often relying solely on text or converting other data types into text, or providing emotion recognition only, thus overlooking the full potential of multimodal inputs. Moreover, many studies prioritize response generation without accurately identifying critical emotional support elements or ensuring the reliability of outputs. To overcome these issues, we introduce \textsc{ MultiMood}, a new framework that (i) leverages multimodal embeddings from video, audio, and text to predict emotional components and to produce responses responses aligned with professional therapeutic standards. To improve trustworthiness, we (ii) incorporate novel psychological criteria and apply Reinforcement Learning (RL) to optimize large language models (LLMs) for consistent adherence to these standards. We also (iii) analyze several advanced LLMs to assess their multimodal emotional support capabilities. Experimental results show that MultiMood achieves state-of-the-art on MESC and DFEW datasets while RL-driven trustworthiness improvements are validated through human and LLM evaluations, demonstrating its superior capability in applying a multimodal framework in this domain.

在现今世界，情感支持变得愈发重要，但对于寻求帮助和提供帮助的人来说都仍然是一个挑战。通过整合各种数据源，情感支持的多模式方法展现出巨大的潜力，可以提供富有同情心和上下文相关的回应，促进更有效的互动。然而，当前的方法有明显的局限性，通常仅依赖文本或将其他数据类型转换为文本，或者只提供情绪识别，从而忽略了多模态输入的全面潜力。此外，许多研究优先注重响应生成，而没有准确识别关键的情感支持元素或确保输出的可靠性。为了克服这些问题，我们引入了\textsc{MultiMood}新框架，它（i）利用视频、音频和文本的多模式嵌入来预测情感成分，并产生与专业治疗标准相符的响应。为了提高可信度，我们（ii）结合了新颖的心理标准和应用强化学习（RL）来优化大型语言模型（LLM），使其始终遵循这些标准。我们还（iii）分析了多个先进的大型语言模型，以评估其多模式情感支持能力。实验结果表明，MultiMood在MESC和DFEW数据集上达到了最新水平，而由强化学习驱动的可靠性改进通过人类和大型语言模型评估得到了验证，证明其在该领域应用多模式框架的卓越能力。

论文及项目相关链接

PDF

Summary：在如今世界，情感支持变得越来越重要，然而，对寻求帮助者和提供帮助者来说仍面临挑战。多模式情感支持方法通过整合各种数据源提供富有同情心的、语境相关的回应，显示出巨大潜力。然而，当前方法存在诸多局限性，通常只依赖文本或将其他数据类型转换为文本，仅提供情感识别，忽略了多模式输入的潜力。为解决这些问题，我们引入了一个新的框架“MultiMood”，它能利用视频、音频和文本的多元嵌入来预测情感成分并产生与专业的治疗标准相符的响应。为了提高可信度，我们还结合了新型心理标准和强化学习（RL）来优化大型语言模型（LLM）。通过多模式分析验证了MultiMood框架的有效性，并且在MESC和DFEW数据集上的实验结果验证了其在RL驱动下可信度的提高。关键要点：

当前情感支持面临的挑战是未能准确理解需求并有效地回应。多模式情感支持方法可以融合不同数据源来提供有针对性的支持。
现有方法在处理和表达情感上局限性较大，可能忽视了全面的多模式信息潜力。一些现有的情感表达仅仅依赖文字数据或对真实感情反馈的表示方法有所简化。需要整合各种形式的信息输入以提高对情绪复杂性及其实际应用的掌握程度。

Cool Papers

点此查看论文截图

DualSpeechLM: Towards Unified Speech Understanding and Generation via Dual Speech Token Modeling with Large Language Models

Authors:Yuanyuan Wang, Dongchao Yang, Yiwen Shao, Hangting Chen, Jiankun Zhao, Zhiyong Wu, Helen Meng, Xixin Wu

Extending pre-trained text Large Language Models (LLMs)’s speech understanding or generation abilities by introducing various effective speech tokens has attracted great attention in the speech community. However, building a unified speech understanding and generation model still faces the following challenges: (1) Due to the huge modality gap between speech and text tokens, extending text LLMs to unified speech LLMs relies on large-scale paired data for fine-tuning, and (2) Generation and understanding tasks prefer information at different levels, e.g., generation benefits from detailed acoustic features, while understanding favors high-level semantics. This divergence leads to difficult performance optimization in one unified model. To solve these challenges, in this paper, we present two key insights in speech tokenization and speech language modeling. Specifically, we first propose an Understanding-driven Speech Tokenizer (USTokenizer), which extracts high-level semantic information essential for accomplishing understanding tasks using text LLMs. In this way, USToken enjoys better modality commonality with text, which reduces the difficulty of modality alignment in adapting text LLMs to speech LLMs. Secondly, we present DualSpeechLM, a dual-token modeling framework that concurrently models USToken as input and acoustic token as output within a unified, end-to-end framework, seamlessly integrating speech understanding and generation capabilities. Furthermore, we propose a novel semantic supervision loss and a Chain-of-Condition (CoC) strategy to stabilize model training and enhance speech generation performance. Experimental results demonstrate that our proposed approach effectively fosters a complementary relationship between understanding and generation tasks, highlighting the promising strategy of mutually enhancing both tasks in one unified model.

将预训练的文本大型语言模型（LLM）的语音理解和生成能力通过引入各种有效的语音标记进行扩展，在语音领域引起了极大的关注。然而，构建统一的语音理解和生成模型仍然面临以下挑战：（1）由于语音和文本标记之间存在巨大的模态差距，将文本LLM扩展到统一的语音LLM依赖于大规模配对数据进行微调；（2）生成和理解任务偏好不同层级的信息，例如，生成受益于详细的声学特征，而理解则更倾向于高级语义。这种分歧导致在一个统一模型中优化性能变得困难。

为解决这些挑战，本文提出了语音标记化和语音语言模型的两个关键见解。具体来说，我们首先提出了一种理解驱动的语音标记器（USTokenizer），它使用文本LLM完成理解任务所必需的高级语义信息。通过这种方式，USToken与文本的模态共性更好，降低了将文本LLM适应到语音LLM时的模态对齐难度。

其次，我们提出了DualSpeechLM，这是一个双标记建模框架，它在一个统一、端到端的框架内，同时以USToken作为输入、声学标记作为输出进行建模，无缝集成了语音理解和生成能力。此外，我们提出了一种新的语义监督损失和链式条件（CoC）策略，以稳定模型训练并提高语音生成性能。

论文及项目相关链接

PDF Accepted by AAAI 2026

摘要

扩展预训练文本大型语言模型（LLMs）的语音理解或生成能力，通过引入各种有效的语音令牌，在语音领域引起了极大的关注。然而，构建统一的语音理解与生成模型仍面临以下挑战：一是由于语音和文本令牌之间存在巨大的模态差距，将文本LLM扩展到统一语音LLM依赖于大规模配对数据进行微调；二是生成和理解任务偏好不同层级的信息，例如生成受益于详细的声学特征，而理解则偏好高级语义。这导致在统一模型中优化性能变得困难。为解决这些挑战，本文提出了两个关于语音令牌化和语音语言建模的关键见解。首先，我们提出了一种理解驱动的语音分词器（USTokenizer），其利用文本LLMs完成理解任务时提取的高级语义信息至关重要。通过这种方式，USToken与文本的模态共性更好，降低了将文本LLMs适应到语音LLMs时模态对齐的难度。其次，我们提出了DualSpeechLM，这是一个双令牌建模框架，将USToken作为输入、声学令牌作为输出在统一端到端框架内建模，无缝集成语音理解和生成能力。此外，我们提出了一种新颖的语义监督损失和链式条件（CoC）策略，以稳定模型训练并增强语音生成性能。实验结果表明，我们提出的方法有效地促进了理解与生成任务之间的互补关系，突显出在统一模型中相互增强两个任务的策略前景。

关键见解

语音和文本令牌之间存在巨大的模态差距，扩展文本LLMs到统一语音LLMs需要大规模配对数据进行微调。
生成和理解任务偏好不同层级的信息，导致在统一模型中优化性能变得困难。
提出理解驱动的语音分词器（USTokenizer），提取高级语义信息，降低模态对齐难度。
引入DualSpeechLM框架，统一建模语音理解和生成能力。
提出新颖的语义监督损失，稳定模型训练，提高语音生成性能。
采用链式条件（CoC）策略，有效促进理解与生成任务之间的互补关系。

Cool Papers

点此查看论文截图

READ: Real-time and Efficient Asynchronous Diffusion for Audio-driven Talking Head Generation

Authors:Haotian Wang, Yuzhe Weng, Jun Du, Haoran Xu, Xiaoyan Wu, Shan He, Bing Yin, Cong Liu, Jianqing Gao, Qingfeng Liu

The introduction of diffusion models has brought significant advances to the field of audio-driven talking head generation. However, the extremely slow inference speed severely limits the practical implementation of diffusion-based talking head generation models. In this study, we propose READ, a real-time diffusion-transformer-based talking head generation framework. Our approach first learns a spatiotemporal highly compressed video latent space via a temporal VAE, significantly reducing the token count to accelerate generation. To achieve better audio-visual alignment within this compressed latent space, a pre-trained Speech Autoencoder (SpeechAE) is proposed to generate temporally compressed speech latent codes corresponding to the video latent space. These latent representations are then modeled by a carefully designed Audio-to-Video Diffusion Transformer (A2V-DiT) backbone for efficient talking head synthesis. Furthermore, to ensure temporal consistency and accelerated inference in extended generation, we propose a novel asynchronous noise scheduler (ANS) for both the training and inference processes of our framework. The ANS leverages asynchronous add-noise and asynchronous motion-guided generation in the latent space, ensuring consistency in generated video clips. Experimental results demonstrate that READ outperforms state-of-the-art methods by generating competitive talking head videos with significantly reduced runtime, achieving an optimal balance between quality and speed while maintaining robust metric stability in long-time generation.

扩散模型的引入为音频驱动的头像生成领域带来了重大突破。然而，极慢的推理速度严重限制了基于扩散的头像生成模型的实际应用。在这项研究中，我们提出了READ，这是一个基于实时扩散变换器的头像生成框架。我们的方法首先通过时间VAE学习一个时空高度压缩的视频潜在空间，显著减少令牌计数以加速生成。为了在这个压缩的潜在空间内实现更好的音频视频对齐，我们提出了一个预训练的语音自编码器（SpeechAE）来生成与视频潜在空间相对应的时空压缩语音潜在代码。然后，这些潜在表示由一个精心设计的音频到视频扩散变换器（A2V-DiT）主干进行建模，以实现高效的头像合成。此外，为了确保扩展生成的时序一致性和加速推理，我们为我们的框架的训练和推理过程提出了新型异步噪声调度器（ANS）。ANS利用潜在空间中的异步添加噪声和异步运动引导生成，确保生成视频片段的一致性。实验结果表明，READ优于现有最先进的方法，生成的头像视频具有显著减少的运行时间，在质量和速度之间达到了最佳平衡，同时在长时间生成中保持了稳健的度量稳定性。

Summary

扩散模型在音频驱动的头生成领域取得了显著进展，但其推理速度极慢，限制了实际应用。本研究提出一种基于扩散转换器的实时头生成框架READ。通过时空降维视频潜在空间学习和预训练语音自编码器，实现音频与视频的更好对齐。设计精巧的音频到视频扩散转换器保证了高效头合成。此外，为确保长期生成的时序一致性和加速推理，本研究还提出了一种新型异步噪声调度器。实验结果表明，READ在生成具有竞争力的头视频时，运行时显著减少，实现了质量与速度的平衡。

Key Takeaways

扩散模型在音频驱动的头生成领域有重大进展，但推理速度慢。
READ框架基于扩散转换器实现实时头生成。
通过时空降维视频潜在空间学习来提高效率。
预训练语音自编码器实现音频与视频的更好对齐。
精心设计的音频到视频扩散转换器保证高效头合成。
异步噪声调度器确保长期生成的时序一致性和加速推理。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-11-19/Talking%20Head%20Generation/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

Talking Head Generation

R1_Reasoning

R1_Reasoning 方向最新论文已更新，请持续关注 Update in 2025-11-20 UniGen-1.5 Enhancing Image Generation and Editing through Reward Unification in Reinforcement Learning

2025-11-20 R1_Reasoning

R1_Reasoning

Text-to-Motion

Text-to-Motion 方向最新论文已更新，请持续关注 Update in 2025-11-19 Skeletons Speak Louder than Text A Motion-Aware Pretraining Paradigm for Video-Based Person Re-Identification

2025-11-19 Text-to-Motion

Text-to-Motion