发布日期: 2025-11-19

更新日期: 2025-11-27

文章字数: 4k

阅读时长: 16 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-19 更新

Mem-PAL: Towards Memory-based Personalized Dialogue Assistants for Long-term User-Agent Interaction

Authors:Zhaopei Huang, Qifeng Dai, Guozheng Wu, Xiaopeng Wu, Kehan Chen, Chuan Yu, Xubin Li, Tiezheng Ge, Wenxuan Wang, Qin Jin

With the rise of smart personal devices, service-oriented human-agent interactions have become increasingly prevalent. This trend highlights the need for personalized dialogue assistants that can understand user-specific traits to accurately interpret requirements and tailor responses to individual preferences. However, existing approaches often overlook the complexities of long-term interactions and fail to capture users’ subjective characteristics. To address these gaps, we present PAL-Bench, a new benchmark designed to evaluate the personalization capabilities of service-oriented assistants in long-term user-agent interactions. In the absence of available real-world data, we develop a multi-step LLM-based synthesis pipeline, which is further verified and refined by human annotators. This process yields PAL-Set, the first Chinese dataset comprising multi-session user logs and dialogue histories, which serves as the foundation for PAL-Bench. Furthermore, to improve personalized service-oriented interactions, we propose H$^2$Memory, a hierarchical and heterogeneous memory framework that incorporates retrieval-augmented generation to improve personalized response generation. Comprehensive experiments on both our PAL-Bench and an external dataset demonstrate the effectiveness of the proposed memory framework.

随着智能个人设备的兴起，面向服务的智能人机互动变得越来越普遍。这一趋势凸显了需要具有理解用户特定特征能力的个性化对话助理，以准确解释需求并根据个人偏好定制响应。然而，现有方法往往忽视了长期互动的复杂性，无法捕捉用户的主观特征。为了解决这些空白，我们推出了PAL-Bench，这是一个新的基准测试，旨在评估面向服务的助理在长期用户代理互动中的个性化能力。在缺乏现实世界的可用数据的情况下，我们开发了一个基于多步骤的LLM合成管道，该管道经过人类注释者的验证和细化。这个过程产生了PAL-Set，这是第一个包含多会话用户日志和对话历史记录的中文数据集，作为PAL-Bench的基础。此外，为了改善面向个性化服务的人机互动，我们提出了H Memory框架，这是一个层次化和异质化的记忆框架，它结合了检索增强生成技术来改善个性化的响应生成。在我们自己的PAL-Bench和外部数据集上的综合实验都证明了所提出的记忆框架的有效性。

论文及项目相关链接

PDF Accepted by AAAI 2026 (Oral)

Summary

随着智能个人设备的普及，服务导向的人机交互越来越普遍。这凸显了对能够理解用户特定特征并准确解读需求的个性化对话助手的需求。为了解决现有方法在长期互动中忽略的复杂性和用户主观特征捕捉的不足，我们推出了PAL-Bench基准测试，并创建了PAL-Set数据集。为提高个性化服务交互，我们提出了H^2Memory框架。该框架结合了检索增强生成技术，通过多层次、异质性的记忆结构来改善个性化响应生成。

Key Takeaways

智能个人设备的普及促进了服务导向的人机交互的增长，需要个性化的对话助手。
现有方法在长期互动中忽略用户主观特征的捕捉和复杂性的处理。
PAL-Bench基准测试用于评估服务导向助理在长期用户代理交互中的个性化能力。
为了支持PAL-Bench，创建了PAL-Set数据集，包含多会话用户日志和对话历史。
H^2Memory框架通过结合检索增强生成技术，提高了个性化响应生成的效果。
H^2Memory框架采用分层和异质性的记忆结构。

Cool Papers

点此查看论文截图

How Far Do SSL Speech Models Listen for Tone? Temporal Focus of Tone Representation under Low-resource Transfer

Authors:Minu Kim, Ji Sub Um, Hoirin Kim

Lexical tone is central to many languages but remains underexplored in self-supervised learning (SSL) speech models, especially beyond Mandarin. We study four languages with complex and diverse tone systems: Burmese, Thai, Lao, and Vietnamese, to examine how far such models listen for tone and how transfer operates in low-resource conditions. As a baseline reference, we estimate the temporal span of tone cues to be about 100 ms in Burmese and Thai, and about 180 ms in Lao and Vietnamese. Probes and gradient analyses on fine-tuned SSL models reveal that tone transfer varies by downstream task: automatic speech recognition fine-tuning aligns spans with language-specific tone cues, while prosody- and voice-related tasks bias the model toward overly long spans. These findings indicate that tone transfer is shaped by downstream task, highlighting task effects on temporal focus in tone modeling.

词汇的声调在许多语言中都是核心要素，但在自监督学习（SSL）语音模型中，尤其是在非汉语语境下，对声调的研究仍然不足。我们研究了四种具有复杂和多样声调系统的语言：缅甸语、泰语、老挝语和越南语，以探讨这些模型在多大程度上能够识别声调，以及在资源匮乏的条件下如何进行迁移。作为基准参考，我们估计缅甸语和泰语的声调线索时间跨度约为100毫秒，而老挝语和越南语的声调线索时间跨度约为180毫秒。对微调SSL模型的探针和梯度分析显示，声调迁移因下游任务而异：自动语音识别微调与特定语言的声调线索对齐时间跨度，而语音和情感相关的任务则使模型偏向于过长的时间跨度。这些发现表明，声调迁移受下游任务的影响，突显了任务对声调建模中时间焦点的影响。

论文及项目相关链接

PDF 5 pages, 7 figures, submitted to ICASSP 2026

Summary

语音语调在许多语言中至关重要，但在自监督学习（SSL）语音模型中尚未得到充分探索，特别是在非汉语环境下。本研究聚焦于四种具有复杂多样语调系统的语言：缅甸语、泰语、老挝语和越南语。通过对这些语言的探究，我们考察了这些模型在多大程度上能够捕捉到语调信息，以及在资源有限的情况下如何进行迁移。本研究作为参考基准，估计了缅甸语和泰语的语调提示时间跨度约为100毫秒，而老挝语和越南语的这一时间跨度约为180毫秒。通过对微调SSL模型的探针和梯度分析发现，语调迁移会根据下游任务的不同而变化：自动语音识别微调与特定语言的语调提示时间跨度对齐，而韵律和语音相关任务则使模型偏向于过长的时间跨度。这表明语调迁移受到下游任务的影响，突显了任务对语调建模中时间焦点的影响。

Key Takeaways

语音语调在许多语言中具有重要意义，但在自监督学习（SSL）语音模型中对其的探索仍不足，特别是在非汉语环境下。
本研究聚焦于四种具有复杂多样语调系统的语言：缅甸语、泰语、老挝语和越南语。
模型在捕捉语调信息方面有一定的能力，且这种能力受下游任务的影响。
缅甸语和泰语的语调提示时间跨度约为100毫秒，而老挝语和越南语的这一时间跨度较长，约为180毫秒。
自动语音识别微调与特定语言的语调提示时间跨度对齐。
韵律和语音相关任务会影响模型对语调的时间感知，使其偏向于过长的时间跨度。

Cool Papers

点此查看论文截图

DialogGraph-LLM: Graph-Informed LLMs for End-to-End Audio Dialogue Intent Recognition

Authors:HongYu Liu, Junxin Li, Changxi Guo, Hao Chen, Yaqian Huang, Yifu Guo, Huan Yang, Lihua Cai

Recognizing speaker intent in long audio dialogues among speakers has a wide range of applications, but is a non-trivial AI task due to complex inter-dependencies in speaker utterances and scarce annotated data. To address these challenges, an end-to-end framework, namely DialogGraph-LLM, is proposed in the current work. DialogGraph-LLM combines a novel Multi-Relational Dialogue Attention Network (MR-DAN) architecture with multimodal foundation models (e.g., Qwen2.5-Omni-7B) for direct acoustic-to-intent inference. An adaptive semi-supervised learning strategy is designed using LLM with a confidence-aware pseudo-label generation mechanism based on dual-threshold filtering using both global and class confidences, and an entropy-based sample selection process that prioritizes high-information unlabeled instances. Extensive evaluations on the proprietary MarketCalls corpus and the publicly available MIntRec 2.0 benchmark demonstrate DialogGraph-LLM’s superiority over strong audio and text-driven baselines. The framework demonstrates strong performance and efficiency in intent recognition in real world scenario audio dialogues, proving its practical value for audio-rich domains with limited supervision. Our code is available at https://github.com/david188888/DialogGraph-LLM.

识别长音频对话中说话者的意图具有广泛的应用，但由于说话者表述之间的复杂相互依赖关系和缺乏注释数据，这是一项不平凡的人工智能任务。针对这些挑战，当前工作提出了一个端到端的框架，即DialogGraph-LLM。DialogGraph-LLM结合了一种新型的多关系对话注意网络（MR-DAN）架构和多模式基础模型（例如Qwen2.5-Omni-7B），用于直接进行声音到意图的推断。设计了一种自适应的半监督学习策略，该策略使用LLM，并基于双阈值过滤和全局及类别置信度构建了一个具有信心感知伪标签生成机制的半监督训练策略，还使用基于熵的样本选择过程来优先处理高信息量的未标记实例。对专用的MarketCalls语料库和公开可用的MIntRec 2.0基准测试集的广泛评估表明，DialogGraph-LLM在音频对话中的意图识别方面优于强大的音频和文本驱动的基准测试。该框架在实际场景中的音频对话意图识别中表现出强大的性能和效率，证明了其在监督有限的丰富音频领域的实用价值。我们的代码位于https://github.com/david188888/DialogGraph-LLM。

论文及项目相关链接

PDF 8 pages, 2 figures. To appear in: Proceedings of the 28th European Conference on Artificial Intelligence (ECAI 2025), Frontiers in Artificial Intelligence and Applications, Vol. 413. DOI: 10.3233/FAIA251182

Summary

对话图大模型（DialogGraph-LLM）结合多关系对话注意力网络（MR-DAN）和跨模态基础模型（如Qwen2.5-Omni-7B），通过自适应半监督学习策略，实现了音频直接映射意图推断的先进功能。该策略利用大型语言模型（LLM）生成信心感知伪标签，并利用双阈值过滤全局和类别置信度。该模型在专有MarketCalls语料库和公开可用的MIntRec 2.0基准测试上的表现优于其他强大的音频和文本驱动基线，证实其在真实世界音频对话中的意图识别表现出优越的性能和效率。相关代码可通过特定链接获取。

Key Takeaways

DialogGraph-LLM结合了MR-DAN和多模态基础模型技术，用于音频直接映射意图推断。
利用自适应半监督学习策略解决复杂音频对话中的意图识别问题。
LLM用于生成信心感知伪标签，通过双阈值过滤实现全局和类别置信度的管理。
该模型在MarketCalls语料库和MIntRec 2.0基准测试中表现优越。
模型适用于真实世界音频对话中的意图识别，尤其在音频丰富的领域具有实用价值。
该模型具有高效性能，可处理复杂音频对话中的多种意图识别任务。

Cool Papers

点此查看论文截图

SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video Understanding

Authors:Zhenyu Yang, Yuhang Hu, Zemin Du, Dizhan Xue, Shengsheng Qian, Jiahong Wu, Fan Yang, Weiming Dong, Changsheng Xu

Despite the significant advancements of Large Vision-Language Models (LVLMs) on established benchmarks, there remains a notable gap in suitable evaluation regarding their applicability in the emerging domain of long-context streaming video understanding. Current benchmarks for video understanding typically emphasize isolated single-instance text inputs and fail to evaluate the capacity to sustain temporal reasoning throughout the entire duration of video streams. To address these limitations, we introduce SVBench, a pioneering benchmark with temporal multi-turn question-answering chains specifically designed to thoroughly assess the capabilities of streaming video understanding of current LVLMs. We design a semi-automated annotation pipeline to obtain 49,979 Question-Answer (QA) pairs of 1,353 streaming videos, which includes generating QA chains that represent a series of consecutive multi-turn dialogues over video segments and constructing temporal linkages between successive QA chains. Our experimental results, obtained from 14 models in dialogue and streaming evaluations, reveal that while the closed-source GPT-4o outperforms others, most open-source LVLMs struggle with long-context streaming video understanding. We also construct a StreamingChat model, which significantly outperforms open-source LVLMs on our SVBench and achieves comparable performance on diverse vision-language benchmarks. We expect SVBench to advance the research of streaming video understanding by providing a comprehensive and in-depth analysis of current LVLMs. Our benchmark and model can be accessed at https://github.com/sotayang/SVBench.

尽管大型视觉语言模型（LVLMs）在既定基准测试中取得了显著进展，但在长上下文流式视频理解等新兴领域的应用评估中仍存在明显差距。当前视频理解基准测试通常侧重于孤立的单一实例文本输入，无法评估在整个视频流持续过程中保持时间推理的能力。为了解决这些局限性，我们引入了SVBench，这是一个具有时间多回合问答链的开创性基准测试，专门设计用于全面评估当前LVLMs的流式视频理解能力。我们设计了一个半自动注释管道，获得了49979个问答对，涉及1353个流式视频，包括生成代表视频片段上连续多回合对话的QA链，以及在连续QA链之间建立时间联系。我们的实验结果来自14个模型在对话和流式评估中的结果，表明虽然闭源GPT-4o表现优于其他模型，但大多数开源LVLMs在长上下文流式视频理解方面表现挣扎。我们还构建了一个StreamingChat模型，在我们的SVBench上显著优于开源LVLMs，并在各种视觉语言基准测试中实现了相当的性能。我们希望SVBench通过提供对当前LVLMs的全面深入分析，推动流式视频理解的研究。我们的基准测试和模型可在https://github.com/sotayang/SVBench上访问。

论文及项目相关链接

PDF ICLR 2025 Accepted (Spotlight)

Summary

该文介绍了大型视觉语言模型（LVLMs）在处理长视频流理解方面的能力评估仍存在差距的问题。为此，研究团队推出了SVBench基准测试平台，该平台通过设计包含连续多轮对话的视频片段问答链，全面评估LVLMs对视频流的理解能力。实验结果显示，GPT-4o表现较好，而多数开源LVLMs在长视频流理解方面表现不佳。同时研究团队构建了一个StreamingChat模型，在SVBench上的表现优于大多数开源LVLMs，并且在多种视觉语言基准测试中表现良好。SVBench为流媒体视频理解领域的研究提供了全面的分析平台。

Key Takeaways

大型视觉语言模型（LVLMs）在长视频流理解方面的能力评估存在显著差距。
SVBench基准测试平台设计用于评估LVLMs对视频流的理解能力，特点是包含连续多轮问答链。
GPT-4o在测试中表现良好，但多数开源LVLMs面临长视频流理解的挑战。
StreamingChat模型在SVBench上表现优于多数开源LVLMs。
StreamingChat模型在多种视觉语言基准测试中表现良好。
SVBench为流媒体视频理解领域的研究提供了全面的分析平台。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-11-19/Interactive/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

Interactive

Text-to-Motion

Text-to-Motion 方向最新论文已更新，请持续关注 Update in 2025-11-19 Skeletons Speak Louder than Text A Motion-Aware Pretraining Paradigm for Video-Based Person Re-Identification

2025-11-19 Text-to-Motion

Text-to-Motion

TTS

TTS 方向最新论文已更新，请持续关注 Update in 2025-11-19 FoleyBench A Benchmark For Video-to-Audio Models

2025-11-19 TTS

TTS