发布日期: 2025-09-13

更新日期: 2025-10-07

文章字数: 3.9k

阅读时长: 16 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-09-13 更新

Listening for “You”: Enhancing Speech Image Retrieval via Target Speaker Extraction

Authors:Wenhao Yang, Jianguo Wei, Wenhuan Lu, Xinyue Song, Xianghu Yue

Image retrieval using spoken language cues has emerged as a promising direction in multimodal perception, yet leveraging speech in multi-speaker scenarios remains challenging. We propose a novel Target Speaker Speech-Image Retrieval task and a framework that learns the relationship between images and multi-speaker speech signals in the presence of a target speaker. Our method integrates pre-trained self-supervised audio encoders with vision models via target speaker-aware contrastive learning, conditioned on a Target Speaker Extraction and Retrieval module. This enables the system to extract spoken commands from the target speaker and align them with corresponding images. Experiments on SpokenCOCO2Mix and SpokenCOCO3Mix show that TSRE significantly outperforms existing methods, achieving 36.3% and 29.9% Recall@1 in 2 and 3 speaker scenarios, respectively - substantial improvements over single speaker baselines and state-of-the-art models. Our approach demonstrates potential for real-world deployment in assistive robotics and multimodal interaction systems.

使用口头语言线索进行图像检索在多模态感知中已成为一个有前途的研究方向，但在多说话人场景中利用语音仍然具有挑战性。我们提出了一种新的目标说话人语音图像检索任务和一个框架，该框架在目标说话人存在的情况下学习图像与多说话人语音信号之间的关系。我们的方法通过目标说话人感知对比学习，将预训练的自我监督音频编码器与视觉模型相结合。这使得系统能够从目标说话人中提取口语指令，并与相应的图像对齐。在SpokenCOCO2Mix和SpokenCOCO3Mix上的实验表明，TSRE显著优于现有方法，在2个和3个说话人的场景中分别实现了36.3%和29.9%的Recall@1，相较于单说话人基准线和最新模型都有显著改进。我们的方法展示了在辅助机器人和多模态交互系统中实际部署的潜力。

论文及项目相关链接

PDF 5 pages, 2 figures

Summary

本文提出一种新颖的面向目标说话人的语音图像检索任务及框架，该框架能够在存在目标说话人的情况下，学习图像与多说话人语音信号之间的关系。通过整合预训练的自我监督音频编码器和视觉模型，借助目标说话人感知对比学习，该系统能够提取目标说话人的语音指令并与相应图像对齐。实验表明，该方法在SpokenCOCO2Mix和SpokenCOCO3Mix数据集上显著优于现有方法，分别实现了36.3%和29.9%的Recall@1指标，在助手机器人和多媒体交互系统中具有实际应用潜力。

Key Takeaways

本文提出了一种新的目标说话人语音图像检索任务，能够在多说话人场景下利用语音进行图像检索。
通过预训练的自我监督音频编码器和视觉模型的整合，系统能够提取目标说话人的语音指令。
目标说话人感知对比学习使得系统能够将语音指令与相应的图像对齐。
在SpokenCOCO2Mix和SpokenCOCO3Mix数据集上进行了实验验证，显示出该方法的有效性。
该方法实现了较高的Recall@1指标，相较于单说话人基准模型和现有最新模型有显著提升。
该方法具有实际应用价值，可应用于助手机器人和多媒体交互系统等场景。

Cool Papers

点此查看论文截图

Data Driven Discovery of Emergent Dynamics in Reaction Diffusion Systems from Sparse and Noisy Observations

Authors:Saumitra Dwivedi, Ricardo da Silva Torres, Ibrahim A. Hameed, Gunnar Tufte, Anniken Susanne T. Karlsen

Data-driven discovery of emergent dynamics is gaining popularity, particularly in the context of reaction-diffusion systems. These systems are widely studied across various fields, including neuroscience, ecology, epidemiology, and several other subject areas that deal with emergent dynamics. A current challenge in the discovery process relates to system identification when there is no prior knowledge of the underlying physics. We attempt to address this challenge by learning Soft Artificial Life (Soft ALife) models, such as Agent-based and Cellular Automata (CA) models, from observed data for reaction-diffusion systems. In this paper, we present findings on the applicability of a conceptual framework, the Data-driven Rulesets for Soft Artificial Life (DRSALife) model, to learn Soft ALife rulesets that accurately represent emergent dynamics in a reaction-diffusion system from observed data. This model has demonstrated promising results for Elementary CA Rule 30, Game of Life, and Vicsek Flocking problems in recent work. To our knowledge, this is one of the few studies that explore machine-based Soft ALife ruleset learning and system identification for reaction-diffusion dynamics without any prior knowledge of the underlying physics. Moreover, we provide comprehensive findings from experiments investigating the potential effects of using noisy and sparse observed datasets on learning emergent dynamics. Additionally, we successfully identify the structure and parameters of the underlying partial differential equations (PDEs) representing these dynamics. Experimental results demonstrate that the learned models are able to predict the emergent dynamics with good accuracy (74%) and exhibit quite robust performance when subjected to Gaussian noise and temporal sparsity.

数据驱动的发现新兴动态越来越受欢迎，特别是在反应-扩散系统的背景下。这些系统在各个领域都受到广泛研究，包括神经科学、生态学、流行病学和涉及新兴动态的其它几个领域。当前发现过程中的一个挑战是与缺乏基础物理知识的系统识别有关。我们试图通过从观察到的反应-扩散系统的数据中学习软人工生命（Soft ALife）模型来解决这一挑战，例如基于代理的模型和细胞自动机（CA）模型。在本文中，我们介绍了数据驱动规则集用于软人工生命（DRSALife）模型的概念框架的适用性发现，该模型可从观测数据中学习准确代表反应-扩散系统中新兴动态的软ALife规则集。该模型在最近的工作中，对于元素CA规则30、生命游戏和Vicsek集群问题，已经显示出有前途的结果。据我们所知，这是少数几篇探索基于机器学习的软ALife规则集学习以及反应-扩散动力学的系统识别研究之一，且无需了解基础物理的先验知识。此外，我们提供了大量实验发现结果，探讨了使用噪声和稀疏观测数据集对学习新兴动力学的潜在影响。而且，我们成功地确定了代表这些动力学的底层偏微分方程（PDEs）的结构和参数。实验结果表明，学习到的模型能够非常准确地预测新兴动态（准确率为74%），并且在受到高斯噪声和时间稀疏性影响时表现出相当稳健的性能。

论文及项目相关链接

PDF

Summary
数据驱动的发现新兴动力学正受到越来越多的关注，特别是在反应-扩散系统的背景下。本文提出一种概念框架Data-driven Rulesets for Soft Artificial Life（DRSALife）模型，该模型可从观测数据中学习Soft ALife规则集，准确代表反应-扩散系统中的新兴动力学。实验结果表明，该模型对噪声和稀疏观测数据集具有鲁棒性，并能准确预测新兴动力学。

Key Takeaways

数据驱动发现新兴动力学在反应-扩散系统中受到广泛关注。
Soft Artificial Life（Soft ALife）模型，如基于代理和细胞自动机（CA）模型，被用于从观测数据中学习反应-扩散系统的动力学。
3.DRSALife模型已成功应用于学习代表反应-扩散系统中新兴动力学的Soft ALife规则集。
该模型在Elementary CA Rule 30、Game of Life和Vicsek Flocking问题上展示了有前景的结果。
模型能够在没有关于底层物理的先验知识的情况下，对反应-扩散动力学进行机器基于的Soft ALife规则集学习和系统识别。
该模型对噪声和稀疏观测数据集具有鲁棒性，能准确预测新兴动力学。

Cool Papers

点此查看论文截图

Automated Classification of Tutors’ Dialogue Acts Using Generative AI: A Case Study Using the CIMA Corpus

Authors:Liqun He, Jiaqi Xu

This study explores the use of generative AI for automating the classification of tutors’ Dialogue Acts (DAs), aiming to reduce the time and effort required by traditional manual coding. This case study uses the open-source CIMA corpus, in which tutors’ responses are pre-annotated into four DA categories. Both GPT-3.5-turbo and GPT-4 models were tested using tailored prompts. Results show that GPT-4 achieved 80% accuracy, a weighted F1-score of 0.81, and a Cohen’s Kappa of 0.74, surpassing baseline performance and indicating substantial agreement with human annotations. These findings suggest that generative AI has strong potential to provide an efficient and accessible approach to DA classification, with meaningful implications for educational dialogue analysis. The study also highlights the importance of task-specific label definitions and contextual information in enhancing the quality of automated annotation. Finally, it underscores the ethical considerations associated with the use of generative AI and the need for responsible and transparent research practices. The script of this research is publicly available at https://github.com/liqunhe27/Generative-AI-for-educational-dialogue-act-tagging.

本研究探讨了使用生成式人工智能（AI）自动分类导师对话行为（DA）的应用，旨在减少传统手动编码所需的时间和精力。这项案例研究使用了开源CIMA语料库，其中导师的回应已被预先标注为四个DA类别。GPT-3.5 Turbo和GPT-4模型都使用了定制提示进行了测试。结果显示，GPT-4的准确率达到了80%，加权F1分数为0.81，与人工标注的Cohen’s Kappa系数为0.74，超过了基线性能，表明与人类标注之间存在很大的一致性。这些发现表明，生成式人工智能在DA分类方面具有强大的潜力，为教育对话分析提供了一种高效且可行的方法。该研究还强调了任务特定标签定义和上下文信息在提高自动标注质量方面的重要性。最后，它强调了使用生成式人工智能的伦理考量以及负责任和透明的研究实践的需要。该研究的脚本可在 https://github.com/liqunhe27/Generative-AI-for-educational-dialogue-act-tagging 上公开访问。

论文及项目相关链接

PDF Accepted for publication in the journal Reflecting Digital Learning. First submitted: 30 Oct 2023. The final version will be available open access via the journal

Summary

本研究利用生成式人工智能自动化分类教育者对话行为（DA），旨在减少传统手动编码所需的时间和精力。该研究使用开源CIMA语料库，其中教育者的回应被预先标注为四个DA类别。GPT-3.5 Turbo和GPT-4模型通过定制提示进行测试。结果显示，GPT-4准确率达到了80%，加权F1分数为0.81，与人工标注的Cohen’s Kappa系数为0.74，超过了基线性能，表现出强烈的潜力。这为教育对话分析提供了一种高效、可行的方法。研究还强调了任务特定标签定义和上下文信息在提高自动化注释质量中的重要性，并指出了使用生成式人工智能的伦理考量以及对负责任和透明研究实践的需求。

Key Takeaways

本研究利用生成式AI实现了教育者对话行为的自动分类，提高了效率并减少了手动编码的工作量。
GPT-4模型在DA分类任务中表现良好，达到了80%的准确率。
上下文信息和任务特定标签定义对于提高自动化注释的质量至关重要。
生成式AI在教育对话分析中具有强烈的应用潜力。
使用生成式AI时需要考虑伦理问题，包括负责任和透明的实践。
公开可用脚本有助于进一步研究和应用。

Cool Papers

点此查看论文截图

Binaural Target Speaker Extraction using HRTFs

Authors:Yoav Ellinson, Sharon Gannot

In this work, we aim to imitate the human ability to selectively attend to a single speaker, even in the presence of multiple simultaneous talkers. To achieve this, we propose a novel approach for binaural target speaker extraction that leverages the listener’s Head-Related Transfer Function (HRTF) to isolate the desired speaker. Notably, our method does not rely on speaker embeddings, making it speaker-independent and enabling strong generalization across multiple speech datasets and languages. We employ a fully complex-valued neural network that operates directly on the complex-valued Short-Time Fourier transform (STFT) of the mixed audio signals, and compare it to a Real-Imaginary (RI)-based neural network, demonstrating the advantages of the former. We first evaluate the method in an anechoic, noise-free scenario, achieving excellent extraction performance while preserving the binaural cues of the target signal. We then extend the evaluation to reverberant conditions. Our method proves robust, maintaining speech clarity and source directionality while simultaneously reducing reverberation. A comparative analysis with existing binaural Target Speaker Extraction (TSE) methods demonstrates that our approach attains performance on par with competing techniques in terms of noise reduction and perceptual quality, while offering a clear advantage in preserving binaural cues.Demo-page: https://bi-ctse-hrtf.github.io

在这项工作中，我们旨在模仿人类能够在多个同时说话的情境中，选择性地关注单个说话人的能力。为了实现这一点，我们提出了一种利用听者头部相关传输函数（HRTF）进行双耳目标说话人提取的新方法。值得注意的是，我们的方法不依赖于说话人嵌入，使其成为独立于说话人的，并在多个语音数据集和语言之间实现了强大的泛化能力。我们采用全复数神经网络，直接在混合音频信号的复数短时傅里叶变换（STFT）上运行，并将其与基于实虚（RI）的神经网络进行比较，展示了前者的优势。我们首先在无混响、无噪声的场景下评估该方法，在保持目标信号的双耳线索的同时实现了出色的提取性能。然后我们将评估扩展到有混响的条件，我们的方法被证明是稳健的，在保持语音清晰度和来源方向性的同时减少了混响。与现有的双耳目标说话人提取（TSE）方法的对比分析表明，我们的方法在降噪和感知质量方面达到了与其他技术相当的性能，同时在保留双耳线索方面具有明显的优势。演示页面：https://bi-ctse-hrtf.github.io

论文及项目相关链接

PDF

Summary

本文提出一种利用人头相关传输函数（HRTF）进行双耳目标说话人提取的新方法。该方法不依赖说话人嵌入，实现了跨多个语音数据集和语言的强泛化能力。在复数值神经网络与真实值-虚构值神经网络的比较中显示了优势。该方法在无声学混响的条件下表现优秀，在混响环境中亦能保持稳健，维持语音清晰度和方向性，同时减少混响影响。相较于其他双耳目标说话人提取方法，本文方法在降噪和感知质量方面表现相当，同时更好地保留了双耳线索。

Key Takeaways