发布日期: 2025-09-13

更新日期: 2025-10-07

文章字数: 5.2k

阅读时长: 20 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-09-13 更新

Listening for “You”: Enhancing Speech Image Retrieval via Target Speaker Extraction

Authors:Wenhao Yang, Jianguo Wei, Wenhuan Lu, Xinyue Song, Xianghu Yue

Image retrieval using spoken language cues has emerged as a promising direction in multimodal perception, yet leveraging speech in multi-speaker scenarios remains challenging. We propose a novel Target Speaker Speech-Image Retrieval task and a framework that learns the relationship between images and multi-speaker speech signals in the presence of a target speaker. Our method integrates pre-trained self-supervised audio encoders with vision models via target speaker-aware contrastive learning, conditioned on a Target Speaker Extraction and Retrieval module. This enables the system to extract spoken commands from the target speaker and align them with corresponding images. Experiments on SpokenCOCO2Mix and SpokenCOCO3Mix show that TSRE significantly outperforms existing methods, achieving 36.3% and 29.9% Recall@1 in 2 and 3 speaker scenarios, respectively - substantial improvements over single speaker baselines and state-of-the-art models. Our approach demonstrates potential for real-world deployment in assistive robotics and multimodal interaction systems.

使用口语线索进行图像检索在多模态感知中已成为一个有前途的研究方向，但在多说话人场景中利用语音仍然具有挑战性。我们提出了一种新的目标说话人语音图像检索任务和一个框架，该框架在学习图像与多说话人语音信号之间的关系时，会侧重于目标说话人。我们的方法通过目标说话人感知对比学习，将预训练的自我监督音频编码器与视觉模型集成在一起。这允许系统从目标说话人中提取口语命令，并与相应的图像进行匹配。在SpokenCOCO2Mix和SpokenCOCO3Mix上的实验表明，TSRE显著优于现有方法，在2个和3个说话人的场景中分别实现了36.3%和29.9%的Recall@1——相对于单说话人基准和最新模型都有显著改进。我们的方法展示了在辅助机器人和多模态交互系统中实际部署的潜力。

论文及项目相关链接

PDF 5 pages, 2 figures

Summary

本文提出一种面向目标说话人的语音图像检索任务及框架，该框架能够在多说话人场景下学习图像与多说话人语音信号之间的关系。通过目标说话人感知的对比学习，将预训练的自我监督音频编码器和视觉模型集成在一起。实验表明，该方法在SpokenCOCO2Mix和SpokenCOCO3Mix数据集上显著优于现有方法，分别实现了36.3%和29.9%的Recall@1召回率，展示了在辅助机器人和多模态交互系统中的实际部署潜力。

Key Takeaways

语音图像检索任务在多说话人场景下面临挑战。
提出了一种新的目标说话人语音图像检索任务及框架。
通过目标说话人感知的对比学习，集成音频编码器和视觉模型。
在SpokenCOCO2Mix和SpokenCOCO3Mix数据集上进行实验验证。
与现有方法和单说话人基准测试相比，实现了显著的性能提升。
框架在辅助机器人和多模态交互系统中有实际部署的潜力。

Cool Papers

点此查看论文截图

MAPSS: Manifold-based Assessment of Perceptual Source Separation

Authors:Amir Ivry, Samuele Cornell, Shinji Watanabe

Objective assessment of source-separation systems still mismatches subjective human perception, especially when leakage and self-distortion interact. We introduce the Perceptual Separation (PS) and Perceptual Match (PM), the first pair of measures that functionally isolate these two factors. Our intrusive method begins with generating a bank of fundamental distortions for each reference waveform signal in the mixture. Distortions, references, and their respective system outputs from all sources are then independently encoded by a pre-trained self-supervised learning model. These representations are aggregated and projected onto a manifold via diffusion maps, which aligns Euclidean distances on the manifold with dissimilarities of the encoded waveforms. On this manifold, the PM measures the Mahalanobis distance from each output to its attributed cluster that consists of its reference and distortions embeddings, capturing self-distortion. The PS accounts for the Mahalanobis distance of the output to the attributed and to the closest non-attributed clusters, quantifying leakage. Both measures are differentiable and granular, operating at a resolution as low as 50 frames per second. We further derive, for both measures, deterministic error radius and non-asymptotic, high-probability confidence intervals (CIs). Experiments on English, Spanish, and music mixtures show that the PS and PM nearly always achieve the highest linear correlation coefficients with human mean-opinion scores than 14 competitors, reaching as high as 86.36% for speech and 87.21% for music. We observe, at worst, an error radius of 1.39% and a probabilistic 95% CI of 12.21% for these coefficients, which improves reliable and informed evaluation. Using mutual information, the measures complement each other most as their values decrease, suggesting they are jointly more informative as system performance degrades.

对源分离系统的客观评估仍然与人的主观感知存在不匹配，特别是在泄漏和自我失真相互作用的情况下。我们引入了感知分离（PS）和感知匹配（PM），这是第一对能够功能性地隔离这两个因素的度量指标。我们的侵入性方法首先针对混合中的每个参考波形信号生成一系列基本失真。然后，失真、参考以及所有来源的系统输出独立地被预训练的自我监督学习模型编码。这些表示被聚集并映射到一个流形上，流形上的欧几里得距离与编码的波形的不相似性保持一致。在这个流形上，PM通过马氏距离衡量每个输出与其所属的由参考和失真嵌入组成的集群之间的距离，从而捕捉自我失真。PS考虑了输出与所属集群以及最近的非所属集群的马氏距离，以量化泄漏。这两种度量都是可微分的和颗粒状的，可以在低至每秒50帧的分辨率下运行。我们进一步为这两种度量方法推导出确定性误差半径和非常规高概率置信区间（CI）。对英文、西班牙文和音乐混合物的实验表明，PS和PM几乎总是与人类平均意见分数达到最高的线性相关系数，相对于其他竞争对手的评估达到了语音为最高可达的百分之八十六点三六以及音乐为百分之八十七点二一的高水准。在最坏的情况下，观察到系数为百分之一点三八的误差半径以及百分之十二点二一的百分之九十五概率CI，这改善了可靠和高效的评估。使用互信息表明，当这两个度量值减少时，它们在大多数情况下能互补使用，表明当系统性能下降时，它们联合起来提供的信息更加充足。

论文及项目相关链接

PDF Submitted to ICLR

摘要

源分离系统客观评估仍与主观人类感知存在不匹配，特别是泄漏和自我失真相互作用时。本文引入感知分离（PS）和感知匹配（PM），这是第一对能孤立这两个因素的度量标准。通过生成混合中各参考波形信号的基准失真库来实施侵入式方法。然后，这些失真、参考以及所有来源的系统输出都被一个预先训练好的自监督学习模型独立编码。这些表示被聚集并投射到一个流形上，该流形上的欧几里得距离与编码波形的差异对齐。在此流形上，PM测量每个输出与其所属集群（包括其参考和失真嵌入）之间的马氏距离，捕捉自我失真。PS则计算输出与所属集群以及最近非所属集群的马氏距离，量化泄漏。这两个度量都是可微分的和精细的，在每秒50帧的分辨率下运行。我们还为这两个度量推导出了确定性的误差半径和非渐近、高概率置信区间（CI）。对英文、西班牙文和音乐混合物的实验表明，PS和PM几乎总是与人的平均意见得分有最高的线性相关系数，其中语音高达86.36%，音乐高达87.21%。在最坏的情况下，误差半径为1.39%，这些系数的概率95%置信区间为12.21%，这提高了可靠和全面的评估。使用互信息时，当它们的值减少时，这些度量标准最能相互补充，表明它们在系统性能下降时联合提供更有价值的信息。

关键见解

源分离系统的客观评估与主观人类感知仍存在不匹配，特别是在泄漏和自我失真交互时。
引入新的评估指标：感知分离（PS）和感知匹配（PM），以更准确地反映人类感知。
通过生成失真库和采用自监督学习模型，实施了一种新的侵入式评估方法。
使用扩散映射将波形数据投射到流形上，使欧几里得距离与波形差异对齐。
PM和PS度量能够捕捉自我失真和泄漏，且具备高度分化的特性。
实验结果显示PS和PM与人的平均意见得分有很高的线性相关性，尤其是与对手方法相比。

Cool Papers

点此查看论文截图

DeCodec: Rethinking Audio Codecs as Universal Disentangled Representation Learners

Authors:Xiaoxue Luo, Jinwei Huang, Runyan Yang, Yingying Gao, Junlan Feng, Chao Deng, Shilei Zhang

Universal audio codecs learn entangled representations across audio types, whereas some specific codecs offer decoupled representations but are limited to speech. Real-world audio, however, often contains mixed speech and background sounds, and downstream tasks require selective access to these components. Therefore, we rethink the audio codec as a universal disentangled representation learner to enable controllable feature selection across different audio tasks. To this end, we introduce DeCodec, a novel neural codec that learns to decouple audio representations into orthogonal subspaces dedicated to speech and background sound, and within speech, representations are further decomposed into semantic and paralinguistic components. This hierarchical disentanglement allows flexible feature selection, making DeCodec a universal front-end for multiple audio applications. Technically, built upon a codec framework, DeCodec incorporates two key innovations: a subspace orthogonal projection module that factorizes the input into two decoupled orthogonal subspaces, and a representation swap training procedure that ensures these two subspaces are correlate to the speech and background sound, respectively. These allows parallel RVQs to quantize speech and background sound components independently. Furthermore, we employ semantic guidance to the speech RVQ to achieve semantic and paralinguistic decomposition. Experimental results show that DeCodec maintains advanced signal reconstruction while enabling new capabilities: superior speech enhancement and effective one-shot voice conversion on noisy speech via representation recombination, improved ASR robustness through clean semantic features, and controllable background sound preservation/suppression in TTS. Demo Page: https://luo404.github.io/DeCodecV2/

通用音频编解码器学习跨音频类型的纠缠表示，而一些特定的编解码器提供了解耦的表示，但仅限于语音。然而，现实世界的音频通常包含混合的语音和背景声音，并且下游任务需要选择性地访问这些组件。因此，我们重新思考音频编解码器作为通用解纠缠表示学习者，以实现在不同音频任务之间可控的特征选择。为此，我们引入了DeCodec，这是一种新型神经网络编解码器，学习将音频表示解耦成专用于语音和背景声音的正交子空间，并且在语音内部，表示进一步分解为语义和副语言成分。这种分层的解纠缠允许灵活的特征选择，使DeCodec成为多个音频应用的前端。技术上，DeCodec建立在编解码器框架上，并有两个关键的创新点：子空间正交投影模块将输入分解为两个解耦的正交子空间，以及表示交换训练过程确保这两个子空间分别与语音和背景声音相关联。这允许并行RVQs独立地对语音和背景声音组件进行量化。此外，我们对语音RVQ采用语义指导来实现语义和副语言分解。实验结果表明，DeCodec在保持先进信号重建的同时，具备新的功能：通过表示重组实现优质语音增强和一次性噪声语音转换，通过干净语义特征提高ASR稳健性，以及可控的背景声音保留/抑制在TTS中。演示页面：https://luo404.github.io/DeCodecV2/

论文及项目相关链接

PDF

摘要
音频流常混合了不同音源，如语音和背景音。现有音频编解码器在处理这种复杂音频时存在局限性。本研究提出一种新型神经网络编解码器（DeCodec），它能解耦音频表示，将语音和背景音分离，并进一步分解语音中的语义和语用成分。这种层次化的解耦允许灵活的特征选择，使得DeCodec成为多种音频应用的前端通用工具。实验证明，DeCodec在保证信号重建质量的同时，还能实现语音增强、一次性的噪音语音转换、提高语音识别鲁棒性和可控的背景音保留/抑制等。

关键见解

DeCodec采用解耦的音频表示学习，成功分离语音和背景音。
DeCodec在编解码器框架中加入两个关键创新：子空间正交投影模块和表示交换训练过程。
子空间正交投影模块能将输入分解为两个解耦的正交子空间，与语音和背景音相对应。
通过表示交换训练过程，确保这两个子空间与语音和背景音的相关性。
独立量化语音和背景音组件的并行RVQs使DeCodec具备高级功能。
通过语义指导语音RVQ，实现语义和语用分解。
实验结果显示，DeCodec在多种任务上表现优越，包括语音增强、一次性噪音语音转换、提高语音识别鲁棒性和背景音控制等。

Cool Papers

点此查看论文截图

Behind the Scenes: Mechanistic Interpretability of LoRA-adapted Whisper for Speech Emotion Recognition

Authors:Yujian Ma, Jinqiu Sang, Ruizhe Li

Large pre-trained speech models such as Whisper offer strong generalization but pose significant challenges for resource-efficient adaptation. Low-Rank Adaptation (LoRA) has become a popular parameter-efficient fine-tuning method, yet its underlying mechanisms in speech tasks remain poorly understood. In this work, we conduct the first systematic mechanistic interpretability study of LoRA within the Whisper encoder for speech emotion recognition (SER). Using a suite of analytical tools, including layer contribution probing, logit-lens inspection, and representational similarity via singular value decomposition (SVD) and centered kernel alignment (CKA), we reveal two key mechanisms: a delayed specialization process that preserves general features in early layers before consolidating task-specific information, and a forward alignment, backward differentiation dynamic between LoRA’s matrices. Our findings clarify how LoRA reshapes encoder hierarchies, providing both empirical insights and a deeper mechanistic understanding for designing efficient and interpretable adaptation strategies in large speech models. Our code is available at https://github.com/harryporry77/Behind-the-Scenes.

大型预训练语音模型如Whisper具有很强的泛化能力，但给资源高效适应带来了重大挑战。低秩适应（LoRA）已经成为一种流行的参数高效微调方法，但在语音任务中其底层机制仍知之甚少。在这项工作中，我们对Whisper编码器中的LoRA进行了首次系统的机械解释性研究，用于语音情感识别（SER）。我们使用了一系列分析工具，包括层贡献探针、逻辑透镜检查以及通过奇异值分解（SVD）和中心化内核对齐（CKA）的表示相似性，揭示了两种关键机制：一种延迟专业化过程，在整合特定任务信息之前保留早期层中的一般特征，以及LoRA矩阵之间的前向对齐、后向分化动态。我们的研究阐明了LoRA如何重塑编码器层次结构，为大型语音模型中设计高效和可解释适应策略提供了实证见解和更深入的机械理解。我们的代码可在https://github.com/harryporry77/Behind-the-Scenes找到。

论文及项目相关链接

PDF Work in process

摘要

针对大规模预训练语音模型如Whisper的通用化优势与资源高效适应挑战，本研究首次系统地采用机械解释性研究的方法探索其背后的机制。针对语音情感识别（SER）任务，通过一系列分析工具揭示LoRA在低秩适应（LoRA）方面的两个核心机制。研究结果明确说明了LoRA如何在保持模型早期层级一般特征的同时整合任务特定信息的过程以及如何在层级之间产生“向前对齐，向后区分”的动态效果。这不仅提供了实证洞察，也为在大规模语音模型中设计高效和可解释适应性策略提供了更深层次的机械理解。代码已公开于https://github.com/harryporry77/Behind-the-Scenes。

关键见解

LoRA作为一种参数高效的微调方法，在语音任务中显示出强大的适应性。
通过一系列分析工具揭示了LoRA在语音情感识别任务中的两个核心机制。
LoRA在保持模型早期层级的通用特征的同时整合任务特定信息，呈现延迟专业化过程。
LoRA矩阵间存在“向前对齐，向后区分”的动态效果。
LoRA重塑了语音模型的层级结构，提供了对模型工作方式的深入洞察。
研究结果不仅提供了实证洞察，还有助于设计更高效和可解释的大规模语音模型适应性策略。

Cool Papers

点此查看论文截图

MERaLiON-SpeechEncoder: Towards a Speech Foundation Model for Singapore and Beyond

Authors:Muhammad Huzaifah, Geyu Lin, Tianchi Liu, Hardik B. Sailor, Kye Min Tan, Tarun K. Vangani, Qiongqiong Wang, Jeremy H. M. Wong, Jinyang Wu, Nancy F. Chen, Ai Ti Aw

This technical report describes the MERaLiON-SpeechEncoder, a foundation model designed to support a wide range of downstream speech applications. Developed as part of Singapore’s National Multimodal Large Language Model Programme, the MERaLiON-SpeechEncoder is tailored to address the speech processing needs in Singapore and the surrounding Southeast Asian region. The model currently supports mainly English, including the variety spoken in Singapore. We are actively expanding our datasets to gradually cover other languages in subsequent releases. The MERaLiON-SpeechEncoder was pre-trained from scratch on 200,000 hours of unlabelled speech data using a self-supervised learning approach based on masked language modelling. We describe our training procedure and hyperparameter tuning experiments in detail below. Our evaluation demonstrates improvements to spontaneous and Singapore speech benchmarks for speech recognition, while remaining competitive to other state-of-the-art speech encoders across ten other speech tasks. We commit to releasing our model, supporting broader research endeavours, both in Singapore and beyond.

本技术报告介绍了MERaLiON-SpeechEncoder，这是一款旨在支持多种下游语音应用的基础模型。作为新加坡国家多模态大型语言模型计划的一部分，MERaLiON-SpeechEncoder旨在解决新加坡及周边东南亚地区的语音处理需求。该模型目前主要支持英语，包括新加坡所说的英语变体。我们正在积极扩大我们的数据集，在后续版本中逐步覆盖其他语言。MERaLiON-SpeechEncoder使用基于掩码语言建模的自我监督学习方法，在20万小时的未标记语音数据上进行预训练。我们下面将详细描述我们的训练过程和超参数调整实验。我们的评估表明，在语音识别方面，对自发性和新加坡语音基准测试有所改进，同时在其他10个语音任务上保持与最新语音编码器的竞争力。我们致力于发布我们的模型，以支持新加坡及以外的更广泛的研究工作。

论文及项目相关链接

PDF

Summary

MERaLiON-SpeechEncoder是一款为应对新加坡及周边东南亚地区语音处理需求而设计的基础模型，支持多种下游语音应用。该模型以无标注的20万小时语音数据为基础进行预训练，采用基于掩码语言建模的自监督学习方法。评估结果表明，该模型在语音识别方面对自发性和新加坡语基准测试有所改善，同时在其他十个语音任务中保持竞争力。

Key Takeaways