嘘~ 正在从服务器偷取页面 . . .

Speech


⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验

2025-05-28 更新

Mixture of LoRA Experts for Low-Resourced Multi-Accent Automatic Speech Recognition

Authors:Raphaël Bagat, Irina Illina, Emmanuel Vincent

We aim to improve the robustness of Automatic Speech Recognition (ASR) systems against non-native speech, particularly in low-resourced multi-accent settings. We introduce Mixture of Accent-Specific LoRAs (MAS-LoRA), a fine-tuning method that leverages a mixture of Low-Rank Adaptation (LoRA) experts, each specialized in a specific accent. This method can be used when the accent is known or unknown at inference time, without the need to fine-tune the model again. Our experiments, conducted using Whisper on the L2-ARCTIC corpus, demonstrate significant improvements in Word Error Rate compared to regular LoRA and full fine-tuning when the accent is unknown. When the accent is known, the results further improve. Furthermore, MAS-LoRA shows less catastrophic forgetting than the other fine-tuning methods. To the best of our knowledge, this is the first use of a mixture of LoRA experts for non-native multi-accent ASR.

我们的目标是提高自动语音识别(ASR)系统对非母语语音的稳健性,特别是在资源稀缺的多口音环境中。我们引入了口音特定LoRA的混合方法(MAS-LoRA),这是一种微调方法,利用一系列低秩适应(LoRA)专家,每个专家都专门针对一种口音。这种方法可以在推理阶段使用,无需知道口音信息即可重新微调模型。我们在L2-ARCTIC语料库上使用whisper进行了实验,证明当口音未知时,与传统的LoRA和全量微调相比,MAS-LoRA在词错误率方面取得了显著的改进。当知道口音时,结果会进一步改善。此外,MAS-LoRA与其他微调方法相比,显示出较少的灾难性遗忘。据我们所知,这是首次使用LoRA专家的混合方法来解决非母语的多口音ASR问题。

论文及项目相关链接

PDF Submitted to Interspeech 2025

Summary
针对非母语语音的自动语音识别(ASR)系统,在资源稀缺的多口音环境下,我们引入了口音特定LoRA混合(MAS-LoRA)的微调方法。该方法利用一系列专门针对特定口音的Low-Rank Adaptation(LoRA)专家混合而成,可在推理阶段应用于已知或未知口音的情况,无需再次对模型进行微调。实验表明,与常规LoRA和全量微调相比,MAS-LoRA在未知口音情况下显著提高了单词错误率。当口音已知时,结果进一步改善。此外,MAS-LoRA还显示出比其他微调方法更少的灾难性遗忘。这是首次将LoRA专家混合用于非母语多口音ASR。

Key Takeaways

  1. 研究目标是提高自动语音识别(ASR)系统对非母语语音的稳健性,特别是在资源有限的多口音环境中。
  2. 引入了MAS-LoRA方法,它结合了多个针对特定口音的LoRA专家,以提高ASR系统的性能。
  3. MAS-LoRA可在推理阶段应用于已知或未知口音的情况,无需对模型进行二次微调。
  4. 实验表明,MAS-LoRA在未知口音情况下显著提高了单词错误率,并且当口音已知时,效果更佳。
  5. 与其他微调方法相比,MAS-LoRA显示出较少的灾难性遗忘。
  6. MAS-LoRA是首次将LoRA专家混合应用于非母语多口音ASR的研究。

Cool Papers

点此查看论文截图

Mel-McNet: A Mel-Scale Framework for Online Multichannel Speech Enhancement

Authors:Yujie Yang, Bing Yang, Xiaofei Li

Online multichannel speech enhancement has been intensively studied recently. Though Mel-scale frequency is more matched with human auditory perception and computationally efficient than linear frequency, few works are implemented in a Mel-frequency domain. To this end, this work proposes a Mel-scale framework (namely Mel-McNet). It processes spectral and spatial information with two key components: an effective STFT-to-Mel module compressing multi-channel STFT features into Mel-frequency representations, and a modified McNet backbone directly operating in the Mel domain to generate enhanced LogMel spectra. The spectra can be directly fed to vocoders for waveform reconstruction or ASR systems for transcription. Experiments on CHiME-3 show that Mel-McNet can reduce computational complexity by 60% while maintaining comparable enhancement and ASR performance to the original McNet. Mel-McNet also outperforms other SOTA methods, verifying the potential of Mel-scale speech enhancement.

近期,在线多通道语音增强得到了广泛的研究。虽然梅尔尺度频率与人类听觉感知更加匹配且计算效率更高,但很少有人在梅尔频率域进行实现。为此,这项工作提出了一个梅尔尺度框架(即Mel-McNet)。它通过两个关键组件处理谱和空间信息:一个有效的STFT-to-Mel模块,将多通道STFT特征压缩为梅尔频率表示,和一个修改的McNet骨干网直接在梅尔域操作以生成增强的LogMel谱。这些谱可以直接馈送给vocoders进行波形重建或ASR系统进行转录。在CHiME-3上的实验表明,Mel-McNet可以将计算复杂度降低60%,同时保持与原始McNet相当的增强和ASR性能。Mel-McNet还优于其他SOTA方法,验证了梅尔尺度语音增强的潜力。

论文及项目相关链接

PDF Accepted by Interspeech 2025

总结

本研究提出了一种基于Mel频域的多通道语音增强框架(Mel-McNet)。它通过两个关键组件处理频谱和空间信息:有效的STFT-to-Mel模块将多通道STFT特征压缩成Mel频域表示,以及修改的McNet骨干网直接在Mel域生成增强的LogMel频谱。该框架可以应用于语音合成或语音识别系统,实验表明,Mel-McNet在降低计算复杂度60%的同时,保持了与原始McNet相当的增强和语音识别性能。与其他先进方法相比,Mel-McNet的表现更加出色,验证了Mel频域语音增强的潜力。

关键见解

  1. Mel频域在多通道语音增强中具有重要的应用价值。
  2. Mel-McNet框架利用STFT-to-Mel模块将多通道STFT特征转换为Mel频域表示。
  3. Mel-McNet通过直接生成增强的LogMel频谱,提高了语音质量和语音识别性能。
  4. Mel-McNet在计算复杂度方面相比原始McNet降低了60%。
  5. Mel-McNet在CHiME-3数据集上的表现优于其他先进方法。
  6. 实验结果证明了Mel频域语音增强的潜力和优势。
  7. Mel-McNet框架可为语音合成和语音识别系统提供有效的语音增强解决方案。

Cool Papers

点此查看论文截图

Training-Free Multi-Step Audio Source Separation

Authors:Yongyi Zang, Jingyi Li, Qiuqiang Kong

Audio source separation aims to separate a mixture into target sources. Previous audio source separation systems usually conduct one-step inference, which does not fully explore the separation ability of models. In this work, we reveal that pretrained one-step audio source separation models can be leveraged for multi-step separation without additional training. We propose a simple yet effective inference method that iteratively applies separation by optimally blending the input mixture with the previous step’s separation result. At each step, we determine the optimal blending ratio by maximizing a metric. We prove that our method always yield improvement over one-step inference, provide error bounds based on model smoothness and metric robustness, and provide theoretical analysis connecting our method to denoising along linear interpolation paths between noise and clean distributions, a property we link to denoising diffusion bridge models. Our approach effectively delivers improved separation performance as a “free lunch” from existing models. Our empirical results demonstrate that our multi-step separation approach consistently outperforms one-step inference across both speech enhancement and music source separation tasks, and can achieve scaling performance similar to training a larger model, using more data, or in some cases employing a multi-step training objective. These improvements appear not only on the optimization metric during multi-step inference, but also extend to nearly all non-optimized metrics (with one exception). We also discuss limitations of our approach and directions for future research.

音频源分离旨在将混合音频分离为目标源。之前的音频源分离系统通常进行一次推理,没有充分探索模型的分离能力。在这项工作中,我们发现预训练的一步音频源分离模型可用于多步分离而无需额外训练。我们提出了一种简单有效的推理方法,通过迭代应用分离,将输入混合物与上一步的分离结果最优地混合。每一步,我们通过最大化一个指标来确定最佳混合比例。我们证明我们的方法总是优于一步推理,提供基于模型平滑和指标稳健性的误差界限,并提供理论分析,将我们的方法与去噪联系到噪声和清洁分布之间的线性插值路径上的属性,这与去噪扩散桥模型相关。我们的方法有效地提高了现有模型的分离性能,就像“免费午餐”一样。我们的实验结果表明,我们的多步分离方法在一系列语音增强和音乐源分离任务上始终优于一步推理,并且可以实现与训练更大的模型、使用更多的数据或在某些情况下采用多步训练目标相似的可扩展性能。这些改进不仅出现在多步推理过程中的优化指标上,而且几乎扩展到所有未优化的指标(有一个例外)。我们还讨论了我们的方法的局限性以及未来研究的方向。

论文及项目相关链接

PDF

Summary

本文揭示了一站式音频源分离模型可以应用于多步分离,而无需额外训练。通过迭代应用分离方法并优化混合输入与上一步分离结果的混合比例,提出了简单有效的推理方法。该方法在理论上与去噪扩散桥模型相联系,可改善现有模型的分离性能。经验结果表明,多步分离方法在一站式推理中表现更优秀,并能在语音增强和音乐源分离任务上实现类似更大模型的性能提升。

Key Takeaways

  1. 音频源分离旨在将混合音频分离为目标源。
  2. 一站式音频源分离模型可以应用于多步分离。
  3. 通过迭代应用分离方法并优化混合比例,提出简单有效的推理方法。
  4. 方法在理论上与去噪扩散桥模型相联系。
  5. 多步分离方法在一站式推理中表现更优秀。
  6. 该方法在语音增强和音乐源分离任务上实现性能提升。

Cool Papers

点此查看论文截图

AmpleHate: Amplifying the Attention for Versatile Implicit Hate Detection

Authors:Yejin Lee, Joonghyuk Hahn, Hyeseon Ahn, Yo-Sub Han

Implicit hate speech detection is challenging due to its subtlety and reliance on contextual interpretation rather than explicit offensive words. Current approaches rely on contrastive learning, which are shown to be effective on distinguishing hate and non-hate sentences. Humans, however, detect implicit hate speech by first identifying specific targets within the text and subsequently interpreting how these target relate to their surrounding context. Motivated by this reasoning process, we propose AmpleHate, a novel approach designed to mirror human inference for implicit hate detection. AmpleHate identifies explicit target using a pretrained Named Entity Recognition model and capture implicit target information via [CLS] tokens. It computes attention-based relationships between explicit, implicit targets and sentence context and then, directly injects these relational vectors into the final sentence representation. This amplifies the critical signals of target-context relations for determining implicit hate. Experiments demonstrate that AmpleHate achieves state-of-the-art performance, outperforming contrastive learning baselines by an average of 82.14% and achieve faster convergence. Qualitative analyses further reveal that attention patterns produced by AmpleHate closely align with human judgement, underscoring its interpretability and robustness.

隐式仇恨言论检测因其细微性和依赖于上下文解读而非明显的冒犯性词汇而具有挑战性。当前的方法依赖于对比学习,在区分仇恨和非仇恨句子方面显示出有效性。然而,人类检测隐式仇恨言论是先识别文本中的特定目标,然后解释这些目标如何与他们的上下文相关联。受这种推理过程的启发,我们提出了AmpleHate,这是一种旨在模拟人类推断隐式仇恨检测的新方法。AmpleHate使用预训练的命名实体识别模型识别显式目标,并通过[CLS]令牌捕获隐式目标信息。它计算显式、隐式目标与句子上下文之间的基于注意力的关系,然后将这些关系向量直接注入最终的句子表示中。这放大了确定隐式仇恨的目标-上下文关系的关键信号。实验表明,AmpleHate达到了最先进的性能,平均比对比学习基线高出82.14%,并且实现了更快的收敛速度。定性分析进一步表明,AmpleHate产生的注意力模式与人类判断紧密吻合,强调了其可解释性和稳健性。

论文及项目相关链接

PDF 13 pages, 4 figures, Under Review

摘要

隐恶言检测面临诸多挑战,因其表达微妙且依赖于语境解读而非直接冒犯性词汇。当前方法主要依赖对比学习,在区分仇恨与非仇恨语句方面效果显著。然而,人类检测隐恶言时首先识别文本中的特定目标,随后解读这些目标如何与周遭语境关联。受这一过程启发,我们提出AmpleHate这一新方法,旨在模拟人类推理过程进行隐恶言检测。AmpleHate运用预训练命名实体识别模型识别显性目标,并通过[CLS]令牌捕捉隐性目标信息。它计算显性、隐性目标与句子语境间的关注关系,随后将这些关系向量直接注入最终句子表述中。这放大了目标语境关系的关键信号,以确定隐恶言的存在。实验证明,AmpleHate实现了最先进的性能表现,较对比学习基线平均高出82.14%,且收敛速度更快。定性分析进一步表明,AmpleHate产生的关注模式与人类判断高度一致,凸显其可解释性和稳健性。

要点概括

  1. 隐恶言检测的挑战性在于其表达方式的微妙和依赖语境解读的特点。
  2. 当前的主要方法是通过对比学习来区分仇恨和非仇恨语句。
  3. 人类检测隐恶言时,首先识别文本中的特定目标,并解读这些目标与语境的关系。
  4. AmpleHate方法模拟人类推理过程进行隐恶言检测,通过识别显性目标和隐性目标信息,计算它们与句子语境的关注关系。
  5. AmpleHate将关系向量直接注入句子最终表述中,放大目标语境关系的关键信号。
  6. 实验证明AmpleHate性能卓越,较对比学习基线有显著提高,且收敛速度更快。

Cool Papers

点此查看论文截图

Multi-Channel Acoustic Echo Cancellation Based on Direction-of-Arrival Estimation

Authors:Fei Zhao, Xueliang Zhang, Zhong-Qiu Wang

Acoustic echo cancellation (AEC) is an important speech signal processing technology that can remove echoes from microphone signals to enable natural-sounding full-duplex speech communication. While single-channel AEC is widely adopted, multi-channel AEC can leverage spatial cues afforded by multiple microphones to achieve better performance. Existing multi-channel AEC approaches typically combine beamforming with deep neural networks (DNN). This work proposes a two-stage algorithm that enhances multi-channel AEC by incorporating sound source directional cues. Specifically, a lightweight DNN is first trained to predict the sound source directions, and then the predicted directional information, multi-channel microphone signals, and single-channel far-end signal are jointly fed into an AEC network to estimate the near-end signal. Evaluation results show that the proposed algorithm outperforms baseline approaches and exhibits robust generalization across diverse acoustic environments.

声学回声消除(AEC)是一项重要的语音信号处理 technologies,能够消除麦克风信号中的回声,从而实现自然的全双工语音通信。虽然单通道AEC已得到广泛应用,但多通道AEC可以利用多个麦克风提供的空间线索来实现更好的性能。现有的多通道AEC方法通常将波束形成与深度神经网络(DNN)相结合。这项工作提出了一个两阶段的算法,通过融入声源方向线索来增强多通道AEC。具体来说,首先训练一个轻量级的DNN来预测声源方向,然后将预测的方位信息、多通道麦克风信号和单通道远端信号一起输入到AEC网络中来估计近端信号。评估结果表明,该算法优于基线方法,并在各种声学环境中表现出稳健的泛化能力。

论文及项目相关链接

PDF Accepted by Interspeech 2025

Summary
多通道声学回声消除(AEC)技术利用多个麦克风的空域线索,通过结合波束成形和深度神经网络(DNN)实现更好的性能。本研究提出了一种两阶段算法,通过引入声源方向线索来增强多通道AEC。首先训练一个轻量级的DNN来预测声源方向,然后将预测的方位信息、多通道麦克风信号和单通道远端信号一起输入到AEC网络中以估计近端信号。评估结果表明,该算法优于基准方法,并在各种声学环境中表现出稳健的泛化能力。

Key Takeaways

  1. 多通道AEC技术利用多个麦克风的空域线索,能够实现更好的回声消除性能。
  2. 现有多通道AEC方法通常结合波束成形和深度神经网络(DNN)。
  3. 本研究提出了一种两阶段算法,通过引入声源方向线索来增强多通道AEC。
  4. 该算法首先使用轻量级DNN预测声源方向。
  5. 该算法将预测的声源方向、多通道麦克风信号和单通道远端信号一起输入到AEC网络。
  6. 评估结果表明,该算法性能优于基准方法。

Cool Papers

点此查看论文截图

FlowSE: Efficient and High-Quality Speech Enhancement via Flow Matching

Authors:Ziqian Wang, Zikai Liu, Xinfa Zhu, Yike Zhu, Mingshuai Liu, Jun Chen, Longshuai Xiao, Chao Weng, Lei Xie

Generative models have excelled in audio tasks using approaches such as language models, diffusion, and flow matching. However, existing generative approaches for speech enhancement (SE) face notable challenges: language model-based methods suffer from quantization loss, leading to compromised speaker similarity and intelligibility, while diffusion models require complex training and high inference latency. To address these challenges, we propose FlowSE, a flow-matching-based model for SE. Flow matching learns a continuous transformation between noisy and clean speech distributions in a single pass, significantly reducing inference latency while maintaining high-quality reconstruction. Specifically, FlowSE trains on noisy mel spectrograms and optional character sequences, optimizing a conditional flow matching loss with ground-truth mel spectrograms as supervision. It implicitly learns speech’s temporal-spectral structure and text-speech alignment. During inference, FlowSE can operate with or without textual information, achieving impressive results in both scenarios, with further improvements when transcripts are available. Extensive experiments demonstrate that FlowSE significantly outperforms state-of-the-art generative methods, establishing a new paradigm for generative-based SE and demonstrating the potential of flow matching to advance the field. Our code, pre-trained checkpoints, and audio samples are available.

生成模型在音频任务方面表现出色,采用了语言模型、扩散和流匹配等方法。然而,现有的语音增强(SE)生成方法面临显著挑战:基于语言模型的方法遭受量化损失,导致说话人相似性和可懂度受损,而扩散模型则需要复杂的训练和高的推理延迟。为了应对这些挑战,我们提出了基于流匹配的SE模型FlowSE。流匹配一次性学习带噪和干净语音分布之间的连续转换,在保持高质量重建的同时,显著减少了推理延迟。具体来说,FlowSE在带噪梅尔频谱和可选字符序列上进行训练,以地面真实梅尔频谱图为监督,优化条件流匹配损失。它隐式地学习语音的时空频谱结构和文本语音对齐。在推理过程中,FlowSE可以带有或不带文本信息进行操作,在这两种情况下都取得了令人印象深刻的结果,在有可用转录时进一步改进。大量实验表明,FlowSE显著优于最先进的生成方法,为基于生成的SE建立了新范式,并展示了流匹配推动该领域发展的潜力。我们的代码、预训练检查点和音频样本可供使用。

论文及项目相关链接

PDF Accepted to InterSpeech 2025

Summary

基于流匹配技术的新型语音增强模型FlowSE表现出优异性能。它通过一次过程学习噪声语音和纯净语音的分布转换,具备高效推理时间和高重建质量。FlowSE可通过优化有条件流匹配损失进行训练,在有声谱图监督下,可隐式学习语音的时频结构和文本语音对齐。在推理过程中,FlowSE可在有或无文本信息的情况下运行,并在两种情况下均取得了令人印象深刻的结果。

Key Takeaways

  1. 生成模型在音频任务中表现出卓越性能,采用语言模型、扩散和流匹配等方法。
  2. 现有语音增强生成方法面临挑战,如语言模型的量化损失和扩散模型的高复杂度和推理延迟。
  3. FlowSE是一个基于流匹配的模型,旨在解决这些挑战。
  4. FlowSE通过优化有条件流匹配损失进行训练,可在声谱图监督下隐式学习语音结构。
  5. FlowSE在推理过程中,可灵活处理有或无文本信息的情况,并均取得良好效果。
  6. FlowSE显著优于现有生成方法,为生成式语音增强建立了新范例。

Cool Papers

点此查看论文截图

SoloSpeech: Enhancing Intelligibility and Quality in Target Speech Extraction through a Cascaded Generative Pipeline

Authors:Helin Wang, Jiarui Hai, Dongchao Yang, Chen Chen, Kai Li, Junyi Peng, Thomas Thebaud, Laureano Moro Velazquez, Jesus Villalba, Najim Dehak

Target Speech Extraction (TSE) aims to isolate a target speaker’s voice from a mixture of multiple speakers by leveraging speaker-specific cues, typically provided as auxiliary audio (a.k.a. cue audio). Although recent advancements in TSE have primarily employed discriminative models that offer high perceptual quality, these models often introduce unwanted artifacts, reduce naturalness, and are sensitive to discrepancies between training and testing environments. On the other hand, generative models for TSE lag in perceptual quality and intelligibility. To address these challenges, we present SoloSpeech, a novel cascaded generative pipeline that integrates compression, extraction, reconstruction, and correction processes. SoloSpeech features a speaker-embedding-free target extractor that utilizes conditional information from the cue audio’s latent space, aligning it with the mixture audio’s latent space to prevent mismatches. Evaluated on the widely-used Libri2Mix dataset, SoloSpeech achieves the new state-of-the-art intelligibility and quality in target speech extraction and speech separation tasks while demonstrating exceptional generalization on out-of-domain data and real-world scenarios.

目标语音提取(TSE)旨在利用特定于说话人的线索(通常作为辅助音频(又称线索音频)提供)从多个说话人的混合声音中分离出目标说话人的声音。尽管最近的TSE进展主要采用了提供高感知质量的判别模型,但这些模型往往会引入不需要的伪影,降低自然度,并对训练和测试环境之间的差异敏感。另一方面,TSE的生成模型在感知质量和清晰度方面有所不足。为了解决这些挑战,我们提出了SoloSpeech,这是一种新型级联生成管道,它集成了压缩、提取、重建和校正过程。SoloSpeech的特点是无需说话人嵌入的目标提取器,它利用线索音频的潜在空间的条件信息,将其与混合音频的潜在空间对齐,以防止不匹配。在广泛使用的Libri2Mix数据集上,SoloSpeech在目标语音提取和语音分离任务中实现了最新的清晰度和质量水平,同时在跨领域数据和真实世界场景上展示了出色的泛化能力。

论文及项目相关链接

PDF

Summary
目标语音提取(TSE)旨在利用特定于说话人的线索(通常作为辅助音频提供),从多个说话人的混合声音中分离出目标说话人的声音。尽管最近的研究主要采用了提供高感知质量的判别模型,但这些模型常常引入不必要的伪影,降低了自然度,并对训练和测试环境之间的差异敏感。为解决这些问题,我们提出了名为SoloSpeech的新型级联生成管道,它集成了压缩、提取、重建和校正过程。SoloSpeech的特点是无需使用说话人嵌入的目标提取器,利用来自提示音频的潜在空间的条件信息,与混合音频的潜在空间对齐,以避免不匹配。在广泛使用的Libri2Mix数据集上,SoloSpeech在目标语音提取和语音分离任务中实现了新的最先进的可理解性和质量,并在超出领域的数据和真实世界场景中表现出卓越的泛化能力。

Key Takeaways

  1. 目标语音提取(TSE)旨在从混合声音中分离出目标说话人的声音,利用特定于说话人的线索。
  2. 现有模型(判别模型)存在引入伪影、降低自然度和对环境差异敏感的问题。
  3. SoloSpeech是一种新型的级联生成管道,集成了压缩、提取、重建和校正过程。
  4. SoloSpeech采用无说话人嵌入的目标提取器设计,利用提示音频和混合音频的潜在空间条件信息。
  5. SoloSpeech在Libri2Mix数据集上实现了新的最先进的可理解性和质量。
  6. SoloSpeech在超出领域的数据和真实世界场景中表现出良好的泛化能力。

Cool Papers

点此查看论文截图

EnvSDD: Benchmarking Environmental Sound Deepfake Detection

Authors:Han Yin, Yang Xiao, Rohan Kumar Das, Jisheng Bai, Haohe Liu, Wenwu Wang, Mark D Plumbley

Audio generation systems now create very realistic soundscapes that can enhance media production, but also pose potential risks. Several studies have examined deepfakes in speech or singing voice. However, environmental sounds have different characteristics, which may make methods for detecting speech and singing deepfakes less effective for real-world sounds. In addition, existing datasets for environmental sound deepfake detection are limited in scale and audio types. To address this gap, we introduce EnvSDD, the first large-scale curated dataset designed for this task, consisting of 45.25 hours of real and 316.74 hours of fake audio. The test set includes diverse conditions to evaluate the generalizability, such as unseen generation models and unseen datasets. We also propose an audio deepfake detection system, based on a pre-trained audio foundation model. Results on EnvSDD show that our proposed system outperforms the state-of-the-art systems from speech and singing domains.

音频生成系统如今能够创建非常逼真的声音场景,这可以增强媒体制作,但也存在潜在风险。一些研究已经研究了语音或歌唱声音的深度伪造。然而,环境声音具有不同的特性,这可能使检测语音和歌唱深度伪造的方法对现实世界的声音效果较差。此外,用于环境声音深度伪造检测的现有数据集在规模和音频类型方面受到限制。为了弥补这一空白,我们引入了EnvSDD,这是为此任务设计的大规模定制数据集,包含45.25小时的真实音频和316.74小时的虚假音频。测试集包括各种条件,以评估其泛化能力,例如未见过的生成模型和未见过的数据集。我们还提出了一个基于预训练音频基础模型的音频深度伪造检测系统。在EnvSDD上的结果表显示,我们提出的系统优于语音和歌唱领域的最先进的系统。

论文及项目相关链接

PDF Accepted by Interspeech 2025

摘要
音频生成系统所创造的声音环境越来越真实,既增强了媒体制作的效果,也带来了潜在风险。尽管已有研究探讨了语音或歌唱声音的深度伪造问题,但环境音具有不同的特性,使得检测语音和歌唱深度伪造的方法对真实世界声音的效果可能不佳。此外,环境音深度伪造检测的现有数据集规模有限、音频类型有限。为解决这一空白,我们推出了EnvSDD,这是专门为这一任务设计的大型精选数据集,包含45.25小时的真实音频和316.74小时的伪造音频。测试集包括各种条件,以评估对未见过的生成模型和未见过的数据集的泛化能力。我们还基于预训练的音频基础模型提出了音频深度伪造检测系统。在EnvSDD上的结果表明,我们提出的系统优于语音和歌唱领域的最新系统。

要点掌握

  1. 音频生成系统能创造逼真的声音环境,既有积极影响,也存在潜在风险。
  2. 环境音具有独特特性,使得现有的语音和歌唱深度伪造检测方法对其可能不够有效。
  3. 当前针对环境音深度伪造检测的数据集存在规模及音频类型上的局限。
  4. 推出EnvSDD数据集,专门为环境音深度伪造检测任务设计,包含真实和伪造音频。
  5. 测试集包含多种条件,以评估系统的泛化能力,如对未见过的生成模型和数据集。
  6. 基于预训练的音频基础模型提出音频深度伪造检测系统。

Cool Papers

点此查看论文截图

BR-ASR: Efficient and Scalable Bias Retrieval Framework for Contextual Biasing ASR in Speech LLM

Authors:Xun Gong, Anqi Lv, Zhiming Wang, Huijia Zhu, Yanmin Qian

While speech large language models (SpeechLLMs) have advanced standard automatic speech recognition (ASR), contextual biasing for named entities and rare words remains challenging, especially at scale. To address this, we propose BR-ASR: a Bias Retrieval framework for large-scale contextual biasing (up to 200k entries) via two innovations: (1) speech-and-bias contrastive learning to retrieve semantically relevant candidates; (2) dynamic curriculum learning that mitigates homophone confusion which negatively impacts the final performance. The is a general framework that allows seamless integration of the retrieved candidates into diverse ASR systems without fine-tuning. Experiments on LibriSpeech test-clean/-other achieve state-of-the-art (SOTA) biased word error rates (B-WER) of 2.8%/7.1% with 2000 bias words, delivering 45% relative improvement over prior methods. BR-ASR also demonstrates high scalability: when expanding the bias list to 200k where traditional methods generally fail, it induces only 0.3 / 2.9% absolute WER / B-WER degradation with a 99.99% pruning rate and only 20ms latency per query on test-other.

虽然语音大型语言模型(SpeechLLMs)已经推动了标准自动语音识别(ASR)的发展,但对命名实体和罕见词汇的上下文偏见仍然是一个挑战,尤其是在大规模情况下。为了解决这一问题,我们提出了BR-ASR:一种用于大规模上下文偏见(高达20万个条目)的偏见检索框架,通过两个创新点来实现:(1)语音和偏见的对比学习,以检索语义相关的候选词;(2)动态课程学习,减轻同音字混淆对最终性能的负面影响。这是一个通用框架,允许无缝地将检索到的候选词集成到各种ASR系统中,而无需微调。在LibriSpeech测试-清洁/其他实验上,使用2000个偏见词汇达到了最先进的偏置词错误率(B-WER)2.8%/7.1%,相较于先前的方法实现了45%的相对改进。BR-ASR还显示出高度可扩展性:当将偏见列表扩大到20万个条目时,传统方法通常无法处理,而BR-ASR只引入了0.3/2.9%的绝对词错误率(WER)/ B-WER退化,具有99.99%的修剪率和仅20ms的查询延迟(针对测试其他情况)。

论文及项目相关链接

PDF Accepted by InterSpeech 2025

Summary

本文提出BR-ASR模型,通过两项创新解决了大规模语境偏差问题:一是语音与偏差对比学习,用于检索语义相关候选词;二是动态课程学习,减少同音字混淆对最终性能的影响。该模型可以无缝集成到不同的语音识别系统中,并在LibriSpeech数据集上取得了显著的识别性能提升。特别是对于含有特定语境偏差的大规模数据集,其显著的优势表现为高度可扩展性和出色的效率。此外,实验结果在利用最先进的实验资源评估测试表现时也实现了良好的效果。在进一步扩大偏差列表的情况下,BR-ASR仍然保持了出色的性能表现。总体来说,BR-ASR为大规模语境偏差问题提供了一种有效的解决方案。

Key Takeaways

  1. BR-ASR模型采用两项创新解决大规模语境偏差问题:语音与偏差对比学习以及动态课程学习。
  2. 对比学习能够检索语义相关候选词,动态课程学习则有助于减少同音字混淆对性能的影响。
  3. 模型可以无缝集成到不同的语音识别系统中,无需微调。

Cool Papers

点此查看论文截图

FiLLM – A Filipino-optimized Large Language Model based on Southeast Asia Large Language Model (SEALLM)

Authors:Carlos Jude G. Maminta, Isaiah Job Enriquez, Deandre Nigel Nunez, Michael B. Dela Fuente

This study presents FiLLM, a Filipino-optimized large language model, designed to enhance natural language processing (NLP) capabilities in the Filipino language. Built upon the SeaLLM-7B 2.5 model, FiLLM leverages Low-Rank Adaptation (LoRA) fine-tuning to optimize memory efficiency while maintaining task-specific performance. The model was trained and evaluated on diverse Filipino datasets to address key NLP tasks, including Named Entity Recognition (NER), Part-of-Speech (POS) tagging, Dependency Parsing, and Text Summarization. Performance comparisons with the CalamanCy model were conducted using F1 Score, Precision, Recall, Compression Rate, and Keyword Overlap metrics. Results indicate that Calamancy outperforms FILLM in several aspects, demonstrating its effectiveness in processing Filipino text with improved linguistic comprehension and adaptability. This research contributes to the advancement of Filipino NLP applications by providing an optimized, efficient, and scalable language model tailored for local linguistic needs.

本研究介绍了FiLLM,这是一个针对菲律宾语进行优化的大型语言模型,旨在提高菲律宾语的自然语言处理(NLP)能力。FiLLM基于SeaLLM-7B 2.5模型构建,利用低秩适配(LoRA)微调技术,在保持特定任务性能的同时优化内存效率。该模型在多种菲律宾语数据集上进行训练和评估,以应对关键的自然语言处理任务,包括命名实体识别(NER)、词性标注(POS)、依存解析和文本摘要。通过与CalamanCy模型的F1分数、精确度、召回率、压缩率和关键词重叠等指标进行性能比较。结果表明,在多个方面,Calamancy的表现优于FILLM,证明其在处理菲律宾文本时的有效性,提高了语言理解和适应性。本研究通过提供一个针对本地语言需求进行优化、高效且可扩展的语言模型,为菲律宾自然语言处理应用的进步做出了贡献。

论文及项目相关链接

PDF

Summary
本研究介绍了FiLLM,这是一个针对菲律宾语优化的大型语言模型,旨在提高菲律宾语的自然语言处理(NLP)能力。该模型基于SeaLLM-7B 2.5构建,采用低秩适应(LoRA)微调技术,在优化内存效率的同时保持特定任务性能。模型在多种菲律宾语数据集上进行训练和评估,以应对关键的自然语言处理任务,包括命名实体识别(NER)、词性标注(POS)、依存解析和文本摘要。通过与CalamanCy模型的F1分数、精确度、召回率、压缩率和关键词重叠度等指标的比较,结果显示Calamancy在多个方面优于FILLM,证明其在处理菲律宾语文本时的有效性和适应性。该研究为菲律宾语NLP应用的进步做出了贡献,提供了一个针对本地语言需求的优化、高效和可扩展的语言模型。

Key Takeaways

  1. FiLLM是一个针对菲律宾语优化的语言模型,基于SeaLLM-7B 2.5构建。
  2. FiLLM利用低秩适应(LoRA)微调技术以提高内存效率和特定任务性能。
  3. 模型在多种菲律宾语数据集上进行训练和评估,涵盖NER、POS标注、依存解析和文本摘要等关键NLP任务。
  4. 与CalamanCy模型的比较表明,Calamancy在多个方面表现优于FiLLM。
  5. Calamancy在处理菲律宾语文本时表现出更好的语言理解和适应性。
  6. 该研究为菲律宾语NLP应用的进步做出了贡献。

Cool Papers

点此查看论文截图

StandUp4AI: A New Multilingual Dataset for Humor Detection in Stand-up Comedy Videos

Authors:Valentin Barriere, Nahuel Gomez, Leo Hemamou, Sofia Callejas, Brian Ravenet

Aiming towards improving current computational models of humor detection, we propose a new multimodal dataset of stand-up comedies, in seven languages: English, French, Spanish, Italian, Portuguese, Hungarian and Czech. Our dataset of more than 330 hours, is at the time of writing the biggest available for this type of task, and the most diverse. The whole dataset is automatically annotated in laughter (from the audience), and the subpart left for model validation is manually annotated. Contrary to contemporary approaches, we do not frame the task of humor detection as a binary sequence classification, but as word-level sequence labeling, in order to take into account all the context of the sequence and to capture the continuous joke tagging mechanism typically occurring in natural conversations. As par with unimodal baselines results, we propose a method for e propose a method to enhance the automatic laughter detection based on Audio Speech Recognition errors. Our code and data are available online: https://tinyurl.com/EMNLPHumourStandUpPublic

针对改进当前的幽默检测计算模型,我们提出了一个新的多模式喜剧数据集,包含七种语言:英语、法语、西班牙语、意大利语、葡萄牙语、匈牙利语和捷克语。我们的数据集时长超过330小时,是此类任务中目前可用最大且最多元的数据集。整个数据集都自动标注了观众的笑声,用于模型验证的子集则是手动标注的。与当前的方法不同,我们不把幽默检测任务视为一个二元序列分类问题,而是将其视为单词级别的序列标记问题,以考虑整个序列的上下文并捕捉通常在自然对话中发生的连续笑话标记机制。对于单模态基线结果,我们提出了一种基于语音识别错误来提高自动笑声检测的方法。我们的代码和数据可以在线获取:https://tinyurl.com/EMNLPHumourStandUpPublic

论文及项目相关链接

PDF

Summary

本文提出一个新的多模态数据集,包含七种语言的脱口秀喜剧内容,总计超过330小时,是目前此类任务中最大且最多元的数据集。数据集通过观众笑声自动标注,部分数据用于模型验证并进行手动标注。与当代方法不同,本文不将幽默检测视为二元序列分类任务,而是作为词级序列标记,以考虑序列的上下文并捕捉自然对话中连续的笑话标记机制。此外,本文还提出一种基于语音识别错误增强自动笑声检测的方法。数据和代码已在线发布。

Key Takeaways

  1. 引入一个包含七种语言的脱口秀喜剧多模态数据集,规模超过330小时,为当前最大的幽默检测数据集。
  2. 数据集通过观众笑声自动标注,部分数据手动标注,用于模型验证。
  3. 不同于现有的幽默检测二元序列分类方法,本文将其视为词级序列标记,以捕捉自然对话中的连续笑话。
  4. 考虑到序列的上下文信息,提高幽默检测的准确性。
  5. 提出一种基于语音识别错误增强自动笑声检测的方法。
  6. 数据集和代码已在线发布,便于其他研究者使用。

Cool Papers

点此查看论文截图

Enhancing Generalization of Speech Large Language Models with Multi-Task Behavior Imitation and Speech-Text Interleaving

Authors:Jingran Xie, Xiang Li, Hui Wang, Yue Yu, Yang Xiang, Xixin Wu, Zhiyong Wu

Large language models (LLMs) have shown remarkable generalization across tasks, leading to increased interest in integrating speech with LLMs. These speech LLMs (SLLMs) typically use supervised fine-tuning to align speech with text-based LLMs. However, the lack of annotated speech data across a wide range of tasks hinders alignment efficiency, resulting in poor generalization. To address these issues, we propose a novel multi-task ‘behavior imitation’ method with speech-text interleaving, called MTBI, which relies solely on paired speech and transcripts. By ensuring the LLM decoder generates equivalent responses to paired speech and text, we achieve a more generalized SLLM. Interleaving is used to further enhance alignment efficiency. We introduce a simple benchmark to evaluate prompt and task generalization across different models. Experimental results demonstrate that our MTBI outperforms SOTA SLLMs on both prompt and task generalization, while requiring less supervised speech data.

大型语言模型(LLM)在各种任务中表现出了显著的泛化能力,这引发了将语音与LLM集成的兴趣增长。这些语音LLM(SLLM)通常使用有监督微调来将语音与基于文本的LLM对齐。然而,由于缺乏广泛任务的标注语音数据,阻碍了对齐效率,导致泛化性能较差。为了解决这些问题,我们提出了一种新型多任务“行为模仿”方法,该方法采用语音文本交替方式,称为MTBI,仅依赖于配对语音和文本。通过确保LLM解码器对配对语音和文本产生等效响应,我们实现了更通用的SLLM。交替方式被用来进一步提高对齐效率。我们引入了一个简单的基准测试来评估不同模型在提示和任务方面的泛化能力。实验结果表明,我们的MTBI在提示和任务泛化方面都优于最新的SLLM,同时需要更少的监督语音数据。

论文及项目相关链接

PDF Accepted by Interspeech 2025

摘要

语言大模型(LLM)在跨任务中展现出显著的泛化能力,引发了将语音与LLM集成的兴趣增加。然而,由于缺乏跨各种任务的标注语音数据,影响了与文本型LLM的对齐效率,导致泛化性能不佳。为解决这些问题,我们提出了一种新型多任务“行为模仿”方法,采用语音与文本的交错方式,称为MTBI,仅依赖于成对的语音和文本。通过确保LLM解码器对成对的语音和文本产生等效响应,我们实现了更通用的SLLM。交错方式进一步提高了对齐效率。我们还引入了一个简单的基准测试来评估不同模型在提示和任务上的泛化能力。实验结果表明,我们的MTBI在提示和任务泛化方面优于当前最佳水平的SLLM,同时需要更少的监督语音数据。

要点掌握

  1. LLM在跨任务中展现出良好的泛化能力,引发了对集成语音和LLM的研究兴趣。
  2. 现有的语音LLM(SLLM)通常使用监督微调与文本型LLM对齐,但缺乏标注语音数据影响了对齐效率和泛化性能。
  3. 提出了一种新型多任务“行为模仿”方法(MTBI),仅利用成对的语音和文本,确保LLM解码器对语音和文本产生等效响应,提高泛化能力。
  4. MTBI采用语音与文本的交错方式,进一步提高对齐效率。
  5. 引入了一个简单的基准测试来评估模型在提示和任务上的泛化能力。
  6. 实验结果表明,MTBI在泛化性能和监督语音数据需求方面优于现有SLLM。

Cool Papers

点此查看论文截图

TS-URGENet: A Three-stage Universal Robust and Generalizable Speech Enhancement Network

Authors:Xiaobin Rong, Dahan Wang, Qinwen Hu, Yushi Wang, Yuxiang Hu, Jing Lu

Universal speech enhancement aims to handle input speech with different distortions and input formats. To tackle this challenge, we present TS-URGENet, a Three-Stage Universal, Robust, and Generalizable speech Enhancement Network. To address various distortions, the proposed system employs a novel three-stage architecture consisting of a filling stage, a separation stage, and a restoration stage. The filling stage mitigates packet loss by preliminarily filling lost regions under noise interference, ensuring signal continuity. The separation stage suppresses noise, reverberation, and clipping distortion to improve speech clarity. Finally, the restoration stage compensates for bandwidth limitation, codec artifacts, and residual packet loss distortion, refining the overall speech quality. Our proposed TS-URGENet achieved outstanding performance in the Interspeech 2025 URGENT Challenge, ranking 2nd in Track 1.

通用的语音增强旨在处理具有不同失真和输入格式的输入语音。为了应对这一挑战,我们提出了TS-URGENet,这是一个三阶段的通用、稳健和可推广的语音增强网络。为了解决各种失真问题,所提出系统采用了一种新颖的三阶段架构,包括填充阶段、分离阶段和恢复阶段。填充阶段通过初步填充噪声干扰下丢失的区域,减轻数据包丢失的情况,确保信号连续性。分离阶段抑制噪声、回声和削波失真,以提高语音清晰度。最后,恢复阶段对带宽限制、编码解码器产生的伪影和残留的丢包失真进行补偿,提高整体语音质量。我们提出的TS-URGENet在Interspeech 2025紧急挑战中取得了卓越表现,在赛道一排名第二。

论文及项目相关链接

PDF Accepted by Interspeech 2025

Summary

TS-URGENet是一个三阶段通用、稳健和通用的语音增强网络,用于处理不同失真和输入格式的语音。它通过填充、分离和恢复三个阶段来解决各种失真问题,提高语音清晰度并优化语音质量。在Interspeech 2025 URGENT Challenge中表现优异,排名第二。

Key Takeaways

  1. TS-URGENet是一个三阶段架构的通用语音增强网络。
  2. 它包括填充、分离和恢复三个阶段,针对各种语音失真问题。
  3. 填充阶段初步填充噪声干扰下丢失的区域,确保信号连续性。
  4. 分离阶段抑制噪声、回声和削峰失真,提高语音清晰度。
  5. 恢复阶段补偿带宽限制、编码解码器产生的伪影和残留包丢失失真,优化语音质量。
  6. TS-URGENet在Interspeech 2025 URGENT Challenge中取得了优异表现。

Cool Papers

点此查看论文截图

ABHINAYA – A System for Speech Emotion Recognition In Naturalistic Conditions Challenge

Authors:Soumya Dutta, Smruthi Balaji, Varada R, Viveka Salinamakki, Sriram Ganapathy

Speech emotion recognition (SER) in naturalistic settings remains a challenge due to the intrinsic variability, diverse recording conditions, and class imbalance. As participants in the Interspeech Naturalistic SER Challenge which focused on these complexities, we present Abhinaya, a system integrating speech-based, text-based, and speech-text models. Our approach fine-tunes self-supervised and speech large language models (SLLM) for speech representations, leverages large language models (LLM) for textual context, and employs speech-text modeling with an SLLM to capture nuanced emotional cues. To combat class imbalance, we apply tailored loss functions and generate categorical decisions through majority voting. Despite one model not being fully trained, the Abhinaya system ranked 4th among 166 submissions. Upon completion of training, it achieved state-of-the-art performance among published results, demonstrating the effectiveness of our approach for SER in real-world conditions.

在自然场景中的语音情感识别(SER)由于内在的可变性、多样的录音条件和类别不平衡仍然是一个挑战。作为专注于这些复杂性的Interspeech自然主义SER挑战的参与者,我们推出了阿宾亚雅系统,该系统融合了基于语音、文本和语音文本模型。我们的方法通过微调自我监督和语音大型语言模型(SLLM)进行语音表示,利用大型语言模型(LLM)提供文本上下文,并采用带有SLLM的语音文本建模来捕捉微妙的情感线索。为了解决类别不平衡问题,我们应用了专门的损失函数,并通过多数投票产生分类决策。尽管有一个模型尚未完全训练,但阿宾亚雅系统在166个提交作品中排名第4。训练完成后,它在已发布的结果中取得了最先进的性能,证明了我们的方法在真实世界条件下的SER中的有效性。

论文及项目相关链接

PDF 5 pages, 2 figures, 4 tables, accepted at Interspeech 2025

Summary

该文本介绍了针对自然场景下的语音情感识别(SER)挑战,提出了一种名为Abhinaya的系统。该系统结合了基于语音、文本和语音文本模型,通过微调自我监督和语音大型语言模型进行语音表示,利用大型语言模型提供文本背景,并通过语音文本建模捕捉微妙的情感线索。为应对类别不平衡问题,采用了定制的损失函数,并通过多数投票进行类别决策。Abhinaya系统在未完全训练的情况下排名第4,且在完成训练后达到了最新的性能水平,证明了该系统在自然条件下的SER的有效性。

Key Takeaways

  1. 介绍了针对自然场景下的语音情感识别(SER)的挑战。
  2. 提出了名为Abhinaya的系统,结合了基于语音、文本和语音文本模型的方法。
  3. 通过微调自我监督和语音大型语言模型进行语音表示。
  4. 利用大型语言模型提供文本背景信息。
  5. 采用语音文本建模以捕捉微妙的情感线索。
  6. 采用定制的损失函数和多数投票机制应对类别不平衡问题。

Cool Papers

点此查看论文截图

Dysfluent WFST: A Framework for Zero-Shot Speech Dysfluency Transcription and Detection

Authors:Chenxu Guo, Jiachen Lian, Xuanru Zhou, Jinming Zhang, Shuhe Li, Zongli Ye, Hwi Joo Park, Anaisha Das, Zoe Ezzes, Jet Vonk, Brittany Morin, Rian Bogley, Lisa Wauters, Zachary Miller, Maria Gorno-Tempini, Gopala Anumanchipalli

Automatic detection of speech dysfluency aids speech-language pathologists in efficient transcription of disordered speech, enhancing diagnostics and treatment planning. Traditional methods, often limited to classification, provide insufficient clinical insight, and text-independent models misclassify dysfluency, especially in context-dependent cases. This work introduces Dysfluent-WFST, a zero-shot decoder that simultaneously transcribes phonemes and detects dysfluency. Unlike previous models, Dysfluent-WFST operates with upstream encoders like WavLM and requires no additional training. It achieves state-of-the-art performance in both phonetic error rate and dysfluency detection on simulated and real speech data. Our approach is lightweight, interpretable, and effective, demonstrating that explicit modeling of pronunciation behavior in decoding, rather than complex architectures, is key to improving dysfluency processing systems.

自动检测语言流畅障碍有助于语言病理学家高效地转录异常语音,提高诊断和制定治疗方案。传统方法往往局限于分类,无法提供足够的临床见解,而独立于文本之外的模型在检测语言流畅障碍时会出现误判,特别是在依赖于语境的案例中。本研究引入了Dysfluent-WFST,这是一种零样本解码器,可同时转录音素并检测语言流畅障碍。不同于之前的模型,Dysfluent-WFST与上游编码器(如WavLM)协同工作,无需额外训练。它在模拟和真实语音数据上实现了最佳的性能,无论是在音素错误率还是语言流畅性检测方面。我们的方法轻巧、可解释、有效,表明在解码过程中明确建模发音行为,而不是复杂的架构,是提高语言流畅障碍处理系统的关键。

论文及项目相关链接

PDF Accepted for Interspeech2025

Summary

本文介绍了一种名为Dysfluent-WFST的零样本解码器,它能同时转录语音并检测语言流畅性问题。与传统的分类方法相比,该解码器在模拟和实际语音数据的语音错误率和语言流畅性检测方面表现出卓越的性能。其采用轻量级、可解释的设计,强调在解码过程中明确建模发音行为是提高语言流畅性处理系统的关键。

Key Takeaways

  1. 自动检测语言流畅性有助于语言病理学家有效转录障碍性语言,提升诊断和制定治疗方案。
  2. 传统方法主要限于分类,提供有限的临床洞察,而文本独立模型在语境依赖的案例中误分类语言流畅性问题。
  3. Dysfluent-WFST是一种零样本解码器,可同时进行语音转录和语言流畅性检测。
  4. 与其他模型不同,Dysfluent-WFST使用上游编码器如WavLM,无需额外训练。
  5. Dysfluent-WFST在模拟和实际语音数据上实现了语音错误率和语言流畅性检测的最新性能。
  6. 该方法采用轻量级、可解释的设计,强调在解码过程中明确建模发音行为的重要性。

Cool Papers

点此查看论文截图

Data-Efficient Hate Speech Detection via Cross-Lingual Nearest Neighbor Retrieval with Limited Labeled Data

Authors:Faeze Ghorbanpour, Daryna Dementieva, Alexander Fraser

Considering the importance of detecting hateful language, labeled hate speech data is expensive and time-consuming to collect, particularly for low-resource languages. Prior work has demonstrated the effectiveness of cross-lingual transfer learning and data augmentation in improving performance on tasks with limited labeled data. To develop an efficient and scalable cross-lingual transfer learning approach, we leverage nearest-neighbor retrieval to augment minimal labeled data in the target language, thereby enhancing detection performance. Specifically, we assume access to a small set of labeled training instances in the target language and use these to retrieve the most relevant labeled examples from a large multilingual hate speech detection pool. We evaluate our approach on eight languages and demonstrate that it consistently outperforms models trained solely on the target language data. Furthermore, in most cases, our method surpasses the current state-of-the-art. Notably, our approach is highly data-efficient, retrieving as small as 200 instances in some cases while maintaining superior performance. Moreover, it is scalable, as the retrieval pool can be easily expanded, and the method can be readily adapted to new languages and tasks. We also apply maximum marginal relevance to mitigate redundancy and filter out highly similar retrieved instances, resulting in improvements in some languages.

考虑到检测仇恨性言论的重要性,标注仇恨言论的数据收集既昂贵又耗时,特别是对于资源贫乏的语言来说。先前的研究已经证明了跨语言迁移学习和数据增强在改善有限标记数据任务性能方面的有效性。为了开发一种高效且可扩展的跨语言迁移学习方法,我们利用最近邻检索来增强目标语言的少量标记数据,从而提高检测性能。具体来说,我们假设目标语言中有少量标记的训练实例可供访问,并使用这些实例从大型多语言仇恨言论检测池中检索最相关的标记示例。我们在八种语言上评估了我们的方法,并证明它始终优于仅使用目标语言数据训练的模型。此外,在大多数情况下,我们的方法超越了当前的最佳水平。值得注意的是,我们的方法非常高效,在某些情况下只需检索200个实例就能保持卓越的性能。而且,它是可扩展的,因为检索池可以很容易地扩展,并且该方法可以很容易地适应新的语言和任务。我们还应用最大边缘相关性来缓解冗余并过滤掉高度相似的检索实例,从而在某些语言中取得了改进。

论文及项目相关链接

PDF

Summary

本研究关注仇恨言论检测的重要性,针对低资源语言标注仇恨言论数据收集的高成本和耗时的挑战,提出一种高效且可扩展的跨语言迁移学习方法。通过利用最近邻检索技术增强目标语言的少量标注数据,提高检测性能。研究假设拥有目标语言的少量标注训练实例,并从大型多语言仇恨言论检测池中检索最相关的实例。在八种语言上的评估显示,该方法优于仅使用目标语言数据的模型,并在多数情况下达到当前最佳水平。该方法具有高效性和可扩展性,可轻松扩展检索池,并易于适应新语言和任务。同时,研究还采用最大边际相关性方法减少冗余,过滤出高度相似的检索实例,进一步提高某些语言的性能。

Key Takeaways

  1. 跨语言迁移学习用于提高低资源语言的仇恨言论检测性能。
  2. 利用最近邻检索技术增强目标语言的少量标注数据。
  3. 方法通过检索最相关的实例从大型多语言仇恨言论检测池中丰富数据。
  4. 在八种语言上的评估显示该方法优于其他方法。
  5. 方法具有高效性和可扩展性,可适应新语言和任务。
  6. 采用最大边际相关性方法减少冗余检索实例。

Cool Papers

点此查看论文截图

LipDiffuser: Lip-to-Speech Generation with Conditional Diffusion Models

Authors:Danilo de Oliveira, Julius Richter, Tal Peer, Timo Gerkmann

We present LipDiffuser, a conditional diffusion model for lip-to-speech generation synthesizing natural and intelligible speech directly from silent video recordings. Our approach leverages the magnitude-preserving ablated diffusion model (MP-ADM) architecture as a denoiser model. To effectively condition the model, we incorporate visual features using magnitude-preserving feature-wise linear modulation (MP-FiLM) alongside speaker embeddings. A neural vocoder then reconstructs the speech waveform from the generated mel-spectrograms. Evaluations on LRS3 and TCD-TIMIT demonstrate that LipDiffuser outperforms existing lip-to-speech baselines in perceptual speech quality and speaker similarity, while remaining competitive in downstream automatic speech recognition (ASR). These findings are also supported by a formal listening experiment. Extensive ablation studies and cross-dataset evaluation confirm the effectiveness and generalization capabilities of our approach.

我们提出了LipDiffuser,这是一种用于唇语到语音生成的条件扩散模型,它可以直接从无声的视频记录中合成自然和可理解的语音。我们的方法采用幅度保持消融扩散模型(MP-ADM)架构作为去噪模型。为了有效地对模型进行条件处理,我们采用幅度保持特征线性调制(MP-FiLM)结合语音者嵌入来融入视觉特征。然后,神经网络vocoder从生成的梅尔频谱中重建语音波形。在LRS3和TCD-TIMIT上的评估表明,LipDiffuser在感知语音质量和语音者相似性方面优于现有的唇语到语音基线,同时在下游自动语音识别(ASR)中保持竞争力。正式听力学实验的结果也支持了这一发现。广泛的消融研究和跨数据集评估证实了我们的方法的有效性和泛化能力。

论文及项目相关链接

PDF

Summary

本文介绍了LipDiffuser,一种用于唇语生成的条件扩散模型。它通过采用MP-ADM架构作为去噪模型,结合MP-FiLM视觉特征和说话人嵌入进行条件化建模,实现了从无声视频录制中直接合成自然可理解的语音。实验结果表明,在LRS3和TCD-TIMIT数据集上,LipDiffuser在感知语音质量和说话人相似性方面优于现有的唇语基线,同时在下游自动语音识别(ASR)方面保持竞争力。正式听力学实验进一步证实了这些发现。广泛的消融研究和跨数据集评估证实了该方法的有效性和泛化能力。

Key Takeaways

  1. LipDiffuser是一种条件扩散模型,用于从无声视频生成自然可理解的语音。
  2. MP-ADM架构作为去噪模型用于处理核心任务。
  3. MP-FiLM视觉特征和说话人嵌入用于有效地条件化模型。
  4. 神经vocoder从生成的mel-spectrograms重建语音波形。
  5. 在LRS3和TCD-TIMIT数据集上的实验表明,LipDiffuser在感知语音质量和说话人相似性方面优于现有方法。
  6. 正式听力学实验证实了这些发现。

Cool Papers

点此查看论文截图

SepALM: Audio Language Models Are Error Correctors for Robust Speech Separation

Authors:Zhaoxi Mu, Xinyu Yang, Gang Wang

While contemporary speech separation technologies adeptly process lengthy mixed audio waveforms, they are frequently challenged by the intricacies of real-world environments, including noisy and reverberant settings, which can result in artifacts or distortions in the separated speech. To overcome these limitations, we introduce SepALM, a pioneering approach that employs audio language models (ALMs) to rectify and re-synthesize speech within the text domain following preliminary separation. SepALM comprises four core components: a separator, a corrector, a synthesizer, and an aligner. By integrating an ALM-based end-to-end error correction mechanism, we mitigate the risk of error accumulation and circumvent the optimization hurdles typically encountered in conventional methods that amalgamate automatic speech recognition (ASR) with large language models (LLMs). Additionally, we have developed Chain-of-Thought (CoT) prompting and knowledge distillation techniques to facilitate the reasoning and training processes of the ALM. Our experiments substantiate that SepALM not only elevates the precision of speech separation but also markedly bolsters adaptability in novel acoustic environments.

虽然当代的语音识别分离技术能够处理冗长的混合音频波形,但它们在实际环境中常常面临复杂情况,包括噪声和混响环境等,这些都可能导致分离后的语音出现伪像或失真。为了克服这些局限,我们引入了SepALM,这是一种采用音频语言模型(ALM)在初步分离后修复和重新合成文本领域内的语音的开创性方法。SepALM包含四个核心组件:分离器、校正器、合成器和对齐器。通过集成基于ALM的端到端错误校正机制,我们降低了误差累积的风险,并绕过了传统方法中遇到的优化障碍,这些方法将自动语音识别(ASR)与大型语言模型(LLM)合并使用。此外,我们还开发了链式思维(CoT)提示和知识蒸馏技术,以促进ALM的推理和训练过程。我们的实验证实,SepALM不仅提高了语音分离的精度,还显著提高了对新环境音的适应性。

论文及项目相关链接

PDF Appears in IJCAI 2025

Summary:针对当前语音分离技术在处理复杂现实环境(如噪声和回声环境)时面临的挑战,提出了一种名为SepALM的创新方法。该方法采用音频语言模型(ALM)在文本域内进行修正和重新合成。SepALM包括四个核心组件:分离器、校正器、合成器和对齐器。通过端到端的错误校正机制,减少了误差累积的风险,并克服了传统方法中自动语音识别(ASR)与大型语言模型(LLM)融合的优化难题。实验证明,SepALM不仅提高了语音分离的精度,还显著提高了对新环境的适应能力。

Key Takeaways

  1. SepALM是一种针对复杂现实环境设计的创新语音分离方法。
  2. SepALM采用音频语言模型(ALM)在文本域进行修正和重新合成。
  3. SepALM包括分离器、校正器、合成器和对齐器四个核心组件。
  4. 通过端到端的错误校正机制,SepALM减少了误差累积的风险。
  5. SepALM克服了传统方法中ASR与LLM融合的优化难题。
  6. 实验证明SepALM提高了语音分离的精度。

Cool Papers

点此查看论文截图

Sugar-Coated Poison: Benign Generation Unlocks LLM Jailbreaking

Authors:Yu-Hang Wu, Yu-Jie Xiong, Hao Zhang, Jia-Chen Zhang, Zheng Zhou

With the increasingly deep integration of large language models (LLMs) across diverse domains, the effectiveness of their safety mechanisms is encountering severe challenges. Currently, jailbreak attacks based on prompt engineering have become a major safety threat. However, existing methods primarily rely on black-box manipulation of prompt templates, resulting in poor interpretability and limited generalization. To break through the bottleneck, this study first introduces the concept of Defense Threshold Decay (DTD), revealing the potential safety impact caused by LLMs’ benign generation: as benign content generation in LLMs increases, the model’s focus on input instructions progressively diminishes. Building on this insight, we propose the Sugar-Coated Poison (SCP) attack paradigm, which uses a “semantic reversal” strategy to craft benign inputs that are opposite in meaning to malicious intent. This strategy induces the models to generate extensive benign content, thereby enabling adversarial reasoning to bypass safety mechanisms. Experiments show that SCP outperforms existing baselines. Remarkably, it achieves an average attack success rate of 87.23% across six LLMs. For defense, we propose Part-of-Speech Defense (POSD), leveraging verb-noun dependencies for syntactic analysis to enhance safety of LLMs while preserving their generalization ability.

随着大型语言模型(LLMs)在各个领域越来越深入的融合,其安全机制的有效性正面临严峻挑战。目前,基于提示工程的越狱攻击已成为主要的安全威胁。然而,现有方法主要依赖于提示模板的黑盒操作,导致解释性差和泛化能力有限。为了突破瓶颈,本研究首先引入防御阈值衰减(DTD)的概念,揭示LLMs良性生成所潜在的安全影响:随着LLMs中良性内容生成的增加,模型对输入指令的关注逐渐减少。基于这一见解,我们提出了“糖衣毒药”(SCP)攻击范式,采用“语义反转”策略来制作与恶意意图相反含义的良性输入。该策略诱导模型生成大量良性内容,从而通过对抗性推理绕过安全机制。实验表明,SCP优于现有基线。值得一提的是,它在六个LLMs上平均攻击成功率达到87.23%。为防御此攻击,我们提出了基于词性的防御(POSD)方法,利用动词-名词依赖进行句法分析来提高LLMs的安全性,同时保留其泛化能力。

论文及项目相关链接

PDF

Summary

大语言模型(LLM)在多领域深度集成中面临安全挑战,出现了基于提示工程的越狱攻击等安全威胁。现有方法主要依赖黑盒操作提示模板,存在解释性差和泛化能力有限的问题。本研究引入防御阈值衰减(DTD)概念,揭示LLM良性生成对安全的影响:随着良性内容生成增加,模型对输入指令的关注逐渐减少。基于此,提出“糖衣毒药”(SCP)攻击范式,采用“语义反转”策略制作与恶意意图相反的良性输入,诱导模型生成大量良性内容,实现对抗推理,绕过安全机制。实验显示,SCP优于现有基线,平均攻击成功率达87.23%,覆盖六种LLM。为防御,提出利用动词-名词依赖进行句法分析的词性防御(POSD),增强LLM安全性同时保持其泛化能力。

Key Takeaways

  1. 大语言模型(LLM)在多个领域集成时面临严重的安全挑战。
  2. 目前存在的攻击方法主要基于提示工程,存在解释性差和泛化能力有限的问题。
  3. 引入防御阈值衰减(DTD)概念,揭示LLM良性生成对安全的影响。
  4. 提出“糖衣毒药”(SCP)攻击策略,通过语义反转绕过LLM的安全机制。
  5. SCP攻击策略在实验中表现优异,平均攻击成功率达87.23%。
  6. 为应对此攻击,提出词性防御(POSD)方法,结合句法分析增强LLM安全性。

Cool Papers

点此查看论文截图

Speech-FT: Merging Pre-trained And Fine-Tuned Speech Representation Models For Cross-Task Generalization

Authors:Tzu-Quan Lin, Wei-Ping Huang, Hao Tang, Hung-yi Lee

Fine-tuning speech representation models can enhance performance on specific tasks but often compromises their cross-task generalization ability. This degradation is often caused by excessive changes in the representations, making it difficult to retain information learned during pre-training. Existing approaches, such as regularizing weight changes during fine-tuning, may fail to maintain sufficiently high feature similarity with the pre-trained model, and thus could possibly lose cross-task generalization. To address this issue, we propose Speech-FT, a novel two-stage fine-tuning framework designed to maintain cross-task generalization while benefiting from fine-tuning. Speech-FT first applies fine-tuning specifically designed to reduce representational drift, followed by weight-space interpolation with the pre-trained model to restore cross-task generalization. Extensive experiments on HuBERT, wav2vec 2.0, DeCoAR 2.0, and WavLM Base+ demonstrate that Speech-FT consistently improves performance across a wide range of supervised, unsupervised, and multitask fine-tuning scenarios. Moreover, Speech-FT achieves superior cross-task generalization compared to fine-tuning baselines that explicitly constrain weight changes, such as weight-space regularization and LoRA fine-tuning. Our analysis reveals that Speech-FT maintains higher feature similarity to the pre-trained model compared to alternative strategies, despite allowing larger weight-space updates. Notably, Speech-FT achieves significant improvements on the SUPERB benchmark. For example, when fine-tuning HuBERT on automatic speech recognition, Speech-FT is able to reduce phone error rate from 5.17% to 3.94%, lower word error rate from 6.38% to 5.75%, and increase speaker identification accuracy from 81.86% to 84.11%. Speech-FT provides a simple yet powerful solution for further refining speech representation models after pre-training.

对语音表征模型进行微调可以提高特定任务的性能,但通常会损害其跨任务泛化能力。这种退化通常是由于表征的过度变化,使得难以保留预训练期间学到的信息。现有方法,如正则化微调过程中的权重变化,可能无法保持与预训练模型足够高的特征相似性,因此可能丧失跨任务泛化能力。为解决这一问题,我们提出了Speech-FT,这是一种新型的两阶段微调框架,旨在保持跨任务泛化能力的同时受益于微调。Speech-FT首先应用专门设计的减少表征漂移的微调,然后通过权重空间插值与预训练模型恢复跨任务泛化。在HuBERT、wav2vec 2.0、DeCoAR 2. 0和WavLM Base+上的大量实验表明,Speech-FT在广泛的监督、无监督和多任务微调场景中始终提高了性能。而且,Speech-FT在跨任务泛化方面优于显式约束权重变化的微调基准,如权重空间正则化和LoRA微调。我们的分析表明,尽管允许更大的权重空间更新,Speech-FT与预训练模型的特征相似性仍然高于其他策略。值得注意的是,Speech-FT在SUPERB基准测试中取得了显著的改进。例如,在对HuBERT进行自动语音识别微调时,Speech-FT能够将电话错误率从5.17%降低到3.94%,单词错误率从6.38%降低到5.75%,并提高说话人识别准确率至84.11%。Speech-FT为预训练后进一步精炼语音表征模型提供了一种简单而强大的解决方案。

论文及项目相关链接

PDF

Summary

本文提出一种名为Speech-FT的新型两阶段微调框架,旨在保持跨任务泛化能力的同时,受益于微调。Speech-FT首先应用专门设计的减少表示漂移的微调,然后通过预训练模型进行权重空间插值以恢复跨任务泛化。实验表明,Speech-FT在多种监督、无监督和多任务微调场景下均能提高性能,并在跨任务泛化方面优于其他微调方法。

Key Takeaways

  1. 现有微调方法在提高特定任务性能时,可能会损害模型的跨任务泛化能力。
  2. Speech-FT采用两阶段微调策略,旨在减少表示漂移并恢复跨任务泛化。
  3. Speech-FT通过权重空间插值与预训练模型相结合,保持与预训练模型的高特征相似性。
  4. Speech-FT在多种语音模型及不同任务场景下均表现出优异的性能提升。
  5. Speech-FT在跨任务泛化方面优于其他显式约束权重变化的微调方法。
  6. Speech-FT降低了自动语音识别中的电话错误率、单词错误率,并提高了说话人识别准确率。

Cool Papers

点此查看论文截图


文章作者: Kedreamix
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !
 上一篇
Face Swapping Face Swapping
Face Swapping 方向最新论文已更新,请持续关注 Update in 2025-05-28 Towards Generalized Proactive Defense against Face Swappingwith Contour-Hybrid Watermark
2025-05-28
下一篇 
无监督/半监督/对比学习 无监督/半监督/对比学习
无监督/半监督/对比学习 方向最新论文已更新,请持续关注 Update in 2025-05-28 Hard Negative Contrastive Learning for Fine-Grained Geometric Understanding in Large Multimodal Models
  目录