嘘~ 正在从服务器偷取页面 . . .

Speech


⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验

2025-09-18 更新

GLAD: Global-Local Aware Dynamic Mixture-of-Experts for Multi-Talker ASR

Authors:Yujie Guo, Jiaming Zhou, Yuhang Jia, Shiwan Zhao, Yong Qin

End-to-end multi-talker automatic speech recognition (MTASR) faces significant challenges in accurately transcribing overlapping speech, especially under high-overlap conditions. To address these challenges, we proposed Global-Local Aware Dynamic (GLAD) Mixture-of-Experts, which dynamically fuse speaker-aware global information and fine-grained local features to guide expert selection. This mechanism enables speaker-specific routing by leveraging both global context and local acoustic cues. Experiments on LibriSpeechMix show that GLAD outperforms existing MTASR approaches, particularly in challenging multi-talker scenarios. To our best knowledge, this is the first work to apply Mixture-of-Experts (MoE) to end-to-end MTASR with a global-local fusion strategy. Our code and train dataset can be found at https://github.com/NKU-HLT/GLAD.

端到端多说话人自动语音识别(MTASR)在准确转录重叠语音方面面临重大挑战,尤其是在高重叠条件下。为了解决这些挑战,我们提出了Global-Local Aware Dynamic(GLAD)专家混合模型,该模型能够动态融合说话人感知全局信息和精细局部特征,以指导专家选择。这种机制能够利用全局上下文和局部声学线索,实现针对说话人的路由。在LibriSpeechMix上的实验表明,GLAD优于现有的MTASR方法,特别是在具有挑战性的多说话人场景中。据我们所知,这是首次将专家混合(MoE)应用于端到端MTASR的全局-局部融合策略。我们的代码和训练数据集可在https://github.com/NKU-HLT/GLAD找到。

论文及项目相关链接

PDF

Summary

本文介绍了针对多说话人自动语音识别(MTASR)的挑战,提出了Global-Local Aware Dynamic(GLAD)Mixture-of-Experts方案。该方案通过融合说话者感知全局信息和精细局部特征,实现专家选择的动态融合,利用全局上下文和局部声学线索进行说话者特定路由。在LibriSpeechMix上的实验表明,GLAD在挑战性多说话人场景下优于现有MTASR方法。

Key Takeaways

  1. 端到端多说话人自动语音识别(MTASR)面临准确转录重叠语音的挑战。
  2. 提出Global-Local Aware Dynamic(GLAD)Mixture-of-Experts方案来解决这些挑战。
  3. GLAD通过融合说话者感知全局信息和精细局部特征,实现专家选择的动态融合。
  4. GLAD利用全局上下文和局部声学线索进行说话者特定路由。
  5. 在LibriSpeechMix上的实验表明,GLAD在挑战性多说话人场景下具有优越性。
  6. GLAD是首次将Mixture-of-Experts(MoE)应用于端到端MTASR的方案,采用全球本地融合策略。

Cool Papers

点此查看论文截图

Token-based Attractors and Cross-attention in Spoof Diarization

Authors:Kyo-Won Koo, Chan-yeong Lim, Jee-weon Jung, Hye-jin Shim, Ha-Jin Yu

Spoof diarization identifies ``what spoofed when” in a given speech by temporally locating spoofed regions and determining their manipulation techniques. As a first step toward this task, prior work proposed a two-branch model for localization and spoof type clustering, which laid the foundation for spoof diarization. However, its simple structure limits the ability to capture complex spoofing patterns and lacks explicit reference points for distinguishing between bona fide and various spoofing types. To address these limitations, our approach introduces learnable tokens where each token represents acoustic features of bona fide and spoofed speech. These attractors interact with frame-level embeddings to extract discriminative representations, improving separation between genuine and generated speech. Vast experiments on PartialSpoof dataset consistently demonstrate that our approach outperforms existing methods in bona fide detection and spoofing method clustering.

欺骗语音的日记识别通过定位欺骗区域并确定其操纵技术,识别给定语音中的“何时被欺骗”。在此任务的第一步中,早期的研究提出了一个用于定位和欺骗类型聚类的两分支模型,为欺骗语音的日记识别奠定了基础。然而,其简单的结构限制了捕捉复杂欺骗模式的能力,并且在区分真实和多种欺骗类型方面缺乏明确的参考点。为了克服这些局限性,我们的方法引入了可学习的令牌,每个令牌代表真实和欺骗语音的声学特征。这些吸引器与帧级嵌入进行交互,以提取判别表示,从而提高真实和生成语音之间的分离度。在PartialSpoof数据集上的大量实验表明,我们的方法在真实检测器和欺骗方法聚类方面的性能均优于现有方法。

论文及项目相关链接

PDF Accepted to IEEE ASRU 2025

Summary

本文介绍了Spoof日记化技术的最新发展。针对先前模型在识别伪造语音时的局限性,提出一种基于学习符号(令牌)的新方法。这种方法通过引入学习令牌来改善对真实和伪造语音的辨别能力,提高了在部分伪造数据集上的表现。

Key Takeaways

  1. Spoof日记化技术能够识别给定语音中的伪造部分并确定其操纵技术。
  2. 之前的模型存在对复杂伪造模式识别能力有限和缺乏明确区分真实和伪造语音的参照点的问题。
  3. 新方法引入学习令牌来代表真实和伪造语音的声学特征。
  4. 这些令牌与帧级嵌入相互作用,提取更具区分性的表示。
  5. 新方法在PartialSpoof数据集上的实验表现优于现有方法,特别是在真实语音检测和伪造方法聚类方面。
  6. 通过改进模型结构,提高了对伪造语音的识别能力和真实与伪造语音之间的区分度。

Cool Papers

点此查看论文截图

The CCF AATC 2025: Speech Restoration Challenge

Authors:Junan Zhang, Mengyao Zhu, Xin Xu, Hui Bu, Zhenhua Ling, Zhizheng Wu

Real-world speech communication is often hampered by a variety of distortions that degrade quality and intelligibility. While many speech enhancement algorithms target specific degradations like noise or reverberation, they often fall short in realistic scenarios where multiple distortions co-exist and interact. To spur research in this area, we introduce the Speech Restoration Challenge as part of the China Computer Federation (CCF) Advanced Audio Technology Competition (AATC) 2025. This challenge focuses on restoring speech signals affected by a composite of three degradation types: (1) complex acoustic degradations including non-stationary noise and reverberation; (2) signal-chain artifacts such as those from MP3 compression; and (3) secondary artifacts introduced by other pre-processing enhancement models. We describe the challenge’s background, the design of the task, the comprehensive dataset creation methodology, and the detailed evaluation protocol, which assesses both objective performance and model complexity. Homepage: https://ccf-aatc.org.cn/.

现实世界中的语音通信常常受到各种失真的阻碍,导致语音质量和清晰度下降。虽然许多语音增强算法针对噪声或回声等特定失真问题,但在多种失真共存并相互作用的现实场景中,这些算法往往表现不足。为了刺激该领域的研究,我们推出语音恢复挑战赛,作为2025年中国计算机联合会(CCF)先进音频技术竞赛(AATC)的一部分。本次挑战的核心是恢复受到三种类型组合影响的语音信号:(1)复杂声学失真,包括非平稳噪声和回声;(2)信号链伪影,如MP3压缩产生的伪影;(3)其他预处理增强模型引入的次要伪影。我们描述了挑战的背景、任务设计、综合数据集创建方法和详细的评估协议,该协议既评估客观性能又评估模型复杂度。主页链接:https://ccf-aatc.org.cn/。

论文及项目相关链接

PDF Technical Report

Summary

本文介绍了中国计算机联合会(CCF)高级音频技术竞赛(AATC)2025中的语音恢复挑战。该挑战旨在解决现实语音通信中多种失真问题并存的难题,包括复杂声学失真、信号链伪迹以及由其他预处理增强模型引入的二次伪迹。文章介绍了挑战背景、任务设计、综合数据集创建方法和详细的评估协议。

Key Takeaways

  1. 语音恢复挑战旨在解决现实语音通信中的多种失真问题。
  2. 挑战包括三种类型的失真:复杂声学失真、信号链伪迹和二次伪迹。
  3. 挑战作为CCF高级音频技术竞赛(AATC)2025的一部分。
  4. 综合数据集创建方法用于评估各种语音失真情况。
  5. 评估协议既考虑客观性能也考虑模型复杂性。
  6. 竞赛网站地址为:https://ccf-aatc.org.cn/。
  7. 此挑战旨在促进该领域的研究发展。

Cool Papers

点此查看论文截图

PAC: Pronunciation-Aware Contextualized Large Language Model-based Automatic Speech Recognition

Authors:Li Fu, Yu Xin, Sunlu Zeng, Lu Fan, Youzheng Wu, Xiaodong He

This paper presents a Pronunciation-Aware Contextualized (PAC) framework to address two key challenges in Large Language Model (LLM)-based Automatic Speech Recognition (ASR) systems: effective pronunciation modeling and robust homophone discrimination. Both are essential for raw or long-tail word recognition. The proposed approach adopts a two-stage learning paradigm. First, we introduce a pronunciation-guided context learning method. It employs an interleaved grapheme-phoneme context modeling strategy that incorporates grapheme-only distractors, encouraging the model to leverage phonemic cues for accurate recognition. Then, we propose a pronunciation-discriminative reinforcement learning method with perturbed label sampling to further enhance the model's ability to distinguish contextualized homophones. Experimental results on the public English Librispeech and Mandarin AISHELL-1 datasets indicate that PAC: (1) reduces relative Word Error Rate (WER) by 30.2% and 53.8% compared to pre-trained LLM-based ASR models, and (2) achieves 31.8% and 60.5% relative reductions in biased WER for long-tail words compared to strong baselines, respectively.

本文提出了一个发音感知上下文(PAC)框架,以解决基于大型语言模型(LLM)的自动语音识别(ASR)系统中的两个关键挑战:有效的发音建模和稳健的同音字辨别。这两个对于原始或长尾词的识别都至关重要。所提出的方法采用了两阶段学习范式。首先,我们引入了一种发音引导上下文学习方法。它采用了一种交错的字母-发音上下文建模策略,该策略结合了字母特有的干扰项,鼓励模型利用音位线索进行准确识别。然后,我们提出了一种基于受扰标签采样的发音判别强化学习方法,以进一步增强模型在区分上下文同音字方面的能力。在公共英语Librispeech和汉语AISHELL-1数据集上的实验结果表明,PAC:(1)与基于预训练LLM的ASR模型相比,相对单词错误率(WER)降低了30.2%和53.8%;(2)与强大的基准模型相比,长尾词的偏向WER分别实现了31.8%和60.5%的相对降低。

论文及项目相关链接

PDF Submitted to ICASSP 2026

Summary

本文提出了一种发音感知语境化(PAC)框架,以解决大型语言模型(LLM)为基础的自动语音识别(ASR)系统中的两个关键挑战:有效的发音建模和稳健的同音字辨别。该框架采用两阶段学习模式,首先引入发音引导语境学习方法,采用交织的字母-语音语境建模策略,并结合字母型干扰项,鼓励模型利用语音线索进行准确识别。然后,提出一种带有扰动标签采样的发音判别强化学习方法,进一步提高模型对语境化同音词的区分能力。在公共英语Librispeech和汉语AISHELL-1数据集上的实验结果表明,PAC相较于预训练LLM-based ASR模型,相对单词错误率(WER)降低了30.2%和53.8%,长尾词的偏差WER相对强基线分别降低了31.8%和60.5%。

Key Takeaways

  1. PAC框架旨在解决LLM-based ASR系统中的有效发音建模和同音字辨别挑战。
  2. 采用两阶段学习模式,包括发音引导语境学习方法和发音判别强化学习方法。
  3. 发音引导语境学习方法采用字母-语音语境建模策略,结合字母型干扰项,鼓励模型利用语音线索。
  4. 发音判别强化学习方法通过扰动标签采样进一步提高模型对语境化同音词的区分能力。
  5. 在Librispeech和AISHELL-1数据集上的实验结果表明,PAC相较于基线模型显著降低WER。
  6. PAC框架对于长尾词的识别能力有显著提升。

Cool Papers

点此查看论文截图

High-Energy Concentration for Federated Learning in Frequency Domain

Authors:Haozhi Shi, Weiying Xie, Hangyu Ye, Daixun Li, Jitao Ma, Leyuan Fang

Federated Learning (FL) presents significant potential for collaborative optimization without data sharing. Since synthetic data is sent to the server, leveraging the popular concept of dataset distillation, this FL framework protects real data privacy while alleviating data heterogeneity. However, such methods are still challenged by the redundant information and noise in entire spatial-domain designs, which inevitably increases the communication burden. In this paper, we propose a novel Frequency-Domain aware FL method with high-energy concentration (FedFD) to address this problem. Our FedFD is inspired by the discovery that the discrete cosine transform predominantly distributes energy to specific regions, referred to as high-energy concentration. The principle behind FedFD is that low-energy like high-frequency components usually contain redundant information and noise, thus filtering them helps reduce communication costs and optimize performance. Our FedFD is mathematically formulated to preserve the low-frequency components using a binary mask, facilitating an optimal solution through frequency-domain distribution alignment. In particular, real data-driven synthetic classification is imposed into the loss to enhance the quality of the low-frequency components. On five image and speech datasets, FedFD achieves superior performance than state-of-the-art methods while reducing communication costs. For example, on the CIFAR-10 dataset with Dirichlet coefficient $\alpha = 0.01$, FedFD achieves a minimum reduction of 37.78% in the communication cost, while attaining a 10.88% performance gain.

联邦学习(FL)在无需数据共享的情况下,为协同优化提供了巨大的潜力。由于采用数据集蒸馏的流行概念,合成数据被发送到服务器,这种FL框架保护了真实数据的隐私,同时缓解了数据异质性。然而,整个空间域设计中的冗余信息和噪声给这些方法带来了挑战,这不可避免地增加了通信负担。针对这一问题,本文提出了一种新型的高能量集中频域感知联邦学习(FedFD)方法。我们的FedFD受到离散余弦变换主要将能量分布到特定区域的发现的启发,这些特定区域被称为高能量集中区域。FedFD的原理是,低能量(如高频分量)通常包含冗余信息和噪声,因此过滤它们有助于降低通信成本并优化性能。我们的FedFD通过数学公式进行表述,使用二进制掩码保留低频分量,通过频域分布对齐实现最优解。特别是,在损失中加入现实数据驱动的合成分类,以提高低频分量的质量。在五个图像和语音数据集中,FedFD在降低通信成本的同时,实现了比最新技术更优越的性能。例如,在CIFAR-10数据集上,狄利克雷系数α=0.01的情况下,FedFD的通信成本最低降低了37.78%,同时性能提高了10.88%。

论文及项目相关链接

PDF

Summary

本文提出了一种基于频率域感知的联邦学习(FedFD)方法,旨在解决联邦学习中冗余信息和噪声带来的通信负担问题。FedFD通过保留低频成分并过滤掉冗余的高频成分,优化了通信成本并提升了性能。在图像和语音数据集上的实验表明,FedFD相较于其他最新方法取得了更好的性能表现。例如,在CIFAR-10数据集上,相较于传统的联邦学习方法,FedFD降低了37.78%的通信成本,同时性能提升了10.88%。

Key Takeaways

  • 联邦学习(FL)通过合成数据实现了数据隐私保护并解决了数据异质性挑战。
  • 现有方法面临整个空间域设计冗余信息和噪声的问题,增加了通信负担。
  • FedFD方法通过保留低频成分并过滤高频成分来优化通信成本和性能。
  • FedFD利用离散余弦变换的高能量集中特性,通过频率域分布对齐实现优化。
  • 真实数据驱动的合成分类被纳入损失函数,提升低频成分质量。
  • 在多个图像和语音数据集上的实验表明,FedFD相比现有方法表现出卓越性能。

Cool Papers

点此查看论文截图

Multi-Modal Embedding-based Target Speaker Enhancement

Authors:Zhan Jin

Target Speaker Extraction (TSE) is a critical challenge in cocktail party scenarios. While leveraging multiple modalities, such as voice, lip, face, and expression embeddings, can enhance performance, real-world applications often suffer from intermittent modality dropout. This paper presents a comprehensive study on the interactions and robustness of various multimodal fusion strategies under varying degrees of modality dropout. We build upon a state-of-the-art audio-visual speech enhancement system and integrate four distinct speaker identity cues: lip embeddings for synchronized contextual information, a voice speaker embedding extracted via cross-attention for acoustic consistency, a static face embedding for speaker identity, and a novel dynamic expression embedding for frame-wise emotional features. We systematically evaluate different combinations of these modalities under two key training regimes: zero dropout and 80% modality dropout. Extensive experiments demonstrate that while a full multimodal ensemble achieves optimal performance under ideal (zero dropout) conditions, its effectiveness diminishes significantly when test-time dropout occurs without prior exposure during training. Crucially, we show that training with a high (80%) modality dropout rate dramatically enhances model robustness, enabling the system to maintain superior performance even under severe test-time missing modalities. Our findings highlight that voice embeddings exhibit consistent robustness, while the proposed expression embedding provides valuable complementary information. This work underscores the importance of training strategies that account for real-world imperfection, moving beyond pure performance maximization to achieve practical reliability in multimodal speech enhancement systems.

目标说话人提取(TSE)是鸡尾酒会场景中的一个关键挑战。虽然利用多种模式(如语音、嘴唇、面部和表情嵌入)可以提高性能,但实际应用中经常遭受间歇模式中断的影响。本文对不同多模式融合策略在不同程度模式中断下的交互和稳健性进行了全面研究。我们建立在先进的视听语音增强系统之上,并集成了四种不同的说话人身份线索:用于同步上下文信息的嘴唇嵌入、通过交叉注意提取的语音说话人嵌入以用于声音一致性、用于说话人身份的静态面部嵌入以及用于帧级情感特征的新型动态表情嵌入。我们系统地评估了这两种关键训练方案下不同模式的组合:零中断和80%模式中断。大量实验表明,在理想条件下(零中断),全模式组合达到最佳性能,但在测试时发生中断而没有事先暴露于训练中的情况下,其有效性会大大降低。关键的是,我们表明,以高(80%)模式中断率进行训练可以大大提高模型的稳健性,使系统在测试时即使面临严重的模式缺失也能保持卓越的性能。我们的研究结果表明,语音嵌入表现出持续的稳健性,而所提出的表情嵌入提供了有价值的补充信息。这项工作强调了训练策略的重要性,这些策略需要考虑现实世界的缺陷,超越纯粹的性能最大化,以实现多模式语音增强系统的实际可靠性。

论文及项目相关链接

PDF

Summary

该论文研究了目标说话者提取(TSE)在多模态融合策略中的交互作用及稳健性,面临现实世界中的间歇性模态掉线问题。文章利用音频视觉语音增强系统,集成四种说话人身份线索,并在零掉线率和80%模态掉线率两种关键训练环境下进行系统评估。实验表明,在理想条件下,全模态组合表现最佳,但在测试时出现掉线未经训练前的暴露,其效果会显著下降。训练时的高模态掉线率能显著提高模型的稳健性,即使在测试时缺失多个模态也能保持优越性能。此外,本研究强调在训练策略中考虑现实世界的缺陷至关重要。

Key Takeaways

  1. 目标说话者提取(TSE)在多模态融合策略中面临间歇性模态掉线问题。
  2. 研究利用四种不同的说话人身份线索:唇嵌入、语音嵌入、静态面部嵌入和动态表情嵌入。
  3. 在零掉线率和80%模态掉线率两种训练环境下进行实验评估。
  4. 全模态组合在理想条件下表现最佳,但面临掉线问题时光靠增加多模态融合可能并不理想。
  5. 训练时的高模态掉线率能提高模型的稳健性,使其能在缺失多个模态的情况下保持优越性能。
  6. 语音嵌入展现出持续的稳健性,而动态表情嵌入提供了有价值的补充信息。

Cool Papers

点此查看论文截图

FunAudio-ASR Technical Report

Authors:Keyu An, Yanni Chen, Chong Deng, Changfeng Gao, Zhifu Gao, Bo Gong, Xiangang Li, Yabin Li, Xiang Lv, Yunjie Ji, Yiheng Jiang, Bin Ma, Haoneng Luo, Chongjia Ni, Zexu Pan, Yiping Peng, Zhendong Peng, Peiyao Wang, Hao Wang, Wen Wang, Wupeng Wang, Biao Tian, Zhentao Tan, Nan Yang, Bin Yuan, Jieping Ye, Jixing Yu, Qinglin Zhang, Kun Zou, Han Zhao, Shengkui Zhao, Jingren Zhou

In recent years, automatic speech recognition (ASR) has witnessed transformative advancements driven by three complementary paradigms: data scaling, model size scaling, and deep integration with large language models (LLMs). However, LLMs are prone to hallucination, which can significantly degrade user experience in real-world ASR applications. In this paper, we present FunAudio-ASR, a large-scale, LLM-based ASR system that synergistically combines massive data, large model capacity, LLM integration, and reinforcement learning to achieve state-of-the-art performance across diverse and complex speech recognition scenarios. Moreover, FunAudio-ASR is specifically optimized for practical deployment, with enhancements in streaming capability, noise robustness, code-switching, hotword customization, and satisfying other real-world application requirements. Experimental results show that while most LLM-based ASR systems achieve strong performance on open-source benchmarks, they often underperform on real industry evaluation sets. Thanks to production-oriented optimizations, FunAudio-ASR achieves SOTA performance on real application datasets, demonstrating its effectiveness and robustness in practical settings.

近年来,自动语音识别(ASR)见证了由三个互补范式驱动的革命性进步:数据规模扩展、模型规模扩展以及与大型语言模型(LLM)的深度集成。然而,LLM容易出错,这在现实世界的ASR应用中可能会显著影响用户体验。在本文中,我们介绍了FunAudio-ASR,这是一个基于LLM的大规模ASR系统,它协同结合了大规模数据、大型模型容量、LLM集成和强化学习,以在多样且复杂的语音识别场景中实现最先进的性能。此外,FunAudio-ASR针对实际部署进行了专门优化,增强了流式处理能力、噪声鲁棒性、代码切换、热词定制以及其他现实应用的要求。实验结果表明,虽然大多数基于LLM的ASR系统在开源基准测试上表现强劲,但它们在真实行业评估集上往往表现不佳。由于面向生产的优化,FunAudio-ASR在真实应用数据集上实现了最新技术性能,证明了其在实际应用环境中的有效性和稳健性。

论文及项目相关链接

PDF

Summary

近期,自动语音识别(ASR)领域在数据规模扩大、模型规模扩展以及与大型语言模型(LLM)的深度整合三大互补范式推动下取得了突破性进展。然而,LLM易产生幻觉,显著降低了真实世界ASR应用中的用户体验。本研究提出FunAudio-ASR系统,结合大规模数据、大模型容量、LLM集成和强化学习,针对多种复杂语音识别场景实现前沿性能。此外,FunAudio-ASR针对实际部署进行优化,增强流式处理、抗噪声、代码切换、热词定制等能力,满足其他真实世界应用需求。实验结果显示,大多数LLM基础的ASR系统在开源基准测试中表现良好,但在真实行业评估集上表现不佳。得益于面向生产的优化,FunAudio-ASR在真实应用数据集上实现最佳性能,证明了其在真实环境中的有效性和稳健性。

Key Takeaways

  1. 自动语音识别(ASR)在数据规模、模型规模及大型语言模型(LLM)整合方面取得显著进展。
  2. LLM易产生幻觉,影响真实世界ASR应用中的用户体验。
  3. FunAudio-ASR系统结合多种技术实现前沿性能,包括大规模数据、大模型容量、LLM集成和强化学习。
  4. FunAudio-ASR针对实际部署进行优化,包括流式处理、抗噪声、代码切换和热词定制等能力。
  5. 大多数LLM基础的ASR系统在开源基准测试与真实行业评估集上的表现存在差异。
  6. FunAudio-ASR在真实应用数据集上实现最佳性能,显示出其在真实环境中的有效性。

Cool Papers

点此查看论文截图

More Similar than Dissimilar: Modeling Annotators for Cross-Corpus Speech Emotion Recognition

Authors:James Tavernor, Emily Mower Provost

Speech emotion recognition systems often predict a consensus value generated from the ratings of multiple annotators. However, these models have limited ability to predict the annotation of any one person. Alternatively, models can learn to predict the annotations of all annotators. Adapting such models to new annotators is difficult as new annotators must individually provide sufficient labeled training data. We propose to leverage inter-annotator similarity by using a model pre-trained on a large annotator population to identify a similar, previously seen annotator. Given a new, previously unseen, annotator and limited enrollment data, we can make predictions for a similar annotator, enabling off-the-shelf annotation of unseen data in target datasets, providing a mechanism for extremely low-cost personalization. We demonstrate our approach significantly outperforms other off-the-shelf approaches, paving the way for lightweight emotion adaptation, practical for real-world deployment.

语音情绪识别系统通常会预测由多个注释器评分生成的共识值。然而,这些模型预测任何一个人的注释的能力有限。或者,模型可以学习预测所有注释器的注释。适应新注释器的模型很困难,因为新注释器必须分别提供足够的标记训练数据。我们提出利用注释器之间的相似性,通过使用在大量注释器群体上预训练的模型来识别相似的、以前见过的注释器。对于新的、以前未见过的注释器和有限的注册数据,我们可以为相似的注释器进行预测,实现对目标数据集中未见数据的即时注释,提供一种极其低成本个性化的机制。我们证明我们的方法显著优于其他即时方法,为轻量级的情绪适应铺平了道路,适用于现实世界部署。

论文及项目相关链接

PDF \copyright 20XX IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

Summary

本文提出利用预训练在大规模标注者群体上的模型来识别相似标注者,从而实现对新标注者的标注预测。通过利用标注者间的相似性,可以在新标注者提供有限标注数据的情况下,对目标数据集中的数据进行预测。此方法显著优于其他现成方法,为轻量级情感适应提供了可能,适用于实际部署场景。

Key Takeaways

  1. 语音情感识别系统通常基于多个标注者的评分生成共识值进行预测。
  2. 当前模型在预测单一标注者标注方面能力有限。
  3. 可以学习预测所有标注者的标注,但适应新标注者时面临困难。
  4. 提出利用预训练模型识别相似标注者,基于大规模标注者群体数据。
  5. 对于新标注者,在有限数据下可通过相似标注者预测进行标注。
  6. 此方法显著优于其他现成方法,为情感适应提供了低成本个人化的可能性。

Cool Papers

点此查看论文截图

TSPC: A Two-Stage Phoneme-Centric Architecture for code-switching Vietnamese-English Speech Recognition

Authors:Minh N. H. Nguyen, Anh Nguyen Tran, Dung Truong Dinh, Nam Van Vo

Code-switching (CS) presents a significant challenge for general Auto-Speech Recognition (ASR) systems. Existing methods often fail to capture the subtle phonological shifts inherent in CS scenarios. The challenge is particularly difficult for language pairs like Vietnamese and English, where both distinct phonological features and the ambiguity arising from similar sound recognition are present. In this paper, we propose a novel architecture for Vietnamese-English CS ASR, a Two-Stage Phoneme-Centric model (TSPC). The TSPC employs a phoneme-centric approach, built upon an extended Vietnamese phoneme set as an intermediate representation to facilitate mixed-lingual modeling. Experimental results demonstrate that TSPC consistently outperforms existing baselines, including PhoWhisper-base, in Vietnamese-English CS ASR, achieving a significantly lower word error rate of 20.8% with reduced training resources. Furthermore, the phonetic-based two-stage architecture enables phoneme adaptation and language conversion to enhance ASR performance in complex CS Vietnamese-English ASR scenarios.

语言转换(CS)对于通用自动语音识别(ASR)系统来说是一个重大挑战。现有方法往往无法捕捉到CS场景中固有的细微语音变化。对于越南语和英语这样的语言对来说,挑战尤其艰巨,这两种语言具有截然不同的语音特征,以及由相似声音识别产生的歧义。在本文中,我们为越南语-英语CS ASR提出了一种新型架构,即两阶段音素中心模型(TSPC)。TSPC采用音素中心法,以扩展的越南语音素集作为中间表示,促进混合语言建模。实验结果表明,TSPC在越南语-英语CS ASR中始终优于现有基线,包括PhoWhisper-base,实现了显著降低的词错误率20.8%,同时减少了训练资源。此外,基于音素的两阶段架构能够实现音素适配和语言转换,以提高复杂CS越南语-英语ASR场景中的ASR性能。

论文及项目相关链接

PDF I need to withdraw the paper as there something wrong

总结

本文提出一种针对越南语-英语代码切换自动语音识别(ASR)系统的两阶段音素中心模型(TSPC)。该模型采用音素中心方法,基于扩展的越南语音素集作为中间表示,以促进混合语言建模。实验结果表明,TSPC在越南语-英语代码切换ASR中持续优于现有基线,包括PhoWhisper-base,实现了显著降低的20.8%的词错误率,并在减少训练资源的情况下仍具有良好的性能。此外,基于音素的两阶段架构能够实现音素适应和语言转换,从而提高复杂越南语-英语代码切换ASR场景的识别性能。

关键见解

  1. 代码切换(CS)给通用自动语音识别(ASR)系统带来了挑战,特别是在越南语和英语之间,存在独特的音位特征和由于相似声音识别而产生的歧义。
  2. 提出了一种新的越南语-英语CS ASR系统架构——两阶段音素中心模型(TSPC)。
  3. TSPC模型基于扩展的越南语音素集作为中间表示,促进混合语言建模。
  4. 实验结果显示TSPC在越南语-英语CS ASR中表现优异,显著优于现有基线,包括PhoWhisper-base。
  5. TSPC实现了显著降低的词错误率,仅为20.8%,并在减少训练资源的情况下仍具有良好的性能。
  6. 基于音素的两阶段架构可实现音素适应和语言转换,适用于复杂的CS场景。

Cool Papers

点此查看论文截图

Quality Assessment of Noisy and Enhanced Speech with Limited Data: UWB-NTIS System for VoiceMOS 2024

Authors:Marie Kunešová, Aleš Pražák, Jan Lehečka

We present a system for non-intrusive prediction of speech quality in noisy and enhanced speech, developed for Track 3 of the VoiceMOS 2024 Challenge. The task required estimating the ITU-T P.835 metrics SIG, BAK, and OVRL without reference signals and with only 100 subjectively labeled utterances for training. Our approach uses wav2vec 2.0 with a two-stage transfer learning strategy: initial fine-tuning on automatically labeled noisy data, followed by adaptation to the challenge data. The system achieved the best performance on BAK prediction (LCC=0.867) and a very close second place in OVRL (LCC=0.711) in the official evaluation. Post-challenge experiments show that adding artificially degraded data to the first fine-tuning stage substantially improves SIG prediction, raising correlation with ground truth scores from 0.207 to 0.516. These results demonstrate that transfer learning with targeted data generation is effective for predicting P.835 scores under severe data constraints.

我们呈现了一个系统,用于在噪声和增强语音中进行非侵入式语音质量预测,该系统是为VoiceMOS 2024挑战赛的第三赛道开发的。该任务要求在没有任何参考信号的情况下,仅使用100个主观标记的发音来估计ITU-T P.835指标的SIG、BAK和OVRL。我们的方法使用wav2vec 2.0和两阶段迁移学习策略:首先在自动标记的噪声数据上进行初步微调,然后适应挑战赛数据。在官方评估中,该系统在BAK预测方面表现最佳(LCC=0.867),在OVRL(LCC=0.711)中位列第二。挑战后的实验表明,在第一阶段微调中添加人工退化数据可以大大提高SIG预测,将其与基准真实分数的相关性从0.207提高到0.516。这些结果表明,在严格的数据约束下,使用有针对性的数据生成进行迁移学习对于预测P.835分数是有效的。

论文及项目相关链接

PDF Submitted to ICASSP 2026

Summary

本文介绍了一种用于非侵入性预测噪声和增强语音质量的系统,该系统是为VoiceMOS 2024挑战赛的Track 3任务开发的。系统需在无需参考信号的情况下,仅使用100个主观标记的语音片段进行训练,估计ITU-T P.835指标的SIG、BAK和OVRL。该研究采用wav2vec 2.0与两阶段迁移学习策略:首先在自动标记的噪声数据上进行初步微调,然后适应挑战赛数据。在官方评估中,该系统在BAK预测方面表现最佳(LCC=0.867),并在OVRL预测中取得第二名(LCC=0.711)。实验后表明,在第一阶段微调中添加人工退化数据可以大大提高SIG预测效果,与地面真实分数的相关性从0.207提高到0.516。这些结果证明了在严格数据约束下,使用有针对性的数据生成进行迁移学习是有效的。

Key Takeaways

  1. 系统针对非侵入性预测噪声和增强语音质量进行设计,适用于VoiceMOS 2024挑战赛Track 3任务。
  2. 系统在仅使用100个主观标记的语音片段进行训练的情况下,估计ITU-T P.835指标的SIG、BAK和OVRL。
  3. 采用wav2vec 2.0与两阶段迁移学习策略,初步微调在自动标记的噪声数据上,然后适应挑战赛数据。
  4. 在官方评估中,系统表现最佳的是BAK预测(LCC=0.867),OVRL预测取得第二名(LCC=0.711)。
  5. 实验表明,添加人工退化数据到第一阶段微调中可以显著提高SIG预测效果。
  6. 该研究证明了迁移学习在严格数据约束下的有效性。

Cool Papers

点此查看论文截图


文章作者: Kedreamix
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !
 上一篇
元宇宙/虚拟人 元宇宙/虚拟人
元宇宙/虚拟人 方向最新论文已更新,请持续关注 Update in 2025-09-18 Dream3DAvatar Text-Controlled 3D Avatar Reconstruction from a Single Image
下一篇 
无监督/半监督/对比学习 无监督/半监督/对比学习
无监督/半监督/对比学习 方向最新论文已更新,请持续关注 Update in 2025-09-18 More performant and scalable Rethinking contrastive vision-language pre-training of radiology in the LLM era
  目录