⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-09-19 更新
CS-FLEURS: A Massively Multilingual and Code-Switched Speech Dataset
Authors:Brian Yan, Injy Hamed, Shuichiro Shimizu, Vasista Lodagala, William Chen, Olga Iakovenko, Bashar Talafha, Amir Hussein, Alexander Polok, Kalvin Chang, Dominik Klement, Sara Althubaiti, Puyuan Peng, Matthew Wiesner, Thamar Solorio, Ahmed Ali, Sanjeev Khudanpur, Shinji Watanabe, Chih-Chen Chen, Zhen Wu, Karim Benharrak, Anuj Diwan, Samuele Cornell, Eunjung Yeo, Kwanghee Choi, Carlos Carvalho, Karen Rosero
We present CS-FLEURS, a new dataset for developing and evaluating code-switched speech recognition and translation systems beyond high-resourced languages. CS-FLEURS consists of 4 test sets which cover in total 113 unique code-switched language pairs across 52 languages: 1) a 14 X-English language pair set with real voices reading synthetically generated code-switched sentences, 2) a 16 X-English language pair set with generative text-to-speech 3) a 60 {Arabic, Mandarin, Hindi, Spanish}-X language pair set with the generative text-to-speech, and 4) a 45 X-English lower-resourced language pair test set with concatenative text-to-speech. Besides the four test sets, CS-FLEURS also provides a training set with 128 hours of generative text-to-speech data across 16 X-English language pairs. Our hope is that CS-FLEURS helps to broaden the scope of future code-switched speech research. Dataset link: https://huggingface.co/datasets/byan/cs-fleurs.
我们推出CS-FLEURS数据集,这是一个用于开发和评估跨高资源语言代码切换的语音识别和翻译系统的新数据集。CS-FLEURS包含四个测试集,总共覆盖113种独特的代码切换语言对和52种语言:1)包含真实语音朗读合成代码切换句子的14种英语对外的语言对集;2)生成性文本语音合成的具有创造性文本的带有语言倾向的口音英语的另一种测试集;包含具有生成性文本语音合成口音的阿拉伯、普通话、印度和西班牙语对外口音的测试集,以及一种带有拼接文本语音的口音英语的测试集。除了这四个测试集外,CS-FLEURS还提供包含具有创造性文本的口音英语的合成语音数据的训练集,共计时长为CS-FLEURS希望拓宽未来代码切换语音识别的研究范围。数据集链接:https://huggingface.co/datasets/byan/cs-fleurs。
论文及项目相关链接
Summary
CS-FLEURS是一个用于开发和评估跨语言代码切换语音识别和翻译系统的新数据集。它包含四个测试集,涵盖113种独特的代码切换语言对和52种语言。该数据集可用于支持多种语言的代码切换场景,并提供训练集以支持未来代码切换语音研究的拓宽。数据集链接:https://huggingface.co/datasets/byan/cs-fleurs。
Key Takeaways
- CS-FLEURS是一个用于开发评估跨语言代码切换语音识别和翻译系统的新数据集。
- 数据集包含四个测试集,涵盖113种独特的代码切换语言对和52种语言。
- 数据集支持多种场景的跨语言代码切换应用。
- 数据集包含128小时的生成式文本到语音数据的训练集。
- CS-FLEURS提供了广泛的语种覆盖,包括英语与其他语言的组合。
- 数据集可以用于未来的代码切换语音研究拓宽领域。
点此查看论文截图






Canary-1B-v2 & Parakeet-TDT-0.6B-v3: Efficient and High-Performance Models for Multilingual ASR and AST
Authors:Monica Sekoyan, Nithin Rao Koluguri, Nune Tadevosyan, Piotr Zelasko, Travis Bartley, Nick Karpov, Jagadeesh Balam, Boris Ginsburg
This report introduces Canary-1B-v2, a fast, robust multilingual model for Automatic Speech Recognition (ASR) and Speech-to-Text Translation (AST). Built with a FastConformer encoder and Transformer decoder, it supports 25 languages primarily European. The model was trained on 1.7M hours of total data samples, including Granary and NeMo ASR Set 3.0, with non-speech audio added to reduce hallucinations for ASR and AST. We describe its two-stage pre-training and fine-tuning process with dynamic data balancing, as well as experiments with an nGPT encoder. Results show nGPT scales well with massive data, while FastConformer excels after fine-tuning. For timestamps, Canary-1B-v2 uses the NeMo Forced Aligner (NFA) with an auxiliary CTC model, providing reliable segment-level timestamps for ASR and AST. Evaluations show Canary-1B-v2 outperforms Whisper-large-v3 on English ASR while being 10x faster, and delivers competitive multilingual ASR and AST performance against larger models like Seamless-M4T-v2-large and LLM-based systems. We also release Parakeet-TDT-0.6B-v3, a successor to v2, offering multilingual ASR across the same 25 languages with just 600M parameters.
这份报告介绍了Canary-1B-v2,这是一个快速、稳健的多语言模型,用于自动语音识别(ASR)和语音到文本的翻译(AST)。该模型采用FastConformer编码器和Transformer解码器构建,主要支持25种欧洲语言。该模型在170万小时的总数据样本上进行训练,包括Granary和NeMo ASR Set 3.0,并添加了非语音音频,以减少ASR和AST的幻觉。我们描述了其两阶段预训练和微调过程,采用动态数据平衡,以及使用nGPT编码器的实验。结果表明,nGPT在大量数据中表现良好,而FastConformer在微调后表现出色。对于时间戳,Canary-1B-v2使用NeMo强制对齐器(NFA)和辅助CTC模型,为ASR和AST提供可靠的分段级时间戳。评估表明,Canary-1B-v2在英语ASR方面超越了Whisper-large-v3,而且速度更快,同时在多语言ASR和AST方面与更大的模型(如无缝M4T-v2大型模型)和基于LLM的系统表现竞争。我们还发布了Parakeet-TDT-0.6B-v3,这是v2的后续版本,使用仅6亿个参数即可提供相同25种语言的多语言ASR。
论文及项目相关链接
PDF Mini Version of it Submitted to ICASSP 2026
Summary
这份报告介绍了Canary-1B-v2模型,它是一个快速、稳健的多语言模型,主要用于语音识别(ASR)和语音到文本的翻译(AST)。该模型采用FastConformer编码器和Transformer解码器,支持25种主要欧洲语言。经过大规模数据样本的训练和两阶段预训练与微调,该模型表现出优秀的性能。Canary-1B-v2使用NeMo强制对齐器(NFA)提供可靠的时段级时间戳,并在英语ASR上表现优于Whisper-large-v3,同时处理速度更快。它还推出了Parakeet-TDT-0.6B-v3模型,具有更小的参数规模,但依然能进行多语言ASR。
Key Takeaways
- Canary-1B-v2是一个针对自动语音识别(ASR)和语音到文本翻译(AST)的快速、稳健的多语言模型。
- 该模型采用FastConformer编码器和Transformer解码器,支持25种主要欧洲语言。
- 模型经过两阶段预训练和微调,使用动态数据平衡来提高性能。
- 模型使用NeMo强制对齐器(NFA)提供可靠的时间戳。
- 在英语ASR方面,Canary-1B-v2优于Whisper-large-v3模型,同时处理速度更快。
- 模型能够处理大规模数据,并且FastConformer在微调后表现出卓越性能。
点此查看论文截图





A Lightweight Fourier-based Network for Binaural Speech Enhancement with Spatial Cue Preservation
Authors:Xikun Lu, Yujian Ma, Xianquan Jiang, Xuelong Wang, Jinqiu Sang
Binaural speech enhancement faces a severe trade-off challenge, where state-of-the-art performance is achieved by computationally intensive architectures, while lightweight solutions often come at the cost of significant performance degradation. To bridge this gap, we propose the Global Adaptive Fourier Network (GAF-Net), a lightweight deep complex network that aims to establish a balance between performance and computational efficiency. The GAF-Net architecture consists of three components. First, a dual-feature encoder combining short-time Fourier transform and gammatone features enhances the robustness of acoustic representation. Second, a channel-independent globally adaptive Fourier modulator efficiently captures long-term temporal dependencies while preserving the spatial cues. Finally, a dynamic gating mechanism is implemented to reduce processing artifacts. Experimental results show that GAF-Net achieves competitive performance, particularly in terms of binaural cues (ILD and IPD error) and objective intelligibility (MBSTOI), with fewer parameters and computational cost. These results confirm that GAF-Net provides a feasible way to achieve high-fidelity binaural processing on resource-constrained devices.
双耳语音增强面临着严重的权衡挑战,现有技术性能的实现依赖于计算密集型的架构,而轻量级解决方案往往以性能显著下降为代价。为了弥补这一差距,我们提出了全球自适应傅里叶网络(GAF-Net),这是一个轻量级的深度复数网络,旨在在性能和计算效率之间取得平衡。GAF-Net架构由三个部分组成。首先,结合短时傅里叶变换和伽马通特征的双重特征编码器增强了声音表示的鲁棒性。其次,通道独立的全局自适应傅里叶调制器有效地捕获了长期时间依赖性,同时保留了空间线索。最后,实现了动态门控机制以减少处理过程中的伪影。实验结果表明,GAF-Net在双耳线索(ILD和IPD误差)和客观清晰度(MBSTOI)方面取得了具有竞争力的性能表现,同时减少了参数和计算成本。这些结果证实,GAF-Net在资源受限的设备上实现高保真双耳处理提供了一种可行的方法。
论文及项目相关链接
PDF Submitted to ICASSP 2026
Summary
GAF-Net是一种结合了全球自适应傅立叶网络架构的轻量级深度复杂网络,旨在平衡性能和计算效率。它采用双特征编码器、全局自适应傅立叶调制器和动态门控机制等技术,实现了高性能和轻量级的语音增强效果。实验结果表明,GAF-Net在双耳线索、客观可懂度和参数计算成本等方面均表现出竞争力。
Key Takeaways
- GAF-Net是一种针对语音增强的轻量级深度复杂网络。
- 该网络结合了双特征编码器、全局自适应傅立叶调制器和动态门控机制等技术。
- GAF-Net旨在平衡高性能和计算效率。
- 实验结果显示,GAF-Net在语音增强方面表现出竞争力。
- GAF-Net对于资源受限的设备实现高保真双耳处理具有可行性。
- GAF-Net特别优化了双耳线索(ILD和IPD误差)和客观可懂度(MBSTOI)。
点此查看论文截图





Mixture of Low-Rank Adapter Experts in Generalizable Audio Deepfake Detection
Authors:Janne Laakkonen, Ivan Kukanov, Ville Hautamäki
Foundation models such as Wav2Vec2 excel at representation learning in speech tasks, including audio deepfake detection. However, after being fine-tuned on a fixed set of bonafide and spoofed audio clips, they often fail to generalize to novel deepfake methods not represented in training. To address this, we propose a mixture-of-LoRA-experts approach that integrates multiple low-rank adapters (LoRA) into the model’s attention layers. A routing mechanism selectively activates specialized experts, enhancing adaptability to evolving deepfake attacks. Experimental results show that our method outperforms standard fine-tuning in both in-domain and out-of-domain scenarios, reducing equal error rates relative to baseline models. Notably, our best MoE-LoRA model lowers the average out-of-domain EER from 8.55% to 6.08%, demonstrating its effectiveness in achieving generalizable audio deepfake detection.
Wave2Vec2等基础模型在语音任务(包括音频深度伪造检测)中擅长表示学习。然而,在对固定的真实和假冒音频片段进行微调后,它们通常无法推广到训练中没有出现的新型深度伪造方法。为了解决这个问题,我们提出了一种混合LoRA专家方法,该方法将多个低秩适配器(LoRA)集成到模型的注意力层中。路由机制有选择地激活专业专家,提高对不断发展的深度伪造攻击的适应性。实验结果表明,与标准微调相比,我们的方法在域内和域外场景中表现更好,相对于基线模型降低了等误码率。值得注意的是,我们最好的MoE-LoRA模型将域外平均EER从8.55%降低到6.08%,表明其在实现可泛化的音频深度伪造检测方面的有效性。
论文及项目相关链接
PDF 6 pages, 3 figures, 1 table
Summary
模型如Wav2Vec2在语音任务中的表示学习表现卓越,尤其在音频深度伪造检测方面。然而,在固定的一组真实和伪造音频片段上进行微调后,它们往往无法推广到训练中没有代表的新型深度伪造方法。为解决这一问题,我们提出了一种混合LoRA专家方法,该方法将多个低秩适配器(LoRA)集成到模型的注意力层中。路由机制能够选择性地激活专业专家,增强对不断发展的深度伪造攻击的适应性。实验结果表明,我们的方法在域内和域外场景中都优于标准微调,相对于基线模型降低了等错误率。特别是,我们最好的MoE-LoRA模型将平均域外EER从8.55%降低到6.08%,证明了其在实现可推广的音频深度伪造检测方面的有效性。
Key Takeaways
- Wav2Vec2等模型在语音表示学习,尤其是音频深度伪造检测方面表现出色。
- 现有模型在固定音频集上微调后,难以适应新型深度伪造方法。
- 提出的混合LoRA专家方法通过集成多个低秩适配器增强模型的适应性。
- 路由机制能选择性地激活特定专家,以应对不断演变的深度伪造攻击。
- 实验显示,该方法在域内和域外场景中都优于标准微调。
- MoE-LoRA模型降低了平均域外等错误率,证明了其有效性。
点此查看论文截图




Conducting Mission-Critical Voice Experiments with Automated Speech Recognition and Crowdsourcing
Authors:Jan Janak, Kahlil Dozier, Lauren Berny, Liang Hu, Dan Rubenstein, Charles Jennings, Henning Schulzrinne
Mission-critical voice (MCV) communications systems have been a critical tool for the public safety community for over eight decades. Public safety users expect MCV systems to operate reliably and consistently, particularly in challenging conditions. Because of these expectations, the Public Safety Communications Research (PSCR) Division of the National Institute of Standards and Technology (NIST) has been interested in correlating impairments in MCV communication systems and public safety user quality of experience (QoE). Previous research has studied MCV voice quality and intelligibility in a controlled environment. However, such research has been limited by the challenges inherent in emulating real-world environmental conditions. Additionally, there is the question of the best metric to use to reflect QoE accurately. This paper describes our efforts to develop the methodology and tools for human-subject experiments with MCV. We illustrate their use in human-subject experiments in emulated real-world environments. The tools include a testbed for emulating real-world MCV systems and an automated speech recognition (ASR) robot approximating human subjects in transcription tasks. We evaluate QoE through a Levenshtein Distance-based metric, arguing it is a suitable proxy for measuring comprehension and the QoE. We conducted human-subject studies with Amazon MTurk volunteers to understand the influence of selected system parameters and impairments on human subject performance and end-user QoE. We also compare the performance of several ASR system configurations with human-subject performance. We find that humans generally perform better than ASR in accuracy-related MCV tasks and that the codec significantly influences the end-user QoE and ASR performance.
任务关键语音(MCV)通信系统已作为公共安全社区的重要工具长达八十多年。公共安全用户期望MCV系统能够可靠且稳定地运行,特别是在具有挑战性的条件下。由于这些期望,美国国家技术标准研究所(NIST)的公共安全通信研究(PSCR)部门对MCV通信系统的缺陷与公共安全用户的质量体验(QoE)之间的相关性非常感兴趣。以往的研究已在受控环境中研究了MCV的语音质量和清晰度。然而,这类研究受限于模拟现实环境条件的挑战。此外,还存在用于准确反映QoE的最佳指标的问题。本文描述了我们在开发MCV人类主体实验的方法和工具方面的努力。我们在模拟现实环境的人类主体实验中说明了它们的使用。这些工具包括一个模拟现实MCV系统的测试平台和一个在转录任务中近似人类主体的自动语音识别(ASR)机器人。我们通过基于Levenshtein距离的度量来评估QoE,认为它是衡量理解和QoE的合适代理。我们使用亚马逊MTurk志愿者进行人类主体研究,以了解所选系统参数和缺陷对人类主体性能和最终用户QoE的影响。我们还比较了不同ASR系统配置与人类主体性能的表现。我们发现,在准确性相关的MCV任务中,人类的表现通常优于ASR,并且编码器对最终用户的QoE和ASR性能具有重大影响。
论文及项目相关链接
摘要
本论文关注公共安全问题中至关重要的语音通信系统(MCV)。为解决MCV系统在实际环境中的挑战及其对公共安全用户体验(QoE)的影响,国家标准技术研究所公共研究部致力于探究MCV通信系统中的损害与QoE的相关性。此前的实验主要在受控环境中研究MCV的语音质量和清晰度,但在模拟真实环境时面临诸多挑战。同时,对最佳度量QoE的标准也存在疑问。本文描述了为MCV进行人体实验的方法和工具开发,包括模拟真实世界环境的测试平台和模拟人类转录任务的自动语音识别机器人。通过基于Levenshtein距离的度量评估QoE,认为其适合衡量理解和QoE。通过亚马逊MTurk志愿者进行人体实验,研究选定系统参数和损害对人体表现和用户体验的影响,并将多种ASR系统配置与人体性能进行比较。发现人类在准确性相关的MCV任务中通常表现优于ASR,并且编码器对用户体验和ASR性能有显著影响。
关键见解
- MCV系统是公共安全社区的重要工具,其可靠性和一致性至关重要。
- NIST对MCV系统损害与公共安全性用户体验(QoE)的相关性感兴趣。
- 之前的研究主要在受控环境中进行,难以模拟真实环境。
- 选择Levenshtein距离作为评估QoE的度量标准,能有效衡量理解和用户体验。
- 人体实验显示人类在准确性相关的MCV任务中表现优于ASR系统。
- 系统参数和损害影响人体表现和用户体验。
点此查看论文截图




Invisible Ears at Your Fingertips: Acoustic Eavesdropping via Mouse Sensors
Authors:Mohamad Fakih, Rahul Dharmaji, Youssef Mahmoud, Halima Bouzidi, Mohammad Abdullah Al Faruque
Modern optical mouse sensors, with their advanced precision and high responsiveness, possess an often overlooked vulnerability: they can be exploited for side-channel attacks. This paper introduces Mic-E-Mouse, the first-ever side-channel attack that targets high-performance optical mouse sensors to covertly eavesdrop on users. We demonstrate that audio signals can induce subtle surface vibrations detectable by a mouse’s optical sensor. Remarkably, user-space software on popular operating systems can collect and broadcast this sensitive side channel, granting attackers access to raw mouse data without requiring direct system-level permissions. Initially, the vibration signals extracted from mouse data are of poor quality due to non-uniform sampling, a non-linear frequency response, and significant quantization. To overcome these limitations, Mic-E-Mouse employs a sophisticated end-to-end data filtering pipeline that combines Wiener filtering, resampling corrections, and an innovative encoder-only spectrogram neural filtering technique. We evaluate the attack’s efficacy across diverse conditions, including speaking volume, mouse polling rate and DPI, surface materials, speaker languages, and environmental noise. In controlled environments, Mic-E-Mouse improves the signal-to-noise ratio (SNR) by up to +19 dB for speech reconstruction. Furthermore, our results demonstrate a speech recognition accuracy of roughly 42% to 61% on the AudioMNIST and VCTK datasets. All our code and datasets are publicly accessible on https://sites.google.com/view/mic-e-mouse.
现代光学鼠标传感器具有高精度和高响应性的特性,但往往会被忽略其存在的漏洞:它们可能被用于侧信道攻击。本文介绍了Mic-E-Mouse,这是一种针对高性能光学鼠标传感器的新型侧信道攻击,可以隐秘地窃听用户信息。我们证明了音频信号会引发鼠标光学传感器可检测到的细微表面振动。值得注意的是,流行操作系统上的用户空间软件可以收集和广播这种敏感的侧信道信息,使攻击者无需获得系统级权限即可访问原始鼠标数据。由于采样不均匀、频率响应非线性以及量化显著,最初从鼠标数据中提取的振动信号质量较差。为了克服这些局限性,Mic-E-Mouse采用了一种先进的端到端数据过滤管道,结合了Wiener过滤、重采样校正以及创新的仅编码器光谱神经过滤技术。我们评估了攻击在不同条件下的有效性,包括说话音量、鼠标轮询率和DPI、表面材料、说话人的语言和环境噪音。在受控环境中,Mic-E-Mouse提高了信号噪声比(SNR),语音重建提高了高达+19分贝。此外,我们的结果在AudioMNIST和VCTK数据集上达到了约42%至61%的语音识别准确率。所有代码和数据集均可在https://sites.google.com/view/mic-e-mouse上公开访问。
论文及项目相关链接
PDF Appearing in the Annual Computer Security Applications Conference (ACSAC 2025)
摘要
现代高性能光学鼠标传感器存在一种常被忽视的漏洞:它们可能受到侧信道攻击的影响。本文介绍了Mic-E-Mouse,这是一种针对高性能光学鼠标传感器的新型侧信道攻击,可以隐秘地窃听用户信息。我们证明,音频信号会诱导鼠标光学传感器可检测到的微小表面振动。值得注意的是,流行操作系统上的用户空间软件可以收集和广播这种敏感侧信道信息,使攻击者无需获得系统级权限即可访问原始鼠标数据。为解决从鼠标数据中提取的振动信号质量差的问题(如非均匀采样、非线性频率响应和显著量化),Mic-E-Mouse采用了一种先进的端到端数据过滤管道,结合了维纳滤波、重采样校正和创新的仅编码器光谱神经过滤技术。我们在不同条件下评估了攻击的有效性,包括讲话音量、鼠标轮询率和DPI、表面材料、讲话语言和环境噪音。在控制环境下,Mic-E-Mouse将信噪比(SNR)提高了高达+19分贝,用于语音重建。此外,我们的结果在AudioMNIST和VCTK数据集上达到了约42%至61%的语音识别准确率。我们的所有代码和数据集均可在https://sites.google.com/view/mic-e-mouse上公开访问。
要点掌握
- 现代光学鼠标传感器存在侧信道攻击漏洞,可通过Mic-E-Mouse进行隐秘窃听。
- 音频信号能诱导光学鼠标传感器检测微小表面振动。
- 用户空间软件能收集和广播鼠标的侧信道信息,使攻击者无需系统级权限即可获取数据。
- Mic-E-Mouse采用先进的数据过滤管道处理振动信号的质量问题。
- 该攻击在多种条件下有效,包括讲话音量、鼠标参数、表面材料、语言和环境噪音。
- 在控制环境下,Mic-E-Mouse能提高语音重建的信噪比达+19分贝。
点此查看论文截图







Enhancing Speaker-Independent Dysarthric Speech Severity Classification with DSSCNet and Cross-Corpus Adaptation
Authors:Arnab Kumar Roy, Hemant Kumar Kathania, Paban Sapkota
Dysarthric speech severity classification is crucial for objective clinical assessment and progress monitoring in individuals with motor speech disorders. Although prior methods have addressed this task, achieving robust generalization in speaker-independent (SID) scenarios remains challenging. This work introduces DSSCNet, a novel deep neural architecture that combines Convolutional, Squeeze-Excitation (SE), and Residual network, helping it extract discriminative representations of dysarthric speech from mel spectrograms. The addition of SE block selectively focuses on the important features of the dysarthric speech, thereby minimizing loss and enhancing overall model performance. We also propose a cross-corpus fine-tuning framework for severity classification, adapted from detection-based transfer learning approaches. DSSCNet is evaluated on two benchmark dysarthric speech corpora: TORGO and UA-Speech under speaker-independent evaluation protocols: One-Speaker-Per-Severity (OSPS) and Leave-One-Speaker-Out (LOSO) protocols. DSSCNet achieves accuracies of 56.84% and 62.62% under OSPS and 63.47% and 64.18% under LOSO setting on TORGO and UA-Speech respectively outperforming existing state-of-the-art methods. Upon fine-tuning, the performance improves substantially, with DSSCNet achieving up to 75.80% accuracy on TORGO and 68.25% on UA-Speech in OSPS, and up to 77.76% and 79.44%, respectively, in LOSO. These results demonstrate the effectiveness and generalizability of DSSCNet for fine-grained severity classification across diverse dysarthric speech datasets.
语言障碍语音严重性分类对于患有运动性语言障碍者的客观临床评估和进度监测至关重要。尽管之前的方法已经解决了此任务,但在独立于说话人的场景中实现稳健的泛化仍然具有挑战性。本研究介绍了DSSCNet,这是一种新型深度神经网络架构,结合了卷积、挤压激励(SE)和残差网络,有助于从梅尔频谱图中提取语言障碍语音的判别表示。SE块的添加有选择地关注语言障碍语音的重要特征,从而减小损失并提高整体模型性能。我们还提出了基于检测迁移学习方法的严重性分类跨语料微调框架。DSSCNet在两个基准语言障碍语音语料库TORGO和UA-Speech上进行了评估,采用独立于说话人的评估协议:每严重性一位说话人(OSPS)和留一说话人出(LOSO)协议。DSSCNet在TORGO和UA-Speech上的OSPS协议下分别达到56.84%和62.62%的准确率,在LOSO设置下分别达到63.47%和64.18%,超过了现有最先进的方法。经过微调后,性能大大提高,在TORGO的OSPS中,DSSCNet的准确率高达75.80%,在UA-Speech中高达68.25%,而在LOSO中分别高达77.76%和79.44%。这些结果证明了DSSCNet在跨多种语言障碍语音数据集上进行精细严重性分类的有效性和泛化能力。
论文及项目相关链接
PDF Speaker-independent experiments on classification of dysarthric speech severity
摘要
本文研究了运动性言语障碍患者的语言障碍程度分类问题。尽管已有方法对此进行了探索,但在独立于说话人的场景下实现稳健的泛化仍然具有挑战性。本研究提出了一种新型的深度神经网络架构DSSCNet,它结合了卷积神经网络、挤压激励网络和残差网络,能够从梅尔频谱图中提取运动性言语障碍的判别特征表示。挤压激励模块能够有选择地关注运动性言语障碍的重要特征,从而减少损失并提高模型的整体性能。同时,本文提出了基于检测迁移学习的跨语料库微调框架,用于严重程度分类。在TORGO和UA-Speech两个基准运动性言语障碍语料库中,对DSSCNet进行了说话人独立评估协议下的评估,包括单说话人单严重程度和留一说话人出协议。DSSCNet在TORGO和UA-Speech上的准确率分别在OPS和LOSO设置下达到56.84%和62.62%,以及63.47%和64.18%,超过了现有最新方法。经过微调后,DSSCNet的性能得到显著提高,在TORGO和UA-Speech上的OPS准确率分别提高到75.80%和68.25%,LOSO中的准确率分别提高到77.76%和79.44%。这些结果表明,DSSCNet在精细粒度的严重程度分类中对于不同的运动性言语障碍数据集具有良好的有效性和泛化能力。
关键见解
- DSSCNet结合了卷积神经网络、挤压激励网络和残差网络,能有效提取运动性言语障碍的判别特征表示。
- 挤压激励模块能关注运动性言语障碍的重要特征,提高模型性能。
- 提出了基于检测迁移学习的跨语料库微调框架,用于严重程度分类。
- DSSCNet在多个基准数据集上表现优于现有方法,显示其有效性和泛化能力。
- 通过微调,DSSCNet的性能得到显著提高。
- 研究结果对于开发更精确的语音识别和运动性言语障碍诊断工具具有重要意义。
点此查看论文截图





CAMEO: Collection of Multilingual Emotional Speech Corpora
Authors:Iwona Christop, Maciej Czajka
This paper presents CAMEO – a curated collection of multilingual emotional speech datasets designed to facilitate research in emotion recognition and other speech-related tasks. The main objectives were to ensure easy access to the data, to allow reproducibility of the results, and to provide a standardized benchmark for evaluating speech emotion recognition (SER) systems across different emotional states and languages. The paper describes the dataset selection criteria, the curation and normalization process, and provides performance results for several models. The collection, along with metadata, and a leaderboard, is publicly available via the Hugging Face platform.
本文介绍了CAMEO——一个精选的多语言情感语音数据集合集,旨在促进情感识别和其他相关语音任务的研究。主要目标是确保数据的轻松访问,以便结果的复现性,以及为评估不同情感状态和语言的语音情感识别(SER)系统提供一个标准化的基准。本文描述了数据集的选择标准、编纂和归一化过程,并为几个模型提供了性能结果。该集合与元数据以及排行榜可通过Hugging Face平台公开访问。
论文及项目相关链接
PDF Under review at ICASSP
Summary
本文介绍了CAMEO——一个精选的多语言情感语音数据集合集,旨在促进情感识别和其他语音相关任务的研究。其主要目标包括确保数据的轻松访问、允许结果的可重复性,以及为不同情感状态和语言的语音情感识别(SER)系统提供一个标准化的基准测试。本文描述了数据集的选择标准、整理和标准化流程,并提供了多个模型的表现结果。该合集与元数据以及排行榜已通过Hugging Face平台公开提供。
Key Takeaways
- CAMEO是一个多语言情感语音数据集合集,用于促进情感识别研究。
- 数据集的主要目标是确保数据轻松访问、结果可重复,并为语音情感识别系统提供标准化基准测试。
- 数据集的选择标准、整理和标准化流程在文中得到详细描述。
- 提供了多个模型在CAMEO上的表现结果。
- 该数据集合集与元数据已公开在Hugging Face平台上。
- 有一个公开的排行榜,可以比较不同系统的性能。
点此查看论文截图






COMI-LINGUA: Expert Annotated Large-Scale Dataset for Multitask NLP in Hindi-English Code-Mixing
Authors:Rajvee Sheth, Himanshu Beniwal, Mayank Singh
We introduce COMI-LINGUA, the largest manually annotated Hindi-English code-mixed dataset, comprising 125K+ high-quality instances across five core NLP tasks: Matrix Language Identification, Token-level Language Identification, Part-Of-Speech Tagging, Named Entity Recognition, and Machine Translation. Each instance is annotated by three bilingual annotators, yielding over 376K expert annotations with strong inter-annotator agreement (Fleiss’ Kappa $\geq$ 0.81). The rigorously preprocessed and filtered dataset covers both Devanagari and Roman scripts and spans diverse domains, ensuring real-world linguistic coverage. Evaluation reveals that closed-source LLMs significantly outperform traditional tools and open-source models in zero-shot settings. Notably, one-shot prompting consistently boosts performance across tasks, especially in structure-sensitive predictions like POS and NER. Fine-tuning state-of-the-art LLMs on COMI-LINGUA demonstrates substantial improvements, achieving up to 95.25 F1 in NER, 98.77 F1 in MLI, and competitive MT performance, setting new benchmarks for Hinglish code-mixed text. COMI-LINGUA is publicly available at this URL: https://huggingface.co/datasets/LingoIITGN/COMI-LINGUA.
我们介绍COMI-LINGUA,这是最大的手动标注的印地语-英语混合代码数据集,包含超过12.5万个高质量实例,涉及五个核心NLP任务:矩阵语言识别、标记级语言识别、词性标注、命名实体识别和机器翻译。每个实例均由三名双语注释者进行标注,产生了超过37.6万个专家注释,注释者之间的强一致性(Fleiss’ Kappa $\geq$ 0.81)。该数据集经过严格预处理和过滤,涵盖Devanagari和Roman两种脚本,涉及多个领域,确保现实世界的语言覆盖。评估表明,封闭源代码的大型语言模型在零样本设置中显著优于传统工具和开源模型。值得注意的是,一次性提示任务始终提高了性能,特别是在结构敏感的预测中,如POS和NER。在COMI-LINGUA上对最新的大型语言模型进行微调,实现了显著的性能提升,命名实体识别的F1得分高达95.25,矩阵语言识别的F1得分高达98.77,机器翻译性能具有竞争力,为印地语混合文本树立了新基准。COMI-LINGUA可在以下网址公开访问:https://huggingface.co/datasets/LingoIITGN/COMI-LINGUA。
论文及项目相关链接
Summary:
COMI-LINGUA是最大规模的手动标注印地语-英语混合代码数据集,包含超过12.5万个高质量实例,涵盖五大核心NLP任务。该数据集由三位双语标注者进行标注,具有强大的标注者间一致性。数据集经过严格预处理和筛选,覆盖多种领域和脚本,确保真实世界语言覆盖度。评估显示,封闭源代码的大型语言模型在零样本设置下显著优于传统工具和开源模型。特别是一步提示法在各种任务上表现一致,特别是在结构敏感的预测如POS和NER上效果显著。对最先进的大型语言模型在COMI-LINGUA上进行微调,实现了显著的改进,在NER上达到95.25 F1,MLI达到98.77 F1,并且在机器翻译性能上具有竞争力,为印地语混合文本设定了新的基准。
Key Takeaways:
- COMI-LINGUA是最大规模的手动标注印地语-英语混合代码数据集。
- 数据集涵盖五大核心NLP任务,包括矩阵语言识别、令牌级语言识别、词性标注、实体命名识别和机器翻译。
- 数据集由三位双语标注者进行标注,具有强大的标注者间一致性(Fleiss’ Kappa ≥ 0.81)。
- 数据集覆盖多种领域和脚本,确保真实世界语言覆盖度。
- 封闭源代码的大型语言模型在零样本设置下显著优于传统工具和开源模型。
- 一步提示法在各种任务上表现一致,特别是在结构敏感的预测如POS和NER上效果显著。
点此查看论文截图





