发布日期: 2025-10-03

更新日期: 2025-11-27

文章字数: 1.7k

阅读时长: 6 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-10-03 更新

Out of the Box, into the Clinic? Evaluating State-of-the-Art ASR for Clinical Applications for Older Adults

Authors:Bram van Dijk, Tiberon Kuiper, Sirin Aoulad si Ahmed, Armel Levebvre, Jake Johnson, Jan Duin, Simon Mooijaart, Marco Spruit

Voice-controlled interfaces can support older adults in clinical contexts – with chatbots being a prime example – but reliable Automatic Speech Recognition (ASR) for underrepresented groups remains a bottleneck. This study evaluates state-of-the-art ASR models on language use of older Dutch adults, who interacted with the Welzijn.AI chatbot designed for geriatric contexts. We benchmark generic multilingual ASR models, and models fine-tuned for Dutch spoken by older adults, while also considering processing speed. Our results show that generic multilingual models outperform fine-tuned models, which suggests recent ASR models can generalise well out of the box to real-world datasets. Moreover, our results indicate that truncating generic models is helpful in balancing the accuracy-speed trade-off. Nonetheless, we also find inputs which cause a high word error rate and place them in context.

语音控制界面可以在临床环境中支持老年人，聊天机器人就是一个很好的例子，但是对于代表性不足的群体的可靠自动语音识别（ASR）仍然是一个瓶颈。本研究评估了最先进ASR模型在老年荷兰人与针对老年情境设计的Welzijn.AI聊天机器人的语言使用方面的表现。我们对通用多语种ASR模型和针对老年荷兰人进行微调后的模型进行了基准测试，同时考虑了处理速度。我们的结果表明，通用多语种模型的表现优于精细调整的模型，这表明最近的ASR模型可以很好地推广应用到现实世界的数据集。此外，我们的结果表明，截断通用模型有助于平衡准确性与速度之间的权衡。尽管如此，我们也发现了导致高词错误率的输入，并将它们置于相应的语境中。

论文及项目相关链接

PDF Forthcoming in the Proceedings of the Fourth Workshop on Bridging Human Computer Interaction and Natural Language Processing HCINLP (EMNLP)

Summary
语音控制界面如聊天机器人可以支持临床环境中的老年人，但针对非主流群体的自动语音识别（ASR）技术仍是瓶颈。本研究评估了针对老年荷兰人使用的先进ASR模型，他们对专门为老年护理环境设计的Welzijn.AI聊天机器人进行了互动。我们对比了通用的多语种ASR模型和针对老年荷兰人使用的精细调整模型的处理速度。结果显示，通用多语种模型表现优于精细调整模型，说明最新ASR模型能很好地泛化至现实世界数据集。同时，对通用模型进行截断有助于平衡准确性和速度之间的权衡。但研究发现一些导致高词错率的输入。

Key Takeaways

语音控制界面如聊天机器人可支持临床环境中老年人与之交互。
针对老年群体的自动语音识别（ASR）技术仍存在瓶颈。
研究评估了先进的ASR模型在老年荷兰人群中的表现。
通用多语种ASR模型在处理老年荷兰人群语言时表现较好。
截断通用ASR模型有助于平衡准确性和处理速度之间的权衡。
研究发现某些输入会导致高词错率。

Cool Papers

点此查看论文截图

DualCodec: A Low-Frame-Rate, Semantically-Enhanced Neural Audio Codec for Speech Generation

Authors:Jiaqi Li, Xiaolong Lin, Zhekai Li, Shixi Huang, Yuancheng Wang, Chaoren Wang, Zhenpeng Zhan, Zhizheng Wu

Neural audio codecs form the foundational building blocks for language model (LM)-based speech generation. Typically, there is a trade-off between frame rate and audio quality. This study introduces a low-frame-rate, semantically enhanced codec model. Existing approaches distill semantically rich self-supervised (SSL) representations into the first-layer codec tokens. This work proposes DualCodec, a dual-stream encoding approach that integrates SSL and waveform representations within an end-to-end codec framework. In this setting, DualCodec enhances the semantic information in the first-layer codec and enables the codec system to maintain high audio quality while operating at a low frame rate. Note that a low-frame-rate codec improves the efficiency of speech generation. Experimental results on audio codec and speech generation tasks confirm the effectiveness of the proposed DualCodec compared to state-of-the-art codec systems, such as Mimi Codec, SpeechTokenizer, DAC, and Encodec. Demos are available at: https://dualcodec.github.io, code is available at: https://github.com/jiaqili3/DualCodec

神经音频编码器是语言模型（LM）为基础的语音生成的基础构件。通常帧率与音频质量之间存在权衡。本研究介绍了一种低帧率、语义增强的编码模型。现有方法将语义丰富的自监督表示（SSL）蒸馏到第一层编码令牌中。本研究提出了DualCodec，一种双流编码方法，它将SSL和波形表示集成到一个端到端的编码框架中。在此设置中，DualCodec增强了第一层编码的语义信息，使得编码系统在保持高音频质量的情况下运行在低帧率上成为可能。请注意，低帧率的编码解码器提高了语音生成的效率。在音频编码解码器和语音生成任务上的实验结果证实了所提出的DualCodec与最先进的编码解码器系统相比的有效性，例如Mimi Codec、SpeechTokenizer、DAC和Encodec等。演示可以在https://dualcodec.github.io上找到，代码可以在https://github.com/jiaqili3/DualCodec上找到。

论文及项目相关链接

PDF Accepted to Interspeech 2025

Summary

本文介绍了一种新型的神经网络音频编解码器，名为DualCodec。它采用双流编码方式，融合了自监督学习和波形表示，在保持低帧率的同时提高了音频质量。该研究提出的编解码器可有效提升语音生成的效率。相较于其他先进的编解码系统，如Mimi Codec、SpeechTokenizer、DAC和Encodec等，实验结果证明了DualCodec的有效性和优越性。

Key Takeaways