发布日期: 2025-10-11

更新日期: 2025-11-27

文章字数: 7.6k

阅读时长: 31 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-10-11 更新

MeanVC: Lightweight and Streaming Zero-Shot Voice Conversion via Mean Flows

Authors:Guobin Ma, Jixun Yao, Ziqian Ning, Yuepeng Jiang, Lingxin Xiong, Lei Xie, Pengcheng Zhu

Zero-shot voice conversion (VC) aims to transfer timbre from a source speaker to any unseen target speaker while preserving linguistic content. Growing application scenarios demand models with streaming inference capabilities. This has created a pressing need for models that are simultaneously fast, lightweight, and high-fidelity. However, existing streaming methods typically rely on either autoregressive (AR) or non-autoregressive (NAR) frameworks, which either require large parameter sizes to achieve strong performance or struggle to generalize to unseen speakers. In this study, we propose MeanVC, a lightweight and streaming zero-shot VC approach. MeanVC introduces a diffusion transformer with a chunk-wise autoregressive denoising strategy, combining the strengths of both AR and NAR paradigms for efficient streaming processing. By introducing mean flows, MeanVC regresses the average velocity field during training, enabling zero-shot VC with superior speech quality and speaker similarity in a single sampling step by directly mapping from the start to the endpoint of the flow trajectory. Additionally, we incorporate diffusion adversarial post-training to mitigate over-smoothing and further enhance speech quality. Experimental results demonstrate that MeanVC significantly outperforms existing zero-shot streaming VC systems, achieving superior conversion quality with higher efficiency and significantly fewer parameters. Audio demos and code are publicly available at https://aslp-lab.github.io/MeanVC.

零样本语音转换（VC）旨在将源说话者的音质转移到任何未见过的目标说话者，同时保留语言内容。不断增长的应用场景要求模型具备流式推理能力。这产生了对同时具备快速、轻便和高保真性能的模型的迫切需求。然而，现有的流式方法通常依赖于自回归（AR）或非自回归（NAR）框架，这些框架要么需要较大的参数规模以实现强劲性能，要么在推广到未见过的说话者时面临困难。在这项研究中，我们提出了MeanVC，这是一种轻便的流式零样本VC方法。MeanVC引入了一个扩散变压器，采用分块自回归去噪策略，结合了AR和NAR范式的优点，以实现高效的流式处理。通过引入平均流，MeanVC在训练过程中回归平均速度场，使零样本VC能够在单个采样步骤中以卓越的语音质量和说话人相似性直接从流的起点映射到终点。此外，我们引入了扩散对抗后训练，以减轻过度平滑并进一步改善语音质量。实验结果表明，MeanVC显著优于现有的零样本流式VC系统，实现了高质量的转换性能，具有更高的效率和更少的参数。音频演示和代码可在https://aslp-lab.github.io/MeanVC上公开查看。

论文及项目相关链接

PDF

Summary

本论文提出了一种名为MeanVC的轻量级流式零样本语音转换方法。它结合了自回归和非自回归框架的优势，采用基于扩散变压器的块级自回归去噪策略，实现高效流式处理。通过引入平均流，MeanVC在训练过程中回归平均速度场，实现单步采样中的零样本语音转换，具有出色的语音质量和说话人相似性。此外，还采用扩散对抗后训练来缓解过度平滑问题，进一步提高语音质量。实验结果表明，MeanVC在零样本流式语音转换系统中显著优于现有方法，具有更高的转换质量、效率和较少的参数。

Key Takeaways

MeanVC是一种轻量级的流式零样本语音转换方法，结合了自回归和非自回归框架的优点。
采用基于扩散变压器的块级自回归去噪策略，实现高效流式处理。
通过引入平均流，MeanVC在训练过程中回归平均速度场，提高了语音转换的质量和效率。
MeanVC实现了单步采样中的零样本语音转换，具有出色的语音质量和说话人相似性。
采用扩散对抗后训练技术，缓解语音转换中的过度平滑问题，进一步提升语音质量。
实验结果表明，MeanVC显著优于现有零样本流式语音转换系统。

Cool Papers

点此查看论文截图

CS3-Bench: Evaluating and Enhancing Speech-to-Speech LLMs for Mandarin-English Code-Switching

Authors:Heyang Liu, Yuhao Wang, Ziyang Cheng, Ronghua Wu, Qunshan Gu, Yanfeng Wang, Yu Wang

The advancement of multimodal large language models has accelerated the development of speech-to-speech interaction systems. While natural monolingual interaction has been achieved, we find existing models exhibit deficiencies in language alignment. In our proposed Code-Switching Speech-to-Speech Benchmark (CS3-Bench), experiments on 7 mainstream models demonstrate a relative performance drop of up to 66% in knowledge-intensive question answering and varying degrees of misunderstanding in open-ended conversations. Starting from a model with severe performance deterioration, we propose both data constructions and training approaches to improve the language alignment capabilities, specifically employing Chain of Recognition (CoR) to enhance understanding and Keyword Highlighting (KH) to guide generation. Our approach improves the knowledge accuracy from 25.14% to 46.13%, with open-ended understanding rate from 64.5% to 86.5%, and significantly reduces pronunciation errors in the secondary language. CS3-Bench is available at https://huggingface.co/datasets/VocalNet/CS3-Bench.

多模态大型语言模型的进步加速了语音到语音交互系统的发展。虽然自然单语交互已经实现，但我们发现现有模型在语言对齐方面存在缺陷。在我们提出的跨语言语音到语音基准测试（CS3-Bench）中，对7种主流模型的实验表明，在知识密集型问答方面，性能相对下降了高达66%，在开放式对话中存在不同程度的误解。针对性能严重退化的模型，我们提出数据构建和训练方法来提高语言对齐能力，具体采用识别链（CoR）增强理解和关键词高亮（KH）来指导生成。我们的方法将知识准确性从25.14%提高到46.13%，开放对话的理解率从64.5%提高到86.5%，并显著减少了第二语言的发音错误。CS3-Bench可在https://huggingface.co/datasets/VocalNet/CS3-Bench访问。

论文及项目相关链接

PDF

Summary

随着多模态大型语言模型的进步，推动了语音到语音交互系统的发展。当前模型在自然单语交互方面取得了进展，但在语言对齐方面存在缺陷。在提出的跨语言语音到语音基准测试（CS3-Bench）中，对7种主流模型的实验表明，在知识密集型问答方面性能下降相对高达66%，并且在开放式对话中存在不同程度的误解。通过数据构建和训练方法的改进，特别是采用识别链（CoR）增强理解和关键词高亮（KH）引导生成的方法，改善了语言对齐能力。该方法提高了知识准确性从25.14%至46.13%，增强了开放式对话的理解率从64.5%至86.5%，并显著减少了次要语言的发音错误。CS3-Bench基准测试可在https://huggingface.co/datasets/VocalNet/CS3-Bench访问。

Key Takeaways

多模态大型语言模型的进步推动了语音到语音交互系统的发展。
当前模型在自然单语交互方面存在性能提升，但在语言对齐方面存在缺陷。
跨语言语音到语音基准测试（CS3-Bench）揭示了主流模型在知识密集型问答方面的性能下降和开放式对话中的误解问题。
通过数据构建和训练方法的改进，提高了语言对齐能力。
采用识别链（CoR）和关键词高亮（KH）的方法改善了语言对齐，提高了知识准确性和开放式对话的理解率。
提出的改进方法显著减少了次要语言的发音错误。

Cool Papers

点此查看论文截图

A Multimodal GUI Architecture for Interfacing with LLM-Based Conversational Assistants

Authors:Hans G. W. van Dam

Advances in large language models (LLMs) and real-time speech recognition now make it possible to issue any graphical user interface (GUI) action through natural language and receive the corresponding system response directly through the GUI. Most production applications were never designed with speech in mind. This article provides a concrete architecture that enables GUIs to interface with LLM-based speech-enabled assistants. The architecture makes an application’s navigation graph and semantics available through the Model Context Protocol (MCP). The ViewModel, part of the MVVM (Model-View-ViewModel) pattern, exposes the application’s capabilities to the assistant by supplying both tools applicable to a currently visible view and application-global tools extracted from the GUI tree router. This architecture facilitates full voice accessibility while ensuring reliable alignment between spoken input and the visual interface, accompanied by consistent feedback across modalities. It future-proofs apps for upcoming OS super assistants that employ computer use agents (CUAs) and natively consume MCP if an application provides it. To address concerns about privacy and data security, the practical effectiveness of locally deployable, open-weight LLMs for speech-enabled multimodal UIs is evaluated. Findings suggest that recent smaller open-weight models approach the performance of leading proprietary models in overall accuracy and require enterprise-grade hardware for fast responsiveness. A demo implementation of the proposed architecture can be found at https://github.com/hansvdam/langbar

随着大型语言模型（LLM）和实时语音识别技术的不断进步，现在可以通过自然语言执行任何图形用户界面（GUI）操作，并直接通过GUI接收相应的系统响应。大多数生产应用程序在设计时并未考虑语音功能。本文提供了一个具体的架构，使GUI能够与基于LLM的语音助手进行交互。该架构通过模型上下文协议（MCP）使应用程序的导航图和语义得以呈现。作为MVVM（Model-View-ViewModel）模式的一部分，ViewModel通过提供适用于当前可见视图的工具以及从GUI树路由器中提取的应用程序全局工具，向助手公开应用程序的功能。此架构实现了全面的语音访问功能，同时确保口头输入与视觉界面之间的可靠对齐，并提供跨模式的持续反馈。它为即将到来的操作系统超级助手提供了保障，这些助手使用计算机使用代理（CUA）并原生消费MCP，如果应用程序提供的话。为了应对关于隐私和数据安全的担忧，评估了用于语音启用的多模式UI的可在本地部署的开放权重LLM的实际有效性。研究结果表明，最近的较小开放权重模型在总体准确性方面接近领先专有模型的表现，并需要企业级硬件来实现快速响应。所提议架构的演示实现可在https://github.com/hansvdam/langbar 找到。

论文及项目相关链接

PDF 24 pages, 19 figures, code available at https://github.com/hansvdam/langbar

Summary

基于大型语言模型（LLM）的实时语音识别技术，现在可以通过自然语言驱动任何图形用户界面（GUI）的行动，并直接通过GUI获取系统回应。文章提出了一种具体架构，使GUI与基于LLM的语音助手进行交互。该架构通过模型上下文协议（MCP）提供应用程序的导航图和语义。文章还评价了关于隐私和数据安全的担忧，并发现小型开源模型在总体准确性上接近领先的专业模型，但需要企业级硬件来实现快速响应。

Key Takeaways

大型语言模型（LLM）和实时语音识别技术融合，可通过自然语言控制GUI，并获取系统反馈。
文章介绍了一种具体架构，该架构通过模型上下文协议（MCP）使GUI与基于LLM的语音助手交互。
ViewModel在MVVM模式中的作用是向语音助手展示应用程序的功能，提供当前视图可用的工具和全局工具。
该架构确保了语音输入与视觉界面的可靠对齐，并提供了跨模式的一致反馈。
文章还考虑了隐私和数据安全问题，评价了本地部署的开源LLM在语音赋能多模式UI中的实际效果。
研究发现，小型开源模型在总体准确性上表现出色，接近领先的专业模型。

Cool Papers

点此查看论文截图

Audio-Conditioned Diffusion LLMs for ASR and Deliberation Processing

Authors:Mengqi Wang, Zhan Liu, Zengrui Jin, Guangzhi Sun, Chao Zhang, Philip C. Woodland

Diffusion-based large language models (DLLMs) have recently attracted growing interest as an alternative to autoregressive decoders. In this work, we present an empirical study on using the diffusion-based large language model LLaDA for automatic speech recognition (ASR). We first investigate its use as an external deliberation-based processing module for Whisper-LLaMA transcripts. By leveraging the bidirectional attention and denoising capabilities of LLaDA, we explore random masking, low-confidence masking, and semi-autoregressive strategies, showing that Whisper-LLaDA substantially reduces WER compared with the baseline. On LibriSpeech, the best cascade system achieves 2.25%/4.94% WER on test-clean/test-other, representing a 12.3% relative improvement over the Whisper-LLaMA baseline on the test-other split. In contrast, a plain-text LLaDA without acoustic features fails to improve accuracy, highlighting the importance of audio-conditioned embeddings. We further evaluate Whisper-LLaDA as a standalone decoder for ASR with diffusion-based and semi-autoregressive decoding. Most experimental configurations achieve faster inference than the Whisper-LLaMA baseline, although recognition accuracy is slightly lower. These findings offer an empirical view of diffusion-based LLMs for ASR and point to promising directions for improvements.

基于扩散的大型语言模型（DLLMs）最近作为自回归解码器的替代方案而日益受到关注。在这项研究中，我们对使用基于扩散的大型语言模型LLaDA进行自动语音识别（ASR）进行了实证研究。我们首先研究其在基于外部深思处理模块的Whisper-LLaMA转录中的应用。通过利用LLaDA的双向注意力和去噪能力，我们探索了随机遮挡、低置信度遮挡和半自回归策略，结果表明Whisper-LLaDA与基线相比大幅降低了WER。在LibriSpeech上，最佳级联系统在测试干净/其他测试上的WER达到2.25%/4.94%，相对于Whisper-LLaMA基线，在其他测试分割上实现了12.3%的相对改进。相比之下，没有声学特征的纯文本LLaDA未能提高准确性，这凸显了音频条件嵌入的重要性。我们进一步评估了作为独立解码器的Whisper-LLaDA的ASR性能，采用扩散和半自回归解码。大多数实验配置实现了比Whisper-LLaMA基线更快的推理速度，尽管识别准确率略有下降。这些发现为基于扩散的LLM在ASR方面的应用提供了实证观点，并指出了改进的有希望的方向。

论文及项目相关链接

PDF

Summary：基于扩散的大型语言模型LLaDA在自动语音识别（ASR）中的应用进行了实证研究。将其作为Whisper-LLaMA的外部决策处理模块，利用LLaDA的双向注意力和去噪能力，通过随机掩蔽、低信心掩蔽和半自回归策略，显著降低了词错误率（WER）。在LibriSpeech数据集上，最佳级联系统达到了test-clean/test-other的2.25%/4.94% WER，相对于Whisper-LLaMA基线在test-other分割上实现了12.3%的相对改进。但纯文本LLaDA在不使用音频特征的情况下无法提高准确性，这强调了音频条件嵌入的重要性。此外，还评估了Whisper-LLaDA作为ASR的独立解码器的表现，大多数实验配置实现了比Whisper-LLaMA基线更快的推理速度，尽管识别准确率略有下降。

Key Takeaways:

基于扩散的大型语言模型LLaDA被用于自动语音识别（ASR）的实证研究中。
LLaDA作为Whisper-LLaMA的外部处理模块，通过特定策略显著降低了词错误率（WER）。
在LibriSpeech数据集上，级联系统的表现优于基线，特别是在test-other分割上实现了相对改进。
纯文本LLaDA在不使用音频特征时无法提高准确性，强调音频条件嵌入的重要性。
Whisper-LLaDA作为独立解码器在ASR中的表现被评估，多数配置实现快速推理，但识别准确率略有下降。
实证研究提供了关于扩散基础大型语言模型在ASR中应用的见解。

Cool Papers

点此查看论文截图

I$^2$RF-TFCKD: Intra-Inter Representation Fusion with Time-Frequency Calibration Knowledge Distillation for Speech Enhancement

Authors:Jiaming Cheng, Ruiyu Liang, Ye Ni, Chao Xu, Jing Li, Wei Zhou, Rui Liu, Björn W. Schuller, Xiaoshuai Hao

In this paper, we propose an intra-inter representation fusion knowledge distillation (KD) framework with time-frequency calibration (I$^2$RF-TFCKD) for SE, which achieves distillation through the fusion of multi-layer teacher-student feature flows. Different from previous distillation strategies for SE, the proposed framework fully utilizes the time-frequency differential information of speech while promoting global knowledge flow. Firstly, we construct a collaborative distillation paradigm for intra-set and inter-set correlations. Within a correlated set, multi-layer teacher-student features are pairwise matched for calibrated distillation. Subsequently, we generate representative features from each correlated set through residual fusion to form the fused feature set that enables inter-set knowledge interaction. Secondly, we propose a multi-layer interactive distillation based on dual-stream time-frequency cross-calibration, which calculates the teacher-student similarity calibration weights in the time and frequency domains respectively and performs cross-weighting, thus enabling refined allocation of distillation contributions across different layers according to speech characteristics. The proposed distillation strategy is applied to the dual-path dilated convolutional recurrent network (DPDCRN) that ranked first in the SE track of the L3DAS23 challenge. To evaluate the effectiveness of I$^2$RF-TFCKD, we conduct experiments on both single-channel and multi-channel SE datasets. Objective evaluations demonstrate that the proposed KD strategy consistently and effectively improves the performance of the low-complexity student model and outperforms other distillation schemes.

本文提出了一种基于时间频率校准的内外表示融合知识蒸馏（KD）框架（称为I$^2$RF-TFCKD），用于SE。该框架通过融合多层教师学生特征流来实现蒸馏。不同于以前的SE蒸馏策略，该框架充分利用语音的时间频率差异信息，同时促进全局知识流。首先，我们为集合内部和集合之间的相关性构建了一种协作蒸馏模式。在相关集合内，多层教师学生特征进行配对校准蒸馏。然后，我们通过残差融合生成每个相关集合的代表特征，形成融合特征集，以实现集合间的知识交互。其次，我们提出了一种基于双流时间频率交叉校准的多层交互蒸馏方法，该方法分别计算教师和学生在时间和频率域中的相似度校准权重，并进行交叉加权，从而能够根据语音特征在不同的层次上精细分配蒸馏贡献。所提出的蒸馏策略应用于双路径扩张卷积循环网络（DPDCRN），在L3DAS23挑战的SE赛道上排名第一。为了评估I$^2$RF-TFCKD的有效性，我们在单通道和多通道SE数据集上进行了实验。客观评估表明，所提出的知识蒸馏策略持续有效地提高了低复杂度学生模型的性能，并优于其他蒸馏方案。

论文及项目相关链接

PDF submitted to Information Fusion

Summary

本文提出了一种基于时间频率校准的跨内外表示融合知识蒸馏（I$^2$RF-TFCKD）框架，用于语音增强（SE）。该框架通过多层教师-学生特征流的融合实现蒸馏。与以往的SE蒸馏策略不同，该框架充分利用语音的时间频率差异信息，并促进全局知识流动。实验表明，该蒸馏策略在单通道和多通道SE数据集上均表现优秀，有效提升了低复杂度学生模型的性能并优于其他蒸馏方案。

Key Takeaways

提出了基于时间频率校准的跨内外表示融合知识蒸馏（I$^2$RF-TFCKD）框架。
该框架融合多层教师-学生特征流，实现语音增强（SE）中的蒸馏。
充分利用语音的时间频率差异信息，促进全局知识流动。
构建了协同蒸馏范式，实现数据集内部的跨集和内部特征匹配。
通过剩余融合生成代表性特征，实现跨集知识交互。
提出了基于双流时间频率交叉校准的多层交互蒸馏方法，根据语音特性进行不同层的精炼分配。

Cool Papers

点此查看论文截图

SeamlessEdit: Background Noise Aware Zero-Shot Speech Editing with in-Context Enhancement

Authors:Kuan-Yu Chen, Jeng-Lin Li, Jian-Jiun Ding

With the fast development of zero-shot text-to-speech technologies, it is possible to generate high-quality speech signals that are indistinguishable from the real ones. Speech editing, including speech insertion and replacement, appeals to researchers due to its potential applications. However, existing studies only considered clean speech scenarios. In real-world applications, the existence of environmental noise could significantly degrade the quality of generation. In this study, we propose a noise-resilient speech editing framework, SeamlessEdit, for noisy speech editing. SeamlessEdit adopts a frequency-band-aware noise suppression module and an in-content refinement strategy. It can well address the scenario where the frequency bands of voice and background noise are not separated. The proposed SeamlessEdit framework outperforms state-of-the-art approaches in multiple quantitative and qualitative evaluations.

随着零文本转语音技术的快速发展，生成高质量、难以与真实语音区分的语音信号成为可能。语音编辑，包括语音插入和替换，由于其潜在的应用价值而吸引了研究人员的关注。然而，现有研究仅涉及干净语音场景。在真实世界应用中，环境噪声的存在可能会显著降低生成语音的质量。本研究提出了一种用于噪声语音编辑的鲁棒性语音编辑框架SeamlessEdit。SeamlessEdit采用频带感知噪声抑制模块和内容内优化策略，能够很好地处理语音和背景噪声频率带未分离的场景。所提出的SeamlessEdit框架在多个定量和定性评估中均优于最新技术方法。

论文及项目相关链接

PDF 5 pages, 3 figures

Summary
随着零样本文本转语音技术的快速发展，生成高质量、难以区分真假的语音信号成为可能。语音编辑，包括语音插入和替换，因其在各种应用场景中的潜力而受到研究人员的关注。然而，现有研究主要集中在干净语音场景。实际应用中，环境噪声的存在会显著影响生成语音的质量。本研究提出一种用于噪声语音编辑的稳健框架SeamlessEdit，采用频带感知噪声抑制模块和内容内优化策略，能够很好地处理语音和背景噪声频带不分离的场景。SeamlessEdit框架在多项定量和定性评估中优于现有先进技术。

Key Takeaways

零样本文本转语音技术快速发展，可生成高质量难以区分真假的语音信号。
语音编辑具有潜在应用价值，包括语音插入和替换。
现有研究主要集中在干净语音场景，忽略环境噪声的影响。
本研究提出一种用于噪声语音编辑的SeamlessEdit框架。
SeamlessEdit采用频带感知噪声抑制模块和内容内优化策略。
SeamlessEdit能处理语音和背景噪声频带不分离的场景。

Cool Papers

点此查看论文截图

A Differentiable Alignment Framework for Sequence-to-Sequence Modeling via Optimal Transport

Authors:Yacouba Kaloga, Shashi Kumar, Petr Motlicek, Ina Kodrasi

Accurate sequence-to-sequence (seq2seq) alignment is critical for applications like medical speech analysis and language learning tools relying on automatic speech recognition (ASR). State-of-the-art end-to-end (E2E) ASR systems, such as the Connectionist Temporal Classification (CTC) and transducer-based models, suffer from peaky behavior and alignment inaccuracies. In this paper, we propose a novel differentiable alignment framework based on one-dimensional optimal transport, enabling the model to learn a single alignment and perform ASR in an E2E manner. We introduce a pseudo-metric, called Sequence Optimal Transport Distance (SOTD), over the sequence space and discuss its theoretical properties. Based on the SOTD, we propose Optimal Temporal Transport Classification (OTTC) loss for ASR and contrast its behavior with CTC. Experimental results on the TIMIT, AMI, and LibriSpeech datasets show that our method considerably improves alignment performance compared to CTC and the more recently proposed Consistency-Regularized CTC, though with a trade-off in ASR performance. We believe this work opens new avenues for seq2seq alignment research, providing a solid foundation for further exploration and development within the community.

精确序列到序列（seq2seq）对齐对于依赖自动语音识别（ASR）的医疗语音分析和语言学习工具等应用至关重要。目前最先进的端到端（E2E）ASR系统，如连接时序分类（CTC）和基于转换器模型的，都存在峰值行为和对齐不准确的问题。在本文中，我们提出了一种基于一维最优传输的新型可区分对齐框架，使模型能够以端到端的方式进行单一对齐并执行ASR。我们引入了序列最优传输距离（SOTD）的伪度量，并讨论了其理论属性。基于SOTD，我们提出了用于ASR的最优时序传输分类（OTTC）损失，并将其行为与CTC进行了对比。在TIMIT、AMI和LibriSpeech数据集上的实验结果表明，我们的方法在改进对齐性能方面与CTC和最近提出的一致性正则化CTC相比具有显著优势，尽管ASR性能存在权衡。我们相信这项工作开辟了seq2seq对齐研究的新途径，为社区内的进一步探索和发展提供了坚实的基础。

论文及项目相关链接

PDF

Summary

本文提出了一种基于一维最优传输的可微排列框架，用于解决序列到序列（seq2seq）对齐问题，如医疗语音分析和依赖自动语音识别（ASR）的语言学习工具中的对齐问题。传统端到端（E2E）ASR系统如连接定时分类（CTC）和基于转换器模型存在尖峰行为和排列不精确的问题。本文引入了一种序列最优传输距离（SOTD）的伪度量，并基于SOTD提出了最优时间传输分类（OTTC）损失函数用于ASR。实验结果表明，与CTC和最近提出的一致性正则化CTC相比，该方法在TIMIT、AMI和LibriSpeech数据集上的对齐性能有了显著提高，尽管ASR性能存在权衡。本文工作开创了seq2seq对齐研究的新途径，为社区内的进一步探索和发展提供了坚实的基础。

Key Takeaways

准确序列到序列对齐对于依赖自动语音识别（ASR）的应用至关重要，如医疗语音分析和语言学习工具。
当前端到端（E2E）ASR系统存在尖峰行为和排列不精确的问题。
引入了一种基于一维最优传输的可微排列框架，解决了seq2seq对齐问题。
提出了序列最优传输距离（SOTD）的伪度量，并基于SOTD构建了Optimal Temporal Transport Classification（OTTC）损失函数用于ASR。
实验结果表明，与CTC和其他方法相比，所提出的方法在对齐性能上有了显著提高。
虽然存在ASR性能的权衡，但这项工作为seq2seq对齐研究提供了新的方向。

Cool Papers

点此查看论文截图

An Investigation of Incorporating Mamba for Speech Enhancement

Authors:Rong Chao, Wen-Huang Cheng, Moreno La Quatra, Sabato Marco Siniscalchi, Chao-Han Huck Yang, Szu-Wei Fu, Yu Tsao

This work aims to investigate the use of a recently proposed, attention-free, scalable state-space model (SSM), Mamba, for the speech enhancement (SE) task. In particular, we employ Mamba to deploy different regression-based SE models (SEMamba) with different configurations, namely basic, advanced, causal, and non-causal. Furthermore, loss functions either based on signal-level distances or metric-oriented are considered. Experimental evidence shows that SEMamba attains a competitive PESQ of 3.55 on the VoiceBank-DEMAND dataset with the advanced, non-causal configuration. A new state-of-the-art PESQ of 3.69 is also reported when SEMamba is combined with Perceptual Contrast Stretching (PCS). Compared against Transformed-based equivalent SE solutions, a noticeable FLOPs reduction up to ~12% is observed with the advanced non-causal configurations. Finally, SEMamba can be used as a pre-processing step before automatic speech recognition (ASR), showing competitive performance against recent SE solutions.

本文旨在研究一种新提出的无注意力、可扩展的状态空间模型（SSM），名为Mamba，在语音增强（SE）任务中的应用。特别是，我们采用Mamba部署不同配置的基于回归的SE模型（SEMamba），包括基本、高级、因果和非因果配置。此外，还考虑了基于信号级距离或面向度量的损失函数。实验证据表明，SEMamba在VoiceBank-DEMAND数据集上使用高级非因果配置达到了有竞争力的PESQ 3.55。当SEMamba与感知对比度拉伸（PCS）结合时，报告了新的最高PESQ分数为3.69。与基于转换的等效SE解决方案相比，高级非因果配置下观察到显著的FLOPs减少，最高可达约12%。最后，SEMamba可以用作自动语音识别（ASR）之前的预处理步骤，显示出与最新的SE解决方案相当的性能。

论文及项目相关链接

PDF Accepted to IEEE SLT 2024

Summary
本研究旨在探究一种新提出的无注意力、可扩展的状态空间模型（SSM），即Mamba在语音增强（SE）任务中的应用。研究采用Mamba部署不同配置的回归式SE模型（SEMamba），包括基本、高级、因果和非因果配置。同时考虑了基于信号级别距离或面向度量的损失函数。实验结果表明，SEMamba在VoiceBank-DEMAND数据集上获得有竞争力的PESQ分数3.55，其中高级非因果配置表现最佳。当SEMamba与感知对比度拉伸（PCS）结合时，报告了新的最先进的PESQ分数为3.69。与基于变换的等效SE解决方案相比，高级非因果配置实现了显著的FLOPs减少，最多可达约12%。最后，SEMamba可作为自动语音识别（ASR）之前的预处理步骤，与最新的SE解决方案相比表现有竞争力。

Key Takeaways