发布日期: 2025-11-25

更新日期: 2025-11-27

文章字数: 979

阅读时长: 4 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-25 更新

A Differentiable Alignment Framework for Sequence-to-Sequence Modeling via Optimal Transport

Authors:Yacouba Kaloga, Shashi Kumar, Petr Motlicek, Ina Kodrasi

Accurate sequence-to-sequence (seq2seq) alignment is critical for applications like medical speech analysis and language learning tools relying on automatic speech recognition (ASR). State-of-the-art end-to-end (E2E) ASR systems, such as the Connectionist Temporal Classification (CTC) and transducer-based models, suffer from peaky behavior and alignment inaccuracies. In this paper, we propose a novel differentiable alignment framework based on one-dimensional optimal transport, enabling the model to learn a single alignment and perform ASR in an E2E manner. We introduce a pseudo-metric, called Sequence Optimal Transport Distance (SOTD), over the sequence space and discuss its theoretical properties. Based on the SOTD, we propose Optimal Temporal Transport Classification (OTTC) loss for ASR and contrast its behavior with CTC. Experimental results on the TIMIT, AMI, and LibriSpeech datasets show that our method considerably improves alignment performance compared to CTC and the more recently proposed Consistency-Regularized CTC, though with a trade-off in ASR performance. We believe this work opens new avenues for seq2seq alignment research, providing a solid foundation for further exploration and development within the community. Our code is publicly available at: https://github.com/idiap/OTTC

精确序列到序列（seq2seq）对齐对于依赖自动语音识别（ASR）的应用至关重要，例如医疗语音分析和语言学习工具。目前最先进的端到端（E2E）ASR系统，如连接时序分类（CTC）和基于转换器模型的系统，存在尖峰行为和对齐不准确的问题。在本文中，我们提出了一种基于一维最优传输的新型可区分对齐框架，使模型能够以端到端的方式学习单一对齐并执行ASR。我们引入了序列空间上的伪度量，称为序列最优传输距离（SOTD），并讨论了其理论特性。基于SOTD，我们提出了用于ASR的最优时序传输分类（OTTC）损失，并将其行为与CTC进行了对比。在TIMIT、AMI和LibriSpeech数据集上的实验结果表明，我们的方法在改善对齐性能方面与CTC和最近提出的一致性正则化CTC相比具有显著优势，尽管在ASR性能方面存在权衡。我们相信这项工作为seq2seq对齐研究开辟了新途径，为社区内的进一步探索和发展提供了坚实的基础。我们的代码可在https://github.com/idiap/OTTC公开访问。

论文及项目相关链接

PDF

Summary

本文提出了一种基于一维最优传输的可微对齐框架，用于端到端的自动语音识别（ASR）。通过引入序列最优传输距离（SOTD）伪度量，该框架使模型能够学习单一对齐并进行端到端的ASR。实验结果表明，该方法在TIMIT、AMI和LibriSpeech数据集上的对齐性能有明显提升，但ASR性能存在一定折衷。

Key Takeaways

序列到序列（seq2seq）准确对齐对于医疗语音分析和依赖自动语音识别（ASR）的语言学习工具至关重要。
当前端到端（E2E）ASR系统如Connectionist Temporal Classification（CTC）和基于转换器模型存在峰值行为和对齐不精确的问题。
本文提出了基于一维最优传输的可微对齐框架，该框架可以学习单一对齐并以端到端方式进行ASR。
引入序列最优传输距离（SOTD）伪度量，并讨论了其理论属性。
基于SOTD，提出了用于ASR的最优时间传输分类（OTTC）损失，并将其行为与CTC进行了对比。
实验结果表明，该方法在多个数据集上的对齐性能显著提升，但ASR性能存在一定折衷。

Cool Papers

点此查看论文截图