嘘~ 正在从服务器偷取页面 . . .

Speech


⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验

2025-02-12 更新

Speaker Embedding Informed Audiovisual Active Speaker Detection for Egocentric Recordings

Authors:Jason Clarke, Yoshihiko Gotoh, Stefan Goetze

Audiovisual active speaker detection (ASD) addresses the task of determining the speech activity of a candidate speaker given acoustic and visual data. Typically, systems model the temporal correspondence of audiovisual cues, such as the synchronisation between speech and lip movement. Recent work has explored extending this paradigm by additionally leveraging speaker embeddings extracted from candidate speaker reference speech. This paper proposes the speaker comparison auxiliary network (SCAN) which uses speaker-specific information from both reference speech and the candidate audio signal to disambiguate challenging scenes when the visual signal is unresolvable. Furthermore, an improved method for enrolling face-speaker libraries is developed, which implements a self-supervised approach to video-based face recognition. Fitting with the recent proliferation of wearable devices, this work focuses on improving speaker-embedding-informed ASD in the context of egocentric recordings, which can be characterised by acoustic noise and highly dynamic scenes. SCAN is implemented with two well-established baselines, namely TalkNet and Light-ASD; yielding a relative improvement in mAP of 14.5% and 10.3% on the Ego4D benchmark, respectively.

视听主动说话人检测(ASD)的任务是给定声音和视觉数据来确定候选说话人的语音活动。通常,系统会对视听线索的时间对应进行建模,例如语音和唇部运动之间的同步。最近的研究探索了通过额外利用从候选说话人参考语音中提取的说话人嵌入来扩展这种范式。本文提出了说话人比较辅助网络(SCAN),该网络使用来自参考语音和候选音频信号的说话人特定信息,以解决视觉信号无法解析时具有挑战性的场景。此外,开发了一种改进的面部说话人库注册方法,该方法实现了基于视频的面部识别的自我监督方法。随着可穿戴设备的日益普及,这项工作集中在改进以说话人嵌入信息为主的ASD在自我中心记录背景下的应用,这可以以声音噪声和高度动态场景为特征。SCAN通过使用两个经过良好验证的基线(即TalkNet和Light-ASD)实现,在Ego4D基准测试上分别提高了14.5%和10.3%的mAP。

论文及项目相关链接

PDF Accepted to ICASSP 2025. 5 pages, 4 figures. To appear in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), April 6-11, 2025, Hyderabad, India

Summary

本文介绍了视听主动说话人检测(ASD)技术,该技术通过声音和视觉数据来确定候选人的说话活动。文章提出使用发言人对比辅助网络(SCAN),该网络从参考语音和候选音频信号中提取发言人特定信息,以解决视觉信号不可分辨的挑战场景。此外,开发了一种改进的面部发言人库注册方法,采用基于视频的面部识别的自我监督方法。该研究关注于穿戴式设备领域,致力于在自我中心录音背景下改进基于发言人嵌入的ASD技术,该环境具有声学噪声和高度动态场景的特点。通过TalkNet和Light-ASD两个基准测试实现SCAN,在Ego4D基准测试中分别提高了14.5%和10.3%的mAP。

Key Takeaways

  1. 视听主动说话人检测(ASD)是通过声音和视觉数据确定说话人活动的任务。
  2. SCAN网络结合参考语音和候选音频信号中的说话人特定信息,解决视觉信号不可分辨时的识别难题。
  3. 改进了面部发言人库注册方法,采用自我监督的视频面部识别技术。
  4. 研究关注穿戴式设备领域,特别是在自我中心录音环境下的ASD技术。
  5. 该技术面临声学噪声和高度动态场景的挑战。
  6. SCAN网络通过TalkNet和Light-ASD基准测试实现,并在Ego4D基准测试中取得了显著的mAP提升。

Cool Papers

点此查看论文截图

Speech to Speech Translation with Translatotron: A State of the Art Review

Authors:Jules R. Kala, Emmanuel Adetiba, Abdultaofeek Abayom, Oluwatobi E. Dare, Ayodele H. Ifijeh

A cascade-based speech-to-speech translation has been considered a benchmark for a very long time, but it is plagued by many issues, like the time taken to translate a speech from one language to another and compound errors. These issues are because a cascade-based method uses a combination of methods such as speech recognition, speech-to-text translation, and finally, text-to-speech translation. Translatotron, a sequence-to-sequence direct speech-to-speech translation model was designed by Google to address the issues of compound errors associated with cascade model. Today there are 3 versions of the Translatotron model: Translatotron 1, Translatotron 2, and Translatotron3. The first version was designed as a proof of concept to show that a direct speech-to-speech translation was possible, it was found to be less effective than the cascade model but was producing promising results. Translatotron2 was an improved version of Translatotron 1 with results similar to the cascade model. Translatotron 3 the latest version of the model is better than the cascade model at some points. In this paper, a complete review of speech-to-speech translation will be presented, with a particular focus on all the versions of Translatotron models. We will also show that Translatotron is the best model to bridge the language gap between African Languages and other well-formalized languages.

基于级联的语音到语音翻译很长一段时间以来一直被视为一个基准测试,但它存在许多问题,如将语音从一种语言翻译到另一种语言所需的时间以及复合错误。这些问题的原因是,级联方法结合了语音识别、语音到文本的翻译以及最后的文本到语音的翻译等方法。谷歌设计了序列到序列的直接语音到语音翻译模型Translatotron,以解决与级联模型相关的复合错误问题。如今,Translatotron模型已有三个版本:Translatotron 1、Translatotron 2和Translatotron3。第一个版本是为了证明直接语音到语音翻译的可能性而设计的,它的效果不如级联模型,但产生了令人鼓舞的结果。Translatotron2是Translatotron 1的改进版本,其结果与级联模型相似。最新版本的Translatotron 3在某些方面优于级联模型。本文将全面回顾语音到语音的翻译,重点关注Translatotron模型的所有版本。我们还将展示Translatotron是弥合非洲语言和其他规范化语言之间语言障碍的最佳模型。

论文及项目相关链接

PDF 12 pages and 3 figures

Summary

本文介绍了基于级联的语音到语音翻译长期被视为基准方法,但仍存在翻译时间长和复合错误等问题。Google设计的Translatotron序列到序列直接语音到语音翻译模型旨在解决这些问题。本文将全面回顾语音到语音翻译,重点关注Translatotron模型的所有版本,并展示Translatotron是弥补非洲语言与其他良好形式化语言之间语言鸿沟的最佳模型。

Key Takeaways

  1. 级联的语音到语音翻译存在翻译时间长和复合错误等问题。
  2. Translatotron是一种序列到序列的直接语音到语音翻译模型,旨在解决级联模型的问题。
  3. Translatotron模型有三个版本:Translatotron 1、Translatotron 2和Translatotron 3。
  4. Translatotron 1作为概念验证,证明了直接语音到语音翻译的可能性,但效果不如级联模型。
  5. Translatotron 2是Translatotron 1的改进版本,其效果与级联模型相似。
  6. Translatotron 3在某些方面优于级联模型。

Cool Papers

点此查看论文截图

Audio-Visual Representation Learning via Knowledge Distillation from Speech Foundation Models

Authors:Jing-Xuan Zhang, Genshun Wan, Jianqing Gao, Zhen-Hua Ling

Audio-visual representation learning is crucial for advancing multimodal speech processing tasks, such as lipreading and audio-visual speech recognition. Recently, speech foundation models (SFMs) have shown remarkable generalization capabilities across various speech-related tasks. Building on this progress, we propose an audio-visual representation learning model that leverages cross-modal knowledge distillation from SFMs. In our method, SFMs serve as teachers, from which multi-layer hidden representations are extracted using clean audio inputs. We also introduce a multi-teacher ensemble method to distill the student, which receives audio-visual data as inputs. A novel representational knowledge distillation loss is employed to train the student during pretraining, which is also applied during finetuning to further enhance the performance on downstream tasks. Our experiments utilized both a self-supervised SFM, WavLM, and a supervised SFM, iFLYTEK-speech. The results demonstrated that our proposed method achieved superior or at least comparable performance to previous state-of-the-art baselines across automatic speech recognition, visual speech recognition, and audio-visual speech recognition tasks. Additionally, comprehensive ablation studies and the visualization of learned representations were conducted to evaluate the effectiveness of our proposed method.

视听表示学习对于推进多模态语音处理任务至关重要,例如唇读和视听语音识别。最近,语音基础模型(SFM)在各种语音相关任务中表现出了显著的泛化能力。在此基础上,我们提出了一种视听表示学习模型,该模型利用来自SFM的跨模态知识蒸馏。在我们的方法中,SFM作为教师,使用干净的音频输入提取多层隐藏表示。我们还引入了一种多教师集成方法来蒸馏学生,学生接收视听数据作为输入。在预训练期间,采用一种新的表征知识蒸馏损失来训练学生,并在微调期间也应用该损失,以进一步提高在下游任务上的性能。我们的实验采用了自监督的SFM——WavLM和受监督的SFM——iFLYTEK-speech。结果表明,我们提出的方法在自动语音识别、视觉语音识别和视听语音识别任务上的性能达到了或至少与最新基线方法相当。此外,还进行了全面的消融研究和表示可视化,以评估我们提出方法的有效性。

论文及项目相关链接

PDF accepted to Pattern Recognition

Summary

基于语音模型(SFM)的多模态视听表示学习对于提高语音处理任务如唇读和视听语音识别至关重要。本研究提出一种利用来自SFM的跨模态知识蒸馏的视听表示学习方法。研究中,SFM作为教师,使用纯净音频输入提取多层隐藏表示。同时引入多教师集成方法以蒸馏学生模型,学生模型接收视听数据作为输入。采用新型表征知识蒸馏损失来预训练学生模型,并在微调阶段继续应用以提高下游任务性能。实验采用自监督SFM——WavLM和监 SFM——iFLYTEK-speech,结果表明该方法在自动语音识别、视觉语音识别和视听语音识别任务上的性能达到或超过先前最先进的基线水平。同时进行了全面的消融研究和表征可视化以评估方法的有效性。

Key Takeaways

  1. 音频视觉表示学习对多模态语音处理任务至关重要,如唇读和视听语音识别。
  2. 研究提出了基于语音模型(SFM)的视听表示学习方法,利用跨模态知识蒸馏技术。
  3. SFM作为教师模型,使用纯净音频输入提取多层隐藏表示。
  4. 引入多教师集成方法以进一步训练学生模型,该模型接收视听数据。
  5. 采用新型表征知识蒸馏损失进行预训练和微调,以提高下游任务性能。
  6. 实验采用自监督和监督学习SFM,证明该方法在多个语音任务上性能优越。
  7. 进行了消融研究和表征可视化以评估方法的有效性。

Cool Papers

点此查看论文截图

IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System

Authors:Wei Deng, Siyi Zhou, Jingchen Shu, Jinchao Wang, Lu Wang

Recently, large language model (LLM) based text-to-speech (TTS) systems have gradually become the mainstream in the industry due to their high naturalness and powerful zero-shot voice cloning capabilities.Here, we introduce the IndexTTS system, which is mainly based on the XTTS and Tortoise model. We add some novel improvements. Specifically, in Chinese scenarios, we adopt a hybrid modeling method that combines characters and pinyin, making the pronunciations of polyphonic characters and long-tail characters controllable. We also performed a comparative analysis of the Vector Quantization (VQ) with Finite-Scalar Quantization (FSQ) for codebook utilization of acoustic speech tokens. To further enhance the effect and stability of voice cloning, we introduce a conformer-based speech conditional encoder and replace the speechcode decoder with BigVGAN2. Compared with XTTS, it has achieved significant improvements in naturalness, content consistency, and zero-shot voice cloning. As for the popular TTS systems in the open-source, such as Fish-Speech, CosyVoice2, FireRedTTS and F5-TTS, IndexTTS has a relatively simple training process, more controllable usage, and faster inference speed. Moreover, its performance surpasses that of these systems. Our demos are available at https://index-tts.github.io.

最近,基于大型语言模型(LLM)的文本到语音(TTS)系统由于其高度的自然性和强大的零样本语音克隆能力,已逐渐成为行业的主流。在这里,我们介绍了IndexTTS系统,它主要基于XTTS和Tortoise模型,并增加了一些新颖的改进。具体来说,在中国场景中,我们采用了一种结合字符和拼音的混合建模方法,使得多音字符和长尾字符的发音可控。我们还对向量量化(VQ)和有限标量量化(FSQ)进行了比较分析,用于声学生成令牌的字典利用。为了进一步改进语音克隆效果和稳定性,我们引入了基于适配器的语音条件编码器,并用BigVGAN2替换语音代码解码器。与XTTS相比,它在自然性、内容一致性和零样本语音克隆方面取得了显著改进。至于开源中流行的TTS系统,如Fish-Speech、CosyVoice2、FireRedTTS和F5-TTS,IndexTTS具有相对简单的训练过程、更可控的使用方式和更快的推理速度。此外,其性能超过了这些系统。我们的演示作品请访问:https://index-tts.github.io。

论文及项目相关链接

PDF

Summary

基于大型语言模型的文本转语音(TTS)系统已成为行业主流,具有高度的自然性和强大的零样本语音克隆能力。本文介绍了IndexTTS系统,该系统基于XTTS和Tortoise模型,采用混合建模方法处理中文场景下的多音字和长尾字的发音问题。同时,对向量量化与有限标量量化进行了比较分析,以提高语音克隆效果和稳定性。与XTTS相比,IndexTTS在自然性、内容一致性和零样本语音克隆方面取得了显著改进。相较于开源的TTS系统,如Fish-Speech、CosyVoice2等,IndexTTS具有更简单的训练过程、更可控的使用体验和更快的推理速度。

Key Takeaways

  1. 大型语言模型(LLM)在文本转语音(TTS)系统中表现出主流趋势,因其高自然度和零样本语音克隆能力而受到重视。
  2. IndexTTS系统基于XTTS和Tortoise模型,采用混合建模方法处理中文场景下的字符和拼音结合问题。
  3. IndexTTS在语音克隆效果和稳定性方面进行了改进,引入了基于卷积神经网络的语音条件编码器,并替换了语音代码解码器。
  4. IndexTTS与XTTS相比,在自然性、内容一致性和零样本语音克隆方面有所提升。
  5. IndexTTS相较于其他开源TTS系统,具有更简单的训练过程、更优秀的性能表现和更快的推理速度。
  6. IndexTTS系统支持多音字和长尾字的可控发音。

Cool Papers

点此查看论文截图

Enhancing Expressive Voice Conversion with Discrete Pitch-Conditioned Flow Matching Model

Authors:Jialong Zuo, Shengpeng Ji, Minghui Fang, Ziyue Jiang, Xize Cheng, Qian Yang, Wenrui Liu, Guangyan Zhang, Zehai Tu, Yiwen Guo, Zhou Zhao

This paper introduces PFlow-VC, a conditional flow matching voice conversion model that leverages fine-grained discrete pitch tokens and target speaker prompt information for expressive voice conversion (VC). Previous VC works primarily focus on speaker conversion, with further exploration needed in enhancing expressiveness (such as prosody and emotion) for timbre conversion. Unlike previous methods, we adopt a simple and efficient approach to enhance the style expressiveness of voice conversion models. Specifically, we pretrain a self-supervised pitch VQVAE model to discretize speaker-irrelevant pitch information and leverage a masked pitch-conditioned flow matching model for Mel-spectrogram synthesis, which provides in-context pitch modeling capabilities for the speaker conversion model, effectively improving the voice style transfer capacity. Additionally, we improve timbre similarity by combining global timbre embeddings with time-varying timbre tokens. Experiments on unseen LibriTTS test-clean and emotional speech dataset ESD show the superiority of the PFlow-VC model in both timbre conversion and style transfer. Audio samples are available on the demo page https://speechai-demo.github.io/PFlow-VC/.

本文介绍了PFlow-VC,这是一种基于精细粒度离散音调标记和目标说话人提示信息的条件流匹配语音转换模型,用于实现表达性语音转换(VC)。之前的VC工作主要集中在说话人转换上,而对于音色转换中增强表达性(如语调和情感)的探索仍需进一步深入。不同于以往的方法,我们采用简单高效的方法来增强语音转换模型的风格表现力。具体来说,我们预训练了一个自监督的音调VQVAE模型,对与说话人无关的音调信息进行离散化,并利用掩码音调条件下的流匹配模型进行梅尔频谱合成,为说话人转换模型提供上下文音调建模能力,有效提高语音风格转移能力。此外,我们通过结合全局音色嵌入和时变音色标记来提高音色相似性。在未见过的LibriTTS测试清洁数据集和情感语音数据集ESD上的实验表明,PFlow-VC模型在音色转换和风格转移方面都表现出卓越的性能。音频样本可在演示页面https://speechai-demo.github.io/PFlow-VC/上找到。

论文及项目相关链接

PDF Accepted by ICASSP 2025

Summary
本文介绍了PFlow-VC,这是一种基于精细离散音高标记和目标说话人提示信息的条件流匹配语音转换模型,用于实现表现性语音转换(VC)。与主要关注说话人转换的先前VC工作不同,我们采用简单高效的方法来提高语音转换模型的风格表现力,如离散化音高信息和采用掩码音高条件流匹配模型进行梅尔频谱合成等。此外,结合全局音色嵌入和时变音色标记,提高了音色相似度。在LibriTTS测试清洁集和情感语音数据集ESD上的实验表明,PFlow-VC模型在音色转换和风格迁移方面的优越性。

Key Takeaways

  1. PFlow-VC是一种条件流匹配语音转换模型,旨在实现表现性语音转换。
  2. 该模型利用精细离散音高标记和目标说话人提示信息。
  3. 不同于传统的语音转换模型主要关注说话人转换,PFlow-VC提高了风格表现力。
  4. PFlow-VC通过预训练的音高VQVAE模型进行音高信息的离散化。
  5. 采用掩码音高条件流匹配模型进行梅尔频谱合成,提高了语音风格转移能力。
  6. 通过结合全局音色嵌入和时变音色标记,提高了音色转换的相似度。

Cool Papers

点此查看论文截图

Koel-TTS: Enhancing LLM based Speech Generation with Preference Alignment and Classifier Free Guidance

Authors:Shehzeen Hussain, Paarth Neekhara, Xuesong Yang, Edresson Casanova, Subhankar Ghosh, Mikyas T. Desta, Roy Fejgin, Rafael Valle, Jason Li

While autoregressive speech token generation models produce speech with remarkable variety and naturalness, their inherent lack of controllability often results in issues such as hallucinations and undesired vocalizations that do not conform to conditioning inputs. We introduce Koel-TTS, a suite of enhanced encoder-decoder Transformer TTS models that address these challenges by incorporating preference alignment techniques guided by automatic speech recognition and speaker verification models. Additionally, we incorporate classifier-free guidance to further improve synthesis adherence to the transcript and reference speaker audio. Our experiments demonstrate that these optimizations significantly enhance target speaker similarity, intelligibility, and naturalness of synthesized speech. Notably, Koel-TTS directly maps text and context audio to acoustic tokens, and on the aforementioned metrics, outperforms state-of-the-art TTS models, despite being trained on a significantly smaller dataset. Audio samples and demos are available on our website.

虽然自回归语音令牌生成模型能够产生具有显著多样性和自然性的语音,但它们固有的不可控性常常导致一些问题,如幻觉和不符合条件输入的意外发声。我们介绍了Koel-TTS,这是一系列增强的编码器-解码器Transformer TTS模型,通过融入偏好对齐技术,结合自动语音识别和说话人验证模型来解决这些挑战。此外,我们还融入了无分类器指导,以进一步提高合成语音对文本和参考说话人音频的遵循度。我们的实验表明,这些优化显著提高了目标说话人的相似性、语音的清晰度和合成语音的自然度。值得注意的是,Koel-TTS直接将文本和上下文音频映射到声音令牌上,并且在上述指标上,即使在较小的数据集上进行训练,也优于最先进的TTS模型。音频样本和演示视频可在我们的网站上找到。

论文及项目相关链接

PDF

Summary

本文介绍了Koel-TTS,一种改进的编码解码器Transformer文本转语音模型套件,通过融入偏好对齐技术解决了生成语音的不一致性和缺乏控制性问题。通过融入自动语音识别和语音验证模型进行引导,同时引入无分类器引导进一步提升语音合成准确性并更接近原始演讲稿内容。实验表明这些优化提高了合成语音的目标准确度、可辨识度和自然度。Koel-TTS直接将文本和上下文音频映射到声音标记上,在相关指标上表现优于现有的TTS模型,即使是在训练数据量显著减少的情况下也能表现出良好的性能。有关音频样本和演示可访问我们的网站查看。

Key Takeaways

  1. Koel-TTS是一种改进的文本转语音(TTS)模型套件,旨在解决现有模型缺乏控制性和合成语音不一致的问题。
  2. 该模型通过融入偏好对齐技术和自动语音识别(ASR)以及语音验证模型的引导来提升性能。
  3. Koel-TTS采用编码解码器Transformer结构,能够直接将文本和上下文音频映射到声音标记上。
  4. 无分类器引导方法被用来进一步提升语音合成的准确性和自然度。
  5. 实验结果显示,Koel-TTS在目标说话人相似性、可辨识度和自然度方面显著提高。
  6. Koel-TTS在性能指标上优于现有TTS模型,并且在训练数据量较小的情况下仍表现出良好的性能。

Cool Papers

点此查看论文截图

Aligner-Encoders: Self-Attention Transformers Can Be Self-Transducers

Authors:Adam Stooke, Rohit Prabhavalkar, Khe Chai Sim, Pedro Moreno Mengibar

Modern systems for automatic speech recognition, including the RNN-Transducer and Attention-based Encoder-Decoder (AED), are designed so that the encoder is not required to alter the time-position of information from the audio sequence into the embedding; alignment to the final text output is processed during decoding. We discover that the transformer-based encoder adopted in recent years is actually capable of performing the alignment internally during the forward pass, prior to decoding. This new phenomenon enables a simpler and more efficient model, the “Aligner-Encoder”. To train it, we discard the dynamic programming of RNN-T in favor of the frame-wise cross-entropy loss of AED, while the decoder employs the lighter text-only recurrence of RNN-T without learned cross-attention – it simply scans embedding frames in order from the beginning, producing one token each until predicting the end-of-message. We conduct experiments demonstrating performance remarkably close to the state of the art, including a special inference configuration enabling long-form recognition. In a representative comparison, we measure the total inference time for our model to be 2x faster than RNN-T and 16x faster than AED. Lastly, we find that the audio-text alignment is clearly visible in the self-attention weights of a certain layer, which could be said to perform “self-transduction”.

现代自动语音识别系统,包括RNN-Transducer和基于注意力的编码器解码器(AED),其设计使得编码器不需要改变音频序列中的信息的时间位置到嵌入中;最终文本输出的对齐过程是在解码期间进行的。我们发现,近年来采用的基于变换器的编码器实际上能够在前向传递过程中内部进行对齐,而无需在解码之前进行对齐。这一新现象使得更简洁高效的模型——“对齐器编码器”成为可能。为了对其进行训练,我们放弃了RNN-T的动态规划,采用了AED的帧级交叉熵损失,而解码器则采用了RNN-T的轻量级纯文本递归,无需学习交叉注意力——它只需按序扫描嵌入帧,从头到尾生成一个令牌,直到预测消息结束。我们进行了实验,证明了其性能与最新技术相当接近,包括一种特殊的推理配置,可实现长形式识别。在代表性比较中,我们测量我们的模型的总推理时间是RNN-T的2倍快和AED的16倍快。最后,我们发现某一层的自我注意力权重中清晰地显示了音频文本对齐,可以说是实现了“自我传导”。

论文及项目相关链接

PDF

Summary

本文介绍了自动语音识别系统中的新发现,即基于转换器的编码器可在解码前内部进行对齐,从而简化模型。通过采用AED的帧级交叉熵损失替代RNN-T的动态规划训练,并简化解码器,该模型实现了快速推理和接近前沿的性能。实验结果显示,该模型推理速度是RNN-T的两倍,是AED的16倍。此外,还观察到某一层的自我注意力权重中明显的音频文本对齐现象。

Key Takeaways

  1. 基于转换器的编码器可在解码前内部进行对齐,简化自动语音识别系统模型。
  2. 提出了名为“Aligner-Encoder”的新模型,该模型通过丢弃RNN-T的动态规划,采用AED的帧级交叉熵损失进行训练。
  3. 解码器采用RNN-T的轻量级文本复发机制,无需学习交叉注意力,按序扫描嵌入框架即可生成令牌。
  4. 该模型性能接近前沿水平,实验结果显示其推理速度远超RNN-T和AED。
  5. 该模型可实现长形式识别,具有特殊的推理配置。
  6. 在模型的某一层中观察到明显的音频文本对齐现象,体现在自我注意力权重中。

Cool Papers

点此查看论文截图

Unveiling Interpretability in Self-Supervised Speech Representations for Parkinson’s Diagnosis

Authors:David Gimeno-Gómez, Catarina Botelho, Anna Pompili, Alberto Abad, Carlos-D. Martínez-Hinarejos

Recent works in pathological speech analysis have increasingly relied on powerful self-supervised speech representations, leading to promising results. However, the complex, black-box nature of these embeddings and the limited research on their interpretability significantly restrict their adoption for clinical diagnosis. To address this gap, we propose a novel, interpretable framework specifically designed to support Parkinson’s Disease (PD) diagnosis. Through the design of simple yet effective cross-attention mechanisms for both embedding- and temporal-level analysis, the proposed framework offers interpretability from two distinct but complementary perspectives. Experimental findings across five well-established speech benchmarks for PD detection demonstrate the framework’s capability to identify meaningful speech patterns within self-supervised representations for a wide range of assessment tasks. Fine-grained temporal analyses further underscore its potential to enhance the interpretability of deep-learning pathological speech models, paving the way for the development of more transparent, trustworthy, and clinically applicable computer-assisted diagnosis systems in this domain. Moreover, in terms of classification accuracy, our method achieves results competitive with state-of-the-art approaches, while also demonstrating robustness in cross-lingual scenarios when applied to spontaneous speech production.

近期病理性语音分析的研究越来越依赖于强大的自监督语音表征,这带来了很有前景的结果。然而,这些嵌入的复杂性和黑箱性质,以及对其可解释性的研究有限,显著限制了它们在临床诊断中的应用。为了解决这一差距,我们提出了一种新型的可解释框架,专为支持帕金森病的诊断而设计。通过为嵌入级和时间级分析设计简单有效的交叉注意力机制,该框架从两个不同但互补的角度提供可解释性。在五个成熟的帕金森病检测语音基准测试上的实验结果表明,该框架能够在自监督表征中识别各种评估任务中有意义的语音模式。精细的时间分析进一步突出了其在提高深度学习病理语音模型的可解释性方面的潜力,为开发更具透明度、可信赖度和临床适用性的计算机辅助诊断系统铺平了道路。此外,在分类准确度方面,我们的方法实现了与最新技术方法相竞争的结果,并且在应用于自然语音生产时表现出跨语言的稳健性。

论文及项目相关链接

PDF Accepted in the Special Issue on “Modelling and Processing Language and Speech in Neurodegenerative Disorders” published by Journal of Selected Topics in Signal Processing (JSTSP)

Summary
帕金森病的病理语音分析是近期研究热点,现有的自监督语音表征技术取得了良好效果,但缺乏解释性限制了其在临床诊断中的应用。本研究提出了一种新型的可解释框架,旨在支持帕金森病(PD)的诊断。该框架通过设计简单有效的跨嵌入和时序注意力机制,从两个不同但互补的角度提供解释性。实验结果表明,该框架能在自监督表征中识别有意义的语音模式,用于广泛的评估任务,提高了深度学习的可解释性,为开发更透明、可靠的临床辅助诊断系统铺平了道路。同时,该方法的分类精度与最先进的方法相当,并且在跨语言场景中表现稳健。

Key Takeaways

  1. 帕金森病病理语音分析是当前的热门研究领域。
  2. 自监督语音表征技术在该领域取得了显著成果。
  3. 缺乏解释性是当前技术应用于临床诊断的主要挑战之一。
  4. 提出了一种新型的可解释框架,旨在支持帕金森病的诊断。
  5. 该框架通过跨嵌入和时序注意力机制提供双重解释角度。
  6. 实验结果表明该框架在多种评估任务中的有效性和可靠性。

Cool Papers

点此查看论文截图

Fast and High-Quality Auto-Regressive Speech Synthesis via Speculative Decoding

Authors:Bohan Li, Hankun Wang, Situo Zhang, Yiwei Guo, Kai Yu

The auto-regressive architecture, like GPTs, is widely used in modern Text-to-Speech (TTS) systems. However, it incurs substantial inference time, particularly due to the challenges in the next-token prediction posed by lengthy sequences of speech tokens. In this work, we introduce VADUSA, one of the first approaches to accelerate auto-regressive TTS through speculative decoding. Our results show that VADUSA not only significantly improves inference speed but also enhances performance by incorporating draft heads to predict future speech content auto-regressively. Furthermore, the inclusion of a tolerance mechanism during sampling accelerates inference without compromising quality. Our approach demonstrates strong generalization across large datasets and various types of speech tokens.

自回归架构,如GPT,在现代文本到语音(TTS)系统中有着广泛应用。然而,它产生了大量的推理时间,尤其是因为长语音序列令牌带来的下一个令牌预测挑战。在这项工作中,我们引入了VADUSA,这是第一个通过投机解码加速自回归TTS的方法之一。我们的结果表明,VADUSA不仅显著提高推理速度,而且通过融入草稿头进行未来语音内容的自回归预测来提高性能。此外,采样过程中容错机制的加入在不牺牲质量的情况下加速了推理。我们的方法在大数据集和各种语音令牌类型上表现出了强大的泛化能力。

论文及项目相关链接

PDF Accepted by ICASSP 2025

总结

本文主要介绍了VADUSA,这是一种加速基于GPT的自回归文本转语音(TTS)系统的方法。通过采用投机解码技术,VADUSA不仅显著提高了推理速度,还通过引入草案头进行未来语音内容的自回归预测提高了性能。此外,采样过程中的容忍机制进一步加速了推理过程,同时不损害质量。此方法在大型数据集和各种类型的语音令牌上具有较强的通用性。

要点

  1. VADUSA是首个加速自回归文本转语音(TTS)系统的方法之一。
  2. 通过投机解码技术显著提高了推理速度。
  3. 通过引入草案头进行未来语音内容的自回归预测,提高了性能。
  4. 采样过程中的容忍机制进一步加速了推理过程。
  5. VADUSA在不影响质量的前提下加速了推理过程。
  6. 该方法具有在大型数据集上实现强通用性的能力。

Cool Papers

点此查看论文截图

High-Resolution Speech Restoration with Latent Diffusion Model

Authors:Tushar Dhyani, Florian Lux, Michele Mancusi, Giorgio Fabbro, Fritz Hohl, Ngoc Thang Vu

Traditional speech enhancement methods often oversimplify the task of restoration by focusing on a single type of distortion. Generative models that handle multiple distortions frequently struggle with phone reconstruction and high-frequency harmonics, leading to breathing and gasping artifacts that reduce the intelligibility of reconstructed speech. These models are also computationally demanding, and many solutions are restricted to producing outputs in the wide-band frequency range, which limits their suitability for professional applications. To address these challenges, we propose Hi-ResLDM, a novel generative model based on latent diffusion designed to remove multiple distortions and restore speech recordings to studio quality, sampled at 48kHz. We benchmark Hi-ResLDM against state-of-the-art methods that leverage GAN and Conditional Flow Matching (CFM) components, demonstrating superior performance in regenerating high-frequency-band details. Hi-ResLDM not only excels in non-instrusive metrics but is also consistently preferred in human evaluation and performs competitively on intrusive evaluations, making it ideal for high-resolution speech restoration.

传统语音增强方法通常通过专注于一种失真来简化恢复任务。处理多种失真的生成模型在电话重建和高频谐波方面经常遇到困难,导致出现呼吸和喘息伪影,降低了重建语音的可懂度。这些模型计算量也很大,许多解决方案仅限于在宽带频率范围内产生输出,这限制了它们在专业应用中的适用性。为了解决这些挑战,我们提出了Hi-ResLDM,这是一种基于潜在扩散的新型生成模型,旨在消除多种失真,并将语音记录恢复到48kHz采样的录音棚质量。我们将Hi-ResLDM与利用生成对抗网络和条件流匹配(CFM)组件的最先进方法进行了比较,在重新生成高频带细节方面表现出了卓越的性能。Hi-ResLDM不仅在非侵入性指标上表现出色,而且在人类评估中始终更受欢迎,在侵入性评估中也表现出竞争力,使其成为高分辨率语音恢复的理想选择。

论文及项目相关链接

PDF

Summary

本文介绍了针对传统语音增强方法的挑战,提出了一种基于潜在扩散模型的新型生成模型——Hi-ResLDM。该模型旨在去除多种失真并恢复语音录制至接近录音室质量,采样率为48kHz。相较于利用GAN和条件流匹配(CFM)组件的最先进方法,Hi-ResLDM在再生高频带细节方面展现出卓越性能,不仅在非侵入性指标上表现出色,而且在人类评估和侵入性评估中也具有竞争力,是理想的高分辨率语音恢复解决方案。

Key Takeaways

  1. 传统语音增强方法通常只关注单一类型的失真,而Hi-ResLDM是一个基于潜在扩散模型的生成模型,旨在去除多种失真。
  2. Hi-ResLDM可以恢复语音录制至录音室质量,采样率高达48kHz。
  3. 与利用GAN和CFM组件的最先进方法相比,Hi-ResLDM在再生高频带细节方面表现更优秀。
  4. Hi-ResLDM在非侵入性指标上表现出色。
  5. Hi-ResLDM在人类评估中表现优异。
  6. Hi-ResLDM在侵入性评估中也具有竞争力。

Cool Papers

点此查看论文截图


文章作者: Kedreamix
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !
 上一篇
GAN GAN
GAN 方向最新论文已更新,请持续关注 Update in 2025-02-12 ViSIR Vision Transformer Single Image Reconstruction Method for Earth System Models
2025-02-12
下一篇 
医学影像/Breast Ultrasound 医学影像/Breast Ultrasound
医学影像/Breast Ultrasound 方向最新论文已更新,请持续关注 Update in 2025-02-12 Exploiting Precision Mapping and Component-Specific Feature Enhancement for Breast Cancer Segmentation and Identification
  目录