嘘~ 正在从服务器偷取页面 . . .

Speech


⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验

2025-10-18 更新

OmniMotion: Multimodal Motion Generation with Continuous Masked Autoregression

Authors:Zhe Li, Weihao Yuan, Weichao Shen, Siyu Zhu, Zilong Dong, Chang Xu

Whole-body multi-modal human motion generation poses two primary challenges: creating an effective motion generation mechanism and integrating various modalities, such as text, speech, and music, into a cohesive framework. Unlike previous methods that usually employ discrete masked modeling or autoregressive modeling, we develop a continuous masked autoregressive motion transformer, where a causal attention is performed considering the sequential nature within the human motion. Within this transformer, we introduce a gated linear attention and an RMSNorm module, which drive the transformer to pay attention to the key actions and suppress the instability caused by either the abnormal movements or the heterogeneous distributions within multi-modalities. To further enhance both the motion generation and the multimodal generalization, we employ the DiT structure to diffuse the conditions from the transformer towards the targets. To fuse different modalities, AdaLN and cross-attention are leveraged to inject the text, speech, and music signals. Experimental results demonstrate that our framework outperforms previous methods across all modalities, including text-to-motion, speech-to-gesture, and music-to-dance. The code of our method will be made public.

全身多模态人体运动生成主要面临两大挑战:创建有效的运动生成机制,以及将文本、语音、音乐等不同模态集成到一个协调的框架中。不同于通常采用离散掩蔽建模或自回归建模的方法,我们开发了一种连续掩蔽自回归运动转换器,其中考虑到人类运动的序列性进行了因果注意力处理。在该转换器中,我们引入门控线性注意力和RMSNorm模块,推动转换器关注关键动作,并抑制由于多模态中的异常动作或异质分布所导致的稳定性问题。为了进一步增强运动生成和多模态泛化能力,我们采用DiT结构将转换器的条件向目标扩散。为了融合不同的模态,利用AdaLN和交叉注意力来注入文本、语音和音乐信号。实验结果表明,我们的框架在包括文本到运动、语音到动作和音乐到舞蹈的所有模态上都优于以前的方法。我们方法的代码将公开。

论文及项目相关链接

PDF

Summary
本文提出一种全新的全身多模态人类动作生成方法,包括连续遮蔽自回归动作转换器(Continuous Masked Autoregressive Motion Transformer),能够解决动作生成机制和多模态集成问题。通过引入因果注意力机制、门控线性注意力模块和RMSNorm模块,解决了关键动作关注与多模态异常抑制问题。使用DiT结构促进条件扩散和模态融合,通过AdaLN和跨注意力注入文本、语音和音乐信号。实验证明,该方法在多模态动作生成上优于现有技术。

Key Takeaways

  1. 提出连续遮蔽自回归动作转换器,针对全身多模态人类动作生成进行高效建模。
  2. 利用因果注意力机制处理动作序列的连续性。
  3. 门控线性注意力模块和RMSNorm模块增强了对关键动作的关注并抑制了多模态异常。
  4. 使用DiT结构促进条件扩散,提升动作生成和跨模态泛化能力。
  5. AdaLN和跨注意力技术用于融合不同模态(文本、语音、音乐)。
  6. 实验证明该方法在多种模态上的优越性,包括文本转动作、语音转手势和音乐转舞蹈。

Cool Papers

点此查看论文截图

TRI-DEP: A Trimodal Comparative Study for Depression Detection Using Speech, Text, and EEG

Authors:Annisaa Fitri Nurfidausi, Eleonora Mancini, Paolo Torroni

Depression is a widespread mental health disorder, yet its automatic detection remains challenging. Prior work has explored unimodal and multimodal approaches, with multimodal systems showing promise by leveraging complementary signals. However, existing studies are limited in scope, lack systematic comparisons of features, and suffer from inconsistent evaluation protocols. We address these gaps by systematically exploring feature representations and modelling strategies across EEG, together with speech and text. We evaluate handcrafted features versus pre-trained embeddings, assess the effectiveness of different neural encoders, compare unimodal, bimodal, and trimodal configurations, and analyse fusion strategies with attention to the role of EEG. Consistent subject-independent splits are applied to ensure robust, reproducible benchmarking. Our results show that (i) the combination of EEG, speech and text modalities enhances multimodal detection, (ii) pretrained embeddings outperform handcrafted features, and (iii) carefully designed trimodal models achieve state-of-the-art performance. Our work lays the groundwork for future research in multimodal depression detection.

抑郁症是一种普遍的心理健康障碍,但其自动检测仍然具有挑战性。早期的研究已经探索了单模态和多模态的方法,多模态系统通过利用互补信号显示出良好的前景。然而,现有研究在范围上有限,缺乏特征的系统比较,并且受到评估协议不一致的影响。我们通过系统地探索脑电图、语音和文本的特征表示和建模策略来解决这些空白。我们评估了手工特征与预训练嵌入的效果,评估了不同神经网络编码器的有效性,比较了单模态、双模态和三模态配置,并分析了融合策略,重点关注脑电图的作用。应用一致的受试者独立分割,以确保稳健、可复制的基准测试。我们的结果表明:(i)脑电图、语音和文本模态的组合增强了多模态检测效果;(ii)预训练的嵌入特征表现优于手工特征;(iii)精心设计的三模态模型达到了最先进的性能。我们的工作为多模态抑郁症检测的未来发展奠定了基础。

论文及项目相关链接

PDF

摘要

关于抑郁症的自动检测是一项重要的挑战,虽然已经有许多关于单模态和多模态方法的研究,但现有的研究仍存在局限性,缺乏系统的特征比较和一致的评价协议。本研究系统地探讨了EEG、语音和文本的特征表示和建模策略,评估了手工特征和预训练嵌入的效果,比较了单模态、双模态和三模态配置,并分析了融合策略中EEG的作用。研究采用一致的受试者独立分割方法,以确保稳健、可重复性的评估。结果表明,多模态检测中结合EEG、语音和文本模态的效果更佳,预训练嵌入优于手工特征,精心设计的三模态模型达到最佳性能。本研究为未来的多模态抑郁症检测研究奠定了基础。

要点摘要

一、多模态检测融合EEG、语音和文本模态可以提高抑郁症检测的准确性。
二、预训练嵌入特征在抑郁症检测中表现优于手工特征。
三、精心设计的三模态模型在多模态抑郁症检测中达到最佳性能。
四、本研究系统地探讨了不同神经编码器的有效性,并比较了单模态、双模态和三模态配置。
五、融合策略在抑郁症检测中起着重要作用,其中EEG的作用不可忽视。
六、采用一致的受试者独立分割方法,确保评估的稳健性和可重复性。
七、本研究为未来的多模态抑郁症检测研究提供了重要的参考和基础。

Cool Papers

点此查看论文截图

RLAIF-SPA: Optimizing LLM-based Emotional Speech Synthesis via RLAIF

Authors:Qing Yang, Zhenghao Liu, Junxin Wang, Yangfan Du, Pengcheng Huang, Tong Xiao

Text-To-Speech synthesis has achieved near-human quality in neutral speech, but emotional expressiveness remains a challenge. Existing methods often rely on costly emotion annotations or optimize indirect objectives that fail to capture the emotional expressiveness and perceptual naturalness of speech, leading to generated speech that is accurate but emotionally flat. To address these challenges, we propose the RLAIF-SPA framework, incorporating a Reinforcement Learning from AI Feedback (RLAIF) mechanism to employ Automatic Speech Recognition (ASR) and Large Language Model (LLM) techniques to respectively judge semantic accuracy and prosodic-emotional label alignment as a direct reward for emotional expressiveness and intelligibility optimization. Specifically, it leverages Prosodic Label Alignment to enhance expressive quality by jointly considering semantic accuracy and prosodic-emotional alignment along four fine-grained dimensions: Structure, Emotion, Speed, and Tone. In addition, it incorporates Semantic Accuracy Feedback to ensure the generation of clear and accurate speech. Experiments on the Libri Speech dataset show that RLAIF-SPA outperforms Chat-TTS, with a 26.1% reduction in WER, a 9.1% increase in SIM-O, and over 10% improvement in human evaluation.

文本转语音合成在中性语音方面已经达到了接近人类的质量,但在情感表达方面仍存在挑战。现有方法通常依赖于昂贵的情感注释,或优化未能捕捉到语音的情感表达力和感知自然度的间接目标,导致生成的语音虽然准确但情感平淡。为了解决这些挑战,我们提出了RLAIF-SPA框架,它结合了强化学习从人工智能反馈(RLAIF)机制,采用自动语音识别(ASR)和大型语言模型(LLM)技术来分别判断语义准确性和韵律情感标签对齐,作为情感表达力和可理解性优化的直接奖励。具体来说,它利用韵律标签对齐来提高表达质量,同时考虑语义准确性和韵律情感对齐四个精细维度:结构、情感、速度和语调。此外,它结合了语义准确性反馈,以确保清晰准确的语音生成。在Libri Speech数据集上的实验表明,RLAIF-SPA优于Chat-TTS,相对字词错误率(WER)降低了26.1%,SIM-O增加了9.1%,人类评估改进超过10%。

论文及项目相关链接

PDF

Summary
情感化语音合成仍面临挑战,现有方法依赖昂贵的情感标注或优化间接目标,无法捕捉情感的表达力和语音的自然感知。为解决这些问题,提出了RLAIF-SPA框架,采用强化学习从人工智能反馈机制,结合语音识别和大型语言模型技术,分别判断语义准确性和韵律情感标签对齐,作为情感表达和优化可理解性的直接奖励。实验表明,RLAIF-SPA在Libri Speech数据集上的表现优于Chat-TTS,字词错误率降低26.1%,SIM-O得分提高9.1%,人类评估得分提高超过10%。

Key Takeaways

  1. 情感化语音合成是一个挑战,现有方法难以捕捉情感的表达力和语音的自然感知。
  2. RLAIF-SPA框架采用强化学习从人工智能反馈机制来解决这个问题。
  3. 框架结合语音识别和大型语言模型技术,判断语义准确性和韵律情感标签对齐。
  4. RLAIF-SPA框架通过考虑语义准确性和韵律情感对齐,在四个精细粒度(结构、情感、速度和音调)上提高表达质量。
  5. 框架还结合了语义准确性反馈,确保生成的语音清晰准确。
  6. 在Libri Speech数据集上的实验表明,RLAIF-SPA优于Chat-TTS,字词错误率降低26.1%,SIM-O得分提高9.1%。

Cool Papers

点此查看论文截图

Do Slides Help? Multi-modal Context for Automatic Transcription of Conference Talks

Authors:Supriti Sinhamahapatra, Jan Niehues

State-of-the-art (SOTA) Automatic Speech Recognition (ASR) systems primarily rely on acoustic information while disregarding additional multi-modal context. However, visual information are essential in disambiguation and adaptation. While most work focus on speaker images to handle noise conditions, this work also focuses on integrating presentation slides for the use cases of scientific presentation. In a first step, we create a benchmark for multi-modal presentation including an automatic analysis of transcribing domain-specific terminology. Next, we explore methods for augmenting speech models with multi-modal information. We mitigate the lack of datasets with accompanying slides by a suitable approach of data augmentation. Finally, we train a model using the augmented dataset, resulting in a relative reduction in word error rate of approximately 34%, across all words and 35%, for domain-specific terms compared to the baseline model.

当前最先进的自动语音识别(ASR)系统主要依赖于声学信息,而忽略了额外的多模态上下文。然而,视觉信息在澄清和适应方面非常重要。虽然大多数工作都专注于使用发言人图像来处理噪声条件,但这项工作还专注于整合演示幻灯片以用于科学演示。首先,我们为包括特定领域术语的自动转录的多模态演示创建了一个基准测试。接下来,我们探索了用多模态信息增强语音模型的方法。我们通过适当的数据增强方法缓解了缺乏附带幻灯片的数据集的问题。最后,我们使用增强数据集训练了一个模型,与基准模型相比,该模型在所有单词上的词错误率大约降低了3 4%,在特定领域的术语上降低了3 5%。

论文及项目相关链接

PDF

Summary

本文探讨了多模态信息在自动语音识别(ASR)系统中的应用,特别是视觉信息在解决歧义和适应不同场景中的作用。文章提出了一种集成演讲幻灯片的多模态演示方法,建立了一个数据集用于评估领域特定术语的自动分析。通过数据增强技术弥补了缺乏附带幻灯片的数据集问题,并使用增强数据集训练的模型在词错误率方面相比基线模型降低了约34%(针对所有单词)和35%(针对领域特定术语)。

Key Takeaways

  1. 多模态信息在ASR系统中具有重要作用,尤其是视觉信息用于解决歧义和适应不同场景。
  2. 文章提出了一种新的多模态演示方法,集成了演讲幻灯片,为科学演讲等场景提供辅助。
  3. 建立了一个数据集用于评估领域特定术语的自动分析。
  4. 采用数据增强技术来弥补缺乏附带幻灯片的数据集的问题。
  5. 使用增强数据集训练的模型相比基线模型在词错误率方面有了显著降低。
  6. 该方法不仅关注噪声条件下的处理,也注重特定领域的术语识别。

Cool Papers

点此查看论文截图

Switchboard-Affect: Emotion Perception Labels from Conversational Speech

Authors:Amrit Romana, Jaya Narain, Tien Dung Tran, Andrea Davis, Jason Fong, Ramya Rasipuram, Vikramjit Mitra

Understanding the nuances of speech emotion dataset curation and labeling is essential for assessing speech emotion recognition (SER) model potential in real-world applications. Most training and evaluation datasets contain acted or pseudo-acted speech (e.g., podcast speech) in which emotion expressions may be exaggerated or otherwise intentionally modified. Furthermore, datasets labeled based on crowd perception often lack transparency regarding the guidelines given to annotators. These factors make it difficult to understand model performance and pinpoint necessary areas for improvement. To address this gap, we identified the Switchboard corpus as a promising source of naturalistic conversational speech, and we trained a crowd to label the dataset for categorical emotions (anger, contempt, disgust, fear, sadness, surprise, happiness, tenderness, calmness, and neutral) and dimensional attributes (activation, valence, and dominance). We refer to this label set as Switchboard-Affect (SWB-Affect). In this work, we present our approach in detail, including the definitions provided to annotators and an analysis of the lexical and paralinguistic cues that may have played a role in their perception. In addition, we evaluate state-of-the-art SER models, and we find variable performance across the emotion categories with especially poor generalization for anger. These findings underscore the importance of evaluation with datasets that capture natural affective variations in speech. We release the labels for SWB-Affect to enable further analysis in this domain.

理解和把握语音情感数据集整理和标注的细微差别,对于评估语音情感识别(SER)模型在真实世界应用中的潜力至关重要。大多数训练和评估数据集包含的行为或模拟行为的语音(例如,播客语音)中的情感表达可能过于夸张或故意修改。此外,基于群体感知标注的数据集往往缺乏对标注者所给指南的透明度。这些因素使得理解模型性能并指出必要的改进领域变得困难。为了弥补这一差距,我们确定了Switchboard语料库是自然会话语音的有前途的来源,我们培训了一个群体来标注数据集中的分类情感(愤怒、蔑视、厌恶、恐惧、悲伤、惊讶、快乐、温柔、冷静和中性)和维度属性(激活、效价和支配性)。我们将这个标签集称为Switchboard-Affect(SWB-Affect)。在这项工作中,我们详细介绍了我们的方法,包括为标注者提供的定义,以及分析可能在其感知中起作用的词汇和副语言线索。此外,我们评估了最先进的SER模型,发现情绪类别之间的性能各不相同,其中对愤怒的泛化尤其差。这些发现强调了使用捕捉自然情感变化的语音数据集进行评估的重要性。我们发布SWB-Affect的标签,以推动该领域的进一步分析。

论文及项目相关链接

PDF 2025 13th International Conference on Affective Computing and Intelligent Interaction (ACII) https://github.com/apple/ml-switchboard-affect

Summary

本文主要探讨了语音情感识别(SER)模型在实际应用中评估的重要性,以及情感数据集收集和标注的复杂性。文章介绍了使用Switchboard语料库构建新数据集Switchboard-Affect(SWB-Affect)的过程,该数据集包含自然对话语音,并要求标注人员对其进行情感类别和维度属性的标注。通过对当前先进的SER模型进行评估,发现模型在不同情感类别上的表现存在差距,特别是在识别愤怒情绪方面的泛化性能较差。本文强调了使用捕捉自然情感变化的语音数据集进行评估的重要性,并发布了SWB-Affect的标签供进一步研究分析使用。

Key Takeaways

  1. 语音情感识别(SER)模型在实际应用中的评估至关重要。
  2. 现有训练与评估数据集往往包含夸张或故意修改的情感表达形式,导致模型性能难以准确评估。
  3. Switchboard语料库被用作构建新的数据集Switchboard-Affect(SWB-Affect),以包含自然对话语音为特色。
  4. SWB-Affect要求标注人员按情感类别(如愤怒、悲伤等)和维度属性(如活跃度、价值倾向性等)进行标注。
  5. 当前先进的SER模型在情感类别上表现存在差异,特别是愤怒情绪的识别泛化性能较差。
  6. 使用捕捉自然情感变化的语音数据集进行评估至关重要。

Cool Papers

点此查看论文截图

InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue

Authors:Wenwen Tong, Hewei Guo, Dongchuan Ran, Jiangnan Chen, Jiefan Lu, Kaibin Wang, Keqiang Li, Xiaoxu Zhu, Jiakui Li, Kehan Li, Xueheng Li, Lumin Li, Chenxu Guo, Jiasheng Zhou, Jiandong Chen, Xianye Wu, Jiahao Wang, Silei Wu, Lei Chen, Hanming Deng, Yuxuan Song, Dinghao Zhou, Guiping Zhong, Ken Zheng, Shiyin Kang, Lewei Lu

We introduce InteractiveOmni, a unified and open-source omni-modal large language model for audio-visual multi-turn interaction, ranging from 4B to 8B parameters, designed to lead the field of lightweight models by offering comprehensive omni-modal understanding and speech generation capabilities. To achieve this, we integrate the vision encoder, audio encoder, large language model, and speech decoder into a unified model for understanding and generation tasks. We design a multi-stage training strategy to ensure robust cross-modal capabilities, including pre-training for omni-modal understanding, followed by post-training with speech conversation and audio-visual interaction. To enable human-like long-term conversational ability, we meticulously curate a multi-turn training dataset that enhances the model’s ability to handle complex and multi-turn interactions. To effectively evaluate the multi-turn memory and speech interaction capabilities, we construct the multi-modal multi-turn memory benchmark and the multi-turn speech interaction benchmark. Experiments demonstrate that InteractiveOmni significantly outperforms leading open-source models and provides a more intelligent multi-turn audio-visual experience, particularly in its long-term memory capabilities. Notably, InteractiveOmni-4B is comparable to the much larger model like Qwen2.5-Omni-7B on general benchmarks, and it can retain 97% of the performance of the InteractiveOmni-8B while utilizing only 50% of the model size. Achieving state-of-the-art results against similarly sized models across image, audio, video understanding, and speech generation tasks, InteractiveOmni is an accessible, open-source foundation for next-generation intelligent interactive systems.

我们介绍InteractiveOmni,这是一个统一和开源的跨模态大语言模型,用于视听多轮交互,参数范围从4B到8B。旨在通过提供全面的跨模态理解和语音生成能力,引领轻量级模型领域。为实现这一目标,我们将视觉编码器、音频编码器、大型语言模型和语音解码器集成到一个统一的模型中,用于理解和生成任务。我们设计了一种多阶段训练策略,以确保强大的跨模态能力,包括用于跨模态理解的预训练,其次是使用语音对话和视听交互的后训练。为了实现类似人类的长期对话能力,我们精心策划了一个多轮训练数据集,以增强模型处理复杂和多轮交互的能力。为了有效地评估多轮记忆和语音交互能力,我们构建了多模态多轮记忆基准和多轮语音交互基准。实验表明,InteractiveOmni显著优于领先的开源模型,并提供更智能的多轮视听体验,尤其在长期记忆能力方面。值得注意的是,InteractiveOmni-4B在通用基准测试上的表现与更大的模型如Qwen2.5-Omni-7B相当,并且在保留InteractiveOmni-8B性能的97%的同时,仅使用一半的模型大小。InteractiveOmni在图像、音频、视频理解和语音生成任务方面达到了与类似规模模型的最先进水平,是下一代智能交互系统的开放、开源基础。

论文及项目相关链接

PDF

摘要

本文介绍了InteractiveOmni,一个统一、开源的跨模态大语言模型,适用于音频视觉多轮交互。该模型从4B到8B参数,设计用于轻型模型领域,提供全面的跨模态理解和语音生成能力。通过整合视觉编码器、音频编码器、大型语言模型和语音解码器,实现理解和生成任务的统一模型。采用多阶段训练策略,确保跨模态能力稳健,包括先对跨模态理解的预训练,然后进行语音对话和视听交互的后训练。为了具备类似人类的长效对话能力,精心策划了多轮训练数据集,增强模型处理复杂和多轮交互的能力。为有效评估多轮记忆和语音交互能力,构建了多模态多轮记忆基准和多轮语音交互基准。实验表明,InteractiveOmni显著优于领先的开源模型,提供了更智能的多轮视听体验,尤其在长期记忆能力方面。值得注意的是,InteractiveOmni-4B在一般基准测试上与更大的模型如Qwen2.5-Omni-7B相当,并且能在仅使用50%模型大小的情况下保留InteractiveOmni-8B的97%性能。InteractiveOmni在图像、音频、视频理解和语音生成任务上达到了同类模型的最优结果,是下一代智能交互系统的开放、可访问基础。

关键见解

  1. InteractiveOmni是一个跨模态的大型语言模型,旨在实现音频视觉多轮交互。
  2. 模型采用统一框架,整合了视觉、音频、语言和语音解码器。
  3. 通过多阶段训练策略确保跨模态能力的稳健性。
  4. 模型具备人类长期对话能力,通过精心策划的多轮训练数据集实现。
  5. 建立了多模态多轮记忆基准和多轮语音交互基准进行评估。
  6. InteractiveOmni在性能上显著优于其他开源模型,特别是在长期记忆能力方面。

Cool Papers

点此查看论文截图

UniMoE-Audio: Unified Speech and Music Generation with Dynamic-Capacity MoE

Authors:Zhenyu Liu, Yunxin Li, Xuanyu Zhang, Qixun Teng, Shenyuan Jiang, Xinyu Chen, Haoyuan Shi, Jinchao Li, Qi Wang, Haolan Chen, Fanbo Meng, Mingjun Zhao, Yu Xu, Yancheng He, Baotian Hu, Min Zhang

Recent advances in unified multimodal models indicate a clear trend towards comprehensive content generation. However, the auditory domain remains a significant challenge, with music and speech often developed in isolation, hindering progress towards universal audio synthesis. This separation stems from inherent task conflicts and severe data imbalances, which impede the development of a truly unified audio generation model. To address this challenge, we propose UniMoE-Audio, a unified speech and music generation model within a novel Dynamic-Capacity Mixture-of-Experts (MoE) framework. Architecturally, UniMoE-Audio introduces a Top-P routing strategy for dynamic expert number allocation, and a hybrid expert design comprising routed experts for domain-specific knowledge, shared experts for domain-agnostic features, and null experts for adaptive computation skipping. To tackle data imbalance, we introduce a three-stage training curriculum: 1) Independent Specialist Training leverages original datasets to instill domain-specific knowledge into each “proto-expert” without interference; 2) MoE Integration and Warmup incorporates these specialists into the UniMoE-Audio architecture, warming up the gate module and shared expert using a subset of balanced dataset; and 3) Synergistic Joint Training trains the entire model end-to-end on the fully balanced dataset, fostering enhanced cross-domain synergy. Extensive experiments show that UniMoE-Audio not only achieves state-of-the-art performance on major speech and music generation benchmarks, but also demonstrates superior synergistic learning, mitigating the performance degradation typically seen in naive joint training. Our findings highlight the substantial potential of specialized MoE architecture and curated training strategies in advancing the field of universal audio generation. Homepage: https://mukioxun.github.io/Uni-MoE-site/home.html

最近的多模态模型进展显示了一种朝着全面内容生成发展的明确趋势。然而,听觉领域仍然是一个巨大的挑战,音乐与语音常常孤立发展,阻碍了通用音频合成的进步。这种分离源于固有的任务冲突和严重的数据不平衡,阻碍了真正统一的音频生成模型的发展。为了应对这一挑战,我们提出了UniMoE-Audio,这是一个在新型动态容量专家混合(MoE)框架内的统一语音和音乐生成模型。在结构上,UniMoE-Audio引入了动态专家数量分配的Top-P路由策略,以及包含面向特定领域知识的路由专家、面向通用特征的共享专家和用于自适应计算跳过的空专家的混合专家设计。为了解决数据不平衡问题,我们引入了三阶段训练课程:1)独立专家训练利用原始数据集来向每个“原型专家”灌输特定领域知识,而不会相互干扰;2)MoE集成和预热将这些专家纳入UniMoE-Audio架构,使用平衡数据集的子集来预热门模块和共享专家;3)协同联合训练在完全平衡的数据集上对整个模型进行端到端训练,促进增强跨域协同。大量实验表明,UniMoE-Audio不仅在主要语音和音乐生成基准测试中达到了最新性能水平,而且还展示了卓越协同学习能力,减轻了联合训练中通常出现的性能下降问题。我们的研究结果突出了专业化MoE架构和定制训练策略在推进通用音频生成领域的巨大潜力。主页链接:https://mukioxun.github.io/Uni-MoE-site/home.html

论文及项目相关链接

PDF

Summary

随着近期统一多模态模型的进展,内容生成趋向全面,但音频领域的挑战依旧显著。音乐和语音的孤立开发阻碍了通用音频合成的进展。UniMoE-Audio模型在动态容量混合专家(MoE)框架下统一了语音和音乐生成,通过顶级P路由策略实现专家数量动态分配,并设计混合专家架构处理领域特定知识和通用特征。为应对数据不平衡问题,采用三阶段训练课程。实验证明,UniMoE-Audio不仅在语音和音乐生成基准测试中达到最佳性能,还展现出优越的结合学习能力,缓解联合训练中的性能下降。

Key Takeaways

  1. 最近的多模态模型进展推动了内容生成的全面性,但音频生成仍面临挑战。
  2. 音乐和语音的孤立开发阻碍了通用音频合成的进展。
  3. UniMoE-Audio模型在动态容量MoE框架下实现了语音和音乐的统一生成。
  4. UniMoE-Audio采用顶级P路由策略和混合专家架构设计。
  5. 三阶段训练课程解决了数据不平衡问题。
  6. UniMoE-Audio在基准测试中表现优秀,达到最佳性能。
  7. UniMoE-Audio展现出优越的结合学习能力,缓解联合训练中的性能下降。

Cool Papers

点此查看论文截图

Mismatch Aware Guidance for Robust Emotion Control in Auto-Regressive TTS Models

Authors:Yizhou Peng, Yukun Ma, Chong Zhang, Yi-Wen Chao, Chongjia Ni, Bin Ma

While Text-to-Speech (TTS) systems can achieve fine-grained control over emotional expression via natural language prompts, a significant challenge emerges when the desired emotion (style prompt) conflicts with the semantic content of the text. This mismatch often results in unnatural-sounding speech, undermining the goal of achieving fine-grained emotional control. Classifier-Free Guidance (CFG) is a key technique for enhancing prompt alignment; however, its application to auto-regressive (AR) TTS models remains underexplored, which can lead to degraded audio quality. This paper directly addresses the challenge of style-content mismatch in AR TTS models by proposing an adaptive CFG scheme that adjusts to different levels of the detected mismatch, as measured using large language models or natural language inference models. This solution is based on a comprehensive analysis of CFG’s impact on emotional expressiveness in state-of-the-art AR TTS models. Our results demonstrate that the proposed adaptive CFG scheme improves the emotional expressiveness of the AR TTS model while maintaining audio quality and intelligibility.

虽然文本转语音(TTS)系统可以通过自然语言提示实现细粒度的情感表达控制,但当所期望的情感(风格提示)与文本语义内容相冲突时,就会出现一个重大的挑战。这种不匹配通常会导致语音听起来不自然,从而破坏了实现精细情感控制的目标。无分类器引导(CFG)是增强提示对齐的关键技术,但其应用于自回归(AR)TTS模型的情况仍被较少探索,这可能导致音频质量下降。本文直接解决了AR TTS模型中风格与内容不匹配这一挑战,提出了一种自适应CFG方案,该方案可以根据使用大型语言模型或自然语言推理模型检测到的不同级别的不匹配进行调整。该解决方案基于对CFG在最新AR TTS模型中的情感表达影响进行全面分析。我们的结果表明,所提出的自适应CFG方案在保持音频质量和清晰度的同时,提高了AR TTS模型的情感表达能力。

论文及项目相关链接

PDF Submitted to ICASSP 2026

Summary

文本探讨了自动回归(AR)文本转语音(TTS)模型中的风格与内容不匹配问题。当期望的情感(风格提示)与文本语义内容冲突时,会导致语音不自然。本文提出了一种自适应的分类器免费指导(CFG)方案,根据检测到的不匹配程度进行调整,以提高AR TTS模型的情感表现力,同时保持音频质量和可理解性。

Key Takeaways

  1. 当文本转语音(TTS)系统中的期望情感与文本语义内容冲突时,会产生不自然的声音。
  2. 分类器自由指导(CFG)是提高提示对齐的关键技术。
  3. AR TTS模型中CFG的应用仍存在不足,可能导致音频质量下降。
  4. 本文提出了一种自适应CFG方案,该方案可以根据检测到的不匹配程度进行调整。
  5. 该方案通过使用大型语言模型或自然语言推理模型来衡量不匹配程度。
  6. 提出的自适应CFG方案在提高AR TTS模型的情感表现力的同时,维持了音频质量和可理解性。

Cool Papers

点此查看论文截图

Two Heads Are Better Than One: Audio-Visual Speech Error Correction with Dual Hypotheses

Authors:Sungnyun Kim, Kangwook Jang, Sungwoo Cho, Joon Son Chung, Hoirin Kim, Se-Young Yun

This paper introduces a new paradigm for generative error correction (GER) framework in audio-visual speech recognition (AVSR) that reasons over modality-specific evidences directly in the language space. Our framework, DualHyp, empowers a large language model (LLM) to compose independent N-best hypotheses from separate automatic speech recognition (ASR) and visual speech recognition (VSR) models. To maximize the effectiveness of DualHyp, we further introduce RelPrompt, a noise-aware guidance mechanism that provides modality-grounded prompts to the LLM. RelPrompt offers the temporal reliability of each modality stream, guiding the model to dynamically switch its focus between ASR and VSR hypotheses for an accurate correction. Under various corruption scenarios, our framework attains up to 57.7% error rate gain on the LRS2 benchmark over standard ASR baseline, contrary to single-stream GER approaches that achieve only 10% gain. To facilitate research within our DualHyp framework, we release the code and the dataset comprising ASR and VSR hypotheses at https://github.com/sungnyun/dualhyp.

本文介绍了一种视听语音识别(AVSR)中生成式误差校正(GER)框架的新范式。该框架(DualHyp)赋能大型语言模型(LLM)从单独的自动语音识别(ASR)和视觉语音识别(VSR)模型中组合出独立的N-best假设。为了最大化DualHyp的有效性,我们进一步引入了RelPrompt,这是一种噪声感知指导机制,为LLM提供模态基础提示。RelPrompt提供各模态流的时序可靠性,指导模型在ASR和VSR假设之间动态切换,以实现准确校正。在各种腐败场景下,与标准的ASR基线相比,我们的框架在LRS2基准测试上实现了高达57.7%的错误率增益,而单流GER方法仅实现了10%的增益。为了在我们的DualHyp框架内进行研究,我们在https://github.com/sungnyun/dualhyp上发布了包含ASR和VSR假设的代码和数据集。

论文及项目相关链接

PDF Preprint work

摘要

本文介绍了一种新的生成式错误校正(GER)框架,该框架用于视听语音识别(AVSR)。提出的DualHyp框架使大型语言模型(LLM)能够直接从语言空间中推理模态特定证据,并组合来自自动语音识别(ASR)和视觉语音识别(VSR)模型的独立N-best假设。为了最大化DualHyp的有效性,进一步引入了RelPrompt,这是一种噪声感知指导机制,为LLM提供模态基础的提示。RelPrompt提供了各模态流的时序可靠性,指导模型在ASR和VSR假设之间动态切换,以实现准确校正。在各种腐败场景下,我们的框架在LRS2基准测试上实现了高达57.7%的错误率增益,而单流GER方法仅实现10%的增益。

要点

  1. 提出了新的生成式错误校正(GER)框架,用于视听语音识别(AVSR)。
  2. DualHyp框架使大型语言模型(LLM)能够直接处理模态特定证据并组合N-best假设。
  3. RelPrompt机制提供模态基础提示,指导模型在ASR和VSR假设之间动态切换。
  4. 在各种场景下,DualHyp框架实现了显著的性能提升,错误率降低高达57.7%。
  5. 相较于单流GER方法,DualHyp框架性能更优。
  6. 公开了包含ASR和VSR假设的代码和数据集,以推动研究。
  7. DualHyp框架对于复杂环境下的语音识别错误校正具有潜在应用价值。

Cool Papers

点此查看论文截图

Adaptive vector steering: A training-free, layer-wise intervention for hallucination mitigation in large audio and multimodal models

Authors:Tsung-En Lin, Kuan-Yi Lee, Hung-Yi Lee

Large Audio-Language Models and Multi-Modal Large Language Models have demonstrated strong capabilities in tasks such as Audio Question Answering (AQA), Audio Captioning, and Automatic Speech Recognition (ASR). However, there is growing evidence that these models can hallucinate about the content of the audio. To address this issue, we probe the models’ internal states and propose Adaptive Vector Steering (AVS), a method that better grounds generation in audio content. We also identify a strong correlation between output correctness and internal representations. Experiments show consistent performance gains across two models and two benchmarks. On the Audio Hallucination QA dataset, our method boosts the F1-score of Gemma from 0.550 to 0.619 and Qwen from 0.626 to 0.632. Furthermore, our method increases the accuracy of Qwen on MMAU from 0.548 to 0.592, marking an 8% relative increase. To the best of our knowledge, this is the first work to apply vector steering to mitigate hallucination in audio.

大型音频语言模型和多模态大型语言模型在音频问答(AQA)、音频描述和自动语音识别(ASR)等任务中表现出了强大的能力。然而,有越来越多的证据表明,这些模型会对音频内容产生幻觉。为了解决这个问题,我们探索了模型的内部状态,并提出了自适应向量转向(AVS)方法,更好地将生成与音频内容相结合。我们还发现输出正确性与内部表示之间存在强烈的相关性。实验显示,我们的方法在两个模型和两个基准测试上均取得了性能提升。在音频幻觉问答数据集上,我们的方法将Gemma的F1分数从0.550提高到0.619,将Qwen的分数从0.626提高到0.632。此外,我们的方法还将Qwen在MMAU上的准确率从0.548提高到0.592,相对提高了8%。据我们所知,这是首次将向量转向应用于减轻音频幻觉的工作。

论文及项目相关链接

PDF Note: This preprint is a version of the paper submitted to ICASSP 2026. The author list here includes contributors who provided additional supervision and guidance. The official ICASSP submission may differ slightly in author composition

Summary

大型音频语言模型和多模态大型语言模型在音频问答、音频标注和自动语音识别等任务中展现出强大的能力。然而,有越来越多的证据表明这些模型会对音频内容产生幻觉。为解决这一问题,本文探讨了模型的内部状态,并提出了一种自适应向量转向(AVS)方法,更好地将生成内容与音频内容相结合。同时,本文发现输出正确性与内部表征之间存在强相关性。实验表明,该方法在两种模型和两种基准测试上的性能均有所提高。在音频幻觉问答数据集上,我们的方法将Gemma的F1分数从0.550提高到0.619,将Qwen的分数从0.626提高到0.632。此外,我们的方法将Qwen在MMAU上的准确率从0.548提高到0.592,相对提高了8%。据我们所知,这是首次将向量转向应用于缓解音频幻觉问题。

Key Takeaways

  1. 大型音频语言模型和多模态语言模型在多项任务中表现出色,但存在对音频内容产生幻觉的问题。
  2. 提出了自适应向量转向(AVS)方法,以改善模型在音频内容生成方面的表现。
  3. 发现输出正确性与模型内部表征之间存在强相关性。
  4. AVS方法在多个模型和基准测试上提高了性能。
  5. 在音频幻觉问答数据集上,AVS方法提高了Gemma和Qwen的F1分数。
  6. AVS方法提高了Qwen在MMAU上的准确率,相对提高了8%。

Cool Papers

点此查看论文截图

Structured Sparsity and Weight-adaptive Pruning for Memory and Compute efficient Whisper models

Authors:Prasenjit K Mudi, Anshi Sachan, Dahlia Devapriya, Sheetal Kalyani

Whisper models have achieved remarkable progress in speech recognition; yet their large size remains a bottleneck for deployment on resource-constrained edge devices. This paper proposes a framework to design fine-tuned variants of Whisper which address the above problem. Structured sparsity is enforced via the Sparse Group LASSO penalty as a loss regularizer, to reduce the number of FLOating Point operations (FLOPs). Further, a weight statistics aware pruning algorithm is proposed. We also design our custom text normalizer for WER evaluation. On Common Voice 11.0 Hindi dataset, we obtain, without degrading WER, (a) 35.4% reduction in model parameters, 14.25% lower memory consumption and 18.5% fewer FLOPs on Whisper-small, and (b) 31% reduction in model parameters, 15.29% lower memory consumption and 16.95% fewer FLOPs on Whisper-medium; and, (c) substantially outperform the state-of-the-art Iterative Magnitude Pruning based method by pruning 18.7% more parameters along with a 12.31 reduction in WER.

whisper模型在语音识别方面取得了显著的进步,但其庞大的规模仍是部署在资源受限的边缘设备上的瓶颈。本文提出了一个设计whisper微调变种(whisper fine-tuned variants)的框架来解决上述问题。通过采用稀疏组LASSO惩罚作为损失正则化器来强制实施结构化稀疏性,以减少浮点运算次数(FLOPs)。此外,还提出了一种权重统计感知剪枝算法。我们还为WER评估设计了自定义文本规范化器。在Common Voice 11.0印度数据集上,我们在不降低WER的情况下获得了(a)whisper小型模型的模型参数减少35.4%,内存消耗降低14.25%,FLOPs减少18.5%;(b)whisper中型模型的模型参数减少31%,内存消耗降低15.29%,FLOPs减少16.95%;(c)通过迭代幅度剪枝方法大幅提高了性能,剪枝参数增加了高达达到或超过业界领先水平达到了最高可达的水平最多减少更多百分比为技术改进方法的可剔除更多的参数百分之 更高的降低同时也获得了更好的词错误率性能降低率,达到12.3%。其中。通过对更多的参数进行剪枝并降低词错误率,实质上超过了当前最先进的迭代幅度剪枝方法,即最多可剪枝更多的参数减少同时达到更高的性能改进率,并且实质上超过了当前最先进的迭代幅度剪枝方法最多减少高达更多百分比的参数。同时大幅提升了性能。

论文及项目相关链接

PDF

Summary

本文提出了一个针对Whisper模型的优化框架,通过结构化稀疏性和权重统计感知的剪枝算法,减少了模型的参数、内存消耗和浮点运算数量。在Common Voice 11.0 Hindi数据集上,对Whisper-small和Whisper-medium模型进行了优化,并在保持字词错误率(WER)不变的情况下,取得了显著的效率提升。相较于当前最先进的方法,该框架能够进一步剪枝更多的参数并降低WER。

Key Takeaways

  1. Whisper模型在语音识别方面取得了显著进展,但模型大小仍是限制其在资源受限的边缘设备上部署的瓶颈。
  2. 论文提出了一个针对Whisper模型的优化框架,通过结构化稀疏性和权重统计感知的剪枝算法来解决上述问题。
  3. 使用Sparse Group LASSO惩罚作为损失正则化器来实现结构化稀疏性。
  4. 定制了文本规范化器进行WER评估。
  5. 在Common Voice 11.0 Hindi数据集上,对Whisper-small和Whisper-medium模型进行优化,显著减少了模型参数、内存消耗和浮点运算数量。
  6. 优化后的模型在保持字词错误率(WER)不变的情况下,实现了效率提升。

Cool Papers

点此查看论文截图

I-DCCRN-VAE: An Improved Deep Representation Learning Framework for Complex VAE-based Single-channel Speech Enhancement

Authors:Jiatong Li, Simon Doclo

Recently, a complex variational autoencoder (VAE)-based single-channel speech enhancement system based on the DCCRN architecture has been proposed. In this system, a noise suppression VAE (NSVAE) learns to extract clean speech representations from noisy speech using pretrained clean speech and noise VAEs with skip connections. In this paper, we improve DCCRN-VAE by incorporating three key modifications: 1) removing the skip connections in the pretrained VAEs to encourage more informative speech and noise latent representations; 2) using $\beta$-VAE in pretraining to better balance reconstruction and latent space regularization; and 3) a NSVAE generating both speech and noise latent representations. Experiments show that the proposed system achieves comparable performance as the DCCRN and DCCRN-VAE baselines on the matched DNS3 dataset but outperforms the baselines on mismatched datasets (WSJ0-QUT, Voicebank-DEMEND), demonstrating improved generalization ability. In addition, an ablation study shows that a similar performance can be achieved with classical fine-tuning instead of adversarial training, resulting in a simpler training pipeline.

最近,基于复杂变分自编码器(VAE)和DCCRN架构的单通道语音增强系统已经被提出。在这个系统中,噪声抑制VAE(NSVAE)学习使用带有跳跃连接的预训练干净语音和噪声VAE从带噪声的语音中提取干净的语音表示。在本文中,我们通过融入三个关键改进来提升DCCRN-VAE的性能:1)移除预训练VAE中的跳跃连接,以鼓励更有信息量的语音和噪声潜在表示;2)在预训练中使用β-VAE,以更好地平衡重建和潜在空间正则化;3)NSVAE生成语音和噪声的潜在表示。实验表明,在匹配的DNS3数据集上,所提出的系统达到DCCRN和DCCRN-VAE基准线的性能水平,但在不匹配的数据集(WSJ0-QUT,Voicebank-DEMEND)上表现优于基准线,显示出更好的泛化能力。此外,一项消融研究(ablation study)表明,使用经典微调而非对抗训练也能达到类似性能,从而得到更简单的训练流程。

论文及项目相关链接

PDF

摘要
基于复杂变分自编码器(VAE)的单一通道语音增强系统近日已有提出,其中噪声抑制VAE(NSVAE)利用带有跳跃连接的预训练干净语音和噪声VAE学习从带噪语音中提取干净语音表示。本文改进了DCCRN-VAE,通过融入三项关键修改:1)移除预训练VAE中的跳跃连接以鼓励更有信息量的语音和噪声潜在表征;2)在预训练中使用β-VAE以更好地平衡重建和潜在空间正则化;3)NSVAE生成语音和噪声的潜在表征。实验表明,该系统在匹配的DNS3数据集上表现与DCCRN和DCCRN-VAE基线相当,但在不匹配的WSJ0-QUT和Voicebank-DEMEND数据集上表现优于基线,显示出更好的泛化能力。此外,消融研究表明,使用经典微调而非对抗性训练也可达到类似性能,简化训练流程。

要点提炼

  1. 提出了基于复杂变分自编码器的单一通道语音增强系统。
  2. 通过噪声抑制VAE学习从带噪语音中提取干净语音表示。
  3. 对DCCRN-VAE进行了三项关键改进,包括去除预训练VAE的跳跃连接、使用β-VAE进行预训练以及NSVAE生成语音和噪声的潜在表征。
  4. 系统在不同数据集上的性能表现优异,尤其在不匹配的数据集上表现出更好的泛化能力。
  5. 消融研究证明了使用经典微调可以达到与对抗性训练相似的性能,简化了训练流程。
  6. 该系统提供了一种有效的语音增强方法,有助于改善语音质量和可懂度。

Cool Papers

点此查看论文截图

TFGA-Net: Temporal-Frequency Graph Attention Network for Brain-Controlled Speaker Extraction

Authors:Youhao Si, Yuan Liao, Qiushi Han, Yuhang Yang, Rui Dai, Liya Huang

The rapid development of auditory attention decoding (AAD) based on electroencephalography (EEG) signals offers the possibility EEG-driven target speaker extraction. However, how to effectively utilize the target-speaker common information between EEG and speech remains an unresolved problem. In this paper, we propose a model for brain-controlled speaker extraction, which utilizes the EEG recorded from the listener to extract the target speech. In order to effectively extract information from EEG signals, we derive multi-scale time–frequency features and further incorporate cortical topological structures that are selectively engaged during the task. Moreover, to effectively exploit the non-Euclidean structure of EEG signals and capture their global features, the graph convolutional networks and self-attention mechanism are used in the EEG encoder. In addition, to make full use of the fused EEG and speech feature and preserve global context and capture speech rhythm and prosody, we introduce MossFormer2 which combines MossFormer and RNN-Free Recurrent as separator. Experimental results on both the public Cocktail Party and KUL dataset in this paper show that our TFGA-Net model significantly outper-forms the state-of-the-art method in certain objective evaluation metrics. The source code is available at: https://github.com/LaoDa-X/TFGA-NET.

基于脑电图(EEG)信号的听觉注意力解码(AAD)的快速发展为EEG驱动的目标说话人提取提供了可能性。然而,如何有效利用EEG和语音之间的目标说话人的共同信息仍然是一个未解决的问题。在本文中,我们提出了一种脑控说话人提取模型,该模型利用记录下来的听众的脑电图来提取目标语音。为了有效地从脑电图信号中提取信息,我们推导出了多尺度时间-频率特征,并进一步结合了执行任务时选择性涉及的皮层拓扑结构。此外,为了有效利用脑电图信号的非欧几里得结构并捕捉其全局特征,我们在EEG编码器中使用图卷积网络和自注意力机制。为了充分利用融合的EEG和语音特征,并保留全局上下文,同时捕捉语音的节奏和韵律,我们引入了MossFormer2,它结合了MossFormer和RNN-Free Recurrent作为分离器。本文在公共鸡尾酒会和KUL数据集上的实验结果表明,我们的TFGA-Net模型在某些客观评价指标上显著优于最新方法。源代码可在:https://github.com/LaoDa-X/TFGA-NET获取。

论文及项目相关链接

PDF 5 pages, 3 figures

Summary

基于脑电图信号的听觉注意力解码(AAD)快速发展,为EEG驱动的目标语音提取提供了可能。本文提出一种脑控语音提取模型,利用听众的脑电图进行目标语音提取。通过提取EEG信号的多尺度时间-频率特征并融入皮质拓扑结构,结合图卷积网络和自注意力机制的EEG编码器,有效挖掘EEG信号的非欧几里得结构并捕捉全局特征。同时,通过MossFormer2分离器结合MossFormer和RNN-Free Recurrent技术,充分利用融合后的EEG和语音特征,保留全局语境并捕捉语音节奏和语调。在公共鸡尾酒会和KUL数据集上的实验结果表明,本文的TFGA-Net模型在某些客观评价指标上显著优于现有方法。

Key Takeaways

  1. AAD基于EEG信号实现目标语音提取。
  2. 利用听众的EEG数据进行目标语音提取。
  3. 通过多尺度时间-频率特征和皮质拓扑结构进行EEG信号处理。
  4. 使用图卷积网络和自注意力机制的EEG编码器。
  5. MossFormer2分离器结合MossFormer和RNN-Free Recurrent技术用于处理融合后的EEG和语音特征。
  6. 在公共数据集上的实验结果表明TFGA-Net模型在客观评价指标上表现优异。

Cool Papers

点此查看论文截图

BridgeCode: A Dual Speech Representation Paradigm for Autoregressive Zero-Shot Text-to-Speech Synthesis

Authors:Jingyuan Xing, Mingru Yang, Zhipeng Li, Xiaofen Xing, Xiangmin Xu

Autoregressive (AR) frameworks have recently achieved remarkable progress in zero-shot text-to-speech (TTS) by leveraging discrete speech tokens and large language model techniques. Despite their success, existing AR-based zero-shot TTS systems face two critical limitations: (i) an inherent speed-quality trade-off, as sequential token generation either reduces frame rates at the cost of expressiveness or enriches tokens at the cost of efficiency, and (ii) a text-oriented supervision mismatch, as cross-entropy loss penalizes token errors uniformly without considering the fine-grained acoustic similarity among adjacent tokens. To address these challenges, we propose BridgeTTS, a novel AR-TTS framework built upon the dual speech representation paradigm BridgeCode. BridgeTTS reduces AR iterations by predicting sparse tokens while reconstructing rich continuous features for high-quality synthesis. Joint optimization of token-level and feature-level objectives further enhances naturalness and intelligibility. Experiments demonstrate that BridgeTTS achieves competitive quality and speaker similarity while significantly accelerating synthesis. Speech demos are available at https://test1562.github.io/demo/.

自回归(AR)框架最近借助离散语音标记和大型语言模型技术,在零样本文本到语音(TTS)方面取得了显著进展。尽管取得了成功,但现有的基于AR的零样本TTS系统面临两个关键局限性:(i)固有的速度与质量权衡,因为顺序标记生成要么以降低帧率为代价换取表现力,要么以牺牲效率为代价丰富标记;(ii)文本导向的监督不匹配,因为交叉熵损失会统一惩罚标记错误,而不会考虑相邻标记之间的精细声学相似性。为了解决这些挑战,我们提出了基于双语音表示范式的BridgeCode构建的全新AR-TTS框架BridgeTTS。BridgeTTS通过预测稀疏标记和重建丰富的连续特征来减少AR迭代,以实现高质量合成。对标记级和目标特征级的联合优化进一步提高了自然度和清晰度。实验表明,BridgeTTS在保证竞争质量的同时显著提高了语音合成的速度,并具有高度的说话人相似性。更多语音演示可通过https://test1562.github.io/demo/进行访问。

论文及项目相关链接

PDF

Summary

基于离散语音标记技术和大型语言模型技术的自回归(AR)框架在零样本文本到语音(TTS)领域取得了显著进展。然而,现有的AR-TTS系统面临速度与质量的权衡问题以及文本导向的监督不匹配问题。为解决这些问题,我们提出了基于双重语音表示范式BridgeCode的BridgeTTS框架。BridgeTTS通过预测稀疏标记并结合重建丰富的连续特征来减少AR迭代次数,以实现高质量合成。对标记级和特征级的联合优化进一步提高了自然度和清晰度。实验证明,BridgeTTS在保证语音质量和说话人相似性的同时,显著提高了合成速度。更多语音演示请访问:[链接地址]。

Key Takeaways

  1. AR框架结合离散语音标记和大型语言模型技术在零样本TTS中表现突出。
  2. AR-TTS系统面临速度与质量的权衡挑战。
  3. BridgeTTS框架基于双重语音表示范式BridgeCode,旨在解决上述问题。
  4. BridgeTTS通过预测稀疏标记并重建连续特征来减少AR迭代次数。
  5. 联合优化标记级和特征级目标提高了语音的自然度和清晰度。
  6. 实验证明BridgeTTS在保证语音质量和说话人相似性的同时,显著提高了合成速度。

Cool Papers

点此查看论文截图

An Encoder-Integrated PhoBERT with Graph Attention for Vietnamese Token-Level Classification

Authors:Ba-Quang Nguyen

We propose a novel neural architecture named TextGraphFuseGAT, which integrates a pretrained transformer encoder (PhoBERT) with Graph Attention Networks for token-level classification tasks. The proposed model constructs a fully connected graph over the token embeddings produced by PhoBERT, enabling the GAT layer to capture rich inter-token dependencies beyond those modeled by sequential context alone. To further enhance contextualization, a Transformer-style self-attention layer is applied on top of the graph-enhanced embeddings. The final token representations are passed through a classification head to perform sequence labeling. We evaluate our approach on three Vietnamese benchmark datasets: PhoNER-COVID19 for named entity recognition in the COVID-19 domain, PhoDisfluency for speech disfluency detection, and VietMed-NER for medical-domain NER. VietMed-NER is the first Vietnamese medical spoken NER dataset, featuring 18 entity types collected from real-world medical speech transcripts and annotated with the BIO tagging scheme. Its specialized vocabulary and domain-specific expressions make it a challenging benchmark for token-level classification models. Experimental results show that our method consistently outperforms strong baselines, including transformer-only and hybrid neural models such as BiLSTM + CNN + CRF, confirming the effectiveness of combining pretrained semantic features with graph-based relational modeling for improved token classification across multiple domains.

我们提出了一种新型的神经网络架构,名为TextGraphFuseGAT,它将预训练的转换器编码器(PhoBERT)与图注意力网络集成,用于进行标记级分类任务。该模型在PhoBERT产生的标记嵌入上构建了一个全连接图,使GAT层能够捕获丰富的标记间依赖关系,而不仅仅是通过顺序上下文进行建模。为了进一步增强上下文化,在图增强嵌入之上应用了Transformer风格的自注意力层。最终的标记表示通过分类头进行传递,以执行序列标注。我们在三个越南基准数据集上评估了我们的方法:用于COVID-19领域命名实体识别的PhoNER-COVID19、用于语音不流畅检测的PhoDisfluency,以及用于医疗领域命名实体识别的VietMed-NER。VietMed-NER是第一个越南医疗口语命名实体识别数据集,它包含从现实世界医疗语音转录中收集的18种实体类型,并按BIO标注方案进行标注。其专业词汇和领域特定表达式使其成为标记级分类模型具有挑战性的基准测试。实验结果表明,我们的方法始终优于强大的基线,包括仅使用转换器和混合神经网络模型(如BiLSTM+CNN+CRF),这证实了将预训练语义特征与基于图的关系建模相结合,在多个领域进行改进的令牌分类的有效性。

论文及项目相关链接

PDF 11 pages, 1 figure. Submitted to VLSP 2025 and reviewed

Summary

提出一种名为TextGraphFuseGAT的新型神经网络架构,结合了预训练的PhoBERT编码器和图注意力网络,用于令牌级分类任务。该模型构建了一个完全连接的图,基于PhoBERT产生的令牌嵌入,使GAT层能够捕获丰富的令牌间依赖关系,而不仅仅是基于顺序上下文建模。通过图增强嵌入之上应用Transformer风格的自注意力层,进一步增强了上下文。最终的令牌表示通过分类头进行传递,以执行序列标注。在越南语基准数据集上进行评估,包括PhoNER-COVID19、PhoDisfluency和VietMed-NER。实验结果证实,该方法优于包括仅使用变换器和混合神经网络模型在内的强基线,证明了将预训练语义特征与图关系建模相结合的有效性,可改善跨多个领域的令牌分类。

Key Takeaways

  1. 提出了名为TextGraphFuseGAT的新型神经网络架构。
  2. 该架构结合了预训练的PhoBERT编码器和图注意力网络。
  3. 通过构建完全连接的图来捕获令牌间的丰富依赖关系。
  4. 在图增强嵌入之上应用了Transformer风格的自注意力层以增强上下文。
  5. 在多个越南语基准数据集上进行了评估,包括PhoNER-COVID19、PhoDisfluency和VietMed-NER。
  6. 实验结果证明了该方法的优越性,相比强基线有更佳表现。

Cool Papers

点此查看论文截图

Dynamically Slimmable Speech Enhancement Network with Metric-Guided Training

Authors:Haixin Zhao, Kaixuan Yang, Nilesh Madhu

To further reduce the complexity of lightweight speech enhancement models, we introduce a gating-based Dynamically Slimmable Network (DSN). The DSN comprises static and dynamic components. For architecture-independent applicability, we introduce distinct dynamic structures targeting the commonly used components, namely, grouped recurrent neural network units, multi-head attention, convolutional, and fully connected layers. A policy module adaptively governs the use of dynamic parts at a frame-wise resolution according to the input signal quality, controlling computational load. We further propose Metric-Guided Training (MGT) to explicitly guide the policy module in assessing input speech quality. Experimental results demonstrate that the DSN achieves comparable enhancement performance in instrumental metrics to the state-of-the-art lightweight baseline, while using only 73% of its computational load on average. Evaluations of dynamic component usage ratios indicate that the MGT-DSN can appropriately allocate network resources according to the severity of input signal distortion.

为了进一步简化轻量级语音增强模型的复杂性,我们引入了一种基于门控的动态可缩放网络(DSN)。DSN包含静态和动态组件。为了实现独立于架构的适用性,我们针对常用组件引入了不同的动态结构,即分组循环神经网络单元、多头注意力、卷积和全连接层。策略模块根据输入信号质量以帧级分辨率自适应地控制动态部分的使用,控制计算负载。我们还提出了Metric-Guided Training(MGT)来明确指导策略模块评估输入语音质量。实验结果表明,DSN在仪器指标上达到了与最新轻量级基线相当的增强性能,而平均计算负载仅为其73%。对动态组件使用率的评估表明,MGT-DSN可以根据输入信号失真的严重程度适当地分配网络资源。

论文及项目相关链接

PDF Preprint version of a paper under review at ICASSP2026

摘要

本文提出了一种基于门控的动态可缩放网络(DSN),以进一步简化轻量级语音增强模型的复杂性。DSN包括静态和动态组件,为了具有独立于架构的适用性,我们针对常用的组件(如分组循环神经网络单元、多头注意力、卷积和全连接层)引入了不同的动态结构。策略模块根据输入信号质量以帧级分辨率自适应地控制动态部分的使用,从而控制计算负载。此外,我们还提出了度量指导训练(MGT)来明确指导策略模块评估输入语音质量。实验结果表明,DSN在仪器度量上实现了与最新轻量级基线相当的增强性能,平均计算负载仅为其73%。对动态组件使用率的评估表明,MGT-DSN可以根据输入信号失真的严重程度适当地分配网络资源。

要点

  1. 引入了基于门控的动态可缩放网络(DSN)来简化轻量级语音增强模型的复杂性。
  2. DSN包含静态和动态组件,适用于多种架构。
  3. 针对常见的组件如分组循环神经网络单元、多头注意力、卷积和全连接层,引入了动态结构。
  4. 策略模块能自适应地根据输入信号质量控制动态部分的使用,并管理计算负载。
  5. 提出了度量指导训练(MGT)来指导策略模块评估输入语音质量。
  6. 实验显示,DSN在仪器度量上的增强性能与最新轻量级基线相当,且计算效率更高。

Cool Papers

点此查看论文截图

Nepali Sign Language Characters Recognition: Dataset Development and Deep Learning Approaches

Authors:Birat Poudel, Satyam Ghimire, Sijan Bhattarai, Saurav Bhandari, Suramya Sharma Dahal

Sign languages serve as essential communication systems for individuals with hearing and speech impairments. However, digital linguistic dataset resources for underrepresented sign languages, such as Nepali Sign Language (NSL), remain scarce. This study introduces the first benchmark dataset for NSL, consisting of 36 gesture classes with 1,500 samples per class, designed to capture the structural and visual features of the language. To evaluate recognition performance, we fine-tuned MobileNetV2 and ResNet50 architectures on the dataset, achieving classification accuracies of 90.45% and 88.78%, respectively. These findings demonstrate the effectiveness of convolutional neural networks in sign recognition tasks, particularly within low-resource settings. To the best of our knowledge, this work represents the first systematic effort to construct a benchmark dataset and assess deep learning approaches for NSL recognition, highlighting the potential of transfer learning and fine-tuning for advancing research in underexplored sign languages.

手语对于听力和语言障碍人士来说是一种重要的沟通系统。然而,对于代表性不足的尼泊尔手语(NSL)等手语的数字语言数据集资源仍然非常稀缺。本研究介绍了首个尼泊尔手语基准数据集,包含36个手势类别,每个类别有1500个样本,旨在捕捉该语言的结构和视觉特征。为了评估识别性能,我们对MobileNetV2和ResNet50架构进行了微调,分别取得了90.45%和88.78%的分类准确率。这些发现证明了卷积神经网络在手势识别任务中的有效性,特别是在资源稀缺的环境中。据我们所知,这项工作代表了构建基准数据集和评估深度学习在手势识别方面的首次系统性努力,突显了迁移学习和微调在推进对未被充分研究的手语的探索方面的潜力。

论文及项目相关链接

PDF 6 pages, 9 figures

Summary

本研究介绍了尼泊尔手语(NSL)的首个基准数据集,包含36个手势类别,每类1500个样本,旨在捕捉该语言的结构和视觉特征。研究通过MobileNetV2和ResNet50架构对该数据集进行微调,分类准确率分别为90.45%和88.78%,表明卷积神经网络在手势识别任务中的有效性,特别是在资源匮乏的环境下。此研究是构建NSL识别基准数据集和评估深度学习方法的首个系统性尝试,突显了迁移学习和微调在推进对未被充分研究的手语研究方面的潜力。

Key Takeaways

  1. 本研究建立了尼泊尔手语(NSL)的首个基准数据集。
  2. 数据集包含36个手势类别,每类有1500个样本。
  3. 通过MobileNetV2和ResNet50架构对数据集进行微调,获得了较高的分类准确率。
  4. 研究表明卷积神经网络在手势识别任务中的有效性。
  5. 此研究尤其在资源有限的环境下具有显著意义。
  6. 该研究是系统性地构建手语识别基准数据集和评估深度学习方法的首个尝试。

Cool Papers

点此查看论文截图

Efficient Edge Test-Time Adaptation via Latent Feature Coordinate Correction

Authors:Xinyu Luo, Jie Liu, Kecheng Chen, Junyi Yang, Bo Ding, Arindam Basu, Haoliang Li

Edge devices face significant challenges due to limited computational resources and distribution shifts, making efficient and adaptable machine learning essential. Existing test-time adaptation (TTA) methods often rely on gradient-based optimization or batch processing, which are inherently unsuitable for resource-constrained edge scenarios due to their reliance on backpropagation and high computational demands. Gradient-free alternatives address these issues but often suffer from limited learning capacity, lack flexibility, or impose architectural constraints. To overcome these limitations, we propose a novel single-instance TTA method tailored for edge devices (TED), which employs forward-only coordinate optimization in the principal subspace of latent using the covariance matrix adaptation evolution strategy (CMA-ES). By updating a compact low-dimensional vector, TED not only enhances output confidence but also aligns the latent representation closer to the source latent distribution within the latent principal subspace. This is achieved without backpropagation, keeping the model parameters frozen, and enabling efficient, forgetting-free adaptation with minimal memory and computational overhead. Experiments on image classification and keyword spotting tasks across the ImageNet and Google Speech Commands series datasets demonstrate that TED achieves state-of-the-art performance while $\textit{reducing computational complexity by up to 63 times}$, offering a practical and scalable solution for real-world edge applications. Furthermore, we successfully $\textit{deployed TED on the ZYNQ-7020 platform}$, demonstrating its feasibility and effectiveness for resource-constrained edge devices in real-world deployments.

边缘设备由于计算资源有限和分布转移而面临重大挑战,这使得高效和可适应的机器学习方法变得至关重要。现有的测试时间自适应(TTA)方法通常依赖于基于梯度的优化或批量处理,由于其依赖反向传播和计算需求较高,因此在资源受限的边缘场景中并不适合。无梯度替代方案解决了这些问题,但往往存在学习容量有限、缺乏灵活性或施加架构约束等缺点。为了克服这些局限性,我们提出了一种针对边缘设备的新型单实例TTA方法(TED),它采用仅前向传播的坐标优化,在潜在的主子空间中使用协方差矩阵适应进化策略(CMA-ES)。通过更新紧凑的低维向量,TED不仅提高了输出信心,而且使潜在表示更接近源潜在分布在潜在的主子空间内。这是在不进行反向传播的情况下实现的,保持模型参数冻结,并实现了高效、无遗忘的适应,具有极低的内存和计算开销。在ImageNet和Google语音命令系列数据集上进行图像分类和关键词识别任务的实验表明,TED在减少高达63倍的计算复杂性的同时实现了最先进的性能表现,为现实世界的边缘应用提供了实用且可扩展的解决方案。此外,我们在ZYNQ-7020平台上成功部署了TED,证明了其在资源受限的边缘设备的实际部署中的可行性和有效性。

论文及项目相关链接

PDF Under review

摘要
针对边缘设备资源受限和分布变化带来的挑战,提出一种新型的单实例测试时自适应方法(TED),采用前向坐标优化,在潜在主成分子空间中使用协方差矩阵自适应进化策略(CMA-ES)。通过更新紧凑的低维向量,TED在不进行反向传播、冻结模型参数的情况下,提高了输出信心,并使潜在表示更接近源潜在分布。在ImageNet和Google语音命令系列数据集上的图像分类和关键词识别任务实验表明,TED实现了最先进的性能,同时减少了高达63倍的计算复杂度,为现实世界中的边缘应用提供了实用且可扩展的解决方案。此外,成功在ZYNQ-7020平台上部署TED,证明了其在资源受限的边缘设备上的可行性和有效性。

关键见解

  1. 边缘设备面临计算资源有限和分布变化的挑战,需要高效且可适应的机器学习方法。
  2. 现有测试时自适应(TTA)方法依赖于梯度优化或批量处理,不适用于资源受限的边缘场景。
  3. 梯度替代方案解决了这些问题,但存在学习容量有限、缺乏灵活性或施加架构约束的缺点。
  4. 提出的TED方法是一种新型单实例TTA方法,采用前向坐标优化,更新低维向量以提高输出信心和潜在表示。
  5. TED不使用反向传播,冻结模型参数,实现高效、无遗忘的适应,具有最小的内存和计算开销。
  6. 实验表明,TED在图像分类和关键词识别任务上实现了最先进的性能,并显著减少了计算复杂度。
  7. 成功在资源受限的边缘设备上部署TED,证明了其在实际应用中的可行性和有效性。

Cool Papers

点此查看论文截图

End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF: A Reproducibility Study

Authors:Anirudh Ganesh, Jayavardhan Reddy

We present a reproducibility study of the state-of-the-art neural architecture for sequence labeling proposed by Ma and Hovy (2016)\cite{ma2016end}. The original BiLSTM-CNN-CRF model combines character-level representations via Convolutional Neural Networks (CNNs), word-level context modeling through Bi-directional Long Short-Term Memory networks (BiLSTMs), and structured prediction using Conditional Random Fields (CRFs). This end-to-end approach eliminates the need for hand-crafted features while achieving excellent performance on named entity recognition (NER) and part-of-speech (POS) tagging tasks. Our implementation successfully reproduces the key results, achieving 91.18% F1-score on CoNLL-2003 NER and demonstrating the model’s effectiveness across sequence labeling tasks. We provide a detailed analysis of the architecture components and release an open-source PyTorch implementation to facilitate further research.

我们对Ma和Hovy(2016)提出的最新神经网络架构进行了可重复性研究\cite{ma2016end}。原始的BiLSTM-CNN-CRF模型通过卷积神经网络(CNN)结合字符级表示,通过双向长短时记忆网络(BiLSTM)进行单词级上下文建模,并使用条件随机场(CRF)进行结构化预测。这种端到端的方法不需要手工特征,同时在命名实体识别(NER)和词性标注(POS)任务上取得了出色的性能。我们的实现成功地再现了关键结果,在CoNLL-2003 NER上达到了91.18%的F1分数,并展示了该模型在序列标注任务中的有效性。我们提供了架构组件的详细分析,并发布了一个开源的PyTorch实现,以促进进一步的研究。

论文及项目相关链接

PDF

摘要
针对Ma和Hovy(2016)提出的最新神经网络架构进行可重复性研究。原BiLSTM-CNN-CRF模型通过卷积神经网络(CNN)进行字符级表示,通过双向长短时记忆网络(BiLSTM)进行单词级上下文建模,并使用条件随机字段(CRF)进行结构化预测。这种端到端的方法无需手工特征,在命名实体识别(NER)和词性标注(POS)任务上表现优异。我们的实现成功复现了关键结果,在CoNLL-2003 NER任务上实现了91.18%的F1分数,证明了该模型在序列标注任务中的有效性。我们提供了对该架构组件的详细分析,并发布了开源PyTorch实现,以促进进一步的研究。

要点

  1. 该研究对Ma和Hovy(2016)提出的先进神经网络架构进行了可重复性验证。
  2. BiLSTM-CNN-CRF模型结合了字符级表示、单词级上下文建模和结构化预测。
  3. 该模型实现了端到端的训练,无需手工特征。
  4. 在命名实体识别(NER)和词性标注(POS)任务上取得了优异性能。
  5. 研究成功复现了关键结果,并在CoNLL-2003 NER任务上达到了91.18%的F1分数。
  6. 提供了对该神经网络架构的详细分析。

Cool Papers

点此查看论文截图

LSZone: A Lightweight Spatial Information Modeling Architecture for Real-time In-car Multi-zone Speech Separation

Authors:Jun Chen, Shichao Hu, Jiuxin Lin, Wenjie Li, Zihan Zhang, Xingchen Li, JinJiang Liu, Longshuai Xiao, Chao Weng, Lei Xie, Zhiyong Wu

In-car multi-zone speech separation, which captures voices from different speech zones, plays a crucial role in human-vehicle interaction. Although previous SpatialNet has achieved notable results, its high computational cost still hinders real-time applications in vehicles. To this end, this paper proposes LSZone, a lightweight spatial information modeling architecture for real-time in-car multi-zone speech separation. We design a spatial information extraction-compression (SpaIEC) module that combines Mel spectrogram and Interaural Phase Difference (IPD) to reduce computational burden while maintaining performance. Additionally, to efficiently model spatial information, we introduce an extremely lightweight Conv-GRU crossband-narrowband processing (CNP) module. Experimental results demonstrate that LSZone, with a complexity of 0.56G MACs and a real-time factor (RTF) of 0.37, delivers impressive performance in complex noise and multi-speaker scenarios.

车内多区域语音分离技术捕捉来自不同语音区域的声音,在人机交互中扮演着至关重要的角色。尽管之前的SpatialNet已经取得了显著成果,但其较高的计算成本仍然阻碍了其在车辆中的实时应用。为此,本文提出了LSZone,一种用于实时车内多区域语音分离轻量级空间信息建模架构。我们设计了一个空间信息提取-压缩(SpaIEC)模块,结合梅尔频谱图和耳间相位差(IPD),以减少计算负担的同时保持性能。此外,为了有效地对空间信息进行建模,我们引入了一个极轻量级的Conv-GRU跨频带窄频带处理(CNP)模块。实验结果表明,LSZone的计算复杂度为0.56GMACs,实时因子(RTF)为0.37,在复杂噪声和多说话人场景中表现出令人印象深刻的效果。

论文及项目相关链接

PDF submitted to ICASSP 2026

Summary

本文提出了一种名为LSZone的轻量级空间信息建模架构,用于实时车内多区域语音分离。该架构通过结合梅尔频谱图和跨耳相位差设计了一个空间信息提取压缩模块,以减少计算负担同时保持性能。此外,为了有效地建模空间信息,还引入了一个超轻量级的Conv-GRU跨频带处理模块。LSZone在复杂噪声和多说话人场景中表现出色,计算复杂度为0.56G MACs,实时因子为0.37。

Key Takeaways

  1. LSZone是一个轻量级的空间信息建模架构,用于实时车内多区域语音分离。
  2. 通过结合梅尔频谱图和跨耳相位差设计了一个空间信息提取压缩模块,减少计算负担。
  3. 引入了一个超轻量级的Conv-GRU跨频带处理模块,以有效地建模空间信息。
  4. LSZone在复杂噪声和多说话人场景中具有出色的性能。
  5. LSZone的计算复杂度为0.56G MACs,实时因子为0.37,适用于实时应用。
  6. SpatialNet虽然有一定的成果,但其高计算成本阻碍了实时应用,而LSZone解决了这一问题。

Cool Papers

点此查看论文截图


文章作者: Kedreamix
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !
 上一篇
Face Swapping Face Swapping
Face Swapping 方向最新论文已更新,请持续关注 Update in 2025-10-18 PIA Deepfake Detection Using Phoneme-Temporal and Identity-Dynamic Analysis
2025-10-18
下一篇 
医学影像/Breast Ultrasound 医学影像/Breast Ultrasound
医学影像/Breast Ultrasound 方向最新论文已更新,请持续关注 Update in 2025-10-18 Evaluating the Explainability of Vision Transformers in Medical Imaging
  目录