嘘~ 正在从服务器偷取页面 . . .

Speech


⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验

2025-11-05 更新

Audio-Visual Speech Enhancement In Complex Scenarios With Separation And Dereverberation Joint Modeling

Authors:Jiarong Du, Zhan Jin, Peijun Yang, Juan Liu, Zhuo Li, Xin Liu, Ming Li

Audio-visual speech enhancement (AVSE) is a task that uses visual auxiliary information to extract a target speaker’s speech from mixed audio. In real-world scenarios, there often exist complex acoustic environments, accompanied by various interfering sounds and reverberation. Most previous methods struggle to cope with such complex conditions, resulting in poor perceptual quality of the extracted speech. In this paper, we propose an effective AVSE system that performs well in complex acoustic environments. Specifically, we design a “separation before dereverberation” pipeline that can be extended to other AVSE networks. The 4th COGMHEAR Audio-Visual Speech Enhancement Challenge (AVSEC) aims to explore new approaches to speech processing in multimodal complex environments. We validated the performance of our system in AVSEC-4: we achieved excellent results in the three objective metrics on the competition leaderboard, and ultimately secured first place in the human subjective listening test.

视听语音增强(AVSE)是一项利用视觉辅助信息从混合音频中提取目标说话人语音的任务。在真实场景中,通常存在复杂的声学环境,伴随着各种干扰声音和回声。之前的大多数方法都很难应对这样的复杂条件,导致提取的语音感知质量较差。在本文中,我们提出了一种在复杂声学环境中表现良好的有效AVSE系统。具体来说,我们设计了一个“先分离后去混响”的管道,可扩展到其他AVSE网络。第4届COGMHEAR视听语音增强挑战(AVSEC)旨在探索在多种模式复杂环境下的语音处理新方法。我们在AVSEC-4中验证了系统的性能:在竞赛排行榜上的三个客观指标中取得了优异成绩,并在主观人工听力测试中获得第一名。

论文及项目相关链接

PDF

总结
这篇文本介绍了视听语音增强系统,通过使用视觉辅助信息从混合音频中提取目标说话人的语音。在复杂声学环境中,该系统表现良好,设计了一个“先分离后去混响”的管道,可扩展到其他视听语音增强网络。在第四届中国视听语音增强挑战赛(AVSEC-4)中,该系统在三个客观指标上取得优异结果,并在主观听力测试中荣获第一名。

要点

  1. 视听语音增强(AVSE)是利用视觉辅助信息从混合音频中提取目标说话人语音的任务。
  2. 复杂声学环境和各种干扰声音对语音提取造成挑战。
  3. 本文提出了一种在复杂声学环境中表现良好的AVSE系统,采用“先分离后去混响”的设计管道。
  4. 该系统可扩展到其他AVSE网络。
  5. 在第四届中国视听语音增强挑战赛(AVSEC-4)中验证了系统的性能。
  6. 该系统在三个客观指标上取得优异结果,表现领先。

Cool Papers

点此查看论文截图

See the Speaker: Crafting High-Resolution Talking Faces from Speech with Prior Guidance and Region Refinement

Authors:Jinting Wang, Jun Wang, Hei Victor Cheng, Li Liu

Unlike existing methods that rely on source images as appearance references and use source speech to generate motion, this work proposes a novel approach that directly extracts information from the speech, addressing key challenges in speech-to-talking face. Specifically, we first employ a speech-to-face portrait generation stage, utilizing a speech-conditioned diffusion model combined with statistical facial prior and a sample-adaptive weighting module to achieve high-quality portrait generation. In the subsequent speech-driven talking face generation stage, we embed expressive dynamics such as lip movement, facial expressions, and eye movements into the latent space of the diffusion model and further optimize lip synchronization using a region-enhancement module. To generate high-resolution outputs, we integrate a pre-trained Transformer-based discrete codebook with an image rendering network, enhancing video frame details in an end-to-end manner. Experimental results demonstrate that our method outperforms existing approaches on the HDTF, VoxCeleb, and AVSpeech datasets. Notably, this is the first method capable of generating high-resolution, high-quality talking face videos exclusively from a single speech input.

不同于依赖源图像作为外观参考并使用源语音生成运动的现有方法,这项工作提出了一种直接从语音中提取信息的新方法,解决了语音到说话人脸的关键挑战。具体来说,我们首先采用语音到人脸肖像生成阶段,利用受语音控制的扩散模型,结合统计面部先验和样本自适应加权模块,实现高质量肖像生成。在随后的语音驱动说话人脸生成阶段,我们将表达性动态(如嘴唇运动、面部表情和眼睛运动)嵌入扩散模型的潜在空间,并进一步使用区域增强模块优化嘴唇同步。为了生成高分辨率输出,我们将预训练的基于Transformer的离散码本与图像渲染网络相结合,以端到端的方式增强视频帧细节。实验结果表明,我们的方法在HDTF、VoxCeleb和AVSpeech数据集上的性能优于现有方法。值得注意的是,这是第一种能够仅从单个语音输入生成高分辨率、高质量说话人脸视频的方法。

论文及项目相关链接

PDF 16 pages,15 figures, accepted by TASLP

Summary

本文提出了一种全新的方法,直接从语音中提取信息来解决语音到说话人脸的关键挑战。该方法包括两个阶段:首先是语音驱动的人脸肖像生成阶段,利用语音控制的扩散模型、面部统计先验和样本自适应加权模块实现高质量肖像生成;接下来是语音驱动的说话人脸生成阶段,将表情动态嵌入扩散模型的潜在空间,并使用区域增强模块优化唇同步。通过集成预训练的基于Transformer的离散码本和图像渲染网络,以端到端的方式提升视频帧的细节分辨率。实验结果表明,该方法在HDTF、VoxCeleb和AVSpeech数据集上的表现优于现有方法,并且是首个仅从单一语音输入生成高分辨率、高质量说话人脸视频的方法。

Key Takeaways

  1. 该方法直接从语音中提取信息,不依赖源图像作为外观参考和源语音来生成动作。
  2. 包括语音驱动的人脸肖像生成和说话人脸生成两个阶段。
  3. 使用了语音控制的扩散模型、面部统计先验和样本自适应加权模块来实现高质量肖像生成。
  4. 将表情动态如嘴唇运动、面部表情和眼睛运动嵌入扩散模型的潜在空间。
  5. 使用区域增强模块优化唇同步。
  6. 通过集成预训练的基于Transformer的离散码本和图像渲染网络,提高了视频帧的细节分辨率。

Cool Papers

点此查看论文截图

UniTok-Audio: A Unified Audio Generation Framework via Generative Modeling on Discrete Codec Tokens

Authors:Chengwei Liu, Haoyin Yan, Shaofei Xue, Xiaotao Liang, Yinghao Liu, Zheng Xue, Gang Song, Boyang Zhou

Generative modeling has recently achieved remarkable success across text, image, and audio domains, demonstrating powerful capabilities for unified representation learning. However, audio generation models still face challenges in terms of audio quality and generalization ability across tasks. This fragmentation results in redundant development efforts, inconsistent performance, and limited extensibility. To address these issues, we propose \textbf{UniTok-Audio}, a scalable and extensible framework for unified audio generation tasks. Specifically, 1) UniTok-Audio extracts continuous feature of conditions to generates discrete tokens of target audio in an autoregressive manner; 2) a special task identifier token unifies different learning patterns of multiple tasks in a single framework; 3) a dual-stream audio codec involving acoustic and semantic branch is developed for high-fidelity waveform reconstruction. Experimental results demonstrate that UniTok-Audio achieves competitive performance in comparation with state-of-the-art task-specific or multi-task systems across five time-aligned tasks: speech restoration, target speaker extraction, speech separation, voice conversion, and language-queried audio source separation. To foster future research, we will open-source our codebase. The demo page of our work can be found here: https://alibaba.github.io/unified-audio.

生成模型在文本、图像和音频领域都取得了显著的成功,证明了其统一表示学习的强大能力。然而,音频生成模型在音频质量和跨任务的泛化能力方面仍面临挑战。这种碎片化导致了开发工作的冗余、性能的不一致和扩展的有限。为了解决这些问题,我们提出了UniTok-Audio,这是一个用于统一音频生成任务的可扩展和可扩展框架。具体来说,1)UniTok-Audio以自回归的方式提取连续的特征条件来生成目标音频的离散令牌;2)特殊任务标识符令牌统一了单个框架中多个任务的不同学习模式;3)开发了一个包含声学和语义分支的双流音频编解码器,用于高保真波形重建。实验结果表明,UniTok-Audio在五个时间对齐的任务上实现了与最新任务特定或多任务系统相当的性能,包括语音修复、目标说话人提取、语音分离、语音转换和语言查询音频源分离。为了促进未来的研究,我们将开源我们的代码库。我们工作的演示页面可在此找到:https://alibaba.github.io/unified-audio

论文及项目相关链接

PDF 21 pages, 3 figures

Summary

UniTok-Audio框架解决了音频生成模型在音频质量和跨任务泛化能力方面面临的挑战,实现了统一音频生成任务的可扩展和可延伸性。它通过提取连续特征条件生成目标音频的离散令牌,采用特殊任务标识符令牌统一多个任务的学习模式,并开发双流音频编解码器进行高保真波形重建。在五个时间对齐的任务上,UniTok-Audio与最新任务特定或多任务系统相比具有竞争力。

Key Takeaways

  1. UniTok-Audio是一个针对统一音频生成任务的可扩展和可延伸的框架。
  2. 该框架通过提取连续特征条件以生成目标音频的离散令牌,并采用特殊任务标识符令牌统一多个任务的学习模式。
  3. UniTok-Audio开发了一个双流音频编解码器,包括声学分支和语义分支,以实现高保真波形重建。
  4. 该框架在五个时间对齐的任务上实现了具有竞争力的性能,包括语音修复、目标说话人提取、语音分离、语音转换和语言查询音频源分离。
  5. UniTok-Audio框架的地址将在demo页面上公开,以便促进未来的研究。
  6. 该框架有助于解决音频生成模型在音频质量和任务泛化方面的挑战。

Cool Papers

点此查看论文截图

Modeling strategies for speech enhancement in the latent space of a neural audio codec

Authors:Sofiene Kammoun, Xavier Alameda-Pineda, Simon Leglaive

Neural audio codecs (NACs) provide compact latent speech representations in the form of sequences of continuous vectors or discrete tokens. In this work, we investigate how these two types of speech representations compare when used as training targets for supervised speech enhancement. We consider both autoregressive and non-autoregressive speech enhancement models based on the Conformer architecture, as well as a simple baseline where the NAC encoder is simply fine-tuned for speech enhancement. Our experiments reveal three key findings: predicting continuous latent representations consistently outperforms discrete token prediction; autoregressive models achieve higher quality but at the expense of intelligibility and efficiency, making non-autoregressive models more attractive in practice; and encoder fine-tuning yields the strongest enhancement metrics overall, though at the cost of degraded codec reconstruction. The code and audio samples are available online.

神经音频编解码器(NACs)提供紧凑的潜在语音表示形式,以连续的向量序列或离散标记的形式呈现。在这项工作中,我们研究了这两种语音表示形式在用于监督语音增强作为训练目标时的比较。我们考虑了基于Conformer架构的自回归和非自回归语音增强模型,以及一个简单的基线,其中NAC编码器只是通过微调用于语音增强。我们的实验揭示了三个关键发现:预测连续潜在表示形式始终优于预测离散标记;自回归模型虽然实现了更高质量,但牺牲了可懂性和效率,这使得非自回归模型在实践中更具吸引力;编码器微调总体增强指标最强,但以编解码器重建能力下降为代价。代码和音频样本可以在网上找到。

论文及项目相关链接

PDF

总结

神经音频编解码器(NACs)提供了紧凑的潜在语音表示形式,包括连续向量序列或离散令牌。在这项研究中,我们探讨了这两种语音表示形式在用作监督语音增强的训练目标时的比较。我们考虑了基于Conformer架构的自回归和非自回归语音增强模型,以及一个简单的基线,其中NAC编码器只是经过微调用于语音增强。实验揭示了三个关键发现:预测连续潜在表示形式始终优于预测离散令牌;自回归模型虽然质量更高,但牺牲了可理解性和效率,使得非自回归模型在实践中更具吸引力;编码器微调总体上获得了最强的增强指标,但以牺牲编解码器重建能力为代价。相关代码和音频样本可在网上获取。

要点

  1. 神经音频编解码器提供了紧凑的潜在语音表示形式,可作为训练目标用于监督语音增强。
  2. 对比了连续潜在表示和离散令牌两种形式的语音表示在语音增强中的表现。
  3. 预测连续潜在表示形式在语音增强中表现更优。
  4. 自回归模型在语音增强中质量更高,但非自回归模型在实际应用中更具吸引力。
  5. 简单微调NAC编码器用于语音增强能获得较强的增强指标。
  6. 该研究揭示了自回归模型在语音增强中的权衡问题,即在质量和效率之间的取舍。

Cool Papers

点此查看论文截图

Evaluating Emotion Recognition in Spoken Language Models on Emotionally Incongruent Speech

Authors:Pedro Corrêa, João Lima, Victor Moreno, Lucas Ueda, Paula Dornhofer Paro Costa

Advancements in spoken language processing have driven the development of spoken language models (SLMs), designed to achieve universal audio understanding by jointly learning text and audio representations for a wide range of tasks. Although promising results have been achieved, there is growing discussion regarding these models’ generalization capabilities and the extent to which they truly integrate audio and text modalities in their internal representations. In this work, we evaluate four SLMs on the task of speech emotion recognition using a dataset of emotionally incongruent speech samples, a condition under which the semantic content of the spoken utterance conveys one emotion while speech expressiveness conveys another. Our results indicate that SLMs rely predominantly on textual semantics rather than speech emotion to perform the task, indicating that text-related representations largely dominate over acoustic representations. We release both the code and the Emotionally Incongruent Synthetic Speech dataset (EMIS) to the community.

自然语言处理技术的进步推动了口语模型(SLM)的发展,这些模型旨在通过联合学习文本和音频表示来完成一系列任务,从而实现通用音频理解。尽管已经取得了有前景的结果,但对于这些模型的泛化能力以及它们在实际操作中在内部表示中真正融合音频和文本模式的程度,讨论的声音越来越大。在这项工作中,我们使用情感不一致的语音样本数据集评估了四种口语模型在语音情感识别任务上的表现,这是一种情况下,口头话语的语义内容传达了一种情感,而语音表达性则传达了另一种情感。我们的结果表明,口语模型主要依赖文本语义而不是语音情感来完成任务,这表明文本相关的表示在很大程度上超过了声学表示。我们向社区发布了代码和情绪不一致合成语音数据集(EMIS)。

论文及项目相关链接

PDF Submitted to IEEE ICASSP 2026. Copyright 2026 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses

Summary

本文介绍了语音语言模型(SLMs)的发展及其在语音情感识别任务上的表现。尽管SLMs在设计上旨在通过联合学习文本和音频表示来实现通用音频理解,但研究表明,这些模型在处理情感不一致的语音样本时,主要依赖文本语义而非语音情感来完成任务,表明文本相关表示在很大程度上主导了声学表示。本文同时公开了相关的代码和情感不一致的合成语音数据集(EMIS)。

Key Takeaways

  1. 语音语言模型(SLMs)通过联合学习文本和音频表示,旨在实现通用音频理解。
  2. 在处理情感不一致的语音样本时,SLMs主要依赖文本语义进行任务处理。
  3. 语音情感在SLMs中的表示作用有限,表明声学表示在很大程度上被文本相关表示所主导。
  4. SLMs在语音情感识别任务上的表现受到关注,存在关于其泛化能力的讨论。
  5. 本文公开了情感不一致的合成语音数据集(EMIS)以及相关的代码。
  6. SLMs在处理音频时,如何平衡文本和语音情感信息是一个需要进一步研究的问题。

Cool Papers

点此查看论文截图

POWSM: A Phonetic Open Whisper-Style Speech Foundation Model

Authors:Chin-Jou Li, Kalvin Chang, Shikhar Bharadwaj, Eunjung Yeo, Kwanghee Choi, Jian Zhu, David Mortensen, Shinji Watanabe

Recent advances in spoken language processing have led to substantial progress in phonetic tasks such as automatic speech recognition (ASR), phone recognition (PR), grapheme-to-phoneme conversion (G2P), and phoneme-to-grapheme conversion (P2G). Despite their conceptual similarity, these tasks have largely been studied in isolation, each relying on task-specific architectures and datasets. In this paper, we introduce POWSM (Phonetic Open Whisper-style Speech Model), the first unified framework capable of jointly performing multiple phone-related tasks. POWSM enables seamless conversion between audio, text (graphemes), and phones, opening up new possibilities for universal and low-resource speech processing. Our model outperforms or matches specialized PR models of similar size (Wav2Vec2Phoneme and ZIPA) while jointly supporting G2P, P2G, and ASR. Our training data, code and models are released to foster open science.

近期语音处理领域的进展在音素任务方面取得了重大突破,如自动语音识别(ASR)、音素识别(PR)、字母到音素转换(G2P)和音素到字母转换(P2G)。尽管这些任务在概念上具有相似性,但它们在很大程度上一直被孤立研究,每个任务都依赖于特定的架构和数据集。在本文中,我们介绍了POWSM(语音风格语音模型),这是第一个能够联合执行多个与音素相关的任务的统一框架。POWSM实现了音频、文本(字母)和音素之间的无缝转换,为通用和低资源语音处理开辟了新途径。我们的模型在联合支持G2P、P2G和ASR的同时,性能优于或匹配了类似规模的专门PR模型(Wav2Vec2Phoneme和ZIPA)。我们的训练数据、代码和模型已公开发布,以促进开放科学的发展。

论文及项目相关链接

PDF 14 pages, under review

摘要

本文介绍了POWSM(语音音素开放框架),它是首个能够同时执行多项语音相关的统一框架。POWSM可实现音频、文本(字母)和语音之间的无缝转换,为通用和低资源语音处理带来了新的可能性。该模型在联合支持音素到字母、字母到音素和自动语音识别任务时,性能优于或匹配了类似规模的专门语音识别模型(Wav2Vec2Phoneme和ZIPA)。本文公开了训练数据、代码和模型,以推动开放科学的发展。

关键见解

  1. POWSM是首个能够同时执行多项语音相关的统一框架,包括自动语音识别、语音识别、字母到音素转换和音素到字母转换等任务。
  2. POWSM实现了音频、文本和语音之间的无缝转换,为语音处理带来了新的可能性。
  3. 该模型性能优越,与类似规模的专门语音模型相比,具有竞争力。
  4. POWSM支持多种语音任务,提高了语音处理的通用性,并降低了资源需求。
  5. 该模型的训练数据、代码和模型已公开,促进了开放科学的发展。
  6. POWSM框架的引入将促进语音任务的研究和发展,推动语音处理技术的进步。

Cool Papers

点此查看论文截图

Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation

Authors:Inclusion AI, :, Bowen Ma, Cheng Zou, Canxiang Yan, Chunxiang Jin, Chunjie Shen, Dandan Zheng, Fudong Wang, Furong Xu, GuangMing Yao, Jun Zhou, Jingdong Chen, Jianing Li, Jianxin Sun, Jiajia Liu, Jianjiang Zhu, Jianping Jiang, Jun Peng, Kaixiang Ji, Kaimeng Ren, Libin Wang, Lixiang Ru, Longhua Tan, Lan Wang, Mochen Bai, Ning Gao, Qingpei Guo, Qinglong Zhang, Qiang Xu, Rui Liu, Ruijie Xiong, Ruobing Zheng, Sirui Gao, Tianqi Li, Tinghao Liu, Weilong Chai, Xinyu Xiao, Xiaomei Wang, Xiaolong Wang, Xiao Lu, Xiaoyu Li, Xingning Dong, Xuzheng Yu, Yi Yuan, Yuting Gao, Yuting Xiao, Yunxiao Sun, Yipeng Chen, Yifan Mao, Yifei Wu, Yongjie Lyu, Ziping Ma, Zhiqiang Fang, Zhihao Qiu, Ziyuan Huang, Zizheng Yang, Zhengyu He

We propose Ming-Flash-Omni, an upgraded version of Ming-Omni, built upon a sparser Mixture-of-Experts (MoE) variant of Ling-Flash-2.0 with 100 billion total parameters, of which only 6.1 billion are active per token. This architecture enables highly efficient scaling (dramatically improving computational efficiency while significantly expanding model capacity) and empowers stronger unified multimodal intelligence across vision, speech, and language, representing a key step toward Artificial General Intelligence (AGI). Compared to its predecessor, the upgraded version exhibits substantial improvements across multimodal understanding and generation. We significantly advance speech recognition capabilities, achieving state-of-the-art performance in contextual ASR and highly competitive results in dialect-aware ASR. In image generation, Ming-Flash-Omni introduces high-fidelity text rendering and demonstrates marked gains in scene consistency and identity preservation during image editing. Furthermore, Ming-Flash-Omni introduces generative segmentation, a capability that not only achieves strong standalone segmentation performance but also enhances spatial control in image generation and improves editing consistency. Notably, Ming-Flash-Omni achieves state-of-the-art results in text-to-image generation and generative segmentation, and sets new records on all 12 contextual ASR benchmarks, all within a single unified architecture.

我们提出了“Ming-Flash-Omni”,它是Ming-Omni的升级版,建立在更稀疏的专家混合(MoE)变体的Ling-Flash-2.0之上,总共有100亿参数,其中每令牌只有6.1亿是活跃的。这种架构实现了高效扩展(在大幅提高计算效率的同时显著扩大了模型容量),并在视觉、语音和语言的跨模态智能中赋予了更强的统一能力,是朝着通用人工智能(AGI)迈出的关键一步。与前代产品相比,升级版在跨模态理解和生成方面取得了重大进步。我们显着提高了语音识别能力,在上下文ASR中实现了最先进的性能,在方言感知ASR中取得了具有竞争力的结果。在图像生成方面,Ming-Flash-Omni引入了高保真文本渲染,并在场景一致性和身份保留的图像编辑方面取得了明显进步。此外,Ming-Flash-Omni引入了生成分割功能,这一功能不仅实现了强大的独立分割性能,还提高了图像生成的空间控制力并改善了编辑一致性。值得注意的是,Ming-Flash-Omni在文本到图像生成和生成分割方面取得了最新结果,并在所有12个上下文ASR基准测试中创造了新纪录,所有这些都在一个统一架构内完成。

论文及项目相关链接

PDF 18 pages, 5 figures

Summary

基于稀疏的专家混合(MoE)变体Ling-Flash-2.0构建的Ming-Flash-Omni是一个升级版模型,拥有高效的扩展能力并能提高多模态的智能统一性能,实现视觉、语音和语言的统一,朝着通用人工智能(AGI)迈出重要一步。新模型在多模态理解和生成方面实现了显著改进,提升了语音识别能力,并在上下文ASR和方言感知ASR方面达到了业界领先水平。此外,Ming-Flash-Omni还引入了高保真文本渲染和场景一致性及身份保留的图像编辑增益功能。该模型在文本到图像生成和生成分割方面达到了业界领先的结果。总体来说,Ming-Flash-Omni提供了一种统一的架构来实现高效的扩展和多样化的功能改进。

Key Takeaways

  1. Ming-Flash-Omni是基于稀疏的专家混合(MoE)变体Ling-Flash-2.0构建的升级模型。
  2. 该模型具有高效扩展能力,可大幅提高计算效率和模型容量。
  3. Ming-Flash-Omni在多模态智能方面表现出强大的统一性能,涵盖视觉、语音和语言。
  4. 新模型在上下文ASR和方言感知ASR方面达到了业界领先水平。
  5. Ming-Flash-Omni引入了高保真文本渲染功能,提高了图像编辑的场景一致性和身份保留能力。
  6. 该模型在文本到图像生成和生成分割方面表现出卓越性能。

Cool Papers

点此查看论文截图

BEST-RQ-Based Self-Supervised Learning for Whisper Domain Adaptation

Authors:Raphaël Bagat, Irina Illina, Emmanuel Vincent

Automatic Speech Recognition (ASR) systems, despite large multilingual training, struggle in out-of-domain and low-resource scenarios where labeled data is scarce. We propose BEARD (BEST-RQ Encoder Adaptation with Re-training and Distillation), a novel framework designed to adapt Whisper’s encoder using unlabeled data. Unlike traditional self-supervised learning methods, BEARD uniquely combines a BEST-RQ objective with knowledge distillation from a frozen teacher encoder, ensuring the encoder’s complementarity with the pre-trained decoder. Our experiments focus on the ATCO2 corpus from the challenging Air Traffic Control (ATC) communications domain, characterized by non-native speech, noise, and specialized phraseology. Using about 5,000 hours of untranscribed speech for BEARD and 2 hours of transcribed speech for fine-tuning, the proposed approach significantly outperforms previous baseline and fine-tuned model, achieving a relative improvement of 12% compared to the fine-tuned model. To the best of our knowledge, this is the first work to use a self-supervised learning objective for domain adaptation of Whisper.

自动语音识别(ASR)系统在大型多语言训练的背景下,在领域外及资源匮乏的场景中,当标注数据稀缺时,会遇到挑战。我们提出了BEARD(结合最佳请求编码器的再训练和蒸馏的自适应框架),这是一个新框架,旨在使用未标记数据适应Whisper的编码器。不同于传统的自监督学习方法,BEARD独特地结合了最佳请求目标以及来自冻结教师编码器的知识蒸馏,确保编码器与预训练解码器的互补性。我们的实验主要集中在具有挑战性的航空交通管制(ATC)通信领域的ATCO2语料库上,该语料库的特点是口音混杂、噪声干扰以及专业术语。使用大约5000小时的未转录语音进行BEARD训练,以及使用2小时的转录语音进行微调,所提出的方法显著优于之前的基准模型和微调模型,相对于微调模型实现了高达12%的相对改进。据我们所知,这是第一个将自监督学习目标应用于Whisper领域自适应的工作。

论文及项目相关链接

PDF Submitted to ICASSP 2026

总结
针对自动语音识别(ASR)系统在离域和少资源场景中缺乏标注数据的问题,提出了BEARD(BEST-RQ编码器适应与再训练和蒸馏)框架。该框架旨在使用无标签数据适应whisper编码器。与传统的自监督学习方法不同,BEARD结合了BEST-RQ目标和来自冻结教师编码器的知识蒸馏,确保编码器与预训练解码器的互补性。在航空交通管制(ATC)通信领域的ATCO2语料库上进行的实验表明,使用约5000小时的无转录语音进行BEARD和仅使用2小时转录语音进行微调的情况下,该方法显著优于先前的基础模型和微调模型,相对提高了12%。据我们所知,这是首次将自监督学习目标应用于whisper领域的域适应研究。

关键见解

  1. BEARD框架旨在使用无标签数据适应ASR系统的whisper编码器。
  2. 传统自监督学习方法不同,BEARD结合了BEST-RQ目标和知识蒸馏技术。
  3. BEARD确保编码器与预训练解码器的互补性。
  4. 在航空交通管制(ATC)通信领域的ATCO2语料库上进行了实验验证。
  5. 使用大量无转录语音数据(约5000小时)进行BEARD训练。
  6. 使用仅2小时转录语音数据进行微调。

Cool Papers

点此查看论文截图

LibriConvo: Simulating Conversations from Read Literature for ASR and Diarization

Authors:Máté Gedeon, Péter Mihajlik

We introduce LibriConvo, a simulated multi-speaker conversational dataset based on speaker-aware conversation simulation (SASC), designed to support training and evaluation of speaker diarization and automatic speech recognition (ASR) systems. Unlike prior resources that mostly rely on semantically disconnected utterances and implausible temporal gaps, LibriConvo ensures semantic coherence and realistic conversational timing. Our pipeline leverages CallHome with external VAD for reliable boundaries, applies compression to reduce unnaturally long silences, and organizes LibriTTS utterances by book to maintain contextual consistency. Acoustic realism is enhanced via a novel room impulse response selection procedure that ranks speaker-microphone configurations by spatial plausibility, balancing realism and diversity. The dataset comprises 240.1 hours across 1,496 dialogues with 830 unique speakers, split in a speaker-disjoint manner for robust evaluation. Baselines show that the sortformer model outperforms the pyannote pipeline in diarization, while a fine-tuned Fast Conformer-CTC XLarge with Serialized Output Training achieves 7.29% WER for ASR, surpassing zero-shot Whisper-large-v3. LibriConvo provides a valuable resource for advancing multi-speaker speech processing research with realistic conversational dynamics and controlled experimental conditions.

我们介绍了LibriConvo,这是一个基于说话人感知对话模拟(SASC)的模拟多说话人对话数据集,旨在支持说话人分化和自动语音识别(ASR)系统的训练和评估。不同于以前主要依赖于语义不连贯的说话和不可能的时间间隔的资源,LibriConvo确保了语义连贯和现实的对话时间。我们的管道利用CallHome和外部VAD进行可靠的边界划分,通过压缩减少不自然的长时间沉默,并按书籍组织LibriTTS的说话内容,以保持上下文一致性。通过一种新的房间冲击响应选择程序增强声音现实感,该程序根据空间合理性对说话人-麦克风配置进行排名,平衡现实感和多样性。该数据集包含240.1小时、跨越1496个对话的音频内容,由830个独特说话人构成,以说话人无关联的方式进行分割以进行稳健评估。基准测试表明,sortformer模型在分化方面优于pyannote管道,而经过精细训练的Fast Conformer-CTC XLarge模型采用序列化输出训练方式实现ASR的7.29%词错误率,超过了零样本的Whisper-large-v3模型。LibriConvo提供了一个有价值的资源,用于在具有现实对话动力和受控实验条件下推进多说话人语音处理研究。

论文及项目相关链接

PDF Submitted to LREC 2026

Summary

LibriConvo是一个基于说话人感知对话模拟(SASC)的模拟多说话人对话数据集,用于训练和评估说话人分化和自动语音识别(ASR)系统。该数据集注重语义连贯性和现实对话的时间安排,采用一系列技术处理数据以维持上下文一致性并增强声学真实性。LibriConvo包含现实对话的动态性和可控实验条件,为推进多说话人语音处理研究提供了宝贵的资源。

Key Takeaways

  1. LibriConvo是一个模拟多说话人对话数据集,基于说话人感知对话模拟(SASC)。
  2. 它被设计用于训练和评估说话人分化和自动语音识别(ASR)系统。
  3. LibriConvo注重语义连贯性和现实对话的时间安排。
  4. 数据集处理技术包括利用CallHome和外部VAD确保可靠边界,压缩以减少不自然的长时间沉默,并按书籍组织LibriTTS的发言以维持上下文一致性。
  5. 通过新颖的房间冲击响应选择程序增强声学真实性,该程序按空间合理性对说话人-麦克风配置进行排名,平衡现实性和多样性。
  6. LibriConvo包含240.1小时、1496个对话和830个独特说话人的数据,以说话人分立的方式进行分割,用于稳健评估。

Cool Papers

点此查看论文截图

UltraVoice: Scaling Fine-Grained Style-Controlled Speech Conversations for Spoken Dialogue Models

Authors:Wenming Tu, Guanrou Yang, Ruiqi Yan, Wenxi Chen, Ziyang Ma, Yipeng Kang, Kai Yu, Xie Chen, Zilong Zheng

Spoken dialogue models currently lack the ability for fine-grained speech style control, a critical capability for human-like interaction that is often overlooked in favor of purely functional capabilities like reasoning and question answering. To address this limitation, we introduce UltraVoice, the first large-scale speech dialogue dataset engineered for multiple fine-grained speech style control. Encompassing over 830 hours of speech dialogues, UltraVoice provides instructions across six key speech stylistic dimensions: emotion, speed, volume, accent, language, and composite styles. Fine-tuning leading models such as SLAM-Omni and VocalNet on UltraVoice significantly enhances their fine-grained speech stylistic controllability without degrading core conversational abilities. Specifically, our fine-tuned models achieve improvements of 29.12-42.33% in Mean Opinion Score (MOS) and 14.61-40.09 percentage points in Instruction Following Rate (IFR) on multi-dimensional control tasks designed in the UltraVoice. Moreover, on the URO-Bench benchmark, our fine-tuned models demonstrate substantial gains in core understanding, reasoning, and conversational abilities, with average improvements of +10.84% on the Basic setting and +7.87% on the Pro setting. Furthermore, the dataset’s utility extends to training controllable Text-to-Speech (TTS) models, underscoring its high quality and broad applicability for expressive speech synthesis. The complete dataset and model checkpoints are available at: https://github.com/bigai-nlco/UltraVoice.

当前,口语对话模型缺乏精细的语音风格控制能力,这对于类似于人类的交互至关重要,往往会被忽视,而更侧重于推理和问答等纯功能性的能力。为了解决这个问题,我们引入了UltraVoice,这是第一个为多种精细语音风格控制而设计的大型语音对话数据集。UltraVoice包含超过830小时的语音对话,涵盖了六种关键的语音风格维度:情感、语速、音量、口音、语言和复合风格。对SLAM-Omni和VocalNet等领先模型进行UltraVoice微调,可显著提高其精细语音风格的控制能力,而不会降低其核心对话能力。具体来说,我们微调的模型在多维度控制任务上实现了平均意见得分(MOS)提高29.12-42.33%,指令遵循率(IFR)提高14.61-40.09个百分点。此外,在URO-Bench基准测试上,我们的微调模型在基本设置上平均提高了+10.84%,在专业设置上平均提高了+7.87%,在核心理解、推理和对话能力方面取得了显著进步。此外,该数据集还可用于训练可控的文本到语音(TTS)模型,凸显其高质量和广泛的表达能力合成适用性。完整的数据集和模型检查点可用于:https://github.com/bigai-nlco/UltraVoice。

论文及项目相关链接

PDF 23 pages, 4 figures

Summary

本文介绍了UltraVoice数据集,该数据集针对多种精细粒度的语音风格控制设计,解决了当前口语对话模型缺乏精细风格控制能力的问题。UltraVoice包含超过830小时的对话语音,涵盖情感、语速、音量、口音、语言和复合风格等六个关键语音风格维度。在UltraVoice上微调SLAM-Omni和VocalNet等领先模型,显著提高了精细风格的控制能力,同时不损失核心对话能力。此外,该数据集还用于训练可控文本转语音(TTS)模型,展示了其高质量和广泛适用性。

Key Takeaways

  1. UltraVoice是首个针对多种精细粒度的语音风格控制的大型语音对话数据集。
  2. UltraVoice包含超过830小时的对话语音,涵盖六个关键语音风格维度。
  3. 通过对领先模型进行微调,如SLAM-Omni和VocalNet,显著提高了精细风格的控制能力。
  4. 细微调模型在设计的多维度控制任务上实现了显著改进,如在Mean Opinion Score (MOS)和Instruction Following Rate (IFR)上分别提高了29.12-42.33%和14.61-40.09个百分点。
  5. 在URO-Bench基准测试中,微调后的模型在核心理解、推理和对话能力方面实现了显著改进。
  6. UltraVoice数据集可用于训练可控文本转语音(TTS)模型,展示其高质量和广泛适用性。

Cool Papers

点此查看论文截图

FlexIO: Flexible Single- and Multi-Channel Speech Separation and Enhancement

Authors:Yoshiki Masuyama, Kohei Saijo, Francesco Paissan, Jiangyu Han, Marc Delcroix, Ryo Aihara, François G. Germain, Gordon Wichern, Jonathan Le Roux

Speech separation and enhancement (SSE) has advanced remarkably and achieved promising results in controlled settings, such as a fixed number of speakers and a fixed array configuration. Towards a universal SSE system, single-channel systems have been extended to deal with a variable number of speakers (i.e., outputs). Meanwhile, multi-channel systems accommodating various array configurations (i.e., inputs) have been developed. However, these attempts have been pursued separately. In this paper, we propose a flexible input and output SSE system, named FlexIO. It performs conditional separation using prompt vectors, one per speaker as a condition, allowing separation of an arbitrary number of speakers. Multi-channel mixtures are processed together with the prompt vectors via an array-agnostic channel communication mechanism. Our experiments demonstrate that FlexIO successfully covers diverse conditions with one to five microphones and one to three speakers. We also confirm the robustness of FlexIO on CHiME-4 real data.

语音分离与增强(SSE)已经取得了显著的进步,并在固定数量的发言人和固定阵列配置等受控环境中取得了有前景的结果。为了构建一个通用的SSE系统,单通道系统已经被扩展到处理可变数量的发言人(即输出)。同时,容纳各种阵列配置(即输入)的多通道系统也已经开发出来。然而,这些尝试是分开进行的。在本文中,我们提出了一种灵活输入和输出的SSE系统,名为FlexIO。它使用每发言者一个的提示向量进行条件分离,允许任意数量的发言人进行分离。多通道混合物与提示向量通过阵列无关的通道通信机制一起处理。我们的实验表明,FlexIO成功覆盖了从一至五个麦克风和一至三个发言人的各种条件。我们也在CHiME-4真实数据上证实了FlexIO的稳健性。

论文及项目相关链接

PDF Submitted to ICASSP 2026

Summary

本文提出了一种灵活输入输出语音分离与增强系统(FlexIO),采用条件分离方法,利用为每个说话人提供的提示向量进行任意数量的说话人分离。该系统可以处理多通道混合信号,并通过独立于阵列的通道通信机制处理提示向量。实验表明,FlexIO在涵盖不同条件下表现优异,包括从一至五个麦克风输入和一到三个说话人的分离任务。在CHiME-4真实数据上验证了FlexIO的鲁棒性。

Key Takeaways

  1. FlexIO系统是一个灵活输入输出语音分离与增强系统,旨在实现任意数量说话人的语音分离。
  2. FlexIO采用条件分离方法,利用为每个说话人提供的提示向量作为条件。
  3. 该系统可以处理多通道混合信号,并借助独立于阵列的通道通信机制处理提示向量。
  4. 实验表明FlexIO在多种条件下表现优异,包括不同麦克风数量和不同说话人数量。
  5. FlexIO在CHiME-4真实数据上进行了测试,验证了其鲁棒性。
  6. FlexIO系统有望为语音分离和增强技术带来新的突破,特别是在处理复杂环境和多变条件下的语音信号时。

Cool Papers

点此查看论文截图

Can large audio language models understand child stuttering speech? speech summarization, and source separation

Authors:Chibuzor Okocha, Maya Bakri, Christan Grant

Child speech differs from adult speech in acoustics, prosody, and language development, and disfluencies (repetitions, prolongations, blocks) further challenge Automatic Speech Recognition (ASR) and downstream Natural Language Processing (NLP). Recent large audio-language models (LALMs) demonstrate strong cross-modal audio understanding; however, their behavior in disfluent child speech remains underexplored. We evaluate several state-of-the-art LALMs in two settings: an interview (mixed speakers) and a reading task (single child). The tasks are (i) single-channel source separation to isolate the child and (ii) child-only summarization that preserves clinically relevant disfluencies and avoids adult-speech leakage. Evaluation combines Large Language Model (LLM) as a judge, human expert ratings, and BERTScore (F1), and we report agreement between models and between models and humans to assess reliability. Our findings delineate the conditions under which LALMs produce faithful child-only summaries from mixed audio and where they fail, offering practical guidance for clinical and educational deployments. We provide prompts and evaluation scripts to support replication.

儿童语音在声学、语调和语言发展方面与成人语音存在差异,不流畅(重复、延长、阻塞)进一步挑战了自动语音识别(ASR)和下游的自然语言处理(NLP)。最近的大型音频语言模型(LALM)展示了强大的跨模态音频理解能力,但在不流畅的语音中的表现仍然鲜有研究。我们在两种环境中评估了多种最先进的LALM:访谈(混合说话者)和阅读任务(单一儿童)。任务包括:(i)单通道源分离,以隔离儿童说话;(ii)仅针对儿童的摘要,保留临床相关的语言不流畅性并避免成人语音泄漏。评估结合了大型语言模型(LLM)作为评判标准、人类专家评分和BERTScore(F1),我们报告了模型之间的共识以及模型与人类之间的共识来评估可靠性。我们的研究结果指出了在哪些情况下LALM能够从混合音频中产生忠实于儿童的摘要,以及在哪些情况下它们会失败,为临床和教育部署提供了实际指导。我们提供了提示和评估脚本来支持复制。

论文及项目相关链接

PDF 7 pages, 1 Figure, 8 tables, Under review ICASSP 2026

总结
最新研究评估了针对儿童和成人混合语音的大型音频语言模型在处理复杂情况下的性能。研究通过设置两种场景:访谈(混合说话者)和阅读任务(单一儿童),对一系列前沿的大型音频语言模型进行了评估。评估内容包括语音分离和儿童语音摘要生成等任务。研究结合了大型语言模型、人类专家评分和BERTScore(F1)等评估手段,旨在确保模型的可靠性。研究发现这些模型在某些条件下能够生成忠实于儿童语音的摘要,但在其他情况下则会出现问题。这为临床和教育部署提供了实际指导。

关键见解

  1. 儿童语音与成人语音在声学、语调和语言发展方面存在差异,这使得自动语音识别和自然语言处理面临挑战。
  2. 大型音频语言模型在跨模态音频理解方面表现出色,但在处理儿童语音时的表现仍需进一步研究。
  3. 研究通过两种场景(访谈和阅读任务)评估了多个前沿的大型音频语言模型。
  4. 模型评估结合了大型语言模型、人类专家评分和BERTScore(F1)等多种方法,以确保评估的可靠性。
  5. 模型在生成忠实于儿童语音的摘要方面有一定效果,但也存在失败的情况,这取决于特定条件。
  6. 研究为临床和教育部署提供了关于如何使用这些模型的实用指导。
  7. 研究提供了支持复制实验所需的提示和评估脚本。

Cool Papers

点此查看论文截图

WEST: LLM based Speech Toolkit for Speech Understanding, Generation, and Interaction

Authors:Binbin Zhang, Chengdong Liang, Shuai Wang, Xuelong Geng, Zhao Guo, Haoyu Li, Hao Yin, Xipeng Yang, Pengshen Zhang, Changwei Ma, Lei Xie

In this paper, we present WEST(WE Speech Toolkit), a speech toolkit based on a large language model (LLM) for speech understanding, generation, and interaction. There are three key features of WEST: 1) Fully LLM-based: Standing on the shoulders of giants by reusing mature architectures, ecosystems (e.g., Hugging Face), and methods (e.g., sequence packing) from large models. 2) Full-stack: Supports tasks such as recognition, synthesis, understanding, dialogue, and multimodal capabilities, with extensibility to incorporate open-source models. 3) Simple and Stupid: A simple and stupid speech toolkit that everyone can Touch. In addition, WEST provides two types of recipes, models, and experimental results. The first is entirely based on open-source models and open-source data, allowing users to fully reproduce the experiments in this paper and serving as a verification system or minimal system baseline. The second is trained on massive data, offering superior performance so the user can directly apply it out of the box. WEST is publicly avilable at https://github.com/wenet-e2e/west/

本文介绍了WEST(WE语音工具包),这是一个基于大型语言模型(LLM)的语音工具包,用于语音理解、生成和交互。WEST有三个关键特点:1)完全基于LLM:利用大型模型的成熟架构、生态系统(例如Hugging Face)和方法(例如序列打包),站在巨人的肩膀上。2)全栈支持:支持识别、合成、理解、对话和多模式功能等任务,可扩展以纳入开源模型。3)简单明了:一个简单明了的语音工具包,每个人都能轻松使用。此外,WEST提供两种类型的模型、食谱和实验结果。第一种完全基于开源模型和开源数据,允许用户完全复制本文中的实验,并作为验证系统或最小系统基线。第二种是在大量数据上训练的,提供卓越性能,用户可以直接使用。WEST在https://github.com/wenet-e2e/west/公开可用。

论文及项目相关链接

PDF

Summary
WEST是一个基于大型语言模型的语音工具包,具有三大特点:全LLM基础、全栈支持和简单易用。它支持语音识别、合成、理解和对话等任务,并提供两种类型的模型:基于开源模型和数据的验证系统或最小系统基线,以及经过大规模数据训练提供卓越性能的直接应用模型。

Key Takeaways

  1. WEST是一个基于大型语言模型的语音工具包。
  2. 它支持语音识别、合成、理解和对话等任务,具有全栈功能。
  3. WEST具有三大特点:全LLM基础、利用现有成熟架构和生态系统以及简单易用。
  4. WEST提供两种类型的模型:基于开源模型和数据的模型,以及经过大规模数据训练的高性能模型。
  5. WEST包含实验结果的分享。
  6. 用户可以利用WEST的工具进行语音交互和生成。

Cool Papers

点此查看论文截图

High-Energy Concentration for Federated Learning in Frequency Domain

Authors:Haozhi Shi, Weiying Xie, Hangyu Ye, Daixun Li, Jitao Ma, Yunsong Li, Leyuan Fang

Federated Learning (FL) presents significant potential for collaborative optimization without data sharing. Since synthetic data is sent to the server, leveraging the popular concept of dataset distillation, this FL framework protects real data privacy while alleviating data heterogeneity. However, such methods are still challenged by the redundant information and noise in entire spatial-domain designs, which inevitably increases the communication burden. In this paper, we propose a novel Frequency-Domain aware FL method with high-energy concentration (FedFD) to address this problem. Our FedFD is inspired by the discovery that the discrete cosine transform predominantly distributes energy to specific regions, referred to as high-energy concentration. The principle behind FedFD is that low-energy like high-frequency components usually contain redundant information and noise, thus filtering them helps reduce communication costs and optimize performance. Our FedFD is mathematically formulated to preserve the low-frequency components using a binary mask, facilitating an optimal solution through frequency-domain distribution alignment. In particular, real data-driven synthetic classification is imposed into the loss to enhance the quality of the low-frequency components. On five image and speech datasets, FedFD achieves superior performance than state-of-the-art methods while reducing communication costs. For example, on the CIFAR-10 dataset with Dirichlet coefficient $\alpha = 0.01$, FedFD achieves a minimum reduction of 37.78% in the communication cost, while attaining a 10.88% performance gain.

联邦学习(FL)在无需数据共享的情况下,为实现协作优化提供了巨大的潜力。由于合成数据被发送到服务器,并借助数据集蒸馏的流行概念,这种联邦学习框架保护了真实数据的隐私,同时缓解了数据异构性。然而,整个空间域设计中的冗余信息和噪声仍然给这些方法带来了挑战,这不可避免地增加了通信负担。针对这一问题,本文提出了一种新型的高能量集中联邦学习频率域感知方法(FedFD)。我们的FedFD方法灵感来自于离散余弦变换能将能量主要分布到特定区域的发现,被称为高能量集中。FedFD的原理是,低频部分如高频成分通常包含冗余信息和噪声,因此过滤它们有助于减少通信成本并优化性能。我们的FedFD通过数学模型使用二进制掩码保留低频成分,并通过频率域分布对齐实现最优解。特别是,在损失中加入了基于真实数据的合成分类,以提高低频成分的质量。在五个图像和语音数据集中,FedFD相较于最新技术方法实现了卓越的性能,同时降低了通信成本。例如,在CIFAR-10数据集上,以狄利克雷系数α=0.01为例,FedFD的通信成本降低了至少37.78%,同时性能提升了10.88%。

论文及项目相关链接

PDF

摘要
联邦学习(FL)在无需数据共享的情况下呈现协作优化的巨大潜力。通过利用数据集蒸馏的流行概念,使用合成数据代替真实数据发送到服务器,保护真实数据隐私并缓解数据异质性。然而,整个空间域设计中的冗余信息和噪声仍然挑战着这种方法,这不可避免地增加了通信负担。针对这一问题,本文提出了一种新的频率域感知联邦学习的方法FedFD,具有高能集中度。FedFD的灵感来自于离散余弦变换能将能量主要分布到特定区域的发现,称为高能集中度。FedFD的原理是低能量(如高频成分)通常包含冗余信息和噪声,因此过滤它们有助于降低通信成本并优化性能。我们的FedFD通过数学公式化保留低频成分并使用二进制蒙版,通过频率域分布对齐实现最优解决方案。特别是,将真实数据驱动的合成分类引入损失以提高低频成分的质量。在五个图像和语音数据集上,FedFD较最新技术实现了卓越的性能,同时降低了通信成本。例如,在CIFAR-10数据集上,当Dirichlet系数为α=0.01时,FedFD将通信成本降低了至少37.78%,同时性能提高了10.88%。

关键见解

  1. 联邦学习(FL)能够在不共享数据的情况下实现协作优化,保护真实数据隐私并缓解数据异质性。
  2. 现有方法面临整个空间域设计中的冗余信息和噪声问题,增加了通信负担。
  3. FedFD是一种新的频率域感知联邦学习方法,通过保留低频成分并过滤冗余信息和噪声的优化解决方案。
  4. FedFD利用离散余弦变换的高能集中度特性,将能量主要分布到特定区域。
  5. FedFD通过数学公式化保留低频成分并使用二进制蒙版实现频率域分布对齐。
  6. FedFD在真实数据驱动的合成分类中实现了高质量的低频成分。

Cool Papers

点此查看论文截图

SimulMEGA: MoE Routers are Advanced Policy Makers for Simultaneous Speech Translation

Authors:Chenyang Le, Bing Han, Jinshun Li, Songyong Chen, Yanmin Qian

Simultaneous Speech Translation (SimulST) enables real-time cross-lingual communication by jointly optimizing speech recognition and machine translation under strict latency constraints. Existing systems struggle to balance translation quality, latency, and semantic coherence, particularly in multilingual many-to-many scenarios where divergent read and write policies hinder unified strategy learning. In this paper, we present SimulMEGA (Simultaneous Generation by Mixture-of-Experts Gating), an unsupervised policy learning framework that combines prefix-based training with a Mixture-of-Experts refiner to learn effective read and write decisions in an implicit manner, without adding inference-time overhead. Our design requires only minimal modifications to standard transformer architectures and generalizes across both speech-to-text and text-to-speech streaming tasks. Through comprehensive evaluation on six language pairs, our 500M parameter speech-to-text model outperforms the Seamless baseline, achieving under 7 percent BLEU degradation at 1.5 seconds average lag and under 3 percent at 3 seconds. We further demonstrate the versatility of SimulMEGA by extending it to streaming TTS with a unidirectional backbone, yielding superior latency quality tradeoffs.

同步语音识别翻译(SimulST)通过联合优化语音识别和机器翻译,在严格的延迟限制下实现实时跨语言交流。现有系统在平衡翻译质量、延迟和语义连贯性方面存在困难,特别是在多语言多对多的场景中,不同的读取和写入策略阻碍了统一策略的学习。本文介绍了SimulMEGA(基于专家混合门控的同步生成),这是一种无监督的策略学习框架,它将前缀训练与专家混合精炼器相结合,以隐式的方式学习有效的读取和写入决策,而不会增加推理时间开销。我们的设计只需要对标准变压器架构进行最小的修改,就可以推广到语音到文本和文本到语音的流式处理任务。通过对六种语言对的全面评估,我们的50亿参数语音到文本模型优于无缝基线,在平均延迟1.5秒的情况下,BLEU值降低不到7%,在3秒延迟的情况下,降低不到3%。我们还通过将其扩展到具有单向骨干的流式TTS来展示SimulMEGA的通用性,从而获得出色的延迟质量权衡。

论文及项目相关链接

PDF NeurIPS 2025 poster

摘要
SimulMEGA利用混合专家门控机制实现了实时多语言同步翻译。它通过前缀训练与混合专家精炼器相结合,以隐式方式学习有效的读写决策,优化了语音识别和机器翻译过程,解决了多语言翻译中存在翻译质量、延迟和语义连贯性问题。该方法在标准变压器架构上只需进行少量修改,并能适用于语音到文本和文本到语音的流媒体任务。评估显示,SimulMEGA在六种语言对上优于无缝基线模型,在平均延迟为1.5秒的情况下BLEU值降低不到7%,在延迟为3秒的情况下BLEU值降低不到3%。此外,它还展示了在具有单向骨干的流媒体TTS中的通用性,在延迟质量权衡方面表现出卓越性能。

关键见解

  1. SimulMEGA解决了SimulST中翻译质量、延迟和语义连贯性的平衡问题。
  2. SimulMEGA采用混合专家门控机制进行政策学习,结合了前缀训练和混合专家精炼器。
  3. 该方法以隐式方式学习有效的读写决策,优化语音识别和机器翻译过程。
  4. SimulMEGA适用于语音到文本和文本到语音的流媒体任务。
  5. 在六种语言对的评估中,SimulMEGA表现出优越性能,相较于基线模型在延迟和翻译质量方面取得良好结果。
  6. SimulMEGA具有出色的通用性,可应用于流媒体TTS中,展示了良好的延迟质量权衡。

Cool Papers

点此查看论文截图

HM-Talker: Hybrid Motion Modeling for High-Fidelity Talking Head Synthesis

Authors:Shiyu Liu, Kui Jiang, Xianming Liu, Hongxun Yao, Xiaocheng Feng

Audio-driven talking head video generation enhances user engagement in human-computer interaction. However, current methods frequently produce videos with motion blur and lip jitter, primarily due to their reliance on implicit modeling of audio-facial motion correlations–an approach lacking explicit articulatory priors (i.e., anatomical guidance for speech-related facial movements). To overcome this limitation, we propose HM-Talker, a novel framework for generating high-fidelity, temporally coherent talking heads. HM-Talker leverages a hybrid motion representation combining both implicit and explicit motion cues. Explicit cues use Action Units (AUs), anatomically defined facial muscle movements, alongside implicit features to minimize phoneme-viseme misalignment. Specifically, our Cross-Modal Disentanglement Module (CMDM) extracts complementary implicit/explicit motion features while predicting AUs directly from audio input aligned to visual cues. To mitigate identity-dependent biases in explicit features and enhance cross-subject generalization, we introduce the Hybrid Motion Modeling Module (HMMM). This module dynamically merges randomly paired implicit/explicit features, enforcing identity-agnostic learning. Together, these components enable robust lip synchronization across diverse identities, advancing personalized talking head synthesis. Extensive experiments demonstrate HM-Talker’s superiority over state-of-the-art methods in visual quality and lip-sync accuracy.

音频驱动的说话人头部视频生成提高了人机交互中的用户参与度。然而,当前的方法经常产生运动模糊和嘴唇抖动的视频,这主要是因为它们依赖于音频面部运动关联的隐式建模——这种方法缺乏明确的发音先验知识(即与语音相关的面部运动的解剖指导)。为了克服这一局限性,我们提出了HM-Talker,这是一个生成高保真、时间连贯的说话人头部的全新框架。HM-Talker利用了一种混合运动表示法,结合了隐式和显式运动线索。显式线索使用行为单位(AUs),即面部解剖上定义的肌肉运动,以及隐式特征来最小化音素-面部动作不匹配。具体来说,我们的跨模态分离模块(CMDM)在预测与视觉线索对齐的音频输入的直接AUs时,提取了互补的隐式/显式运动特征。为了减轻显式特征中的身份相关偏见,增强跨主体泛化能力,我们引入了混合运动建模模块(HMMM)。该模块动态合并随机配对的隐式/显式特征,强制执行身份无关的学习。这些组件共同作用,实现了跨不同身份的稳健嘴唇同步,推动了个性化说话人头部合成的进步。大量实验表明,HM-Talker在视觉质量和嘴唇同步精度方面优于最先进的方法。

论文及项目相关链接

PDF

Summary

本文提出一种新型框架HM-Talker,用于生成高保真、时间连贯的谈话头视频。它通过结合隐式和显式运动线索来解决现有方法中常见的运动模糊和唇部抖动问题。HM-Talker使用一种混合运动表示方法,使用动作单元(AUs)等显式线索和隐式特征来最小化音素-表情动作不匹配。此外,通过引入跨模态分解模块和混合运动建模模块,提高了跨主体泛化能力,实现了稳健的唇部同步。实验表明,HM-Talker在视觉质量和唇部同步准确性方面优于现有技术。

Key Takeaways

  1. HM-Talker是一个用于生成高保真谈话头视频的全新框架。
  2. 该框架结合隐式和显式运动线索来解决运动模糊和唇部抖动问题。
  3. HM-Talker使用动作单元(AUs)等显式线索进行面部肌肉运动的建模。
  4. 跨模态分解模块(CMDM)用于提取隐式和显式运动特征的互补信息,并预测与视觉线索对齐的音频输入的AUs。
  5. 混合运动建模模块(HMMM)通过动态合并隐式和显式特征来减轻身份相关偏见,提高跨主体泛化能力。
  6. 实验结果显示HM-Talker在视觉质量和唇部同步准确性方面优于现有技术。

Cool Papers

点此查看论文截图

PESTO: Real-Time Pitch Estimation with Self-supervised Transposition-equivariant Objective

Authors:Alain Riou, Bernardo Torres, Ben Hayes, Stefan Lattner, Gaëtan Hadjeres, Gaël Richard, Geoffroy Peeters

In this paper, we introduce PESTO, a self-supervised learning approach for single-pitch estimation using a Siamese architecture. Our model processes individual frames of a Variable-$Q$ Transform (VQT) and predicts pitch distributions. The neural network is designed to be equivariant to translations, notably thanks to a Toeplitz fully-connected layer. In addition, we construct pitch-shifted pairs by translating and cropping the VQT frames and train our model with a novel class-based transposition-equivariant objective, eliminating the need for annotated data. Thanks to this architecture and training objective, our model achieves remarkable performances while being very lightweight ($130$k parameters). Evaluations on music and speech datasets (MIR-1K, MDB-stem-synth, and PTDB) demonstrate that PESTO not only outperforms self-supervised baselines but also competes with supervised methods, exhibiting superior cross-dataset generalization. Finally, we enhance PESTO’s practical utility by developing a streamable VQT implementation using cached convolutions. Combined with our model’s low latency (less than 10 ms) and minimal parameter count, this makes PESTO particularly suitable for real-time applications.

本文介绍了PESTO,这是一种基于Siamese架构的单音高估计自监督学习方法。我们的模型处理可变Q变换(VQT)的单个帧,并预测音高分布。神经网络被设计成对等位翻译,特别是得益于托普利兹全连接层。此外,我们通过翻译和裁剪VQT帧来构建音高对,并用新型类别平移等变目标来训练我们的模型,从而消除了对注释数据的需求。由于这种架构和培训目标,我们的模型在性能上取得了显著的成就,同时非常轻量级(130k个参数)。在音乐和语音数据集(MIR-1K、MDB-stem-synth和PTDB)上的评估表明,PESTO不仅超越了自监督基线,而且与监督方法相竞争,表现出优越的跨数据集泛化能力。最后,我们通过使用缓存卷积开发了可流式传输的VQT实现,增强了PESTO的实际效用。结合我们的模型低延迟(小于10毫秒)和极少的参数数量,这使得PESTO特别适合实时应用。

论文及项目相关链接

PDF

Summary

本文介绍了PESTO,一种基于自监督学习的单音高估计方法,采用Siamese架构。该模型处理可变Q值变换(VQT)的单个帧,预测音高分布。神经网络设计为对翻译具有等变性,特别是得益于Toeplitz全连接层。通过构建音高移位对,翻译和裁剪VQT帧,以及使用新型基于类别的转置等变目标进行模型训练,无需标注数据。该模型和训练目标使模型在保持轻量级(130k参数)的同时取得了显著性能。在MUSIC和语音数据集上的评估表明,PESTO不仅优于自监督基线,而且与监督方法竞争,展现出优越的跨数据集泛化能力。最后,通过缓存卷积开发了一个可流式传输的VQT实现,增强了PESTO的实用性。其低延迟和少量参数使其成为实时应用的理想选择。

Key Takeaways

  1. PESTO是一种基于自监督学习的单音高估计方法,采用Siamese架构进行音高预测。
  2. 该模型处理VQT帧,并预测音高分布,利用Toeplitz全连接层实现翻译等变性。
  3. 通过构建音高移位对,无需标注数据进行模型训练。
  4. PESTO模型具有出色的性能,同时在参数数量上保持轻量级(仅有130k参数)。
  5. 在多个数据集上的评估显示,PESTO在自监督学习和监督学习方面都有出色的表现,并具有良好的跨数据集泛化能力。
  6. PESTO具有实时应用的潜力,低延迟和可流式传输的VQT实现增强了其实用性。

Cool Papers

点此查看论文截图

Authors:Samuele Cornell, Christoph Boeddeker, Taejin Park, He Huang, Desh Raj, Matthew Wiesner, Yoshiki Masuyama, Xuankai Chang, Zhong-Qiu Wang, Stefano Squartini, Paola Garcia, Shinji Watanabe

The CHiME-7 and 8 distant speech recognition (DASR) challenges focus on multi-channel, generalizable, joint automatic speech recognition (ASR) and diarization of conversational speech. With participation from 9 teams submitting 32 diverse systems, these challenges have contributed to state-of-the-art research in the field. This paper outlines the challenges’ design, evaluation metrics, datasets, and baseline systems while analyzing key trends from participant submissions. From this analysis it emerges that: 1) Most participants use end-to-end (e2e) ASR systems, whereas hybrid systems were prevalent in previous CHiME challenges. This transition is mainly due to the availability of robust large-scale pre-trained models, which lowers the data burden for e2e-ASR. 2) Despite recent advances in neural speech separation and enhancement (SSE), all teams still heavily rely on guided source separation, suggesting that current neural SSE techniques are still unable to reliably deal with complex scenarios and different recording setups. 3) All best systems employ diarization refinement via target-speaker diarization techniques. Accurate speaker counting in the first diarization pass is thus crucial to avoid compounding errors and CHiME-8 DASR participants especially focused on this part. 4) Downstream evaluation via meeting summarization can correlate weakly with transcription quality due to the remarkable effectiveness of large-language models in handling errors. On the NOTSOFAR-1 scenario, even systems with over 50% time-constrained minimum permutation WER can perform roughly on par with the most effective ones (around 11%). 5) Despite recent progress, accurately transcribing spontaneous speech in challenging acoustic environments remains difficult, even when using computationally intensive system ensembles.

CHiME-7和CHiME-8远程语音识别(DASR)挑战的重点是多通道、通用化、联合自动语音识别(ASR)和对话语音的摘要化。共有9支队伍提交了32种不同的系统参与挑战,这些挑战推动了该领域的最新研究。本文概述了挑战的设计、评估指标、数据集和基准系统,同时分析了参与者提交的关键趋势。从分析中可以看出:1)大多数参与者使用端到端(e2e)ASR系统,而之前的CHiME挑战中混合系统更为普遍。这种转变主要是由于强大的大规模预训练模型的出现,降低了端到端ASR的数据负担。2)尽管神经网络语音分离和增强(SSE)技术取得了最新进展,所有团队仍然严重依赖于引导源分离,这表明当前的神经网络SSE技术仍然无法可靠地处理复杂场景和不同录音设置。3)所有最佳系统都通过目标说话人摘要化技术进行了摘要化改进。因此,第一次摘要化过程中的准确说话人数至关重要,可以避免累积错误,CHiME-8 DASR参与者尤其关注这一点。4)通过会议摘要进行的下游评估由于大型语言模型在处理错误方面的显著有效性,可能与转录质量存在微弱的相关性。在NOTSOFAR-1场景中,即使时间受限时的最小置换WER超过50%的系统,其表现也大致与最出色的系统(约11%)相当。5)尽管取得了最新进展,但在具有挑战性的声学环境中准确转录即兴演讲仍然很困难,即使使用计算密集型的系统集合也是如此。

论文及项目相关链接

PDF

摘要

CHiME-7和CHiME-8的远距离语音识别(DASR)挑战聚焦于多通道、通用、联合自动语音识别(ASR)和对话语音的摘要化。这些挑战吸引了9个团队的参与,共提交了32个多样化的系统,为这一领域的研究贡献了最前沿的成果。本文概述了挑战的设计、评估指标、数据集和基线系统,同时分析了参与者提交的关键趋势。分析表明:1)大多数参与者使用端到端(e2e)ASR系统,而之前的CHiME挑战中混合系统更为普遍。这一转变主要是由于强大的大规模预训练模型的可获得性,降低了e2e-ASR的数据负担。2)尽管神经语音分离和增强(SSE)方面取得了最新进展,所有团队仍然严重依赖于引导源分离,表明当前的神经SSE技术仍然无法可靠地处理复杂场景和不同录音设置。3)所有最佳系统都通过目标说话人摘要技术进行了摘要修正。因此,第一次摘要过程中的准确说话人计数至关重要,可以避免复合错误,CHiME-8 DASR参与者尤其关注这一部分。4)通过会议摘要进行的下游评估由于大型语言模型在处理错误方面的显著有效性,可能与转录质量相关性较弱。在NOTSOFAR-1场景中,即使系统的最小排列WER超过时间限制的50%,其性能也可能与最有效的方法大致相同(约11%)。5)尽管取得了最新进展,但在具有挑战性的声学环境中准确转录自发语音仍然很困难,即使使用计算密集型的系统集合也是如此。

关键见解

  1. CHiME-7和CHiME-8挑战推动了多通道和通用联合自动语音识别(ASR)以及对话语音摘要化的研究前沿。
  2. 端到端ASR系统逐渐取代混合系统,得益于大规模预训练模型的普及。
  3. 尽管神经语音分离和增强技术取得进展,但团队仍依赖引导源分离方法,暗示当前技术面对复杂场景和多变录音环境的挑战。
  4. 最佳系统均重视通过目标说话人摘要技术进行摘要修正,准确说话人计数对避免错误累积至关重要。
  5. 下游评估如会议摘要与转录质量的相关性可能较弱,特别是当使用大型语言模型时。
  6. 在具有挑战性的声学环境中准确转录自发语音仍然是一个难题,需要计算密集型的系统集合来处理。

Cool Papers

点此查看论文截图

Application of Whisper in Clinical Practice: the Post-Stroke Speech Assessment during a Naming Task

Authors:Milena Davudova, Ziyuan Cai, Valentina Giunchiglia, Dragos C. Gruia, Giulia Sanguedolce, Adam Hampshire, Fatemeh Geranmayeh

Detailed assessment of language impairment following stroke remains a cognitively complex and clinician-intensive task, limiting timely and scalable diagnosis. Automatic Speech Recognition (ASR) foundation models offer a promising pathway to augment human evaluation through intelligent systems, but their effectiveness in the context of speech and language impairment remains uncertain. In this study, we evaluate whether Whisper, a state-of-the-art ASR foundation model, can be applied to transcribe and analyze speech from patients with stroke during a commonly used picture-naming task. We assess both verbatim transcription accuracy and the model’s ability to support downstream prediction of language function, which has major implications for outcomes after stroke. Our results show that the baseline Whisper model performs poorly on single-word speech utterances. Nevertheless, fine-tuning Whisper significantly improves transcription accuracy (reducing Word Error Rate by 87.72% in healthy speech and 71.22% in speech from patients). Further, learned representations from the model enable accurate prediction of speech quality (average F1 Macro of 0.74 for healthy, 0.75 for patients). However, evaluations on an unseen (TORGO) dataset reveal limited generalizability, highlighting the inability of Whisper to perform zero-shot transcription of single-word utterances on out-of-domain clinical speech and emphasizing the need to adapt models to specific clinical populations. While challenges remain in cross-domain generalization, these findings highlight the potential of foundation models, when appropriately fine-tuned, to advance automated speech and language assessment and rehabilitation for stroke-related impairments.

对中风后语言障碍的详细评估仍然是一个认知上复杂且需要大量临床医生的任务,这限制了及时和可规模化的诊断。自动语音识别(ASR)基础模型通过智能系统增强人类评估提供了一种有前途的途径,但它们在语音和语言障碍的情境中的有效性仍然不确定。在这项研究中,我们评估了前沿的ASR基础模型Whisper,是否可以应用于转录和分析中风患者在常用的图片命名任务中的语音。我们评估了逐字转录的准确性和模型支持语言功能下游预测的能力,这对中风后的结果具有重大意义。我们的结果表明,基线Whisper模型在单个单词的语音发音上表现不佳。然而,对Whisper进行微调显著提高了转录准确性(将健康语音的单词错误率降低了87.72%,患者语音降低了71.22%)。此外,从模型中学习的表示能够准确预测语音质量(健康语音的平均F1 Macro为0.74,患者语音为0.75)。然而,在未见过的(TORGO)数据集上的评估显示出有限的泛化能力,这突显了Whisper无法对超出领域的临床语音进行零样本转录单个单词发音的能力不足,并强调了需要对特定临床人群适应模型的需求。虽然跨域泛化仍存在挑战,但这些发现突出了基础模型在适当微调后的潜力,可以促进中风相关障碍的自动化语言和语音评估和康复。

论文及项目相关链接

PDF

Summary

本研究探讨了自动语音识别(ASR)基础模型在脑卒中患者语言障碍评估中的应用。研究发现,虽然基础模型在单词发音上的表现不佳,但通过微调可以显著提高转录准确性,并支持预测患者的语言功能。然而,该模型在未见数据集上的表现显示其泛化能力有限,强调需要针对特定临床人群进行模型适应。这凸显了适当微调基础模型在推动脑卒中相关障碍的自动化语言和语音评估及康复中的潜力。

Key Takeaways

  1. ASR基础模型在脑卒中语言障碍评估中具有应用潜力。
  2. 基础模型在单词发音上的表现不佳,但微调可以显著提高转录准确性。
  3. 模型支持预测患者的语言功能,对脑卒中后结果有重要意义。
  4. 模型在未见数据集上表现泛化能力有限,需要针对特定临床人群进行适应。
  5. 挑战仍然存在跨域泛化问题。
  6. 适当微调基础模型在自动化语言和语音评估及康复中有潜力。

Cool Papers

点此查看论文截图

FastLongSpeech: Enhancing Large Speech-Language Models for Efficient Long-Speech Processing

Authors:Shoutao Guo, Shaolei Zhang, Qingkai Fang, Zhengrui Ma, Min Zhang, Yang Feng

The rapid advancement of Large Language Models (LLMs) has spurred significant progress in Large Speech-Language Models (LSLMs), enhancing their capabilities in both speech understanding and generation. While existing LSLMs often concentrate on augmenting speech generation or tackling a diverse array of short-speech tasks, the efficient processing of long-form speech remains a critical yet underexplored challenge. This gap is primarily attributed to the scarcity of long-speech training datasets and the high computational costs associated with long sequences. To address these limitations, we introduce FastLongSpeech, a novel framework designed to extend LSLM capabilities for efficient long-speech processing without necessitating dedicated long-speech training data. FastLongSpeech incorporates an iterative fusion strategy that can compress excessively long-speech sequences into manageable lengths. To adapt LSLMs for long-speech inputs, it introduces a dynamic compression training approach, which exposes the model to short-speech sequences at varying compression ratios, thereby transferring the capabilities of LSLMs to long-speech tasks. To assess the long-speech capabilities of LSLMs, we develop a long-speech understanding benchmark called LongSpeech-Eval. Experiments show that our method exhibits strong performance in both long-speech and short-speech tasks, while greatly improving inference efficiency.

随着大型语言模型(LLM)的快速发展,大型语音语言模型(LSLM)也取得了显著进步,其在语音理解和生成方面的能力得到了增强。尽管现有的LSLM通常专注于增强语音生成或应对各种短期语音任务,但长语音的高效处理仍然是一个关键且尚未充分探索的挑战。这一差距主要归因于长语音训练数据集的稀缺以及与长序列相关的高计算成本。为了解决这些限制,我们引入了FastLongSpeech,这是一个新型框架,旨在扩展LSLM的功能,以高效地进行长语音处理,而无需专用的长语音训练数据。FastLongSpeech采用了一种迭代融合策略,可以将过长的语音序列压缩成可管理的长度。为了适应LSLM的长语音输入,它引入了一种动态压缩训练方法,该方法使模型暴露在不同压缩比率下的短语音序列中,从而将LSLM的能力转移到长语音任务上。为了评估LSLM的长语音能力,我们开发了一个名为LongSpeech-Eval的长语音理解基准测试。实验表明,我们的方法在长短语音任务中均表现出强大的性能,同时大大提高了推理效率。

论文及项目相关链接

PDF NeurIPS 2025. The code is at https://github.com/ictnlp/FastLongSpeech. This model is at https://huggingface.co/ICTNLP/FastLongSpeech. The dataset is at https://huggingface.co/datasets/ICTNLP/LongSpeech-Eval

Summary

大型语言模型的快速发展推动了大型语音语言模型的显著进步,提高了其在语音理解和生成方面的能力。然而,现有模型在处理长语音方面存在不足,缺乏相应的训练数据集以及处理长序列的高计算成本是主要难题。为解决这些问题,我们提出了FastLongSpeech框架,旨在扩展大型语音语言模型的能力,实现高效的长语音处理,而无需专门的长语音训练数据。FastLongSpeech采用迭代融合策略,将过长的语音序列压缩成可管理的长度。为适应长语音输入,它采用动态压缩训练方法,通过暴露模型于不同压缩比例下的短语音序列,将大型语音语言模型的能力转移到长语音任务上。为评估大型语音语言模型在长语音上的表现,我们开发了一个名为LongSpeech-Eval的长语音理解基准测试。实验表明,我们的方法在长短语音任务上均表现出强劲性能,同时大大提高了推理效率。

Key Takeaways

  1. 大型语言模型的进步推动了语音语言模型的发展。
  2. 当前模型在处理长语音时面临挑战,如缺乏训练数据集和高计算成本。
  3. FastLongSpeech框架旨在解决这些问题,实现高效长语音处理。
  4. FastLongSpeech采用迭代融合策略和动态压缩训练方法。
  5. 迭代融合策略可将长语音序列压缩成可管理长度。
  6. 动态压缩训练方法通过暴露模型于不同压缩比例的短语音序列,提升模型处理长语音的能力。

Cool Papers

点此查看论文截图


文章作者: Kedreamix
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !
 上一篇
Face Swapping Face Swapping
Face Swapping 方向最新论文已更新,请持续关注 Update in 2025-11-05 A Dual-Branch CNN for Robust Detection of AI-Generated Facial Forgeries
2025-11-05
下一篇 
医学影像/Breast Ultrasound 医学影像/Breast Ultrasound
医学影像/Breast Ultrasound 方向最新论文已更新,请持续关注 Update in 2025-11-05 Flip Learning Weakly Supervised Erase to Segment Nodules in Breast Ultrasound
  目录