⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-10-02 更新
Voice Evaluation of Reasoning Ability: Diagnosing the Modality-Induced Performance Gap
Authors:Yueqian Lin, Zhengmian Hu, Qinsi Wang, Yudong Liu, Hengfan Zhang, Jayakumar Subramanian, Nikos Vlassis, Hai Helen Li, Yiran Chen
We present Voice Evaluation of Reasoning Ability (VERA), a benchmark for evaluating reasoning ability in voice-interactive systems under real-time conversational constraints. VERA comprises 2,931 voice-native episodes derived from established text benchmarks and organized into five tracks (Math, Web, Science, Long-Context, Factual). Each item is adapted for speech interaction while preserving reasoning difficulty. VERA enables direct text-voice comparison within model families and supports analysis of how architectural choices affect reliability. We assess 12 contemporary voice systems alongside strong text baselines and observe large, consistent modality gaps: on competition mathematics a leading text model attains 74.8% accuracy while its voice counterpart reaches 6.1%; macro-averaged across tracks the best text models achieve 54.0% versus 11.3% for voice. Latency-accuracy analyses reveal a low-latency plateau, where fast voice systems cluster around ~10% accuracy, while approaching text performance requires sacrificing real-time interaction. Diagnostic experiments indicate that common mitigations are insufficient. Increasing “thinking time” yields negligible gains; a decoupled cascade that separates reasoning from narration improves accuracy but still falls well short of text and introduces characteristic grounding/consistency errors. Failure analyses further show distinct error signatures across native streaming, end-to-end, and cascade designs. VERA provides a reproducible testbed and targeted diagnostics for architectures that decouple thinking from speaking, offering a principled way to measure progress toward real-time voice assistants that are both fluent and reliably reasoned.
我们推出了语音推理能力评估(VERA),这是一个基准测试,用于评估实时对话约束下语音交互系统中的推理能力。VERA包含从既定文本基准测试中得出的2931个原生语音片段,这些片段被组织成五个赛道(数学、网络、科学、长语境、事实)。每个项目都适应语音交互,同时保留推理难度。VERA能够在模型家族内进行直接的文本语音比较,并支持分析架构选择如何影响可靠性。我们评估了12个当代语音系统以及强大的文本基准,并观察到了一致且显著的模态差距:在数学竞赛中,领先的文本模型达到74.8%的准确率,而其语音对应模型只达到6.1%;跨赛道的宏观平均,最佳文本模型达到54.0%,而语音模型为11.3%。延迟-准确性分析显示了一个低延迟平台,在这个平台上,快速语音系统聚集在约10%的准确率,而要达到文本性能则需要牺牲实时交互。诊断实验表明常见的缓解措施并不足够。增加“思考时间”产生的收益微乎其微;一个分离的级联,将推理与叙述分开,可以提高准确性,但仍远远落后于文本,并引入典型的接地/一致性错误。故障分析进一步显示了原生流媒体、端到端和级联设计之间的不同错误签名。VERA为脱离思考和说话过程的架构提供了一个可重复的测试平台和有针对性的诊断,为实时语音助手提供了一-种既有流利度又可靠推理的衡量进步的原则性方法。
论文及项目相关链接
PDF Code and data available at https://github.com/linyueqian/VERA
Summary
本文提出了语音推理能力评估基准(VERA),用于评估语音交互系统在面对真实对话时的推理能力。VERA包含从已建立的文本基准数据中衍生的2,931个语音本集,分为五个轨道(数学、网络、科学、长语境、事实)。每个项目都适应了语音交互,同时保留了推理难度。VERA使直接文本与语音对比成为可能,支持对模型家族内部架构选择对可靠性的影响的分析。通过对当前语音系统与强大文本基准的评估,发现显著的、一致的模态差距。竞争数学题中,顶尖文本模型准确率达74.8%,而语音模型仅达6.1%。宏观平均轨道上,最佳文本模型准确率为54%,而语音模型仅为11.3%。延迟准确分析揭示低延迟平台区域,快速语音系统准确率约为10%,而接近文本性能则需要牺牲实时交互。诊断实验表明常见的缓解方法不足以提高性能。增加思考时间带来的收益微乎其微;一个分离的级联设计改善了准确性,但仍远未达到文本性能,并引入特定的定位或一致性错误。失败分析显示本地流媒体、端到端和级联设计有各自独特的错误特征。VERA为架构提供了一个可重复的测试平台和有针对性的诊断工具,这些架构将思考与说话分开,为构建既流畅又可靠的实时语音助手提供了衡量进步的原则方法。
Key Takeaways
- VERA是一个用于评估语音交互系统推理能力的基准。
- VERA包含多个轨道,涵盖不同的领域和语境。
- 语音系统与文本系统性能存在显著差异,尤其在复杂任务上。
- 延迟与准确性之间存在权衡,快速响应往往导致准确性下降。
- 增加思考时间对改善语音系统性能效果有限。
- 分离推理与叙述的级联设计能提高准确性,但仍存在特定误差。
点此查看论文截图






An Analysis of Joint Nonlinear Spatial Filtering for Spatial Aliasing Reduction
Authors:Alina Mannanova, Jakob Kienegger, Timo Gerkmann
The performance of traditional linear spatial filters for speech enhancement is constrained by the physical size and number of channels of microphone arrays. For instance, for large microphone distances and high frequencies, spatial aliasing may occur, leading to unwanted enhancement of signals from non-target directions. Recently, it has been proposed to replace linear beamformers by nonlinear deep neural networks for joint spatial-spectral processing. While it has been shown that such approaches result in higher performance in terms of instrumental quality metrics, in this work we highlight their ability to efficiently handle spatial aliasing. In particular, we show that joint spatial and tempo-spectral processing is more robust to spatial aliasing than traditional approaches that perform spatial processing alone or separately with tempo-spectral filtering. The results provide another strong motivation for using deep nonlinear networks in multichannel speech enhancement, beyond their known benefits in managing non-Gaussian noise and multiple speakers, especially when microphone arrays with rather large microphone distances are used.
传统线性空间滤波器在语音增强方面的性能受到麦克风阵列的物理尺寸和通道数量的限制。例如,在较大的麦克风距离和较高频率下,可能会发生空间混叠,导致来自非目标方向信号的意外增强。最近,有人建议使用非线性深度神经网络替代线性波束形成器进行联合空间频谱处理。虽然已有研究表明,这种方法在仪器质量指标方面表现出更高的性能,但在这项工作中,我们重点强调了它们处理空间混叠的有效性。特别是我们显示联合空间和时态频谱处理比传统方法更稳健,后者单独进行空间处理或时态频谱滤波的分离处理会出现混叠。结果除了已知的在对抗非高斯噪声和多说话人时具有优势之外,在多通道语音增强领域中使用深度非线性网络的动机愈发强烈,尤其是在使用具有较大麦克风距离的麦克风阵列时更是如此。
论文及项目相关链接
PDF Submitted to ICASSP 2026. This work has been submitted to the IEEE for possible publication
摘要
传统的线性空间滤波器在进行语音增强时的性能受限于麦克风阵列的物理尺寸和通道数量。在大麦克风距离和高频情况下,会发生空间混叠,导致非目标方向信号的意外增强。最近提议用非线性深度神经网络替代线性波束形成器进行联合空间光谱处理。虽然已证明这些方法在仪器质量指标方面表现更好,但在这项工作中,我们重点强调了它们处理空间混叠的能力。特别是,我们显示联合空间和时序光谱处理比传统方法更稳健对抗空间混叠,后者单独或分别进行时频滤波处理。结果除了它们在管理非高斯噪声和多说话人方面的已知优势外,还提供了在多通道语音增强中使用深度非线性网络的另一个强烈动机,特别是在使用间距较大的麦克风阵列时更是如此。
关键见解
- 传统线性空间滤波器在语音增强方面受限于麦克风阵列的物理尺寸和通道数量。
- 在大麦克风距离和高频时,会发生空间混叠,导致性能下降。
- 非线性深度神经网络已被提议用于联合空间光谱处理,以提高性能。
- 深度神经网络在处理空间混叠方面表现出色,与传统的空间处理方法相比更加稳健。
- 联合空间和时序光谱处理可以提供更好的稳健性,对抗空间混叠问题。
- 使用深度非线性网络进行多通道语音增强具有优势,尤其是在处理非高斯噪声和多说话人场景时。
点此查看论文截图






Detecting Hope Across Languages: Multiclass Classification for Positive Online Discourse
Authors:T. O. Abiola, K. D. Abiodun, O. E. Olumide, O. O. Adebanji, O. Hiram Calvo, Grigori Sidorov
The detection of hopeful speech in social media has emerged as a critical task for promoting positive discourse and well-being. In this paper, we present a machine learning approach to multiclass hope speech detection across multiple languages, including English, Urdu, and Spanish. We leverage transformer-based models, specifically XLM-RoBERTa, to detect and categorize hope speech into three distinct classes: Generalized Hope, Realistic Hope, and Unrealistic Hope. Our proposed methodology is evaluated on the PolyHope dataset for the PolyHope-M 2025 shared task, achieving competitive performance across all languages. We compare our results with existing models, demonstrating that our approach significantly outperforms prior state-of-the-art techniques in terms of macro F1 scores. We also discuss the challenges in detecting hope speech in low-resource languages and the potential for improving generalization. This work contributes to the development of multilingual, fine-grained hope speech detection models, which can be applied to enhance positive content moderation and foster supportive online communities.
在社交媒体中检测带有积极情绪的话语已成为促进积极对话和福祉的重要任务。本文提出了一种基于机器学习的方法,用于多语言环境中的多元希望言论检测,包括英语、乌尔都语和西班牙语。我们利用基于transformer的模型,特别是XLM-RoBERTa,将希望话语分为三种不同的类别:通用希望、现实希望和非现实希望,并进行检测与分类。我们提出的方法在PolyHope数据集上对PolyHope-M 2025共享任务进行了评估,在所有语言上均取得了具有竞争力的表现。我们将结果与现有模型进行了比较,证明我们的方法在宏观F1得分方面大大优于现有先进技术。我们还讨论了在低资源语言中检测希望言论的挑战以及提高泛化能力的潜力。这项工作为多语种精细粒度的希望言论检测模型的发展做出了贡献,可应用于积极内容的管理和支持型在线社区的建设。
论文及项目相关链接
Summary
社交媒体中希望言论的检测对促进积极对话和心理健康至关重要。本文提出一种跨多种语言的希望言论机器学习检测方案,包括英语、乌尔都语和西班牙语。我们利用基于transformer的模型XLM-RoBERTa来检测和分类希望言论为三种不同类别:通用型希望、现实型希望和幻想型希望。该方法在PolyHope数据集上针对PolyHope-M 2025共享任务进行评估,所有语言上的表现均表现优异。相较于现有模型,我们的方法在提高宏观F1分数方面显著优于先前最先进的技巧。我们还讨论了在低资源语言中检测希望言论的挑战以及提高泛化能力的潜力。本研究为开发多语言、精细化的希望言论检测模型做出贡献,可应用于增强积极内容管理和促进支持性在线社区。
Key Takeaways
- 希望言论检测在社交媒体中对于促进积极对话和心理健康至关重要。
- 提出一种跨多种语言的希望言论机器学习检测方案,包括英语、乌尔都语和西班牙语。
- 利用基于transformer的模型XLM-RoBERTa进行希望言论检测和分类。
- 将希望言论分为三种类别:通用型希望、现实型希望和幻想型希望。
- 在PolyHope数据集上的评估表现优异,并显著优于先前最先进的技巧。
- 讨论了在低资源语言中进行希望言论检测的挑战。
点此查看论文截图








LTA-L2S: Lexical Tone-Aware Lip-to-Speech Synthesis for Mandarin with Cross-Lingual Transfer Learning
Authors:Kang Yang, Yifan Liang, Fangkun Liu, Zhenping Xie, Chengshi Zheng
Lip-to-speech (L2S) synthesis for Mandarin is a significant challenge, hindered by complex viseme-to-phoneme mappings and the critical role of lexical tones in intelligibility. To address this issue, we propose Lexical Tone-Aware Lip-to-Speech (LTA-L2S). To tackle viseme-to-phoneme complexity, our model adapts an English pre-trained audio-visual self-supervised learning (SSL) model via a cross-lingual transfer learning strategy. This strategy not only transfers universal knowledge learned from extensive English data to the Mandarin domain but also circumvents the prohibitive cost of training such a model from scratch. To specifically model lexical tones and enhance intelligibility, we further employ a flow-matching model to generate the F0 contour. This generation process is guided by ASR-fine-tuned SSL speech units, which contain crucial suprasegmental information. The overall speech quality is then elevated through a two-stage training paradigm, where a flow-matching postnet refines the coarse spectrogram from the first stage. Extensive experiments demonstrate that LTA-L2S significantly outperforms existing methods in both speech intelligibility and tonal accuracy.
针对普通话的唇音到语音(L2S)合成是一个重大挑战,受到复杂唇动到音素映射以及词汇音在可理解性中的关键作用的阻碍。为了解决这个问题,我们提出了词汇音感知的唇音到语音(LTA-L2S)合成方法。为了解决唇动到音素的复杂性,我们的模型通过跨语言迁移学习策略,对一个英语预训练的音频视觉自我监督学习(SSL)模型进行适配。这一策略不仅将从大量英语数据中学习到的通用知识转移到普通话领域,而且避免了从头开始训练此类模型的昂贵成本。为了专门建模词汇音并增强可理解性,我们进一步采用流匹配模型来生成F0轮廓。这一生成过程由经过ASR精细调整的SSL语音单元引导,这些单元包含关键的超音段信息。通过两阶段训练范式,整体语音质量得到提升,其中流匹配后网络对第一阶段产生的粗糙频谱图进行细化。大量实验表明,LTA-L2S在语音可理解性和音调准确性方面都显著优于现有方法。
论文及项目相关链接
PDF Submitted to ICASSP 2026
Summary
针对普通话的唇动到语音合成(L2S)存在挑战,主要由于复杂的唇动到音素映射以及词汇音调在可理解性中的关键作用。我们提出了词汇音调感知的唇动到语音合成(LTA-L2S)来解决这一问题。我们的模型通过跨语言迁移学习策略,借鉴英语预训练的音频视觉自我监督学习(SSL)模型来解决音素映射的复杂性。这一策略不仅将通用知识从大量的英语数据转移到普通话领域,而且避免了从头开始训练此类模型的昂贵成本。为了专门建模词汇的音调并增强可理解性,我们进一步采用流匹配模型来生成F0轮廓。这一生成过程由ASR微调SSL语音单元引导,这些单元包含关键的超分段信息。总体而言,通过两阶段训练模式提高语音质量,第一阶段产生的粗略频谱图通过流匹配后网进行细化。实验表明,LTA-L2S在语音可理解性和音调准确性方面均显著优于现有方法。
Key Takeaways
- 普通话的唇动到语音合成面临挑战,主要是由于复杂的音素映射和词汇音调的重要性。
- 提出了词汇音调感知的唇动到语音合成(LTA-L2S)来解决这一问题。
- 通过跨语言迁移学习策略,借鉴英语预训练的音频视觉自我监督学习模型来解决音素映射的复杂性。
- 采用流匹配模型生成F0轮廓,以增强语音的音调表现和可理解性。
- ASR微调SSL语音单元提供关键的超分段信息,指导F0轮廓的生成。
- 两阶段训练模式用于提高语音质量,其中包括流匹配后网对第一阶段产生的粗略频谱图的细化。
点此查看论文截图



Plug-and-Play Emotion Graphs for Compositional Prompting in Zero-Shot Speech Emotion Recognition
Authors:Jiacheng Shi, Hongfei Du, Y. Alicia Hong, Ye Gao
Large audio-language models (LALMs) exhibit strong zero-shot performance across speech tasks but struggle with speech emotion recognition (SER) due to weak paralinguistic modeling and limited cross-modal reasoning. We propose Compositional Chain-of-Thought Prompting for Emotion Reasoning (CCoT-Emo), a framework that introduces structured Emotion Graphs (EGs) to guide LALMs in emotion inference without fine-tuning. Each EG encodes seven acoustic features (e.g., pitch, speech rate, jitter, shimmer), textual sentiment, keywords, and cross-modal associations. Embedded into prompts, EGs provide interpretable and compositional representations that enhance LALM reasoning. Experiments across SER benchmarks show that CCoT-Emo outperforms prior SOTA and improves accuracy over zero-shot baselines.
大型音频语言模型(LALM)在语音任务中表现出强大的零样本性能,但由于缺乏副语言建模和有限的跨模态推理,在语音情感识别(SER)方面遇到了困难。我们提出了面向情感推理的组成思维链提示(CCoT-Emo)框架,该框架引入结构化情感图(EG)来指导LALM进行情感推断,而无需微调。每个情感图编码七种声学特征(例如音调、语速、抖动、颤音)、文本情感、关键词和跨模态关联。嵌入到提示中,情感图提供可解释和组合表示,增强了LALM推理。在SER基准测试上的实验表明,CCoT-Emo优于先前最佳技术,并提高了零样本基准测试的准确性。
论文及项目相关链接
Summary:针对大型音频语言模型(LALM)在语音情感识别(SER)方面的不足,提出了基于情感图的情绪推理框架Compositional Chain-of-Thought Prompting(CCoT-Emo)。该框架通过引入情感图来引导LALM进行情感推理,无需微调。情感图结合了语音的声学特征、文本情感、关键词和跨模态关联,嵌入到提示中,提供了可解释和组合性的表示,提高了LALM的推理能力。实验表明,CCoT-Emo在SER基准测试中表现优于先前的方法,并在零样本基准上提高了准确性。
Key Takeaways:
- 大型音频语言模型(LALMs)在语音情感识别(SER)方面存在挑战,主要是由于缺乏旁语言学建模和跨模态推理。
- 提出了Compositional Chain-of-Thought Prompting for Emotion Reasoning(CCoT-Emo)框架,通过引入情感图来改进LALMs在SER方面的性能。
- 情感图结合了语音的声学特征、文本情感和关键词,以及跨模态关联。
- 嵌入式情感图提供可解释和组合性的表示,增强LALM的推理能力。
- CCoT-Emo框架无需微调,即可提高LALMs在SER任务上的性能。
- 实验结果表明,CCoT-Emo在SER基准测试中表现优于先前的方法。
点此查看论文截图




VoiceBridge: Designing Latent Bridge Models for General Speech Restoration at Scale
Authors:Chi Zhang, Zehua Chen, Kaiwen Zheng, Jun Zhu
Bridge models have recently been explored for speech enhancement tasks such as denoising, dereverberation, and super-resolution, while these efforts are typically confined to a single task or small-scale datasets, with constrained general speech restoration (GSR) capability at scale. In this work, we introduce VoiceBridge, a GSR system rooted in latent bridge models (LBMs), capable of reconstructing high-fidelity speech at full-band (\textit{i.e.,} 48kHz) from various distortions. By compressing speech waveform into continuous latent representations, VoiceBridge models the\textit{diverse LQ-to-HQ tasks} (namely, low-quality to high-quality) in GSR with~\textit{a single latent-to-latent generative process} backed by a scalable transformer architecture. To better inherit the advantages of bridge models from the data domain to the latent space, we present an energy-preserving variational autoencoder, enhancing the alignment between the waveform and latent space over varying energy levels. Furthermore, to address the difficulty of HQ reconstruction from distinctively different LQ priors, we propose a joint neural prior, uniformly alleviating the reconstruction burden of LBM. At last, considering the key requirement of GSR systems, human perceptual quality, a perceptually aware fine-tuning stage is designed to mitigate the cascading mismatch in generation while improving perceptual alignment. Extensive validation across in-domain and out-of-domain tasks and datasets (\textit{e.g.}, refining recent zero-shot speech and podcast generation results) demonstrates the superior performance of VoiceBridge. Demo samples can be visited at: https://VoiceBridge-demo.github.io/.
最近,桥梁模型已被探索用于语音增强任务,如去噪、消除回声和超分辨率。然而,这些努力通常局限于单一任务或小规模数据集,大规模通用语音恢复(GSR)能力受限。在这项工作中,我们引入了VoiceBridge,这是一个基于潜在桥梁模型(LBM)的GSR系统,能够从各种失真中重建全频带(即48kHz)的高保真语音。通过将语音波形压缩成连续潜在表示,VoiceBridge用一个潜在空间到潜在空间的生成过程来模拟GSR中的多种LQ-to-HQ任务(即从低质量到高质量)。为了更好地从数据域继承桥梁模型的优势到潜在空间,我们提出了一种能量保持式变分自动编码器,增强了波形和潜在空间在不同能量水平上的对齐。此外,为了解决从明显不同的LQ先验中进行HQ重建的困难,我们提出了一个联合神经先验,统一缓解了LBM的重建负担。最后,考虑到GSR系统的关键需求——人类感知质量,设计了一个感知意识微调阶段,旨在减轻生成过程中的级联不匹配问题,同时提高感知对齐。在域内和域外任务和数据集上的广泛验证(例如,改进最近的零样本语音和播客生成结果)表明VoiceBridge的卓越性能。演示样品可访问:https://VoiceBridge-demo.github.io/。
论文及项目相关链接
摘要
本文介绍了基于潜在桥梁模型(LBMs)的VoiceBridge系统,用于通用语音恢复(GSR)。该系统能够重建高质量的全频带语音,并具有从各种失真中恢复语音的能力。通过压缩语音波形到连续的潜在表示,VoiceBridge用一个潜在到潜在生成过程对各种低质量到高质量的任务进行建模。通过引入能量保持变分自编码器,增强波形和潜在空间之间在不同能量水平上的对齐。针对从独特低质量先验重构高质量语音的困难,提出了一种联合神经先验,减轻了LBM的重构负担。最后,考虑到GSR系统的关键需求——人类感知质量,设计了一个感知意识微调阶段,以缓解生成过程中的级联不匹配问题,同时提高感知对齐。在跨领域任务和数据集上的广泛验证表明VoiceBridge的卓越性能。
关键见解
- VoiceBridge是一个基于潜在桥梁模型(LBMs)的通用语音恢复(GSR)系统。
- 能够重建高质量的全频带(即48 kHz)语音,并从各种失真中恢复语音。
- 通过连续潜在表示压缩语音波形,用一个潜在到潜在生成过程对各种低质量到高质量的任务进行建模。
- 引入能量保持变分自编码器,增强波形和潜在空间之间在不同能量水平的对齐。
- 提出了一种联合神经先验,以处理从独特低质量先验重构高质量语音的难题。
- 设计了一个感知意识微调阶段,以提高语音恢复的感知质量和缓解生成过程中的不匹配问题。
- VoiceBridge在跨领域任务和数据集上的表现经过广泛验证,并展示了其卓越性能。
点此查看论文截图





VSSFlow: Unifying Video-conditioned Sound and Speech Generation via Joint Learning
Authors:Xin Cheng, Yuyue Wang, Xihua Wang, Yihan Wu, Kaisi Guan, Yijing Chen, Peng Zhang, Xiaojiang Liu, Meng Cao, Ruihua Song
Video-conditioned sound and speech generation, encompassing video-to-sound (V2S) and visual text-to-speech (VisualTTS) tasks, are conventionally addressed as separate tasks, with limited exploration to unify them within a signle framework. Recent attempts to unify V2S and VisualTTS face challenges in handling distinct condition types (e.g., heterogeneous video and transcript conditions) and require complex training stages. Unifying these two tasks remains an open problem. To bridge this gap, we present VSSFlow, which seamlessly integrates both V2S and VisualTTS tasks into a unified flow-matching framework. VSSFlow uses a novel condition aggregation mechanism to handle distinct input signals. We find that cross-attention and self-attention layer exhibit different inductive biases in the process of introducing condition. Therefore, VSSFlow leverages these inductive biases to effectively handle different representations: cross-attention for ambiguous video conditions and self-attention for more deterministic speech transcripts. Furthermore, contrary to the prevailing belief that joint training on the two tasks requires complex training strategies and may degrade performance, we find that VSSFlow benefits from the end-to-end joint learning process for sound and speech generation without extra designs on training stages. Detailed analysis attributes it to the learned general audio prior shared between tasks, which accelerates convergence, enhances conditional generation, and stabilizes the classifier-free guidance process. Extensive experiments demonstrate that VSSFlow surpasses the state-of-the-art domain-specific baselines on both V2S and VisualTTS benchmarks, underscoring the critical potential of unified generative models.
视频条件的声音和语音生成,包括视频到声音(V2S)和视觉文本到语音(VisualTTS)任务,传统上被视为单独的任务,并且在统一框架内对其进行探索的尝试有限。最近尝试将V2S和VisualTTS统一起来,在处理不同的条件类型(例如,不同的视频和文本条件)方面面临挑战,并且需要复杂的训练阶段。统一这两个任务仍然是一个悬而未决的问题。为了弥合这一差距,我们提出了VSSFlow,它无缝地将V2S和VisualTTS任务集成到一个统一的流匹配框架中。VSSFlow使用了一种新型的条件聚合机制来处理不同的输入信号。我们发现,在引入条件的过程中,跨注意力和自注意力层表现出不同的归纳偏置。因此,VSSFlow利用这些归纳偏置来有效地处理不同的表示形式:跨注意力用于模糊的视频条件,自注意力用于更确定的语音文本。此外,与我们普遍认为的联合训练这两个任务需要复杂的训练策略并可能降低性能相反,我们发现VSSFlow受益于声音和语音生成的端到端联合学习过程,而无需在训练阶段进行额外设计。详细分析将其归因于任务之间学到的共享通用音频先验,这加速了收敛,增强了条件生成,并稳定了无分类器指导过程。大量实验表明,VSSFlow超越了V2S和VisualTTS基准测试领域的最新技术,突显出统一生成模型的关键潜力。
论文及项目相关链接
PDF Paper Under Review
Summary
本文提出了VSSFlow,一个将视频转声音(V2S)和视觉文本转语音(VisualTTS)任务统一于一个框架的方法。VSSFlow使用新型的条件聚合机制来处理不同的输入信号,利用交叉注意力和自注意力层的不同诱导偏差来处理不同的表现。此外,研究发现联合训练两个任务无需复杂的训练策略,反而能从端到端的联合学习过程获益,这得益于任务间共享的通用音频先验知识。VSSFlow在V2S和VisualTTS基准测试中均超越现有技术,展现了统一生成模型的巨大潜力。
Key Takeaways
- VSSFlow是一个统一的框架,整合了视频转声音(V2S)和视觉文本转语音(VisualTTS)任务。
- VSSFlow采用新型条件聚合机制处理不同输入信号。
- 交叉注意力和自注意力层在VSSFlow中用于处理不同的表现。
- 联合训练两个任务无需复杂策略,得益于端到端的联合学习过程。
- 任务间共享的通用音频先验知识加速了收敛,增强了条件生成,稳定了无分类器指导过程。
- VSSFlow在V2S和VisualTTS基准测试中表现优异。
点此查看论文截图



MeanFlowSE: One-Step Generative Speech Enhancement via MeanFlow
Authors:Yike Zhu, Boyi Kang, Ziqian Wang, Xingchen Li, Zihan Zhang, Wenjie Li, Longshuai Xiao, Wei Xue, Lei Xie
Speech enhancement (SE) recovers clean speech from noisy signals and is vital for applications such as telecommunications and automatic speech recognition (ASR). While generative approaches achieve strong perceptual quality, they often rely on multi-step sampling (diffusion/flow-matching) or large language models, limiting real-time deployment. To mitigate these constraints, we present MeanFlowSE, a one-step generative SE framework. It adopts MeanFlow to predict an average-velocity field for one-step latent refinement and conditions the model on self-supervised learning (SSL) representations rather than VAE latents. This design accelerates inference and provides robust acoustic-semantic guidance during training. In the Interspeech 2020 DNS Challenge blind test set and simulated test set, MeanFlowSE attains state-of-the-art (SOTA) level perceptual quality and competitive intelligibility while significantly lowering both real-time factor (RTF) and model size compared with recent generative competitors, making it suitable for practical use. The code will be released upon publication at https://github.com/Hello3orld/MeanFlowSE.
语音增强(SE)从带噪声的信号中恢复清洁语音,对电信和自动语音识别(ASR)等应用至关重要。虽然生成方法达到了很强的感知质量,但它们通常依赖于多步采样(扩散/流匹配)或大型语言模型,限制了实时部署。为了缓解这些约束,我们提出了MeanFlowSE,这是一个一步生成SE框架。它采用MeanFlow来预测一步潜在精化的平均速度场,并以自我监督学习(SSL)表示而不是VAE潜在进行模型训练。这种设计加速了推理,并在训练过程中提供了稳健的声学语义指导。在Interspeech 2020 DNS Challenge的盲测试集和模拟测试集中,MeanFlowSE达到了最先进的感知质量和有竞争力的清晰度,同时大大降低了实时因子(RTF)和模型大小,与最近的生成竞争对手相比,适合实际应用。代码将在https://github.com/Hello3orld/MeanFlowSE上发布。
论文及项目相关链接
PDF Submitted to ICASSP 2026
摘要
语音增强(SE)从噪声信号中恢复清洁语音,对电信和自动语音识别(ASR)等应用至关重要。虽然生成方法可以达到很强的感知质量,但它们通常依赖于多步采样(扩散/流匹配)或大型语言模型,限制了实时部署。为了缓解这些约束,我们提出了MeanFlowSE,这是一个一步生成SE框架。它采用MeanFlow预测平均速度场进行一步潜在细化,并在自我监督学习(SSL)表示的基础上对模型进行条件设置,而不是VAE潜变量。这种设计加速了推理,并在训练过程中提供了稳健的声学语义指导。在Interspeech 2020 DNS Challenge的盲测试集和模拟测试集中,MeanFlowSE达到了先进的感知质量和有竞争力的可理解性,同时大大降低了实时因子(RTF)和模型大小,与最近的生成竞争对手相比,适合实际应用。代码将在https://github.com/Hello3orld/MeanFlowSE发布。
要点
- 语音增强(SE)在电信和自动语音识别(ASR)等应用中至关重要。
- 生成方法虽然能达成高感知质量,但存在实时部署的局限性。
- MeanFlowSE是一个一步生成SE框架,采用MeanFlow预测平均速度场进行潜在细化。
- MeanFlowSE在自我监督学习(SSL)表示的基础上进行条件设置,加速推理并提供稳健的声学语义指导。
- 在Interspeech 2020 DNS Challenge的测试中,MeanFlowSE达到先进感知质量和竞争力可理解性。
- 与其他生成方法相比,MeanFlowSE降低实时因子(RTF)和模型大小。
- MeanFlowSE适合实际应用,代码将在相关仓库发布。
点此查看论文截图




i-LAVA: Insights on Low Latency Voice-2-Voice Architecture for Agents
Authors:Anupam Purwar, Aditya Choudhary
We experiment with a low-latency, end-to-end voice-to-voice communication model to optimize it for real-time conversational applications. By analyzing components essential to voice to voice (V-2-V) system viz. automatic speech recognition (ASR), text-to-speech (TTS), and dialog management, our work analyzes how to reduce processing time while maintaining high-quality interactions to identify the levers for optimizing V-2-V system. Our work identifies that TTS component which generates life-like voice, full of emotions including natural pauses and exclamations has highest impact on Real time factor (RTF). The experimented V-2-V architecture utilizes CSM1b has the capability to understand tone as well as context of conversation by ingesting both audio and text of prior exchanges to generate contextually accurate speech. We explored optimization of Residual Vector Quantization (RVQ) iterations by the TTS decoder which come at a cost of decrease in the quality of voice generated. Our experimental evaluations also demonstrate that for V-2-V implementations based on CSM most important optimizations can be brought by reducing the number of RVQ Iterations along with the codebooks used in Mimi.
我们实验了一个低延迟、端到端的语音到语音通信模型,以优化其适用于实时对话应用。通过分析语音到语音(V-2-V)系统所必需的关键组件,即自动语音识别(ASR)、文本到语音(TTS)和对话管理,我们的工作分析了如何在保持高质量交互的同时减少处理时间,以确定优化V-2-V系统的关键杠杆。我们的工作发现,生成充满情感、逼真的语音的TTS组件(包括自然停顿和感叹)对实时因子(RTF)的影响最大。所试验的V-2-V架构利用CSM1b,通过摄取先前的音频和文本交换,理解对话的语调以及语境,从而生成语境准确的语音。我们探索了通过TTS解码器优化剩余向量量化(RVQ)迭代的方法,虽然这会降低生成的语音质量。我们的实验评估还表明,对于基于CSM的V-2-V实现,最重要的优化可以通过减少RVQ迭代次数以及与Mimi中使用的代码簿相结合来实现。
论文及项目相关链接
PDF This paper analyzes a low-latency, end-to-end voice-to-voice (V-2-V) architecture, identifying that the Text-to-Speech (TTS) component has the highest impact on real-time performance. By reducing the number of Residual Vector Quantization (RVQ) iterations in the TTS model, latency can be effectively halved. Its accepted at AIML Systems 2025
Summary
本文研究了针对实时语音对话应用的端到端语音通信模型的优化策略。通过分析自动语音识别(ASR)、文本到语音(TTS)和对话管理等关键环节,探索减少处理时间的同时保持高质量互动的方法。研究发现,生成情感丰富、带有自然停顿和语气词的语音的TTS组件对实时因子(RTF)影响最大。实验的V-2-V架构通过利用CSM技术,可以理解和生成与对话内容和语气相关的语音,同时通过优化TTS解码器的残差向量量化(RVQ)迭代来提升效率。实验结果展示了对RVQ迭代的优化可以在一定程度上减少生成的语音质量损失。
Key Takeaways
- 研究针对端到端语音通信模型的优化策略,用于实时语音对话应用。
- 分析自动语音识别(ASR)、文本到语音(TTS)和对话管理等关键环节对实时因子(RTF)的影响。
- 发现TTS组件是影响RTF最大的部分,它能生成带有情感和自然停顿的语音。
- 利用CSM技术理解和生成与对话内容和语气相关的语音,提升交互质量。
- 优化TTS解码器的RVQ迭代以提升效率,同时减少生成的语音质量损失。
- 实验评估表明,减少RVQ迭代次数和代码本的使用是实现V-2-V系统优化的关键手段。
点此查看论文截图







SynParaSpeech: Automated Synthesis of Paralinguistic Datasets for Speech Generation and Understanding
Authors:Bingsong Bai, Qihang Lu, Wenbing Yang, Zihan Sun, Yueran Hou, Peilei Jia, Songbai Pu, Ruibo Fu, Yingming Gao, Ya Li, Jun Gao
Paralinguistic sounds, like laughter and sighs, are crucial for synthesizing more realistic and engaging speech. However, existing methods typically depend on proprietary datasets, while publicly available resources often suffer from incomplete speech, inaccurate or missing timestamps, and limited real-world relevance. To address these problems, we propose an automated framework for generating large-scale paralinguistic data and apply it to construct the SynParaSpeech dataset. The dataset comprises 6 paralinguistic categories with 118.75 hours of data and precise timestamps, all derived from natural conversational speech. Our contributions lie in introducing the first automated method for constructing large-scale paralinguistic datasets and releasing the SynParaSpeech corpus, which advances speech generation through more natural paralinguistic synthesis and enhances speech understanding by improving paralinguistic event detection. The dataset and audio samples are available at https://github.com/ShawnPi233/SynParaSpeech.
类语言声音,如笑声和叹息声,对于合成更真实、更吸引人的语音至关重要。然而,现有方法通常依赖于专有数据集,而公开可用的资源往往存在语音不完整、时间戳不准确或缺失、以及现实世界相关性有限等问题。为了解决这些问题,我们提出了一种自动生成大规模类语言数据集的框架,并应用该框架构建了SynParaSpeech数据集。该数据集包含6大类类语言数据,包含118.75小时的数据和精确的时间戳,所有语音均来自自然对话语音。我们的贡献在于引入第一种自动生成大规模类语言数据集的方法,并发布SynParaSpeech语料库,它通过更自然的类语言合成推进语音生成,并通过改进类语言事件检测提高语音理解。数据集和音频样本可在https://github.com/ShawnPi233/SynParaSpeech获取。
论文及项目相关链接
PDF Submitted to ICASSP 2026. Copyright 2026 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works
摘要
文本指出,副语言声音(如笑声和叹息声)对于合成更真实、更吸引人的语音至关重要。然而,现有方法通常依赖于专有数据集,而公开可用的资源往往存在语音不完整、时间戳不准确或缺失、以及现实相关性有限等问题。为解决这些问题,文本提出了一种用于生成大规模副语言数据集的自动化框架,并应用该框架构建了SynParaSpeech数据集。该数据集包含6个副语言类别,共有118.75小时的数据和精确的时间戳,所有内容均来自自然对话语音。文本的主要贡献在于引入了一种构建大规模副语言数据集的自动化方法,并发布了SynParaSpeech语料库。该语料库通过更自然的副语言合成推进了语音生成,并通过提高副语言事件检测改善了语音理解。数据集和音频样本可在https://github.com/ShawnPi233/SynParaSpeech获取。
关键见解
- 副语言声音对于合成更真实、更吸引人的语音至关重要。
- 现有数据集存在依赖专有数据、公开资源语音不完整、时间戳不准确或缺失等问题。
- 提出了一个自动化框架来生成大规模副语言数据集。
- 构建了包含6个副语言类别、118.75小时数据和精确时间戳的SynParaSpeech数据集。
- 该数据集来自自然对话语音,提高了现实相关性。
- 通过更自然的副语言合成推进了语音生成。
点此查看论文截图




FuseCodec: Semantic-Contextual Fusion and Supervision for Neural Codecs
Authors:Md Mubtasim Ahasan, Rafat Hasan Khan, Tasnim Mohiuddin, Aman Chadha, Tariq Iqbal, M Ashraful Amin, Amin Ahsan Ali, Md Mofijul Islam, A K M Mahbubur Rahman
Speech tokenization enables discrete representation and facilitates speech language modeling. However, existing neural codecs capture low-level acoustic features, overlooking the semantic and contextual cues inherent to human speech. While recent efforts introduced semantic representations from self-supervised speech models or incorporated contextual representations from pre-trained language models, challenges remain in aligning and unifying the semantic and contextual representations. We introduce FuseCodec, which unifies acoustic, semantic, and contextual representations through strong cross-modal alignment and globally informed supervision. We propose three complementary techniques: (i) Latent Representation Fusion, integrating semantic and contextual features directly into the encoder latent space for robust and unified representation learning; (ii) Global Semantic-Contextual Supervision, supervising discrete tokens with globally pooled and broadcasted representations to enhance temporal consistency and cross-modal alignment; and (iii) Temporally Aligned Contextual Supervision, strengthening alignment by dynamically matching contextual and speech tokens within a local window for fine-grained token-level supervision. We further introduce FuseCodec-TTS, demonstrating our methodology’s applicability to zero-shot speech synthesis. Empirically, FuseCodec achieves state-of-the-art performance in LibriSpeech, surpassing EnCodec, SpeechTokenizer, and DAC in transcription accuracy, perceptual quality, intelligibility, and speaker similarity. Results highlight the effectiveness of contextually and semantically guided tokenization for speech tokenization and downstream tasks. Code and pretrained models are available at https://github.com/mubtasimahasan/FuseCodec.
语音分词能够实现对语音的离散表示,并促进语音语言建模。然而,现有的神经编码器主要捕捉低级别的声学特征,忽略了人类语音所固有的语义和上下文线索。尽管最近的努力引入了来自自监督语音模型的语义表示或结合了预训练语言模型的上下文表示,但在对齐和统一语义和上下文表示方面仍存在挑战。我们推出了FuseCodec,它通过强大的跨模态对齐和全局信息监督,统一了声学、语义和上下文表示。我们提出了三种互补的技术:(i)潜在表示融合,直接将语义和上下文特征集成到编码器潜在空间,以实现稳健和统一的表示学习;(ii)全局语义上下文监督,用全局池化和广播的表示来监督离散令牌,以增强时间一致性和跨模态对齐;(iii)临时对齐上下文监督,通过在局部窗口中动态匹配上下文和语音令牌来加强对齐,以实现精细的令牌级监督。我们还推出了FuseCodec-TTS,证明我们的方法适用于零样本语音合成。经验上,FuseCodec在LibriSpeech上达到了最先进的性能,在转录准确性、感知质量、清晰度和说话人相似性方面超越了EnCodec、SpeechTokenizer和DAC。结果突出了上下文和语义引导的分词法在语音分词和下游任务中的有效性。代码和预训练模型可在https://github.com/mubtasimahasan/FuseCodec中获得。
论文及项目相关链接
Summary
本文介绍了Speech tokenization的重要性及其在语音语言建模中的应用。现有神经编码方法主要关注低级别声学特征,忽略了语音的语义和上下文线索。文章提出了FuseCodec,通过强大的跨模态对齐和全局监督,融合了声学、语义和上下文表示。采用三种技术:潜代表融合、全局语义上下文监督、时间对齐的上下文监督。FuseCodec-TTS的引入证明了该方法在零样本语音合成中的适用性。实验结果表明,FuseCodec在LibriSpeech上的表现达到最新水平,超越了EnCodec、SpeechTokenizer和DAC,在转录准确性、感知质量、清晰度和说话人相似性方面有所改进。
Key Takeaways
- Speech tokenization对于离散表示和语音语言建模至关重要。
- 现有神经编码方法主要关注声学特征而忽略了语义和上下文线索。
- FuseCodec融合了声学、语义和上下文表示,通过强大的跨模态对齐和全局监督实现。
- FuseCodec采用潜代表融合、全局语义上下文监督和时间对齐的上下文监督三种技术。
- FuseCodec-TTS证明了该方法在零样本语音合成中的适用性。
- FuseCodec在LibriSpeech上的表现优于其他模型,如EnCodec、SpeechTokenizer和DAC。
点此查看论文截图




Streaming Sequence-to-Sequence Learning with Delayed Streams Modeling
Authors:Neil Zeghidour, Eugene Kharitonov, Manu Orsini, Václav Volhejn, Gabriel de Marmiesse, Edouard Grave, Patrick Pérez, Laurent Mazaré, Alexandre Défossez
We introduce Delayed Streams Modeling (DSM), a flexible formulation for streaming, multimodal sequence-to-sequence learning. Sequence-to-sequence generation is often cast in an offline manner, where the model consumes the complete input sequence before generating the first output timestep. Alternatively, streaming sequence-to-sequence rely on learning a policy for choosing when to advance on the input stream, or write to the output stream. DSM instead models already time-aligned streams with a decoder-only language model. By moving the alignment to a pre-processing step,and introducing appropriate delays between streams, DSM provides streaming inference of arbitrary output sequences, from any input combination, making it applicable to many sequence-to-sequence problems. In particular, given text and audio streams, automatic speech recognition (ASR) corresponds to the text stream being delayed, while the opposite gives a text-to-speech (TTS) model. We perform extensive experiments for these two major sequence-to-sequence tasks, showing that DSM provides state-of-the-art performance and latency while supporting arbitrary long sequences, being even competitive with offline baselines. Code, samples and demos are available at https://github.com/kyutai-labs/delayed-streams-modeling
我们介绍了延迟流建模(DSM),这是一种用于流式、多模态序列到序列学习的灵活公式。序列到序列生成通常以一种离线的方式进行,模型在生成第一个输出时间步之前会消耗完整的输入序列。相比之下,流式序列到序列则依赖于学习策略,以决定何时推进输入流,或写入输出流。然而,DSM使用仅解码的语言模型对已经时间对齐的流进行建模。通过将对齐移动到预处理步骤,并在流之间引入适当的延迟,DSM提供了任意输出流的流式推理,适用于多种序列到序列问题,可从任何输入组合生成。特别是给定文本和音频流时,语音识别(ASR)对应于延迟的文本流,而相反则为文本到语音(TTS)模型。我们为这两个主要的序列到序列任务进行了广泛的实验,结果表明,DSM在提供最先进的性能和延迟的同时,还支持任意长的序列,甚至与离线基准相竞争。相关代码、样本和演示可在 https://github.com/kyutai-labs/delayed-streams-modeling 找到。
论文及项目相关链接
Summary
本文介绍了延迟流建模(DSM),这是一种用于流式、多模态序列到序列学习的灵活方法。与传统的离线序列到序列生成不同,DSM对已经时间对齐的流进行建模,通过预处理步骤完成对齐,并在流之间引入适当的延迟,以实现任意输出序列的流式推断。这种方法适用于许多序列到序列问题,包括语音识别和文本到语音转换。实验表明,DSM具有最先进的性能和低延迟,支持任意长序列,与离线基准测试具有竞争力。
Key Takeaways
- 延迟流建模(DSM)是一种灵活的序列到序列学习方法,适用于流式和多模态数据。
- DSM通过预处理步骤完成流之间的时间对齐,并引入延迟。
- DSM支持任意输出序列的流式推断,适用于多种序列到序列问题。
- DSM在语音识别和文本到语音转换等任务中具有卓越性能。
- DSM具有先进的性能和低延迟,能够处理任意长序列。
- DSM与离线基准测试具有竞争力。
点此查看论文截图



Leveraging Mamba with Full-Face Vision for Audio-Visual Speech Enhancement
Authors:Rong Chao, Wenze Ren, You-Jin Li, Kuo-Hsuan Hung, Sung-Feng Huang, Szu-Wei Fu, Wen-Huang Cheng, Yu Tsao
Recent Mamba-based models have shown promise in speech enhancement by efficiently modeling long-range temporal dependencies. However, models like Speech Enhancement Mamba (SEMamba) remain limited to single-speaker scenarios and struggle in complex multi-speaker environments such as the cocktail party problem. To overcome this, we introduce AVSEMamba, an audio-visual speech enhancement model that integrates full-face visual cues with a Mamba-based temporal backbone. By leveraging spatiotemporal visual information, AVSEMamba enables more accurate extraction of target speech in challenging conditions. Evaluated on the AVSEC-4 Challenge development and blind test sets, AVSEMamba outperforms other monaural baselines in speech intelligibility (STOI), perceptual quality (PESQ), and non-intrusive quality (UTMOS), and achieves \textbf{1st place} on the monaural leaderboard.
近期基于Mamba的模型在通过有效建模远程时间依赖性进行语音增强方面显示出良好的前景。然而,像语音增强Mamba(SEMamba)这样的模型仅限于单人场景,在复杂的多人环境(如鸡尾酒会问题)中表现困难。为了克服这一难题,我们引入了AVSEMamba,这是一个视听语音增强模型,它结合了基于Mamba的时间骨架和全脸视觉线索。通过利用时空视觉信息,AVSEMamba能够在具有挑战性的条件下更准确地提取目标语音。在AVSEC-4挑战的开发集和盲测试集上进行了评估,AVSEMamba在语音清晰度(STOI)、感知质量(PESQ)和非侵入质量(UTMOS)方面都优于其他单耳基线测试,并在单耳测试排行榜上取得了第一名。
论文及项目相关链接
PDF Accepted to Interspeech 2025 Workshop
Summary
Mamba模型在语音增强领域展现出良好性能,能够高效建模长时序依赖关系。然而,面对复杂的多说话人环境如鸡尾酒会问题,现有模型如SEMamba仅限于单说话人场景。为解决此问题,本文提出了AVSEMamba音频视觉语音增强模型,它融合了面部视觉线索与Mamba的时序主干网络。借助时空视觉信息,AVSEMamba能在复杂条件下更准确地提取目标语音。在AVSEC-4挑战的开发和盲测试集上评估,AVSEMamba在语音清晰度(STOI)、感知质量(PESQ)和非侵入质量(UTMOS)方面超越其他单声道基线,并在单声道排行榜上获得第一名。
Key Takeaways
- Mamba模型在语音增强领域表现优异,擅长建模长时序依赖关系。
- 现有模型如SEMamba在复杂多说话人环境下面临挑战。
- AVSEMamba是一个音频视觉语音增强模型,融合了面部视觉线索和Mamba的时序主干网络。
- AVSEMamba借助时空视觉信息,能更准确地提取目标语音。
- 在AVSEC-4挑战的开发和盲测试集上,AVSEMamba性能超越其他模型。
- AVSEMamba在语音清晰度、感知质量和非侵入质量方面表现优秀。
点此查看论文截图



IML-Spikeformer: Input-aware Multi-Level Spiking Transformer for Speech Processing
Authors:Zeyang Song, Shimin Zhang, Yuhong Chou, Jibin Wu, Haizhou Li
Spiking Neural Networks (SNNs), inspired by biological neural mechanisms, represent a promising neuromorphic computing paradigm that offers energy-efficient alternatives to traditional Artificial Neural Networks (ANNs). Despite proven effectiveness, SNN architectures have struggled to achieve competitive performance on large-scale speech processing tasks. Two key challenges hinder progress: (1) the high computational overhead during training caused by multi-timestep spike firing, and (2) the absence of large-scale SNN architectures tailored to speech processing tasks. To overcome the issues, we introduce Input-aware Multi-Level Spikeformer, i.e. IML-Spikeformer, a spiking Transformer architecture specifically designed for large-scale speech processing. Central to our design is the Input-aware Multi-Level Spike (IMLS) mechanism, which simulates multi-timestep spike firing within a single timestep using an adaptive, input-aware thresholding scheme. IML-Spikeformer further integrates a Re-parameterized Spiking Self-Attention (RepSSA) module with a Hierarchical Decay Mask (HDM), forming the HD-RepSSA module. This module enhances the precision of attention maps and enables modeling of multi-scale temporal dependencies in speech signals. Experiments demonstrate that IML-Spikeformer achieves word error rates of 6.0% on AiShell-1 and 3.4% on Librispeech-960, comparable to conventional ANN transformers while reducing theoretical inference energy consumption by 4.64$\times$ and 4.32$\times$ respectively. IML-Spikeformer marks an advance of scalable SNN architectures for large-scale speech processing in both task performance and energy efficiency. Our source code and model checkpoints are publicly available at github.com/Pooookeman/IML-Spikeformer.
脉冲神经网络(Spiking Neural Networks,SNNs),受生物神经机制的启发,代表了一种有前景的类脑计算范式,为传统的人工神经网络(ANNs)提供了节能的替代方案。尽管已经证明了其有效性,但SNN架构在大规模语音识别任务上实现具有竞争力的性能仍存在困难。两个关键挑战阻碍了进展:(1)由于多时间步长的脉冲发射导致的训练过程中的高计算开销;(2)缺乏针对语音识别任务定制的大规模SNN架构。为了克服这些问题,我们引入了输入感知多级Spikeformer,即IML-Spikeformer,这是一种专门用于大规模语音处理的脉冲Transformer架构。设计的核心在于输入感知多级脉冲(IMLS)机制,该机制使用自适应的、输入感知的阈值方案,在单个时间步长内模拟多时间步长的脉冲发射。IML-Spikeformer进一步结合了参数化脉冲自注意力(RepSSA)模块和分层衰减掩码(HDM),形成了HD-RepSSA模块。该模块提高了注意力图的精度,并实现了语音信号中多尺度时间依赖关系的建模。实验表明,IML-Spikeformer在AiShell-1上的词错误率为6.0%,在Librispeech-960上的词错误率为3.4%,与传统ANN变压器相当,同时分别降低了4.64×和4.32×的理论推理能耗。IML-Spikeformer标志着大规模语音处理的可扩展SNN架构在任务性能和能源效率方面的进步。我们的源代码和模型检查点可在github.com/Pooookeman/IML-Spikeformer上公开获得。
论文及项目相关链接
PDF Accepted by TNNLS
摘要
受生物神经机制启发的脉冲神经网络(SNNs)是一种前景看好的神经形态计算范例,为传统的人工神经网络(ANNs)提供了能源效率更高的替代方案。然而,SNN架构在大规模语音识别任务上的表现竞争力有限。本文引入Input-aware Multi-Level Spikeformer(IML-Spikeformer),一种专门用于大规模语音处理的脉冲Transformer架构。其核心设计的Input-aware Multi-Level Spike(IMLS)机制能够在单个时间步内模拟多时间步的脉冲发射。此外,IML-Spikeformer集成了带有层次衰减掩膜(HDM)的Re-parameterized Spiking Self-Attention(RepSSA)模块,形成HD-RepSSA模块,提高了注意力图的精度,并实现了语音信号多尺度时间依赖性的建模。实验表明,IML-Spikeformer在AiShell-1上的词错误率为6.0%,在Librispeech-960上为3.4%,与传统ANN变压器相当,同时理论上降低了4.64倍和4.32倍的推理能耗。IML-Spikeformer标志着大规模语音识别任务中可扩展SNN架构在任务性能和能源效率方面的进步。
关键见解
- 脉冲神经网络(SNNs)是一种受生物神经启发的计算范例,相较于传统人工神经网络(ANNs)更节能。
- SNN面临两大挑战:多时间步脉冲发射导致的高计算开销,以及缺乏针对语音处理任务的大规模SNN架构。
- IML-Spikeformer,一种专为大规模语音处理设计的脉冲Transformer架构被提出。
- IML-Spikeformer的Input-aware Multi-Level Spike(IMLS)机制能在单个时间步内模拟多时间步的脉冲发射。
- HD-RepSSA模块通过增强注意力图的精度和实现多尺度时间依赖性建模,提高了语音处理的性能。
- IML-Spikeformer在大型语音识别任务上表现出色,词错误率与常规ANN变压器相当。
- IML-Spikeformer相比传统方法大幅降低了推理能耗。
点此查看论文截图



EnvSDD: Benchmarking Environmental Sound Deepfake Detection
Authors:Han Yin, Yang Xiao, Rohan Kumar Das, Jisheng Bai, Haohe Liu, Wenwu Wang, Mark D Plumbley
Audio generation systems now create very realistic soundscapes that can enhance media production, but also pose potential risks. Several studies have examined deepfakes in speech or singing voice. However, environmental sounds have different characteristics, which may make methods for detecting speech and singing deepfakes less effective for real-world sounds. In addition, existing datasets for environmental sound deepfake detection are limited in scale and audio types. To address this gap, we introduce EnvSDD, the first large-scale curated dataset designed for this task, consisting of 45.25 hours of real and 316.74 hours of fake audio. The test set includes diverse conditions to evaluate the generalizability, such as unseen generation models and unseen datasets. We also propose an audio deepfake detection system, based on a pre-trained audio foundation model. Results on EnvSDD show that our proposed system outperforms the state-of-the-art systems from speech and singing domains.
音频生成系统如今能够创造出非常逼真的声音场景,这既可以增强媒体制作,也存在潜在风险。几项研究已经研究了语音或歌唱声音的深度伪造。然而,环境声音具有不同的特性,这可能会使检测语音和歌唱深度伪造的方法对真实世界的声音效果较差。此外,用于环境声音深度伪造检测的现有数据集在规模和音频类型方面有限。为了弥补这一空白,我们引入了EnvSDD,这是为此任务设计的大规模定制数据集,包含45.25小时的真实音频和316.74小时的伪造音频。测试集包括各种条件,以评估其泛化能力,例如未见过的生成模型和未见过的数据集。我们还提出了一种基于预训练音频基础模型的音频深度伪造检测系统。在EnvSDD上的结果证明,我们提出的系统优于语音和歌唱领域的最先进的系统。
论文及项目相关链接
PDF Proceedings of Interspeech 2025
Summary
本文介绍了音频生成系统所带来现实音效增强的同时存在的潜在风险。针对环境声音的特性,现有的语音和歌唱声音深度伪造检测手段可能对其并不完全适用。为了填补这一空白,文章推出了EnvSDD数据集,用于专门的环境声音深度伪造检测任务,其中包括真实音频与合成音频。同时,文章还提出了一种基于预训练音频基础模型的音频深度伪造检测系统,该系统在EnvSDD上的表现优于语音和歌唱领域的最新系统。
Key Takeaways
- 音频生成系统能生成逼真的声音场景,增强媒体制作效果,但也存在潜在风险。
- 环境声音的特性使得现有的语音和歌唱深度伪造检测手段可能不够有效。
- 推出EnvSDD数据集,用于专门的环境声音深度伪造检测任务。
- EnvSDD数据集包括真实音频与合成音频,提供了大量的样本以供学习和检测。
- 数据集中的测试集包含了各种条件以评估检测模型的泛化能力。
- 提出了一种基于预训练音频基础模型的音频深度伪造检测系统。
点此查看论文截图





MemeIntel: Explainable Detection of Propagandistic and Hateful Memes
Authors:Mohamed Bayan Kmainasi, Abul Hasnat, Md Arid Hasan, Ali Ezzat Shahroor, Firoj Alam
The proliferation of multimodal content on social media presents significant challenges in understanding and moderating complex, context-dependent issues such as misinformation, hate speech, and propaganda. While efforts have been made to develop resources and propose new methods for automatic detection, limited attention has been given to jointly modeling label detection and the generation of explanation-based rationales, which often leads to degraded classification performance when trained simultaneously. To address this challenge, we introduce MemeXplain, an explanation-enhanced dataset for propagandistic memes in Arabic and hateful memes in English, making it the first large-scale resource for these tasks. To solve these tasks, we propose a multi-stage optimization approach and train Vision-Language Models (VLMs). Our results show that this strategy significantly improves both label detection and explanation generation quality over the base model, outperforming the current state-of-the-art with an absolute improvement of ~1.4% (Acc) on ArMeme and ~2.2% (Acc) on Hateful Memes. For reproducibility and future research, we aim to make the MemeXplain dataset and scripts publicly available (https://github.com/MohamedBayan/MemeIntel).
社交媒体上多模式内容的激增,为理解和调节复杂且依赖于情境的议题(如假信息、仇恨言论和宣传)带来了重大挑战。虽然已有研究致力于开发资源和提出自动检测新方法,但对联合建模标签检测和基于解释的理性生成给予的关注有限,这往往导致在同时训练时分类性能下降。为了应对这一挑战,我们推出了MemeXplain,这是一个用于阿拉伯语的宣传性meme和英语的仇恨言论meme的解释增强数据集,成为这些任务的首个大规模资源。为了解决这些任务,我们提出了一种多阶段优化方法并训练了视觉语言模型(VLMs)。我们的结果表明,该策略在标签检测器和解释生成质量方面显著改进了基础模型,在ArMeme上绝对提高了约1.4%(准确率),在仇恨言论Memes上提高了约2.2%(准确率)。为了可复制性和未来研究,我们旨在使MemeXplain数据集和脚本公开可用(https://github.com/MohamedBayan/MemeIntel)。
论文及项目相关链接
PDF disinformation, misinformation, factuality, harmfulness, fake news, propaganda, hateful meme, multimodality, text, images
Summary
社交媒体上多模态内容的激增,为理解和调节复杂、依赖语境的问题(如错误信息、仇恨言论和宣传)带来了重大挑战。尽管已有一些自动检测资源的开发和新方法的提出,但在联合建模标签检测和基于解释的理性生成方面关注较少,这往往导致同时训练时分类性能下降。为应对这一挑战,我们推出了MemeXplain数据集,包含阿拉伯语的宣传性meme和英语的仇恨性meme,成为首个大规模资源用于这些任务。为解决这些任务,我们提出了多阶段优化方法并训练了视觉语言模型(VLMs)。结果显示,此策略在标签检测和解释生成质量上均显著提高了基础模型性能,相较于当前最先进技术,在ArMeme上准确率提高了约1.4%,在仇恨性Memes上准确率提高了约2.2%。我们旨在让MemeXplain数据集和脚本公开可用,以供未来研究(https://github.com/MohamedBayan/MemeIntel)。
Key Takeaways
- 社交媒体多模态内容带来的挑战:理解和调节复杂、依赖语境的问题如错误信息、仇恨言论和宣传等具有难度。
- 数据集缺失问题:尽管已有自动检测资源的开发和新方法提出,但在联合建模标签检测和基于解释的理性生成方面关注不足。
- MemeXplain数据集的推出:包含阿拉伯语的宣传性meme和英语的仇恨性meme,成为大规模资源用于相关任务。
- 解决方案:采用多阶段优化方法并训练视觉语言模型(VLMs)。
- 性能和效果:策略显著提高标签检测和解释生成质量,相较于当前技术有显著提升。
- 数据集和脚本的公开可用性:旨在促进未来研究。
点此查看论文截图



