发布日期: 2025-11-19

更新日期: 2025-11-27

文章字数: 12.2k

阅读时长: 49 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-19 更新

PASE: Leveraging the Phonological Prior of WavLM for Low-Hallucination Generative Speech Enhancement

Authors:Xiaobin Rong, Qinwen Hu, Mansur Yesilbursa, Kamil Wojcicki, Jing Lu

Generative models have shown remarkable performance in speech enhancement (SE), achieving superior perceptual quality over traditional discriminative approaches. However, existing generative SE approaches often overlook the risk of hallucination under severe noise, leading to incorrect spoken content or inconsistent speaker characteristics, which we term linguistic and acoustic hallucinations, respectively. We argue that linguistic hallucination stems from models’ failure to constrain valid phonological structures and it is a more fundamental challenge. While language models (LMs) are well-suited for capturing the underlying speech structure through modeling the distribution of discrete tokens, existing approaches are limited in learning from noise-corrupted representations, which can lead to contaminated priors and hallucinations. To overcome these limitations, we propose the Phonologically Anchored Speech Enhancer (PASE), a generative SE framework that leverages the robust phonological prior embedded in the pre-trained WavLM model to mitigate hallucinations. First, we adapt WavLM into a denoising expert via representation distillation to clean its final-layer features. Guided by the model’s intrinsic phonological prior, this process enables robust denoising while minimizing linguistic hallucinations. To further reduce acoustic hallucinations, we train the vocoder with a dual-stream representation: the high-level phonetic representation provides clean linguistic content, while a low-level acoustic representation retains speaker identity and prosody. Experimental results demonstrate that PASE not only surpasses state-of-the-art discriminative models in perceptual quality, but also significantly outperforms prior generative models with substantially lower linguistic and acoustic hallucinations.

生成模型在语音增强（SE）方面表现出了显著的性能，相较于传统的判别方法，它们在感知质量上达到了更高的水平。然而，现有的生成式SE方法往往忽视了严重噪声下的幻觉风险，这会导致错误的语音内容或不一致的说话人特征，我们分别称之为语言幻觉和声音幻觉。我们认为，语言幻觉源于模型对有效语音结构的约束不足，这是一个更为根本的挑战。虽然语言模型（LMs）通过建模离散符号的分布来捕捉潜在的语音结构非常适合，但现有方法在学习噪声污染的表示方面存在局限性，这可能导致污染的先验知识和幻觉。为了克服这些局限性，我们提出了基于音系锚定的语音增强器（PASE），这是一种生成式SE框架，它利用预训练的WavLM模型中嵌入的稳健音系先验知识来减轻幻觉。首先，我们通过表示蒸馏将WavLM适应为去噪专家，以清洁其最终层特征。在模型的内在音系先验知识的指导下，这个过程实现了稳健的去噪，同时最大限度地减少了语言幻觉。为了进一步减少声音幻觉，我们用双流表示法训练了vocoder：高级语音表示法提供清晰的语音内容，而低级声音表示法则保留了说话人的身份和语调。实验结果表明，PASE不仅超过了先进判别模型的感知质量，而且在语言幻觉和声音幻觉方面显著优于先前的生成模型。

论文及项目相关链接

PDF Accepted by AAAI 2026

Summary

本文介绍了在语音增强领域，生成模型相较于传统判别模型具有显著优势，但现有生成模型存在在严重噪声下出现幻听的风险。文章指出，语言模型的幻听源于其对有效语音结构的约束不足，这是一个更为根本的挑战。为解决这一问题，文章提出了一个名为 Phonologically Anchored Speech Enhancer（PASE）的生成式语音增强框架，它利用预训练的WavLM模型中的稳健语音结构先验来减轻幻听现象。通过表示蒸馏使WavLM适应去噪专家角色，以净化其最终层特征，并通过模型固有的语音结构先验进行引导，以实现稳健的去噪和减少语言幻听。为降低声学幻听，训练了一个双流表示的vocoder，其中高级语音表示提供清晰的语音内容，而低级声学表示保留说话人的身份和语调。实验结果表明，PASE不仅在感知质量上超越了先进的判别模型，而且在语言学和声学幻听方面显著优于先前的生成模型。

Key Takeaways

生成模型在语音增强中展现出显著优势，相较于传统判别模型具有更高的感知质量。
现有生成模型面临在严重噪声下的幻听风险，分为语言学和声学两种。
语言学幻听源于模型对语音结构的有效约束不足，这是一个基础挑战。
提出PASE框架利用WavLM模型的稳健语音结构先验来减轻幻听现象。
通过表示蒸馏使WavLM适应去噪专家角色，以净化特征并减少语言幻听。
训练了一个双流表示的vocoder来进一步降低声学幻听，同时保留语音内容和说话人的身份、语调。

Cool Papers

点此查看论文截图

Distinguishing Repetition Disfluency from Morphological Reduplication in Bangla ASR Transcripts: A Novel Corpus and Benchmarking Analysis

Authors:Zaara Zabeen Arpa, Sadnam Sakib Apurbo, Nazia Karim Khan Oishee, Ajwad Abrar

Automatic Speech Recognition (ASR) transcripts, especially in low-resource languages like Bangla, contain a critical ambiguity: word-word repetitions can be either Repetition Disfluency (unintentional ASR error/hesitation) or Morphological Reduplication (a deliberate grammatical construct). Standard disfluency correction fails by erroneously deleting valid linguistic information. To solve this, we introduce the first publicly available, 20,000-row Bangla corpus, manually annotated to explicitly distinguish between these two phenomena in noisy ASR transcripts. We benchmark this novel resource using two paradigms: state-of-the-art multilingual Large Language Models (LLMs) and task-specific fine-tuning of encoder models. LLMs achieve competitive performance (up to 82.68% accuracy) with few-shot prompting. However, fine-tuning proves superior, with the language-specific BanglaBERT model achieving the highest accuracy of 84.78% and an F1 score of 0.677. This establishes a strong, linguistically-informed baseline and provides essential data for developing sophisticated, semantic-preserving text normalization systems for Bangla.

自动语音识别（ASR）转录内容，特别是在孟加拉语等低资源语言中，存在一个关键的模糊性：单词重复可能是重复不流畅（无意的ASR错误/犹豫）或形态复现（故意的语法结构）。标准的流畅性修正会错误地删除有效的语言信息。为解决这一问题，我们引入了首个公开可用的、包含2万行数据的孟加拉语语料库，进行手动注释，以明确区分这两种现象在嘈杂的ASR转录内容中的不同。我们使用两种范式来评估这一新颖资源：最先进的多语言大型语言模型（LLM）和针对特定任务的编码器模型微调。LLM通过少量提示即可实现具有竞争力的性能（最高达82.68%的准确率）。然而，微调证明更优越，针对孟加拉语的特定语言模型BanglaBERT实现了最高84.78%的准确率和0.677的F1分数。这为开发复杂、语义保持的孟加拉语文本归一化系统提供了强有力的语言信息基准和必要数据。

论文及项目相关链接

PDF

Summary

这篇论文解决了自动语音识别（ASR）转录本中词重复现象引起的歧义问题。对于如孟加拉语等低资源语言，ASR转录本中的词重复可能是重复失流（无意中的ASR错误/犹豫）或形态复现（有意的语法结构）。为了解决这一挑战，引入了首个公开可用的手动注释的孟加拉语语料库，该语料库能够明确区分这两种现象。通过采用最先进的多语言大型语言模型和针对特定任务的编码器模型微调两种方法评估此新资源的效果，证明了语言模型的优越性，而微调效果更佳，使用特定的孟加拉语BERT模型最高准确率达到了84.78%，F1分数为0.677。这为开发孟加拉语的语义保留文本归一化系统提供了强有力的基线数据。

Key Takeaways

ASR转录本中的词重复现象存在歧义，需区分重复失流和形态复现。
为了解决这一问题，引入了首个公开可用的、手动注释的孟加拉语语料库。
通过两种评估方法：最先进的多语言大型语言模型和针对特定任务的编码器模型微调。
多语言大型语言模型表现出竞争力，但微调效果更佳。
采用特定语言的孟加拉BERT模型达到最高准确率84.78%，并实现了较高的F1分数。
该研究为开发孟加拉语的文本归一化系统提供了强有力的基线数据。

Cool Papers

点此查看论文截图

Improving Direct Persian-English Speech-to-Speech Translation with Discrete Units and Synthetic Parallel Data

Authors:Sina Rashidi, Hossein Sameti

Direct speech-to-speech translation (S2ST), in which all components are trained jointly, is an attractive alternative to cascaded systems because it offers a simpler pipeline and lower inference latency. However, direct S2ST models require large amounts of parallel speech data in the source and target languages, which are rarely available for low-resource languages such as Persian. This paper presents a direct S2ST system for translating Persian speech into English speech, as well as a pipeline for synthetic parallel Persian-English speech generation. The model comprises three components: (1) a conformer-based encoder, initialized from self-supervised pre-training, maps source speech to high-level acoustic representations; (2) a causal transformer decoder with relative position multi-head attention translates these representations into discrete target speech units; (3) a unit-based neural vocoder generates waveforms from the predicted discrete units. To mitigate the data scarcity problem, we construct a new Persian-English parallel speech corpus by translating Persian speech transcriptions into English using a large language model and then synthesizing the corresponding English speech with a state-of-the-art zero-shot text-to-speech system. The resulting corpus increases the amount of available parallel speech by roughly a factor of six. On the Persian-English portion of the CVSS corpus, the proposed model achieves improvement of 4.6 ASR BLEU with the synthetic data over direct baselines. These results indicate that combining self-supervised pre-training, discrete speech units, and synthetic parallel data is effective for improving direct S2ST in low-resource language pairs such as Persian-English

直接语音识别翻译（Speech-to-Speech Translation，S2ST），其所有组件都进行联合训练，作为级联系统的替代方案颇具吸引力，因为它提供了更简单的管道和更低的推理延迟。然而，直接S2ST模型需要大量源语言和目标语言的平行语音数据，对于波斯语等低资源语言来说，这些数据很少可用。本文介绍了一个用于波斯语语音翻译成英语语音的直接S2ST系统，以及一个用于生成波斯语-英语平行语音的合成管道。该模型包含三个组件：（1）基于卷积神经网络的编码器，通过自我监督的预训练进行初始化，将源语音映射到高级声学表示；（2）具有相对位置多头注意力的因果变压器解码器将这些表示翻译成离散的目标语音单元；（3）基于单元的神经网络编码器从预测的离散单元生成波形。为了缓解数据稀缺的问题，我们构建了一个新的波斯语-英语平行语音语料库，通过将波斯语语音转录翻译成英语并使用大型语言模型，然后使用最先进的零样本文本到语音系统合成相应的英语语音。由此产生的语料库使可用的平行语音数量增加了大约六倍。在CVSS语料库的波斯语-英语部分上，与直接基线相比，使用合成数据所提出的模型在ASR BLEU上提高了4.6分。这些结果表明，结合自我监督的预训练、离散语音单元和合成平行数据对于改进波斯语-英语等低资源语言对的直接S2ST非常有效。

论文及项目相关链接

PDF

Summary

本论文提出一种针对波斯语到英语直接语音对语音翻译（S2ST）的系统，以及一个合成波斯语-英语平行语音生成管道。该系统由三个组件构成：基于卷积神经网络的编码器、具有相对位置多头注意力的因果Transformer解码器以及基于单元的神经网络编码器。为解决数据稀缺问题，论文通过大型语言模型将波斯语语音转录翻译成英语，然后使用先进的零样本文本到语音系统合成相应的英语语音，构建了一个新的波斯语-英语平行语音语料库。在CVSS语料库的波斯语-英语部分，与直接基线相比，使用合成数据的新模型提高了4.6的ASR BLEU得分。

Key Takeaways

直接语音对语音翻译（S2ST）为低资源语言对如波斯语和英语提供了简化管道和较低推理延迟的替代方案。
S2ST模型需要大量平行语音数据，但低资源语言如波斯语的数据难以获取。
本论文提出一个波斯语到英语的直接S2ST系统，包括一个基于卷积神经网络的编码器、因果Transformer解码器和神经网络编码器。
为解决数据稀缺问题，论文采用大型语言模型构建波斯语-英语平行语音语料库，并使用文本到语音系统合成相应的英语语音。
新构建的语料库增加了可用平行语音数据的数量。
在CVSS语料库的波斯语-英语部分进行的实验中，新模型在ASR BLEU得分上较直接基线提高了4.6。

Cool Papers

点此查看论文截图

Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data

Authors:Yunxin Li, Xinyu Chen, Shenyuan Jiang, Haoyuan Shi, Zhenyu Liu, Xuanyu Zhang, Nanhao Deng, Zhenran Xu, Yicheng Ma, Meishan Zhang, Baotian Hu, Min Zhang

We present Uni-MoE 2.0 from the Lychee family. As a fully open-source omnimodal large model (OLM), it substantially advances Lychee’s Uni-MoE series in language-centric multimodal understanding, reasoning, and generating. Based on the Qwen2.5-7B dense architecture, we build Uni-MoE-2.0-Omni from scratch through three core contributions: dynamic-capacity Mixture-of-Experts (MoE) design, a progressive training strategy enhanced with an iterative reinforcement strategy, and a carefully curated multimodal data matching technique. It is capable of omnimodal understanding, as well as generating images, text, and speech. Architecturally, our new MoE framework balances computational efficiency and capability for 10 cross-modal inputs using shared, routed, and null experts, while our Omni-Modality 3D RoPE ensures spatio-temporal cross-modality alignment in the self-attention layer. For training, following cross-modal pretraining, we use a progressive supervised fine-tuning strategy that activates modality-specific experts and is enhanced by balanced data composition and an iterative GSPO-DPO method to stabilise RL training and improve reasoning. Data-wise, the base model, trained on approximately 75B tokens of open-source multimodal data, is equipped with special speech and image generation tokens, allowing it to learn these generative tasks by conditioning its outputs on linguistic cues. Extensive evaluation across 85 benchmarks demonstrates that our model achieves SOTA or highly competitive performance against leading OLMs, surpassing Qwen2.5-Omni (trained with 1.2T tokens) on over 50 of 76 benchmarks. Key strengths include video understanding (+7% avg. of 8), omnimodallity understanding (+7% avg. of 4), and audiovisual reasoning (+4%). It also advances long-form speech processing (reducing WER by 4.2%) and leads in low-level image processing and controllable generation across 5 metrics.

我们在此介绍来自Lychee家族的Uni-MoE 2.0。作为一款全开源的多模态大型模型（OLM），它极大地推动了Lychee的Uni-MoE系列在语言为中心的多模态理解、推理和生成方面的进展。基于Qwen2.5-7B的密集架构，我们通过三个核心贡献从头构建了Uni-MoE-2.0-Omni：动态容量专家混合（MoE）设计、采用迭代增强策略的渐进训练策略以及精心策划的多模态数据匹配技术。它能够进行多模态理解，并能够生成图像、文本和语音。在结构上，我们新的MoE框架通过共享、路由和空专家在10个跨模态输入之间平衡计算效率和能力；我们的Omni-Modality 3D RoPE则确保自注意力层中的时空跨模态对齐。在训练方面，我们在跨模态预训练之后采用渐进的监督微调策略，该策略能够激活模态特定专家，并通过平衡数据组合和迭代GSPO-DPO方法来稳定强化学习训练和提高推理能力。在数据方面，基础模型在大约75B个开源多模态数据标记上进行训练，并配备了特殊的语音和图像生成标记，这使其能够通过语言线索来生成这些任务输出。在85个基准测试上的广泛评估表明，我们的模型达到了最新技术水平或在领先的OLM中具有竞争力，在超过76个基准测试中的50多个测试中超过了使用1.2T标记训练的Qwen2.5-Omni。关键优势包括视频理解（+7%的平均值达到八项指标）、多模态理解（+七项平均值达四项）、视听推理（+4%）等。同时，它还推动了长形式语音处理的发展（将字错误率降低4.2%），并在五个指标中处于图像处理和可控生成的领先地位。

论文及项目相关链接

PDF 47 pages,10 Figures, Project Website: https://idealistxy.github.io/Uni-MoE-v2.github.io/; Codes: https://github.com/HITsz-TMG/Uni-MoE

Summary

基于Qwen2.5-7B密集架构，Lychee家族的Uni-MoE 2.0作为全开源的跨模态大型模型（OLM），在语言为中心的跨模态理解、推理和生成方面取得了显著进展。它通过动态容量的专家混合（MoE）设计、增强的渐进训练策略以及精心策划的跨模态数据匹配技术，构建了Uni-MoE-2.0-Omni模型。该模型具备跨模态理解和生成图像、文本、语音的能力。其新的MoE框架在计算效率和处理跨模态输入方面取得了平衡，同时Omni-Modality 3D RoPE确保了跨模态时空的自注意力层对齐。通过渐进的监督微调策略和平衡数据组合以及迭代GSPO-DPO方法，模型在训练过程中更加稳定并提高了推理能力。该模型在约75B标记的开源跨模态数据上进行训练，并配备了特殊的语音和图像生成标记。评估表明，该模型在超过一半的基准测试中达到了或超越了先前的OLM表现。其主要优势包括视频理解、跨模态理解和视听推理等方面。

Key Takeaways

Uni-MoE 2.0是Lychee家族的一个全开源的跨模态大型模型（OLM），基于Qwen2.5-7B架构。
Uni-MoE 2.0通过动态容量的专家混合（MoE）设计等技术构建了Uni-MoE-2.0-Omni模型。
该模型具备跨模态理解和生成图像、文本、语音的能力。
新的MoE框架在计算效率和处理跨模态输入方面取得了平衡。
Omni-Modality 3D RoPE技术确保了跨模态时空的自注意力层对齐。
通过渐进的监督微调策略和平衡数据组合以及迭代GSPO-DPO方法，模型在训练过程中更加稳定并提高了推理能力。
该模型在广泛的基准测试中表现优异，尤其在视频理解、跨模态理解和视听推理等方面有显著的进步。

Cool Papers

点此查看论文截图

How Far Do SSL Speech Models Listen for Tone? Temporal Focus of Tone Representation under Low-resource Transfer

Authors:Minu Kim, Ji Sub Um, Hoirin Kim

Lexical tone is central to many languages but remains underexplored in self-supervised learning (SSL) speech models, especially beyond Mandarin. We study four languages with complex and diverse tone systems: Burmese, Thai, Lao, and Vietnamese, to examine how far such models listen for tone and how transfer operates in low-resource conditions. As a baseline reference, we estimate the temporal span of tone cues to be about 100 ms in Burmese and Thai, and about 180 ms in Lao and Vietnamese. Probes and gradient analyses on fine-tuned SSL models reveal that tone transfer varies by downstream task: automatic speech recognition fine-tuning aligns spans with language-specific tone cues, while prosody- and voice-related tasks bias the model toward overly long spans. These findings indicate that tone transfer is shaped by downstream task, highlighting task effects on temporal focus in tone modeling.

语音语调在许多语言中至关重要，但在自监督学习（SSL）的语音模型中，特别是在非普通话语言中的探索仍然不足。我们研究了四种具有复杂和多样化的语调系统的语言：缅甸语、泰语、老挝语和越南语，以研究此类模型在多大程度上能够识别语调，以及在资源匮乏的条件下如何进行迁移。作为基准参考，我们估计缅甸语和泰语的语调线索时间跨度约为100毫秒，而老挝语和越南语的语调线索时间跨度约为180毫秒。对微调SSL模型的探针和梯度分析显示，语调迁移受下游任务的影响：自动语音识别微调与特定语言的语调线索对齐，而韵律和语音相关任务则使模型偏向于过长的跨度。这些发现表明，语调迁移受下游任务的影响，突出了任务对语调建模中时间重点的影响。

论文及项目相关链接

PDF 5 pages, 7 figures, submitted to ICASSP 2026

总结

语音模型的自监督学习对音调处理不够充分，特别是对于非汉语语种的研究尚浅。本文通过探讨四种拥有复杂且多样的音调系统的语言，包括缅甸语、泰语、老挝语和越南语，探究了语音模型如何识别语音和在不同资源条件下的迁移效果。本文认为语音模型捕捉音调信息的时长大约为缅甸语和泰语的100毫秒左右，而老挝语和越南语则为约180毫秒。对微调后的语音模型的测试和梯度分析显示，不同下游任务之间的音调迁移存在差异：自动语音识别微调与特定语言的音调提示对齐，而韵律和语音相关任务则使模型偏向于过长的时长。这表明下游任务对音调迁移产生影响，进而突显了语音模型处理音调的时空特征。

关键发现

以下是从文本中得出的七个主要观点：

自监督学习（SSL）在语音模型中对音调的探索尚显不足，特别是在处理复杂多样语言体系方面，例如缅甸语、泰语、老挝语和越南语等。
音调信息的捕捉时长因语言而异，通常在缅甸语和泰语中约为100毫秒左右，而在老挝语和越南语中则大约为180毫秒。这表明不同语言的音调系统有其独特的声学特征。
自动语音识别微调与特定语言的音调提示对齐，这有助于提升语音模型的性能。
不同下游任务对语音模型的音调迁移产生影响，这反映了语音模型在应对不同任务时的适应性差异。

Cool Papers

点此查看论文截图

Tighter Truncated Rectangular Prism Approximation for RNN Robustness Verification

Authors:Xingqi Lin, Liangyu Chen, Min Wu, Min Zhang, Zhenbing Zeng

Robustness verification is a promising technique for rigorously proving Recurrent Neural Networks (RNNs) robustly. A key challenge is to over-approximate the nonlinear activation functions with linear constraints, which can transform the verification problem into an efficiently solvable linear programming problem. Existing methods over-approximate the nonlinear parts with linear bounding planes individually, which may cause significant over-estimation and lead to lower verification accuracy. In this paper, in order to tightly enclose the three-dimensional nonlinear surface generated by the Hadamard product, we propose a novel truncated rectangular prism formed by two linear relaxation planes and a refinement-driven method to minimize both its volume and surface area for tighter over-approximation. Based on this approximation, we implement a prototype DeepPrism for RNN robustness verification. The experimental results demonstrate that \emph{DeepPrism} has significant improvement compared with the state-of-the-art approaches in various tasks of image classification, speech recognition and sentiment analysis.

稳健性验证是证明循环神经网络（RNN）稳健性的有前途的技术。一个关键挑战是用线性约束对非线性激活函数进行过近似，这可以将验证问题转化为可高效解决的线性规划问题。现有方法通过线性边界平面单独对非线性的部分进行过近似处理，这可能会导致显著的过度估计并导致验证精度降低。在本文中，为了紧密封闭由Hadamard产品产生的三维非线性表面，我们提出了一种由两个线性松弛平面构成的新型截断矩形棱柱，并基于精细化驱动的方法同时减小其体积和表面积来实现更紧密的过近似。基于这种近似方法，我们实现了一种名为DeepPrism的RNN稳健性验证原型。实验结果表明，DeepPrism相较于最前沿的图像分类、语音识别和情感分析的各种任务在处理中具有显著改善效果。

论文及项目相关链接

PDF

总结

该文介绍了一种基于深度学习的鲁棒性验证技术，特别是针对循环神经网络（RNNs）的鲁棒性验证。文章提出了一种新的方法来解决非线性激活函数过度近似的问题，通过将非线性部分紧密地封装在一个由两个线性松弛平面和一个优化驱动的截断矩形棱柱内，以最小化其体积和表面积实现更紧密的近似。基于这种近似方法，文章实现了一个名为DeepPrism的原型用于RNN鲁棒性验证，并在图像分类、语音识别和情感分析等任务上取得了显著的改进。

关键见解

鲁棒性验证是一种有前景的技术，用于严格证明RNNs的鲁棒性。
非线性激活函数的过度近似是验证过程中的关键挑战。
现有方法通过线性边界平面进行个体过度近似，可能导致显著过度估计和较低的验证精度。
提出了一个新的截断矩形棱柱方法，通过两个线性松弛平面和最小化体积及表面积实现更紧密的近似。
基于这种近似方法，实现了名为DeepPrism的RNN鲁棒性验证原型。
实验结果表明，DeepPrism在图像分类、语音识别和情感分析等任务上相比最新技术有显著改进。

Cool Papers

点此查看论文截图

Regularized Schrödinger: Alleviating Distortion and Exposure Bias in Solving Inverse Problems

Authors:Qing Yao, Lijian Gao, Qirong Mao, Dong Ming

Diffusion models serve as a powerful generative framework for solving inverse problems. However, they still face two key challenges: 1) the distortion-perception tradeoff, where improving perceptual quality often degrades reconstruction fidelity, and 2) the exposure bias problem, where the training-inference input mismatch leads to prediction error accumulation and reduced reconstruction quality. In this work, we propose the Regularized Schrödinger Bridge (RSB), an adaptation of Schrödinger Bridge tailored for inverse problems that addresses the above limitations. RSB employs a novel regularized training strategy that perturbs both the input states and targets, effectively mitigating exposure bias by exposing the model to simulated prediction errors and also alleviating distortion by well-designed interpolation via the posterior mean. Extensive experiments on two typical inverse problems for speech enhancement demonstrate that RSB outperforms state-of-the-art methods, significantly improving distortion metrics and effectively reducing exposure bias.

扩散模型是解决逆问题的强大生成框架。然而，它们仍然面临两个主要挑战：1）失真与感知之间的权衡，在提高感知质量的同时往往会降低重建的保真度；2）曝光偏差问题，训练与推理之间的输入不匹配导致预测误差累积并降低了重建质量。针对上述问题，我们在本文中提出了一种正则化Schrödinger Bridge（RSB），这是针对逆问题量身定制的Schrödinger Bridge的一种适应。RSB采用了一种新型的正则化训练策略，同时扰动输入状态和目标，通过使模型暴露于模拟预测误差来有效缓解曝光偏差，并通过精心设计后验均值进行插值来缓解失真。针对语音增强的两个典型逆问题的广泛实验表明，RSB优于最新方法，显著提高了失真指标并有效降低了曝光偏差。

论文及项目相关链接

PDF

Summary

扩散模型在解决反问题上是一个强大的生成框架，但仍面临两大挑战：一是失真与感知之间的权衡，改善感知质量往往会降低重建的保真度；二是曝光偏差问题，训练与推理时输入的不匹配导致预测误差累积和重建质量下降。本研究提出了针对反问题的正则化薛定谔桥（RSB），通过一种新的正则化训练策略，对输入状态和目标进行扰动，有效缓解了曝光偏差，并通过精心设计的插值方法缓解了失真问题。在语音增强领域的两个典型反问题上的实验表明，RSB优于最先进的方法，显著提高了失真指标，并有效减少了曝光偏差。

Key Takeaways

扩散模型在解决反问题上表现出强大的生成能力。
扩散模型面临失真与感知之间的权衡以及曝光偏差两大挑战。
RSB作为针对反问题的扩散模型解决方案，通过正则化训练策略缓解上述问题。
RSB通过扰动输入状态和目标来缓解曝光偏差。
RSB通过精心设计的插值方法缓解失真问题。
在语音增强领域的实验表明，RSB在失真指标上表现优异。

Cool Papers

点此查看论文截图

VocalBench-zh: Decomposing and Benchmarking the Speech Conversational Abilities in Mandarin Context

Authors:Heyang Liu, Ziyang Cheng, Yuhao Wang, Hongcheng Liu, Yiqi Li, Ronghua Wu, Qunshan Gu, Yanfeng Wang, Yu Wang

The development of multi-modal large language models (LLMs) leads to intelligent approaches capable of speech interactions. As one of the most widely spoken languages globally, Mandarin is supported by most models to enhance their applicability and reach. However, the scarcity of comprehensive speech-to-speech (S2S) benchmarks in Mandarin contexts impedes systematic evaluation for developers and hinders fair model comparison for users. In this work, we propose VocalBench-zh, an ability-level divided evaluation suite adapted to Mandarin context consisting of 10 well-crafted subsets and over 10K high-quality instances, covering 12 user-oriented characters. The evaluation experiment on 14 mainstream models reveals the common challenges for current routes, and highlights the need for new insights into next-generation speech interactive systems. The evaluation codes and datasets will be available at https://github.com/SJTU-OmniAgent/VocalBench-zh.

多模态大型语言模型（LLM）的发展推动了能够进行语音交互的智能方法。作为世界上使用最广泛的语种之一，普通话受到大多数模型的支持，增强了其适用性和覆盖范围。然而，普通话语境中全面的语音到语音（S2S）基准的缺乏阻碍了开发人员的系统评估和用户之间的公平模型比较。在这项工作中，我们提出了面向普通话语境的VocalBench-zh能力水平评估套件，其中包括精心设计的10个子集和超过一万个高质量实例样本。此外还覆盖了用户性格导向的十二个要素。对于主流的十四种模型的评估实验表明出了当前方法的普遍挑战，并凸显出对下一代语音交互系统的新见解的必要性。评估代码和数据集将可通过https://github.com/SJTU-OmniAgent/VocalBench-zh获得。

论文及项目相关链接

PDF This article will serve as an extension of the preceding work, “VocalBench: Benchmarking the Vocal Conversational Abilities for Speech Interaction Models” (arXiv:2505.15727). Therefore, we have chosen to withdraw to avoid potential duplicate publication. We will update the previously open-sourced paper of VocalBench in several weeks to include the content of VocalBench-zh

Summary

多模态大型语言模型（LLM）的发展推动了能够进行语音交互的智能方法的应用。为了提升模型的应用性和覆盖面，大多数模型都支持全球最广泛使用的语言之一——普通话。然而，由于缺乏全面的普通话语境下的语音到语音（S2S）基准测试集，开发者在系统的评价方面存在不足，用户在模型对比中也难以做到公平。本研究提出了面向普通话语境的VocalBench-zh评估套件，该套件包含10个精心设计的子集和超过1万条高质量实例，涵盖12个面向用户的特点。对主流模型的评估实验揭示了当前路线的常见挑战，并突出了对下一代语音交互系统的新见解的必要性。评估代码和数据集将在https://github.com/SJTU-OmniAgent/VocalBench-zh上提供。

Key Takeaways

多模态大型语言模型的发展推动了语音交互的智能方法应用。
普通话在模型中的应用增强了其适用性和覆盖面。
缺乏普通话语境下的全面语音到语音基准测试集影响了系统的评价。
VocalBench-zh评估套件旨在解决这一问题，包含精心设计的子集和高质量实例。
该套件涵盖多种面向用户的特点。
评估实验揭示了当前模型的挑战并强调了新见解的必要性。

Cool Papers

点此查看论文截图

Invisible Ears at Your Fingertips: Acoustic Eavesdropping via Mouse Sensors

Authors:Mohamad Fakih, Rahul Dharmaji, Youssef Mahmoud, Halima Bouzidi, Mohammad Abdullah Al Faruque

Modern optical mouse sensors, with their advanced precision and high responsiveness, possess an often overlooked vulnerability: they can be exploited for side-channel attacks. This paper introduces Mic-E-Mouse, the first-ever side-channel attack that targets high-performance optical mouse sensors to covertly eavesdrop on users. We demonstrate that audio signals can induce subtle surface vibrations detectable by a mouse’s optical sensor. Remarkably, user-space software on popular operating systems can collect and broadcast this sensitive side channel, granting attackers access to raw mouse data without requiring direct system-level permissions. Initially, the vibration signals extracted from mouse data are of poor quality due to non-uniform sampling, a non-linear frequency response, and significant quantization. To overcome these limitations, Mic-E-Mouse employs a sophisticated end-to-end data filtering pipeline that combines Wiener filtering, resampling corrections, and an innovative encoder-only spectrogram neural filtering technique. We evaluate the attack’s efficacy across diverse conditions, including speaking volume, mouse polling rate and DPI, surface materials, speaker languages, and environmental noise. In controlled environments, Mic-E-Mouse improves the signal-to-noise ratio (SNR) by up to +19 dB for speech reconstruction. Furthermore, our results demonstrate a speech recognition accuracy of roughly 42% to 61% on the AudioMNIST and VCTK datasets. All our code and datasets are publicly accessible on https://sites.google.com/view/mic-e-mouse.

现代光学鼠标传感器以其高级精度和高响应性而闻名，但它们存在一个经常被忽视的漏洞：它们可能被用于侧信道攻击。本文介绍了Mic-E-Mouse，这是一种针对高性能光学鼠标传感器的新型侧信道攻击，可以秘密监听用户。我们证明音频信号会诱发鼠标光学传感器可以检测到的微妙表面振动。值得注意的是，流行操作系统上的用户空间软件可以收集和广播这种敏感的侧信道信息，从而允许攻击者访问原始鼠标数据，而无需获得直接的系统级权限。由于非均匀采样、非线性频率响应和显著的量化，最初从鼠标数据中提取的振动信号质量较差。为了克服这些局限性，Mic-E-Mouse采用了一种先进的端到端数据过滤管道，结合了Wiener滤波、重采样校正以及创新的仅编码器光谱神经过滤技术。我们评估了攻击在不同条件下的有效性，包括说话音量、鼠标轮询率和DPI、表面材料、说话人的语言和环境噪音。在受控环境中，Mic-E-Mouse提高了信号与噪声比（SNR），语音重建提高了高达+19分贝。此外，我们的研究结果在AudioMNIST和VCTK数据集上显示出约42%至61%的语音识别准确率。我们的所有代码和数据集可在https://sites.google.com/view/mic-e-mouse上公开访问。

论文及项目相关链接

PDF Appearing in the Annual Computer Security Applications Conference (ACSAC 2025)

摘要

现代高性能光学鼠标传感器存在易被忽视的漏洞：可能遭受侧信道攻击。本文介绍了一种名为Mic-E-Mouse的新型侧信道攻击，该攻击针对高性能光学鼠标传感器进行隐秘监听。演示了音频信号引起的细微表面振动可被鼠标光学传感器检测到。令人惊讶的是，流行操作系统的用户空间软件可以收集和广播这一敏感侧信道信息，使攻击者无需系统级权限即可访问原始鼠标数据。为应对从鼠标数据中提取的振动信号质量差的问题（如非均匀采样、非线性频率响应和重大量化），Mic-E-Mouse采用先进的端到端数据过滤管道，结合Wiener过滤、重新采样校正和创新的仅编码器光谱神经网络过滤技术。我们评估了攻击在多种条件下的有效性，包括讲话音量、鼠标轮询率和DPI、表面材料、演讲者和环境噪音。在控制环境下，Mic-E-Mouse的语音重建的信噪比（SNR）提高了高达+19分贝。此外，我们的结果在AudioMNIST和VCTK数据集上显示出约42%至61%的语音识别准确率。所有代码和数据集可在https://sites.google.com/view/mic-e-mouse上公开访问。

关键见解

现代光学鼠标传感器存在侧信道攻击的风险，本文首次介绍了针对此的Mic-E-Mouse攻击。
音频信号能通过鼠标光学传感器检测到的表面振动进行隐秘传递。
用户空间软件可收集和广播敏感侧信道信息，使攻击者能访问原始鼠标数据。
Mic-E-Mouse采用复杂的数据过滤管道以提高从鼠标数据中提取的振动信号质量。
攻击在多种条件下有效，包括讲话音量、鼠标性能和环境因素。
在控制环境下，Mic-E-Mouse的语音重建的信噪比提高显著。

Cool Papers

点此查看论文截图

DualSpeechLM: Towards Unified Speech Understanding and Generation via Dual Speech Token Modeling with Large Language Models

Authors:Yuanyuan Wang, Dongchao Yang, Yiwen Shao, Hangting Chen, Jiankun Zhao, Zhiyong Wu, Helen Meng, Xixin Wu

Extending pre-trained text Large Language Models (LLMs)’s speech understanding or generation abilities by introducing various effective speech tokens has attracted great attention in the speech community. However, building a unified speech understanding and generation model still faces the following challenges: (1) Due to the huge modality gap between speech and text tokens, extending text LLMs to unified speech LLMs relies on large-scale paired data for fine-tuning, and (2) Generation and understanding tasks prefer information at different levels, e.g., generation benefits from detailed acoustic features, while understanding favors high-level semantics. This divergence leads to difficult performance optimization in one unified model. To solve these challenges, in this paper, we present two key insights in speech tokenization and speech language modeling. Specifically, we first propose an Understanding-driven Speech Tokenizer (USTokenizer), which extracts high-level semantic information essential for accomplishing understanding tasks using text LLMs. In this way, USToken enjoys better modality commonality with text, which reduces the difficulty of modality alignment in adapting text LLMs to speech LLMs. Secondly, we present DualSpeechLM, a dual-token modeling framework that concurrently models USToken as input and acoustic token as output within a unified, end-to-end framework, seamlessly integrating speech understanding and generation capabilities. Furthermore, we propose a novel semantic supervision loss and a Chain-of-Condition (CoC) strategy to stabilize model training and enhance speech generation performance. Experimental results demonstrate that our proposed approach effectively fosters a complementary relationship between understanding and generation tasks, highlighting the promising strategy of mutually enhancing both tasks in one unified model.

将预训练的文本大型语言模型（LLMs）的语音理解或生成能力通过引入各种有效的语音标记进行扩展，在语音领域引起了极大的关注。然而，构建统一的语音理解和生成模型仍然面临以下挑战：（1）由于语音和文本标记之间存在巨大的模态差距，将文本LLMs扩展到统一的语音LLMs依赖于大规模配对数据进行微调；（2）生成和理解任务偏好不同层级的信息，例如，生成受益于详细的声学特征，而理解则偏向于高级语义。这种分歧导致在一个统一模型中难以进行性能优化。为了解决这些挑战，本文提出了语音标记化和语音语言模型的两个关键见解。具体来说，我们首先提出了一种理解驱动的语音标记器（USTokenizer），它使用文本LLMs提取完成理解任务所需的高级语义信息。通过这种方式，USToken与文本的模态共性更好，降低了将文本LLMs适应到语音LLMs时的模态对齐难度。其次，我们提出了DualSpeechLM，这是一个双标记建模框架，该框架在一个统一、端到端的框架内，同时以USToken作为输入和音频标记作为输出进行建模，无缝集成了语音理解和生成能力。此外，我们提出了一种新的语义监督损失和条件链（CoC）策略，以稳定模型训练并增强语音生成性能。实验结果表明，我们提出的方法有效地促进了理解与生成任务之间的互补关系，突显了在统一模型中相互增强这两个任务的策略前景。

论文及项目相关链接

PDF Accepted by AAAI 2026

摘要

扩展预训练文本大型语言模型（LLMs）的语音识别或生成能力，通过引入各种有效的语音令牌，在语音领域引起了极大的关注。然而，构建统一的语音理解与生成模型仍面临以下挑战：一是由于语音和文本令牌之间存在巨大的模态差距，将文本LLM扩展到统一的语音LLM依赖于大规模配对数据进行微调；二是生成和理解任务偏好不同层级的信息，例如在生成中受益于详细的声学特征，而理解则更偏向于高级语义。这种分歧导致在统一模型中难以进行优化。为解决这些挑战，本文提出了两个关于语音令牌化和语音语言建模的关键见解。首先，我们提出了一种理解驱动的语音分词器（USTokenizer），它使用文本LLMs提取对完成理解任务至关重要的高级语义信息。通过这种方式，USToken与文本的模态共性更好，降低了将文本LLMs适应到语音LLMs时模态对齐的难度。其次，我们提出了DualSpeechLM，这是一个双令牌建模框架，该框架在一个统一、端到端的框架内，同时以USToken作为输入和音频令牌作为输出进行建模，无缝集成了语音理解和生成能力。此外，我们提出了一种新的语义监督损失和条件链（CoC）策略，以稳定模型训练并增强语音生成性能。实验结果表明，我们提出的方法有效地促进了理解与生成任务之间的互补关系，突显了在统一模型中相互增强这两个任务的策略前景。

要点解析

扩展预训练的大型语言模型到语音领域引起了广泛关注。存在模态差距和生成理解任务偏好不同层级信息的挑战。
提出理解驱动的语音分词器（USTokenizer），提取高级语义信息用于理解任务，增强与文本的模态共性。
介绍DualSpeechLM框架，统一建模语音理解和生成能力，使用USToken和音频令牌。
提出新的语义监督损失和条件链（CoC）策略，稳定模型训练并提升语音生成性能。

Cool Papers

点此查看论文截图

PERTINENCE: Input-based Opportunistic Neural Network Dynamic Execution

Authors:Omkar Shende, Gayathri Ananthanarayanan, Marcello Traiola

Deep neural networks (DNNs) have become ubiquitous thanks to their remarkable ability to model complex patterns across various domains such as computer vision, speech recognition, robotics, etc. While large DNN models are often more accurate than simpler, lightweight models, they are also resource- and energy-hungry. Hence, it is imperative to design methods to reduce reliance on such large models without significant degradation in output accuracy. The high computational cost of these models is often necessary only for a reduced set of challenging inputs, while lighter models can handle most simple ones. Thus, carefully combining properties of existing DNN models in a dynamic, input-based way opens opportunities to improve efficiency without impacting accuracy. In this work, we introduce PERTINENCE, a novel online method designed to analyze the complexity of input features and dynamically select the most suitable model from a pre-trained set to process a given input effectively. To achieve this, we employ a genetic algorithm to explore the training space of an ML-based input dispatcher, enabling convergence towards the Pareto front in the solution space that balances overall accuracy and computational efficiency. We showcase our approach on state-of-the-art Convolutional Neural Networks (CNNs) trained on the CIFAR-10 and CIFAR-100, as well as Vision Transformers (ViTs) trained on TinyImageNet dataset. We report results showing PERTINENCE’s ability to provide alternative solutions to existing state-of-the-art models in terms of trade-offs between accuracy and number of operations. By opportunistically selecting among models trained for the same task, PERTINENCE achieves better or comparable accuracy with up to 36% fewer operations.

深度神经网络（DNN）由于其能够在计算机视觉、语音识别、机器人技术等领域对复杂模式进行建模的出色能力而无处不在。虽然大型DNN模型的准确性通常高于简单轻量级模型，但它们也需要大量的资源和能源。因此，设计方法来减少对这类大型模型的依赖，同时不会显著降低输出准确性，这是至关重要的。这些模型的高计算成本往往仅对于减少的具有挑战性的输入集是必要的，而较轻的模型可以处理大多数简单的输入。因此，以动态、基于输入的方式谨慎地结合现有DNN模型的属性，有助于提高效率而不影响准确性。在这项工作中，我们引入了PERTINENCE，这是一种新的在线方法，旨在分析输入特征复杂性，并从预训练模型集中动态选择最适合的模型来有效处理给定输入。为了实现这一点，我们采用遗传算法来探索基于机器学习的输入调度器的训练空间，使解决方案在整体准确性和计算效率之间取得平衡，朝着Pareto前沿收敛。我们在CIFAR-10和CIFAR-100上经过训练的最新卷积神经网络（CNN）以及TinyImageNet数据集上训练的视觉转换器（ViT）上展示了我们的方法。报告的结果显示了PERTINENCE在准确性和运算次数之间的权衡方面为现有最先进的模型提供替代解决方案的能力。通过在有相同任务的模型中做出选择，PERTINENCE可以在保持相当或更高的准确率的同时，减少高达36%的操作。

论文及项目相关链接

PDF

Summary

深度神经网络（DNN）广泛应用于计算机视觉、语音识别、机器人等领域，其出色的模式识别能力引人注目。但大型模型常常消耗大量的资源和能源。研究人介绍了一种新的在线方法PERTINENCE，该方法能根据输入特征复杂度动态选择最合适的模型进行处理，从而提高效率而不影响准确性。通过遗传算法探索ML输入调度器的训练空间，使解决方案在总体准确性和计算效率之间达到平衡。在CIFAR-10、CIFAR-100数据集上的卷积神经网络（CNN）和TinyImageNet数据集上的视觉转换器（ViT）的实验结果表明，PERTINENCE能在保证准确性的前提下，减少高达36%的计算操作。

Key Takeaways