发布日期: 2025-11-06

更新日期: 2025-11-27

文章字数: 12.1k

阅读时长: 50 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-06 更新

Improving DF-Conformer Using Hydra For High-Fidelity Generative Speech Enhancement on Discrete Codec Token

Authors:Shogo Seki, Shaoxiang Dang, Li Li

The Dilated FAVOR Conformer (DF-Conformer) is an efficient variant of the Conformer architecture designed for speech enhancement (SE). It employs fast attention through positive orthogonal random features (FAVOR+) to mitigate the quadratic complexity associated with self-attention, while utilizing dilated convolution to expand the receptive field. This combination results in impressive performance across various SE models. In this paper, we propose replacing FAVOR+ with bidirectional selective structured state-space sequence models to achieve two main objectives:(1) enhancing global sequential modeling by eliminating the approximations inherent in FAVOR+, and (2) maintaining linear complexity relative to the sequence length. Specifically, we utilize Hydra, a bidirectional extension of Mamba, framed within the structured matrix mixer framework. Experiments conducted using a generative SE model on discrete codec tokens, known as Genhancer, demonstrate that the proposed method surpasses the performance of the DF-Conformer.

Dilated FAVOR Conformer（DF-Conformer）是专为语音增强（SE）设计的Conformer架构的有效变体。它通过正向正交随机特征（FAVOR+）实现快速注意力，以减轻自注意力相关的二次复杂性，同时利用膨胀卷积来扩大感受野。这种结合在各种SE模型中表现出令人印象深刻的效果。在本文中，我们提出用双向选择性结构化状态空间序列模型替换FAVOR+，以实现两个主要目标：（1）通过消除FAVOR+所固有的近似值，增强全局序列建模；（2）保持相对于序列长度的线性复杂性。具体来说，我们在结构化矩阵混合器框架内使用Hydra，这是Mamba的双向扩展。使用称为Genhancer的离散编解码符号生成SE模型进行的实验表明，所提出的方法超过了DF-Conformer的性能。

论文及项目相关链接

PDF Submitted to ICASSP 2026. Audio samples available at https://s-seki.github.io/dc_hydra/

Summary
高效语音增强模型DF-Conformer的新变体提出使用双向选择性结构化状态空间序列模型替换FAVORe+，以提高全局序列建模性能并维持线性复杂度。新模型使用名为Hydra的结构化矩阵混合器框架实现，通过实验证明性能超越DF-Conformer。

Key Takeaways

DF-Conformer是一种用于语音增强的高效变体。它通过引入膨胀卷积提高了模型感受野，并采用FAVORe+机制减少自注意力的计算复杂度。
研究人员提出使用双向选择性结构化状态空间序列模型来改进DF-Conformer中的FAVORe+，以实现更高的全局序列建模能力。新模型旨在消除FAVORe+固有的近似问题。
新模型基于结构化矩阵混合器框架实现，称为Hydra，该框架结合了双向扩展的特性。这种双向扩展的实现基于Mamba算法。
实验结果显示，新模型在基于离散编码解码标记生成模型的语音增强任务上表现优于DF-Conformer。这意味着新模型可能在实际应用中提供更高效的语音增强效果。

Cool Papers

点此查看论文截图

Emotion Detection in Speech Using Lightweight and Transformer-Based Models: A Comparative and Ablation Study

Authors:Lucky Onyekwelu-Udoka, Md Shafiqul Islam, Md Shahedul Hasan

Emotion recognition from speech plays a vital role in the development of empathetic human-computer interaction systems. This paper presents a comparative analysis of lightweight transformer-based models, DistilHuBERT and PaSST, by classifying six core emotions from the CREMA-D dataset. We benchmark their performance against a traditional CNN-LSTM baseline model using MFCC features. DistilHuBERT demonstrates superior accuracy (70.64%) and F1 score (70.36%) while maintaining an exceptionally small model size (0.02 MB), outperforming both PaSST and the baseline. Furthermore, we conducted an ablation study on three variants of the PaSST, Linear, MLP, and Attentive Pooling heads, to understand the effect of classification head architecture on model performance. Our results indicate that PaSST with an MLP head yields the best performance among its variants but still falls short of DistilHuBERT. Among the emotion classes, angry is consistently the most accurately detected, while disgust remains the most challenging. These findings suggest that lightweight transformers like DistilHuBERT offer a compelling solution for real-time speech emotion recognition on edge devices. The code is available at: https://github.com/luckymaduabuchi/Emotion-detection-.

语音中的情绪识别在开发具有同情心的计算机交互系统中起着至关重要的作用。本文对比分析了基于轻量级转换模型的DistilHuBERT和PaSST，通过对CREMA-D数据集的核心六种情绪进行分类。我们使用MFCC特征的传统CNN-LSTM基线模型来评估它们的性能。DistilHuBERT在保持模型大小异常小（仅0.02MB）的情况下，表现出较高的准确性和F1分数（分别为70.64%和70.36%），优于PaSST和基线模型。此外，我们对PaSST的三种变体（Linear、MLP和Attentive Pooling heads）进行了消融研究，以了解分类头部架构对模型性能的影响。我们的结果表明，PaSST的MLP头变体在其变体中表现最佳，但仍不及DistilHuBERT。在情绪类别中，愤怒的情绪始终是最准确检测的，而厌恶则是最具挑战性的。这些发现表明，像DistilHuBERT这样的轻量级转换器为边缘设备上的实时语音情绪识别提供了引人注目的解决方案。代码可在以下网址找到：https://github.com/luckymaduabuchi/Emotion-detection-。

论文及项目相关链接

PDF

Summary

本文对比分析了基于轻量级变压器的情感识别模型DistilHuBERT和PaSST，以及传统的CNN-LSTM基线模型。在CREMA-D数据集上进行情感分类，DistilHuBERT表现出更高的准确性和F1得分，且模型体积较小。PaSST的MLP头变体性能最佳但仍逊于DistilHuBERT。研究表明，轻量级变压器为边缘设备的实时语音情感识别提供了有效解决方案。

Key Takeaways

情感识别在开发具有同情心的计算机交互系统中扮演重要角色。
DistilHuBERT模型在情感识别方面表现出卓越的性能和较小的模型体积。
PaSST模型的MLP头变体在情感识别方面性能最佳。
基于变压器的模型（如DistilHuBERT）对于实时语音情感识别具有优势。
“愤怒”情绪是最容易检测的情感类别，“厌恶”情感则是最难检测的。
该研究提供了一种新的语音情感识别的比较基准。

Cool Papers

点此查看论文截图

Authors:Chunxi Wang, Maoshen Jia, Wenyu Jin

Room Impulse Responses (RIRs) accurately characterize acoustic properties of indoor environments and play a crucial role in applications such as speech enhancement, speech recognition, and audio rendering in augmented reality (AR) and virtual reality (VR). Existing blind estimation methods struggle to achieve practical accuracy. To overcome this challenge, we propose the dynamic audio-room acoustic synthesis (DARAS) model, a novel deep learning framework that is explicitly designed for blind RIR estimation from monaural reverberant speech signals. First, a dedicated deep audio encoder effectively extracts relevant nonlinear latent space features. Second, the Mamba-based self-supervised blind room parameter estimation (MASS-BRPE) module, utilizing the efficient Mamba state space model (SSM), accurately estimates key room acoustic parameters and features. Third, the system incorporates a hybrid-path cross-attention feature fusion module, enhancing deep integration between audio and room acoustic features. Finally, our proposed dynamic acoustic tuning (DAT) decoder adaptively segments early reflections and late reverberation to improve the realism of synthesized RIRs. Experimental results, including a MUSHRA-based subjective listening study, demonstrate that DARAS substantially outperforms existing baseline models, providing a robust and effective solution for practical blind RIR estimation in real-world acoustic environments.

室内环境的声学特性可以通过房间脉冲响应（RIRs）来精确描述，其在语音增强、语音识别以及增强现实（AR）和虚拟现实（VR）中的音频渲染等应用中起着至关重要的作用。现有的盲估计方法难以实现实用准确性。为了克服这一挑战，我们提出了动态音频房间声学合成（DARAS）模型，这是一种专门为从单声道混响语音信号进行盲RIR估计而设计的新型深度学习框架。首先，专用的深度音频编码器有效地提取了相关的非线性潜在空间特征。其次，基于Mamba的自监督盲房间参数估计（MASS-BRPE）模块，利用高效的Mamba状态空间模型（SSM），准确估计了关键的房间声学参数和特征。第三，系统结合了一个混合路径交叉注意特征融合模块，增强了音频和房间声学特征之间的深度集成。最后，我们提出的动态声学调整（DAT）解码器自适应地分割早期反射和后期回声，提高了合成RIRs的真实性。实验结果包括基于MUSHRA的主观听觉研究，证明DARAS在真实世界声学环境中显著优于现有基线模型，为实际盲RIR估计提供了稳健有效的解决方案。

论文及项目相关链接

PDF 14 pages, 9 figures, accepted for publication in IEEE/ACM Transactions on Audio, Speech, and Language Processing

Summary

本文介绍了动态音频室声合成（DARAS）模型，这是一种专为从单声道混响语音信号中盲估计房间冲击响应（RIR）而设计的新型深度学习框架。DARAS通过深度音频编码器提取相关非线性潜在空间特征，利用基于Mamba的自我监督盲室参数估计模块准确估计关键室声参数和特征，并结合混合路径交叉注意特征融合模块，提高音频和室声特征的深度融合。最后，通过动态声学调整解码器自适应分割早期反射和后期混响，提高了合成RIR的真实感。实验结果表明，DARAS在真实世界声学环境中显著优于现有基线模型，为实际盲RIR估计提供了稳健有效的解决方案。

Key Takeaways

动态音频室声合成（DARAS）模型是一种用于盲估计房间冲击响应（RIR）的深度学习框架。
DARAS能够从单声道混响语音信号中有效提取相关特征。
Mamba-based自我监督盲室参数估计模块用于准确估计关键室声参数。
混合路径交叉注意特征融合模块提高了音频和室声特征的深度融合。
动态声学调整解码器自适应分割早期反射和后期混响，增强了合成RIR的真实感。
实验结果表明DARAS在真实世界声学环境中性能优越。

Cool Papers

点此查看论文截图

DeepOmni: Towards Seamless and Smart Speech Interaction with Adaptive Modality-Specific MoE

Authors:Hang Shao, Heting Gao, Yunhang Shen, Jiawei Chen, Zuwei Long, Dong Yang, Ke Li, Xing Sun

Native multimodal large language models (MLLMs) restructure a single large language model (LLM) into a spoken language model (SLM) capable of both speech and text generation. Compared to modular and aligned MLLMs, native MLLMs preserve richer paralinguistic features such as emotion and prosody, and generate speech responses directly within the backbone LLM rather than using a separate speech decoder. This integration also results in lower response latency and smoother interaction. However, native MLLMs suffer from catastrophic forgetting and performance degradation because the available paired speech-text data is insufficient to support the pretraining of MLLMs compared to the vast amount of text data required to pretrain text LLMs. To address this issue, we propose DeepTalk, a framework for adaptive modality expert learning based on a Mixture of Experts (MoE) architecture. DeepTalk first adaptively distinguishes modality experts according to their modality load within the LLM. Each modality expert then undergoes specialized single-modality training, followed by joint multimodal collaborative training. As a result, DeepTalk incurs only a 5.5% performance drop compared to the original LLM, which is significantly lower than the average performance drop of over 20% typically seen in native MLLMs (such as GLM-4-Voice), and is on par with modular MLLMs. Meanwhile, the end-to-end dialogue latency remains within 0.5 seconds, ensuring a seamless and intelligent speech interaction experience. Code and models are released at https://github.com/talkking/DeepTalk.

原生多模态大型语言模型（MLLMs）将单一的大型语言模型（LLM）重构为能够生成语音和文本的口语语言模型（SLM）。与模块化和对齐的MLLMs相比，原生MLLMs保留了更丰富的副语言特征，如情感和语调，并在主干LLM内直接生成语音响应，而不是使用单独的语音解码器。这种集成还带来了更低的响应延迟和更流畅的互动。然而，由于可用的配对语音-文本数据不足以支持MLLMs的预训练，与需要预训练文本LLMs的大量文本数据相比，原生MLLMs遭受灾难性遗忘和性能下降的问题。为了解决这个问题，我们提出了DeepTalk，一个基于专家混合（MoE）架构的自适应模态专家学习框架。DeepTalk首先根据LLM内的模态负载自适应地区分模态专家。然后，每个模态专家进行专业化的单模态训练，接着进行联合多模态协作训练。因此，DeepTalk的性能仅比原始LLM下降5.5%，远低于原生MLLMs通常出现的超过20%的平均性能下降，并与模块化MLLMs相当。同时，端到端对话延迟保持在0.5秒内，确保无缝且智能的语音交互体验。代码和模型已发布在https://github.com/talkking/DeepTalk。

论文及项目相关链接

PDF Under Review

Summary

本文介绍了原生多模态大型语言模型（MLLMs）能够同时处理语音和文本生成的能力。相较于模块化和对齐的MLLMs，原生MLLMs保留了更丰富的语言特征如情感和语调。但其在预训练时对配对语音-文本数据的需求较高，使得其性能在数据不足时易受影响。为解决这一问题，本文提出了DeepTalk框架，采用混合专家（MoE）架构进行自适应模态专家学习。DeepTalk能在LLM内部自适应地区分模态专家并进行单一模态和联合多模态协同训练，从而在保证性能的同时提高响应速度。相关代码和模型已发布在GitHub上。

Key Takeaways

原生多模态大型语言模型（MLLMs）能够同时处理语音和文本生成，保留了丰富的语言特征。
与模块化和对齐的MLLMs相比，原生MLLMs具有更低的响应延迟和更流畅的互动性能。
原生MLLMs面临数据不足的问题，导致性能易受影响。
DeepTalk框架采用混合专家（MoE）架构进行自适应模态专家学习，旨在解决原生MLLMs在数据不足时的性能问题。
DeepTalk框架包括自适应区分模态专家、单一模态训练和联合多模态协同训练三个主要步骤。
DeepTalk相较于原生LLM的性能下降仅有5.5%，显著优于大多数原生MLLMs（如GLM-4-Voice）。

Cool Papers

点此查看论文截图

ARECHO: Autoregressive Evaluation via Chain-Based Hypothesis Optimization for Speech Multi-Metric Estimation

Authors:Jiatong Shi, Yifan Cheng, Bo-Hao Su, Hye-jin Shim, Jinchuan Tian, Samuele Cornell, Yiwen Zhao, Siddhant Arora, Shinji Watanabe

Speech signal analysis poses significant challenges, particularly in tasks such as speech quality evaluation and profiling, where the goal is to predict multiple perceptual and objective metrics. For instance, metrics like PESQ (Perceptual Evaluation of Speech Quality), STOI (Short-Time Objective Intelligibility), and MOS (Mean Opinion Score) each capture different aspects of speech quality. However, these metrics often have different scales, assumptions, and dependencies, making joint estimation non-trivial. To address these issues, we introduce ARECHO (Autoregressive Evaluation via Chain-based Hypothesis Optimization), a chain-based, versatile evaluation system for speech assessment grounded in autoregressive dependency modeling. ARECHO is distinguished by three key innovations: (1) a comprehensive speech information tokenization pipeline; (2) a dynamic classifier chain that explicitly captures inter-metric dependencies; and (3) a two-step confidence-oriented decoding algorithm that enhances inference reliability. Experiments demonstrate that ARECHO significantly outperforms the baseline framework across diverse evaluation scenarios, including enhanced speech analysis, speech generation evaluation, and, noisy speech evaluation. Furthermore, its dynamic dependency modeling improves interpretability by capturing inter-metric relationships. Across tasks, ARECHO offers reference-free evaluation using its dynamic classifier chain to support subset queries (single or multiple metrics) and reduces error propagation via confidence-oriented decoding.

语音信号处理面临着巨大的挑战，特别是在语音质量评估和语音特征描述等任务中。这些任务的目标是预测多种感知和客观指标。例如，PESQ（语音质量感知评估）、STOI（短期客观可懂度）和MOS（平均意见得分）等指标各自捕捉语音质量的不同方面。然而，这些指标通常有不同的尺度、假设和依赖关系，使得联合估计变得非平凡。为了解决这些问题，我们引入了ARECHO（基于链假设优化的自回归评估），这是一个基于自回归依赖建模的通用语音评估系统。ARECHO以三种关键创新为特色：（1）全面的语音信息标记流程；（2）动态分类器链，能够明确捕捉指标间的依赖关系；（3）两步以信心为导向的解码算法，提高推理可靠性。实验表明，在各种评估场景中，ARECHO显著优于基准框架，包括增强的语音分析、语音生成评估和噪声语音评估。此外，其动态依赖建模通过捕捉指标之间的关系提高了可解释性。在各种任务中，ARECHO使用其动态分类器链进行无参考评估，支持子集查询（单个或多个指标），并通过信心导向的解码减少错误传播。

论文及项目相关链接

PDF NeurIPS 2025 Spotlight

Summary：
语音信号分析面临诸多挑战，特别是在预测多种感知和客观指标的任务中，如语音质量评估和语音特征描述。为了克服不同评价指标之间的差异和复杂性，我们提出了基于链式假设优化的自回归评估系统ARECHO。该系统具有全面的语音信息标记流程、动态分类器链和两步置信导向解码算法等三大创新点，可显著提高语音评估的可靠性和准确性。实验表明，ARECHO在不同评估场景中均优于基准框架，并提供了子集查询的参考无关评估。

Key Takeaways：

语音信号分析具有挑战性，特别是在预测多个感知和客观指标的任务中。
ARECHO是一个基于链式的通用语音评估系统，具有三大创新点。
ARECHO通过全面的语音信息标记流程提高了语音评估的准确性。
动态分类器链能够显式捕获指标间的依赖关系，提高系统性能。
两步置信导向解码算法增强了推断的可靠性。
ARECHO在多种评估场景中显著优于基准框架。

Cool Papers

点此查看论文截图

OmniResponse: Online Multimodal Conversational Response Generation in Dyadic Interactions

Authors:Cheng Luo, Jianghui Wang, Bing Li, Siyang Song, Bernard Ghanem

In this paper, we introduce Online Multimodal Conversational Response Generation (OMCRG), a novel task designed to produce synchronized verbal and non-verbal listener feedback online, based on the speaker’s multimodal inputs. OMCRG captures natural dyadic interactions and introduces new challenges in aligning generated audio with listeners’ facial responses. To tackle these challenges, we incorporate text as an intermediate modality to connect audio and facial responses. We propose OmniResponse, a Multimodal Large Language Model (MLLM) that autoregressively generates accurate multimodal listener responses. OmniResponse leverages a pretrained LLM enhanced with two core components: Chrono-Text Markup, which precisely timestamps generated text tokens, and TempoVoice, a controllable online text-to-speech (TTS) module that outputs speech synchronized with facial responses. To advance OMCRG research, we offer ResponseNet, a dataset of 696 detailed dyadic interactions featuring synchronized split-screen videos, multichannel audio, transcripts, and annotated facial behaviors. Comprehensive evaluations on ResponseNet demonstrate that OmniResponse outperforms baseline models in terms of semantic speech content, audio-visual synchronization, and generation quality. Our dataset, code, and models are publicly available.

在这篇论文中，我们介绍了在线多模式对话响应生成（OMCRG）这一新任务，它的设计目的是基于说话者的多模式输入，在线产生同步的言语和非言语听众反馈。OMCRG捕捉自然的二元交互，并在将生成的音频与听众的面部响应对齐方面引入新的挑战。为了应对这些挑战，我们采用文本作为连接音频和面部响应的中间模式。我们提出了OmniResponse，这是一种多模式大型语言模型（MLLM），可以自动生成准确的多模式听众响应。OmniResponse利用一个预训练的大型语言模型，并增强了两个核心组件：Chrono-Text标记，它为生成的文本标记提供精确的时间戳，以及TempoVoice，这是一个可控的在线文本到语音（TTS）模块，输出与面部响应同步的语音。为了推进OMCRG研究，我们提供了ResponseNet数据集，其中包含696个详细的二元交互特征，包括同步分屏视频、多通道音频、字幕和注释的面部行为。在ResponseNet上的综合评估表明，OmniResponse在语义语音内容、视听同步和生成质量方面优于基准模型。我们的数据集、代码和模型都是公开可用的。

论文及项目相关链接

PDF 25 pages, 9 figures

Summary

本文介绍了在线多模态对话响应生成（OMCRG）任务，该任务旨在基于说话者的多模态输入，在线生成同步的言语和非言语听众反馈。OMCRG捕捉自然双向互动，并为生成音频与听众的面部响应对齐带来新挑战。为应对这些挑战，本文采用文本作为连接音频和面部响应的中间模态。提出OmniResponse多模态大型语言模型（MLLM），能够自动生成准确的多模态听众响应。OmniResponse利用带有两个核心组件的预训练LLM：精确时间戳生成文本标记的Chrono-Text和可控的在线文本到语音（TTS）模块TempoVoice，输出与面部响应同步的语音。为推进OMCRG研究，本文提供了ResponseNet数据集，包含696个详细的双向互动特征，同步分屏视频、多通道音频、文字记录以及注释的面部行为。在ResponseNet上的综合评估表明，OmniResponse在语义语音内容、音视频同步和生成质量方面优于基准模型。数据集、代码和模型均公开可用。

Key Takeaways

OMCRG任务旨在在线生成基于说话者多模态输入的同步言语和非言语听众反馈。
OMCRG捕捉自然双向互动，并带来对齐生成音频与听众面部响应的新挑战。
OmniResponse多模态大型语言模型能够自动生成准确的多模态听众响应，利用预训练LLM及其两个核心组件Chrono-Text和TempoVoice。
ResponseNet数据集包含详细双向互动特征，如同步分屏视频、多通道音频、文字记录及注释的面部行为。
OmniResponse在语义语音内容、音视频同步和生成质量方面优于基准模型。
本文强调数据集、代码和模型的公开可用性以促进研究发展。

Cool Papers

点此查看论文截图

Multi-head Temporal Latent Attention

Authors:Keqi Deng, Philip C. Woodland

While Transformer self-attention offers strong parallelism, the Key-Value (KV) cache grows linearly with sequence length and becomes a bottleneck for inference efficiency. Multi-head latent attention was recently developed to compress the KV cache into a low-rank latent space. This paper proposes Multi-head Temporal Latent Attention (MTLA), which further reduces the KV cache size along the temporal dimension, greatly lowering the memory footprint of self-attention inference. MTLA employs a hyper-network to dynamically merge temporally adjacent KV cache vectors. To address the mismatch between the compressed KV cache and processed sequence lengths, a stride-aware causal mask is proposed to ensure efficient parallel training and consistency with inference behaviour. Experiments across tasks, including speech translation, speech recognition, speech understanding and text summarisation, demonstrate that MTLA achieves competitive performance compared to standard Multi-Head Attention (MHA), while greatly improving inference speed and GPU memory usage. For example, on a English-German speech translation task, MTLA achieves a 5.3x speedup and a reduction in GPU memory usage by a factor of 8.3 compared to MHA, while maintaining translation quality.

虽然Transformer的自注意力提供了强大的并行性，但键值（KV）缓存随着序列长度线性增长，成为推断效率瓶颈。最近开发了多头潜在注意力，用于将KV缓存压缩到低阶潜在空间。本文提出了多头时间潜在注意力（MTLA），它沿时间维度进一步减小KV缓存大小，大大降低了自注意力推断的内存占用。MTLA采用超网络来动态合并时间上相邻的KV缓存向量。为了解决压缩KV缓存与处理序列长度之间的不匹配问题，提出了一种跨步感知因果掩码，以确保有效的并行训练和与推断行为的一致性。实验涵盖语音翻译、语音识别、语音理解和文本摘要等多项任务，结果表明，与标准多头注意力（MHA）相比，MTLA在实现具有竞争力的性能的同时，大大提高了推断速度和GPU内存使用率。例如，在英文到德文的语音翻译任务中，与MHA相比，MTLA实现了5.3倍的加速和GPU内存使用量的8.3倍减少，同时保持了翻译质量。

论文及项目相关链接

PDF Accepted by NeurIPS 2025

Summary
论文提出Multi-head Temporal Latent Attention（MTLA），通过动态合并相邻时序KV缓存向量，沿时序维度减小KV缓存大小，显著降低自注意力推理的内存占用。为解决压缩KV缓存与处理序列长度之间的不匹配问题，论文还提出了步长感知因果掩码，以确保高效的并行训练和推理行为的一致性。实验表明，MTLA在语音翻译、语音识别、语音理解和文本摘要等任务上，与标准多头注意力相比，实现了竞争性的性能和显著改进的推理速度和GPU内存使用率。

Key Takeaways

MTLA通过动态合并时序相邻的KV缓存向量，进一步减小KV缓存大小。
MTLA能够显著降低自注意力推理的内存占用。
步长感知因果掩码解决了压缩KV缓存与处理序列长度之间的不匹配问题。
MTLA在保证翻译质量的同时，实现了对标准多头注意力的5.3倍速度提升和GPU内存使用减少8.3倍。
MTLA在多种任务上表现良好，包括语音翻译、语音识别、语音理解和文本摘要等。
MTLA能够提高推理效率。

Cool Papers

点此查看论文截图

MultiMed-ST: Large-scale Many-to-many Multilingual Medical Speech Translation

Authors:Khai Le-Duc, Tuyen Tran, Bach Phan Tat, Nguyen Kim Hai Bui, Quan Dang, Hung-Phong Tran, Thanh-Thuy Nguyen, Ly Nguyen, Tuan-Minh Phan, Thi Thu Phuong Tran, Chris Ngo, Nguyen X. Khanh, Thanh Nguyen-Tang

Multilingual speech translation (ST) and machine translation (MT) in the medical domain enhances patient care by enabling efficient communication across language barriers, alleviating specialized workforce shortages, and facilitating improved diagnosis and treatment, particularly during pandemics. In this work, we present the first systematic study on medical ST, to our best knowledge, by releasing MultiMed-ST, a large-scale ST dataset for the medical domain, spanning all translation directions in five languages: Vietnamese, English, German, French, and Simplified/Traditional Chinese, together with the models. With 290,000 samples, this is the largest medical MT dataset and the largest many-to-many multilingual ST among all domains. Secondly, we present the most comprehensive ST analysis in the field’s history, to our best knowledge, including: empirical baselines, bilingual-multilingual comparative study, end-to-end vs. cascaded comparative study, task-specific vs. multi-task sequence-to-sequence comparative study, code-switch analysis, and quantitative-qualitative error analysis. All code, data, and models are available online: https://github.com/leduckhai/MultiMed-ST

在医疗领域，多语言语音识别（ST）和机器翻译（MT）通过打破语言障碍实现高效沟通、缓解专业人员短缺问题、促进诊断和治疗的改善，特别是在疫情期间提升了病患护理的效果。在这项工作中，我们首次系统性研究医疗ST技术，并发布了MultiMed-ST数据集。这是医疗领域的大型ST数据集，涵盖了五种语言的全部翻译方向：越南语、英语、德语、法语和简体中文/繁体中文。该数据集包含29万个样本，是迄今为止最大的医疗MT数据集和所有领域中最全面的多语种ST数据集。其次，我们提供了迄今为止该领域最全面的ST分析，包括实证研究基准线、双语-多语对比研究、端到端与级联对比研究、任务特定与多任务序列到序列对比研究、代码切换分析和定量定性错误分析。所有代码、数据和模型均可在网上获取：https://github.com/leduckhai/MultiMed-ST

论文及项目相关链接

PDF EMNLP 2025

Summary

本文介绍了医疗领域中的多语言语音识别翻译（ST）和机器翻译（MT）如何提升患者护理的质量。通过克服语言障碍、缓解专业劳动力短缺问题，以及促进诊断和治疗的改进，特别是疫情期间，其重要性尤为突出。文中首次系统性研究了医疗ST，并发布了MultiMed-ST数据集，包含五种语言的大规模ST数据集，以及相应的模型。此外，还进行了全面的ST分析，包括实证基准、双语-多语对比研究等。所有代码、数据和模型均可在网上获取。

Key Takeaways

多语言语音识别翻译（ST）和机器翻译（MT）在医疗领域能够提升患者护理的质量，通过克服语言障碍、缓解专业劳动力短缺、促进诊断和治疗的改进。
首次系统性研究医疗ST，并发布了MultiMed-ST数据集，包含五种语言的大规模ST数据集，为医疗领域的翻译提供了丰富的资源。
MultiMed-ST数据集是迄今为止最大的医疗机器翻译数据集，涵盖了所有翻译方向，并且包含了多种语言的样本。
全面的ST分析包括实证基准、双语-多语对比研究、端到端与级联对比研究、任务特定与多任务序列到序列对比研究、代码切换分析和定量-定性误差分析。
所有相关的代码、数据和模型均可在网上获取，便于研究人员使用和进一步开发。
该研究对于改进医疗领域的多语言沟通具有重大意义，有助于提升医疗服务的质量和效率。

Cool Papers

点此查看论文截图

Modelling Emotions in Face-to-Face Setting: The Interplay of Eye-Tracking, Personality, and Temporal Dynamics

Authors:Meisam Jamshidi Seikavandi, Jostein Fimland, Maria Barrett, Paolo Burelli

Accurate emotion recognition is pivotal for nuanced and engaging human-computer interactions, yet remains difficult to achieve, especially in dynamic, conversation-like settings. In this study, we showcase how integrating eye-tracking data, temporal dynamics, and personality traits can substantially enhance the detection of both perceived and felt emotions. Seventy-three participants viewed short, speech-containing videos from the CREMA-D dataset, while being recorded for eye-tracking signals (pupil size, fixation patterns), Big Five personality assessments, and self-reported emotional states. Our neural network models combined these diverse inputs including stimulus emotion labels for contextual cues and yielded marked performance gains compared to the state-of-the-art. Specifically, perceived valence predictions reached a macro F1-score of 0.76, and models incorporating personality traits and stimulus information demonstrated significant improvements in felt emotion accuracy. These results highlight the benefit of unifying physiological, individual and contextual factors to address the subjectivity and complexity of emotional expression. Beyond validating the role of user-specific data in capturing subtle internal states, our findings inform the design of future affective computing and human-agent systems, paving the way for more adaptive and cross-individual emotional intelligence in real-world interactions.

精确的情绪识别对于细微且引人入胜的人机交互至关重要，然而仍难以实现，特别是在动态、类似对话的环境中。在这项研究中，我们展示了如何整合眼动数据、时间动态和人格特征，可以极大地提高感知情绪和感受情绪的识别。73名参与者观看了来自CREMA-D数据集的简短语音视频，同时记录眼动信号（瞳孔大小、注视模式）、大五人格评估和自报情绪状态。我们的神经网络模型结合了这些不同的输入，包括刺激情绪标签作为上下文线索，与最新技术相比取得了显著的性能提升。具体来说，感知价值预测的宏观F1分数达到了0.76，结合人格特征和刺激信息的模型在感受情绪准确性方面显示出显著改善。这些结果凸显了统一生理、个体和上下文因素的好处，以应对情绪表达的主观性和复杂性。除了验证用户特定数据在捕捉微妙内部状态中的作用外，我们的研究还为未来情感计算和人机代理系统的设计提供了信息，为现实世界中更自适应和跨个体的情绪智能铺平了道路。

论文及项目相关链接

PDF The paper has been significantly revised and my colleague has already submitted the updated version in the link below: arXiv:2510.24720

Summary

本研究探讨了将眼动数据、时间动态和个人性格特质相结合，对于提高感知和情感体验识别准确性的重要性。通过收集眼动信号、个人性格评估和自我报告情感状态等多方面的数据，结合神经网络模型对情境线索进行处理，在情感识别上取得了显著的成绩提升。特别地，感知价值预测达到了宏观F1分数为0.76，而结合个人性格特质和情境信息的模型在情感准确性上也有了显著提升。本研究强调了整合生理、个人和情境因素的重要性，为解决情感表达的复杂性和主观性提供了有益的思路。

Key Takeaways

眼动数据在情感识别中的作用日益重要，特别是针对动态环境下的情绪检测。
结合眼动数据、时间动态和个人性格特质能显著提高情感识别的准确性。
神经网络模型在处理情感识别任务时表现出良好的性能。
本研究通过使用用户特定数据捕获微妙的内部状态验证了个人因素对情感表达的影响。
本研究为提高未来情感计算和人机交互系统的适应性提供了有价值的见解。
结合情境线索和用户特定信息对于提高情感识别的准确性至关重要。

Cool Papers

点此查看论文截图

As Good as It KAN Get: High-Fidelity Audio Representation

Authors:Patryk Marszałek, Maciej Rut, Piotr Kawa, Przemysław Spurek, Piotr Syga

Implicit neural representations (INR) have gained prominence for efficiently encoding multimedia data, yet their applications in audio signals remain limited. This study introduces the Kolmogorov-Arnold Network (KAN), a novel architecture using learnable activation functions, as an effective INR model for audio representation. KAN demonstrates superior perceptual performance over previous INRs, achieving the lowest Log-SpectralDistance of 1.29 and the highest Perceptual Evaluation of Speech Quality of 3.57 for 1.5 s audio. To extend KAN’s utility, we propose FewSound, a hypernetwork-based architecture that enhances INR parameter updates. FewSound outperforms the state-of-the-art HyperSound, with a 33.3% improvement in MSE and 60.87% in SI-SNR. These results show KAN as a robust and adaptable audio representation with the potential for scalability and integration into various hypernetwork frameworks. The source code can be accessed at https://github.com/gmum/fewsound.git.

隐式神经表示（INR）在高效编码多媒体数据方面已备受关注，但在音频信号中的应用仍然有限。本研究引入了Kolmogorov-Arnold网络（KAN），这是一种使用可学习激活函数的新型架构，作为音频表示的有效INR模型。KAN在感知性能上优于先前的INR，在1.5秒音频上实现了最低的Log-SpectralDistance为1.29和最高的语音质量感知评价为3.57。为了扩展KAN的实用性，我们提出了基于超网络的FewSound架构，它增强了INR参数更新。FewSound的表现优于最新的HyperSound，在MSE上提高了33.3%，在SI-SNR上提高了60.87%。这些结果表明，KAN是一种稳健且适应性强的音频表示方法，具有可扩展性和融入各种超网络框架的潜力。源代码可在https://github.com/gmum/fewsound.git访问。

论文及项目相关链接

PDF Accepted to the 34th ACM International Conference on Information and Knowledge Management (CIKM ‘25)

总结

本文引入Kolmogorov-Arnold网络（KAN），这是一种使用可学习激活函数的新型架构，作为音频表示的有效隐神经表示（INR）模型。KAN在音频信号上的表现优于先前的INR模型，对1.5秒音频实现了最低的逻辑谱距离和最高的语音质量感知评价。为扩展KAN的实用性，本文还提出了基于超网络的FewSound架构，可优化INR参数更新。FewSound优于当前主流的HyperSound，在MSE和SI-SNR上分别提高了33.3%和60.87%。结果证明，KAN是一种稳健且可适应的音频表示方法，具有可扩展性和集成到各种超网络框架的潜力。

要点

引入Kolmogorov-Arnold网络（KAN）作为隐神经表示（INR）模型，用于音频表示。
KAN使用可学习激活函数，表现优于其他INR模型。
KAN在音频信号处理中实现了低逻辑谱距离和高语音质量感知评价。
提出基于超网络的FewSound架构，优化INR参数更新。
FewSound在性能上超越了现有的HyperSound，在MSE和SI-SNR方面有明显提升。
KAN具有可扩展性和集成到各种超网络框架的潜力。

Cool Papers

点此查看论文截图

AnyEnhance: A Unified Generative Model with Prompt-Guidance and Self-Critic for Voice Enhancement

Authors:Junan Zhang, Jing Yang, Zihao Fang, Yuancheng Wang, Zehua Zhang, Zhuo Wang, Fan Fan, Zhizheng Wu

We introduce AnyEnhance, a unified generative model for voice enhancement that processes both speech and singing voices. Based on a masked generative model, AnyEnhance is capable of handling both speech and singing voices, supporting a wide range of enhancement tasks including denoising, dereverberation, declipping, super-resolution, and target speaker extraction, all simultaneously and without fine-tuning. AnyEnhance introduces a prompt-guidance mechanism for in-context learning, which allows the model to natively accept a reference speaker’s timbre. In this way, it could boost enhancement performance when a reference audio is available and enable the target speaker extraction task without altering the underlying architecture. Moreover, we also introduce a self-critic mechanism into the generative process for masked generative models, yielding higher-quality outputs through iterative self-assessment and refinement. Extensive experiments on various enhancement tasks demonstrate AnyEnhance outperforms existing methods in terms of both objective metrics and subjective listening tests. Demo audios are publicly available at https://amphionspace.github.io/anyenhance. An open-source implementation is provided at https://github.com/viewfinder-annn/anyenhance-v1-ccf-aatc.

我们介绍了AnyEnhance，这是一个统一的生成模型，用于处理语音和歌唱声音的声音增强。基于掩蔽生成模型，AnyEnhance能够同时处理语音和歌唱声音，支持广泛的增强任务，包括去噪、去混响、去剪辑、超分辨率和目标说话人提取，而且无需微调。AnyEnhance引入了上下文学习中的提示指导机制，使模型能够原生地接受参考说话人的音色。这样，当提供参考音频时，它可以提高增强性能，并在不改变底层架构的情况下实现目标说话人提取任务。此外，我们还将在生成过程中引入了自我批判机制，通过迭代自我评估和细化，产生更高质量的输出。对各种增强任务的广泛实验表明，AnyEnhance在客观指标和主观听觉测试方面均优于现有方法。演示音频可在https://amphionspace.github.io/anyenhance公开获取。开源实现可在https://github.com/viewfinder-annn/anyenhance-v1-ccf-aatc找到。

论文及项目相关链接

PDF Accepted by IEEE TASLP 2025. Demopage: https://amphionspace.github.io/anyenhance. Open-source implementation: https://github.com/viewfinder-annn/anyenhance-v1-ccf-aatc

Summary
语音增强统一生成模型AnyEnhance问世，支持语音和歌声的多种增强任务，包括降噪、去混响、去剪辑、超分辨率和目标说话人提取等。模型采用掩码生成机制，具备自适应学习能力，可借助参考音频提升增强性能。此外，通过自我批评机制提高生成质量。实验证明，AnyEnhance在客观指标和主观听觉测试上均超越现有方法。

Key Takeaways

AnyEnhance是一个统一的生成模型，能够处理语音和歌声的多种增强任务。
基于掩码生成模型，无需微调即可同时处理多种任务。
引入提示指导机制，使模型能够自然地接受参考演讲者的音色。
借助参考音频提升增强性能，实现目标说话人提取任务。
引入自我批评机制，通过迭代自我评估和精炼提高生成质量。
实验证明，AnyEnhance在客观和主观测试上均超越现有方法。

Cool Papers

点此查看论文截图

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

Authors:Chaoyou Fu, Haojia Lin, Xiong Wang, Yi-Fan Zhang, Yunhang Shen, Xiaoyu Liu, Haoyu Cao, Zuwei Long, Heting Gao, Ke Li, Long Ma, Xiawu Zheng, Rongrong Ji, Xing Sun, Caifeng Shan, Ran He

Recent Multimodal Large Language Models (MLLMs) have typically focused on integrating visual and textual modalities, with less emphasis placed on the role of speech in enhancing interaction. However, speech plays a crucial role in multimodal dialogue systems, and implementing high-performance in both vision and speech tasks remains a significant challenge due to the fundamental modality differences. In this paper, we propose a carefully designed multi-stage training methodology that progressively trains LLM to understand both visual and speech information, ultimately enabling fluent vision and speech interaction. Our approach not only preserves strong vision-language capacity, but also enables efficient speech-to-speech dialogue capabilities without separate ASR and TTS modules, significantly accelerating multimodal end-to-end response speed. By comparing our method against state-of-the-art counterparts across benchmarks for image, video, and speech tasks, we demonstrate that our model is equipped with both strong visual and speech capabilities, making near real-time vision and speech interaction. Code has been released at https://github.com/VITA-MLLM/VITA.

最近的多模态大型语言模型（MLLMs）主要集中于整合视觉和文本模态，较少强调语音在增强交互中的作用。然而，语音在多模态对话系统中扮演重要角色，由于在根本模式上的差异，在视觉和语音任务中都实现高性能仍然是一个重大挑战。在本文中，我们提出了一种精心设计的多阶段训练方法论，该方法论逐步训练LLM以理解视觉和语音信息，最终使流畅的视觉和语音交互成为可能。我们的方法不仅保留了强大的视觉语言功能，还实现了高效的语音对话功能，无需单独的ASR和TTS模块，从而显著加快了多模态端到端的响应速度。通过与图像、视频和语音任务的最新基准测试进行比较，我们证明了我们的模型具有强大的视觉和语音功能，可实现近乎实时的视觉和语音交互。代码已发布在https://github.com/VITA-MLLM/VITA。

论文及项目相关链接

PDF NeurIPS 2025 Spotlight, Code 2.4K Stars: https://github.com/VITA-MLLM/VITA

Summary

本文提出了一种多阶段训练策略，旨在使大型语言模型（LLM）能够理解视觉和语音信息，实现流畅的视听交互。该方法不仅保留了强大的视觉语言能力，还能实现高效的语音对话能力，无需额外的语音识别和文本转语音模块，显著加快了多模态端到端的响应速度。实验表明，该模型在图像、视频和语音任务方面均具备出色的性能，实现了近乎实时的视觉和语音交互。

Key Takeaways

近期多模态大型语言模型（MLLMs）通常聚焦于视觉和文本模态的融合，但语音在交互中起着关键作用。
实现视觉和语音任务的高性能是一个重大挑战，原因在于这两种模态的根本差异。
提出了一种多阶段训练策略，逐步训练LLM理解视觉和语音信息，以实现流畅的视听交互。
该方法保留了强大的视觉语言能力，并实现了高效的语音对话能力。
模型无需额外的语音识别和文本转语音模块，显著提高了多模态端到端的响应速度。
实验表明该模型在图像、视频和语音任务方面表现出色。

Cool Papers

点此查看论文截图

Memristive Nanowire Network for Energy Efficient Audio Classification: Pre-Processing-Free Reservoir Computing with Reduced Latency

Authors:Akshaya Rajesh, Pavithra Ananthasubramanian, Nagarajan Raghavan, Ankush Kumar

Efficient audio feature extraction is critical for low-latency, resource-constrained speech recognition. Conventional preprocessing techniques, such as Mel Spectrogram, Perceptual Linear Prediction (PLP), and Learnable Spectrogram, achieve high classification accuracy but require large feature sets and significant computation. The low-latency and power efficiency benefits of neuromorphic computing offer a strong potential for audio classification. Here, we introduce memristive nanowire networks as a neuromorphic hardware preprocessing layer for spoken-digit classification, a capability not previously demonstrated. Nanowire networks extract compact, informative features directly from raw audio, achieving a favorable trade-off between accuracy, dimensionality reduction from the original audio size (data compression) , and training time efficiency. Compared with state-of-the-art software techniques, nanowire features reach 98.95% accuracy with 66 times data compression (XGBoost) and 97.9% accuracy with 255 times compression (Random Forest) in sub-second training latency. Across multiple classifiers nanowire features consistently achieve more than 90% accuracy with more than 62.5 times compression, outperforming features extracted by conventional state-of-the-art techniques such as MFCC in efficiency without loss of performance. Moreover, nanowire features achieve 96.5% accuracy classifying multispeaker audios, outperforming all state-of-the-art feature accuracies while achieving the highest data compression and lowest training time. Nanowire network preprocessing also enhances linear separability of audio data, improving simple classifier performance and generalizing across speakers. These results demonstrate that memristive nanowire networks provide a novel, low-latency, and data-efficient feature extraction approach, enabling high-performance neuromorphic audio classification.

有效的音频特征提取对于低延迟、资源受限的语音识别至关重要。传统的预处理技术，如梅尔频谱图、感知线性预测（PLP）和学习频谱图，虽然可以实现较高的分类精度，但需要大量的特征集和大量的计算。神经形态计算的低延迟和节能优势为音频分类提供了强大的潜力。在这里，我们介绍了一种用于语音数字分类的神经形态硬件预处理层——忆阻纳米线网络，这是以前未曾展示过的能力。纳米线网络直接从原始音频中提取紧凑且信息丰富的特征，在实现精度、从原始音频大小降低的维度（数据压缩）和训练时间效率之间达到了有利的平衡。与最先进的软件技术相比，纳米线特征在子秒级训练延迟下实现了98.95%的准确率，数据压缩比为66倍（XGBoost），以及97.9%的准确率，压缩比为255倍（随机森林）。在多个分类器中，纳米线特征的压缩比超过62.5倍时，准确率始终超过90%，在效率上优于MFCC等传统先进技术提取的特征，同时不损失性能。此外，纳米线特征在分类多说话者音频时达到了96.5%的准确率，在数据压缩比最高、训练时间最短的同时，优于所有先进特征准确率。纳米线网络预处理还提高了音频数据的线性可分性，提高了简单分类器的性能，并能在不同说话者之间进行泛化。这些结果表明，忆阻纳米线网络提供了一种新颖、低延迟和数据高效的特征提取方法，可实现高性能的神经形态音频分类。

论文及项目相关链接

PDF 14 pages, 5 Figures

Summary

本文介绍了利用纳米线网络作为神经形态硬件预处理层进行语音数字分类的技术。纳米线网络能够从原始音频中直接提取紧凑且富含信息特征，实现了在准确性、音频数据压缩和训练时间效率之间的良好平衡。相较于传统软件技术，纳米线特征在分类精度、数据压缩和训练时间上展现出显著优势。研究结果表明，纳米线网络为低延迟和数据高效音频分类提供了新的解决方案。

Key Takeaways

音频特征提取对于低延迟、资源受限的语音识别至关重要。
传统预处理技术虽然分类准确率高，但需要大量的特征集和计算资源。
神经形态计算可提供低延迟和高效能优势，用于音频分类。
纳米线网络作为神经形态硬件预处理层，可直接从原始音频中提取特征。
纳米线特征在数据压缩和训练时间效率方面表现出显著优势，相较于最新软件技术，达到高分类精度。
纳米线特征在多分类器中的表现稳定，且在多说话者音频分类任务中表现出高准确性。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-11-06/Speech/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

Speech

Face Swapping

Face Swapping 方向最新论文已更新，请持续关注 Update in 2025-11-06 PercHead Perceptual Head Model for Single-Image 3D Head Reconstruction & Editing

2025-11-06 Face Swapping

Face Swapping

医学影像/Breast Ultrasound

医学影像/Breast Ultrasound 方向最新论文已更新，请持续关注 Update in 2025-11-06 DIsoN Decentralized Isolation Networks for Out-of-Distribution Detection in Medical Imaging

2025-11-06 医学影像/Breast Ultrasound

医学影像/Breast Ultrasound

Speech

2025-11-06 更新

Improving DF-Conformer Using Hydra For High-Fidelity Generative Speech Enhancement on Discrete Codec Token

Emotion Detection in Speech Using Lightweight and Transformer-Based Models: A Comparative and Ablation Study

DARAS: Dynamic Audio-Room Acoustic Synthesis for Blind Room Impulse Response Estimation

DeepOmni: Towards Seamless and Smart Speech Interaction with Adaptive Modality-Specific MoE

ARECHO: Autoregressive Evaluation via Chain-Based Hypothesis Optimization for Speech Multi-Metric Estimation

OmniResponse: Online Multimodal Conversational Response Generation in Dyadic Interactions

Multi-head Temporal Latent Attention

MultiMed-ST: Large-scale Many-to-many Multilingual Medical Speech Translation

Modelling Emotions in Face-to-Face Setting: The Interplay of Eye-Tracking, Personality, and Temporal Dynamics

As Good as It KAN Get: High-Fidelity Audio Representation

AnyEnhance: A Unified Generative Model with Prompt-Guidance and Self-Critic for Voice Enhancement

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

Memristive Nanowire Network for Energy Efficient Audio Classification: Pre-Processing-Free Reservoir Computing with Reduced Latency