发布日期: 2025-11-21

更新日期: 2025-11-27

文章字数: 2.6k

阅读时长: 10 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-21 更新

Scriboora: Rethinking Human Pose Forecasting

Authors:Daniel Bermuth, Alexander Poeppel, Wolfgang Reif

Human pose forecasting predicts future poses based on past observations, and has many significant applications in areas such as action recognition, autonomous driving or human-robot interaction. This paper evaluates a wide range of pose forecasting algorithms in the task of absolute pose forecasting, revealing many reproducibility issues, and provides a unified training and evaluation pipeline. After drawing a high-level analogy to the task of speech understanding, it is shown that recent speech models can be efficiently adapted to the task of pose forecasting, and improve current state-of-the-art performance. At last the robustness of the models is evaluated, using noisy joint coordinates obtained from a pose estimator model, to reflect a realistic type of noise, which is more close to real-world applications. For this a new dataset variation is introduced, and it is shown that estimated poses result in a substantial performance degradation, and how much of it can be recovered again by unsupervised finetuning.

人体姿态预测是基于过去观察来预测未来姿态，在动作识别、自动驾驶或人机交互等领域有许多重要应用。本文评估了广泛的人体姿态预测算法在绝对姿态预测任务中的表现，揭示了许多可重复性问题，并提供了一个统一的训练和评估流程。通过和语音理解任务的高级类比，展示了最近的语音模型可以高效地适应姿态预测任务，并提高了当前最新技术的性能。最后，使用从姿态估计模型获得的带有噪声的关节坐标来评估模型的稳健性，以反映一种更接近现实世界的噪声类型。为此引入了一种新的数据集变体，展示了估计姿态导致的性能下降幅度，以及通过无监督微调能够恢复多少性能。

论文及项目相关链接

PDF

Summary：
本文介绍了人类姿态预测的研究，包括绝对姿态预测的各种算法评估。文章指出了许多可重复性问题，并提供了一种统一的训练和评估流程。此外，文章通过类比语音理解任务，展示了近期语音模型可以高效适应姿态预测任务，提高了当前最先进的性能。最后，文章还评估了模型的稳健性，通过使用来自姿态估计模型的噪声关节坐标来模拟现实世界的噪声，并介绍了新的数据集变化。通过估计姿态导致的性能下降和通过无监督微调恢复的程度来展示。

Key Takeaways：

人类姿态预测基于过去观察预测未来姿态，广泛应用于动作识别、自动驾驶、人机交互等领域。
多种姿态预测算法在绝对姿态预测任务中存在问题，需解决可重复性问题。
统一的训练和评估流程被提出来评估这些算法。
近期语音模型可高效适应姿态预测任务，提高当前最先进算法的性能。
模型的稳健性通过使用来自姿态估计模型的噪声关节坐标进行评估。
新的数据集变化被介绍来模拟现实世界的噪声。

Cool Papers

点此查看论文截图

Regularized Schrödinger Bridge: Alleviating Distortion and Exposure Bias in Solving Inverse Problems

Authors:Qing Yao, Lijian Gao, Qirong Mao, Ming Dong

Diffusion models serve as a powerful generative framework for solving inverse problems. However, they still face two key challenges: 1) the distortion-perception tradeoff, where improving perceptual quality often degrades reconstruction fidelity, and 2) the exposure bias problem, where the training-inference input mismatch leads to prediction error accumulation and reduced reconstruction quality. In this work, we propose the Regularized Schrödinger Bridge (RSB), an adaptation of Schrödinger Bridge tailored for inverse problems that addresses the above limitations. RSB employs a novel regularized training strategy that perturbs both the input states and targets, effectively mitigating exposure bias by exposing the model to simulated prediction errors and also alleviating distortion by well-designed interpolation via the posterior mean. Extensive experiments on two typical inverse problems for speech enhancement demonstrate that RSB outperforms state-of-the-art methods, significantly improving distortion metrics and effectively reducing exposure bias.

扩散模型是解决逆向问题的强大生成框架。然而，它们仍面临两个主要挑战：1）畸变与感知之间的权衡，提高感知质量往往会降低重建保真度；以及2）暴露偏见问题，训练与推理输入的不匹配导致预测误差累积并降低了重建质量。在这项工作中，我们提出了正则化薛定谔桥（RSB），这是对薛定谔桥的适应，适用于逆向问题并解决上述局限性。RSB采用了一种新型的正则化训练策略，对输入状态和目标进行扰动，通过暴露模型模拟的预测误差来有效缓解暴露偏见，并通过精心设计的后验均值插值来缓解畸变。针对语音增强的两个典型逆向问题的广泛实验表明，RSB优于最先进的方法，显著提高了畸变指标并有效减少了暴露偏见。

论文及项目相关链接

PDF

Summary

扩散模型是解决逆问题的强大生成框架，但仍面临失真与感知权衡以及暴露偏差问题。本研究提出正则化薛定谔桥（RSB），针对逆问题的薛定谔桥适应版本，能解决上述问题。RSB采用新型正则化训练策略，扰动输入状态和目标，有效减轻暴露偏差并缓解失真问题。在语音增强领域的两个典型逆问题实验中，RSB表现优于现有最先进方法，显著改进失真指标并有效减少暴露偏差。

Key Takeaways

扩散模型是解决逆问题的强大生成框架。
扩散模型面临失真与感知权衡以及暴露偏差两大挑战。
提出正则化薛定谔桥（RSB）解决上述问题。
RSB通过扰动输入状态和目标，采用新型正则化训练策略。
RSB有效减轻暴露偏差并缓解失真问题。
在语音增强领域的实验中，RSB表现优于现有方法。

Cool Papers

点此查看论文截图

Efficient and Generalizable Speaker Diarization via Structured Pruning of Self-Supervised Models

Authors:Jiangyu Han, Petr Pálka, Marc Delcroix, Federico Landini, Johan Rohdin, Jan Cernocký, Lukáš Burget

Self-supervised learning (SSL) models such as WavLM have substantially advanced speaker diarization by providing rich contextual speech representations. However, the high computational and memory costs of these models hinder deployment in real-time and resource-constrained scenarios. This work presents a systematic study on compressing SSL-based diarization models through structured pruning guided by knowledge distillation. We investigate pruning objectives that target both model parameters and computational complexity, and analyze alternative strategies, showing that a simple overall pruning approach provides the best balance between efficiency and accuracy. Our method achieves up to 80% model size reduction and 4x faster inference without performance degradation. Comprehensive experiments across eight public diarization datasets demonstrate that the pruned models consistently match or surpass the performance of their uncompressed counterparts. Furthermore, we show strong out-of-domain generalization on the CHiME-6 dataset, achieving accuracy comparable to the top systems in the CHiME-7 challenge without any domain adaptation. These results highlight that structured pruning, when guided by distillation, can yield efficient and generalizable diarization systems suitable for real-world applications.

自监督学习（SSL）模型，如WavLM，通过提供丰富的上下文语音表征，极大地推动了说话人身份识别的发展。然而，这些模型的高计算和内存成本阻碍了其在实时和资源受限场景中的应用。本研究通过基于知识蒸馏引导的结构化剪枝，对基于SSL的身份识别模型进行压缩，进行了系统的研究。我们研究了针对模型参数和计算复杂度的剪枝目标，并分析了替代策略，结果表明，简单的整体剪枝方法能在效率和准确性之间达到最佳平衡。我们的方法实现了高达8 结识别模型的压缩比例，同时推理速度提高了四倍，且性能没有降低。在八个公开的身份识别数据集上进行的综合实验表明，剪枝模型始终与未压缩的模型相匹配或表现更好。此外，我们在CHiME-6数据集上展示了强大的域外泛化能力，在无领域自适应的情况下达到了与CHiME-7挑战中顶尖系统相当的准确性。这些结果强调，当在蒸馏引导下进行结构化剪枝时，可以产生适合实际应用的高效且可泛化的身份识别系统。

论文及项目相关链接

PDF 11 pages, 6 figures

Summary

SSL模型如WavLM极大地推动了说话人分区的进步，但其高计算成本和内存消耗限制了其在实时和资源受限场景的应用。本研究通过基于知识蒸馏的结构化修剪对SSL分区模型进行压缩。研究发现针对模型参数和计算复杂性的修剪目标以及替代策略，得出整体修剪方法能够在效率和准确性之间达到最佳平衡。该方法实现了高达80%的模型大小缩减和4倍推理速度提升，且性能无退化。在八个公共分区数据集上的实验表明，修剪模型性能持续与未压缩模型相匹配或超越。此外，在CHiME-6数据集上展现出强大的域外泛化能力，无需任何域适应即可达到CHiME-7挑战顶级系统的准确度。结果强调，当以蒸馏为指导时，结构化修剪可产生适用于实际应用的高效且可泛化的分区系统。

Key Takeaways

WavLM等自监督学习模型在说话人分区任务中有显著进步，但计算成本和内存消耗较高，限制了其实际应用。
研究通过结构化修剪和知识蒸馏技术压缩SSL分区模型。
研究发现整体修剪方法在平衡模型效率和准确性方面表现最佳。
该方法实现了模型大小缩减和推理速度提升，同时保持性能不降低。
在多个数据集上的实验表明，修剪模型性能与未压缩模型相匹配或更好。
模型在CHiME-6数据集上展现出强大的域外泛化能力。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-11-21/Speech/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

Speech

GAN

GAN 方向最新论文已更新，请持续关注 Update in 2025-11-21 Denoising weak lensing mass maps with diffusion model systematic comparison with generative adversarial network

2025-11-21 GAN

GAN

检测/分割/跟踪

检测/分割/跟踪方向最新论文已更新，请持续关注 Update in 2025-11-21 FQ-PETR Fully Quantized Position Embedding Transformation for Multi-View 3D Object Detection

2025-11-21 检测/分割/跟踪

检测/分割/跟踪