TTS

发布日期: 2025-10-22

更新日期: 2025-11-27

文章字数: 3.3k

阅读时长: 13 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-10-22 更新

TrajSelector: Harnessing Latent Representations for Efficient and Effective Best-of-N in Large Reasoning Model

Authors:Bin Yu, Xinming Wang, Shijie Lian, Haotian Li, Changti Wu, Ruina Hu, Bailing Wang, Yuliang Wei, Kai Chen

Large language models (LLMs) have shown remarkable progress in complex reasoning tasks, largely enabled by test-time scaling (TTS) paradigms that allocate additional compute during inference. Among these, external TTS (particularly the Best-of-N selection paradigm) yields scalable performance improvements by selecting from multiple independently generated reasoning trajectories. However, this approach faces key limitations: (i) the high computational overhead of deploying process reward models, (ii) the underutilization of the LLM’s intrinsic latent representations. We introduce TrajSelector, an efficient and effective Best-of-N framework that exploit the hidden states in the sampler LLM for process-level scoring. A lightweight verifier (with only 0.6B parameters) evaluates the quality of step-wise trajectory, and then aggregates these scores to identify the optimal reasoning trajectory. Our framework employs a fully data-driven, end-to-end training recipe that eliminates reliance on massive step-level annotations. Experiential results across five benchmarks demonstrate that TrajSelector delivers consistent performance gains. In Best-of-32 settings, it surpasses majority voting by 4.61% accuracy and outperforms existing process reward models by 4.31% to 12.21%, all while maintaining lower inference costs.

大规模语言模型（LLM）在复杂的推理任务中取得了显著的进步，这主要得益于测试时缩放（TTS）范式，在推理过程中分配额外的计算资源。其中，外部TTS（特别是Best-of-N选择范式）通过从多个独立生成的推理轨迹中选择，实现了可扩展的性能改进。然而，这种方法面临主要局限：（i）部署过程奖励模型的高计算开销，（ii）LLM的固有潜在表示使用不足。我们引入了TrajSelector，这是一个高效且有效的Best-of-N框架，利用采样器LLM中的隐藏状态进行过程级评分。一个轻量级的验证器（仅有0.6B参数）评估逐步轨迹的质量，然后聚合这些分数以识别最佳的推理轨迹。我们的框架采用全数据驱动、端到端的训练配方，消除了对大量步骤级注释的依赖。在五项基准测试中的实验结果表明，TrajSelector提供了持续的性能提升。在Best-of-32设置中，它的准确率高于多数投票法4.61%，并且相较于现有的过程奖励模型，其表现提升了4.31%至12.21%，同时降低了推理成本。

论文及项目相关链接

PDF 13 pages, 6 figures. Project website: https://zgca-ai4edu.github.io/TrajSelector

Summary

大型语言模型（LLM）在复杂推理任务中取得了显著进展，这主要得益于测试时缩放（TTS）范式，在推理过程中分配额外的计算资源。外部TTS（特别是N选最佳范式）通过从多个独立生成的推理轨迹中选择，实现了性能的可扩展改进。然而，这种方法面临关键局限：一是部署过程奖励模型的高计算开销，二是LLM内在潜在表示利用不足。本研究引入TrajSelector，一个高效且有效的Best-of-N框架，利用采样LLM中的隐藏状态进行过程级评分。一个轻量级的验证器（仅0.6B参数）评估逐步轨迹的质量，然后汇总这些分数以识别最佳推理轨迹。该研究采用全数据驱动、端到端的训练配方，消除了对大量步骤级注释的依赖。实验结果显示，TrajSelector在五个基准测试上实现了性能提升。在Best-of-32设置中，其准确率高于多数投票4.61%，并优于现有过程奖励模型4.31%至12.21%，同时降低了推理成本。

Key Takeaways

大型语言模型（LLM）在复杂推理任务中借助测试时缩放（TTS）取得显著进展。
外部TTS方法面临计算开销大和对LLM内在潜力利用不足的问题。
TrajSelector框架利用LLM中的隐藏状态进行过程级评分，提高了效率。
TrajSelector使用一个轻量级验证器评估逐步轨迹质量，并汇总分数以选择最佳推理轨迹。
TrajSelector采用全数据驱动、端到端的训练方式，减少对大规模步骤级注释的依赖。
实验结果显示TrajSelector在多个基准测试上实现性能提升，准确率高且推理成本低。

Cool Papers

点此查看论文截图

Every Rollout Counts: Optimal Resource Allocation for Efficient Test-Time Scaling

Authors:Xinglin Wang, Yiwei Li, Shaoxiong Feng, Peiwen Yuan, Yueqi Zhang, Jiayi Shi, Chuyi Tan, Boyuan Pan, Yao Hu, Kan Li

Test-Time Scaling (TTS) improves the performance of Large Language Models (LLMs) by using additional inference-time computation to explore multiple reasoning paths through search. Yet how to allocate a fixed rollout budget most effectively during search remains underexplored, often resulting in inefficient use of compute at test time. To bridge this gap, we formulate test-time search as a resource allocation problem and derive the optimal allocation strategy that maximizes the probability of obtaining a correct solution under a fixed rollout budget. Within this formulation, we reveal a core limitation of existing search methods: solution-level allocation tends to favor reasoning directions with more candidates, leading to theoretically suboptimal and inefficient use of compute. To address this, we propose Direction-Oriented Resource Allocation (DORA), a provably optimal method that mitigates this bias by decoupling direction quality from candidate count and allocating resources at the direction level. To demonstrate DORA’s effectiveness, we conduct extensive experiments on challenging mathematical reasoning benchmarks including MATH500, AIME2024, and AIME2025. The empirical results show that DORA consistently outperforms strong baselines with comparable computational cost, achieving state-of-the-art accuracy. We hope our findings contribute to a broader understanding of optimal TTS for LLMs.

测试时间缩放（TTS）通过利用额外的推理时间计算，探索通过搜索的多个推理路径，从而提高了大型语言模型（LLM）的性能。然而，如何在搜索过程中最有效地分配固定的滚动预算仍然未被充分探索，这常常导致在测试时计算使用的效率低下。为了弥补这一差距，我们将测试时间的搜索制定为资源分配问题，并推导出在固定滚动预算下最大化获得正确解决方案概率的最佳分配策略。在此公式中，我们揭示了现有搜索方法的核心局限：解决方案级的分配倾向于推理方向上有更多候选答案，从而导致理论上最佳的计算使用效率无法实现。为了解决这个问题，我们提出了方向导向的资源分配（DORA），这是一种可证明的最优方法，它通过解耦方向质量与候选答案数量来减轻这种偏见，并在方向级别上分配资源。为了证明DORA的有效性，我们在包括MATH500、AIME2024和AIME2025等具有挑战性的数学推理基准测试上进行了广泛的实验。实证结果表明，DORA在具有可比计算成本的强基准线测试上表现优异，实现了最先进的准确性。我们希望我们的研究结果能对LLM的最优TTS有一个更广泛的理解做出贡献。

论文及项目相关链接

PDF Accepted at NeurIPS2025

Summary

本文探讨了如何在测试时利用有限计算资源进行最优分配，以提高大型语言模型的性能。文章提出一种名为方向导向资源分配（DORA）的方法，解决了现有搜索方法存在的核心问题，即在固定计算预算下倾向于选择拥有更多候选答案的推理方向，导致计算效率低下的问题。实验结果表明，DORA在有限的计算成本下持续超越基准线，达到业界领先精度。

Key Takeaways

测试时间缩放（TTS）使用额外的推理时间计算来改进大型语言模型（LLM）的性能。
现有搜索方法存在计算资源分配问题，导致计算效率低下。
文章提出一种新型的资源分配策略，称为方向导向资源分配（DORA）。该方法克服了现有方法的缺陷，并通过减少方向质量和候选数量之间的耦合来实现计算资源的方向级别分配。

Cool Papers

点此查看论文截图

Nexus: An Omni-Perceptive And -Interactive Model for Language, Audio, And Vision

Authors:Che Liu, Yingji Zhang, Dong Zhang, Weijie Zhang, Chenggong Gong, Yu Lu, Shilin Zhou, Ziliang Gan, Ziao Wang, Haipang Wu, Ji Liu, André Freitas, Qifan Wang, Zenglin Xu, Rongjuncheng Zhang, Yong Dai

This work proposes an industry-level omni-modal large language model (LLM) pipeline that integrates auditory, visual, and linguistic modalities to overcome challenges such as limited tri-modal datasets, high computational costs, and complex feature alignments. Our pipeline consists of three main components: First, a modular framework enabling flexible configuration of various encoder-LLM-decoder architectures. Second, a lightweight training strategy that pre-trains audio-language alignment on the state-of-the-art vision-language model Qwen2.5-VL, thus avoiding the costly pre-training of vision-specific modalities. Third, an audio synthesis pipeline that generates high-quality audio-text data from diverse real-world scenarios, supporting applications such as Automatic Speech Recognition and Speech-to-Speech chat. To this end, we introduce an industry-level omni-modal LLM, Nexus. Extensive experiments validate the efficacy of our pipeline, yielding the following key findings:(1) In the visual understanding task, Nexus exhibits superior performance compared with its backbone model - Qwen2.5-VL-7B, validating the efficiency of our training strategy. (2) Within the English Spoken Question-Answering task, the model achieves better accuracy than the same-period competitor (i.e, MiniCPM-o2.6-7B) in the LLaMA Q. benchmark. (3) In our real-world ASR testset, Nexus achieves outstanding performance, indicating its robustness in real scenarios. (4) In the Speech-to-Text Translation task, our model outperforms Qwen2-Audio-Instruct-7B. (5) In the Text-to-Speech task, based on pretrained vocoder (e.g., Fishspeech1.4 or CosyVoice2.0), Nexus is comparable to its backbone vocoder on Seed-TTS benchmark. (6) An in-depth analysis of tri-modal alignment reveals that incorporating the audio modality enhances representational alignment between vision and language.

本文提出了一个跨行业的多模态大型语言模型（LLM）流程，该流程融合了听觉、视觉和语言学模态，以克服如有限的三模态数据集、高计算成本和复杂特征对齐等挑战。我们的流程主要包含三个组成部分：首先，一个模块化框架，能够灵活配置各种编码器-LLM-解码器架构。其次，一种轻量级的训练策略，它通过在最先进的视觉语言模型Qwen2.5-VL上进行音频语言对齐的预训练，从而避免了针对特定视觉模态的昂贵预训练。最后，一个音频合成流程，从各种真实场景生成高质量音频文本数据，支持如自动语音识别和语音到语音聊天等应用。为此，我们引入了一个跨行业的多模态LLM，名为Nexus。大量实验验证了我们的流程的有效性，并得出了以下关键发现：（1）在视觉理解任务中，Nexus相较于其骨干模型Qwen2.5-VL-7B表现出卓越的性能，验证了我们训练策略的有效性。（2）在英语口语问答任务中，该模型在LLaMA Q. benchmark上达到了比同期竞争对手（即MiniCPM-o2.6-7B）更高的准确性。（3）在我们的真实ASR测试集中，Nexus表现出卓越的性能，表明其在真实场景中的稳健性。（4）在语音到文本翻译任务中，我们的模型优于Qwen2-Audio-Instruct-7B。（5）在文本到语音任务中，基于预训练的vocoder（例如Fishspeech1.4或CosyVoice2.0），Nexus在Seed-TTS benchmark上的表现与其骨干vocoder相当。（6）对三模态对齐的深入分析表明，加入音频模态增强了视觉和语言之间的代表性对齐。

Summary

本文提出了一种跨行业的多模态大型语言模型（LLM）管道，该管道融合了听觉、视觉和语言学三大模态，以应对如有限的三模态数据集、高计算成本和复杂特征对齐等挑战。该管道包括三个主要组件：一个支持各种编码器-LLM-解码器架构灵活配置的模块化框架；一种基于先进视觉语言模型Qwen2.5-VL的轻量级训练策略，避免了昂贵的针对视觉特定模态的预训练；以及一个生成高质量音频文本数据的多模态音频合成管道，支持如自动语音识别和语音到语音聊天等应用。引入的行业级多模态LLM——Nexus，经过广泛实验验证，展现了优越性能。

Key Takeaways

提出了一种多模态大型语言模型管道，融合听觉、视觉和语言学三大模态，以应对挑战。
管道包含模块化框架、轻量级训练策略和音频合成管道三个主要组件。
Nexus在视觉理解任务中表现出比Qwen2.5-VL-7B更优越的性能，验证了训练策略的效率。
在英语口语问答任务中，Nexus的准确度高于同期竞争对手MiniCPM-o2.6-7B。
在真实场景的ASR测试集中，Nexus展现出卓越性能，证明其在现实场景中的稳健性。
在语音到文本翻译任务中，Nexus模型优于Qwen2-Audio-Instruct-7B。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-10-22/TTS/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

TTS

Interactive

Interactive 方向最新论文已更新，请持续关注 Update in 2025-10-22 MT-Video-Bench A Holistic Video Understanding Benchmark for Evaluating Multimodal LLMs in Multi-Turn Dialogues

2025-10-22 Interactive

Interactive

医学图像

医学图像方向最新论文已更新，请持续关注 Update in 2025-10-22 Towards Explainable Skin Cancer Classification A Dual-Network Attention Model with Lesion Segmentation and Clinical Metadata Fusion

2025-10-22 医学图像

医学图像