发布日期: 2025-09-28

更新日期: 2025-11-27

文章字数: 6.7k

阅读时长: 27 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-09-28 更新

Thinking While Listening: Simple Test Time Scaling For Audio Classification

Authors:Prateek Verma, Mert Pilanci

We propose a framework that enables neural models to “think while listening” to everyday sounds, thereby enhancing audio classification performance. Motivated by recent advances in the reasoning capabilities of large language models, we address two central questions: (i) how can thinking be incorporated into existing audio classification pipelines to enable reasoning in the category space and improve performance, and (ii) can a new architecture be designed from the ground up to support both thinking and test-time scaling? We demonstrate that in both settings, our models exhibit improved classification accuracy. Leveraging test-time scaling, we observe consistent gains as the number of sampled traces increases. Furthermore, we evaluate two open-source reasoning models, GPT-OSS-20B and Qwen3-14B, showing that while such models are capable of zero-shot reasoning, a lightweight approach–retraining only the embedding matrix of a frozen, smaller model like GPT-2–can surpass the performance of billion-parameter text-based reasoning models.

我们提出了一种框架，使神经网络模型能够在聆听日常声音时“思考”，从而提高音频分类的性能。受大型语言模型推理能力最新进展的启发，我们解决了两个核心问题：(i) 如何将思考融入现有的音频分类管道，以在类别空间中实现推理并提高性能，以及(ii) 是否可以从头开始设计一种新的架构，以支持思考和测试时的扩展？我们证明，在这两种情况下，我们的模型都表现出更高的分类精度。利用测试时的扩展性，我们观察到随着采样轨迹数量的增加，收益持续稳定。此外，我们评估了两个开源推理模型GPT-OSS-20B和Qwen3-14B，结果表明，虽然这些模型具备零启动推理能力，但采用一种轻型方法——仅对冻结小型模型（如GPT-2）的嵌入矩阵进行再训练——可以超越十亿参数文本推理模型的性能。

论文及项目相关链接

PDF 6 pages, 3 figures, 2 Tables, ICASSP 2026

Summary

本文提出一种框架，使神经网络模型能够在聆听日常声音时进行“思考”，从而提高音频分类性能。该框架受到大型语言模型推理能力进步的启发，解决了两个核心问题：一是如何将思考融入现有音频分类管道，以支持类别空间中的推理并提升性能；二是能否设计全新架构以支持思考和测试时的扩展性。实验表明，该框架能提高分类准确性，并且在测试时随着采样轨迹数量的增加，性能提升更加明显。此外，评估了两种开源推理模型GPT-OSS-20B和Qwen3-14B，发现虽然这些模型具备零样本推理能力，但通过仅对GPT-2等较小模型的嵌入矩阵进行微调，其性能可超越亿级参数文本推理模型。

Key Takeaways

提出的框架使神经网络模型能够在音频分类过程中进行“思考”，从而提高性能。
框架解决了如何融入思考和设计支持思考的全新架构这两个核心问题。
在实验环境中，该框架表现出提高分类准确性的能力。
测试时随着采样轨迹数量的增加，性能提升显著。
评估了两种开源推理模型的性能。
这些开源模型具备零样本推理能力。

Cool Papers

点此查看论文截图

Listening to the long ringdown: A novel way to pinpoint the EOS in neutron-star cores

Authors:Christian Ecker, Tyler Gorda, Aleksi Kurkela, Luciano Rezzolla

Gravitational waves (GWs) from binary neutron star (BNS) merger remnants complement constraints from the inspiral phase, mass-radius measurements, and microscopic theory by providing information about the neutron-star equation of state (EOS) at extreme densities. We perform general-relativistic simulations of BNS mergers using EOS models that span the uncertain high-density regime. We find a robust correlation between the ratio of energy and angular momentum lost during the late-time post-merger GW signal - the long ringdown - and the EOS at the highest densities in neutron star cores. Applying this correlation to post-merger GW signals reduces EOS uncertainty at several times saturation density, where no direct constraints currently exist.

来自双中子星合并残留物的引力波（GWs）补充了螺旋桨阶段、质量半径测量和微观理论所施加的约束，提供了有关极端密度下中子星状态方程（EOS）的信息。我们使用跨越不确定高密度区域的EOS模型，对双中子星合并进行广义相对论模拟。我们发现后期合并后引力波信号中能量和角动量丢失的比率（即长环形振荡）与中子星核心最高密度处的EOS之间存在稳健的相关性。将此相关性应用于合并后的引力波信号，可以减少饱和密度几倍处的EOS不确定性，目前尚无直接约束存在。

论文及项目相关链接

PDF 4 pages, 2 figures, Quark Matter 2025 Proceedings

Summary

文中关于二元中子星合并残余产生的引力波对中子星物质状态方程（EOS）在极高密度下的约束进行了研究。通过模拟使用涵盖不确定高密度区域的各种EOS模型，发现晚期后合并引力波信号释放的能量与角动量之比（即长期环震）与星核中最高密度下的EOS之间存在稳健相关性。此相关性可用于降低没有直接约束下的几倍饱和密度的EOS不确定性。

Key Takeaways

二元中子星合并残余产生的引力波可以提供关于中子星物质状态方程（EOS）在极端密度下的信息。
高密度下的EOS模型存在不确定性，需要进行模拟研究。
发现晚期后合并引力波信号的能量与角动量之比与EOS之间存在稳健相关性。
这种相关性可以用于降低当前没有直接约束的几倍饱和密度的EOS不确定性。
通过引力波信号，我们可以更好地了解中子星内部结构和物理特性。
此研究对于理解极端物理条件下的物质状态方程具有重要价值。

Cool Papers

点此查看论文截图

Authors:Minki Hong, Jangho Choi, Jihie Kim

Social norms govern culturally appropriate behavior in communication, enabling dialogue systems to produce responses that are not only coherent but also socially acceptable. We present NormGenesis, a multicultural framework for generating and annotating socially grounded dialogues across English, Chinese, and Korean. To model the dynamics of social interaction beyond static norm classification, we propose a novel dialogue type, Violation-to-Resolution (V2R), which models the progression of conversations following norm violations through recognition and socially appropriate repair. To improve pragmatic consistency in underrepresented languages, we implement an exemplar-based iterative refinement early in the dialogue synthesis process. This design introduces alignment with linguistic, emotional, and sociocultural expectations before full dialogue generation begins. Using this framework, we construct a dataset of 10,800 multi-turn dialogues annotated at the turn level for norm adherence, speaker intent, and emotional response. Human and LLM-based evaluations demonstrate that NormGenesis significantly outperforms existing datasets in refinement quality, dialogue naturalness, and generalization performance. We show that models trained on our V2R-augmented data exhibit improved pragmatic competence in ethically sensitive contexts. Our work establishes a new benchmark for culturally adaptive dialogue modeling and provides a scalable methodology for norm-aware generation across linguistically and culturally diverse languages.

社会规范在沟通中主导着恰当的文化行为，使得对话系统能够产生不仅连贯而且社会可接受的回应。我们提出了NormGenesis，这是一个多文化的框架，用于生成和注释英语、中文和韩语的社会化对话。为了模拟社会互动的动态而不仅仅是静态规范的分类，我们提出了一种新型的对话类型——从违规到解决（V2R），该模型模拟了规范违规后对话的进展，通过识别和适当的社会修复来进行。为了提高欠代表语言的实用性一致性，我们在对话合成过程的早期实现了基于范例的迭代细化。这种设计在完整的对话生成开始之前，引入了与语言、情感和社会文化期望的对齐。使用这个框架，我们构建了一个包含10800个多轮对话的数据集，在轮次级别进行规范遵守、说话者意图和情绪反应的注释。人类和LLM（大型语言模型）为基础的评价表明，NormGenesis在改进质量、对话自然性和泛化性能方面显著优于现有数据集。我们展示，在V2R增强数据上训练的模型在道德敏感环境下表现出更强的实用技能。我们的工作为文化适应的对话建模建立了新的基准，并为跨语言和跨文化多样化的规范感知生成提供了可扩展的方法。

论文及项目相关链接

PDF 39 pages, 17 figures, EMNLP 2025 Main Conference

Summary

Cool Papers

点此查看论文截图

WavReward: Spoken Dialogue Models With Generalist Reward Evaluators

Authors:Shengpeng Ji, Tianle Liang, Yangzhuo Li, Jialong Zuo, Minghui Fang, Jinzheng He, Yifu Chen, Zhengqing Liu, Ziyue Jiang, Xize Cheng, Siqi Zheng, Jin Xu, Junyang Lin, Zhou Zhao

End-to-end spoken dialogue models such as GPT-4o-audio have recently garnered significant attention in the speech domain. However, the evaluation of spoken dialogue models’ conversational performance has largely been overlooked. This is primarily due to the intelligent chatbots convey a wealth of non-textual information which cannot be easily measured using text-based language models like ChatGPT. To address this gap, we propose WavReward, a reward feedback model based on audio language models that can evaluate both the IQ and EQ of spoken dialogue systems with speech input. Specifically, 1) based on audio language models, WavReward incorporates the deep reasoning process and the nonlinear reward mechanism for post-training. By utilizing multi-sample feedback via the reinforcement learning algorithm, we construct a specialized evaluator tailored to spoken dialogue models. 2) We introduce ChatReward-30K, a preference dataset used to train WavReward. ChatReward-30K includes both comprehension and generation aspects of spoken dialogue models. These scenarios span various tasks, such as text-based chats, nine acoustic attributes of instruction chats, and implicit chats. WavReward outperforms previous state-of-the-art evaluation models across multiple spoken dialogue scenarios, achieving a substantial improvement about Qwen2.5-Omni in objective accuracy from 53.4$%$ to 91.5$%$. In subjective A/B testing, WavReward also leads by a margin of 83$%$. Comprehensive ablation studies confirm the necessity of each component of WavReward. All data and code will be publicly at https://github.com/jishengpeng/WavReward after the paper is accepted.

端到端口语对话模型，如GPT-4o-audio，最近在语音领域引起了广泛关注。然而，口语对话模型的会话性能评估却被大大忽视了。这主要是因为智能聊天机器人传达了大量的非文本信息，这些非文本信息无法轻松地使用基于文本的模型进行评估。为了填补这一空白，我们提出了WavReward，一个基于音频语言模型的奖励反馈模型，该模型可以根据语音输入评估口语对话系统的智商和情商。具体来说，1）WavReward基于音频语言模型，融入了深度推理过程和用于训练后的非线性奖励机制。通过强化学习算法的多样本反馈，我们构建了一个专为口语对话模型量身定制的评估器。2）我们引入了用于训练WavReward的偏好数据集ChatReward-30K。ChatReward-30K涵盖了口语对话模型的理解和生成方面。这些场景涵盖了各种任务，如文本聊天、指令聊天的九个声音特征和隐式聊天。WavReward在多个口语对话场景中超越了先前的最先进的评估模型，在客观准确度上实现了显著的提升，将Qwen2.5-Omni从53.4%提高到91.5%。在主观的A/B测试中，WavReward的领先幅度也达到了83%。全面的消融研究证实了WavReward每个组件的必要性。论文被接受后，所有数据和代码将在https://github.com/jishengpeng/WavReward上公开。

论文及项目相关链接

PDF

摘要

近期语音领域出现了端到端的对话模型，如GPT-4o-audio等，但对话模型的评价常常忽视其交互性能。针对智能聊天机器人所传达的丰富非文本信息，我们提出基于音频语言模型的反馈奖励模型WavReward，用以评价对话系统的智商和情商。此模型融入深度学习过程中的推理过程和非线性奖励机制进行后训练。利用强化学习算法的多样本反馈构建专用于对话模型的评价器。引入ChatReward-30K偏好数据集用于训练WavReward，涵盖文本聊天、指令聊天的九种声音特征以及隐含聊天等场景。相较于之前的最佳评价模型，WavReward在多场景对话中表现出更高的客观准确率，从53.4%提升至91.5%。主观测试中，WavReward也占据明显优势，高出约83%。全面的消融研究证实了WavReward每个组件的必要性。所有数据和相关代码将在论文被接受后公开于https://github.com/jishengpeng/WavReward。

关键见解

端到端对话模型如GPT-4o-audio受到关注，但对话模型的评价多侧重于文本信息而忽视交互性能。
WavReward基于音频语言模型构建，可评估对话系统的智商和情商。
WavReward采用深度学习过程中的推理过程和非线性奖励机制进行后训练。
ChatReward-30K数据集的引入用于训练WavReward，包括多方面的对话场景。
WavReward在客观和主观测试中均表现出卓越性能，显著提高评价准确率。
WavReward的每个组件都经过消融研究验证其必要性。

Cool Papers

点此查看论文截图

BAP v2: An Enhanced Task Framework for Instruction Following in Minecraft Dialogues

Authors:Prashant Jayannavar, Liliang Ren, Marisa Hudspeth, Risham Sidhu, Charlotte Lambert, Ariel Cordes, Elizabeth Kaplan, Anjali Narayan-Chen, Julia Hockenmaier

Developing interactive agents that can understand language, perceive their surroundings, and act within the physical world is a long-standing goal of AI research. The Minecraft Collaborative Building Task (MCBT) (Narayan-Chen, Jayannavar, and Hockenmaier 2019), a two-player game in which an Architect (A) instructs a Builder (B) to construct a target structure in a simulated 3D Blocks World environment, offers a rich platform to work towards this goal. In this work, we focus on the Builder Action Prediction (BAP) subtask: predicting B’s actions in a multimodal game context (Jayannavar, Narayan-Chen, and Hockenmaier 2020) - a challenging testbed for grounded instruction following, with limited training data. We holistically re-examine this task and introduce BAP v2 to address key challenges in evaluation, training data, and modeling. Specifically, we define an enhanced evaluation benchmark, featuring a cleaner test set and fairer, more insightful metrics that also reveal spatial reasoning as the primary performance bottleneck. To address data scarcity and to teach models basic spatial skills, we generate different types of synthetic MCBT data. We observe that current, LLM-based SOTA models trained on the human BAP dialogues fail on these simpler, synthetic BAP ones, but show that training models on this synthetic data improves their performance across the board. We also introduce a new SOTA model, Llama-CRAFTS, which leverages richer input representations, and achieves an F1 score of 53.0 on the BAP v2 task and strong performance on the synthetic data. While this result marks a notable 6 points improvement over previous work, it also underscores the task’s remaining difficulty, establishing BAP v2 as a fertile ground for future research, and providing a useful measure of the spatial capabilities of current text-only LLMs in such embodied tasks.

开发能够理解语言、感知周围环境并在物理世界中行动的交互智能体是人工智能研究的长远目标。Minecraft协同建造任务（MCBT）（Narayan-Chen、Jayannavar和Hockenmaier 2019）是一个两人游戏，建筑师（A）指导建造者（B）在一个模拟的3D方块世界环境中建造目标结构，这为实现这一目标提供了一个丰富的平台。在这项工作中，我们重点关注建造者行动预测（BAP）子任务：在多媒体游戏中预测B的行动（Jayannavar、Narayan-Chen和Hockenmaier 2020）——这是一个对遵循指令具有挑战性的测试平台，且训练数据有限。我们全面重新审视了这项任务，并引入了BAP v2来解决评估、训练数据和建模方面的关键挑战。具体来说，我们定义了一个增强的评估基准，包括一个更干净的测试集和更公平、更深入的指标，这些指标还揭示了空间推理是主要的性能瓶颈。为了解决数据稀缺的问题并教授模型基本空间技能，我们生成了不同类型的合成MCBT数据。我们发现，当前基于大型语言模型的最新模型在人类BAP对话上的训练效果在这些更简单、合成的BAP数据上表现不佳，但训练模型使用这种合成数据可以提高其整体表现。我们还引入了一种新型SOTA模型Llama-CRAFTS，它利用更丰富的输入表示形式，在BAP v2任务上实现了53.0的F1分数，并在合成数据上表现出强劲的性能。虽然这一结果比前人的工作提高了6个百分点，但也表明了任务仍存在难度，确立了BAP v2作为未来研究的肥沃土壤，并为当前文本大型语言模型在这种实体任务中的空间能力提供了有用的衡量标准。

论文及项目相关链接

PDF major revision; few examples of changes: added contemporary LLMs and new SOTA model, improved readability, expanded related work, etc

Summary

本文探讨了人工智能领域长期的目标：开发能够理解语言、感知周围环境并在物理世界中行动的交互代理。文章以Minecraft协作建造任务（MCBT）为例，重点介绍了建造者行动预测（BAP）子任务，这是一个对遵循指令有现实意义的测试平台，并存在关键挑战。研究团队提出了BAP v2来完善评价基准、训练数据和建模方法。通过引入合成MCBT数据来解决数据稀缺问题，并教授模型基本空间技能。文章还介绍了一种新型模型Llama-CRAFTS，它在BAP v2任务上取得了显著成绩，并强调未来研究的潜力。此成果展现了当前纯文本大型语言模型在处理此类嵌入式任务时的空间能力挑战。

Key Takeaways

AI研究的一个重要目标是开发能够理解语言、感知环境并在物理世界行动的交互代理。
Minecraft协作建造任务（MCBT）为这一目标的实现提供了丰富的平台。
建造者行动预测（BAP）子任务是一个对遵循指令有现实意义的测试平台，但存在关键挑战。
BAP v2的推出解决了评价基准、训练数据和建模方法上的关键挑战。
通过引入合成MCBT数据来解决数据稀缺问题，并教授模型基本空间技能。
当前大型语言模型（LLM）在处理简单合成BAP数据时表现不佳，但训练在这些数据上可以提高其整体性能。

Cool Papers

点此查看论文截图

Text-Augmented Multimodal LLMs for Chemical Reaction Condition Recommendation

Authors:Yu Zhang, Ruijie Yu, Kaipeng Zeng, Ding Li, Feng Zhu, Xiaokang Yang, Yaohui Jin, Yanyan Xu

Identifying reaction conditions that are broadly applicable across diverse substrates is a longstanding challenge in chemical and pharmaceutical research. While many methods are available to generate conditions with acceptable performance, a universal approach for reliably discovering effective conditions during reaction exploration is rare. Consequently, current reaction optimization processes are often labor-intensive, time-consuming, and costly, relying heavily on trial-and-error experimentation. Nowadays, large language models (LLMs) are capable of tackling chemistry-related problems, such as molecule design and chemical reasoning tasks. Here, we report the design, implementation and application of Chemma-RC, a text-augmented multimodal LLM to identify effective conditions through task-specific dialogue and condition generation. Chemma-RC learns a unified representation of chemical reactions by aligning multiple modalities-including text corpus, reaction SMILES, and reaction graphs-within a shared embedding module. Performance benchmarking on datasets showed high precision in identifying optimal conditions, with up to 17% improvement over the current state-of-the-art methods. A palladium-catalysed imidazole C-H arylation reaction was investigated experimentally to evaluate the functionalities of the Chemma-RC in practice. Our findings suggest that Chemma-RC holds significant potential to accelerate high-throughput condition screening in chemical synthesis.

识别广泛应用于各种底物的反应条件一直是化学和制药研究中的一项长期挑战。虽然有许多方法能够产生可接受的性能条件，但在反应探索过程中可靠地发现有效条件的通用方法却很少见。因此，当前的反应优化过程往往劳动强度高、耗时且成本高昂，严重依赖于试错实验。如今，大型语言模型（LLM）能够解决化学相关问题，例如分子设计和化学推理任务。在这里，我们报告了Chemma-RC的设计、实现和应用。Chemma-RC是一种文本增强多模式LLM，通过特定任务对话和条件生成来识别有效条件。Chemma-RC通过共享嵌入模块对齐多个模式，包括文本语料库、反应SMILES和反应图，学习化学反应的统一表示。在数据集上的性能基准测试显示，在识别最佳条件方面具有较高的精度，比当前最先进的方法高出高达17%。通过实验调查钯催化的咪唑C-H芳基化反应，以评估Chemma-RC在实际操作中的功能。我们的研究结果表明，Chemma-RC在高通量条件筛选方面具有巨大的潜力，可加速化学合成。

论文及项目相关链接

PDF

Summary

化学与制药研究中，确定可广泛应用于不同底物的反应条件是一项长期挑战。当前反应优化流程劳动密集型，耗时耗资，很大程度上依赖于试验性错误探索。本研究报告推出Chemma-RC，一种文本增强多模式大型语言模型，通过特定任务对话和条件生成来识别有效条件。Chemma-RC通过共享嵌入模块对齐文本语料库、反应SMILES和反应图谱等多种模式，学习化学反应的统一表示。数据集上的性能评估显示，其在确定最佳条件方面具有高精度，较当前前沿方法最高提升了17%。通过实验验证的钯催化咪唑C-H芳基化反应，证明了Chemma-RC在实际应用中具有显著潜力，可加速化学合成中的高通量条件筛选。

Key Takeaways

确定可广泛应用于不同底物的反应条件是一项挑战。
当前反应优化流程劳动密集型且耗时耗资。
Chemma-RC是一种文本增强多模式大型语言模型，旨在解决化学反应条件识别问题。
Chemma-RC通过对齐多种模式学习化学反应的统一表示。
性能评估显示Chemma-RC在确定最佳条件方面具有较高的精度，较当前前沿方法有所提升。
实验验证的钯催化咪唑C-H芳基化反应证明了Chemma-RC的实际应用潜力。

Cool Papers

点此查看论文截图

REACT: Real-time Efficiency and Accuracy Compromise for Tradeoffs in Scene Graph Generation

Authors:Maëlic Neau, Paulo E. Santos, Anne-Gwenn Bosser, Cédric Buche, Akihiro Sugimoto

Scene Graph Generation (SGG) is a task that encodes visual relationships between objects in images as graph structures. SGG shows significant promise as a foundational component for downstream tasks, such as reasoning for embodied agents. To enable real-time applications, SGG must address the trade-off between performance and inference speed. However, current methods tend to focus on one of the following: (1) improving relation prediction accuracy, (2) enhancing object detection accuracy, or (3) reducing latency, without aiming to balance all three objectives simultaneously. To address this limitation, we propose the Real-time Efficiency and Accuracy Compromise for Tradeoffs in Scene Graph Generation (REACT) architecture, which achieves the highest inference speed among existing SGG models, improving object detection accuracy without sacrificing relation prediction performance. Compared to state-of-the-art approaches, REACT is 2.7 times faster and improves object detection accuracy by 58%. Furthermore, our proposal significantly reduces model size, with an average of 5.5x fewer parameters. The code is available at https://github.com/Maelic/SGG-Benchmark

场景图生成（SGG）是一项将图像中对象之间的视觉关系编码为图结构的任务。SGG作为下游任务的基础组件，如在实体代理中进行推理，显示出巨大的潜力。为了实现实时应用，SGG必须解决性能与推理速度之间的权衡。然而，当前的方法往往侧重于以下几个方面：（1）提高关系预测精度，（2）提高对象检测精度，或（3）降低延迟，而没有同时瞄准平衡这三个目标。为了解决这一局限性，我们提出了场景图生成中的实时效率与精度权衡（REACT）架构。该架构实现了现有SGG模型中最高的推理速度，提高了对象检测精度，同时不牺牲关系预测性能。与最新方法相比，REACT的速度提高了2.7倍，对象检测精度提高了58%。此外，我们的提案还显著减少了模型大小，参数平均减少了5.5倍。代码可在https://github.com/Maelic/SGG-Benchmark找到。

论文及项目相关链接

PDF Accepted at the 2025 British Machine Vision Conference (BMVC)

Summary
场景图生成（SGG）是编码图像中对象之间视觉关系作为图结构的任务。它作为下游任务（如智能体推理）的基础组件，具有巨大潜力。为了实时应用，SGG必须在性能与推理速度之间进行权衡。当前方法往往侧重于改进关系预测准确性、提高对象检测准确性或降低延迟，而没有同时平衡这三个目标。为解决此局限性，我们提出了实时效率和准确性妥协的场景图生成（REACT）架构，它在现有SGG模型中实现了最高的推理速度，提高了对象检测准确性，不牺牲关系预测性能。与最新方法相比，REACT速度更快（快2.7倍），对象检测准确性提高了58%，且显著减少了模型大小（平均减少5.5倍参数）。

Key Takeaways

场景图生成（SGG）是编码图像中对象间视觉关系的任务，对下游任务如智能体推理有重要作用。
SGG面临实时应用中性能与推理速度的权衡问题。
当前SGG方法往往只关注关系预测准确性、对象检测准确性或降低延迟中的一个方面，未能全面平衡。
提出REACT架构，实现了高推理速度，提高了对象检测准确性，且不牺牲关系预测性能。
REACT相比现有SGG模型，速度更快（快2.7倍），对象检测准确性更高（提高58%）。
REACT显著减少了模型大小（平均减少5.5倍参数）。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-09-28/Interactive/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

Interactive

Text-to-Motion

Text-to-Motion 方向最新论文已更新，请持续关注 Update in 2025-09-28 Lagrangian Motion Fields for Long-term Motion Generation

2025-09-28 Text-to-Motion

Text-to-Motion

TTS

TTS 方向最新论文已更新，请持续关注 Update in 2025-09-28 UniSS Unified Expressive Speech-to-Speech Translation with Your Voice

2025-09-28 TTS

TTS

Interactive

2025-09-28 更新

Thinking While Listening: Simple Test Time Scaling For Audio Classification

Listening to the long ringdown: A novel way to pinpoint the EOS in neutron-star cores

NormGenesis: Multicultural Dialogue Generation via Exemplar-Guided Social Norm Modeling and Violation Recovery

WavReward: Spoken Dialogue Models With Generalist Reward Evaluators

BAP v2: An Enhanced Task Framework for Instruction Following in Minecraft Dialogues

Text-Augmented Multimodal LLMs for Chemical Reaction Condition Recommendation

REACT: Real-time Efficiency and Accuracy Compromise for Tradeoffs in Scene Graph Generation