发布日期: 2025-11-06

更新日期: 2025-11-27

文章字数: 4.4k

阅读时长: 17 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-06 更新

RxnCaption: Reformulating Reaction Diagram Parsing as Visual Prompt Guided Captioning

Authors:Jiahe Song, Chuang Wang, Bowen Jiang, Yinfan Wang, Hao Zheng, Xingjian Wei, Chengjin Liu, Junyuan Gao, Yubin Wang, Lijun Wu, Jiang Wu, Qian Yu, Conghui He

Large-scale chemical reaction datasets are crucial for AI research in chemistry. However, existing chemical reaction data often exist as images within papers, making them not machine-readable and unusable for training machine learning models. In response to this challenge, we propose the RxnCaption framework for the task of chemical Reaction Diagram Parsing (RxnDP). Our framework reformulates the traditional coordinate prediction driven parsing process into an image captioning problem, which Large Vision-Language Models (LVLMs) handle naturally. We introduce a strategy termed “BBox and Index as Visual Prompt” (BIVP), which uses our state-of-the-art molecular detector, MolYOLO, to pre-draw molecular bounding boxes and indices directly onto the input image. This turns the downstream parsing into a natural-language description problem. Extensive experiments show that the BIVP strategy significantly improves structural extraction quality while simplifying model design. We further construct the RxnCaption-11k dataset, an order of magnitude larger than prior real-world literature benchmarks, with a balanced test subset across four layout archetypes. Experiments demonstrate that RxnCaption-VL achieves state-of-the-art performance on multiple metrics. We believe our method, dataset, and models will advance structured information extraction from chemical literature and catalyze broader AI applications in chemistry. We will release data, models, and code on GitHub.

大规模化学反应数据集对化学人工智能研究至关重要。然而，现有的化学反应数据通常以论文中的图片形式存在，这使其无法被机器读取，也无法用于训练机器学习模型。针对这一挑战，我们提出了用于化学反应图谱解析（RxnDP）任务的RxnCaption框架。我们的框架将传统的坐标预测驱动解析过程重构为图像描述问题，大型视觉语言模型（LVLM）可以自然处理这个问题。我们引入了一种名为“BBox和索引作为视觉提示”（BIVP）的策略，使用我们最先进的分子检测器MolYOLO，直接在输入图像上预先绘制分子边界框和索引。这将下游解析转化为自然语言描述问题。大量实验表明，BIVP策略在简化模型设计的同时，显著提高了结构提取质量。我们还构建了RxnCaption-11k数据集，其规模比以往现实世界文献基准测试大一个数量级，并在四种布局原型中具有平衡的测试子集。实验表明，RxnCaption-VL在多个指标上达到了最新性能水平。我们相信我们的方法、数据集和模型将推动从化学文献中提取结构化信息，并在化学领域催化更广泛的人工智能应用。我们会在GitHub上发布数据、模型和代码。

论文及项目相关链接

PDF

Summary
化学领域的大规模反应数据集对人工智能研究至关重要。然而，现有的化学反应数据多以图像形式存在于论文中，无法被机器学习模型读取和使用。为此，我们提出了用于化学反应图解析（RxnDP）的RxnCaption框架。该框架将传统的坐标预测驱动解析过程重构为图像描述问题，这正是大型视觉语言模型（LVLMs）擅长的领域。我们引入了一种名为“BBox和索引作为视觉提示”（BIVP）的策略，使用我们最先进的分子检测器MolYOLO，直接在输入图像上绘制分子边界框和索引。这将下游解析转化为自然语言描述问题。实验表明，BIVP策略在简化模型设计的同时，显著提高了结构提取质量。我们还构建了RxnCaption-11k数据集，其规模比先前的现实世界文献基准测试大一个数量级，并且在四种布局原型中具有平衡的测试子集。实验表明，RxnCaption-VL在多个指标上达到了最先进的性能。我们相信我们的方法、数据集和模型将推动从化学文献中提取结构化信息的进步，并催化化学领域的更广泛人工智能应用。

Key Takeaways

大规模化学反应数据集对化学人工智能研究的重要性。
现有化学反应数据以图像形式存在，难以用于机器学习模型训练。
提出RxnCaption框架以解决化学反应图解析问题，将传统解析过程转化为图像描述问题。
引入BIVP策略，使用MolYOLO分子检测器提高结构提取质量。
构建RxnCaption-11k数据集，规模远超先前基准测试。
RxnCaption-VL在多个指标上实现最先进的性能。

Cool Papers

点此查看论文截图

Efficient Tool-Calling Multi-Expert NPC Agent for Commonsense Persona-Grounded Dialogue

Authors:Mahammad Nuriyev

We present a multi-expert system for creating Non-Player Characters (NPCs) capable of both natural dialogue and contextual action execution in interactive environments. Using Qwen3 as the base model and Low-Rank Adaptation (LoRA) adapters, we instantiate three specialists: tool calling, tool-response interpretation, and direct dialogue. Our system comfortably meets the computational efficiency requirements, delivering fast responses and maintaining modest resource usage on L40S GPUs. In the Commonsense Persona-Grounded Dialogue Challenge 2025, our method ranked second overall. Code available at: https://github.com/MahammadNuriyev62/CPDC-challenge-2025-solution/

我们提出了一种多专家系统，用于创建能够在交互式环境中进行自然对话和上下文动作执行的非玩家角色（NPC）。我们以Qwen3为基础模型和LoRA适配器实例化三个专家：工具呼叫、工具响应解释和直接对话。我们的系统轻松满足计算效率要求，在L40S GPU上实现快速响应并保持适度的资源使用。在常识人格化对话挑战2025中，我们的方法总体排名第二。代码可在：https://github.com/MahammadNuriyev62/CPDC-challenge-2025-solution/找到。

论文及项目相关链接

PDF 10 pages, 1 figure, 2 tables. Technical report for the Commonsense Persona-Grounded Dialogue Challenge (CPDC) 2025, part of the Wordplay 2025 Workshop @ EMNLP 2025

Summary

基于Qwen3模型和LoRA适配器，我们开发了一个多专家系统，用于创建能在互动环境中进行自然对话和进行情境动作的非玩家角色（NPC）。该系统实例化三位专家：工具调用、工具反应解读和直接对话专家。该方法满足了计算效率的要求，能在L40S GPU上快速响应并保持适度的资源使用率。在2025年常识人格化对话挑战中，我们的方法位列第二。

Key Takeaways

该系统是一个多专家系统，用于创建非玩家角色（NPC），能在互动环境中进行自然对话和情境动作执行。
系统使用Qwen3作为基础模型和LoRA适配器。
系统实例化三位专家：工具调用、工具反应解读和直接对话专家。
系统满足计算效率要求，响应迅速且资源使用适度。
在2025年的常识人格化对话挑战中，该方法排名第二。
代码已公开，可访问特定链接查看和下载。

Cool Papers

点此查看论文截图

Learnable Fractional Reaction-Diffusion Dynamics for Under-Display ToF Imaging and Beyond

Authors:Xin Qiao, Matteo Poggi, Xing Wei, Pengchao Deng, Yanhui Zhou, Stefano Mattoccia

Under-display ToF imaging aims to achieve accurate depth sensing through a ToF camera placed beneath a screen panel. However, transparent OLED (TOLED) layers introduce severe degradations-such as signal attenuation, multi-path interference (MPI), and temporal noise-that significantly compromise depth quality. To alleviate this drawback, we propose Learnable Fractional Reaction-Diffusion Dynamics (LFRD2), a hybrid framework that combines the expressive power of neural networks with the interpretability of physical modeling. Specifically, we implement a time-fractional reaction-diffusion module that enables iterative depth refinement with dynamically generated differential orders, capturing long-term dependencies. In addition, we introduce an efficient continuous convolution operator via coefficient prediction and repeated differentiation to further improve restoration quality. Experiments on four benchmark datasets demonstrate the effectiveness of our approach. The code is publicly available at https://github.com/wudiqx106/LFRD2.

显示下方的飞行时间成像（ToF imaging）旨在通过屏幕面板下的飞行时间相机实现准确的深度感知。然而，透明OLED（TOLED）层引入了严重的退化问题，如信号衰减、多路径干扰（MPI）和时间噪声等，这些都会严重影响深度质量。为了缓解这一缺陷，我们提出了可学习的分数反应扩散动力学（LFRD2），这是一个结合了神经网络表现力和物理建模解释性的混合框架。具体来说，我们实现了一个时间分数反应扩散模块，该模块通过动态生成差分阶数，能够迭代优化深度，捕捉长期依赖关系。此外，我们还通过系数预测和重复微分引入了一种高效的连续卷积算子，以进一步提高恢复质量。在四个基准数据集上的实验证明了我们方法的有效性。代码公开在https://github.com/wudiqx106/LFRD2。

论文及项目相关链接

PDF

Summary

文本介绍了通过屏幕下方的ToF相机实现精准深度感知的难题，特别是透明OLED（TOLED）层导致的信号衰减、多路径干扰和时序噪声等问题严重影响了深度质量。为解决这些问题，提出了一种名为Learnable Fractional Reaction-Diffusion Dynamics（LFRD2）的混合框架，结合了神经网络的表现力和物理建模的可解释性。它通过时间分数反应扩散模块进行迭代深度优化，并引入高效的连续卷积算子以提高修复质量。实验证明该方法在四个基准数据集上均有效。代码已公开在GitHub上。

Key Takeaways

ToF成像技术通过屏幕下的相机实现深度感知，但面临透明OLED层导致的信号衰减等问题。
LFRD2框架结合了神经网络和物理建模的优势。
LFRD2使用迭代深度优化技术处理长时间依赖关系。
通过引入连续卷积算子，提高了修复质量。
该方法在四个基准数据集上的实验证明了其有效性。
LFRD2框架的代码已公开在GitHub上供公众访问和使用。

Cool Papers

点此查看论文截图

MULTI-Bench: A Multi-Turn Interactive Benchmark for Assessing Emotional Intelligence ability of Spoken Dialogue Models

Authors:Yayue Deng, Guoqiang Hu, Haiyang Sun, Xiangyu Zhang, Haoyang Zhang, Fei Tian, Xuerui Yang, Gang Yu, Eng Siong Chng

Spoken Dialogue Models (SDMs) have advanced rapidly, yet their ability to sustain genuinely interactive multi-turn conversations remains underexplored, as most benchmarks focus on single-turn exchanges. We introduce Multi-Bench, the first benchmark explicitly designed to evaluate SDMs in multi-turn interactive dialogue with an emphasis on emotional intelligence. Multi-Bench employs a hierarchical structure with a basic track for emotion understanding and reasoning and an advanced track for emotion support and application. It comprises five carefully designed tasks and about 3.2K samples, ranging from emotion recognition to complex reasoning and interactive dialogue, supported by a reproducible evaluation framework. We evaluate six representative SDMs on eight subsets of Multi-Bench. Results show that while current SDMs achieve good performance on basic understanding tasks, they still have room for improvement in advanced multi-turn interactive dialogue and reasoning-related tasks, particularly in emotion awareness and application.

对话模型（Spoken Dialogue Models，简称SDMs）虽然发展迅速，但在维持真正的交互式多轮对话方面仍存在待探索的空间，因为大多数基准测试主要集中在单轮对话上。我们引入了Multi-Bench，这是第一个专门用于评估SDMs在多轮交互对话中的表现的基准测试，并特别重视情商的评估。Multi-Bench采用分层结构，包含情感理解和推理的基本赛道和面向情感支持和应用的高级赛道。它包含了五个精心设计的任务以及大约3.2K样本数据，涵盖从情感识别到复杂推理和交互式对话等任务，同时提供了一个可复现的评估框架。我们在Multi-Bench的八个子集上评估了六个代表性的SDMs。结果表明，虽然当前SDMs在基本理解任务上表现良好，但在高级多轮交互对话和推理相关任务上仍有提升空间，特别是在情感意识和应用方面。

论文及项目相关链接

PDF Submitted to ICASSP 2026

Summary

本文介绍了多基准（Multi-Bench），这是一个专门为评估具有情感智能的多轮交互对话中的口语对话模型（SDMs）而设计的基准测试。它采用分层结构，包括基础的情感理解和推理轨道，以及高级的情感支持和应用程序轨道。通过五个精心设计的任务，约3.2K样本，从情感识别到复杂推理和交互式对话。对现有SDMs的评价结果表明，SDMs在基础理解任务上表现良好，但在高级多轮交互对话和推理相关任务中仍有提升空间，特别是在情感意识和应用方面。

Key Takeaways

介绍了多基准（Multi-Bench），它是专门评估口语对话模型（SDMs）在多轮交互对话中的性能的首个基准测试。
Multi-Bench强调情感智能的重要性，采用分层结构，包括基础的情感理解和推理轨道和高级的情感支持和应用程序轨道。
该基准测试包含五个任务，大约3.2K样本，涵盖从情感识别到复杂推理和交互式对话的各个方面。
当前SDMs在基础理解任务上表现良好，但在高级多轮交互对话和推理任务中仍需要改进。
情感意识和应用是当前SDMs的薄弱环节，需要更多的研究和改进。
多基准测试提供了一个可重复的评价框架，有助于公平地评估和比较不同的SDMs。

Cool Papers

点此查看论文截图

Towards Stable and Personalised Profiles for Lexical Alignment in Spoken Human-Agent Dialogue

Authors:Keara Schaaij, Roel Boumans, Tibor Bosse, Iris Hendrickx

Lexical alignment, where speakers start to use similar words across conversation, is known to contribute to successful communication. However, its implementation in conversational agents remains underexplored, particularly considering the recent advancements in large language models (LLMs). As a first step towards enabling lexical alignment in human-agent dialogue, this study draws on strategies for personalising conversational agents and investigates the construction of stable, personalised lexical profiles as a basis for lexical alignment. Specifically, we varied the amounts of transcribed spoken data used for construction as well as the number of items included in the profiles per part-of-speech (POS) category and evaluated profile performance across time using recall, coverage, and cosine similarity metrics. It was shown that smaller and more compact profiles, created after 10 min of transcribed speech containing 5 items for adjectives, 5 items for conjunctions, and 10 items for adverbs, nouns, pronouns, and verbs each, offered the best balance in both performance and data efficiency. In conclusion, this study offers practical insights into constructing stable, personalised lexical profiles, taking into account minimal data requirements, serving as a foundational step toward lexical alignment strategies in conversational agents.

词汇对齐对话在对话过程中能够使双方开始使用类似的词汇，有助于实现成功交流。然而，在对话机器人中实现词汇对齐尚未得到充分研究，特别是在大型语言模型（LLM）的最新发展背景下更是如此。本研究旨在作为实现人机对话词汇对齐的第一步，借鉴个性化对话机器人的策略，探索构建稳定、个性化的词汇分布作为基础实现词汇对齐。具体来说，我们变化用于构建对话系统的转写语音数据量以及各类词性的项目数量，并使用召回率、覆盖率和余弦相似度等指标评估词汇分布随时间的表现。结果显示，在包含形容词5项、连词5项以及副词、名词、代词和动词各包含10项的10分钟转写语音数据基础上构建的较小且更紧凑的词汇分布，在性能和数据效率方面都表现最好。总之，本研究提供了一些关于构建稳定、个性化词汇分布的实用见解，考虑了最小数据需求，可作为对话机器人实现词汇对齐策略的基础步骤。

论文及项目相关链接

PDF This preprint has not undergone peer review or any post-submission improvements or corrections. The Version of Record of this contribution is published in TSD 2025. Lecture Notes in Computer Science, vol 16029

Summary

该研究表明，词汇对齐（即对话中双方使用相似词汇的现象）对沟通成功至关重要，但在对话代理中实现词汇对齐仍有许多未被探索之处。该研究尝试通过个性化对话代理策略来构建稳定、个性化的词汇表，以作为词汇对齐的基础。研究结果显示，使用少量转录语音数据构建的较小且紧凑的词汇表表现最佳，这为构建稳定、个性化的词汇表提供了实际见解，为对话代理中实现词汇对齐策略奠定了基础。

Key Takeaways