嘘~ 正在从服务器偷取页面 . . .

LLM


⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验

2025-09-24 更新

Seg4Diff: Unveiling Open-Vocabulary Segmentation in Text-to-Image Diffusion Transformers

Authors:Chaehyun Kim, Heeseong Shin, Eunbeen Hong, Heeji Yoon, Anurag Arnab, Paul Hongsuck Seo, Sunghwan Hong, Seungryong Kim

Text-to-image diffusion models excel at translating language prompts into photorealistic images by implicitly grounding textual concepts through their cross-modal attention mechanisms. Recent multi-modal diffusion transformers extend this by introducing joint self-attention over concatenated image and text tokens, enabling richer and more scalable cross-modal alignment. However, a detailed understanding of how and where these attention maps contribute to image generation remains limited. In this paper, we introduce Seg4Diff (Segmentation for Diffusion), a systematic framework for analyzing the attention structures of MM-DiT, with a focus on how specific layers propagate semantic information from text to image. Through comprehensive analysis, we identify a semantic grounding expert layer, a specific MM-DiT block that consistently aligns text tokens with spatially coherent image regions, naturally producing high-quality semantic segmentation masks. We further demonstrate that applying a lightweight fine-tuning scheme with mask-annotated image data enhances the semantic grouping capabilities of these layers and thereby improves both segmentation performance and generated image fidelity. Our findings demonstrate that semantic grouping is an emergent property of diffusion transformers and can be selectively amplified to advance both segmentation and generation performance, paving the way for unified models that bridge visual perception and generation.

文本到图像的扩散模型通过其跨模态注意机制隐式地建立文本概念,从而将语言提示翻译成具有光感的图像。最近的多模态扩散变压器通过在连接的图像和文本符号上引入联合自注意力,使其扩展,从而实现更丰富和更可扩展的跨模态对齐。然而,对于这些注意力图如何在图像生成中发挥作用以及发挥作用的位置,我们的了解仍然有限。

论文及项目相关链接

PDF NeurIPS 2025. Project page: https://cvlab-kaist.github.io/Seg4Diff/

Summary

文本到图像扩散模型通过跨模态注意力机制隐式地将文本概念接地,从而擅长将语言提示转化为逼真的图像。最近的多模态扩散变压器通过引入联合自注意力,实现对串联的图像和文本符号的更丰富和可扩展的跨模态对齐。然而,关于这些注意力图如何以及在何处助力图像生成仍存在详细的了解空白。本文提出Seg4Diff(用于扩散的分割),这是一个系统地分析MM-DiT注意力结构的框架,重点关注特定层如何将语义信息从文本传播到图像。通过综合分析,我们确定了语义接地专家层,这是一个特定的MM-DiT块,能够持续地将文本符号与空间连贯的图像区域对齐,自然产生高质量语义分割掩模。我们还进一步证明,使用带掩码注释的图像数据进行轻量级微调可以增强这些层的语义分组能力,从而提高分割性能和生成的图像保真度。研究结果表明,语义分组是扩散变压器的一种新兴属性,可选择性增强以提高分割和生成性能,为融合视觉感知和生成的统一模型铺平道路。

Key Takeaways

  1. 文本到图像扩散模型通过跨模态注意力机制将语言提示转化为逼真的图像。
  2. 多模态扩散变压器通过引入联合自注意力实现更丰富的跨模态对齐。
  3. Seg4Diff框架用于分析MM-DiT的注意力结构,探究信息从文本到图像的传播过程。
  4. 语义接地专家层是MM-DiT中的一个特定块,能将文本符号与空间连贯的图像区域对齐,产生高质量的语义分割掩模。
  5. 使用带掩码注释的图像数据进行微调可以增强模型的语义分组能力。
  6. 语义分组是扩散变压器的一种新兴属性,能够提高分割和生成性能。

Cool Papers

点此查看论文截图

GraDeT-HTR: A Resource-Efficient Bengali Handwritten Text Recognition System utilizing Grapheme-based Tokenizer and Decoder-only Transformer

Authors:Md. Mahmudul Hasan, Ahmed Nesar Tahsin Choudhury, Mahmudul Hasan, Md. Mosaddek Khan

Despite Bengali being the sixth most spoken language in the world, handwritten text recognition (HTR) systems for Bengali remain severely underdeveloped. The complexity of Bengali script–featuring conjuncts, diacritics, and highly variable handwriting styles–combined with a scarcity of annotated datasets makes this task particularly challenging. We present GraDeT-HTR, a resource-efficient Bengali handwritten text recognition system based on a Grapheme-aware Decoder-only Transformer architecture. To address the unique challenges of Bengali script, we augment the performance of a decoder-only transformer by integrating a grapheme-based tokenizer and demonstrate that it significantly improves recognition accuracy compared to conventional subword tokenizers. Our model is pretrained on large-scale synthetic data and fine-tuned on real human-annotated samples, achieving state-of-the-art performance on multiple benchmark datasets.

尽管孟加拉语是世界上第六大语言,但孟加拉语的手写文本识别(HTR)系统仍然严重缺乏开发。孟加拉语脚本的复杂性——包括连字、变音符号和高度可变的书写风格——与注释数据集的匮乏相结合,使这一任务特别具有挑战性。我们提出了GraDeT-HTR,这是一个基于音素感知解码器Transformer架构的资源高效孟加拉语手写文本识别系统。为了解决孟加拉语脚本的独特挑战,我们通过集成音素感知的切词器来提高只含解码器的Transformer的性能,并证明与传统的子词切词器相比,它在识别准确性方面有了显著提高。我们的模型在大型合成数据上进行预训练,在真实的人为注释样本上进行微调,在多个基准数据集上实现了最先进的性能。

论文及项目相关链接

PDF 7 pages. Accepted at the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP) System Demonstrations. Equal Contribution: Md. Mahmudul Hasan and Ahmed Nesar Tahsin Choudhury

总结

虽然孟加拉语是全球第六大语言,但针对孟加拉语的手写文本识别(HTR)系统仍然严重缺乏发展。孟加拉语脚本的复杂性,如连字符、变音符号和高度可变的书写风格,加上缺乏注释数据集,使得这一任务特别具有挑战性。我们提出了GraDeT-HTR,这是一种基于Grapheme感知解码器Transformer架构的资源节约型孟加拉语手写文本识别系统。为解决孟加拉语脚本的独特挑战,我们通过集成基于音素的标记器增强了仅解码器Transformer的性能,并证明其在识别准确性方面显著优于传统的子词标记器。我们的模型在大规模合成数据上进行预训练,在真实的人工注释样本上进行微调,在多个基准数据集上实现最先进的性能。

要点

  1. 孟加拉语手写文本识别(HTR)系统发展不足,面临诸多挑战。
  2. 孟加拉语脚本的复杂性,包括连字符、变音符号和手写风格的多样性。
  3. 缺乏注释数据集使得开发HTR系统更加困难。
  4. 提出了一种基于Grapheme感知解码器Transformer架构的GraDeT-HTR系统。
  5. 通过集成基于音素的标记器提高了仅解码器Transformer的性能。
  6. GraDeT-HTR在多个基准数据集上实现了最先进的性能。

Cool Papers

点此查看论文截图

TempSamp-R1: Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMs

Authors:Yunheng Li, Jing Cheng, Shaoyong Jia, Hangyi Kuang, Shaohui Jiao, Qibin Hou, Ming-Ming Cheng

This paper introduces TempSamp-R1, a new reinforcement fine-tuning framework designed to improve the effectiveness of adapting multimodal large language models (MLLMs) to video temporal grounding tasks. We reveal that existing reinforcement learning methods, such as Group Relative Policy Optimization (GRPO), rely on on-policy sampling for policy updates. However, in tasks with large temporal search spaces, this strategy becomes both inefficient and limited in performance, as it often fails to identify temporally accurate solutions. To address this limitation, TempSamp-R1 leverages ground-truth annotations as off-policy supervision to provide temporally precise guidance, effectively compensating for the sparsity and misalignment in on-policy solutions. To further stabilize training and reduce variance in reward-based updates, TempSamp-R1 provides a non-linear soft advantage computation method that dynamically reshapes the reward feedback via an asymmetric transformation. By employing a hybrid Chain-of-Thought (CoT) training paradigm, TempSamp-R1 optimizes a single unified model to support both CoT and non-CoT inference modes, enabling efficient handling of queries with varying reasoning complexity. Experimental results demonstrate that TempSamp-R1 outperforms GRPO-based baselines, establishing new state-of-the-art performance on benchmark datasets: Charades-STA (R1@0.7: 52.9%, +2.7%), ActivityNet Captions (R1@0.5: 56.0%, +5.3%), and QVHighlights (mAP: 30.0%, +3.0%). Moreover, TempSamp-R1 shows robust few-shot generalization capabilities under limited data. Code: https://github.com/HVision-NKU/TempSamp-R1

本文介绍了TempSamp-R1,这是一种新的强化精细调整框架,旨在提高多模态大型语言模型(MLLMs)在视频时间定位任务中的有效性。我们发现现有的强化学习方法,如组相对策略优化(GRPO),依赖于策略更新的在策略采样。然而,在具有大时间搜索空间的任务中,这种策略变得效率低下且性能受限,因为它经常无法找到时间准确的解决方案。为了解决这个问题,TempSamp-R1利用真实标注作为离线监督来提供精确的时间指导,有效地补偿了在策略解决方案中的稀疏性和不匹配问题。为了进一步优化基于奖励的更新并减少方差,TempSamp-R1提供了一种非线性软优势计算方法,该方法通过不对称转换动态地重塑奖励反馈。通过采用混合的链式思维(CoT)训练范式,TempSamp-R1优化了一个统一的模型来支持CoT和非CoT推理模式,能够高效地处理不同推理复杂度的查询。实验结果表明,TempSamp-R1优于基于GRPO的基线模型,在基准数据集上取得了最新性能:Charades-STA(R1@0.7: 52.9%,+2.7%),ActivityNet Captions(R1@0.5: 56.0%,+5.3%)和QVHighlights(mAP: 30.0%,+3.0%)。此外,TempSamp-R1在有限数据下显示出强大的少量样本泛化能力。代码地址:https://github.com/HVision-NKU/TempSamp-R1

论文及项目相关链接

PDF Accepted at NeurIPS 2025

Summary

本文介绍了一种名为TempSamp-R1的新型强化微调框架,旨在提高多模态大型语言模型(MLLMs)在视频时序定位任务中的效率。针对现有强化学习方法在大型时序搜索空间中的局限性和不足,TempSamp-R1利用真实标注作为离线监督,提供精确的时间指导,并引入非线性软优势计算方法以稳定训练和减少奖励更新的方差。此外,TempSamp-R1采用混合的Chain-of-Thought(CoT)训练范式,优化单一模型以支持CoT和非CoT推理模式,并展示出色的少样本泛化能力。实验结果显示,TempSamp-R1在多个基准数据集上取得突破性的性能,包括Charades-STA、ActivityNet Captions和QVHighlights。

Key Takeaways

  1. TempSamp-R1是一个新的强化微调框架,旨在提高多模态大型语言模型在视频时序定位任务中的性能。
  2. 现有强化学习方法在大型时序搜索空间中表现不佳,TempSamp-R1通过引入真实标注作为离线监督来克服这一问题。
  3. TempSamp-R1采用非线性软优势计算方法,以稳定训练和减少奖励更新的方差。
  4. TempSamp-R1结合Chain-of-Thought(CoT)训练范式,支持多种推理模式。
  5. TempSamp-R1在多个基准数据集上实现卓越性能,包括Charades-STA、ActivityNet Captions和QVHighlights。
  6. TempSamp-R1具有强大的少样本泛化能力。

Cool Papers

点此查看论文截图

V2V-GoT: Vehicle-to-Vehicle Cooperative Autonomous Driving with Multimodal Large Language Models and Graph-of-Thoughts

Authors:Hsu-kuang Chiu, Ryo Hachiuma, Chien-Yi Wang, Yu-Chiang Frank Wang, Min-Hung Chen, Stephen F. Smith

Current state-of-the-art autonomous vehicles could face safety-critical situations when their local sensors are occluded by large nearby objects on the road. Vehicle-to-vehicle (V2V) cooperative autonomous driving has been proposed as a means of addressing this problem, and one recently introduced framework for cooperative autonomous driving has further adopted an approach that incorporates a Multimodal Large Language Model (MLLM) to integrate cooperative perception and planning processes. However, despite the potential benefit of applying graph-of-thoughts reasoning to the MLLM, this idea has not been considered by previous cooperative autonomous driving research. In this paper, we propose a novel graph-of-thoughts framework specifically designed for MLLM-based cooperative autonomous driving. Our graph-of-thoughts includes our proposed novel ideas of occlusion-aware perception and planning-aware prediction. We curate the V2V-GoT-QA dataset and develop the V2V-GoT model for training and testing the cooperative driving graph-of-thoughts. Our experimental results show that our method outperforms other baselines in cooperative perception, prediction, and planning tasks.

当前最先进的自动驾驶汽车,当它们的局部传感器被道路上的大型附近物体遮挡时,可能会面临安全关键的情况。车对车(V2V)协同自动驾驶已被提议作为解决此问题的一种手段,而最近引入的协同自动驾驶框架进一步采用了一种结合多模态大型语言模型(MLLM)的方法,以整合协同感知和规划过程。然而,尽管将图状思维推理应用于MLLM具有潜在优势,但这一想法并未被之前的协同自动驾驶研究考虑。在本文中,我们提出了一种专门用于基于MLLM的协同自动驾驶的图状思维框架。我们的图状思维包括我们提出的遮挡感知感知和规划感知预测的新思想。我们整理了V2V-GoT-QA数据集,并开发了V2V-GoT模型,用于训练和测试协同驾驶的图状思维。我们的实验结果表明,我们的方法在协同感知、预测和规划任务上的性能优于其他基线。

论文及项目相关链接

PDF

摘要
最新自主驾驶车辆技术在局部传感器受到道路大型物体的遮挡时可能面临安全隐患。本文提出采用车对车协同自动驾驶来应对此问题,并利用多模态大型语言模型整合合作感知和规划流程。为了更有效地将这一想法用于实际的合作驾驶过程,本文将结合一种新的思考图谱模型构建相应的框架,并引入遮挡感知和规划预测的新思路。此外,本文还构建了V2V-GoT数据集并开发了V2V-GoT模型,以实现合作驾驶思考图谱的训练和测试。实验结果显示,在合作感知、预测和规划任务方面,本文的方法较基线有更佳表现。

关键发现

  1. 最新自主驾驶技术在面对大型物体遮挡局部传感器时存在安全隐患。
  2. 提出使用车对车协同自动驾驶作为解决手段,并采用多模态大型语言模型整合感知和规划过程。
  3. 提出结合思考图谱模型构建新的协同自动驾驶框架。引入新概念包括遮挡感知和规划预测,优化过程更适用于现实情况。
  4. 构建新的V2V-GoT数据集以训练并测试合作驾驶思考图谱模型。
  5. 本文提出了创新方法以解决因物体遮挡导致自主车辆所面临的安全隐患问题。
  6. 新构建的框架有效集成了车辆间合作的感知和规划流程,提高了协同自动驾驶的性能。

Cool Papers

点此查看论文截图

Beyond Diagnosis: Evaluating Multimodal LLMs for Pathology Localization in Chest Radiographs

Authors:Advait Gosai, Arun Kavishwar, Stephanie L. McNamara, Soujanya Samineni, Renato Umeton, Alexander Chowdhury, William Lotter

Recent work has shown promising performance of frontier large language models (LLMs) and their multimodal counterparts in medical quizzes and diagnostic tasks, highlighting their potential for broad clinical utility given their accessible, general-purpose nature. However, beyond diagnosis, a fundamental aspect of medical image interpretation is the ability to localize pathological findings. Evaluating localization not only has clinical and educational relevance but also provides insight into a model’s spatial understanding of anatomy and disease. Here, we systematically assess two general-purpose MLLMs (GPT-4 and GPT-5) and a domain-specific model (MedGemma) in their ability to localize pathologies on chest radiographs, using a prompting pipeline that overlays a spatial grid and elicits coordinate-based predictions. Averaged across nine pathologies in the CheXlocalize dataset, GPT-5 exhibited a localization accuracy of 49.7%, followed by GPT-4 (39.1%) and MedGemma (17.7%), all lower than a task-specific CNN baseline (59.9%) and a radiologist benchmark (80.1%). Despite modest performance, error analysis revealed that GPT-5’s predictions were largely in anatomically plausible regions, just not always precisely localized. GPT-4 performed well on pathologies with fixed anatomical locations, but struggled with spatially variable findings and exhibited anatomically implausible predictions more frequently. MedGemma demonstrated the lowest performance on all pathologies, showing limited capacity to generalize to this novel task. Our findings highlight both the promise and limitations of current MLLMs in medical imaging and underscore the importance of integrating them with task-specific tools for reliable use.

最近的工作显示,前沿的大型语言模型(LLM)及其多模态对应模型在医学测验和诊断任务中表现出有前景的性能,突出了它们由于易于访问、通用性强而具有的广泛临床应用的潜力。然而,除了诊断之外,医学图像解读的一个基本能力是定位病理发现。评估定位不仅具有临床和教育意义,而且提供了模型对解剖结构和疾病空间理解洞察。在这里,我们通过采用提示管道(该管道叠加了一个空间网格并激发基于坐标的预测),系统评估了两种通用MLLM(GPT-4和GPT-5)和一个特定领域的模型(MedGemma)在胸部X光片上定位病理的能力。在CheXlocalize数据集上的九种病理中,GPT-5的定位准确度为49.7%,其次是GPT-4(39.1%),而MedGemma(17.7%)则低于特定的CNN基线(59.9%)和放射科医生基准(80.1%)。尽管表现平平,但错误分析显示GPT-5的预测大多在解剖上合理的区域,只是并不总是精确定位。GPT-4在具有固定解剖位置的病理上表现良好,但在空间上可变的发现上表现困难,并且更频繁地出现解剖上不合理的预测。MedGemma在所有病理上的表现最低,显示出对这项新任务的有限通用化能力。我们的研究结果突出了当前MLLM在医学成像方面的潜力和局限性,并强调了将它们与特定任务工具集成以可靠使用的重要性。

论文及项目相关链接

PDF

Summary

大型语言模型(LLMs)在多模态医疗测试任务中有潜力表现良好,但仍面临挑战。本文通过系统评估三种模型GPT-4、GPT-5以及领域特定模型MedGemma在胸部放射图像上定位病理的能力,发现GPT-5的定位精度为49.7%,表现最佳,但仍低于任务特定CNN基线(59.9%)和放射科医生基准(80.1%)。尽管性能温和,但分析表明GPT-5的预测大多在解剖上合理,但并非总是精确。GPT-4在固定解剖位置的病理上表现良好,但在空间上变化较大的病理上遇到困难,且偶尔出现解剖上不合逻辑的预测。MedGemma在所有病理上表现最低,表明其在推广此类任务时能力有限。本研究强调了当前LLMs在医学成像中的潜力和局限性,并强调了将其与特定任务工具结合使用的重要性。

Key Takeaways

  • 大型语言模型(LLMs)在医学诊断任务中展现出潜力。
  • 在胸部放射图像上定位病理的能力评估中,GPT-5表现最佳,但精度仍有提升空间。
  • GPT-4在处理空间变化较大的病理上遇到困难。
  • MedGemma在定位病理上的表现最低,表明其在特定任务上的泛化能力有限。
  • LLMs在医学成像中的潜力和局限性并存,需结合特定任务工具以提高可靠性。

Cool Papers

点此查看论文截图

Adaptive Kernel Design for Bayesian Optimization Is a Piece of CAKE with LLMs

Authors:Richard Cornelius Suwandi, Feng Yin, Juntao Wang, Renjie Li, Tsung-Hui Chang, Sergios Theodoridis

The efficiency of Bayesian optimization (BO) relies heavily on the choice of the Gaussian process (GP) kernel, which plays a central role in balancing exploration and exploitation under limited evaluation budgets. Traditional BO methods often rely on fixed or heuristic kernel selection strategies, which can result in slow convergence or suboptimal solutions when the chosen kernel is poorly suited to the underlying objective function. To address this limitation, we propose a freshly-baked Context-Aware Kernel Evolution (CAKE) to enhance BO with large language models (LLMs). Concretely, CAKE leverages LLMs as the crossover and mutation operators to adaptively generate and refine GP kernels based on the observed data throughout the optimization process. To maximize the power of CAKE, we further propose BIC-Acquisition Kernel Ranking (BAKER) to select the most effective kernel through balancing the model fit measured by the Bayesian information criterion (BIC) with the expected improvement at each iteration of BO. Extensive experiments demonstrate that our fresh CAKE-based BO method consistently outperforms established baselines across a range of real-world tasks, including hyperparameter optimization, controller tuning, and photonic chip design. Our code is publicly available at https://github.com/cake4bo/cake.

贝叶斯优化(BO)的效率在很大程度上取决于高斯过程(GP)核的选择,这在有限的评估预算下对于平衡探索和开发起着核心作用。传统的BO方法通常依赖于固定或启发式核选择策略,当所选的核不适合底层目标函数时,可能会导致收敛速度慢或得到非最优解。为了解决这一局限性,我们提出了一种新出炉的上下文感知核进化(CAKE)技术,利用大型语言模型(LLM)增强BO。具体来说,CAKE利用LLM作为交叉和变异算子,基于优化过程中的观测数据自适应地生成和细化GP核。为了最大化CAKE的威力,我们进一步提出了BIC获取核排名(BAKER),通过平衡贝叶斯信息准则(BIC)所衡量的模型拟合与BO每次迭代的预期改进,来选择最有效的核。大量实验表明,我们全新的基于CAKE的BO方法在一系列真实任务中始终优于既定的基线方法,包括超参数优化、控制器调优和光子芯片设计。我们的代码公开在https://github.com/cake4bo/cake。

论文及项目相关链接

PDF Accepted as Poster at NeurIPS 2025

Summary:贝叶斯优化的效率很大程度上依赖于高斯过程内核的选择,内核在有限的评估预算下平衡探索与开发起着关键作用。传统的贝叶斯优化方法常常依赖固定或启发式内核选择策略,当选择的内核不适合底层目标函数时,可能导致收敛速度慢或得到次优解。为解决这一问题,我们提出利用大型语言模型的上下文感知内核进化(CAKE)方法增强贝叶斯优化。CAKE利用大型语言模型作为交叉和变异运算符,根据优化过程中的观测数据自适应生成和细化高斯过程内核。为最大化CAKE的威力,我们进一步提出BIC采集内核排名(BAKER)方法,通过平衡模型拟合的贝叶斯信息准则与贝叶斯优化每次迭代的预期改进来选择最有效的内核。大量实验表明,基于CAKE的贝叶斯优化方法在一系列真实任务上始终优于现有基线,包括超参数优化、控制器调优和光子芯片设计等。

Key Takeaways

  1. 贝叶斯优化的效率依赖于高斯过程内核的选择,内核在平衡探索与开发中起关键作用。
  2. 传统贝叶斯优化方法存在内核选择问题,可能导致收敛速度慢或得到次优解。
  3. 上下文感知内核进化(CAKE)方法利用大型语言模型自适应生成和细化高斯过程内核。
  4. CAKE通过大型语言模型作为交叉和变异运算符增强贝叶斯优化。
  5. BIC采集内核排名(BAKER)方法通过平衡模型拟合的贝叶斯信息准则与预期改进来选择最有效的内核。
  6. 基于CAKE的贝叶斯优化方法在多种真实任务上表现优异,包括超参数优化、控制器调优和光子芯片设计。

Cool Papers

点此查看论文截图

ConfClip: Confidence-Weighted and Clipped Reward for Reinforcement Learning in LLMs

Authors:Bonan Zhang, Zhongqi Chen, Bowen Song, Qinya Li, Fan Wu, Guihai Chen

Reinforcement learning (RL) has become a standard paradigm for refining large language models (LLMs) beyond pre-training and instruction tuning. A prominent line of work is RL with verifiable rewards (RLVR), which leverages automatically verifiable outcomes (e.g., correctness or executability) to generate reward signals. While efficient, this framework faces two key limitations: First, its binary feedback is too sparse to capture the quality of the reasoning process. Second, its coarse-grained rewards potentially lead to vanishing gradients. Inspired by observations from human learning, we introduce a RL technique that integrates verifiable outcomes with the model’s own confidence estimates. This joint design enriches the reward signal, providing finer-grained feedback and implicitly supervising the reasoning process. Experimental results demonstrate that our proposed method enhances RL performance across multiple datasets and reduces token consumption during inference, while incurring negligible additional training cost. Moreover, it can be used as a plug-in module to enhance other state-of-the-art RL methods.

强化学习(RL)已成为在预训练和指令微调之外,对大型语言模型(LLM)进行精细调整的标准范式。一条突出的工作线是带有可验证奖励的强化学习(RLVR),它利用可自动验证的结果(例如正确性或可执行性)来生成奖励信号。尽管效率很高,但这个框架面临两个主要局限性:首先,其反馈是二进制的,过于稀疏,无法捕捉到推理过程的质量。其次,其粗粒度的奖励可能导致梯度消失。受人类学习的观察启发,我们引入了一种强化学习技术,该技术结合了可验证的结果与模型本身的信心估计。这种联合设计丰富了奖励信号,提供了更精细的反馈,并隐含地监督了推理过程。实验结果表明,我们提出的方法在多个数据集上提高了强化学习的性能,并在推理过程中减少了标记符的消耗,同时只产生了可以忽略的额外训练成本。此外,它可以作为一个插件模块,用于增强其他最先进的强化学习方法。

论文及项目相关链接

PDF

Summary
强化学习(RL)已成为优化大型语言模型(LLM)的标准范式,特别是在预训练和指令微调之后。RLVR(强化学习可验证奖励)利用可自动验证的结果(如正确性)生成奖励信号,尽管效率较高,但存在两个主要局限性。首先,其反馈稀疏,无法捕捉推理过程的质量;其次,其粗粒度的奖励可能导致梯度消失。本研究受人类学习的启发,提出了一种结合可验证结果与模型自身置信度估计的RL技术。这种联合设计丰富了奖励信号,提供了更精细的反馈并隐含地监督了推理过程。实验结果表明,该方法提高了多个数据集上的RL性能,降低了推理时的令牌消耗,且几乎不增加额外的训练成本。此外,它还可以作为插件模块增强其他先进的RL方法。

Key Takeaways

  1. 强化学习已成为优化大型语言模型的标准方法。
  2. RLVR利用自动验证的结果生成奖励信号,但存在反馈稀疏和粗粒度奖励的问题。
  3. 结合可验证结果与模型自身置信度估计的RL技术可以提高奖励信号的丰富性。
  4. 该方法提高了多个数据集上的RL性能,并降低了推理时的令牌消耗。
  5. 所提出的方法几乎不增加额外的训练成本。
  6. 该方法可以作为插件模块增强其他先进的RL方法。

Cool Papers

点此查看论文截图

CorefInst: Leveraging LLMs for Multilingual Coreference Resolution

Authors:Tuğba Pamay Arslan, Emircan Erol, Gülşen Eryiğit

Coreference Resolution (CR) is a crucial yet challenging task in natural language understanding, often constrained by task-specific architectures and encoder-based language models that demand extensive training and lack adaptability. This study introduces the first multilingual CR methodology which leverages decoder-only LLMs to handle both overt and zero mentions. The article explores how to model the CR task for LLMs via five different instruction sets using a controlled inference method. The approach is evaluated across three LLMs; Llama 3.1, Gemma 2, and Mistral 0.3. The results indicate that LLMs, when instruction-tuned with a suitable instruction set, can surpass state-of-the-art task-specific architectures. Specifically, our best model, a fully fine-tuned Llama 3.1 for multilingual CR, outperforms the leading multilingual CR model (i.e., Corpipe 24 single stage variant) by 2 pp on average across all languages in the CorefUD v1.2 dataset collection.

核心引用解析(Coreference Resolution,简称CR)是自然语言理解中的一个重要且具有挑战性的任务。它通常受到特定任务架构和基于编码器的语言模型的限制,这些模型需要大量的训练且缺乏适应性。本研究引入了首个多语言CR方法,该方法利用仅解码器的大型语言模型(LLMs)来处理显性和零引用。文章探讨了如何通过五种不同的指令集,使用受控推理方法,为LLMs构建CR任务模型。该方法在三种LLMs中进行了评估,包括Llama 3.1、Gemma 2和Mistral 0.3。结果表明,当LLMs通过合适的指令集进行指令微调时,它们可以超越最先进的特定任务架构。具体来说,我们最好的模型是专为多语言CR进行完全微调过的Llama 3.1,在CorefUD v1.2数据集集合的所有语言中,平均表现优于领先的多语言CR模型(即Corpipe 24单阶段变体)2个百分点。

论文及项目相关链接

PDF Accepted for publication in Transactions of the Association for Computational Linguistics (TACL) (2025 August). Submission: March, 2025. Revision: July, 2025. Acceptance: August, 2025

摘要

本文介绍了一种利用解码器仅大型语言模型(LLMs)处理显性和零提及的核心引用解析(CR)的多语言方法。文章探讨了如何通过五种不同的指令集为LLMs建模CR任务,并采用了受控推理方法。该方法在三种LLMs(Llama 3.1、Gemma 2和Mistral 0.3)上进行了评估。结果表明,当LLMs通过适当的指令集进行调优时,可以超越特定的任务架构。特别是,我们最好的模型——针对多语言CR全面调优的Llama 3.1,在CorefUD v1.2数据集上平均比领先的多语言CR模型(即Corpipe 24单阶段变体)高出2个百分点。

关键见解

  1. 引入了一种多语言的核心引用解析(CR)方法,该方法利用解码器仅大型语言模型(LLMs)处理显性和零提及。
  2. 通过五种不同的指令集探索了LLMs的CR任务建模方法。
  3. 采用受控推理方法对LLMs进行CR任务评估。
  4. 在三种LLMs上进行了实验:Llama 3.1、Gemma 2和Mistral 0.3。
  5. 实验结果显示,适当指令集调优的LLMs可以超越特定任务架构。
  6. 最佳模型Llama 3.1在多语言CR上的表现超过了领先的模型Corpipe 24单阶段变体。
  7. 在CorefUD v1.2数据集上,Llama 3.1平均比Corpipe 24高出2个百分点。

Cool Papers

点此查看论文截图

Transformer-Gather, Fuzzy-Reconsider: A Scalable Hybrid Framework for Entity Resolution

Authors:Mohammadreza Sharifi, Danial Ahmadzadeh

Entity resolution plays a significant role in enterprise systems where data integrity must be rigorously maintained. Traditional methods often struggle with handling noisy data or semantic understanding, while modern methods suffer from computational costs or the excessive need for parallel computation. In this study, we introduce a scalable hybrid framework, which is designed to address several important problems, including scalability, noise robustness, and reliable results. We utilized a pre-trained language model to encode each structured data into corresponding semantic embedding vectors. Subsequently, after retrieving a semantically relevant subset of candidates, we apply a syntactic verification stage using fuzzy string matching techniques to refine classification on the unlabeled data. This approach was applied to a real-world entity resolution task, which exposed a linkage between a central user management database and numerous shared hosting server records. Compared to other methods, this approach exhibits an outstanding performance in terms of both processing time and robustness, making it a reliable solution for a server-side product. Crucially, this efficiency does not compromise results, as the system maintains a high retrieval recall of approximately 0.97. The scalability of the framework makes it deployable on standard CPU-based infrastructure, offering a practical and effective solution for enterprise-level data integrity auditing.

实体解析在企业系统中扮演着重要角色,那里需要严格维护数据完整性。传统方法往往难以处理嘈杂的数据或理解语义,而现代方法则面临计算成本高昂或需要大量并行计算的问题。在这项研究中,我们引入了一个可扩展的混合框架,旨在解决包括可扩展性、噪声鲁棒性和可靠结果等重要问题。我们利用预训练的语言模型将每个结构化数据编码为相应的语义嵌入向量。在检索到语义相关的候选子集后,我们应用基于模糊字符串匹配技术的句法验证阶段,对未标记数据进行精细化分类。该方法应用于现实世界中的实体解析任务,涉及中央用户管理数据库和众多共享主机服务器记录之间的关联。与其他方法相比,该方法在处理时间和稳健性方面都表现出卓越的性能,成为服务器端产品的可靠解决方案。关键的是,这种效率并不会影响结果,因为系统的检索召回率保持在高水平的0.97左右。该框架的可扩展性使其可以部署在基于标准CPU的基础设施上,为企业级数据完整性审计提供了实用有效的解决方案。

论文及项目相关链接

PDF Accepted at ICCKE 2025 Conference. 6 tables, 7 figures

Summary

本文介绍了一种可扩展的混合框架,用于解决企业系统中的实体解析问题。该框架旨在解决可扩展性、噪声鲁棒性和可靠结果等方面的问题,通过利用预训练的语言模型将结构化数据编码为相应的语义嵌入向量,并应用模糊字符串匹配技术进行句法验证,以在未标记数据上进行精细化分类。该框架在实际的用户管理数据库与共享主机服务器记录链接任务中表现出色,具有高效的处理能力和强大的鲁棒性,同时保持了高召回率。框架的可扩展性使其能够在标准CPU基础设施上部署,为企业级数据完整性审计提供了实用有效的解决方案。

Key Takeaways

  1. 实体解析在企业系统中非常重要,需要维护数据完整性。
  2. 传统方法在处理嘈杂数据或语义理解方面存在困难。
  3. 现代方法面临计算成本或并行计算需求过高的挑战。
  4. 介绍了一种可扩展的混合框架,旨在解决实体解析中的可扩展性、噪声鲁棒性和可靠结果问题。
  5. 该框架利用预训练语言模型将结构化数据编码为语义嵌入向量。
  6. 框架使用模糊字符串匹配技术进行句法验证,以在未标记数据上进行精细化分类。

Cool Papers

点此查看论文截图

Mind the Gap: Comparing Model- vs Agentic-Level Red Teaming with Action-Graph Observability on GPT-OSS-20B

Authors:Ilham Wicaksono, Zekun Wu, Rahul Patel, Theo King, Adriano Koshiyama, Philip Treleaven

As the industry increasingly adopts agentic AI systems, understanding their unique vulnerabilities becomes critical. Prior research suggests that security flaws at the model level do not fully capture the risks present in agentic deployments, where models interact with tools and external environments. This paper investigates this gap by conducting a comparative red teaming analysis of GPT-OSS-20B, a 20-billion parameter open-source model. Using our observability framework AgentSeer to deconstruct agentic systems into granular actions and components, we apply iterative red teaming attacks with harmful objectives from HarmBench at two distinct levels: the standalone model and the model operating within an agentic loop. Our evaluation reveals fundamental differences between model level and agentic level vulnerability profiles. Critically, we discover the existence of agentic-only vulnerabilities, attack vectors that emerge exclusively within agentic execution contexts while remaining inert against standalone models. Agentic level iterative attacks successfully compromise objectives that completely failed at the model level, with tool-calling contexts showing 24% higher vulnerability than non-tool contexts. Conversely, certain model-specific exploits work exclusively at the model level and fail when transferred to agentic contexts, demonstrating that standalone model vulnerabilities do not always generalize to deployed systems.

随着业界越来越多地采用智能体AI系统,理解它们独特的漏洞变得至关重要。先前的研究表明,模型层面的安全漏洞并不能完全捕捉智能体部署中的风险,即在模型中与工具和外部环境进行交互的部分。本文通过对比研究GPT-OSS-20B(一个拥有20亿参数的开源模型)的红队分析来探讨这一差距。我们使用可观察性框架AgentSeer来将智能体系统分解为颗粒状的动作和组件,以在两个不同层面应用具有有害目标的迭代红队攻击:独立模型和运行于智能体循环中的模型。我们的评估揭示了模型层面和智能体层面漏洞分布的根本差异。关键的是,我们发现了一些仅存在于智能体层面的漏洞,这些漏洞向量仅在智能体执行环境中出现,而对独立模型则无效。智能体层面的迭代攻击成功破坏了之前模型层面无法达成的目标,并且在工具调用环境下的漏洞比非工具环境下高出24%。相反,某些特定模型的攻击策略只能在模型层面上奏效,当转移到智能体环境时则无效,这表明独立模型的漏洞并不总是适用于已部署的系统。

论文及项目相关链接

PDF Winner of the OpenAI GPT-OSS-20B Red Teaming Challenge (Kaggle, 2025)

Summary

随着业界越来越多地采用智能体AI系统,理解其特有的漏洞变得至关重要。本文研究了GPT-OSS-20B模型在智能体环境中的漏洞情况。通过红队攻击测试和AgentSeer框架,发现模型级和智能体级的漏洞存在显著差异。存在仅存在于智能体执行上下文中的漏洞,工具调用环境的漏洞更是高出非工具环境24%。这意味着单纯依靠模型级的安全测试不足以保障整个智能体系统的安全。

Key Takeaways

  1. 深入理解智能体AI系统的脆弱性至关重要,因为这关乎其实际部署的安全性。
  2. GPT-OSS-20B模型在智能体环境中的漏洞研究是必要的。
  3. 模型级和智能体级的漏洞存在显著差异,需要进行针对性的安全测试。
  4. 存在仅存在于智能体执行上下文中的漏洞,这些被称为“智能体级独有漏洞”。
  5. 工具调用环境的漏洞高于非工具环境,需特别关注。
  6. 单纯依赖模型级的安全测试并不足以保障智能体系统的全面安全。

Cool Papers

点此查看论文截图

Interpretable Audio Editing Evaluation via Chain-of-Thought Difference-Commonality Reasoning with Multimodal LLMs

Authors:Yuhang Jia, Xu Zhang, Yang Chen, Hui Wang, Enzhi Wang, Yong Qin

Automatic mean opinion score (MOS) prediction provides a more perceptual alternative to objective metrics, offering deeper insights into the evaluated models. With the rapid progress of multimodal large language models (MLLMs), their enhanced perceptual and reasoning abilities enable more comprehensive and interpretable audio quality assessment. In this work, we tackle the challenging task of audio editing evaluation and propose the first natural language-based automated evaluation framework built on MLLMs. Our approach introduces two fine-tuning tasks to boost multi-audio understanding, combined with Chain-of-Thought prompting, and lightweight instruction tuning, to enhance step-by-step reasoning. Experiment demonstrate that our framework delivers accurate, interpretable, and text-based editing evaluation, closely aligning with human judgments and objective metrics while substantially improving over baselines. The code and demo are available at https://github.com/NKU-HLT/Eval_Reasoning.

自动主观评分(MOS)预测为客观指标提供了一种更感知的替代方案,为所评估的模型提供了更深入的了解。随着多模态大型语言模型(MLLMs)的快速发展,其增强的感知和推理能力能够提供更全面和可解释的音频质量评估。在这项工作中,我们解决了音频编辑评估这一具有挑战性的任务,并基于MLLMs提出了第一个自然语言驱动的自动化评估框架。我们的方法引入了两个微调任务来提升多音频理解,结合链式思维提示和轻量级指令微调,以增强逐步推理。实验表明,我们的框架能够提供准确、可解释、基于文本的视频编辑评估,与人类判断和客观指标紧密对齐,同时在很大程度上超越了基线。代码和演示可在https://github.com/NKU-HLT/Eval_Reasoning访问。

论文及项目相关链接

PDF

Summary

自动意义评分(MOS)预测为客观指标提供了更感知的替代方案,为评估模型提供了更深入的了解。随着多模态大型语言模型(MLLMs)的快速发展,其增强的感知和推理能力为音频质量评估提供了更全面和可解释的方法。本研究致力于解决音频编辑评估的挑战性任务,并首次提出了基于MLLM的自然语言自动化评估框架。该方法引入了两个微调任务来提升多音频理解,结合链式思维提示和轻量级指令微调,以增强逐步推理能力。实验表明,该框架能够提供准确、可解释、基于文本的编辑评估,与人类判断和客观指标紧密对齐,并且在基准测试上有显著提升。代码和演示可在https://github.com/NKU-HLT/Eval_Reasoning上找到。

Key Takeaways

  1. 自动意义评分(MOS)预测为客观指标提供了更感知的替代方案。
  2. 多模态大型语言模型(MLLMs)在音频质量评估中表现出更强的感知和推理能力。
  3. 本研究提出了基于MLLM的自然语言自动化评估框架,用于音频编辑评估。
  4. 引入两个微调任务提升多音频理解。
  5. 结合链式思维提示和轻量级指令微调增强逐步推理能力。
  6. 该框架的评估结果准确、可解释,与人类判断和客观指标紧密对齐。

Cool Papers

点此查看论文截图

Mental Multi-class Classification on Social Media: Benchmarking Transformer Architectures against LSTM Models

Authors:Khalid Hasan, Jamil Saquer, Yifan Zhang

Millions of people openly share mental health struggles on social media, providing rich data for early detection of conditions such as depression, bipolar disorder, etc. However, most prior Natural Language Processing (NLP) research has focused on single-disorder identification, leaving a gap in understanding the efficacy of advanced NLP techniques for distinguishing among multiple mental health conditions. In this work, we present a large-scale comparative study of state-of-the-art transformer versus Long Short-Term Memory (LSTM)-based models to classify mental health posts into exclusive categories of mental health conditions. We first curate a large dataset of Reddit posts spanning six mental health conditions and a control group, using rigorous filtering and statistical exploratory analysis to ensure annotation quality. We then evaluate five transformer architectures (BERT, RoBERTa, DistilBERT, ALBERT, and ELECTRA) against several LSTM variants (with or without attention, using contextual or static embeddings) under identical conditions. Experimental results show that transformer models consistently outperform the alternatives, with RoBERTa achieving 91-99% F1-scores and accuracies across all classes. Notably, attention-augmented LSTMs with BERT embeddings approach transformer performance (up to 97% F1-score) while training 2-3.5 times faster, whereas LSTMs using static embeddings fail to learn useful signals. These findings represent the first comprehensive benchmark for multi-class mental health detection, offering practical guidance on model selection and highlighting an accuracy-efficiency trade-off for real-world deployment of mental health NLP systems.

成千上万的人在社交媒体上公开分享心理健康问题,为早期检测抑郁症、双相情感障碍等疾病提供了丰富数据。然而,大多数先前的自然语言处理(NLP)研究主要集中在单一疾病的识别上,对于使用先进的NLP技术来区分多种心理健康疾病的有效性方面仍存在空白。在这项工作中,我们对最新变压器与基于长短时记忆(LSTM)的模型进行了大规模比较研究,以将心理健康帖子分类为多种心理健康疾病的专属类别。我们首先筛选了一个涵盖六种心理健康疾病和一组对照的Reddit帖子大型数据集,通过严格的过滤和统计探索性分析来确保注释质量。然后我们在相同条件下评估了五种变压器架构(BERT、RoBERTa、DistilBERT、ALBERT和ELECTRA)和几种LSTM变体(带或不带注意力机制,使用上下文嵌入或静态嵌入)。实验结果表明,变压器模型始终优于其他模型,RoBERTa在所有类别中达到91-99%的F1分数和准确率。值得注意的是,结合了BERT嵌入的带有注意力机制的长短期记忆网络(LSTM)在接近变压器性能的同时(高达97%的F1分数),训练速度提高了2-3.5倍,而使用静态嵌入的LSTM则无法学习有用的信号。这些发现代表了首次进行的多类心理健康检测的全面基准测试,为模型选择提供了实际指导,并突出了心理健康NLP系统在实际部署中的准确性效率权衡。

论文及项目相关链接

PDF 24th IEEE International Conference on Machine Learning and Applications, ICMLA 2025 (camera-ready)

摘要

社交媒体上公开分享心理健康问题的人数日益增多,为早期识别抑郁症、双相情感障碍等疾病提供了丰富的数据。然而,大部分自然语言处理(NLP)研究聚焦于单一疾病的识别,对多种心理健康状况的区分研究存在空白。本研究旨在比较当前先进的转换模型与基于长短时记忆(LSTM)的模型在分类心理健康帖子方面的表现。我们筛选了Reddit上的帖子,涵盖六种心理健康状况及对照组,进行严格的过滤和统计分析以确保注释质量。评估了五种转换器架构与几种LSTM变种,实验结果显示转换器模型表现更优秀,RoBERTa在所有类别中的F1分数和准确率均达到91%~99%。值得注意的是,结合注意力机制的LSTM在训练过程中更为快速(速度加快约两倍),且在带有BERT嵌入的情况下能够接近Transformer性能(达到97%的F1分数),而使用静态嵌入的LSTM则无法学习有效信号。这些发现代表了首个全面的多类心理健康检测基准测试,为模型选择提供了实际指导,并强调了心理健康NLP系统在实际部署中的准确性-效率权衡。

关键见解

  1. 社交媒体提供了大量心理健康相关的数据,可用于早期疾病检测。
  2. 自然语言处理(NLP)在区分多种心理健康状况方面的研究存在空白。
  3. 本研究比较了先进的Transformer模型和基于LSTM的模型在分类心理健康帖子方面的表现。
  4. RoBERTa在所有类别中的表现最佳,准确率极高。
  5. 结合注意力机制的LSTM训练速度快,且使用BERT嵌入时的性能接近Transformer模型。
  6. 对比模型在不同情境下的性能有助于进行实际应用时的模型选择。

Cool Papers

点此查看论文截图

Large Language Model-Empowered Decision Transformer for UAV-Enabled Data Collection

Authors:Zhixion Chen, Jiangzhou Wang, Hyundong Shin, Arumugam Nallanathan

The deployment of unmanned aerial vehicles (UAVs) for reliable and energy-efficient data collection from spatially distributed devices holds great promise in supporting diverse Internet of Things (IoT) applications. Nevertheless, the limited endurance and communication range of UAVs necessitate intelligent trajectory planning. While reinforcement learning (RL) has been extensively explored for UAV trajectory optimization, its interactive nature entails high costs and risks in real-world environments. Offline RL mitigates these issues but remains susceptible to unstable training and heavily rely on expert-quality datasets. To address these challenges, we formulate a joint UAV trajectory planning and resource allocation problem to maximize energy efficiency of data collection. The resource allocation subproblem is first transformed into an equivalent linear programming formulation and solved optimally with polynomial-time complexity. Then, we propose a large language model (LLM)-empowered critic-regularized decision transformer (DT) framework, termed LLM-CRDT, to learn effective UAV control policies. In LLM-CRDT, we incorporate critic networks to regularize the DT model training, thereby integrating the sequence modeling capabilities of DT with critic-based value guidance to enable learning effective policies from suboptimal datasets. Furthermore, to mitigate the data-hungry nature of transformer models, we employ a pre-trained LLM as the transformer backbone of the DT model and adopt a parameter-efficient fine-tuning strategy, i.e., LoRA, enabling rapid adaptation to UAV control tasks with small-scale dataset and low computational overhead. Extensive simulations demonstrate that LLM-CRDT outperforms benchmark online and offline RL methods, achieving up to 36.7% higher energy efficiency than the current state-of-the-art DT approaches.

无人机(UAV)的部署在可靠且能源高效地从空间分布设备收集数据方面展现出巨大的潜力,为支持多样化的物联网(IoT)应用提供了广阔前景。然而,无人机的续航和通信范围有限,需要进行智能轨迹规划。强化学习(RL)在无人机轨迹优化方面已有广泛研究,但其交互性质在实际环境中带来高昂的成本和风险。离线RL减轻了这些问题,但仍然容易受到训练不稳定的影响,并严重依赖于专家级数据集。为了应对这些挑战,我们制定了联合无人机轨迹规划和资源分配问题,以最大化数据收集的能效。首先,我们将资源分配子问题转化为等效的线性规划形式,并以多项式时间复杂度最优地解决。然后,我们提出了一种大型语言模型(LLM)赋能的评论家正则化决策转换器(DT)框架,称为LLM-CRDT,用于学习有效的无人机控制策略。在LLM-CRDT中,我们结合评论家网络来规范DT模型的训练,从而将DT的序列建模能力与基于评论家的价值指导相结合,使能够从次优数据集中学习有效策略。此外,为了减轻转换器模型的数据渴求性质,我们采用预训练的LLM作为DT模型的转换器骨干,并采用参数高效的微调策略,即LoRA,能够实现用小规模数据集快速适应无人机控制任务且降低计算开销。大量模拟实验表明,LLM-CRDT优于基准在线和离线RL方法,与当前最先进的DT方法相比,能效提高了高达36.7%。

论文及项目相关链接

PDF 14pages, 8 figures

Summary

本文探讨了利用无人机(UAVs)进行可靠且能源高效的数据采集以支持物联网(IoT)应用的前景。针对无人机有限的续航和通信范围问题,需要智能轨迹规划。虽然强化学习(RL)已被广泛应用于无人机轨迹优化,但其交互性质导致在实际环境中的成本高和风险大。为了解决这个问题,本文提出了一个联合无人机轨迹规划和资源分配问题,以最大化数据收集的能源效率。通过利用大型语言模型(LLM)赋能的批评正则化决策转换器(LLM-CRDT)框架,解决了资源分配子问题并将其转化为等效的线性规划形式,以最优方式解决并具有多项式时间复杂度。此外,本文还采用了预训练的语言模型作为决策转换模型的变压器骨干,并采用LoRA等参数高效微调策略,使得模型能够适应小规模数据集和低计算开销的无人机控制任务。模拟实验表明,LLM-CRDT相较于基准在线和离线RL方法表现出更高的能源效率,相较于当前最先进的决策转换方法,能源效率提高了高达36.7%。

Key Takeaways

  1. 无人机在物联网应用中具有数据收集的潜力。
  2. 无人机轨迹规划和资源分配是重要挑战。
  3. 强化学习用于无人机轨迹优化面临高成本和风险问题。
  4. 提出了一种联合轨迹规划和资源分配方法以提高能源效率。
  5. 资源分配问题转化为线性规划形式以实现最优解决。
  6. 利用大型语言模型赋能决策转换器框架LLM-CRDT学习有效的无人机控制策略。

Cool Papers

点此查看论文截图

MALTA: An Automated CGRA Design Framework

Authors:Zesong Jiang, Yuqi Sun, Qing Zhong, Mahathi Krishna, Deepak Patil, Cheng Tan, Sriram Krishnamoorthy, Jeff Zhang

Coarse-grained Reconfigurable Arrays (CGRAs) are a promising computing architecture that can deliver high-performance, energy-efficient acceleration across diverse domains. By supporting reconfiguration at the functional unit level, CGRAs efficiently adapt to varying computational patterns and optimize resource utilization. However, designing CGRAs is highly challenging due to the vast design space, independent architectural parameters, and the time-consuming nature of manual design. Fortunately, the rapid advancement of large language models (LLMs) presents new opportunities to automate this process. In this work, we propose MALTA– an open-source multi-agent LLM-based framework for Hardware/Software (HW/SW) co-design of CGRAs. The framework employs LLM reasoning to generate CGRAs across four stages: HW/SW co-design, Design error correction, Best design selection, and Evaluation & Feedback. Furthermore, MALTA iteratively optimizes the generated CGRAs, leveraging agent reasoning and feedback to achieve higher PPA (that is, power, performance, and area) design points for a given domain. In addition, we introduce an LLM self-learning mechanism that employs LLM-driven decision making to select the optimal CGRA to accelerate the design process. We evaluate the framework with state-of-the-art LLM-based methods and manual CGRA design, in terms of performance, power consumption, and area. Experimental results show that MALTA efficiently generates high-quality CGRA architectures, significantly reducing manual design effort and demonstrating the potential of our framework for real-world CGRA design.

粗粒度可重构阵列(CGRAs)是一种有前景的计算架构,可以在不同的领域提供高性能、节能的加速。通过支持功能单元级别的重构,CGRAs能够高效地适应不同的计算模式并优化资源利用。然而,设计CGRAs面临巨大挑战,因为设计空间大、架构参数独立以及手动设计的耗时性。幸运的是,大型语言模型(LLM)的快速发展为自动化这一过程提供了新的机会。

在这项工作中,我们提出了MALTA——一个开源的多智能体大型语言模型(LLM)基于硬件/软件(HW/SW)协同设计的CGRAs框架。该框架采用LLM推理在四个阶段生成CGRAs:HW/SW协同设计、设计错误修正、最佳设计选择以及评估和反馈。此外,MALTA通过智能体推理和反馈来迭代优化生成的CGRAs,以实现给定领域的更高PPA(即功率、性能和面积)设计点。我们还引入了一种LLM自我学习机制,采用LLM驱动的决策选择来加速设计过程中的最佳CGRA选择。

论文及项目相关链接

PDF Due to certain confidentiality requirements, this article needs to be withdrawn

摘要

粗粒度可重构阵列(CGRAs)是一种有前途的计算架构,可以在不同的领域提供高性能、节能的加速。通过支持功能单元级别的重构,CGRAs可以灵活地适应不同的计算模式并优化资源利用。然而,设计CGRAs面临巨大挑战,包括设计空间大、独立的架构参数以及手动设计的时间消耗。幸运的是,大型语言模型(LLMs)的快速发展为自动化这一过程提供了新的机会。在研究中,我们提出了MALTA——一个用于CGRAs软硬件协同设计的开源多智能体LLM框架。该框架利用LLM推理生成CGRAs的四个阶段:软硬件协同设计、设计错误修正、最佳设计选择、评估与反馈。此外,MALTA通过利用智能体推理和反馈来迭代优化生成的CGRAs,以实现给定领域的更高功率性能面积比(PPA)。我们还引入了一种LLM自学习机制,采用LLM驱动的决策选择最佳CGRA以加速设计过程。通过最先进的LLM方法和手动CGRA设计评估我们的框架在性能、功耗和面积方面的表现。实验结果表明,MALTA能有效生成高质量的CGRA架构,显著减少手动设计工作量,展示了该框架在现实世界CGRA设计中的潜力。

关键见解

  1. CGRAs是一种具有高性能和能源效率的计算架构,可以在不同的领域提供加速,并支持功能单元级别的重构。
  2. 设计CGRAs具有挑战性,因为设计空间大、独立的架构参数以及手动设计的时间消耗。
  3. LLMs的快速发展为自动化CGRA设计过程提供了新的机会。
  4. MALTA是一个多智能体LLM框架,用于CGRAs的软硬件协同设计,包括设计、修正、选择和评估阶段。
  5. MALTA利用LLM推理和自学习机制来优化生成的CGRAs,提高功率性能面积比(PPA)。
  6. 实验结果表明,MALTA在生成高质量CGRA架构方面表现出色,能显著减少手动设计工作量。

Cool Papers

点此查看论文截图

The Thinking Therapist: Training Large Language Models to Deliver Acceptance and Commitment Therapy using Supervised Fine-Tuning and Odds Ratio Policy Optimization

Authors:Talha Tahir

Acceptance and Commitment Therapy (ACT) is a third-wave cognitive behavioral therapy with emerging evidence of efficacy in several psychiatric conditions. This study investigates the impact of post-training methodology and explicit reasoning on the ability of a small open-weight large language model (LLM) to deliver ACT. Using synthetic ACT transcripts generated by Mistral-Large, we trained Llama-3.2-3b-Instruct with two distinct approaches, supervised fine-tuning (SFT) and odds ratio policy optimization (ORPO), each with and without an explicit chain-of-thought (COT) reasoning step. Performance was evaluated by comparing these four post-trained variants against the base Instruct model. These models were benchmarked in simulated therapy sessions, with performance quantitatively assessed on the ACT Fidelity Measure (ACT-FM) and the Therapist Empathy Scale (TES) by an LLM judge that had been fine-tuned on human evaluations. Our findings demonstrate that the ORPO-trained models significantly outperformed both their SFT and Instruct counterparts on ACT fidelity ($\chi^2(5) = 185.15, p < .001$) and therapeutic empathy ($\chi^2(5) = 140.37, p < .001$). The effect of COT was conditional as it provided a significant benefit to SFT models, improving ACT-FM scores by an average of 2.68 points ($p < .001$), while offering no discernible advantage to the superior ORPO or instruct-tuned variants. We posit that the superiority of ORPO stems from its ability to learn the therapeutic process' over imitating content,’ a key aspect of ACT, while COT acts as a necessary scaffold for models trained only via imitation. This study establishes that preference-aligned policy optimization can effectively instill ACT competencies in small LLMs, and that the utility of explicit reasoning is highly dependent on the underlying training paradigm.

接纳承诺疗法(ACT)是一种新兴的、在多种精神疾病中疗效显著的第三代认知行为疗法。本研究探讨了培训后的方法和明确推理对小型开放权重大型语言模型(LLM)执行ACT能力的影响。我们利用Mistral-Large生成的合成ACT转录本,采用两种不同的方法训练了Llama-3.2-3b-Instruct模型:监督微调(SFT)和优势比策略优化(ORPO),每种方法都有和没有明确的思维链(COT)推理步骤。我们通过将这四个经过训练后的模型与基础指令模型进行比较来评估性能。这些模型在模拟治疗会话中进行了基准测试,并通过ACT忠诚度度量(ACT-FM)和疗愈师同理心量表(TES)进行定量评估,评估由经过人类评估精细调整的大型语言模型评委进行。我们的研究发现,ORPO训练的模型在ACT忠诚度(χ²(5)= 185.15,p < .001)和治疗同理心(χ²(5)= 140.37,p < .001)方面显著优于SFT和指令模型。思维链的影响是有条件的,因为它对SFT模型有明显的帮助,ACT-FM得分平均提高了2.68分(p < .001),而对于更高级的ORPO或指令调整变体则没有明显优势。我们认为ORPO的优越性源于其学习治疗“过程”而非仅仅模仿“内容”的能力,这是ACT的一个关键方面,而思维链对于仅通过模仿训练的模型来说是必要的支架。本研究表明,偏好对齐的策略优化可以有效地在小型LLM中灌输ACT能力,而明确推理的实用性在很大程度上取决于底层训练范式。

论文及项目相关链接

PDF

摘要

本文研究了在小型开放式大型语言模型(LLM)中应用接纳承诺疗法(ACT)的效果。实验涉及合成ACT演讲稿,对Lama-3.2-3b-Instruct模型使用两种训练方法(监督微调(SFT)和优势比策略优化(ORPO))结合不同的思维方式,并对其进行评估。研究结果显示,ORPO训练模型在ACT的忠实度与疗愈共情方面的表现显著优于其他模型。思维逻辑对于某些模型起到了辅助的作用,而对优势较大的模型并无显著优势。研究表明,ORPO之所以具有优势,是因为它能够学习治疗过程而非单纯模仿内容,这是ACT的核心要素。同时,本文也证明了偏好对齐策略优化可以有效地将ACT技能融入小型LLM中,而明确推理的实用性取决于基础训练范式。

关键见解

  1. 研究调查了不同的训练方法对LLM在提供ACT能力方面的影响,特别是对小型语言模型的效果进行了深入研究。
  2. 实验利用合成ACT演讲稿来训练LLM模型,并采用四种不同的训练组合进行评估。
  3. ORPO训练模型在ACT忠实度和疗愈共情方面的表现显著优于其他模型,其优势源于对治疗过程的学习而非单纯模仿内容。
  4. 思维逻辑对部分模型的性能起到了促进作用,但对已经经过优势训练的模型并无额外优势。
  5. 研究发现偏好对齐策略优化可以有效地将ACT技能融入小型LLM中。
  6. 明确推理的实用性取决于基础训练范式,与模型的训练方式和当前状态有关。

Cool Papers

点此查看论文截图

Measuring Scalar Constructs in Social Science with LLMs

Authors:Hauke Licht, Rupak Sarkar, Patrick Y. Wu, Pranav Goel, Niklas Stoehr, Elliott Ash, Alexander Miserlis Hoyle

Many constructs that characterize language, like its complexity or emotionality, have a naturally continuous semantic structure; a public speech is not just “simple” or “complex,” but exists on a continuum between extremes. Although large language models (LLMs) are an attractive tool for measuring scalar constructs, their idiosyncratic treatment of numerical outputs raises questions of how to best apply them. We address these questions with a comprehensive evaluation of LLM-based approaches to scalar construct measurement in social science. Using multiple datasets sourced from the political science literature, we evaluate four approaches: unweighted direct pointwise scoring, aggregation of pairwise comparisons, token-probability-weighted pointwise scoring, and finetuning. Our study finds that pairwise comparisons made by LLMs produce better measurements than simply prompting the LLM to directly output the scores, which suffers from bunching around arbitrary numbers. However, taking the weighted mean over the token probability of scores further improves the measurements over the two previous approaches. Finally, finetuning smaller models with as few as 1,000 training pairs can match or exceed the performance of prompted LLMs.

语言的许多特征,如复杂性或情感性,都具有自然连续的语义结构;公共演讲不仅仅是“简单”或“复杂”,而是存在于极端之间的连续体中。尽管大型语言模型(LLM)是测量标量结构的吸引人的工具,但它们对数值输出的独特处理方式提出了如何最好地应用它们的问题。我们通过全面评估LLM在社会科学中测量标量结构的方法来解决这些问题。我们利用来自政治科学文献的多个数据集,评估了四种方法:无权重直接逐点计分、成对比较聚合、令牌概率加权逐点计分和微调。我们的研究发现,LLM进行的成对比较产生的测量值优于简单地提示LLM直接输出分数,后者很容易聚集在任意数字周围。然而,通过对令牌概率的得分进行加权平均,可以进一步改进前两个方法的测量值。最后,通过微调只有1000个训练对的较小模型,可以匹配或超过提示LLM的性能。

论文及项目相关链接

PDF Accepted to EMNLP 2025 (Main)

Summary

大型语言模型(LLM)在测量语言特征如复杂度和情感时,存在数值输出的特定处理模式,如何应用成为关键问题。研究通过全面评估LLM在社会科学领域进行标量特征测量的四种方法来解决这一问题:无加权直接点值评分、成对比较聚合、基于token概率的加权点值评分和微调。研究发现,LLM的成对比较测量优于直接输出得分的简单提示法,后者容易产生随机数值堆积问题。进一步以token概率加权平均值的方法能够进一步优化测量效果。最终,通过对小型模型进行微调,数量达数千的训练对足以与LLM的提示效果相媲美或表现更优。

Key Takeaways

  • 语言特征的连续性语义结构使得大型语言模型(LLM)在测量时面临挑战。
  • LLM在测量标量特征时存在数值输出的特定处理方式,需要合理应用。
  • 通过四种方法评估LLM在社会科学领域的标量特征测量:无加权直接点值评分、成对比较聚合等。
  • LLM的成对比较测量法优于直接输出得分的简单提示法。
  • 基于token概率的加权平均值方法能够进一步优化测量结果。

Cool Papers

点此查看论文截图

Beyond Human Judgment: A Bayesian Evaluation of LLMs’ Moral Values Understanding

Authors:Maciej Skorski, Alina Landowska

How do Large Language Models understand moral dimensions compared to humans? This first large-scale Bayesian evaluation of market-leading language models provides the answer. In contrast to prior work using deterministic ground truth (majority or inclusion rules), we model annotator disagreements to capture both aleatoric uncertainty (inherent human disagreement) and epistemic uncertainty (model domain sensitivity). We evaluated the best language models (Claude Sonnet 4, DeepSeek-V3, Llama 4 Maverick) across 250K+ annotations from nearly 700 annotators in 100K+ texts spanning social networks, news and forums. Our GPU-optimized Bayesian framework processed 1M+ model queries, revealing that AI models typically rank among the top 25% of human annotators, performing much better than average balanced accuracy. Importantly, we find that AI produces far fewer false negatives than humans, highlighting their more sensitive moral detection capabilities.

大型语言模型与人类相比是如何理解道德维度的?这篇关于市场领先语言模型的首个大规模贝叶斯评估提供了答案。与之前使用确定性真实值(多数或包含规则)的工作不同,我们对注释器之间的分歧进行建模,以捕捉随机不确定性(固有的人类分歧)和知识不确定性(模型领域敏感性)。我们评估了最佳语言模型(Claude Sonnet 4、DeepSeek-V3、Llama 4 Maverick),这些模型基于来自近700名注释者在超过10万个文本(涵盖社交网络、新闻和论坛)中的25万多个注释进行了评价。我们的GPU优化贝叶斯框架处理了超过1百万个模型查询,结果表明AI模型通常排在人类注释器的前25%,其平均平衡准确率远高于人类。重要的是,我们发现AI产生的假阴性远少于人类,这凸显了它们更敏感的道德检测能力。

论文及项目相关链接

PDF Appears in UncertaiNLP@EMNLP 2025

Summary

大型语言模型对人类道德维度的理解程度如何?一项大规模贝叶斯评估提供了答案。该研究采用GPU优化的贝叶斯框架,对市场上领先的大型语言模型进行了评价,包括Claude Sonnet 4、DeepSeek-V3、Llama 4 Maverick等。通过对近700名标注者在超过百万文本中的近百万标注数据进行评估,发现AI模型通常排名在前百分之二十五的人类标注者中,表现出超越平均水平的准确性。尤其是AI产生的假阴性远低于人类,显示出其更灵敏的道德检测能力。

Key Takeaways

  1. 大型语言模型在理解道德维度方面表现出色,通常排名在人类标注者的前百分之二十五。
  2. 本研究首次大规模使用贝叶斯评估法对大型语言模型进行评价。
  3. 与以前的研究不同,本研究采用GPU优化的贝叶斯框架来处理超过百万次的模型查询。
  4. 本研究同时考虑了人类的两类不确定性:内在的不可确定性(随机性不确定)和知识的不确定性(模型领域敏感性)。

Cool Papers

点此查看论文截图

Advanced Financial Reasoning at Scale: A Comprehensive Evaluation of Large Language Models on CFA Level III

Authors:Pranam Shetty, Abhisek Upadhayaya, Parth Mitesh Shah, Srikanth Jagabathula, Shilpi Nayak, Anna Joo Fee

As financial institutions increasingly adopt Large Language Models (LLMs), rigorous domain-specific evaluation becomes critical for responsible deployment. This paper presents a comprehensive benchmark evaluating 23 state-of-the-art LLMs on the Chartered Financial Analyst (CFA) Level III exam - the gold standard for advanced financial reasoning. We assess both multiple-choice questions (MCQs) and essay-style responses using multiple prompting strategies including Chain-of-Thought and Self-Discover. Our evaluation reveals that leading models demonstrate strong capabilities, with composite scores such as 79.1% (o4-mini) and 77.3% (Gemini 2.5 Flash) on CFA Level III. These results, achieved under a revised, stricter essay grading methodology, indicate significant progress in LLM capabilities for high-stakes financial applications. Our findings provide crucial guidance for practitioners on model selection and highlight remaining challenges in cost-effective deployment and the need for nuanced interpretation of performance against professional benchmarks.

随着金融机构越来越多地采用大型语言模型(LLM),为了负责任的部署,严格的领域特定评估变得至关重要。本文提供了一个全面的基准测试,评估了23个最新LLM在特许金融分析师(CFA)三级考试上的表现——这是高级金融推理的金标准。我们评估了选择题(MCQs)和论文式回答,采用了多种提示策略,包括思维链和自发现。我们的评估显示,领先模型表现出强大的能力,如在CFA三级考试中的综合得分率高达79.1%(o4-mini)和77.3%(双子座2.5闪存)。这些结果是在修订后更严格的论文评分方法下取得的,表明大型语言模型在高风险金融应用方面的能力取得了显著进展。我们的研究为从业者提供了关于模型选择的宝贵指导,并强调了低成本部署的剩余挑战以及与专业基准相对性能细致解读的必要性。

论文及项目相关链接

PDF Accepted at FinLLM @ IJCAI 2025

Summary

这篇论文研究了金融领域对大型语言模型(LLM)的应用与评估。通过对包括CFA三级考试在内的专业金融考试进行模拟测试,发现顶级LLM模型在高级金融推理方面表现出强大的能力。这一研究为从业者提供了重要模型选择指导,并指出了成本效益部署的挑战以及专业基准性能评估的需求。

Key Takeaways

以下是提取的关键要点:

  1. 金融机构对大型语言模型(LLM)的采用正在增加,因此进行领域特定的严格评估至关重要。
  2. 本研究对23种最新LLM模型进行了全面的基准测试评估,使用CFA三级考试作为高级金融推理的评估标准。
  3. 通过多种提示策略,包括Chain-of-Thought和Self-Discover,评估了多选和论述式回答的性能。
  4. 顶级模型在CFA三级考试上表现出强大的能力,如o4-mini的复合分数达到79.1%,Gemini 2.5 Flash的复合分数为77.3%。
  5. 这些结果是在修订的更严格的论述评分方法下获得的,表明LLM在高风险金融应用中的能力已取得了重大进展。
  6. 研究结果为从业者提供了关键模型选择指导,强调了对模型成本效益部署的挑战。

Cool Papers

点此查看论文截图

Med-PRM: Medical Reasoning Models with Stepwise, Guideline-verified Process Rewards

Authors:Jaehoon Yun, Jiwoong Sohn, Jungwoo Park, Hyunjae Kim, Xiangru Tang, Yanjun Shao, Yonghoe Koo, Minhyeok Ko, Qingyu Chen, Mark Gerstein, Michael Moor, Jaewoo Kang

Large language models have shown promise in clinical decision making, but current approaches struggle to localize and correct errors at specific steps of the reasoning process. This limitation is critical in medicine, where identifying and addressing reasoning errors is essential for accurate diagnosis and effective patient care. We introduce Med-PRM, a process reward modeling framework that leverages retrieval-augmented generation to verify each reasoning step against established medical knowledge bases. By verifying intermediate reasoning steps with evidence retrieved from clinical guidelines and literature, our model can precisely assess the reasoning quality in a fine-grained manner. Evaluations on five medical QA benchmarks and two open-ended diagnostic tasks demonstrate that Med-PRM achieves state-of-the-art performance, with improving the performance of base models by up to 13.50% using Med-PRM. Moreover, we demonstrate the generality of Med-PRM by integrating it in a plug-and-play fashion with strong policy models such as Meerkat, achieving over 80% accuracy on MedQA for the first time using small-scale models of 8 billion parameters. Our code and data are available at: https://med-prm.github.io/

大型语言模型在临床决策中显示出潜力,但当前的方法在推理过程的特定步骤中难以定位和纠正错误。在医学领域,这一局限性至关重要,因为发现和解决推理错误对于准确诊断和治疗患者至关重要。我们引入了Med-PRM,这是一种过程奖励建模框架,它利用增强检索生成的方法,根据建立医学知识库来验证每一步的推理。通过从临床指南和文献中检索的证据验证中间推理步骤,我们的模型可以精细地评估推理质量。在五个医学问答基准测试和两个开放诊断任务上的评估表明,Med-PRM达到了最先进的性能,使用Med-PRM提高基础模型的性能高达13.5%。此外,我们通过将其与强大的政策模型(如Meerkat)以插件方式集成,展示了Med-PRM的普遍性,首次在小规模模型(8亿参数)上实现了超过80%的MedQA准确率。我们的代码和数据可在:https://med-prm.github.io/查看。

论文及项目相关链接

PDF Accepted to EMNLP 2025 (Oral)

Summary

大型语言模型在临床决策中展现出潜力,但现有方法难以对推理过程中的具体步骤进行定位和纠错。我们提出Med-PRM,一种利用检索增强生成技术的过程奖励建模框架,通过对照医学知识库验证每一步推理,以精细的方式评估推理质量。评估表明,Med-PRM在五个医学问答基准测试和两个开放式诊断任务上达到最新技术水平,较基础模型性能提升最多达13.5%。此外,通过将Med-PRM与强大策略模型(如Meerkat)以插件方式集成,我们在小规模模型(8亿参数)上首次实现了超过80%的MedQA准确率。

Key Takeaways

  1. 大型语言模型在临床决策中具备潜力,但存在定位及纠正推理错误的具体步骤的困难。
  2. Med-PRM框架利用检索增强生成技术验证每一步推理。
  3. Med-PRM能精细地评估推理质量,通过对照医学知识库进行验证。
  4. Med-PRM在医学问答基准测试上表现优异,较基础模型性能提升显著。
  5. Med-PRM可与强大策略模型集成,实现高准确率。
  6. Med-PRM提供插件式集成方式,适用于不同模型。

Cool Papers

点此查看论文截图

TurnaboutLLM: A Deductive Reasoning Benchmark from Detective Games

Authors:Yuan Yuan, Muyu He, Muhammad Adil Shahid, Jiani Huang, Ziyang Li, Li Zhang

This paper introduces TurnaboutLLM, a novel framework and dataset for evaluating the deductive reasoning abilities of Large Language Models (LLMs) by leveraging the interactive gameplay of detective games Ace Attorney and Danganronpa. The framework tasks LLMs with identifying contradictions between testimonies and evidences within long narrative contexts, a challenging task due to the large answer space and diverse reasoning types presented by its questions. We evaluate twelve state-of-the-art LLMs on the dataset, hinting at limitations of popular strategies for enhancing deductive reasoning such as extensive thinking and Chain-of-Thought prompting. The results also suggest varying effects of context size, the number of reasoning step and answer space size on model performance. Overall, TurnaboutLLM presents a substantial challenge for LLMs’ deductive reasoning abilities in complex, narrative-rich environments.

本文介绍了TurnaboutLLM,这是一个利用侦探游戏《逆转裁判》和《弹丸论破》的互动游戏环节来评估大语言模型(LLM)的推理能力的新型框架和数据集。该框架的任务是要求LLM在长篇叙事环境中识别证词和证据之间的矛盾,这是一个具有挑战性的任务,因为其中的问题具有较大的答案空间和多样的推理类型。我们在数据集上评估了十二个最先进的大型语言模型,暗示了提高推理能力(如深入思考及思维链提示)的流行策略的局限性。结果还表明上下文大小、推理步骤的数量和答案空间大小对模型性能的影响各不相同。总体而言,TurnaboutLLM在复杂的叙事丰富环境中对LLM的推理能力构成了重大挑战。

论文及项目相关链接

PDF In EMNLP 2025 main conference

Summary

本论文介绍了一个名为TurnaboutLLM的新框架和数据集,它通过利用侦探游戏Ace Attorney和Danganronpa的交互式游戏玩法来评估大型语言模型(LLM)的演绎推理能力。该框架的任务是使LLM能够在长篇叙事环境中识别证词和证据之间的矛盾,这一任务由于其问题的巨大答案空间和多样的推理类型而变得具有挑战性。对十二种最先进的LLM进行了数据集评估,揭示了增强演绎推理的流行策略(如深度思考和Chain-of-Thought提示)的局限性。结果还表明,上下文大小、推理步骤数量和答案空间大小对模型性能的影响各不相同。总体而言,TurnaboutLLM在复杂的叙事丰富环境中对LLM的演绎推理能力构成了重大挑战。

Key Takeaways

  1. TurnaboutLLM是一个用于评估LLM演绎推理能力的新框架和数据集。
  2. 它通过侦探游戏的交互式玩法来模拟复杂的推理场景。
  3. LLM面临识别长篇叙事中的矛盾这一挑战。
  4. 评估结果显示,流行的增强演绎推理策略(如深度思考和Chain-of-Thought提示)存在局限性。
  5. 上下文大小、推理步骤数量和答案空间大小对LLM性能有显著影响。
  6. TurnaboutLLM为LLM在复杂叙事环境中的演绎推理能力设定了高挑战标准。

Cool Papers

点此查看论文截图


文章作者: Kedreamix
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !
 上一篇
Agent Agent
Agent 方向最新论文已更新,请持续关注 Update in 2025-09-24 The STAR-XAI Protocol An Interactive Framework for Inducing Second-Order Agency in AI Agents
2025-09-24
下一篇 
R1_Reasoning R1_Reasoning
R1_Reasoning 方向最新论文已更新,请持续关注 Update in 2025-09-24 UniPixel Unified Object Referring and Segmentation for Pixel-Level Visual Reasoning
2025-09-24
  目录