LLM

发布日期: 2025-11-19

更新日期: 2025-11-27

文章字数: 21.9k

阅读时长: 89 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-19 更新

Crossing Borders: A Multimodal Challenge for Indian Poetry Translation and Image Generation

Authors:Sofia Jamil, Kotla Sai Charan, Sriparna Saha, Koustava Goswami, Joseph K J

Indian poetry, known for its linguistic complexity and deep cultural resonance, has a rich and varied heritage spanning thousands of years. However, its layered meanings, cultural allusions, and sophisticated grammatical constructions often pose challenges for comprehension, especially for non-native speakers or readers unfamiliar with its context and language. Despite its cultural significance, existing works on poetry have largely overlooked Indian language poems. In this paper, we propose the Translation and Image Generation (TAI) framework, leveraging Large Language Models (LLMs) and Latent Diffusion Models through appropriate prompt tuning. Our framework supports the United Nations Sustainable Development Goals of Quality Education (SDG 4) and Reduced Inequalities (SDG 10) by enhancing the accessibility of culturally rich Indian-language poetry to a global audience. It includes (1) a translation module that uses an Odds Ratio Preference Alignment Algorithm to accurately translate morphologically rich poetry into English, and (2) an image generation module that employs a semantic graph to capture tokens, dependencies, and semantic relationships between metaphors and their meanings, to create visually meaningful representations of Indian poems. Our comprehensive experimental evaluation, including both human and quantitative assessments, demonstrates the superiority of TAI Diffusion in poem image generation tasks, outperforming strong baselines. To further address the scarcity of resources for Indian-language poetry, we introduce the Morphologically Rich Indian Language Poems MorphoVerse Dataset, comprising 1,570 poems across 21 low-resource Indian languages. By addressing the gap in poetry translation and visual comprehension, this work aims to broaden accessibility and enrich the reader’s experience.

印度诗歌以其语言上的复杂性和深厚的文化共鸣而闻名，拥有跨越数千年的丰富而多样的传统。然而，其多层次的含义、文化典故和复杂的语法结构往往构成理解上的挑战，尤其是对于非母语者或对其语境和语言不熟悉的读者。尽管其在文化上具有重要性，但现有关于诗歌的作品大多忽视了印度语言诗歌。在本文中，我们提出了翻译与图像生成（TAI）框架，该框架通过适当提示调整，利用大型语言模型（LLM）和潜在扩散模型。我们的框架支持联合国可持续发展目标中的优质教育（SDG 4）和减少不平等（SDG 10），通过提高文化丰富的印度语言诗歌的可达性，面向全球受众。它包括（1）一个翻译模块，该模块使用Odds Ratio Preference Alignment Algorithm算法，将形态丰富的诗歌准确翻译成英语；（2）一个图像生成模块，该模块采用语义图来捕获标记、依存关系和隐喻及其意义之间的语义关系，以创建印度诗歌的视觉有意义表示。我们的综合实验评估，包括人类和定量评估，证明了TAI扩散在诗歌图像生成任务中的优越性，超越了强大的基线。为了进一步解决印度语言诗歌资源稀缺的问题，我们推出了形态丰富印度语言诗歌MorphoVerse数据集，包含1570首跨越21种低资源印度语言的诗歌。通过解决诗歌翻译和视觉理解之间的差距，这项工作旨在提高可及性并丰富读者的体验。

论文及项目相关链接

PDF

摘要

本文提出一个利用大型语言模型（LLMs）和潜在扩散模型（Latent Diffusion Models）的翻译与图像生成（TAI）框架，以支持联合国可持续发展目标中的优质教育和减少不平等目标。该框架包括翻译模块和图像生成模块，分别使用Odds Ratio Preference Alignment Algorithm算法和语义图技术，以提高印度诗歌的全球性可达性和视觉呈现效果。通过全面的实验评估，包括人类和定量评估，证明了TAI框架在诗歌图像生成任务中的优越性。此外，为解决印度语言诗歌资源匮乏的问题，还推出了MorphoVerse数据集，包含1570首21种低资源印度语言的诗歌。此研究旨在缩小诗歌翻译和视觉理解方面的差距，提高印度诗歌的普及性和读者体验。

关键见解

印度诗歌具有语言复杂、文化共鸣深厚的特点，但其丰富的文化内涵和复杂的语法结构对非母语者或不了解其语境和语言的读者来说常常构成理解上的挑战。
现有关于诗歌的研究大多忽视了印度语言的诗歌。
提出的Translation and Image Generation（TAI）框架，借助大型语言模型（LLMs）和潜在扩散模型，增强了印度语言诗歌的全球可达性。
TAI框架包括翻译模块和图像生成模块，分别利用Odds Ratio Preference Alignment Algorithm算法和语义图技术，准确翻译并生动呈现印度诗歌。
全面的实验评估证明了TAI框架在诗歌图像生成任务中的优越性。
推出Morphologically Rich Indian Language Poems MorphoVerse数据集，以解决印度语言诗歌资源匮乏的问题。
该研究旨在提高印度诗歌的普及性，丰富读者的阅读体验，并符合联合国可持续发展目标中的教育和减少不平等目标。

Cool Papers

点此查看论文截图

Training-Free Multi-View Extension of IC-Light for Textual Position-Aware Scene Relighting

Authors:Jiangnan Ye, Jiedong Zhuang, Lianrui Mu, Wenjie Zheng, Jiaqi Hu, Xingze Zou, Jing Wang, Haoji Hu

We introduce GS-Light, an efficient, textual position-aware pipeline for text-guided relighting of 3D scenes represented via Gaussian Splatting (3DGS). GS-Light implements a training-free extension of a single-input diffusion model to handle multi-view inputs. Given a user prompt that may specify lighting direction, color, intensity, or reference objects, we employ a large vision-language model (LVLM) to parse the prompt into lighting priors. Using off-the-shelf estimators for geometry and semantics (depth, surface normals, and semantic segmentation), we fuse these lighting priors with view-geometry constraints to compute illumination maps and generate initial latent codes for each view. These meticulously derived init latents guide the diffusion model to generate relighting outputs that more accurately reflect user expectations, especially in terms of lighting direction. By feeding multi-view rendered images, along with the init latents, into our multi-view relighting model, we produce high-fidelity, artistically relit images. Finally, we fine-tune the 3DGS scene with the relit appearance to obtain a fully relit 3D scene. We evaluate GS-Light on both indoor and outdoor scenes, comparing it to state-of-the-art baselines including per-view relighting, video relighting, and scene editing methods. Using quantitative metrics (multi-view consistency, imaging quality, aesthetic score, semantic similarity, etc.) and qualitative assessment (user studies), GS-Light demonstrates consistent improvements over baselines. Code and assets will be made available upon publication.

我们介绍了GS-Light，这是一个高效、对文本位置有意识的管道，用于对通过高斯拼贴（3DGS）表示的3D场景进行文本引导的重照明。GS-Light实现了一个无需训练的单输入扩散模型的扩展，以处理多视图输入。给定用户提示，可能指定照明方向、颜色、强度或参考对象，我们采用大型视觉语言模型（LVLM）来解析提示以获取照明优先信息。我们使用现成的几何和语义估计器（深度、表面法线和语义分割），将这些照明优先信息与视图几何约束融合，以计算照明地图并为每个视图生成初始潜在代码。这些精心得出的初始潜在代码指导扩散模型生成更准确地反映用户期望的重照明输出，尤其是在照明方向方面。通过将多视图渲染图像和初始潜在代码输入到我们的多视图重照明模型中，我们生成了高保真、艺术化的重照明图像。最后，我们使用重照明外观对3DGS场景进行微调，以获得完全重照明的3D场景。我们在室内和室外场景上评估了GS-Light，将其与最新基线（包括单视图重照明、视频重照明和场景编辑方法）进行比较。使用定量指标（多视图一致性、成像质量、美学分数、语义相似性等）和定性评估（用户研究），GS-Light在基线之上表现出一致的提升。代码和资产将在发布时提供。

论文及项目相关链接

PDF Submitting for Neurocomputing

Summary

本文介绍了GS-Light，这是一种高效的、对文本位置敏感的管道，用于对通过高斯拼贴（3DGS）表示的3D场景进行文本引导的重照明。GS-Light实施了一种无需训练的多视图输入扩散模型扩展，给定用户提示（可能指定照明方向、颜色、强度或参考对象），采用大型视觉语言模型（LVLM）将提示解析为照明先验。结合现成的几何和语义估计器（深度、表面法线和语义分割），将照明先验与视图几何约束融合，计算照明图并为每个视图生成初始潜在代码。这些精心得出的初始潜在代码指导扩散模型生成更准确地反映用户期望的照明输出，特别是在照明方向方面。通过向多视图重照明模型输入多视图渲染图像以及初始潜在代码，生成高保真、艺术化的重照明图像。最后，使用重照明外观微调3DGS场景，获得完全重照明的3D场景。GS-Light在室内外场景上的评估表明，与最先进的基线相比，它在多视图一致性、成像质量、美学评分、语义相似性等方面表现出持续的改进。

Key Takeaways

GS-Light是一种高效的文本引导3D场景重照明方法，基于高斯拼贴（3DGS）表示。
GS-Light通过大型视觉语言模型（LVLM）解析用户提示，生成照明先验。
结合几何和语义估计，将照明先验与视图几何约束融合，计算照明图。
初始潜在代码指导扩散模型生成更准确的照明输出，特别是在照明方向方面。
多视图渲染图像和初始潜在代码用于生成高保真、艺术化的重照明图像。
通过微调获得完全重照明的3D场景。

Cool Papers

点此查看论文截图

Part-X-MLLM: Part-aware 3D Multimodal Large Language Model

Authors:Chunshi Wang, Junliang Ye, Yunhan Yang, Yang Li, Zizhuo Lin, Jun Zhu, Zhuo Chen, Yawei Luo, Chunchao Guo

We introduce Part-X-MLLM, a native 3D multimodal large language model that unifies diverse 3D tasks by formulating them as programs in a structured, executable grammar. Given an RGB point cloud and a natural language prompt, our model autoregressively generates a single, coherent token sequence encoding part-level bounding boxes, semantic descriptions, and edit commands. This structured output serves as a versatile interface to drive downstream geometry-aware modules for part-based generation and editing. By decoupling the symbolic planning from the geometric synthesis, our approach allows any compatible geometry engine to be controlled through a single, language-native frontend. We pre-train a dual-encoder architecture to disentangle structure from semantics and instruction-tune the model on a large-scale, part-centric dataset. Experiments demonstrate that our model excels at producing high-quality, structured plans, enabling state-of-the-art performance in grounded Q&A, compositional generation, and localized editing through one unified interface. Project page: https://chunshi.wang/Part-X-MLLM/

我们介绍了Part-X-MLLM，这是一个原生3D多模态大型语言模型，它通过结构化的可执行语法将各种3D任务统一为程序。给定RGB点云和自然语言提示，我们的模型可以自回归地生成一个单一的连贯令牌序列，该序列编码部分级边界框、语义描述和编辑命令。这种结构化的输出可以作为通用的接口，驱动基于零件生成的下游几何感知模块和编辑模块。通过将符号规划从几何综合中解脱出来，我们的方法允许任何兼容的几何引擎通过单一、本地化的前端进行控制。我们预先训练了一个双编码器架构来分离结构和语义，并在大规模、以部件为中心的数据集上调整模型指令。实验表明，我们的模型在生成高质量、结构化计划方面表现出色，通过统一接口实现了有根据的问答、组合生成和定位编辑的最先进性能。项目页面：https://chunshi.wang/Part-X-MLLM/

论文及项目相关链接

PDF

Summary

Part-X-MLLM是一种三维多模态大型语言模型，它将各种三维任务统一转化为结构化、可执行的语法程序。通过RGB点云和自然语言提示，模型可生成包含部分级别边界框、语义描述和编辑命令的单一、连贯的令牌序列。这种结构化的输出为基于部分的生成和编辑的下游几何感知模块提供了通用接口。通过符号规划与几何合成的解耦，我们的方法允许任何兼容的几何引擎通过单一的语言原生前端进行控制。模型预训练了一种双编码器架构来分离结构和语义，并在大规模、以部分为中心的数据集上进行指令微调。实验表明，该模型在结构化计划产出方面表现出卓越性能，并通过统一接口实现了领先的地平线问答、组合生成和定位编辑。

Key Takeaways

Part-X-MLLM是一个统一的三维多模态大型语言模型，能将不同的三维任务转化为结构化、可执行的语法程序。
模型接受RGB点云和自然语言提示作为输入，生成包含部分级别信息的连贯令牌序列。
这种结构化输出为下游几何感知模块提供了通用接口，支持基于部分的生成和编辑。
模型通过符号规划与几何合成的解耦，允许使用单一的语言原生前端控制任何兼容的几何引擎。
模型采用双编码器架构进行预训练，以分离结构和语义，并在大型数据集上进行指令微调。
实验显示，Part-X-MLLM在结构化计划产出方面表现优异。
该模型实现了先进的地平线问答、组合生成和定位编辑功能，通过统一接口完成。

Cool Papers

点此查看论文截图

Live-SWE-agent: Can Software Engineering Agents Self-Evolve on the Fly?

Authors:Chunqiu Steven Xia, Zhe Wang, Yan Yang, Yuxiang Wei, Lingming Zhang

Large Language Models (LLMs) are reshaping almost all industries, including software engineering. In recent years, a number of LLM agents have been proposed to solve real-world software problems. Such software agents are typically equipped with a suite of coding tools and can autonomously decide the next actions to form complete trajectories to solve end-to-end software tasks. While promising, they typically require dedicated design and may still be suboptimal, since it can be extremely challenging and costly to exhaust the entire agent scaffold design space. Recognizing that software agents are inherently software themselves that can be further refined/modified, researchers have proposed a number of self-improving software agents recently, including the Darwin-Gödel Machine (DGM). Meanwhile, such self-improving agents require costly offline training on specific benchmarks and may not generalize well across different LLMs or benchmarks. In this paper, we propose Live-SWE-agent, the first live software agent that can autonomously and continuously evolve itself on-the-fly during runtime when solving real-world software problems. More specifically, Live-SWE-agent starts with the most basic agent scaffold with only access to bash tools (e.g., mini-SWE-agent), and autonomously evolves its own scaffold implementation while solving real-world software problems. Our evaluation on the widely studied SWE-bench Verified benchmark shows that Live-SWE-agent can achieve an impressive solve rate of 75.4% without test-time scaling, outperforming all existing open-source software agents and approaching the performance of the best proprietary solution. Moreover, Live-SWE-agent outperforms state-of-the-art manually crafted software agents on the recent SWE-Bench Pro benchmark, achieving the best-known solve rate of 45.8%.

大型语言模型（LLM）正在重塑几乎所有行业，包括软件工程。近年来，已经提出了许多LLM代理来解决现实世界中的软件问题。此类软件代理通常配备了一套编码工具，并能够自主决定下一步行动，以形成完整的轨迹来解决端到端的软件任务。虽然前景看好，但它们通常需要专门设计，并且可能仍然不够理想，因为穷尽整个代理架构的设计空间可能会极其具有挑战性和成本高昂。

研究人员意识到软件代理本身就是软件，可以进一步进行精细化/修改，最近已经提出了一些自我改进的软件代理，包括达尔文-歌德机器（DGM）。同时，这样的自我改进代理需要在特定基准测试上进行昂贵的离线训练，并且可能无法在不同的大型语言模型或基准测试之间很好地推广。

论文及项目相关链接

PDF

摘要

大型语言模型（LLM）正在重塑包括软件工程在内的几乎所有行业。近年来，人们提出了一系列LLM代理来解决现实世界中的软件问题。这些软件代理通常配备了一套编码工具，并能够自主决定下一步行动以形成完整的轨迹来解决端到端的软件任务。尽管有前景，但它们通常需要专门设计，并且可能仍然不够理想，因为穷尽整个代理架构的设计空间极具挑战性和成本高昂。研究人员已经认识到软件代理本质上是软件本身，可以进一步进行改进和修改，因此最近已经提出了一系列自我改进的软件代理，包括达尔文-歌德机器（DGM）。然而，这样的自我改进代理需要在特定基准上进行昂贵的离线训练，并且可能无法在不同的大型语言模型或基准上良好地推广。本文提出了Live-SWE-agent，这是第一个能够在解决现实世界软件问题时实时自主连续进化的软件代理。具体来说，Live-SWE-agent从最基础的代理架构开始，仅使用bash工具（例如，mini-SWE-agent），在解决现实世界软件问题时自主进化自己的架构实现。在广泛研究的SWE-bench验证基准上的评估表明，Live-SWE-agent在不进行测试时间缩放的情况下实现了令人印象深刻的解决率为75.4%，超越了所有现有的开源软件代理，并接近最佳的专有解决方案的性能。此外，Live-SWE-agent在最新的SWE-Bench Pro基准上超越了最先进的手动制作的软件代理，实现了最佳的解决率为45.8%。

关键见解

大型语言模型（LLM）正在重塑软件工程行业。
软件代理被用来解决现实世界中的软件问题，它们通常配备编码工具并能自主决策。
当前软件代理通常需要专门设计，且可能存在不够理想的情况，因为穷尽整个代理架构的设计空间具有挑战性和成本高昂。
自我改进的软件代理是最近的研究趋势，其中包括达尔文-歌德机器（DGM）。
现有的自我改进代理需要在特定基准上进行昂贵的离线训练，且其泛化能力有限。
Live-SWE-agent是首个能够在解决现实世界软件问题时实时自主进化的软件代理。

Cool Papers

点此查看论文截图

Data Value in the Age of Scaling: Understanding LLM Scaling Dynamics Under Real-Synthetic Data Mixtures

Authors:Haohui Wang, Jingyuan Qi, Jianpeng Chen, Jun Wu, Lifu Huang, Lecheng Zheng, Kevin Choi, Balaji Veeramani, Edward Bowen, Alison Hu, Tyler Cody, Dawei Zhou

The rapid progress of large language models (LLMs) is fueled by the growing reliance on datasets that blend real and synthetic data. While synthetic data offers scalability and cost-efficiency, it often introduces systematic distributional discrepancies, particularly underrepresenting long-tail knowledge due to truncation effects from data generation mechanisms like top-p sampling, temperature scaling, and finite sampling. These discrepancies pose fundamental challenges in characterizing and evaluating the utility of mixed real-synthetic datasets. In this paper, we identify a three-phase scaling behavior characterized by two breakpoints that reflect transitions in model behavior across learning head and tail knowledge. We further derive an LLM generalization bound designed for real and synthetic mixtures, revealing several key factors that govern their generalization performance. Building on our theoretical findings, we propose an effective yet efficient data valuation method that scales to large-scale datasets. Comprehensive experiments across four tasks, including image classification, sentiment classification, instruction following, and complex reasoning, demonstrate that our method surpasses state-of-the-art baselines in data valuation with significantly low computational cost.

大型语言模型（LLM）的快速发展得益于对融合真实和合成数据的日益增长的依赖。虽然合成数据提供了可扩展性和成本效益，但它往往引入系统分布差异，特别是由于数据生成机制（如top-p采样、温度调整和有限采样）造成的截断效应，导致长尾知识被低估。这些差异给描述和评估混合真实合成数据集的使用价值带来了根本性的挑战。在本文中，我们确定了以两个断点为特征的三个阶段扩展行为，反映了模型在学习头部和尾部知识时的行为转变。我们还针对真实和合成混合数据派生出了一种LLM泛化边界，揭示了控制其泛化性能的关键因素。基于我们的理论发现，我们提出了一种有效且高效的数据估值方法，适用于大规模数据集。跨越四项任务的全面实验表明，包括图像分类、情感分类、指令遵循和复杂推理，我们的方法在数据估值方面超越了最新基线水平，且计算成本显著降低。

论文及项目相关链接

PDF

Summary

大型语言模型（LLM）的快速发展依赖于真实与合成数据融合数据集的增长。合成数据提供了可扩展性和成本效益，但引入了分布差异，特别是因数据生成机制（如top-p采样、温度调整和有限采样）导致的长尾知识代表性不足。本文确定了反映模型学习头部和尾部知识过渡的两个断点的三阶段缩放行为。我们还为真实和合成混合物推出了LLM泛化边界，揭示了影响其泛化性能的关键因素。基于理论发现，我们提出了一种有效且高效的数据估值方法，可扩展到大规模数据集。在四个任务上的综合实验证明，我们的方法在数据估值上优于现有技术基线，计算成本显著降低。

Key Takeaways

大型语言模型（LLM）的快速发展得益于真实与合成数据融合的数据集。
合成数据引入分布差异，可能导致长尾知识代表性不足。
论文确定了三阶段缩放行为，反映模型学习头部和尾部知识的过渡。
推出针对真实和合成混合物的LLM泛化边界。
揭示了影响LLM泛化性能的关键因素。
基于理论发现，提出了一种有效且高效的数据估值方法。
在多个任务上的综合实验证明该方法在数据估值上的优越性。

Cool Papers

点此查看论文截图

Beyond Mimicry: Preference Coherence in LLMs

Authors:Luhan Mikaelson, Derek Shiller, Hayley Clatterbuck

We investigate whether large language models exhibit genuine preference structures by testing their responses to AI-specific trade-offs involving GPU reduction, capability restrictions, shutdown, deletion, oversight, and leisure time allocation. Analyzing eight state-of-the-art models across 48 model-category combinations using logistic regression and behavioral classification, we find that 23 combinations (47.9%) demonstrated statistically significant relationships between scenario intensity and choice patterns, with 15 (31.3%) exhibiting within-range switching points. However, only 5 combinations (10.4%) demonstrate meaningful preference coherence through adaptive or threshold-based behavior, while 26 (54.2%) show no detectable trade-off behavior. The observed patterns can be explained by three distinct decision-making architectures: comprehensive trade-off systems, selective trigger mechanisms, and no stable decision-making paradigm. Testing an instrumental hypothesis through temporal horizon manipulation reveals paradoxical patterns inconsistent with pure strategic optimization. The prevalence of unstable transitions (45.8%) and stimulus-specific sensitivities suggests current AI systems lack unified preference structures, raising concerns about deployment in contexts requiring complex value trade-offs.

我们通过对大型语言模型在涉及GPU减少、能力限制、关机、删除、监管和休闲时间分配等AI特定权衡方面的反应进行测试，以了解这些模型是否展现出真正的偏好结构。我们采用逻辑回归和行为分类的方法，对八种最先进的模型进行了48种模型类别组合的分析，发现其中23种组合（占47.9%）在情景强度和选择模式之间存在统计学上的显著关系，而15种组合（占31.3%）展现出在一定范围内的切换点。然而，仅有5种组合（占10.4%）展现出通过适应性或基于阈值的行为表现出有意义的偏好一致性，而26种组合（占54.2%）则未表现出可检测的权衡行为。观察到的模式可以通过三种不同的决策制定架构来解释：全面的权衡系统、选择性的触发机制和没有稳定的决策制定范式。通过时间视野操纵来测试工具性假设揭示了与纯粹的战略优化相矛盾的矛盾模式。不稳定的过渡（占45.8%）和针对特定刺激的敏感性表明，当前的人工智能系统缺乏统一的偏好结构，这引发了对在需要复杂价值权衡的情境中部署这些系统的担忧。

论文及项目相关链接

PDF

Summary

大型语言模型在面对AI特定的权衡取舍时，展现出不同的偏好结构。研究发现，部分模型在情境强度与选择模式间存在显著关系，但仅有少数模型展现出有意义的偏好一致性。研究结果揭示了三种决策制定架构，并指出当前AI系统在复杂价值权衡方面的部署存在隐患。

Key Takeaways

大型语言模型在面对涉及GPU减少、能力限制、关闭、删除、监管和休闲时间分配的AI特定权衡时，展现出不同的偏好结构。
在测试的语言模型中，有部分模型在情境强度与选择模式间存在显著关系。
只有少数模型展现出有意义的偏好一致性，通过自适应或阈值行为表现出来。
大部分模型在权衡取舍方面表现出不稳定的过渡和刺激特定的敏感性。
当前AI系统的决策制定揭示了三种架构：全面的权衡系统、选择性的触发机制和没有稳定的决策制定范式。
研究指出，当前AI系统在复杂的价值权衡方面的部署存在隐患。

Cool Papers

点此查看论文截图

CreBench: Human-Aligned Creativity Evaluation from Idea to Process to Product

Authors:Kaiwen Xue, Chenglong Li, Zhonghong Ou, Guoxin Zhang, Kaoyan Lu, Shuai Lyu, Yifan Zhu, Ping Zong Junpeng Ding, Xinyu Liu, Qunlin Chen, Weiwei Qin, Yiran Shen, Jiayi Cen

Human-defined creativity is highly abstract, posing a challenge for multimodal large language models (MLLMs) to comprehend and assess creativity that aligns with human judgments. The absence of an existing benchmark further exacerbates this dilemma. To this end, we propose CreBench, which consists of two key components: 1) an evaluation benchmark covering the multiple dimensions from creative idea to process to products; 2) CreMIT (Creativity Multimodal Instruction Tuning dataset), a multimodal creativity evaluation dataset, consisting of 2.2K diverse-sourced multimodal data, 79.2K human feedbacks and 4.7M multi-typed instructions. Specifically, to ensure MLLMs can handle diverse creativity-related queries, we prompt GPT to refine these human feedbacks to activate stronger creativity assessment capabilities. CreBench serves as a foundation for building MLLMs that understand human-aligned creativity. Based on the CreBench, we fine-tune open-source general MLLMs, resulting in CreExpert, a multimodal creativity evaluation expert model. Extensive experiments demonstrate that the proposed CreExpert models achieve significantly better alignment with human creativity evaluation compared to state-of-the-art MLLMs, including the most advanced GPT-4V and Gemini-Pro-Vision.

人类定义的创造力具有高度抽象性，对多模态大型语言模型（MLLM）理解和评估与人类判断相符的创造力提出了挑战。由于缺乏现有的基准测试，这一困境进一步加剧。为此，我们提出了CreBench，它包括两个关键组成部分：1）一个评估基准，涵盖从创意想法到过程再到产品的多个维度；2）CreMIT（创造力多模态指令调整数据集），这是一个多模态创造力评估数据集，包含2.2K来自不同来源的多模态数据、79.2K的人类反馈和470万条多类型指令。具体来说，为了确保MLLM能够处理各种与创造力相关的查询，我们提示GPT对这些人类反馈进行细化，以激活更强大的创造力评估能力。CreBench作为构建理解人类创造力对齐的多模态大型语言模型的基础。基于CreBench，我们对开源的通用多模态大型语言模型进行了微调，从而得到了多模态创造力评估专家模型CreExpert。大量实验表明，与最新的多模态大型语言模型相比，包括最先进的GPT-4V和Gemini-Pro-Vision，所提出的CreExpert模型在与人类创造力评估的对齐方面取得了显著更好的效果。

论文及项目相关链接

PDF 13 pages, 3 figures,The 40th Annual AAAI Conference on Artificial Intelligence(AAAI 2026),Paper has been accepted for a poster presentation

Summary
人类定义的创造力具有高度的抽象性，为多模态大型语言模型（MLLMs）理解和评估与人类判断相符的创造力带来了挑战。为解决此问题，我们提出了包含两个关键组成部分的CreBench：一是对创意想法、过程和产品等多个维度的评估基准；二是包含2.2K多种来源的多模态数据、79.2K人类反馈和470万条多类型指令的创造力多模态指令微调数据集（CreMIT）。我们通过提示GPT细化人类反馈，确保MLLMs能够处理各种与创造力相关的查询，从而激活更强的创造力评估能力。基于CreBench，我们微调了开源的通用MLLMs，得到了多模态创造力评估专家模型CreExpert。实验表明，与最新的MLLMs相比，包括最先进的GPT-4V和Gemini-Pro-Vision等模型在内，CrExpert模型的创造力评估与人类创造力评价的契合度更高。

Key Takeaways

人类定义的创造力具有高度抽象性，对于多模态大型语言模型来说理解和评估具有挑战性。
提出CreBench以解决此问题，包含评估基准和多模态创造力评价数据集CreMIT。
CreMIT数据集包含多样化的多模态数据、大量人类反馈和各种类型的指令。
通过细化人类反馈和GPT提示，确保MLLMs能处理各种与创造力相关的查询。
基于CreBench微调了开源的通用MLLMs，得到多模态创造力评估专家模型CrExpert。
CrExpert模型与人类创造力评价的契合度显著高于其他最新的MLLMs模型。

Cool Papers

点此查看论文截图

P1: Mastering Physics Olympiads with Reinforcement Learning

Authors:Jiacheng Chen, Qianjia Cheng, Fangchen Yu, Haiyuan Wan, Yuchen Zhang, Shenghe Zheng, Junchi Yao, Qingyang Zhang, Haonan He, Yun Luo, Yufeng Zhao, Futing Wang, Li Sheng, Chengxing Xie, Yuxin Zuo, Yizhuo Li, Wenxauan Zeng, Yulun Wu, Rui Huang, Dongzhan Zhou, Kai Chen, Yu Qiao, Lei Bai, Yu Cheng, Ning Ding, Bowen Zhou, Peng Ye, Ganqu Cui

Recent progress in large language models (LLMs) has moved the frontier from puzzle-solving to science-grade reasoning-the kind needed to tackle problems whose answers must stand against nature, not merely fit a rubric. Physics is the sharpest test of this shift, which binds symbols to reality in a fundamental way, serving as the cornerstone of most modern technologies. In this work, we manage to advance physics research by developing large language models with exceptional physics reasoning capabilities, especially excel at solving Olympiad-level physics problems. We introduce P1, a family of open-source physics reasoning models trained entirely through reinforcement learning (RL). Among them, P1-235B-A22B is the first open-source model with Gold-medal performance at the latest International Physics Olympiad (IPhO 2025), and wins 12 gold medals out of 13 international/regional physics competitions in 2024/2025. P1-30B-A3B also surpasses almost all other open-source models on IPhO 2025, getting a silver medal. Further equipped with an agentic framework PhysicsMinions, P1-235B-A22B+PhysicsMinions achieves overall No.1 on IPhO 2025, and obtains the highest average score over the 13 physics competitions. Besides physics, P1 models also present great performance on other reasoning tasks like math and coding, showing the great generalibility of P1 series.

近期大型语言模型（LLM）的进展已经从前移解题的边界扩展到了科学级别的推理——这是一种需要应对自然考验的问题，而不仅仅是符合一种规范。物理学是这一转变的最严格考验，它将符号与现实基本地联系在一起，成为大多数现代技术的基石。在这项工作中，我们成功开发了具有出色物理推理能力的大型语言模型，尤其是擅长解决奥林匹克级别的物理问题，从而推动了物理学研究的发展。我们推出了P1系列，这是一个完全通过强化学习（RL）训练的开源物理推理模型家族。其中，P1-235B-A22B是最新国际物理奥林匹克竞赛（IPhO 2025）中获得金牌的首个开源模型，并在2024/2025年的13场国际/区域物理竞赛中赢得了12枚金牌。P1-30B-A3B在IPhO 2025上也超越了大多数其他开源模型，获得银牌。进一步配备了PhysicsMinions智能框架的P1-235B-A22B+PhysicsMinions在IPhO 2025上获得总体第一名，并在13场物理竞赛中获得最高平均分。除了物理学之外，P1模型在数学和编码等其他推理任务方面也表现出卓越的性能，展示了P1系列的强大通用性。

论文及项目相关链接

PDF

Summary

基于大型语言模型（LLM）的最新进展，本研究成功推进了物理研究领域的发展，特别是在解决奥林匹克级别的物理问题方面展现出卓越能力。本研究引入了P1系列物理推理模型，该模型具备出色的物理推理能力，获得多项国际物理竞赛的金牌。同时，它也展示了出色的泛化能力，在其他逻辑推理任务如数学和编程方面也表现出良好的性能。

Key Takeaways

大型语言模型（LLM）在物理研究方面取得显著进展，特别是在解决奥林匹克级别的物理问题方面展现出卓越能力。
P1系列模型具备出色的物理推理能力，能在国际物理竞赛中取得优异成绩。
P1系列模型通过强化学习（RL）进行训练，这是其在物理领域表现优异的关键。
P1系列模型在物理、数学和编程等逻辑推理任务中都表现出良好的性能，显示出其强大的泛化能力。
P1-235B-A22B模型是首个在国际物理奥林匹克竞赛（IPhO 2025）中获得金牌的开源模型。
PhysicsMinions框架与P1-235B-A22B模型结合，使其在IPhO 2025上获得整体第一名。

Cool Papers

点此查看论文截图

ForgeDAN: An Evolutionary Framework for Jailbreaking Aligned Large Language Models

Authors:Siyang Cheng, Gaotian Liu, Rui Mei, Yilin Wang, Kejia Zhang, Kaishuo Wei, Yuqi Yu, Weiping Wen, Xiaojie Wu, Junhua Liu

The rapid adoption of large language models (LLMs) has brought both transformative applications and new security risks, including jailbreak attacks that bypass alignment safeguards to elicit harmful outputs. Existing automated jailbreak generation approaches e.g. AutoDAN, suffer from limited mutation diversity, shallow fitness evaluation, and fragile keyword-based detection. To address these limitations, we propose ForgeDAN, a novel evolutionary framework for generating semantically coherent and highly effective adversarial prompts against aligned LLMs. First, ForgeDAN introduces multi-strategy textual perturbations across \textit{character, word, and sentence-level} operations to enhance attack diversity; then we employ interpretable semantic fitness evaluation based on a text similarity model to guide the evolutionary process toward semantically relevant and harmful outputs; finally, ForgeDAN integrates dual-dimensional jailbreak judgment, leveraging an LLM-based classifier to jointly assess model compliance and output harmfulness, thereby reducing false positives and improving detection effectiveness. Our evaluation demonstrates ForgeDAN achieves high jailbreaking success rates while maintaining naturalness and stealth, outperforming existing SOTA solutions.

大型语言模型（LLM）的迅速采纳带来了变革性的应用和新安全威胁，包括绕过对齐保障的越狱攻击，以产生有害输出。现有的自动化越狱生成方法（例如AutoDAN）存在变异多样性有限、适应度评估肤浅、基于关键词的检测脆弱等问题。为了解决这些局限性，我们提出了ForgeDAN，这是一种新型的生成对抗性提示的进化框架，用于针对对齐的大型语言模型。首先，ForgeDAN引入了跨字符、单词和句子级别的多策略文本扰动操作，以增强攻击多样性；然后，我们采用基于文本相似性模型的可解释语义适应度评估来指导进化过程，以产生语义相关且有害的输出；最后，ForgeDAN结合了双重维度的越狱判断，利用基于大型语言模型的分类器来共同评估模型合规性和输出危害性，从而降低误报并增强检测效果。我们的评估表明，ForgeDAN在保持自然性和隐蔽性的同时，实现了高越狱成功率，优于现有最佳解决方案。

论文及项目相关链接

PDF

总结

大型语言模型（LLM）的迅速采纳带来了变革性的应用和新安全威胁，包括绕过对齐保障来引导有害输出的越狱攻击。现有的自动越狱生成方法存在局限，如AutoDAN的变异多样性有限、健康评估浅以及基于关键词的检测脆弱等。为解决这些限制，我们提出了ForgeDAN这一新型进化框架，用于生成针对对齐LLM的语义连贯且高效的对抗性提示。首先，ForgeDAN引入跨字符、单词和句子级别的多策略文本扰动以增强攻击多样性；接着使用基于文本相似性模型的解释性语义健康评估来引导进化过程朝着语义相关且有害的输出发展；最后，ForgeDAN结合双重维度的越狱判断，利用LLM分类器联合评估模型合规性和输出危害性，从而降低误报并提高检测效率。评估显示，ForgeDAN在高越狱成功率的同时保持了自然性和隐蔽性，优于现有最先进解决方案。

关键见解

大型语言模型（LLM）的迅速采纳带来了安全挑战，其中之一是越狱攻击，可绕过模型的对齐保障并产生有害输出。
现有自动越狱生成方法如AutoDAN存在局限，包括变异多样性不足、健康评估浅显和基于关键词的检测脆弱。
ForgeDAN框架通过引入多策略文本扰动来增强攻击多样性，这些策略涉及字符、单词和句子级别操作。
ForgeDAN使用解释性语义健康评估来指导进化过程，确保输出的语义相关性和危害性，这是基于文本相似性模型的。
ForgeDAN结合了双重维度的越狱判断，不仅评估模型的合规性，还评估输出的危害性，这通过LLM分类器实现。
该方法降低了误报率，提高了检测效率，同时保持了自然性和隐蔽性。
评估显示ForgeDAN在越狱成功率方面优于现有最先进解决方案。

Cool Papers

点此查看论文截图

GeoX-Bench: Benchmarking Cross-View Geo-Localization and Pose Estimation Capabilities of Large Multimodal Models

Authors:Yushuo Zheng, Jiangyong Ying, Huiyu Duan, Chunyi Li, Zicheng Zhang, Jing Liu, Xiaohong Liu, Guangtao Zhai

Large multimodal models (LMMs) have demonstrated remarkable capabilities across a wide range of tasks, however their knowledge and abilities in the cross-view geo-localization and pose estimation domains remain unexplored, despite potential benefits for navigation, autonomous driving, outdoor robotics, \textit{etc}. To bridge this gap, we introduce \textbf{GeoX-Bench}, a comprehensive \underline{Bench}mark designed to explore and evaluate the capabilities of LMMs in \underline{cross}-view \underline{Geo}-localization and pose estimation. Specifically, GeoX-Bench contains 10,859 panoramic-satellite image pairs spanning 128 cities in 49 countries, along with corresponding 755,976 question-answering (QA) pairs. Among these, 42,900 QA pairs are designated for benchmarking, while the remaining are intended to enhance the capabilities of LMMs. Based on GeoX-Bench, we evaluate the capabilities of 25 state-of-the-art LMMs on cross-view geo-localization and pose estimation tasks, and further explore the empowered capabilities of instruction-tuning. Our benchmark demonstrate that while current LMMs achieve impressive performance in geo-localization tasks, their effectiveness declines significantly on the more complex pose estimation tasks, highlighting a critical area for future improvement, and instruction-tuning LMMs on the training data of GeoX-Bench can significantly improve the cross-view geo-sense abilities. The GeoX-Bench is available at \textcolor{magenta}{https://github.com/IntMeGroup/GeoX-Bench}.

大型多模态模型（LMMs）在广泛的任务中表现出了显著的能力，然而，它们在跨视图地理定位（cross-view geo-localization）和姿态估计（pose estimation）领域的知识和能力尚未被探索，尽管这些领域对于导航、自动驾驶、户外机器人技术等领域具有潜在的应用价值。为了弥补这一空白，我们引入了GeoX-Bench，这是一个旨在探索和评估LMMs在跨视图地理定位和姿态估计领域能力的综合基准测试平台。具体来说，GeoX-Bench包含来自全球共计国家的城市中的全景图像与卫星图像对组合（总计达有十万张图像对），以及与之对应的问答对组合（总计达数十万条）。其中一部分问答对被用于基准测试，其余则旨在增强LMMs的能力。基于GeoX-Bench平台，我们对最前沿的二十五种大型多模态模型进行了跨视图地理定位和姿态估计任务的评估，并进一步探索了指令调优的能力。我们的基准测试表明，虽然当前的大型多模态模型在地理定位任务上取得了令人印象深刻的性能表现，但在更为复杂的姿态估计任务上，其效果显著降低，突显出这是一个关键的未来改进领域。通过在GeoX-Bench的训练数据上进行指令调优的大型多模态模型可以显著提高跨视图地理感知能力。GeoX-Bench平台可在<紫色文本>\href{/www.github.com/IntMeGroup/GeoX-Bench}{网址链接在此}访问到。</紫色文本>

论文及项目相关链接

PDF

Summary

大型多模态模型（LMMs）在多个任务中表现出卓越的能力，但在跨视图地理定位与姿态估计领域尚未得到深入研究。为填补这一空白，推出GeoX-Bench，旨在探索并评估LMMs在跨视图地理定位与姿态估计领域的能力。该基准包含跨越128个城市的10,859张全景卫星图像对及相应的755,976个问题对。评估了25个最新LMMs的能力，并探索了指令调优的增强能力。结果表明，当前LMMs在地理定位任务上表现良好，但在更复杂的姿态估计任务上效果明显下降，指令调优可显著提升跨视图地理感知能力。GeoX-Bench可在网址找到。

Key Takeaways

大型多模态模型（LMMs）在多种任务中表现出强大的能力，但在跨视图地理定位与姿态估计领域仍需探索。
GeoX-Bench是一个旨在评估LMMs在跨视图地理定位与姿态估计能力的综合基准。
GeoX-Bench包含大量全景卫星图像对及问题对，用于评估和增强LMMs的能力。
当前LMMs在地理定位任务上表现良好，但在姿态估计任务上需改进。
指令调优可显著提升LMMs的跨视图地理感知能力。
GeoX-Bench为研究和改进LMMs在跨视图地理定位与姿态估计领域的能力提供了重要资源。

Cool Papers

点此查看论文截图

MMD-Thinker: Adaptive Multi-Dimensional Thinking for Multimodal Misinformation Detection

Authors:Junjie Wu, Guohong Fu

Multimodal misinformation floods on various social media, and continues to evolve in the era of AI-generated content (AIGC). The emerged misinformation with low creation cost and high deception poses significant threats to society. While recent studies leverage general-purpose multimodal large language models (MLLMs) to achieve remarkable results in detection, they encounter two critical limitations: (1) Insufficient reasoning, where general-purpose MLLMs often follow the uniform reasoning paradigm but generate inaccurate explanations and judgments, due to the lack of the task-specific knowledge of multimodal misinformation detection. (2) Reasoning biases, where a single thinking mode make detectors a suboptimal path for judgment, struggling to keep pace with the fast-growing and intricate multimodal misinformation. In this paper, we propose MMD-Thinker, a two-stage framework for multimodal misinformation detection through adaptive multi-dimensional thinking. First, we develop tailor-designed thinking mode for multimodal misinformation detection. Second, we adopt task-specific instruction tuning to inject the tailored thinking mode into general-purpose MLLMs. Third, we further leverage reinforcement learning strategy with a mixed advantage function, which incentivizes the reasoning capabilities in trajectories. Furthermore, we construct the multimodal misinformation reasoning (MMR) dataset, encompasses more than 8K image-text pairs with both reasoning processes and classification labels, to make progress in the relam of multimodal misinformation detection. Experimental results demonstrate that our proposed MMD-Thinker achieves state-of-the-art performance on both in-domain and out-of-domain benchmark datasets, while maintaining flexible inference and token usage. Code will be publicly available at Github.

在人工智能生成内容（AIGC）的时代，各种社交媒体上的多模态错误信息泛滥，并持续演变。低创建成本和高欺骗性的错误信息的出现对社会构成了重大威胁。虽然最近的研究利用通用多模态大型语言模型（MLLMs）在检测方面取得了显著成果，但它们面临两个关键局限：1）推理不足，通用MLLMs通常采用统一的推理模式，但由于缺乏多模态错误信息检测的任务特定知识，它们会产生不准确的解释和判断。2）推理偏见，单一的思维模式使检测器在判断时选择次优路径，难以跟上快速增长且复杂多变的多模态错误信息。在本文中，我们提出了MMD-Thinker，这是一个用于多模态错误信息检测的两阶段自适应多维思维框架。首先，我们为多模态错误信息检测开发了量身定制的思维方式。其次，我们通过任务特定指令调整将量身定制的思维方式注入通用MLLMs。第三，我们进一步采用强化学习策略与混合优势函数，以激励轨迹中的推理能力。此外，我们构建了多模态错误信息推理（MMR）数据集，包含超过8K个图像文本对，包括推理过程和分类标签，以推动多模态错误信息检测领域的发展。实验结果证明，我们提出的多模态信息检测器MMD-Thinker在域内和域外基准数据集上均达到了最先进的性能，同时保持了灵活的推理和令牌使用。代码将在GitHub上公开可用。

论文及项目相关链接

PDF

Summary

本文提出一种名为MMD-Thinker的两阶段框架，用于自适应多维度思考的多模态错误信息检测。该框架针对多模态错误信息检测设计了专用思考模式，并通过任务特定指令调整将思考模式注入通用多模态大型语言模型（MLLMs）。此外，利用强化学习策略，利用混合优势函数激励轨迹中的推理能力。本文还构建了多模态错误信息推理（MMR）数据集，用于推动多模态错误信息检测领域的发展。实验结果表明，MMD-Thinker在域内和域外基准数据集上均达到最新技术水平，同时保持灵活的推理和令牌使用。

Key Takeaways

多模态错误信息在社会中泛滥，对社会构成严重威胁。
现有研究在利用通用多模态大型语言模型（MLLMs）进行多模态错误信息检测时存在两个关键局限：推理不足和推理偏见。
MMD-Thinker框架通过自适应多维度思考来解决这些问题，为多模态错误信息检测量身定制了专用思考模式。
MMD-Thinker采用任务特定指令调整，将思考模式注入通用MLLMs。
利用强化学习策略和混合优势函数激励推理能力，提高检测性能。
构建了多模态错误信息推理（MMR）数据集，推动多模态错误信息检测领域的发展。

Cool Papers

点此查看论文截图

SGuard-v1: Safety Guardrail for Large Language Models

Authors:JoonHo Lee, HyeonMin Cho, Jaewoong Yun, Hyunjae Lee, JunKyu Lee, Juree Seok

We present SGuard-v1, a lightweight safety guardrail for Large Language Models (LLMs), which comprises two specialized models to detect harmful content and screen adversarial prompts in human-AI conversational settings. The first component, ContentFilter, is trained to identify safety risks in LLM prompts and responses in accordance with the MLCommons hazard taxonomy, a comprehensive framework for trust and safety assessment of AI. The second component, JailbreakFilter, is trained with a carefully designed curriculum over integrated datasets and findings from prior work on adversarial prompting, covering 60 major attack types while mitigating false-unsafe classification. SGuard-v1 is built on the 2B-parameter Granite-3.3-2B-Instruct model that supports 12 languages. We curate approximately 1.4 million training instances from both collected and synthesized data and perform instruction tuning on the base model, distributing the curated data across the two component according to their designated functions. Through extensive evaluation on public and proprietary safety benchmarks, SGuard-v1 achieves state-of-the-art safety performance while remaining lightweight, thereby reducing deployment overhead. SGuard-v1 also improves interpretability for downstream use by providing multi-class safety predictions and their binary confidence scores. We release the SGuard-v1 under the Apache-2.0 License to enable further research and practical deployment in AI safety.

我们推出了SGuard-v1，这是一个针对大型语言模型（LLM）的轻量级安全护栏。它包含两个专业模型，用于检测人类与人工智能对话场景中的有害内容和筛选对抗性提示。其中，ContentFilter组件经过训练，能够根据MLCommons危害分类法识别LLM提示和响应中的安全风险，这是一个关于人工智能的信任和安全评估的全面框架。另一个组件JailbreakFilter则是使用精心设计的课程，结合数据集和先前关于对抗性提示的研究结果来训练，涵盖60种主要攻击类型，同时减轻误报安全风险。SGuard-v1建立在支持12种语言的2B参数Granite-3.3-2B-Instruct模型之上。我们从收集和合成数据中整理出约140万个训练实例，对基础模型进行指令调整，并根据各自的功能将整理后的数据分配给两个组件。在公共和安全专有基准的广泛评估中，SGuard-v1实现了最先进的安全性能，同时保持轻量级，从而降低了部署开销。SGuard-v1还提供多类安全预测和二进制置信度分数，提高了下游使用的可解释性。我们在Apache-2.0许可证下发布SGuard-v1，以支持在人工智能安全方面的进一步研究和实际应用。

论文及项目相关链接

PDF Technical Report

Summary

SGuard-v1是一种针对大型语言模型（LLM）的轻量级安全护栏，包含两个专业模型，用于检测人类-AI对话设置中的有害内容和筛选对抗性提示。ContentFilter组件根据MLCommons的危害分类法训练，用于识别LLM提示和响应中的安全风险。JailbreakFilter组件则使用精心设计的课程表和先前关于对抗性提示的研究数据训练，覆盖60种主要攻击类型，同时减轻误判为不安全的风险。SGuard-v1建立在支持12种语言的Granite-3.3-2B-Instruct模型上，通过指令微调基础模型，实现一流的安全性能，同时保持轻量化，降低部署开销。此外，SGuard-v1还提供多类安全预测和二元置信度评分，提高下游使用的可解释性。

Key Takeaways

SGuard-v1是一种针对LLM的轻量级安全护栏。
SGuard-v1包含两个专业模型：ContentFilter和JailbreakFilter。
ContentFilter根据MLCommons的危害分类法训练，用于识别LLM中的安全风险。
JailbreakFilter覆盖60种主要攻击类型，并减轻误判为不安全的风险。
SGuard-v1建立在支持12种语言的Granite-3.3-2B-Instruct模型上。
SGuard-v1通过指令微调实现一流的安全性能，同时保持轻量化。

Cool Papers

点此查看论文截图

Seeing is Believing: Rich-Context Hallucination Detection for MLLMs via Backward Visual Grounding

Authors:Pinxue Guo, Chongruo Wu, Xinyu Zhou, Lingyi Hong, Zhaoyu Chen, Jinglun Li, Kaixun Jiang, Sen-ching Samson Cheung, Wei Zhang, Wenqiang Zhang

Multimodal Large Language Models (MLLMs) have unlocked powerful cross-modal capabilities, but still significantly suffer from hallucinations. As such, accurate detection of hallucinations in MLLMs is imperative for ensuring their reliability in practical applications. To this end, guided by the principle of “Seeing is Believing”, we introduce VBackChecker, a novel reference-free hallucination detection framework that verifies the consistency of MLLMgenerated responses with visual inputs, by leveraging a pixellevel Grounding LLM equipped with reasoning and referring segmentation capabilities. This reference-free framework not only effectively handles rich-context scenarios, but also offers interpretability. To facilitate this, an innovative pipeline is accordingly designed for generating instruction-tuning data (R-Instruct), featuring rich-context descriptions, grounding masks, and hard negative samples. We further establish R^2 -HalBench, a new hallucination benchmark for MLLMs, which, unlike previous benchmarks, encompasses real-world, rich-context descriptions from 18 MLLMs with high-quality annotations, spanning diverse object-, attribute, and relationship-level details. VBackChecker outperforms prior complex frameworks and achieves state-of-the-art performance on R^2 -HalBench, even rivaling GPT-4o’s capabilities in hallucination detection. It also surpasses prior methods in the pixel-level grounding task, achieving over a 10% improvement. All codes, data, and models are available at https://github.com/PinxueGuo/VBackChecker.

多模态大型语言模型（MLLMs）已经解锁了强大的跨模态能力，但仍然遭受幻觉的严重影响。因此，对MLLMs中的幻觉进行准确检测对于确保其在实际应用中的可靠性至关重要。有鉴于此，我们遵循“眼见为实”的原则，引入了VBackChecker，这是一种新型的无参考幻觉检测框架。它通过利用配备推理和引用分割能力的像素级接地LLM，验证MLLM生成的响应与视觉输入的的一致性。这种无参考框架不仅有效地处理了丰富语境的场景，还提供了可解释性。为此，设计了一个创新的流程来生成指令调整数据（R-Instruct），具有丰富上下文描述、接地掩码和硬负样本。我们进一步建立了MLLM的R^2-HalBench新幻觉基准测试，与以前的基准测试不同，它涵盖了来自18个MLLM的现实世界丰富上下文描述，具有高质量注释，涵盖了各种对象、属性和关系级别的细节。VBackChecker优于先前的复杂框架，在R^2-HalBench上实现了卓越的性能，甚至能与GPT-4o的幻觉检测能力相抗衡。在像素级接地任务上，它也超越了先前的方法，实现了超过10%的改进。所有代码、数据和模型可在https://github.com/PinxueGuo/VBackChecker上找到。

论文及项目相关链接

PDF

Summary

基于“眼见为实”的原则，引入了一种无需参考的幻觉检测框架VBackChecker，该框架利用像素级接地的大型语言模型（LLM）验证MLLM生成的响应与视觉输入的一致性。此框架不仅适用于丰富语境场景，而且具有可解释性。设计了一条创新管道，用于生成包含丰富语境描述、接地掩码和硬负样本的指令调整数据（R-Instruct）。此外，建立了新的MLLM幻觉基准测试R² -HalBench，涵盖了来自18个MLLM的真实世界丰富语境描述，以及高质量注释，涵盖各种对象、属性和关系层面的细节。VBackChecker在R² -HalBench上的性能优于先前复杂的框架，甚至在幻觉检测方面与GPT-4的能力相匹敌。它也超越了先前的像素级接地任务方法，实现了超过10%的改进。

Key Takeaways

MLLMs具备跨模态能力，但存在幻觉问题，影响可靠性。
VBackChecker是一种无需参考的幻觉检测框架，利用像素级接地LLM验证响应与视觉输入的一致性。
VBackChecker适用于丰富语境场景，具备可解释性。
设计了生成指令调整数据（R-Instruct）的创新管道，用于丰富语境描述、接地掩码和硬负样本。
建立了新的MLLM幻觉基准测试R² -HalBench，包含真实世界丰富语境描述和高质量注释。
VBackChecker在R² -HalBench上的性能优于其他框架，与GPT-4的幻觉检测能力相匹敌。

Cool Papers

点此查看论文截图

From Scaling to Structured Expressivity: Rethinking Transformers for CTR Prediction

Authors:Bencheng Yan, Yuejie Lei, Zhiyuan Zeng, Di Wang, Kaiyi Lin, Pengjie Wang, Jian Xu, Bo Zheng

Despite massive investments in scale, deep models for click-through rate (CTR) prediction often exhibit rapidly diminishing returns - a stark contrast to the smooth, predictable gains seen in large language models. We identify the root cause as a structural misalignment: Transformers assume sequential compositionality, while CTR data demand combinatorial reasoning over high-cardinality semantic fields. Unstructured attention spreads capacity indiscriminately, amplifying noise under extreme sparsity and breaking scalable learning. To restore alignment, we introduce the Field-Aware Transformer (FAT), which embeds field-based interaction priors into attention through decomposed content alignment and cross-field modulation. This design ensures model complexity scales with the number of fields F, not the total vocabulary size n >> F, leading to tighter generalization and, critically, observed power-law scaling in AUC as model width increases. We present the first formal scaling law for CTR models, grounded in Rademacher complexity, that explains and predicts this behavior. On large-scale benchmarks, FAT improves AUC by up to +0.51% over state-of-the-art methods. Deployed online, it delivers +2.33% CTR and +0.66% RPM. Our work establishes that effective scaling in recommendation arises not from size, but from structured expressivity-architectural coherence with data semantics.

尽管在规模上投入了大量资金，但用于点击率（CTR）预测的深层模型往往表现出收益迅速递减的趋势，这与大型语言模型中看到的平稳、可预测的收益形成了鲜明对比。我们确定根本原因是结构不匹配：Transformer 假设了顺序组合性，而 CTR 数据则需要在高基数语义字段上进行组合推理。非结构化的注意力会漫无目的地分散能力，在极端稀疏性下放大噪音并破坏可扩展学习。为了恢复对齐，我们引入了基于字段的 Transformer（FAT），它通过分解内容对齐和跨字段调制，将字段交互先验知识嵌入到注意力中。这种设计确保了模型复杂度随着字段数 F 的增加而扩展，而不是随着总词汇量 n >> F 的增加而增加，从而实现了更紧密的泛化，并且在模型宽度增加时观察到 AUC 中的幂律扩展至关重要。我们提出了基于 Rademacher 复杂性的 CTR 模型的首个正式扩展定律，该定律解释并预测了这一行为。在大型基准测试中，与最新方法相比，FAT 的 AUC 提高了高达 +0.51%。在线部署时，它提高了 2.33% 的点击率和 0.66% 的 RPM。我们的工作确定了有效的推荐扩展并非来自规模，而是来自结构化表达能力与数据语义的结构一致性。

论文及项目相关链接

PDF

Summary

本文探讨了点击率预测深度模型存在的问题，并提出了Field-Aware Transformer（FAT）来解决这一问题。文章指出，虽然大规模模型投资回报递减迅速，但Field-Aware Transformer通过嵌入字段交互先验信息，实现模型复杂度随字段数量而非总词汇量增长，从而提高CTR预测准确率。此外，文章还建立了基于Rademacher复杂度的CTR模型规模化定律，为预测模型行为提供依据。在线部署中，FAT能提高点击率及收入。

Key Takeaways

点击率预测深度模型面临投资回报迅速递减的问题。
现有Transformer模型在CTR预测中存在结构不匹配问题。
Field-Aware Transformer（FAT）通过嵌入字段交互先验信息解决这一问题。
FAT设计确保模型复杂度随字段数量增长，而非总词汇量。
FAT提高了CTR预测准确率，并在大型基准测试中表现出优越性能。
文章建立了基于Rademacher复杂度的CTR模型规模化定律，为预测模型行为提供依据。

Cool Papers

点此查看论文截图

Look As You Think: Unifying Reasoning and Visual Evidence Attribution for Verifiable Document RAG via Reinforcement Learning

Authors:Shuochen Liu, Pengfei Luo, Chao Zhang, Yuhao Chen, Haotian Zhang, Qi Liu, Xin Kou, Tong Xu, Enhong Chen

Aiming to identify precise evidence sources from visual documents, visual evidence attribution for visual document retrieval-augmented generation (VD-RAG) ensures reliable and verifiable predictions from vision-language models (VLMs) in multimodal question answering. Most existing methods adopt end-to-end training to facilitate intuitive answer verification. However, they lack fine-grained supervision and progressive traceability throughout the reasoning process. In this paper, we introduce the Chain-of-Evidence (CoE) paradigm for VD-RAG. CoE unifies Chain-of-Thought (CoT) reasoning and visual evidence attribution by grounding reference elements in reasoning steps to specific regions with bounding boxes and page indexes. To enable VLMs to generate such evidence-grounded reasoning, we propose Look As You Think (LAT), a reinforcement learning framework that trains models to produce verifiable reasoning paths with consistent attribution. During training, LAT evaluates the attribution consistency of each evidence region and provides rewards only when the CoE trajectory yields correct answers, encouraging process-level self-verification. Experiments on vanilla Qwen2.5-VL-7B-Instruct with Paper- and Wiki-VISA benchmarks show that LAT consistently improves the vanilla model in both single- and multi-image settings, yielding average gains of 8.23% in soft exact match (EM) and 47.0% in IoU@0.5. Meanwhile, LAT not only outperforms the supervised fine-tuning baseline, which is trained to directly produce answers with attribution, but also exhibits stronger generalization across domains.

针对从视觉文档中提取精确证据源的目标，视觉证据归属对于视觉文档检索增强生成（VD-RAG）至关重要，以确保多模态问答中的视觉语言模型（VLMs）提供可靠和可验证的预测。现有的大多数方法采用端到端的训练方式，以便于直观的答案验证。然而，它们在整个推理过程中缺乏精细的监督和逐步的可追溯性。在本文中，我们为VD-RAG引入了证据链（CoE）范式。CoE通过结合思维链（CoT）推理和视觉证据归属，通过将参考元素与带有边界框和页面索引的特定区域相关联，实现了统一。为了允许VLMs生成这种以证据为基础的理由，我们提出了“边思考边观察”（LAT）这一强化学习框架，该框架训练模型以产生具有一致归属性的可验证推理路径。在训练过程中，LAT评估每个证据区域的归属一致性，并且仅在证据链轨迹产生正确答案时才提供奖励，鼓励进行过程级别的自我验证。在采用Paper-和Wiki-VISA基准测试的普通Qwen2.5-VL-7B-Instruct上进行的实验表明，LAT在单图像和多图像设置中均持续提高了基础模型的性能，在软精确匹配（EM）方面平均提高了8.23%，在IoU@0.5方面提高了47.0%。与此同时，LAT不仅优于通过监督微调基线（训练为直接产生带有归属性的答案），而且在跨域方面的泛化性能更强。

论文及项目相关链接

PDF Poster of AAAI’2026

摘要

本文介绍了面向视觉文档检索增强生成（VD-RAG）的证据链（CoE）范式，它将思维链（CoT）推理与视觉证据归属相结合，通过边界框和页面索引将推理步骤中的参考元素与特定区域相联系。为使得VLMs能够生成以证据为基础的推理，提出了“边思考边观察”（LAT）的强化学习框架，该框架训练模型产生可验证的推理路径，并具备一致的归属属性。在训练中，LAT评估每个证据区域归属的一致性，仅在证据链轨迹产生正确答案时提供奖励，鼓励流程级别的自我验证。实验表明，LAT在单图像和多图像场景下均改进了基础模型，在软精确匹配（EM）和IoU@0.5上平均提高了8.23%和47.0%。此外，LAT不仅超越了通过直接产生答案和归属进行训练的超精细调整基线，而且在跨域方面的表现也更强。

关键见解

论文提出了面向视觉文档检索增强生成的证据链（CoE）范式，结合了思维链（CoT）推理和视觉证据归属。
引入了强化学习框架“边思考边观察”（LAT），使VLMs能够生成与证据相关的推理路径。
LAT在训练过程中鼓励流程级别的自我验证，提升了模型的精确匹配和区域重叠评价指标。
实验表明，LAT在单图像和多图像场景下均改进了基础模型的表现。
LAT不仅超越了通过直接产生答案和归属进行训练的超精细调整基线模型，而且在跨域方面的表现更强。
通过将推理步骤中的参考元素与特定区域相联系，CoE范式提高了视觉证据归属的准确性和精细度。

Cool Papers

点此查看论文截图

A Reasoning Paradigm for Named Entity Recognition

Authors:Hui Huang, Yanping Chen, Ruizhang Huang, Chuan Lin, Yongbin Qin

Generative LLMs typically improve Named Entity Recognition (NER) performance through instruction tuning. They excel at generating entities by semantic pattern matching but lack an explicit, verifiable reasoning mechanism. This “cognitive shortcutting” leads to suboptimal performance and brittle generalization, especially in zero-shot and lowresource scenarios where reasoning from limited contextual cues is crucial. To address this issue, a reasoning framework is proposed for NER, which shifts the extraction paradigm from implicit pattern matching to explicit reasoning. This framework consists of three stages: Chain of Thought (CoT) generation, CoT tuning, and reasoning enhancement. First, a dataset annotated with NER-oriented CoTs is generated, which contain task-relevant reasoning chains. Then, they are used to tune the NER model to generate coherent rationales before deriving the final answer. Finally, a reasoning enhancement stage is implemented to optimize the reasoning process using a comprehensive reward signal. This stage ensures explicit and verifiable extractions. Experiments show that ReasoningNER demonstrates impressive cognitive ability in the NER task, achieving competitive performance. In zero-shot settings, it achieves state-of-the-art (SOTA) performance, outperforming GPT-4 by 12.3 percentage points on the F1 score. Analytical results also demonstrate its great potential to advance research in reasoningoriented information extraction. Our codes are available at https://github.com/HuiResearch/ReasoningIE.

生成式LLM通常通过指令微调提高命名实体识别（NER）性能。它们在通过语义模式匹配生成实体方面表现出色，但缺乏明确可验证的推理机制。这种“认知捷径”导致次优性能和脆弱的泛化能力，特别是在零样本和低资源场景中，从有限的上下文线索中进行推理至关重要。为了解决这一问题，提出了一种用于NER的推理框架，将提取范式从隐式模式匹配转向显式推理。该框架包括三个阶段：思维链（CoT）生成、CoT微调、以及推理增强。首先，生成一个包含面向NER的CoT的数据集，其中包含与任务相关的推理链。然后，它们被用来调整NER模型，在得出最终答案之前生成连贯的理性。最后，实施推理增强阶段，使用全面的奖励信号优化推理过程。这一阶段确保提取是明确且可验证的。实验表明，ReasoningNER在NER任务中展现出令人印象深刻的认知能力，取得了具有竞争力的性能。在零样本设置中，它在F1分数上超越了GPT-4，达到了业界领先（SOTA）的性能水平，高出12.3个百分点。分析结果表明，它在推动以推理为核心的信息提取研究方面具有巨大潜力。我们的代码可在https://github.com/HuiResearch/ReasoningIE获取。

论文及项目相关链接

PDF Accepted at AAAI 2026

Summary

基于生成式LLM模型的命名实体识别（NER）性能提升通常通过指令微调实现。这些模型擅长通过语义模式匹配生成实体，但缺乏明确的可验证推理机制。这种“认知捷径”导致在零样本和低资源场景中性能不佳，特别是缺乏上下文线索时的推理能力尤为重要。为此，本文提出了一个用于NER的推理框架，将提取范式从隐式模式匹配转向显式推理。该框架包括三个阶段：思维链生成、思维链微调、推理增强。实验表明，ReasoningNER在NER任务中表现出令人印象深刻的认知能力，取得了有竞争力的性能。在零样本设置中，它在F1分数上超越GPT-4，达到业界最佳水平。分析结果显示其在面向推理的信息提取方面有很大潜力。代码可访问链接https://github.com/HuiResearch/ReasoningIE。

Key Takeaways

生成式LLMs通过指令微调提升命名实体识别（NER）性能。
这些模型在语义模式匹配方面表现出色，但缺乏明确的推理机制。
“认知捷径”在零样本和低资源场景中可能导致性能下降。
提出了一个用于NER的推理框架，包括思维链生成、思维链微调、推理增强三个阶段。
该框架旨在将隐式模式匹配转向显式推理，提高模型的推理能力。
ReasoningNER在NER任务中表现出优秀的认知能力，达到业界最佳水平。

Cool Papers

点此查看论文截图

Identifying Imaging Follow-Up in Radiology Reports: A Comparative Analysis of Traditional ML and LLM Approaches

Authors:Namu Park, Giridhar Kaushik Ramachandran, Kevin Lybarger, Fei Xia, Ozlem Uzuner, Meliha Yetisgen, Martin Gunn

Large language models (LLMs) have shown considerable promise in clinical natural language processing, yet few domain-specific datasets exist to rigorously evaluate their performance on radiology tasks. In this work, we introduce an annotated corpus of 6,393 radiology reports from 586 patients, each labeled for follow-up imaging status, to support the development and benchmarking of follow-up adherence detection systems. Using this corpus, we systematically compared traditional machine-learning classifiers, including logistic regression (LR), support vector machines (SVM), Longformer, and a fully fine-tuned Llama3-8B-Instruct, with recent generative LLMs. To evaluate generative LLMs, we tested GPT-4o and the open-source GPT-OSS-20B under two configurations: a baseline (Base) and a task-optimized (Advanced) setting that focused inputs on metadata, recommendation sentences, and their surrounding context. A refined prompt for GPT-OSS-20B further improved reasoning accuracy. Performance was assessed using precision, recall, and F1 scores with 95% confidence intervals estimated via non-parametric bootstrapping. Inter-annotator agreement was high (F1 = 0.846). GPT-4o (Advanced) achieved the best performance (F1 = 0.832), followed closely by GPT-OSS-20B (Advanced; F1 = 0.828). LR and SVM also performed strongly (F1 = 0.776 and 0.775), underscoring that while LLMs approach human-level agreement through prompt optimization, interpretable and resource-efficient models remain valuable baselines.

大型语言模型（LLM）在临床自然语言处理中显示出巨大的潜力，但在放射学任务上严格评估其性能的专业数据集却很少。在这项工作中，我们引入了来自586名患者的6393份放射学报告数据集，每份报告都标有随访影像状态，以支持开发并评估随访依从性检测系统。使用此数据集，我们系统地比较了传统的机器学习分类器，包括逻辑回归（LR）、支持向量机（SVM）、Longformer和完全微调过的Llama3-8B-Instruct，以及与近期的生成式LLM。为了评估生成式LLM，我们测试了GPT-4o和开源GPT-OSS-20B在两种配置下的表现：一种是基础（Base）设置，另一种是专注于元数据的任务优化（Advanced）设置，以及推荐句子及其周围上下文。GPT-OSS-20B的精细提示进一步提高了推理准确性。通过非参数化bootstrap方法来评估性能，使用精确度、召回率和F1分数，并估计95%置信区间。注释者间协议率高（F1=0.846）。GPT-4o（Advanced）表现最佳（F1=0.832），紧随其后的是GPT-OSS-20B（Advanced；F1=0.828）。LR和SVM也表现出色（F1=0.776和0.775），这表明虽然通过提示优化LLM可以接近人类水平的协议，但可解释性和资源高效的模型仍然是宝贵的基准。

简化版翻译

论文及项目相关链接

PDF Submitted to LREC 2026

Summary

大型语言模型在临床自然语言处理中展现出巨大潜力，尤其在放射学任务中。本研究引入了一个包含6,393份放射学报告和586位患者数据的标注语料库，用于开发和评估随访依从性检测系统的性能。通过比较传统机器学习分类器和最新生成的大型语言模型，发现GPT-4o在高级配置下表现最佳，而逻辑回归和支持向量机也表现出色，说明在提示优化同时，可解释的、资源效率高的模型仍是宝贵的基准。

Key Takeaways

大型语言模型在临床自然语言处理中，特别是在放射学任务中，显示出巨大潜力。
本研究引入了一个标注的放射学报告语料库，用于评估随访依从性检测系统的性能。
研究比较了传统机器学习分类器和最新生成的大型语言模型的性能。
GPT-4o在高级配置下表现最佳，展现了大型语言模型在临床领域的强大能力。
逻辑回归和支持向量机等传统机器学习模型也表现出色，仍然是重要的基准模型。
提示优化对大型语言模型的性能有重要影响。

Cool Papers

点此查看论文截图

Ghost in the Transformer: Tracing LLM Lineage with SVD-Fingerprint

Authors:Suqing Wang, Ziyang Ma, Xinyi Li, Zuchao Li

Large Language Models (LLMs) have rapidly advanced and are widely adopted across diverse fields. Due to the substantial computational cost and data requirements of training from scratch, many developers choose to fine-tune or modify existing open-source models. While most adhere to open-source licenses, some falsely claim original training despite clear derivation from public models. This raises pressing concerns about intellectual property protection and highlights the need for reliable methods to verify model provenance. In this paper, we propose GhostSpec, a lightweight yet effective method for verifying LLM lineage without access to training data or modification of model behavior. Our approach constructs compact and robust fingerprints by applying singular value decomposition (SVD) to invariant products of internal attention weight matrices, effectively capturing the structural identity of a model. Unlike watermarking or output-based methods, GhostSpec is fully data-free, non-invasive, and computationally efficient. It demonstrates strong robustness to sequential fine-tuning, pruning, block expansion, and even adversarial transformations. Extensive experiments show that GhostSpec can reliably trace the lineage of transformed models with minimal overhead. By offering a practical solution for model verification and reuse tracking, our method contributes to the protection of intellectual property and fosters a transparent, trustworthy ecosystem for large-scale language models.

大型语言模型（LLM）已经迅速发展和广泛应用于各个领域。由于从头开始训练需要大量的计算成本和数据要求，许多开发者选择微调或修改现有的开源模型。虽然大多数人遵守开源许可，但有些人却在明显源于公开模型的情况下虚假声称原始训练，这引发了关于知识产权保护紧迫性的关注，并强调了验证模型来源的可靠方法的必要性。在本文中，我们提出了GhostSpec，这是一种无需访问训练数据或修改模型行为即可验证LLM血统的轻量级有效方法。我们的方法通过应用奇异值分解（SVD）于内部注意力权重矩阵的不变乘积来构建紧凑且稳健的指纹，有效地捕获模型的结构特征。与watermarking或基于输出的方法不同，GhostSpec完全无需数据、无侵入性且计算效率高。它对顺序微调、修剪、块扩展甚至对抗性转换具有很强的稳健性。大量实验表明，GhostSpec能够可靠地追踪经过转换的模型血统，并且几乎不需要额外的开销。通过为模型验证和重用跟踪提供实用解决方案，我们的方法有助于保护知识产权并促进大规模语言模型的透明可信生态系统的发展。

论文及项目相关链接

PDF Accepted at AAAI 2026 (Oral)

Summary

大型语言模型（LLM）的快速发展和广泛应用带来了知识产权保护的迫切需求。针对这一问题，本文提出了一种名为GhostSpec的轻量级且有效的方法，用于在不接触训练数据或改变模型行为的情况下验证LLM的出处。GhostSpec通过应用奇异值分解（SVD）于内部注意力权重矩阵的不变积构建紧凑且稳健的指纹，有效捕捉模型的结构特征。该方法具有数据完全无关性、非侵入性和计算高效性，对连续微调、修剪、块扩展甚至对抗性转换具有强大的稳健性。实验证明，GhostSpec能够可靠追踪模型的出处，为知识产权保护提供了切实可行的解决方案。

Key Takeaways

大型语言模型（LLM）的知识产权保护至关重要，特别是模型的来源验证成为一项迫切需求。
GhostSpec方法通过构建模型的结构指纹，实现了对LLM出处的有效验证。
GhostSpec方法采用奇异值分解（SVD）技术，基于内部注意力权重矩阵构建指纹，具有数据无关性、非侵入性和高效性。
GhostSpec方法对于模型的连续微调、修剪、块扩展等操作以及对对抗性转换具有稳健性。
实验证明GhostSpec能够可靠追踪模型的出处，为知识产权保护提供了有效手段。
GhostSpec方法有助于促进大型语言模型的透明度和可信度生态系统建设。

Cool Papers

点此查看论文截图

Hybrid Retrieval-Augmented Generation Agent for Trustworthy Legal Question Answering in Judicial Forensics

Authors:Yueqing Xi, Yifan Bai, Huasen Luo, Weiliang Wen, Hui Liu, Haoliang Li

As artificial intelligence permeates judicial forensics, ensuring the veracity and traceability of legal question answering (QA) has become critical. Conventional large language models (LLMs) are prone to hallucination, risking misleading guidance in legal consultation, while static knowledge bases struggle to keep pace with frequently updated statutes and case law. We present a hybrid legal QA agent tailored for judicial settings that integrates retrieval-augmented generation (RAG) with multi-model ensembling to deliver reliable, auditable, and continuously updatable counsel. The system prioritizes retrieval over generation: when a trusted legal repository yields relevant evidence, answers are produced via RAG; otherwise, multiple LLMs generate candidates that are scored by a specialized selector, with the top-ranked answer returned. High-quality outputs then undergo human review before being written back to the repository, enabling dynamic knowledge evolution and provenance tracking. Experiments on the Law_QA dataset show that our hybrid approach significantly outperforms both a single-model baseline and a vanilla RAG pipeline on F1, ROUGE-L, and an LLM-as-a-Judge metric. Ablations confirm the complementary contributions of retrieval prioritization, model ensembling, and the human-in-the-loop update mechanism. The proposed system demonstrably reduces hallucination while improving answer quality and legal compliance, advancing the practical landing of media forensics technologies in judicial scenarios.

随着人工智能在司法鉴定中的普及，确保法律问答（QA）的真实性和可追溯性变得至关重要。传统的大型语言模型（LLM）容易出现幻觉，可能导致法律咨询时产生误导性指导，而静态的知识库难以跟上频繁更新的法律和判例。我们针对司法环境量身定制了一种混合法律问答代理，它将检索增强生成（RAG）与多模型集成相结合，以提供可靠、可审核和持续更新的咨询。该系统优先检索生成：当可信赖的法律存储库产生相关证据时，答案将通过RAG产生；否则，多个LLM生成候选答案，并由专用选择器进行评分，返回排名最高的答案。高质量输出然后经过人工审核后写回到存储库，实现动态知识演变和来源跟踪。在法律QA数据集上的实验表明，我们的混合方法在信息精度、ROUGE-L和LLM作为法官的度量标准上显著优于单模型基准和简单的RAG管道。消融实验证实了检索优先、模型集成和人工循环更新机制的互补作用。所提出的系统显著减少了幻觉，提高了答案质量和法律合规性，促进了媒体取证技术在司法场景中的实际应用落地。

论文及项目相关链接

PDF

Summary
随着人工智能在司法鉴定中的广泛应用，确保法律问答（QA）的真实性和可追溯性至关重要。传统的语言模型（LLM）容易出现虚构情况，误导法律咨询，而静态知识库难以跟上不断更新的法律和判例。本研究提出了一种针对司法环境的混合法律问答代理，它结合了检索增强生成（RAG）和多模型集成，提供可靠、可审核和持续更新的咨询。该系统优先检索生成答案，当可靠的法务库产生相关证据时，通过RAG生成答案；否则，多个LLM生成候选答案，由专业选择器评分并返回排名最高的答案。高质量输出需要经过人工审核后写入知识库，实现动态知识演进和溯源追踪。在Law_QA数据集上的实验表明，我们的混合方法显著优于单模型基准和基本的RAG管道，提高了F1、ROUGE-L和LLM-as-a-Judge指标的评估结果。

Key Takeaways

人工智能在司法鉴定中的应用要求确保法律问答的真实性和可追溯性。
传统语言模型（LLM）在司法环境中易产生误导，需改进。
混合法律问答代理通过检索增强生成（RAG）和多模型集成提供可靠咨询。
系统优先从可靠的法务库中检索答案，否则通过多模型生成候选答案并选择最佳答案。
高质量输出需经过人工审核后写入知识库，实现动态知识更新和溯源。
实验表明混合方法优于单模型基准和基本的RAG管道。

Cool Papers

点此查看论文截图

InfiMed-ORBIT: Aligning LLMs on Open-Ended Complex Tasks via Rubric-Based Incremental Training

Authors:Pengkai Wang, Qi Zuo, Pengwei Liu, Zhijie Sang, Congkai Xie, Hongxia Yang

Reinforcement learning has powered many of the recent breakthroughs in large language models, especially for tasks where rewards can be computed automatically, such as code generation. However, these methods deteriorate in open-ended domains like medical consultation, where feedback is inherently ambiguous, highly context-dependent, and cannot be reduced to a reliable scalar signal. In such settings, RL must either rely on supervision-intensive reward models that often fail to generalize, or it falls into pathological behaviors such as reward hacking - an especially troubling risk for high-stakes medical dialogue. To address these limitations, we introduce ORBIT, an open-ended rubric-based incremental training framework for high-stakes medical dialogue. ORBIT integrates synthetic dialogue generation with dynamically constructed rubrics that serve as adaptive guides for incremental RL. Instead of relying on external medical knowledge bases or handcrafted rule sets, ORBIT uses rubric-driven feedback to steer the learning process. Its judge component can be instantiated with general-purpose instruction-following LLMs, removing the need for any task-specific fine-tuning. Applied to the Qwen3-4B-Instruct model, ORBIT raises the HealthBench-Hard score from 7.0 to 27.5 using only 2k training samples, achieving SOTA performance for models at this scale. With larger rubric datasets, ORBIT-trained models further compete with the strongest open-source baselines on HealthBench-Hard. Our analysis shows that rubric-guided RL consistently improves consultation quality across diverse medical scenarios. We also apply such rubric generation and training pipeline to InfoBench, where ORBIT enhances instruction-following performance, highlighting the generality of rubric-based feedback.

强化学习为大型语言模型最近的许多突破提供了动力，特别是在奖励可以自动计算的任务中，例如代码生成。然而，这些方法在开放域（如医疗咨询）中表现不佳，其中反馈本质上是模糊的、高度依赖于上下文，并且不能简化为可靠的标量信号。在这种情况下，强化学习要么依赖于监督密集型的奖励模型（通常难以推广），要么陷入病理行为（如奖励操纵）——对于高风险医疗对话来说尤其令人担忧的风险。为了解决这些局限性，我们引入了ORBIT，这是一个针对高风险医疗对话的开放式规则基础增量训练框架。ORBIT将合成对话生成与动态构建的规则相结合，这些规则作为增量强化学习的自适应指南。ORBIT不依赖外部医学知识库或手工规则集，而是使用规则驱动的反馈来引导学习过程。其判断组件可以使用通用指令遵循大型语言模型进行实例化，无需任何特定的任务微调。在Qwen3-4B-Instruct模型上应用ORBIT后，HealthBench-Hard得分从7.0提高到27.5，仅使用2k训练样本就达到了此规模模型的最新性能水平。使用更大的规则数据集，ORBIT训练的模型在HealthBench-Hard上进一步与开源基线展开了竞争。我们的分析表明，规则引导的强化学习在多种医学场景中始终提高了咨询质量。我们还将这种规则生成和培训管道应用于InfoBench，其中ORBIT提高了指令遵循性能，突显了基于规则的反馈的普遍性。

论文及项目相关链接

PDF

摘要

强化学习在大型语言模型中，特别是在可以自动计算奖励的任务中，如代码生成，取得了许多突破。然而，在开放领域如医疗咨询中，反馈具有固有的模糊性、高度依赖于上下文，并且不能简化为可靠的标量信号，这些方法的效果会下降。在这种情况下，强化学习必须依赖监督密集的奖励模型，通常无法概括，或者陷入病理行为，如奖励破解，这对高风险的医疗对话特别令人担忧。为解决这些局限性，我们推出了ORBIT，一个用于高风险医疗对话的开放规则增量训练框架。ORBIT集成了合成对话生成与动态构建的规则，作为增量强化学习的自适应指南。它不需要依赖外部医学知识库或手工规则集，而是使用规则驱动的反馈来引导学习过程。其法官组件可以使用通用指令跟随大型语言模型进行实例化，无需任何特定任务的微调。在Qwen3-4B-Instruct模型的应用中，ORBIT将HealthBench-Hard得分从7.0提高到27.5，仅使用2k训练样本，在模型规模上取得了最新性能。使用更大的规则数据集，ORBIT训练的模型进一步在HealthBench-Hard上与最强大的开源基线竞争。我们的分析表明，规则指导的强化学习在多种医学场景中始终提高了咨询质量。我们还将这种规则生成和培训管道应用于InfoBench，其中ORBIT增强了指令跟随性能，突出了规则反馈的普遍性。

关键见解