⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-11-18 更新
Human-AI collaborative autonomous synthesis with pulsed laser deposition for remote epitaxy
Authors:Asraful Haque, Daniel T. Yimam, Jawad Chowdhury, Ralph Bulanadi, Ivan Vlassiouk, John Lasseter, Sujoy Ghosh, Christopher M. Rouleau, Kai Xiao, Yongtao Liu, Eva Zarkadoula, Rama K. Vasudevan, Sumner B. Harris
Autonomous laboratories typically rely on data-driven decision-making, occasionally with human-in-the-loop oversight to inject domain expertise. Fully leveraging AI agents, however, requires tightly coupled, collaborative workflows spanning hypothesis generation, experimental planning, execution, and interpretation. To address this, we develop and deploy a human-AI collaborative (HAIC) workflow that integrates large language models for hypothesis generation and analysis, with collaborative policy updates driving autonomous pulsed laser deposition (PLD) experiments for remote epitaxy of BaTiO$_3$/graphene. HAIC accelerated the hypothesis formation and experimental design and efficiently mapped the growth space to graphene-damage. In situ Raman spectroscopy reveals that chemistry drives degradation while the highest energy plume components seed defects, identifying a low-O$_2$ pressure low-temperature synthesis window that preserves graphene but is incompatible with optimal BaTiO$_3$ growth. Thus, we show a two-step Ar/O$_2$ deposition is required to exfoliate ferroelectric BaTiO$_3$ while maintaining a monolayer graphene interlayer. HAIC stages human insight with AI reasoning between autonomous batches to drive rapid scientific progress, providing an evolution to many existing human-in-the-loop autonomous workflows.
自主实验室通常依赖于数据驱动的决策,偶尔有人类参与监督以注入专业领域的经验知识。然而,要充分利用AI智能体实现实验全自动化,需要在多个领域建立紧密联系并合作开展涉及假设生成、实验规划、执行以及结果解读的工作流程。为解决这一问题,我们开发并部署了一种人机协同(HAIC)的工作流程,该流程整合大型语言模型进行假设生成和分析,并利用协同策略更新驱动远程外延生长的脉冲激光沉积(PLD)实验。人机协同加快了假设形成和实验设计过程,并高效映射出石墨烯生长过程中容易遭到破坏的空间分布区域。原位拉曼光谱技术表明化学反应促使退化发生,同时高能成分促成缺陷出现,并且确定了低氧压低温合成窗口可以保护石墨烯却不利于BaTiO3的最佳生长条件。因此,我们证明了需要通过两步氩氧沉积法剥离铁电BaTiO3并保持单层石墨烯夹层。人机协同将人类见解与AI推理相结合,在自主批次之间推动快速的科学进步,为许多现有的有人类参与的自主工作流程提供了进化方向。
论文及项目相关链接
Summary
人工智能(AI)实验室中通常需要借助大数据来进行决策,但仍需专家知识的辅助,让人类与AI紧密合作是关键。本文介绍了一种人机协同工作流程(HAIC),它结合了大型语言模型进行假设生成和分析,通过协同政策更新驱动自主脉冲激光沉积(PLD)实验。人机协同能加速实验设计过程并发现生长空间和石墨烯损伤的关联。然而高能量光束对石墨烯具有破坏性影响。为了在不影响巴钙钛晶型层的情况下实现单层石墨烯剥离,需要一个两步的Ar/O²沉积过程。这项研究展现了人机协同工作在推动科学进步方面的潜力。
Key Takeaways
- AI实验室依赖于数据驱动的决策,但仍需要人类领域知识的注入。
- 人机协同工作流程(HAIC)结合了大型语言模型和协同政策更新,用于自主实验。
- HAIC加速了假设形成和实验设计过程,并发现了石墨烯生长过程中的问题。
- 高能量光束对石墨烯有破坏性影响,需要找到平衡以实现巴钙钛晶型层的剥离。
- 在保护石墨烯的同时实现巴钙钛晶型层的剥离需要两步沉积过程。
点此查看论文截图
W2S-AlignTree: Weak-to-Strong Inference-Time Alignment for Large Language Models via Monte Carlo Tree Search
Authors:Zhenyu Ding, Yuhao Wang, Tengyue Xiao, Haoying Wang, Guojun Ma, Mingyang Wan, Caigui Jiang, Ning Ding
Large Language Models (LLMs) demonstrate impressive capabilities, yet their outputs often suffer from misalignment with human preferences due to the inadequacy of weak supervision and a lack of fine-grained control. Training-time alignment methods like Reinforcement Learning from Human Feedback (RLHF) face prohibitive costs in expert supervision and inherent scalability limitations, offering limited dynamic control during inference. Consequently, there is an urgent need for scalable and adaptable alignment mechanisms. To address this, we propose W2S-AlignTree, a pioneering plug-and-play inference-time alignment framework that synergistically combines Monte Carlo Tree Search (MCTS) with the Weak-to-Strong Generalization paradigm for the first time. W2S-AlignTree formulates LLM alignment as an optimal heuristic search problem within a generative search tree. By leveraging weak model’s real-time, step-level signals as alignment proxies and introducing an Entropy-Aware exploration mechanism, W2S-AlignTree enables fine-grained guidance during strong model’s generation without modifying its parameters. The approach dynamically balances exploration and exploitation in high-dimensional generation search trees. Experiments across controlled sentiment generation, summarization, and instruction-following show that W2S-AlignTree consistently outperforms strong baselines. Notably, W2S-AlignTree raises the performance of Llama3-8B from 1.89 to 2.19, a relative improvement of 15.9 on the summarization task.
大型语言模型(LLMs)展示出了令人印象深刻的能力,然而由于其弱监督的不充分性和缺乏精细的的控制,它们的输出常常与人类偏好不符。像强化学习从人类反馈(RLHF)这样的训练时对齐方法面临着专家监督成本高昂和固有的可扩展性限制的问题,以及在推理过程中动态控制的局限性。因此,急需一种可扩展且可适应的对齐机制。为解决这一问题,我们提出了W2S-AlignTree,这是一种首创的即插即用推理时间对齐框架,它协同结合了蒙特卡洛树搜索(MCTS)与强弱泛化范式。W2S-AlignTree将LLM对齐表述为生成搜索树内的最优启发式搜索问题。它通过利用弱模型的实时、步骤级信号作为对齐代理,并引入熵感知探索机制,实现在强模型生成过程中进行精细指导,同时无需修改其参数。该方法在高维生成搜索树中动态平衡探索与利用。在受控的情感生成、摘要和指令遵循方面的实验表明,W2S-AlignTree始终优于强劲的基线。值得注意的是,W2S-AlignTree提高了摘要任务中Llama3-8B的性能,从1.89提升至2.19,相对改进了15.9%。
论文及项目相关链接
PDF AAAI 2026 Oral
Summary
大型语言模型(LLM)虽展现出强大能力,但其输出常因缺乏精细控制和弱监督不足而与人偏好不符。训练时的对齐方法如强化学习从人类反馈(RLHF)存在专家监督成本高和可伸缩性固有的局限。为此,我们提出W2S-AlignTree,一个开创性的即插即用推理时对齐框架,首次结合蒙特卡洛树搜索(MCTS)与弱到强泛化范式。W2S-AlignTree将LLM对齐表述为生成搜索树内的最优启发式搜索问题。它利用弱模型的实时步骤级信号作为对齐代理,并引入熵感知探索机制,在强模型生成过程中实现精细指导而不修改其参数。该方法在高维生成搜索树中实现了探索与开发的动态平衡。实验表明,W2S-AlignTree在情感生成、摘要和指令遵循等方面均优于强基线。特别是在摘要任务上,W2S-AlignTree将Llama3-8B的性能从1.89提高到2.19,相对改进了15.9%。
Key Takeaways
- LLMs虽能力强大,但输出与人偏好不符,需对齐机制。
- 训练时对齐方法如RLHF存在专家监督成本高和可扩展性限制。
- W2S-AlignTree是首个即插即用推理时对齐框架,结合MCTS和弱到强泛化范式。
- W2S-AlignTree将LLM对齐表述为生成搜索树内的最优启发式搜索问题。
- W2S-AlignTree利用弱模型信号作为对齐代理,实现精细指导强模型生成。
- W2S-AlignTree在高维生成搜索树中平衡探索与开发。
点此查看论文截图
FarSkip-Collective: Unhobbling Blocking Communication in Mixture of Experts Models
Authors:Yonatan Dukler, Guihong Li, Deval Shah, Vikram Appia, Emad Barsoum
Blocking communication presents a major hurdle in running MoEs efficiently in distributed settings. To address this, we present FarSkip-Collective which modifies the architecture of modern models to enable overlapping of their computation with communication. Our approach modifies the architecture to skip connections in the model and it is unclear a priori whether the modified model architecture can remain as capable, especially for large state-of-the-art models and while modifying all of the model layers. We answer this question in the affirmative and fully convert a series of state-of-the-art models varying from 16B to 109B parameters to enable overlapping of their communication while achieving accuracy on par with their original open-source releases. For example, we convert Llama 4 Scout (109B) via self-distillation and achieve average accuracy within 1% of its instruction tuned release averaged across a wide range of downstream evaluations. In addition to demonstrating retained accuracy of the large modified models, we realize the benefits of FarSkip-Collective through optimized implementations that explicitly overlap communication with computation, accelerating both training and inference in existing frameworks.
在分布式环境中高效运行MoE(巨模型)时,通信阻塞成为一个主要障碍。为了解决这个问题,我们提出了FarSkip-Collective方法,它通过修改现代模型的架构来实现在计算过程中重叠通信。我们的方法通过跳过模型中的连接来修改架构,事先并不清楚修改后的模型架构是否还能保持其功能,特别是对于大型最先进的模型和需要修改所有模型层的情况。我们对此问题作出了肯定回答,并将一系列最先进的模型从16B到109B参数进行了全面转换,使它们在实现通信重叠的同时达到了与原始开源版本相当的精度。例如,我们通过自我蒸馏技术转化了Llama 4 Scout(109B),并在一系列下游评估中,其平均精度达到了指令调优版本的99%。除了证明大型修改模型的准确性得以保留外,我们还通过优化实现来明确实现通信与计算的重叠,从而加速现有框架的训练和推理过程,体现了FarSkip-Collective的优势。
论文及项目相关链接
Summary
本文介绍了在分布式环境中运行MoE(模型并行计算)时通信阻塞带来的挑战。为解决这一问题,提出了FarSkip-Collective方法,它通过修改现代模型的架构来实现计算与通信的重叠。该方法通过跳过模型中的连接来修改架构,不确定修改后的模型架构是否仍然具备能力,特别是在大型先进模型和所有模型层的修改中。研究表明,这种方法在大型模型(从16B到109B参数)上有效,可实现通信的重叠,同时保持与原始开源发布相当的准确性。例如,通过自蒸馏转化Llama 4 Scout(109B),在广泛的下游评估中,其平均准确度保持在指令调整发布的1%以内。此外,除了展示大型修改模型的保留准确性外,我们还通过明确实现计算与通信的重叠来优化FarSkip-Collective的实现,从而加速现有框架的训练和推理。
Key Takeaways
- FarSkip-Collective解决了在分布式环境中运行MoE时的通信阻塞问题。
- 通过修改现代模型的架构,FarSkip-Collective实现了计算与通信的重叠。
- 跳过连接的方式修改模型架构对于大型先进模型是否仍然具备能力存在不确定性。
- 研究结果证明FarSkip-Collective在大型模型上有效,可保持与原始模型相当的高准确性。
- 通过自蒸馏技术转化Llama 4 Scout模型,其准确度保持在指令调整后的1%以内。
- FarSkip-Collective优化的实施明确了计算与通信的重叠,加速了训练和推理过程。
点此查看论文截图
PAS : Prelim Attention Score for Detecting Object Hallucinations in Large Vision–Language Models
Authors:Nhat Hoang-Xuan, Minh Vu, My T. Thai, Manish Bhattarai
Large vision-language models (LVLMs) are powerful, yet they remain unreliable due to object hallucinations. In this work, we show that in many hallucinatory predictions the LVLM effectively ignores the image and instead relies on previously generated output (prelim) tokens to infer new objects. We quantify this behavior via the mutual information between the image and the predicted object conditioned on the prelim, demonstrating that weak image dependence strongly correlates with hallucination. Building on this finding, we introduce the Prelim Attention Score (PAS), a lightweight, training-free signal computed from attention weights over prelim tokens. PAS requires no additional forward passes and can be computed on the fly during inference. Exploiting this previously overlooked signal, PAS achieves state-of-the-art object-hallucination detection across multiple models and datasets, enabling real-time filtering and intervention.
大型视觉语言模型(LVLMs)虽然功能强大,但由于物体幻觉(hallucination)的存在,其可靠性仍然有待提高。在这项工作中,我们发现LVLM在许多幻觉预测中会有效地忽略图像,而是依赖于先前生成的输出(初步)令牌来推断新物体。我们通过初步条件下图像和预测物体之间的互信息来衡量这种行为,证明弱图像依赖性与幻觉之间存在强烈的相关性。基于这一发现,我们引入了初步注意力得分(PAS),这是一种轻量级的、无需训练的信号,通过计算初步令牌上的注意力权重得出。PAS不需要额外的前向传递,可以在推理过程中即时计算。通过利用这一以前被忽视的信号,PAS在多个模型和数据集上实现了最先进的物体幻觉检测,能够实现实时过滤和干预。
论文及项目相关链接
Summary:大型视觉语言模型(LVLMs)虽然强大,但由于物体幻觉现象而不够可靠。本研究发现,在许多幻觉预测中,LVLM实际上忽略了图像,而是依赖于先前生成的输出(初步)令牌来推断新物体。通过初步令牌与预测物体之间的互信息来衡量这种行为,发现弱图像依赖与幻觉之间存在强烈的相关性。在此基础上,我们引入了初步注意力得分(PAS),这是一种轻量级的、无需训练的信号,通过计算初步令牌上的注意力权重得到。PAS无需额外的前向传递,可以在推理过程中实时计算。利用这一以前被忽视的信号,PAS在多个模型和数据集上实现了物体幻觉检测的最佳性能,能够实现实时过滤和干预。
Key Takeaways:
- LVLMs存在物体幻觉问题,即模型在预测时可能忽略图像,依赖先前生成的输出令牌来推断新物体。
- 初步令牌与预测物体之间的互信息弱,表明这种行为与物体幻觉之间存在强烈相关性。
- 引入了一种新的信号——初步注意力得分(PAS),用于检测物体幻觉。
- PAS是一种轻量级、无需训练的信号,通过计算初步令牌上的注意力权重得到,可在推理过程中实时计算。
- PAS能在多个模型和数据集上实现最佳物体幻觉检测性能。
- PAS能实现实时过滤和干预,有助于提升LVLMs的可靠性。
点此查看论文截图
VP-Bench: A Comprehensive Benchmark for Visual Prompting in Multimodal Large Language Models
Authors:Mingjie Xu, Jinpeng Chen, Yuzhi Zhao, Jason Chun Lok Li, Yue Qiu, Zekang Du, Mengyang Wu, Pingping Zhang, Kun Li, Hongzheng Yang, Wenao Ma, Jiaheng Wei, Qinbin Li, Kangcheng Liu, Wenqiang Lei
Multimodal large language models (MLLMs) have enabled a wide range of advanced vision-language applications, including fine-grained object recognition and contextual understanding. When querying specific regions or objects in an image, human users naturally use “visual prompts” (VPs), such as bounding boxes, to provide reference. However, no existing benchmark systematically evaluates the ability of MLLMs to interpret such VPs. This gap leaves it unclear whether current MLLMs can effectively recognize VPs, an intuitive prompting method for humans, and use them to solve problems. To address this limitation, we introduce VP-Bench, a benchmark for assessing MLLMs’ capability in VP perception and utilization. VP-Bench employs a two-stage evaluation framework: Stage 1 examines models’ ability to perceive VPs in natural scenes, using 30k visualized prompts spanning eight shapes and 355 attribute combinations. Stage 2 investigates the impact of VPs on downstream tasks, measuring their effectiveness in real-world problem-solving scenarios. Using VP-Bench, we evaluate 28 MLLMs, including proprietary systems (e.g., GPT-4o) and open-source models (e.g., InternVL3 and Qwen2.5-VL), and provide a comprehensive analysis of factors that affect VP understanding, such as variations in VP attributes, question arrangement, and model scale. VP-Bench establishes a new reference framework for studying how MLLMs comprehend and resolve grounded referring questions.
多模态大型语言模型(MLLMs)已经支持了一系列先进的视觉语言应用,包括精细粒度对象识别和上下文理解。当查询图像中的特定区域或对象时,人类用户自然地使用“视觉提示”(VPs),如边界框,以提供参考。然而,现有的基准测试并未系统地评估MLLMs解释这种VPs的能力。这一差距使得当前MLLMs是否能够有效地识别VPs(一种人类直观的提示方法)并利用它们解决问题尚不清楚。为了解决这个问题,我们引入了VP-Bench,这是一个评估MLLMs在VP感知和利用能力的基准测试。VP-Bench采用两阶段评估框架:第一阶段考察模型在自然场景中感知VPs的能力,使用涵盖八种形状和355种属性组合的3万多个可视化提示。第二阶段研究VPs对下游任务的影响,衡量它们在现实问题解决场景中的有效性。使用VP-Bench,我们评估了28个MLLMs,包括专有系统(如GPT-4o)和开源模型(如InternVL3和Qwen2.5-VL),并对影响VP理解的因素进行了综合分析,如VP属性、问题安排和模型规模的变化。VP-Bench为研究MLLMs如何理解和解决基于地面的引用问题建立了新的参考框架。
论文及项目相关链接
PDF This is the extended version of the paper accepted at AAAI 2026, which includes all technical appendices and additional experimental details
Summary
本文介绍了多模态大型语言模型(MLLMs)在视觉语言应用方面的能力,特别是通过视觉提示(VPs)来识别图像中的特定区域或对象的能力。现有的基准测试并未系统地评估MLLMs解读VPs的能力,为此本文引入了VP-Bench基准测试来评估MLLMs对VPs的感知和利用能力。VP-Bench采用两阶段评估框架,第一阶段考察模型在自然场景中感知VPs的能力,第二阶段研究VPs对下游任务的影响,衡量其在解决实际问题场景中的有效性。通过VP-Bench评估了多个MLLMs的性能,并分析了影响VP理解的因素。VP-Bench为MLLMs如何理解和解决基于视觉提示的问题提供了新的参考框架。
Key Takeaways
- 多模态大型语言模型(MLLMs)广泛应用于视觉语言应用。
- 视觉提示(VPs)是人类用户查询图像中特定区域或对象时的自然提示方法。
- 现有基准测试未系统地评估MLLMs解读VPs的能力。
- VP-Bench基准测试用于评估MLLMs对VPs的感知和利用能力。
- VP-Bench采用两阶段评估框架,包括感知VPs的能力和在解决实际问题场景中利用VPs的能力。
- 通过VP-Bench评估了多个MLLMs的性能,发现模型间存在性能差异。
点此查看论文截图
Q-Doc: Benchmarking Document Image Quality Assessment Capabilities in Multi-modal Large Language Models
Authors:Jiaxi Huang, Dongxu Wu, Hanwei Zhu, Lingyu Zhu, Jun Xing, Xu Wang, Baoliang Chen
The rapid advancement of Multi-modal Large Language Models (MLLMs) has expanded their capabilities beyond high-level vision tasks. Nevertheless, their potential for Document Image Quality Assessment (DIQA) remains underexplored. To bridge this gap, we propose Q-Doc, a three-tiered evaluation framework for systematically probing DIQA capabilities of MLLMs at coarse, middle, and fine granularity levels. a) At the coarse level, we instruct MLLMs to assign quality scores to document images and analyze their correlation with Quality Annotations. b) At the middle level, we design distortion-type identification tasks, including single-choice and multi-choice tests for multi-distortion scenarios. c) At the fine level, we introduce distortion-severity assessment where MLLMs classify distortion intensity against human-annotated references. Our evaluation demonstrates that while MLLMs possess nascent DIQA abilities, they exhibit critical limitations: inconsistent scoring, distortion misidentification, and severity misjudgment. Significantly, we show that Chain-of-Thought (CoT) prompting substantially enhances performance across all levels. Our work provides a benchmark for DIQA capabilities in MLLMs, revealing pronounced deficiencies in their quality perception and promising pathways for enhancement. The benchmark and code are publicly available at: https://github.com/cydxf/Q-Doc.
多模态大型语言模型(MLLMs)的快速发展使其能力超越了高级视觉任务。然而,它们在文档图像质量评估(DIQA)方面的潜力尚未得到充分探索。为了填补这一空白,我们提出了Q-Doc,这是一个用于系统探测MLLMs在DIQA方面的能力的三层评估框架,包括粗略、中等和精细粒度级别。首先,在粗略级别上,我们指导MLLMs给文档图像分配质量分数,并分析其与质量注释的相关性。其次,在中等水平上,我们设计了对失真类型进行识别的任务,包括针对多种失真场景的单一选择和多项选择测试。最后,在精细水平上,我们引入了失真严重程度评估,其中MLLMs根据人类注释的参考来分类失真的强度。我们的评估表明,虽然MLLMs具有新兴的DIQA能力,但它们存在关键局限:评分不一致、失真识别错误和严重程度判断错误。值得注意的是,我们表明“思维链”(CoT)提示显著提高了所有级别的性能。我们的工作为MLLMs的DIQA能力提供了基准测试,揭示了它们在质量感知方面的明显缺陷,并指出了增强性能的潜在途径。基准测试和代码可在https://github.com/cydxf/Q-Doc上公开访问。
论文及项目相关链接
Summary
MLLM在多模态场景的应用已超越了高层次的视觉任务。但它们在文档图像质量评估(DIQA)领域的潜力尚未得到充分挖掘。为了填补这一空白,提出了Q-Doc这一三级评估框架来系统地测试MLLM在不同精细程度下的DIQA能力。包括质量评分、失真类型识别和失真严重程度评估等方面。研究结果表明,MLLM具有初步的DIQA能力,但在评分一致性、失真识别和严重程度判断方面存在显著局限。采用思维链(CoT)提示可显著提高各层级性能。该研究为MLLM在DIQA方面的能力提供了基准测试,揭示了其在质量感知方面的明显不足,并为未来的改进指明了方向。
Key Takeaways
- MLLM在多模态领域应用广泛,但在文档图像质量评估(DIQA)方面的应用尚未得到充分探索。
- Q-Doc是一个三级评估框架,旨在系统地测试MLLM在不同精细程度下的DIQA能力。
- MLLM能够进行初步的质量评分,但在评分一致性方面存在局限。
- MLLM能够识别文档图像的失真类型,但在多失真场景的识别上仍需改进。
- MLLM可以评估失真严重程度,但在与人类注释的参考标准对比时存在误差。
- 思维链(CoT)提示有助于提高MLLM在DIQA任务上的性能。
点此查看论文截图
MicroVQA++: High-Quality Microscopy Reasoning Dataset with Weakly Supervised Graphs for Multimodal Large Language Model
Authors:Manyu Li, Ruian He, Chenxi Ma, Weimin Tan, Bo Yan
Multimodal Large Language Models are increasingly applied to biomedical imaging, yet scientific reasoning for microscopy remains limited by the scarcity of large-scale, high-quality training data. We introduce MicroVQA++, a three-stage, large-scale and high-quality microscopy VQA corpus derived from the BIOMEDICA archive. Stage one bootstraps supervision from expert-validated figure-caption pairs sourced from peer-reviewed articles. Stage two applies HiCQA-Graph, a novel heterogeneous graph over images, captions, and QAs that fuses NLI-based textual entailment, CLIP-based vision-language alignment, and agent signals to identify and filter inconsistent samples. Stage three uses a MultiModal Large Language Model (MLLM) agent to generate multiple-choice questions (MCQ) followed by human screening. The resulting release comprises a large training split and a human-checked test split whose Bloom’s level hard-sample distribution exceeds the MicroVQA benchmark. Our work delivers (i) a quality-controlled dataset that couples expert literature with graph-based filtering and human refinement; (ii) HiCQA-Graph, the first graph that jointly models (image, caption, QA) for cross-modal consistency filtering; (iii) evidence that careful data construction enables 4B-scale MLLMs to reach competitive microscopy reasoning performance (e.g., GPT-5) and achieve state-of-the-art performance among open-source MLLMs. Code and dataset will be released after the review process concludes.
多模态大型语言模型在生物医学成像方面的应用日益广泛,但由于缺乏大规模高质量的训练数据,显微镜的科学推理仍然受到限制。我们推出了MicroVQA++,这是一个三阶段、大规模、高质量的显微镜VQA语料库,来源于BIOMEDICA档案。第一阶段从经过同行评审的文章中采集的专家验证过的图像-标题对中获取监督信息。第二阶段应用了HiCQA-Graph,这是一种基于图像、标题和问答的新型异构图,融合了基于自然语言推断的文本蕴含、基于CLIP的视觉语言对齐和代理信号,以识别和过滤不一致的样本。第三阶段使用多模态大型语言模型(MLLM)代理生成多项选择题(MCQ),随后进行人工筛选。最终发布的数据集包括一个大规模的训练集和一个经过人工检查过的测试集,其Bloom’s难度样本分布超过了MicroVQA基准测试。我们的工作提供了(i)一个经过质量控制的数据集,该数据集结合了专家文献、基于图的过滤和人工改进;(ii)首个联合建模(图像、标题、问答)的HiCQA-Graph,用于跨模态一致性过滤;(iii)证据表明,通过精心构建的数据,可以使4B规模的多模态大型语言模型达到具有竞争力的显微镜推理性能(例如GPT-5),并在开源多模态大型语言模型中达到最新水平。代码和数据集将在审查过程结束后发布。
论文及项目相关链接
PDF 11 pages, 4 figures
摘要
本文介绍了MicroVQA++项目,该项目针对生物医学成像中的多模态大型语言模型应用进行了创新性的工作。该项目引入了一种新的高质量显微图像问答语料库构建方法,包含三个阶段,有效解决了因高质量训练数据缺乏而对显微镜观察科学推理的制约。项目首次提出并实现了HiCQA-Graph技术,该技术构建了一个图像、描述和问答之间的异构图,实现了跨模态一致性过滤。此外,通过精心构建的数据集,项目证明了大型模型在显微镜推理方面的竞争力性能,并达到了开源大型语言模型的最先进性能水平。代码和数据集将在审查过程结束后发布。
关键见解
- MicroVQA++项目通过三个阶段构建了一个大规模、高质量的显微图像问答语料库。
- 项目采用专家验证的图像描述对作为初始监督来源,并从同行评审的文章中抽取。
- HiCQA-Graph技术的引入实现了图像、描述和问答之间的异构图构建,提高了数据质量。
- 该技术结合了基于自然语言理解(NLI)的文本蕴含、基于CLIP的视觉语言对齐和代理信号,以识别和过滤不一致的样本。
- 项目通过精心构建的数据集证明了大型模型在显微镜推理方面的性能潜力。
- 结果显示,使用此数据集训练的大型语言模型性能达到领先水平。
点此查看论文截图
When Genes Speak: A Semantic-Guided Framework for Spatially Resolved Transcriptomics Data Clustering
Authors:Jiangkai Long, Yanran Zhu, Chang Tang, Kun Sun, Yuanyuan Liu, Xuesong Yan
Spatial transcriptomics enables gene expression profiling with spatial context, offering unprecedented insights into the tissue microenvironment. However, most computational models treat genes as isolated numerical features, ignoring the rich biological semantics encoded in their symbols. This prevents a truly deep understanding of critical biological characteristics. To overcome this limitation, we present SemST, a semantic-guided deep learning framework for spatial transcriptomics data clustering. SemST leverages Large Language Models (LLMs) to enable genes to “speak” through their symbolic meanings, transforming gene sets within each tissue spot into biologically informed embeddings. These embeddings are then fused with the spatial neighborhood relationships captured by Graph Neural Networks (GNNs), achieving a coherent integration of biological function and spatial structure. We further introduce the Fine-grained Semantic Modulation (FSM) module to optimally exploit these biological priors. The FSM module learns spot-specific affine transformations that empower the semantic embeddings to perform an element-wise calibration of the spatial features, thus dynamically injecting high-order biological knowledge into the spatial context. Extensive experiments on public spatial transcriptomics datasets show that SemST achieves state-of-the-art clustering performance. Crucially, the FSM module exhibits plug-and-play versatility, consistently improving the performance when integrated into other baseline methods.
空间转录组学能够通过空间上下文进行基因表达谱分析,为组织微环境提供了前所未有的洞察。然而,大多数计算模型将基因视为孤立的数值特征,忽略了其符号中所蕴含的丰富生物学语义。这阻碍了我们对关键生物学特性的真正深入理解。为了克服这一局限性,我们提出了SemST,这是一个用于空间转录组学数据聚类的语义引导深度学习框架。SemST利用大型语言模型(LLM)使基因能够通过其符号意义“说话”,将每个组织斑点内的基因集转化为生物学信息嵌入。然后,这些嵌入与图神经网络(GNN)捕获的空间邻域关系融合,实现了生物功能与空间结构的连贯融合。我们还引入了精细语义调制(FSM)模块,以最优方式利用这些生物学先验。FSM模块学习特定于斑点的仿射变换,使语义嵌入能够执行空间特征的点校,从而动态注入高阶生物学知识到空间上下文中。在公共空间转录组学数据集上的广泛实验表明,SemST达到了最先进的聚类性能。关键的是,FSM模块表现出即插即用的通用性,在集成到其他基准方法时始终提高了性能。
论文及项目相关链接
PDF AAAI’2026 poster paper. 12 pages, 8 figures
Summary
空间转录组学能够结合空间上下文进行基因表达谱分析,提供对组织微环境的全新见解。然而,大多数计算模型将基因视为孤立的数值特征,忽略了其符号中丰富的生物学语义。为了克服这一局限性,我们提出了SemST,一个用于空间转录组学数据聚类的语义引导深度学习框架。SemST利用大型语言模型(LLM)使基因能够通过其符号意义“表达”,将每个组织斑点内的基因集转化为生物学信息嵌入。这些嵌入与图神经网络(GNN)捕获的空间邻域关系相结合,实现了生物学功能和空间结构的连贯融合。我们还引入了精细语义调制(FSM)模块,以最优方式利用这些生物先验。FSM模块学习特定的仿射变换,使语义嵌入能够执行空间特征的空间校准,从而动态地将高阶生物学知识注入空间上下文。在公共空间转录组学数据集上的广泛实验表明,SemST达到了最先进的聚类性能。
Key Takeaways
- 空间转录组学可通过结合空间上下文进行基因表达分析,提供对组织微环境的深入理解。
- 现有的计算模型多忽略基因的生物学语义,SemST利用LLM使基因通过符号意义表达。
- SemST将基因集转化为生物学信息嵌入,并与GNN捕获的空间关系结合。
- 精细语义调制(FSM)模块用于动态注入高阶生物学知识到空间上下文中。
- SemST在公共空间转录组学数据集上实现最先进的聚类性能。
- FSM模块具有即插即用的灵活性,可以与其他基线方法集成以提高性能。
点此查看论文截图
Instella: Fully Open Language Models with Stellar Performance
Authors:Jiang Liu, Jialian Wu, Xiaodong Yu, Yusheng Su, Prakamya Mishra, Gowtham Ramesh, Sudhanshu Ranjan, Chaitanya Manem, Ximeng Sun, Ze Wang, Pratik Prabhanjan Brahma, Zicheng Liu, Emad Barsoum
Large language models (LLMs) have demonstrated remarkable performance across a wide range of tasks, yet the majority of high-performing models remain closed-source or partially open, limiting transparency and reproducibility. In this work, we introduce Instella, a family of fully open three billion parameter language models trained entirely on openly available data and codebase. Powered by AMD Instinct MI300X GPUs, Instella is developed through large-scale pre-training, general-purpose instruction tuning, and alignment with human preferences. Despite using substantially fewer pre-training tokens than many contemporaries, Instella achieves state-of-the-art results among fully open models and is competitive with leading open-weight models of comparable size. We further release two specialized variants: Instella-Long, capable of handling context lengths up to 128K tokens, and Instella-Math, a reasoning-focused model enhanced through supervised fine-tuning and reinforcement learning on mathematical tasks. Together, these contributions establish Instella as a transparent, performant, and versatile alternative for the community, advancing the goal of open and reproducible language modeling research.
大规模语言模型(LLM)在多种任务中表现出卓越的性能,然而,大多数高性能模型仍然是闭源的或部分开源的,这限制了透明度和可重复性。在这项工作中,我们介绍了Instella,这是一个完全开源的、基于公开数据和代码库训练的、拥有三十亿参数的语言模型家族。Instella由AMD Instinct MI300X GPU驱动,通过大规模预训练、通用指令调整和与人类偏好对齐的方式开发。尽管使用的预训练令牌比许多同龄人少得多,但Instella在完全开源模型中实现了最先进的成果,并与规模相似的领先开源权重模型具有竞争力。我们还发布了两个专业版本:能够处理长达128K令牌上下文长度的Instella-Long,以及通过监督微调强化学习在数学任务上重点突出的推理模型Instella-Math。总的来说,这些贡献使Instella成为社区中透明、高性能和通用的替代方案,推动了开源和可重复的语言建模研究目标。
论文及项目相关链接
Summary
开放源码的语言模型Instella在公开数据上训练,展现了出色的性能。此模型通过大规模预训练、通用指令调整和与人类偏好对齐的方式开发,尽管使用的预训练令牌数量相对较少,但仍具有最先进的性能。此外,还发布了两个专业版本:处理上下文长度可达128K令牌的Instella-Long和专注于推理的Instella-Math。这些贡献为社区建立了透明、高性能和通用的语言模型选择,推动了开放和可重复的语言建模研究目标。
Key Takeaways
- Instella是一个完全开源的语言模型家族,训练数据完全公开。
- Instella使用大规模预训练、通用指令调整和人类偏好对齐的方式进行开发。
- 与许多当代模型相比,Instella使用的预训练令牌数量大大减少。
- Instella在完全开源模型中实现了最先进的性能,并与相当规模的开源权重模型具有竞争力。
- 发布了两个专业版本的Instella,包括处理长文本的Instella-Long和专注于数学任务的Instella-Math。
- Instella-Long可以处理上下文长度达128K的令牌,而Instella-Math通过监督微调强化学习在数学任务上表现出色。
点此查看论文截图
OmniVGGT: Omni-Modality Driven Visual Geometry Grounded Transformer
Authors:Haosong Peng, Hao Li, Yalun Dai, Yushi Lan, Yihang Luo, Tianyu Qi, Zhengshen Zhang, Yufeng Zhan, Junfei Zhang, Wenchao Xu, Ziwei Liu
General 3D foundation models have started to lead the trend of unifying diverse vision tasks, yet most assume RGB-only inputs and ignore readily available geometric cues (e.g., camera intrinsics, poses, and depth maps). To address this issue, we introduce OmniVGGT, a novel framework that can effectively benefit from an arbitrary number of auxiliary geometric modalities during both training and inference. In our framework, a GeoAdapter is proposed to encode depth and camera intrinsics/extrinsics into a spatial foundation model. It employs zero-initialized convolutions to progressively inject geometric information without disrupting the foundation model’s representation space. This design ensures stable optimization with negligible overhead, maintaining inference speed comparable to VGGT even with multiple additional inputs. Additionally, a stochastic multimodal fusion regimen is proposed, which randomly samples modality subsets per instance during training. This enables an arbitrary number of modality inputs during testing and promotes learning robust spatial representations instead of overfitting to auxiliary cues. Comprehensive experiments on monocular/multi-view depth estimation, multi-view stereo, and camera pose estimation demonstrate that OmniVGGT outperforms prior methods with auxiliary inputs and achieves state-of-the-art results even with RGB-only input. To further highlight its practical utility, we integrated OmniVGGT into vision-language-action (VLA) models. The enhanced VLA model by OmniVGGT not only outperforms the vanilla point-cloud-based baseline on mainstream benchmarks, but also effectively leverages accessible auxiliary inputs to achieve consistent gains on robotic tasks.
通用3D基础模型已经开始引领统一各种视觉任务的趋势,然而,大多数模型仅采用RGB输入,忽略了可轻松获得的几何线索(例如,相机内参、姿态和深度图)。为了解决这一问题,我们引入了OmniVGGT,这是一种新型框架,可以在训练和推理过程中有效利用任意数量的辅助几何模态。在我们的框架中,提出了GeoAdapter,用于将深度和相机内参/外参编码到空间基础模型中。它采用零初始化卷积来逐步注入几何信息,而不会破坏基础模型的表示空间。这种设计确保了稳定的优化,且开销微乎其微,即使使用多个附加输入,其推理速度也与VGGT相当。此外,还提出了随机多模态融合方案,在训练过程中随机对实例进行模态子集采样。这可以在测试时接受任意数量的模态输入,并促进学习稳健的空间表示,而不是过度拟合辅助线索。在单目/多目深度估计、多目立体视觉和相机姿态估计方面的综合实验表明,OmniVGGT在具有辅助输入的方法中表现优越,即使在仅有RGB输入的情况下也达到了最新结果。为了进一步突出其实用性,我们将OmniVGGT集成到视觉语言动作(VLA)模型中。OmniVGGT增强的VLA模型不仅在主流基准测试上超越了基于点云的基准模型,而且有效地利用了可访问的辅助输入,在机器人任务上实现了持续的收益。
论文及项目相关链接
PDF Project Page: https://livioni.github.io/OmniVGGT-official/
Summary
本文介绍了OmniVGGT框架,该框架能有效利用多种辅助几何模态信息,包括深度、相机内参和外参等,对通用三维基础模型进行增强。通过GeoAdapter模块,该框架能逐步注入几何信息而不干扰基础模型的表示空间。此外,提出随机多模态融合方案,在训练时随机选择模态子集,使得模型在测试时能接受任意数量的模态输入,并学习稳健的空间表示。实验表明,OmniVGGT在单目/多目深度估计、多目立体视觉和相机姿态估计等多项任务上超越了使用辅助输入的先前方法,并在仅使用RGB输入的情况下达到最新技术水平。同时,将OmniVGGT集成到视觉语言动作模型中,不仅提高了基于点云的基准性能,而且在机器人任务上通过利用辅助输入实现了持续的性能提升。
Key Takeaways
- OmniVGGT框架能利用多种辅助几何模态信息(如深度、相机内参和外参等),增强通用三维基础模型的性能。
- 通过GeoAdapter模块,OmniVGGT能逐步注入几何信息,不影响基础模型的表示空间。
- 随机多模态融合方案使得模型在测试时能接受任意数量的模态输入,并学习稳健的空间表示。
- OmniVGGT在多项视觉任务上表现优越,包括单目/多目深度估计、多目立体视觉和相机姿态估计。
- OmniVGGT能提高基于点云的基准性能,并在机器人任务上实现持续的性能提升。
- OmniVGGT框架具有实际应用的潜力,可广泛应用于计算机视觉和机器人技术等领域。
点此查看论文截图
STAGE: A Symbolic Tensor grAph GEnerator for distributed AI system co-design
Authors:Changhai Man, Joongun Park, Hanjiang Wu, Huan Xu, Srinivas Sridharan, Tushar Krishna
Optimizing the performance of large language models (LLMs) on large-scale AI training and inference systems requires a scalable and expressive mechanism to model distributed workload execution. Such modeling is essential for pre-deployment system-level optimizations (e.g., parallelization strategies) and design-space explorations. While recent efforts have proposed collecting execution traces from real systems, access to large-scale infrastructure remains limited to major cloud providers. Moreover, traces obtained from existing platforms cannot be easily adapted to study future larger-scale system configurations. We introduce Symbolic Tensor grAph GEnerator(STAGE), a framework that synthesizes high-fidelity execution traces to accurately model LLM workloads. STAGE supports a comprehensive set of parallelization strategies, allowing users to systematically explore a wide spectrum of LLM architectures and system configurations. STAGE demonstrates its scalability by synthesizing high-fidelity LLM traces spanning over 32K GPUs, while preserving tensor-level accuracy in compute, memory, and communication. STAGE is publicly available to facilitate further research in distributed machine learning systems: https://github.com/astra-sim/symbolic tensor graph
优化大规模人工智能训练和推理系统上的大型语言模型(LLM)性能,需要一个可扩展且表现力丰富的机制来模拟分布式工作量执行。这样的建模对于预部署的系统级优化(例如并行化策略)和设计空间探索至关重要。虽然最近的努力已经提出从真实系统收集执行轨迹,但大规模基础设施的访问仍然仅限于主要云提供商。此外,从现有平台获得的轨迹无法轻松适应研究未来更大规模的系统配置。我们引入了Symbolic Tensor grAph GEnerator(STAGE),这是一个合成高保真执行轨迹的框架,能够准确地对LLM工作量进行建模。STAGE支持一套全面的并行化策略,允许用户系统地探索广泛的LLM架构和系统配置。STAGE通过合成覆盖32K GPU的高保真LLM轨迹展示了其可扩展性,同时在计算、内存和通信方面保持张量级别的准确性。为了方便在分布式机器学习系统领域进行进一步研究,我们公开提供了STAGE:https://github.com/astra-sim/symbolic tensor graph(请替换为正确的链接地址)
论文及项目相关链接
Summary
基于符号张量图生成器(STAGE)框架,可合成高保真执行轨迹,准确模拟大规模语言模型(LLM)的工作负载,支持广泛的并行化策略,能系统地探索各种LLM架构和系统配置,并在超过32K GPU上展示其可扩展性,同时保持张量级别的计算、内存和通信准确性。
Key Takeaways
- STAGE框架用于合成高保真执行轨迹,以模拟LLM的工作负载。
- 该框架支持广泛的并行化策略,便于系统地探索LLM架构和系统配置。
- STAGE具有可扩展性,能合成覆盖超过32K GPU的高保真LLM轨迹。
- STAGE在模拟过程中保持张量级别的计算、内存和通信准确性。
- 该框架对于预部署系统级优化(如并行化策略)具有重要意义。
- STAGE框架有助于研究未来更大规模的系统配置。
点此查看论文截图
Speech-Audio Compositional Attacks on Multimodal LLMs and Their Mitigation with SALMONN-Guard
Authors:Yudong Yang, Xuezhen Zhang, Zhifeng Han, Siyin Wang, Jimin Zhuang, Zengrui Jin, Jing Shao, Guangzhi Sun, Chao Zhang
Recent progress in large language models (LLMs) has enabled understanding of both speech and non-speech audio, but exposing new safety risks emerging from complex audio inputs that are inadequately handled by current safeguards. We introduce SACRED-Bench (Speech-Audio Composition for RED-teaming) to evaluate the robustness of LLMs under complex audio-based attacks. Unlike existing perturbation-based methods that rely on noise optimization or white-box access, SACRED-Bench exploits speech-audio composition mechanisms. SACRED-Bench adopts three mechanisms: (a) speech overlap and multi-speaker dialogue, which embeds harmful prompts beneath or alongside benign speech; (b) speech-audio mixture, which imply unsafe intent via non-speech audio alongside benign speech or audio; and (c) diverse spoken instruction formats (open-ended QA, yes/no) that evade text-only filters. Experiments show that, even Gemini 2.5 Pro, the state-of-the-art proprietary LLM, still exhibits 66% attack success rate in SACRED-Bench test set, exposing vulnerabilities under cross-modal, speech-audio composition attacks. To bridge this gap, we propose SALMONN-Guard, a safeguard LLM that jointly inspects speech, audio, and text for safety judgments, reducing attack success down to 20%. Our results highlight the need for audio-aware defenses for the safety of multimodal LLMs. The benchmark and SALMONN-Guard checkpoints can be found at https://huggingface.co/datasets/tsinghua-ee/SACRED-Bench. Warning: this paper includes examples that may be offensive or harmful.
大型语言模型(LLM)的最新进展使得理解和处理语音和非语音音频成为可能,但同时也暴露出当前保障措施处理不当的复杂音频输入所带来的新安全威胁。我们推出SACRED-Bench(用于红队协作的语音音频组合评估)来评估复杂音频攻击下LLM的稳健性。与现有的基于扰动的方法不同,这些方法依赖于噪声优化或白盒访问,SACRED-Bench利用语音音频组合机制。SACRED-Bench采用三种机制:(a)语音重叠和多人对话,在良性语音前后或内部嵌入有害提示;(b)语音音频混合,通过良性语音或音频旁边的非语音音频暗示不安全意图;(c)多样的口语指令格式(开放式问答、是非问)可躲避仅文本过滤器。实验表明,即使是目前最先进的专有LLM——Gemini 2.5 Pro,在SACRED-Bench测试集中的攻击成功率仍高达66%,显示出在跨模态语音音频组合攻击下的漏洞。为了弥补这一差距,我们提出了SALMONN-Guard保障LLM,它联合检查语音、音频和文本以进行安全判断,将攻击成功率降低到20%。我们的研究结果强调了为模态LLM的安全意识音频防御的必要性。该基准测试和SALMONN-Guard检查点可在https://huggingface.co/datasets/tsinghua-ee/SACRED-Bench 找到。警告:本论文包含可能具有冒犯性或有害性的示例。
论文及项目相关链接
摘要
大型语言模型(LLM)的最新进展使其能够理解语音和非语音音频,但同时也暴露出由复杂音频输入带来的新安全威胁。为评估LLM在复杂音频攻击下的稳健性,我们推出了SACRED-Bench(用于红队演练的语音-音频组合)。SACRED-Bench不同于现有的基于噪声优化的白盒访问方法,而是利用语音-音频组合机制。SACRED-Bench采用三种机制:(a)语音重叠和多人对话,在无害语音下嵌入有害提示;(b)语音-音频混合,通过无害语音或音频旁边的非语音音频暗示不安全意图;(c)多样化的口头指令格式(开放式问答、是非题等),以避免仅文本过滤器。实验表明,即使在SACRED-Bench测试集中,最先进的专有LLM——Gemini 2.5 Pro仍有66%的攻击成功率,显示出跨模态和语音-音频组合攻击下的漏洞。为解决这一差距,我们提出了SALMONN-Guard安全卫士LLM,其联合检查语音、音频和文本进行安全判断,将攻击成功率降至仅不到百分之二十。我们的研究结果突显了对多媒体LLM实施音频感知防御的必要性。该基准测试和SALMONN-Guard检查点可在https://huggingface.co/datasets/tsinghua-ee/SACRED-Bench 找到。警告:本文包含可能具有冒犯性或有害的示例。
要点归纳
- LLM能够理解语音和非语音音频,但面临来自复杂音频输入的新安全威胁。
- SACRED-Bench用于评估LLM在复杂音频攻击下的稳健性,采用三种新机制模拟有害输入。
- 实验显示最先进LLM仍存在漏洞,容易受到特定攻击形式的影响。
- SALMONN-Guard被提出作为LLM的安全卫士,联合检查语音、音频和文本进行安全判断,能有效降低攻击成功率。
- 需要针对多媒体LLM实施音频感知防御。
- SACRED-Bench测试基准和SALMONN-Guard检查点可在特定网站找到。
点此查看论文截图
Large Language Model-assisted Autonomous Vehicle Recovery from Immobilization
Authors:Zhipeng Bao, Qianwen Li
Despite significant advancements in recent decades, autonomous vehicles (AVs) continue to face challenges in navigating certain traffic scenarios where human drivers excel. In such situations, AVs often become immobilized, disrupting overall traffic flow. Current recovery solutions, such as remote intervention (which is costly and inefficient) and manual takeover (which excludes non-drivers and limits AV accessibility), are inadequate. This paper introduces StuckSolver, a novel Large Language Model (LLM) driven recovery framework that enables AVs to resolve immobilization scenarios through self-reasoning and/or passenger-guided decision-making. StuckSolver is designed as a plug-in add-on module that operates on top of the AV’s existing perception-planning-control stack, requiring no modification to its internal architecture. Instead, it interfaces with standard sensor data streams to detect immobilization states, interpret environmental context, and generate high-level recovery commands that can be executed by the AV’s native planner. We evaluate StuckSolver on the Bench2Drive benchmark and in custom-designed uncertainty scenarios. Results show that StuckSolver achieves near-state-of-the-art performance through autonomous self-reasoning alone and exhibits further improvements when passenger guidance is incorporated.
尽管近几十年来取得了重大进展,自主车辆(AVs)在导航某些交通场景时仍然面临挑战,这些场景是人类驾驶员擅长的领域。在这种情况下,自主车辆经常会失去机动能力,从而破坏整体的交通流量。目前的恢复解决方案,如远程干预(成本高昂且效率低下)和手动接管(排除非驾驶员并限制AV的可用性),都不足以应对挑战。本文介绍了StuckSolver,这是一种新型的大型语言模型(LLM)驱动的恢复框架,它能够使自主车辆通过自我推理和(或)乘客指导的决策制定来解决失去机动能力的情况。StuckSolver被设计为一个插件附加模块,它在自主车辆现有的感知-规划-控制堆栈上运行,无需对其内部架构进行任何修改。相反,它与标准传感器数据流接口,以检测失去机动能力的状态,解释环境上下文,并生成可以由自主车辆的本地规划器执行的高级恢复命令。我们在Bench2Drive基准测试和自定义设计的不确定性场景中对StuckSolver进行了评估。结果表明,StuckSolver通过自主的自我推理实现了近乎最新的性能水平,并且在融入乘客指导时表现出进一步的改进。
论文及项目相关链接
PDF 7 pages
Summary
近年来尽管自动驾驶车辆(AV)技术取得了显著进展,但在应对某些交通场景中仍然面临挑战,人类驾驶员在这方面表现更出色。自主恢复解决方案如远程干预(成本高且效率低)和手动接管(不适用于非驾驶员并限制了AV的普及)并不完善。本论文介绍了StuckSolver,一个基于大型语言模型(LLM)的新型恢复框架,通过自我推理和乘客引导决策,帮助自动驾驶车辆解决僵停场景。StuckSolver被设计为一个插件模块,无需修改AV的内部架构,即可在其现有的感知规划控制堆栈上运行。它通过标准传感器数据流检测僵停状态、解释环境上下文并生成可以由AV本地规划器执行的高级恢复命令。我们在Bench2Drive基准测试和自定义设计的不确定性场景中对StuckSolver进行了评估。结果表明,StuckSolver通过自主自我推理实现了近乎最先进的性能表现,并且在加入乘客指导后进一步提升了性能。
Key Takeaways
- 自动驾驶车辆在特定交通场景中仍面临挑战,需要更先进的恢复解决方案。
- 当前恢复解决方案如远程干预和手动接管存在不足,需要新的方法来解决这些问题。
- StuckSolver是一个基于大型语言模型的恢复框架,能够通过自我推理和乘客引导决策帮助自动驾驶车辆解决僵停问题。
- StuckSolver作为一个插件模块,可无缝集成到现有的自动驾驶车辆系统中,无需修改内部架构。
- StuckSolver通过标准传感器数据流来检测僵停状态和环境上下文。
- StuckSolver生成高级恢复命令,可以由自动驾驶车辆的本地规划器执行。
点此查看论文截图
ORIC: Benchmarking Object Recognition under Contextual Incongruity in Large Vision-Language Models
Authors:Zhaoyang Li, Zhan Ling, Yuchen Zhou, Litian Gong, Erdem Bıyık, Hao Su
Large Vision-Language Models (LVLMs) excel at captioning, visual question answering, and robotics by combining vision and language, yet they often miss obvious objects or hallucinate nonexistent ones in atypical scenes. We examine these failures through the lens of uncertainty, focusing on contextual incongruity, where objects appear unexpectedly or fail to appear in expected contexts, and show that such cases increase recognition difficulty for state-of-the-art LVLMs. To study this regime, we introduce the Object Recognition in Incongruous Context (ORIC) framework, which constructs incongruous object-context pairs through two complementary strategies: (1) LLM-guided sampling to identify hard-to-recognize objects present in the image and (2) CLIP-guided sampling to mine plausible but absent ones. Applied to MSCOCO, ORIC produces ORIC-Bench and ORIC-style training data. Evaluating 18 LVLMs and 2 open-vocabulary detectors reveals substantial performance drops and bias patterns under incongruous contexts. Fine-tuning Qwen3-VL-8B-Instruct with Visual Reinforcement Fine-Tuning on 600 ORIC-style samples improves results on ORIC-Bench, AMBER, and HallusionBench. Overall, we show that contextual incongruity is a key source of uncertainty and provide tools for more reliable LVLMs. The code is available at https://github.com/ZhaoyangLi-1/ORIC.
大型视觉语言模型(LVLMs)通过结合视觉和语言,在图像描述、视觉问答和机器人技术等领域表现出色。然而,它们在处理不常见的场景时,往往会忽略明显的物体或想象出不存在的物体。我们通过不确定性的视角来研究这些失败的情况,重点关注上下文的不一致性,即物体在不预期的情境中意外出现或在预期的情境中未出现,并表明这种情况增加了对最新LVLMs的识别难度。为了研究这一领域,我们引入了“不一致上下文中的对象识别”(ORIC)框架,该框架通过两种互补策略构建不一致的对象-上下文对:(1)LLM引导采样,识别图像中难以识别的对象;(2)CLIP引导采样,挖掘可能但不存在的对象。应用于MSCOCO时,ORIC生成ORIC-Bench和ORIC风格的训练数据。对18个LVLMs和2个开放词汇检测器的评估显示,在不一致的上下文环境下,它们的性能大幅下降并存在偏见模式。通过对Qwenh3-VL-8B-Instruct进行视觉强化微调,并在600个ORIC样式样本上进行微调,其在ORIC-Bench、AMBER和HallusionBench上的结果有所改善。总的来说,我们证明了上下文的不一致性是不确定性的主要来源,并为更可靠的LVLMs提供了工具。代码可在https://github.com/ZhaoyangLi-1/ORIC访问。
论文及项目相关链接
摘要
大型视觉语言模型(LVLM)在图像描述、视觉问答和机器人技术等领域表现出色,但它们在处理非典型场景时,往往会忽略明显物体或幻想出不存在的物体。本研究从不确定性的角度审视这些失败案例,重点关注上下文的不一致性,即物体在预期之外出现或在预期语境中缺失,并表明这种情况增加了对最新LVLM的识别难度。为了研究这一领域,我们引入了Object Recognition in Incongruous Context(ORIC)框架,它通过两种互补策略构建不一致的物体上下文对:(1)LLM引导采样识别图像中难以识别的物体;(2)CLIP引导采样挖掘可能但不存在的物体。应用于MSCOCO数据集时,ORIC产生了ORIC-Bench和ORIC风格训练数据。对18个LVLM和2个开放词汇检测器的评估显示,在不一致的语境下,性能大幅下降且存在偏见模式。使用Visual Reinforcement Fine-Tuning对Qwen3-VL-8B-Instruct进行微调,在600个ORIC风格样本上改进了其在ORIC-Bench、AMBER和HallusionBench上的结果。总体而言,我们证明了上下文不一致性是关键的不确定性来源,并为更可靠的LVLM提供了工具。相关代码可通过https://github.com/ZhaoyangLi-1/ORIC获取。
关键见解
- 大型视觉语言模型(LVLMs)在图像描述、视觉问答和机器人技术上表现优秀,但在处理非典型场景时存在局限性。
- 模型在不一致的上下文中识别物体的难度增加,上下文的不一致性是模型失败的关键因素。
- 介绍了Object Recognition in Incongruous Context(ORIC)框架,用于构建不一致的物体上下文对数据集。
- ORIC框架通过LLM引导采样和CLIP引导采样两种策略来创建训练数据。
- 对多个LVLM和开放词汇检测器的评估显示,在不一致的语境下性能下降,并存在偏见模式。
- 通过Visual Reinforcement Fine-Tuning对模型进行微调,改进了在特定测试集上的性能。
点此查看论文截图
CAMA: Enhancing Mathematical Reasoning in Large Language Models with Causal Knowledge
Authors:Lei Zan, Keli Zhang, Ruichu Cai, Lujia Pan
Large Language Models (LLMs) have demonstrated strong performance across a wide range of tasks, yet they still struggle with complex mathematical reasoning, a challenge fundamentally rooted in deep structural dependencies. To address this challenge, we propose \textbf{CA}usal \textbf{MA}thematician (\textbf{CAMA}), a two-stage causal framework that equips LLMs with explicit, reusable mathematical structure. In the learning stage, CAMA first constructs the \textbf{M}athematical \textbf{C}ausal \textbf{G}raph (\textbf{MCG}), a high-level representation of solution strategies, by combining LLM priors with causal discovery algorithms applied to a corpus of question-solution pairs. The resulting MCG encodes essential knowledge points and their causal dependencies. To better align the graph with downstream reasoning tasks, CAMA further refines the MCG through iterative feedback derived from a selected subset of the question-solution pairs. In the reasoning stage, given a new question, CAMA dynamically extracts a task-relevant subgraph from the MCG, conditioned on both the question content and the LLM’s intermediate reasoning trace. This subgraph, which encodes the most pertinent knowledge points and their causal dependencies, is then injected back into the LLM to guide its reasoning process. Empirical results on real-world datasets show that CAMA significantly improves LLM performance on challenging mathematical problems. Furthermore, our experiments demonstrate that structured guidance consistently outperforms unstructured alternatives, and that incorporating asymmetric causal relationships yields greater improvements than using symmetric associations alone.
大型语言模型(LLM)在多种任务中表现出强大的性能,但在复杂的数学推理方面仍面临挑战,这一挑战根本源于深层的结构依赖性。为了应对这一挑战,我们提出了配备LLM以明确、可重复使用的数学结构的因果数学家(\textbf{CAMA})两阶段因果框架。在学习阶段,CAMA首先通过结合LLM先验知识和应用于问题解决方案对语料库的因果发现算法,构建数学因果图(MCG),这是一种解决方案策略的高级表示。生成的MCG编码了必要的知识点及其因果关系。为了更好地使图形与下游推理任务对齐,CAMA进一步通过来自问题解决方案对中选择子集的迭代反馈来优化MCG。在推理阶段,对于新问题,CAMA会根据问题的内容和LLM的中间推理轨迹,从MCG中动态提取与任务相关的子图。这个子图编码了最相关知识点及其因果关系,然后注入LLM以指导其推理过程。在真实数据集上的实证结果表明,CAMA显著提高了LLM解决复杂数学问题性能。此外,我们的实验表明,结构化的指导始终优于非结构化替代方案,并且纳入不对称的因果关系比仅使用对称关联带来更大的改进。
论文及项目相关链接
Summary
大型语言模型(LLM)在多种任务中表现出强大的性能,但在复杂的数学推理方面仍存在挑战。为解决这一挑战,提出了因果数学家(CAMA)框架,该框架分为两个阶段,为LLM提供了明确的可重复使用的数学结构。学习阶段构建数学因果图(MCG),通过结合LLM先验知识与因果发现算法,对问题解决方案对进行编码,形成高层次的解决方案策略表示。为了与下游推理任务更好地对齐,CAMA通过来自问题解决方案对的迭代反馈进一步改进了MCG。在推理阶段,对于新问题,CAMA根据问题和LLM的中间推理轨迹从MCG中提取相关子图,并将其注入LLM以指导其推理过程。实证结果表明,CAMA显著提高了LLM在挑战性数学问题上的性能。
Key Takeaways
- LLM在复杂数学推理上表现不足,需要新的方法来解决这一问题。
- CAMA框架旨在通过结合LLM先验知识和因果发现算法来解决这一挑战。
- CAMA分为两个阶段:学习阶段和推理阶段。
- 在学习阶段,CAMA构建数学因果图(MCG),表示问题的解决方案策略。
- 通过迭代反馈,CAMA改进了MCG,使其更好地与下游推理任务对齐。
- 在推理阶段,CAMA根据问题和LLM的推理轨迹从MCG中提取相关子图来指导推理。
点此查看论文截图
Comprehension Without Competence: Architectural Limits of LLMs in Symbolic Computation and Reasoning
Authors:Zheng Zhang
Large Language Models (LLMs) display striking surface fluency yet systematically fail at tasks requiring symbolic reasoning, arithmetic accuracy, and logical consistency. This paper offers a structural diagnosis of such failures, revealing a persistent gap between \textit{comprehension} and \textit{competence}. Through controlled experiments and architectural analysis, we demonstrate that LLMs often articulate correct principles without reliably applying them–a failure rooted not in knowledge access, but in computational execution. We term this phenomenon the computational \textit{split-brain syndrome}, where instruction and action pathways are geometrically and functionally dissociated. This core limitation recurs across domains, from mathematical operations to relational inferences, and explains why model behavior remains brittle even under idealized prompting. We argue that LLMs function as powerful pattern completion engines, but lack the architectural scaffolding for principled, compositional reasoning. Our findings delineate the boundary of current LLM capabilities and motivate future models with metacognitive control, principle lifting, and structurally grounded execution. This diagnosis also clarifies why mechanistic interpretability findings may reflect training-specific pattern coordination rather than universal computational principles, and why the geometric separation between instruction and execution pathways suggests limitations in neural introspection and mechanistic analysis.
大型语言模型(LLMs)虽然在表面展现出了惊人的流畅性,但在需要符号推理、算术准确性和逻辑一致性的任务上却系统性地失败了。本文对这些失败进行了结构性诊断,揭示了“理解”和“能力”之间的持久差距。通过受控实验和架构分析,我们证明LLMs往往能表述正确的原则,但却不能可靠地应用它们——这种失败并非源于知识获取,而是源于计算执行。我们将这种现象称为“计算性‘脑裂症’”,其中指令和行动路径在几何和功能上都是分离的。这一核心限制在各个领域都普遍存在,从数学运算到关系推理,并解释了为什么即使在理想化的提示下,模型的行为仍然很脆弱。我们认为LLMs是强大的模式完成引擎,但缺乏有原则、有组织的推理所需的架构支撑。我们的研究界定了当前LLM的能力边界,并激励未来构建具有元认知控制、原则提升和结构基础执行的模型。这种诊断也阐明了为什么机制性解释的发现可能反映训练特定的模式协调,而不是普遍的计算原则,以及为什么指令和执行路径之间的几何分离表明神经内省和机制性分析的局限性。
论文及项目相关链接
PDF v2: Two TMLR revision rounds addressing reviewer feedback. Added real-world validation (3.4), interpretability analysis (7), computational hallucination framework, strengthened theory. v3: Sec 3.2 - added transformer architecture diagram, clarified UAT capacity vs computational limits, improved role specialization theorem presentation
Summary
本文指出大型语言模型(LLMs)虽然表现出令人惊叹的表面流畅度,但在需要符号推理、算术准确性和逻辑一致性的任务上却表现出系统性失败。文章对这类失败进行了结构性诊断,揭示了“理解”与“能力”之间的持久差距。通过控制实验和架构分析,文章展示了LLMs往往能够正确阐述原则,但在实际应用中却无法可靠地应用这些原则——这种失败并非源于知识获取,而是源于计算执行。这种现象被称为计算“裂脑综合征”,其中指令和执行路径在几何和功能上相互分离。这种核心限制在各个领域普遍存在,从数学运算到关系推理都是如此,并解释了为什么即使在理想化的提示下,模型行为仍然脆弱。文章强调LLMs作为强大的模式完成引擎的功能,但缺乏进行原则性、组合性推理的架构支撑。
Key Takeaways
- LLMs表面流畅却系统性地在需要符号推理、算术和逻辑的任务上失败。
- LLMs在理解原则方面表现出能力,但在实际应用中无法可靠地应用这些原则。
- LLMs的失败源于计算执行,而非知识获取。
- LLMs展现出计算“裂脑综合征”,指令与执行路径存在几何和功能上的分离。
- LLMs的局限在多个领域都普遍存在,包括数学运算和关系推理。
- LLMs作为模式完成引擎功能强大,但缺乏进行原则性、组合性推理的架构支撑。
- 为了改善LLMs的性能,未来需要发展具有元认知控制、原则提升和结构基础执行的模型。
点此查看论文截图
Sensory-Motor Control with Large Language Models via Iterative Policy Refinement
Authors:Jônata Tyska Carvalho, Stefano Nolfi
We propose a method that enables large language models (LLMs) to control embodied agents through the generation of control policies that directly map continuous observation vectors to continuous action vectors. At the outset, the LLMs generate a control strategy based on a textual description of the agent, its environment, and the intended goal. This strategy is then iteratively refined through a learning process in which the LLMs are repeatedly prompted to improve the current strategy, using performance feedback and sensory-motor data collected during its evaluation. The method is validated on classic control tasks from the Gymnasium library and the inverted pendulum task from the MuJoCo library. The approach proves effective with relatively compact models such as GPT-oss:120b and Qwen2.5:72b. In most cases, it successfully identifies optimal or near-optimal solutions by integrating symbolic knowledge derived through reasoning with sub-symbolic sensory-motor data gathered as the agent interacts with its environment.
我们提出了一种方法,使大型语言模型(LLM)能够通过生成控制策略来控制实体代理,该策略直接将连续的观测向量映射到连续的动作向量。最初,LLM基于代理的文本描述、其环境以及预期目标来生成控制策略。然后,通过学习过程对策略进行迭代优化,在此过程中,LLM反复被提示以性能反馈和收集到的感觉运动数据来改善当前策略。该方法在Gymnasium库的经典控制任务和MuJoCo库的倒立摆任务上得到了验证。该方法在相对紧凑的模型(如GPT-oss:120b和Qwen2.5:72b)上证明是有效的。在大多数情况下,它通过整合通过推理获得的符号知识与代理在与环境交互过程中收集到的感觉运动数据,成功找到最优或接近最优的解决方案。
论文及项目相关链接
PDF Article updated with results from gpt-oss:120b and gpt-oss:20b. 27 pages (13 pages are from appendix), 8 figures, 2 tables, code for experiments replication and supplementary material provided at https://github.com/jtyska/llm-robotics-article/
摘要
本文提出了一种方法,通过生成连续观测向量到连续动作向量的控制策略,使大型语言模型(LLM)能够控制实体代理。LLM基于代理的文本描述、环境和目标生成初始控制策略,然后通过学习过程不断提示改进当前策略,使用性能反馈和收集到的感觉运动数据进行评价。该方法在Gymnasium库的经典控制任务和MuJoCo库的倒立摆任务中得到了验证。使用相对紧凑的模型(如GPT-oss:120b和Qwen2.5:72b)即可实现有效证明。该方法成功地将符号知识推理与代理与环境的交互过程中收集的感觉运动数据相结合,找出最优或接近最优的解决方案。
关键见解
- 提出了一种基于大型语言模型(LLM)的控制方法,允许通过生成控制策略来控制实体代理。
- LLM根据代理的文本描述、环境和目标来生成初始控制策略。
- 控制策略通过迭代学习过程进行改进,结合性能反馈和感觉运动数据。
- 方法在经典控制任务和倒立摆任务中得到了验证。
- 相对紧凑的模型,如GPT-oss:120b和Qwen2.5:72b,即可实现有效证明。
- 成功结合符号知识推理和感觉运动数据,找出最优或接近最优的解决方案。
- 该方法展示了文本描述与实际行动之间的直接映射潜力,为未来的语言控制研究提供了新的方向。
点此查看论文截图
Transformer Copilot: Learning from The Mistake Log in LLM Fine-tuning
Authors:Jiaru Zou, Yikun Ban, Zihao Li, Yunzhe Qi, Ruizhong Qiu, Ling Yang, Jingrui He
Large language models are typically adapted to downstream tasks through supervised fine-tuning on domain-specific data. While standard fine-tuning focuses on minimizing generation loss to optimize model parameters, we take a deeper step by retaining and leveraging the model’s own learning signals, analogous to how human learners reflect on past mistakes to improve future performance. We first introduce the concept of Mistake Log to systematically track the model’s learning behavior and recurring errors throughout fine-tuning. Treating the original transformer-based model as the Pilot, we correspondingly design a Copilot model to refine the Pilot’s inference performance via logits rectification. We name the overall Pilot-Copilot framework the Transformer Copilot, which introduces (i) a novel Copilot model design, (ii) a joint training paradigm where the Copilot continuously learns from the evolving Mistake Log alongside the Pilot, and (iii) a fused inference paradigm where the Copilot rectifies the Pilot’s logits for enhanced generation. We provide both theoretical and empirical analyses on our new learning framework. Experiments on 12 benchmarks spanning commonsense, arithmetic, and recommendation tasks demonstrate that Transformer Copilot consistently improves performance by up to 34.5%, while introducing marginal computational overhead to Pilot models and exhibiting strong scalability and transferability. Our code is released at https://github.com/jiaruzouu/TransformerCopilot.
大型语言模型通常通过针对特定领域的监督微调来适应下游任务。虽然标准的微调侧重于通过最小化生成损失来优化模型参数,但我们更进一步,保留并利用模型本身的学习信号,类似于人类学习者如何反思过去的错误以提高未来的表现。我们首先引入“错误日志”的概念,以系统地跟踪模型在微调过程中的学习行为和反复出现的错误。我们将原始的基于变压器的模型视为“Pilot”(飞行员)模型,相应地设计了一个“Copilot”(副驾驶)模型来优化Pilot模型的推理性能,通过调整logits(逻辑值)来实现。我们将整体的Pilot-Copilot框架命名为“Transformer Copilot”,它引入了(i)新颖的Copilot模型设计,(ii)联合训练范式,其中Copilot与Pilot一起不断从演变的错误日志中学习,以及(iii)融合推理范式,其中Copilot调整Pilot的逻辑值以实现增强的生成效果。我们针对新的学习框架进行了理论分析和实证研究。在涵盖常识、算术和推荐任务的12个基准测试上的实验表明,Transformer Copilot可不断提高性能至高达34.5%,同时为Pilot模型增加了轻微的计算开销,并表现出强大的可扩展性和可迁移性。我们的代码已发布在https://github.com/jiaruzouu/TransformerCopilot。
论文及项目相关链接
PDF NeurIPS 2025 Spotlight
Summary
大型语言模型通过领域特定数据的监督微调来适应下游任务。本文提出一种Transformer Copilot框架,通过引入Mistake Log来跟踪模型学习行为和微调过程中的反复错误。该框架包括Copilot模型设计、联合训练范式和融合推理范式。实验表明,Transformer Copilot在常识、算术和推荐任务上性能提升高达34.5%,同时给Pilot模型带来轻微的计算开销,并展现出强大的可扩展性和可迁移性。
Key Takeaways
- 大型语言模型通过微调适应下游任务,但标准微调主要关注生成损失最小化来优化模型参数。
- Transformer Copilot框架引入Mistake Log来跟踪模型学习行为和微调过程中的错误。
- Transformer Copilot包括Copilot模型设计,旨在提高Pilot模型的推理性能。
- Transformer Copilot采用联合训练范式,Copilot从不断更新的Mistake Log中学习并与Pilot共同进步。
- Transformer Copilot采用融合推理范式,Copilot纠正Pilot的logits以提高生成质量。
- 实验表明,Transformer Copilot在多种任务上性能显著提升,最高可达34.5%。
点此查看论文截图
Unifying Segment Anything in Microscopy with Vision-Language Knowledge
Authors:Manyu Li, Ruian He, Zixian Zhang, Chenxi Ma, Weimin Tan, Bo Yan
Accurate segmentation of regions of interest in biomedical images holds substantial value in image analysis. Although several foundation models for biomedical segmentation have currently achieved excellent performance on certain datasets, they typically demonstrate sub-optimal performance on unseen domain data. We owe the deficiency to lack of vision-language knowledge before segmentation. Multimodal Large Language Models (MLLMs) bring outstanding understanding and reasoning capabilities to multimodal tasks, which inspires us to leverage MLLMs to inject Vision-Language Knowledge (VLK), thereby enabling vision models to demonstrate superior generalization capabilities on cross-domain datasets. In this paper, we propose a novel framework that seamlessly uses MLLMs to guide SAM in learning microscopy cross-domain data, unifying Segment Anything in Microscopy, named uLLSAM. Specifically, we propose the Vision-Language Semantic Alignment (VLSA) module, which injects VLK into Segment Anything Model (SAM). We find that after SAM receives global VLK prompts, its performance improves significantly, but there are deficiencies in boundary contour perception. Therefore, we further propose Semantic Boundary Regularization (SBR) to regularize SAM. Our method achieves performance improvements of 11.8% in SA across 9 in-domain microscopy datasets, achieving state-of-the-art performance. Our method also demonstrates improvements of 9.2% in SA across 10 out-of-domain datasets, exhibiting strong generalization capabilities. Code is available at https://github.com/ieellee/uLLSAM.
生物医学图像中感兴趣区域的精确分割在图像分析中具有重要价值。尽管目前一些生物医学分割的基础模型已经在某些数据集上取得了卓越的性能,但它们在未见域数据上的表现通常并不理想。我们认为这种不足是由于分割前的视觉语言知识缺乏所导致的。多模态大型语言模型(MLLMs)为多模态任务带来了出色的理解和推理能力,这启发我们利用MLLMs来注入视觉语言知识(VLK),从而使视觉模型在跨域数据集上展现出更强大的泛化能力。在本文中,我们提出了一个新颖的框架,无缝地使用MLLMs来指导SAM在显微镜跨域数据上的学习,统一显微镜中的任何分段,名为uLLSAM。具体来说,我们提出了视觉语言语义对齐(VLSA)模块,该模块将VLK注入到分段任何模型(SAM)中。我们发现,在SAM接收全局VLK提示后,其性能得到了显著提高,但在边界轮廓感知方面仍存在不足。因此,我们进一步提出语义边界正则化(SBR)来规范SAM。我们的方法在9个同域显微镜数据集上的SA性能提高了11.8%,达到最新性能水平。我们的方法还展示了在10个跨域数据集上的SA性能提高了9.2%,表现出强大的泛化能力。代码可访问于:https://github.com/ieellee/uLLSAM。
论文及项目相关链接
PDF 15 pages, 5 figures
Summary
基于生物医学图像的区域精准分割在图像分析中具有重要价值。当前的一些基础模型在某些数据集上表现优秀,但在未见域数据上的表现通常不佳。这主要是由于分割前缺乏视觉语言知识。多模态大型语言模型(MLLMs)为跨模态任务带来了卓越的理解和推理能力,这激发我们利用MLLMs注入视觉语言知识(VLK),使视觉模型在跨域数据集上表现出更出色的泛化能力。本文提出一种新颖框架uLLSAM,无缝利用MLLMs引导SAM学习显微镜跨域数据,统一显微分割任务。通过VLSA模块注入VLK到SAM中。研究发现,SAM接受全局VLK提示后性能显著提升,但在边界轮廓感知方面存在不足。因此,进一步提出语义边界正则化(SBR)以优化SAM。该方法在9个同域显微镜数据集上的分割精度提高了11.8%,达到最新性能水平;在10个跨域数据集上的分割精度也提高了9.2%,展现出强大的泛化能力。
Key Takeaways
- 准确区域分割在生物医学图像分析中具有重要价值。
- 当前模型在未见域数据上的表现不佳,缺乏视觉语言知识是主要原因之一。
- 多模态大型语言模型(MLLMs)具有卓越的理解和推理能力,可用于注入视觉语言知识(VLK)。
- 论文提出uLLSAM框架,利用MLLMs引导SAM学习显微镜跨域数据。
- 通过VLSA模块,SAM能够接收并应用VLK,显著提升性能。
- SAM在边界轮廓感知方面存在不足,因此提出语义边界正则化(SBR)进行优化。
点此查看论文截图
The Empty Chair: Using LLMs to Raise Missing Perspectives in Policy Deliberations
Authors:Suyash Fulay, Dimitra Dimitrakopoulou, Deb Roy
Deliberation is essential to well-functioning democracies, yet physical, economic, and social barriers often exclude certain groups, reducing representativeness and contributing to issues like group polarization. In this work, we explore the use of large language model (LLM) personas to introduce missing perspectives in policy deliberations. We develop and evaluate a tool that transcribes conversations in real-time and simulates input from relevant but absent stakeholders. We deploy this tool in a 19-person student citizens’ assembly on campus sustainability. Participants and facilitators found that the tool was useful to spark new discussions and surfaced valuable perspectives they had not previously considered. However, they also raised skepticism about the ability of LLMs to accurately characterize the perspectives of different groups, especially ones that are already underrepresented. Overall, this case study highlights that while AI personas can usefully surface new perspectives and prompt discussion in deliberative settings, their successful deployment depends on clarifying their limitations and emphasizing that they complement rather than replace genuine participation.
审议在民主社会中扮演着至关重要的角色,但由于物理、经济和社会方面的障碍,某些群体常常被排除在外,这不仅降低了代表性,还引发了群体极化等问题。在这项工作中,我们探索使用大型语言模型(LLM)人格来为政策审议引入缺失的观点。我们开发并评估了一种工具,该工具能够实时转录对话,并模拟来自相关但缺席的利益相关者的输入。我们在校园可持续性问题的19名学生公民大会上使用了这一工具。参与者和协调员发现该工具对于激发新讨论和揭示他们之前没有考虑到的有价值的观点非常有用。然而,他们也提出了对LLM准确刻画不同群体观点能力的怀疑,尤其是对那些已经代表性不足的群体。总体而言,这个案例研究突出了虽然人工智能人格有助于揭示新的观点并促进讨论,但其成功部署取决于明确其局限性,并强调它们只是作为真正参与的补充,而不是替代。
论文及项目相关链接
PDF 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: PersonaLLM: Workshop on LLM Persona Modeling
Summary
决策过程中的讨论与辩论对于民主制度至关重要,但物理、经济和社会等障碍会排除某些群体参与讨论。本文通过使用大型语言模型(LLM)模拟相关但缺席利益相关者的输入,探索引入缺失视角的政策讨论方法。在校园可持续性议题的学生公民大会上进行了实地测试,发现该工具有助于激发新讨论和呈现之前未考虑过的观点,但对LLM表征不同群体视角的准确性有所质疑。总的来说,AI工具可以在审议式场合补充新视角和激发讨论,但需明确其局限性并强调其为辅助而非取代实际参与的作用。
Key Takeaways
- 大型语言模型(LLM)在引入政策讨论中的缺失视角方面具有重要潜力。
- 通过模拟相关但缺席利益相关者的输入,可以增加讨论的全面性和代表性。
- 在学生公民大会上的实地测试表明,该工具能够激发新讨论并揭示有价值的视角。
- 存在对LLM准确表征不同群体视角的质疑,特别是对那些已经缺乏代表性的群体。
- AI工具在审议式场合中的成功部署取决于对其局限性的明确说明。
- AI工具应被视为辅助工具,而非取代实际参与的主要手段。