LLM

发布日期: 2025-11-16

更新日期: 2025-11-27

文章字数: 21.5k

阅读时长: 88 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-16 更新

Enhancing the Outcome Reward-based RL Training of MLLMs with Self-Consistency Sampling

Authors:Jiahao Wang, Weiye Xu, Aijun Yang, Wengang Zhou, Lewei Lu, Houqiang Li, Xiaohua Wang, Jinguo Zhu

Outcome-reward reinforcement learning (RL) is a common and increasingly significant way to refine the step-by-step reasoning of multimodal large language models (MLLMs). In the multiple-choice setting - a dominant format for multimodal reasoning benchmarks - the paradigm faces a significant yet often overlooked obstacle: unfaithful trajectories that guess the correct option after a faulty chain of thought receive the same reward as genuine reasoning, which is a flaw that cannot be ignored. We propose Self-Consistency Sampling (SCS) to correct this issue. For each question, SCS (i) introduces small visual perturbations and (ii) performs repeated truncation and resampling of an initial trajectory; agreement among the resulting trajectories yields a differentiable consistency score that down-weights unreliable traces during policy updates. Based on Qwen2.5-VL-7B-Instruct, plugging SCS into RLOO, GRPO, and REINFORCE++ series improves accuracy by up to 7.7 percentage points on six multimodal benchmarks with negligible extra computation. SCS also yields notable gains on both Qwen2.5-VL-3B-Instruct and InternVL3-8B, offering a simple, general remedy for outcome-reward RL in MLLMs.

结果导向的强化学习（RL）是优化多模态大型语言模型（MLLMs）的逐步推理的一种常见且日益重要的方法。在多项选择的环境中——多模态推理基准测试的主导形式——这种方法面临一个重大但常被忽视的问题：不真实的轨迹在错误的思维链条后猜测出正确答案，与真实推理获得相同的奖励，这是一个不容忽视的缺陷。我们提出自我一致性采样（SCS）来解决这个问题。对于每个问题，SCS（i）引入微小的视觉扰动，并（ii）对初始轨迹进行重复截断和重新采样；结果轨迹之间的一致性产生了一个可区分的一致性分数，在策略更新时降低不可靠轨迹的权重。基于Qwen2.5-VL-7B-Instruct，将SCS插入RLOO、GRPO和REINFORCE++系列，在六个多模态基准测试上的准确率提高了高达7.7个百分点，且额外的计算几乎可以忽略不计。SCS也在Qwen2.5-VL-3B-Instruct和InternVL3-8B上取得了显著的收益，为MLLMs中的结果导向RL提供了简单、通用的解决方案。

论文及项目相关链接

PDF Accepted to NeurIPS 2025 (The Thirty-Ninth Annual Conference on Neural Information Processing Systems)

Summary

多模态大型语言模型（MLLMs）在决策过程中采用结果奖励强化学习（RL）方法，但在多选题环境中存在不忠实轨迹问题，即错误推理路径即使猜测出正确答案也能获得奖励。为解决这一问题，本文提出自我一致性采样（SCS）方法，通过引入微小视觉扰动和重复截断重采样初始轨迹，计算一致性得分，降低不可靠轨迹在策略更新时的权重。实验表明，SCS能显著提高六个多模态基准测试上的准确率，最高提升7.7个百分点，且计算成本增加微乎其微。同时，SCS在Qwen2.5-VL和InternVL上也有所改进，为MLLMs中的结果奖励RL提供了一种简单通用的解决方案。

Key Takeaways

结果奖励强化学习（RL）在多模态大型语言模型（MLLMs）中广泛应用，用以改进逐步推理过程。
在多选题环境下，RL面临一个重要问题：不忠实轨迹，即错误推理路径也能获得奖励。
为解决这一问题，提出自我一致性采样（SCS）方法。
SCS通过引入视觉扰动和轨迹重复截断重采样，计算一致性得分。
SCS能显著提高多个多模态基准测试上的准确率，最高提升7.7个百分点。
SCS方法具有较低的计算成本增加。

Cool Papers

点此查看论文截图

Instella: Fully Open Language Models with Stellar Performance

Authors:Jiang Liu, Jialian Wu, Xiaodong Yu, Yusheng Su, Prakamya Mishra, Gowtham Ramesh, Sudhanshu Ranjan, Chaitanya Manem, Ximeng Sun, Ze Wang, Pratik Prabhanjan Brahma, Zicheng Liu, Emad Barsoum

Large language models (LLMs) have demonstrated remarkable performance across a wide range of tasks, yet the majority of high-performing models remain closed-source or partially open, limiting transparency and reproducibility. In this work, we introduce Instella, a family of fully open three billion parameter language models trained entirely on openly available data and codebase. Powered by AMD Instinct MI300X GPUs, Instella is developed through large-scale pre-training, general-purpose instruction tuning, and alignment with human preferences. Despite using substantially fewer pre-training tokens than many contemporaries, Instella achieves state-of-the-art results among fully open models and is competitive with leading open-weight models of comparable size. We further release two specialized variants: Instella-Long, capable of handling context lengths up to 128K tokens, and Instella-Math, a reasoning-focused model enhanced through supervised fine-tuning and reinforcement learning on mathematical tasks. Together, these contributions establish Instella as a transparent, performant, and versatile alternative for the community, advancing the goal of open and reproducible language modeling research.

大规模语言模型（LLM）在多种任务中表现出卓越的性能，然而，大多数高性能模型仍然是闭源的或部分开源的，这限制了透明度和可重复性。在这项工作中，我们介绍了Instella，这是一个完全开源的、基于公开数据和代码库训练的、拥有三十亿参数的语言模型家族。Instella由AMD Instinct MI300X GPU驱动，通过大规模预训练、通用指令调整和与人类偏好对齐的方式开发。尽管使用的预训练令牌远少于许多同龄人，但Instella在完全开源模型中的表现达到了最先进水平，并与同规模的开源权重领先模型具有竞争力。我们还发布了两个专业版本：能够处理长达128K令牌上下文长度的Instella-Long，以及通过监督微调强化学习进行数学任务推理的Instella-Math。总之，这些贡献使Instella成为社区中透明、高性能和通用的替代方案，推动了开源和可重复的语言建模研究目标的实现。

论文及项目相关链接

PDF

Summary

公开、可复现的大型语言模型研究取得进展。本研究推出Instella系列模型，采用公开数据和代码库进行训练，使用AMD Instinct MI300X GPU加速。Instella模型通过大规模预训练、通用指令调整和与人类偏好对齐，实现了全开源的3亿参数语言模型。与其他全开源模型相比，尽管使用的预训练令牌数量大大减少，但Instella仍取得了最先进的成果，并可与相当规模的开源权重模型相竞争。此外，还发布了两个专门版本：处理上下文长度可达128K令牌的Instella-Long和专注于推理的Instella-Math，通过监督微调强化学习在数学任务上进行增强。这些贡献使Instella成为社区中透明、高性能和通用的替代方案，推动开放和可复现的语言建模研究发展。

Key Takeaways

Instella是首个完全开源的大型语言模型系列，使用公开数据和代码库进行训练。
Instella模型通过大规模预训练、通用指令调整与人类偏好对齐实现高性能。
与许多当代模型相比，Instella在使用的预训练令牌数量上有所减少。
Instella在完全开源模型中的表现达到最新水平，并与同类规模的开源权重模型竞争力相当。
研究团队发布了两个专门版本的Instella模型：处理长文本的Instella-Long和专注于数学推理的Instella-Math。
Instella-Math通过监督微调和强化学习在数学任务上进行了增强。

Cool Papers

点此查看论文截图

Querying Labeled Time Series Data with Scenario Programs

Authors:Edward Kim, Devan Shanker, Varun Bharadwaj, Hongbeen Park, Jinkyu Kim, Hazem Torfah, Daniel J Fremont, Sanjit A Seshia

Simulation-based testing has become a crucial complement to road testing for ensuring the safety of cyber physical systems (CPS). As a result, significant research efforts have been directed toward identifying failure scenarios within simulation environments. However, a critical question remains. Are the AV failure scenarios discovered in simulation reproducible on actual systems in the real world? The sim-to-real gap caused by differences between simulated and real sensor data means that failure scenarios identified in simulation might either be artifacts of synthetic sensor data or actual issues that also occur with real sensor data. To address this, an effective approach to validating simulated failure scenarios is to locate occurrences of these scenarios within real-world datasets and verify whether the failure persists on the datasets. To this end, we introduce a formal definition of how labeled time series sensor data can match an abstract scenario, represented as a scenario program using the Scenic probabilistic programming language. We present a querying algorithm that, given a scenario program and a labeled dataset, identifies the subset of data that matches the specified scenario. Our experiment shows that our algorithm is more accurate and orders of magnitude faster in querying scenarios than the state-of-the-art commercial vision large language models, and can scale with the duration of queried time series data.

基于模拟的测试已成为对物理网络系统（CPS）进行道路测试的重要补充，从而确保网络安全。因此，大量的研究工作致力于在模拟环境中识别失败场景。然而，仍然有一个关键问题：在模拟中发现的自动驾驶失败场景能否在现实世界中实际系统的重现？由于模拟数据和真实传感器数据之间的差异，导致模拟与真实之间的差距，这意味着在模拟中确定的失败场景可能是合成传感器数据的产物，也可能是真实传感器数据也会出现的实际问题。为了解决这一问题，验证模拟失败场景的有效方法是定位现实世界数据集中的这些场景发生情况，并验证失败是否持续存在于这些数据集中。为此，我们正式定义了如何使标记的时间序列传感器数据与抽象场景匹配，该场景使用Scenic概率编程语言表示为场景程序。我们提出了一种查询算法，给定场景程序和标记数据集，该算法可以识别匹配指定场景的数据子集。我们的实验表明，我们的算法在查询场景方面比最新的商业视觉大型语言模型更准确、速度快得多，并且可以随着查询时间序列数据的持续时间而扩展。

论文及项目相关链接

PDF

Summary

模拟测试已成为确保网络物理系统安全的重要补充手段。当前研究集中在模拟环境中识别失败场景，但模拟环境中的失败场景是否能在实际系统中重现仍存在疑问。本文提出了一种验证模拟失败场景的方法，通过定位真实世界数据集中的这些场景并验证其是否存在失败情况来解决模拟与真实之间的差距问题。本文还定义了一种方法，使用Scenic概率编程语言将标签时间序列传感器数据与抽象场景进行匹配，并给出了查询算法。实验表明，该算法比现有的商业视觉大型语言模型更准确且速度快得多，并能随着查询时间序列数据的持续时间进行扩展。

Key Takeaways

模拟测试是网络物理系统安全性的重要补充手段。
模拟环境中的失败场景在实际系统中的重现性仍存在疑问。
模拟与真实之间的差距问题可通过验证模拟失败场景来解决。
提出了一种使用Scenic概率编程语言将标签时间序列传感器数据与抽象场景匹配的方法。
给出了查询算法，可定位真实世界数据集中的特定失败场景。
实验表明，该查询算法比现有商业视觉大型语言模型更准确且速度更快。

Cool Papers

点此查看论文截图

SSR: Socratic Self-Refine for Large Language Model Reasoning

Authors:Haizhou Shi, Ye Liu, Bo Pang, Zeyu Leo Liu, Hao Wang, Silvio Savarese, Caiming Xiong, Yingbo Zhou, Semih Yavuz

Large Language Models (LLMs) have demonstrated remarkable reasoning abilities, yet existing test-time frameworks often rely on coarse self-verification and self-correction, limiting their effectiveness on complex tasks. In this paper, we propose Socratic Self-Refine (SSR), a novel framework for fine-grained evaluation and precise refinement of LLM reasoning. Our proposed SSR decomposes model responses into verifiable (sub-question, sub-answer) pairs, enabling step-level confidence estimation through controlled re-solving and self-consistency checks. By pinpointing unreliable steps and iteratively refining them, SSR produces more accurate and interpretable reasoning chains. Empirical results across five reasoning benchmarks and three LLMs show that SSR consistently outperforms state-of-the-art iterative self-refinement baselines. Beyond performance gains, SSR provides a principled black-box approach for evaluating and understanding the internal reasoning processes of LLMs. Code is available at https://github.com/SalesforceAIResearch/socratic-self-refine-reasoning.

大型语言模型（LLM）已经展现出惊人的推理能力，然而现有的测试时间框架通常依赖于粗略的自我验证和自我校正，这在复杂任务上限制了其有效性。在本文中，我们提出了苏格拉底自我修正（SSR）这一新型框架，用于对LLM推理进行精细的评估和精确的优化。我们提出的SSR将模型响应分解为可验证的（子问题、子答案）对，通过受控的重新求解和自我一致性检查，实现步骤级别的置信度估计。通过定位不可靠的步骤并进行迭代优化，SSR生成更准确且可解释的推理链。在五个推理基准测试和三个LLM上的经验结果表明，SSR始终优于最新的迭代自我修正基线。除了性能提升外，SSR还提供了一种有原则的黑盒方法，用于评估和了解LLM的内部推理过程。代码可在https://github.com/SalesforceAIResearch/socratic-self-refine-reasoning获取。

论文及项目相关链接

PDF Preprint; work in progress

Summary
大型语言模型（LLM）展现出强大的推理能力，但现有测试框架的自我验证和校正较为粗略，难以应对复杂任务。本文提出一种名为“苏格拉底自我精炼”（SSR）的新型框架，用于对LLM推理进行精细评估与精确优化。SSR将模型响应分解为可验证的（子问题、子答案）对，通过受控重新求解和自我一致性检查进行步骤级置信度估计。通过定位不可靠的步骤并进行迭代优化，SSR生成更准确且可解释性强的推理链。在五个推理基准测试和三个LLM上的实证结果表明，SSR持续优于最新的迭代自我优化基线。除了性能提升外，SSR还为评估和理解LLM的内部推理过程提供了一种原则性的黑盒方法。

Key Takeaways

LLM展现出强大的推理能力，但现有测试框架存在局限性。
苏格拉底自我精炼（SSR）框架用于精细评估LLM的推理能力。
SSR通过分解模型响应为可验证的子问题、子答案对，进行步骤级置信度估计。
SSR通过定位并优化不可靠的步骤，生成更准确且可解释的推理链。
SSR在多个基准测试和LLM上表现出卓越性能。
SSR提供评估和理解LLM内部推理过程的黑盒方法。

Cool Papers

点此查看论文截图

Impact of Layer Norm on Memorization and Generalization in Transformers

Authors:Rishi Singhal, Jung-Eun Kim

Layer Normalization (LayerNorm) is one of the fundamental components in transformers that stabilizes training and improves optimization. In recent times, Pre-LayerNorm transformers have become the preferred choice over Post-LayerNorm transformers due to their stable gradient flow. However, the impact of LayerNorm on learning and memorization across these architectures remains unclear. In this work, we investigate how LayerNorm influences memorization and learning for Pre- and Post-LayerNorm transformers. We identify that LayerNorm serves as a key factor for stable learning in Pre-LayerNorm transformers, while in Post-LayerNorm transformers, it impacts memorization. Our analysis reveals that eliminating LayerNorm parameters in Pre-LayerNorm models exacerbates memorization and destabilizes learning, while in Post-LayerNorm models, it effectively mitigates memorization by restoring genuine labels. We further precisely identify that early layers LayerNorm are the most critical over middle/later layers and their influence varies across Pre and Post LayerNorm models. We have validated it through 13 models across 6 Vision and Language datasets. These insights shed new light on the role of LayerNorm in shaping memorization and learning in transformers.

层归一化（LayerNorm）是稳定训练和优化转换器模型的基本组成部分之一。由于具有稳定的梯度流，Pre-LayerNorm转换器已成为较Post-LayerNorm转换器的首选。然而，LayerNorm在这些架构中对学习和记忆的影响尚不清楚。在这项工作中，我们调查了LayerNorm对Pre-和Post-LayerNorm转换器的记忆和学习的影响。我们发现，在Pre-LayerNorm转换器中，LayerNorm是稳定学习的关键因素，而在Post-LayerNorm转换器中，它影响记忆。我们的分析表明，消除Pre-LayerNorm模型中的LayerNorm参数会加剧记忆问题并导致学习不稳定，而在Post-LayerNorm模型中，它通过恢复真实标签有效地减轻了记忆负担。我们进一步精确地确定，早期层的LayerNorm较中层或后层更为重要，并且在Pre和Post LayerNorm模型中的影响各不相同。我们通过6个视觉和语言数据集的13个模型验证了这一点。这些见解为理解LayerNorm在塑造转换器中的记忆和学习方面提供了新的视角。

论文及项目相关链接

PDF NeurIPS 2025

Summary
层归一化（LayerNorm）是稳定训练和优化转换器模型的关键组件之一。本文研究了LayerNorm对预层归一化和后层归一化转换器的记忆和学习的影响，发现预层归一化转换器中的LayerNorm是实现稳定学习的关键因素，而后层归一化转换器中的LayerNorm则影响记忆。移除预层归一化模型中的LayerNorm参数会加剧记忆问题并破坏学习稳定性，而在后层归一化模型中，移除LayerNorm可有效减轻记忆问题。早期层的LayerNorm相较于中间或后期层更为重要，其在预层和后层归一化模型中的影响各不相同。这些见解为理解LayerNorm在塑造转换器中的记忆和学习提供了新的视角。

Key Takeaways

LayerNorm是转换器模型中的重要组件，有助于稳定训练和优化。
预层归一化转换器中的LayerNorm对稳定学习起关键作用。
后层归一化转换器中的LayerNorm影响模型的记忆能力。
移除预层归一化模型中的LayerNorm参数会加剧记忆问题并破坏学习稳定性。
在后层归一化模型中，移除LayerNorm有助于减轻记忆问题。
早期层的LayerNorm对预层和后层归一化模型的影响最为显著。

Cool Papers

点此查看论文截图

URaG: Unified Retrieval and Generation in Multimodal LLMs for Efficient Long Document Understanding

Authors:Yongxin Shi, Jiapeng Wang, Zeyu Shan, Dezhi Peng, Zening Lin, Lianwen Jin

Recent multimodal large language models (MLLMs) still struggle with long document understanding due to two fundamental challenges: information interference from abundant irrelevant content, and the quadratic computational cost of Transformer-based architectures. Existing approaches primarily fall into two categories: token compression, which sacrifices fine-grained details; and introducing external retrievers, which increase system complexity and prevent end-to-end optimization. To address these issues, we conduct an in-depth analysis and observe that MLLMs exhibit a human-like coarse-to-fine reasoning pattern: early Transformer layers attend broadly across the document, while deeper layers focus on relevant evidence pages. Motivated by this insight, we posit that the inherent evidence localization capabilities of MLLMs can be explicitly leveraged to perform retrieval during the reasoning process, facilitating efficient long document understanding. To this end, we propose URaG, a simple-yet-effective framework that Unifies Retrieval and Generation within a single MLLM. URaG introduces a lightweight cross-modal retrieval module that converts the early Transformer layers into an efficient evidence selector, identifying and preserving the most relevant pages while discarding irrelevant content. This design enables the deeper layers to concentrate computational resources on pertinent information, improving both accuracy and efficiency. Extensive experiments demonstrate that URaG achieves state-of-the-art performance while reducing computational overhead by 44-56%. The code is available at https://github.com/shi-yx/URaG.

最近的多模态大型语言模型（MLLMs）在处理长文档理解时仍面临两大基本挑战：一是来自大量无关内容的信息干扰，二是基于Transformer的架构的二次计算成本。现有方法主要分为两类：一是牺牲精细细节的词元压缩；二是引入外部检索器，增加了系统复杂性并阻止了端到端的优化。为了解决这些问题，我们进行了深入分析，并观察到MLLMs表现出人类从粗到细的推理模式：早期的Transformer层会广泛地关注整个文档，而较深的层则专注于相关的证据页面。受此启发，我们认为可以利用MLLMs的固有证据定位能力在推理过程中进行检索，从而实现高效的长文档理解。为此，我们提出了URaG，这是一个简单有效的框架，将检索和生成统一到一个单一的大型语言模型中。URaG引入了一个轻量级的跨模态检索模块，该模块将早期的Transformer层转化为高效的证据选择器，识别和保留最相关的页面，同时丢弃不相关内容。这种设计使较深的层能够集中计算资源在重要信息上，提高了准确性和效率。大量实验表明，URaG达到了最先进的性能，同时减少了44-56%的计算开销。代码可在https://github.com/shi-yx/URaG获得。

论文及项目相关链接

PDF Accepted by AAAI 2026 (Oral)

Summary

本文介绍了多模态大型语言模型（MLLMs）在处理长文档理解时面临的挑战，并提出了一个新的框架URaG来解决这些问题。URaG将检索和生成整合在一个单一的MLLM中，通过引入一个轻量级的跨模态检索模块，利用早期Transformer层的证据定位能力进行检索。这减少了计算资源的浪费，提高了准确性和效率，实现了最先进的性能并减少了计算开销。

Key Takeaways

MLLMs在处理长文档理解时面临两大挑战：信息干扰和计算成本高昂。
现有方法主要包括牺牲细节信息的令牌压缩和增加系统复杂度的引入外部检索器。
MLLMs展现出人类粗到细的推理模式，早期Transformer层广泛关注文档，而深层则聚焦于相关证据页面。
URaG框架利用MLLMs的固有证据定位能力，在推理过程中执行检索。
URaG引入了一个轻量级的跨模态检索模块，将早期Transformer层转化为高效的证据选择器，识别并保留最相关的页面，同时丢弃不相关的信息。
这种设计使深层专注于重要信息，提高了准确性和效率。
URaG达到了最先进的性能，在减少计算开销方面表现出色。

Cool Papers

点此查看论文截图

LocalBench: Benchmarking LLMs on County-Level Local Knowledge and Reasoning

Authors:Zihan Gao, Yifei Xu, Jacob Thebault-Spieker

Large language models (LLMs) have been widely evaluated on macro-scale geographic tasks, such as global factual recall, event summarization, and regional reasoning. Yet, their ability to handle hyper-local knowledge remains poorly understood. This gap is increasingly consequential as real-world applications, from civic platforms to community journalism, demand AI systems that can reason about neighborhood-specific dynamics, cultural narratives, and local governance. Existing benchmarks fall short in capturing this complexity, often relying on coarse-grained data or isolated references. We present LocalBench, the first benchmark designed to systematically evaluate LLMs on county-level local knowledge across the United States. Grounded in the Localness Conceptual Framework, LocalBench includes 14,782 validated question-answer pairs across 526 U.S. counties in 49 states, integrating diverse sources such as Census statistics, local subreddit discourse, and regional news. It spans physical, cognitive, and relational dimensions of locality. Using LocalBench, we evaluate 13 state-of-the-art LLMs under both closed-book and web-augmented settings. Our findings reveal critical limitations: even the best-performing models reach only 56.8% accuracy on narrative-style questions and perform below 15.5% on numerical reasoning. Moreover, larger model size and web augmentation do not guarantee better performance, for example, search improves Gemini’s accuracy by +13.6%, but reduces GPT-series performance by -11.4%. These results underscore the urgent need for language models that can support equitable, place-aware AI systems: capable of engaging with the diverse, fine-grained realities of local communities across geographic and cultural contexts.

大型语言模型（LLM）在宏观地理任务方面已得到广泛评估，如全球事实回忆、事件摘要和区域推理。然而，它们处理超本地知识的能力仍缺乏充分了解。随着现实世界应用（从公民平台到社区新闻）的需求不断增长，要求人工智能系统能够推理关于特定邻域的动力、文化叙事和当地治理等事务，这一差距变得越来越重要。现有的基准测试在捕捉这种复杂性方面表现不足，通常依赖于粗略的数据或孤立的参考。我们推出LocalBench，这是第一个旨在系统地评估美国县级本地知识的LLM的基准测试。LocalBench基于本地概念框架，包含美国49个州的526个县市的14,782个经过验证的问题答案对，融合了诸如人口普查统计、本地Reddit论坛讨论和区域新闻等各种来源。它涵盖了本地的物理、认知和关系维度。使用LocalBench，我们在封闭书籍和网络增强两种环境下评估了最先进的13个LLM。我们的研究结果显示出重大局限性：即使在叙述性问题上，表现最佳的模型也只有56.8%的准确率，而在数值推理方面的准确率低于15.5%。此外，更大的模型尺寸和网络增强并不一定能保证更好的性能，例如搜索功能提高了Gemini的准确率+13.6%，但降低了GPT系列的性能-11.4%。这些结果突显了对支持公平、注重场所的AI系统的语言模型的迫切需求：能够参与处理跨地理和文化背景的不同地区的细微现实社区情况。

论文及项目相关链接

PDF

Summary

大型语言模型（LLM）在宏观地理任务上的评估已经相当广泛，但在处理超本地知识方面的能力尚待深入了解。随着从公民平台到社区新闻等现实应用需求日益增长，要求人工智能系统能够推理出邻里动态、文化叙事和当地治理等方面的本地特定知识。现有基准测试无法捕捉这种复杂性，往往依赖于粗略的数据或孤立的引用。本文提出了LocalBench，这是第一个旨在系统地评估LLM在美国县级本地知识方面的基准测试。LocalBench基于本地概念框架，包含美国526个县、49州的14,782个经过验证的问题答案对，融合了人口普查统计、本地Reddit论坛讨论和区域新闻等多元来源。它涵盖了本地的物理、认知和关系维度。使用LocalBench，我们对13款最先进的大型语言模型进行了封闭书籍和网络增强两种环境下的评估。研究发现，即使是最好的模型在叙事式问题上的准确率也只有56.8%，数值推理准确率低于15.5%。此外，模型规模的大小和网络增强并不保证性能提升，搜索功能提高了Gemini模型的准确率+13.6%，但降低了GPT系列性能-11.4%。这凸显了开发支持公平、立足当地的人工智能系统的语言模型的紧迫需求，要求语言模型能够应对不同地理和文化背景下的本地社区复杂现实。

Key Takeaways

大型语言模型（LLM）在超本地知识方面的能力尚待深入了解。
现有基准测试无法全面评估LLM在县级本地知识方面的能力。
LocalBench是首个旨在评估LLM在美国县级本地知识方面的基准测试，涵盖了物理、认知和关系维度。
在叙事式问题和数值推理方面，现有模型的准确率较低。
模型规模和网络增强并不保证性能提升。
搜索功能对模型性能的影响因模型而异。

Cool Papers

点此查看论文截图

Facial-R1: Aligning Reasoning and Recognition for Facial Emotion Analysis

Authors:Jiulong Wu, Yucheng Shen, Lingyong Yan, Haixin Sun, Deguo Xia, Jizhou Huang, Min Cao

Facial Emotion Analysis (FEA) extends traditional facial emotion recognition by incorporating explainable, fine-grained reasoning. The task integrates three subtasks: emotion recognition, facial Action Unit (AU) recognition, and AU-based emotion reasoning to model affective states jointly. While recent approaches leverage Vision-Language Models (VLMs) and achieve promising results, they face two critical limitations: (1) hallucinated reasoning, where VLMs generate plausible but inaccurate explanations due to insufficient emotion-specific knowledge; and (2) misalignment between emotion reasoning and recognition, caused by fragmented connections between observed facial features and final labels. We propose Facial-R1, a three-stage alignment framework that effectively addresses both challenges with minimal supervision. First, we employ instruction fine-tuning to establish basic emotional reasoning capability. Second, we introduce reinforcement training guided by emotion and AU labels as reward signals, which explicitly aligns the generated reasoning process with the predicted emotion. Third, we design a data synthesis pipeline that iteratively leverages the prior stages to expand the training dataset, enabling scalable self-improvement of the model. Built upon this framework, we introduce FEA-20K, a benchmark dataset comprising 17,737 training and 1,688 test samples with fine-grained emotion analysis annotations. Extensive experiments across eight standard benchmarks demonstrate that Facial-R1 achieves state-of-the-art performance in FEA, with strong generalization and robust interpretability.

面部情绪分析（FEA）通过融入可解释、精细的推理，扩展了传统的面部情绪识别。该任务融合了三个子任务：情绪识别、面部动作单元（AU）识别，以及基于AU的情绪推理，以联合建模情感状态。尽管最近的方法利用视觉语言模型（VLMs）并取得了有前景的结果，但它们面临两个关键局限性：（1）虚幻推理，由于缺乏特定的情感知识，VLMs会产生合理的但不准确的解释；（2）情绪推理和识别之间的不匹配，这是由于观察到的面部特征和最终标签之间缺乏连贯的联系。我们提出了Facial-R1，这是一个三阶段对齐框架，以最小的监督有效地解决这两个挑战。首先，我们通过指令微调建立基本的情绪推理能力。其次，我们引入了一种强化训练，以情感和AU标签作为奖励信号进行指导，显式地将生成的推理过程与预测的情绪对齐。第三，我们设计了一个数据合成管道，该管道通过迭代利用先前的阶段来扩展训练数据集，使模型能够实现可扩展的自我改进。基于此框架，我们引入了FEA-20K数据集，这是一个包含17,737个训练样本和1,688个测试样本的基准数据集，具有精细的情绪分析注释。在八个标准基准测试上的广泛实验表明，Facial-R1在面部情绪分析方面达到了最新技术水平，具有强大的泛化能力和稳健的可解释性。

论文及项目相关链接

PDF This paper has been accepted by AAAI 2026. 16 pages, 3 figures, 10 tables

Summary

本文介绍了面部情绪分析（FEA）如何通过结合解释性、精细的推理来扩展传统的面部情绪识别。文章提出了一个三阶段的面部情绪分析框架——Facial-R1，该框架解决了现有方法面临的挑战，包括通过情感指令微调建立基本情感推理能力，引入强化训练明确对齐生成推理过程和预测情感，以及设计数据合成管道实现模型的自我提升。文章还介绍了新的FEA数据集FEA-20K的应用和广泛实验。结果显示，Facial-R1在FEA领域取得了最先进的性能，具有强大的泛化能力和稳健的可解释性。

Key Takeaways

FEA结合了解释性、精细的推理来扩展传统的面部情绪识别。
Facial-R1是一个三阶段的面部情绪分析框架，用于解决情感分析的两个主要挑战。
Facial-R1使用情感指令微调、强化训练和数据合成管道等技术来提升模型性能。
FEA-20K是一个新的数据集，用于面部情绪分析任务，包含精细的情绪分析注释。

Cool Papers

点此查看论文截图

LangGPS: Language Separability Guided Data Pre-Selection for Joint Multilingual Instruction Tuning

Authors:Yangfan Ye, Xiaocheng Feng, Xiachong Feng, Lei Huang, Weitao Ma, Qichen Hong, Yunfei Lu, Duyu Tang, Dandan Tu, Bing Qin

Joint multilingual instruction tuning is a widely adopted approach to improve the multilingual instruction-following ability and downstream performance of large language models (LLMs), but the resulting multilingual capability remains highly sensitive to the composition and selection of the training data. Existing selection methods, often based on features like text quality, diversity, or task relevance, typically overlook the intrinsic linguistic structure of multilingual data. In this paper, we propose LangGPS, a lightweight two-stage pre-selection framework guided by language separability which quantifies how well samples in different languages can be distinguished in the model’s representation space. LangGPS first filters training data based on separability scores and then refines the subset using existing selection methods. Extensive experiments across six benchmarks and 22 languages demonstrate that applying LangGPS on top of existing selection methods improves their effectiveness and generalizability in multilingual training, especially for understanding tasks and low-resource languages. Further analysis reveals that highly separable samples facilitate the formation of clearer language boundaries and support faster adaptation, while low-separability samples tend to function as bridges for cross-lingual alignment. Besides, we also find that language separability can serve as an effective signal for multilingual curriculum learning, where interleaving samples with diverse separability levels yields stable and generalizable gains. Together, we hope our work offers a new perspective on data utility in multilingual contexts and support the development of more linguistically informed LLMs.

联合多语种指令微调是提升大型语言模型（LLM）的多语种指令遵循能力和下游性能的一种广泛采用的方法，但所得的多语种能力仍然高度依赖于训练数据的组成和选择。现有的选择方法通常基于文本质量、多样性或任务相关性等特征，往往忽略了多语种数据内在的语言结构。在本文中，我们提出了LangGPS，这是一个由语言可分离性引导的轻量级两阶段预选择框架，它量化不同语言中的样本在模型表示空间中能区分开来的程度。LangGPS首先根据可分离性分数过滤训练数据，然后使用现有的选择方法进行细化。在六个基准测试和22种语言的广泛实验表明，在现有选择方法之上应用LangGPS提高了其在多语种训练中的有效性和通用性，特别是在理解任务和低资源语言方面。进一步的分析表明，高度可分离的样本有助于形成更清晰的语言边界并支持更快的适应，而低可分离性的样本往往作为跨语言对齐的桥梁。此外，我们还发现语言可分离性可以作为多语种课程学习的有效信号，其中交错不同可分离性水平的样本会产生稳定和可推广的收益。总之，我们希望我们的研究为多语种环境下的数据效用提供新的视角，并支持更具语言意识的LLM的发展。

论文及项目相关链接

PDF AAAI2026 Main Track Accepted

Summary

联合多语种教学调节能提高大型语言模型的跨语种指令理解和下游任务的性能，但在训练数据的组合和筛选方面非常敏感。本文提出了基于语言可分性的LangGPS框架，以量化不同语言样本在模型表征空间中的区分度。实验证明，在现有筛选方法的基础上应用LangGPS能提高其有效性和泛化能力，特别是在理解和低资源语言方面。此外，还发现高度可分的样本有助于形成更清晰的语言边界和支持快速适应，而低可分度的样本则倾向于作为跨语种对齐的桥梁。这为多语种环境下的数据效用提供了新的视角。

Key Takeaways

联合多语种教学调节广泛应用于提高大型语言模型的跨语种指令跟随能力和下游性能。
训练数据的组合和筛选对多语种能力的影响至关重要。
LangGPS框架基于语言可分性进行训练数据预筛选，以提高模型的区分能力。
LangGPS能提高现有筛选方法的有效性和泛化能力，特别是在理解和低资源语言方面。
高度可分的样本有助于形成更清晰的语言边界和支持快速适应模型。
低可分度的样本在多语种对齐中起到桥梁作用。
语言可分性可以作为多语种课程学习的有效信号。

Cool Papers

点此查看论文截图

MTAttack: Multi-Target Backdoor Attacks against Large Vision-Language Models

Authors:Zihan Wang, Guansong Pang, Wenjun Miao, Jin Zheng, Xiao Bai

Recent advances in Large Visual Language Models (LVLMs) have demonstrated impressive performance across various vision-language tasks by leveraging large-scale image-text pretraining and instruction tuning. However, the security vulnerabilities of LVLMs have become increasingly concerning, particularly their susceptibility to backdoor attacks. Existing backdoor attacks focus on single-target attacks, i.e., targeting a single malicious output associated with a specific trigger. In this work, we uncover multi-target backdoor attacks, where multiple independent triggers corresponding to different attack targets are added in a single pass of training, posing a greater threat to LVLMs in real-world applications. Executing such attacks in LVLMs is challenging since there can be many incorrect trigger-target mappings due to severe feature interference among different triggers. To address this challenge, we propose MTAttack, the first multi-target backdoor attack framework for enforcing accurate multiple trigger-target mappings in LVLMs. The core of MTAttack is a novel optimization method with two constraints, namely Proxy Space Partitioning constraint and Trigger Prototype Anchoring constraint. It jointly optimizes multiple triggers in the latent space, with each trigger independently mapping clean images to a unique proxy class while at the same time guaranteeing their separability. Experiments on popular benchmarks demonstrate a high success rate of MTAttack for multi-target attacks, substantially outperforming existing attack methods. Furthermore, our attack exhibits strong generalizability across datasets and robustness against backdoor defense strategies. These findings highlight the vulnerability of LVLMs to multi-target backdoor attacks and underscore the urgent need for mitigating such threats. Code is available at https://github.com/mala-lab/MTAttack.

最近的大视觉语言模型（LVLMs）进展通过利用大规模图像文本预训练以及指令微调，在各种视觉语言任务上表现出了令人印象深刻的性能。然而，LVLMs的安全漏洞越来越令人担忧，尤其是它们容易受到后门攻击。现有的后门攻击主要集中在单目标攻击上，即针对特定触发因素关联的单个恶意输出。在这项工作中，我们发现了多目标后门攻击，即在单次训练中添加与不同攻击目标对应的多个独立触发器，对现实世界应用的LVLMs构成了更大威胁。在LVLMs中执行此类攻击具有挑战性，因为不同触发器之间的特征干扰严重，可能会有许多错误的触发-目标映射。为了应对这一挑战，我们提出了MTAttack，这是第一个用于在LVLMs中实现准确的多触发-目标映射的多目标后门攻击框架。MTAttack的核心是一种新型优化方法，具有两个约束条件，即代理空间划分约束和触发原型锚定约束。它在潜在空间中联合优化多个触发器，每个触发器将干净图像独立映射到唯一的代理类别，同时保证它们的可分离性。在流行基准测试上的实验表明，MTAttack对多目标攻击的成功率很高，显著优于现有攻击方法。此外，我们的攻击在数据集之间具有很强的通用性，并对抗后门防御策略表现出稳健性。这些发现突显了LVLMs对多目标后门攻击的脆弱性，并强调了缓解此类威胁的紧迫需求。代码可在https://github.com/mala-lab/MTAttack获得。

论文及项目相关链接

PDF AAAI2026, with supplementary material

摘要

近期大型视觉语言模型（LVLMs）在跨各种视觉语言任务上表现出卓越性能，主要得益于大规模图像文本预训练与指令微调。但LVLMs的安全漏洞日益引人关注，特别是易受后门攻击的影响。现有后门攻击主要关注单目标攻击，即针对特定触发因素关联单个恶意输出的攻击。本研究发现了多目标后门攻击，即在单次训练中加入多个独立触发因素，对应不同攻击目标，对LVLMs构成更大威胁。执行此类攻击在LVLMs中颇具挑战，因为不同触发因素间特征干扰严重，导致许多错误的触发-目标映射。为解决这一挑战，我们提出MTAttack，首个用于LVLMs的多目标后门攻击框架，通过新型优化方法实现准确的多触发-目标映射。MTAttack的核心在于两种约束的联合优化：代理空间划分约束和触发原型锚定约束。它在潜在空间中联合优化多个触发器，每个触发器将干净图像独立映射到唯一代理类别，同时保证它们之间的可分性。在流行基准测试上的实验表明，MTAttack在多目标攻击方面成功率极高，显著优于现有攻击方法。此外，我们的攻击在数据集间表现出强大的通用性和对后门防御策略的稳健性。这些发现突显了LVLMs面临多目标后门攻击时的脆弱性，并迫切需要对这类威胁进行缓解。相关代码可访问：链接。

关键见解

大型视觉语言模型（LVLMs）虽然性能卓越，但存在安全漏洞，易受后门攻击影响。
提出了一种新型多目标后门攻击方法——MTAttack，能够在单次训练中加入多个独立触发因素。
MTAttack通过联合优化多个触发器在潜在空间中的位置，实现准确的多触发-目标映射。
MTAttack采用两种约束：代理空间划分约束和触发原型锚定约束。
实验结果表明，MTAttack在多目标攻击方面表现出高成功率、强通用性和对后门防御策略的稳健性。
当前研究突显了LVLMs面临多目标后门攻击的脆弱性。

Cool Papers

点此查看论文截图

GraphIF: Enhancing Multi-Turn Instruction Following for Large Language Models with Relation Graph Prompt

Authors:Zhenhe Li, Can Lin, Ling Zheng, Wen-Da Wei, Junli Liang, Qi Song

Multi-turn instruction following is essential for building intelligent conversational systems that can consistently adhere to instructions across dialogue turns. However, existing approaches to enhancing multi-turn instruction following primarily rely on collecting or generating large-scale multi-turn dialogue datasets to fine-tune large language models (LLMs), which treat each response generation as an isolated task and fail to explicitly incorporate multi-turn instruction following into the optimization objectives. As a result, instruction-tuned LLMs often struggle with complex long-distance constraints. In multi-turn dialogues, relational constraints across turns can be naturally modeled as labeled directed edges, making graph structures particularly suitable for modeling multi-turn instruction following. Despite this potential, leveraging graph structures to enhance the multi-turn instruction following capabilities of LLMs remains unexplored. To bridge this gap, we propose GraphIF, a plug-and-play framework that models multi-turn dialogues as directed relation graphs and leverages graph prompts to enhance the instruction following capabilities of LLMs. GraphIF comprises three key components: (1) an agent-based relation extraction module that captures inter-turn semantic relations via action-triggered mechanisms to construct structured graphs; (2) a relation graph prompt generation module that converts structured graph information into natural language prompts; and (3) a response rewriting module that refines initial LLM outputs using the generated graph prompts. Extensive experiments on two long multi-turn dialogue datasets demonstrate that GraphIF can be seamlessly integrated into instruction-tuned LLMs and leads to significant improvements across all four multi-turn instruction-following evaluation metrics.

多轮指令跟随对于构建能够跨对话轮次持续遵循指令的智能对话系统至关重要。然而，现有的增强多轮指令跟随的方法主要依赖于收集或生成大规模的多轮对话数据集来微调大型语言模型（LLM）。这些方法将每个响应生成视为一项独立任务，未能显式地将多轮指令跟随纳入优化目标。因此，指令调整的LLM往往面临复杂的远距离约束问题。在多轮对话中，各轮之间的关系约束可以自然地建模为带标签的有向边，使得图结构特别适合建模多轮指令跟随。尽管存在这种潜力，但利用图结构增强LLM的多轮指令跟随能力仍尚未被探索。为了填补这一空白，我们提出了GraphIF，这是一个即插即用的框架，它将多轮对话建模为有向关系图，并利用图提示增强LLM的指令遵循能力。GraphIF包含三个关键组件：（1）基于代理的关系提取模块，它通过动作触发机制捕获轮次之间的语义关系，以构建结构化图；（2）关系图提示生成模块，将结构化图信息转换为自然语言提示；（3）响应重写模块，使用生成的图提示对初始LLM输出进行精炼。在两个长多轮对话数据集上的大量实验表明，GraphIF可以无缝地集成到指令调整的LLM中，并在所有四个多轮指令跟随评估指标上实现了显著改进。

论文及项目相关链接

PDF

Summary

本文主要探讨了在构建智能对话系统时，多轮指令跟随的重要性及其现有方法的不足。现有方法主要依赖收集或生成大规模多轮对话数据集来微调大型语言模型（LLMs），但这种方法在处理复杂的长距离约束时表现不佳。为此，本文提出了GraphIF框架，该框架将多轮对话建模为有向关系图，并利用图提示增强LLMs的指令跟随能力。GraphIF包括三个关键组件，能够实现结构图的构建、关系图提示的生成以及响应的重写。实验证明，GraphIF可以无缝集成到指令调整型LLMs中，并在所有四项多轮指令跟随评估指标上实现显著改进。

Key Takeaways

多轮指令跟随对于构建智能对话系统至关重要，能够确保系统在对话过程中始终遵循指令。
现有方法主要依赖大规模数据集微调LLMs来处理多轮指令，但这种方法在处理复杂长距离约束时表现不足。
GraphIF框架首次将多轮对话建模为有向关系图，以更好地处理复杂的指令跟随任务。
GraphIF包括三个关键组件：关系提取模块、关系图提示生成模块和响应重写模块。
关系提取模块通过动作触发机制捕获轮次间的语义关系，构建结构图。
关系图提示生成模块将结构图信息转换为自然语言提示。
响应重写模块使用生成的图提示对初始LLM输出进行精细化处理。

Cool Papers

点此查看论文截图

How Small Can You Go? Compact Language Models for On-Device Critical Error Detection in Machine Translation

Authors:Muskaan Chopra, Lorenz Sparrenberg, Sarthak Khanna, Rafet Sifa

Large Language Models (LLMs) excel at evaluating machine translation (MT), but their scale and cost hinder deployment on edge devices and in privacy-sensitive workflows. We ask: how small can you get while still detecting meaning-altering translation errors? Focusing on English->German Critical Error Detection (CED), we benchmark sub-2B models (LFM2-350M, Qwen-3-0.6B/1.7B, Llama-3.2-1B-Instruct, Gemma-3-1B) across WMT21, WMT22, and SynCED-EnDe-2025. Our framework standardizes prompts, applies lightweight logit-bias calibration and majority voting, and reports both semantic quality (MCC, F1-ERR/F1-NOT) and compute metrics (VRAM, latency, throughput). Results reveal a clear sweet spot around one billion parameters: Gemma-3-1B provides the best quality-efficiency trade-off, reaching MCC=0.77 with F1-ERR=0.98 on SynCED-EnDe-2025 after merged-weights fine-tuning, while maintaining 400 ms single-sample latency on a MacBook Pro M4 Pro (24 GB). At larger scale, Qwen-3-1.7B attains the highest absolute MCC (+0.11 over Gemma) but with higher compute cost. In contrast, ultra-small models (0.6B) remain usable with few-shot calibration yet under-detect entity and number errors. Overall, compact, instruction-tuned LLMs augmented with lightweight calibration and small-sample supervision can deliver trustworthy, on-device CED for MT, enabling private, low-cost error screening in real-world translation pipelines. All datasets, prompts, and scripts are publicly available at our GitHub repository.

大型语言模型（LLMs）擅长评估机器翻译（MT）的效果，但它们的规模和成本阻碍了在边缘设备和隐私敏感工作流程中的部署。我们的问题是：在仍能检测意义改变的翻译错误的情况下，你能把模型缩小到什么程度？我们专注于英语到德语的临界错误检测（CED），对次2B模型（LFM2-350M、Qwen-3-0.6B/1.7B、Llama-3.2-1B-Instruct、Gemma-3-1B）在WMT21、WMT22和SynCED-EnDe-2025上进行基准测试。我们的框架标准化了提示，应用了轻量级的logit-bias校准和多数投票，并报告了语义质量（MCC、F1-ERR/F1-NOT）和计算指标（VRAM、延迟、吞吐量）。结果揭示了一个明确的最佳平衡点，大约十亿个参数：Gemma-3-1B在合并权重微调后，在SynCED-EnDe-2025上达到MCC=0.77且F1-ERR=0.98的最佳质量效率权衡，同时在MacBook Pro M4 Pro（24GB）上保持400毫秒的单样本延迟。在更大规模上，Qwen-3-1.7B达到了最高的绝对MCC（比Gemma高出0.11），但计算成本更高。相比之下，超小型模型（0.6B）在少量样本校准后仍然可用，但在检测实体和数字错误方面仍有不足。总体而言，紧凑、指令调优的LLMs辅以轻量级校准和小样本监督，可以在机器翻译中实现可信的在线临界错误检测，从而在现实世界翻译管道中实现私人、低成本的错误筛查。所有数据集、提示和脚本均可在我们的GitHub仓库中公开获得。

论文及项目相关链接

PDF Accepted in IEEE BigData 2025

Summary

大型语言模型（LLM）在评估机器翻译（MT）方面表现出色，但其规模和成本阻碍了其在边缘设备和隐私敏感工作流程中的部署。本研究聚焦于英语到德语的翻译错误检测（CED），对比评估了多个小型LLM模型在WMT21、WMT22和SynCED-EnDe-2025上的表现。研究采用标准化提示、轻量级logit-bias校准和多数投票机制，并报告了语义质量（MCC、F1-ERR/F1-NOT）和计算指标（VRAM、延迟、吞吐量）。结果显示，在约十亿参数范围内存在一个明显的最佳点：Gemma-3-1B在合并权重微调后，在SynCED-EnDe-2025上达到MCC=0.77且F1-ERR=0.98的高质量效率平衡，同时在MacBook Pro M4 Pro（24GB）上保持400ms的单样本延迟。而更大规模的Qwen-3-1.7B虽然获得最高的绝对MCC值，但计算成本更高。超小型模型（0.6B）在少数样本校准下仍然可用，但在检测实体和数字错误方面表现不足。总体而言，结合轻量级校准和小样本监督的紧凑、指令调优的LLM可在MT中实现可信的在线CED，为真实翻译管道中的私有、低成本错误筛查提供可能。相关数据集、提示和脚本已公开在我们的GitHub仓库中。

Key Takeaways

大型语言模型（LLM）在机器翻译评估中表现优异，但部署在边缘设备和隐私敏感环境中的成本较高。
研究聚焦于英语到德语的翻译错误检测（CED）。
评估了多个小型LLM模型在多个数据集上的表现。
采用标准化提示、轻量级校准和多数投票机制来提高模型性能。
发现一个约十亿参数的模型（Gemma-3-1B）在质量效率方面达到平衡，实现良好的翻译错误检测性能。
Qwen-3-1.7B模型虽然达到较高的MCC值，但计算成本较高。

Cool Papers

点此查看论文截图

SliderEdit: Continuous Image Editing with Fine-Grained Instruction Control

Authors:Arman Zarei, Samyadeep Basu, Mobina Pournemat, Sayan Nag, Ryan Rossi, Soheil Feizi

Instruction-based image editing models have recently achieved impressive performance, enabling complex edits to an input image from a multi-instruction prompt. However, these models apply each instruction in the prompt with a fixed strength, limiting the user’s ability to precisely and continuously control the intensity of individual edits. We introduce SliderEdit, a framework for continuous image editing with fine-grained, interpretable instruction control. Given a multi-part edit instruction, SliderEdit disentangles the individual instructions and exposes each as a globally trained slider, allowing smooth adjustment of its strength. Unlike prior works that introduced slider-based attribute controls in text-to-image generation, typically requiring separate training or fine-tuning for each attribute or concept, our method learns a single set of low-rank adaptation matrices that generalize across diverse edits, attributes, and compositional instructions. This enables continuous interpolation along individual edit dimensions while preserving both spatial locality and global semantic consistency. We apply SliderEdit to state-of-the-art image editing models, including FLUX-Kontext and Qwen-Image-Edit, and observe substantial improvements in edit controllability, visual consistency, and user steerability. To the best of our knowledge, we are the first to explore and propose a framework for continuous, fine-grained instruction control in instruction-based image editing models. Our results pave the way for interactive, instruction-driven image manipulation with continuous and compositional control.

基于指令的图像编辑模型最近取得了令人印象深刻的性能表现，能够从多指令提示中对输入图像进行复杂编辑。然而，这些模型以固定的强度应用提示中的每个指令，限制了用户精确和连续控制单个编辑强度的能力。我们引入了SliderEdit，一个具有精细粒度、可解释的指令控制的连续图像编辑框架。对于多部分编辑指令，SliderEdit拆解了各个指令，并将其暴露为全局训练的滑块，允许平滑地调整其强度。与之前引入滑块基于属性控制文本到图像生成的工作不同，通常需要对每个属性或概念进行单独的训练或微调，我们的方法学习了一组低秩适应矩阵，可以在各种编辑、属性和组合指令中通用。这实现了在单个编辑维度上的连续插值，同时保留空间局部性和全局语义一致性。我们将SliderEdit应用于最先进的图像编辑模型，包括FLUX-Kontext和Qwen-Image-Edit，并观察到在编辑控制、视觉一致性和用户可控性方面的显著改善。据我们所知，我们是第一个探索并提出基于指令的图像编辑模型中的连续、精细粒度指令控制框架。我们的研究为具有连续性和组合控制的交互式指令驱动图像操纵铺平了道路。

论文及项目相关链接

PDF

Summary：

该文介绍了一种基于指令的图像编辑框架SliderEdit，它实现了对图像进行精细化的连续编辑。该框架将每个编辑指令分解为可独立调整的滑块，使用户能够平滑地调整每个指令的强度。与传统方法不同，SliderEdit通过单一的低秩适应矩阵学习实现跨不同编辑、属性和组合指令的通用性。这使得在单个编辑维度上进行连续插值，同时保留空间局部性和全局语义一致性。应用SliderEdit到先进的图像编辑模型，如FLUX-Kontext和Qwen-Image-Edit，显著提高了编辑的控制性、视觉一致性和用户可操作性。这是首次探索并提出在基于指令的图像编辑模型中进行连续精细化指令控制的框架。

Key Takeaways：

SliderEdit框架实现了基于指令的图像编辑的精细化连续控制。
通过将每个编辑指令分解为独立滑块，用户可平滑调整指令强度。
采用单一的低秩适应矩阵学习，实现跨不同编辑、属性和组合指令的通用性。
SliderEdit保留了空间局部性和全局语义一致性，实现单个编辑维度上的连续插值。
该框架应用于多个先进的图像编辑模型，如FLUX-Kontext和Qwen-Image-Edit。
应用SliderEdit后，编辑的控制性、视觉一致性和用户可操作性得到显著提高。

Cool Papers

点此查看论文截图

General Intelligence-based Fragmentation (GIF): A framework for peak-labeled spectra simulation

Authors:Margaret R. Martin, Soha Hassoun

Despite growing reference libraries and advanced computational tools, progress in the field of metabolomics remains constrained by low rates of annotating measured spectra. The recent developments of large language models (LLMs) have led to strong performance across a wide range of generation and reasoning tasks, spurring increased interest in LLMs’ application to domain-specific scientific challenges, such as mass spectra annotation. Here, we present a novel framework, General Intelligence-based Fragmentation (GIF), that guides pretrained LLMs through spectra simulation using structured prompting and reasoning. GIF utilizes tagging, structured inputs/outputs, system prompts, instruction-based prompts, and iterative refinement. Indeed, GIF offers a structured alternative to ad hoc prompting, underscoring the need for systematic guidance of LLMs on complex scientific tasks. Using GIF, we evaluate current generalist LLMs’ ability to use reasoning towards fragmentation and to perform intensity prediction after fine-tuning. We benchmark performance on a novel QA dataset, the MassSpecGym QA-sim dataset, that we derive from the MassSpecGym dataset. Through these implementations of GIF, we find that GPT-4o and GPT-4o-mini achieve a cosine similarity of 0.36 and 0.35 between the simulated and true spectra, respectively, outperforming other pretrained models including GPT-5, Llama-3.1, and ChemDFM, despite GPT-5’s recency and ChemDFM’s domain specialization. GIF outperforms several deep learning baselines. Our evaluation of GIF highlights the value of using LLMs not only for spectra simulation but for enabling human-in-the-loop workflows and structured, explainable reasoning in molecular fragmentation.

尽管参考库不断增长和计算工具日益先进，但代谢组学领域的进展仍受到注释测量光谱速率低的限制。大型语言模型（LLM）的最近发展在各种生成和推理任务中表现出强大的性能，从而增加了对LLM在特定领域科学挑战（如质谱注释）中的应用的兴趣。在这里，我们提出了一种基于通用智能的碎片化（GIF）新型框架，它通过结构化提示和推理引导预训练的LLM进行光谱模拟。GIF利用标签、结构化输入/输出、系统提示、指令提示和迭代优化。实际上，GIF为即兴提示提供了结构化替代方案，强调了系统指导LLM完成复杂科学任务的必要性。使用GIF，我们评估了当前通用LLM通过推理进行碎片化和微调后进行强度预测的能力。我们在从MassSpecGym数据集派生的新型问答数据集MassSpecGym QA-sim数据集上评估性能。通过这些GIF的实现，我们发现GPT-4o和GPT-4o-mini在模拟光谱和真实光谱之间的余弦相似度分别达到0.36和0.35，优于其他预训练模型，包括GPT-5、Llama-3.1和ChemDFM，尽管GPT-5是较新的模型且ChemDFM具有领域专业性。GIF的表现优于几个深度学习基线。我们对GIF的评估强调了使用LLM不仅用于光谱模拟的价值，而且在于实现人机交互工作流和结构化、可解释的分子碎片化推理中的价值。

论文及项目相关链接

PDF

Summary

该文本介绍了一种基于通用智能的新框架——General Intelligence-based Fragmentation（GIF），用于指导预训练的大型语言模型（LLM）进行光谱模拟。GIF通过结构化提示和推理，利用标签、结构化输入/输出、系统提示、指令提示和迭代优化等手段，实现对LLM的有序引导。评估表明，GPT-4o和GPT-4o-mini在模拟光谱和真实光谱之间的余弦相似度方面表现优异，优于其他预训练模型，包括GPT-5和ChemDFM等。这一新框架不仅用于光谱模拟，还可应用于人类参与的流程和结构化、可解释的分子裂解推理。

Key Takeaways

文本提到了在代谢组学领域，尽管有大量的参考文献和先进的计算工具，但由于注释测量的光谱速率较低，进展仍然受到限制。
大型语言模型（LLM）在多种生成和推理任务中表现出强大的性能，引发了对其在特定领域科学挑战，如质谱注释中的应用兴趣。
提出了一种新的框架——基于通用智能的碎片化（GIF），用于指导预训练的大型语言模型进行光谱模拟，实现结构化提示和推理。
GIF通过标签、结构化输入/输出、系统提示、指令提示和迭代优化等手段实现对LLM的有序引导。
在新型问答数据集MassSpecGym QA-sim上的评估显示，GPT-4o和GPT-4o-mini在模拟光谱和真实光谱之间的余弦相似度方面表现突出。
与其他预训练模型如GPT-5和ChemDFM相比，GPT-4o和GPT-4o-mini具有更好的性能。

Cool Papers

点此查看论文截图

LLM Inference Beyond a Single Node: From Bottlenecks to Mitigations with Fast All-Reduce Communication

Authors:Prajwal Singhania, Siddharth Singh, Lannie Dalton Hough, Akarsh Srivastava, Harshitha Menon, Charles Fredrick Jekel, Abhinav Bhatele

As large language models (LLMs) continue to grow in size, distributed inference has become increasingly important. Model-parallel strategies must now efficiently scale not only across multiple GPUs but also across multiple nodes. In this work, we present a detailed performance study of multi-node distributed inference using LLMs on GPU-based supercomputers. We conduct experiments with several state-of-the-art inference engines alongside YALIS, a research-oriented prototype engine designed for controlled experimentation. We analyze the strong-scaling behavior of different model-parallel schemes and identify key bottlenecks. Since all-reduce operations are a common performance bottleneck, we develop NVRAR, a hierarchical all-reduce algorithm based on recursive doubling with NVSHMEM. NVRAR achieves up to 1.9x-3.6x lower latency than NCCL for message sizes between 128 KB and 2 MB on HPE Slingshot and InfiniBand interconnects. Integrated into YALIS, NVRAR achieves up to a 1.72x reduction in end-to-end batch latency for the Llama 3.1 405B model in multi-node decode-heavy workloads using tensor parallelism.

随着大型语言模型（LLM）的规模不断增长，分布式推断变得越来越重要。模型并行策略现在必须有效地扩展到多个GPU，并且还要扩展到多个节点。在这项工作中，我们对基于GPU的超级计算机上使用LLM进行多节点分布式推断进行了详细的性能研究。我们除了使用几个最先进的推理引擎进行实验之外，还结合了面向研究的原型引擎YALIS进行设计实验。我们分析了不同模型并行方案的可扩展行为并确定了关键瓶颈。由于所有归约操作是一种常见的性能瓶颈，因此我们开发了NVRAR，这是一种基于递归加倍和NVSHMEM的分层归约算法。NVRAR与HPE Slingshot和InfiniBand互联相比，在消息大小为128KB至2MB的情况下，可实现高达1.9倍至3.6倍的延迟降低。集成到YALIS中后，NVRAR在采用张量并行性的多节点解码密集型工作负载中，对Llama 3.1 405B模型的端到端批处理延迟实现了高达1.72倍的降低。

论文及项目相关链接

PDF 12 Figures

Summary

大型语言模型（LLM）在分布式推理中的多节点效率研究。本研究详细探讨了使用GPU超级计算机进行多节点分布式推理的模型并行策略性能。研究包括与YALIS等先进推理引擎的实验，并分析了不同模型并行方案的强扩展行为。为解决常见性能瓶颈——全减操作，开发了基于NVSHMEM的NVRAR分层全减算法，其降低延迟达1.9x至3.6x。集成到YALIS后，对于多节点解码密集型工作负载，NVRAR实现了对Llama 3.1 405B模型的端到端批处理延迟的显著减少。

Key Takeaways

大型语言模型（LLM）的分布式推理中，多节点效率变得至关重要。
需要对模型并行策略进行详细的性能研究，不仅在多个GPU上，而且在多个节点上。
研究涉及多种先进推理引擎，包括YALIS。
全减操作是一个常见的性能瓶颈。
开发了一种新的分层全减算法NVRAR，基于NVSHMEM，对于消息大小为128KB至2MB的情况，其延迟低于NCCL。
集成到YALIS后，NVRAR显著降低了多节点解码密集型工作负载的端到端批处理延迟。

Cool Papers

点此查看论文截图

RF-DETR: Neural Architecture Search for Real-Time Detection Transformers

Authors:Isaac Robinson, Peter Robicheaux, Matvei Popov, Deva Ramanan, Neehar Peri

Open-vocabulary detectors achieve impressive performance on COCO, but often fail to generalize to real-world datasets with out-of-distribution classes not typically found in their pre-training. Rather than simply fine-tuning a heavy-weight vision-language model (VLM) for new domains, we introduce RF-DETR, a light-weight specialist detection transformer that discovers accuracy-latency Pareto curves for any target dataset with weight-sharing neural architecture search (NAS). Our approach fine-tunes a pre-trained base network on a target dataset and evaluates thousands of network configurations with different accuracy-latency tradeoffs without re-training. Further, we revisit the “tunable knobs” for NAS to improve the transferability of DETRs to diverse target domains. Notably, RF-DETR significantly improves on prior state-of-the-art real-time methods on COCO and Roboflow100-VL. RF-DETR (nano) achieves 48.0 AP on COCO, beating D-FINE (nano) by 5.3 AP at similar latency, and RF-DETR (2x-large) outperforms GroundingDINO (tiny) by 1.2 AP on Roboflow100-VL while running 20x as fast. To the best of our knowledge, RF-DETR (2x-large) is the first real-time detector to surpass 60 AP on COCO. Our code is at https://github.com/roboflow/rf-detr

开放词汇检测器在COCO上的表现令人印象深刻，但在面对预训练阶段未曾遇到过的非分布类现实数据集时，通常无法推广到实际应用。我们并未简单地对重量级视觉语言模型（VLM）进行新域微调，而是引入了RF-DETR，这是一款轻量级的专业检测转换器。它通过权重共享神经网络架构搜索（NAS）为任何目标数据集找到精度-延迟帕累托曲线。我们的方法是在目标数据集上微调预训练基础网络，并评估数千种具有不同精度-延迟权衡的网络配置，无需重新训练。此外，我们重新审视了NAS的“可调旋钮”，以提高DETR在不同目标领域的迁移能力.值得一提的是，RF-DETR在COCO和Roboflow100-VL上的实时方法上显著优于现有技术。RF-DETR（纳米版）在COCO上达到了48.0 AP，在相似延迟下比D-FINE（纳米版）高出5.3 AP；RF-DETR（2倍大型）在Roboflow100-VL上的性能优于GroundingDINO（微小版），虽然运行速度快20倍，但只高出1.2 AP。据我们所知，RF-DETR（2倍大型）是首个在COCO上实时检测超过60 AP的检测器。我们的代码位于：https://github.com/roboflow/rf-detr

论文及项目相关链接

PDF Project Page: https://rfdetr.roboflow.com/

Summary

本文介绍了一种新型的轻量级专业检测器RF-DETR，它通过采用神经网络架构搜索（NAS）技术，实现了对各种目标数据集的精度和延迟的帕累托优化。RF-DETR能够在无需重新训练的情况下，对预训练基础网络进行微调，并评估数千种不同精度和延迟权衡的网络配置。此外，RF-DETR还重新研究了NAS的可调节参数，以提高DETR在不同目标域的可迁移性。实验结果表明，RF-DETR在COCO和Roboflow100-VL等数据集上的性能优于其他实时检测方法。

Key Takeaways

RF-DETR是一种新型的轻量级专业检测器，采用神经网络架构搜索（NAS）技术进行优化。
RF-DETR能够在无需重新训练的情况下，对预训练基础网络进行微调，并评估不同精度和延迟的网络配置。
RF-DETR通过重新研究NAS的可调节参数，提高了DETR在不同目标域的可迁移性。
RF-DETR在COCO数据集上的性能表现优异，其中RF-DETR（nano）模型在相似延迟下比D-FINE（nano）高出5.3 AP。
RF-DETR（2x-large）在Roboflow100-VL数据集上的性能优于GroundingDINO（tiny），并且运行速度更快。
RF-DETR（2x-large）是首个在COCO数据集上实时检测性能超过60 AP的模型。

Cool Papers

点此查看论文截图

vMFCoOp: Towards Equilibrium on a Unified Hyperspherical Manifold for Prompting Biomedical VLMs

Authors:Minye Shao, Sihan Guo, Xinrun Li, Xingyu Miao, Haoran Duan, Yang Long

Recent advances in context optimization (CoOp) guided by large language model (LLM)-distilled medical semantic priors offer a scalable alternative to manual prompt engineering and full fine-tuning for adapting biomedical CLIP-based vision-language models (VLMs). However, prompt learning in this context is challenged by semantic misalignment between LLMs and CLIP variants due to divergent training corpora and model architectures; it further lacks scalability across continuously evolving families of foundation models. More critically, pairwise multimodal alignment via conventional Euclidean-space optimization lacks the capacity to model unified representations or apply localized geometric constraints, which tends to amplify modality gaps in complex biomedical imaging and destabilize few-shot adaptation. In this work, we propose vMFCoOp, a framework that inversely estimates von Mises-Fisher (vMF) distributions on a shared Hyperspherical Manifold, aligning semantic biases between arbitrary LLMs and CLIP backbones via Unified Semantic Anchors to achieve robust biomedical prompting and superior few-shot classification. Grounded in three complementary constraints, vMFCoOp demonstrates consistent improvements across 14 medical datasets, 12 medical imaging modalities, and 13 anatomical regions, outperforming state-of-the-art methods in accuracy, generalization, and clinical applicability. This work aims to continuously expand to encompass more downstream applications, and the corresponding resources are intended to be shared through https://github.com/VinyehShaw/UniEqui.

最近，借助大型语言模型（LLM）提炼出的医学语义优先级，上下文优化（CoOp）的进展为基于CLIP的生物医学视觉语言模型（VLMs）提供了手动提示工程和全微调的可扩展替代方案。然而，在这种情况下，由于训练语料库和模型架构的差异，LLM和CLIP变体之间的语义不一致对提示学习构成了挑战；它缺乏跨不断演变的基础模型的扩展性。更重要的是，通过传统的欧几里得空间优化进行的一对多模态对齐，缺乏建模统一表示或应用局部几何约束的能力，这往往会放大复杂生物医学成像中的模态差距，并破坏少量镜头适应性。在这项工作中，我们提出了vMFCoOp框架，它通过共享超球面流形上反向估计冯·米塞斯-费希尔（vMF）分布，通过统一语义锚对齐任意LLM和CLIP主干之间的语义偏差，实现稳健的生物医学提示和优越的小样本分类。基于三个互补约束，vMFCoOp在14个医疗数据集、12种医疗成像方式和1 3个解剖区域上显示出持续一致的改进，在准确性、通用性和临床适用性方面优于现有技术方法。本工作的目标是不断扩大以涵盖更多的下游应用，相关资源将通过https://github.com/VinyehShaw/UniEqui进行共享。

论文及项目相关链接

PDF Accepted as an Oral Presentation at AAAI 2026 Main Technical Track (this version is not peer-reviewed; it is the extended version)

Summary

基于大型语言模型（LLM）蒸馏医学语义先验的上下文优化（CoOp）为手动提示工程以及生物医学CLIP基础视觉语言模型（VLMs）的适应提供了可扩展的替代方案。然而，这一领域的提示学习面临语义不一致和模型架构不同导致的挑战，并且在不断发展的家族基础模型中也缺乏可扩展性。本工作提出了vMFCoOp框架，通过共享超球面流形上的逆向估计冯米塞斯费舍尔分布和对统一语义锚点的使用，实现鲁棒的生物医学提示和卓越的少量样本分类。基于三项互补约束，vMFCoOp在医疗数据集、成像模态和解剖区域上展现出一致的改进效果。未来，该工作将不断扩展到更多下游应用，并通过链接共享资源：https://github.com/VinyehShaw/UniEqui。

Key Takeaways

LLM蒸馏的医学语义先验知识对上下文优化有重要作用。
语义不一致和模型架构差异在LLM和CLIP变体之间构成挑战。
vMFCoOp框架能有效应对这种挑战，通过在共享超球面流形上逆向估计冯米塞斯费舍尔分布，增强语义对齐。
vMFCoOp使用统一语义锚点实现鲁棒生物医学提示和少量样本分类。
vMFCoOp在多种医疗数据集、成像模态和解剖区域上表现优越。
vMFCoOp框架旨在不断扩展至更多下游应用，并共享资源。

Cool Papers

点此查看论文截图

Lumine: An Open Recipe for Building Generalist Agents in 3D Open Worlds

Authors:Weihao Tan, Xiangyang Li, Yunhao Fang, Heyuan Yao, Shi Yan, Hao Luo, Tenglong Ao, Huihui Li, Hongbin Ren, Bairen Yi, Yujia Qin, Bo An, Libin Liu, Guang Shi

We introduce Lumine, the first open recipe for developing generalist agents capable of completing hours-long complex missions in real time within challenging 3D open-world environments. Lumine adopts a human-like interaction paradigm that unifies perception, reasoning, and action in an end-to-end manner, powered by a vision-language model. It processes raw pixels at 5 Hz to produce precise 30 Hz keyboard-mouse actions and adaptively invokes reasoning only when necessary. Trained in Genshin Impact, Lumine successfully completes the entire five-hour Mondstadt main storyline on par with human-level efficiency and follows natural language instructions to perform a broad spectrum of tasks in both 3D open-world exploration and 2D GUI manipulation across collection, combat, puzzle-solving, and NPC interaction. In addition to its in-domain performance, Lumine demonstrates strong zero-shot cross-game generalization. Without any fine-tuning, it accomplishes 100-minute missions in Wuthering Waves and the full five-hour first chapter of Honkai: Star Rail. These promising results highlight Lumine’s effectiveness across distinct worlds and interaction dynamics, marking a concrete step toward generalist agents in open-ended environments.

我们推出Lumine，这是首个开放配方，用于开发能够在具有挑战性的3D开放世界环境中实时完成数小时复杂任务的通用智能体。Lumine采用类似人类的交互范式，以端到端的方式统一感知、推理和行动，由视觉语言模型驱动。它每秒处理5帧原始像素，产生精确的30Hz键盘鼠标动作，并在必要时自适应地调用推理功能。经过在《原神》中的训练，Lumine成功完成了长达五个小时的蒙德主要故事情节，效率堪比人类，并遵循自然语言指令，在3D开放世界探索和2D图形用户界面操作方面执行广泛的任务，包括收集、战斗、解谜和NPC交互。除了领域内的表现外，Lumine还展示了强大的零启动跨游戏泛化能力。无需微调，它可以在《风浪怒吼》中完成100分钟的使命，以及《崩坏：星穹铁道》的首章全长五小时的任务。这些令人鼓舞的结果凸显了Lumine在不同世界和交互动态中的有效性，朝着开放环境中通用智能体的方向发展迈出了坚实的一步。

论文及项目相关链接

PDF

Summary

Lumine是一种通用智能体开发的新方法，能够在具有挑战性的实时3D开放世界环境中完成长达数小时的复杂任务。Lumine采用类似人类的交互模式，通过视觉语言模型统一感知、推理和行动，能够处理原始像素并自适应调用推理。在《原神》中训练后，Lumine以与人类相当的效率完成了五小时的蒙德主要剧情，还能在虚拟3D世界探索和现实世界交互场景下完成多样化的任务。除了具有领域内的高性能表现，Lumine还具有强大的跨游戏泛化能力，未经微调即成功完成《风云乱世》的100分钟任务和《崩坏：星际铁路》的第一章节的五小时游戏。此项目的成果显示了通用智能体在开放环境下的应用潜力。

Key Takeaways

Lumine是首个能够完成在复杂、开放的3D环境中长达数小时的实时任务的通用智能体开发食谱。
Lumine采用类似人类的交互模式，通过视觉语言模型统一感知、推理和行动。
Lumine可以在处理原始像素的同时自适应调用推理，确保效率和准确性。
在《原神》的训练下，Lumine成功地完成了相当于人类效率的五小时蒙德主要剧情。
Lumine能够在虚拟的3D世界探索和现实世界的交互场景下完成多样化的任务。
Lumine不仅在训练游戏领域表现出色，还展示了强大的跨游戏泛化能力。

Cool Papers

点此查看论文截图

Toward Automated Cognitive Assessment in Parkinson’s Disease Using Pretrained Language Models

Authors:Varada Khanna, Nilay Bhatt, Ikgyu Shin, Sule Tinaz, Yang Ren, Hua Xu, Vipina K. Keloth

Understanding how individuals with Parkinson’s disease (PD) describe cognitive experiences in their daily lives can offer valuable insights into disease-related cognitive and emotional changes. However, extracting such information from unstructured patient narratives is challenging due to the subtle, overlapping nature of cognitive constructs. This study developed and evaluated natural language processing (NLP) models to automatically identify categories that reflect various cognitive processes from de-identified first-person narratives. Three model families, a Bio_ClinicalBERT-based span categorization model for nested entity recognition, a fine-tuned Meta-Llama-3-8B-Instruct model using QLoRA for instruction following, and GPT-4o mini evaluated under zero- and few-shot settings, were compared on their performance on extracting seven categories. Our findings indicated that model performance varied substantially across categories and model families. The fine-tuned Meta-Llama-3-8B-Instruct achieved the highest overall F1-scores (0.74 micro-average and 0.59 macro-average), particularly excelling in context-dependent categories such as thought and social interaction. Bio_ClinicalBERT exhibited high precision but low recall and performed comparable to Llama for some category types such as location and time but failed on other categories such as thought, emotion and social interaction. Compared to conventional information extraction tasks, this task presents a greater challenge due to the abstract and overlapping nature of narrative accounts of complex cognitive processes. Nonetheless, with continued refinement, these NLP systems hold promise for enabling low-burden, longitudinal monitoring of cognitive function and serving as a valuable complement to formal neuropsychological assessments in PD.

理解帕金森病患者如何描述日常生活中的认知体验，可以为与疾病相关的认知和情绪变化提供宝贵的见解。然而，从非结构化的患者叙述中提取此类信息是一项挑战，因为认知概念具有细微且重叠的性质。本研究开发和评估了自然语言处理（NLP）模型，这些模型能够自动从非身份识别的第一人称叙述中识别反映各种认知过程的类别。对基于Bio_ClinicalBERT的跨度分类模型（用于嵌套实体识别）、使用QLoRA进行指令跟踪的精细调整的Meta-Llama-3-8B-Instruct模型以及在零样本和少样本设置下评估的GPT-4o mini这三种模型家族的对比发现，它们在提取七个类别时的性能各不相同。我们的研究结果表明，精细调整的Meta-Llama-3-8B-Instruct模型在总体F1分数上表现最佳（微平均值为0.74，宏平均值为0.59），特别是在上下文相关的类别中，如思维和社交互动方面表现出色。Bio_ClinicalBERT精度高但召回率低，在某些类别类型（如地点和时间）上与Llama表现相当，但在其他类别（如思维、情感和社交互动）上表现不佳。与传统的信息提取任务相比，由于复杂认知过程的叙述具有抽象性和重叠性，此任务呈现出更大的挑战。然而，随着不断的改进，这些NLP系统在实现轻松的长期认知功能监测方面大有潜力，并将作为正式神经心理学评估在帕金森病中的宝贵补充。

论文及项目相关链接

PDF 15 pages, 4 figures, 1 table. Varada Khanna and Nilay Bhatt are co-first authors. Sule Tinaz and Hua Xu are co-senior authors. Corresponding author: Vipina K. Keloth (vipina.kuttichikeloth@yale.edu)

摘要
帕金森病患者的认知体验描述对于了解疾病相关的认知和情绪变化具有宝贵价值。本研究开发并评估了自然语言处理模型，用于从去标识化的第一人称叙述中自动识别反映各种认知过程的类别。研究结果显示，模型性能在不同类别和模型家族之间存在显著差异。微调后的Meta-Llama-3-8B-Instruct模型在微平均和宏平均的F1得分上表现最佳，特别是在上下文相关的类别中如思维和社交互动。Bio_ClinicalBERT表现出较高的精度但召回率较低，在某些类别如地点和时间上与Llama表现相当，但在其他类别如思维、情感和社交互动方面表现不佳。这些NLP系统在继续改进的情况下，有望实现对认知功能的低负担长期监测，并成为帕金森病正式神经心理学评估的有益补充。

关键见解

PD患者的日常认知体验描述对于理解疾病相关的认知和情绪变化至关重要。
自然语言处理模型能够从患者的非结构化叙述中自动提取认知过程类别。
不同模型家族在处理不同认知类别时的表现存在差异。
Meta-Llama-3-8B-Instruct模型在总体性能上表现最佳，尤其在处理上下文相关的认知类别上如思维和社交互动。
Bio_ClinicalBERT模型在精确性方面表现良好，但在召回率上有所不足，对某些类别如地点和时间的识别较为准确，但在思维、情感和社交互动等类别上的表现有待提高。
与传统信息提取任务相比，从叙述中提取复杂认知过程的任务更具挑战性，因为叙述具有抽象性和交叉性。

Cool Papers

点此查看论文截图

RESTL: Reinforcement Learning Guided by Multi-Aspect Rewards for Signal Temporal Logic Transformation

Authors:Yue Fang, Jin Zhi, Jie An, Hongshen Chen, Xiaohong Chen, Naijun Zhan

Signal Temporal Logic (STL) is a powerful formal language for specifying real-time specifications of Cyber-Physical Systems (CPS). Transforming specifications written in natural language into STL formulas automatically has attracted increasing attention. Existing rule-based methods depend heavily on rigid pattern matching and domain-specific knowledge, limiting their generalizability and scalability. Recently, Supervised Fine-Tuning (SFT) of large language models (LLMs) has been successfully applied to transform natural language into STL. However, the lack of fine-grained supervision on atomic proposition correctness, semantic fidelity, and formula readability often leads SFT-based methods to produce formulas misaligned with the intended meaning. To address these issues, we propose RESTL, a reinforcement learning (RL)-based framework for the transformation from natural language to STL. RESTL introduces multiple independently trained reward models that provide fine-grained, multi-faceted feedback from four perspectives, i.e., atomic proposition consistency, semantic alignment, formula succinctness, and symbol matching. These reward models are trained with a curriculum learning strategy to improve their feedback accuracy, and their outputs are aggregated into a unified signal that guides the optimization of the STL generator via Proximal Policy Optimization (PPO). Experimental results demonstrate that RESTL significantly outperforms state-of-the-art methods in both automatic metrics and human evaluations.

信号时序逻辑（STL）是描述网络物理系统（CPS）实时规范的一种强大形式语言。将自然语言编写的规范自动转换为STL公式已经引起了越来越多的关注。现有的基于规则的方法严重依赖于僵化的模式匹配和领域特定知识，这限制了它们的通用性和可扩展性。最近，大型语言模型（LLM）的监督微调（SFT）已成功应用于将自然语言转换为STL。然而，由于缺乏关于原子命题正确性、语义保真和公式可读性的精细监督，基于SFT的方法往往会产生与意图不符的公式。为了解决这些问题，我们提出了RESTL，这是一个基于强化学习（RL）的自然语言到STL转换框架。RESTL引入了多个独立训练的奖励模型，从四个角度提供精细、多方面的反馈，即原子命题一致性、语义对齐、公式简洁性和符号匹配。这些奖励模型采用课程学习策略进行训练，以提高其反馈准确性，它们的输出被聚合为一个统一信号，通过近端策略优化（PPO）指导STL生成器的优化。实验结果表明，RESTL在自动度量指标和人类评估方面都显著优于现有最先进的方法。

论文及项目相关链接

PDF 12 pages, 4 figures

Summary：

该文本介绍了Signal Temporal Logic（STL）作为一种强大的形式语言在指定网络物理系统实时规范方面的应用。文章指出将自然语言规范转换为STL公式的方法已经引起了广泛关注，但现有方法存在局限性。最近，大型语言模型的监督微调（SFT）已成功应用于此转换，但缺乏原子命题正确性、语义保真和公式可读性的精细监督常常导致产生的公式与预期意义不符。为解决这些问题，文章提出了一种基于强化学习的转换框架RESTL，通过多重独立训练的奖励模型从四个维度提供精细、多维的反馈，并通过课程学习策略提高反馈精度。实验结果表明，RESTL在自动评估指标和人类评价方面都显著优于现有方法。

Key Takeaways: