LLM

发布日期: 2025-09-12

更新日期: 2025-10-07

文章字数: 20.8k

阅读时长: 84 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-09-12 更新

A Survey of Reinforcement Learning for Large Reasoning Models

Authors:Kaiyan Zhang, Yuxin Zuo, Bingxiang He, Youbang Sun, Runze Liu, Che Jiang, Yuchen Fan, Kai Tian, Guoli Jia, Pengfei Li, Yu Fu, Xingtai Lv, Yuchen Zhang, Sihang Zeng, Shang Qu, Haozhan Li, Shijie Wang, Yuru Wang, Xinwei Long, Fangfu Liu, Xiang Xu, Jiaze Ma, Xuekai Zhu, Ermo Hua, Yihao Liu, Zonglin Li, Huayu Chen, Xiaoye Qu, Yafu Li, Weize Chen, Zhenzhao Yuan, Junqi Gao, Dong Li, Zhiyuan Ma, Ganqu Cui, Zhiyuan Liu, Biqing Qi, Ning Ding, Bowen Zhou

In this paper, we survey recent advances in Reinforcement Learning (RL) for reasoning with Large Language Models (LLMs). RL has achieved remarkable success in advancing the frontier of LLM capabilities, particularly in addressing complex logical tasks such as mathematics and coding. As a result, RL has emerged as a foundational methodology for transforming LLMs into LRMs. With the rapid progress of the field, further scaling of RL for LRMs now faces foundational challenges not only in computational resources but also in algorithm design, training data, and infrastructure. To this end, it is timely to revisit the development of this domain, reassess its trajectory, and explore strategies to enhance the scalability of RL toward Artificial SuperIntelligence (ASI). In particular, we examine research applying RL to LLMs and LRMs for reasoning abilities, especially since the release of DeepSeek-R1, including foundational components, core problems, training resources, and downstream applications, to identify future opportunities and directions for this rapidly evolving area. We hope this review will promote future research on RL for broader reasoning models. Github: https://github.com/TsinghuaC3I/Awesome-RL-for-LRMs

在这篇论文中，我们概述了强化学习（RL）在大型语言模型（LLM）推理方面的最新进展。强化学习在推进大型语言模型能力方面取得了显著的成果，尤其是在处理数学和编码等复杂逻辑任务方面。因此，强化学习已经成为将大型语言模型转化为语言资源模型（LRM）的基础方法。随着该领域的快速发展，强化学习在语言资源模型中的进一步扩展现在面临着不仅是计算资源而且还在算法设计、训练数据和基础设施方面的基础挑战。因此，重新审视该领域的发展、重新评估其轨迹以及探索提高强化学习在人工智能超智能（ASI）方面的可扩展性的策略是及时的。特别是，我们研究了自DeepSeek-R1发布以来，将强化学习应用于大型语言模型和语言资源模型以提高推理能力的相关研究，包括基础组件、核心问题、训练资源和下游应用，以确定这一快速演变领域的未来机遇和方向。我们希望这次回顾能够促进强化学习在更广泛的推理模型方面的未来研究。Github：https://github.com/TsinghuaC3I/Awesome-RL-for-LRMs

论文及项目相关链接

PDF

Summary
强化学习在推动大型语言模型（LLM）能力方面取得了显著成效，特别是在解决数学和编码等复杂逻辑任务上表现突出。从将LLM转变为语言推理模型（LRM）的方法论角度看，强化学习扮演了核心角色。然而，随着领域发展加速，强化学习在可扩展性方面面临重大挑战，尤其是在计算资源、算法设计、训练数据和基础设施方面。本文回顾了强化学习在LLM和LRM中的研究与应用历程，并探讨了增强其在朝人工智能超智（ASI）发展方面可扩展性的策略。

Key Takeaways

强化学习（RL）在推动大型语言模型（LLM）能力上取得了显著进步，尤其在处理复杂逻辑任务时表现优异。
RL成为将LLM转化为语言推理模型（LRM）的关键方法论。
RL在可扩展性方面面临重大挑战，特别是在计算资源、算法设计、训练数据和基础设施等方面。
强化学习在LLM和LRM领域的应用与研究正在不断发展和演变。
本文回顾了强化学习在该领域的历程，并探讨了如何增强其在人工智能超智（ASI）方面的可扩展性。
通过对包括基础组件、核心问题、训练资源和下游应用等在内的全面研究，本文为这一快速演变的领域提供了未来机会和方向。

Cool Papers

点此查看论文截图

Large Language Model Hacking: Quantifying the Hidden Risks of Using LLMs for Text Annotation

Authors:Joachim Baumann, Paul Röttger, Aleksandra Urman, Albert Wendsjö, Flor Miriam Plaza-del-Arco, Johannes B. Gruber, Dirk Hovy

Large language models (LLMs) are rapidly transforming social science research by enabling the automation of labor-intensive tasks like data annotation and text analysis. However, LLM outputs vary significantly depending on the implementation choices made by researchers (e.g., model selection, prompting strategy, or temperature settings). Such variation can introduce systematic biases and random errors, which propagate to downstream analyses and cause Type I, Type II, Type S, or Type M errors. We call this LLM hacking. We quantify the risk of LLM hacking by replicating 37 data annotation tasks from 21 published social science research studies with 18 different models. Analyzing 13 million LLM labels, we test 2,361 realistic hypotheses to measure how plausible researcher choices affect statistical conclusions. We find incorrect conclusions based on LLM-annotated data in approximately one in three hypotheses for state-of-the-art models, and in half the hypotheses for small language models. While our findings show that higher task performance and better general model capabilities reduce LLM hacking risk, even highly accurate models do not completely eliminate it. The risk of LLM hacking decreases as effect sizes increase, indicating the need for more rigorous verification of findings near significance thresholds. Our extensive analysis of LLM hacking mitigation techniques emphasizes the importance of human annotations in reducing false positive findings and improving model selection. Surprisingly, common regression estimator correction techniques are largely ineffective in reducing LLM hacking risk, as they heavily trade off Type I vs. Type II errors. Beyond accidental errors, we find that intentional LLM hacking is unacceptably simple. With few LLMs and just a handful of prompt paraphrases, anything can be presented as statistically significant.

大型语言模型（LLM）正在迅速改变社会科学研究，使数据标注和文本分析等劳动密集型任务的自动化成为可能。然而，LLM的输出结果因研究者所做的实施选择（例如模型选择、提示策略或温度设置）而有很大差异。这种差异可能会引入系统性偏见和随机错误，并传播到下游分析，从而导致第一类、第二类、S类或M类错误。我们将这种现象称为LLM黑客攻击。我们通过复制来自21项已发布的社会科学研究中的37个数据标注任务，并使用18个不同的模型来量化LLM黑客攻击的风险。通过分析1300万个LLM标签和测试了2361个合理的假设，我们衡量了研究人员的选择如何影响统计结论。我们发现，基于LLM标注数据的错误结论约占三分之一假设的结论在使用先进模型时，在小模型假设中约占一半。虽然我们的研究结果表明，更高的任务性能和更好的模型整体能力可以降低LLM黑客攻击的风险，但即使是高度准确的模型也无法完全消除风险。随着效应大小的增加，LLM黑客攻击的风险降低，这表明在显著性阈值附近需要对发现进行更严格的验证。我们对LLM黑客攻击缓解技术的综合分析强调了人类标注在减少假阳性发现和改进模型选择中的重要性。令人惊讶的是，常见的回归估计校正技术在降低LLM黑客攻击风险方面几乎无效，因为它们会大量权衡第一类与第二类错误之间的平衡。除了偶然性错误外，我们发现故意进行的LLM黑客攻击也极其容易且难以接受。只需要少量的LLM和一些提示语的重述，任何事情都可以被呈现为具有统计学上的显著性。

论文及项目相关链接

PDF

Summary

大型语言模型（LLM）的自动化功能为社会科学研究带来了变革，但也带来了新的问题。LLM的输出结果因研究人员的实施选择（如模型选择、提示策略或温度设置）而有很大差异，这可能导致类型I、类型II、类型S或类型M的错误，我们称之为LLM黑客。本研究通过复制21项已发布的社会科学研究中的37项数据标注任务，使用18个不同的模型进行了量化分析。我们发现，基于LLM标注数据得出的错误结论在先进模型的一半假设中都有出现，小型语言模型的假设中错误结论出现比例更高。虽然高性能任务和更好的模型能力可以降低LLM黑客风险，但无法完全消除。本研究强调了人类注释在减少误报和改进模型选择中的重要性。常见的回归估计校正技术在降低LLM黑客风险方面效果不佳。此外，研究发现有意为之的LLM黑客行为非常简单，只需少量LLM和提示变体能轻易制造统计显著的结果。

Key Takeaways

LLM在社会科学研究中的应用虽然带来了效率提升，但也可能导致LLM黑客问题，即因模型实施选择差异导致的系统性偏见和随机错误。
LLM标注数据的错误可能导致下游分析的错误结论，包括类型I、类型II、类型S和类型M的错误。
高性能任务和更好的模型能力可以降低LLM黑客风险，但无法完全消除。
接近显著性阈值的结果需要更严格的验证，因为效应大小增加可以减少LLM黑客风险。
人类注释在减少误报和改进模型选择方面至关重要。
常见的回归估计校正技术在降低LLM黑客风险方面效果有限，因为它们可能在不同程度上权衡了类型I和类型II错误。

Cool Papers

点此查看论文截图

Calibrating MLLM-as-a-judge via Multimodal Bayesian Prompt Ensembles

Authors:Eric Slyman, Mehrab Tanjim, Kushal Kafle, Stefan Lee

Multimodal large language models (MLLMs) are increasingly used to evaluate text-to-image (TTI) generation systems, providing automated judgments based on visual and textual context. However, these “judge” models often suffer from biases, overconfidence, and inconsistent performance across diverse image domains. While prompt ensembling has shown promise for mitigating these issues in unimodal, text-only settings, our experiments reveal that standard ensembling methods fail to generalize effectively for TTI tasks. To address these limitations, we propose a new multimodal-aware method called Multimodal Mixture-of-Bayesian Prompt Ensembles (MMB). Our method uses a Bayesian prompt ensemble approach augmented by image clustering, allowing the judge to dynamically assign prompt weights based on the visual characteristics of each sample. We show that MMB improves accuracy in pairwise preference judgments and greatly enhances calibration, making it easier to gauge the judge’s true uncertainty. In evaluations on two TTI benchmarks, HPSv2 and MJBench, MMB outperforms existing baselines in alignment with human annotations and calibration across varied image content. Our findings highlight the importance of multimodal-specific strategies for judge calibration and suggest a promising path forward for reliable large-scale TTI evaluation.

多模态大型语言模型（MLLMs）越来越多地被用于评估文本到图像（TTI）生成系统，基于视觉和文本上下文提供自动化判断。然而，这些“判断”模型常常受到偏见、过度自信以及在不同图像领域表现不一致的影响。虽然提示集成法已在单模态、纯文本环境中显示出缓解这些问题的前景，但我们的实验表明，标准集成方法无法有效地推广到TTI任务。为了解决这些局限性，我们提出了一种新的多模态感知方法，称为多模态贝叶斯混合提示集成法（MMB）。我们的方法使用贝叶斯提示集成法，通过图像聚类进行增强，允许判断者根据每个样本的视觉特征动态分配提示权重。我们表明，MMB在配对偏好判断中提高了准确性，并大大提高了校准度，使判断者能够更容易地衡量其真实的不确定性。在HPSv2和MJBench两个TTI基准测试上，MMB优于现有基线，与人类注释和跨不同图像内容的校准保持一致。我们的研究结果表明，判断校准需要针对多模态的特定策略，并为可靠的大规模TTI评估提供了前景光明的道路。

论文及项目相关链接

PDF 17 pages, 8 figures, Accepted at ICCV 2025

Summary

本文介绍了多模态大型语言模型（MLLMs）在文本到图像（TTI）生成系统评估中的应用。虽然这些模型能基于视觉和文本上下文提供自动判断，但它们常受到偏见、过度自信和不一致性能的影响。针对这些问题，提出了一种新的多模态感知方法——多模态混合贝叶斯提示集成（MMB）。该方法结合了贝叶斯提示集成和图像聚类，使模型能根据每个样本的视觉特征动态分配提示权重。实验表明，MMB在提高成对偏好判断的准确性和校准性方面表现优异，为评估文本到图像生成系统的质量提供了更可靠的方法。

Key Takeaways

多模态大型语言模型（MLLMs）被广泛应用于文本到图像（TTI）生成系统的评估。
这些模型存在偏见、过度自信及在不同图像领域性能不一致的问题。
标准集成方法在多模态环境下效果不佳，需要新的方法来解决这些问题。
提出了一种新的多模态感知方法——多模态混合贝叶斯提示集成（MMB）。
MMB通过结合贝叶斯提示集成和图像聚类，能根据样本的视觉特征动态调整提示权重。
MMB在成对偏好判断和校准性方面表现出卓越的性能，提高了评估的准确性。

Cool Papers

点此查看论文截图

Authors:Sike Xiang, Shuang Chen, Amir Atapour-Abarghouei

As multimodal large language models (MLLMs) advance, their large-scale architectures pose challenges for deployment in resource-constrained environments. In the age of large models, where energy efficiency, computational scalability and environmental sustainability are paramount, the development of lightweight and high-performance models is critical for real-world applications. As such, we propose a lightweight MLLM framework for end-to-end visual question answering. Our proposed approach centres on BreezeCLIP, a compact yet powerful vision-language encoder optimised for efficient multimodal understanding. With only 1.2 billion parameters overall, our model significantly reduces computational cost while achieving performance comparable to standard-size MLLMs. Experiments conducted on multiple datasets further validate its effectiveness in balancing accuracy and efficiency. The modular and extensible design enables generalisation to broader multimodal tasks. The proposed lightweight vision-language framework is denoted as BcQLM (BreezeCLIP-enhanced Q-Gated Multimodal Language Model). It offers a promising path toward deployable MLLMs under practical hardware constraints. The source code is available at https://github.com/thico0224/BcQLM.

随着多模态大型语言模型（MLLM）的进步，其大规模架构在资源受限环境中部署时面临挑战。在大型模型时代，能源效率、计算可扩展性和环境可持续性至关重要，因此开发轻便、高性能的模型对于实际应用至关重要。因此，我们提出了一种用于端到端视觉问答的轻便MLLM框架。我们提出的方法以BreezeCLIP为中心，这是一种紧凑而强大的视觉语言编码器，针对高效的多模态理解进行了优化。我们的模型总参数只有1.2亿，在显著降低计算成本的同时，实现了与标准大小的MLLM相当的性能。在多个数据集上进行的实验进一步验证了它在平衡准确性和效率方面的有效性。模块化且可扩展的设计可实现更广泛的多模态任务的通用化。所提出的小型化视觉语言框架被命名为BcQLM（由BreezeCLIP增强的Q门多模态语言模型）。它为在实际硬件约束条件下部署MLLM提供了有前景的路径。源代码可在https://github.com/thico0224/BcQLM找到。

论文及项目相关链接

PDF

Summary

随着多模态大型语言模型（MLLMs）的发展，其大规模架构在资源受限环境中部署时面临的挑战日益凸显。为解决能源效率、计算可伸缩性和环境可持续性等问题，开发轻量级、高性能的模型对于实际应用至关重要。为此，我们提出了一种轻量级MLLM框架，用于端到端的视觉问答。我们提出的方法以BreezeCLIP为核心，这是一种紧凑而强大的视觉语言编码器，专为高效的多模态理解而优化。我们的模型仅有1.2亿个参数，在保持性能与标准大小的MLLM相当的同时，大大降低了计算成本。在多个数据集上进行的实验进一步验证了其在平衡精度和效率方面的有效性。该模块化且可扩展的设计使其能够泛化到更广泛的多模态任务。所提出的轻量级视觉语言框架被称为BcQLM（BreezeCLIP增强型Q门控多模态语言模型），为实现实用硬件约束下的可部署MLLMs提供了有前途的路径。

Key Takeaways

多模态大型语言模型（MLLMs）在资源受限环境中的部署具有挑战。
轻量级、高性能的模型开发对于实际应用至关重要。
提出的轻量级MLLM框架以BreezeCLIP为核心，专为高效多模态理解而优化。
模型仅有1.2亿个参数，计算成本低，性能与标准MLLM相当。
在多个数据集上的实验验证了其在平衡精度和效率方面的有效性。
模块化且可扩展的设计使模型能够泛化到更广泛的多模态任务。
BcQLM框架为实现实用硬件约束下的可部署MLLMs提供了前景。

Cool Papers

点此查看论文截图

AdsQA: Towards Advertisement Video Understanding

Authors:Xinwei Long, Kai Tian, Peng Xu, Guoli Jia, Jingxuan Li, Sa Yang, Yihua Shao, Kaiyan Zhang, Che Jiang, Hao Xu, Yang Liu, Jiaheng Ma, Bowen Zhou

Large language models (LLMs) have taken a great step towards AGI. Meanwhile, an increasing number of domain-specific problems such as math and programming boost these general-purpose models to continuously evolve via learning deeper expertise. Now is thus the time further to extend the diversity of specialized applications for knowledgeable LLMs, though collecting high quality data with unexpected and informative tasks is challenging. In this paper, we propose to use advertisement (ad) videos as a challenging test-bed to probe the ability of LLMs in perceiving beyond the objective physical content of common visual domain. Our motivation is to take full advantage of the clue-rich and information-dense ad videos’ traits, e.g., marketing logic, persuasive strategies, and audience engagement. Our contribution is three-fold: (1) To our knowledge, this is the first attempt to use ad videos with well-designed tasks to evaluate LLMs. We contribute AdsQA, a challenging ad Video QA benchmark derived from 1,544 ad videos with 10,962 clips, totaling 22.7 hours, providing 5 challenging tasks. (2) We propose ReAd-R, a Deepseek-R1 styled RL model that reflects on questions, and generates answers via reward-driven optimization. (3) We benchmark 14 top-tier LLMs on AdsQA, and our \texttt{ReAd-R}~achieves the state-of-the-art outperforming strong competitors equipped with long-chain reasoning capabilities by a clear margin.

大型语言模型（LLM）在朝着通用人工智能（AGI）迈出了巨大的一步。与此同时，越来越多的特定领域问题，如数学和编程，推动这些通用模型通过深入学习专业知识而不断进化。因此，现在是进一步扩展知识型LLM的专用应用多样性的好时机，尽管收集高质量的数据以及带有意想不到和具有信息含量的任务是一个挑战。在本文中，我们建议使用广告（广告）视频作为一个充满挑战的测试平台，以检测LLM在感知超越常见视觉域客观物理内容之外的能力。我们的动机是充分利用广告视频特征中的线索丰富和信息密集的特点，例如营销逻辑、策略技巧和观众参与度等。我们的贡献有三个层面：首先，据我们所知，这是首次尝试利用精心设计的任务来利用广告视频对LLM进行评估。我们贡献AdsQA作为一个广告视频问答的基准测试平台，来源于从总计达到超过广告的原始视频的视频片段——即收集剪成的合计达万余小时的音频材料经过加工的来自影片浓缩而构建的共计一万五千四百四十四个广告视频片段和十万九千六百二十个剪辑片段，提供五个具有挑战性的任务。其次，我们提出了ReAd-R模型，这是一个采用Deepseek-R1风格的强化学习模型，能够反映问题并生成奖励驱动优化的答案。最后，我们在AdsQA上对顶尖的十四个LLM进行了评估对比发现我们设计的ReAd-R表现超过了其它竞争者甚至采用了先进的模型强大的竞争对手有强大链式逻辑推理能力ReAd-R与同样掌握了逻辑体系的其它领先大型语言模型相比同样大占优势胜对手一筹。

论文及项目相关链接

PDF ICCV-2025

Summary

大型语言模型（LLM）在通用人工智能（AGI）方向上取得巨大进展，且特定领域问题如数学和编程推动其不断进化。当前面临如何扩展LLM专业知识多样性的挑战，而广告视频作为一种富含线索和信息密集的领域成为测试LLM能力的理想平台。本研究首次尝试利用广告视频设计任务评估LLM，提出AdsQA基准测试与ReAd-R模型。ReAd-R模型通过奖励驱动优化生成答案，并在AdsQA基准测试中表现最佳，展现了强大的长链推理能力。

Key Takeaways

LLM在通用人工智能方向上取得显著进展，特定领域问题推动其不断进化。
广告视频成为评估LLM能力的理想平台，其富含线索和信息密集的特点有助于测试LLM的感知能力。
本研究首次尝试利用广告视频设计任务评估LLM，并建立了AdsQA基准测试。
提出了ReAd-R模型，通过奖励驱动优化生成答案。
ReAd-R模型在 AdsQA基准测试中表现最佳，具有强大的长链推理能力。
LLM在理解和分析广告视频方面展现出巨大潜力，未来可进一步探索其在营销、传播等领域的应用。

Cool Papers

点此查看论文截图

LLM Ensemble for RAG: Role of Context Length in Zero-Shot Question Answering for BioASQ Challenge

Authors:Dima Galat, Diego Molla-Aliod

Biomedical question answering (QA) poses significant challenges due to the need for precise interpretation of specialized knowledge drawn from a vast, complex, and rapidly evolving corpus. In this work, we explore how large language models (LLMs) can be used for information retrieval (IR), and an ensemble of zero-shot models can accomplish state-of-the-art performance on a domain-specific Yes/No QA task. Evaluating our approach on the BioASQ challenge tasks, we show that ensembles can outperform individual LLMs and in some cases rival or surpass domain-tuned systems - all while preserving generalizability and avoiding the need for costly fine-tuning or labeled data. Our method aggregates outputs from multiple LLM variants, including models from Anthropic and Google, to synthesize more accurate and robust answers. Moreover, our investigation highlights a relationship between context length and performance: while expanded contexts are meant to provide valuable evidence, they simultaneously risk information dilution and model disorientation. These findings emphasize IR as a critical foundation in Retrieval-Augmented Generation (RAG) approaches for biomedical QA systems. Precise, focused retrieval remains essential for ensuring LLMs operate within relevant information boundaries when generating answers from retrieved documents. Our results establish that ensemble-based zero-shot approaches, when paired with effective RAG pipelines, constitute a practical and scalable alternative to domain-tuned systems for biomedical question answering.

生物医学问答（QA）面临巨大的挑战，这主要是因为需要准确解读大量来自复杂且迅速发展的语料库的专门知识。在这项工作中，我们探讨了如何使用大型语言模型（LLM）进行信息检索（IR），以及如何使用零样本模型集合在特定领域的问答任务上实现最先进的性能。我们在BioASQ挑战任务上评估了我们的方法，表明集合模型可以超越单个LLM的性能，在某些情况下甚至可以与经过领域调整的系统的性能相抗衡或超越，同时保持模型的泛化能力，避免了昂贵的微调或标注数据的需要。我们的方法聚合了多种LLM变种模型的输出，包括来自Anthropic和Google的模型，以合成更准确、更稳健的答案。此外，我们的调查突出了上下文长度与性能之间的关系：虽然扩展的上下文旨在提供有价值的证据，但它们同时可能带来信息稀释和模型迷失的风险。这些发现强调了信息检索在检索增强生成（RAG）方法对生物医学问答系统中的关键作用。精确、集中的检索仍然是确保大型语言模型在生成答案时从检索到的文档中在相关的信息范围内操作的关键。我们的结果表明，基于集合的零样本方法与有效的RAG管道相结合时，对于生物医学问答系统而言，是一种实用且可扩展的替代经过领域调整的系统的方法。

论文及项目相关链接

PDF CEUR-WS, CLEF2025

摘要
大型语言模型（LLM）在生物医学问答（QA）领域的信息检索（IR）中展现出潜力。通过集成零样本模型，可在特定领域的Yes/No问答任务上实现最先进的性能。在BioASQ挑战任务上的评估显示，集成模型在性能上超越了单个LLM，在某些情况下甚至能与或超越领域特定系统相抗衡——同时保留通用性，无需昂贵的微调或标注数据。该方法聚合了来自多个LLM变体（包括Anthropic和Google的模型）的输出，以合成更准确、更稳健的答案。此外，研究发现上下文长度与性能之间存在关系：扩展的上下文虽然提供了有价值的证据，但同时可能带来信息稀释和模型方向迷失的风险。精确、有针对性的检索仍是确保LLM在生成答案时，从检索到的文档中操作在相关信息的范围内的重要基础。结果证明，基于集成方法的零样本方法，与有效的RAG管道相结合，成为生物医学问答系统中一种实用且可扩展的替代方案。

关键见解

大型语言模型（LLM）可用于生物医学问答（QA）中的信息检索（IR）。
集成零样本模型可在特定领域实现先进性能。
集成模型在BioASQ挑战任务上的性能超越了单个LLM。
集成模型在某些情况下能媲美或超越领域特定系统。
上下文长度与LLM性能之间存在关系。
扩展的上下文可能带来信息稀释和模型方向迷失的风险。

Cool Papers

点此查看论文截图

Low-Resource Fine-Tuning for Multi-Task Structured Information Extraction with a Billion-Parameter Instruction-Tuned Model

Authors:Yu Cheng Chih, Yong Hao Hou

Deploying large language models (LLMs) for structured data extraction in domains such as financial compliance reporting, legal document analytics, and multilingual knowledge base construction is often impractical for smaller teams due to the high cost of running large architectures and the difficulty of preparing large, high-quality datasets. Most recent instruction-tuning studies focus on seven-billion-parameter or larger models, leaving limited evidence on whether much smaller models can work reliably under low-resource, multi-task conditions. This work presents ETLCH, a billion-parameter LLaMA-based model fine-tuned with low-rank adaptation on only a few hundred to one thousand samples per task for JSON extraction, knowledge graph extraction, and named entity recognition. Despite its small scale, ETLCH outperforms strong baselines across most evaluation metrics, with substantial gains observed even at the lowest data scale. These findings demonstrate that well-tuned small models can deliver stable and accurate structured outputs at a fraction of the computational cost, enabling cost-effective and reliable information extraction pipelines in resource-constrained environments.

在大规模部署语言模型（LLM）进行结构化数据提取时，金融合规报告、法律文档分析以及跨语言知识库构建等领域，对小团队来说并不实际，因为他们运行大型架构的成本较高且难以准备大型的高质量数据集。最近的指令调整研究主要集中于具有七亿参数或更大的模型上，至于规模较小的模型在低资源条件下的多任务表现情况却鲜有证据。这项研究展示了ETLCH，它是一个基于LLaMA的一亿参数模型，经过微调，对每个任务只使用数百至一千个样本进行低秩适应，用于JSON提取、知识图谱提取和命名实体识别。尽管规模较小，ETLCH在大多数评估指标上的表现都优于强有力的基线，甚至在数据量最小的情况下也取得了实质性的改进。这些发现表明，经过良好调整的小型模型能够以一小部分计算成本实现稳定和准确的结构化输出，从而实现了在资源受限环境中低成本和可靠的信息提取管道。

论文及项目相关链接

PDF 13 pages, 8 figures, includes experiments on JSON extraction, knowledge graph extraction, and NER

Summary

本文介绍了ETLCH模型，这是一个基于LLaMA的小型模型，可以在低资源多任务环境下完成结构化数据提取任务，包括JSON提取、知识图谱提取和命名实体识别等。该模型通过低秩适应技术进行微调，仅使用几百到一千个样本即可实现出色的性能，并且在大多数评估指标上都优于基线模型，甚至在数据量最少的情况下也表现出显著的改进。这些发现表明，经过良好调整的小型模型可以在计算成本较低的情况下提供稳定和准确的结构化输出，从而在资源受限的环境中实现高效可靠的信息提取管道。

Key Takeaways

ETLCH是一个基于LLaMA的小型模型，用于在低资源多任务环境下进行结构化数据提取。
该模型通过低秩适应技术进行微调，仅使用有限的样本即可完成各种任务。
ETLCH在JSON提取、知识图谱提取和命名实体识别等方面表现出色。
在大多数评估指标上，ETLCH优于基线模型，并且在数据量最少的情况下也表现出显著的改进。
研究表明，经过良好调整的小型模型可以在资源受限的环境中实现高效可靠的信息提取管道。
这种模型降低了大型语言模型部署的成本，特别是对于小型团队而言。

Cool Papers

点此查看论文截图

MM-DREX: Multimodal-Driven Dynamic Routing of LLM Experts for Financial Trading

Authors:Yang Chen, Yueheng Jiang, Zhaozhao Ma, Yuchen Cao, Jacky Keung, Kun Kuang, Leilei Gan, Yiquan Wu, Fei Wu

The inherent non-stationarity of financial markets and the complexity of multi-modal information pose significant challenges to existing quantitative trading models. Traditional methods relying on fixed structures and unimodal data struggle to adapt to market regime shifts, while large language model (LLM)-driven solutions - despite their multi-modal comprehension - suffer from static strategies and homogeneous expert designs, lacking dynamic adjustment and fine-grained decision mechanisms. To address these limitations, we propose MM-DREX: a Multimodal-driven, Dynamically-Routed EXpert framework based on large language models. MM-DREX explicitly decouples market state perception from strategy execution to enable adaptive sequential decision-making in non-stationary environments. Specifically, it (1) introduces a vision-language model (VLM)-powered dynamic router that jointly analyzes candlestick chart patterns and long-term temporal features to allocate real-time expert weights; (2) designs four heterogeneous trading experts (trend, reversal, breakout, positioning) generating specialized fine-grained sub-strategies; and (3) proposes an SFT-RL hybrid training paradigm to synergistically optimize the router’s market classification capability and experts’ risk-adjusted decision-making. Extensive experiments on multi-modal datasets spanning stocks, futures, and cryptocurrencies demonstrate that MM-DREX significantly outperforms 15 baselines (including state-of-the-art financial LLMs and deep reinforcement learning models) across key metrics: total return, Sharpe ratio, and maximum drawdown, validating its robustness and generalization. Additionally, an interpretability module traces routing logic and expert behavior in real time, providing an audit trail for strategy transparency.

金融市场固有的非平稳性和多模态信息的复杂性对现有量化交易模型构成了重大挑战。传统方法依赖于固定结构和单模态数据，难以适应市场模式的变化，而大型语言模型（LLM）驱动的方案尽管具有多模态理解能力，但却受到静态策略和同质化专家设计的影响，缺乏动态调整和精细决策机制。为了应对这些局限性，我们提出了基于大型语言模型的MM-DREX：一个以多模态驱动、动态路由的专家框架。MM-DREX显式地将市场状态感知与策略执行分开，以实现在非平稳环境中的自适应序列决策。具体来说，它（1）引入了一个由视觉语言模型（VLM）驱动的动态路由器，该路由器联合分析蜡烛图模式和长期时间特征来分配实时专家权重；（2）设计了四种异构交易专家（趋势、反转、突破、定位），生成专业化的精细子策略；（3）提出了一种SFT-RL混合训练范式，协同优化路由器的市场分类能力和专家的风险调整决策能力。在涵盖股票、期货和加密货币的多模态数据集上的广泛实验表明，MM-DREX在关键指标上显著优于15个基准模型（包括最先进的金融LLM和深度强化学习模型）：总回报率、夏普比率和最大回撤，验证了其稳健性和泛化能力。此外，一个解释性模块实时跟踪路由逻辑和专家行为，为策略透明性提供审计跟踪。

论文及项目相关链接

PDF

Summary
金融市场的固有非平稳性与多模态信息的复杂性对现有定量交易模型提出了巨大挑战。传统方法难以适应市场状态变化，而大型语言模型（LLM）虽然具备多模态理解能力，但策略静态且缺乏动态调整和精细决策机制。为此，我们提出MM-DREX：基于大型语言模型的多模态驱动动态路由专家框架。MM-DREX显式地将市场状态感知与策略执行分离，以在非平稳环境中实现自适应的序列决策。它通过视觉语言模型驱动的动态路由器分析蜡烛图模式和时间特征来分配实时专家权重，设计四种不同交易专家生成专项精细子策略，并提出SFT-RL混合训练范式来优化路由器的市场分类能力和专家的风险调整决策能力。实验证明，MM-DREX在股票、期货和加密货币的多模态数据集上显著优于15种基线方法，验证了其稳健性和泛化能力。此外，解释性模块实时追踪路由逻辑和专家行为，提供策略透明度的审计记录。

Key Takeaways

金融市场存在非平稳性和多模态信息复杂性挑战，对定量交易模型提出新要求。
传统方法和现有LLM解决方案难以适应市场变化，缺乏动态调整和精细决策机制。
MM-DREX框架基于大型语言模型提出，实现市场状态感知与策略执行的分离。
MM-DREX通过动态路由器分配实时专家权重，分析蜡烛图模式和时间特征。
框架包含四种交易专家，生成专项精细子策略。
SFT-RL混合训练范式优化路由器的市场分类能力和专家的风险决策能力。

Cool Papers

点此查看论文截图

Subjective Behaviors and Preferences in LLM: Language of Browsing

Authors:Sai Sundaresan, Harshita Chopra, Atanu R. Sinha, Koustava Goswami, Nagasai Saketh Naidu, Raghav Karan, N Anushka

A Large Language Model (LLM) offers versatility across domains and tasks, purportedly benefiting users with a wide variety of behaviors and preferences. We question this perception about an LLM when users have inherently subjective behaviors and preferences, as seen in their ubiquitous and idiosyncratic browsing of websites or apps. The sequential behavior logs of pages, thus generated, form something akin to each user’s self-constructed “language”, albeit without the structure and grammar imbued in natural languages. We ask: (i) Can a small LM represent the “language of browsing” better than a large LM? (ii) Can an LM with a single set of parameters (or, single LM) adequately capture myriad users’ heterogeneous, subjective behaviors and preferences? (iii) Can a single LM with high average performance, yield low variance in performance to make alignment good at user level? We introduce clusterwise LM training, HeTLM (Heterogeneity aware Training of Language Model), appropriate for subjective behaviors. We find that (i) a small LM trained using a page-level tokenizer outperforms large pretrained or finetuned LMs; (ii) HeTLM with heterogeneous cluster specific set of parameters outperforms a single LM of the same family, controlling for the number of parameters; and (iii) a higher mean and a lower variance in generation ensues, implying improved alignment.

大型语言模型（LLM）在各个领域和任务中表现出多功能性，据说能为具有各种行为和偏好的用户提供好处。当用户在浏览网站或应用程序时表现出固有的主观行为和偏好（这种情况随处可见且各不相同），我们对LLM的这一认知产生了质疑。由此产生的页面顺序行为日志形成了类似于每个用户自我构建的“语言”，尽管它没有自然语言中所蕴含的结构和语法。我们的问题是：（i）小型语言模型能否比大型语言模型更好地代表“浏览的语言”？（ii）单一参数集的语言模型（或单一语言模型）是否能够充分捕捉众多用户的异质性和主观行为偏好？（iii）一个平均性能较高的单一语言模型是否能在用户层面产生较低的性能差异，从而实现良好的对齐？我们引入了聚类语言模型训练，即针对主观行为的异质性感知语言模型训练（HeTLM）。我们发现：（i）使用页面级标记器训练的小型语言模型优于大型预训练或微调的语言模型；（ii）使用异质集群特定参数集的HeTLM优于具有相同家族的单语言模型；（iii）生成更高的均值和更低的方差随之而来，意味着对齐了有所改善。

论文及项目相关链接

PDF Accepted at EMNLP 2025

Summary

大型语言模型（LLM）具有跨域跨任务的通用性，能够应对用户多样的行为和偏好。然而，用户行为具有主观性和独特性，如浏览网页或应用程序时的行为。用户浏览的页面顺序日志形成了一种类似于用户自我构建的“浏览语言”，与自然语言结构和语法不同。我们引入面向主观行为的聚类式语言模型训练方法HeTLM，研究结果显示小模型训练表现更好；针对特定参数的异集群模型相较于单一的大型模型表现出更优越的性能；并且，这种策略可提高生成文本的均值并降低其方差，从而实现更好的对齐效果。

Key Takeaways

大型语言模型（LLM）能够应对多样化的用户行为和偏好。
用户浏览网页或应用程序的行为具有主观性和独特性。
用户浏览页面顺序形成的日志可被视为用户自我构建的“浏览语言”。
相比大型预训练或微调的语言模型，小模型使用页面级别的分词器进行训练表现更好。

Cool Papers

点此查看论文截图

Scaling LLM Planning: NL2FLOW for Parametric Problem Generation and Rigorous Evaluation

Authors:Jungkoo Kang

Robust workflow composition is critical for effective agent performance, yet progress in Large Language Model (LLM) planning and reasoning is hindered by a scarcity of scalable evaluation data. This work introduces NL2Flow, a fully automated pipeline for generating and evaluating workflow planning problems. NL2Flow generates problems parametrically in a structured intermediate representation, translating them into both natural language and formal PDDL. I evaluate several open-source, instruct-tuned LLMs on a dataset of 2296 low-difficulty problems generated by NL2Flow. Results demonstrate that the best-performing model achieved 86% success in generating valid plans and 69% in generating optimal plans (for solvable problems). Regression analysis shows that the influence of problem characteristics on plan generation is contingent on both model and prompt design. Importantly, translating natural language problems into a structured JSON representation prior to symbolic planning significantly improved success rates, suggesting a benefit from neuro-symbolic integration. These findings underscore the importance of understanding error sources within LLM reasoning as systems scale to more complex tasks. As LLM reasoning scales to increasingly complex problems, understanding the shifting bottlenecks and sources of error within these systems will be crucial.

强大的工作流程组合对于有效代理性能至关重要，然而，由于可扩展评估数据的稀缺，大型语言模型（LLM）规划和推理方面的进展受到了阻碍。这项工作引入了NL2Flow，这是一个完全自动化的管道，用于生成和评估工作流程规划问题。NL2Flow以结构化的中间表示形式生成问题参数，并将其转换为自然语言描述和正式PDDL描述。我在由NL2Flow生成的包含2296个低难度问题的数据集上评估了多个开源、指令调整的大型语言模型。结果表明，表现最佳的模型在生成有效计划方面取得了86%的成功率，在生成最优计划方面取得了69%（针对可解决问题）。回归分析表明，问题特征对计划生成的影响取决于模型和提示设计两方面。重要的是，将自然语言问题转换为结构化JSON表示形式后进行符号规划，显著提高了成功率，这表明神经符号融合具有优势。这些发现强调了随着语言模型系统在处理更复杂的任务时扩大规模，理解系统内部错误来源的重要性。随着LLM推理能力处理越来越复杂的问题时规模扩大，理解这些系统内部瓶颈和错误来源的变化将至关重要。

论文及项目相关链接

PDF 31 pages, 7 figures

Summary

NL2Flow是一个用于生成和评估工作流程规划问题的全自动管道系统。该系统通过参数化生成问题，将其转化为自然语言与正式的逻辑规划语言PDDL。评估显示，最优模型在生成有效计划上的成功率为86%，在生成最优可解决问题的计划上的成功率为69%。研究还表明，问题特性对计划生成的影响取决于模型和提示设计。将自然语言问题转化为结构化JSON表示形式再进行符号规划，能显著提高成功率，显示出神经符号融合的优势。随着LLM系统处理任务的复杂性增加，理解错误来源将成为关键。

Key Takeaways

NL2Flow是首个为LLM提供大规模工作流程规划问题生成和评估的自动管道系统。
系统能够通过参数化生成问题，并转化为自然语言与PDDL，便于评估和规划。
最优模型在生成有效计划上的成功率为86%，在生成最优可解决问题的计划上为69%。
问题特性对计划生成的影响受模型和提示设计共同影响。
将自然语言问题转化为结构化JSON表示形式再进行符号规划，能显著提高计划生成的成功率。
神经符号融合在LLM规划中显示出优势。

Cool Papers

点此查看论文截图

Teaching an Old LLM Secure Coding: Localized Preference Optimization on Distilled Preferences

Authors:Mohammad Saqib Hasan, Saikat Chakraborty, Santu Karmaker, Niranjan Balasubramanian

LLM generated code often contains security issues. We address two key challenges in improving secure code generation. First, obtaining high quality training data covering a broad set of security issues is critical. To address this, we introduce a method for distilling a preference dataset of insecure and secure code pairs from frontier LLMs, along with a security reasoning that explains the issues and the fix. The key idea here is to make use of security knowledge sources to devise a systematic prompting strategy that ensures broad coverage. Second, aligning models to secure code requires focusing on localized regions of code. Direct preference optimization methods, like SimPO, are not designed to handle these localized differences and turn out to be ineffective. We address this with a new localized preference optimization algorithm that masks the security related tokens in both the winning (secure) and losing (insecure) responses. To prevent loss in code quality, we also add a regularizer. Evaluations show that both training on our dataset, DiSCo, and the new preference optimization algorithm, LPO, yield substantial reductions in code insecurity while also improving overall code quality. Code and dataset are available at https://github.com/StonyBrookNLP/disco-lpo.

LLM生成的代码通常包含安全问题。我们针对改进安全代码生成过程中的两个关键挑战进行研究。首先，获取高质量、涵盖广泛安全问题的训练数据至关重要。为解决这一问题，我们引入了一种从前沿LLM中提炼不安全和安全的代码对偏好数据集的方法，并配以解释问题和修复的安全推理。这里的关键思想是利用安全知识来源来制定一种系统性的提示策略，以确保广泛的覆盖。其次，要使模型与安全的代码对齐，需要专注于代码的局部区域。直接的偏好优化方法，如SimPO，并不适合处理这些局部差异，结果证明其无效。我们通过一种新的局部偏好优化算法来解决这一问题，该算法同时屏蔽了获胜（安全）和失败（不安全）响应中的与安全相关的令牌。为防止代码质量损失，我们还增加了一个正则化器。评估表明，无论是在我们的数据集DiSCo上进行训练，还是使用新的偏好优化算法LPO，都能显著减少代码的不安全性，同时提高整体代码质量。代码和数据集可在https://github.com/StonyBrookNLP/disco-lpo找到。

论文及项目相关链接

PDF Accepted to ACL 2025 (Main)

Summary

LLM生成的代码常存在安全问题，本文提出解决LLM生成安全代码的两个关键挑战。首先，获取高质量、覆盖广泛安全问题的训练数据至关重要。为此，本文引入了一种从前沿LLM中提炼不安全与安全的代码对偏好数据集的方法，并解释了安全推理。其次，通过对模型进行局部对齐以生成安全代码是关键。直接的偏好优化方法，如SimPO，不适用于处理局部差异，效果有限。为此，本文提出了一种新的局部偏好优化算法，该算法能够在安全相关的标记中进行遮掩。为了防止代码质量的损失，本文还增加了一个正则化器。实验表明，无论是在DiSCo数据集上的训练还是新的偏好优化算法LPO，都能显著减少代码的不安全性并提高整体代码质量。

Key Takeaways

LLM生成的代码存在安全问题。
获取覆盖广泛安全问题的训练数据是提高安全代码生成的关键。
引入了一种从前沿LLM提炼不安全与安全的代码对偏好数据集的方法。
对模型进行局部对齐以生成安全代码是重要步骤。
直接的偏好优化方法不适用于处理局部差异。
提出了一种新的局部偏好优化算法，能够遮掩安全相关的标记。
训练数据的新算法能够显著提高代码的安全性和整体质量。

Cool Papers

点此查看论文截图

Whose Name Comes Up? Auditing LLM-Based Scholar Recommendations

Authors:Daniele Barolo, Chiara Valentin, Fariba Karimi, Luis Galárraga, Gonzalo G. Méndez, Lisette Espín-Noboa

This paper evaluates the performance of six open-weight LLMs (llama3-8b, llama3.1-8b, gemma2-9b, mixtral-8x7b, llama3-70b, llama3.1-70b) in recommending experts in physics across five tasks: top-k experts by field, influential scientists by discipline, epoch, seniority, and scholar counterparts. The evaluation examines consistency, factuality, and biases related to gender, ethnicity, academic popularity, and scholar similarity. Using ground-truth data from the American Physical Society and OpenAlex, we establish scholarly benchmarks by comparing model outputs to real-world academic records. Our analysis reveals inconsistencies and biases across all models. mixtral-8x7b produces the most stable outputs, while llama3.1-70b shows the highest variability. Many models exhibit duplication, and some, particularly gemma2-9b and llama3.1-8b, struggle with formatting errors. LLMs generally recommend real scientists, but accuracy drops in field-, epoch-, and seniority-specific queries, consistently favoring senior scholars. Representation biases persist, replicating gender imbalances (reflecting male predominance), under-representing Asian scientists, and over-representing White scholars. Despite some diversity in institutional and collaboration networks, models favor highly cited and productive scholars, reinforcing the rich-getricher effect while offering limited geographical representation. These findings highlight the need to improve LLMs for more reliable and equitable scholarly recommendations.

本文评估了六种开放权重的大型语言模型（llama3-8b、llama3.1-8b、gemma2-9b、mixtral-8x7b、llama3-70b、llama3.1-70b）在五个任务中推荐物理学专家的表现：按领域推荐前k名专家、按学科影响力推荐科学家、按时代、资历和学者同行推荐。评估内容包括一致性、事实性以及与性别、种族、学术流行度和学者相似性的偏见。我们利用美国物理学会和OpenAlex的真实数据，通过对比模型输出与真实学术记录，建立学术基准。我们的分析显示所有模型都存在不一致性和偏见。mixtral-8x7b的输出最稳定，而llama3.1-70b的变异性最高。许多模型存在重复问题，一些模型特别是gemma2-9b和llama3.1-8b存在格式错误。大型语言模型通常会推荐真实科学家，但在针对领域、时代和资历的查询中，准确性会下降，一贯地偏向资深学者。仍存在代表性偏见，复制性别失衡（反映男性占主导地位），亚洲科学家代表性不足，白人学者代表性过剩。尽管在机构和协作网络方面有一些多样性，但模型还是偏向于高度引用和高效的学者，强化了富者愈富效应，同时提供有限的地理代表性。这些发现强调了对大型语言模型进行改进的必要性，以获取更可靠和公平的学术推荐。

论文及项目相关链接

PDF 40 pages: 10 main (incl. 9 figures), 3 references, and 27 appendix. Paper under-review

Summary：

本文评估了六种不同规模的开放权重大型语言模型（LLMs）在推荐物理领域专家方面的表现。通过完成五个任务，包括按领域、学科影响力、时代、资历和学者对应推荐专家，评估模型的一致性、真实性和与性别、种族、学术流行度、学者相似性相关的偏见。使用美国物理学会和OpenAlex的地面真实数据作为基准，对比模型输出与真实学术记录，发现所有模型都存在不一致和偏见。mixtral-8x7b输出最稳定，而llama3.1-70b表现最不稳定。许多模型存在重复推荐，部分模型在特定查询中存在格式化错误。LLMs推荐的学者通常是真实的，但在针对领域、时代和资历的查询中准确性下降，倾向于推荐资深学者。存在代表性偏见，反映出性别失衡、亚洲科学家代表性不足以及白人学者代表性过强等问题。尽管在机构和协作网络方面有一定多样性，但模型倾向于推荐高被引和高效的学者，强化了“富者愈富”效应，同时提供有限的地理代表性。这些发现强调了改进LLMs的必要性，以提供更可靠和公平的学术推荐。

Key Takeaways：

评估了六种LLMs在推荐物理专家方面的性能。
通过五个任务考察模型表现，包括按不同标准推荐专家。
模型存在不一致性和偏见，需要关注性别、种族、学术流行度和学者相似性等因素。
mixtral-8x7b输出最稳定，而llama3.1-70b表现最为不稳定。
部分模型存在重复推荐和格式化错误。
LLMs倾向于推荐真实学者，但在特定查询中准确性下降。

Cool Papers

点此查看论文截图

MPO: Boosting LLM Agents with Meta Plan Optimization

Authors:Weimin Xiong, Yifan Song, Qingxiu Dong, Bingchan Zhao, Feifan Song, Xun Wang, Sujian Li

Recent advancements in large language models (LLMs) have enabled LLM-based agents to successfully tackle interactive planning tasks. However, despite their successes, existing approaches often suffer from planning hallucinations and require retraining for each new agent. To address these challenges, we propose the Meta Plan Optimization (MPO) framework, , which enhances agent planning capabilities by directly incorporating explicit guidance. Unlike previous methods that rely on complex knowledge, which either require significant human effort or lack quality assurance, MPO leverages high-level general guidance through meta plans to assist agent planning and enables continuous optimization of the meta plans based on feedback from the agent’s task execution. Our experiments conducted on two representative tasks demonstrate that MPO significantly outperforms existing baselines. Moreover, our analysis indicates that MPO provides a plug-and-play solution that enhances both task completion efficiency and generalization capabilities in previous unseen scenarios.

最近的大型语言模型（LLM）的进步已经使得基于LLM的代理成功完成了交互规划任务。然而，尽管已经取得了成功，现有的方法经常受到规划幻觉的影响，并且为每个新代理都需要进行再训练。为了解决这些挑战，我们提出了元计划优化（MPO）框架，通过直接融入明确的指导来提升代理的规划能力。与依赖复杂知识的之前方法不同，这些方法需要大量的人工参与或缺乏质量保证，MPO通过元计划提供高级一般指导来协助代理规划，并根据代理任务执行的反馈来持续优化元计划。我们在两个代表性任务上进行的实验表明，MPO显著优于现有基线。此外，我们的分析表明，MPO提供了一个即插即用的解决方案，提高了任务完成效率和在未见过的场景中的泛化能力。

论文及项目相关链接

PDF EMNLP 2025 Findings

Summary

LLM-based agents已成功应对交互规划任务，但仍面临规划幻觉问题，需为每个新代理进行再训练。为此，提出Meta Plan Optimization（MPO）框架，通过直接引入明确指导增强代理规划能力。MPO利用高级通用指导通过元计划协助代理规划，并基于代理任务执行的反馈持续优化元计划。实验表明，MPO显著优于现有基线，并提供即插即用解决方案，提高任务完成效率和在未见过场景中的泛化能力。

Key Takeaways

LLM-based agents已能成功处理交互规划任务，但仍面临规划幻觉问题。
Meta Plan Optimization (MPO)框架旨在解决这些问题，通过直接引入明确指导增强代理规划能力。
MPO利用高级通用指导通过元计划协助代理规划。
MPO能基于代理任务执行的反馈持续优化元计划。
实验表明MPO在交互规划任务上的性能显著优于现有方法。
MPO提供即插即用解决方案，提高任务完成效率。

Cool Papers

点此查看论文截图

Pay Attention to Real World Perturbations! Natural Robustness Evaluation in Machine Reading Comprehension

Authors:Yulong Wu, Viktor Schlegel, Riza Batista-Navarro

As neural language models achieve human-comparable performance on Machine Reading Comprehension (MRC) and see widespread adoption, ensuring their robustness in real-world scenarios has become increasingly important. Current robustness evaluation research, though, primarily develops synthetic perturbation methods, leaving unclear how well they reflect real life scenarios. Considering this, we present a framework to automatically examine MRC models on naturally occurring textual perturbations, by replacing paragraph in MRC benchmarks with their counterparts based on available Wikipedia edit history. Such perturbation type is natural as its design does not stem from an arteficial generative process, inherently distinct from the previously investigated synthetic approaches. In a large-scale study encompassing SQUAD datasets and various model architectures we observe that natural perturbations result in performance degradation in pre-trained encoder language models. More worryingly, these state-of-the-art Flan-T5 and Large Language Models (LLMs) inherit these errors. Further experiments demonstrate that our findings generalise to natural perturbations found in other more challenging MRC benchmarks. In an effort to mitigate these errors, we show that it is possible to improve the robustness to natural perturbations by training on naturally or synthetically perturbed examples, though a noticeable gap still remains compared to performance on unperturbed data.

随着神经语言模型在机器阅读理解（MRC）方面达到与人类相当的性能，并且在实践中得到广泛应用，确保其在现实场景中的稳健性变得愈发重要。然而，当前的稳健性评估研究主要发展合成扰动方法，尚不清楚这些方法在多大程度上能够反映真实场景。鉴于此，我们提出了一种框架，通过替换MRC基准测试中的段落，以基于可用的Wikipedia编辑历史内容的对应段落来自动检查文本模型。这种扰动类型是自然的，因为其内容并非源自人为的生成过程，本质上与以前研究的合成方法有所不同。在对涵盖SQUAD数据集和各种模型架构的大规模研究中，我们发现自然扰动会导致预训练编码器语言模型的性能下降。更令人担忧的是，最先进的Flan-T5和大语言模型（LLM）也存在继承这些错误的问题。进一步的实验表明，我们的发现可以推广到在其他更具挑战性的MRC基准测试中遇到的自然扰动。为了缓解这些错误，我们展示了通过训练自然或合成扰动实例来提高对自然扰动的稳健性是可能的，但与未扰动数据上的性能相比仍存在明显的差距。

论文及项目相关链接

PDF

Summary

本文主要探讨了在机器阅读理解（MRC）领域，随着神经网络模型达到人类水平的性能并广泛应用，如何在现实场景中确保模型的稳健性变得至关重要。现有的稳健性评估主要依赖于合成扰动方法，但它们是否能真实反映现实场景尚不清楚。因此，本文提出了一种自动评估MRC模型的框架，该框架通过替换MRC基准测试中的段落，使用基于Wikipedia编辑历史的对应段落来引入自然发生的文本扰动。实验表明，自然扰动会导致预训练编码器语言模型的性能下降，最先进的Flan-T5和大语言模型（LLM）也会继承这些错误。同时，研究发现通过训练自然或合成扰动样本可以提高对自然扰动的稳健性，但仍与未扰动数据上的性能存在明显差距。

Key Takeaways

神经网络模型在机器阅读理解（MRC）领域表现出卓越性能，但现实场景中的稳健性评估变得重要。
现有的模型稳健性评估主要依赖合成扰动方法，但其与现实场景的匹配度尚不清楚。
提出了一种基于Wikipedia编辑历史的自然扰动框架，用于评估MRC模型的稳健性。
自然扰动会导致预训练编码器语言模型的性能下降，包括最先进的Flan-T5和大语言模型（LLM）。
通过训练自然或合成扰动样本可以提高模型的抗自然扰动能力，但仍存在与未扰动数据上的性能差距。
自然扰动框架有助于发现模型在真实环境下的弱点，为改进模型提供方向。

Cool Papers

点此查看论文截图

Investigating Compositional Reasoning in Time Series Foundation Models

Authors:Willa Potosnak, Cristian Challu, Mononito Goswami, Kin G. Olivares, Michał Wiliński, Nina Żukowska, Artur Dubrawski

Large pre-trained time series foundation models (TSFMs) have demonstrated promising zero-shot performance across a wide range of domains. However, a question remains: Do TSFMs succeed by memorizing patterns in training data, or do they possess the ability to reason about such patterns? While reasoning is a topic of great interest in the study of Large Language Models (LLMs), it is undefined and largely unexplored in the context of TSFMs. In this work, inspired by language modeling literature, we formally define compositional reasoning in forecasting and distinguish it from in-distribution generalization. We evaluate the reasoning and generalization capabilities of 16 popular deep learning forecasting models on multiple synthetic and real-world datasets. Additionally, through controlled studies, we systematically examine which design choices in 7 popular open-source TSFMs contribute to improved reasoning capabilities. Our study yields key insights into the impact of TSFM architecture design on compositional reasoning and generalization. We find that patch-based Transformers have the best reasoning performance, closely followed by residualized MLP-based architectures, which are 97% less computationally complex in terms of FLOPs and 86% smaller in terms of the number of trainable parameters. Interestingly, in some zero-shot out-of-distribution scenarios, these models can outperform moving average and exponential smoothing statistical baselines trained on in-distribution data. Only a few design choices, such as the tokenization method, had a significant (negative) impact on Transformer model performance.

大型预训练时间序列基础模型（TSFMs）在多个领域表现出了有前景的零样本性能。然而，仍有一个问题有待解决：TSFM是通过训练数据中的模式记忆来成功的吗，还是它们具有对这些模式进行推理的能力？虽然推理是大型语言模型（LLM）研究中的一个热门话题，但在TSFM的上下文中它是未定义的并且未被广泛探索。在这项工作中，我们受到语言建模文献的启发，正式定义了预测中的组合推理，并将其与内部分布泛化区分开来。我们评估了16个流行的深度学习预测模型在多个合成和现实世界数据集上的推理和泛化能力。此外，通过受控研究，我们系统地检查了7个流行的开源TSFM中的哪些设计选择有助于改进推理能力。我们的研究对于TSFM架构设计对组合推理和泛化的影响产生了重要见解。我们发现基于patch的Transformer具有最佳的推理性能，紧随其后的是基于残差MLP的架构，其在浮点运算方面减少了高达97的计算复杂性，并且在可训练参数数量方面减少了高达86%。有趣的是，在某些零样本外部分布场景中，这些模型可以超越在内部分布数据上训练的移动平均和指数平滑统计基线。只有少数设计选择（如分词方法）对Transformer模型性能产生了重大（负面）影响。

论文及项目相关链接

PDF

Summary

时序大模型的优异表现主要得益于其对模式记忆的能力还是推理能力，这在时序预测领域仍是一个悬而未决的问题。本研究正式定义了预测中的组合推理，并将其与分布内泛化区分开来。通过对多种合成和真实数据集的研究，发现基于补丁的Transformer架构在推理性能上表现最佳，其次是简化后的MLP架构。这些架构的计算复杂度较低且参数更少，在某些零样本分布外场景中，其性能甚至超越了传统的基于分布的统计基线方法。某些设计选择如令牌化方法对Transformer模型的性能有显著影响。

Key Takeaways

时序大模型（TSFMs）在多个领域展现出零样本性能潜力。
目前对于TSFM是否通过记忆或推理成功的理解尚不清楚。
本研究定义了时序预测中的组合推理，并研究了多种深度预测模型的推理和泛化能力。
基于补丁的Transformer架构和简化后的MLP架构在推理性能上表现最佳。
在某些场景下，这些模型的性能优于基于分布的统计基线方法。

Cool Papers

点此查看论文截图

Understanding Museum Exhibits using Vision-Language Reasoning

Authors:Ada-Astrid Balauca, Sanjana Garai, Stefan Balauca, Rasesh Udayakumar Shetty, Naitik Agrawal, Dhwanil Subhashbhai Shah, Yuqian Fu, Xi Wang, Kristina Toutanova, Danda Pani Paudel, Luc Van Gool

Museums serve as repositories of cultural heritage and historical artifacts from diverse epochs, civilizations, and regions, preserving well-documented collections that encapsulate vast knowledge, which, when systematically structured into large-scale datasets, can train specialized models. Visitors engage with exhibits through curiosity and questions, making expert domain-specific models essential for interactive query resolution and gaining historical insights. Understanding exhibits from images requires analyzing visual features and linking them to historical knowledge to derive meaningful correlations. We facilitate such reasoning by (a) collecting and curating a large-scale dataset of 65M images and 200M question-answer pairs for exhibits from all around the world; (b) training large vision-language models (VLMs) on the collected dataset; (c) benchmarking their ability on five visual question answering tasks, specifically designed to reflect real-world inquiries and challenges observed in museum settings. The complete dataset is labeled by museum experts, ensuring the quality and the practical significance of the labels. We train two VLMs from different categories: BLIP with vision-language aligned embeddings, but lacking the expressive power of large language models, and the LLaVA model, a powerful instruction-tuned LLM enriched with vision-language reasoning capabilities. Through extensive experiments, we find that while both model types effectively answer visually grounded questions, large vision-language models excel in queries requiring deeper historical context and reasoning. We further demonstrate the necessity of fine-tuning models on large-scale domain-specific datasets by showing that our fine-tuned models significantly outperform current SOTA VLMs in answering questions related to specific attributes, highlighting their limitations in handling complex, nuanced queries.

博物馆作为不同时代、文明和地区文化遗产和历史文物的储存库，保存了记录详尽的收藏品，其中蕴含了丰富的知识。当这些收藏品被系统地构建成大规模数据集时，就可以训练专业模型。游客通过好奇和提问与展品互动，因此，对于交互式查询解析和获取历史见解而言，领域专家模型至关重要。要从图像中理解展品，需要分析视觉特征，并将其与历史知识联系起来，以得出有意义的关联。我们通过以下方式促进了这种推理：（a）收集和整理了一个包含6500万张图像和2亿个问答对的大规模数据集，这些展品来自世界各地；（b）在收集的数据集上训练大型视觉语言模型（VLMs）；（c）在五个视觉问答任务上对这些模型的能力进行基准测试，这些任务专门设计用来反映博物馆环境中观察到的现实世界查询和挑战。该完整数据集经过博物馆专家的标注，确保了标签的质量和实用性。我们训练了两种不同类型的VLMs：BLIP，具有视觉语言对齐嵌入，但缺乏大型语言模型的表达能力；以及LLaVA模型，这是一个功能强大的指令调优LLM，拥有丰富的视觉语言推理能力。通过大量实验，我们发现虽然两种模型类型都能有效地回答视觉基础问题，但大型视觉语言模型在需要深厚历史背景和推理能力的查询中表现尤为出色。我们进一步通过展示我们的微调模型在回答与特定属性相关的问题方面显著优于当前的最佳VLMs，证明了在特定领域的大规模数据集上微调模型的必要性，同时突出了它们在处理复杂、细微查询方面的局限性。

论文及项目相关链接

PDF Accepted at ICCV 2025

摘要

博物馆作为文化遗产和历史文物的重要收藏地，保存了大量记录丰富的收藏品，这些收藏品蕴含了庞大的知识。通过系统地构建大规模数据集，可以训练专业模型。游客对展品充满好奇并提出问题，因此需要专业的领域模型来解决互动查询和获取历史见解。通过对图像的理解和视觉特征的分析，结合历史知识，可以推导出有意义的关联。我们收集了6.5亿张图像和2亿个问答对，训练了大型视觉语言模型（VLMs），并在五个视觉问答任务上进行基准测试。完整数据集由博物馆专家标注，确保标签的质量和实用性。我们训练了两种不同类型的VLMs：BLIP和LLaVA模型。前者具有视觉语言对齐嵌入，但缺乏大型语言模型的表达能力；后者是一种功能强大的指令调整型LLM，融合了视觉语言推理能力。实验表明，虽然两种模型类型都能有效地回答视觉问题，但在需要深入历史背景和推理的查询中，大型视觉语言模型表现更为出色。我们还展示了在大型特定领域数据集上微调模型的必要性，我们的微调模型在回答与特定属性相关的问题时，显著优于当前的最佳VLMs，同时指出了在处理复杂细微查询时的局限性。

关键见解

博物馆的藏品蕴含大量知识和信息，通过构建大规模数据集可以训练专业模型。
游客对展品的好奇和问题需求专业的领域模型来解决互动查询和获取历史见解。
理解和分析博物馆展品图像需要视觉语言模型（VLMs）。
我们收集了一个包含6.5亿张图像和2亿个问答对的大规模数据集来训练VLMs。
博物馆专家参与数据集的标注工作，确保数据质量。
对比了两种不同的VLMs：BLIP和LLaVA，发现大型视觉语言模型在处理需要深入历史背景和推理的查询时表现更好。

Cool Papers

点此查看论文截图

Flash STU: Fast Spectral Transform Units

Authors:Y. Isabel Liu, Windsor Nguyen, Yagiz Devre, Evan Dogariu, Anirudha Majumdar, Elad Hazan

Recent advances in state-space model architectures have shown great promise for efficient sequence modeling, but challenges remain in balancing computational efficiency with model expressiveness. We propose the Flash STU architecture, a hybrid model that interleaves spectral state space model layers with sliding window attention, enabling scalability to billions of parameters for language modeling while maintaining a near-linear time complexity. We evaluate the Flash STU and its variants on diverse sequence prediction tasks, including linear dynamical systems, robotics control, and language modeling. We find that, given a fixed parameter budget, the Flash STU architecture consistently outperforms the Transformer and other leading state-space models such as S4 and Mamba-2.

最近状态空间模型架构的进展为高效序列建模展现出了巨大潜力，但在平衡计算效率和模型表达能力方面仍存在挑战。我们提出了Flash STU架构，这是一种混合模型，将谱状态空间模型层与滑动窗口注意力交织在一起，能够在语言建模中实现数十亿参数的扩展，同时保持接近线性的时间复杂度。我们在多种序列预测任务上评估了Flash STU及其变体，包括线性动力系统、机器人控制和语言建模。我们发现，在固定参数预算下，Flash STU架构始终优于Transformer以及其他领先的状态空间模型，如S4和Mamba-2。

论文及项目相关链接

PDF

Summary
最新的状态空间模型架构的进步为高效的序列建模展示了巨大的潜力，但仍面临在计算效率和模型表达能力之间平衡的挑战。我们提出了Flash STU架构，它是一种混合模型，将谱状态空间模型层与滑动窗口注意力交错起来，在保持接近线性时间复杂度的同时，可实现数十亿参数的扩展性语言建模。在包括线性动态系统、机器人控制和语言建模在内的各种序列预测任务上，我们评估了Flash STU及其变体，发现给定固定的参数预算时，Flash STU架构始终优于Transformer和其他领先的状态空间模型，如S4和Mamba-2。

Key Takeaways

状态空间模型架构的最新进展显示出高效的序列建模潜力。
Flash STU架构是一种混合模型，结合了谱状态空间模型层和滑动窗口注意力。
Flash STU架构能在保持近线性时间复杂度的同时实现大规模的参数扩展。
在不同的序列预测任务上，Flash STU架构表现出优越的性能。
在固定参数预算下，Flash STU架构优于其他主流模型架构，如Transformer。
Flash STU架构与其他领先的状态空间模型相比，如S4和Mamba-2，表现出更好的性能。

Cool Papers

点此查看论文截图

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Authors:Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, Mike Zheng Shou

We present a unified transformer, i.e., Show-o, that unifies multimodal understanding and generation. Unlike fully autoregressive models, Show-o unifies autoregressive and (discrete) diffusion modeling to adaptively handle inputs and outputs of various and mixed modalities. The unified model flexibly supports a wide range of vision-language tasks including visual question-answering, text-to-image generation, text-guided inpainting/extrapolation, and mixed-modality generation. Across various benchmarks, it demonstrates comparable or superior performance to existing individual models with an equivalent or larger number of parameters tailored for understanding or generation. This significantly highlights its potential as a next-generation foundation model. Code and models are released at https://github.com/showlab/Show-o.

我们提出了一种统一的多模态理解和生成一体化转换器——Show-o。与完全自回归模型不同，Show-o融合了自回归和（离散）扩散建模，以自适应地处理各种和混合模态的输入和输出。该统一模型灵活支持广泛的视觉语言任务，包括视觉问答、文本到图像生成、文本引导的上采样/外推以及混合模态生成等。在各种基准测试中，其性能与现有针对理解或生成任务的具有相同或更多参数的模型相比表现相当或更好。这充分凸显了其作为下一代基础模型的潜力。相关代码和模型已发布在https://github.com/showlab/Show-o上。

论文及项目相关链接

PDF ICLR 2025

Summary

展示了一种统一的多模态理解和生成模型——“Show-o”，它结合了自回归和离散扩散建模，可灵活处理各种和混合模态的输入和输出。该模型支持广泛的视觉语言任务，包括视觉问答、文本生成图像、文本引导的图像修复/外推和混合模态生成等。在多个基准测试中，其性能与现有模型相当或更优，展现出作为下一代基础模型的潜力。代码和模型已发布在“https://github.com/showlab/Show-o。”

Key Takeaways

Show-o是一种统一的多模态理解和生成模型，融合了自回归和离散扩散建模技术。
该模型可以适应各种和混合模态的输入和输出。
Show-o支持广泛的视觉语言任务，如视觉问答、文本生成图像等。
在多个基准测试中，Show-o的性能与现有模型相当或更优。
Show-o具有潜力成为下一代基础模型。
模型代码和参数已公开发布，便于研究和使用。

Cool Papers

点此查看论文截图

Affective Computing in the Era of Large Language Models: A Survey from the NLP Perspective

Authors:Yiqun Zhang, Xiaocui Yang, Xingle Xu, Zeran Gao, Yijie Huang, Shiyi Mu, Shi Feng, Daling Wang, Yifei Zhang, Kaisong Song, Ge Yu

Affective Computing (AC) integrates computer science, psychology, and cognitive science to enable machines to recognize, interpret, and simulate human emotions across domains such as social media, finance, healthcare, and education. AC commonly centers on two task families: Affective Understanding (AU) and Affective Generation (AG). While fine-tuned pre-trained language models (PLMs) have achieved solid AU performance, they often generalize poorly across tasks and remain limited for AG, especially in producing diverse, emotionally appropriate responses. The advent of Large Language Models (LLMs) (e.g., ChatGPT and LLaMA) has catalyzed a paradigm shift by offering in-context learning, broader world knowledge, and stronger sequence generation. This survey presents an NLP-oriented overview of AC in the LLM era. We (i) consolidate traditional AC tasks and preliminary LLM-based studies; (ii) review adaptation techniques that improve AU/AG, including Instruction Tuning (full and parameter-efficient methods such as LoRA, P-/Prompt-Tuning), Prompt Engineering (zero/few-shot, chain-of-thought, agent-based prompting), and Reinforcement Learning. For the latter, we summarize RL from human preferences (RLHF), verifiable/programmatic rewards (RLVR), and AI feedback (RLAIF), which provide preference- or rule-grounded optimization signals that can help steer AU/AG toward empathy, safety, and planning, achieving finer-grained or multi-objective control. To assess progress, we compile benchmarks and evaluation practices for both AU and AG. We also discuss open challenges-from ethics, data quality, and safety to robust evaluation and resource efficiency-and outline research directions. We hope this survey clarifies the landscape and offers practical guidance for building affect-aware, reliable, and responsible LLM systems.

情感计算（AC）融合了计算机科学、心理学和认知科学，使机器能够识别、解释和模拟人类情绪，涉及社交媒体、金融、医疗和教育等领域。情感计算通常围绕两个任务家族：情感理解（AU）和情感生成（AG）。虽然经过微调预训练语言模型（PLM）在AU方面取得了扎实的效果，但它们往往跨任务泛化能力较差，在AG方面仍存在局限性，尤其是在生成多样且情感恰当的响应方面。大型语言模型（LLM）（例如ChatGPT和LLaMA）的出现，通过提供上下文学习、更广泛的世界知识和更强的序列生成能力，催生了范式转变。这篇综述以自然语言处理为导向，介绍了LLM时代的AC。我们（i）整合了传统的AC任务和初步的LLM研究；（ii）回顾了提高AU/AG的适应技术，包括指令微调（包括LoRA、P-/Prompt-Tuning等全量和参数高效方法）、提示工程（零/少样本、思维链、基于代理的提示）和强化学习。对于后者，我们总结了从人类偏好中学习（RLHF）、可验证/程序化奖励（RLVR）和AI反馈（RLAIF），这些提供偏好或规则基础的优化信号，有助于引导AU/AG朝着同理心、安全和规划的方向发展，实现更精细或多目标控制。为了评估进展，我们整理了AU和AG的基准测试和评估实践。我们还讨论了从伦理、数据质量和安全到稳健评估和资源效率等开放挑战，并概述了研究方向。我们希望这篇综述能澄清该领域的情况，为构建情感感知、可靠和负责任的LLM系统提供实用指导。

论文及项目相关链接

PDF Compared with the previous version, reinforcement learning has been added (as a new section), including RLHF, RLVR, and RLAIF

Summary

本文介绍了情感计算（AC）领域在大型语言模型（LLM）时代的研究进展。文章概述了情感计算的定义及其两大任务家族：情感理解（AU）和情感生成（AG）。随着LLM的出现，如ChatGPT和LLaMA，情感计算领域迎来了变革。本文回顾了适应技巧，包括指令微调、提示工程和强化学习等方法，以提高AU/AG的性能。此外，本文还介绍了基于人类偏好的强化学习等策略，可实现情感理解和生成的精细化或多目标控制。文章还总结了AU和AG的评估标准和挑战，并为构建情感感知、可靠和负责任的LLM系统提供了实用指导。

Key Takeaways

情感计算（AC）结合了计算机科学、心理学和认知科学，使机器能够识别、解释和模拟人类情绪。
AC主要关注两个任务家族：情感理解（AU）和情感生成（AG）。
大型语言模型（LLM）的出现为情感计算领域带来了变革，提供了上下文学习、更广泛的世界知识和更强的序列生成能力。
提高AU/AG性能的适应技巧包括指令微调、提示工程和强化学习等。
基于人类偏好的强化学习策略可以帮助实现情感理解和生成的精细化或多目标控制。
文章总结了AU和AG的评估标准和挑战。

Cool Papers

点此查看论文截图

Baba Is AI: Break the Rules to Beat the Benchmark

Authors:Nathan Cloos, Meagan Jens, Michelangelo Naim, Yen-Ling Kuo, Ignacio Cases, Andrei Barbu, Christopher J. Cueva

Humans solve problems by following existing rules and procedures, and also by leaps of creativity to redefine those rules and objectives. To probe these abilities, we developed a new benchmark based on the game Baba Is You where an agent manipulates both objects in the environment and rules, represented by movable tiles with words written on them, to reach a specified goal and win the game. We test three state-of-the-art multi-modal large language models (OpenAI GPT-4o, Google Gemini-1.5-Pro and Gemini-1.5-Flash) and find that they fail dramatically when generalization requires that the rules of the game must be manipulated and combined.

人类通过遵循现有规则和程序来解决问题，同时也通过创造性的飞跃来重新定义这些规则和目标。为了探索这些能力，我们基于游戏《上下左右移动爸爸》开发了一个新的基准测试。在这个游戏中，代理需要操作环境中的物体和规则（以带有文字的移动瓷砖的形式呈现），以达到特定目标并赢得游戏。我们测试了三种最新的多模态大型语言模型（OpenAI的GPT-4o、Google的Gemini-1.5-Pro和Gemini-1.5-Flash），发现它们在需要操纵和组合游戏规则进行泛化时表现显著不佳。

论文及项目相关链接

PDF 8 pages, 8 figures

Summary

本摘要介绍了一个基于游戏《你爷爷是“你”吗？》开发的新基准测试系统，该环境让智能代理能操控游戏物体和游戏内的规则。测试中对比了三款当前流行的多模态大型语言模型，发现在涉及需要对游戏规则进行操控和组合等泛化应用时，它们表现显著不佳。本研究既涉及到了遵守规则又考虑到规则的重定义与创新应用，挑战了人类对规则操纵与创造性思维的双重挑战能力。此模型进一步推进了我们对大型语言模型在复杂环境中的理解和应用。

Key Takeaways

研究人员开发了一种基于游戏《你爷爷是“你”吗？》的新基准测试系统，用于评估智能代理解决问题时的能力。
测试系统允许智能代理操控游戏内的物体和规则，以达成特定目标。
三款当前流行的多模态大型语言模型参与了测试，包括OpenAI GPT-4o、Google Gemini-1.5-Pro和Gemini-1.5-Flash。
在泛化应用层面，尤其是需要操控和组合规则时，这些语言模型表现不佳。
研究表明，现有的大型语言模型在复杂环境中对规则的操控和组合能力有待提高。
测试系统不仅评估了遵循规则的能力，还挑战了规则的重定义与创新应用的能力。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-09-12/LLM/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

LLM

Agent

Agent 方向最新论文已更新，请持续关注 Update in 2025-09-12 Automatic Failure Attribution and Critical Step Prediction Method for Multi-Agent Systems Based on Causal Inference

2025-09-12 Agent

Agent

R1_Reasoning

R1_Reasoning 方向最新论文已更新，请持续关注 Update in 2025-09-12 A Survey of Reinforcement Learning for Large Reasoning Models

2025-09-12 R1_Reasoning

R1_Reasoning

LLM

2025-09-12 更新

A Survey of Reinforcement Learning for Large Reasoning Models

Large Language Model Hacking: Quantifying the Hidden Risks of Using LLMs for Text Annotation

Calibrating MLLM-as-a-judge via Multimodal Bayesian Prompt Ensembles

BcQLM: Efficient Vision-Language Understanding with Distilled Q-Gated Cross-Modal Fusion

AdsQA: Towards Advertisement Video Understanding

LLM Ensemble for RAG: Role of Context Length in Zero-Shot Question Answering for BioASQ Challenge

Low-Resource Fine-Tuning for Multi-Task Structured Information Extraction with a Billion-Parameter Instruction-Tuned Model

MM-DREX: Multimodal-Driven Dynamic Routing of LLM Experts for Financial Trading

Subjective Behaviors and Preferences in LLM: Language of Browsing

Scaling LLM Planning: NL2FLOW for Parametric Problem Generation and Rigorous Evaluation

Teaching an Old LLM Secure Coding: Localized Preference Optimization on Distilled Preferences

Whose Name Comes Up? Auditing LLM-Based Scholar Recommendations

MPO: Boosting LLM Agents with Meta Plan Optimization

Pay Attention to Real World Perturbations! Natural Robustness Evaluation in Machine Reading Comprehension

Investigating Compositional Reasoning in Time Series Foundation Models

Understanding Museum Exhibits using Vision-Language Reasoning

Flash STU: Fast Spectral Transform Units

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Affective Computing in the Era of Large Language Models: A Survey from the NLP Perspective

Baba Is AI: Break the Rules to Beat the Benchmark