LLM

发布日期: 2025-11-05

更新日期: 2025-11-27

文章字数: 20.3k

阅读时长: 82 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-05 更新

NAUTILUS: A Large Multimodal Model for Underwater Scene Understanding

Authors:Wei Xu, Cheng Wang, Dingkang Liang, Zongchuang Zhao, Xingyu Jiang, Peng Zhang, Xiang Bai

Underwater exploration offers critical insights into our planet and attracts increasing attention for its broader applications in resource exploration, national security, etc. We study the underwater scene understanding methods, which aim to achieve automated underwater exploration. The underwater scene understanding task demands multi-task perceptions from multiple granularities. However, the absence of large-scale underwater multi-task instruction-tuning datasets hinders the progress of this research. To bridge this gap, we construct NautData, a dataset containing 1.45 M image-text pairs supporting eight underwater scene understanding tasks. It enables the development and thorough evaluation of the underwater scene understanding models. Underwater image degradation is a widely recognized challenge that interferes with underwater tasks. To improve the robustness of underwater scene understanding, we introduce physical priors derived from underwater imaging models and propose a plug-and-play vision feature enhancement (VFE) module, which explicitly restores clear underwater information. We integrate this module into renowned baselines LLaVA-1.5 and Qwen2.5-VL and build our underwater LMM, NAUTILUS. Experiments conducted on the NautData and public underwater datasets demonstrate the effectiveness of the VFE module, consistently improving the performance of both baselines on the majority of supported tasks, thus ensuring the superiority of NAUTILUS in the underwater scene understanding area. Data and models are available at https://github.com/H-EmbodVis/NAUTILUS.

水下探索为我们的地球提供了关键见解，并因其资源探索、国家安全等领域的广泛应用而日益受到关注。我们研究水下场景理解方法，旨在实现自动化水下探索。水下场景理解任务需要来自多个粒度的多任务感知。然而，缺乏大规模的水下多任务指令调整数据集，阻碍了这项研究的进展。为了弥补这一空白，我们构建了NautData数据集，包含145万张图像文本对，支持水下场景理解的八个任务。它能够促进水下场景理解模型的开发和全面评估。水下图像退化是一个公认的挑战，会干扰水下任务。为了提高水下场景理解的稳健性，我们引入了来自水下成像模型的物理先验知识，并提出了一种即插即用的视觉特征增强（VFE）模块，该模块可以明确恢复清晰的水下信息。我们将该模块集成到著名的基线LLaVA-1.5和Qwen2.5-VL中，构建了我们的水下LLM模型NAUTILUS。在NautData和公共水下数据集上进行的实验验证了VFE模块的有效性，该模块在大多数支持的任务上均提高了基线模型的性能，从而确保了NAUTILUS在水下场景理解领域的优越性。数据和模型均可在https://github.com/H-EmbodVis/NAUTILUS获取。

论文及项目相关链接

PDF Accepted to NeurIPS 2025. Data and models are available at https://github.com/H-EmbodVis/NAUTILUS

Summary

本文研究了水下场景理解方法，旨在实现自动化水下探索。由于缺少大规模的水下多任务指令调整数据集，限制了该领域的研究进展。为了弥补这一空白，构建了NautData数据集，包含145万张图像文本对，支持水下场景理解的八个任务。为了提高水下场景理解的稳健性，引入了水下成像模型得出的物理先验知识，并提出了即插即用的视觉特征增强（VFE）模块，该模块能够明确恢复清晰的水下信息。实验证明，该模块在大多数支持的任务上都提高了基线模型的性能，确保了NAUTILUS在水下场景理解领域的优越性。

Key Takeaways

水下探索对于资源探索、国家安全等领域具有广泛的应用前景，水下场景理解是实现自动化水下探索的关键。
当前水下场景理解研究受限于缺乏大规模的多任务数据集。
NautData数据集的建立填补了这一空白，支持水下场景理解的八个任务。
水下图像退化是干扰水下任务的一个公认挑战。
引入物理先验知识和水下成像模型有助于提高水下场景理解的稳健性。
提出的即插即用的视觉特征增强（VFE）模块能够明确恢复清晰的水下信息，提高了基线模型的性能。

Cool Papers

点此查看论文截图

Un-Attributability: Computing Novelty From Retrieval & Semantic Similarity

Authors:Philipp Davydov, Ameya Prabhu, Matthias Bethge, Elisa Nguyen, Seong Joon Oh

Understanding how language-model outputs relate to the pretraining corpus is central to studying model behavior. Most training data attribution (TDA) methods ask which training examples causally influence a given output, often using leave-one-out tests. We invert the question: which outputs cannot be attributed to any pretraining example? We introduce un-attributability as an operational measure of semantic novelty: an output is novel if the pretraining corpus contains no semantically similar context. We approximate this with a simple two-stage retrieval pipeline: index the corpus with lightweight GIST embeddings, retrieve the top-n candidates, then rerank with ColBERTv2. If the nearest corpus item is less attributable than a human-generated text reference, we consider the output of the model as novel. We evaluate on SmolLM and SmolLM2 and report three findings: (1) models draw on pretraining data across much longer spans than previously reported; (2) some domains systematically promote or suppress novelty; and (3) instruction tuning not only alters style but also increases novelty. Reframing novelty assessment around un-attributability enables efficient analysis at pretraining scale. We release ~20 TB of corpus chunks and index artifacts to support replication and large-scale extension of our analysis at https://huggingface.co/datasets/stai-tuebingen/faiss-smollm

理解语言模型的输出与预训练语料库之间的关系是研究模型行为的核心。大多数训练数据归属（TDA）方法都是询问哪些训练样本对给定输出产生了因果影响，通常使用留一法测试。我们反过来提出问题：哪些输出不能归属于任何预训练样本？我们引入不可归属性作为语义新颖度的操作度量：如果预训练语料库中不存在语义上相似的上下文，则输出被视为新颖的。我们通过一个简单的两阶段检索流程来近似这一点：使用轻量级的GIST嵌入对语料库进行索引，检索前n个候选者，然后使用ColBERTv2进行重新排序。如果最近的语料库项比人工生成的文本参考样本的归属度低，我们认为模型的输出是新颖的。我们在SmolLM和SmolLM2上进行了评估，并报告了三个发现：（1）模型调用预训练数据的跨度比以前报道的要长得多；（2）某些领域会系统地促进或抑制新颖性；（3）指令调整不仅改变了风格，还增加了新颖性。以不可归属性重新构建新颖性评估，能够在预训练规模上进行高效分析。我们发布了约20TB的语料库片段和索引文物，以支持我们的分析复制和大规模扩展。详情请参阅https://huggingface.co/datasets/stai-tuebingen/faiss-smollm。

论文及项目相关链接

PDF

Summary

文本主要研究了语言模型输出与预训练语料库之间的关系，并提出了一个新的衡量标准——不可归因性，用于评估模型输出的语义新颖度。通过倒排索引和基于GIST嵌入的检索方法，发现模型在预训练数据上的依赖跨度比以往报道的要长得多。同时，一些领域会系统地促进或抑制新颖性，指令微调不仅改变风格，还增加新颖性。通过重新构建基于不可归因性的新颖性评估方法，可在预训练规模上实现高效分析。

Key Takeaways

文本聚焦于研究语言模型输出与预训练语料库的关系。
提出了一个新的衡量标准——不可归因性，用于评估模型输出的语义新颖度。
通过倒排索引和基于GIST嵌入的检索方法，发现模型依赖预训练数据的跨度更长。
不同领域对模型输出的新颖性有不同的影响，某些领域会促进或抑制新颖性的产生。
指令微调不仅改变模型风格，还能增加输出的新颖性。
通过重新构建基于不可归因性的新颖性评估方法，可以在预训练规模上实现高效分析。

Cool Papers

点此查看论文截图

Kimi Linear: An Expressive, Efficient Attention Architecture

Authors: Kimi Team, Yu Zhang, Zongyu Lin, Xingcheng Yao, Jiaxi Hu, Fanqing Meng, Chengyin Liu, Xin Men, Songlin Yang, Zhiyuan Li, Wentao Li, Enzhe Lu, Weizhou Liu, Yanru Chen, Weixin Xu, Longhui Yu, Yejie Wang, Yu Fan, Longguang Zhong, Enming Yuan, Dehao Zhang, Yizhi Zhang, T. Y. Liu, Haiming Wang, Shengjun Fang, Weiran He, Shaowei Liu, Yiwei Li, Jianlin Su, Jiezhong Qiu, Bo Pang, Junjie Yan, Zhejun Jiang, Weixiao Huang, Bohong Yin, Jiacheng You, Chu Wei, Zhengtao Wang, Chao Hong, Yutian Chen, Guanduo Chen, Yucheng Wang, Huabin Zheng, Feng Wang, Yibo Liu, Mengnan Dong, Zheng Zhang, Siyuan Pan, Wenhao Wu, Yuhao Wu, Longyu Guan, Jiawen Tao, Guohong Fu, Xinran Xu, Yuzhi Wang, Guokun Lai, Yuxin Wu, Xinyu Zhou, Zhilin Yang, Yulun Du

We introduce Kimi Linear, a hybrid linear attention architecture that, for the first time, outperforms full attention under fair comparisons across various scenarios – including short-context, long-context, and reinforcement learning (RL) scaling regimes. At its core lies Kimi Delta Attention (KDA), an expressive linear attention module that extends Gated DeltaNet with a finer-grained gating mechanism, enabling more effective use of limited finite-state RNN memory. Our bespoke chunkwise algorithm achieves high hardware efficiency through a specialized variant of the Diagonal-Plus-Low-Rank (DPLR) transition matrices, which substantially reduces computation compared to the general DPLR formulation while remaining more consistent with the classical delta rule. We pretrain a Kimi Linear model with 3B activated parameters and 48B total parameters, based on a layerwise hybrid of KDA and Multi-Head Latent Attention (MLA). Our experiments show that with an identical training recipe, Kimi Linear outperforms full MLA with a sizeable margin across all evaluated tasks, while reducing KV cache usage by up to 75% and achieving up to 6 times decoding throughput for a 1M context. These results demonstrate that Kimi Linear can be a drop-in replacement for full attention architectures with superior performance and efficiency, including tasks with longer input and output lengths. To support further research, we open-source the KDA kernel and vLLM implementations, and release the pre-trained and instruction-tuned model checkpoints.

我们介绍了Kimi Linear，这是一种混合线性注意力架构，它首次在不同的场景中与全注意力进行了公平比较并表现出优越性，这些场景包括短上下文、长上下文和强化学习（RL）规模制度。其核心是Kimi Delta Attention（KDA），这是一种表现力很强的线性注意力模块，它通过更精细的闸门机制扩展了Gated DeltaNet，能够更有效地利用有限的有限状态RNN内存。我们的专用分块算法通过对角线加低阶（DPLR）转换矩阵的专用变体实现了高硬件效率，与通用DPLR公式相比，这大大降低了计算量，同时更符合经典的delta规则。我们基于KDA和Multi-Head Latent Attention（MLA）的分层混合，预训练了一个具有3B激活参数和48B总参数的Kimi Linear模型。实验表明，使用相同的训练配方，Kimi Linear在所有评估任务上大幅度超越了全MLA，同时减少了KV缓存使用高达75%，并在1M上下文中实现了高达6倍的解码吞吐量。这些结果表明，Kimi Linear可以作为一种高性能和高效的替代品，替代全注意力架构，包括具有更长输入和输出长度的任务。为了支持进一步研究，我们开源了KDA内核和vLLM实现，并发布了预训练和指令调整过的模型检查点。

论文及项目相关链接

PDF Kimi Linear tech report

Summary

本文介绍了Kimi Linear，一种混合线性注意力架构，它在各种场景下首次实现了对全注意力的超越，包括短语境、长语境和强化学习（RL）扩展模式。其核心是Kimi Delta Attention（KDA），这是一种表现力极强的线性注意力模块，它通过更精细的网关机制扩展了Gated DeltaNet，使有限的RNN内存得到更有效的利用。通过特殊的对角加低秩（DPLR）转换矩阵的变体，实现了高硬件效率，减少了计算量，同时保持了与经典Delta规则的更多一致性。实验表明，Kimi Linear模型在相同训练配方下，在所有评估任务上大幅度超越了全MLA，同时减少了KV缓存使用，提高了解码吞吐量。结果表明，Kimi Linear可作为全注意力架构的替代品，具有优越的性能和效率，尤其适用于长输入和输出长度的任务。

Key Takeaways

Kimi Linear是一种混合线性注意力架构，可在各种场景下实现出色的性能，包括短语境、长语境和强化学习场景。
核心是Kimi Delta Attention（KDA）模块，它通过精细的网关机制扩展了Gated DeltaNet。
实现了高硬件效率，通过特殊的对角加低秩（DPLR）转换矩阵的变体减少了计算量。
Kimi Linear模型在相同训练配方下，性能优于全MLA模型。
Kimi Linear减少了KV缓存的使用，并提高了解码吞吐量。
Kimi Linear特别适用于长输入和输出长度的任务。

Cool Papers

点此查看论文截图

Who Has The Final Say? Conformity Dynamics in ChatGPT’s Selections

Authors:Clarissa Sabrina Arlinghaus, Tristan Kenneweg, Barbara Hammer, Günter W. Maier

Large language models (LLMs) such as ChatGPT are increasingly integrated into high-stakes decision-making, yet little is known about their susceptibility to social influence. We conducted three preregistered conformity experiments with GPT-4o in a hiring context. In a baseline study, GPT consistently favored the same candidate (Profile C), reported moderate expertise (M = 3.01) and high certainty (M = 3.89), and rarely changed its choice. In Study 1 (GPT + 8), GPT faced unanimous opposition from eight simulated partners and almost always conformed (99.9%), reporting lower certainty and significantly elevated self-reported informational and normative conformity (p < .001). In Study 2 (GPT + 1), GPT interacted with a single partner and still conformed in 40.2% of disagreement trials, reporting less certainty and more normative conformity. Across studies, results demonstrate that GPT does not act as an independent observer but adapts to perceived social consensus. These findings highlight risks of treating LLMs as neutral decision aids and underline the need to elicit AI judgments prior to exposing them to human opinions.

大型语言模型（如ChatGPT）越来越多地被用于高风险决策，但关于它们对社会影响的易感性知之甚少。我们在招聘背景下，对GPT-4进行了三次预注册的符合性实验。在基线研究中，GPT一直偏爱同一候选人（Profile C），报告了中等专业水平（M = 3.01）和高确定性（M = 3.89），并且很少改变其选择。在研究1（GPT + 8）中，GPT面临来自八个模拟伙伴的一致反对，并几乎总是符合（99.9%），报告较低的确定性和显著增高的自我报告的信息和规范性符合（p < .001）。在研究2（GPT + 1）中，GPT与一个伙伴互动，在40.2%的争议试验中仍然符合，报告较少的确定性和更多的规范性符合。各项研究表明，GPT并不作为独立的观察者行事，而是适应于感知到的社会共识。这些结果强调了将大型语言模型视为中立决策辅助工具的风险，并强调需要在暴露人工智能意见之前征求人工智能的判断。

论文及项目相关链接

PDF 5 pages, 5 figures, HAI 2025: Workshop on Socially Aware and Cooperative Intelligent Systems

Summary

本文探讨了大型语言模型（LLMs）如ChatGPT在高风险决策中的集成应用，以及其对社会影响的易感性。通过三个预注册的符合性实验，在招聘背景下对GPT-4o进行研究。基线研究表明，GPT对候选人的选择具有一致性，但在模拟伙伴的一致反对下，GPT表现出符合性，降低了自信和显著的信息和规范性符合性。与单个伙伴的互动中，GPT仍表现出符合性，并报告了较少的自信和更多的规范性符合性。研究表明GPT并不作为独立观察者行动，而是适应感知到的社会共识。这强调了将LLMs视为中立决策辅助工具的风险，并需要在暴露于人类观点之前对AI判断进行引导的必要性。

Key Takeaways

大型语言模型（LLMs）如ChatGPT在高风险决策中越来越受欢迎，但其对社会影响的易感性尚不清楚。
通过三个预注册的符合性实验发现，GPT在模拟伙伴的影响下表现出符合社会共识的行为。
在模拟伙伴一致反对的情况下，GPT几乎总是符合社会共识（99.9%），并报告降低自信和显著的信息和规范性符合性。
与单个伙伴互动时，GPT在40.2%的争议情况下仍然表现出符合性，报告更少自信和更多规范性符合性。
GPT并不总是作为独立观察者行动，而是根据感知到的社会共识进行适应。
将LLMs视为中立决策辅助工具存在风险。

Cool Papers

点此查看论文截图

Can Agent Conquer Web? Exploring the Frontiers of ChatGPT Atlas Agent in Web Games

Authors:Jingran Zhang, Ning Li, Justin Cui

OpenAI’s ChatGPT Atlas introduces new capabilities for web interaction, enabling the model to analyze webpages, process user intents, and execute cursor and keyboard inputs directly within the browser. While its capacity for information retrieval tasks has been demonstrated, its performance in dynamic, interactive environments remains less explored. In this study, we conduct an early evaluation of Atlas’s web interaction capabilities using browser-based games as test scenarios, including Google’s T-Rex Runner, Sudoku, Flappy Bird, and Stein.world. We employ in-game performance scores as quantitative metrics to assess performance across different task types. Our results show that Atlas performs strongly in logical reasoning tasks like Sudoku, completing puzzles significantly faster than human baselines, but struggles substantially in real-time games requiring precise timing and motor control, often failing to progress beyond initial obstacles. These findings suggest that while Atlas demonstrates capable analytical processing, there remain notable limitations in dynamic web environments requiring real-time interaction. The website of our project can be found at https://atlas-game-eval.github.io.

OpenAI的ChatGPT Atlas为网页交互引入了新功能，使模型能够分析网页、处理用户意图，并在浏览器中直接执行光标和键盘输入。虽然其信息检索任务的能力已得到展示，但在动态、交互式环境中的表现仍然探索得较少。在这项研究中，我们利用基于浏览器的游戏作为测试场景，对Atlas的网页交互能力进行了初步评估，包括Google的T-Rex Runner、数独、飞翔鸟和Stein.world。我们采用游戏内成绩作为量化指标，评估不同任务类型的表现。结果表明，Atlas在逻辑推理任务（如数独）中表现强劲，完成谜题的速度远快于人类基准测试，但在需要精确计时和动作控制的实时游戏中面临较大困难，往往无法克服初始障碍。这些发现表明，虽然Atlas展现出强大的分析能力，但在需要实时交互的动态网页环境中仍存在显著局限性。我们项目的网站可在https://atlas-game-eval.github.io找到。

论文及项目相关链接

PDF

Summary

OpenAI的ChatGPT Atlas通过网页交互展现了新能力，能在浏览器中直接分析网页、处理用户意图并执行光标和键盘输入。本研究对Atlas的网页交互能力进行了初步评估，采用浏览器游戏作为测试场景，包括Google的T-Rex Runner、数独、飞翔鸟和Stein.world。研究结果显示，Atlas在逻辑推理任务如数独中表现出色，快于人类基准测试者，但在需要实时互动和精确控制的游戏中遇到困难，难以突破初始障碍。这表明Atlas在动态网页环境中的实时互动存在局限。

Key Takeaways

ChatGPT Atlas具备网页交互新能力，能在浏览器中分析网页、处理用户意图及执行输入。
研究采用浏览器游戏评估Atlas的交互能力。
Atlas在逻辑推理任务中表现优异，如数独游戏。
Atlas在完成数独时显著快于人类基准测试者。
在需要实时互动和精确控制的游戏中，Atlas表现欠佳，难以突破初始障碍。
Atlas在动态网页环境中的实时互动存在局限。

Cool Papers

点此查看论文截图

ChatGPT in Systematic Investing – Enhancing Risk-Adjusted Returns with LLMs

Authors:Nikolas Anic, Andrea Barbon, Ralf Seiz, Carlo Zarattini

This paper investigates whether large language models (LLMs) can improve cross-sectional momentum strategies by extracting predictive signals from firm-specific news. We combine daily U.S. equity returns for S&P 500 constituents with high-frequency news data and use prompt-engineered queries to ChatGPT that inform the model when a stock is about to enter a momentum portfolio. The LLM evaluates whether recent news supports a continuation of past returns, producing scores that condition both stock selection and portfolio weights. An LLM-enhanced momentum strategy outperforms a standard long-only momentum benchmark, delivering higher Sharpe and Sortino ratios both in-sample and in a truly out-of-sample period after the model’s pre-training cut-off. These gains are robust to transaction costs, prompt design, and portfolio constraints, and are strongest for concentrated, high-conviction portfolios. The results suggest that LLMs can serve as effective real-time interpreters of financial news, adding incremental value to established factor-based investment strategies.

本文调查大型语言模型（LLM）是否可以通过从特定公司的新闻中提取预测信号来提高横截面动量策略。我们结合了标准普尔500成分股的每日美国股票回报与高频新闻数据，并使用针对ChatGPT设计的提示查询来告知模型何时将股票纳入动量投资组合。LLM评估最新新闻是否支持过去收益的持续，产生分数来决定股票选择和投资组合权重。增强型LLM动量策略优于标准的长仓动量基准，在样本内和模型预训练截止后真正的样本外期间均表现出更高的夏普比率和索提诺比率。这些收益在交易成本、提示设计和投资组合约束方面都很稳健，并且在集中、高信仰投资组合中表现最为强劲。结果表明，LLM可以作为金融新闻的有效实时解释器，为基于因素的投资策略增加增值。

论文及项目相关链接

PDF

Summary
大语言模型（LLM）能否通过从公司特定新闻中提取预测信号来改善横截面动量策略的效果进行了研究。该研究结合了标准普尔500成分股的美国日收益率与高频新闻数据，并使用针对ChatGPT设计的提示查询来告知模型何时将要执行动量策略。LLM评估最近新闻是否支持过去收益的持续，生成得分以影响股票选择和投资组合权重。通过LLM增强的动量策略在表现和夏普比率方面均优于标准的长仓动量基准，并且在模型预训练截止后的真正离样期间也是如此。这些收益稳健，并且对于集中、高信念投资组合的收益最为显著。这表明LLM可以作为金融新闻的有效实时解释器，为基于因素的投资策略增加了增值价值。

Key Takeaways

LLM能够通过处理公司特定新闻改善横截面动量策略的效果。
结合了股票收益和新闻数据，使用ChatGPT进行提示查询。
LLM评估新闻是否支持过去收益的持续，影响股票和权重选择。
LLM增强的动量策略表现优于标准动量基准，表现在夏普比率和Sortino比率上。
这些收益在真正的离样期间也是稳健的。
这些收益对交易成本、提示设计和投资组合约束具有稳健性。

Cool Papers

点此查看论文截图

How Data Mixing Shapes In-Context Learning: Asymptotic Equivalence for Transformers with MLPs

Authors:Samet Demir, Zafer Dogan

Pretrained Transformers demonstrate remarkable in-context learning (ICL) capabilities, enabling them to adapt to new tasks from demonstrations without parameter updates. However, theoretical studies often rely on simplified architectures (e.g., omitting MLPs), data models (e.g., linear regression with isotropic inputs), and single-source training, limiting their relevance to realistic settings. In this work, we study ICL in pretrained Transformers with nonlinear MLP heads on nonlinear tasks drawn from multiple data sources with heterogeneous input, task, and noise distributions. We analyze a model where the MLP comprises two layers, with the first layer trained via a single gradient step and the second layer fully optimized. Under high-dimensional asymptotics, we prove that such models are equivalent in ICL error to structured polynomial predictors, leveraging results from the theory of Gaussian universality and orthogonal polynomials. This equivalence reveals that nonlinear MLPs meaningfully enhance ICL performance, particularly on nonlinear tasks, compared to linear baselines. It also enables a precise analysis of data mixing effects: we identify key properties of high-quality data sources (low noise, structured covariances) and show that feature learning emerges only when the task covariance exhibits sufficient structure. These results are validated empirically across various activation functions, model sizes, and data distributions. Finally, we experiment with a real-world scenario involving multilingual sentiment analysis where each language is treated as a different source. Our experimental results for this case exemplify how our findings extend to real-world cases. Overall, our work advances the theoretical foundations of ICL in Transformers and provides actionable insight into the role of architecture and data in ICL.

预训练Transformer表现出惊人的上下文学习（ICL）能力，能够使它们通过演示适应新任务而无需更新参数。然而，理论研究通常依赖于简化的架构（例如，省略多层感知器（MLPs））、数据模型（例如，具有同向输入的线性回归）和单源训练，这限制了它们在实际环境中的相关性。在这项工作中，我们研究了在预训练的Transformer上使用非线性多层感知器头进行上下文学习，这些任务来自多个数据源，具有不同的输入、任务和噪声分布。我们分析了一个MLP包含两层模型的情况，其中第一层通过单个梯度步骤进行训练，第二层进行完全优化。在高维渐近的情况下，我们证明了这种模型在ICL误差方面等同于结构化多项式预测器，这得益于高斯普遍性和正交多项式的理论结果。这种等价关系表明，与线性基线相比，非线性MLP在非线性任务上能显著提高ICL性能。它还使我们能够精确分析数据混合效应：我们确定了高质量数据源的关键属性（低噪声、结构化协方差），并表明只有在任务协方差表现出足够的结构时才会出现特征学习。这些结果在不同的激活函数、模型大小和数据分布上得到了实证验证。最后，我们用一个涉及多语言情感分析的现实场景进行实验，其中每种语言被视为一个不同的来源。我们在该案例中的实验结果例证了我们的发现如何扩展到现实情况。总体而言，我们的工作推进了Transformer中ICL的理论基础，并为架构和数据在ICL中的作用提供了可操作性的见解。

论文及项目相关链接

PDF NeurIPS 2025, 24 pages, 6 figures

Summary

本文研究了预训练Transformer模型在非线性格式任务中的上下文学习能力（ICL）。实验表明，带有非线性多层感知机（MLP）头的预训练Transformer模型能够从多个数据源中处理具有不同输入、任务和噪声分布的非线性任务。理论分析和实证研究均表明，非线性MLP能显著提升Transformer模型在非线性任务上的ICL性能，特别是与线性基线相比。此外，本文还探讨了数据混合效应，指出高质量数据源的关键特性，并发现特征学习仅在任务协方差具有足够结构时出现。最后，通过多语言情感分析实验验证了这些发现的实际应用。

Key Takeaways

预训练Transformer模型具有出色的上下文学习能力（ICL），能从演示中适应新任务而无需更新参数。
非线性多层感知机（MLP）头能显著提升Transformer模型在非线性任务上的ICL性能。
在高维渐近条件下，带有非线性MLP的模型在ICL误差方面与结构多项式预测器相当。
数据混合效应分析表明，高质量数据源应具有低噪声、结构协方差等关键特性。
特征学习仅当任务协方差具有足够结构时出现。
不同激活函数、模型大小和数据分布的实验验证了上述发现。

Cool Papers

点此查看论文截图

Standardization of Psychiatric Diagnoses – Role of Fine-tuned LLM Consortium and OpenAI-gpt-oss Reasoning LLM Enabled Decision Support System

Authors:Eranga Bandara, Ross Gore, Atmaram Yarlagadda, Anita H. Clayton, Preston Samuel, Christopher K. Rhea, Sachin Shetty

The diagnosis of most mental disorders, including psychiatric evaluations, primarily depends on dialogues between psychiatrists and patients. This subjective process can lead to variability in diagnoses across clinicians and patients, resulting in inconsistencies and challenges in achieving reliable outcomes. To address these issues and standardize psychiatric diagnoses, we propose a Fine-Tuned Large Language Model (LLM) Consortium and OpenAI-gpt-oss Reasoning LLM-enabled Decision Support System for the clinical diagnosis of mental disorders. Our approach leverages fine-tuned LLMs trained on conversational datasets involving psychiatrist-patient interactions focused on mental health conditions (e.g., depression). The diagnostic predictions from individual models are aggregated through a consensus-based decision-making process, refined by the OpenAI-gpt-oss reasoning LLM. We propose a novel method for deploying LLM agents that orchestrate communication between the LLM consortium and the reasoning LLM, ensuring transparency, reliability, and responsible AI across the entire diagnostic workflow. Experimental results demonstrate the transformative potential of combining fine-tuned LLMs with a reasoning model to create a robust and highly accurate diagnostic system for mental health assessment. A prototype of the proposed platform, integrating three fine-tuned LLMs with the OpenAI-gpt-oss reasoning LLM, was developed in collaboration with the U.S. Army Medical Research Team in Norfolk, Virginia, USA. To the best of our knowledge, this work represents the first application of a fine-tuned LLM consortium integrated with a reasoning LLM for clinical mental health diagnosis paving the way for next-generation AI-powered eHealth systems aimed at standardizing psychiatric diagnoses.

大多数精神疾病的诊断，包括精神评估，主要依赖于精神病医生与患者之间的对话。这一主观过程可能导致不同临床医生与患者之间的诊断差异，从而在实现可靠结果方面造成不一致性和挑战。为了解决这些问题并标准化精神疾病的诊断，我们提出了一个精细调整的大型语言模型（LLM）联盟和OpenAI-gpt-oss推理LLM支持的决策支持系统，用于精神疾病的临床诊断。我们的方法利用了在精神健康条件（例如抑郁症）方面的精神病医生与患者互动会话数据集上训练过的精细调整LLM。来自个别模型的诊断预测通过基于共识的决策制定过程进行汇总，并由OpenAI-gpt-oss推理LLM进行改进。我们提出了一种部署LLM代理的新方法，协调LLM联盟和推理LLM之间的通信，确保整个诊断工作流程的透明性、可靠性和负责任的AI。实验结果表明，将精细调整的LLM与推理模型相结合，在创建稳健且高度准确的精神健康评估诊断系统方面具有变革性潜力。在与美国弗吉尼亚州诺福克的美国陆军医学研究小组的合作下，我们开发了一个原型平台，该平台集成了三个精细调整的LLM和OpenAI-gpt-oss推理LLM。据我们所知，这项工作代表了第一个与推理LLM集成的一体化精细调整LLM联盟，用于临床精神健康诊断，为下一代以标准化精神病诊断为目标的AI驱动的电子健康系统铺平了道路。

论文及项目相关链接

PDF

Summary

该文提出利用精细调整的大型语言模型（LLM）联合OpenAI的GPT OSS推理模型构建一个决策支持系统来标准化精神疾病的诊断过程。通过训练在精神科医生与患者对话数据集上的精细调整LLM模型，结合OpenAI GPT OSS推理模型进行决策支持，以提高诊断的一致性和可靠性。提出一种部署LLM代理的新方法，确保整个诊断过程中的透明度、可靠性和负责任的人工智能。实验结果表明，结合精细调整的LLM和推理模型具有变革潜力，可创建稳健且高度准确的诊断系统用于心理健康评估。已开发一个原型平台，整合了三个精细调整的LLM与OpenAI GPT OSS推理模型，并且这是已知首个此类应用的例子，为下一代AI驱动的电子健康系统铺平了道路，旨在标准化精神疾病的诊断。

Key Takeaways

精神疾病的诊断主要依赖于精神科医生与患者之间的对话，存在诊断不一致性和挑战。
提出使用精细调整的大型语言模型（LLM）联合OpenAI GPT OSS推理模型构建决策支持系统以标准化诊断过程。
LLM通过训练在涉及精神健康条件的精神科医生与患者互动对话数据集上进行精细调整。
诊断预测通过基于共识的决策制定过程进行聚合，并由OpenAI GPT OSS推理模型进行精细化。
提出一种部署LLM代理的新方法，确保整个诊断过程中的透明度、可靠性和负责任的人工智能。
结合精细调整的LLM和推理模型的实验系统显示出变革潜力，可创建稳健且高度准确的诊断系统。

Cool Papers

点此查看论文截图

Routing Matters in MoE: Scaling Diffusion Transformers with Explicit Routing Guidance

Authors:Yujie Wei, Shiwei Zhang, Hangjie Yuan, Yujin Han, Zhekai Chen, Jiayu Wang, Difan Zou, Xihui Liu, Yingya Zhang, Yu Liu, Hongming Shan

Mixture-of-Experts (MoE) has emerged as a powerful paradigm for scaling model capacity while preserving computational efficiency. Despite its notable success in large language models (LLMs), existing attempts to apply MoE to Diffusion Transformers (DiTs) have yielded limited gains. We attribute this gap to fundamental differences between language and visual tokens. Language tokens are semantically dense with pronounced inter-token variation, while visual tokens exhibit spatial redundancy and functional heterogeneity, hindering expert specialization in vision MoE. To this end, we present ProMoE, an MoE framework featuring a two-step router with explicit routing guidance that promotes expert specialization. Specifically, this guidance encourages the router to partition image tokens into conditional and unconditional sets via conditional routing according to their functional roles, and refine the assignments of conditional image tokens through prototypical routing with learnable prototypes based on semantic content. Moreover, the similarity-based expert allocation in latent space enabled by prototypical routing offers a natural mechanism for incorporating explicit semantic guidance, and we validate that such guidance is crucial for vision MoE. Building on this, we propose a routing contrastive loss that explicitly enhances the prototypical routing process, promoting intra-expert coherence and inter-expert diversity. Extensive experiments on ImageNet benchmark demonstrate that ProMoE surpasses state-of-the-art methods under both Rectified Flow and DDPM training objectives. Code and models will be made publicly available.

混合专家（MoE）已经成为一种强大的范式，能够在保持计算效率的同时扩展模型容量。尽管其在大型语言模型（LLM）中取得了显著的成功，但将MoE应用于扩散变压器（DiT）的现有尝试所产生的收益有限。我们将这一差距归因于语言令牌和视觉令牌之间的根本差异。语言令牌语义丰富，令牌间变化明显，而视觉令牌则表现出空间冗余和功能异质性，阻碍了视觉MoE中的专家专业化。为此，我们推出了ProMoE，这是一个带有两步路由器的MoE框架，具有明确的路由指导，可以促进专家专业化。具体而言，这种指导鼓励路由器根据功能角色通过条件路由将图像令牌划分为条件集和非条件集，并通过基于语义内容的原型路由来细化条件图像令牌的分配，这里的原型是可学习的。此外，由相似性驱动的潜在空间中的专家分配，通过原型路由提供明确的语义指导的自然机制，我们验证了这种指导对于视觉MoE至关重要。在此基础上，我们提出了路由对比损失，它明确增强了原型路由过程，促进了专家内部的连贯性和专家之间的多样性。在ImageNet基准测试上的大量实验表明，ProMoE在Rectified Flow和DDPM训练目标下均超越了最先进的方法。代码和模型将公开发布。

论文及项目相关链接

PDF

Summary

本文介绍了将Mixture-of-Experts（MoE）范式应用于Diffusion Transformers（DiTs）的挑战，并提出了一种新的MoE框架ProMoE。该框架通过两步路由机制显式路由指导来促进专家专业化，根据图像令牌的功能角色将其分为条件和非条件集，并通过原型路由优化条件图像令牌的分配。此外，ProMoE通过引入路由对比损失来提高原型路由过程的效率，增强专家内部的连贯性和专家之间的多样性。在ImageNet基准测试上的实验表明，ProMoE在Rectified Flow和DDPM训练目标下均超越了现有方法。

Key Takeaways

MoE范式在大型语言模型（LLMs）中取得了显著成功，但将其应用于Diffusion Transformers（DiTs）时存在挑战。
现有尝试将MoE应用于视觉领域的不足主要归因于语言与视觉令牌之间的根本差异。
ProMoE是一种新的MoE框架，通过两步路由机制促进专家专业化，包括条件路由和原型路由。
条件路由根据图像令牌的功能角色将其分为条件和非条件集。
原型路由通过基于语义内容的可学习原型来优化条件图像令牌的分配，并实现基于潜伏空间的相似性专家分配。
原型路由为显式语义指导提供了一个自然的机制，这在视觉MoE中至关重要。

Cool Papers

点此查看论文截图

Evolving Diagnostic Agents in a Virtual Clinical Environment

Authors:Pengcheng Qiu, Chaoyi Wu, Junwei Liu, Qiaoyu Zheng, Yusheng Liao, Haowen Wang, Yun Yue, Qianrui Fan, Shuai Zhen, Jian Wang, Jinjie Gu, Yanfeng Wang, Ya Zhang, Weidi Xie

In this paper, we present a framework for training large language models (LLMs) as diagnostic agents with reinforcement learning, enabling them to manage multi-turn diagnostic processes, adaptively select examinations, and commit to final diagnoses. Unlike instruction-tuned models trained on static case summaries, our method acquires diagnostic strategies through interactive exploration and outcome-based feedback. Our contributions are fourfold: (i) We present DiagGym, a diagnostics world model trained with electronic health records that emits examination outcomes conditioned on patient history and recommended examination, serving as a virtual clinical environment for realistic diagnosis training and evaluation; (ii) We train DiagAgent via end-to-end, multi-turn reinforcement learning to learn diagnostic policies that optimize both information yield and diagnostic accuracy; (iii) We introduce DiagBench, a diagnostic benchmark comprising 750 cases with physician-validated examination recommendations and 99 cases annotated with 973 physician-written rubrics on diagnosis process; (iv) we demonstrate superior performance across diverse diagnostic settings. DiagAgent significantly outperforms 10 state-of-the-art LLMs, including DeepSeek-v3 and GPT-4o, as well as two prompt-engineered agents. In single-turn settings, DiagAgent achieves 9.34% higher diagnostic accuracy and 44.03% improvement in examination recommendation hit ratio. In end-to-end settings, it delivers 15.12% increase in diagnostic accuracy and 23.09% boost in examination recommendation F1 score. In rubric-based evaluation, it surpasses the next-best model, Claude-sonnet-4, by 7.1% in weighted rubric score. These findings indicate that learning policies in interactive clinical environments confers dynamic and clinically meaningful diagnostic management abilities unattainable through passive training alone.

本文提出了一种利用强化学习训练大型语言模型（LLM）作为诊断代理人的框架，使它们能够管理多轮诊断过程，自适应选择检查项目，并做出最终诊断。不同于在静态案例摘要上训练的指令调优模型，我们的方法通过交互式探索和结果反馈获取诊断策略。我们的贡献主要体现在四个方面：（i）我们提出了DiagGym，一个使用电子健康记录训练的诊断世界模型，根据病人病史和推荐的检查项目生成检查结果，作为真实的诊断训练和评价的虚拟临床环境；（ii）我们通过端到端的多轮强化学习训练DiagAgent，学习诊断策略，以优化信息产量和诊断准确性；（iii）我们引入了DiagBench，一个包含750个病例的诊断基准测试，其中包括医生验证的检查建议，以及99个病例的973个医生撰写的诊断过程批注；（iv）我们在多种诊断环境中展示了卓越的性能。DiagAgent显著优于其他10种先进的LLM，包括DeepSeek-v3和GPT-4o以及两种提示工程代理。在单轮设置中，DiagAgent的诊断准确性高出9.34%，检查建议命中率提高44.03%。在端到端设置中，其诊断准确性提高了15.12%，检查建议的F1分数提高了23.09%。在基于批注的评估中，它在加权批注得分上超过了次优模型Claude-sonnet-4，提高了7.1%。这些发现表明，在交互式临床环境中学习策略会赋予动态和具有临床意义的诊断管理能力，这是仅通过被动训练无法实现的。

论文及项目相关链接

PDF

Summary

本文提出了一种利用强化学习训练大型语言模型（LLM）作为诊断代理人的框架，使它们能够管理多轮诊断过程、自适应选择检查并做出最终诊断。该方法的贡献在于：一是提出了DiagGym，一个用电子健康记录训练的诊病世界模型，能够根据病人病史和推荐的检查项目发出检查结果，作为虚拟临床环境，用于现实诊断的训练和评估；二是通过端到端的多轮强化学习训练DiagAgent，学习诊断策略，优化信息产量和诊断准确性；三是引入了DiagBench诊断基准测试，包含750个病例，其中99个病例的诊断过程附有医生验证的检查建议和973条医生撰写的评价准则；四是展示了在不同诊断环境下的优越性能。与十个先进的大型语言模型相比，无论是在单轮设置还是端到端的设置下，DiagAgent在诊断准确性和检查建议方面都表现出更高的性能。这一发现表明，在交互式临床环境中学习政策赋予了动态且临床意义的诊断管理能力，这是通过被动训练无法实现的。

Key Takeaways

提出了结合强化学习训练大型语言模型（LLM）作为诊断代理人的新方法，使其能进行多轮诊断、自适应选择检查和做出最终诊断。
引入了DiagGym，一个基于电子健康记录的诊病世界模型，模拟真实临床环境。
通过多轮强化学习训练DiagAgent，提高了诊断策略的信息获取和诊断准确性。
引入了包含多种病例的DiagBench诊断基准测试，含有医生验证的检查建议和诊断过程评价准则。
DiagAgent在诊断准确性、检查建议等方面表现出卓越性能，显著优于其他先进的大型语言模型。
DiagAgent在单轮和端到端的设置下均表现出更高的诊断管理能力。

Cool Papers

点此查看论文截图

Optimizing Retrieval for RAG via Reinforced Contrastive Learning

Authors:Jiawei Zhou, Lei Chen

As retrieval-augmented generation (RAG) becomes increasingly widespread, the role of information retrieval (IR) is shifting from retrieving information for human users to retrieving contextual knowledge for artificial intelligence (AI) systems, where relevance becomes difficult to define or annotate beforehand. To address this challenge, we propose R3, a Retrieval framework optimized for RAG through trialand-feedback Reinforced contrastive learning. Unlike prior approaches that rely on annotated or synthetic data for supervised fine-tuning, R3 enables the retriever to dynamically explore and optimize relevance within the RAG environment. During training, the retrieved results interact with the environment to produce contrastive signals that automatically guide the retriever’s self-improvement. Extensive experiments across diverse tasks demonstrate that R3 improves RAG performance by 5.2% over the original retriever and surpasses state-of-the-art retrievers by 4.9%, while achieving comparable results to LLM-augmented retrieval and RAG systems built on post-trained or instruction-tuned LLMs. It is both efficient and practical, requiring only 4 GPUs and completing training within a single day.

随着检索增强生成（RAG）技术的日益普及，信息检索（IR）的角色正在从为人类用户检索信息转变为为人工智能（AI）系统检索上下文知识，这使得提前定义或标注相关性变得困难。为了应对这一挑战，我们提出了R3，这是一个通过试验和反馈强化对比学习针对RAG优化的检索框架。不同于以前依赖于注释或合成数据进行有监督微调的方法，R3使检索器能够在RAG环境中动态地探索和优化相关性。在训练过程中，检索结果与环境互动产生对比信号，自动引导检索器自我改进。在多种任务上的大量实验表明，R3在原始检索器的基础上提高了5.2%的RAG性能，并超越了最先进的检索器4.9%，同时实现了与基于后训练或指令调整的LLM的LLM增强检索和RAG系统相当的结果。它高效实用，只需4个GPU，并在一天内完成训练。

论文及项目相关链接

PDF

Summary

随着检索增强生成（RAG）的普及，信息检索（IR）的角色正在从为人类用户检索信息转变为为人工智能（AI）系统检索上下文知识。为此，我们提出了R3，一个通过试验和反馈强化对比学习优化RAG的检索框架。R3使检索器能够在RAG环境中动态探索和优化相关性，不同于以往依赖注释或合成数据进行监督微调的方法。实验表明，R3在RAG性能上较原检索器提高了5.2%，并在多样化的任务上超越了最先进的检索器4.9%，同时实现了与基于预训练或指令调优的大型语言模型（LLM）增强的检索和RAG系统的相当结果。它高效实用，只需4个GPU，训练时间一天即可完成。

Key Takeaways

信息检索（IR）的角色正在从为人类用户检索信息转变为为人工智能系统检索上下文知识。
R3是一个针对检索增强生成（RAG）优化的检索框架，通过试验和反馈强化对比学习。
R3使检索器能够在RAG环境中动态探索和优化相关性。
R3不同于以往依赖注释或合成数据进行监督微调的方法。
R3在多样化的任务上超越了最先进的检索器，并实现了与基于大型语言模型（LLM）的相当结果。
R3的训练效率高，只需4个GPU，一天即可完成。

Cool Papers

点此查看论文截图

Towards Transparent Reasoning: What Drives Faithfulness in Large Language Models?

Authors:Teague McMillan, Gabriele Dominici, Martin Gjoreski, Marc Langheinrich

Large Language Models (LLMs) often produce explanations that do not faithfully reflect the factors driving their predictions. In healthcare settings, such unfaithfulness is especially problematic: explanations that omit salient clinical cues or mask spurious shortcuts can undermine clinician trust and lead to unsafe decision support. We study how inference and training-time choices shape explanation faithfulness, focusing on factors practitioners can control at deployment. We evaluate three LLMs (GPT-4.1-mini, LLaMA 70B, LLaMA 8B) on two datasets-BBQ (social bias) and MedQA (medical licensing questions), and manipulate the number and type of few-shot examples, prompting strategies, and training procedure. Our results show: (i) both the quantity and quality of few-shot examples significantly impact model faithfulness; (ii) faithfulness is sensitive to prompting design; (iii) the instruction-tuning phase improves measured faithfulness on MedQA. These findings offer insights into strategies for enhancing the interpretability and trustworthiness of LLMs in sensitive domains.

大型语言模型（LLM）产生的解释通常不能真实地反映其预测背后的因素。在医疗环境中，这种不忠实的解释特别成问题：省略重要临床线索或掩盖虚假捷径的解释可能会破坏医生对模型的信任，并导致决策支持的失误。我们研究了推理和训练时间的选择如何影响解释的忠实度，重点关注从业者在部署时可以控制的因素。我们在两个数据集上评估了三个LLM（GPT-4.1-mini、LLaMA 70B、LLaMA 8B），分别是BBQ（社会偏见）和MedQA（医学许可问题），并操作了少量示例的数量和类型、提示策略以及训练过程。我们的结果显示：（i）少量示例的数量和质量对模型的忠实度有重大影响；（ii）忠实度对提示设计敏感；（iii）在MedQA上，指令调整阶段提高了测量的忠实度。这些发现有助于了解如何在敏感领域提高LLM的可解释性和可信度。

论文及项目相关链接

PDF 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: NeurIPS 2025 Workshop on Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling

Summary

大型语言模型（LLM）产生的解释往往不能真实地反映其预测背后的因素。在医疗环境中，不忠实的解释特别成问题：解释中遗漏重要的临床线索或掩盖错误的捷径会破坏医生对模型的信任，并可能导致决策支持不当。本研究探讨推理和训练时期的选择如何影响解释的忠实度，重点是在部署时实践者可控制的因素。本研究评估了三种LLM（GPT-4.1-mini、LLaMA 70B、LLaMA 8B）在两个数据集（BBQ社会偏见和MedQA医学许可问题）上的表现，并操作了少量示例的数量和类型、提示策略以及训练过程。研究发现：一、少量示例的数量和质量显著影响模型的忠实度；二、忠实度对提示设计敏感；三、指令调整阶段提高了在MedQA上的忠实度测量值。这些发现对于提高敏感领域LLM的可解释性和可信度具有启示意义。

Key Takeaways

大型语言模型（LLM）的解释可能不真实反映其预测背后的因素，特别是在医疗环境中需特别关注。
少量示例的数量和质量对模型的忠实度有重要影响。
忠实度对提示设计敏感，提示策略的选择会影响模型解释的可信度。
指令调整阶段对提高模型的忠实度有积极作用，特别是在医疗领域。
LLM在不同数据集上的表现反映了其解释忠实度的差异，需针对特定领域进行优化。
提高LLM的可解释性和可信度对于其在敏感领域的实际应用至关重要。

Cool Papers

点此查看论文截图

AfriMTEB and AfriE5: Benchmarking and Adapting Text Embedding Models for African Languages

Authors:Kosei Uemura, Miaoran Zhang, David Ifeoluwa Adelani

Text embeddings are an essential building component of several NLP tasks such as retrieval-augmented generation which is crucial for preventing hallucinations in LLMs. Despite the recent release of massively multilingual MTEB (MMTEB), African languages remain underrepresented, with existing tasks often repurposed from translation benchmarks such as FLORES clustering or SIB-200. In this paper, we introduce AfriMTEB – a regional expansion of MMTEB covering 59 languages, 14 tasks, and 38 datasets, including six newly added datasets. Unlike many MMTEB datasets that include fewer than five languages, the new additions span 14 to 56 African languages and introduce entirely new tasks, such as hate speech detection, intent detection, and emotion classification, which were not previously covered. Complementing this, we present AfriE5, an adaptation of the instruction-tuned mE5 model to African languages through cross-lingual contrastive distillation. Our evaluation shows that AfriE5 achieves state-of-the-art performance, outperforming strong baselines such as Gemini-Embeddings and mE5.

文本嵌入是多个NLP任务的关键构建组件，如增强生成任务，这对于防止大型语言模型中的幻视至关重要。尽管最近发布了大规模多语言文本嵌入（MMTEB），但非洲语言仍然被低估，现有任务通常是从翻译基准测试（如FLORES聚类或SIB-200）重新利用。在本文中，我们介绍了AfriMTEB——MMTEB的区域扩展，覆盖59种语言、14项任务和38个数据集，包括新增的六个数据集。与许多包含不到五种语言的MMTEB数据集不同，新增部分涵盖了14至56种非洲语言，并引入了以前未涵盖的全新任务，如仇恨言论检测、意图检测和情绪分类等。作为补充，我们推出了AfriE5，这是通过跨语言对比蒸馏对非洲语言进行指令调优的mE5模型的适应版本。我们的评估显示，AfriE5达到了最新性能水平，超过了如Gemini-Embeddings和mE5等强大的基线模型。

论文及项目相关链接

PDF

Summary

文本介绍了非洲语言在NLP任务中的重要性，并指出了现有数据集和模型的不足。为此，论文提出了AfriMTEB和AfriE5，前者是一个覆盖59种语言、14个任务和38个数据集的区域扩展数据集，后者是对mE5模型的适应性调整。该论文的主要贡献在于为非洲语言提供了更全面的NLP任务数据集和适应性模型，从而促进了自然语言处理任务的进步。

Key Takeaways

文本嵌入是多个NLP任务的基本构建组件，对于预防LLM中的幻觉尤为重要。
尽管近期推出了大规模多语种文本嵌入库（MMTEB），但非洲语言仍然被忽视。
现有任务经常从翻译基准测试（如FLORES聚类或SIB-200）中重新定位。
4.AfriMTEB是MMTEB的区域扩展版本，覆盖59种语言、14个任务和38个数据集，包括新添加的六个数据集。
与许多只包含少数语言的MMTEB数据集不同，新添加的数据集涵盖了多达56种非洲语言。
新增任务包括仇恨言论检测、意图检测和情绪分类等先前未涵盖的内容。

Cool Papers

点此查看论文截图

Understanding In-Context Learning Beyond Transformers: An Investigation of State Space and Hybrid Architectures

Authors:Shenran Wang, Timothy Tin-Long Tse, Jian Zhu

We perform in-depth evaluations of in-context learning (ICL) on state-of-the-art transformer, state-space, and hybrid large language models over two categories of knowledge-based ICL tasks. Using a combination of behavioral probing and intervention-based methods, we have discovered that, while LLMs of different architectures can behave similarly in task performance, their internals could remain different. We discover that function vectors (FVs) responsible for ICL are primarily located in the self-attention and Mamba layers, and speculate that Mamba2 uses a different mechanism from FVs to perform ICL. FVs are more important for ICL involving parametric knowledge retrieval, but not for contextual knowledge understanding. Our work contributes to a more nuanced understanding across architectures and task types. Methodologically, our approach also highlights the importance of combining both behavioural and mechanistic analyses to investigate LLM capabilities.

我们对最先进的转换器、状态空间和混合大型语言模型进行了两类基于知识的上下文学习（ICL）任务的深入评估。通过结合行为探测和基于干预的方法，我们发现虽然不同架构的大型语言模型在任务性能上可能表现相似，但它们的内部结构可能仍然存在差异。我们发现，负责ICL的功能向量（FV）主要位于自注意力和玛姆巴层，并推测玛姆巴2使用不同于FV的机制来执行ICL。FV对于涉及参数知识检索的ICL更为重要，但对于上下文知识理解则不是如此。我们的工作有助于更细致地了解不同架构和任务类型。在方法论上，我们的方法也强调了结合行为分析和机制分析来调查大型语言模型能力的重要性。

论文及项目相关链接

PDF

Summary：我们对不同架构的大型语言模型（LLM）在两类知识型上下文学习任务上进行了深入评估。结合行为探测和干预方法，我们发现虽然不同架构的LLM在任务性能上表现相似，但它们的内部结构可能不同。我们发现在自注意力层和Mamba层中存在主要负责上下文学习的功能向量（FVs），并推测Mamba2使用不同于FVs的机制进行上下文学习。功能向量对于涉及参数知识检索的上下文学习更为重要，但对于上下文知识理解则不那么重要。我们的工作对不同的架构和任务类型提供了更微妙的了解，并在方法论上强调了结合行为分析和机制分析调查LLM能力的重要性。

Key Takeaways：

不同架构的大型语言模型在知识型上下文学习任务上的表现存在相似性，但内部结构可能不同。
功能向量（FVs）在自注意力层和Mamba层中起关键作用，对上下文学习有显著影响。
Mamba2可能使用不同于功能向量的机制进行上下文学习。
功能向量对于涉及参数知识检索的上下文学习更为重要，但对上下文知识理解的贡献较小。
深入研究不同架构和任务类型有助于更精细地理解大型语言模型的性能特点。
结合行为探测和干预方法的研究方法有助于更全面地了解大型语言模型的内部机制。

Cool Papers

点此查看论文截图

Beyond Semantics: How Temporal Biases Shape Retrieval in Transformer and State-Space Models

Authors:Anooshka Bajaj, Deven Mahesh Mistry, Sahaj Singh Maini, Yash Aggarwal, Zoran Tiganj

In-context learning is governed by both temporal and semantic relationships, shaping how Large Language Models (LLMs) retrieve contextual information. Analogous to human episodic memory, where the retrieval of specific events is enabled by separating events that happened at different times, this work probes the ability of various pretrained LLMs, including transformer and state-space models, to differentiate and retrieve temporally separated events. Specifically, we prompted models with sequences containing multiple presentations of the same token, which reappears at the sequence end. By fixing the positions of these repeated tokens and permuting all others, we removed semantic confounds and isolated temporal effects on next-token prediction. Across diverse sequences, models consistently placed the highest probabilities on tokens following a repeated token, but with a notable bias for those nearest the beginning or end of the input. An ablation experiment linked this phenomenon in transformers to induction heads. Extending the analysis to unique semantic contexts with partial overlap further demonstrated that memories embedded in the middle of a prompt are retrieved less reliably. Despite architectural differences, state-space and transformer models showed comparable temporal biases. Our findings deepen the understanding of temporal biases in in-context learning and offer an illustration of how these biases can enable temporal separation and episodic retrieval.

上下文学习受到时间和语义关系的影响，这决定了大型语言模型（LLM）如何检索上下文信息。这项工作探究了多种预训练的大型语言模型（包括transformer和状态空间模型）区分和检索时间上分离的事件的能力，与人类情景记忆类似，人类能够回忆特定事件的情景是因为能够区分发生在不同时间的事件。具体来说，我们通过设定包含多次出现相同标记的序列，并且该标记会在序列末尾再次出现，来提示模型。我们通过固定这些重复标记的位置并排列其他所有标记，消除了语义混淆，并隔离了对下一个标记预测的时间效应。在各种序列中，模型始终将最高概率分配给重复标记之后的标记，但对靠近输入开始或结束的标记有明显的偏向。一项消融实验显示，这种现象与transformer中的归纳头有关。将分析扩展到具有部分重叠的独特语义背景进一步表明，嵌入在提示中间的记忆检索可靠性较低。尽管架构存在差异，但状态空间模型和transformer模型显示出类似的时间偏见。我们的研究深化了对上下文学习中时间偏见的理解，并说明了这些偏见如何能够实现时间分离和情景检索。

论文及项目相关链接

PDF

Summary

大型语言模型（LLM）的上下文学习受时间和语义关系的影响，这决定了模型如何检索上下文信息。本研究探讨了预训练LLM（包括转换器和状态空间模型）区分和检索时间上分离的事件的能力。通过固定重复标记的位置并排列其他所有标记，我们移除了语义混淆并隔离了时间对下一个标记预测的影响。模型一致地将最高概率放在重复标记之后的标记上，但对靠近输入开始或结束的标记有明显的偏向。转换器中的这种现象与归纳头有关。扩展到具有部分重叠的独特语义上下文进一步表明，嵌入提示中间的记忆检索可靠性较低。尽管架构存在差异，但状态空间模型和转换器模型显示出相似的时序偏差。我们的研究深化了对上下文学习中时序偏差的理解，并展示了这些偏差如何促进时序分离和情景检索。

Key Takeaways

大型语言模型的上下文学习受时间和语义关系的影响。
预训练的大型语言模型能够区分和检索时间上分离的事件。
模型在预测下一个标记时，会考虑时间和语义关系。
模型对重复标记及其位置的识别表现出明显的偏向。
转换器模型中的时序偏差与归纳头有关。
在具有部分重叠的语义上下文中，模型的记忆检索能力会降低。
状态空间模型和转换器模型在时序偏差上表现出相似性。

Cool Papers

点此查看论文截图

Can ChatGPT be a good follower of academic paradigms? Research quality evaluations in conflicting areas of sociology

Authors:Mike Thelwall, Ralph Schroeder, Meena Dhanda

Purpose: It has become increasingly likely that Large Language Models (LLMs) will be used to score the quality of academic publications to support research assessment goals in the future. This may cause problems for fields with competing paradigms since there is a risk that one may be favoured, causing long term harm to the reputation of the other. Design/methodology/approach: To test whether this is plausible, this article uses 17 ChatGPTs to evaluate up to 100 journal articles from each of eight pairs of competing sociology paradigms (1490 altogether). Each article was assessed by prompting ChatGPT to take one of five roles: paradigm follower, opponent, antagonistic follower, antagonistic opponent, or neutral. Findings: Articles were scored highest by ChatGPT when it followed the aligning paradigm, and lowest when it was told to devalue it and to follow the opposing paradigm. Broadly similar patterns occurred for most of the paradigm pairs. Follower ChatGPTs displayed only a small amount of favouritism compared to neutral ChatGPTs, but articles evaluated by an opposing paradigm ChatGPT had a substantial disadvantage. Research limitations: The data covers a single field and LLM. Practical implications: The results confirm that LLM instructions for research evaluation should be carefully designed to ensure that they are paradigm-neutral to avoid accidentally resolving conflicts between paradigms on a technicality by devaluing one side’s contributions. Originality/value: This is the first demonstration that LLMs can be prompted to show a partiality for academic paradigms.

目的：未来，大型语言模型（LLM）可能被用来评估学术出版物的质量，以支持研究评估目标。这可能会给具有竞争范式的研究领域带来问题，因为存在一种范式可能更受欢迎的风险，从而对另一种范式的声誉造成长期损害。设计/方法论/方法：为了测试这种假设是否可行，本文使用17个ChatGPT评估了来自八组竞争社会学范式的每组的最多100篇期刊文章（共1490篇）。每篇文章都是通过提示ChatGPT采取五种角色之一进行评估的：范式追随者、对手、对抗性追随者、对抗性对手或中立者。研究结果：当ChatGPT遵循一致范式时，文章得分最高，而当它被要求贬低它并遵循对立范式时，文章得分最低。大多数范式对的模式大致相同。与中性ChatGPT相比，追随者ChatGPT只表现出少量的偏爱，但由对立范式ChatGPT评估的文章存在明显的劣势。研究局限性：数据仅限于一个领域和LLM。实践意义：结果表明，对于研究评估的LLM指令应精心设计，以确保它们是范式中立的，以避免在技术上意外解决范式之间的冲突而贬低一方的贡献。原创性/价值：这是首次证明LLM可以被提示表现出对学术范式的偏好。

论文及项目相关链接

PDF

摘要

大型语言模型（LLM）未来可能会被用来评估学术出版物的质量，以支持研究评估目标。这可能会给具有竞争范式的研究领域带来问题，因为存在偏向某一范式的风险，从而长期损害其他范式的声誉。本文使用17个ChatGPT对八对竞争社会学范式中的每对范式的100篇文章进行测评（共1490篇文章）。文章通过提示ChatGPT采取五种角色进行评估：范式追随者、反对者、对抗追随者、对抗反对者和中立者。研究发现，当ChatGPT遵循一致范式时，文章评分最高；当其被指示贬低该范式并遵循相反范式时，评分最低。大多数范式对的评价都表现出类似的模式。相比中立ChatGPT，追随者ChatGPT只表现出少量的倾向性，但由反对范式ChatGPT评估的文章则明显处于劣势。因此，为确保技术层面避免范式间的冲突并避免单方面贬损，应精心设计LLM的研究评估指令。

关键见解

大型语言模型（LLM）可能被用于评估学术出版物质量，以支持研究评估目标。
在具有竞争范式的研究领域，使用LLM进行质量评估可能引发问题。
ChatGPT在评估学术文章时，对符合其指令的范式表现出倾向性。
当指令与ChatGPT的范式不一致时，文章的评分会较低。
相比中立ChatGPT，追随范式的ChatGPT评分偏高，而反对范式的ChatGPT评分偏低。
研究局限在于仅涵盖单一领域和LLM，未来研究需考虑更多领域和不同类型的语言模型。
为确保公正评估，设计LLM指令时需考虑范式中立性，避免单方面贬损。

Cool Papers

点此查看论文截图

Transformer Key-Value Memories Are Nearly as Interpretable as Sparse Autoencoders

Authors:Mengyu Ye, Jun Suzuki, Tatsuro Inaba, Tatsuki Kuribayashi

Recent interpretability work on large language models (LLMs) has been increasingly dominated by a feature-discovery approach with the help of proxy modules. Then, the quality of features learned by, e.g., sparse auto-encoders (SAEs), is evaluated. This paradigm naturally raises a critical question: do such learned features have better properties than those already represented within the original model parameters, and unfortunately, only a few studies have made such comparisons systematically so far. In this work, we revisit the interpretability of feature vectors stored in feed-forward (FF) layers, given the perspective of FF as key-value memories, with modern interpretability benchmarks. Our extensive evaluation revealed that SAE and FFs exhibits a similar range of interpretability, although SAEs displayed an observable but minimal improvement in some aspects. Furthermore, in certain aspects, surprisingly, even vanilla FFs yielded better interpretability than the SAEs, and features discovered in SAEs and FFs diverged. These bring questions about the advantage of SAEs from both perspectives of feature quality and faithfulness, compared to directly interpreting FF feature vectors, and FF key-value parameters serve as a strong baseline in modern interpretability research.

关于大型语言模型（LLM）的最近可解释性研究越来越多地采用了一种借助代理模块的特征发现方法。然后，对例如稀疏自动编码器（SAE）所学习特征的质量进行评估。这种模式自然引发了一个关键问题：这些学习到的特征是否比原始模型参数中已表示的特征具有更好的属性，遗憾的是，迄今为止只有少数研究进行了这样的系统比较。在这项工作中，我们从前馈（FF）作为键值存储器的角度重新审视了存储在前馈层中的特征向量的可解释性，并使用现代可解释性基准进行了评估。我们的广泛评估表明，SAE和FF具有相似的可解释性范围，尽管SAE在某些方面显示出可观察到的但微小的改进。此外，令人惊讶的是，在某些方面，甚至普通的前馈也能产生比SAE更好的可解释性，并且SAE和前馈中发现的特征存在分歧。这些关于特征质量和忠实度的问题，使我们质疑与直接解释前馈特征向量相比，SAE的优势何在，并且前馈的键值参数在现代可解释性研究中作为强有力的基线存在。

论文及项目相关链接

PDF NeurIPS 2025

Summary

近期关于大型语言模型（LLM）的可解释性研究主要集中于通过代理模块发现特征的方法，并评估如稀疏自动编码器（SAE）等学习到的特征质量。然而，关于这些学习到的特征是否比原始模型参数中的特征具有更好的性质，仅有少数研究进行了系统比较。本研究从前馈（FF）层作为键值记忆的角度重新审视了其可解释性，并利用现代可解释性基准进行了评估。研究发现，SAE和FF层具有相似的可解释性范围，但SAE在某些方面略有改进。同时，令人惊讶的是，在某些方面，即使是最基础的前馈层也表现出了比SAE更好的可解释性，且SAE和前馈层中发现的特征存在分歧。这为现代可解释性研究带来了新的疑问和讨论，尤其是在特征质量和忠实度方面的对比上，而FF键值参数为可解释性研究提供了一个强有力的基准。

Key Takeaways

近期LLM的可解释性研究主要关注特征发现方法，尤其是通过代理模块进行。
SAE和其他方法（如FF层）在可解释性上具有相似的范围。
SAE在某些方面略有改进，但在其他方面，基础FF层表现出更好的可解释性。
SAE和前馈层发现的特征存在分歧。
SAE的优势在特征质量和忠实度方面的对比尚待进一步研究。
FF键值参数为现代可解释性研究提供了强有力的基准。

Cool Papers

点此查看论文截图

Transformer Based Linear Attention with Optimized GPU Kernel Implementation

Authors:Armin Gerami, Ramani Duraiswami

The original softmax-based attention mechanism (regular attention) in the extremely successful Transformer architecture computes attention between $N$ tokens, each embedded in a $D$-dimensional head, with a time complexity of $O(N^2D)$. Given the success of Transformers, improving their runtime during both training and inference is a popular research area. One such approach is the introduction of the linear attention (LA) mechanisms, which offers a linear time complexity of $O(ND^2)$ and have demonstrated comparable accuracy to regular attention. However, LA in practice lags behind its theoretical efficiency. We propose a novel method for LA’s forward and backward passes, along with a highly-optimized CUDA implementation. Our approach outperforms the state-of-the-art by 3.3 times in speed and reduces memory consumption by 3.6 times. We validate these improvements in both single-layer and end-to-end settings by training a 1.4 billion parameter language model, which demonstrates similar expressivity to regular attention on major reasoning benchmarks.

原始的基于softmax的注意力机制（常规注意力）在极其成功的Transformer架构中计算了N个标记之间的注意力，每个标记嵌入在D维头中，时间复杂度为O(N^2D)。鉴于Transformer的成功，提高其在训练和推理期间的运行时间是当前热门的研究领域。一种方法是引入线性注意力（LA）机制，它提供线性时间复杂度O(ND^2)，并已证明其精度与常规注意力相当。然而，LA在实践中的效率并没有达到其理论效率。我们提出了一种用于LA前向和反向传递的新方法，以及高度优化的CUDA实现。我们的方法在速度上比现有技术高出3.3倍，并将内存消耗减少了3.6倍。我们通过训练包含1.4亿参数的单一语言模型验证了这些改进情况，该模型在主要推理基准测试中显示出与常规注意力相似的表达能力。

论文及项目相关链接

PDF

Summary

本文介绍了Transformer架构中的原始softmax注意力机制（常规注意力）及其时间复杂度为O(N^2D)。为提高Transformer在训练和推理期间的运行效率，研究人员引入了线性注意力（LA）机制，其时间复杂度为O(ND^2)，并具有相当高的准确性。然而，LA在实际应用中并未达到理论效率。本研究提出了一种新型的LA前向和反向传播方法，并进行了高度优化的CUDA实现。该方法在速度上比现有技术提高了3.3倍，并降低了内存消耗。通过训练包含1.4亿参数的基准语言模型验证了这些改进，该模型在主要推理基准测试中表现出与常规注意力相似的表达能力。

Key Takeaways

Transformer架构中的常规softmax注意力机制具有O(N^2D)的时间复杂度。
线性注意力（LA）机制旨在提高Transformer的效率，其时间复杂度为O(ND^2)。
LA在实际应用中未实现理论效率。
提出了一种新型的LA前向和反向传播方法。
该方法通过高度优化的CUDA实现，在速度上比现有技术提高了3.3倍。
该方法降低了内存消耗，减少了3.6倍。
通过训练包含1.4亿参数的基准语言模型验证了这些改进，该模型表现出与常规注意力相当的表达能力。

Cool Papers

点此查看论文截图

Enabling Robust In-Context Memory and Rapid Task Adaptation in Transformers with Hebbian and Gradient-Based Plasticity

Authors:Siddharth Chaudhary

Large language models display in-context learning as an emergent effect of scale, but they rely on static weights during inference. In contrast, biological systems continually adapt via synaptic plasticity. We investigate whether explicit, biologically inspired plasticity can endow Transformers with faster in-sequence adaptation. To this end, we augment decoder-only Transformers with fast-weight modules updated either by (i) a neuromodulated Hebbian rule or (ii) the gradient-based plasticity mechanism of Duan et al. (2023). Across copying, regression, and few-shot classification tasks (CIFAR-FS, Omniglot), Hebbian plasticity consistently achieves lower loss and stronger few-shot generalization, while gradient-based updates perform best on long-horizon credit assignment. When associations are short and linearly separable, static weights suffice, defining a clear boundary condition for when plasticity helps. Analysis of learned modulatory signals reveals that gradient-based rules maintain large, persistent updates, whereas Hebbian plasticity is sharply gated around salient events. Together, these results show that explicit plasticity complements attention by enabling rapid, task-specific adaptation, and clarify when different plasticity mechanisms are most effective.

大型语言模型展现出上下文学习作为规模的一种涌现效应，但它们在推理过程中依赖于静态权重。与此相反，生物系统通过突触可塑性持续适应。我们调查明确的、受生物启发的可塑性是否可以赋予Transformer更快的序列适应性。为此，我们仅对解码器Transformer进行增强，通过（i）神经调节的赫布规则或（ii）段等人提出的基于梯度的可塑性机制进行快速权重模块更新（2023年）。在复制、回归和少镜头分类任务（CIFAR-FS、Omniglot）中，赫布可塑性始终实现更低的损失和更强的少镜头泛化能力，而基于梯度的更新在长期信用分配任务上表现最佳。当关联短暂且线性可分时，静态权重就足够了，这为可塑性何时有帮助设定了明确的边界条件。对学到的调制信号的分析表明，基于梯度的规则维持了大而持久的更新，而赫布可塑性则围绕显著事件进行尖锐的闸门控制。总的来说，这些结果表明，明确的可塑性通过实现快速、特定的任务适应来补充注意力，并明确了不同可塑性机制何时最为有效。

论文及项目相关链接

PDF

Summary

大语言模型展现出上下文学习的规模效应，但在推理过程中依赖静态权重。与此相反，生物系统通过突触可塑性持续适应。本研究旨在探究明确的生物启发可塑性是否能为Transformer赋予更快的序列适应性。为此，我们对仅解码器Transformer进行了增强，通过（i）神经调节的赫布规则或（ii）段等人的基于梯度的可塑性机制进行快速权重模块更新（Du et al.，2023）。在复制、回归和少量样本分类任务（CIFAR-FS、Omniglot）中，赫布可塑性始终实现了较低的损失和更强的少量样本泛化能力，而基于梯度的更新在长期信用分配任务上表现最佳。当关联短暂且线性可分时，静态权重就足够了，这为可塑性何时有帮助设定了明确的边界条件。对学到的调制信号的分析表明，基于梯度的规则维持了大而持久的更新，而赫布可塑性则围绕显著事件进行尖锐的门控。总体而言，明确的可塑性通过实现快速的任务特定适应来补充注意力机制，并明确了不同可塑性机制何时最有效。

Key Takeaways

大语言模型展现出上下文学习的规模效应，但推理时依赖静态权重。
生物系统通过突触可塑性持续适应。
引入明确的生物启发可塑性以加快Transformer的序列适应性。
通过赫布规则和基于梯度的可塑性机制对Transformer进行增强。
在多种任务中，赫布可塑性实现低损失和强泛化，基于梯度的更新在长期任务上表现最佳。
当关联简单时，静态权重足够，复杂时则需要引入可塑性。

Cool Papers

点此查看论文截图

Evaluating ChatGPT’s Performance in Classifying Pneumonia from Chest X-Ray Images

Authors:Pragna Prahallad, Pranathi Prahallad

In this study, we evaluate the ability of OpenAI’s gpt-4o model to classify chest X-ray images as either NORMAL or PNEUMONIA in a zero-shot setting, without any prior fine-tuning. A balanced test set of 400 images (200 from each class) was used to assess performance across four distinct prompt designs, ranging from minimal instructions to detailed, reasoning-based prompts. The results indicate that concise, feature-focused prompts achieved the highest classification accuracy of 74%, whereas reasoning-oriented prompts resulted in lower performance. These findings highlight that while ChatGPT exhibits emerging potential for medical image interpretation, its diagnostic reliability remains limited. Continued advances in visual reasoning and domain-specific adaptation are required before such models can be safely applied in clinical practice.

本研究评估了OpenAI的gpt-4o模型在零样本设置下对胸部X射线图像进行分类的能力，将其分为正常或肺炎两类，且无需进行任何先前的微调。使用包含400张图像（各200张）的平衡测试集来评估四种不同提示设计的性能，提示设计范围从简单的指令到详细、基于推理的提示。结果表明，简洁、以特征为中心的提示获得了最高的分类准确率（即74%），而基于推理的提示则导致性能下降。这些发现表明，虽然ChatGPT在医学图像解释方面显示出新兴潜力，但其诊断可靠性仍然有限。在将这种模型安全应用于临床实践之前，还需要在视觉推理和特定领域的适应性方面进行持续进步。

论文及项目相关链接

PDF

Summary：本研究评估了OpenAI的gpt-4o模型在零样本设置下分类胸部X光图像的能力，无需任何先前的微调。使用包含400张图像（每个类别各200张）的平衡测试集来评估四种不同提示设计的性能，从简单的指令到详细的基于推理的提示。结果表明，简洁、以特征为中心的提示获得了最高的分类准确率（74%），而基于推理的提示则表现较差。这些发现表明，虽然ChatGPT在医学图像解读方面显示出新兴潜力，但其诊断可靠性仍然有限。需要进一步提高视觉推理和领域特定适应性才能将这些模型安全应用于临床实践。

Key Takeaways：