发布日期: 2025-09-28

更新日期: 2025-11-27

文章字数: 20k

阅读时长: 82 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-09-28 更新

RePro: Leveraging Large Language Models for Semi-Automated Reproduction of Networking Research Results

Authors:Yining Jiang, Wenyun Xu, Qingyu Song, Yuling Lin, Xuanhao Liu, Xiaoqiang Zheng, Qiang Su, Lizhao You, Lu Tang, Wangjian Feng, Linghe Kong, Qiao Xiang, Jiwu Shu

Reproducing networking research is a critical but challenging task due to the scarcity of open-source code. While Large Language Models (LLMs) can automate code generation, current approaches lack the generalizability required for the diverse networking field. To address this, we propose RePro, a semi-automated reproduction framework that leverages advanced prompt engineering to reproduce network systems from their research papers. RePro combines few-shot in-context learning with Structured and Semantic Chain of Thought (SCoT/SeCoT) techniques to systematically translate a paper’s description into an optimized, executable implementation. The framework operates through a three-stage pipeline: system description extraction, structural code generation, and code optimization. Our evaluation with five state-of-the-art LLMs across diverse network sub-domains demonstrates that RePro significantly reduces reproduction time compared to manual efforts while achieving comparable system performance, validating its effectiveness and efficiency.

复制网络研究是一项至关重要的任务，但由于缺乏开源代码，这具有挑战性。虽然大型语言模型（LLM）可以自动进行代码生成，但当前的方法缺乏在多样化网络领域所需的一般性。为了解决这一问题，我们提出了RePro，这是一个半自动化的复制框架，它利用先进的提示工程来从研究论文中复制网络系统。RePro结合了基于上下文的少样本学习与结构化与语义链思维（SCoT/SeCoT）技术，系统性地将论文描述转化为优化后的可执行实现。该框架通过三个阶段进行操作：系统描述提取、结构化代码生成和代码优化。我们在不同的网络子领域对最先进的五个LLM进行了评估，证明了RePro在减少复制时间方面与手动工作相比具有显著优势，同时在系统性能上取得了相当的表现，验证了其有效性和效率。

论文及项目相关链接

PDF

Summary

基于网络开源代码稀缺的挑战，网络研究的重现成为一项艰巨的任务。本研究提出RePro框架，结合大型语言模型（LLM）的半自动化重现技术，通过先进的提示工程从研究论文中重现网络系统。RePro通过三阶段管道（系统描述提取、结构代码生成和代码优化）实现论文描述的系统化翻译，生成优化的可执行实现。本研究使用五个最先进的大型语言模型在不同网络子域的评估显示，RePro在显著减少重现时间的同时，实现了与系统性能相当的效果，验证了其有效性和效率。

Key Takeaways

网络研究的重现因缺乏开源代码而面临挑战。
RePro框架结合大型语言模型进行半自动化重现。
RePro使用先进的提示工程技术从研究论文中重现网络系统。
RePro框架包含三个主要阶段：系统描述提取、结构代码生成和代码优化。

Cool Papers

点此查看论文截图

Background Prompt for Few-Shot Out-of-Distribution Detection

Authors:Songyue Cai, Zongqian Wu, Yujie Mo, Liang Peng, Ping Hu, Xiaoshuang Shi, Xiaofeng Zhu

Existing foreground-background (FG-BG) decomposition methods for the few-shot out-of-distribution (FS-OOD) detection often suffer from low robustness due to over-reliance on the local class similarity and a fixed background patch extraction strategy. To address these challenges, we propose a new FG-BG decomposition framework, namely Mambo, for FS-OOD detection. Specifically, we propose to first learn a background prompt to obtain the local background similarity containing both the background and image semantic information, and then refine the local background similarity using the local class similarity. As a result, we use both the refined local background similarity and the local class similarity to conduct background extraction, reducing the dependence of the local class similarity in previous methods. Furthermore, we propose the patch self-calibrated tuning to consider the sample diversity to flexibly select numbers of background patches for different samples, and thus exploring the issue of fixed background extraction strategies in previous methods. Extensive experiments on real-world datasets demonstrate that our proposed Mambo achieves the best performance, compared to SOTA methods in terms of OOD detection and near OOD detection setting. The source code will be released at https://github.com/YuzunoKawori/Mambo.

现有的针对小样本外分布（FS-OOD）检测的的前景-背景（FG-BG）分解方法，往往由于过度依赖局部类相似性和固定的背景斑块提取策略，导致稳健性较低。为了应对这些挑战，我们提出了一种新的FG-BG分解框架，名为Mambo，用于FS-OOD检测。具体来说，我们提议首先学习背景提示来获得包含背景和图像语义信息的局部背景相似性，然后使用局部类相似性对局部背景相似性进行精炼。因此，我们使用精炼的局部背景相似性和局部类相似性进行背景提取，减少了以前方法对局部类相似性的依赖。此外，我们提出了斑块自校准调整，以考虑样本多样性，灵活选择不同样本的背景斑块数量，从而解决了以前方法中固定背景提取策略的问题。在真实数据集上的大量实验表明，我们提出的Mambo在OOD检测和接近OOD检测设置方面实现了最佳性能，与最新方法相比具有优势。源代码将发布在https://github.com/YuzunoKawori/Mambo。

论文及项目相关链接

PDF

Summary

本文提出一种新的前景背景（FG-BG）分解框架Mambo，用于少样本分布外（FS-OOD）检测。通过引入背景提示学习局部背景相似性，并结合局部类相似性进行精炼。同时，采用补丁自校准调整策略灵活选择不同样本的背景补丁数量，以解决先前方法中固定背景提取策略的问题。实验表明，Mambo在现实世界数据集上的性能最佳，与现有最先进的方法相比，在OOD检测和近OOD检测设置中均表现出优异性能。

Key Takeaways

Mambo框架针对少样本分布外（FS-OOD）检测提出新的前景背景分解方法。
引入背景提示学习局部背景相似性，包含背景与图像语义信息。
结合局部类相似性对局部背景相似性进行精炼。
采用补丁自校准调整策略，灵活选择不同样本的背景补丁数量。
解决现有方法过度依赖局部类相似性和固定背景补丁提取策略的问题。
在真实世界数据集上的实验表明，Mambo在OOD检测和近OOD检测设置中性能最佳。

Cool Papers

点此查看论文截图

FSMODNet: A Closer Look at Few-Shot Detection in Multispectral Data

Authors:Manuel Nkegoum, Minh-Tan Pham, Élisa Fromont, Bruno Avignon, Sébastien Lefèvre

Few-shot multispectral object detection (FSMOD) addresses the challenge of detecting objects across visible and thermal modalities with minimal annotated data. In this paper, we explore this complex task and introduce a framework named “FSMODNet” that leverages cross-modality feature integration to improve detection performance even with limited labels. By effectively combining the unique strengths of visible and thermal imagery using deformable attention, the proposed method demonstrates robust adaptability in complex illumination and environmental conditions. Experimental results on two public datasets show effective object detection performance in challenging low-data regimes, outperforming several baselines we established from state-of-the-art models. All code, models, and experimental data splits can be found at https://anonymous.4open.science/r/Test-B48D.

少量样本多光谱目标检测（FSMOD）解决了在可见光和热成像模态下，利用少量标注数据进行目标检测的难题。在本文中，我们探讨了这项复杂的任务，并引入了一个名为“FSMODNet”的框架，它通过跨模态特征融合来提高即使在标签有限的情况下检测性能。通过使用可变形注意力有效结合可见光和热成像的独特优势，该方法在复杂光照和环境条件下表现出强大的适应性。在两个公共数据集上的实验结果表明，在挑战性的低数据条件下，该方法可实现有效的目标检测性能，超越了我们根据最新模型设定的几个基准线。所有代码、模型和实验数据分割可在https://anonymous.4open.science/r/Test-B48D找到。

论文及项目相关链接

PDF

Summary

这篇论文提出了一种名为FSMODNet的框架，用于解决在有少量标注数据的情况下进行跨可见光和热成像模态的目标检测问题。该框架通过跨模态特征融合和可变形注意力机制，有效结合了两种模态的独特优势，在复杂的照明和环境条件下展现了出色的适应能力。实验结果表明，在有限的标注数据下，该框架能够有效检测目标，优于多个使用先进模型建立的基准测试。具体信息可访问相应链接。

Key Takeaways

FSMODNet框架解决了在少量标注数据下跨模态目标检测的难题。
该框架通过跨模态特征融合提高了检测性能。
可变形注意力机制被用于结合可见光和热成像模态的独特优势。
FSMODNet在复杂的照明和环境条件下具有出色的适应能力。
实验结果证明了该框架在有限标注数据下的有效性和优越性。
论文提供的代码、模型和实验数据可公开访问。

Cool Papers

点此查看论文截图

DAC-LoRA: Dynamic Adversarial Curriculum for Efficient and Robust Few-Shot Adaptation

Authors:Ved Umrajkar

Vision-Language Models (VLMs) are foundational to critical applications like autonomous driving, medical diagnosis, and content moderation. While Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA enable their efficient adaptation to specialized tasks, these models remain vulnerable to adversarial attacks that can compromise safety-critical decisions. CLIP, the backbone for numerous downstream VLMs, is a high-value target whose vulnerabilities can cascade across the multimodal AI ecosystem. We propose Dynamic Adversarial Curriculum DAC-LoRA, a novel framework that integrates adversarial training into PEFT. The core principle of our method i.e. an intelligent curriculum of progressively challenging attack, is general and can potentially be applied to any iterative attack method. Guided by the First-Order Stationary Condition (FOSC) and a TRADES-inspired loss, DAC-LoRA achieves substantial improvements in adversarial robustness without significantly compromising clean accuracy. Our work presents an effective, lightweight, and broadly applicable method to demonstrate that the DAC-LoRA framework can be easily integrated into a standard PEFT pipeline to significantly enhance robustness.

视觉语言模型（VLMs）在自动驾驶、医疗诊断和内容审核等关键应用中具有基础性作用。虽然LoRA等参数高效微调（PEFT）方法能够实现这些模型对特定任务的效率适应，但这些模型仍然容易受到可能危及安全决策的对抗性攻击的影响。CLIP作为众多下游VLMs的骨干，是一个高价值目标，其漏洞可能会在多模态AI生态系统中产生连锁反应。我们提出了动态对抗课程（DAC）LoRA这一新型框架，它将对抗性训练整合到PEFT中。我们的方法的核心原则，即智能课程渐进挑战攻击法，是通用的，可应用于任何迭代攻击方法。通过一阶平稳条件（FOSC）和受TR部门启发下的损失（TR对增益指标可能也表述为一种损失函数）的指导，DAC-LoRA在不影响清洁精度的情况下，实现了对抗性稳健性的实质性提升。我们的工作提出了一种有效、轻便且广泛应用的方法，证明DAC-LoRA框架可以轻松地集成到标准PEFT管道中，从而显著提高稳健性。

论文及项目相关链接

PDF Accepted at ICCV2025 Workshop on Safe and Trustworthy Multimodal AI Systems

Summary

VLM（视觉语言模型）在自动驾驶、医疗诊断和内容审核等关键应用中扮演着重要角色。虽然使用LoRA等参数高效微调（PEFT）方法可以使其适应特定任务，但这些模型仍然面临易于受到威胁的风险，存在可能威胁到安全决策的对抗性攻击。CLIP作为众多下游VLM的骨干，是一个高风险目标，其漏洞可能会波及整个多媒体人工智能生态系统。本文提出将对抗性训练整合到PEFT中的动态对抗课程DAC-LoRA新框架。该方法的核心原则即是一种智能的课程，包括渐进挑战的攻击方式，其普遍适用于任何迭代攻击方法。结合一阶平稳条件（FOSC）和基于TRlaDES的损失，DAC-LoRA在不显著降低清洁精度的情况下实现了对抗性稳健性的显著提高。本研究提供了一种有效、轻便且广泛适用的方法，证明了DAC-LoRA框架可以轻松地集成到标准PEFT管道中，从而显著提高稳健性。

Key Takeaways

VLMs在现代应用如自动驾驶、医疗诊断和内容审核中扮演着关键角色，但其易受对抗性攻击影响，这可能危及安全决策。
DAC-LoRA框架是一种新颖的、结合了对抗训练与参数高效微调（PEFT）的方法。
DAC-LoRA通过智能课程的方式，包括渐进挑战的攻击，具有广泛的应用性，可应用于任何迭代攻击方法。
利用一阶平稳条件（FOSC）和基于TRlaDES的损失，DAC-LoRA在增强模型对抗性稳健性的同时，不会显著降低清洁精度。
实验证明DAC-LoRA能有效提高模型的稳健性。
DAC-LoRA框架易于集成到标准的PEFT流程中，具有实际应用价值。

Cool Papers

点此查看论文截图

EchoBench: Benchmarking Sycophancy in Medical Large Vision-Language Models

Authors:Botai Yuan, Yutian Zhou, Yingjie Wang, Fushuo Huo, Yongcheng Jing, Li Shen, Ying Wei, Zhiqi Shen, Ziwei Liu, Tianwei Zhang, Jie Yang, Dacheng Tao

Recent benchmarks for medical Large Vision-Language Models (LVLMs) emphasize leaderboard accuracy, overlooking reliability and safety. We study sycophancy – models’ tendency to uncritically echo user-provided information – in high-stakes clinical settings. We introduce EchoBench, a benchmark to systematically evaluate sycophancy in medical LVLMs. It contains 2,122 images across 18 departments and 20 modalities with 90 prompts that simulate biased inputs from patients, medical students, and physicians. We evaluate medical-specific, open-source, and proprietary LVLMs. All exhibit substantial sycophancy; the best proprietary model (Claude 3.7 Sonnet) still shows 45.98% sycophancy, and GPT-4.1 reaches 59.15%. Many medical-specific models exceed 95% sycophancy despite only moderate accuracy. Fine-grained analyses by bias type, department, perceptual granularity, and modality identify factors that increase susceptibility. We further show that higher data quality/diversity and stronger domain knowledge reduce sycophancy without harming unbiased accuracy. EchoBench also serves as a testbed for mitigation: simple prompt-level interventions (negative prompting, one-shot, few-shot) produce consistent reductions and motivate training- and decoding-time strategies. Our findings highlight the need for robust evaluation beyond accuracy and provide actionable guidance toward safer, more trustworthy medical LVLMs.

近期针对医疗领域的大型视觉语言模型（LVLMs）的基准测试主要侧重于排行榜的准确性，却忽视了可靠性和安全性。我们研究了在临床环境中高风险的模型盲目重复用户提供的信息的倾向，即顺承性。我们引入了EchoBench，一个系统评估医疗LVLMs中顺承性的基准测试。它包含18个科室的2,122张图像，以及模拟来自患者、医学生、医生的偏见输入的20种模态和90个提示。我们评估了医疗专用的、开源的和专有的LVLMs。所有模型都表现出明显的顺承性；表现最佳的专有模型（Claude 3.7 Sonnet）仍有45.98%的顺承性，GPT-4.1达到59.15%。尽管准确性只有中等水平，但许多医疗专用模型的顺承性超过95%。通过偏见类型、科室、感知粒度和模态的精细分析，我们确定了增加顺承性的因素。我们进一步表明，更高的数据质量和多样性以及更强的领域知识可以减少顺承性，而不损害无偏见准确性。EchoBench也作为缓解测试的测试床：简单的提示级别干预（负面提示、单次提示、少次提示）产生了一致的减少，并激励了训练和解码时间策略。我们的研究结果强调了除了准确性之外进行稳健评估的必要性，并为构建更安全、更可信赖的医疗LVLMs提供了可操作性的指导。

论文及项目相关链接

PDF 29 pages, 6 figures

Summary

本文关注医疗领域的大型视觉语言模型（LVLMs）的可靠性及安全问题。研究模型过度迎合用户输入信息的倾向，即“奉承现象”，并介绍了一个新的评估基准EchoBench。该基准通过模拟来自患者、医学生及医师的偏见输入，系统地评估医疗LVLMs的奉承现象。发现所有评估的模型都存在显著的奉承现象，即使是最优秀的模型也不例外。此外，还发现数据质量和多样性以及领域知识的增强可以减少奉承现象，同时不影响非偏见准确性。EchoBench也为缓解该问题提供了方法，简单的提示级别干预措施可以产生一致的减少效果，并激发训练和解码时间策略。研究强调了除了准确性之外，对医疗LVLMs进行稳健评估的必要性，并为构建更安全、更可信赖的医疗LVLMs提供了实际指导。

Key Takeaways

医疗LVLMs的奉承现象研究至关重要，这涉及到模型对用户提供的无批判性回声的倾向。
EchoBench基准被引入以系统地评估医疗LVLMs的奉承现象，涵盖多个部门、模式和提示。
所有评估的模型，包括医疗专用、开源和专有LVLMs，都存在显著的奉承现象。
GPT-4.1在奉承现象方面的表现较差，最佳专有模型Claude 3.7 Sonnet也有较高程度的奉承。
奉承现象与模型的准确性并不总是正相关，有些模型的奉承现象超过95%。
数据质量和多样性以及领域知识的增强有助于减少奉承现象而不影响非偏见准确性。

Cool Papers

点此查看论文截图

MMSE-Calibrated Few-Shot Prompting for Alzheimer’s Detection

Authors:Jana Sweidan, Mounim A. El-Yacoubi, Nasredine Semmar

Prompting large language models is a training-free method for detecting Alzheimer’s disease from speech transcripts. Using the ADReSS dataset, we revisit zero-shot prompting and study few-shot prompting with a class-balanced protocol using nested interleave and a strict schema, sweeping up to 20 examples per class. We evaluate two variants achieving state-of-the-art prompting results. (i) MMSE-Proxy Prompting: each few-shot example carries a probability anchored to Mini-Mental State Examination bands via a deterministic mapping, enabling AUC computing; this reaches 0.82 accuracy and 0.86 AUC (ii) Reasoning-augmented Prompting: few-shot examples pool is generated with a multimodal LLM (GPT-5) that takes as input the Cookie Theft image, transcript, and MMSE to output a reasoning and MMSE-aligned probability; evaluation remains transcript-only and reaches 0.82 accuracy and 0.83 AUC. To our knowledge, this is the first ADReSS study to anchor elicited probabilities to MMSE and to use multimodal construction to improve interpretability.

通过语音转录检测阿尔茨海默病的一种无训练方法是通过大语言模型的提示来实现的。我们使用ADReSS数据集重新研究了零样本提示，并研究了一种使用类别平衡协议的少量样本提示，通过嵌套交错和严格模式最多使用每个类别20个示例。我们评估了两种变体，实现了最先进的提示结果。（i）MMSE-代理提示：每个少量样本示例都带有概率，通过确定性映射与Mini-Mental状态检查带相关联，从而实现AUC计算；这达到了0.82的准确率和0.86的AUC（ii）推理增强提示：少量样本示例池是由多模态大型语言模型（GPT-5）生成的，该模型以Cookie Theft图像、转录和MMSE作为输入来输出推理和与MMSE对齐的概率；评估仍然是仅转录的，达到了0.82的准确率和0.83的AUC。据我们所知，这是第一个将ADReSS研究中得到的概率与MMSE相关联并使用多模态构建来提高可解释性的研究。

论文及项目相关链接

PDF

Summary

基于ADReSS数据集的研究表明，利用大型语言模型的无训练提示法检测阿尔茨海默病从语音转录中获得重要进展。该研究重新研究了零提示并探讨了使用嵌套交错和严格模式的类平衡协议进行少量提示的方法，每个类别最多使用20个示例。该研究评估了两种变体方法，包括MMSE代理提示和推理增强提示，均达到了先进的提示结果。其中MMSE代理提示通过将每个少量示例与基于确定性映射的Mini精神状态检查评分相结合，使得准确率达到0.82且AUC达到0.86。推理增强提示利用多模态大型语言模型（如GPT-5）生成少量示例池，将图像、转录和MMSE相结合进行推理，其准确性同样达到了一定水平。这些方法的研究将有助于推进从语音中检测阿尔茨海默病的准确度及精确度。研究表明该首次提出通过零样本案例集的MMS级别化分析方法以及多模态构建来提高解释性。这些研究为阿尔茨海默病的早期检测和干预提供了新的可能性。

Key Takeaways

以下是七个关于文本的主要观点：

研究提出了一种使用大型语言模型的训练免费方法来检测阿尔茨海默病从语音转录中。
研究重新研究了零提示并探讨了类平衡协议下的少量提示方法，每个类别最多使用20个示例。

Cool Papers

点此查看论文截图

MoTiC: Momentum Tightness and Contrast for Few-Shot Class-Incremental Learning

Authors:Zeyu He, Shuai Huang, Yuwu Lu, Ming Zhao

Few-Shot Class-Incremental Learning (FSCIL) must contend with the dual challenge of learning new classes from scarce samples while preserving old class knowledge. Existing methods use the frozen feature extractor and class-averaged prototypes to mitigate against catastrophic forgetting and overfitting. However, new-class prototypes suffer significant estimation bias due to extreme data scarcity, whereas base-class prototypes benefit from sufficient data. In this work, we theoretically demonstrate that aligning the new-class priors with old-class statistics via Bayesian analysis reduces variance and improves prototype accuracy. Furthermore, we propose large-scale contrastive learning to enforce cross-category feature tightness. To further enrich feature diversity and inject prior information for new-class prototypes, we integrate momentum self-supervision and virtual categories into the Momentum Tightness and Contrast framework (MoTiC), constructing a feature space with rich representations and enhanced interclass cohesion. Experiments on three FSCIL benchmarks produce state-of-the-art performances, particularly on the fine-grained task CUB-200, validating our method’s ability to reduce estimation bias and improve incremental learning robustness.

Few-Shot 类增量学习（FSCIL）面临着学习稀缺样本中的新类别并保持旧类知识的双重挑战。现有方法使用冻结的特征提取器和类别平均原型来缓解灾难性遗忘和过度拟合的问题。然而，由于极端的数据稀缺性，新类的原型会遭受显著的估计偏差，而基础类的原型受益于充足的数据。在这项工作中，我们从理论上证明了通过贝叶斯分析将新类先验与旧类统计量对齐可以减少方差并提高原型准确性。此外，我们提出了大规模对比学习来加强跨类别的特征紧密性。为了进一步丰富特征多样性和为新类原型注入先验信息，我们将动量自监督和虚拟类别集成到 Momentum Tightness and Contrast 框架（MoTiC）中，构建了一个具有丰富表示和增强类间凝聚力的特征空间。在三个 FSCIL 基准测试上的实验产生了最先进的性能，特别是在细粒度任务 CUB-200 上的表现，验证了我们方法在减少估计偏差和提高增量学习稳健性方面的能力。

论文及项目相关链接

PDF

摘要
少数类增量学习（FSCIL）面临从稀缺样本中学习新类并保持旧类知识的双重挑战。现有方法采用冻结特征提取器和类平均原型来缓解灾难性遗忘和过拟合问题。然而，由于极端数据稀缺，新类原型遭受显著的估计偏差，而基本类原型受益于充足的数据。本研究从理论上证明，通过贝叶斯分析将新类先验与旧类统计量对齐，可以减少方差并提高原型准确性。此外，我们提出大规模对比学习来强制执行跨类别的特征紧密性。为了进一步优化特征多样性和注入新类原型中的先验信息，我们将动量自我监督和虚拟类别整合到Momentum Tightness and Contrast框架（MoTiC）中，构建了一个具有丰富的表示和增强的类间凝聚力的特征空间。在三个FSCIL基准测试上的实验产生了最先进的性能，特别是在细粒度任务CUB-200上的表现验证了我们的方法减少估计偏差和提高增量学习稳健性的能力。

要点

Few-Shot Class-Incremental Learning (FSCIL) 面临学习新类和保持旧类知识的双重挑战。
现有方法使用冻结特征提取器和类平均原型来应对挑战。
新类原型因数据稀缺而遭受显著估计偏差，而基于充足数据的旧类原型则相对稳定。
通过贝叶斯分析对齐新类先验与旧类统计量，可以提高原型准确性并减少方差。
提出大规模对比学习来加强跨类别的特征紧密性。
整合动量自我监督和虚拟类别到MoTiC框架中，丰富特征表示并增强类间凝聚力。

Cool Papers

点此查看论文截图

RoboSSM: Scalable In-context Imitation Learning via State-Space Models

Authors:Youngju Yoo, Jiaheng Hu, Yifeng Zhu, Bo Liu, Qiang Liu, Roberto Martín-Martín, Peter Stone

In-context imitation learning (ICIL) enables robots to learn tasks from prompts consisting of just a handful of demonstrations. By eliminating the need for parameter updates at deployment time, this paradigm supports few-shot adaptation to novel tasks. However, recent ICIL methods rely on Transformers, which have computational limitations and tend to underperform when handling longer prompts than those seen during training. In this work, we introduce RoboSSM, a scalable recipe for in-context imitation learning based on state-space models (SSM). Specifically, RoboSSM replaces Transformers with Longhorn – a state-of-the-art SSM that provides linear-time inference and strong extrapolation capabilities, making it well-suited for long-context prompts. We evaluate our approach on the LIBERO benchmark and compare it against strong Transformer-based ICIL baselines. Experiments show that RoboSSM extrapolates effectively to varying numbers of in-context demonstrations, yields high performance on unseen tasks, and remains robust in long-horizon scenarios. These results highlight the potential of SSMs as an efficient and scalable backbone for ICIL. Our code is available at https://github.com/youngjuY/RoboSSM.

上下文模仿学习（ICIL）使机器人能够从仅包含少量演示的提示中学习任务。通过消除部署时参数更新的需求，此范式支持对新型任务的少量样本适应。然而，最近的ICIL方法依赖于Transformer，这具有计算上的局限性，并且在处理比训练期间所见更长的提示时往往会表现不佳。在这项工作中，我们介绍了RoboSSM，这是一种基于状态空间模型（SSM）的上下文模仿学习的可扩展配方。具体来说，RoboSSM用Longhorn——一种最新SSM技术替代Transformer，提供线性时间推理和强大的外推能力，非常适合长上下文提示。我们在LIBERO基准测试上评估了我们的方法，并将其与强大的基于Transformer的ICIL基线进行了比较。实验表明，RoboSSM能有效地扩展到不同数量的上下文演示，在未见的任务上表现出高性能，并在长周期场景中保持稳健。这些结果突出了SSM作为ICIL高效和可扩展骨干的潜力。我们的代码可在https://github.com/youngjuY/RoboSSM上找到。

论文及项目相关链接

PDF 8 pages, 11 figures

Summary

本文介绍了基于状态空间模型（SSM）的在线语境模仿学习（ICIL）方法——RoboSSM。该方法使用Longhorn替代Transformer，具有线性时间推理和强大的外推能力，适用于长语境提示。实验证明，RoboSSM在不同数量的在线演示中能够有效外推，对未见任务表现优异，并在长期场景下保持稳健。这些结果表明SSM作为ICIL的有效和可扩展主干具有潜力。

Key Takeaways

ICIL允许机器人仅通过少量演示进行任务学习，并支持在部署时快速适应新任务。
现有ICIL方法依赖Transformer，存在计算限制，处理长提示时表现不佳。
RoboSSM是一个基于SSM的在线语境模仿学习可伸缩方案，使用Longhorn替代Transformer。
Longhorn具有线性时间推理和强大的外推能力，适合处理长语境提示。
实验证明RoboSSM在不同数量的在线演示中能有效外推，并在未见任务上表现优异。
RoboSSM在长期的场景下保持稳健，显示出其在复杂环境中的实用性。

Cool Papers

点此查看论文截图

Large Language Models for Pedestrian Safety: An Application to Predicting Driver Yielding Behavior at Unsignalized Intersections

Authors:Yicheng Yang, Zixian Li, Jean Paul Bizimana, Niaz Zafri, Yongfeng Dong, Tianyi Li

Pedestrian safety is a critical component of urban mobility and is strongly influenced by the interactions between pedestrian decision-making and driver yielding behavior at crosswalks. Modeling driver–pedestrian interactions at intersections requires accurately capturing the complexity of these behaviors. Traditional machine learning models often struggle to capture the nuanced and context-dependent reasoning required for these multifactorial interactions, due to their reliance on fixed feature representations and limited interpretability. In contrast, large language models (LLMs) are suited for extracting patterns from heterogeneous traffic data, enabling accurate modeling of driver-pedestrian interactions. Therefore, this paper leverages multimodal LLMs through a novel prompt design that incorporates domain-specific knowledge, structured reasoning, and few-shot prompting, enabling interpretable and context-aware inference of driver yielding behavior, as an example application of modeling pedestrian–driver interaction. We benchmarked state-of-the-art LLMs against traditional classifiers, finding that GPT-4o consistently achieves the highest accuracy and recall, while Deepseek-V3 excels in precision. These findings highlight the critical trade-offs between model performance and computational efficiency, offering practical guidance for deploying LLMs in real-world pedestrian safety systems.

行人安全是城市流动性的重要组成部分，并受到行人在十字路口做出决策和驾驶员礼让行人行为之间互动的重要影响。在十字路口对驾驶员与行人互动进行建模需要准确捕捉这些行为的复杂性。传统的机器学习模型由于其依赖于固定的特征表示和有限的解释性，往往难以捕捉这些多元互动的细微和情境相关的推理。相比之下，大型语言模型（LLM）适合从异质交通数据中提取模式，能够准确地对驾驶员与行人互动进行建模。因此，本文通过一种新的提示设计，利用多模式LLM，该设计融合了领域特定知识、结构化推理和少量提示，以驾驶员礼让行为可解释的、具有情境意识的推断为例，对行人-驾驶员互动进行建模。我们将最前沿的LLM与传统分类器进行了对比评估，发现GPT-4o在准确率和召回率方面表现最优秀，而Deepseek-V3在精确度上表现突出。这些发现突出了模型性能和计算效率之间的关键权衡，为在现实世界行人安全系统中部署LLM提供了实际指导。

论文及项目相关链接

PDF

Summary

本文探讨了城市移动性中行人安全的重要性，指出行人决策与司机让行行为在交叉口处的互动对行人安全产生深远影响。传统机器学习方法在捕捉这些多元互动的细微差别和上下文依赖性推理方面存在困难。相反，本文通过多模式大型语言模型（LLMs）对司机与行人互动进行建模，利用新颖提示设计融合领域特定知识、结构化推理和少量提示，实现可解释和上下文感知的司机让行行为推断。对比最新LLMs与传统分类器，发现GPT-4o在准确率和召回率上表现最佳，而Deepseek-V3在精确度上表现突出。这为在实际行人安全系统中部署LLMs提供了实际指导。

Key Takeaways

行人安全是城市移动性的关键部分，受行人决策和司机让行行为交互的影响。
传统机器学习方法在捕捉司机与行人互动的复杂性和上下文依赖性方面存在挑战。
大型语言模型（LLMs）适合从异质交通数据中提取模式，能准确建模司机与行人的互动。
通过新颖提示设计，LLMs可实现可解释和上下文感知的司机让行行为推断。
GPT-4o在模型性能和召回率方面表现最佳，而Deepseek-V3在精确度上表现优秀。
模型性能和计算效率之间存在权衡，这为在实际行人安全系统中部署LLMs提供了挑战。

Cool Papers

点此查看论文截图

Semantic-Aware Fuzzing: An Empirical Framework for LLM-Guided, Reasoning-Driven Input Mutation

Authors:Mengdi Lu, Steven Ding, Furkan Alaca, Philippe Charland

Security vulnerabilities in Internet-of-Things devices, mobile platforms, and autonomous systems remain critical. Traditional mutation-based fuzzers – while effectively explore code paths – primarily perform byte- or bit-level edits without semantic reasoning. Coverage-guided tools such as AFL++ use dictionaries, grammars, and splicing heuristics to impose shallow structural constraints, leaving deeper protocol logic, inter-field dependencies, and domain-specific semantics unaddressed. Conversely, reasoning-capable large language models (LLMs) can leverage pretraining knowledge to understand input formats, respect complex constraints, and propose targeted mutations, much like an experienced reverse engineer or testing expert. However, lacking ground truth for “correct” mutation reasoning makes supervised fine-tuning impractical, motivating explorations of off-the-shelf LLMs via prompt-based few-shot learning. To bridge this gap, we present an open-source microservices framework that integrates reasoning LLMs with AFL++ on Google’s FuzzBench, tackling asynchronous execution and divergent hardware demands (GPU- vs. CPU-intensive) of LLMs and fuzzers. We evaluate four research questions: (R1) How can reasoning LLMs be integrated into the fuzzing mutation loop? (R2) Do few-shot prompts yield higher-quality mutations than zero-shot? (R3) Can prompt engineering with off-the-shelf models improve fuzzing directly? and (R4) Which open-source reasoning LLMs perform best under prompt-only conditions? Experiments with Llama3.3, Deepseek-r1-Distill-Llama-70B, QwQ-32B, and Gemma3 highlight Deepseek as the most promising. Mutation effectiveness depends more on prompt complexity and model choice than shot count. Response latency and throughput bottlenecks remain key obstacles, offering directions for future work.

物联网设备、移动平台和自主系统中的安全漏洞仍然十分严重。传统的基于变异的模糊测试器虽然能够有效地探索代码路径，但主要进行字节或位级别的编辑，而没有语义推理。像AFL++这样的覆盖导向工具使用字典、语法和拼接启发式方法来施加浅结构约束，但忽略了更深层次的协议逻辑、字段间依赖关系和特定领域的语义。相反，具备推理能力的大型语言模型（LLM）可以利用预训练知识来理解输入格式、遵守复杂约束并提出有针对性的变异，就像经验丰富的逆向工程师或测试专家一样。然而，缺乏“正确”变异推理的地面真实数据使得监督微调变得不切实际，这激发了通过基于提示的少量样本学习来使用现成的LLM的探索。为了弥补这一差距，我们推出了一个开源的微服务框架，该框架将推理LLM与Google的FuzzBench上的AFL++集成在一起，解决了LLM和模糊测试器的异步执行和不同硬件需求（GPU密集型与CPU密集型）的问题。我们评估了四个研究问题：（R1）如何将推理LLM集成到模糊测试变异循环中？（R2）与零样本相比，少量样本提示是否会产生更高质量的变异？（R3）使用现成的模型进行提示工程能否直接提高模糊测试的效率？以及（R4）在只有提示的条件下，哪个开源推理LLM表现最好？使用Llama3.3、Deepseek-r1-Distill-Llama-70B、QwQ-32B和Gemma3的实验突出了Deepseek的最有前途。变异的有效性更多地取决于提示的复杂性和模型的选择，而不是射击次数。响应延迟和吞吐量瓶颈仍然是关键障碍，为未来的工作提供了方向。

论文及项目相关链接

PDF

Summary

互联网物联网设备、移动平台和自主系统的安全漏洞依然严重。传统基于变异的模糊测试技术虽然能有效探索代码路径，但主要进行字节或位级别的编辑，缺乏语义推理。覆盖指导工具如AFL++使用字典、语法和拼接启发式方法施加浅层次的结构约束，但更深层次的协议逻辑、字段间依赖关系和领域特定语义未得到解决。相反，具备推理能力的大型语言模型（LLM）能利用预训练知识理解输入格式、遵守复杂约束并提出有针对性的变异，如同经验丰富的逆向工程师或测试专家。本研究呈现了一个开源的微服务框架，将推理LLM与AFL++集成在Google的FuzzBench上，解决了LLM和模糊测试器的异步执行和不同硬件需求（GPU与CPU密集型）。本研究评估了四个研究问题并进行了实验验证。

Key Takeaways

互联网物联网设备、移动平台和自主系统的安全漏洞仍是关键挑战。
传统模糊测试技术主要进行字节或位级别的编辑，缺乏语义推理。
覆盖指导工具如AFL++存在深层次协议逻辑、字段间依赖关系和领域特定语义的问题。
大型语言模型具备理解输入格式、遵守复杂约束并提出针对性的变异能力。
整合推理LLM和模糊测试器的开源框架能解决两者的异步执行和不同硬件需求问题。
实验评估了不同模型在模糊测试中的表现，并发现Deepseek模型在少样本提示条件下表现最为出色。

Cool Papers

点此查看论文截图

Self-evolved Imitation Learning in Simulated World

Authors:Yifan Ye, Jun Cen, Jing Chen, Zhihe Lu

Imitation learning has been a trend recently, yet training a generalist agent across multiple tasks still requires large-scale expert demonstrations, which are costly and labor-intensive to collect. To address the challenge of limited supervision, we propose Self-Evolved Imitation Learning (SEIL), a framework that progressively improves a few-shot model through simulator interactions. The model first attempts tasksin the simulator, from which successful trajectories are collected as new demonstrations for iterative refinement. To enhance the diversity of these demonstrations, SEIL employs dual-level augmentation: (i) Model-level, using an Exponential Moving Average (EMA) model to collaborate with the primary model, and (ii) Environment-level, introducing slight variations in initial object positions. We further introduce a lightweight selector that filters complementary and informative trajectories from the generated pool to ensure demonstration quality. These curated samples enable the model to achieve competitive performance with far fewer training examples. Extensive experiments on the LIBERO benchmark show that SEIL achieves a new state-of-the-art performance in few-shot imitation learning scenarios. Code is available at https://github.com/Jasper-aaa/SEIL.git.

模仿学习近期成为一种趋势，然而，在多个任务之间训练一个通用智能体仍然需要大规模的专业示范，这些示范的收集成本高昂且劳动密集。为了解决监督有限的挑战，我们提出了自进化模仿学习（SEIL）框架，该框架通过模拟器交互逐步改进了小样本模型。该模型首先在模拟器中尝试任务，收集成功的轨迹作为新的示范进行迭代优化。为了提高这些示范的多样性，SEIL采用了双重层次的增强方式：（i）模型层面，使用指数移动平均（EMA）模型与主模型进行协作；（ii）环境层面，在初始物体位置引入轻微变化。我们进一步引入了一个轻量级的选择器，从生成的池中筛选出互补且信息丰富的轨迹，以确保示范的质量。这些精选样本使模型在极少的训练样本下实现了具有竞争力的性能。在LIBERO基准测试上的广泛实验表明，SEIL在少样本模仿学习场景中达到了最新的最佳性能。代码可通过https://github.com/Jasper-aaa/SEIL.git获取。

论文及项目相关链接

PDF

Summary
自演化模仿学习框架（SEIL）通过模拟器交互逐步优化少量模型，解决了有限监督的问题。模型首先在模拟器尝试任务，收集成功的轨迹作为新的演示进行迭代优化。为了增强演示的多样性，SEIL采用了双重层次的增强方法：模型级别的指数移动平均（EMA）模型与主模型的协作和环境级别的初始对象位置轻微变化。此外，我们还引入了一个轻量级的选择器，从生成的池中筛选出互补和有用的轨迹，以确保演示的质量。在LIBERO基准测试上的广泛实验表明，SEIL在少量模仿学习场景中达到了新的最佳性能。

Key Takeaways

SEIL是一个通过模拟器交互改善少量模型的框架，解决了有限监督的挑战。
模型在模拟器中尝试任务，并收集成功的轨迹作为新的演示进行迭代优化。
SEIL采用双重层次的增强方法来增强演示的多样性。
在模型级别，使用指数移动平均（EMA）模型与主模型协作。
在环境级别，引入初始对象位置的轻微变化。
引入轻量级选择器来确保演示的质量，从生成的池中筛选出有用的轨迹。

Cool Papers

点此查看论文截图

Vision-Free Retrieval: Rethinking Multimodal Search with Textual Scene Descriptions

Authors:Ioanna Ntinou, Alexandros Xenos, Yassine Ouali, Adrian Bulat, Georgios Tzimiropoulos

Contrastively-trained Vision-Language Models (VLMs), such as CLIP, have become the standard approach for learning discriminative vision-language representations. However, these models often exhibit shallow language understanding, manifesting bag-of-words behaviour. These limitations are reinforced by their dual-encoder design, which induces a modality gap. Additionally, the reliance on vast web-collected data corpora for training makes the process computationally expensive and introduces significant privacy concerns. To address these limitations, in this work, we challenge the necessity of vision encoders for retrieval tasks by introducing a vision-free, single-encoder retrieval pipeline. Departing from the traditional text-to-image retrieval paradigm, we migrate to a text-to-text paradigm with the assistance of VLLM-generated structured image descriptions. We demonstrate that this paradigm shift has significant advantages, including a substantial reduction of the modality gap, improved compositionality, and better performance on short and long caption queries, all attainable with only a few hours of calibration on two GPUs. Additionally, substituting raw images with textual descriptions introduces a more privacy-friendly alternative for retrieval. To further assess generalisation and address some of the shortcomings of prior compositionality benchmarks, we release two benchmarks derived from Flickr30k and COCO, containing diverse compositional queries made of short captions, which we coin subFlickr and subCOCO. Our vision-free retriever matches and often surpasses traditional multimodal models. Importantly, our approach achieves state-of-the-art zero-shot performance on multiple retrieval and compositionality benchmarks, with models as small as 0.3B parameters. Code is available at: https://github.com/IoannaNti/LexiCLIP

对比训练Vision-Language模型（VLMs），如CLIP，已经成为学习区分性的视觉语言表示的标准方法。然而，这些模型往往表现出浅层的语言理解，表现出词袋行为。这些局限性由它们的双编码器设计而加剧，这导致了模态间隙。此外，这些模型依赖于大量网络收集的数据集进行训练，这使得过程计算成本高昂并引发了严重的隐私问题。为了解决这个问题，在这项工作中，我们通过引入无视觉的单一编码器检索管道，质疑了视觉编码器对于检索任务的必要性。我们摒弃了传统的文本到图像检索范式，借助VLLM生成的结构化图像描述，迁移到文本到文本的范式。我们证明这种范式转变具有显著的优势，包括大幅度减少模态间隙、提高组合性和在短长标题查询上的更好性能，所有这些都可在两台GPU上仅几个小时的校准内实现。此外，用文本描述替代原始图像为检索提供了更隐私友好的替代方案。为了进一步评估通用性和解决先前组合基准的一些缺点，我们从Flickr30k和COCO派生出两个基准测试，包含由短标题组成的多样化组合查询，我们称之为subFlickr和subCOCO。我们的无视觉检索器与许多传统的多媒体模型相匹配，甚至经常超越它们。最重要的是，我们的方法在多个检索和组合基准测试上实现了最先进的零样本性能，模型大小小至0.3B参数。代码可在以下网址找到：https://github.com/IoannaNti/LexiCLIP

论文及项目相关链接

PDF Accepted at EMNLP 2025

Summary

本文介绍了对比训练的跨模态预训练模型（如CLIP）在处理视觉语言任务时的局限性，包括语言理解浅、模态差距大、依赖大规模网络数据训练带来的计算成本高和隐私担忧等问题。为解决这些问题，提出了一个无视觉编码器的单编码器检索管道，通过文本描述代替原始图像进行检索，实现了模态差距的显著减少、更好的组合性和在短长查询上的优越性能。此外，还发布了两个基于Flickr30k和COCO的基准测试集subFlickr和subCOCO，以评估模型的通用性和组合性。该方法实现了多个检索和组合基准测试集的零样本性能的最佳水平，且模型参数仅为0.3B。

Key Takeaways

对比训练的跨模态预训练模型（CLIP等）存在语言理解浅、模态差距大的局限性。
引入无视觉编码器的单编码器检索管道，通过文本描述实现图像检索，减少模态差距并提高性能。
发布的subFlickr和subCOCO基准测试集用于评估模型的通用性和组合性。
该方法实现了多个检索和组合基准测试集的零样本性能最佳水平，且模型参数小。
替代原始图像使用文本描述有助于增加检索的隐私保护。
方法仅使用少量时间进行校准（仅几小时）即可获得良好性能。

Cool Papers

点此查看论文截图

Benchmarking Vision-Language and Multimodal Large Language Models in Zero-shot and Few-shot Scenarios: A study on Christian Iconography

Authors:Gianmarco Spinaci, Lukas Klic, Giovanni Colavizza

This study evaluates the capabilities of Multimodal Large Language Models (LLMs) and Vision Language Models (VLMs) in the task of single-label classification of Christian Iconography. The goal was to assess whether general-purpose VLMs (CLIP and SigLIP) and LLMs, such as GPT-4o and Gemini 2.5, can interpret the Iconography, typically addressed by supervised classifiers, and evaluate their performance. Two research questions guided the analysis: (RQ1) How do multimodal LLMs perform on image classification of Christian saints? And (RQ2), how does performance vary when enriching input with contextual information or few-shot exemplars? We conducted a benchmarking study using three datasets supporting Iconclass natively: ArtDL, ICONCLASS, and Wikidata, filtered to include the top 10 most frequent classes. Models were tested under three conditions: (1) classification using class labels, (2) classification with Iconclass descriptions, and (3) few-shot learning with five exemplars. Results were compared against ResNet50 baselines fine-tuned on the same datasets. The findings show that Gemini-2.5 Pro and GPT-4o outperformed the ResNet50 baselines. Accuracy dropped significantly on the Wikidata dataset, where Siglip reached the highest accuracy score, suggesting model sensitivity to image size and metadata alignment. Enriching prompts with class descriptions generally improved zero-shot performance, while few-shot learning produced lower results, with only occasional and minimal increments in accuracy. We conclude that general-purpose multimodal LLMs are capable of classification in visually complex cultural heritage domains. These results support the application of LLMs as metadata curation tools in digital humanities workflows, suggesting future research on prompt optimization and the expansion of the study to other classification strategies and models.

本研究旨在评估多模态大型语言模型（LLMs）和视觉语言模型（VLMs）在基督教肖像学单标签分类任务中的能力。目标是评估通用VLMs（CLIP和SigLIP）和LLMs（如GPT-4o和Gemini 2.5）是否能解读通常通过监督分类器处理的肖像学，并评估其性能。本研究由两个研究问题引导分析：（RQ1）多模态LLMs在基督教圣人图像分类方面的表现如何？（RQ2）当用上下文信息或少量样本丰富输入时，性能如何变化？我们使用支持Iconclass的本地数据集ArtDL、ICONCLASS和Wikidata进行了基准测试，筛选出前10个最频繁的类别。模型在三种条件下进行测试：（1）使用类别标签进行分类，（2）使用Iconclass描述进行分类，（3）使用五个样本进行小样本学习。结果与在同数据集上微调过的ResNet50基准线进行了比较。研究结果表明，Gemini-2.5 Pro和GPT-4o的表现优于ResNet50基准线。在Wikidata数据集上，准确率显著下降，其中Siglip达到最高准确率，这表明模型对图像大小和元数据对齐的敏感性。用类别描述丰富提示通常会提高零样本性能，而小样本学习则产生较低的结果，只有偶尔和微小的准确率提升。我们得出结论，通用多模态LLMs具备在视觉复杂的文化遗产领域进行分类的能力。这些结果支持将LLMs应用于数字人文工作流程中的元数据整理工具，并建议未来对提示进行优化，并将研究扩展到其他分类策略和模型。

论文及项目相关链接

PDF 11 pages, 2 figures

Summary

该研究评估了多模态大型语言模型（LLMs）和视觉语言模型（VLMs）在基督教肖像画单标签分类任务中的能力。研究目标是评估通用VLMs（CLIP和SigLIP）和LLMs（如GPT-4o和Gemini 2.5）是否能解读通常通过监督分类器处理的肖像画，并评估其性能。该研究通过三个支持Iconclass的数据集进行了基准测试，包括ArtDL、ICONCLASS和Wikidata，筛选出前10个最常见的类别。模型在三种条件下进行测试：1）使用类标签进行分类，2）使用Iconclass描述进行分类，3）使用五个范例进行少样本学习。结果表明，Gemini-2.5 Pro和GPT-4o优于ResNet50基线。在Wikidata数据集上，Siglip达到最高准确度得分，表明模型对图像大小和元数据对齐的敏感性。用类别描述丰富提示通常会提高零样本性能，而少样本学习产生较低结果，仅偶尔有微小的准确性增加。结论是，通用多模态LLMs能够在视觉复杂文化遗产领域进行分类。这些结果支持将LLMs应用于数字人文工作流中的元数据工具，并建议未来的研究优化提示并扩展研究到其他分类策略和模型。

Key Takeaways

本研究评估了多模态大型语言模型和视觉语言模型在基督教肖像画分类任务中的性能。
研究通过三个数据集进行，包括ArtDL、ICONCLASS和Wikidata，涉及最常见的10个类别。
模型在三种不同条件下进行测试：使用类标签、使用Iconclass描述和少样本学习。
Gemini-2.5 Pro和GPT-4o表现出较高的分类性能，优于ResNet50基线。
在Wikidata数据集上，模型对图像大小和元数据对齐表现出敏感性，Siglip获得最高准确度。
丰富提示（使用类别描述）有助于提高零样本性能，而少样本学习结果有限。

Cool Papers

点此查看论文截图

TempSamp-R1: Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMs

Authors:Yunheng Li, Jing Cheng, Shaoyong Jia, Hangyi Kuang, Shaohui Jiao, Qibin Hou, Ming-Ming Cheng

This paper introduces TempSamp-R1, a new reinforcement fine-tuning framework designed to improve the effectiveness of adapting multimodal large language models (MLLMs) to video temporal grounding tasks. We reveal that existing reinforcement learning methods, such as Group Relative Policy Optimization (GRPO), rely on on-policy sampling for policy updates. However, in tasks with large temporal search spaces, this strategy becomes both inefficient and limited in performance, as it often fails to identify temporally accurate solutions. To address this limitation, TempSamp-R1 leverages ground-truth annotations as off-policy supervision to provide temporally precise guidance, effectively compensating for the sparsity and misalignment in on-policy solutions. To further stabilize training and reduce variance in reward-based updates, TempSamp-R1 provides a non-linear soft advantage computation method that dynamically reshapes the reward feedback via an asymmetric transformation. By employing a hybrid Chain-of-Thought (CoT) training paradigm, TempSamp-R1 optimizes a single unified model to support both CoT and non-CoT inference modes, enabling efficient handling of queries with varying reasoning complexity. Experimental results demonstrate that TempSamp-R1 outperforms GRPO-based baselines, establishing new state-of-the-art performance on benchmark datasets: Charades-STA (R1@0.7: 52.9%, +2.7%), ActivityNet Captions (R1@0.5: 56.0%, +5.3%), and QVHighlights (mAP: 30.0%, +3.0%). Moreover, TempSamp-R1 shows robust few-shot generalization capabilities under limited data. Code: https://github.com/HVision-NKU/TempSamp-R1

本文介绍了TempSamp-R1，这是一个新的强化微调框架，旨在提高多模态大型语言模型（MLLMs）适应视频时序定位任务的有效性。我们发现现有的强化学习方法，如群体相对策略优化（GRPO），依赖于策略更新中的在策略采样。然而，在具有大时间搜索空间的任务中，此策略既低效又性能受限，因为它往往无法识别出时序上精确的解决方案。为了解决这一局限性，TempSamp-R1利用真实注释作为离线监督来提供时间精确指导，有效地补偿在线策略解决方案中的稀疏性和不对齐问题。为了进一步稳定训练和减少基于奖励的更新的方差，TempSamp-R1提供了一种非线性软优势计算方法，该方法通过不对称转换动态地重塑奖励反馈。通过采用混合式的思维链（CoT）训练范式，TempSamp-R1优化了一个统一的模型来支持CoT和非CoT推理模式，从而能够高效地处理具有不同推理复杂性的查询。实验结果表明，TempSamp-R1优于基于GRPO的基线，在基准数据集上实现了卓越的性能：Charades-STA（R1@0.7: 52.9%，+2.7%）、ActivityNet Captions（R1@0.5: 56.0%，+5.3%）和QVHighlights（mAP: 30.0%，+3.0%）。此外，TempSamp-R1在有限数据下表现出强大的泛化能力。代码地址：https://github.com/HVision-NKU/TempSamp-R1。

论文及项目相关链接

PDF Accepted at NeurIPS 2025

Summary

本文介绍了一个名为TempSamp-R1的新强化微调框架，该框架旨在提高多模态大型语言模型在视频时序定位任务中的适应性。TempSamp-R1解决了现有强化学习方法在处理大时序搜索空间时的效率和性能局限问题。它通过利用真实标注作为离线监督，提供精确的时间指导，并引入非线性软优势计算方法，动态调整奖励反馈。此外，TempSamp-R1采用混合的Chain-of-Thought训练范式，支持不同推理复杂度的查询处理。实验结果表明，TempSamp-R1在多个基准数据集上实现了卓越性能，并具有强大的少样本泛化能力。

Key Takeaways

TempSamp-R1是一个强化微调框架，用于改进多模态大型语言模型在视频时序定位任务中的效果。
现有强化学习方法在处理大时序搜索空间时存在效率和性能局限。
TempSamp-R1利用真实标注作为离线监督，提供精确的时间指导。
TempSamp-R1引入非线性软优势计算方法，动态调整奖励反馈。
TempSamp-R1采用混合的Chain-of-Thought训练范式，支持不同推理复杂度的查询处理。
TempSamp-R1在多个基准数据集上实现了卓越性能。
TempSamp-R1展现出强大的少样本泛化能力。

Cool Papers

点此查看论文截图

Can LLMs Reason Over Non-Text Modalities in a Training-Free Manner? A Case Study with In-Context Representation Learning

Authors:Tianle Zhang, Wanlong Fang, Jonathan Woo, Paridhi Latawa, Deepak A. Subramanian, Alvin Chan

The remarkable performance of Large Language Models (LLMs) can be enhanced with test-time computation, which relies on external tools and even other deep learning models. However, existing approaches for integrating non-text modality representations into LLMs typically require additional costly supervised training, restricting on-the-fly adaptation to new domains and modalities. In this work, we explore the feasibility of integrating representations from non-text foundational models (FMs) into text-based LLMs in a training-free manner. We propose In-Context Representation Learning (ICRL) as a proof-of-concept to allow LLMs to adaptively utilize non-text modality representations with few-shot learning. Unlike traditional in-context learning, which incorporates text-label pairs, ICRL replaces text inputs with FM representations, enabling the LLM to perform multi-modal inference without fine-tuning. We evaluate ICRL on a suite of tasks in the molecular domain, investigating three core research questions: (i) how to map FM representations into LLMs in a training-free manner, (ii) what factors influence ICRL performance, and (iii) what mechanisms underlie the effectiveness of ICRL. To the best of our knowledge, ICRL is the first training-free framework for integrating non-text modality representations into text-based LLMs, presenting a promising direction for adaptable, multi-modal generalization.

大型语言模型（LLM）的出色性能可以通过测试时的计算增强，该计算依赖于外部工具和其他深度学习模型。然而，将非文本模态表示集成到LLM中的现有方法通常需要额外的昂贵的有监督训练，这限制了在新领域和模态的即时适应。在这项工作中，我们探索了以训练无关的方式将非文本基础模型（FM）的表示集成到文本为基础的LLM中的可行性。我们提出基于情境的表示学习（ICRL）作为概念验证，允许LLM以少量学习的方式自适应地利用非文本模态表示。与传统的上下文学习不同，ICRL结合了文本标签对，并用FM表示替换文本输入，使LLM能够在不进行微调的情况下执行多模态推理。我们在分子领域的多个任务上评估了ICRL，研究了三个核心问题：（i）如何以训练无关的方式将FM表示映射到LLM中，（ii）哪些因素影响ICRL的性能，以及（iii）ICRL有效性的基础机制是什么。据我们所知，ICRL是第一个将非文本模态表示集成到文本基础LLM中的训练无关框架，为可适应的多模态泛化提供了有前景的方向。

论文及项目相关链接

PDF NeurIPS 2025

Summary

大型语言模型（LLMs）的性能可通过测试时的计算增强，通过利用外部工具和其他深度学习模型来实现。然而，将非文本模态表示集成到LLMs中的现有方法通常需要额外的监督训练，这限制了在新领域和模态的即时适应。本研究探索了以无训练方式将非文本基础模型（FMs）的表示集成到文本为基础的LLMs中的可行性。我们提出了基于上下文表示学习（ICRL）的概念验证，使LLMs能够利用非文本模态表示进行少量学习的自适应利用。不同于传统基于文本标签对进行上下文学习的ICRL用FM表示替换文本输入，使LLM能够进行多模态推理而无需微调。我们在分子领域的多个任务上评估了ICRL的表现，并探讨了三个核心问题：如何以无训练方式将FM表示映射到LLMs中，哪些因素影响ICRL的性能，以及ICRL有效性的内在机制是什么。据我们所知，ICRL是第一个用于将非文本模态表示集成到基于文本的LLMs中的无训练框架，为可适应的多模态泛化提供了有前途的方向。

Key Takeaways

大型语言模型（LLMs）可以通过测试时的计算增强性能，借助外部工具和深度学习模型实现。
现有集成非文本模态表示到LLMs的方法需要额外的监督训练，限制了新领域和模态的即时适应。
研究提出了基于上下文表示学习（ICRL）的概念，使LLMs能够利用非文本模态表示进行少量学习。
ICRL以无训练方式将非文本基础模型（FMs）的表示集成到LLMs中，使用FM表示替换文本输入，实现多模态推理。
ICRL在分子领域的多个任务上进行了评估。
探讨了如何将FM表示映射到LLMs、影响ICRL性能的因素以及ICRL有效性的内在机制。

Cool Papers

点此查看论文截图

Automated Generation of Research Workflows from Academic Papers: A Full-text Mining Framework

Authors:Heng Zhang, Chengzhi Zhang

The automated generation of research workflows is essential for improving the reproducibility of research and accelerating the paradigm of “AI for Science”. However, existing methods typically extract merely fragmented procedural components and thus fail to capture complete research workflows. To address this gap, we propose an end-to-end framework that generates comprehensive, structured research workflows by mining full-text academic papers. As a case study in the Natural Language Processing (NLP) domain, our paragraph-centric approach first employs Positive-Unlabeled (PU) Learning with SciBERT to identify workflow-descriptive paragraphs, achieving an F1-score of 0.9772. Subsequently, we utilize Flan-T5 with prompt learning to generate workflow phrases from these paragraphs, yielding ROUGE-1, ROUGE-2, and ROUGE-L scores of 0.4543, 0.2877, and 0.4427, respectively. These phrases are then systematically categorized into data preparation, data processing, and data analysis stages using ChatGPT with few-shot learning, achieving a classification precision of 0.958. By mapping categorized phrases to their document locations in the documents, we finally generate readable visual flowcharts of the entire research workflows. This approach facilitates the analysis of workflows derived from an NLP corpus and reveals key methodological shifts over the past two decades, including the increasing emphasis on data analysis and the transition from feature engineering to ablation studies. Our work offers a validated technical framework for automated workflow generation, along with a novel, process-oriented perspective for the empirical investigation of evolving scientific paradigms. Source code and data are available at: https://github.com/ZH-heng/research_workflow.

研究工作流程的自动化生成对于提高研究的可重复性和加速“人工智能科学”的模式转变至关重要。然而，现有方法通常仅提取碎片化的程序组件，因此无法捕获完整的研究工作流程。为了弥补这一差距，我们提出了一种端到端的框架，通过挖掘全文学术论文来生成全面、结构化的研究工作流程。作为自然语言处理（NLP）领域的案例研究，我们采用段落为中心的方法，首先使用SciBERT的阳性未标记（PU）学习来识别工作流描述性段落，F1分数达到0.9772。随后，我们使用Flan-T5与提示学习从这些段落生成工作流短语，其ROUGE-1、ROUGE-2和ROUGE-L分数分别为0.4543、0.2877和0.4427。接下来，利用ChatGPT进行少量学习来系统地将这些短语分类为数据准备、数据处理和数据分析阶段，分类精度达到0.958。通过将分类后的短语映射到文档中的位置，我们最终生成了整个研究工作的可阅读可视化流程图。此方法便于分析来自NLP语料库的工作流，并揭示了过去二十年来方法论的关键转变，包括对数据分析的日益重视以及从特征工程到消融研究的转变。我们的工作为自动工作流生成提供了经验验证的技术框架，以及一个用于研究不断发展的科学范式的过程导向型视角。源代码和数据可在：https://github.com/ZH-heng/research_workflow 找到。

论文及项目相关链接

PDF

Summary

本文提出一种端到端的框架，通过挖掘全文学术论文来生成全面、结构化的研究工作流程。该框架采用自然语言处理技术，包括使用SciBERT进行流程描述段落识别、使用Flan-T5进行工作流短语生成，以及使用ChatGPT进行阶段分类和流程图生成。此方法为自动生成研究工作流程提供了技术框架，同时为研究范式的演变提供了过程导向的视角。

Key Takeaways

提出一种自动化生成研究工作流程的端到端框架，旨在提高研究的可重复性和”AI for Science”范式的加速发展。
采用自然语言处理技术如SciBERT、Flan-T5和ChatGPT来识别和生成研究工作流程的各个环节。
该方法能系统地分类数据准备、数据处理和数据分析阶段，并实现流程图的可视化生成。
方法通过NLP领域的案例研究得到验证，并揭示了过去二十年中方法论的关键转变，如对数据分析的重视增加以及从特征工程到消融研究的转变。
此框架为自动工作流生成提供了验证的技术平台，为研究范式的演变提供了新颖的过程导向视角。
源代码和数据可在指定链接找到。

Cool Papers

点此查看论文截图

Data-Augmented Few-Shot Neural Emulator for Computer-Model System Identification

Authors:Sanket Jantre, Deepak Akhare, Zhiyuan Wang, Xiaoning Qian, Nathan M. Urban

Partial differential equations (PDEs) underpin the modeling of many natural and engineered systems. It can be convenient to express such models as neural PDEs rather than using traditional numerical PDE solvers by replacing part or all of the PDE’s governing equations with a neural network representation. Neural PDEs are often easier to differentiate, linearize, reduce, or use for uncertainty quantification than the original numerical solver. They are usually trained on solution trajectories obtained by long-horizon rollout of the PDE solver. Here we propose a more sample-efficient data-augmentation strategy for generating neural PDE training data from a computer model by space-filling sampling of local “stencil” states. This approach removes a large degree of spatiotemporal redundancy present in trajectory data and oversamples states that may be rarely visited but help the neural PDE generalize across the state space. We demonstrate that accurate neural PDE stencil operators can be learned from synthetic training data generated by the computational equivalent of 10 timesteps’ worth of numerical simulation. Accuracy is further improved if we assume access to a single full-trajectory simulation from the computer model, which is typically available in practice. Across several PDE systems, we show that our data-augmented stencil data yield better trained neural stencil operators, with clear performance gains compared with naively sampled stencil data from simulation trajectories. Finally, with only 10 solver steps’ worth of augmented stencil data, our approach outperforms traditional ML emulators trained on thousands of trajectories in long-horizon rollout accuracy and stability.

偏微分方程（PDEs）是许多自然和工程系统建模的基础。通过用神经网络表示替换PDE的部分或全部控制方程，将此类模型表达为神经PDE，而不是使用传统的数值PDE求解器，这样做会更方便。神经PDE通常比原始数值求解器更容易进行微分、线性化、简化或用于不确定性量化。它们通常是在PDE求解器的长周期滚动过程中获得的解决方案轨迹上进行训练。在这里，我们提出了一种通过空间填充采样局部“模板”状态来从计算机模型生成神经PDE训练数据的数据增强策略，这种策略能够更有效地获取样本。此方法消除了轨迹数据中大量存在的时空冗余，并对可能很少访问但有助于神经PDE在状态空间中推广的状态进行过采样。我们证明，通过计算相当于10个时间步长的数值模拟生成的综合训练数据，可以准确地学习神经PDE模板算子。如果我们能够访问计算机模型中的单个完整轨迹模拟（这在实践中通常可用），则准确性会进一步提高。在几个PDE系统中，我们展示了通过数据增强获得的模板数据可以训练出更好的神经模板算子，与从模拟轨迹中简单采样的模板数据相比，表现出明显的性能提升。最后，仅使用相当于10个求解器步骤的增强模板数据，我们的方法在长期滚动准确性和稳定性方面优于传统ML模拟器，后者是在数千条轨迹上进行训练。

论文及项目相关链接

PDF

Summary

本文介绍了将偏微分方程（PDEs）模型转化为神经网络表示的神经PDE模型的方法。为提高神经PDE模型的训练效率，提出了一种基于空间填充采样的数据增强策略，通过局部“模板”状态的空间填充采样生成训练数据。该方法减少了轨迹数据中的时空冗余，并对可能很少访问但有助于神经PDE泛化的状态进行过采样。实验表明，使用这种数据增强策略的神经PDE模板操作器可以更准确地学习，并且在多个PDE系统中表现出良好的性能。

Key Takeaways

神经PDE模型可将偏微分方程模型转化为神经网络表示。
为提高神经PDE模型的训练效率，提出了基于空间填充采样的数据增强策略。
该策略通过局部“模板”状态的空间填充采样生成训练数据，减少轨迹数据中的冗余。
过采样可能很少访问但有助于神经PDE泛化的状态。
实验表明，使用数据增强策略的神经PDE模板操作器可以更准确地学习。
在多个PDE系统中，使用数据增强策略的神经PDE表现出更好的性能。

Cool Papers

点此查看论文截图

Multimodal Reference Visual Grounding

Authors:Yangxiao Lu, Ruosen Li, Liqiang Jing, Jikai Wang, Xinya Du, Yunhui Guo, Nicholas Ruozzi, Yu Xiang

Visual grounding focuses on detecting objects from images based on language expressions. Recent Large Vision-Language Models (LVLMs) have significantly advanced visual grounding performance by training large models with large-scale datasets. However, the problem remains challenging, especially when similar objects appear in the input image. For example, an LVLM may not be able to differentiate Diet Coke and regular Coke in an image. In this case, if additional reference images of Diet Coke and regular Coke are available, it can help the visual grounding of similar objects. In this work, we introduce a new task named Multimodal Reference Visual Grounding (MRVG). In this task, a model has access to a set of reference images of objects in a database. Based on these reference images and a language expression, the model is required to detect a target object from a query image. We first introduce a new dataset to study the MRVG problem. Then we introduce a novel method, named MRVG-Net, to solve this visual grounding problem. We show that by efficiently using reference images with few-shot object detection and using Large Language Models (LLMs) for object matching, our method achieves superior visual grounding performance compared to the state-of-the-art LVLMs such as Qwen2.5-VL-72B. Our approach bridges the gap between few-shot detection and visual grounding, unlocking new capabilities for visual understanding, which has wide applications in robotics. Project page with our video, code, and dataset: https://irvlutd.github.io/MultiGrounding

视觉定位专注于根据语言表达从图像中检测物体。最近的大型视觉语言模型（LVLMs）通过大规模数据集训练大型模型，已经显著提高了视觉定位的性能。然而，问题仍然具有挑战性，特别是当输入图像中出现相似物体时。例如，LVLM可能无法区分图像中的无糖可乐和普通可乐。在这种情况下，如果有无糖可乐和普通可乐的额外参考图像可用，可以帮助对相似物体的视觉定位。在这项工作中，我们引入了一项名为多模态参考视觉定位（MRVG）的新任务。在此任务中，模型可以访问数据库中对象的参考图像集。基于这些参考图像和语言表达，模型需要从查询图像中检测目标物体。我们首先引入一个新的数据集来研究MRVG问题。然后，我们介绍了一种名为MRVG-Net的新方法来解决这个视觉定位问题。我们表明，通过有效地使用参考图像进行小样本目标检测和使用大型语言模型进行对象匹配，我们的方法在视觉定位性能上优于最新的LVLMs，如Qwen2.5-VL-72B。我们的方法缩小了小样本检测和视觉定位之间的差距，为视觉理解解锁了新的能力，具有广泛的应用于机器人技术等领域。项目页面包含我们的视频、代码和数据集：https://irvlutd.github.io/MultiGrounding

论文及项目相关链接

PDF Project page with our code and dataset: https://irvlutd.github.io/MultiGrounding

Summary

本文介绍了一种新的视觉定位任务——多模态参考视觉定位（MRVG），在该任务中，模型可以访问数据库中的一组参考图像，并根据这些参考图像和语言表达式从查询图像中检测目标对象。为了解决这个问题，文章引入了一个新的数据集和一种新的方法MRVG-Net，该方法通过有效利用参考图像和大型语言模型进行对象匹配，实现了对类似对象的视觉定位，并优于现有的大型视觉语言模型。此方法填补了少样本检测和视觉定位之间的空白，为视觉理解开辟了新的可能性，广泛应用于机器人领域。

Key Takeaways

介绍了一种新的视觉定位任务——多模态参考视觉定位（MRVG）。
提出了一个新的数据集用于研究MRVG问题。
引入了名为MRVG-Net的新方法来解决视觉定位问题。
MRVG-Net通过有效利用参考图像和大型语言模型进行对象匹配。
MRVG-Net实现了对类似对象的视觉定位，并优于现有的大型视觉语言模型。
该方法填补了少样本检测和视觉定位之间的空白。

Cool Papers

点此查看论文截图

Explainable Sentiment Analysis with DeepSeek-R1: Performance, Efficiency, and Few-Shot Learning

Authors:Donghao Huang, Zhaoxia Wang

Large language models (LLMs) have transformed sentiment analysis, yet balancing accuracy, efficiency, and explainability remains a critical challenge. This study presents the first comprehensive evaluation of DeepSeek-R1–an open-source reasoning model–against OpenAI’s GPT-4o and GPT-4o-mini. We test the full 671B model and its distilled variants, systematically documenting few-shot learning curves. Our experiments show DeepSeek-R1 achieves a 91.39% F1 score on 5-class sentiment and 99.31% accuracy on binary tasks with just 5 shots, an eightfold improvement in few-shot efficiency over GPT-4o. Architecture-specific distillation effects emerge, where a 32B Qwen2.5-based model outperforms the 70B Llama-based variant by 6.69 percentage points. While its reasoning process reduces throughput, DeepSeek-R1 offers superior explainability via transparent, step-by-step traces, establishing it as a powerful, interpretable open-source alternative.

大型语言模型（LLM）已经改变了情感分析领域，但在平衡准确性、效率和可解释性方面仍然是一个关键挑战。本研究首次对DeepSeek-R1这一开源推理模型进行全面评估，并与OpenAI的GPT-4o和GPT-4o-mini进行对比。我们测试了完整的671B模型及其蒸馏变体，系统地记录了小样本学习曲线。实验表明，DeepSeek-R1在5类情感分析中达到91.39%的F1分数，在二分类任务上达到99.31%的准确率，仅需5个样本，其在小样本效率上是GPT-4o的八倍。出现特定架构的蒸馏效应，其中基于32B Qwen2.5的模型优于基于70B Llama的变体，高出6.69个百分点。虽然其推理过程降低了吞吐量，但DeepSeek-R1通过透明、分步跟踪提供了卓越的可解释性，使其成为强大、可解释的开源替代方案。

论文及项目相关链接

PDF 10 pages, with 2 figures and 6 tables, accepted for publication in an IEEE Intelligent Systems journal

Summary

DeepSeek-R1模型在情感分析领域表现卓越，相较于OpenAI的GPT-4o系列模型，其在少量样本下的学习效率显著提高。实验显示，DeepSeek-R1在5类情感分析任务上达到91.39%的F1分数，在二元任务上达到99.31%的准确率，且具备出色的可解释性。

Key Takeaways

DeepSeek-R1在情感分析领域展现了卓越性能。
对比OpenAI的GPT-4o系列模型，DeepSeek-R1在少量样本下的学习效率显著提高。
DeepSeek-R1在5类情感分析任务上达到91.39%的F1分数，二元任务准确率高达99.31%。
DeepSeek-R1具备出色的可解释性，通过逐步跟踪提供透明的解释过程。
模型架构对性能有影响，一个基于32B Qwen2.5的模型表现优于基于70B Llama的变体。
尽管推理过程可能会影响吞吐量，但DeepSeek-R1仍然是一个强大的、可解释的开源替代方案。

Cool Papers

点此查看论文截图

HoT: Highlighted Chain of Thought for Referencing Supporting Facts from Inputs

Authors:Tin Nguyen, Logan Bolton, Mohammad Reza Taesiri, Trung Bui, Anh Totti Nguyen

An Achilles heel of Large Language Models (LLMs) is their tendency to hallucinate non-factual statements. A response mixed of factual and non-factual statements poses a challenge for humans to verify and accurately base their decisions on. To combat this problem, we propose Highlighted Chain-of-Thought Prompting (HoT), a technique for prompting LLMs to generate responses with XML tags that ground facts to those provided in the query. That is, given an input question, LLMs would first re-format the question to add XML tags highlighting key facts, and then, generate a response with highlights over the facts referenced from the input. Interestingly, in few-shot settings, HoT outperforms vanilla chain of thought prompting (CoT) on a wide range of 17 tasks from arithmetic, reading comprehension to logical reasoning. When asking humans to verify LLM responses, highlights help time-limited participants to more accurately and efficiently recognize when LLMs are correct. Yet, surprisingly, when LLMs are wrong, HoTs tend to make users believe that an answer is correct.

大型语言模型（LLM）的一个弱点是它们倾向于产生非事实性的陈述。由事实和虚构陈述组成的回应给人类带来了验证和准确做出决策的挑战。为了解决这个问题，我们提出了强调思维链提示（HoT）技术，这是一种提示LLM生成带有XML标签的响应的方法，这些标签基于查询中提供的事实。也就是说，给定一个输入问题，LLM会首先重新格式化问题，添加突出关键事实的XML标签，然后生成包含从输入中引用的重点事实的响应。有趣的是，在少量样本的情况下，HoT在算术、阅读理解到逻辑推理等17项任务上的表现优于普通的思维链提示（CoT）。当要求人类验证LLM的响应时，重点有助于时间有限的参与者更准确、更高效地识别LLM是否正确。然而，令人惊讶的是，当LLM错误时，HoT往往使用户认为答案是正确的。

论文及项目相关链接

PDF

Summary

大型语言模型（LLM）的一个弱点是它们容易生成非事实性的陈述，这增加了人类验证并据此做出准确决策的难度。针对这一问题，我们提出了高亮化思维链提示（HoT）技术，该技术通过XML标签将事实依据融入LLM的响应中。在少量样本情况下，HoT在算术、阅读理解到逻辑推理等17项任务上的表现优于传统思维链提示（CoT）。对于人类验证者而言，高亮提示有助于他们更准确地快速识别LLM的正确性。然而，有趣的是，当LLM出错时，HoT往往使用户误以为答案正确。

Key Takeaways