发布日期: 2025-10-22

更新日期: 2025-11-27

文章字数: 12.7k

阅读时长: 52 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-10-22 更新

AcademicEval: Live Long-Context LLM Benchmark

Authors:Haozhen Zhang, Tao Feng, Pengrui Han, Jiaxuan You

Large Language Models (LLMs) have recently achieved remarkable performance in long-context understanding. However, current long-context LLM benchmarks are limited by rigid context length, labor-intensive annotation, and the pressing challenge of label leakage issues during LLM training. Therefore, we propose \textsc{AcademicEval}, a live benchmark for evaluating LLMs over long-context generation tasks. \textsc{AcademicEval} adopts papers on arXiv to introduce several academic writing tasks with long-context inputs, \textit{i.e.}, \textsc{Title}, \textsc{Abstract}, \textsc{Introduction}, and \textsc{Related Work}, which cover a wide range of abstraction levels and require no manual labeling. Moreover, \textsc{AcademicEval} integrates high-quality and expert-curated few-shot demonstrations from a collected co-author graph to enable flexible context length. Especially, \textsc{AcademicEval} features an efficient live evaluation, ensuring no label leakage. We conduct a holistic evaluation on \textsc{AcademicEval}, and the results illustrate that LLMs perform poorly on tasks with hierarchical abstraction levels and tend to struggle with long few-shot demonstrations, highlighting the challenge of our benchmark. Through experimental analysis, we also reveal some insights for enhancing LLMs’ long-context modeling capabilities. Code is available at https://github.com/ulab-uiuc/AcademicEval

大规模语言模型（LLMs）最近在长上下文理解方面取得了显著的成绩。然而，当前的长上下文LLM基准测试受到严格上下文长度、劳动密集型的标注以及LLM训练过程中标签泄露问题的紧迫挑战。因此，我们提出了\emph{AcademicEval}，这是一个用于评估长上下文生成任务中LLMs性能的实时基准测试。AcademicEval采用arXiv上的论文来引入具有长上下文输入的几个学术写作任务，例如\emph{Title}、\emph{Abstract}、\emph{Introduction}和\emph{Related Work}，这些任务涵盖了广泛的抽象层次，且无需手动标注。此外，AcademicEval集成了从收集的合著者图中精心挑选的高质量、专家制作的少样本演示，以实现灵活的上下文长度。特别地，AcademicEval具有高效的实时评估功能，确保无标签泄露。我们对AcademicEval进行了全面的评估，结果表明LLMs在处理具有分层抽象层次的任务时表现不佳，并且在面对长的少样本演示时往往感到困难，这突显了我们基准测试的挑战性。通过实验分析，我们还揭示了一些增强LLMs长上下文建模能力的见解。代码可在https://github.com/ulab-uiuc/AcademicEval找到。

论文及项目相关链接

PDF Accepted by TMLR. Code is available at https://github.com/ulab-uiuc/AcademicEval

Summary

大型语言模型（LLMs）在理解长文本方面取得了显著进展，但现有长文本LLM基准测试存在上下文长度限制、标注工作量大和标签泄露等问题。因此，我们提出了AcademicEval，这是一个用于评估LLMs在长文本生成任务上的实时基准测试。AcademicEval采用arXiv上的论文，引入了多个长文本输入的学术写作任务，包括标题、摘要、介绍和相关工作等，涵盖广泛的抽象层次，无需手动标注。此外，AcademicEval通过收集的作者合作图集成了高质量和专家策划的少量演示，可实现灵活的上下文长度。尤其值得一提的是，AcademicEval具有高效的实时评估功能，确保无标签泄露。我们对AcademicEval进行了全面的评估，结果表明LLMs在处理具有层次抽象的任务和长少量演示时表现较差，凸显了我们基准测试的挑战性。同时，我们还通过实验分析提供了一些提高LLMs长文本建模能力的见解。

Key Takeaways

大型语言模型（LLMs）在长文本理解方面表现出显著性能。
当前长文本LLM基准测试存在上下文长度限制、标注工作量大和标签泄露等问题。
提出了AcademicEval实时基准测试，用于评估LLMs在长文本生成任务上的性能。
AcademicEval采用arXiv论文，涵盖多种学术写作任务，包括标题、摘要、介绍和相关工作等，无需手动标注。
AcademicEval通过作者合作图集成高质量和专家策划的少量演示，实现灵活的上下文长度。
LLMs在处理具有层次抽象的任务和长少量演示时表现较差，凸显了基准测试的挑战性。

Cool Papers

点此查看论文截图

PANER: A Paraphrase-Augmented Framework for Low-Resource Named Entity Recognition

Authors:Nanda Kumar Rengarajan, Jun Yan, Chun Wang

Named Entity Recognition (NER) is a critical task that requires substantial annotated data, making it challenging in low-resource scenarios where label acquisition is expensive. While zero-shot and instruction-tuned approaches have made progress, they often fail to generalize to domain-specific entities and do not effectively utilize limited available data. We present a lightweight few-shot NER framework that addresses these challenges through two key innovations: (1) a new instruction tuning template with a simplified output format that combines principles from prior IT approaches to leverage the large context window of recent state-of-the-art LLMs; (2) introducing a strategic data augmentation technique that preserves entity information while paraphrasing the surrounding context, thereby expanding our training data without compromising semantic relationships. Experiments on benchmark datasets show that our method achieves performance comparable to state-of-the-art models on few-shot and zero-shot tasks, with our few-shot approach attaining an average F1 score of 80.1 on the CrossNER datasets. Models trained with our paraphrasing approach show consistent improvements in F1 scores of up to 17 points over baseline versions, offering a promising solution for groups with limited NER training data and compute power.

命名实体识别（NER）是一项需要大量标注数据的关键任务，这在资源有限的情况下（获取标签成本高昂）构成挑战。虽然零样本和指令微调方法已经取得进展，但它们通常难以推广到特定领域的实体，并且未能有效利用有限的可得数据。我们提出了一种轻量级的少样本NER框架，通过两个关键创新来解决这些挑战：（1）一个新的指令微调模板，采用简化的输出格式，结合先前的IT原则，以利用最新的先进LLM的大上下文窗口；（2）引入了一种战略数据增强技术，它在改述周围上下文的同时保留实体信息，从而扩大了我们的训练数据而不损害语义关系。在基准数据集上的实验表明，我们的方法在少样本和零样本任务上的性能与最先进模型相当，我们的少样本方法在CrossNER数据集上平均F1分数为80.1。使用我们的改述方法训练的模型在F1分数上较基线版本有持续且高达17分的改进，为拥有有限的NER训练数据和计算能力的团队提供了有前景的解决方案。

论文及项目相关链接

PDF

Summary

本文介绍了一种解决低资源场景下命名实体识别（NER）任务挑战的新方法。该方法通过两个关键创新点来实现：一是使用新的指令微调模板，简化输出格式，利用最新大型语言模型的大的上下文窗口；二是引入战略数据增强技术，在改写周围语境的同时保留实体信息，从而扩大训练数据，不损害语义关系。实验表明，该方法在基准数据集上的表现与最新模型相当，特别是在少样本任务上，平均F1分数达到80.1。使用改写方法训练的模型在F1分数上平均提高了17个点，为解决有限NER训练数据和计算资源的团队提供了有前景的解决方案。

Key Takeaways

提出了一种新型的少样本命名实体识别框架，通过两个关键创新点来解决低资源场景下的挑战。
采用了新的指令微调模板，简化了输出格式，并充分利用了大型语言模型的大的上下文窗口。
引入了战略数据增强技术，能够在不损害语义关系的情况下扩大训练数据。
在基准数据集上的实验表现优秀，少样本任务平均F1分数达到80.1。
使用改写方法训练的模型在F1分数上相比基线版本有了显著的提高。
该方法为解决NER训练数据有限和计算资源有限的团队提供了实用的解决方案。

Cool Papers

点此查看论文截图

On-the-Fly OVD Adaptation with FLAME: Few-shot Localization via Active Marginal-Samples Exploration

Authors:Yehonathan Refael, Amit Aides, Aviad Barzilai, George Leifman, Genady Beryozkin, Vered Silverman, Bolous Jaber, Tomer Shekel

Open-vocabulary object detection (OVD) models offer remarkable flexibility by detecting objects from arbitrary text queries. However, their zero-shot performance in specialized domains like Remote Sensing (RS) is often compromised by the inherent ambiguity of natural language, limiting critical downstream applications. For instance, an OVD model may struggle to distinguish between fine-grained classes such as “fishing boat” and “yacht” since their embeddings are similar and often inseparable. This can hamper specific user goals, such as monitoring illegal fishing, by producing irrelevant detections. To address this, we propose a cascaded approach that couples the broad generalization of a large pre-trained OVD model with a lightweight few-shot classifier. Our method first employs the zero-shot model to generate high-recall object proposals. These proposals are then refined for high precision by a compact classifier trained in real-time on only a handful of user-annotated examples - drastically reducing the high costs of RS imagery annotation.The core of our framework is FLAME, a one-step active learning strategy that selects the most informative samples for training. FLAME identifies, on the fly, uncertain marginal candidates near the decision boundary using density estimation, followed by clustering to ensure sample diversity. This efficient sampling technique achieves high accuracy without costly full-model fine-tuning and enables instant adaptation, within less then a minute, which is significantly faster than state-of-the-art alternatives.Our method consistently surpasses state-of-the-art performance on RS benchmarks, establishing a practical and resource-efficient framework for adapting foundation models to specific user needs.

开放词汇对象检测（OVD）模型通过从任意文本查询中检测对象，提供了惊人的灵活性。然而，它们在遥感（RS）等特定领域的零样本性能通常受到自然语言固有模糊性的限制，从而影响关键下游应用程序的应用。例如，OVD模型可能在区分精细类别（如“渔船”和“游艇”）时遇到困难，因为它们的嵌入相似且通常无法区分。这可能会阻碍特定的用户目标，例如通过产生无关的检测结果来监测非法捕鱼。为了解决这一问题，我们提出了一种级联方法，将大型预训练OVD模型的广泛泛化与轻量级小样本分类器相结合。我们的方法首先使用零样本模型生成高召回率对象提案。然后，使用仅实时训练少量用户注释的示例的紧凑分类器对这些提议进行精确度的改进，大大降低了遥感图像注释的成本。我们框架的核心是FLAME，这是一种一步式主动学习策略，可选择最具信息量的样本进行训练。FLAME即时识别决策边界附近的不确定边缘候选对象，使用密度估计进行筛选和集群确保样本多样性。这种高效的采样技术无需昂贵的全模型微调即可实现高准确性，并能在不到一分钟的时间内实现即时适应，这明显优于最新的替代方案。我们的方法在遥感基准测试上始终超越最新性能水平，为根据特定用户需求调整基础模型建立了一个实用且资源高效的框架。

论文及项目相关链接

PDF

Summary

该文介绍了开放词汇对象检测模型在遥感领域的应用挑战。为解决模型在精细类别区分上的不足，提出一种结合大型预训练OVD模型和轻量级少样本分类器的级联方法。首先利用零样本模型生成高召回率的对象提案，然后通过实时训练少量用户标注的样本对提案进行精炼，以提高精度。框架的核心是一步式主动学习策略FLAME，能选择最具有信息量的样本进行训练。FLAME通过密度估计和聚类技术，快速识别决策边界附近的不确定边缘候选样本，确保样本多样性。该方法在不进行昂贵的全模型微调的情况下实现了高效率采样和高精度，可在不到一分钟内完成适应，显著优于现有替代方案。此方法在遥感基准测试上超越了现有技术性能，为将基础模型适应特定用户需求建立了实用且资源高效的框架。

Key Takeaways

OVD模型可通过任意文本查询检测对象，但在遥感等特定领域，其零样本性能会受到自然语言固有模糊性的制约。
现有OVD模型在精细类别区分（如渔船和游艇）上可能存在困难。
提出结合大型预训练OVD模型和轻量级少样本分类器的级联方法，以提高对象检测的精度和效率。
使用零样本模型生成高召回率的对象提案，然后通过实时训练少量用户标注的样本进行精炼。
框架的核心是一步式主动学习策略FLAME，能选择最具有信息量的样本进行训练，实现高效率采样和高精度。
FLAME通过密度估计和聚类技术识别不确定边缘候选样本，确保样本多样性。

Cool Papers

点此查看论文截图

One Dinomaly2 Detect Them All: A Unified Framework for Full-Spectrum Unsupervised Anomaly Detection

Authors:Jia Guo, Shuai Lu, Lei Fan, Zelin Li, Donglin Di, Yang Song, Weihang Zhang, Wenbing Zhu, Hong Yan, Fang Chen, Huiqi Li, Hongen Liao

Unsupervised anomaly detection (UAD) has evolved from building specialized single-class models to unified multi-class models, yet existing multi-class models significantly underperform the most advanced one-for-one counterparts. Moreover, the field has fragmented into specialized methods tailored to specific scenarios (multi-class, 3D, few-shot, etc.), creating deployment barriers and highlighting the need for a unified solution. In this paper, we present Dinomaly2, the first unified framework for full-spectrum image UAD, which bridges the performance gap in multi-class models while seamlessly extending across diverse data modalities and task settings. Guided by the “less is more” philosophy, we demonstrate that the orchestration of five simple element achieves superior performance in a standard reconstruction-based framework. This methodological minimalism enables natural extension across diverse tasks without modification, establishing that simplicity is the foundation of true universality. Extensive experiments on 12 UAD benchmarks demonstrate Dinomaly2’s full-spectrum superiority across multiple modalities (2D, multi-view, RGB-3D, RGB-IR), task settings (single-class, multi-class, inference-unified multi-class, few-shot) and application domains (industrial, biological, outdoor). For example, our multi-class model achieves unprecedented 99.9% and 99.3% image-level (I-) AUROC on MVTec-AD and VisA respectively. For multi-view and multi-modal inspection, Dinomaly2 demonstrates state-of-the-art performance with minimum adaptations. Moreover, using only 8 normal examples per class, our method surpasses previous full-shot models, achieving 98.7% and 97.4% I-AUROC on MVTec-AD and VisA. The combination of minimalistic design, computational scalability, and universal applicability positions Dinomaly2 as a unified solution for the full spectrum of real-world anomaly detection applications.

无监督异常检测（UAD）已经从构建专门单类模型发展到统一多类模型，但现有多类模型显著落后于最先进的一对一个模型。此外，该领域已经细分为针对特定场景（多类、三维、小样本等）的专用方法，造成了部署障碍并凸显了对统一解决方案的需求。在本文中，我们提出了Dinomaly2，这是首个用于全频谱图像UAD的统一框架，它缩小了多类模型的性能差距，同时无缝地扩展到不同的数据模式和任务设置。秉承“少即是多”的理念，我们证明了五种简单元素的协同能在标准的重建框架中达到卓越的性能。这种方法论上的简约性使得跨不同任务的自然扩展成为可能，而无需进行修改，证明了简约性是真正普遍性的基础。在12个UAD基准测试上的广泛实验表明，Dinomaly2在多模式（二维、多视角、RGB-3D、RGB-IR）、任务设置（单类、多类、推理统一多类、小样本）和应用领域（工业、生物、户外）中具有全频谱优势。例如，我们的多类模型在MVTec-AD和VisA上分别实现了前所未有的99.9%和99.3%的图像级（I-）AUROC。对于多视角和多模式检查，Dinomaly2在最少调整的情况下展示了最先进的性能。而且，每个类别仅使用8个正常样本，我们的方法就超越了之前的全镜头模型，在MVTec-AD和VisA上分别实现了98.7%和97.4%的I-AUROC。结合简约设计、计算可扩展性和通用适用性，Dinomaly2成为全频谱现实世界异常检测应用的统一解决方案。

论文及项目相关链接

PDF Extended version of CVPR2025

Summary

本文提出了Dinomaly2，一个用于全频谱图像无监督异常检测的统一框架。该框架解决了多类模型性能较差的问题，并能无缝地应用于不同的数据模态和任务设置。通过采用”少即是多”的理念，Dinomaly2在标准重建框架中展现出卓越性能。在多个无监督异常检测基准测试中，Dinomaly2在全频谱上表现出优越性，包括多种模态、任务设置和应用领域。例如，在多类模型中，它实现了前所未有的高图像级AUROC性能。

Key Takeaways

Dinomaly2是第一个全频谱图像无监督异常检测的统一框架，能够弥多类模型性能差距，并轻松应用于不同的数据模态和任务设置。
遵循”少即是多”的理念，Dinomaly2在重建框架中通过协调五个简单元素实现卓越性能。
Dinomaly2具有极简的设计、计算上的可扩展性和普遍适用性，使其成为全频谱现实世界异常检测应用的统一解决方案。
在多个无监督异常检测基准测试中进行了广泛实验，证明了Dinomaly2的优越性，包括多种模态、任务类型和应用领域。
在MVTec-AD和VisA上，多类模型的图像级AUROC性能达到前所未有的高水平。
Dinomaly2在多视图和多模态检查中表现出卓越的性能，只需进行最小的调整。

Cool Papers

点此查看论文截图

DETree: DEtecting Human-AI Collaborative Texts via Tree-Structured Hierarchical Representation Learning

Authors:Yongxin He, Shan Zhang, Yixuan Cao, Lei Ma, Ping Luo

Detecting AI-involved text is essential for combating misinformation, plagiarism, and academic misconduct. However, AI text generation includes diverse collaborative processes (AI-written text edited by humans, human-written text edited by AI, and AI-generated text refined by other AI), where various or even new LLMs could be involved. Texts generated through these varied processes exhibit complex characteristics, presenting significant challenges for detection. Current methods model these processes rather crudely, primarily employing binary classification (purely human vs. AI-involved) or multi-classification (treating human-AI collaboration as a new class). We observe that representations of texts generated through different processes exhibit inherent clustering relationships. Therefore, we propose DETree, a novel approach that models the relationships among different processes as a Hierarchical Affinity Tree structure, and introduces a specialized loss function that aligns text representations with this tree. To facilitate this learning, we developed RealBench, a comprehensive benchmark dataset that automatically incorporates a wide spectrum of hybrid texts produced through various human-AI collaboration processes. Our method improves performance in hybrid text detection tasks and significantly enhances robustness and generalization in out-of-distribution scenarios, particularly in few-shot learning conditions, further demonstrating the promise of training-based approaches in OOD settings. Our code and dataset are available at https://github.com/heyongxin233/DETree.

检测涉及人工智能的文本对于打击虚假信息、抄袭和学术不端行为至关重要。然而，人工智能文本生成包括多种协作过程（人工智能撰写的文本由人类编辑、人类撰写的文本由人工智能编辑、以及由其他人工智能完善的人工智能生成文本），其中可能涉及各种甚至是新型的大型语言模型。通过这些不同过程生成的文本表现出复杂的特性，为检测带来了重大挑战。当前的方法对这些过程的建模较为粗糙，主要采用二分类（纯粹的人类创作与人工智能参与）或多分类（将人机协作视为一个新类别）。我们观察到，通过不同过程生成的文本表示呈现出固有的聚类关系。因此，我们提出了DETree这一新方法，它将不同过程之间的关系建模为分层亲和树结构，并引入了一种专用的损失函数，使文本表示与此树对齐。为了促进这一学习，我们开发了RealBench综合基准数据集，该数据集可自动包含通过各种人机协作过程产生的广泛混合文本。我们的方法在混合文本检测任务上提高了性能，并在离群分布场景中显著增强了稳健性和泛化能力，特别是在小样本学习条件下，进一步证明了基于训练的方法在OOD环境中的潜力。我们的代码和数据集可通过https://github.com/heyongxin233/DETree获取。

论文及项目相关链接

PDF To appear in NeurIPS 2025

Summary

本文介绍了检测人工智能参与文本的重要性，并针对现有方法的不足，提出了一种基于分层亲和树结构（DETree）的文本检测新方法。DETree旨在通过建模不同生成过程之间的关系来解决多模式文本生成问题，并引入了一种与树结构对齐的专用损失函数。为提高学习效果，还开发了RealBench基准数据集。该方法在混合文本检测任务上表现优异，尤其是在小样本学习条件下，对于提高模型在未知分布场景中的稳健性和泛化能力具有重要意义。

Key Takeaways

检测AI参与文本对于打击谣言、抄袭和学术不端行为至关重要。
现有文本检测方法在面对多模式生成（如AI辅助或纯AI生成）的文本时存在挑战。
DETree方法通过建模不同文本生成过程之间的关系来解决这一挑战。
DETree利用分层亲和树结构来模拟这些关系，并引入专用损失函数以优化性能。
RealBench基准数据集用于模拟真实世界中的各种人机合作文本生成场景，促进模型学习。
方法在混合文本检测任务上表现优异，尤其是在小样本学习条件下。

Cool Papers

点此查看论文截图

TabR1: Taming GRPO for tabular reasoning LLMs

Authors:Pengxiang Cai, Zihao Gao, Jintai Chen

Tabular prediction has traditionally relied on gradient-boosted decision trees and specialized deep learning models, which excel within tasks but provide limited interpretability and weak transfer across tables. Reasoning large language models (LLMs) promise cross-task adaptability with trans- parent reasoning traces, yet their potential has not been fully realized for tabular data. This paper presents TabR1, the first reasoning LLM for tabular prediction with multi-step reasoning. At its core is Permutation Relative Policy Optimization (PRPO), a simple yet efficient reinforcement learning method that encodes column-permutation invariance as a structural prior. By construct- ing multiple label-preserving permutations per sample and estimating advantages both within and across permutations, PRPO transforms sparse rewards into dense learning signals and improves generalization. With limited supervision, PRPO activates the reasoning ability of LLMs for tabular prediction, enhancing few-shot and zero-shot performance as well as interpretability. Comprehensive experiments demonstrate that TabR1 achieves performance comparable to strong baselines under full-supervision fine-tuning. In the zero-shot setting, TabR1 approaches the performance of strong baselines under the 32-shot setting. Moreover, TabR1 (8B) substantially outperforms much larger LLMs across various tasks, achieving up to 53.17% improvement over DeepSeek-R1 (685B).

表格预测传统上依赖于梯度提升决策树和专用深度学习模型，这些模型在任务内表现优异，但提供有限的解释性，并且在跨表格迁移时表现较弱。大型语言模型（LLM）的推理承诺跨任务适应性，具有透明的推理轨迹，但其对表格数据的潜力尚未得到充分实现。本文提出了TabR1，它是首个用于表格预测的多步推理推理LLM。其核心是排列相对策略优化（PRPO），这是一种简单而高效的强化学习方法，将列排列不变性编码为结构先验。通过对每个样本构建多个标签保留排列并在排列内和排列间估计优势，PRPO将稀疏奖励转化为密集的学习信号并提高了泛化能力。在有限的监督下，PRPO激活了LLM的推理能力进行表格预测，提高了少样本和零样本的性能以及解释性。综合实验表明，TabR1在全监督微调的情况下实现了与强大基准线相当的性能。在零样本设置中，TabR1接近在32次展示下的强大基准线的性能。此外，TabR1（8B）在各种任务上大大优于更大的LLMs，相对于DeepSeek-R1（685B）最多提高了53.17%。

论文及项目相关链接

PDF

Summary

本文提出一种名为TabR1的表格预测推理模型，该模型结合了大型语言模型的跨任务适应性特点，并引入了Permutation Relative Policy Optimization（PRPO）算法。PRPO算法能有效处理表格数据的列排列不变性，将稀疏奖励转化为密集学习信号，从而提高泛化能力。在有限的监督下，TabR1能够激活大型语言模型的推理能力，提升少样本和零样本性能，并增强模型的解释性。实验证明，TabR1在完全监督微调的情况下达到与基线模型相当的性能水平，同时在零样本场景下接近基线模型的性能表现。此外，TabR1在不同任务上显著优于其他大型语言模型。

Key Takeaways

TabR1结合了大型语言模型的跨任务适应性，专为表格预测设计。
TabR1引入Permutation Relative Policy Optimization（PRPO）算法处理表格数据的列排列不变性。
PRPO算法能将稀疏奖励转化为密集学习信号，提高模型的泛化能力。
TabR1在有限的监督下能够激活大型语言模型的推理能力。
TabR1提高了少样本和零样本情况下的性能表现，同时增强了模型的解释性。
实验显示TabR1在完全监督微调的情况下性能与基线模型相当。

Cool Papers

点此查看论文截图

Reliability of Large Language Model Generated Clinical Reasoning in Assisted Reproductive Technology: Blinded Comparative Evaluation Study

Authors:Dou Liu, Ying Long, Sophia Zuoqiu, Di Liu, Kang Li, Yiting Lin, Hanyi Liu, Rong Yin, Tian Tang

Creating high-quality clinical Chains-of-Thought (CoTs) is crucial for explainable medical Artificial Intelligence (AI) while constrained by data scarcity. Although Large Language Models (LLMs) can synthesize medical data, their clinical reliability remains unverified. This study evaluates the reliability of LLM-generated CoTs and investigates prompting strategies to enhance their quality. In a blinded comparative study, senior clinicians in Assisted Reproductive Technology (ART) evaluated CoTs generated via three distinct strategies: Zero-shot, Random Few-shot (using shallow examples), and Selective Few-shot (using diverse, high-quality examples). These expert ratings were compared against evaluations from a state-of-the-art AI model (GPT-4o). The Selective Few-shot strategy significantly outperformed other strategies across all human evaluation metrics (p < .001). Critically, the Random Few-shot strategy offered no significant improvement over the Zero-shot baseline, demonstrating that low-quality examples are as ineffective as no examples. The success of the Selective strategy is attributed to two principles: “Gold-Standard Depth” (reasoning quality) and “Representative Diversity” (generalization). Notably, the AI evaluator failed to discern these critical performance differences. The clinical reliability of synthetic CoTs is dictated by strategic prompt curation, not the mere presence of examples. We propose a “Dual Principles” framework as a foundational methodology to generate trustworthy data at scale. This work offers a validated solution to the data bottleneck and confirms the indispensable role of human expertise in evaluating high-stakes clinical AI.

在数据稀缺的情况下，创建高质量的临床思维链（CoTs）对于可解释性医疗人工智能（AI）至关重要。虽然大型语言模型（LLM）可以合成医疗数据，但其临床可靠性尚未得到验证。本研究旨在评估LLM生成的CoTs的可靠性，并研究提高质量的提示策略。在一项盲比较研究中，辅助生殖技术（ART）领域的资深临床医生对通过三种不同策略生成的CoTs进行了评估：零样本、随机少样本（使用浅例）和选择性少样本（使用多样化、高质量的例子）。这些专家评价与来自最新AI模型（GPT-4o）的评价进行了比较。选择性少样本策略在所有人类评价指标上都显著优于其他策略（p < .0 相对于零样本没有任何明显的改进。选择性策略的成功的关键有两个原则：“黄金标准的深度”（推理质量）和“代表性多样性”（泛化能力）。值得注意的是，AI评估者未能区分这些关键性能差异。合成CoTs的临床可靠性取决于战略提示策划，而不是示例的存在与否。我们提出了一个“双重原则”框架，作为一种基础方法来大规模生成可信数据。这项工作为解决数据瓶颈提供了验证的解决方案，并证实了人类专家在评估高风险临床AI中的不可或缺作用。

论文及项目相关链接

PDF

Summary

本摘要探讨了在数据稀缺情况下，如何利用有限的临床数据创建高质量的临床推理链（Chains-of-Thought，CoTs）以增强医学人工智能（AI）的可解释性。研究评估了大型语言模型（LLMs）生成CoTs的可靠性，并探讨了如何采用提示策略提高生成质量。在盲态对比研究中，采用三种不同策略生成CoTs，分别是零样本、随机少样本（使用浅例）和选择性少样本（使用多样、高质量例子）。最终发现选择性少样本策略在所有人类评估指标上都显著优于其他策略（p < .001）。此研究的成功归功于“金标准深度”（推理质量）和“代表性多样性”（泛化能力）两个原则。AI评估者无法区分这些关键的性能差异，因此临床可靠性取决于策略性提示的精心策划，而非示例的存在与否。研究提出了一个“双重原则”框架，作为大规模生成可信数据的基础方法。

Key Takeaways

创建高质量的CoTs对于解释医学AI至关重要，特别是在数据稀缺的情况下。
LLM在合成医疗数据方面的潜力，但其临床可靠性尚未得到验证。
在对比研究中，选择性少样本策略在生成CoTs方面表现最佳。
成功的关键在于“金标准深度”和“代表性多样性”两个原则。
AI评估无法识别不同策略之间的关键性能差异，强调人类专业知识在评估高风险的临床AI中的不可或缺作用。
提示的精心策划比示例的存在与否更能决定临床可靠性。

Cool Papers

点此查看论文截图

Synergistic Enhancement of Requirement-to-Code Traceability: A Framework Combining Large Language Model based Data Augmentation and an Advanced Encoder

Authors:Jianzhang Zhang, Jialong Zhou, Nan Niu, Jinping Hua, Chuang Liu

Automated requirement-to-code traceability link recovery, essential for industrial system quality and safety, is critically hindered by the scarcity of labeled data. To address this bottleneck, this paper proposes and validates a synergistic framework that integrates large language model (LLM)-driven data augmentation with an advanced encoder. We first demonstrate that data augmentation, optimized through a systematic evaluation of bi-directional and zero/few-shot prompting strategies, is highly effective, while the choice among leading LLMs is not a significant performance factor. Building on the augmented data, we further enhance an established, state-of-the-art pre-trained language model based method by incorporating an encoder distinguished by a broader pre-training corpus and an extended context window. Our experiments on four public datasets quantify the distinct contributions of our framework’s components: on its own, data augmentation consistently improves the baseline method, providing substantial performance gains of up to 26.66%; incorporating the advanced encoder provides an additional lift of 2.21% to 11.25%. This synergy culminates in a fully optimized framework with maximum gains of up to 28.59% on $F_1$ score and 28.9% on $F_2$ score over the established baseline, decisively outperforming ten established baselines from three dominant paradigms. This work contributes a pragmatic and scalable methodology to overcome the data scarcity bottleneck, paving the way for broader industrial adoption of data-driven requirement-to-code traceability.

自动化需求到代码的追溯性链接恢复对于工业系统的质量和安全至关重要，但却受到标记数据稀缺的严重阻碍。针对这一瓶颈，本文提出并验证了一个协同框架，该框架融合了基于大型语言模型（LLM）的数据增强技术与先进的编码器。首先，我们证明了通过双向和零/少样本提示策略的系统评估优化的数据增强是非常有效的，而在领先的大型语言模型之间的选择并不是性能的重要因素。基于增强的数据，我们进一步改进了一种基于先进预训练语言模型的方法，并融入了一个由更广泛的预训练语料库和扩展上下文窗口区分的编码器。我们在四个公共数据集上的实验量化了我们框架组件的独特贡献：仅数据增强本身就能持续提高基线方法的性能，提供高达26.66%的性能提升；结合先进的编码器，可提供额外的2.21%至11.25%的提升。这种协同作用造就了一个完全优化的框架，在F1得分上最高提升了28.59%，在F2得分上最高提升了28.9%，明显超越了三种主流范式中的十个基线方法。这项工作为解决数据稀缺的瓶颈提供了一种实用且可扩展的方法，为数据驱动的需求到代码的追溯性在工业中的更广泛应用奠定了基础。

论文及项目相关链接

PDF

Summary

该论文针对工业系统质量与安全的瓶颈问题，提出了一个协同框架，该框架集成了大型语言模型驱动的数据增强和先进编码器。通过系统评估双向和零/少样本提示策略，证明了数据增强的有效性。在此基础上，通过结合预训练语料库更广泛、上下文窗口更扩展的先进编码器，进一步改进了基于预训练语言模型的方法。实验结果表明，该框架的各组成部分对性能有独特贡献，数据增强和先进编码器的结合提供了显著的性能提升。该协同框架解决了数据稀缺的瓶颈问题，为数据驱动的要求到代码跟踪的工业应用开辟了道路。

Key Takeaways

论文提出了一个协同框架，旨在解决工业系统中要求与代码之间的跟踪链接恢复问题。
该框架集成了大型语言模型（LLM）驱动的数据增强和先进编码器。
数据增强策略通过评估不同提示策略的有效性，被证明是提升性能的关键。
先进编码器结合预训练语料库和扩展的上下文窗口，进一步提升了性能。
实验结果表明，该框架在F1和F2分数上实现了显著的性能提升，最高可达28.59%和28.9%。
该方法解决了数据稀缺的瓶颈问题，为工业系统中的数据驱动要求到代码跟踪提供了可行方案。

Cool Papers

点此查看论文截图

Hierarchical Material Recognition from Local Appearance

Authors:Matthew Beveridge, Shree K. Nayar

We introduce a taxonomy of materials for hierarchical recognition from local appearance. Our taxonomy is motivated by vision applications and is arranged according to the physical traits of materials. We contribute a diverse, in-the-wild dataset with images and depth maps of the taxonomy classes. Utilizing the taxonomy and dataset, we present a method for hierarchical material recognition based on graph attention networks. Our model leverages the taxonomic proximity between classes and achieves state-of-the-art performance. We demonstrate the model’s potential to generalize to adverse, real-world imaging conditions, and that novel views rendered using the depth maps can enhance this capability. Finally, we show the model’s capacity to rapidly learn new materials in a few-shot learning setting.

我们介绍了一种基于局部外观进行层次识别的材料分类法。我们的分类法受到视觉应用的启发，并根据材料的物理特征进行排列。我们贡献了一个包含分类图像和深度图的多样化、野外数据集。利用分类法和数据集，我们提出了一种基于图注意力网络的层次材料识别方法。我们的模型利用类之间的分类邻近性，实现了最先进的性能。我们展示了该模型在恶劣的、现实世界的成像条件下的通用化潜力，以及使用深度图呈现的新视角可以增强这种能力。最后，我们展示了该模型在少量学习环境中快速学习新材料的潜力。

论文及项目相关链接

PDF ICCV 2025 Camera Ready

Summary

本文提出了一种基于物理特性的材料分类法，用于从局部外观进行层次识别。文章建立了一个多样化的自然场景数据集，包含各类材料的图像和深度图。基于图注意力网络和材料分类法，提出了一种层次材料识别方法。该方法利用类别之间的分类学相关性，实现了先进性能。实验证明，该模型能够推广到恶劣的现实中成像条件，并使用深度图渲染的新视角提升其性能。此外，该模型还展示了在少量样本学习新材料的快速学习能力。

Key Takeaways

提出了一种基于物理特性的材料分类法，用于层次识别。
建立了一个包含各类材料图像和深度图的数据集。
利用图注意力网络和材料分类法，提出了一种层次材料识别方法。
模型利用类别之间的分类学相关性，实现了先进性能。
模型能够推广到恶劣的现实中成像条件，并使用深度图增强性能。
模型展示了在少量样本学习新材料的快速学习能力。

Cool Papers

点此查看论文截图

Model Predictive Task Sampling for Efficient and Robust Adaptation

Authors:Qi Wang, Zehao Xiao, Yixiu Mao, Yun Qu, Jiayi Shen, Yiqin Lv, Xiangyang Ji

Foundation models have revolutionized general-purpose problem-solving, offering rapid task adaptation through pretraining, meta-training, and finetuning. Recent crucial advances in these paradigms reveal the importance of challenging task prioritized sampling to enhance adaptation robustness under distribution shifts. However, ranking task difficulties over iteration as a preliminary step typically requires exhaustive task evaluation, which is practically unaffordable in computation and data-annotation. This study provides a novel perspective to illuminate the possibility of leveraging the dual importance of adaptation robustness and learning efficiency, particularly in scenarios where task evaluation is risky or costly, such as iterative agent-environment interactions for robotic policy evaluation or computationally intensive inference steps for finetuning foundation models. Firstly, we introduce Model Predictive Task Sampling (MPTS), a framework that bridges the task space and adaptation risk distributions, providing a theoretical foundation for robust active task sampling. MPTS employs a generative model to characterize the episodic optimization process and predicts task-specific adaptation risk via posterior inference. The resulting risk predictive model amortizes the costly evaluation of task adaptation performance and provably approximates task difficulty rankings. MPTS seamlessly integrates into zero-shot, few-shot, and supervised finetuning settings. Empirically, we conduct extensive experiments in pattern recognition using foundation models and sequential decision-making. Our results demonstrate that MPTS significantly enhances adaptation robustness for tail risk or out-of-distribution (OOD) tasks and improves learning efficiency compared to state-of-the-art (SoTA) methods. The code is available at the project site https://github.com/thu-rllab/MPTS.

预训练模型通过预训练、元训练和微调的方式，已经彻底改变了通用问题解决的方法。这些模式中的最新关键进展揭示了优先采样挑战性任务在提高分布转移下的适应稳健性中的重要性。然而，作为初步步骤在迭代中按任务难度排序通常需要全面的任务评估，这在计算和数据标注方面实际上是不切实际的。本研究提供了一个新的视角，以阐明在任务评估风险大或代价高昂的情境下，同时利用适应稳健性和学习效率的双重重要性成为可能。特别是在机器人策略评估的迭代式人机互动或微调预训练模型的计算密集型推理步骤等场景。首先，我们引入了模型预测任务采样（MPTS），这是一个桥梁任务空间和适应风险分布的框架，为稳健的主动任务采样提供了理论基础。MPTS采用生成模型来刻画幕式优化过程，并通过后验推断预测特定任务的适应风险。结果得到的风险预测模型减少了昂贵的任务适应性能评估，并证明可以近似任务难度排名。MPTS无缝集成到零样本、少样本和受监督微调的环境中。通过实证，我们在使用预训练模型的模式识别和序列决策制定方面进行了大量实验。结果表明，与最新技术相比，MPTS显著提高了尾风险或超出分布范围的任务的适应稳健性，并提高了学习效率。代码可在项目网站https://github.com/thu-rllab/MPTS找到。

论文及项目相关链接

PDF

Summary

模型预测任务采样（MPTS）框架通过结合任务空间和适应风险分布，为稳健的活动任务采样提供了理论基础。该框架利用生成模型来表征优化过程，并通过后验推理预测特定任务的适应风险。MPTS能够减少昂贵的任务适应性能评估，并证明其能够近似任务难度排名。在模式识别和序列决策制定的实验中，MPTS显著提高了对尾风险或超出分布（OOD）任务的适应稳健性，并提高了学习效率。

Key Takeaways

基金会模型通过预训练、元训练和微调提供了快速任务适应的能力。
任务难度排序在迭代中通常需要进行全面的任务评估，这在计算和数据标注上是不切实际的。
本文介绍了一种新的模型预测任务采样（MPTS）框架，用于预测任务的适应风险并优化任务采样。
MPTS利用生成模型来表征优化过程，并通过后验推理预测任务特定的适应风险。
MPTS显著提高了对尾风险或超出分布（OOD）任务的适应稳健性。
MPTS在模式识别和序列决策制定的实验中证明了其能提高学习效率。
代码已公开在https://github.com/thu-rllab/MPTS。

Cool Papers

点此查看论文截图

Packet Inspection Transformer: A Self-Supervised Journey to Unseen Malware Detection with Few Samples

Authors:Kyle Stein, Arash Mahyari, Guillermo Francia III, Eman El-Sheikh

As networks continue to expand and become more interconnected, the need for novel malware detection methods becomes more pronounced. Traditional security measures are increasingly inadequate against the sophistication of modern cyber attacks. Deep Packet Inspection (DPI) has been pivotal in enhancing network security, offering an in-depth analysis of network traffic that surpasses conventional monitoring techniques. DPI not only examines the metadata of network packets, but also dives into the actual content being carried within the packet payloads, providing a comprehensive view of the data flowing through networks. While the integration of advanced deep learning techniques with DPI has introduced modern methodologies into malware detection and network traffic classification, state-of-the-art supervised learning approaches are limited by their reliance on large amounts of annotated data and their inability to generalize to novel, unseen malware threats. To address these limitations, this paper leverages the recent advancements in self-supervised learning (SSL) and few-shot learning (FSL). Our proposed self-supervised approach trains a transformer via SSL to learn the embedding of packet content, including payload, from vast amounts of unlabeled data by masking portions of packets, leading to a learned representation that generalizes to various downstream tasks. Once the representation is extracted from the packets, they are used to train a malware detection algorithm. The representation obtained from the transformer is then used to adapt the malware detector to novel types of attacks using few-shot learning approaches. Our experimental results demonstrate that our method achieves classification accuracies of up to 94.76% on the UNSW-NB15 dataset and 83.25% on the CIC-IoT23 dataset.

随着网络的不断扩张和相互连接，对新型恶意软件检测方法的需求变得更加迫切。传统的安全措施越来越难以应对现代网络攻击的复杂性。深度包检测（DPI）在增强网络安全方面发挥了关键作用，它提供了对网络流量的深入分析，超越了传统的监控技术。DPI不仅检查网络包的元数据，还深入探究包有效载荷中的实际内容，提供了流经网络的数据的全面视图。虽然将先进的深度学习技术与DPI相结合已经为恶意软件检测和网络流量分类引入了现代方法，但最先进的监督学习方法受限于它们对大量标注数据的依赖，以及它们无法推广到新型未见过的恶意软件威胁。为了解决这个问题，本文利用自我监督学习（SSL）和少镜头学习（FSL）的最新进展。我们提出的自我监督方法通过SSL训练一个变压器，学习包括有效载荷在内的数据包内容的嵌入，通过掩盖数据包的部分内容，从大量未标记数据中学习表示形式，这有助于适应各种下游任务。一旦从数据包中提取出表示形式，它们将被用于训练恶意软件检测算法。从变压器得到的表示形式然后被用于借助少镜头学习方法适应新型攻击类型的恶意软件检测器。我们的实验结果表明，我们的方法在UNSW-NB15数据集上达到了高达94.76%的分类精度，在CIC-IoT23数据集上达到了83.25%的分类精度。

论文及项目相关链接

PDF

Summary

随着网络不断扩展和相互连接，新型恶意软件检测方法的需求愈发重要。传统安全措施难以应对现代网络攻击的复杂性。深度包检测（DPI）对于增强网络安全至关重要，它提供了对网络流量的深入分析，超越了常规监控技术。本文利用自我监督学习和少样本学习的最新进展，提出了一种新的恶意软件检测方法。该方法通过自我监督学习训练一个从大量未标记数据包中学习表示的转换器，并使用这些表示来训练恶意软件检测算法，并使用少样本学习方法来适应新型攻击。实验结果表明，该方法在UNSW-NB15数据集上的分类准确率高达94.76%，在CIC-IoT23数据集上为83.25%。

Key Takeaways

网络互联性的增强使得新型恶意软件检测的需求愈发重要。
传统安全措施在应对现代网络攻击方面表现出不足。
深度包检测（DPI）在网络安全领域发挥着重要作用，可以进行深度网络流量分析。
该论文结合了自我监督学习和少样本学习来改进恶意软件检测。
通过自我监督学习训练转换器从大量未标记数据包中学习表示。
使用这些学习到的表示来训练恶意软件检测算法，并适应新型攻击。

Cool Papers

点此查看论文截图

Adv-SSL: Adversarial Self-Supervised Representation Learning with Theoretical Guarantees

Authors:Chenguang Duan, Yuling Jiao, Huazhen Lin, Wensen Ma, Jerry Zhijian Yang

Learning transferable data representations from abundant unlabeled data remains a central challenge in machine learning. Although numerous self-supervised learning methods have been proposed to address this challenge, a significant class of these approaches aligns the covariance or correlation matrix with the identity matrix. Despite impressive performance across various downstream tasks, these methods often suffer from biased sample risk, leading to substantial optimization shifts in mini-batch settings and complicating theoretical analysis. In this paper, we introduce a novel \underline{\bf Adv}ersarial \underline{\bf S}elf-\underline{\bf S}upervised Representation \underline{\bf L}earning (Adv-SSL) for unbiased transfer learning with no additional cost compared to its biased counterparts. Our approach not only outperforms the existing methods across multiple benchmark datasets but is also supported by comprehensive end-to-end theoretical guarantees. Our analysis reveals that the minimax optimization in Adv-SSL encourages representations to form well-separated clusters in the embedding space, provided there is sufficient upstream unlabeled data. As a result, our method achieves strong classification performance even with limited downstream labels, shedding new light on few-shot learning.

从大量未标记数据中学习可迁移的数据表示仍然是机器学习中的一个核心挑战。尽管已经提出了许多自监督学习方法来解决这一挑战，但其中很大一部分方法是将协方差或相关性矩阵与单位矩阵对齐。尽管在各种下游任务中表现出令人印象深刻的性能，这些方法往往存在样本偏差风险，导致在小型批次设置中出现重大的优化变化，并使得理论分析复杂化。在本文中，我们介绍了一种新型的对抗性自监督表示学习（Adv-SSL），用于无偏迁移学习，与有偏的同类方法相比，无需增加任何成本。我们的方法不仅在多个基准数据集上优于现有方法，而且得到了全面的端到端理论保证的支持。我们的分析表明，Adv-SSL中的最小最大优化鼓励表示在嵌入空间中形成分离良好的聚类，前提是上游有足够的未标记数据。因此，即使下游标签有限，我们的方法也能实现强大的分类性能，为小样学习提供了新的视角。

论文及项目相关链接

PDF Accepted at the Conference on Neural Information Processing Systems (NeurIPS 2025)

Summary

本文提出了一种新的自监督学习方法——对抗性自监督表示学习（Adv-SSL），用于实现无偏的迁移学习。该方法在多个基准数据集上的性能优于现有方法，且得到全面的端到端理论支持。通过采用极小极大优化，Adv-SSL能够在嵌入空间中形成分离的簇，在有限下游标签的情况下实现了出色的分类性能。这为少样本学习提供了新的视角。

Key Takeaways