嘘~ 正在从服务器偷取页面 . . .


⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验

2025-01-23 更新

CBVLM: Training-free Explainable Concept-based Large Vision Language Models for Medical Image Classification

Authors:Cristiano Patrício, Isabel Rio-Torto, Jaime S. Cardoso, Luís F. Teixeira, João C. Neves

The main challenges limiting the adoption of deep learning-based solutions in medical workflows are the availability of annotated data and the lack of interpretability of such systems. Concept Bottleneck Models (CBMs) tackle the latter by constraining the final disease prediction on a set of predefined and human-interpretable concepts. However, the increased interpretability achieved through these concept-based explanations implies a higher annotation burden. Moreover, if a new concept needs to be added, the whole system needs to be retrained. Inspired by the remarkable performance shown by Large Vision-Language Models (LVLMs) in few-shot settings, we propose a simple, yet effective, methodology, CBVLM, which tackles both of the aforementioned challenges. First, for each concept, we prompt the LVLM to answer if the concept is present in the input image. Then, we ask the LVLM to classify the image based on the previous concept predictions. Moreover, in both stages, we incorporate a retrieval module responsible for selecting the best examples for in-context learning. By grounding the final diagnosis on the predicted concepts, we ensure explainability, and by leveraging the few-shot capabilities of LVLMs, we drastically lower the annotation cost. We validate our approach with extensive experiments across four medical datasets and twelve LVLMs (both generic and medical) and show that CBVLM consistently outperforms CBMs and task-specific supervised methods without requiring any training and using just a few annotated examples. More information on our project page: https://cristianopatricio.github.io/CBVLM/.



PDF This work has been submitted to the IEEE for possible publication



Key Takeaways

  1. 深度学习在医疗工作流中面临的主要挑战是标注数据的可用性和系统解释性的缺失。
  2. 概念瓶颈模型(CBMs)通过约束疾病预测在一组预先定义和可解释的概念上解决了解释性问题,但带来了更高的标注负担和需要重复训练的问题。
  3. CBVLM方法结合了大型视觉语言模型(LVLMs)的优异性能和概念预测来解决上述挑战。
  4. CBVLM通过利用LVLMs的少量样本能力来降低标注成本并实现医疗诊断。
  5. 实验结果显示,CBVLM在多个医疗数据集上的表现优于CBMs和其他特定任务监督方法。
  6. CBVLM无需额外训练,并能利用少量标注样本进行高效预测和分类。

Cool Papers


Can open source large language models be used for tumor documentation in Germany? – An evaluation on urological doctors’ notes

Authors:Stefan Lenz, Arsenij Ustjanzew, Marco Jeray, Torsten Panholzer

Tumor documentation in Germany is largely done manually, requiring reading patient records and entering data into structured databases. Large language models (LLMs) could potentially enhance this process by improving efficiency and reliability. This evaluation tests eleven different open source LLMs with sizes ranging from 1-70 billion model parameters on three basic tasks of the tumor documentation process: identifying tumor diagnoses, assigning ICD-10 codes, and extracting the date of first diagnosis. For evaluating the LLMs on these tasks, a dataset of annotated text snippets based on anonymized doctors’ notes from urology was prepared. Different prompting strategies were used to investigate the effect of the number of examples in few-shot prompting and to explore the capabilities of the LLMs in general. The models Llama 3.1 8B, Mistral 7B, and Mistral NeMo 12 B performed comparably well in the tasks. Models with less extensive training data or having fewer than 7 billion parameters showed notably lower performance, while larger models did not display performance gains. Examples from a different medical domain than urology could also improve the outcome in few-shot prompting, which demonstrates the ability of LLMs to handle tasks needed for tumor documentation. Open source LLMs show a strong potential for automating tumor documentation. Models from 7-12 billion parameters could offer an optimal balance between performance and resource efficiency. With tailored fine-tuning and well-designed prompting, these models might become important tools for clinical documentation in the future. The code for the evaluation is available from https://github.com/stefan-m-lenz/UroLlmEval. We also release the dataset as a new valuable resource that addresses the shortage of authentic and easily accessible benchmarks in German-language medical NLP.

在德国,肿瘤记录工作大多以手动方式进行,需要阅读患者病历并将数据录入结构化数据库。大型语言模型(LLM)有潜力通过提高效率和可靠性来增强这一流程。本次评估对三种肿瘤记录基本任务(识别肿瘤诊断、分配ICD-10代码和提取首次诊断日期)中,使用了11种不同开源的LLM,这些模型参数规模从1亿到70亿不等。为了评估这些模型在这些任务上的表现,我们准备了一份基于泌尿科匿名医生笔记的标注文本片段数据集。使用了不同的提示策略来研究少量示例提示中示例数量对模型的影响,并探索LLM的一般能力。Llama 3.1 8B、Mistral 7B和Mistral NeMo 12B等模型在这些任务中表现良好。具有较少训练数据或参数少于7亿的模型表现明显较差,而更大的模型并没有显示出性能提升。来自泌尿科以外的其他医学领域的例子也能在少量提示中改善结果,这证明了LLM处理肿瘤记录所需任务的能力。开源LLM在自动化肿瘤记录方面显示出巨大潜力。参数规模在7-12亿之间的模型可能在性能和资源效率之间达到最佳平衡。通过有针对性的微调和精心设计的提示,这些模型可能会成为未来临床记录的重要工具。评估的代码可从https://github.com/stefan-m-lenz/UroLlmEval获取。我们还公开了数据集,作为解决德语医学NLP领域中真实、易于访问的基准测试短缺问题的新有价值资源。


PDF 48 pages, 5 figures



Key Takeaways

  • 在德国的肿瘤记录过程中,大型语言模型(LLMs)能够提高效率和可靠性。
  • 研究评估了不同规模的LLMs在三项基本任务上的表现:肿瘤诊断识别、ICD-10代码分配和首次诊断日期提取。
  • 在这项研究中,Llama 3.1 8B、Mistral 7B和Mistral NeMo 12B等模型在任务中表现良好。
  • 模型规模较小的LLMs性能较低,而更大的模型并没有明显的性能提升。
  • 不同医学领域的示例数据能够提高少样本提示的效果,证明LLMs有能力处理肿瘤记录所需的任务。
  • 公开的代码和数据集有助于解决德国医疗NLP领域中真实和可访问的基准测试数据短缺的问题。

Cool Papers


Directional Diffusion-Style Code Editing Pre-training

Authors:Qingyuan Liang, Zeyu Sun, Qihao Zhu, Junhao Hu, Yifan Zhao, Yizhou Chen, Mingxuan Zhu, Guoqing Wang, Lu Zhang

Code pre-trained models have shown promising effectiveness in various software engineering tasks. Among these tasks, many tasks are related to software evolution and/or code editing. However, existing code pre-trained models often overlook the real-world code editing data and the evolutionary nature of the editing process. In this paper, to simulate the step-by-step code editing process of human developers, we propose DivoT5, a pre-trained model based on directional diffusion at the data level. In DivoT5, we adopt two categories of pre-training tasks. The first category is mask and denoising tasks augmented with a diffusion direction representing code evolution. That is, we first apply a noising process to the code snippets before evolution, and then ask the pre-training process to restore the snippets with noise into the code snippets after evolution. The second category is tasks aiming to reinforce the evolutionary direction. That is, we first generate various intermediate versions for each pair of snippets before and after evolution, and then ask the pre-training process to transform the intermediate versions into the snippet after evolution for each pair. We evaluate DivoT5 for two code-editing scenarios and one non-editing scenario using five downstream tasks. Given each downstream task, we fine-tune the pre-trained DivoT5 to evaluate its effectiveness. Our experimental results show that DivoT5 achieves state-of-the-art (SOTA) performance on most tasks in comparison to models of the same scale (220M), large scale (770M) models in fine-tuning, and billion-scale (6.7B, 8B, ChatGPT) models in few-shot settings. For one code-editing task (i.e., automated code review), DivoT5 pre-trained on top of CodeT5-small (60M) can even outperform CodeT5-base (220M) and other pre-trained models with 220M parameters except for DivoT5 pre-trained on top of CodeT5-base (220M).






Key Takeaways

  1. 代码预训练模型在软件工程任务中表现出良好效果,但忽略了真实世界的代码编辑数据和编辑过程的演化性质。
  2. 提出了基于数据层面方向扩散的预训练模型DivoT5,模拟人类开发者的逐步代码编辑过程。
  3. DivoT5采用两类预训练任务:遮蔽和去噪任务,以及加强演化方向的任务。
  4. DivoT5在大多数任务上实现了最佳性能,甚至在微调方面超越了一些规模更大的模型。
  5. 基于CodeT5-small的DivoT5预训练模型在自动化代码审查等代码编辑任务上表现出色,甚至超越了CodeT5-base和其他一些预训练模型。
  6. DivoT5的成功可能源于其考虑到了代码的演化过程和真实世界的代码编辑数据。

Cool Papers


The Value of Nothing: Multimodal Extraction of Human Values Expressed by TikTok Influencers

Authors:Alina Starovolsky-Shitrit, Alon Neduva, Naama Appel Doron, Ella Daniel, Oren Tsur

Societal and personal values are transmitted to younger generations through interaction and exposure. Traditionally, children and adolescents learned values from parents, educators, or peers. Nowadays, social platforms serve as a significant channel through which youth (and adults) consume information, as the main medium of entertainment, and possibly the medium through which they learn different values. In this paper we extract implicit values from TikTok movies uploaded by online influencers targeting children and adolescents. We curated a dataset of hundreds of TikTok movies and annotated them according to the Schwartz Theory of Personal Values. We then experimented with an array of Masked and Large language model, exploring how values can be detected. Specifically, we considered two pipelines – direct extraction of values from video and a 2-step approach in which videos are first converted to elaborated scripts and then values are extracted. Achieving state-of-the-art results, we find that the 2-step approach performs significantly better than the direct approach and that using a trainable Masked Language Model as a second step significantly outperforms a few-shot application of a number of Large Language Models. We further discuss the impact of fine-tuning and compare the performance of the different models on identification of values present or contradicted in the TikTok. Finally, we share the first values-annotated dataset of TikTok videos. Our results pave the way to further research on influence and value transmission in video-based social platforms.






Key Takeaways

  1. TikTok成为传递价值观给青少年甚至成人的重要渠道。
  2. 对TikTok视频的研究发现两步法检测价值观更为准确。
  3. 掩码语言模型在两步法中的第二步表现出卓越性能。
  4. 精细调整对模型性能有显著影响。
  5. 分享了一个标注了价值的TikTok视频数据集。
  6. 传统价值观传播方式(如父母、教育者、同龄人)正逐渐被社交媒体取代。

Cool Papers


ProKeR: A Kernel Perspective on Few-Shot Adaptation of Large Vision-Language Models

Authors:Yassir Bendou, Amine Ouasfi, Vincent Gripon, Adnane Boukhayma

The growing popularity of Contrastive Language-Image Pretraining (CLIP) has led to its widespread application in various visual downstream tasks. To enhance CLIP’s effectiveness and versatility, efficient few-shot adaptation techniques have been widely adopted. Among these approaches, training-free methods, particularly caching methods exemplified by Tip-Adapter, have gained attention for their lightweight adaptation without the need for additional fine-tuning. In this paper, we revisit Tip-Adapter from a kernel perspective, showing that caching methods function as local adapters and are connected to a well-established kernel literature. Drawing on this insight, we offer a theoretical understanding of how these methods operate and suggest multiple avenues for enhancing the Tip-Adapter baseline. Notably, our analysis shows the importance of incorporating global information in local adapters. Therefore, we subsequently propose a global method that learns a proximal regularizer in a reproducing kernel Hilbert space (RKHS) using CLIP as a base learner. Our method, which we call ProKeR (Proximal Kernel ridge Regression), has a closed form solution and achieves state-of-the-art performances across 11 datasets in the standard few-shot adaptation benchmark.



PDF Code available at https://ybendou.github.io/ProKeR



Key Takeaways

  1. CLIP的普及和在各种视觉下游任务中的广泛应用推动了对其高效少样本适应技术的研究。
  2. 无训练方法的缓存方法,如Tip-Adapter,由于其无需额外精细调整而受到了关注。
  3. Tip-Adapter与内核文献之间的联系得到了深入研究,并揭示了其工作原理的理论基础。
  4. 结合全局信息对局部适配器进行改进是必要的,以提高模型的性能。
  5. 提出了一种基于CLIP的新方法ProKeR,用于学习再生核希尔伯特空间中的近端正则化器。
  6. ProKeR实现了在标准少样本适应基准测试中跨多个数据集的最佳性能表现。这意味着这种方法在各种场景下均表现出了稳健而出色的效果。这可能为后续的研究提供新的思路或启示,并可能在实际应用中带来显著的提升。它也有助于我们更深入地理解CLIP模型在各种视觉任务中的潜力和局限性。因此,未来可能会有更多的研究关注于如何进一步改进和完善这一模型以取得更好的性能。该方法具备出色的通用性和适应性未来或将引发更多的探索和改进实践以增强其在各种视觉任务中的性能并扩展其应用范围该方法还具有巨大的潜力在未来为许多不同的应用领域带来创新和改进我们期待未来对此方法的更多研究和探索以实现更广泛的应用场景和提升效果。。这些方法也给我们提供了一个很好的视角来进一步理解和优化CLIP模型在各种视觉任务中的表现和应用范围。总的来说,这篇文章为我们提供了一个关于如何优化和改进CLIP模型在视觉任务中性能的重要视角和新的思路。这对于未来的研究和应用具有重要的意义和价值。 因此它的成功实现可能为机器学习领域的许多其他问题提供了新的解决方案和创新思路,。最后期待有更多对该方法和模型的探索和优化能够在不同场景下获得更为优秀的性能表现和广阔的应用前景被开发出此外对此主题的持续探索和研究将有助于推动计算机视觉和机器学习领域的进一步发展我们也期待未来看到更多相关的研究和应用案例出现以解决更多现实生活中的问题并为相关领域的发展做出更大的贡献,。也相信未来的研究将能够推动该领域取得更大的突破和进展为实际应用带来更多创新和价值同时推动整个机器学习领域的进步和发展因此对该主题的研究具有重要的价值和意义并有望在未来产生深远的影响。总的来说这篇文章提供了一个关于CLIP模型少样本适应技术的深入理解以及在未来研究的可能性提供了富有前景的研究方向这些洞见都将为我们打开进一步理解并掌握计算机视觉任务的关键机遇将我们推向了一个关于人工智能技术和计算机视觉的未来前沿随着人工智能的飞速发展以及研究的不断进步这一领域必将产生更多的创新和突破值得我们持续关注并投入研究以推动人工智能技术的整体进步和发展。

Cool Papers


Beyond Any-Shot Adaptation: Predicting Optimization Outcome for Robustness Gains without Extra Pay

Authors:Qi Cheems Wang, Zehao Xiao, Yixiu Mao, Yun Qu, Jiayi Shen, Yiqin Lv, Xiangyang Ji

The foundation model enables fast problem-solving without learning from scratch, and such a desirable adaptation property benefits from its adopted cross-task generalization paradigms, e.g., pretraining, meta-training, or finetuning. Recent trends have focused on the curation of task datasets during optimization, which includes task selection as an indispensable consideration for either adaptation robustness or sampling efficiency purposes. Despite some progress, selecting crucial task batches to optimize over iteration mostly exhausts massive task queries and requires intensive evaluation and computations to secure robust adaptation. This work underscores the criticality of both robustness and learning efficiency, especially in scenarios where tasks are risky to collect or costly to evaluate. To this end, we present Model Predictive Task Sampling (MPTS), a novel active task sampling framework to establish connections between the task space and adaptation risk landscape achieve robust adaptation. Technically, MPTS characterizes the task episodic information with a generative model and predicts optimization outcome after adaptation from posterior inference, i.e., forecasting task-specific adaptation risk values. The resulting risk learner amortizes expensive annotation, evaluation, or computation operations in task robust adaptation learning paradigms. Extensive experimental results show that MPTS can be seamlessly integrated into zero-shot, few-shot, and many-shot learning paradigms, increases adaptation robustness, and retains learning efficiency without affording extra cost. The code will be available at the project site https://github.com/thu-rllab/MPTS.






Key Takeaways

  1. 跨任务泛化范式(如预训练、元训练、微调)使模型具备快速解决问题的能力。
  2. 任务数据集的选择是优化中的一个重要考虑因素,特别是在适应稳健性和采样效率方面。
  3. 在任务采集风险大或评估成本高的场景下,需要强调稳健性和学习效率的重要性。
  4. 提出了一种新的主动任务采样框架——模型预测任务采样(MPTS),它通过预测任务特定的适应风险值来建立任务空间和适应风险景观之间的联系。
  5. MPTS框架降低了标注、评估和计算操作的成本,提高了任务稳健适应学习的效率。
  6. MPTS可广泛应用于零样本、少样本和多样本学习范式中。

Cool Papers


Few-shot Human Motion Recognition through Multi-Aspect mmWave FMCW Radar Data

Authors:Hao Fan, Lingfeng Chen, Chengbai Xu, Jiadong Zhou, Yongpeng Dai, Panhe HU

Radar human motion recognition methods based on deep learning models has been a heated spot of remote sensing in recent years, yet the existing methods are mostly radial-oriented. In practical application, the test data could be multi-aspect and the sample number of each motion could be very limited, causing model overfitting and reduced recognition accuracy. This paper proposed channel-DN4, a multi-aspect few-shot human motion recognition method. First, local descriptors are introduced for a precise classification metric. Moreover, episodic training strategy was adopted to reduce model overfitting. To utilize the invariant sematic information in multi-aspect conditions, we considered channel attention after the embedding network to obtain precise implicit high-dimensional representation of sematic information. We tested the performance of channel-DN4 and methods for comparison on measured mmWave FMCW radar data. The proposed channel-DN4 produced competitive and convincing results, reaching the highest 87.533% recognition accuracy in 3-way 10-shot condition while other methods suffer from overfitting. Codes are available at: https://github.com/MountainChenCad/channel-DN4






Key Takeaways

  1. 雷达人体运动识别基于深度学习模型已成为遥感热点。
  2. 现有方法主要为径向方向,实际应用面临多方面和有限样本的挑战。
  3. channel-DN4方法引入局部描述符进行精确分类。
  4. 采用周期训练策略减轻模型过拟合问题。
  5. 通过通道注意力利用多方面条件下的不变语义信息。
  6. 在实测数据上测试,channel-DN4表现出高识别精度。

Cool Papers


Visual RAG: Expanding MLLM visual knowledge without fine-tuning

Authors:Mirco Bonomo, Simone Bianco

Multimodal Large Language Models (MLLMs) have achieved notable performance in computer vision tasks that require reasoning across visual and textual modalities, yet their capabilities are limited to their pre-trained data, requiring extensive fine-tuning for updates. Recent researches have explored the use of In-Context Learning (ICL) to overcome these challenges by providing a set of demonstrating examples as context to augment MLLMs performance in several tasks, showing that many-shot ICL leads to substantial improvements compared to few-shot ICL. However, the reliance on numerous demonstrating examples and the limited MLLMs context windows presents significant obstacles. This paper aims to address these challenges by introducing a novel approach, Visual RAG, that synergically combines the MLLMs capability to learn from the context, with a retrieval mechanism. The crux of this approach is to ensure to augment the MLLM knowledge by selecting only the most relevant demonstrating examples for the query, pushing it to learn by analogy. In this way, relying on the new information provided dynamically during inference time, the resulting system is not limited to the knowledge extracted from the training data, but can be updated rapidly and easily without fine-tuning. Furthermore, this greatly reduces the computational costs for improving the model image classification performance, and augments the model knowledge to new visual domains and tasks it was not trained for. Extensive experiments on eight different datasets in the state of the art spanning several domains and image classification tasks show that the proposed Visual RAG, compared to the most recent state of the art (i.e., many-shot ICL), is able to obtain an accuracy that is very close or even higher (approx. +2% improvement on average) while using a much smaller set of demonstrating examples (approx. only 23% on average).


本文旨在通过引入一种新方法来解决这些挑战,即Visual RAG。该方法协同结合了MLLMs从上下文中学习的能力与检索机制。该方法的核心是确保通过选择与查询最相关的演示例子来增强MLLM的知识,推动其通过类比进行学习。通过这种方式,依赖推理时动态提供的新信息,结果系统不限于从训练数据中提取的知识,但可以快速轻松地更新而无需微调。此外,这大大降低了提高模型图像分类性能的计算成本,并增强了模型对未训练的新视觉领域和任务的知识。




基于模态大型语言模型(MLLMs)在计算机视觉任务中展现出强大的跨视觉和文本模态推理能力,但其受限于预训练数据,需大量微调更新。近期研究通过提供演示例子上下文来克服这些挑战,探索了语境学习(ICL)的潜力,多模态显示相比少模态显示出显著改善。然而依赖众多演示例子以及有限的MLLM语境窗口存在明显障碍。本文旨在通过引入一种新型方法Visual RAG来解决这些挑战,该方法结合了MLLM从语境中学习的能力与检索机制。该方法核心在于仅选择对查询最相关的演示例子来扩充MLLM的知识,推动其通过类比学习。通过这种方式,新信息在推理过程中动态提供,使得系统不再局限于从训练数据中提取的知识,可快速轻松更新而无需微调。此外,这大大降低了改进模型图像分类性能的计算成本,并扩充了模型知识到未训练的新视觉领域和任务。在涵盖多个领域和图像分类任务的八个不同数据集上的实验表明,与最新技术相比,提出的Visual RAG能获得相近或更高的准确率(平均提高约+2%),同时使用更少的演示例子(平均仅使用约23%)。

Key Takeaways

  1. 多模态大型语言模型(MLLMs)在计算机视觉任务中展现出强大的跨模态推理能力,但需大量微调更新。
  2. 语境学习(ICL)通过提供演示例子上下文来克服挑战,但依赖众多演示例子和有限的语境窗口存在障碍。
  3. Visual RAG方法结合了MLLM从语境中学习的能力与检索机制,通过选择最相关的演示例子来扩充知识。
  4. Visual RAG能够在不使用大量演示例子的情况下提高模型性能,降低计算成本并扩充模型知识到新的视觉领域和任务。
  5. 实验表明,Visual RAG相比最新技术获得相近或更高的准确率。
  6. Visual RAG动态利用新信息提高模型适应性,无需微调即可快速更新知识。

Cool Papers


Class Incremental Fault Diagnosis under Limited Fault Data via Supervised Contrastive Knowledge Distillation

Authors:Hanrong Zhang, Yifei Yao, Zixuan Wang, Jiayuan Su, Mengxuan Li, Peng Peng, Hongwei Wang

Class-incremental fault diagnosis requires a model to adapt to new fault classes while retaining previous knowledge. However, limited research exists for imbalanced and long-tailed data. Extracting discriminative features from few-shot fault data is challenging, and adding new fault classes often demands costly model retraining. Moreover, incremental training of existing methods risks catastrophic forgetting, and severe class imbalance can bias the model’s decisions toward normal classes. To tackle these issues, we introduce a Supervised Contrastive knowledge distiLlation for class Incremental Fault Diagnosis (SCLIFD) framework proposing supervised contrastive knowledge distillation for improved representation learning capability and less forgetting, a novel prioritized exemplar selection method for sample replay to alleviate catastrophic forgetting, and the Random Forest Classifier to address the class imbalance. Extensive experimentation on simulated and real-world industrial datasets across various imbalance ratios demonstrates the superiority of SCLIFD over existing approaches. Our code can be found at https://github.com/Zhang-Henry/SCLIFD_TII.






Key Takeaways

  1. SCLIFD框架被提出用于解决类增量故障诊断中的问题,该框架能够适应新故障类并保留先前知识。
  2. 在面临不平衡和长尾数据的问题时,提取少数故障数据的判别特征具有挑战性。
  3. 增量训练现有方法存在灾难性遗忘的风险。
  4. SCLIFD采用监督对比知识蒸馏,提高表示学习能力并减少遗忘。
  5. 引入了一种新型的优先示例选择方法,用于样本回放,以减轻灾难性遗忘。
  6. 使用随机森林分类器来解决类别不平衡问题,以更准确地识别故障类。

Cool Papers


IDEA: Image Description Enhanced CLIP-Adapter

Authors:Zhipeng Ye, Feng Jiang, Qiufeng Wang, Kaizhu Huang, Jiaqi Huang

CLIP (Contrastive Language-Image Pre-training) has attained great success in pattern recognition and computer vision. Transferring CLIP to downstream tasks (e.g. zero- or few-shot classification) is a hot topic in multimodal learning. However, current studies primarily focus on either prompt learning for text or adapter tuning for vision, without fully exploiting the complementary information and correlations among image-text pairs. In this paper, we propose an Image Description Enhanced CLIP-Adapter (IDEA) method to adapt CLIP to few-shot image classification tasks. This method captures fine-grained features by leveraging both visual features and textual descriptions of images. IDEA is a training-free method for CLIP, and it can be comparable to or even exceeds state-of-the-art models on multiple tasks. Furthermore, we introduce Trainable-IDEA (T-IDEA), which extends IDEA by adding two lightweight learnable components (i.e., a projector and a learnable latent space), further enhancing the model’s performance and achieving SOTA results on 11 datasets. As one important contribution, we employ the Llama model and design a comprehensive pipeline to generate textual descriptions for images of 11 datasets, resulting in a total of 1,637,795 image-text pairs, named “IMD-11”. Our code and data are released at https://github.com/FourierAI/IDEA.






Key Takeaways

  • CLIP在模式识别和计算机视觉领域取得了很大成功。
  • 将CLIP转移到下游任务(如零样本或少样本分类)是当前的热门话题。
  • IDEA方法通过结合视觉特征和图像文本描述,适应少样本图像分类任务。
  • IDEA是一种针对CLIP的无训练方法,在某些任务上的性能可与或超过现有最新模型。
  • T-IDEA通过添加学习组件进一步提高了模型性能,并在多个数据集上实现了最新结果。
  • 研究人员使用Llama模型生成了大规模的图像文本对数据库IMD-11。

Cool Papers


ValuesRAG: Enhancing Cultural Alignment Through Retrieval-Augmented Contextual Learning

Authors:Wonduk Seo, Zonghao Yuan, Yi Bu

Cultural values alignment in Large Language Models (LLMs) is a critical challenge due to their tendency to embed Western-centric biases from training data, leading to misrepresentations and fairness issues in cross-cultural contexts. Recent approaches, such as role-assignment and few-shot learning, often struggle with reliable cultural alignment as they heavily rely on pre-trained knowledge, lack scalability, and fail to capture nuanced cultural values effectively. To address these issues, we propose ValuesRAG, a novel and effective framework that applies Retrieval-Augmented Generation (RAG) with In-Context Learning (ICL) to integrate cultural and demographic knowledge dynamically during text generation. Leveraging the World Values Survey (WVS) dataset, ValuesRAG first generates summaries of values for each individual. Subsequently, we curate several representative regional datasets to serve as test datasets and retrieve relevant summaries of values based on demographic features, followed by a reranking step to select the top-k relevant summaries. ValuesRAG consistently outperforms baseline methods, both in the main experiment and in the ablation study where only the values summary was provided. Notably, ValuesRAG demonstrates an accuracy of 21% improvement over other baseline methods, highlighting its potential to foster culturally aligned AI systems and enhance the inclusivity of AI-driven applications.



PDF preprint



Key Takeaways

  1. 大型语言模型(LLMs)在文化价值观对齐方面具有挑战,因训练数据中的西方中心偏见导致问题。
  2. ValuesRAG框架通过结合检索增强生成(RAG)和上下文学习(ICL),动态整合文化和人口统计知识来进行文本生成。
  3. ValuesRAG利用世界价值观调查(WVS)数据集生成个人价值观摘要。
  4. ValuesRAG基于人口统计特征检索相关价值观摘要,并进行重新排序。
  5. ValuesRAG在实验中表现优于其他基线方法,准确率提高21%。
  6. ValuesRAG有助于促进文化对齐的AI系统的发展。

Cool Papers


3DGS-CD: 3D Gaussian Splatting-based Change Detection for Physical Object Rearrangement

Authors:Ziqi Lu, Jianbo Ye, John Leonard

We present 3DGS-CD, the first 3D Gaussian Splatting (3DGS)-based method for detecting physical object rearrangements in 3D scenes. Our approach estimates 3D object-level changes by comparing two sets of unaligned images taken at different times. Leveraging 3DGS’s novel view rendering and EfficientSAM’s zero-shot segmentation capabilities, we detect 2D object-level changes, which are then associated and fused across views to estimate 3D change masks and object transformations. Our method can accurately identify changes in cluttered environments using sparse (as few as one) post-change images within as little as 18s. It does not rely on depth input, user instructions, pre-defined object classes, or object models – An object is recognized simply if it has been re-arranged. Our approach is evaluated on both public and self-collected real-world datasets, achieving up to 14% higher accuracy and three orders of magnitude faster performance compared to the state-of-the-art radiance-field-based change detection method. This significant performance boost enables a broad range of downstream applications, where we highlight three key use cases: object reconstruction, robot workspace reset, and 3DGS model update. Our code and data will be made available at https://github.com/520xyxyzq/3DGS-CD.







  1. 提出首个基于3D高斯渲染(3DGS)的方法,用于检测三维场景中的物体重新排列。
  2. 通过比较不同时间的未对齐图像集来估计三维物体级别的变化。
  3. 利用3DGS的新型视图渲染和EfficientSAM的零样本分割能力,检测二维物体级别的变化。
  4. 方法能在杂乱环境中准确工作,仅使用少量或一张变化后的图像。
  5. 不依赖于深度输入、用户指令等,具有灵活性。
  6. 与现有方法相比,具有更高的准确性和显著的性能提升。

Cool Papers


Few-Shot Domain Adaptation for Learned Image Compression

Authors:Tianyu Zhang, Haotian Zhang, Yuqi Li, Li Li, Dong Liu

Learned image compression (LIC) has achieved state-of-the-art rate-distortion performance, deemed promising for next-generation image compression techniques. However, pre-trained LIC models usually suffer from significant performance degradation when applied to out-of-training-domain images, implying their poor generalization capabilities. To tackle this problem, we propose a few-shot domain adaptation method for LIC by integrating plug-and-play adapters into pre-trained models. Drawing inspiration from the analogy between latent channels and frequency components, we examine domain gaps in LIC and observe that out-of-training-domain images disrupt pre-trained channel-wise decomposition. Consequently, we introduce a method for channel-wise re-allocation using convolution-based adapters and low-rank adapters, which are lightweight and compatible to mainstream LIC schemes. Extensive experiments across multiple domains and multiple representative LIC schemes demonstrate that our method significantly enhances pre-trained models, achieving comparable performance to H.266/VVC intra coding with merely 25 target-domain samples. Additionally, our method matches the performance of full-model finetune while transmitting fewer than $2%$ of the parameters.






Key Takeaways

  1. 预训练图像压缩(LIC)模型在应用于训练域外图像时存在性能下降的问题。
  2. 提出了基于卷积适配器和低秩适配器的少量域适应方法来解决这个问题。
  3. 通过实现通道级重新分配,提高了LIC模型的泛化能力。
  4. 该方法仅需使用少量目标域样本即可显著提高预训练模型的性能。
  5. 方法达到与H.266/VVC帧内编码相近的效果。
  6. 该方法在仅传输少量参数的情况下匹配全模型微调的性能。

Cool Papers


VLM Agents Generate Their Own Memories: Distilling Experience into Embodied Programs of Thought

Authors:Gabriel Sarch, Lawrence Jang, Michael J. Tarr, William W. Cohen, Kenneth Marino, Katerina Fragkiadaki

Large-scale LLMs and VLMs excel at few-shot learning but require high-quality examples. We introduce In-Context Abstraction Learning (ICAL), which iteratively refines suboptimal trajectories into high-quality data with optimized actions and detailed reasoning. Given an inefficient demonstration, a VLM corrects actions and annotates causal relationships, object states, subgoals, and task-relevant visuals, forming “programs of thought.” With human feedback, these programs are improved as the agent executes them in a similar environment. The resulting examples, used as prompt context or fine-tuning data, significantly boost decision-making while reducing human feedback needs. ICAL surpasses state-of-the-art in TEACh (dialogue-based instruction following), VisualWebArena (multimodal web agents), and Ego4D (egocentric video action anticipation). In TEACh, combining fine-tuning and retrieval on ICAL examples outperforms raw human demonstrations and expert examples, achieving a 17.5% increase in goal-condition success. In VisualWebArena, retrieval-augmented GPT-4V with ICAL improves task success rate 1.6x over GPT-4V, while fine-tuning Qwen2-VL achieves a 2.8x improvement. In Ego4D, ICAL outperforms few-shot GPT-4V and remains competitive with supervised models. Overall, ICAL scales 2x better than raw human demonstrations and reduces manual prompt engineering.



PDF Project website: https://ical-learning.github.io/


大模型在少样本学习方面表现出色,但需要高质量样本。本文提出一种名为In-Context Abstraction Learning(ICAL)的方法,该方法通过优化动作和详细推理,将不佳的轨迹转化为高质量数据。VLM能够在执行过程中修正动作并标注因果关系、物体状态、子目标和任务相关视觉,形成“思维程序”。通过人类反馈,这些程序可以在类似环境中改善代理执行效果。使用ICAL生成的例子作为提示上下文或微调数据,能显著提高决策能力并减少人类反馈需求。在TEACh、VisualWebArena和Ego4D等任务上,ICAL超越了现有技术。

Key Takeaways

  1. 大规模语言模型和视觉语言模型在少样本学习上表现优异,但需高质量样本。
  2. 引入In-Context Abstraction Learning(ICAL)方法,能将不佳的轨迹转化为高质量数据。
  3. VLM能够标注因果关系、物体状态、子目标和任务相关视觉,形成“思维程序”。
  4. 通过人类反馈,这些“思维程序”可以在类似环境中改善代理执行效果。
  5. ICAL在TEACh、VisualWebArena和Ego4D等任务上超越现有技术。
  6. 相较于原始人类示范,ICAL能提高决策能力并减少人类反馈需求。

Cool Papers


One size doesn’t fit all: Predicting the Number of Examples for In-Context Learning

Authors:Manish Chandra, Debasis Ganguly, Iadh Ounis

In-context learning (ICL) refers to the process of adding a small number of localized examples from a training set of labelled data to an LLM’s prompt with an objective to effectively control the generative process seeking to improve the downstream task performance. Existing ICL approaches use an identical number of examples (a pre-configured hyper-parameter) for each data instance. Our work alleviates the limitations of this ‘one fits all’ approach by dynamically predicting the number of examples for each data instance to be used in few-shot inference with LLMs. In particular, we employ a multi-label classifier, the parameters of which are fitted using a training set, where the label for each instance in this training set indicates if using a specific value of k (number of most similar examples from 0 up to a maximum value) leads to correct k-shot downstream predictions. Our experiments on a number of text classification benchmarks show that AICL substantially outperforms standard ICL by up to 17%.





Key Takeaways

  • 实例学习(ICL)旨在通过向LLM提示中添加少量局部示例来改善下游任务性能。
  • 现有ICL方法为每个数据实例使用相同数量的示例,而AICL则动态预测每个实例所需的示例数量。
  • AICL采用多标签分类器并使用训练集拟合参数。
  • 训练集的标签表示使用特定数量的示例是否导致正确的下游预测。

Cool Papers


UniGraph: Learning a Unified Cross-Domain Foundation Model for Text-Attributed Graphs

Authors:Yufei He, Yuan Sui, Xiaoxin He, Bryan Hooi

Foundation models like ChatGPT and GPT-4 have revolutionized artificial intelligence, exhibiting remarkable abilities to generalize across a wide array of tasks and applications beyond their initial training objectives. However, graph learning has predominantly focused on single-graph models, tailored to specific tasks or datasets, lacking the ability to transfer learned knowledge to different domains. This limitation stems from the inherent complexity and diversity of graph structures, along with the different feature and label spaces specific to graph data. In this paper, we recognize text as an effective unifying medium and employ Text-Attributed Graphs (TAGs) to leverage this potential. We present our UniGraph framework, designed to learn a foundation model for TAGs, which is capable of generalizing to unseen graphs and tasks across diverse domains. Unlike single-graph models that use pre-computed node features of varying dimensions as input, our approach leverages textual features for unifying node representations, even for graphs such as molecular graphs that do not naturally have textual features. We propose a novel cascaded architecture of Language Models (LMs) and Graph Neural Networks (GNNs) as backbone networks. Additionally, we propose the first pre-training algorithm specifically designed for large-scale self-supervised learning on TAGs, based on Masked Graph Modeling. We introduce graph instruction tuning using Large Language Models (LLMs) to enable zero-shot prediction ability. Our comprehensive experiments across various graph learning tasks and domains demonstrate the model’s effectiveness in self-supervised representation learning on unseen graphs, few-shot in-context transfer, and zero-shot transfer, even surpassing or matching the performance of GNNs that have undergone supervised training on target datasets.



PDF KDD 2025


本文介绍了UniGraph框架,这是一个用于Text-Attributed Graphs(TAGs)的基础模型。该框架能够泛化到未见过的图和跨域任务。它采用文本特征来统一节点表示,并提出了一种新的级联架构,包括语言模型和图神经网络。此外,本文还提出了针对TAGs的大规模自监督学习的预训练算法,以及利用大型语言模型进行图形指令调整的方法,以实现零射击预测能力。实验表明,该模型在未见过的图的自监督表示学习、少样本上下文迁移和零样本迁移方面表现出色。

Key Takeaways

  1. UniGraph框架引入Text-Attributed Graphs(TAGs)概念,实现了跨不同领域的泛化能力。
  2. 该框架通过结合语言模型和图神经网络,采用文本特征来统一节点表示。
  3. 提出了针对TAGs的大规模自监督学习的预训练算法。
  4. 通过利用大型语言模型进行图形指令调整,实现了零射击预测能力。
  5. UniGraph框架在未见过的图的自监督表示学习方面表现出色。
  6. 该框架支持少样本上下文迁移和零样本迁移。

Cool Papers


AutoMix: Automatically Mixing Language Models

Authors:Pranjal Aggarwal, Aman Madaan, Ankit Anand, Srividya Pranavi Potharaju, Swaroop Mishra, Pei Zhou, Aditya Gupta, Dheeraj Rajagopal, Karthik Kappaganthu, Yiming Yang, Shyam Upadhyay, Manaal Faruqui, Mausam

Large language models (LLMs) are now available from cloud API providers in various sizes and configurations. While this diversity offers a broad spectrum of choices, effectively leveraging the options to optimize computational cost and performance remains challenging. In this work, we present Automix, an approach that strategically routes queries to larger LMs, based on the approximate correctness of outputs from a smaller LM. Central to Automix are two key technical contributions. First, it has a few-shot self-verification mechanism, which estimates the reliability of its own outputs without requiring extensive training. Second, given that self-verification can be noisy, it employs a POMDP based router that can effectively select an appropriately sized model, based on answer confidence. Experiments across five language models and five challenging datasets show that Automix consistently surpasses strong baselines, reducing computational cost by over 50% for comparable performance.



PDF 38th Conference on Neural Information Processing Systems (NeurIPS 2024). The first two authors contributed equally. Work started and partly done during Aman’s internship at Google. This version adds results on additional models and datasets



Key Takeaways

  1. 大型语言模型(LLMs)的多样性和配置选择带来了有效利用的挑战。
  2. Automix方法根据小型LM的输出近似正确性来指导查询到大型LM。
  3. Automix具有少样本自我验证机制,无需大量训练即可评估输出可靠性。
  4. Automix采用基于POMDP的路由器,有效选择适当大小的模型以优化性能。
  5. 实验表明Automix在多个语言模型和数据集上的表现优于基线。
  6. Automix能够降低计算成本,达到超过50%的节省。

Cool Papers


文章作者: Kedreamix
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !
I2I Translation I2I Translation
I2I Translation 方向最新论文已更新,请持续关注 Update in 2025-01-23 EfficientVITON An Efficient Virtual Try-On Model using Optimized Diffusion Process
Agent Agent
Agent 方向最新论文已更新,请持续关注 Update in 2025-01-23 UI-TARS Pioneering Automated GUI Interaction with Native Agents