嘘~ 正在从服务器偷取页面 . . .


⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验

2025-01-17 更新

IDEA: Image Description Enhanced CLIP-Adapter

Authors:Zhipeng Ye, Feng Jiang, Qiufeng Wang, Kaizhu Huang, Jiaqi Huang

CLIP (Contrastive Language-Image Pre-training) has attained great success in pattern recognition and computer vision. Transferring CLIP to downstream tasks (e.g. zero- or few-shot classification) is a hot topic in multimodal learning. However, current studies primarily focus on either prompt learning for text or adapter tuning for vision, without fully exploiting the complementary information and correlations among image-text pairs. In this paper, we propose an Image Description Enhanced CLIP-Adapter (IDEA) method to adapt CLIP to few-shot image classification tasks. This method captures fine-grained features by leveraging both visual features and textual descriptions of images. IDEA is a training-free method for CLIP, and it can be comparable to or even exceeds state-of-the-art models on multiple tasks. Furthermore, we introduce Trainable-IDEA (T-IDEA), which extends IDEA by adding two lightweight learnable components (i.e., a projector and a learnable latent space), further enhancing the model’s performance and achieving SOTA results on 11 datasets. As one important contribution, we employ the Llama model and design a comprehensive pipeline to generate textual descriptions for images of 11 datasets, resulting in a total of 1,637,795 image-text pairs, named “IMD-11”. Our code and data are released at https://github.com/FourierAI/IDEA.






Key Takeaways

  1. CLIP在模式识别和计算机视觉领域表现出卓越性能。
  2. 在下游任务(如小样本分类)的迁移学习是CLIP的热门应用方向。
  3. 当前研究未充分利用图像和文本之间的互补信息和相关性。
  4. IDEA方法通过结合视觉特征和图像文本描述,适应小样本图像分类任务。
  5. IDEA是一种针对CLIP的无训练方法,性能优越。
  6. T-IDEA通过添加学习组件进一步提高模型性能,并在多个数据集上实现最新结果。

Cool Papers


Exploring ChatGPT for Face Presentation Attack Detection in Zero and Few-Shot in-Context Learning

Authors:Alain Komaty, Hatef Otroshi Shahreza, Anjith George, Sebastien Marcel

This study highlights the potential of ChatGPT (specifically GPT-4o) as a competitive alternative for Face Presentation Attack Detection (PAD), outperforming several PAD models, including commercial solutions, in specific scenarios. Our results show that GPT-4o demonstrates high consistency, particularly in few-shot in-context learning, where its performance improves as more examples are provided (reference data). We also observe that detailed prompts enable the model to provide scores reliably, a behavior not observed with concise prompts. Additionally, explanation-seeking prompts slightly enhance the model’s performance by improving its interpretability. Remarkably, the model exhibits emergent reasoning capabilities, correctly predicting the attack type (print or replay) with high accuracy in few-shot scenarios, despite not being explicitly instructed to classify attack types. Despite these strengths, GPT-4o faces challenges in zero-shot tasks, where its performance is limited compared to specialized PAD systems. Experiments were conducted on a subset of the SOTERIA dataset, ensuring compliance with data privacy regulations by using only data from consenting individuals. These findings underscore GPT-4o’s promise in PAD applications, laying the groundwork for future research to address broader data privacy concerns and improve cross-dataset generalization. Code available here: https://gitlab.idiap.ch/bob/bob.paper.wacv2025_chatgpt_face_pad



PDF Accepted in WACV workshop 2025



Key Takeaways

  1. GPT-4o在面部呈现攻击检测(PAD)中表现出强大的竞争力,超越了多种PAD模型。
  2. GPT-4o在少量样本的情况下具有卓越的一致性和性能。
  3. 详细的提示能使GPT-4o的得分更可靠。
  4. 寻求解释的提示可以提高GPT-4o的解释性和性能。
  5. GPT-4o展现出正确的攻击类型预测能力,具有推理能力。
  6. 实验在符合数据隐私法规的SOTERIA数据集子集上进行。
  7. 研究为GPT-4o在PAD应用中的潜力奠定了基础,未来研究可关注数据隐私和跨数据集泛化能力。

Cool Papers


Few-Shot Learner Generalizes Across AI-Generated Image Detection

Authors:Shiyu Wu, Jing Liu, Jing Li, Yequan Wang

Current fake image detectors trained on large synthetic image datasets perform satisfactorily on limited studied generative models. However, they suffer a notable performance decline over unseen models. Besides, collecting adequate training data from online generative models is often expensive or infeasible. To overcome these issues, we propose Few-Shot Detector (FSD), a novel AI-generated image detector which learns a specialized metric space to effectively distinguish unseen fake images by utilizing very few samples. Experiments show FSD achieves state-of-the-art performance by $+7.4%$ average ACC on GenImage dataset. More importantly, our method is better capable of capturing the intra-category common features in unseen images without further training.



PDF 11 pages, 5 figures


本文介绍了一种新型的基于AI技术的图像检测器——Few-Shot Detector(FSD)。该检测器通过利用极少量样本学习专门的度量空间,有效区分未见过的虚假图像。实验表明,FSD在GenImage数据集上的平均准确率达到了领先水平,提高了7.4%。更重要的是,该方法能够更好地捕捉未见图像中的类别内通用特征,无需进一步训练。

Key Takeaways

  1. Few-Shot Detector(FSD)是一种新型的AI生成的图像检测器,旨在解决现有虚假图像检测器在面对未见模型时性能下降的问题。
  2. FSD通过学习专门的度量空间来区分虚假图像,仅利用少量样本即可实现有效检测。
  3. 实验表明,FSD在GenImage数据集上的平均准确率较之前的方法有所提高,达到了领先水平。
  4. FSD能够捕捉未见图像中的类别内通用特征,这是其与其他检测器的重要区别。
  5. FSD的优势在于其对于训练数据的获取需求较小,即使面临有限或昂贵的训练数据情况,也能表现出良好的性能。
  6. FSD的提出为虚假图像检测领域提供了新的思路和方法,有望在未来得到广泛应用。

Cool Papers


Normalize Then Propagate: Efficient Homophilous Regularization for Few-shot Semi-Supervised Node Classification

Authors:Baoming Zhang, MingCai Chen, Jianqing Song, Shuangjie Li, Jie Zhang, Chongjun Wang

Graph Neural Networks (GNNs) have demonstrated remarkable ability in semi-supervised node classification. However, most existing GNNs rely heavily on a large amount of labeled data for training, which is labor-intensive and requires extensive domain knowledge. In this paper, we first analyze the restrictions of GNNs generalization from the perspective of supervision signals in the context of few-shot semi-supervised node classification. To address these challenges, we propose a novel algorithm named NormProp, which utilizes the homophily assumption of unlabeled nodes to generate additional supervision signals, thereby enhancing the generalization against label scarcity. The key idea is to efficiently capture both the class information and the consistency of aggregation during message passing, via decoupling the direction and Euclidean norm of node representations. Moreover, we conduct a theoretical analysis to determine the upper bound of Euclidean norm, and then propose homophilous regularization to constraint the consistency of unlabeled nodes. Extensive experiments demonstrate that NormProp achieve state-of-the-art performance under low-label rate scenarios with low computational complexity.



PDF Accepted by AAAI 2025



Key Takeaways

  1. GNNs在少样本半监督节点分类中面临挑战,需要大量标注数据训练,但标注数据获取困难且需要大量领域知识。
  2. NormProp算法利用未标记节点的同质性假设生成额外的监督信号,以提高模型泛化能力。
  3. NormProp通过解耦节点表示的方向和欧几里得范数,捕获类信息和聚合过程中的一致性。
  4. 进行了理论分析和实验验证,确定了欧几里得范数的上限。
  5. NormProp提出了同质性正则化来约束未标记节点的一致性。
  6. NormProp在低标签率场景下实现了最佳性能。

Cool Papers


Get Rid of Isolation: A Continuous Multi-task Spatio-Temporal Learning Framework

Authors:Zhongchao Yi, Zhengyang Zhou, Qihe Huang, Yanjiang Chen, Liheng Yu, Xu Wang, Yang Wang

Spatiotemporal learning has become a pivotal technique to enable urban intelligence. Traditional spatiotemporal models mostly focus on a specific task by assuming a same distribution between training and testing sets. However, given that urban systems are usually dynamic, multi-sourced with imbalanced data distributions, current specific task-specific models fail to generalize to new urban conditions and adapt to new domains without explicitly modeling interdependencies across various dimensions and types of urban data. To this end, we argue that there is an essential to propose a Continuous Multi-task Spatio-Temporal learning framework (CMuST) to empower collective urban intelligence, which reforms the urban spatiotemporal learning from single-domain to cooperatively multi-dimensional and multi-task learning. Specifically, CMuST proposes a new multi-dimensional spatiotemporal interaction network (MSTI) to allow cross-interactions between context and main observations as well as self-interactions within spatial and temporal aspects to be exposed, which is also the core for capturing task-level commonality and personalization. To ensure continuous task learning, a novel Rolling Adaptation training scheme (RoAda) is devised, which not only preserves task uniqueness by constructing data summarization-driven task prompts, but also harnesses correlated patterns among tasks by iterative model behavior modeling. We further establish a benchmark of three cities for multi-task spatiotemporal learning, and empirically demonstrate the superiority of CMuST via extensive evaluations on these datasets. The impressive improvements on both few-shot streaming data and new domain tasks against existing SOAT methods are achieved. Code is available at https://github.com/DILab-USTCSZ/CMuST.



PDF Accepted by NeurIPS 2024



Key Takeaways

  1. 城市时空学习对于实现城市智能化至关重要。
  2. 传统时空模型难以适应城市系统的动态性和数据分布的不平衡性。
  3. 提出了一种新的连续多任务时空学习框架(CMuST),从单域改革为多维度多任务学习。
  4. CMuST通过多维度时空交互网络(MSTI)捕捉任务级别共性及个性化。
  5. 滚动适应训练方案(RoAda)能确保连续任务学习并利用任务间的相关模式。
  6. CMuST在基准测试中表现优越,特别是在处理少量流式数据和新的领域任务时。

Cool Papers


RobustEMD: Domain Robust Matching for Cross-domain Few-shot Medical Image Segmentation

Authors:Yazhou Zhu, Minxian Li, Qiaolin Ye, Shidong Wang, Tong Xin, Haofeng Zhang

Few-shot medical image segmentation (FSMIS) aims to perform the limited annotated data learning in the medical image analysis scope. Despite the progress has been achieved, current FSMIS models are all trained and deployed on the same data domain, as is not consistent with the clinical reality that medical imaging data is always across different data domains (e.g. imaging modalities, institutions and equipment sequences). How to enhance the FSMIS models to generalize well across the different specific medical imaging domains? In this paper, we focus on the matching mechanism of the few-shot semantic segmentation models and introduce an Earth Mover’s Distance (EMD) calculation based domain robust matching mechanism for the cross-domain scenario. Specifically, we formulate the EMD transportation process between the foreground support-query features, the texture structure aware weights generation method, which proposes to perform the sobel based image gradient calculation over the nodes, is introduced in the EMD matching flow to restrain the domain relevant nodes. Besides, the point set level distance measurement metric is introduced to calculated the cost for the transportation from support set nodes to query set nodes. To evaluate the performance of our model, we conduct experiments on three scenarios (i.e., cross-modal, cross-sequence and cross-institution), which includes eight medical datasets and involves three body regions, and the results demonstrate that our model achieves the SoTA performance against the compared models.






Key Takeaways

  1. FSMIS旨在解决医学图像分析领域有限标注数据的学习问题。
  2. 当前FSMIS模型局限于同一数据域,但医学成像数据通常涉及不同的数据域。
  3. 引入基于地球移动距离(EMD)的跨域鲁棒匹配机制,以解决模型在不同医学成像域中的泛化问题。
  4. EMD匹配机制包括前景支持查询特征之间的EMD运输过程,以及基于Sobel的图像梯度计算来约束域相关节点。
  5. 点集级别的距离测量指标被用于计算运输成本。
  6. 模型在跨模态、跨序列和跨机构的三种场景下进行了实验验证。

Cool Papers


PACE: Marrying generalization in PArameter-efficient fine-tuning with Consistency rEgularization

Authors:Yao Ni, Shan Zhang, Piotr Koniusz

Parameter-Efficient Fine-Tuning (PEFT) effectively adapts pre-trained transformers to downstream tasks. However, the optimization of tasks performance often comes at the cost of generalizability in fine-tuned models. To address this issue, we theoretically connect smaller weight gradient norms during training and larger datasets to the improvements in model generalization. Motivated by this connection, we propose reducing gradient norms for enhanced generalization and aligning fine-tuned model with the pre-trained counterpart to retain knowledge from large-scale pre-training data. Yet, naive alignment does not guarantee gradient reduction and can potentially cause gradient explosion, complicating efforts to manage gradients. To address such an issue, we propose PACE, marrying generalization of PArameter-efficient fine-tuning with Consistency rEgularization. We perturb features learned from the adapter with the multiplicative noise and ensure the fine-tuned model remains consistent for same sample under different perturbations. Theoretical analysis shows that PACE not only implicitly regularizes gradients for enhanced generalization, but also implicitly aligns the fine-tuned and pre-trained models to retain knowledge. Experimental evidence supports our theories. PACE surpasses existing PEFT methods in visual adaptation tasks (VTAB-1k, FGVC, few-shot learning, domain adaptation) showcasing its potential for resource-efficient fine-tuning. It also improves LoRA in text classification (GLUE) and mathematical reasoning (GSM-8K). The code is available at https://github.com/MaxwellYaoNi/PACE



PDF Accepted by NeurIPS 2024 as a spotlight



Key Takeaways

  1. PEFT有助于适应预训练模型到下游任务,但可能牺牲模型的泛化能力。
  2. 较小权重梯度范数和大数据集与模型泛化改进之间存在理论联系。
  3. 为了提高泛化能力,提出了PACE方法,结合了参数高效微调与一致性正则化。
  4. PACE通过引入特征扰动来确保模型对同一样本在不同扰动下的一致性。
  5. PACE不仅隐式地正则化梯度以提高泛化能力,而且隐式地对微调模型和预训练模型进行对齐以保留知识。
  6. 实验证据表明,PACE在多种任务上超越了现有的PEFT方法,包括视觉适应、文本分类和数学推理。
  7. PACE方法的代码已公开可用。

Cool Papers


CrossFi: A Cross Domain Wi-Fi Sensing Framework Based on Siamese Network

Authors:Zijian Zhao, Tingwei Chen, Zhijie Cai, Xiaoyang Li, Hang Li, Qimei Chen, Guangxu Zhu

In recent years, Wi-Fi sensing has garnered significant attention due to its numerous benefits, such as privacy protection, low cost, and penetration ability. Extensive research has been conducted in this field, focusing on areas such as gesture recognition, people identification, and fall detection. However, many data-driven methods encounter challenges related to domain shift, where the model fails to perform well in environments different from the training data. One major factor contributing to this issue is the limited availability of Wi-Fi sensing datasets, which makes models learn excessive irrelevant information and over-fit to the training set. Unfortunately, collecting large-scale Wi-Fi sensing datasets across diverse scenarios is a challenging task. To address this problem, we propose CrossFi, a siamese network-based approach that excels in both in-domain scenario and cross-domain scenario, including few-shot, zero-shot scenarios, and even works in few-shot new-class scenario where testing set contains new categories. The core component of CrossFi is a sample-similarity calculation network called CSi-Net, which improves the structure of the siamese network by using an attention mechanism to capture similarity information, instead of simply calculating the distance or cosine similarity. Based on it, we develop an extra Weight-Net that can generate a template for each class, so that our CrossFi can work in different scenarios. Experimental results demonstrate that our CrossFi achieves state-of-the-art performance across various scenarios. In gesture recognition task, our CrossFi achieves an accuracy of 98.17% in in-domain scenario, 91.72% in one-shot cross-domain scenario, 64.81% in zero-shot cross-domain scenario, and 84.75% in one-shot new-class scenario. The code for our model is publicly available at https://github.com/RS2002/CrossFi.






Key Takeaways

  1. Wi-Fi感知技术受到广泛关注,具有隐私保护、低成本和穿透能力等优点。
  2. 数据驱动的方法在Wi-Fi感知领域面临域偏移问题,模型在不同于训练数据的环境中表现不佳。
  3. CrossFi方法基于孪生网络和样本相似性计算网络CSi-Net,能有效解决域偏移问题。
  4. CrossFi在多种场景下表现优秀,包括少样本、零样本场景和新类别场景。
  5. 实验结果表明,CrossFi在姿态识别任务中达到98.17%的准确率。
  6. CrossFi模型的代码已公开可用。

Cool Papers


HeadGAP: Few-Shot 3D Head Avatar via Generalizable Gaussian Priors

Authors:Xiaozheng Zheng, Chao Wen, Zhaohu Li, Weiyi Zhang, Zhuo Su, Xu Chang, Yang Zhao, Zheng Lv, Xiaoyuan Zhang, Yongjie Zhang, Guidong Wang, Lan Xu

In this paper, we present a novel 3D head avatar creation approach capable of generalizing from few-shot in-the-wild data with high-fidelity and animatable robustness. Given the underconstrained nature of this problem, incorporating prior knowledge is essential. Therefore, we propose a framework comprising prior learning and avatar creation phases. The prior learning phase leverages 3D head priors derived from a large-scale multi-view dynamic dataset, and the avatar creation phase applies these priors for few-shot personalization. Our approach effectively captures these priors by utilizing a Gaussian Splatting-based auto-decoder network with part-based dynamic modeling. Our method employs identity-shared encoding with personalized latent codes for individual identities to learn the attributes of Gaussian primitives. During the avatar creation phase, we achieve fast head avatar personalization by leveraging inversion and fine-tuning strategies. Extensive experiments demonstrate that our model effectively exploits head priors and successfully generalizes them to few-shot personalization, achieving photo-realistic rendering quality, multi-view consistency, and stable animation.



PDF Accepted to 3DV 2025. Project page: https://headgap.github.io/



Key Takeaways

  1. 提出了一种新型三维头部虚拟角色创建方法,能够从少量数据中泛化并表现出高度保真和动画鲁棒性。
  2. 方法包括先验学习阶段和角色创建阶段,前者利用大规模多视角动态数据集提取头部三维先验知识。
  3. 采用基于高斯拼贴技术的自动解码器网络和部分动态建模技术捕捉先验知识。
  4. 方法采用共享身份编码和个性化潜在代码来学习高斯基本属性的身份特征。
  5. 通过反转和微调策略实现快速头部角色个性化。
  6. 模型有效泛化到小规模个性化数据,达到逼真的渲染质量。

Cool Papers


A Foundation Language-Image Model of the Retina (FLAIR): Encoding Expert Knowledge in Text Supervision

Authors:Julio Silva-Rodríguez, Hadi Chakor, Riadh Kobbi, Jose Dolz, Ismail Ben Ayed

Foundation vision-language models are currently transforming computer vision, and are on the rise in medical imaging fueled by their very promising generalization capabilities. However, the initial attempts to transfer this new paradigm to medical imaging have shown less impressive performances than those observed in other domains, due to the significant domain shift and the complex, expert domain knowledge inherent to medical-imaging tasks. Motivated by the need for domain-expert foundation models, we present FLAIR, a pre-trained vision-language model for universal retinal fundus image understanding. To this end, we compiled 38 open-access, mostly categorical fundus imaging datasets from various sources, with up to 101 different target conditions and 288,307 images. We integrate the expert’s domain knowledge in the form of descriptive textual prompts, during both pre-training and zero-shot inference, enhancing the less-informative categorical supervision of the data. Such a textual expert’s knowledge, which we compiled from the relevant clinical literature and community standards, describes the fine-grained features of the pathologies as well as the hierarchies and dependencies between them. We report comprehensive evaluations, which illustrate the benefit of integrating expert knowledge and the strong generalization capabilities of FLAIR under difficult scenarios with domain shifts or unseen categories. When adapted with a lightweight linear probe, FLAIR outperforms fully-trained, dataset-focused models, more so in the few-shot regimes. Interestingly, FLAIR outperforms by a wide margin larger-scale generalist image-language models and retina domain-specific self-supervised networks, which emphasizes the potential of embedding experts’ domain knowledge and the limitations of generalist models in medical imaging.



PDF Accepted in Medical Image Analysis. The pre-trained model is available at: https://github.com/jusiro/FLAIR




  1. 跨领域迁移学习对于将视觉语言模型应用于医学成像具有重要意义。
  2. 在开发面向视网膜基金图像理解的模型FLAIR时,结合了专家领域知识,并以文本提示的形式融入模型训练与零样本推理过程。
  3. 文本提示包含精细的病理特征描述以及病理层次和依赖关系的描述。
  4. 通过全面的评估证明了整合专家知识的优势以及FLAIR的泛化能力。
  5. 在少样本情况下,使用轻量级线性探针适配的FLAIR表现突出。
  6. 与大规模的综合图像语言模型和视网膜领域的特定自监督网络相比,FLAIR大幅领先,突显了嵌入专家领域知识的潜力。

Cool Papers


Controlling Equational Reasoning in Large Language Models with Prompt Interventions

Authors:Jordan Meadows, Marco Valentino, Andre Freitas

This paper investigates how hallucination rates in Large Language Models (LLMs) may be controlled via a symbolic data generation framework, exploring a fundamental relationship between the rate of certain mathematical errors and types of input intervention. Specifically, we systematically generate data for a derivation generation task using a symbolic engine, applying targeted interventions to prompts to perturb features of mathematical derivations such as the surface forms of symbols, equational tree structures, and mathematical context. We then evaluate the effect of prompt interventions across a range of LLMs including fine-tuned T5 models, GPT, and LLaMa-based models. Our experiments suggest that T5-Large can outperform the few-shot performance of GPT-4 on various evaluation sets generated via the framework. However, an extensive evaluation based on human analysis, template-based error detection, and text generation metrics reveals model weaknesses beyond what the reference-based metrics singularly describe. We use these results to tie characteristic distributional footprints of interventions to the human evaluation of LLM derivation quality, potentially leading to significant control over fine-grained mathematical capabilities of language models with respect to specific types of errors.



PDF AAAI 2025 (7 pages)



Key Takeaways

  1. 论文探索了如何通过符号数据生成框架控制大型语言模型中的幻觉率。
  2. 研究了数学错误率与输入干预类型之间的基本关系。
  3. 通过符号引擎系统地生成数据,并应用有针对性的干预措施来扰动数学推导的特征。
  4. 评估了不同LLMs在提示干预下的表现。
  5. T5-Large在某些评估集上的表现优于GPT-4。
  6. 全面评估揭示了模型除了参考基准指标所描述的弱点之外的其他弱点。

Cool Papers


文章作者: Kedreamix
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !