发布日期: 2025-10-11

更新日期: 2025-11-27

文章字数: 3k

阅读时长: 12 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-10-11 更新

The Visual Iconicity Challenge: Evaluating Vision-Language Models on Sign Language Form-Meaning Mapping

Authors:Onur Keleş, Aslı Özyürek, Gerardo Ortega, Kadir Gökgö, Esam Ghaleb

Iconicity, the resemblance between linguistic form and meaning, is pervasive in signed languages, offering a natural testbed for visual grounding. For vision-language models (VLMs), the challenge is to recover such essential mappings from dynamic human motion rather than static context. We introduce the \textit{Visual Iconicity Challenge}, a novel video-based benchmark that adapts psycholinguistic measures to evaluate VLMs on three tasks: (i) phonological sign-form prediction (e.g., handshape, location), (ii) transparency (inferring meaning from visual form), and (iii) graded iconicity ratings. We assess $13$ state-of-the-art VLMs in zero- and few-shot settings on Sign Language of the Netherlands and compare them to human baselines. On \textit{phonological form prediction}, VLMs recover some handshape and location detail but remain below human performance; on \textit{transparency}, they are far from human baselines; and only top models correlate moderately with human \textit{iconicity ratings}. Interestingly, \textit{models with stronger phonological form prediction correlate better with human iconicity judgment}, indicating shared sensitivity to visually grounded structure. Our findings validate these diagnostic tasks and motivate human-centric signals and embodied learning methods for modelling iconicity and improving visual grounding in multimodal models.

象似性——语言形式与意义之间的相似性在手势语言中普遍存在，为视觉定位提供了一个自然的测试平台。对于视觉语言模型（VLM）来说，挑战在于从动态的人类运动中恢复这样的基本映射，而不是静态的上下文。我们引入了《视觉象似性挑战（Visual Iconicity Challenge）》，这是一个基于视频的新型基准测试，它采用心理语言学措施来评估VLM在三个任务上的表现：（i）语音符号形式预测（例如，手形、位置）；（ii）透明度（从视觉形式推断意义）；以及（iii）分级象似性评分。我们在荷兰手语的无训练和少训练情况下评估了最先进的13个VLM，并与人类基线进行了比较。在语音形式预测方面，VLM能够恢复一些手形和位置细节，但仍达不到人类的表现水平；在透明度方面，它们离人类基线还很远；只有顶级模型与人类象似性评分有中等的相关性。有趣的是，具有更强语音形式预测能力的模型与人类象似性判断的关联性更好，这表明它们对视觉基础结构具有共同的敏感性。我们的研究验证了这些诊断任务的有效性，并为建模象似性、提高多模态模型中的视觉定位能力提供了以人类为中心的信号和具体化学习方法。

论文及项目相关链接

PDF

Summary

视觉图标性挑战：针对视觉语言模型的新视频基准测试。该挑战包含三个任务，评估语言形式与意义之间的映射关系，包括音系符号形式预测、透明度和分级图标性评级。对荷兰手语进行零样本和少样本设置下的评估，结果显示视觉语言模型在某些方面能够恢复手型和位置细节，但仍低于人类表现。模型之间的关联性表明对视觉结构的敏感性共享。这一研究验证了这些诊断任务，并鼓励以人类为中心的信号和体验式学习方法来模拟图标性，改进多模态模型的视觉定位。

Key Takeaways

视觉图标性在签名语言中普遍存在，为视觉定位提供了自然测试环境。
视觉语言模型面临的挑战是从动态人类运动中恢复语言形式与意义之间的关键映射，而非静态上下文。
引入“视觉图标性挑战”，一个基于视频的新基准测试，以评估视觉语言模型在三个任务上的表现：音系符号形式预测、透明度和分级图标性评级。
评估了荷兰手语中的13种最新视觉语言模型，并与人类基准进行了比较。
视觉语言模型在音系符号形式预测方面能够恢复一些手型和位置细节，但仍低于人类性能。
在透明度任务上，视觉语言模型与人类基线相差甚远。

Cool Papers

点此查看论文截图

Meta-Learning Based Few-Shot Graph-Level Anomaly Detection

Authors:Liting Li, Yumeng Wang, Yueheng Sun

Graph-level anomaly detection aims to identify anomalous graphs or subgraphs within graph datasets, playing a vital role in various fields such as fraud detection, review classification, and biochemistry. While Graph Neural Networks (GNNs) have made significant progress in this domain, existing methods rely heavily on large amounts of labeled data, which is often unavailable in real-world scenarios. Additionally, few-shot anomaly detection methods based on GNNs are prone to noise interference, resulting in poor embedding quality and reduced model robustness. To address these challenges, we propose a novel meta-learning-based graph-level anomaly detection framework (MA-GAD), incorporating a graph compression module that reduces the graph size, mitigating noise interference while retaining essential node information. We also leverage meta-learning to extract meta-anomaly information from similar networks, enabling the learning of an initialization model that can rapidly adapt to new tasks with limited samples. This improves the anomaly detection performance on target graphs, and a bias network is used to enhance the distinction between anomalous and normal nodes. Our experimental results, based on four real-world biochemical datasets, demonstrate that MA-GAD outperforms existing state-of-the-art methods in graph-level anomaly detection under few-shot conditions. Experiments on both graph anomaly and subgraph anomaly detection tasks validate the framework’s effectiveness on real-world datasets.

图级异常检测旨在识别图数据集中的异常图或子图，在欺诈检测、评论分类和生物化学等领域发挥着至关重要的作用。虽然图神经网络（GNNs）在该领域取得了显著进展，但现有方法严重依赖于大量标记数据，这在现实场景中往往无法获得。此外，基于GNN的少量异常检测方法容易受到噪声干扰，导致嵌入质量差和模型稳健性降低。为了解决这些挑战，我们提出了一种基于元学习的新型图级异常检测框架（MA-GAD），它结合了图压缩模块，缩小了图的规模，减少了噪声干扰，同时保留了关键节点信息。我们还利用元学习从类似网络中提取元异常信息，能够学习一个可以快速适应新任务且具有有限样本的初始化模型。这提高了目标图上的异常检测性能，并且使用偏差网络来增强异常节点和普通节点之间的区别。我们的实验结果基于四个真实世界的生物化学数据集，显示了在少量样本条件下，MA-GAD在图级异常检测方面优于现有最先进的方法。对图异常和子图异常检测任务的实验验证了该框架在真实世界数据集上的有效性。

论文及项目相关链接

PDF Accepted by ARRML2025

Summary
基于图神经网络（GNNs）的图级异常检测框架MA-GAD能有效解决图数据集内异常图或子图的识别问题，对于虚假交易监测、评论分类和生物化学等领域至关重要。该框架引入图压缩模块减少图规模，减少噪声干扰并保留关键节点信息，借助元学习技术从类似网络中提取元异常信息。通过初始模型的学习能快速适应新任务小样例。此外，增设偏见网络以增强异常节点和正常节点间的区别。在四个真实生物化学数据集上的实验证明，MA-GAD在少样本条件下表现优于现有技术。实验验证了该框架对真实数据集的效率和准确性。

Key Takeaways

MA-GAD是一个针对图级异常检测的新框架，用于识别图或子图中的异常情况。
该框架借助图神经网络(GNNs)，对多个领域中的异常情况起到重要监测作用。如欺诈检测、评论分析和生物化学等领域的应用较为关键。
在框架中引入的图压缩模块能够有效缩小图规模，提升模型的稳健性并降低噪声干扰，同时保留关键的节点信息。
通过使用元学习技术从类似网络中提取元异常信息，建立初始化模型快速适应新任务小样例场景。提升了在目标图形上的异常检测性能。
偏见网络的引入增强了异常节点和正常节点之间的区分度。

Cool Papers

点此查看论文截图

Truth, Trust, and Trouble: Medical AI on the Edge

Authors:Mohammad Anas Azeez, Rafiq Ali, Ebad Shabbir, Zohaib Hasan Siddiqui, Gautam Siddharth Kashyap, Jiechao Gao, Usman Naseem

Large Language Models (LLMs) hold significant promise for transforming digital health by enabling automated medical question answering. However, ensuring these models meet critical industry standards for factual accuracy, usefulness, and safety remains a challenge, especially for open-source solutions. We present a rigorous benchmarking framework using a dataset of over 1,000 health questions. We assess model performance across honesty, helpfulness, and harmlessness. Our results highlight trade-offs between factual reliability and safety among evaluated models – Mistral-7B, BioMistral-7B-DARE, and AlpaCare-13B. AlpaCare-13B achieves the highest accuracy (91.7%) and harmlessness (0.92), while domain-specific tuning in BioMistral-7B-DARE boosts safety (0.90) despite its smaller scale. Few-shot prompting improves accuracy from 78% to 85%, and all models show reduced helpfulness on complex queries, highlighting ongoing challenges in clinical QA.

大型语言模型（LLMs）在通过实现自动化医疗问题回答来转变数字健康方面有着巨大的潜力。然而，确保这些模型达到行业关键的准确性、实用性和安全性的标准仍然是一个挑战，尤其是对于开源解决方案。我们利用包含超过1000个健康问题的数据集，提出了一个严格的基准测试框架。我们评估了模型在诚实度、实用性和无害性方面的表现。我们的结果突出了所评估模型在事实可靠性和安全性之间的权衡——Mistral-7B、BioMistral-7B-DARE和AlpaCare-13B。AlpaCare-13B的准确度最高（91.7%），且无害性也较高（0.92），而BioMistral-7B-DARE的专业领域调整虽然规模较小，但也能提高安全性（0.90）。提示性提示可将准确度从78%提高到85%，所有模型在处理复杂查询时的实用性都有所下降，这突显了临床问答中的持续挑战。

论文及项目相关链接

PDF Accepted at EMNLP 2025 (Industry Track)

Summary

大型语言模型在医疗问答自动化方面展现出巨大潜力，但确保模型达到行业标准的准确性、有用性和安全性仍是挑战。研究团队提出一个严格的基准测试框架，使用超过一千个健康问题的数据集评估模型性能，包括诚实度、有用性和无害性。结果显示不同模型在可靠性和安全性之间存在权衡。AlpaCare-13B准确率最高（91.7%），无害性也较高（0.92）；BioMistral-7B-DARE通过领域特定调整提高了安全性（0.90）。少样本提示有助于提高准确率至85%，但所有模型在处理复杂查询时的有用性均有所下降，突显临床问答挑战依旧存在。

Key Takeaways