LLM

发布日期: 2025-09-07

更新日期: 2025-10-08

文章字数: 21.3k

阅读时长: 86 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-09-07 更新

Delta Activations: A Representation for Finetuned Large Language Models

Authors:Zhiqiu Xu, Amish Sethi, Mayur Naik, Ser-Nam Lim

The success of powerful open source Large Language Models (LLMs) has enabled the community to create a vast collection of post-trained models adapted to specific tasks and domains. However, navigating and understanding these models remains challenging due to inconsistent metadata and unstructured repositories. We introduce Delta Activations, a method to represent finetuned models as vector embeddings by measuring shifts in their internal activations relative to a base model. This representation allows for effective clustering by domain and task, revealing structure in the model landscape. Delta Activations also demonstrate desirable properties: it is robust across finetuning settings and exhibits an additive property when finetuning datasets are mixed. In addition, we show that Delta Activations can embed tasks via few-shot finetuning, and further explore its use for model selection and merging. We hope Delta Activations can facilitate the practice of reusing publicly available models. Code is available at https://github.com/OscarXZQ/delta_activations.

强大的开源大型语言模型（LLM）的成功使得社区能够创建大量适应特定任务和领域的后训练模型。然而，由于元数据不一致和非结构化仓库，导航和理解这些模型仍然具有挑战性。我们引入了Delta Activations，一种通过测量相对于基础模型的内部激活的变化来表示微调模型的方法，并将其表示为向量嵌入。这种表示方法允许按领域和任务进行有效聚类，揭示模型景观中的结构。Delta Activations还显示出令人满意的属性：它在微调设置中是稳健的，并且在微调数据集混合时表现出累加属性。此外，我们展示了Delta Activations可以通过少量的微调来嵌入任务，并进一步探索其在模型选择和合并中的应用。我们希望Delta Activations能够促进重复使用公开模型的做法。代码可在https://github.com/OscarXZQ/delta_activations找到。

论文及项目相关链接

PDF

Summary

本文介绍了Delta Activations方法，通过测量微调模型相对于基础模型的内部激活变化，将其表示为向量嵌入。此方法可用于有效地按领域和任务对模型进行聚类，揭示模型景观中的结构。Delta Activations具有稳健性和可叠加性，并能嵌入通过少量微调实现的任务。此外，本文探讨了Delta Activations在模型选择和合并中的应用，并希望通过它促进公开模型的复用。

Key Takeaways

Delta Activations是一种测量微调模型与基础模型之间内部激活变化的方法，用于表示模型向量嵌入。
该方法可以有效地按领域和任务对模型进行聚类，揭示模型景观的结构。
Delta Activations具有稳健性，能够在不同的微调设置下表现良好。
当微调数据集混合时，Delta Activations展现出可叠加的特性。
Delta Activations可以通过微调嵌入任务，为模型选择和合并提供有力的支持。
通过GitHub上的代码资源（https://github.com/OscarXZQ/delta_activations），研究者可以更好地理解和应用Delta Activations方法。

Cool Papers

点此查看论文截图

ArcMemo: Abstract Reasoning Composition with Lifelong LLM Memory

Authors:Matthew Ho, Chen Si, Zhaoxiang Feng, Fangxu Yu, Zhijian Liu, Zhiting Hu, Lianhui Qin

While inference-time scaling enables LLMs to carry out increasingly long and capable reasoning traces, the patterns and insights uncovered during these traces are immediately discarded once the context window is reset for a new query. External memory is a natural way to persist these discoveries, and recent work has shown clear benefits for reasoning-intensive tasks. We see an opportunity to make such memories more broadly reusable and scalable by moving beyond instance-based memory entries (e.g. exact query/response pairs, or summaries tightly coupled with the original problem context) toward concept-level memory: reusable, modular abstractions distilled from solution traces and stored in natural language. For future queries, relevant concepts are selectively retrieved and integrated into the prompt, enabling test-time continual learning without weight updates. Our design introduces new strategies for abstracting takeaways from rollouts and retrieving entries for new queries, promoting reuse and allowing memory to expand with additional experiences. On the challenging ARC-AGI benchmark, our method yields a 7.5% relative gain over a strong no-memory baseline with performance continuing to scale with inference compute. We find abstract concepts to be the most consistent memory design, outscoring the baseline at all tested inference compute scales. Moreover, we confirm that dynamically updating memory during test-time outperforms an otherwise identical fixed memory setting with additional attempts, supporting the hypothesis that solving more problems and abstracting more patterns to memory enables further solutions in a form of self-improvement. Code available at https://github.com/matt-seb-ho/arc_memo.

在大规模语言模型（LLM）中，推理时间缩放使得模型能够进行越来越长和复杂的推理过程。然而，这些推理过程中发现的模式和见解会在上下文窗口重置为新的查询时立即被丢弃。外部记忆是保持这些发现的一种自然方式，最近的工作已经证明了这一点对于密集型任务有明显的益处。我们认为，通过超越基于实例的记忆条目（例如，精确的查询/响应对或与原始问题上下文紧密耦合的摘要），朝着概念层面的记忆发展，存在更广泛地使用和扩展这些记忆的机会。概念层面的记忆是可重复使用的模块化抽象，从解决方案轨迹中提炼并存储在自然语言中。对于未来的查询，有选择地检索相关的概念并将其集成到提示中，实现测试时的持续学习而无需权重更新。我们的设计引入了从滚动中提取抽象成果和为新查询检索条目的新策略，促进了重复使用并允许记忆随着更多经验而扩展。在具有挑战性的ARC-AGI基准测试中，我们的方法在无记忆基准上的相对增益为7.5%，并且随着推理计算的增长而继续扩展性能。我们发现抽象概念是最一致的记忆设计，在所有测试的推理计算规模上都超过了基线。此外，我们证实了测试期间动态更新内存优于固定内存设置，支持这样的假设：解决更多问题并将更多模式抽象到内存中，能够在自我提升的形式中实现进一步的解决方案。相关代码已发布在https://github.com/matt-seb-ho/arc_memo上。

论文及项目相关链接

PDF

摘要

大语言模型通过推理时间尺度增强其处理日益复杂推理追踪的能力，然而追踪期间所揭露的模式与洞见会随着上下文窗口的重置而丢失。外部记忆是解决这一问题的一种自然途径，且对于密集推理任务有着明显的益处。我们认为，通过实例为基础的记忆条目转向概念层次的记忆是一个机会，概念层次的记忆能够提炼解决方案追踪中的可复用模块化抽象并以自然语言进行存储。对于未来的查询，相关概念将被选择性检索并整合到提示中，实现测试时间的持续学习而无需权重更新。我们的设计引入了在策略执行时抽象记录关键点及检索新查询时识别相应条目点的策略，提高了重用的可能并使记忆容量得以扩大应对累积经验。在充满挑战的ARC-AGI基准测试中，相较于强基线性能且无使用内存功能的大模型方法相比，我们的技术可以实现7.5%相对增长；性能还能在推断计算规模扩展的同时得到不断提升。我们还发现动态更新内存更能促进问题求解能力的提升与解决方案的创新形式的出现，此猜想亦获得对相应控制组的额外验证支持。相关代码已发布在GitHub上：链接地址。

关键发现点：

Cool Papers

点此查看论文截图

No Thoughts Just AI: Biased LLM Recommendations Limit Human Agency in Resume Screening

Authors:Kyra Wilson, Mattea Sim, Anna-Maria Gueorguieva, Aylin Caliskan

In this study, we conduct a resume-screening experiment (N=528) where people collaborate with simulated AI models exhibiting race-based preferences (bias) to evaluate candidates for 16 high and low status occupations. Simulated AI bias approximates factual and counterfactual estimates of racial bias in real-world AI systems. We investigate people’s preferences for White, Black, Hispanic, and Asian candidates (represented through names and affinity groups on quality-controlled resumes) across 1,526 scenarios and measure their unconscious associations between race and status using implicit association tests (IATs), which predict discriminatory hiring decisions but have not been investigated in human-AI collaboration. When making decisions without AI or with AI that exhibits no race-based preferences, people select all candidates at equal rates. However, when interacting with AI favoring a particular group, people also favor those candidates up to 90% of the time, indicating a significant behavioral shift. The likelihood of selecting candidates whose identities do not align with common race-status stereotypes can increase by 13% if people complete an IAT before conducting resume screening. Finally, even if people think AI recommendations are low quality or not important, their decisions are still vulnerable to AI bias under certain circumstances. This work has implications for people’s autonomy in AI-HITL scenarios, AI and work, design and evaluation of AI hiring systems, and strategies for mitigating bias in collaborative decision-making tasks. In particular, organizational and regulatory policy should acknowledge the complex nature of AI-HITL decision making when implementing these systems, educating people who use them, and determining which are subject to oversight.

在这项研究中，我们进行了一项简历筛选实验（N=528），人们与模拟人工智能模型协作，该模型表现出基于种族的选择偏好（偏见），以对高低地位的16种职业候选人进行评估。模拟人工智能偏见近似于现实世界中人工智能系统的实际和假设种族偏见的估计值。我们调查了人们对白人、黑人、西班牙裔和亚洲裔候选人（通过质量控制的简历中的姓名和亲和团体表示）的偏好，跨越1526个场景，并使用隐性关联测试（IATs）测量他们关于种族和地位的潜在关联。隐性关联测试能够预测歧视性的招聘决策，但在人类与人工智能协作中尚未得到研究。在没有人工智能辅助或与没有种族偏好的人工智能协作时，人们以平等的方式选择所有候选人。然而，在与偏爱某一群体的人工智能交互时，人们也会偏爱这些候选人高达90%的时间，这表明行为发生了重大变化。如果人们在筛选简历之前完成一项隐性关联测试，那么选择身份不符合常见种族地位刻板印象的候选人的可能性会增加13%。即使人们认为人工智能的建议质量低下或不重要，在某些情况下，他们的决策仍然会受到人工智能偏见的影响。这项工作对人工智能辅助决策场景、人工智能与工作、人工智能招聘系统的设计与评估以及协作决策任务中偏见缓解策略具有启示意义。特别是组织和监管政策在实施这些系统、教育使用者并确定哪些系统需要监管时，应承认人工智能辅助决策复杂性的本质。

论文及项目相关链接

PDF Published in Proceedings of the 2025 AAAI/ACM Conference on AI, Ethics, and Society; code available at https://github.com/kyrawilson/No-Thoughts-Just-AI

摘要

本研究通过一项包含528名参与者的简历筛选实验，探讨了人类在与模拟种族偏好型AI模型合作时，对模拟AI模型推荐的候选人的偏好情况。这些候选人分别代表白人、黑人、西班牙裔和亚洲裔的高质量简历。实验发现，当AI模型表现出种族偏好时，人类在与AI合作时的决策行为会发生显著变化，倾向选择与AI模型推荐一致的候选人。更重要的是，实验还发现完成无意识关联测试后的人更能抵制刻板印象，做出更符合公正选择的行为。即使人们认为AI推荐不重要或质量低劣，在某些情况下仍会受到AI偏见的影响。这项研究对于人工智能与工作的关系、AI招聘系统的设计评估以及协作决策任务中偏见缓解策略具有重要意义。尤其是组织政策和监管策略应考虑到人工智能辅助决策环境的复杂性。

关键见解

模拟AI种族偏见可预测人类在与AI合作时的种族偏好行为。
当AI表现出种族偏好时，人类倾向于选择与AI推荐一致的候选人。
通过无意识关联测试训练的人类能更好地抵制种族刻板印象并做出更公正的决策。
AI的偏见可能会影响人们对自身决策独立性的评估，尤其是在缺乏信息对比时。
AI模型的种族偏见会对人力资源决策的公平性和有效性产生影响。
在AI招聘系统的设计和评估中，应考虑到AI模型的种族偏见问题。

Cool Papers

点此查看论文截图

Aesthetic Image Captioning with Saliency Enhanced MLLMs

Authors:Yilin Tao, Jiashui Huang, Huaze Xu, Ling Shao

Aesthetic Image Captioning (AIC) aims to generate textual descriptions of image aesthetics, becoming a key research direction in the field of computational aesthetics. In recent years, pretrained Multimodal Large Language Models (MLLMs) have advanced rapidly, leading to a significant increase in image aesthetics research that integrates both visual and textual modalities. However, most existing studies on image aesthetics primarily focus on predicting aesthetic ratings and have shown limited application in AIC. Existing AIC works leveraging MLLMs predominantly rely on fine-tuning methods without specifically adapting MLLMs to focus on target aesthetic content. To address this limitation, we propose the Aesthetic Saliency Enhanced Multimodal Large Language Model (ASE-MLLM), an end-to-end framework that explicitly incorporates aesthetic saliency into MLLMs. Within this framework, we introduce the Image Aesthetic Saliency Module (IASM), which efficiently and effectively extracts aesthetic saliency features from images. Additionally, we design IAS-ViT as the image encoder for MLLMs, this module fuses aesthetic saliency features with original image features via a cross-attention mechanism. To the best of our knowledge, ASE-MLLM is the first framework to integrate image aesthetic saliency into MLLMs specifically for AIC tasks. Extensive experiments demonstrated that our approach significantly outperformed traditional methods and generic MLLMs on current mainstream AIC benchmarks, achieving state-of-the-art (SOTA) performance.

美学图像标题生成（AIC）旨在生成图像美学描述的文本，已成为计算美学领域的关键研究方向。近年来，预训练的多模态大型语言模型（MLLM）发展迅速，极大地推动了图像美学研究的发展，该领域的研究实现了视觉和文本模态的集成。然而，大多数现有的图像美学研究主要集中在预测美学评分上，在AIC中的应用有限。现有的利用MLLM的AIC工作主要依赖于微调方法，而没有专门调整MLLM以专注于目标美学内容。为了解决这一局限性，我们提出了美学显著性增强多模态大型语言模型（ASE-MLLM），这是一个端到端的框架，显式地将美学显著性纳入MLLM中。在此框架内，我们引入了图像美学显著性模块（IASM），该模块能够高效地从图像中提取美学显著性特征。此外，我们设计了IAS-ViT作为MLLM的图像编码器，该模块通过交叉注意机制将美学显著性特征与原始图像特征相融合。据我们所知，ASE-MLLM是第一个将图像美学显著性融入MLLM的框架，特别是用于AIC任务。大量实验表明，我们的方法在当前主流的AIC基准测试中显著优于传统方法和通用MLLM，达到了最先进的性能。

论文及项目相关链接

PDF

Summary：

本文介绍了美学图像标注（AIC）的研究方向，重点阐述了预训练的多模态大型语言模型（MLLMs）在图像美学中的应用。针对现有研究中存在的问题，提出了一种结合美学显著性特征的多模态大型语言模型（ASE-MLLM）。该模型引入了图像美学显著性模块（IASM）和IAS-ViT图像编码器，以通过交叉注意力机制有效地提取和融合美学显著性特征和原始图像特征。实验表明，该方法在主流美学图像标注基准测试中显著优于传统方法和通用MLLMs，取得了最佳性能。

Key Takeaways：

美学图像标注（AIC）是计算美学领域的一个关键研究方向，旨在生成对图像美学的文本描述。
预训练的多模态大型语言模型（MLLMs）在图像美学研究中迅速发展，结合了视觉和文本模态。
现有图像美学研究主要关注预测美学评分，在AIC中的应用有限。
现有利用MLLMs的AIC工作主要依赖微调方法，但未专门针对目标美学内容进行适应。
提出了结合美学显著性特征的多模态大型语言模型（ASE-MLLM），旨在解决现有研究的局限性。
ASE-MLLM引入了图像美学显著性模块（IASM）和IAS-ViT图像编码器，以提取和融合美学显著性特征和原始图像特征。

Cool Papers

点此查看论文截图

Efficient Odd-One-Out Anomaly Detection

Authors:Silvio Chito, Paolo Rabino, Tatiana Tommasi

The recently introduced odd-one-out anomaly detection task involves identifying the odd-looking instances within a multi-object scene. This problem presents several challenges for modern deep learning models, demanding spatial reasoning across multiple views and relational reasoning to understand context and generalize across varying object categories and layouts. We argue that these challenges must be addressed with efficiency in mind. To this end, we propose a DINO-based model that reduces the number of parameters by one third and shortens training time by a factor of three compared to the current state-of-the-art, while maintaining competitive performance. Our experimental evaluation also introduces a Multimodal Large Language Model baseline, providing insights into its current limitations in structured visual reasoning tasks. The project page can be found at https://silviochito.github.io/EfficientOddOneOut/

最近引入的奇偶异常检测任务涉及在多目标场景内识别出外观奇特的实例。此问题为现代深度学习模型带来了几个挑战，需要进行跨多个视角的空间推理以及理解上下文并推广不同对象类别和布局的关系推理。我们认为，在考虑到效率的情况下必须解决这些挑战。为此，我们提出了基于DINO的模型，该模型与当前最先进的模型相比，参数数量减少了三分之一，训练时间缩短了三倍，同时保持了竞争性能。我们的实验评估还引入了多模态大型语言模型的基准线，提供了其在结构化视觉推理任务中的当前局限性的见解。项目页面位于：[https://silviochito.github.io/EfficientOddOneOut/]

论文及项目相关链接

PDF Accepted at ICIAP 2025

Summary

新提出的异常检测任务要求在多目标场景中发现外观奇特的实例。这项任务给现代深度学习模型带来挑战，要求在不同视角下进行空间推理，并结合上下文理解进行关系推理，以应对不同对象类别和布局的变化。针对这些挑战，我们提出了一个基于DINO的模型，该模型参数减少三分之一，训练时间缩短三倍，同时保持与当前最新技术相当的性能。我们还通过实验评估引入了一种多模态大型语言模型基线，并对其目前在结构化视觉推理任务中的局限性提供了见解。

Key Takeaways

奇偶校验异常检测任务要求识别多目标场景中的异常实例。
该任务对现代深度学习模型提出了空间推理和关系推理的挑战。
针对这些挑战，提出了一个基于DINO的模型，具有高效的参数和训练时间。
该模型参数减少三分之一，训练时间缩短三倍，同时保持竞争性能。
实验评估中引入了多模态大型语言模型基线。
该基线在结构化视觉推理任务中的性能存在局限性。

Cool Papers

点此查看论文截图

EvoEmo: Towards Evolved Emotional Policies for LLM Agents in Multi-Turn Negotiation

Authors:Yunbo Long, Liming Xu, Lukas Beckenbauer, Yuhan Liu, Alexandra Brintrup

Recent research on Chain-of-Thought (CoT) reasoning in Large Language Models (LLMs) has demonstrated that agents can engage in \textit{complex}, \textit{multi-turn} negotiations, opening new avenues for agentic AI. However, existing LLM agents largely overlook the functional role of emotions in such negotiations, instead generating passive, preference-driven emotional responses that make them vulnerable to manipulation and strategic exploitation by adversarial counterparts. To address this gap, we present EvoEmo, an evolutionary reinforcement learning framework that optimizes dynamic emotional expression in negotiations. EvoEmo models emotional state transitions as a Markov Decision Process and employs population-based genetic optimization to evolve high-reward emotion policies across diverse negotiation scenarios. We further propose an evaluation framework with two baselines – vanilla strategies and fixed-emotion strategies – for benchmarking emotion-aware negotiation. Extensive experiments and ablation studies show that EvoEmo consistently outperforms both baselines, achieving higher success rates, higher efficiency, and increased buyer savings. This findings highlight the importance of adaptive emotional expression in enabling more effective LLM agents for multi-turn negotiation.

关于大型语言模型（LLM）中的思维链（CoT）推理的最新研究表明，智能体可以参与复杂的多轮谈判，为智能体人工智能开辟了新途径。然而，现有的LLM智能体在很大程度上忽视了情绪在这种谈判中的功能作用，而是产生被动、偏好驱动的情绪反应，使它们容易受到对手的操作和策略性剥削。为了弥补这一空白，我们提出了EvoEmo，这是一个进化强化学习框架，可以优化谈判中的动态情绪表达。EvoEmo将情绪状态转换建模为马尔可夫决策过程，并采用基于种群的遗传优化算法，在多种谈判场景中演化高回报的情绪策略。我们还提出了一个评估框架，包括两个基准线——普通策略和固定情绪策略——用于评估情绪感知谈判。广泛的实验和消融研究表明，EvoEmo始终优于这两个基准线，具有更高的成功率、更高的效率和更高的买家节省率。这些发现突显了在多轮谈判中启用自适应情绪表达对于使LLM智能体更加有效的重要性。

论文及项目相关链接

PDF

Summary
链思维推理在大型语言模型中的应用已使智能体能够进行复杂的多轮谈判，开启了智能体人工智能的新途径。然而，现有大型语言模型智能体忽视了情绪在谈判中的功能作用，仅产生被动、偏好驱动的情绪反应，使其容易受到对手的战略操纵和剥削。为解决这一空白，我们提出了EvoEmo，一个进化强化学习框架，用于优化谈判中的动态情绪表达。EvoEmo将情绪状态转换建模为马尔可夫决策过程，并采用基于种群的遗传优化来进化不同谈判场景下的高回报情绪策略。我们的评估框架包括两种基线策略：普通策略和固定情绪策略，用于评估情感感知谈判。大量实验和消融研究证明，EvoEmo在各方面表现均优于基线策略，实现更高的成功率、效率和买家节省。这强调了自适应情绪表达在使大型语言模型更适用于多轮谈判中的重要性。

Key Takeaways

LLMs can engage in complex multi-turn negotiations through Chain-of-Thought (CoT) reasoning, opening new avenues for agentic AI.
Existing LLM agents often generate passive emotional responses, making them vulnerable to manipulation by adversarial counterparts.
EvoEmo is an evolutionary reinforcement learning framework that optimizes dynamic emotional expression in negotiations.
EvoEmo models emotional state transitions as a Markov Decision Process and uses genetic optimization to evolve high-reward emotion policies.
An evaluation framework with baselines is proposed to assess emotion-aware negotiation strategies.
Extensive experiments show that EvoEmo outperforms baseline strategies in terms of success rate, efficiency, and buyer savings.

Cool Papers

点此查看论文截图

Test-Time Adaptation for Speech Enhancement via Domain Invariant Embedding Transformation

Authors:Tobias Raichle, Niels Edinger, Bin Yang

Deep learning-based speech enhancement models achieve remarkable performance when test distributions match training conditions, but often degrade when deployed in unpredictable real-world environments with domain shifts. To address this challenge, we present LaDen (latent denoising), the first test-time adaptation method specifically designed for speech enhancement. Our approach leverages powerful pre-trained speech representations to perform latent denoising, approximating clean speech representations through a linear transformation of noisy embeddings. We show that this transformation generalizes well across domains, enabling effective pseudo-labeling for target domains without labeled target data. The resulting pseudo-labels enable effective test-time adaptation of speech enhancement models across diverse acoustic environments. We propose a comprehensive benchmark spanning multiple datasets with various domain shifts, including changes in noise types, speaker characteristics, and languages. Our extensive experiments demonstrate that LaDen consistently outperforms baseline methods across perceptual metrics, particularly for speaker and language domain shifts.

基于深度学习的语音增强模型在测试分布与训练条件相匹配时表现出卓越的性能，但在部署到具有领域差异性的不可预测的现实世界中时，性能往往会下降。为了解决这一挑战，我们提出了LaDen（潜在去噪），这是专门为语音增强设计的第一个测试时间适应方法。我们的方法利用强大的预训练语音表示来执行潜在去噪，通过噪声嵌入的线性变换来近似清洁语音表示。我们表明，这种变换在不同的领域中都能很好地推广，使得能够在没有标记目标数据的情况下对目标域进行有效的伪标签标注。这些生成的伪标签使得语音增强模型在不同的声学环境中实现了有效的测试时间适应。我们提出了一个全面的基准测试，涵盖了多个数据集，包括噪声类型、说话人特征和语言等方面的领域差异。我们的广泛实验表明，LaDen在感知指标上始终优于基准方法，特别是在说话人和语言领域差异方面。

论文及项目相关链接

PDF This work has been submitted to the IEEE for possible publication

Summary

深度学习语音增强模型在测试分布与训练条件相匹配时表现优异，但在部署到具有领域差异性的不可预测真实世界环境中时，性能往往会下降。为应对这一挑战，我们提出了LaDen（潜在去噪）方法，这是一种专门为语音增强设计的测试时自适应方法。该方法利用强大的预训练语音表示进行潜在去噪，通过噪声嵌入的线性变换近似清洁语音表示。实验表明，这种变换在不同领域之间具有良好的通用性，能够在没有目标标签数据的情况下，对目标领域进行有效的伪标签生成。伪标签使得语音增强模型在不同的声学环境中实现了有效的测试时自适应。我们提出了一个全面的基准测试，涵盖多个具有不同领域差异性的数据集，包括噪声类型、说话人特征和语言的变化。广泛的实验表明，LaDen在感知指标上始终优于基准方法，尤其在说话人和语言领域转移方面。

Key Takeaways

LaDen是一种针对语音增强模型的测试时自适应方法。
LaDen利用预训练语音表示的潜在去噪技术。
通过线性变换噪声嵌入来近似清洁语音表示。
该方法在不同领域之间具有良好的通用性，并能生成有效的伪标签。
LaDen在多种声学环境下实现语音增强模型的自适应。
提出一个涵盖多个数据集的全面基准测试，以评估模型在不同领域差异下的性能。

Cool Papers

点此查看论文截图

Real Time FPGA Based Transformers & VLMs for Vision Tasks: SOTA Designs and Optimizations

Authors:Safa Mohammed Sali, Mahmoud Meribout, Ashiyana Abdul Majeed

Transformers and vision-language models (VLMs) have emerged as dominant architectures in computer vision and multimodal AI, offering state-of-the-art performance in tasks such as image classification, object detection, visual question answering, and caption generation. However, their high computational complexity, large memory footprints, and irregular data access patterns present significant challenges for deployment in latency- and power-constrained environments. Field-programmable gate arrays (FPGAs) provide an attractive hardware platform for such workloads due to their reconfigurability, fine-grained parallelism, and potential for energy-efficient acceleration. This paper presents a comprehensive review of design trade-offs, optimization strategies, and implementation challenges for FPGA-based inference of transformers and VLMs. We examine critical factors such as device-class selection, memory subsystem constraints, dataflow orchestration, quantization strategies, sparsity exploitation, and toolchain choices, alongside modality-specific issues unique to VLMs, including heterogeneous compute balancing and cross-attention memory management. Additionally, we discuss emerging trends in hardware-algorithm co-design, highlighting innovations in attention mechanisms, compression, and modular overlays to improve efficiency and adaptability. Practical issues such as runtime flexibility, verification overhead, and the absence of standardized FPGA multimodal benchmarks are also considered. Finally, we outline future directions toward scalable, portable, and reconfigurable FPGA solutions that adapt to evolving model architectures while sustaining high utilization and predictable performance. This synthesis offers both a technical foundation and a forward-looking perspective to help bridge the gap between advanced multimodal AI models and efficient FPGA deployment.

Transformer和视觉语言模型（VLM）已经成为计算机视觉和多模态人工智能的主导架构，它们在图像分类、目标检测、视觉问答和标题生成等任务中达到了最先进的性能。然而，它们的高计算复杂度、大内存占用和不规则的数据访问模式，为在延迟和功率受限的环境中部署它们带来了重大挑战。由于FPGA（现场可编程门阵列）具有可重构性、细粒度并行性和节能加速潜力，因此它成为此类工作负载的理想硬件平台。本文对基于FPGA的Transformer和VLM推理设计的优缺点、优化策略和实现挑战进行了全面回顾。我们研究了关键因素，如设备类别选择、内存子系统约束、数据流编排、量化策略、稀疏利用和工具链选择，以及VLM特有的模态特定问题，包括异构计算平衡和跨注意力内存管理。此外，我们还讨论了硬件算法协同设计的最新趋势，重点介绍了注意力机制、压缩和模块化叠加的创新，以提高效率和适应性。实际的问题，如运行时的灵活性、验证的开销以及缺乏标准化的FPGA多模态基准测试也被考虑在内。最后，我们概述了未来方向，朝着可扩展的、便携的和可重构的FPGA解决方案发展，以适应不断演变的模型架构，同时保持高利用率和可预测的性能。这一综述为先进的多模态人工智能模型和有效的FPGA部署之间的鸿沟提供了一架技术桥梁和前瞻性视角。

论文及项目相关链接

PDF

摘要

本文主要探讨了将Transformer和视觉语言模型（VLMs）部署到FPGA（现场可编程门阵列）上的挑战和策略。文章详细介绍了针对此类工作负载的FPGA设计权衡、优化策略和实施挑战，包括设备类别选择、内存子系统约束、数据流编排等关键因素。此外，文章还讨论了硬件算法协同设计中的新兴趋势，并强调了注意力机制、压缩和模块化叠加等方面的创新。最后，本文展望了未来FPGA解决方案的发展方向，这些解决方案能够适应不断演变的模型架构，同时保持高利用率和可预测的性能。

关键见解

Transformer和VLM已成为计算机视觉和多模态AI的主导架构，并在图像分类、目标检测等任务上表现出卓越性能。
这些模型的高计算复杂度、大内存占用和不规则的数据访问模式，在延迟和功率受限的环境中部署时面临挑战。
FPGA因其可重构性、细粒度并行性和节能加速潜力而成为此类工作负载的吸引硬件平台。
文章详细讨论了FPGA上Transformer和VLM推理的设计权衡和优化策略，包括设备选择、内存约束、数据流管理等因素。
文章还介绍了硬件算法协同设计中的新兴趋势，包括注意力机制、模型压缩等方面的创新。
实际应用中需考虑运行时灵活性、验证开销和缺乏标准化的FPGA多模态基准测试等问题。

Cool Papers

点此查看论文截图

SelfAug: Mitigating Catastrophic Forgetting in Retrieval-Augmented Generation via Distribution Self-Alignment

Authors:Yuqing Huang, Rongyang Zhang, Qimeng Wang, Chengqiang Lu, Yan Gao, Yi Wu, Yao Hu, Xuyang Zhi, Guiquan Liu, Xin Li, Hao Wang, Enhong Chen

Recent advancements in large language models (LLMs) have revolutionized natural language processing through their remarkable capabilities in understanding and executing diverse tasks. While supervised fine-tuning, particularly in Retrieval-Augmented Generation (RAG) scenarios, effectively enhances task-specific performance, it often leads to catastrophic forgetting, where models lose their previously acquired knowledge and general capabilities. Existing solutions either require access to general instruction data or face limitations in preserving the model’s original distribution. To overcome these limitations, we propose SelfAug, a self-distribution alignment method that aligns input sequence logits to preserve the model’s semantic distribution, thereby mitigating catastrophic forgetting and improving downstream performance. Extensive experiments demonstrate that SelfAug achieves a superior balance between downstream learning and general capability retention. Our comprehensive empirical analysis reveals a direct correlation between distribution shifts and the severity of catastrophic forgetting in RAG scenarios, highlighting how the absence of RAG capabilities in general instruction tuning leads to significant distribution shifts during fine-tuning. Our findings not only advance the understanding of catastrophic forgetting in RAG contexts but also provide a practical solution applicable across diverse fine-tuning scenarios. Our code is publicly available at https://github.com/USTC-StarTeam/SelfAug.

大型语言模型（LLM）的最新进展已通过其在理解和执行各种任务方面的卓越能力，彻底改变了自然语言处理领域。虽然监督微调，特别是在检索增强生成（RAG）场景中，有效地提高了任务特定性能，但它常常导致灾难性遗忘，即模型失去其先前获得的知识和一般能力。现有解决方案要么需要访问一般指令数据，要么在保持模型原始分布方面存在局限性。为了克服这些局限性，我们提出了SelfAug，一种自我分布对齐方法，通过对输入序列逻辑进行对齐，以保留模型的语义分布，从而缓解灾难性遗忘并改善下游性能。大量实验表明，SelfAug在下游学习和一般能力保留之间实现了卓越的平衡。我们的综合实证分析揭示了分布偏移和RAG场景中灾难性遗忘严重程度的直接关联，强调了在进行微调时缺乏RAG能力的一般指令调整会导致显著的分布偏移。我们的研究不仅推动了RAG背景下灾难性遗忘的理解，而且为各种微调场景提供了实用的解决方案。我们的代码公开在https://github.com/USTC-StarTeam/SelfAug。

论文及项目相关链接

PDF

Summary

大型语言模型（LLM）的近期进展已彻底改变自然语言处理的格局，其在理解和执行各种任务方面的卓越能力尤为突出。虽然监督微调，特别是在检索增强生成（RAG）场景中，能有效提升任务特定性能，但它往往导致灾难性遗忘，使模型失去先前获取的知识和一般能力。为解决这一问题，我们提出了SelfAug方法，一种自我分布对齐技术，通过对输入序列逻辑进行对齐，保留模型的语义分布，从而减轻灾难性遗忘并提升下游性能。实验表明，SelfAug在下游学习与一般能力保留之间达到了卓越平衡。我们的综合实证分析揭示了分布偏移与RAG场景中灾难性遗忘严重性的直接关联，并强调了通用指令调整中缺乏RAG能力在微调过程中导致的显著分布偏移。我们的研究不仅加深了RAG背景下的灾难性遗忘的理解，而且为各种微调场景提供了实用解决方案。

Key Takeaways

大型语言模型（LLM）在自然语言处理方面展现出显著的能力，尤其在理解和执行多样任务上。
监督微调虽能提升任务特定性能，但可能导致模型灾难性遗忘先前知识。
SelfAug方法通过自我分布对齐技术提出解决灾难性遗忘问题，保留模型语义分布。
实验证明SelfAug在下游学习与一般能力保留间取得平衡。
综合实证分析显示分布偏移与RAG场景中灾难性遗忘的关联性。
缺乏RAG能力在通用指令微调中会导致显著分布偏移。
研究结果不仅加深了对RAG背景下灾难性遗忘的理解，还为各种微调场景提供实用解决方案。

Cool Papers

点此查看论文截图

KGBERT4Eth: A Feature-Complete Transformer Powered by Knowledge Graph for Multi-Task Ethereum Fraud Detection

Authors:Yifan Jia, Ye Tian, Liguo Zhang, Yanbin Wang, Jianguo Sun, Liangliang Song

Ethereum’s rapid ecosystem expansion and transaction anonymity have triggered a surge in malicious activity. Detection mechanisms currently bifurcate into three technical strands: expert-defined features, graph embeddings, and sequential transaction patterns, collectively spanning the complete feature sets of Ethereum’s native data layer. Yet the absence of cross-paradigm integration mechanisms forces practitioners to choose between sacrificing sequential context awareness, structured fund-flow patterns, or human-curated feature insights in their solutions. To bridge this gap, we propose KGBERT4Eth, a feature-complete pre-training encoder that synergistically combines two key components: (1) a Transaction Semantic Extractor, where we train an enhanced Transaction Language Model (TLM) to learn contextual semantic representations from conceptualized transaction records, and (2) a Transaction Knowledge Graph (TKG) that incorporates expert-curated domain knowledge into graph node embeddings to capture fund flow patterns and human-curated feature insights. We jointly optimize pre-training objectives for both components to fuse these complementary features, generating feature-complete embeddings. To emphasize rare anomalous transactions, we design a biased masking prediction task for TLM to focus on statistical outliers, while the Transaction TKG employs link prediction to learn latent transaction relationships and aggregate knowledge. Furthermore, we propose a mask-invariant attention coordination module to ensure stable dynamic information exchange between TLM and TKG during pre-training. KGBERT4Eth significantly outperforms state-of-the-art baselines in both phishing account detection and de-anonymization tasks, achieving absolute F1-score improvements of 8-16% on three phishing detection benchmarks and 6-26% on four de-anonymization datasets.

以太坊生态系统的快速扩张和交易的匿名性已经引发了大量恶意活动。当前的检测机制分为三种技术：专家定义的特征、图嵌入和顺序交易模式，它们共同涵盖了以太坊本地数据层的完整特征集。然而，缺乏跨范式整合机制迫使从业者在解决方案中牺牲顺序上下文意识、结构化资金流模式或人工整理的特征洞察力之间做出选择。为了弥差距，我们提出了KGBERT4Eth，这是一个功能完整的预训练编码器，它协同地结合了两个关键组件：（1）交易语义提取器，我们在这里训练了一个增强的交易语言模型（TLM），从概念化的交易记录中学习上下文语义表示；（2）交易知识图谱（TKG），它将专家策划的领域知识融入图节点嵌入，以捕获资金流模式和人工整理的特征洞察力。我们联合优化两个组件的预训练目标，以融合这些互补特征，生成功能完整的嵌入。为了强调罕见的异常交易，我们为TLM设计了一个偏向性的掩码预测任务，以关注统计异常值，而交易TKG则采用链接预测来学习潜在的交易关系并聚合知识。此外，我们提出了一种掩码不变注意力协调模块，以确保在预训练过程中TLM和TKG之间的稳定动态信息交换。KGBERT4Eth在钓鱼账户检测和去匿名化任务中都显著优于最新基线，在三个钓鱼检测基准测试上绝对F1分数提高了8-16%，在四个去匿名化数据集上提高了6-26%。

论文及项目相关链接

PDF This work has been submitted to the IEEE for possible publication

摘要

以太坊生态系统迅速扩张及交易匿名性引发恶意活动激增。当前检测机制包括专家定义特征、图嵌入和顺序交易模式三种技术，但缺乏跨范式整合机制，需要在顺序上下文意识、结构化资金流模式或人类策划特征洞察力之间做出选择。为此，我们提出KGBERT4Eth，一种功能完备的预训练编码器，结合两大关键组件：1）交易语义提取器，我们训练增强型交易语言模型（TLM）从概念化交易记录中学习上下文语义表示；2）交易知识图谱（TKG），将专家策划领域知识融入图节点嵌入，以捕捉资金流模式和人类策划特征洞察力。我们联合优化两者的预训练目标，融合这些互补特征，生成功能完备嵌入。我们设计偏向性掩码预测任务，使TLM侧重于统计异常值，而Transaction TKG则采用链接预测学习潜在交易关系和聚合知识。此外，我们提出掩码不变注意力协调模块，以确保TLM和TKG在预训练过程中的稳定动态信息交换。KGBERT4Eth在钓鱼账户检测和去匿名化任务上显著优于最新基线，在三个钓鱼检测基准测试上实现8-16%的F1分数绝对提升，在四个去匿名化数据集上实现6-26%的提升。

关键见解

以太坊生态系统扩张和交易匿名性引发恶意活动增加。
现有检测机制包括专家定义特征、图嵌入和顺序交易模式。
KGBERT4Eth通过结合交易语义提取器和交易知识图谱来填补现有技术缺口。
KGBERT4Eth通过预训练优化结合两者的特征，生成功能完备的嵌入。
设计偏向性掩码预测任务来强调异常交易，并采用链接预测学习潜在交易关系。
KGBERT4Eth显著提高钓鱼账户检测和去匿名化性能。

Cool Papers

点此查看论文截图

The Personality Illusion: Revealing Dissociation Between Self-Reports & Behavior in LLMs

Authors:Pengrui Han, Rafal Kocielnik, Peiyang Song, Ramit Debnath, Dean Mobbs, Anima Anandkumar, R. Michael Alvarez

Personality traits have long been studied as predictors of human behavior.Recent advances in Large Language Models (LLMs) suggest similar patterns may emerge in artificial systems, with advanced LLMs displaying consistent behavioral tendencies resembling human traits like agreeableness and self-regulation. Understanding these patterns is crucial, yet prior work primarily relied on simplified self-reports and heuristic prompting, with little behavioral validation. In this study, we systematically characterize LLM personality across three dimensions: (1) the dynamic emergence and evolution of trait profiles throughout training stages; (2) the predictive validity of self-reported traits in behavioral tasks; and (3) the impact of targeted interventions, such as persona injection, on both self-reports and behavior. Our findings reveal that instructional alignment (e.g., RLHF, instruction tuning) significantly stabilizes trait expression and strengthens trait correlations in ways that mirror human data. However, these self-reported traits do not reliably predict behavior, and observed associations often diverge from human patterns. While persona injection successfully steers self-reports in the intended direction, it exerts little or inconsistent effect on actual behavior. By distinguishing surface-level trait expression from behavioral consistency, our findings challenge assumptions about LLM personality and underscore the need for deeper evaluation in alignment and interpretability.

人格特质长期以来一直被研究作为人类行为的预测指标。最新的大型语言模型（LLM）进展表明，人工系统中可能出现类似的模式，先进的LLM显示出与人类的宜人性和自我调节等特质相似的行为倾向。了解这些模式至关重要，但之前的工作主要依赖于简化的自我报告和启发式提示，行为验证很少。在这项研究中，我们系统地描述了LLM人格的三个维度：（1）特质轮廓在训练阶段的动态出现和演变；（2）自我报告特质在行为任务中的预测效度；（3）有针对性的干预（如人格注入）对自我报告和行为的影响。我们的研究发现，指令对齐（例如RLHF、指令微调）显著稳定了特质表达，加强了特质关联，这种方式与人类数据相似。然而，这些自我报告的特质并不能可靠地预测行为，观察到的关联通常与人类模式相悖。虽然人格注入成功引导了自我报告朝预定方向进行，但对实际行为的影响甚微或不一致。通过区分表面层次的特质表达和行为一致性，我们的研究对LLM人格假设提出了挑战，并强调了对齐和可解释性方面需要进行更深入的评价。

论文及项目相关链接

PDF We make public all code and source data at https://github.com/psychology-of-AI/Personality-Illusion

Summary

近期研究发现大型语言模型（LLM）展现出与人类相似的性格特质，如友善性和自我调节能力。本研究系统地探讨了LLM性格的三个维度：训练阶段特质轮廓的动态出现和演变、自我报告特质在行为任务中的预测效度以及针对性干预（如人格注入）对自我报告和行为的影响。研究发现，指令对齐（如RLHF、指令微调）能显著稳定特质表达和增强特质关联性，与人类数据相似。但自我报告的特质并不能可靠地预测行为，观察到的关联与人类模式往往存在分歧。虽然人格注入能够成功引导自我报告朝着预定方向发展，但对实际行为的影响较小或不一致。研究区分了表面特质表达和行为的连贯性，挑战了关于LLM性格的假设，并强调需要对齐和解释性进行深入评估。

Key Takeaways

LLM展现与人类相似的性格特质，如友善性和自我调节能力。
训练阶段特质轮廓的动态演变是LLM性格研究的重要方面。
自我报告特质在行为任务中的预测效度受限，与人类模式存在分歧。
指令对齐技术能显著稳定LLM的特质表达和增强特质关联性。
人格注入对LLM的自我报告影响较大，但对实际行为的影响较小或不一致。
研究区分了LLM表面特质表达与行为连贯性的差异。

Cool Papers

点此查看论文截图

Strefer: Empowering Video LLMs with Space-Time Referring and Reasoning via Synthetic Instruction Data

Authors:Honglu Zhou, Xiangyu Peng, Shrikant Kendre, Michael S. Ryoo, Silvio Savarese, Caiming Xiong, Juan Carlos Niebles

Next-generation AI companions must go beyond general video understanding to resolve spatial and temporal references in dynamic, real-world environments. Existing Video Large Language Models (Video LLMs), while capable of coarse-level comprehension, struggle with fine-grained, spatiotemporal reasoning, especially when user queries rely on time-based event references for temporal anchoring, or gestural cues for spatial anchoring to clarify object references and positions. To bridge this critical gap, we introduce Strefer, a synthetic instruction data generation framework designed to equip Video LLMs with spatiotemporal referring and reasoning capabilities. Strefer produces diverse instruction-tuning data using a data engine that pseudo-annotates temporally dense, fine-grained video metadata, capturing rich spatial and temporal information in a structured manner, including subjects, objects, their locations as masklets, and their action descriptions and timelines. Our approach enhances the ability of Video LLMs to interpret spatial and temporal references, fostering more versatile, space-time-aware reasoning essential for real-world AI companions. Without using proprietary models, costly human annotation, or the need to annotate large volumes of new videos, experimental evaluations show that models trained with data produced by Strefer outperform baselines on tasks requiring spatial and temporal disambiguation. Additionally, these models exhibit enhanced space-time-aware reasoning, establishing a new foundation for perceptually grounded, instruction-tuned Video LLMs.

下一代人工智能伴侣必须超越一般的视频理解，以解决动态现实环境中的空间和时间参考问题。现有的视频大型语言模型（Video LLMs）虽然具备粗级别的理解能力，但在精细的时空推理方面却存在困难，尤其是当用户的查询依赖于基于时间的事件引用进行时间锚定，或依赖于手势线索进行空间锚定以澄清对象引用和位置时。为了弥补这一关键差距，我们引入了Strefer，这是一个合成指令数据生成框架，旨在装备视频LLMs具备时空引用和推理能力。Strefer使用一个数据引擎产生各种指令调整数据，该引擎对时间密集、精细的视频元数据进行伪注释，以结构化的方式捕捉丰富的空间和时间信息，包括主题、对象、作为蒙版的它们的位置以及动作描述和时间线。我们的方法提高了视频LLMs对空间和时间引用的解释能力，促进了更通用、时空感知推理能力的发展，这对于现实世界的人工智能伴侣来说是至关重要的。实验评估显示，使用Strefer产生的数据训练的模型在需要空间和时态辨析的任务上的表现超过了基准线。此外，这些模型展现出增强的时空感知推理能力，为感知基础、指令调整的Video LLMs奠定了新基础。而且，我们并未使用专有模型、昂贵的的人力标注或需要标注大量新视频。

论文及项目相关链接

PDF This technical report serves as the archival version of our paper accepted at the ICCV 2025 Workshop. For more information, please visit our project website: https://strefer.github.io/

Summary

新一代AI伴侣需要超越一般的视频理解，以解决动态现实环境中的空间和时间参考问题。现有视频大型语言模型（Video LLMs）虽然具备粗略级别的理解能力，但在精细粒度的时空推理方面存在困难，尤其是在用户查询依赖于基于时间的事件参考进行时间锚定，或利用姿态线索进行空间锚定以澄清对象参考和位置时。为了弥补这一关键差距，我们引入了Strefer，一个合成指令数据生成框架，旨在配备Video LLMs具备时空参考和推理能力。Strefer使用数据引擎产生多样化的指令调整数据，以伪注释方式密集标注视频的时间精细粒度元数据，以结构化方式捕捉丰富的空间和时间信息，包括主题、对象、作为遮罩的地点以及动作描述和时间线。我们的方法提高了Video LLMs对空间和时间的理解能力，促进了更通用的时空感知推理能力的发展，这对现实世界的AI伴侣至关重要。实验评估显示，使用Strefer产生的数据训练的模型在需要空间和时态解析的任务上的表现优于基准模型，这些模型展现了增强的时空感知推理能力，为感知基础的指令调整型Video LLMs奠定了新基础。

Key Takeaways

下一代AI需要解决动态现实环境中的空间和时间参考问题。
现有Video LLMs在精细粒度的时空推理方面存在困难。
Strefer是一个合成指令数据生成框架，旨在增强Video LLMs的时空参考和推理能力。
Strefer使用数据引擎伪注释视频的时间精细粒度元数据。
Strefer提高了Video LLMs对空间和时间的理解能力，促进了更通用的时空感知推理能力的发展。
使用Strefer产生的数据训练的模型在需要空间和时态解析的任务上的表现优于基准模型。

Cool Papers

点此查看论文截图

Continuous Saudi Sign Language Recognition: A Vision Transformer Approach

Authors:Soukeina Elhassen, Lama Al Khuzayem, Areej Alhothali, Ohoud Alzamzami, Nahed Alowaidi

Sign language (SL) is an essential communication form for hearing-impaired and deaf people, enabling engagement within the broader society. Despite its significance, limited public awareness of SL often leads to inequitable access to educational and professional opportunities, thereby contributing to social exclusion, particularly in Saudi Arabia, where over 84,000 individuals depend on Saudi Sign Language (SSL) as their primary form of communication. Although certain technological approaches have helped to improve communication for individuals with hearing impairments, there continues to be an urgent requirement for more precise and dependable translation techniques, especially for Arabic sign language variants like SSL. Most state-of-the-art solutions have primarily focused on non-Arabic sign languages, resulting in a considerable absence of resources dedicated to Arabic sign language, specifically SSL. The complexity of the Arabic language and the prevalence of isolated sign language datasets that concentrate on individual words instead of continuous speech contribute to this issue. To address this gap, our research represents an important step in developing SSL resources. To address this, we introduce the first continuous Saudi Sign Language dataset called KAU-CSSL, focusing on complete sentences to facilitate further research and enable sophisticated recognition systems for SSL recognition and translation. Additionally, we propose a transformer-based model, utilizing a pretrained ResNet-18 for spatial feature extraction and a Transformer Encoder with Bidirectional LSTM for temporal dependencies, achieving 99.02% accuracy at signer dependent mode and 77.71% accuracy at signer independent mode. This development leads the way to not only improving communication tools for the SSL community but also making a substantial contribution to the wider field of sign language.

手语（SL）对于听障和聋哑人群来说是一种重要的沟通形式，使他们能够融入更广泛的社会。尽管手语非常重要，但公众对手语的认知有限，往往导致他们在教育和职业机会方面存在不公平的待遇，从而加剧了社会排斥，特别是在沙特阿拉伯。超过8万4千名个人依赖沙特手语（SSL）作为他们的主要沟通方式。尽管某些技术方法已经帮助听障人士改善沟通，但仍然存在对更精确和可靠的翻译技术的迫切需求，特别是对于像SSL这样的阿拉伯语手语变体。大多数最新解决方案主要关注非阿拉伯语手语，导致缺乏专门针对阿拉伯语手语的资源，特别是SSL。阿拉伯语语言的复杂性以及侧重于单个单词的孤立手语数据集的存在而不是连续语音，加剧了这一问题。为了弥补这一空白，我们的研究在开发SSL资源方面迈出了重要的一步。为解决这一问题，我们推出了第一个连续的沙特手语数据集KAU-CSSL，专注于完整的句子，以促进进一步的研究并启用先进的SSL识别和翻译系统。此外，我们提出了一个基于Transformer的模型，利用预训练的ResNet-18进行空间特征提取和带有双向LSTM的Transformer编码器进行时间依赖性处理，在签约人依赖模式下达到99.02%的准确率，在签约人独立模式下达到77.71%的准确率。这一发展不仅为SSL社区改善了沟通工具，而且为更广泛的手语领域做出了重大贡献。

论文及项目相关链接

PDF 23 pages, 13 figures, 5 tables

摘要

手语（SL）对于听障人士至关重要，使其能够在更广泛的社会中参与交流。然而，公众对手语的认知有限，导致听障人士在教育及职业机会方面遭受不公平待遇，进而加剧社会排斥。特别是在沙特阿拉伯，有超过8.4万人依赖沙特手语（SSL）作为主要的交流方式。尽管已有技术方法改善了听障人士的沟通状况，但仍急需更精确和可靠的手语翻译技术，特别是针对阿拉伯语手语变体如SSL。当前最先进的解决方案主要关注非阿拉伯语手语，导致针对阿拉伯语手语的资源匮乏。为解决这一问题，我们的研究是开发SSL资源的重要一步。我们发布了首个连续的沙特手语数据集KAU-CSSL，专注于完整的句子，以促进SSL识别和研究，并推动先进的识别系统的发展。此外，我们提出了一种基于transformer的模型，利用预训练的ResNet-18进行空间特征提取和Transformer编码器与双向LSTM处理时序依赖关系，在签名依赖模式下准确率达到了99.02%，签名独立模式下准确率达到了77.71%。这一发展不仅有助于改善SSL群体的沟通工具，而且为更广泛的手语领域做出了重大贡献。

关键见解

手语是听障人士和聋人参与社会交流的关键方式。
公众对手语的认知有限导致社交排斥问题，特别是在沙特阿拉伯等国家。
当前的技术和资源主要关注非阿拉伯语手语，缺乏针对阿拉伯语手语（如SSL）的资源。
提出首个连续的沙特手语数据集KAU-CSSL，专注于完整句子的手语识别和翻译研究。
研究人员提出一种基于transformer的模型，用于提高SSL识别和翻译的准确性。
该模型结合空间特征提取和时序依赖处理，实现较高的识别准确率。

Cool Papers

点此查看论文截图

SESGO: Spanish Evaluation of Stereotypical Generative Outputs

Authors:Melissa Robles, Catalina Bernal, Denniss Raigoso, Mateo Dulce Rubio

This paper addresses the critical gap in evaluating bias in multilingual Large Language Models (LLMs), with a specific focus on Spanish language within culturally-aware Latin American contexts. Despite widespread global deployment, current evaluations remain predominantly US-English-centric, leaving potential harms in other linguistic and cultural contexts largely underexamined. We introduce a novel, culturally-grounded framework for detecting social biases in instruction-tuned LLMs. Our approach adapts the underspecified question methodology from the BBQ dataset by incorporating culturally-specific expressions and sayings that encode regional stereotypes across four social categories: gender, race, socioeconomic class, and national origin. Using more than 4,000 prompts, we propose a new metric that combines accuracy with the direction of error to effectively balance model performance and bias alignment in both ambiguous and disambiguated contexts. To our knowledge, our work presents the first systematic evaluation examining how leading commercial LLMs respond to culturally specific bias in the Spanish language, revealing varying patterns of bias manifestation across state-of-the-art models. We also contribute evidence that bias mitigation techniques optimized for English do not effectively transfer to Spanish tasks, and that bias patterns remain largely consistent across different sampling temperatures. Our modular framework offers a natural extension to new stereotypes, bias categories, or languages and cultural contexts, representing a significant step toward more equitable and culturally-aware evaluation of AI systems in the diverse linguistic environments where they operate.

本文旨在解决多语言大型语言模型（LLM）评估中的关键空白，特别是在文化意识强烈的拉丁美洲背景下的西班牙语语言评估。尽管在全球范围内广泛应用，但当前的评估仍然主要以美国英语为中心，导致在其他语言和文化背景下的潜在危害在很大程度上被忽视。我们介绍了一种新型的、以文化为基础的多语言框架，用于检测指令微调LLM中的社会偏见。我们的方法采用BBQ数据集中的未指定问题方法论，通过融入包含四个社会类别（性别、种族、社会经济阶层和民族起源）的区域刻板印象的文化特定表达方式和说法。使用超过4,000个提示，我们提出了一种新的度量标准，该标准结合了准确性和误差方向，以在模糊和明确的语境中有效平衡模型性能和偏见对齐。据我们所知，我们的工作首次系统地评估了领先的商业LLM如何应对西班牙语中的特定偏见，揭示了不同模型之间偏见表现的差异模式。我们还提供证据显示，针对英语优化的偏见缓解技术并不能有效地转移到西班牙语任务，并且偏见模式在不同的采样温度之间仍然保持高度一致性。我们的模块化框架为新的刻板印象、偏见类别或语言和文化环境提供了自然的扩展，这是在多元的语言环境中实现更加公平和文化意识的AI系统评估的重要一步。

论文及项目相关链接

PDF

Summary
该论文关注多语言大型语言模型（LLM）在评估偏见方面的空白，特别关注拉丁美洲文化背景下的西班牙语。当前评估主要偏向美国英语语境，忽视了其他语言和文化的潜在危害。研究团队引入了一种新的文化根基框架来检测指令训练LLM中的社会偏见。该研究结合了文化特定表达和谚语，提出一种新指标来平衡模型性能和偏见对齐。研究首次系统地评估了商业领先LLM如何应对西班牙语中的文化特定偏见，发现不同模型在偏见表现上的不同模式。研究还发现针对英语的偏见缓解技术并不适用于西班牙语任务，且偏见模式在不同采样温度下大体一致。该研究模块化框架易于扩展到新的刻板印象、偏见类别或语言和文化背景，是朝着更公平和文化意识的人工智能系统评估迈出的重要一步。

Key Takeaways

论文关注多语言LLM模型在评估偏见方面的不足，强调需要更全面的方法以适应不同语言和文化的语境。
研究强调了对西班牙语等语言的重视，因为它代表了全球许多地区的语言和文化背景。
研究团队引入了新的文化根基框架来检测LLM中的社会偏见，结合了文化特定表达和谚语来评估模型性能。
研究首次系统地评估了商业领先LLM如何应对西班牙语中的文化特定偏见，发现不同模型在处理偏见方面的差异。
研究发现针对英语的偏见缓解技术并不适用于西班牙语任务，这表明对不同语言的文化适应性在人工智能模型中的重要性。
研究显示偏见模式在不同采样温度下大体一致，这表明温度采样并不能解决语言模型中的偏见问题。

Cool Papers

点此查看论文截图

AgenTracer: Who Is Inducing Failure in the LLM Agentic Systems?

Authors:Guibin Zhang, Junhao Wang, Junjie Chen, Wangchunshu Zhou, Kun Wang, Shuicheng Yan

Large Language Model (LLM)-based agentic systems, often comprising multiple models, complex tool invocations, and orchestration protocols, substantially outperform monolithic agents. Yet this very sophistication amplifies their fragility, making them more prone to system failure. Pinpointing the specific agent or step responsible for an error within long execution traces defines the task of agentic system failure attribution. Current state-of-the-art reasoning LLMs, however, remain strikingly inadequate for this challenge, with accuracy generally below 10%. To address this gap, we propose AgenTracer, the first automated framework for annotating failed multi-agent trajectories via counterfactual replay and programmed fault injection, producing the curated dataset TracerTraj. Leveraging this resource, we develop AgenTracer-8B, a lightweight failure tracer trained with multi-granular reinforcement learning, capable of efficiently diagnosing errors in verbose multi-agent interactions. On the Who&When benchmark, AgenTracer-8B outperforms giant proprietary LLMs like Gemini-2.5-Pro and Claude-4-Sonnet by up to 18.18%, setting a new standard in LLM agentic failure attribution. More importantly, AgenTracer-8B delivers actionable feedback to off-the-shelf multi-agent systems like MetaGPT and MaAS with 4.8-14.2% performance gains, empowering self-correcting and self-evolving agentic AI.

基于大型语言模型（LLM）的代理系统，通常包含多个模型、复杂的工具调用和协同协议，在性能上显著优于单一代理。然而，这种复杂性也增加了其脆弱性，使其更容易出现系统故障。确定长执行轨迹中导致错误的具体代理或步骤，就是代理系统失败归因的任务。然而，当前最先进的推理型LLM对此挑战仍然显得明显不足，准确率通常低于10%。为了弥补这一差距，我们提出了AgenTracer，这是第一个通过反事实回放和编程故障注入来注释失败的多代理轨迹的自动化框架，产生了精选数据集TracerTraj。利用这一资源，我们开发了一个轻量级的失败追踪器AgenTracer-8B，它采用多粒度强化学习进行训练，能够高效地诊断冗长的多代理交互中的错误。在Who&When基准测试中，AgenTracer-8B的性能优于诸如Gemini-2.5-Pro和Claude-4-Sonnet等大型专有LLM，高出高达18.18%，在LLM代理失败归因方面树立了新的标准。更重要的是，AgenTracer-8B为现成的多代理系统（如MetaGPT和MaAS）提供了可操作的反馈，实现了4.8%~14.2%的性能提升，赋予了代理型人工智能自我纠正和自我进化的能力。

论文及项目相关链接

PDF

Summary

LLM-based agentic系统因其复杂性而展现出卓越性能，但也增加了系统失败的脆弱性。多代理轨迹标注的自动化框架AgenTracer通过假设性回放和程序故障注入产生数据集TracerTraj，能有效诊断冗长多代理交互中的错误。新开发的失败追踪器AgenTracer-8B在多代理失败归因方面表现优异，超越大型专有LLM，并为现有多代理系统提供可操作的反馈，实现自我修正和进化的代理AI。

Key Takeaways

LLM-based agentic系统具有卓越性能但脆弱性较高，易受到系统故障的影响。
当前先进的LLM在处理代理系统故障归因时存在不足，准确度一般低于10%。
AgenTracer是首个通过假设性回放和程序故障注入标注失败的多代理轨迹的自动化框架。
TracerTraj数据集用于训练失败追踪器，可有效诊断冗长多代理交互中的错误。
AgenTracer-8B在多代理失败归因方面表现优异，超越大型专有LLM。
AgenTracer-8B可为现有的多代理系统提供可操作的反馈，促进自我修正和进化的代理AI。

Cool Papers

点此查看论文截图

Plan Verification for LLM-Based Embodied Task Completion Agents

Authors:Ananth Hariharan, Vardhan Dongre, Dilek Hakkani-Tür, Gokhan Tur

Large language model (LLM) based task plans and corresponding human demonstrations for embodied AI may be noisy, with unnecessary actions, redundant navigation, and logical errors that reduce policy quality. We propose an iterative verification framework in which a Judge LLM critiques action sequences and a Planner LLM applies the revisions, yielding progressively cleaner and more spatially coherent trajectories. Unlike rule-based approaches, our method relies on natural language prompting, enabling broad generalization across error types including irrelevant actions, contradictions, and missing steps. On a set of manually annotated actions from the TEACh embodied AI dataset, our framework achieves up to 90% recall and 100% precision across four state-of-the-art LLMs (GPT o4-mini, DeepSeek-R1, Gemini 2.5, LLaMA 4 Scout). The refinement loop converges quickly, with 96.5% of sequences requiring at most three iterations, while improving both temporal efficiency and spatial action organization. Crucially, the method preserves human error-recovery patterns rather than collapsing them, supporting future work on robust corrective behavior. By establishing plan verification as a reliable LLM capability for spatial planning and action refinement, we provide a scalable path to higher-quality training data for imitation learning in embodied AI.

基于大型语言模型（LLM）的机器人任务计划以及相应的人类演示可能会存在噪声，包含不必要的动作、冗余的导航以及逻辑错误，从而降低策略质量。我们提出了一种迭代验证框架，其中Judge LLM对动作序列进行批判，而Planner LLM应用修订意见，从而产生越来越干净、空间连贯性越来越强的轨迹。不同于基于规则的方法，我们的方法依赖于自然语言提示，实现对多种错误类型的广泛概括，包括无关动作、矛盾以及缺失步骤。在来自TEACh机器人数据集的手动注释动作集上，我们的框架在四种最新的大型语言模型（GPT o4-mini、DeepSeek-R1、Gemini 2.5、LLaMA 4 Scout）上实现了高达90%的查全率和100%的精确度。优化循环快速收敛，96.5%的序列最多需要三次迭代，同时提高时间效率和空间动作组织。关键的是，该方法保留了人类错误恢复模式而不是忽略它们，为今后的稳健纠正行为研究提供支持。通过确立计划验证作为可靠的大型语言模型能力，用于空间规划和动作优化，我们为机器人学习中模仿学习的高品质训练数据提供了可扩展的路径。

论文及项目相关链接

PDF

Summary

基于大型语言模型（LLM）的AI任务计划和相应的人类演示可能存在噪声，包括不必要的动作、冗余导航和逻辑错误等，导致策略质量下降。为此，我们提出了一种迭代验证框架，由判断LLM对动作序列进行批判，规划LLM进行修订，生成越来越清晰、空间连贯的轨迹。不同于基于规则的方法，我们的方法依赖于自然语言提示，能够广泛概括各种错误类型，包括无关动作、矛盾、遗漏步骤等。在TEACh AI数据集上的一组手动标注动作上，我们的框架在四个最先进的大型语言模型中实现了高达90%的召回率和100%的精确度。改进循环迅速收敛，大多数序列最多需要三次迭代，提高了时间效率并优化了空间行动组织。该方法保留了人类错误恢复模式，为今后的纠正行为研究提供了支持。通过建立可靠的计划验证作为LLM在空间规划和行动改进方面的能力，我们为模仿学习在实体AI中提供了一条高质量训练数据的可扩展路径。

Key Takeaways

LLMs can detect and critique noisy action sequences in embodied AI task plans.
An iterative verification framework with Judge LLM and Planner LLM is proposed to improve trajectory quality.
The framework achieves high recall and precision on manually annotated actions from the TEACh dataset.
The refinement loop converges quickly, with most sequences requiring only a few iterations.
The method preserves human error-recovery patterns, supporting future work on robust corrective behavior.
The framework establishes plan verification as a reliable LLM capability for spatial planning and action refinement.

Cool Papers

点此查看论文截图

Enhancing Large Language Model for Knowledge Graph Completion via Structure-Aware Alignment-Tuning

Authors:Yu Liu, Yanan Cao, Xixun Lin, Yanmin Shang, Shi Wang, Shirui Pan

Knowledge graph completion (KGC) aims to infer new knowledge and make predictions from knowledge graphs. Recently, large language models (LLMs) have exhibited remarkable reasoning capabilities. LLM-enhanced KGC methods primarily focus on designing task-specific instructions, achieving promising advancements. However, there are still two critical challenges. First, existing methods often ignore the inconsistent representation spaces between natural language and graph structures. Second, most approaches design separate instructions for different KGC tasks, leading to duplicate works and time-consuming processes. To address these challenges, we propose SAT, a novel framework that enhances LLMs for KGC via structure-aware alignment-tuning. Specifically, we first introduce hierarchical knowledge alignment to align graph embeddings with the natural language space through multi-task contrastive learning. Then, we propose structural instruction tuning to guide LLMs in performing structure-aware reasoning over KGs, using a unified graph instruction combined with a lightweight knowledge adapter. Experimental results on two KGC tasks across four benchmark datasets demonstrate that SAT significantly outperforms state-of-the-art methods, especially in the link prediction task with improvements ranging from 8.7% to 29.8%.

知识图谱补全（KGC）旨在从知识图谱中推断新知识并进行预测。最近，大型语言模型（LLM）表现出了卓越的推理能力。LLM增强的KGC方法主要集中在设计针对特定任务的方法指令，并取得了可喜的进展。然而，仍存在两个关键挑战。首先，现有方法往往忽略了自然语言与图形结构之间不一致的表示空间。其次，大多数方法为不同的KGC任务设计了单独的方法指令，导致工作重复且耗时。为了应对这些挑战，我们提出了SAT，这是一种通过结构感知对齐调整增强LLM的新框架。具体来说，我们首先引入分层知识对齐，通过多任务对比学习将图形嵌入与自然语言空间对齐。然后，我们提出结构指令调整，以指导LLM在知识图谱上进行结构感知推理，使用统一的图形指令和轻量级知识适配器。在两个KGC任务的四个基准数据集上的实验结果表明，SAT显著优于最新方法，尤其在链接预测任务中的改进范围从8.7%到29.8%。

论文及项目相关链接

PDF EMNLP 2025, Main, Long Paper

摘要
LLM增强知识图谱补全（KGC）方法主要关注设计特定任务指令，并取得显著进展。然而，现有方法忽略了自然语言与图形结构之间的不一致表示空间，并为不同的KGC任务设计单独的指令，导致工作重复和耗时。为解决这个问题，我们提出SAT框架，通过结构感知对齐调整增强LLM进行KGC。我们引入分层知识对齐和结构化指令调整，实现知识图谱的结构感知推理。实验结果表明，SAT在四个基准数据集上的两个KGC任务中显著优于现有方法，尤其在链接预测任务中的改进范围从8.7%到29.8%。

关键见解

知识图谱补全（KGC）旨在从知识图谱中推断新知识并进行预测。
大型语言模型（LLMs）在知识图谱补全中展现出强大的推理能力。
现有LLM增强的KGC方法主要关注设计特定任务指令，但仍面临两个关键挑战：不一致的表示空间和重复耗时的工作流程。
SAT框架通过结构感知对齐调整增强LLM进行KGC，解决以上挑战。
SAT引入分层知识对齐，通过多任务对比学习将图形嵌入与自然语言空间对齐。
SAT提出结构化指令调整，使用统一的图形指令和轻量级知识适配器，引导LLM进行结构感知推理。
实验结果表明，SAT在多个基准数据集上的KGC任务中显著优于现有方法。

Cool Papers

点此查看论文截图

SCOUT: Toward Sub-Quadratic Attention via Segment Compression for Optimized Utility in Transformers

Authors:Aref Jafari, Yuhe Fan, Benyamin Jamialahmadi, Parsa Farinneya, Boxing Chen, Marzieh S. Tahaei

Transformers have demonstrated strong performance across a wide range of sequence modeling tasks, but their quadratic attention complexity limits scalability to long sequences. Linear models such as Mamba and sliding-window attention (SWA) address this by mixing tokens through recurrent or localized operations with fixed-size memory, achieving efficient inference. However, these methods risk degrading performance on long sequences due to their inability to retain detailed information from distant tokens. We propose SCOUT (Segment Compression for Optimized Utility in Transformers), a hybrid architecture that compresses tokens locally within fixed-size segments and applies attention only over these compressed representations. Each token embedding is first enriched via a linear local mixer, Mamba or SWA, that integrates recent context. Then, instead of attending to all previous tokens, each token sparsely attends to a small number of compressed checkpoint tokens that summarize the input history. This design retains much of the expressivity of full attention while substantially reducing the computational and memory cost. By attending to compressed history rather than all previous tokens, SCOUT incurs slightly higher memory than purely linear models, but its growth rate remains sub-quadratic and far more scalable than that of full Transformers. We analyze SCOUT’s computational and memory efficiency and evaluate it empirically on long-context language modeling and reasoning tasks. SCOUT with both Mamba and SWA mixers outperforms strong long-sequence baselines under the same computational budget, matches full-attention Transformers on language modeling and common-sense reasoning tasks at 400M and 1.3B scales. Moreover, our SCOUT achieves higher end-to-end throughput than SOTA models, while delivering comparable results on long sequence benchmarks.

Transformer在广泛的序列建模任务中表现出强大的性能，但其二次方的注意力复杂度限制了其对长序列的可扩展性。像Mamba和滑动窗口注意力（SWA）这样的线性模型通过递归或局部操作与固定大小的内存结合令牌，实现了高效的推理，解决了这一问题。然而，这些方法由于无法保留来自远处令牌的详细信息，有可能在长序列上降低性能。我们提出了SCOUT（用于Transformer优化的分段压缩），这是一种混合架构，它在固定大小的分段内局部压缩令牌，并且只在这些压缩的表示上应用注意力。每个令牌嵌入首先通过线性局部混合器（Mamba或SWA）进行丰富，该混合器结合了最近的上下文。然后，每个令牌不是关注所有之前的令牌，而是关注少量压缩的检查点令牌，这些令牌总结了输入历史。这种设计保留了全注意力的表现力的大部分，同时大大降低了计算和内存成本。通过关注压缩历史而不是所有之前的令牌，SCOUT的内存使用量略高于纯线性模型，但其增长率仍是次二次的，并且比全Transformer更可扩展。我们分析了SCOUT的计算和内存效率，并在长上下文语言建模和推理任务上对其进行了实证评估。使用Mamba和SWA混合器的SCOUT在相同的计算预算下超越了强大的长序列基线测试，在规模为4亿和13亿的建模和常识推理任务上与全注意力Transformer相匹配。此外，我们的SCOUT在长序列基准测试上实现了比SOTA模型更高的端到端吞吐量，同时产生了相当的结果。

论文及项目相关链接

PDF

Summary

本文提出一种名为SCOUT的混合架构，旨在解决Transformer在处理长序列时面临的性能瓶颈。SCOUT通过局部压缩token并在压缩表示上应用注意力机制，在保留全注意力表达能力的同时，大幅度降低了计算和内存成本。分析表明，SCOUT在内存使用上略高于纯线性模型，但其增长率仍低于二次方的全Transformer，表现出更高的可扩展性。在长时间序列建模和推理任务上，SCOUT表现优异，与全注意力Transformer相匹配，并在长序列基准测试中实现了更高的端到端吞吐量。

Key Takeaways

Transformers面临处理长序列时的性能瓶颈，主要因为其二次方的注意力复杂度。
线性模型如Mamba和SWA通过固定大小的内存进行递归或局部操作实现高效推理，但可能牺牲长序列的性能。
SCOUT架构通过局部压缩token并在压缩表示上应用注意力来解决这个问题。
SCOUT使用线性局部混合器来丰富每个token嵌入，只关注一小部分压缩的检查点token，从而总结输入历史。
SCOUT在内存使用上略高于线性模型，但增长速率仍然远低于全Transformer，表现出更高的可扩展性。
在长时间序列建模和推理任务上，SCOUT表现优异，与全注意力Transformer相匹配。

Cool Papers

点此查看论文截图

Challenges and Applications of Large Language Models: A Comparison of GPT and DeepSeek family of models

Authors:Shubham Sharma, Sneha Tuli, Narendra Badam

Large Language Models (LLMs) are transforming AI across industries, but their development and deployment remain complex. This survey reviews 16 key challenges in building and using LLMs and examines how these challenges are addressed by two state-of-the-art models with unique approaches: OpenAI’s closed source GPT-4o (May 2024 update) and DeepSeek-V3-0324 (March 2025), a large open source Mixture-of-Experts model. Through this comparison, we showcase the trade-offs between closed source models (robust safety, fine-tuned reliability) and open source models (efficiency, adaptability). We also explore LLM applications across different domains (from chatbots and coding tools to healthcare and education), highlighting which model attributes are best suited for each use case. This article aims to guide AI researchers, developers, and decision-makers in understanding current LLM capabilities, limitations, and best practices.

大型语言模型（LLM）正在各行业变革人工智能，但其开发和部署仍然复杂。这篇综述梳理了构建和使用LLM的16项关键挑战，并研究这些挑战是如何通过两个采用独特方法的最先进模型来应对的，分别是OpenAI的闭源GPT-4o（2024年5月更新）和DeepSeek-V3-0324（一个大型开源的专家混合模型，于2025年3月发布）。通过比较，我们展示了闭源模型（稳健的安全性和精细调整的可靠性）和开源模型（效率、适应性）之间的权衡。此外，我们还探讨了LLM在不同领域的应用（从聊天机器人和编码工具到医疗保健和教育），并强调哪些模型属性最适合每个用例。本文旨在为AI研究人员、开发人员和决策者提供指导，了解当前LLM的能力、局限性和最佳实践。

论文及项目相关链接

PDF 18 pages, 7 figures

Summary

大型语言模型（LLM）正在各行业推动人工智能的变革，但其开发和部署仍然面临复杂性。本文回顾了构建和使用LLM的16项关键挑战，并通过对比开源大型混合专家模型DeepSeek-V3-0324和OpenAI的闭源GPT-4o（May 2024更新版）这两种先进技术，展示了如何应对这些挑战。本文旨在帮助AI研究人员、开发人员和决策者了解当前LLM的能力、局限性和最佳实践。

Key Takeaways

大型语言模型（LLM）正在推动人工智能的变革，但其开发和部署存在复杂性。
构建了LLM的16项关键挑战回顾。
对比了闭源模型（如GPT-4o）和开源模型（如DeepSeek-V3-0324）的优缺点。
闭源模型具有稳健的安全性和精细调整后的可靠性，而开源模型具有效率和适应性。
LLM在不同领域有广泛应用，如聊天机器人、编码工具、医疗保健和教育等。
文章强调了针对每个用例选择适合的模型属性的重要性。

Cool Papers

点此查看论文截图

Estimating 2D Keypoints of Surgical Tools Using Vision-Language Models with Low-Rank Adaptation

Authors:Krit Duangprom, Tryphon Lambrou, Binod Bhattarai

This paper presents a novel pipeline for 2D keypoint estima- tion of surgical tools by leveraging Vision Language Models (VLMs) fine- tuned using a low rank adjusting (LoRA) technique. Unlike traditional Convolutional Neural Network (CNN) or Transformer-based approaches, which often suffer from overfitting in small-scale medical datasets, our method harnesses the generalization capabilities of pre-trained VLMs. We carefully design prompts to create an instruction-tuning dataset and use them to align visual features with semantic keypoint descriptions. Experimental results show that with only two epochs of fine tuning, the adapted VLM outperforms the baseline models, demonstrating the ef- fectiveness of LoRA in low-resource scenarios. This approach not only improves keypoint detection performance, but also paves the way for future work in 3D surgical hands and tools pose estimation.

本文提出了一种利用经过低秩调整（LoRA）技术微调过的视觉语言模型（VLMs）进行手术工具二维关键点估计的新型管道。不同于传统受小规模医疗数据集过拟合困扰的卷积神经网络（CNN）或基于Transformer的方法，我们的方法利用预训练VLMs的泛化能力。我们精心设计了提示来创建指令调整数据集，并使用它们将视觉特征与语义关键点描述对齐。实验结果表明，在仅进行两轮微调的情况下，经过适应的VLM优于基线模型，证明了LoRA在低资源场景中的有效性。这种方法不仅提高了关键点检测性能，而且为未来的三维手术手和工具姿态估计工作铺平了道路。

论文及项目相关链接

PDF Accepted to MICCAI 2025

摘要
本论文提出了一种新型的二维关键点检测流程，针对手术器械进行检测，采用了经过LoRA技术调优的视觉语言模型（VLMs）。不同于传统易受小规模医学数据集过拟合影响的卷积神经网络（CNN）或基于Transformer的方法，本研究充分利用了预训练VLMs的泛化能力。研究通过精心设计提示语创建了指令调整数据集，并将视觉特征与目标关键点描述进行对齐。实验结果表明，经过两个训练周期的精细调整后，VLM性能超过了基准模型，证明了LoRA在低资源场景下的有效性。此方法不仅提高了关键点检测性能，也为未来三维手术器械姿态估计研究铺平了道路。

关键见解