发布日期: 2025-09-19

更新日期: 2025-10-07

文章字数: 8.6k

阅读时长: 34 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-09-19 更新

White Aggregation and Restoration for Few-shot 3D Point Cloud Semantic Segmentation

Authors:Jiyun Im, SuBeen Lee, Miso Lee, Jae-Pil Heo

Few-Shot 3D Point Cloud Segmentation (FS-PCS) aims to predict per-point labels for an unlabeled point cloud, given only a few labeled examples. To extract discriminative representations from the limited support set, existing methods have constructed prototypes using conventional algorithms such as farthest point sampling. However, we point out that its initial randomness significantly affects FS-PCS performance and that the prototype generation process remains underexplored despite its prevalence. This motivates us to investigate an advanced prototype generation method based on attention mechanism. Despite its potential, we found that vanilla module suffers from the distributional gap between learnable prototypical tokens and support features. To overcome this, we propose White Aggregation and Restoration Module (WARM), which resolves the misalignment by sandwiching cross-attention between whitening and coloring transformations. Specifically, whitening aligns the support features to prototypical tokens before attention process, and subsequently coloring restores the original distribution to the attended tokens. This simple yet effective design enables robust attention, thereby generating representative prototypes by capturing the semantic relationships among support features. Our method achieves state-of-the-art performance with a significant margin on multiple FS-PCS benchmarks, demonstrating its effectiveness through extensive experiments.

少数射击三维点云分割（FS-PCS）旨在利用有限的几个标记样本对未标记的点云进行每个点的标签预测。为了从有限的支撑集中提取出有判别力的表示，现有方法已经构建了原型，使用了诸如最远点采样等传统算法。然而，我们指出其初始随机性对FS-PCS性能的影响很大，尽管原型生成过程很普遍，但至今仍未得到充分研究。这促使我们基于注意力机制来调查先进的原型生成方法。尽管它潜力巨大，但我们发现原型化标记令牌和支持特征之间仍然存在分布差距。为了克服这一问题，我们提出了白化聚合与恢复模块（WARM），它通过夹带白化与着色转换之间的交叉注意力来解决不匹配问题。具体来说，白化在注意力处理之前将支持特征对齐到原型标记令牌上，随后的着色将原始分布恢复到注意力集中的令牌上。这种简单而有效的设计实现了稳健的注意力机制，从而通过捕捉支持特征之间的语义关系生成具有代表性的原型。我们的方法在多个FS-PCS基准测试上实现了业界领先的性能，并通过广泛的实验证明了其有效性。

论文及项目相关链接

PDF 9 pages, 5 figures

Summary
少量样本点云分割（FS-PCS）旨在利用少量标注样本预测未标注点云的每个点的标签。现有方法使用常规算法（如最远点采样）构建原型，但初始随机性对性能有很大影响，且原型生成过程尚未得到充分研究。为此，我们基于注意力机制提出了一种先进的原型生成方法。虽然这很有潜力，但我们发现典型的标记原型令牌与支持特征之间的分布差距是一个问题。为解决这一问题，我们提出了白聚集和恢复模块（WARM），它通过夹心白化和着色转换来解决注意力不匹配的问题。白化将支持特征对齐到原型令牌上，然后进行注意力处理，着色将原始分布恢复到关注的令牌上。这种简单而有效的设计实现了稳健的注意力，通过捕捉支持特征之间的语义关系生成具有代表性的原型。我们的方法在多个FS-PCS基准测试中实现了显著的性能提升。

Key Takeaways

Few-Shot 3D Point Cloud Segmentation (FS-PCS)的目标是仅使用少量标注样本预测未标注点云的每个点的标签。
现有方法使用常规算法构建原型，但初始随机性和原型生成过程的不足限制了性能。
基于注意力机制的先进原型生成方法具有潜力，但需要解决典型标记原型令牌与支持特征之间的分布差距问题。
提出的White Aggregation and Restoration Module (WARM)通过夹心白化和着色转换解决注意力不匹配的问题，提升了原型的质量。
WARM通过白化对齐支持特征和原型令牌，然后通过着色恢复原始分布，实现稳健的注意力机制。
该方法通过捕捉支持特征之间的语义关系生成具有代表性的原型。

Cool Papers

点此查看论文截图

Are Prompts All You Need? Evaluating Prompt-Based Large Language Models (LLM)s for Software Requirements Classification

Authors:Manal Binkhonain, Reem Alfayaz

Requirements classification assigns natural language requirements to predefined classes, such as functional and non functional. Accurate classification reduces risk and improves software quality. Most existing models rely on supervised learning, which needs large labeled data that are costly, slow to create, and domain dependent; they also generalize poorly and often require retraining for each task. This study tests whether prompt based large language models can reduce data needs. We benchmark several models and prompting styles (zero shot, few shot, persona, and chain of thought) across multiple tasks on two English datasets, PROMISE and SecReq. For each task we compare model prompt configurations and then compare the best LLM setups with a strong fine tuned transformer baseline. Results show that prompt based LLMs, especially with few shot prompts, can match or exceed the baseline. Adding a persona, or persona plus chain of thought, can yield further gains. We conclude that prompt based LLMs are a practical and scalable option that reduces dependence on large annotations and can improve generalizability across tasks.

需求分类将自然语言要求分配给预定义的类别，如功能性和非功能性。准确的分类可以降低风险并提高软件质量。大多数现有模型依赖于监督学习，这需要大量标签数据，而这些数据的创建成本高、速度慢并且依赖于领域；它们还通用性较差，并且经常需要为每个任务进行重新训练。本研究测试了基于提示的大型语言模型是否可以减少数据需求。我们在两个英语数据集PROMISE和SecReq上的多个任务上基准测试了几种模型和提示风格（零次射击、少数射击、人格思维和思维链）。对于每个任务，我们比较模型的提示配置，然后将最佳的LLM设置与经过精细调整的变换器基准进行比较。结果表明，基于提示的LLM，尤其是具有少量提示的LLM，可以匹配或超过基准。添加人格或人格思维链可以产生进一步的收益。我们得出结论，基于提示的LLM是一个实用且可扩展的选择，可以减少对大量注释的依赖，并可以提高跨任务的通用性。

论文及项目相关链接

PDF 33 pages, 12 figures

Summary

自然语言要求分类是将自然语言要求分配给预定义类别（如功能和非功能）的过程。准确分类可降低风险并改善软件质量。大多数现有模型依赖于需要大量标注数据监督学习，而这些数据成本高昂、创建缓慢并且依赖于特定领域；它们通用性较差，通常需要为每个任务重新训练。本研究测试基于提示的大型语言模型是否能减少数据需求。我们在两个英语数据集PROMISE和SecReq上对多个任务进行基准测试，包括几种提示风格（零样本、少样本、人格化和思维链）。对于每个任务，我们比较模型提示配置，然后将最佳大型语言模型设置与经过精细调整的转换器基线进行对比。结果表明，基于提示的大型语言模型，尤其是少样本提示，可以匹配或超过基线。增加人格化或人格化加上思维链可以产生进一步的收益。我们得出结论，基于提示的大型语言模型是一个实用且可扩展的选择，可以减少对大量注释的依赖，并可以提高跨任务的通用性。

Key Takeaways

要求分类有助于降低风险并提高软件质量。
现有模型依赖于大量标注数据的监督学习，存在成本高、创建慢、领域依赖等缺点。
基于提示的大型语言模型可减少数据需求。
少样本提示在基于大型语言模型的分类中表现良好。
人格化和思维链的加入可能进一步提高模型性能。
基于提示的大型语言模型是实用且可扩展的选择，减少对大量注释的依赖。

Cool Papers

点此查看论文截图

Exploring Data and Parameter Efficient Strategies for Arabic Dialect Identifications

Authors:Vani Kanjirangat, Ljiljana Dolamic, Fabio Rinaldi

This paper discusses our exploration of different data-efficient and parameter-efficient approaches to Arabic Dialect Identification (ADI). In particular, we investigate various soft-prompting strategies, including prefix-tuning, prompt-tuning, P-tuning, and P-tuning V2, as well as LoRA reparameterizations. For the data-efficient strategy, we analyze hard prompting with zero-shot and few-shot inferences to analyze the dialect identification capabilities of Large Language Models (LLMs). For the parameter-efficient PEFT approaches, we conducted our experiments using Arabic-specific encoder models on several major datasets. We also analyzed the n-shot inferences on open-source decoder-only models, a general multilingual model (Phi-3.5), and an Arabic-specific one(SILMA). We observed that the LLMs generally struggle to differentiate the dialectal nuances in the few-shot or zero-shot setups. The soft-prompted encoder variants perform better, while the LoRA-based fine-tuned models perform best, even surpassing full fine-tuning.

本文探讨了我们在阿拉伯语方言识别（ADI）方面对不同数据高效和参数高效方法的探索。我们研究了各种软提示策略，包括前缀调整、提示调整、P-tuning和P-tuning V2以及LoRA重新参数化。对于数据高效策略，我们分析了零样本和少样本推断中的硬提示，以分析大型语言模型（LLM）的方言识别能力。对于参数高效的PEFT方法，我们在几个主要数据集上使用了阿拉伯特定编码器模型进行实验。我们还分析了开源解码器模型、通用多语言模型（Phi-3.

论文及项目相关链接

PDF 4 main pages, 4 additional, 5 figures

Summary

本文探讨了数据高效和参数高效的阿拉伯语方言识别（ADI）方法。研究了多种软提示策略，如前缀调整、提示调整、P-tuning和P-tuning V2，以及LoRA重新参数化。对于数据高效策略，我们分析了零样本和少样本推断下的硬提示，以评估大语言模型（LLM）的方言识别能力。在参数高效的PEFT方法中，我们在多个主要数据集上使用了阿拉伯特定的编码器模型进行实验。还分析了开源解码器模型、一般的多语言模型（Phi-3.5）和阿拉伯特定的模型（SILMA）的n次抽样推断结果。发现大语言模型在少样本或零样本环境中方言辨识表现不佳，软提示编码器变体表现更好，基于LoRA的精细调整模型表现最佳，甚至超越全量精细调整。

Key Takeaways

论文探讨了数据高效和参数高效的阿拉伯语方言识别方法。
研究了多种软提示策略包括前缀调整、提示调整等。
对于数据高效策略，分析了零样本和少样本推断下的大语言模型方言识别能力。
在参数高效的PEFT方法中，实验使用了阿拉伯特定的编码器模型和多数据集。
n次抽样推断分析涵盖开源解码器模型、多语言模型和阿拉伯特定模型。
大语言模型在少样本或零样本环境下方言辨识能力受限。

Cool Papers

点此查看论文截图

Gender-Neutral Rewriting in Italian: Models, Approaches, and Trade-offs

Authors:Andrea Piergentili, Beatrice Savoldi, Matteo Negri, Luisa Bentivogli

Gender-neutral rewriting (GNR) aims to reformulate text to eliminate unnecessary gender specifications while preserving meaning, a particularly challenging task in grammatical-gender languages like Italian. In this work, we conduct the first systematic evaluation of state-of-the-art large language models (LLMs) for Italian GNR, introducing a two-dimensional framework that measures both neutrality and semantic fidelity to the input. We compare few-shot prompting across multiple LLMs, fine-tune selected models, and apply targeted cleaning to boost task relevance. Our findings show that open-weight LLMs outperform the only existing model dedicated to GNR in Italian, whereas our fine-tuned models match or exceed the best open-weight LLM’s performance at a fraction of its size. Finally, we discuss the trade-off between optimizing the training data for neutrality and meaning preservation.

性别中立重写（GNR）旨在重新制定文本，消除不必要的性别特定表述，同时保留原有含义，这在如意大利语等具有语法性别的语言中是一项特别具有挑战性的任务。在这项工作中，我们对最先进的意大利语GNR大型语言模型（LLM）进行了首次系统评估，并引入了一个二维框架来衡量中立度和语义忠实度。我们比较了多个LLM的少量提示，对选定模型进行了微调，并应用有针对性的清理措施以提高任务相关性。我们的研究结果表明，开放式权重LLM在意大利语性别中立重写任务上的表现优于唯一现有的专用模型，而我们经过调整的模型能在体积远小于开放式权重LLM的情况下达到或超过其最佳性能。最后，我们讨论了为中立和保留意义优化训练数据之间的权衡。

论文及项目相关链接

PDF Accepted at CLiC-it 2025

Summary

本摘要对性别中立重写（GNR）进行了系统评价，重点研究大型语言模型（LLMs）在意大利语中的表现。研究引入了一个二维框架，同时测量语言的中立性和语义的忠实度。通过跨多个LLM进行少量提示，对选定模型进行微调，并应用目标清洗以提高任务相关性。研究结果表明，开放式大型语言模型在意大利语性别中立重写任务上的表现优于现有模型，微调模型性能与最佳开放式大型语言模型相当甚至更优，但模型规模更小。最后，本文讨论了训练数据中对中立性和意义保留之间的权衡。

Key Takeaways

本研究针对大型语言模型在意大利语中的性别中立重写能力进行了首次系统评价。
引入了一个二维框架来评估语言的中立性和语义忠实度。
通过少量提示跨多个大型语言模型进行比较和微调模型。
应用目标清洗技术提高任务相关性。
开放式大型语言模型在意大利语性别中立重写任务上表现优越。
微调后的模型性能与最佳开放式大型语言模型相当或更优，但规模更小。

Cool Papers

点此查看论文截图

Singular Value Few-shot Adaptation of Vision-Language Models

Authors:Taha Koleilat, Hassan Rivaz, Yiming Xiao

Vision-language models (VLMs) like CLIP have shown impressive zero-shot and few-shot learning capabilities across diverse applications. However, adapting these models to new fine-grained domains remains difficult due to reliance on prompt engineering and the high cost of full model fine-tuning. Existing adaptation approaches rely on augmented components, such as prompt tokens and adapter modules, which could limit adaptation quality, destabilize the model, and compromise the rich knowledge learned during pretraining. In this work, we present CLIP-SVD, a novel multi-modal and parameter-efficient adaptation technique that leverages Singular Value Decomposition (SVD) to modify the internal parameter space of CLIP without injecting additional modules. Specifically, we fine-tune only the singular values of the CLIP parameter matrices to rescale the basis vectors for domain adaptation while retaining the pretrained model. This design enables enhanced adaptation performance using only 0.04% of the model’s total parameters and better preservation of its generalization ability. CLIP-SVD achieves state-of-the-art classification results on 11 natural and 10 biomedical datasets, outperforming previous methods in both accuracy and generalization under few-shot settings. Additionally, we leverage a natural language-based approach to analyze the effectiveness and dynamics of the CLIP adaptation to allow interpretability of CLIP-SVD. The code is publicly available at https://github.com/HealthX-Lab/CLIP-SVD.

视觉语言模型（如CLIP）在多种应用中表现出了令人印象深刻的零样本和少样本学习能力。然而，由于这些模型依赖于提示工程和高昂的全模型微调成本，将其适应新的细粒度领域仍然具有挑战性。现有的适应方法依赖于增强组件，如提示令牌和适配器模块，这可能会限制适应质量，使模型不稳定，并损害预训练期间学到的丰富知识。在这项工作中，我们提出了CLIP-SVD，这是一种新型的多模态和参数高效的适应技术，它利用奇异值分解（SVD）来修改CLIP的内部参数空间，而无需注入额外的模块。具体来说，我们只微调CLIP参数矩阵的奇异值，以重新缩放基础向量进行域适应，同时保留预训练模型。这种设计仅使用模型总参数的0.04%即可实现增强的适应性能，并更好地保持其泛化能力。CLIP-SVD在11个自然和10个生物医学数据集上实现了最先进的分类结果，在少样本设置下在准确性和泛化方面均优于以前的方法。此外，我们还利用基于自然语言的方法来分析CLIP适应的有效性动力学，以实现CLIP-SVD的可解释性。代码可在https://github.com/HealthX-Lab/CLIP-SVD公开获取。

论文及项目相关链接

PDF 10 pages, 2 figures, 8 tables

Summary

本文提出了CLIP-SVD，这是一种新的多模态和参数效率高的CLIP模型适配技术。它通过利用奇异值分解（SVD）修改CLIP的内部参数空间，进行领域适配，无需引入额外的模块。只需微调CLIP参数矩阵的奇异值，即可重新调整基础向量以适应领域，同时保留预训练模型的特性。这种方法实现了卓越的分类结果，在多个数据集上优于其他方法。

Key Takeaways

CLIP-SVD是一种基于奇异值分解（SVD）的多模态和参数效率高的CLIP模型适配技术。
该方法通过修改CLIP的内部参数空间进行领域适配，无需引入额外的模块或组件。
CLIP-SVD通过微调参数矩阵的奇异值来调整基础向量，以适应新的领域，同时保留预训练模型的特性。
CLIP-SVD实现了卓越的分类结果，在多个数据集上优于其他方法，特别是在少样本设置下。
该方法实现了对CLIP模型的有效适应，并更好地保留了其泛化能力。
CLIP-SVD的代码已公开可用。

Cool Papers

点此查看论文截图

Deep Learning for Crack Detection: A Review of Learning Paradigms, Generalizability, and Datasets

Authors:Xinan Zhang, Haolin Wang, Yung-An Hsieh, Zhongyu Yang, Anthony Yezzi, Yi-Chang Tsai

Crack detection plays a crucial role in civil infrastructures, including inspection of pavements, buildings, etc., and deep learning has significantly advanced this field in recent years. While numerous technical and review papers exist in this domain, emerging trends are reshaping the landscape. These shifts include transitions in learning paradigms (from fully supervised learning to semi-supervised, weakly-supervised, unsupervised, few-shot, domain adaptation and fine-tuning foundation models), improvements in generalizability (from single-dataset performance to cross-dataset evaluation), and diversification in dataset acquisition (from RGB images to specialized sensor-based data). In this review, we systematically analyze these trends and highlight representative works. Additionally, we introduce a new annotated dataset collected with 3D laser scans, 3DCrack, to support future research and conduct extensive benchmarking experiments to establish baselines for commonly used deep learning methodologies, including recent foundation models. Our findings provide insights into the evolving methodologies and future directions in deep learning-based crack detection. Project page: https://github.com/nantonzhang/Awesome-Crack-Detection

裂缝检测在土木基础设施中扮演着至关重要的角色，包括路面、建筑物等的检测，而深度学习近年来已在此领域取得了重大进展。尽管该领域存在大量的技术和综述性论文，但新兴趋势正在改变这一领域的格局。这些变化包括学习范式上的转变（从全监督学习到半监督、弱监督、无监督、小样本、域适应和微调基础模型），通用性的提高（从单一数据集的性能到跨数据集的评价），以及数据集采集的多样化（从RGB图像到基于专用传感器的数据）。在这篇综述中，我们系统地分析了这些趋势，并重点介绍了具有代表性的工作。此外，我们还介绍了一个使用3D激光扫描收集的新注释数据集“3DCrack”，以支持未来的研究，并对常用的深度学习方法进行广泛的基准测试，包括最近的基础模型。我们的研究结果提供了对不断发展的方法和未来深度学习裂缝检测方向的新见解。项目页面：https://github.com/nantonzhang/Awesome-Crack-Detection

论文及项目相关链接

PDF under review

Summary

深度学习方法在裂缝检测领域的应用已经取得了显著进展，本文系统地分析了当前趋势，包括学习范式的转变、泛化性能的改进和数据集采集方式的多样化。此外，本文还介绍了新收集的3D激光扫描数据集3DCrack，并对常用的深度学习方法进行了基准测试。本文的见解为裂缝检测领域的方法演变和未来方向提供了深入洞察。

Key Takeaways

深度学习方法在裂缝检测中扮演重要角色，应用于民事基础设施的多个领域。
学习范式的转变是当前的趋势，包括从全监督学习到多种其他学习方式的转变。
泛化性能的改进使得模型能够在跨数据集评价中表现良好。
数据集采集方式的多样化，从RGB图像到基于特殊传感器的数据。
介绍了新的3D激光扫描数据集3DCrack，为未来的研究提供支持。
进行了广泛的基准测试实验，为常用的深度学习方法建立了基准。

Cool Papers

点此查看论文截图

Sarc7: Evaluating Sarcasm Detection and Generation with Seven Types and Emotion-Informed Techniques

Authors:Lang Xiong, Raina Gao, Alyssa Jeong, Yicheng Fu, Sean O’Brien, Vasu Sharma, Kevin Zhu

Sarcasm is a form of humor where expressions convey meanings opposite to their literal interpretations. Classifying and generating sarcasm using large language models is vital for interpreting human communication. Sarcasm poses challenges for computational models, due to its nuanced nature. We introduce Sarc7, a benchmark that classifies 7 types of sarcasm: self-deprecating, brooding, deadpan, polite, obnoxious, raging, and manic by annotating entries of the MUStARD dataset. Classification was evaluated using zero-shot, few-shot, chain-of-thought (CoT), and a novel emotion-based prompting technique. We propose an emotion-based generation method developed by identifying key components of sarcasm-incongruity, shock value, and context dependency. Our classification experiments show that Gemini 2.5, using emotion-based prompting, outperforms other setups with an F1 score of 0.3664. Human evaluators preferred our emotion-based prompting, with 38.46% more successful generations than zero-shot prompting.

讽刺是一种幽默形式，其表达的意思与其字面意思相反。使用大型语言模型对讽刺进行分类和生成，对于解释人类交流至关重要。由于讽刺具有微妙的性质，对计算模型来说构成了挑战。我们介绍了Sarc7，这是一个通过标注MUStARD数据集条目来分类7种讽刺类型的基准测试：自嘲、忧郁、面无表情、礼貌、令人不悦、愤怒和狂躁。分类评估使用了零样本、少样本、链式思维（CoT）和一种基于情绪的新型提示技术。我们提出了一种基于情绪的产生方法，该方法通过识别讽刺的关键成分——不一致性、冲击值和上下文依赖性而开发。我们的分类实验表明，使用基于情绪的提示的Gemini 2.5在F1分数上优于其他设置，达到了0.3664。人类评估者更喜欢我们的基于情绪的提示，其生成成功的比例比零样本提示高出38.46%。

论文及项目相关链接

PDF Accepted to EMNLP WiNLP and COLM Melt, Solar, PragLM, and Origen

Summary

本文介绍了名为Sarc7的基准测试，该测试对7种不同类型的讽刺（包括自我贬低、阴郁、冷淡、礼貌、粗鲁、愤怒和狂热）进行了分类。文章还评估了零样本、少样本、思维链和基于情感提示的分类技术，并提出了一种基于情感的生成方法，通过识别讽刺的关键成分（如不一致性、震撼值和上下文依赖性）来生成内容。实验结果展示了使用情感提示的Gemini 2.5系统的优越性，其F1分数为0.3664。人类评估者也更倾向于使用基于情感的提示方法，其生成内容的成功率比零样本提示高出38.46%。

Key Takeaways

Sarc7是一个用于分类7种不同讽刺类型的基准测试。
文章评估了多种分类技术，包括零样本、少样本、思维链和基于情感提示的方法。
基于情感的生成方法通过识别讽刺的关键成分（如不一致性、震撼值和上下文依赖性）来生成内容。
Gemini 2.5系统使用情感提示取得了最佳分类效果，F1分数为0.3664。
人类评估者更倾向于基于情感的提示方法，其生成内容的成功率较高。
讽刺对计算模型构成挑战，因其具有微妙的特性。

Cool Papers

点此查看论文截图

Benchmarking Large Language Models for Cryptanalysis and Side-Channel Vulnerabilities

Authors:Utsav Maskey, Chencheng Zhu, Usman Naseem

Recent advancements in large language models (LLMs) have transformed natural language understanding and generation, leading to extensive benchmarking across diverse tasks. However, cryptanalysis - a critical area for data security and its connection to LLMs’ generalization abilities - remains underexplored in LLM evaluations. To address this gap, we evaluate the cryptanalytic potential of state-of-the-art LLMs on ciphertexts produced by a range of cryptographic algorithms. We introduce a benchmark dataset of diverse plaintexts, spanning multiple domains, lengths, writing styles, and topics, paired with their encrypted versions. Using zero-shot and few-shot settings along with chain-of-thought prompting, we assess LLMs’ decryption success rate and discuss their comprehension abilities. Our findings reveal key insights into LLMs’ strengths and limitations in side-channel scenarios and raise concerns about their susceptibility to under-generalization-related attacks. This research highlights the dual-use nature of LLMs in security contexts and contributes to the ongoing discussion on AI safety and security.

近期大型语言模型（LLM）的进展已经改变了自然语言的理解和生成方式，并在各种任务上进行了广泛的基准测试。然而，在LLM评估中，密码分析作为一个对数据安全至关重要的领域以及其与LLM的泛化能力之间的联系仍然被探索得不够深入。为了弥补这一空白，我们评估了最先进的大型语言模型在由多种加密算法产生的密文上的密码分析潜力。我们引入了一个包含多种普通文本及其加密版本的基准数据集，这些普通文本跨越了多个领域、长度、写作风格和主题。通过使用零样本和少样本设置以及思维链提示，我们评估了LLM的解密成功率并讨论了它们的理解能力。我们的研究结果揭示了LLM在侧通道场景中的优势和局限性，并对它们容易受到与泛化不足相关的攻击表示担忧。该研究强调了LLM在安全上下文中的双重用途性质，并为关于人工智能安全和安全的持续讨论做出了贡献。

论文及项目相关链接

PDF EMNLP’25 Findings

Summary
大型语言模型（LLMs）在自然语言理解和生成方面的最新进展已经在各种任务上进行了广泛的基准测试。然而，对于数据安全和LLMs泛化能力的关键领域——密码分析，在LLM评估中仍然被忽视。本研究首次评估了最先进的LLMs在多种加密算法生成的密文上的密码分析潜力。通过引入包含多种领域的明文及其加密版本的基准数据集，本研究在零样本和少样本环境中评估LLMs的解密成功率，并讨论其理解能力。研究发现揭示了LLMs在侧信道场景下的优势和局限性，并对与泛化不足相关的攻击提出了担忧。本研究强调了LLMs在安全上下文中的双重用途性质，并为人工智能安全和安全性讨论做出了贡献。

Key Takeaways

大型语言模型（LLMs）在自然语言理解和生成方面的能力已经得到广泛验证，但在密码分析领域的研究仍然不足。
研究者评估了LLMs在多种加密算法生成的密文上的密码分析潜力。
通过引入包含多种领域的明文及其加密版本的基准数据集，该研究评估了LLMs的解密能力。
研究发现揭示了LLMs在侧信道场景下的优势与局限性。
LLMs可能存在泛化不足的问题，容易受到某些攻击的影响。
此研究强调了LLMs在安全上下文中的双重用途性质，既可以用于安全目的，也可能被用于威胁安全的活动。

Cool Papers

点此查看论文截图

MAFA: A multi-agent framework for annotation

Authors:Mahmood Hegazy, Aaron Rodrigues, Azzam Naeem

Modern consumer banking applications require accurate and efficient retrieval of information in response to user queries. Mapping user utterances to the most relevant Frequently Asked Questions (FAQs) is a crucial component of these systems. Traditional approaches often rely on a single model or technique, which may not capture the nuances of diverse user inquiries. In this paper, we introduce a multi-agent framework for FAQ annotation that combines multiple specialized agents with different approaches and a judge agent that reranks candidates to produce optimal results. Our agents utilize a structured reasoning approach inspired by Attentive Reasoning Queries (ARQs), which guides them through systematic reasoning steps using targeted, task-specific JSON queries. Our framework features a few-shot example strategy, where each agent receives different few-shots, enhancing ensemble diversity and coverage of the query space. We evaluate our framework on a real-world major bank dataset as well as public benchmark datasets (LCQMC and FiQA), demonstrating significant improvements over single-agent approaches across multiple metrics, including a 14% increase in Top-1 accuracy, an 18% increase in Top-5 accuracy, and a 12% improvement in Mean Reciprocal Rank on our dataset, and similar gains on public benchmarks when compared with traditional and single-agent annotation techniques. Our framework is particularly effective at handling ambiguous queries, making it well-suited for deployment in production banking applications while showing strong generalization capabilities across different domains and languages.

现代消费银行应用程序需要准确高效地检索用户查询的信息。将用户的话语映射到最相关的常见问题（FAQ）是这些系统的关键组成部分。传统方法通常依赖于单个模型或技术，这可能无法捕捉到各种用户查询的细微差别。在本文中，我们介绍了一种结合多个专业代理和法官代理进行重新排名的多智能体框架，以产生最佳结果的常见问题标注框架。我们的智能体利用受注意力推理查询（ARQ）启发的结构化推理方法，通过有针对性的任务特定JSON查询指导他们进行系统化的推理步骤。我们的框架采用少量示例策略，每个智能体接收不同的几个示例，增强了集体多样性和查询空间的覆盖性。我们在现实世界的某大银行数据集以及公共基准数据集（LCQMC和FiQA）上评估了我们的框架，与传统单智能体方法相比，在多个指标上取得了显著改进，包括准确率提高14%，前五个结果准确率提高18%，以及我们数据集上的平均倒数排名提高12%，并且在公共基准测试上与传统的单智能体标注技术相比也取得了类似的成果。我们的框架在处理模糊查询方面特别有效，因此非常适合在生产银行应用程序中部署，并且在不同领域和语言中显示出强大的泛化能力。

论文及项目相关链接

PDF

Summary

本文提出了一种多代理框架用于FAQ标注，结合了多个专业代理与不同方法和一个裁判代理进行重排，以产生最优结果。该框架采用受ARQ启发的结构化推理方法，通过目标化、任务特定的JSON查询引导系统推理步骤。框架采用少量样本策略，每个代理接收不同的样本，增强集合多样性和查询空间覆盖。在真实银行数据集和公共基准数据集上的评估表明，与传统和单一代理标注技术相比，该框架在多指标上实现了显著改进，包括Top-1准确率提高14%，Top-5准确率提高18%，以及数据集上的平均倒数排名改善12%。该框架特别擅长处理模糊查询，适用于生产环境中的银行业务应用，并在不同领域和语言中显示出强大的泛化能力。

Key Takeaways