发布日期: 2025-10-10

更新日期: 2025-11-27

文章字数: 14.2k

阅读时长: 57 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-10-10 更新

Continual Action Quality Assessment via Adaptive Manifold-Aligned Graph Regularization

Authors:Kanglei Zhou, Qingyi Pan, Xingxing Zhang, Hubert P. H. Shum, Frederick W. B. Li, Xiaohui Liang, Liyuan Wang

Action Quality Assessment (AQA) quantifies human actions in videos, supporting applications in sports scoring, rehabilitation, and skill evaluation. A major challenge lies in the non-stationary nature of quality distributions in real-world scenarios, which limits the generalization ability of conventional methods. We introduce Continual AQA (CAQA), which equips AQA with Continual Learning (CL) capabilities to handle evolving distributions while mitigating catastrophic forgetting. Although parameter-efficient fine-tuning of pretrained models has shown promise in CL for image classification, we find it insufficient for CAQA. Our empirical and theoretical analyses reveal two insights: (i) Full-Parameter Fine-Tuning (FPFT) is necessary for effective representation learning; yet (ii) uncontrolled FPFT induces overfitting and feature manifold shift, thereby aggravating forgetting. To address this, we propose Adaptive Manifold-Aligned Graph Regularization (MAGR++), which couples backbone fine-tuning that stabilizes shallow layers while adapting deeper ones with a two-step feature rectification pipeline: a manifold projector to translate deviated historical features into the current representation space, and a graph regularizer to align local and global distributions. We construct four CAQA benchmarks from three datasets with tailored evaluation protocols and strong baselines, enabling systematic cross-dataset comparison. Extensive experiments show that MAGR++ achieves state-of-the-art performance, with average correlation gains of 3.6% offline and 12.2% online over the strongest baseline, confirming its robustness and effectiveness. Our code is available at https://github.com/ZhouKanglei/MAGRPP.

动作质量评估（AQA）能够对视频中的人类行为进行量化，支持体育评分、康复和技能评估等应用。主要挑战在于现实场景中质量分布的非平稳性，这限制了传统方法的泛化能力。我们引入了持续动作质量评估（CAQA），为AQA配备了持续学习（CL）的能力，以处理不断变化的分布，同时减轻灾难性遗忘。尽管预训练模型的参数有效微调在CL图像分类中显示出希望，但我们发现它对于CAQA来说是不够的。我们的经验和理论分析揭示了两个见解：（i）全参数微调（FPFT）对于有效的表示学习是必要的；然而（ii）不受控制的FPFT会导致过度拟合和特征流形偏移，从而加剧遗忘。为了解决这一问题，我们提出了自适应流形对齐图正则化（MAGR++），它将主干微调与适应深层特征的两步特征校正管道相结合：流形投影仪将偏离的历史特征翻译到当前表示空间，图正则化器对齐局部和全局分布。我们构建了四个CAQA基准测试，包括三个数据集、定制评估协议和强大的基线，以实现跨数据集的系统比较。大量实验表明，MAGR++达到了最新性能水平，与最强基线相比，离线平均相关性提高了3.6%，在线提高了12.2%，证实了其稳健性和有效性。我们的代码可在https://github.com/ZhouKanglei/MAGRPP上找到。

论文及项目相关链接

PDF Extended Version of MAGR (ECCV 2024 Oral Presentation)

Summary

基于视频动作质量评估（AQA）面临的挑战在于真实场景中的质量分布非稳定性，这限制了传统方法的泛化能力。本文引入持续AQA（CAQA），配备持续学习（CL）能力来处理不断变化的分布，同时减轻灾难性遗忘。通过实证和理论分析，发现全参数微调（FPFT）对于有效的表示学习是必要的，但不受控制的FPFT会导致过度拟合和特征流形偏移，从而加剧遗忘。为解决这一问题，本文提出自适应流形对齐图正则化（MAGR++），通过微调主干网络稳定浅层网络，同时适应深层网络的两步特征校正管道：流形投影器将偏离的历史特征翻译到当前表示空间，图正则化器对齐局部和全局分布。本文构建了四个CAQA基准测试，实验结果显示MAGR++实现了最先进的性能，平均相关性增益比最强基线高出3.6%（离线）和12.2%（在线），证明了其稳健性和有效性。

Key Takeaways

AQA（动作质量评估）面临真实场景中质量分布非稳定性的挑战。
CAQA（持续动作质量评估）通过配备持续学习（CL）能力来处理不断变化的分布。
全参数微调（FPFT）对于有效的表示学习是必要的，但可能导致过度拟合和特征流形偏移。
MAGR++通过微调主干网络稳定浅层网络，并适应深层网络，包括流形投影和图正则化。
构建了四个CAQA基准测试，包括从三个数据集构建的基准评估协议。
实验显示MAGR++性能优于其他方法，具有稳健性和有效性。

Cool Papers

点此查看论文截图

Benchmarking AI-evolved cosmological structure formation

Authors:Xiaofeng Dong, Nesar Ramachandra, Salman Habib, Katrin Heitmann

The potential of deep learning-based image-to-image translations has recently attracted significant attention. One possible application of such a framework is as a fast, approximate alternative to cosmological simulations, which would be particularly useful in various contexts, including covariance studies, investigations of systematics, and cosmological parameter inference. To investigate different aspects of learning-based cosmological mappings, we choose two approaches for generating suitable cosmological matter fields as datasets: a simple analytical prescription provided by the Zel’dovich approximation, and a numerical N-body method using the Particle-Mesh approach. The evolution of structure formation is modeled using U-Net, a widely employed convolutional image translation framework. Because of the lack of a controlled methodology, validation of these learned mappings requires multiple benchmarks beyond simple visual comparisons and summary statistics. A comprehensive list of metrics is considered, including higher-order correlation functions, conservation laws, topological indicators, and statistical independence of density fields. We find that the U-Net approach performs well only for some of these physical metrics, and accuracy is worse at increasingly smaller scales, where the dynamic range in density is large. By introducing a custom density-weighted loss function during training, we demonstrate a significant improvement in the U-Net results at smaller scales. This study provides an example of how a family of physically motivated benchmarks can, in turn, be used to fine-tune optimization schemes – such as the density-weighted loss used here – to significantly enhance the accuracy of scientific machine learning approaches by focusing attention on relevant features.

基于深度学习的图像到图像转换的潜力最近引起了广泛关注。该框架的一个可能应用是作为宇宙学模拟的快速近似替代方案，这在协方差研究、系统调查以及宇宙学参数推断等各个场景中都将特别有用。为了研究基于学习的宇宙学映射的不同方面，我们选择了两种生成合适宇宙物质场作为数据集的方法：由泽尔多维奇近似提供的简单分析处方，以及使用粒子网格方法的数值N体方法。我们使用广受欢迎的卷积图像翻译框架U-Net对结构形成的演化进行建模。由于缺乏受控的方法论，对这些学习到的映射的验证需要超越简单视觉比较和总结统计的多个基准测试。我们考虑了包括高阶相关函数、守恒定律、拓扑指标以及密度场的统计独立性等在内的综合指标。我们发现U-Net方法在这些物理指标中仅表现良好一部分，并且在越来越小的尺度上准确性较差，其中密度动态范围较大。通过在训练过程中引入自定义的密度加权损失函数，我们在较小的尺度上显著提高了U-Net的结果。这项研究提供了一个例子，说明如何反过来使用一系列物理激励基准测试来微调优化方案——如本文中使用的密度加权损失——通过关注相关特征来显著提高科学机器学习方法准确性。

论文及项目相关链接

PDF Expanded and thoroughly revised version of our prior NeurIPS submission (arXiv:2112.05681; which has no DOI), with new sections, experiments, and analyses

Summary：基于深度学习图像到图像转换的潜力最近引起了广泛关注。其可能的应用之一是作为一种快速近似替代宇宙学模拟，在各种情况下均特别有用，包括协方差研究、系统调查以及宇宙学参数推断。我们使用U-Net卷积图像翻译框架模拟结构形成演化过程，并选择两种生成适当宇宙学物质场数据集的方法：由泽尔多维奇近似提供的简单解析公式和采用粒子网格方法的数值N体方法。由于缺乏控制方法，这些学习映射的验证需要超越简单视觉比较和总结统计指标的多个基准测试。研究考虑了高阶相关函数、守恒定律、拓扑指标和密度场的统计独立性等综合指标清单。我们发现U-Net方法对于这些物理指标的某些部分表现良好，但精度在较小的尺度上更差，密度动态范围较大。通过在训练过程中引入自定义密度加权损失函数，我们在较小的尺度上显著提高了U-Net的结果。这项研究提供了一个例子，说明一系列物理基准测试可以反过来用于调整优化方案，例如此处使用的密度加权损失，以关注相关特征，从而显著提高科学机器学习的准确性。

Key Takeaways:

深度学习图像到图像转换具有巨大潜力，特别是在模拟宇宙学模拟方面表现出快速近似替代的优势。
使用U-Net框架模拟结构形成演化过程，并选择两种生成宇宙学物质场数据集的方法进行评估。
对于此类学习映射的验证需要超越简单视觉比较的多个基准测试，包括高阶相关函数等综合考虑的指标清单。
U-Net在某些物理指标上的表现不一，在小尺度上精度较差，因此需要进一步优化和调整方法以提高准确性。
通过引入自定义密度加权损失函数，可以在较小的尺度上显著提高U-Net的结果。
物理基准测试可以用于调整和优化机器学习算法的优化方案，以关注相关特征并增强准确性。这对于科学机器学习的发展至关重要。

Cool Papers

点此查看论文截图

Personalizing Retrieval using Joint Embeddings or “the Return of Fluffy”

Authors:Bruno Korbar, Andrew Zisserman

The goal of this paper is to be able to retrieve images using a compound query that combines object instance information from an image, with a natural text description of what that object is doing or where it is. For example, to retrieve an image of “Fluffy the unicorn (specified by an image) on someone’s head”. To achieve this we design a mapping network that can “translate” from a local image embedding (of the object instance) to a text token, such that the combination of the token and a natural language query is suitable for CLIP style text encoding, and image retrieval. Generating a text token in this manner involves a simple training procedure, that only needs to be performed once for each object instance. We show that our approach of using a trainable mapping network, termed pi-map, together with frozen CLIP text and image encoders, improves the state of the art on two benchmarks designed to assess personalized retrieval.

本文的目标是通过结合图像中的对象实例信息以及该对象正在做什么或在哪里的自然文本描述，来构建一个复合查询，从而能够检索图像。例如，检索一张“独角兽Fluffy（由图像指定）在某人的头上”的图像。为了实现这一目标，我们设计了一个映射网络，该网络能够将局部图像嵌入（对象实例）转化为一个文本令牌，使该令牌与自然语言查询相结合，适合CLIP风格的文本编码和图像检索。通过这种方式生成文本令牌涉及到一个简单的训练过程，只需要针对每个对象实例执行一次。我们展示了使用可训练的映射网络（称为pi-map）与冻结的CLIP文本和图像编码器相结合的方法，在两种用于评估个性化检索的基准测试上均改进了现有技术。

论文及项目相关链接

PDF Published as an oral in CBMI2025

Summary

本文旨在通过结合图像中的对象实例信息与对象行为或位置的文本描述，实现使用复合查询检索图像。为此，设计了一个映射网络，能将局部图像嵌入转换为文本令牌，使该令牌与自然语言查询的结合适用于CLIP风格的文本编码和图像检索。采用名为pi-map的可训练映射网络，结合预训练的CLIP文本和图像编码器，提高了个性化检索的两个基准测试的性能。

Key Takeaways

本文旨在通过结合图像中的对象实例信息和文本描述实现复合查询的图像检索。
提出了一种名为pi-map的映射网络设计，将图像嵌入转换为文本令牌。
这种转换方法适用于CLIP风格的文本编码和图像检索。
文中使用的训练程序仅需针对每个对象实例执行一次即可生成文本令牌。
pi-map映射网络与预训练的CLIP文本和图像编码器相结合，提高了个性化检索的性能。
文中展示了该方法在两个基准测试上的表现优于当前技术。
此方法提供了一种更精准的图像检索途径，能够根据图像中的特定对象及其上下文（如对象的行为或位置）进行检索。

Cool Papers

点此查看论文截图

Bridging Clinical Narratives and ACR Appropriateness Guidelines: A Multi-Agent RAG System for Medical Imaging Decisions

Authors:Satrio Pambudi, Filippo Menolascina

The selection of appropriate medical imaging procedures is a critical and complex clinical decision, guided by extensive evidence-based standards such as the ACR Appropriateness Criteria (ACR-AC). However, the underutilization of these guidelines, stemming from the difficulty of mapping unstructured patient narratives to structured criteria, contributes to suboptimal patient outcomes and increased healthcare costs. To bridge this gap, we introduce a multi-agent cognitive architecture that automates the translation of free-text clinical scenarios into specific, guideline-adherent imaging recommendations. Our system leverages a novel, domain-adapted dense retrieval model, ColBERT, fine-tuned on a synthetically generated dataset of 8,840 clinical scenario-recommendation pairs to achieve highly accurate information retrieval from the ACR-AC knowledge base. This retriever identifies candidate guidelines with a 93.9% top-10 recall, which are then processed by a sequence of LLM-based agents for selection and evidence-based synthesis. We evaluate our architecture using GPT-4.1 and MedGemma agents, demonstrating a state-of-the-art exact match accuracy of 81%, meaning that in 81% of test cases the predicted procedure set was identical to the guideline’s reference set, and an F1-score of 0.879. This represents a 67-percentage-point absolute improvement in accuracy over a strong standalone GPT-4.1 baseline, underscoring the contribution that our architecture makes to a frontier model. These results were obtained on a challenging test set with substantial lexical divergence from the source guidelines. Our code is available at https://anonymous.4open.science/r/demo-iclr-B567/

选择合适的医学影像程序是一项重要而复杂的临床决策，由如ACR适宜性标准（ACR-AC）等广泛的循证标准所指导。然而，由于对将非结构化患者叙事映射到结构化标准的难度较高，这些指南的使用不足导致患者结果不理想和医疗保健成本增加。为了弥补这一差距，我们引入了一种多智能体认知架构，该架构可以自动将自由文本临床情景转化为具体、符合指南的影像建议。我们的系统采用了一种新颖、适应领域的密集检索模型ColBERT，它在8840个临床情景建议对上合成的数据集上进行微调，以从ACR-AC知识库中实现高度准确的信息检索。该检索器以93.9%的前十召回率识别出候选指南，这些指南随后被一系列基于LLM的智能体处理和循证综合评估。我们使用GPT 4.1和MedGemma智能体评估我们的架构，表现出最先进的精确匹配准确率为81%，这意味着在81%的测试案例中，预测的程序集与指南参考集相同，并且F1分数为0.879。这代表着相对于强大的独立GPT 4.1基准线，准确度提高了67个百分点，凸显了我们的架构所带来的贡献。这些结果是在具有与源指南显著词汇差异的具有挑战性的测试集上获得的。我们的代码可在[https://anonymous.4open.science/r/demo-iclr-B567/]进行访问。

论文及项目相关链接

PDF

Summary
采用多智能体认知架构，结合新型密集检索模型ColBERT，实现自由文本临床情景向特定、符合指南的成像建议的自动转化。通过合成数据集对模型进行微调，实现对ACR-AC知识库的高度准确信息检索，具备智能指南选择功能。研究准确率高，有助于减少医患决策复杂性和降低诊疗成本。研究成果提升决策模型能力达到业内领先水平。详细成果及代码见公开链接。

Key Takeaways

医疗成像程序的选择至关重要且复杂，需依赖证据基础的准则如ACR适宜性标准（ACR-AC）。
存在对指南的利用不足问题，主要由于患者叙事结构化与映射的难度增加导致。这一问题导致了不良患者结局和医疗成本增加。
提出一种多智能体认知架构，实现自由文本临床情景向指南支持的成像建议转化。该架构包括一个新型密集检索模型ColBERT和一个精细调整的合成数据集。
ColBERT模型能够从ACR-AC知识库中准确检索信息，在候选指南的召回率上表现出极高的性能。
采用一系列基于大型语言模型（LLM）的智能体进行选择和证据综合，评估架构使用GPT-4.1和MedGemma智能体。
该架构实现了卓越的性能，预测程序集与指南参考集之间的精确匹配率高达81%，并且相对于单一的GPT-4.1基线模型有显著的改进。

Cool Papers

点此查看论文截图

Bidirectional Mammogram View Translation with Column-Aware and Implicit 3D Conditional Diffusion

Authors:Xin Li, Kaixiang Yang, Qiang Li, Zhiwei Wang

Dual-view mammography, including craniocaudal (CC) and mediolateral oblique (MLO) projections, offers complementary anatomical views crucial for breast cancer diagnosis. However, in real-world clinical workflows, one view may be missing, corrupted, or degraded due to acquisition errors or compression artifacts, limiting the effectiveness of downstream analysis. View-to-view translation can help recover missing views and improve lesion alignment. Unlike natural images, this task in mammography is highly challenging due to large non-rigid deformations and severe tissue overlap in X-ray projections, which obscure pixel-level correspondences. In this paper, we propose Column-Aware and Implicit 3D Diffusion (CA3D-Diff), a novel bidirectional mammogram view translation framework based on conditional diffusion model. To address cross-view structural misalignment, we first design a column-aware cross-attention mechanism that leverages the geometric property that anatomically corresponding regions tend to lie in similar column positions across views. A Gaussian-decayed bias is applied to emphasize local column-wise correlations while suppressing distant mismatches. Furthermore, we introduce an implicit 3D structure reconstruction module that back-projects noisy 2D latents into a coarse 3D feature volume based on breast-view projection geometry. The reconstructed 3D structure is refined and injected into the denoising UNet to guide cross-view generation with enhanced anatomical awareness. Extensive experiments demonstrate that CA3D-Diff achieves superior performance in bidirectional tasks, outperforming state-of-the-art methods in visual fidelity and structural consistency. Furthermore, the synthesized views effectively improve single-view malignancy classification in screening settings, demonstrating the practical value of our method in real-world diagnostics.

双视角乳腺摄影（包括颅尾向（CC）和内外斜位（MLO）投影）为乳腺癌诊断提供了重要的互补解剖视角。然而，在实际的临床工作流程中，由于采集错误或压缩伪影等原因，可能会导致某一视角缺失、损坏或退化，从而限制下游分析的有效性。视角到视角的翻译可以帮助恢复缺失的视角并提高病变对齐。然而，与天然图像相比，这项任务在乳腺摄影中是极具挑战性的，原因在于X射线投影中存在较大的非刚性变形和严重的组织重叠，这掩盖了像素级别的对应关系。在本文中，我们提出了基于条件扩散模型的列感知和隐式3D扩散（CA3D-Diff）新型双向乳腺摄影视角翻译框架。为了解决跨视角结构不对齐的问题，我们首先设计了一种列感知交叉注意机制，该机制利用几何属性，即解剖对应的区域倾向于位于不同视角的相似列位置。应用高斯衰减偏差来强调局部列相关性，同时抑制远距离不匹配。此外，我们引入了一个隐式3D结构重建模块，该模块将噪声的2D潜在变量投影回基于乳腺视图投影几何的粗糙3D特征体积。重建的3D结构经过细化并注入去噪UNet中，以指导具有增强解剖意识的跨视角生成。大量实验表明，CA3D-Diff在双向任务中实现了卓越的性能，在视觉保真度和结构一致性方面优于最新方法。此外，合成的视角有效地提高了单视角恶性分类在筛查环境中的表现，证明了我们的方法在真实世界诊断中的实用价值。

论文及项目相关链接

PDF BIBM2025 accept, 8 pages, 4 figures

摘要

双视角乳腺摄影（包括颅尾（CC）和侧斜（MLO）投影）为乳腺癌诊断提供了互补的解剖视图，但在现实的临床工作流程中，一个视图可能会缺失、损坏或退化。因此，视图到视图的转换可以帮助恢复缺失的视图并提高病变对齐的准确性。本文提出了一种基于条件扩散模型的双向乳腺摄影视图转换框架——列感知与隐式三维扩散（CA3D-Diff）。为解决跨视图结构错位问题，设计了一种列感知交叉注意力机制，利用几何属性在解剖上对应的区域倾向于位于相似列位置的特点。此外，还引入了一个隐式三维结构重建模块，将噪声二维潜在变量反投影到基于乳腺视图投影几何的粗糙三维特征体积上。实验表明，CA3D-Diff在双向任务上取得了卓越性能，在视觉保真度和结构一致性方面超越了现有方法。此外，合成的视图有效地提高了单视图恶性分类在筛查环境中的表现，显示了该方法在现实诊断中的实用价值。

要点总结

双视角乳腺摄影对乳腺癌诊断至关重要，但在实际临床中可能出现视图缺失、损坏或退化的问题。
视图到视图的转换能有效恢复缺失的视图并提高病变对齐的准确性。
本文提出的CA3D-Diff框架结合了列感知交叉注意力机制和隐式三维扩散模型，处理跨视图的结构错位问题。
通过强调局部列相关性并抑制远距离不匹配，设计了一种列感知交叉注意力机制。
隐式三维结构重建模块将噪声二维潜在变量反投影到基于乳腺视图投影几何的三维特征体积上，提高了解剖意识。
实验表明，CA3D-Diff在视觉保真度和结构一致性方面表现优越，合成的视图能有效提高单视图恶性分类的准确性。

Cool Papers

点此查看论文截图

The best performance in the CARE 2025 – Liver Task (LiSeg-Contrast): Contrast-Aware Semi-Supervised Segmentation with Domain Generalization and Test-Time Adaptation

Authors:Jincan Lou, Jingkun Chen, Haoquan Li, Hang Li, Wenjian Huang, Weihua Chen, Fan Wang, Jianguo Zhang

Accurate liver segmentation from contrast-enhanced MRI is essential for diagnosis, treatment planning, and disease monitoring. However, it remains challenging due to limited annotated data, heterogeneous enhancement protocols, and significant domain shifts across scanners and institutions. Traditional image-to-image translation frameworks have made great progress in domain generalization, but their application is not straightforward. For example, Pix2Pix requires image registration, and cycle-GAN cannot be integrated seamlessly into segmentation pipelines. Meanwhile, these methods are originally used to deal with cross-modality scenarios, and often introduce structural distortions and suffer from unstable training, which may pose drawbacks in our single-modality scenario. To address these challenges, we propose CoSSeg-TTA, a compact segmentation framework for the GED4 (Gd-EOB-DTPA enhanced hepatobiliary phase MRI) modality built upon nnU-Netv2 and enhanced with a semi-supervised mean teacher scheme to exploit large amounts of unlabeled volumes. A domain adaptation module, incorporating a randomized histogram-based style appearance transfer function and a trainable contrast-aware network, enriches domain diversity and mitigates cross-center variability. Furthermore, a continual test-time adaptation strategy is employed to improve robustness during inference. Extensive experiments demonstrate that our framework consistently outperforms the nnU-Netv2 baseline, achieving superior Dice score and Hausdorff Distance while exhibiting strong generalization to unseen domains under low-annotation conditions.

精确的肝脏分段在增强型核磁共振成像（MRI）中对诊断、治疗计划和疾病监测至关重要。然而，由于标注数据有限、增强协议异质化以及扫描仪和机构间的领域差异显著，仍面临挑战。传统的图像到图像转换框架在领域通用化方面取得了巨大进步，但其应用并不直接。例如，Pix2Pix需要进行图像注册，而循环生成对抗网络（cycle-GAN）无法无缝集成到分段管道中。同时，这些方法最初用于处理跨模态场景，往往会引入结构失真并面临训练不稳定的问题，这可能在我们的单模态场景中构成缺点。为了解决这些挑战，我们提出了CoSSeg-TTA，这是一个紧凑的分段框架，用于处理Gd-EOB-DTPA增强的肝胆期MRI模态的GED数据。该框架建立在nnU-Netv2的基础上，并借助半监督平均教师方案开发大量未标记体积数据。领域适应模块结合了基于随机直方图的风格外观转换函数和可训练的对比感知网络，丰富了领域多样性并减轻了跨中心差异性。此外，采用连续测试时间适应策略，以提高推理过程中的稳健性。大量实验表明，我们的框架始终优于nnU-Netv2基线，在达到较高的Dice分数和Hausdorff距离的同时，在低注释条件下对未见领域表现出强大的泛化能力。

论文及项目相关链接

PDF 11 pages, 3 figures

Summary

本文提出一种针对GED4模态的紧凑分割框架CoSSeg-TTA，用于从增强MRI中准确分割肝脏。该框架基于nnU-Netv2，采用半监督均值教师方案，利用大量未标记体积数据。通过域适应模块和持续测试时适应策略，提高框架的域多样性和泛化能力，降低跨中心变异性的影响，实现更好的诊断、治疗规划和疾病监测。

Key Takeaways

CoSSeg-TTA框架被提出，用于从增强MRI中准确分割肝脏，特别适用于GED4模态。
框架基于nnU-Netv2，结合半监督均值教师方案，利用大量未标记体积数据。
域适应模块通过随机直方图样式转换函数和可训练对比网络，提高域多样性并减轻跨中心变异性。
框架采用持续测试时适应策略，提高推理阶段的稳健性。
与nnU-Netv2基线相比，CoSSeg-TTA框架在Dice得分和Hausdorff距离上表现更优越。
该框架在低标注条件下对未见领域具有强大的泛化能力。

Cool Papers

点此查看论文截图

BLADE: Bias-Linked Adaptive DEbiasing

Authors:Piyush Arora, Navlika Singh, Vasubhya Diwan, Pratik Mazumder

Neural networks have revolutionized numerous fields, yet they remain vulnerable to a critical flaw: the tendency to learn implicit biases, spurious correlations between certain attributes and target labels in training data. These biases are often more prevalent and easier to learn, causing models to rely on superficial patterns rather than task-relevant features necessary for generalization. Existing methods typically rely on strong assumptions, such as prior knowledge of these biases or access to bias-conflicting samples, i.e., samples that contradict spurious correlations and counterbalance bias-aligned samples, samples that conform to these spurious correlations. However, such assumptions are often impractical in real-world settings. We propose BLADE ({B}ias-{L}inked {A}daptive {DE}biasing), a generative debiasing framework that requires no prior knowledge of bias or bias-conflicting samples. BLADE first trains a generative model to translate images across bias domains while preserving task-relevant features. Then, it adaptively refines each image with its synthetic counterpart based on the image’s susceptibility to bias. To encourage robust representations, BLADE aligns an image with its bias-translated synthetic counterpart that shares task-relevant features but differs in bias, while misaligning it with samples sharing the same bias. We evaluate BLADE on multiple benchmark datasets and show that it significantly outperforms state-of-the-art methods. Notably, it exceeds the closest baseline by an absolute margin of around 18% on the corrupted CIFAR-10 dataset under the worst group setting, establishing a new benchmark in bias mitigation and demonstrating its potential for developing more robust deep learning models without explicit supervision.

神经网络已经颠覆了许多领域，但它们仍存在一个关键缺陷：倾向于学习隐含偏见和训练数据中某些属性与目标标签之间的虚假关联。这些偏见通常更为普遍且更容易学习，导致模型依赖表面模式，而非推广所需的与任务相关的特征。现有方法通常依赖于强大的假设，例如对这些偏见有先验知识或能接触到与偏见相冲突的样本（即与虚假关联相矛盾的样本，可平衡与偏见相符的样本）。然而，这些假设在真实世界环境中往往不切实际。我们提出BLADE（偏见链接自适应去偏框架），这是一种生成式去偏框架，无需对偏见或偏见冲突样本有先验知识。BLADE首先训练一个生成模型，以在保持任务相关特征的同时，跨偏见领域转换图像。然后，它根据图像对偏见的敏感性，自适应地对其与合成图像进行细化。为了鼓励稳健的表示，BLADE将与任务相关但偏见不同的图像与其经过偏见转换的合成对应物对齐，同时与共享相同偏见的样本错开对齐。我们在多个基准数据集上评估BLADE，结果表明它显著优于最新技术。值得注意的是，在最糟糕的团队设置下，它在被腐蚀的CIFAR-10数据集上的表现超过了最接近的基线约18%，在偏见缓解方面建立了新的基准，并展示了其在开发无需明确监督的更稳健深度学习模型方面的潜力。

论文及项目相关链接

PDF The authors have contributed equally

Summary
神经网络虽已应用于多个领域，但仍存在重大缺陷：易学习训练数据中的隐含偏见和特定属性之间的虚假关联。BLADE框架通过图像跨偏见域的转换，自适应调整图像以减轻偏见影响，并在多个基准数据集上实现了显著的成果提升。它在处理隐含偏见的问题上具有优越性能。

Key Takeaways

神经网络虽然对多个领域有重大变革，但易学习训练数据中的隐含偏见和虚假关联是其一大缺陷。
现有方法通常依赖于强假设，如事先知道偏见或能获取与偏见相冲突的样本。
BLADE框架不需要事先了解偏见或获取偏见相冲突的样本，通过训练生成模型实现图像跨偏见域的转换。
BLADE通过自适应调整图像来减轻偏见影响，鼓励形成稳健的表现。
BLADE对齐与偏见转换的合成图像和原始图像，二者共享任务相关特征但偏见不同，同时与共享相同偏见的样本错开对齐。
在多个基准数据集上的评估显示，BLADE显著优于现有技术，特别是在CIFAR-10数据集的最坏组设置中，超过最接近的基线约18%。

Cool Papers

点此查看论文截图

Enhancing OCR for Sino-Vietnamese Language Processing via Fine-tuned PaddleOCRv5

Authors:Minh Hoang Nguyen, Su Nguyen Thiet

Recognizing and processing Classical Chinese (Han-Nom) texts play a vital role in digitizing Vietnamese historical documents and enabling cross-lingual semantic research. However, existing OCR systems struggle with degraded scans, non-standard glyphs, and handwriting variations common in ancient sources. In this work, we propose a fine-tuning approach for PaddleOCRv5 to improve character recognition on Han-Nom texts. We retrain the text recognition module using a curated subset of ancient Vietnamese Chinese manuscripts, supported by a full training pipeline covering preprocessing, LMDB conversion, evaluation, and visualization. Experimental results show a significant improvement over the base model, with exact accuracy increasing from 37.5 percent to 50.0 percent, particularly under noisy image conditions. Furthermore, we develop an interactive demo that visually compares pre- and post-fine-tuning recognition results, facilitating downstream applications such as Han-Vietnamese semantic alignment, machine translation, and historical linguistics research. The demo is available at https://huggingface.co/spaces/MinhDS/Fine-tuned-PaddleOCRv5.

识别和处理古典中文（汉文）文本对于数字化越南历史文件和促进跨语言语义研究起着至关重要的作用。然而，现有的OCR系统对于扫描质量差的文件、非标准字形以及古文献中常见的书写差异存在困难。在这项工作中，我们提出了一种针对PaddleOCRv5的精调方法，以提高汉文文本的字符识别能力。我们使用精选的古代越南中文手稿子集重新训练文本识别模块，该模块由涵盖预处理、LMDB转换、评估和可视化的完整训练管道提供支持。实验结果表明，相较于基础模型，我们的模型在精确率上有显著提高，从原本的37.5%提升至50.0%，特别是在图像噪声较大的情况下。此外，我们开发了一个交互式演示程序，可以直观地比较调整前后的识别结果，有助于下游应用如汉越语义对齐、机器翻译和历史语言学研究等。该演示程序可通过链接https://huggingface.co/spaces/MinhDS/Fine-tuned-PaddleOCRv5访问。

论文及项目相关链接

PDF 5 pages, 6 figures, 2 tables

Summary

在数字化越南历史文档和促进跨语言语义研究方面，识别和处理古典中文（汉文）文本扮演着至关重要的角色。然而，现有的OCR系统在处理老化的扫描件、非标准字形和手写字体变化时面临挑战。本研究提出一种针对PaddleOCRv5的微调方法，以提高汉文文本的字符识别能力。我们重新训练文本识别模块，使用精选的古代越南中文手稿子集，并由完整的训练管道支持，包括预处理、LMDB转换、评估和可视化。实验结果表明，相较于基础模型，准确率显著提高，从37.5%提升至50.0%，特别是在图像噪声较大的情况下。此外，我们开发了一个交互式演示，可视化比较微调前后的识别结果，有助于下游应用如汉越语义对齐、机器翻译和历史语言学研究。演示网址为：链接地址。

Key Takeaways

古典中文（汉文）文本在越南历史文档数字化和跨语言语义研究中具有重要意义。
现有OCR系统在处理古典中文文本时面临挑战，如老化的扫描件和非标准字形。
本研究通过微调PaddleOCRv5提高了汉文文本的字符识别能力。
重新训练文本识别模块使用精选的古代越南中文手稿子集进行训练，以获得更好的准确性。
模型准确率从37.5%提升至50.0%，尤其在图像噪声较大的情况下表现更为显著。
开发了一个交互式演示来可视化比较微调前后的识别结果。

Cool Papers

点此查看论文截图

Contrastive-SDE: Guiding Stochastic Differential Equations with Contrastive Learning for Unpaired Image-to-Image Translation

Authors:Venkata Narendra Kotyada, Revanth Eranki, Nagesh Bhattu Sristy

Unpaired image-to-image translation involves learning mappings between source domain and target domain in the absence of aligned or corresponding samples. Score based diffusion models have demonstrated state-of-the-art performance in generative tasks. Their ability to approximate complex data distributions through stochastic differential equations (SDEs) enables them to generate high-fidelity and diverse outputs, making them particularly well-suited for unpaired I2I settings. In parallel, contrastive learning provides a powerful framework for learning semantic similarities without the need for explicit supervision or paired data. By pulling together representations of semantically similar samples and pushing apart dissimilar ones, contrastive methods are inherently aligned with the objectives of unpaired translation. Its ability to selectively enforce semantic consistency at the feature level makes contrastive learning particularly effective for guiding generation in unpaired scenarios. In this work, we propose a time-dependent contrastive learning approach where a model is trained with SimCLR by considering an image and its domain invarient feature as a positive pair, enabling the preservation of domain-invariant features and the discarding of domain-specific ones. The learned contrastive model then guides the inference of a pretrained SDE for the I2I translation task. We empirically compare Contrastive-SDE with several baselines across three common unpaired I2I tasks, using four metrics for evaluation. Constrastive-SDE achieves comparable results to the state-of-the-art on several metrics. Furthermore, we observe that our model converges significantly faster and requires no label supervision or classifier training, making it a more efficient alternative for this task.

非配对图像到图像的翻译涉及在没有任何对齐或对应样本的情况下学习源域和目标域之间的映射。基于分数的扩散模型在生成任务中表现出了卓越的性能。它们通过随机微分方程（SDEs）来逼近复杂数据分布的能力，使它们能够生成高保真和多样化的输出，特别适用于非配对的I2I设置。同时，对比学习提供了一个强大的框架，用于学习语义相似性，而无需明确的监督或配对数据。通过将语义上相似的样本表示拉在一起，并将不相似的样本推开，对比方法与无配对翻译的目标本质上是一致的。其在特征层面选择性实施语义一致性的能力，使对比学习在无配对场景中的生成指导特别有效。在这项工作中，我们提出了一种时间依赖对比学习方法，该方法使用SimCLR训练模型，将图像及其域不变特征视为正样本对，从而保留域不变特征并丢弃域特定特征。然后，学习得到的对比模型指导I2I翻译任务的预训练SDE的推理过程。我们通过三个常见的非配对I2I任务对Contrastive-SDE与几个基线进行了实证比较，并使用四个指标进行评估。Contrastive-SDE在几个指标上实现了与最新技术相当的结果。此外，我们观察到我们的模型收敛得更快，且不需要标签监督或分类器训练，使其成为此任务的一个更有效的替代方案。

论文及项目相关链接

PDF 9 pages, 3 figures

Summary
本论文介绍了一种基于时间依赖对比学习的方法，用于无配对图像到图像的翻译任务。该方法结合了对比学习和基于扩散模型的生成模型，能够在无配对样本的情况下学习源域和目标域之间的映射关系。通过对比学习，模型能够选择性地在特征层面保持语义一致性，并在推理阶段使用预训练的扩散模型进行图像翻译。实验结果表明，该方法在多个无配对图像到图像翻译任务上取得了与现有技术相当的结果，并且收敛速度更快，无需标签监督和分类器训练，更具效率。

Key Takeaways

本研究提出了一种基于时间依赖对比学习的方法，适用于无配对图像到图像的翻译任务。
对比学习能够在特征层面保持语义一致性，适用于无配对场景下的生成任务。
模型结合对比学习和扩散模型，通过SDEs近似复杂数据分布，生成高质量、多样化的输出。
提出的Contrastive-SDE方法在三个常见的无配对I2I任务上与现有技术相比表现相当。
Contrastive-SDE方法收敛速度快，无需标签监督和分类器训练，提高了效率。
通过考虑图像和其域不变特征作为正样本对，模型能够保留域不变特征并丢弃域特定特征。

Cool Papers

点此查看论文截图

ML2B: Multi-Lingual ML Benchmark For AutoML

Authors:Ekaterina Trofimova, Zosia Shamina, Maria Selifanova, Artem Zaitsev, Remi Savchuk, Maxim Minets, Daria Ozerova, Emil Sataev, Denis Zuenko, Andrey E. Ustyuzhanin

Large language models (LLMs) have recently demonstrated strong capabilities in generating machine learning (ML) code, enabling end-to-end pipeline construction from natural language instructions. However, existing benchmarks for ML code generation are mainly restricted to English, overlooking the global and multilingual nature of ML research and practice. To address this gap, we present ML2B, the first benchmark for evaluating multilingual ML code generation. ML2B consists of 30 Kaggle competitions translated into 13 natural languages, covering tabular, text, and image data types, with structured metadata and validated human-reviewed translations. For evaluation, we employ AIDE, an automated framework for end-to-end assessment of data science pipelines, and provide insights into cross-lingual model performance. Our results reveal substantial 15-45% performance degradation on non-English tasks, highlighting critical challenges in multilingual representation learning for code generation. The benchmark, evaluation framework, and comprehensive results are made available through our GitHub repository to facilitate future research in multilingual ML code generation: https://github.com/enaix/ml2b.

大型语言模型（LLM）最近展现出强大的生成机器学习（ML）代码的能力，能够从自然语言指令中构建端到端的管道。然而，现有的ML代码生成基准测试主要局限于英语，忽略了ML研究和实践的全球和多语种性质。为了弥补这一空白，我们推出了ML2B，这是首个用于评估多语种ML代码生成能力的基准测试。ML2B包含30个翻译成13种自然语言的Kaggle竞赛，涵盖表格、文本和图像数据类型，具备结构化元数据和经过验证的人工审查翻译。为了进行评估，我们采用了AIDE，这是一个用于数据科学管道端到端评估的自动化框架，同时提供了跨语言模型性能的洞察。我们的结果显示，在非英语任务上，性能下降了15-45%，这突显了代码生成多语言表示学习所面临的重大挑战。该基准测试、评估框架和全面结果已通过我们的GitHub仓库提供，以便促进未来在多语种ML代码生成方面的研究：https://github.com/enaix/ml2b。

论文及项目相关链接

PDF

Summary

大型语言模型在生成机器学习代码方面展现出强大能力，能够实现从自然语言指令到端到端管道构建的全程自动化。然而，现有机器学习代码生成基准测试主要局限于英语，忽略了机器学习研究和实践的全球和多语言特性。为解决这一差距，我们推出了ML2B，首个多语言机器学习代码生成评估基准。ML2B包含翻译为13种自然语言的30个Kaggle竞赛，涵盖表格、文本和图像数据类型，具有结构化元数据和经过验证的人工审查翻译。我们使用AIDE（一种用于数据科学管道端到端评估的自动化框架）进行评估，并深入探讨了跨语言模型性能。结果显示，在非英语任务上性能下降15-45%，突显出代码生成多语言表示学习的关键挑战。基准测试、评估框架和全面结果已通过GitHub仓库公开，以便未来进行多语言机器学习代码生成研究。

Key Takeaways

大型语言模型能自动生成机器学习代码，实现从自然语言指令到端到端管道构建。
现有机器学习代码生成基准测试主要局限于英语，存在多语言特性忽视的问题。
ML2B是首个多语言机器学习代码生成评估基准，包含30个Kaggle竞赛，翻译为13种自然语言。
ML2B涵盖表格、文本和图像数据类型，具有结构化元数据和经过验证的翻译。
采用AIDE自动化框架进行评估，深入探究了跨语言模型性能。
在非英语任务上，模型性能下降15-45%，突显出多语言表示学习的关键挑战。

Cool Papers

点此查看论文截图

GIIFT: Graph-guided Inductive Image-free Multimodal Machine Translation

Authors:Jiafeng Xiong, Yuting Zhao

Multimodal Machine Translation (MMT) has demonstrated the significant help of visual information in machine translation. However, existing MMT methods face challenges in leveraging the modality gap by enforcing rigid visual-linguistic alignment whilst being confined to inference within their trained multimodal domains. In this work, we construct novel multimodal scene graphs to preserve and integrate modality-specific information and introduce GIIFT, a two-stage Graph-guided Inductive Image-Free MMT framework that uses a cross-modal Graph Attention Network adapter to learn multimodal knowledge in a unified fused space and inductively generalize it to broader image-free translation domains. Experimental results on the Multi30K dataset of English-to-French and English-to-German tasks demonstrate that our GIIFT surpasses existing approaches and achieves the state-of-the-art, even without images during inference. Results on the WMT benchmark show significant improvements over the image-free translation baselines, demonstrating the strength of GIIFT towards inductive image-free inference.

多模态机器翻译（MMT）已经证明了视觉信息在机器翻译中的巨大帮助。然而，现有的MMT方法在利用模态差距方面面临着挑战，它们通过强制实施严格的视觉语言对齐，同时受限于其训练的多模态域内的推理。在这项工作中，我们构建了新型的多模态场景图，以保留和整合特定模态的信息，并引入了GIIFT，这是一个两阶段的图形引导归纳无图多模态翻译框架，它使用跨模态图注意力网络适配器，在统一融合空间学习多模态知识，并归纳推广至更广泛的无图翻译领域。在Multi30K英语到法语和英语到德语任务的数据集上的实验结果表明，我们的GIIFT超越了现有方法，达到了最新技术水平，即使在推理过程中没有使用图像。在WMT基准测试上的结果较无图翻译基线有显著改进，证明了GIIFT在无图归纳推理方面的实力。

论文及项目相关链接

PDF Accepted as an oral presentation at the EMNLP 2025 Workshop on Machine Translation (WMT)

Summary：本研究通过构建新型的多模态场景图来保存和整合模态特定信息，并引入GIIFT框架，这是一个两阶段的图像引导型无图像多模态翻译框架。它使用跨模态图注意力网络适配器在统一融合空间中学习多模态知识，并将其归纳推广到更广泛的无需图像翻译领域。在Multi30K数据集上的英语到法语和英语到德语任务的实验结果表明，GIIFT超越了现有方法并达到了最新技术水平，且在推理过程中无需使用图像。在WMT基准测试上的结果也显著优于无图像翻译基线，证明了GIIFT在归纳无图像推理方面的优势。

Key Takeaways：

多模态机器翻译（MMT）已经证明视觉信息对机器翻译的显著帮助。
现有MMT方法面临挑战，主要在于模态差距和利用视觉与语言的刚性对齐之间的矛盾。
本研究通过构建多模态场景图来保存和整合模态特定信息。
引入GIIFT框架，一个两阶段的图像引导型无图像MMT框架。
使用跨模态图注意力网络适配器在统一融合空间中学习多模态知识。
GIIFT框架实现了无需图像的翻译推广至更广泛的领域。

Cool Papers

点此查看论文截图

RoboSwap: A GAN-driven Video Diffusion Framework For Unsupervised Robot Arm Swapping

Authors:Yang Bai, Liudi Yang, George Eskandar, Fengyi Shen, Dong Chen, Mohammad Altillawi, Ziyuan Liu, Gitta Kutyniok

Recent advancements in generative models have revolutionized video synthesis and editing. However, the scarcity of diverse, high-quality datasets continues to hinder video-conditioned robotic learning, limiting cross-platform generalization. In this work, we address the challenge of swapping a robotic arm in one video with another: a key step for crossembodiment learning. Unlike previous methods that depend on paired video demonstrations in the same environmental settings, our proposed framework, RoboSwap, operates on unpaired data from diverse environments, alleviating the data collection needs. RoboSwap introduces a novel video editing pipeline integrating both GANs and diffusion models, combining their isolated advantages. Specifically, we segment robotic arms from their backgrounds and train an unpaired GAN model to translate one robotic arm to another. The translated arm is blended with the original video background and refined with a diffusion model to enhance coherence, motion realism and object interaction. The GAN and diffusion stages are trained independently. Our experiments demonstrate that RoboSwap outperforms state-of-the-art video and image editing models on three benchmarks in terms of both structural coherence and motion consistency, thereby offering a robust solution for generating reliable, cross-embodiment data in robotic learning.

最近生成模型的发展在视频合成和编辑方面引起了革命性的变化。然而，多样且高质量数据集的匮乏仍然阻碍了视频条件下的机器人学习，并限制了跨平台的泛化。在这项工作中，我们解决了在一个视频中交换一个机械臂与另一个机械臂的挑战：这是跨体态学习的关键步骤。不同于以前依赖于相同环境设置中的配对视频演示的方法，我们提出的RoboSwap框架能够从不同的环境中处理未配对的数据，从而减轻了数据收集的需求。RoboSwap引入了一种新的视频编辑管道，集成了GANs和扩散模型，结合了它们的独立优势。具体来说，我们从背景中分割出机械臂，并训练一个未配对的GAN模型来将一个机械臂翻译到另一个机械臂。翻译的机械臂被混合到原始视频背景中，并用扩散模型进行细化，以提高连贯性、运动现实感和物体交互性。GAN和扩散阶段是独立训练的。我们的实验表明，RoboSwap在三个基准测试上的表现优于最先进的视频和图像编辑模型，在结构连贯性和运动一致性方面都表现出色，从而为机器人学习中的可靠、跨体态数据生成提供了稳健的解决方案。

论文及项目相关链接

PDF

Summary

本文介绍了RoboSwap框架在视频合成和编辑领域的创新应用，该框架解决了在跨平台机器人学习中因缺乏多样化和高质量数据集而导致的问题。不同于依赖于相同环境设置的配对视频演示的先前方法，RoboSwap能够对来自不同环境的未配对数据进行操作，减轻了数据收集的需求。它结合了GANs和扩散模型的优点，通过分割机器人手臂并将其背景进行翻译转换，再将转换后的手臂与原始视频背景融合，并使用扩散模型增强连贯性和运动现实感以及物体交互。实验表明，RoboSwap在结构连贯性和运动一致性方面优于现有的视频和图像编辑模型，为机器人学习中生成可靠、跨实体的数据提供了稳健的解决方案。

Key Takeaways

RoboSwap框架解决了视频合成和编辑领域中的机器人手臂替换问题，是跨平台机器人学习中的关键步骤。
与依赖配对视频演示的方法不同，RoboSwap处理未配对数据，并能在不同环境中操作，减轻了数据收集的负担。
RoboSwap结合了GANs和扩散模型的优点，创建了一个新颖的视频编辑流程。
通过分割机器人手臂并将其背景进行翻译转换，RoboSwap实现了机器人手臂的替换。
扩散模型用于增强视频连贯性和运动现实感，以及物体交互效果。
实验结果显示RoboSwap在结构连贯性和运动一致性方面超越现有模型。

Cool Papers

点此查看论文截图

Fully Spiking Neural Networks for Unified Frame-Event Object Tracking

Authors:Jingjun Yang, Liangwei Fan, Jinpu Zhang, Xiangkai Lian, Hui Shen, Dewen Hu

The integration of image and event streams offers a promising approach for achieving robust visual object tracking in complex environments. However, current fusion methods achieve high performance at the cost of significant computational overhead and struggle to efficiently extract the sparse, asynchronous information from event streams, failing to leverage the energy-efficient advantages of event-driven spiking paradigms. To address this challenge, we propose the first fully Spiking Frame-Event Tracking framework called SpikeFET. This network achieves synergistic integration of convolutional local feature extraction and Transformer-based global modeling within the spiking paradigm, effectively fusing frame and event data. To overcome the degradation of translation invariance caused by convolutional padding, we introduce a Random Patchwork Module (RPM) that eliminates positional bias through randomized spatial reorganization and learnable type encoding while preserving residual structures. Furthermore, we propose a Spatial-Temporal Regularization (STR) strategy that overcomes similarity metric degradation from asymmetric features by enforcing spatio-temporal consistency among temporal template features in latent space. Extensive experiments across multiple benchmarks demonstrate that the proposed framework achieves superior tracking accuracy over existing methods while significantly reducing power consumption, attaining an optimal balance between performance and efficiency.

将图像和事件流的集成为实现复杂环境中稳健视觉对象跟踪的一种前景广阔的方法。然而，当前的融合方法以大量的计算开销为代价实现了高性能，并且难以有效地从事件流中提取稀疏、异步信息，未能利用事件驱动脉冲范式的节能优势。为了解决这一挑战，我们提出了第一个完全基于脉冲的帧事件跟踪框架，名为SpikeFET。该网络在脉冲范式内实现了卷积局部特征提取和基于Transformer的全局建模的协同融合，有效地融合了帧和事件数据。为了解决因卷积填充而导致的翻译不变性退化问题，我们引入了随机拼贴模块（RPM），该模块通过随机空间重组和可学习类型编码来消除位置偏见，同时保留残差结构。此外，我们提出了一种时空正则化（STR）策略，通过强制潜在空间中时间模板特征的时空一致性，克服由不对称特征引起的相似度度量退化。在多个基准测试上的广泛实验表明，与传统方法相比，该框架实现了卓越的跟踪精度，同时显著降低了功耗，在性能和效率之间达到了最佳平衡。

论文及项目相关链接

PDF Accepted by NeurIPS2025

Summary

卷积局部特征提取与基于Transformer的全局建模在脉冲驱动范式下的协同融合，实现了图像和事件流的集成，有效融合了帧和事件数据。为解决翻译不变性因卷积填充引起的退化问题，引入Random Patchwork Module（RPM）模块，通过随机空间重组和学习类型编码保留残差结构，同时消除位置偏见。此外，提出的Spatial-Temporal Regularization（STR）策略通过对潜在空间中的时间模板特征强制执行时空一致性，克服了不对称特征的相似性度量退化问题。实验证明，该框架在多个基准测试上实现了优越的跟踪精度，同时显著降低功耗，实现了性能和效率之间的优化平衡。

Key Takeaways