嘘~ 正在从服务器偷取页面 . . .

I2I Translation


⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验

2025-02-28 更新

ProxyTransformation: Preshaping Point Cloud Manifold With Proxy Attention For 3D Visual Grounding

Authors:Qihang Peng, Henry Zheng, Gao Huang

Embodied intelligence requires agents to interact with 3D environments in real time based on language instructions. A foundational task in this domain is ego-centric 3D visual grounding. However, the point clouds rendered from RGB-D images retain a large amount of redundant background data and inherent noise, both of which can interfere with the manifold structure of the target regions. Existing point cloud enhancement methods often require a tedious process to improve the manifold, which is not suitable for real-time tasks. We propose Proxy Transformation suitable for multimodal task to efficiently improve the point cloud manifold. Our method first leverages Deformable Point Clustering to identify the point cloud sub-manifolds in target regions. Then, we propose a Proxy Attention module that utilizes multimodal proxies to guide point cloud transformation. Built upon Proxy Attention, we design a submanifold transformation generation module where textual information globally guides translation vectors for different submanifolds, optimizing relative spatial relationships of target regions. Simultaneously, image information guides linear transformations within each submanifold, refining the local point cloud manifold of target regions. Extensive experiments demonstrate that Proxy Transformation significantly outperforms all existing methods, achieving an impressive improvement of 7.49% on easy targets and 4.60% on hard targets, while reducing the computational overhead of attention blocks by 40.6%. These results establish a new SOTA in ego-centric 3D visual grounding, showcasing the effectiveness and robustness of our approach.

嵌入智能需要代理根据语言指令与三维环境进行实时交互。该领域的核心任务是自我中心的3D视觉定位。然而,从RGB-D图像渲染的点云保留了大量冗余的背景数据和固有噪声,两者都可能干扰目标区域的流形结构。现有的点云增强方法通常需要繁琐的过程来改善流形,这不适用于实时任务。我们提出了适用于多模态任务的代理转换方法,以有效地改善点云流形。我们的方法首先利用可变点聚类来识别目标区域的点云子流形。然后,我们提出了一个代理注意力模块,它利用多模态代理来引导点云转换。基于代理注意力,我们设计了一个子流形转换生成模块,其中文本信息全局引导不同子流形的翻译向量,优化目标区域之间的相对空间关系。同时,图像信息引导每个子流形内的线性变换,细化目标区域的局部点云流形。大量实验表明,代理转换显著优于所有现有方法,在简单目标上实现了7.49%的惊人进步,在困难目标上实现了4.60%的进步,同时减少了注意力块的计算开销达40.6%。这些结果确立了自我中心3D视觉定位的新技术状态,展示了我们方法的有效性和稳健性。

论文及项目相关链接

PDF 11 pages, 3 figures

Summary
实时互动任务中的三维视觉识别与智能转化提出,为了更好地理解实体世界的三维环境并处理来自图像的多模态信息,一个基于多模态任务的方法——代理转换被提出,用以优化点云模型的表现性能。此方法可提升点云模型目标区域的流形结构识别精度,从而有效处理多模态任务。经过广泛实验验证,代理转换在自回归3D视觉定位任务上显著优于现有方法,显著提高了计算效率。

Key Takeaways

  • 代理解决方案旨在改进点云模型的目标区域流形结构识别精度。
  • 通过利用多模态代理信息,代理转换能够高效提升点云模型的性能。
  • 代理转换由两部分组成:可变形点聚类以及代理注意力模块。
  • 可变形点聚类用于识别点云子流形目标区域,代理注意力模块则通过多模态代理来引导点云转换。
  • 文本信息全局引导不同子流形的翻译向量,优化目标区域的相对空间关系;图像信息则引导每个子流形内的线性转换,完善目标区域的局部点云流形。

Cool Papers

点此查看论文截图

PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions

Authors:Weifeng Lin, Xinyu Wei, Renrui Zhang, Le Zhuo, Shitian Zhao, Siyuan Huang, Junlin Xie, Huan Teng, Junlin Xie, Yu Qiao, Peng Gao, Hongsheng Li

This paper presents a versatile image-to-image visual assistant, PixWizard, designed for image generation, manipulation, and translation based on free-from language instructions. To this end, we tackle a variety of vision tasks into a unified image-text-to-image generation framework and curate an Omni Pixel-to-Pixel Instruction-Tuning Dataset. By constructing detailed instruction templates in natural language, we comprehensively include a large set of diverse vision tasks such as text-to-image generation, image restoration, image grounding, dense image prediction, image editing, controllable generation, inpainting/outpainting, and more. Furthermore, we adopt Diffusion Transformers (DiT) as our foundation model and extend its capabilities with a flexible any resolution mechanism, enabling the model to dynamically process images based on the aspect ratio of the input, closely aligning with human perceptual processes. The model also incorporates structure-aware and semantic-aware guidance to facilitate effective fusion of information from the input image. Our experiments demonstrate that PixWizard not only shows impressive generative and understanding abilities for images with diverse resolutions but also exhibits promising generalization capabilities with unseen tasks and human instructions. The code and related resources are available at https://github.com/AFeng-x/PixWizard

本文介绍了一款通用图像到图像的视觉助手PixWizard,它旨在基于自由的语言指令进行图像生成、操作和翻译。为此,我们将各种视觉任务纳入统一的图像文本到图像生成框架,并创建了一个Omni Pixel-to-Pixel Instruction-Tuning数据集。通过构建自然语言详细指令模板,我们全面涵盖了大量不同的视觉任务,如文本到图像生成、图像恢复、图像定位、密集图像预测、图像编辑、可控生成、填充/外描等。此外,我们采用扩散变压器(DiT)作为基础模型,并通过灵活的任何分辨率机制扩展其功能,使模型能够根据输入的比例动态处理图像,与人类感知过程紧密对齐。该模型还结合了结构感知和语义感知指导,以促进输入图像中信息的有效融合。我们的实验表明,PixWizard不仅显示出对具有不同分辨率的图像的出色生成和理解能力,而且在未见的任务和人类指令方面也表现出有希望的泛化能力。相关代码和资源可在https://github.com/AFeng-x/PixWizard 找到。

论文及项目相关链接

PDF Code is released at https://github.com/AFeng-x/PixWizard

Summary
新一代图像转换助手PixWizard,支持基于自然语言指令的图像生成、操纵和转换。它构建了一个统一的图像文本转换生成框架,并采用Diffusion Transformers为基础模型,实现了灵活的多分辨率处理机制。PixWizard能处理多种视觉任务,包括文本转图像、图像修复、图像定位等,并展现出强大的泛化能力和理解不同分辨率图像的能力。

Key Takeaways

  1. PixWizard是一个多功能的图像转换助手,支持图像生成、操纵和基于自然语言指令的翻译。
  2. 它构建了一个统一的图像文本转换生成框架,涵盖多种视觉任务。
  3. PixWizard采用Diffusion Transformers作为基础模型,具有灵活的多分辨率处理机制。
  4. 该模型能够紧密贴合人类感知过程,根据输入图像的纵横比进行动态处理。
  5. PixWizard具有结构感知和语义感知指导,有助于有效融合输入图像的信息。
  6. 实验表明,PixWizard不仅具有出色的图像生成和理解能力,而且能够处理未见过任务和人类指令,展现出强大的泛化能力。

Cool Papers

点此查看论文截图

TransForce: Transferable Force Prediction for Vision-based Tactile Sensors with Sequential Image Translation

Authors:Zhuo Chen, Ni Ou, Xuyang Zhang, Shan Luo

Vision-based tactile sensors (VBTSs) provide high-resolution tactile images crucial for robot in-hand manipulation. However, force sensing in VBTSs is underutilized due to the costly and time-intensive process of acquiring paired tactile images and force labels. In this study, we introduce a transferable force prediction model, TransForce, designed to leverage collected image-force paired data for new sensors under varying illumination colors and marker patterns while improving the accuracy of predicted forces, especially in the shear direction. Our model effectively achieves translation of tactile images from the source domain to the target domain, ensuring that the generated tactile images reflect the illumination colors and marker patterns of the new sensors while accurately aligning the elastomer deformation observed in existing sensors, which is beneficial to force prediction of new sensors. As such, a recurrent force prediction model trained with generated sequential tactile images and existing force labels is employed to estimate higher-accuracy forces for new sensors with lowest average errors of 0.69N (5.8% in full work range) in $x$-axis, 0.70N (5.8%) in $y$-axis, and 1.11N (6.9%) in $z$-axis compared with models trained with single images. The experimental results also reveal that pure marker modality is more helpful than the RGB modality in improving the accuracy of force in the shear direction, while the RGB modality show better performance in the normal direction.

基于视觉的触觉传感器(VBTS)为机器人手部操作提供了高分辨率的触觉图像,这是至关重要的。然而,由于获取配对的触觉图像和力度标签的过程成本高昂且耗时,VBTS中的力度感知并未得到充分利用。本研究中,我们引入了一种可迁移的力度预测模型TransForce,旨在利用收集的图像-力度配对数据,针对各种照明颜色和标记模式的新型传感器,提高预测力度的准确性,特别是在剪切方向上。我们的模型有效地实现了触觉图像从源域到目标域的转换,确保生成的触觉图像反映了新传感器的照明颜色和标记模式,同时准确对齐现有传感器的弹性体变形,这有利于新传感器的力度预测。因此,使用生成的连续触觉图像和现有力度标签训练的递归力度预测模型被用来估计新传感器的高精度力度。在X轴、Y轴和Z轴上,最低平均误差分别为0.69N(全工作范围5.8%)、0.70N(5.8%)和1.11N(6.9%),与用单张图像训练的模型相比有所改善。实验结果还表明,在剪切方向上,纯标记模式比RGB模式更有助于提高力度准确性,而在法线方向上则相反。

论文及项目相关链接

PDF Accepted to ICRA2025

Summary

本文介绍了一种可迁移的力预测模型TransForce,该模型能够利用收集的图像-力配对数据,对不同光照颜色和标记模式的新传感器进行力预测。模型实现了触觉图像的跨域转换,确保生成的触觉图像反映新传感器的光照和标记模式,同时准确对齐现有传感器的弹性体变形,从而提高新传感器的力预测精度。实验结果显示,该模型在x、y、z轴上的平均误差分别为0.69N(5.8%)、0.70N(5.8%)、1.11N(6.9%),相比单图像训练的模型具有更高的力预测精度。此外,实验还发现纯标记模式在剪切方向上提高力的精度的效果优于RGB模式,而RGB模式在正常方向上表现更好。

Key Takeaways

  1. Vision-based tactile sensors (VBTSs) provide high-resolution tactile images crucial for robot in-hand manipulation.
  2. Force sensing in VBTSs is underutilized due to the cost and time of acquiring paired tactile images and force labels.
  3. TransForce model leverages collected image-force paired data to improve force prediction accuracy for new sensors under varying conditions.
  4. TransForce achieves cross-domain translation of tactile images, ensuring generated images reflect new sensors’ illumination and marker patterns.
  5. Experimental results show the model’s high accuracy in force prediction, with average errors of less than 1N in all axes.
  6. Pure marker modality is more effective in improving force accuracy in the shear direction compared to the RGB modality.

Cool Papers

点此查看论文截图

Faster Diffusion via Temporal Attention Decomposition

Authors:Haozhe Liu, Wentian Zhang, Jinheng Xie, Francesco Faccio, Mengmeng Xu, Tao Xiang, Mike Zheng Shou, Juan-Manuel Perez-Rua, Jürgen Schmidhuber

We explore the role of attention mechanism during inference in text-conditional diffusion models. Empirical observations suggest that cross-attention outputs converge to a fixed point after several inference steps. The convergence time naturally divides the entire inference process into two phases: an initial phase for planning text-oriented visual semantics, which are then translated into images in a subsequent fidelity-improving phase. Cross-attention is essential in the initial phase but almost irrelevant thereafter. However, self-attention initially plays a minor role but becomes crucial in the second phase. These findings yield a simple and training-free method known as temporally gating the attention (TGATE), which efficiently generates images by caching and reusing attention outputs at scheduled time steps. Experimental results show when widely applied to various existing text-conditional diffusion models, TGATE accelerates these models by 10%-50%. The code of TGATE is available at https://github.com/HaozheLiu-ST/T-GATE.

我们探讨了推理过程中注意力机制在文本条件扩散模型中的作用。经验观察表明,交叉注意力输出在几个推理步骤后会收敛到一个固定点。收敛时间自然地将整个推理过程分为两个阶段:初始阶段用于规划面向文本的视觉语义,然后将其转化为后续的提高忠实度的阶段。交叉注意力在初始阶段至关重要,但在此之后几乎无关。然而,自注意力在初期作用较小,但在第二阶段变得至关重要。这些发现产生了一种简单且无需训练的方法——时间门控注意力(TGATE),它通过缓存和重用预定时间步的注意力输出来有效地生成图像。实验结果表明,当广泛应用于各种现有的文本条件扩散模型时,TGATE能加速这些模型的推理过程达10%-50%。TGATE的代码可在https://github.com/HaozheLiu-ST/T-GATE获取。

论文及项目相关链接

PDF Accepted by TMLR: https://openreview.net/forum?id=xXs2GKXPnH

Summary:本研究探讨了文本条件扩散模型中注意力机制在推理过程中的作用。研究发现交叉注意力输出在多个推理步骤后会收敛到一个固定点,将推理过程分为两个阶段:初期阶段主要用于根据文本规划视觉语义,随后将其转化为图像以提高保真度。交叉注意力在初期至关重要,之后几乎无关。而自我注意力初期作用较小,后期变得至关重要。这些发现提出了一种简单且无需训练的方法——时间门控注意力(TGATE),它通过缓存和重用预定时间步长的注意力输出有效地生成图像。实验结果表明,TGATE广泛应用于各种文本条件扩散模型,可加速10%-50%。

Key Takeaways

  1. 文本条件扩散模型中注意力机制的角色被研究。
  2. 交叉注意力输出在推理过程中会收敛到一个固定点。
  3. 推理过程可分为两个阶段:规划文本导向的视觉语义和提高图像保真度的阶段。
  4. 交叉注意力在初期对模型表现至关重要,后期作用较小。
  5. 自我注意力在模型后期变得重要。
  6. 提出了一种简单且无需训练的注意力优化方法——时间门控注意力(TGATE)。
  7. TGATE方法能显著提高文本条件扩散模型的运行效率,加速比例达10%-50%。

Cool Papers

点此查看论文截图


文章作者: Kedreamix
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !
 上一篇
视频理解 视频理解
视频理解 方向最新论文已更新,请持续关注 Update in 2025-02-28 Task Graph Maximum Likelihood Estimation for Procedural Activity Understanding in Egocentric Videos
2025-02-28
下一篇 
Few-Shot Few-Shot
Few-Shot 方向最新论文已更新,请持续关注 Update in 2025-02-28 FSPO Few-Shot Preference Optimization of Synthetic Preference Data in LLMs Elicits Effective Personalization to Real Users
2025-02-28
  目录