⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-09-19 更新
Where Do Tokens Go? Understanding Pruning Behaviors in STEP at High Resolutions
Authors:Michal Szczepanski, Martyna Poreba, Karim Haroun
Vision Transformers (ViTs) achieve state-of-the-art performance in semantic segmentation but are hindered by high computational and memory costs. To address this, we propose STEP (SuperToken and Early-Pruning), a hybrid token-reduction framework that combines dynamic patch merging and token pruning to enhance efficiency without significantly compromising accuracy. At the core of STEP is dCTS, a lightweight CNN-based policy network that enables flexible merging into superpatches. Encoder blocks integrate also early-exits to remove high-confident supertokens, lowering computational load. We evaluate our method on high-resolution semantic segmentation benchmarks, including images up to 1024 x 1024, and show that when dCTS is applied alone, the token count can be reduced by a factor of 2.5 compared to the standard 16 x 16 pixel patching scheme. This yields a 2.6x reduction in computational cost and a 3.4x increase in throughput when using ViT-Large as the backbone. Applying the full STEP framework further improves efficiency, reaching up to a 4x reduction in computational complexity and a 1.7x gain in inference speed, with a maximum accuracy drop of no more than 2.0%. With the proposed STEP configurations, up to 40% of tokens can be confidently predicted and halted before reaching the final encoder layer.
视觉Transformer(ViTs)在语义分割领域取得了最先进的性能,但受到计算量和内存成本的限制。为了解决这一问题,我们提出了STEP(SuperToken和Early-Pruning)方法,这是一种混合令牌缩减框架,结合了动态补丁合并和令牌修剪技术,以提高效率而不损害准确性。STEP的核心是dCTS,这是一个基于轻量级CNN的策略网络,能够实现灵活的超级补丁合并。编码器块还集成了早期退出机制,以移除高置信度的超级令牌,降低计算负载。我们在高分辨率语义分割基准测试上评估了我们的方法,包括高达1024 x 1024的图像,并显示当仅应用dCTS时,与标准的16 x 16像素补丁方案相比,令牌计数可以减少2.5倍。使用ViT-Large作为骨干网时,这导致计算成本降低了2.6倍,吞吐量增加了3.4倍。应用完整的STEP框架进一步提高了效率,计算复杂度最多减少4倍,推理速度提高1.7倍,而精度最多下降不超过2.0%。通过提出的STEP配置,高达40%的令牌可以在到达最终编码器层之前进行可靠的预测并停止。
论文及项目相关链接
Summary
本文提出一种名为STEP的混合令牌缩减框架,通过动态补丁合并和令牌修剪提高Vision Transformer(ViT)在语义分割任务中的效率,同时不显著降低准确性。该框架结合超级令牌和早期修剪策略,利用轻量级CNN策略网络dCTS实现灵活合并成超级补丁,并在编码器块中引入早期退出机制以移除高置信度的超级令牌,降低计算负载。在高清语义分割基准测试中,该策略可有效减少令牌数量,降低计算成本,提高吞吐量,并可在保持精度损失极小的情况下进一步提高效率。
Key Takeaways
- Vision Transformers (ViTs) achieve state-of-the-art performance in semantic segmentation but have high computational and memory costs.
- STEP framework is proposed to enhance the efficiency of ViTs without compromising accuracy in semantic segmentation.
- STEP combines dynamic patch merging and token pruning to reduce the number of tokens.
- dCTS, a lightweight CNN-based policy network, enables flexible merging into superpatches.
- Encoder blocks with early-exits remove high-confident supertokens, reducing computational load.
- When applied to high-resolution semantic segmentation, STEP can significantly reduce computational complexity and increase inference speed with minimal accuracy loss.
点此查看论文截图

SAIL-VL2 Technical Report
Authors:Weijie Yin, Yongjie Ye, Fangxun Shu, Yue Liao, Zijian Kang, Hongyuan Dong, Haiyang Yu, Dingkang Yang, Jiacong Wang, Han Wang, Wenzhuo Liu, Xiao Liang, Shuicheng Yan, Chao Feng
We introduce SAIL-VL2, an open-suite vision-language foundation model (LVM) for comprehensive multimodal understanding and reasoning. As the successor to SAIL-VL, SAIL-VL2 achieves state-of-the-art performance at the 2B and 8B parameter scales across diverse image and video benchmarks, demonstrating strong capabilities from fine-grained perception to complex reasoning. Three core innovations drive its effectiveness. First, a large-scale data curation pipeline with scoring and filtering strategies enhances both quality and distribution across captioning, OCR, QA, and video data, improving training efficiency. Second, a progressive training framework begins with a powerful pre-trained vision encoder (SAIL-ViT), advances through multimodal pre-training, and culminates in a thinking-fusion SFT-RL hybrid paradigm that systematically strengthens model capabilities. Third, architectural advances extend beyond dense LLMs to efficient sparse Mixture-of-Experts (MoE) designs. With these contributions, SAIL-VL2 demonstrates competitive performance across 106 datasets and achieves state-of-the-art results on challenging reasoning benchmarks such as MMMU and MathVista. Furthermore, on the OpenCompass leaderboard, SAIL-VL2-2B ranks first among officially released open-source models under the 4B parameter scale, while serving as an efficient and extensible foundation for the open-source multimodal community.
我们推出SAIL-VL2,这是一个用于全面多模态理解和推理的开放视觉语言基础模型(LVM)。作为SAIL-VL的继任者,SAIL-VL2在不同参数规模的2B和8B上实现了最先进的性能,在多种图像和视频基准测试中表现出强大的能力,从精细粒度感知到复杂推理。三个核心创新推动了其有效性。首先,采用大规模数据整理管道,通过评分和过滤策略,提高了描述、OCR、问答和视频数据的质量和分布,提高了训练效率。其次,渐进式训练框架从一个强大的预训练视觉编码器(SAIL-ViT)开始,通过多模态预训练,最终形成一个思考融合SFT-RL混合范式,系统地增强了模型能力。第三,架构的进展超越了密集的大型语言模型,采用了高效的稀疏混合专家(MoE)设计。通过这些贡献,SAIL-VL2在106个数据集上表现出具有竞争力的性能,并在具有挑战性的推理基准测试(如MMMU和MathVista)上取得了最新结果。此外,在OpenCompass排行榜上,SAIL-VL2-2B在官方发布的开源模型中排名4B参数规模之首,成为高效且可扩展的开源多模态社区基础。
论文及项目相关链接
PDF Technical Report
Summary
SAIL-VL2是一款用于全面多模态理解和推理的开放视觉语言基础模型(LVM)。通过大规模数据筛选和过滤策略、渐进式训练框架和混合专家设计等技术,SAIL-VL2在不同图像和视频基准测试中表现出卓越性能,实现了从精细粒度感知到复杂推理的强能力。该模型在多个数据集上表现竞争力,并在MMMU和MathVista等挑战性基准测试中达到最新水平。此外,SAIL-VL2在OpenCompass排行榜上名列前茅,成为开源多模态社区的高效可扩展基础。
Key Takeaways
- SAIL-VL2是SAIL-VL的升级版,是一个全面的多模态理解和推理的开放视觉语言基础模型(LVM)。
- 通过大规模数据筛选和过滤策略,提高了数据质量和分布,改善了训练效率。
- 渐进式训练框架从强大的预训练视觉编码器开始,通过多模态预训练,最终实现了系统强化模型能力的思考融合SFT-RL混合范式。
- 架构上的改进超越了密集的LLMs,采用了高效的稀疏混合专家(MoE)设计。
- SAIL-VL2在不同图像和视频基准测试中实现了卓越性能,包括细粒度感知和复杂推理。
- 在多个挑战数据集上达到最新水平,如MMMU和MathVista。
点此查看论文截图





EDITS: Enhancing Dataset Distillation with Implicit Textual Semantics
Authors:Qianxin Xia, Jiawei Du, Guoming Lu, Zhiyong Shu, Jielei Wang
Dataset distillation aims to synthesize a compact dataset from the original large-scale one, enabling highly efficient learning while preserving competitive model performance. However, traditional techniques primarily capture low-level visual features, neglecting the high-level semantic and structural information inherent in images. In this paper, we propose EDITS, a novel framework that exploits the implicit textual semantics within the image data to achieve enhanced distillation. First, external texts generated by a Vision Language Model (VLM) are fused with image features through a Global Semantic Query module, forming the prior clustered buffer. Local Semantic Awareness then selects representative samples from the buffer to construct image and text prototypes, with the latter produced by guiding a Large Language Model (LLM) with meticulously crafted prompt. Ultimately, Dual Prototype Guidance strategy generates the final synthetic dataset through a diffusion model. Extensive experiments confirm the effectiveness of our method.Source code is available in: https://github.com/einsteinxia/EDITS.
数据集蒸馏旨在从原始的大规模数据集中合成一个紧凑的数据集,以实现高效学习,同时保持竞争性的模型性能。然而,传统技术主要捕捉低级别的视觉特征,忽略了图像中固有的高级语义和结构信息。在本文中,我们提出了EDITS,这是一个利用图像数据中的隐含文本语义来实现增强蒸馏的新型框架。首先,通过全局语义查询模块,将视觉语言模型(VLM)生成的外部文本与图像特征相融合,形成先验聚类缓冲区。然后,局部语义感知从缓冲区中选择代表性样本,以构建图像和文本原型,后者是通过精心设计的提示来引导大型语言模型(LLM)而产生的。最终,通过扩散模型,采用双原型引导策略生成最终的合成数据集。大量实验证实了我们的方法的有效性。源代码可在:https://github.com/einsteinxia/EDITS中找到。
论文及项目相关链接
Summary
数据集蒸馏技术旨在从原始的大规模数据集中合成一个精简的数据集,以实现高效学习并保持模型性能竞争力。然而,传统技术主要捕捉图像的低级视觉特征,忽略了图像内在的语义和结构信息。本文提出了一种名为EDITS的新框架,通过利用图像数据中的隐式文本语义来实现增强的蒸馏效果。该框架融合了由视觉语言模型生成的外部文本与图像特征,形成先验聚类缓冲区。随后,通过局部语义感知从缓冲区中选择代表性样本构建图像和文本原型,其中文本原型是通过引导大型语言模型使用精心设计的提示而生成的。最终,通过扩散模型实现基于双原型指导策略的合成数据集的生成。
Key Takeaways
- 数据集蒸馏能够合成一个紧凑的数据集,实现高效学习并维持模型性能。
- 传统技术主要关注低级别视觉特征的捕捉,但忽略了图像的语义和结构信息。
- EDITS框架利用隐式文本语义来实现增强的蒸馏效果。
- EDITS融合了视觉语言模型生成的外部文本与图像特征,形成先验聚类缓冲区。
- 通过局部语义感知选择代表性样本构建图像和文本原型。
- 文本原型是通过引导大型语言模型使用精心设计的提示而生成的。
点此查看论文截图



