发布日期: 2025-11-22

更新日期: 2025-11-27

文章字数: 3.3k

阅读时长: 13 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-22 更新

Panel-by-Panel Souls: A Performative Workflow for Expressive Faces in AI-Assisted Manga Creation

Authors:Qing Zhang, Jing Huang, Yifei Huang, Jun Rekimoto

Current text-to-image models struggle to render the nuanced facial expressions required for compelling manga narratives, largely due to the ambiguity of language itself. To bridge this gap, we introduce an interactive system built on a novel, dual-hybrid pipeline. The first stage combines landmark-based auto-detection with a manual framing tool for robust, artist-centric face preparation. The second stage maps expressions using the LivePortrait engine, blending intuitive performative input from video for fine-grained control. Our case study analysis suggests that this integrated workflow can streamline the creative process and effectively translate narrative intent into visual expression. This work presents a practical model for human-AI co-creation, offering artists a more direct and intuitive means of ``infusing souls’’ into their characters. Our primary contribution is not a new generative model, but a novel, interactive workflow that bridges the gap between artistic intent and AI execution.

当前文本到图像模型很难渲染出在吸引人的漫画叙事中所需要微妙的面部表情，这主要是因为语言的固有模糊性导致的。为了弥补这一差距，我们引入了一个基于新型双混合管道构建的交互式系统。第一阶段结合了基于标记点的自动检测与手动框架工具，以进行稳健的、以艺术家为中心的面容准备。第二阶段使用LivePortrait引擎进行表情映射，融合来自视频的直观表演输入以实现精细控制。我们的案例研究分析表明，这种集成的工作流程可以简化创作过程，并有效地将叙事意图转化为视觉表达。这项工作为人类与人工智能协同创作提供了一个实用模型，为艺术家提供更直接和直观的方式来为他们的角色“注入灵魂”。我们的主要贡献不在于一个新的生成模型，而在于一个新型的交互式工作流程，它缩小了艺术意图和人工智能执行之间的差距。

论文及项目相关链接

PDF NeurIPS 2025 Creative AI Track, The Thirty-Ninth Annual Conference on Neural Information Processing Systems

Summary

当前文本转图像模型在呈现漫画叙事所需的微妙面部表情时存在困难，这主要归因于语言的模糊性。为解决这一问题，我们引入了一个基于新型双混合管道技术的交互系统。第一阶段结合基于地标的自动检测与手动框架工具，实现稳健的、以艺术家为中心的面容准备。第二阶段使用LivePortrait引擎进行表情映射，融入视频中的直观表演输入，以实现精细控制。案例研究分析表明，这一集成工作流程能够优化创作流程，有效将叙事意图转化为视觉表达。此项工作为人类与人工智能共同创作提供了一个实际模型，为艺术家提供更直接、更直观的手段来“赋予角色灵魂”。我们的主要贡献不在于新型生成模型，而在于一种新型的交互式工作流程，它能缩小艺术意图与人工智能执行之间的差距。

Key Takeaways

当前文本转图像模型在呈现漫画中的微妙面部表情时遇到困难，主要原因是语言的模糊性。
引入了一种新型交互系统，基于双混合管道技术来解决这一问题。
系统分为两个阶段：第一阶段实现稳健的面容准备，第二阶段进行表情映射，利用视频中的直观表演输入实现精细控制。
案例研究分析证明了集成工作流程的有效性，能够优化创作流程，将叙事意图转化为视觉表达。
此项工作为人类与人工智能的共同创作提供了一个实际模型，使艺术家能够更直接、更直观地“赋予角色灵魂”。
该系统的贡献主要在于其新型的交互式工作流程，而非新型生成模型。

Cool Papers

点此查看论文截图

Towards Metric-Aware Multi-Person Mesh Recovery by Jointly Optimizing Human Crowd in Camera Space

Authors:Kaiwen Wang, Kaili Zheng, Yiming Shi, Chenyi Guo, Ji Wu

Multi-person human mesh recovery from a single image is a challenging task, hindered by the scarcity of in-the-wild training data. Prevailing in-the-wild human mesh pseudo-ground-truth (pGT) generation pipelines are single-person-centric, where each human is processed individually without joint optimization. This oversight leads to a lack of scene-level consistency, producing individuals with conflicting depths and scales within the same image. To address this, we introduce Depth-conditioned Translation Optimization (DTO), a novel optimization-based method that jointly refines the camera-space translations of all individuals in a crowd. By leveraging anthropometric priors on human height and depth cues from a monocular depth estimator, DTO solves for a scene-consistent placement of all subjects within a principled Maximum a posteriori (MAP) framework. Applying DTO to the 4D-Humans dataset, we construct DTO-Humans, a new large-scale pGT dataset of 0.56M high-quality, scene-consistent multi-person images, featuring dense crowds with an average of 4.8 persons per image. Furthermore, we propose Metric-Aware HMR, an end-to-end network that directly estimates human mesh and camera parameters in metric scale. This is enabled by a camera branch and a relative metric loss that enforces plausible relative scales. Extensive experiments demonstrate that our method achieves state-of-the-art performance on relative depth reasoning and human mesh recovery. Code is available at: https://github.com/gouba2333/MA-HMR.

从单幅图像中恢复多人人体网格是一项具有挑战性的任务，受到野外训练数据稀缺的阻碍。目前流行的野外人体网格伪真实值（pGT）生成流水线是以单人为中心，每个人都被单独处理，没有进行联合优化。这种疏忽导致了场景级别的连贯性缺失，产生了同一图像中深度和方向相冲突的个人。为了解决这一问题，我们引入了深度条件翻译优化（DTO），这是一种基于优化的新方法，可以联合优化场景中所有人体的空间翻译。通过利用人体高度和深度线索的人体测量先验知识以及单目深度估计器，DTO解决了一个场景中所有人物的一致性放置问题，在最大后验概率（MAP）框架下实现。将DTO应用于4D-Humans数据集，我们构建了DTO-Humans，这是一个新的大规模pGT数据集，包含0.56M张高质量、场景连贯的多人图像，特征人群密集，平均每张图像有4.8人。此外，我们提出了Metric-Aware HMR，这是一种端到端的网络，直接估计人体网格和相机参数在度量尺度上。这是通过相机分支和相对度量损失实现的，相对度量损失强制执行合理的相对尺度。大量实验表明，我们的方法在相对深度推理和人体网格恢复方面达到了最新性能。代码可在：[https://github.com/gouba2333/MA-HMR获取。]

论文及项目相关链接

PDF

Summary

本文解决多人人体网格从单张图像中恢复的问题，这是一个受到野外训练数据稀缺限制的挑战性任务。现有方法在处理单个人体时缺乏场景级别的一致性，导致同一图像内个体的深度与尺度存在冲突。为此，本文引入基于深度条件的翻译优化（DTO），这是一种联合优化所有人体在场景中的相机空间平移的新方法。通过利用人体高度和深度线索的先验知识，DTO在最大后验概率（MAP）框架下解决了所有主体的场景一致性放置问题。此外，还提出了Metric-Aware HMR网络，该网络可直接估计人体网格和相机参数的度量尺度。

Key Takeaways

多人人体网格从单张图像恢复是挑战任务，因野外训练数据稀缺。
现有方法在处理单个人体时缺乏场景级别的一致性。
DTO方法通过联合优化场景中所有人的相机空间平移来解决这一问题。
DTO利用人体高度和深度线索的先验知识。
DTO在最大后验概率框架下实现所有主体的场景一致性放置。
提出了Metric-Aware HMR网络，直接估计人体网格和相机参数的度量尺度。

Cool Papers

点此查看论文截图

Zero-Shot Video Translation via Token Warping

Authors:Haiming Zhu, Yangyang Xu, Jun Yu, Shengfeng He

With the revolution of generative AI, video-related tasks have been widely studied. However, current state-of-the-art video models still lag behind image models in visual quality and user control over generated content. In this paper, we introduce TokenWarping, a novel framework for temporally coherent video translation. Existing diffusion-based video editing approaches rely solely on key and value patches in self-attention to ensure temporal consistency, often sacrificing the preservation of local and structural regions. Critically, these methods overlook the significance of the query patches in achieving accurate feature aggregation and temporal coherence. In contrast, TokenWarping leverages complementary token priors by constructing temporal correlations across different frames. Our method begins by extracting optical flows from source videos. During the denoising process of the diffusion model, these optical flows are used to warp the previous frame’s query, key, and value patches, aligning them with the current frame’s patches. By directly warping the query patches, we enhance feature aggregation in self-attention, while warping the key and value patches ensures temporal consistency across frames. This token warping imposes explicit constraints on the self-attention layer outputs, effectively ensuring temporally coherent translation. Our framework does not require any additional training or fine-tuning and can be seamlessly integrated with existing text-to-image editing methods. We conduct extensive experiments on various video translation tasks, demonstrating that TokenWarping surpasses state-of-the-art methods both qualitatively and quantitatively. Video demonstrations are available in supplementary materials.

随着生成式人工智能的革新，视频相关任务已经得到了广泛的研究。然而，当前最先进的视频模型在视觉质量和用户对生成内容的控制方面仍然落后于图像模型。在本文中，我们介绍了TokenWarping，一种用于实现时间连贯视频翻译的新型框架。现有的基于扩散的视频编辑方法仅依赖于自注意力中的关键值和值补丁来确保时间一致性，通常牺牲了局部和结构性区域的保留。这些方法忽略了查询补丁在实现精确特征聚合和时间连贯性方面的重要性。相比之下，TokenWarping通过构建不同帧之间的时间相关性来利用互补的令牌先验。我们的方法首先是从源视频中提取光流。在扩散模型的去噪过程中，这些光流被用来扭曲上一帧的查询、关键值和值补丁，将它们与当前帧的补丁对齐。通过直接扭曲查询补丁，我们增强了自注意力中的特征聚合，同时扭曲关键值和值补丁确保跨帧的时间一致性。这种令牌扭曲对自注意力层的输出施加了明确的约束，有效地确保了时间连贯的翻译。我们的框架不需要任何额外的训练或微调，并且可以无缝集成到现有的文本到图像编辑方法中。我们在各种视频翻译任务上进行了大量实验，证明TokenWarping在定性和定量方面均超越了最先进的方法。视频演示见补充材料。

论文及项目相关链接

PDF

Summary

本文介绍了一种名为TokenWarping的新型视频翻译框架，解决了生成式AI在视频任务中的视觉质量和用户控制问题。该框架利用令牌先验构建不同帧之间的时间关联，通过光学流技术实现查询令牌的直接变形，增强了自注意力的特征聚合能力，确保时间连贯性。无需额外训练和精细调整，TokenWarping可以无缝集成现有的文本到图像编辑方法，并在各种视频翻译任务中超越现有技术。

Key Takeaways