PixelPonder: Dynamic Patch Adaptation for Enhanced Multi-Conditional Text-to-Image Generation

1Fudan University 2Youtu Lab, Tencent 3Hong Kong University of Science and Technology 4Western University 5University of Chinese Academy of Sciences
*Equal Contribution Corresponding Author

Visualization comparison between Uni-ControlNet and our proposed method in different conditional controls with the same text prompt. (a, left two columns) Text and various visual controls, where C1, C2, C3 and C4 denotes edge, sketch, depth and pose map respectively. (b, middle three or four columns) Generation results from UniControlNet. (c, last column) Generation results from our PixelPonder. Previous methods struggled to generate coherent results under multiple conditions, while our results maintain strong similarity to the respective visual controls.

Abstract

Recent advances in diffusion-based text-to-image generation have demonstrated promising results through visual condition control. However, existing ControlNet-like methods struggle with compositional visual conditioning - simultaneously preserving semantic fidelity across multiple heterogeneous control signals while maintaining high visual quality, where they employ separate control branches that often introduce conflicting guidance during the denoising process, leading to structural distortions and artifacts in generated images. To address this issue, we present PixelPonder, a novel unified control framework, which allows for effective control of multiple visual conditions under a single control structure. Specifically, we design a patch-level adaptive condition selection mechanism that dynamically prioritizes spatially relevant control signals at the sub-region level, enabling precise local guidance without global interference. Additionally, a time-aware control injection scheme is deployed to modulate condition influence according to denoising timesteps, progressively transitioning from structural preservation to texture refinement and fully utilizing the control information from different categories to promote more harmonious image generation. Extensive experiments demonstrate that PixelPonder surpasses previous methods across different benchmark datasets, showing superior improvement in spatial alignment accuracy while maintaining high textual semantic consistency.

Method

Overall pipeline of the proposed PixelPonder.




Comparison with Other Methods

BibTeX

@article{pan2025pixelponder,
      title={PixelPonder: Dynamic Patch Adaptation for Enhanced Multi-Conditional Text-to-Image Generation},
      author={Pan, Yanjie and He, Qingdong and Jiang, Zhengkai and Xu, Pengcheng and Wang, Chaoyi and Peng, Jinlong  and Wang, Haoxuan and Cao, Yun and Gan, Zhenye and  Chi, Mingmin and Peng, Bo and Wang, Yabiao},
      journal={arXiv preprint arXiv:2503.06684},
      year={2025}
      }