EZIGen: Enhancing zero-shot personalized image generation with precise subject encoding and decoupled guidance

arXiv preprint

Zicheng Duan1     Yuxuan Ding2      Chenhui Gou3      Ziqin Zhou1     
Ethan Smith4      Lingqiao Liu*1
1Australian Institute of Machine Learning, University of Adelaide     2Xidian University    
3Monash University     4Leonardo.AI    

TL;DR: EZIGen enhances zero-shot personalized image generation by integrating a carefully designed Reference UNet extractor and decoupled guidances, preserving subject identity while maintaining flexibilities.





Abstract

Zero-shot personalized image generation models aim to produce images that align with both a given text prompt and subject image, requiring the model to effectively incorporate both sources of guidance. However, existing methods often struggle to capture fine-grained subject details and frequently prioritize one form of guidance over the other, resulting in suboptimal subject encoding and an imbalance in the generated images. In this study, we uncover key insights into achieving high-quality balances on subject identity preservation and text-following, notably that 1) the design of the subject image encoder critically influences subject identity preservation, and 2) the text and subject guidance should take effect at different denoising stages. Building on these insights, we introduce a new approach, EZIGen, that employs two main components: a carefully crafted subject image encoder based on the pretrained UNet of the Stable Diffusion model, following a process that balances the two guidances by separating their dominance stage and revisiting certain time steps to bootstrap subject transfer quality. Through these two components, EZIGen achieves state-of-the-art results on multiple personalized generation benchmarks with a unified model and 100 times less training data.

Method

Interpolate start reference image.

Illustration of the proposed system. Starting from Encoding and Injecting subject feature, where we extract a set of intermediate latent features during the simulated late denoising process of the given noisy subject image using our fixed Reference UNet, we then regard these features as offline Subject Guidance and inject it to the Main UNet via a learnable adapter. Then we showcase how we decouple the generation process into the Layout Generation Process and Appearance Transfer Process. We first leverage the text prompt as Text Guidance using the original text-guided diffusion process to obtain a layout latent at a middle timestep t, then we incorporate the offline Subject Guidance to transfer the subject appearance into the layout latent in the rest of the timesteps. Finally, to achieve a complete transfer, we develop the Iterative Appearance Transfer mechanism to repeat the Appearance Transfer Process by adding noise to the generated image, which continues to repeat for N times until satisfaction.

Comparisons

We conduct in-depth comparisons of our method against previous literatures on various tasks: namely personalized image generation, personalized image editing and human content generation task.

Results on DreamBench dataset

Interpolate start reference image.

Our design showcase astonishing subjects identity preservation abilities without sacrificing text prompt adhearance, outperforms all previous competitors.

Results on DreamEdit dataset

Interpolate start reference image.

Our method is naturally a subject-driven editor when equipped with a foreground/background mask and image inversion, and it demonstrates outstanding performance on the DreamEditBench dataset.

Results on FastComposer benchmark

Interpolate start reference image.

Due to the high-quality feature extraction and decoupled generation technique, our method produces high-quality human face images with versatilities, WITHOUT training on domain-specific or large-scale datasets.

Subjects Interpolation

Based on the nature of iterative generation, our methods automatically interpolate between when using image during subject-driven editing.

Loading 1...
Loading 2...

More Visualization Results

Personalized Image Generation

Interpolate start reference image.

Personalized Image Editing

Interpolate start reference image.

Subject Interpolation

Interpolate start reference image.