EZIGen Demo

Abstract

Zero-shot personalized image generation models aim to produce images that align with both a given text prompt and subject image, requiring the model to incorporate both sources of guidance. Existing methods often struggle to capture fine-grained subject details and frequently prioritize one form of guidance over the other, resulting in suboptimal subject encoding and imbalanced generation. In this study, we uncover key insights into overcoming such drawbacks, notably that 1) the choice of the subject image encoder critically influences subject identity preservation and training efficiency, and 2) the text and subject guidance should take effect at different denoising stages. Building on these insights, we introduce a new approach, EZIGen, that employs two main components: leveraging a fixed pre-trained Diffusion UNet itself as subject encoder, following a process that balances the two guidances by separating their dominance stage and revisiting certain time steps to bootstrap subject transfer quality. Through these two components, EZIGen, initially built upon SD2.1-base, achieved state-of-the-art performances on multiple personalized generation benchmarks with a unified model, while using 100 times less training data. Moreover, by further migrating our design to SDXL, EZIGen is proven to be a versatile model-agnostic solution for personalized generation.

Motivations

Exisitng methods usually suffer from imperfect balance of text-subject guidance and suboptimal subject appearance encoding.

Method

Illustration of the proposed system. Starting from Encoding and Injecting subject feature, where we extract a set of intermediate latent features during the simulated late denoising process of the given noisy subject image using our fixed Reference UNet, we then regard these features as offline Subject Guidance and inject it to the Main UNet via a learnable adapter. Then we showcase how we decouple the generation process into the Layout Generation Process and Appearance Transfer Process. We first leverage the text prompt as Text Guidance using the original text-guided diffusion process to obtain a layout latent at a middle timestep t, then we incorporate the offline Subject Guidance to transfer the subject appearance into the layout latent in the rest of the timesteps. Finally, to achieve a complete transfer, we develop the Iterative Appearance Transfer mechanism to repeat the Appearance Transfer Process by adding noise to the generated image, which continues to repeat for N times until satisfaction.

Comparisons

We conduct in-depth comparisons of our method against previous literatures on various tasks: namely personalized image generation, personalized image editing and human content generation task.

Results on DreamBench dataset

Our design showcase astonishing subjects identity preservation abilities without sacrificing text prompt adhearance, outperforms all previous competitors.

Results on DreamEdit dataset

Our method is naturally a subject-driven editor when equipped with a foreground/background mask and image inversion, and it demonstrates outstanding performance on the DreamEditBench dataset.

Results on FastComposer benchmark

Due to the high-quality feature extraction and decoupled generation technique, our method produces high-quality human face images with versatilities, WITHOUT training on domain-specific or large-scale datasets.

EZIGen: Enhancing zero-shot personalized image generation with precise subject encoding and decoupled guidance

arXiv preprint

TL;DR: EZIGen enhances zero-shot personalized image generation by integrating a carefully designed Reference UNet extractor and decoupled guidances, preserving subject identity while maintaining flexibilities.

Abstract

Motivations

Method

Comparisons

Results on DreamBench dataset

Results on DreamEdit dataset

Results on FastComposer benchmark

More Visualization Results

Personalized Image Generation

Personalized Image Editing

Subject Interpolation