EZIGen: Enhancing zero-shot subject-driven image generation with precise subject encoding and decoupled guidance

arXiv preprint

Zicheng Duan1     Yuxuan Ding2      Chenhui Gou3      Ziqin Zhou1     
Ethan Smith4      Lingqiao Liu*1
1Australian Institute of Machine Learning, University of Adelaide     2Xidian University    
3Monash University     4Leonardo.AI    
Interpolate start reference image.

TL;DR: EZIGen enhances zero-shot subject driven generation by integrating a carefully designed Reference UNet extractor and decoupled guidances, preserving subject identity while maintaining flexibilities.





Abstract

Zero-shot subject-driven image generation aims to produce images that incorporate a subject from a given example image. The challenge lies in preserving the subject's identity while aligning with the text prompt, which often requires modifying certain aspects of the subject's appearance. Despite advancements in diffusion model based methods, existing approaches still struggle to balance identity preservation with text prompt alignment. In this study, we conducted an in-depth investigation into this issue and uncovered key insights for achieving effective identity preservation while maintaining a strong balance. Our key findings include: (1) the design of the subject image encoder significantly impacts identity preservation quality, and (2) generating an initial layout is crucial for both text alignment and identity preservation. Building on these insights, we introduce a new approach called EZIGen, which employs two main strategies: a carefully crafted subject image Encoder based on the UNet architecture of the pretrained Stable Diffusion model to ensure high-quality identity transfer, following a process that decouples the guidance stages and iteratively refines the initial image layout. Through these strategies, EZIGen achieves state-of-the-art results on multiple subject-driven benchmarks with a unified model and 100 times less training data.

Method

Interpolate start reference image.

Illustration of the proposed system. Starting from Encoding and Injecting subject feature, where we extract a set of intermediate latent features during the simulated late denoising process of the given noisy subject image using our fixed Reference UNet, we then regard these features as offline Subject Guidance and inject it to the Main UNet via a learnable adapter. Then we showcase how we decouple the generation process into the Layout Generation Process and Appearance Transfer Process. We first leverage the text prompt as Text Guidance using the original text-guided diffusion process to obtain a layout latent at a middle timestep t, then we incorporate the offline Subject Guidance to transfer the subject appearance into the layout latent in the rest of the timesteps. Finally, to achieve a complete transfer, we develop the Iterative Appearance Transfer mechanism to repeat the Appearance Transfer Process by adding noise to the generated image, which continues to repeat for N times until satisfaction.

Comparisons

We conduct in-depth comparisons of our method against previous literatures on various tasks: namely subject-driven image generation, subject-driven image editing and human content generation task.

Results on DreamBench dataset

Interpolate start reference image.

Our design showcase astonishing subjects identity preservation abilities without sacrificing text prompt adhearance, outperforms all previous competitors.

Results on DreamEdit dataset

Interpolate start reference image.

Our method is naturally a subject-driven editor when equipped with a foreground/background mask and image inversion, and it demonstrates outstanding performance on the DreamEditBench dataset.

Results on FastComposer benchmark

Interpolate start reference image.

Due to the high-quality feature extraction and decoupled generation technique, our method produces high-quality human face images with versatilities, WITHOUT training on domain-specific or large-scale datasets.

Subjects Interpolation

Based on the nature of iterative generation, our methods automatically interpolate between when using image during subject-driven editing.

Loading 1...
Loading 2...

More Visualization Results

Subject-Driven Generation

Interpolate start reference image.

Subject-Driven Editing

Interpolate start reference image.

Human Content Generation

Interpolate start reference image.