Dense-Face: Realistic Personalized Face Generation Model via Dense Annotation Prediction


[Paper]      [Code]      [Dataset]      [Model Weights]     


Gallery

Abstract

The text-to-image (T2I) personalization diffusion model can generate images of the novel concept, based on the text prompt. However, in the human face generation domain, existing T2I methods either require test-time fine-tuning, or fail to generate images that align well with the given text prompt. In this work, we propose a new T2I personalization diffusion model, called Dense-Face, which can generate face images with the consistent identity as the given reference subject, and align well with the given text prompt. Specifically, we introduce a pose-controllable adapter to achieve high-fidelity, pose-controllable generation while maintaining the text-based editing ability of the pretrained stable diffusion (SD). Additionally, we use internal features of the SD UNet to predict dense face annotations, enabling the proposed method to gain domain knowledge in the face generation. Empirically, our method achieves state-of-the-art or competitive generation performance in image-text alignment, identity preservation, and pose control.




Gallery

Figure: our proposed Dense-Face contains additional components, including a pose branch and PC-adapter, on the top of the pre-trained SD. These two components enable Dense-Face to have two generation modes: text-editing mode and face-generation mode. These two modes are jointly used via the latent space blending for the personalized generation. For example, given one of reference subject images, the text-editing mode generates a base image, and face-generation mode updates the face region for the identity-preservation in the final result.

Gallery

Figure: Dense-Face in face-generation mode generates realistic face images at different pose views.



Gallery

Left: We propose Dense-Face for personalized image generation, which introduces additional components, such as a pose-controllable (PC) adapter, pose branch (i.e., εpose) and annotation prediction module (i.e., εdense) on the top of the pre-trained T2I-SD. The input includes captions, head pose and reference image (i.e., Ipose and Iid). The out- put includes generated faces (Itar.) and dense face annotations (e.g., face depths (D), pseudo masks (P), and landmarks (L)). We only train εpose, εdense and the PC adapter in training and freeze the pre-trained SD.
Right: The PC-adapter (w′q , w′v , and w′v ) modifies cross attention module (wq , wv , and wk ) forward propagation, from fout (orange dash line) to f ′ out (red solid line). εdense utilizes the internal UNet features (i.e., fdense) for predicting dense face annotations.



Demo



Gallery

Dense-Face can place subjects in diverse contexts with changed attributes, such as hair color and clothes.



Gallery

Additional comparisons among different personalized generation methods. Our proposed Dense-Face generates images with a consistent identity with the reference image, which can even be an old photo



T2I-Dense-Face Dataset



Gallery

Additional Face Swapping results from the proposed method. Dense-Face achieves the comparable identity preservation as the previous work.



Gallery

Additional visualizations on the dense annotation prediction. The proposed Dense-Face can generate high-fidelity identity-preserved images and corresponding an- notations (e.g., depth image, pseudo mask and landmarks). Generated images can be at large pose-views.



Gallery

Additional visualizations on different subject stylizations.