Dense-Face: Realistic Personalized Face Generation Model via Dense Annotation Prediction

[Paper] [Code] [Dataset] [Model Weights]

Abstract

The text-to-image (T2I) personalization diffusion model can generate images of the novel concept, based on the text prompt. However, in the human face generation domain, existing T2I methods either require test-time fine-tuning, or fail to generate images that align well with the given text prompt. In this work, we propose a new T2I personalization diffusion model, called Dense-Face, which can generate face images with the consistent identity as the given reference subject, and align well with the given text prompt. Specifically, we introduce a pose-controllable adapter to achieve high-fidelity, pose-controllable generation while maintaining the text-based editing ability of the pretrained stable diffusion (SD). Additionally, we use internal features of the SD UNet to predict dense face annotations, enabling the proposed method to gain domain knowledge in the face generation. Empirically, our method achieves state-of-the-art or competitive generation performance in image-text alignment, identity preservation, and pose control.

Figure: our proposed Dense-Face contains additional components, including a pose branch and PC-adapter, on the top of the pre-trained SD. These two components enable Dense-Face to have two generation modes: text-editing mode and face-generation mode. These two modes are jointly used via the latent space blending for the personalized generation. For example, given one of reference subject images, the text-editing mode generates a base image, and face-generation mode updates the face region for the identity-preservation in the final result.

Figure: Dense-Face in face-generation mode generates realistic face images at different pose views.

Left: We propose Dense-Face for personalized image generation, which introduces additional components, such as a pose-controllable (PC) adapter, pose branch (i.e., εpose) and annotation prediction module (i.e., εdense) on the top of the pre-trained T2I-SD. The input includes captions, head pose and reference image (i.e., Ipose and Iid). The out- put includes generated faces (Itar.) and dense face annotations (e.g., face depths (D), pseudo masks (P), and landmarks (L)). We only train εpose, εdense and the PC adapter in training and freeze the pre-trained SD.
Right: The PC-adapter (w′q , w′v , and w′v ) modifies cross attention module (wq , wv , and wk ) forward propagation, from fout (orange dash line) to f ′ out (red solid line). εdense utilizes the internal UNet features (i.e., fdense) for predicting dense face annotations.

Demo

Dense-Face can place subjects in diverse contexts with changed attributes, such as hair color and clothes.

Additional comparisons among different personalized generation methods. Our proposed Dense-Face generates images with a consistent identity with the reference image, which can even be an old photo

T2I-Dense-Face Dataset

Input image

Face Bounding Box

Dense Landmark

Pseudo Mask

Face parsing mask

Depth map(UV map z-axis)

UV map x-axis

UV map y-axis

git-large-coco(0.3060): a young boy wearing glasses and a blue shirt.
blip2-2.7b(0.3017): a young boy wearing glasses and a blue shirt
blip2-flan-t5-xl(0.2991): a boy wearing glasses and a tie
blip-large(0.2968): smiling boy wearing glasses and a blue shirt and tie
blip-base(0.2949): a boy wearing glasses and a blue shirt
vit-swin(0.2733): A young man wearing glasses and a tie.
vit-gpt2(0.2732): a young man wearing glasses and a tie.

blip2-2.7b(0.2752): a young boy with a brown shirt on looks surprised
blip2-flan-t5-xl(0.2693): a boy in a brown shirt is staring at a picture
git-large-coco(0.2585): a young boy looks at the camera.
vit-swin(0.2353): A young boy is smiling and holding a cell phone.
blip-base(0.2275): a young boy is holding a remote control
blip-large(0.2201): there is a young boy that is holding a remote in his hand
vit-gpt2(0.2038): a young boy wearing a tie and smiling.

blip2-2.7b(0.3335): a young boy with glasses and a black suit
blip-base(0.3184): a young boy wearing glasses and a suit
blip2-flan-t5-xl(0.3168): a young boy in glasses and a suit is making a funny face
vit-swin(0.3072): A young boy wearing glasses and a tie.
blip-large(0.3049): there is a young boy wearing glasses and a suit and tie
git-large-coco(0.2284): [ unused0 ] is the most handsome boy in the world
vit-gpt2(0.2256): a man wearing a suit and tie

Additional Face Swapping results from the proposed method. Dense-Face achieves the comparable identity preservation as the previous work.

Additional visualizations on the dense annotation prediction. The proposed Dense-Face can generate high-fidelity identity-preserved images and corresponding an- notations (e.g., depth image, pseudo mask and landmarks). Generated images can be at large pose-views.

Additional visualizations on different subject stylizations.