We release a dataset of 101 concepts with 3-15 images for each concept for evaluating model customization methods. For a more detailed view of target images please refer to our webpage.
pip install gdown
gdown 1jj8JMtIS5-8vRtNtZ2x8isieWH9yetuK
unzip benchmark_dataset.zip
We provide a set of text prompts for each concept in the prompts folder. The prompt file corresponding to each concept is mentioned in dataset.json and dataset_multiconcept.json. The CLIP feature based image and text similarity can be calculated as:
python evaluate.py --sample_root {folder} --target_path {target-folder} --numgen {numgen}
sample_root
: the root location to generated images. The folder should contain subfoldersamples
with generated images. It should also contain aprompts.json
file with{'imagename.stem': 'text prompt'}
for each image in the samples subfolder.target_path
: file to target real images.numgen
: number of images in thesample_root/samples
folderoutpkl
: the location to save evaluation results (default: evaluation.pkl)
We compare our method (Custom Diffusion) with DreamBooth and Textual Inversion on this dataset. We trained DreamBooth and Textual Inversion according to the suggested hyperparameters in the respective papers. Both Ours and DreamBooth are trained with generated images as regularization.
Single concept
200 DDPM | 50 DDPM | |||||
Textual-alignment (CLIP) | Image-alignment (CLIP) | Image-alignment (DINO) | Textual-alignment (CLIP) | Image-alignment (CLIP) | Image-alignment (DINO) | |
Textual Inversion | 0.6126 | 0.7524 | 0.5111 | 0.6117 | 0.7530 | 0.5128 |
DreamBooth | 0.7522 | 0.7520 | 0.5533 | 0.7514 | 0.7521 | 0.5541 |
Custom Diffusion (Ours) | 0.7602 | 0.7440 | 0.5311 | 0.7583 | 0.7456 | 0.5335 |
Multiple concept
200 DDPM | 50 DDPM | |||||
Textual-alignment (CLIP) | Image-alignment (CLIP) | Image-alignment (DINO) | Textual-alignment (CLIP) | Image-alignment (CLIP) | Image-alignment (DINO) | |
DreamBooth | 0.7383 | 0.6625 | 0.3816 | 0.7366 | 0.6636 | 0.3849 |
Custom Diffusion (Opt) | 0.7627 | 0.6577 | 0.3650 | 0.7599 | 0.6595 | 0.3684 |
Custom Diffusion (Joint) | 0.7567 | 0.6680 | 0.3760 | 0.7534 | 0.6704 | 0.3799 |
We used ChatGPT to generate 40 image captions for each concept with the instructions to either (1) change the background of the scene while keeping the main subject, (2) insert a new object/living thing in the scene along with the main subject, (3) style variation of the main subject, and (4) change the property or material of the main subject. The generated text prompts are manually filtered or modified to get the final 20 prompts for each concept. A similar strategy is applied for multiple concepts. Some of the prompts are also inspired by other concurrent works e.g. Perfusion, DreamBooth, SuTI, BLIP-Diffusion etc.
Images taken from UnSplash are under Unsplash License. Images captured by ourselves are released under CC BY-SA 4.0 license. Flower category images are downloaded from Wikimedia/Flickr/Pixabay and the link to orginial images can also be found here for attribution.
We are grateful to Sheng-Yu Wang, Songwei Ge, Daohan Lu, Ruihan Gao, Roni Shechtman, Avani Sethi, Yijia Wang, Shagun Uppal, and Zhizhuo Zhou for helping with the dataset collection, and Nick Kolkin for the feedback.