We propose a simple yet effective sampling method without any training to generate creative combinations from two object texts. Bottom row: original images from Stable-Diffusion2[2]. Middle row: combinations produced by our algorithm. Top row: artworks by Les Créatonautes[7], a French creative agency. Remarkably, the objects we create rival the artistry of masterpieces crafted by artists.
Generating creative combinatorial objects from two object texts poses a significant challenge for the text-to-image synthesis, often hindered by a focus on emulating existing data distributions. In this paper, we develop a straightforward yet highly effective method, called balance swap-sampling. First, we propose a swapping mechanism that generates a novel combinatorial object image set by randomly exchanging intrinsic elements of two text embeddings through a cutting-edge diffusion model. Second, we introduce a balance swapping region to efficiently sample a small subset from the newly generated image set by managing balance CLIP distances between the new images and their original generations, increasing the likelihood of accepting the high-quality combinations. Last, we employ a segmentation method to compare CLIP distances among the segmented components, ultimately selecting the most promising object from the sampled subset. Our experiments on text pairs of objects from ImageNet showcase that our approach outperforms recent SOTA T2I methods, even the associated concepts appear to be implausible, such as lionfish-abacus, and monarch-cheetah.
Visual results of combinatorial object generations.
The pipeline of our acceptable swap-sampling method. Starting from text embeddings by inputting two given texts into the text encoder, we introduce a swapping operation to collect a set F of randomly swapping vectors for novel embeddings, then generate a new image set I, and propose an acceptable region to build a sampling method for selecting an optimal combinatorial object image.
Visual comparisons of combinatorial object generations. We compare our BASS method with Stable-Diffusion2 [2], DALLE2[1], ERNIE-ViLG2 (Baidu)[3] and Bing (Microsoft) using the prompt ’Hybrid of [prompt1] and [prompt2]’. The results reveal that our model outperforms these counterparts in terms of its creative combinatorial. Moreover, our results exhibit significant dissimilarity from the images retrieved from the LAION-5B dataset[6]in the first row. This directly illustrates our model’s ability to generate out-of-distribution images. For more fair comparison, in addition, we also use intricate textual descriptions in conjunction with our generated images as input prompts. However, these models have not yet demonstrated the capability to produce results that closely match our own.
Comparing our BASS with state-of-the-art human-evaluation methods, such as PickScore[4] and HPSv2[5], through visualizations of the sampling results.
Generalizations using four different styles including aquarelle, line-art, cartoon, and ink painting.
We fixed two words:bea eater and green lizard, to generalize different results.
We sincerely thank the French artist Les Creatonautes for granting us permission to use their images. We express our heartfelt gratitude for their support.
[1]Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text- conditional image generation with clip latents. In arXiv:2204.06125, 2022.
[2]Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ̈orn Ommer. High- resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10684–10695, 2022.
[3]Zhida Feng, Zhenyu Zhang, Xintong Yu, Yewei Fang, Lanxin Li, Xuyi Chen, Yuxiang Lu, Ji- axiang Liu, Weichong Yin, Shikun Feng, Yu Sun, Hao Tian, Hua Wu, and Haifeng Wang. Ernie-vilg 2.0: Improving text-to-image diffusion model with knowledge-enhanced mixture-of- denoising-experts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10135–10145.2023.
[4]Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation. arXiv preprint arXiv:2305.01569, 2023.
[5]Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis. arXiv preprint arXiv:2306.09341, 2023.
[6]Schuhmann C, Beaumont R, Vencu R, et al. Laion-5b: An open large-scale dataset for training next generation image-text models[J]. Advances in Neural Information Processing Systems, 2022, 35: 25278-25294.
[7]https://www.instagram.com/les.creatonautes/