TP2O: Creative Text Pair-to-Object Generation using Balance Swap-Sampling.

 



Teaser figure.


We propose a simple yet effective sampling method without any training to generate creative combinations from two object texts. Bottom row: original images from Stable-Diffusion2[2]. Middle row: combinations produced by our algorithm. Top row: artworks by Les Créatonautes[7], a French creative agency. Remarkably, the objects we create rival the artistry of masterpieces crafted by artists.



Abstract

Generating creative combinatorial objects from two object texts poses a significant challenge for the text-to-image synthesis, often hindered by a focus on emulating existing data distributions. In this paper, we develop a straightforward yet highly effective method, called balance swap-sampling. First, we propose a swapping mechanism that generates a novel combinatorial object image set by randomly exchanging intrinsic elements of two text embeddings through a cutting-edge diffusion model. Second, we introduce a balance swapping region to efficiently sample a small subset from the newly generated image set by managing balance CLIP distances between the new images and their original generations, increasing the likelihood of accepting the high-quality combinations. Last, we employ a segmentation method to compare CLIP distances among the segmented components, ultimately selecting the most promising object from the sampled subset. Our experiments on text pairs of objects from ImageNet showcase that our approach outperforms recent SOTA T2I methods, even the associated concepts appear to be implausible, such as lionfish-abacus, and monarch-cheetah.



Creative Generation

Visual results of combinatorial object generations.

lobster
sea slug
house finch
crash helmet
lionfish
abacus
fire engine
scotch terrier
daisy
fiddler crab
pagoda
pagoda
pagoda
pagoda
pagoda


grey fox
scotch terrier
bee eater
military uniform
bobsled
barn spider
fire engine
scotch terrier
gasmask
bustard apple
pagoda
pagoda
pagoda
pagoda
pagoda



Our framework

The pipeline of our acceptable swap-sampling method. Starting from text embeddings by inputting two given texts into the text encoder, we introduce a swapping operation to collect a set F of randomly swapping vectors for novel embeddings, then generate a new image set I, and propose an acceptable region to build a sampling method for selecting an optimal combinatorial object image.

Teaser figure.


Comparing Results

Visual comparisons of combinatorial object generations. We compare our BASS method with Stable-Diffusion2 [2], DALLE2[1], ERNIE-ViLG2 (Baidu)[3] and Bing (Microsoft) using the prompt ’Hybrid of [prompt1] and [prompt2]’. The results reveal that our model outperforms these counterparts in terms of its creative combinatorial. Moreover, our results exhibit significant dissimilarity from the images retrieved from the LAION-5B dataset[6]in the first row. This directly illustrates our model’s ability to generate out-of-distribution images. For more fair comparison, in addition, we also use intricate textual descriptions in conjunction with our generated images as input prompts. However, these models have not yet demonstrated the capability to produce results that closely match our own.

Teaser figure.


Sampling Comparision

Comparing our BASS with state-of-the-art human-evaluation methods, such as PickScore[4] and HPSv2[5], through visualizations of the sampling results.

Teaser figure.


More Results

Different Styles Generation

Generalizations using four different styles including aquarelle, line-art, cartoon, and ink painting.

Teaser figure.


Fix one with another

We fixed two words:bea eater and green lizard, to generalize different results.

pagoda + lamp
+
pagoda + lamp
pagoda
pagoda + lamp
pagoda
pagoda + lamp
pagoda
pagoda + lamp
pagoda
bee eater
electric guitar
Irish wolfhound
redbone
tree frog
pagoda + lamp
+
pagoda + lamp
pagoda
pagoda + lamp
pagoda
pagoda + lamp
pagoda
pagoda + lamp
pagoda
green lizard
border terrier
head cabbage
killer whale
silky terrier



More visual results

pagoda
pagoda
pagoda
pagoda
pagoda
basenji
frilled lizard
airliner
hot pot
African grey
warplane
Afghan hound
great owl
jacamar
hippop otamus
pagoda
pagoda
pagoda
pagoda
pagoda
Gordon setter
power drill
jacamar
Indian elephant
tricycle
head cabbage
lemon
scorpion
junco
go-kart
pagoda
pagoda
pagoda
pagoda
pagoda
conch
pirate
bee
tiger
barn spider
acoustic guitar
albatross
joystick
toucan
bathing cap



Acknowledgement

We sincerely thank the French artist Les Creatonautes for granting us permission to use their images. We express our heartfelt gratitude for their support.

References

[1]Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text- conditional image generation with clip latents. In arXiv:2204.06125, 2022.

[2]Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ̈orn Ommer. High- resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10684–10695, 2022.

[3]Zhida Feng, Zhenyu Zhang, Xintong Yu, Yewei Fang, Lanxin Li, Xuyi Chen, Yuxiang Lu, Ji- axiang Liu, Weichong Yin, Shikun Feng, Yu Sun, Hao Tian, Hua Wu, and Haifeng Wang. Ernie-vilg 2.0: Improving text-to-image diffusion model with knowledge-enhanced mixture-of- denoising-experts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10135–10145.2023.

[4]Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation. arXiv preprint arXiv:2305.01569, 2023.

[5]Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis. arXiv preprint arXiv:2306.09341, 2023.

[6]Schuhmann C, Beaumont R, Vencu R, et al. Laion-5b: An open large-scale dataset for training next generation image-text models[J]. Advances in Neural Information Processing Systems, 2022, 35: 25278-25294.

[7]https://www.instagram.com/les.creatonautes/