We propose a simple yet effective sampling method without any training to generate creative combinations from two object texts. Bottom row: original images from Stable-Diffusion2^[2]. Middle row: combinations produced by our algorithm. Top row: artworks by Les Créatonautes^[7], a French creative agency. Remarkably, the objects we create rival the artistry of masterpieces crafted by artists.

Abstract

Generating creative combinatorial objects from two object texts poses a significant challenge for the text-to-image synthesis, often hindered by a focus on emulating existing data distributions. In this paper, we develop a straightforward yet highly effective method, called balance swap-sampling. First, we propose a swapping mechanism that generates a novel combinatorial object image set by randomly exchanging intrinsic elements of two text embeddings through a cutting-edge diffusion model. Second, we introduce a balance swapping region to efficiently sample a small subset from the newly generated image set by managing balance CLIP distances between the new images and their original generations, increasing the likelihood of accepting the high-quality combinations. Last, we employ a segmentation method to compare CLIP distances among the segmented components, ultimately selecting the most promising object from the sampled subset. Our experiments on text pairs of objects from ImageNet showcase that our approach outperforms recent SOTA T2I methods, even the associated concepts appear to be implausible, such as lionfish-abacus, and monarch-cheetah.

Creative Generation

Visual results of combinatorial object generations.

lobster

sea slug

house finch

crash helmet

lionfish

abacus

fire engine

scotch terrier

daisy

fiddler crab

grey fox

scotch terrier

bee eater

military uniform

bobsled

barn spider

fire engine

scotch terrier

gasmask

bustard apple

Our framework

The pipeline of our acceptable swap-sampling method. Starting from text embeddings by inputting two given texts into the text encoder, we introduce a swapping operation to collect a set F of randomly swapping vectors for novel embeddings, then generate a new image set I, and propose an acceptable region to build a sampling method for selecting an optimal combinatorial object image.

Comparing Results

Visual comparisons of combinatorial object generations. We compare our BASS method with Stable-Diffusion2 ^[2], DALLE2^[1], ERNIE-ViLG2 (Baidu)^[3] and Bing (Microsoft) using the prompt ’Hybrid of [prompt1] and [prompt2]’. The results reveal that our model outperforms these counterparts in terms of its creative combinatorial. Moreover, our results exhibit significant dissimilarity from the images retrieved from the LAION-5B dataset^[6]in the first row. This directly illustrates our model’s ability to generate out-of-distribution images. For more fair comparison, in addition, we also use intricate textual descriptions in conjunction with our generated images as input prompts. However, these models have not yet demonstrated the capability to produce results that closely match our own.

Sampling Comparision

Comparing our BASS with state-of-the-art human-evaluation methods, such as PickScore^[4] and HPSv2^[5], through visualizations of the sampling results.

More Results

Different Styles Generation

Generalizations using four different styles including aquarelle, line-art, cartoon, and ink painting.

Fix one with another

We fixed two words:bea eater and green lizard, to generalize different results.

bee eater

electric guitar

Irish wolfhound

redbone

tree frog

green lizard

border terrier

head cabbage

killer whale

silky terrier

More visual results

basenji

frilled lizard

airliner

hot pot

African grey

warplane

Afghan hound

great owl

jacamar

hippop otamus

Gordon setter

power drill

jacamar

Indian elephant

tricycle

head cabbage

lemon

scorpion

junco

go-kart

conch

pirate

bee

tiger

barn spider

acoustic guitar

albatross

joystick

toucan

bathing cap

Acknowledgement

We sincerely thank the French artist Les Creatonautes for granting us permission to use their images. We express our heartfelt gratitude for their support.

References

[1]Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text- conditional image generation with clip latents. In arXiv:2204.06125, 2022.

[2]Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ̈orn Ommer. High- resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10684–10695, 2022.

[3]Zhida Feng, Zhenyu Zhang, Xintong Yu, Yewei Fang, Lanxin Li, Xuyi Chen, Yuxiang Lu, Ji- axiang Liu, Weichong Yin, Shikun Feng, Yu Sun, Hao Tian, Hua Wu, and Haifeng Wang. Ernie-vilg 2.0: Improving text-to-image diffusion model with knowledge-enhanced mixture-of- denoising-experts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10135–10145.2023.

[4]Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation. arXiv preprint arXiv:2305.01569, 2023.

[5]Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis. arXiv preprint arXiv:2306.09341, 2023.

[6]Schuhmann C, Beaumont R, Vencu R, et al. Laion-5b: An open large-scale dataset for training next generation image-text models[J]. Advances in Neural Information Processing Systems, 2022, 35: 25278-25294.

[7]https://www.instagram.com/les.creatonautes/

TP2O: Creative Text Pair-to-Object Generation using Balance Swap-Sampling.

Abstract

Creative Generation

Our framework

Comparing Results

Sampling Comparision

More Results

Different Styles Generation

Fix one with another

More visual results

Acknowledgement

References