Würstchen: A Fast and Efficient Way to Generate Images from Text

Introduction

Have you ever wondered what it would be like to create realistic images from text descriptions? Imagine being able to type “a sunset over the ocean” and see a beautiful picture of the sky and the sea. Or how about “a cute dog wearing a hat” and get a funny image of a furry friend. Sounds amazing, right?

Well, this is not just a fantasy anymore. Advances in artificial intelligence and deep learning are now enabling the models to generate images from text with impressive quality and diversity. These models are called text-to-image synthesis models, and they have many potential applications, such as content creation, education, entertainment, and more.

However, there is a catch. Text-to-image synthesis is a very challenging and computationally expensive task. It requires a lot of data, memory, and processing power to train and run these models. Some of the state-of-the-art models, such as Stable Diffusion XL (SDXL), can take hundreds of thousands of GPU hours to train, and several minutes to generate a single image. This makes them inaccessible and impractical for most users and researchers.

But what if there was a way to make text-to-image synthesis faster and cheaper, without compromising the quality and diversity of the images? This is where Würstchen comes in.

What is Würstchen?

Würstchen is a novel text-to-image synthesis model that can generate images up to 16 times faster than SDXL, while using up to 16 times less memory. Würstchen can also produce images at higher resolutions, up to 1536×1536 pixels, compared to SDXL’s 512×512 pixels. And the best part is, Würstchen achieves comparable results to SDXL in terms of image quality and diversity.

Würstchen is an open-source model developed by a team of researchers from Hugging Face and other institutions. It is based on a pipeline of three models: a text-conditional diffusion model, an image encoder/decoder, and a VQGAN. These models work together to compress the image data into a highly compact latent space, where the text-to-image synthesis can be performed efficiently and effectively.

How does Würstchen work?

Würstchen’s pipeline consists of three stages: Stage A, Stage B, and Stage C. Each stage has a different role and function in the text-to-image synthesis process.

Stage A: VQGAN

Stage A is a VQGAN, which stands for Vector Quantized Generative Adversarial Network. A VQGAN is a type of generative model that can learn to compress images into discrete codes, called tokens, that represent the most important features and patterns in the image. A VQGAN consists of two components: an encoder and a decoder. The encoder maps an image into a sequence of tokens, and the decoder reconstructs the image from the tokens.

Stage A’s VQGAN is trained on a large dataset of natural images, such as ImageNet, to learn a general representation of images. It can encode an image of size 512×512 into a sequence of 256 tokens, each of size 16×16. This means that Stage A can achieve a 32x spatial compression of the image data, reducing the dimensionality and complexity of the image.

Stage B: Diffusion Autoencoder

Stage B is a Diffusion Autoencoder, which is a type of generative model that can learn to denoise images corrupted by random noise. A Diffusion Autoencoder consists of two components: an encoder and a decoder. The encoder adds noise to an image, and the decoder removes the noise from the image.

Stage B’s Diffusion Autoencoder is trained on the tokens generated by Stage A’s VQGAN, to learn a further compression of the image data. It can encode a sequence of 256 tokens into a single vector of size 128. This means that Stage B can achieve a 16x spatial compression of the token data, resulting in a total of 512x spatial compression of the original image data.

Stage C: Prior

Stage C is a Prior, which is a text-conditional diffusion model that can learn to generate images from text descriptions. A text-conditional diffusion model is a type of generative model that can learn to reverse the diffusion process, starting from a noisy image and gradually refining it until it matches the text description. A text-conditional diffusion model consists of a single component: a decoder.

Stage C’s Prior is trained on the vectors generated by Stage B’s Diffusion Autoencoder, conditioned on the text descriptions of the images. It can decode a vector of size 128 into a sequence of 256 tokens, and then into an image of size 512×512, or even higher resolutions, such as 1024×1024 or 1536×1536, by using a super-resolution technique.

Why is Würstchen fast and efficient?

Würstchen’s main advantage is that it can perform the text-to-image synthesis in a highly compressed latent space, where the image data is represented by a single vector of size 128. This reduces the computational cost and memory usage of the model, as well as the amount of information that needs to be conditioned on the text. This allows Würstchen to generate images much faster and cheaper than models like SDXL, which work in the pixel space, where the image data is represented by a matrix of size 512×512 or larger.

Würstchen’s pipeline also allows for parallelization and reuse of the models. Stage A and Stage B can be pre-trained on large datasets of natural images, and then reused for different text-to-image tasks, without the need to retrain them. Stage C can be trained on a smaller dataset of text-image pairs, and then run in parallel for multiple text prompts, without the need to rerun Stage A and Stage B.

How to use Würstchen?

Würstchen is available through the Diffusers Library, which is a Python library that provides easy access to various diffusion models. You can install the library using pip:

pip install diffusers

You can then use the AutoPipelineForText2Image class to load and run Würstchen:

import torch
from diffusers import AutoPipelineForText2Image
from diffusers.pipelines.wuerstchen import DEFAULT_STAGE_C_TIMESTEPS

pipeline = AutoPipelineForText2Image.from_pretrained("warp-ai/wuerstchen", torch_dtype=torch.float16).to("cuda")

caption = "A smiling woman holding a cup of coffee"
images = pipeline(
  caption,
  height=1024,
  width=1536,
  prior_timesteps=DEFAULT_STAGE_C_TIMESTEPS,
  prior_guidance_scale=4.0,
  num_images_per_prompt=4,
).images

You can also try Würstchen using the online demo here:

Demo

Conclusion

Würstchen is a breakthrough in text-to-image synthesis, as it offers a fast and efficient way to generate images from text, without sacrificing the quality and diversity of the images. Würstchen can generate images up to 16 times faster than SDXL, while using up to 16 times less memory. Würstchen can also produce images at higher resolutions, up to 1536×1536 pixels, compared to SDXL’s 512×512 pixels.

Würstchen is an open-source model that can be easily accessed and used through the Diffusers Library. Würstchen has many potential applications, such as content creation, education, entertainment, and more. Würstchen is also a great tool for researchers and enthusiasts who want to explore and experiment with text-to-image synthesis, without the need for expensive and powerful hardware.

If you are interested in learning more about Würstchen, you can check out the following resources:

The original paper: Würstchen: Faster Image Generation with Dual Diffusion Models
The official blogpost: Introducing Würstchen: Fast Diffusion for Image Generation
The code repository: GitHub – warp-ai/wuerstchen
The online demo: Würstchen Demo
The Diffusers Library: GitHub – huggingface/diffusers

Also Read: How to Generate Images from Text on Your Mobile Device with MobileDiffusion

Also Read: Diffusion Models: The Next Big Thing in AI