Stable Diffusion is an advanced deep learning model for generating detailed images based on text descriptions. It was publicly released in 2022 and was primarily designed for text-to-image tasks, but it can also be applied to other tasks like inpainting, outpainting, and image-to-image translations using text prompts. The technology behind Stable Diffusion involves a latent diffusion model (LDM), which is a type of deep generative artificial neural network. Unlike previous proprietary text-to-image models, Stable Diffusion's code and model weights are publicly available, and it can run on consumer hardware with a modest GPU.
The architecture of Stable Diffusion includes a variational autoencoder (VAE), a U-Net (a type of deep learning block), and an optional text encoder. The VAE encoder compresses the image into a lower-dimensional latent space, while the U-Net denoises the latent representation in a backward diffusion process. The VAE decoder then generates the final image from the denoised representation. One of the key advantages of Stable Diffusion is its ability to be flexibly conditioned on different inputs, such as text or images, which allows for diverse image generation based on the provided conditioning data. The model was trained on a large dataset called LAION-5B, which consists of image-caption pairs derived from publicly available web data.
With 860 million parameters in the U-Net and 123 million in the text encoder, Stable Diffusion is considered relatively lightweight by 2022 standards and can be run on consumer-grade GPUs. It represents a significant advancement in text-to-image generation and opens up new possibilities for creative AI applications.
Share requirements