Diffusion models have emerged as a powerful approach in generative AI, producing state-of-the-art results in image, audio, and video generation. In this in-depth technical article, we’ll explore how diffusion models work, their key innovations, and why they’ve become so successful. We’ll cover the mathematical foundations, training process, sampling algorithms, and cutting-edge applications of this exciting new technology.
Introduction to Diffusion Models
Diffusion models are a class of generative models that learn to gradually denoise data by reversing a diffusion process. The core idea is to start with pure noise and iteratively refine it into a high-quality sample from the target distribution.
This approach was inspired by non-equilibrium thermodynamics – specifically, the process of reversing diffusion to recover structure. In the context of machine learning, we can think of it as learning to reverse the gradual addition of noise to data.
Some key advantages of diffusion models include:
- State-of-the-art image quality, surpassing GANs in many cases
- Stable training without adversarial dynamics
- Highly parallelizable
- Flexible architecture – any model that maps inputs to outputs of the same dimensionality can be used
- Strong theoretical grounding
Let’s dive deeper into how diffusion models work.
Stochastic Differential Equations govern the forward and reverse processes in diffusion models. The forward SDE adds noise to the data, gradually transforming it into a noise distribution. The reverse SDE, guided by a learned score function, progressively removes noise, leading to the generation of realistic images from random noise. This approach is key to achieving high-quality generative performance in continuous state spaces
The Forward Diffusion Process
The forward diffusion process starts with a data point x₀ sampled from the real data distribution, and gradually adds Gaussian noise over T timesteps to produce increasingly noisy versions x₁, x₂, …, xT.
At each timestep t, we add a small amount of noise according to:
x_t = √(1 - β_t) * x_{t-1} + √(β_t) * ε
Where:
- β_t is a variance schedule that controls how much noise is added at each step
- ε is random Gaussian noise
This process continues until xT is nearly pure Gaussian noise.
Mathematically, we can describe this as a Markov chain:
q(x_t | x_{t-1}) = N(x_t; √(1 - β_t) * x_{t-1}, β_t * I)
Where N denotes a Gaussian distribution.
The β_t schedule is typically chosen to be small for early timesteps and increase over time. Common choices include linear, cosine, or sigmoid schedules.
The Reverse Diffusion Process
The goal of a diffusion model is to learn the reverse of this process – to start with pure noise xT and progressively denoise it to recover a clean sample x₀.
We model this reverse process as:
p_θ(x_{t-1} | x_t) = N(x_{t-1}; μ_θ(x_t, t), σ_θ^2(x_t, t))
Where μ_θ and σ_θ^2 are learned functions (typically neural networks) parameterized by θ.
The key innovation is that we don’t need to explicitly model the full reverse distribution. Instead, we can parameterize it in terms of the forward process, which we know.
Specifically, we can show that the optimal reverse process mean μ* is:
μ* = 1/√(1 - β_t) * (x_t - β_t/√(1 - α_t) * ε_θ(x_t, t))
Where:
- α_t = 1 – β_t
- ε_θ is a learned noise prediction network
This gives us a simple objective – train a neural network ε_θ to predict the noise that was added at each step.
Training Objective
The training objective for diffusion models can be derived from variational inference. After some simplification, we arrive at a simple L2 loss:
L = E_t,x₀,ε [ ||ε - ε_θ(x_t, t)||² ]
Where:
- t is sampled uniformly from 1 to T
- x₀ is sampled from the training data
- ε is sampled Gaussian noise
- x_t is constructed by adding noise to x₀ according to the forward process
In other words, we’re training the model to predict the noise that was added at each timestep.
Model Architecture
The U-Net architecture is central to the denoising step in the diffusion model. It features an encoder-decoder structure with skip connections that help preserve fine-grained details during the reconstruction process. The encoder progressively downsamples the input image while capturing high-level features, and the decoder up-samples the encoded features to reconstruct the image. This architecture is particularly effective in tasks requiring precise localization, such as image segmentation.
The noise prediction network ε_θ
can use any architecture that maps inputs to outputs of the same dimensionality. U-Net style architectures are a popular choice, especially for image generation tasks.
A typical architecture might look like:
class DiffusionUNet(nn.Module): def __init__(self): super().__init__() # Downsampling self.down1 = UNetBlock(3, 64) self.down2 = UNetBlock(64, 128) self.down3 = UNetBlock(128, 256) # Bottleneck self.bottleneck = UNetBlock(256, 512) # Upsampling self.up3 = UNetBlock(512, 256) self.up2 = UNetBlock(256, 128) self.up1 = UNetBlock(128, 64) # Output self.out = nn.Conv2d(64, 3, 1) def forward(self, x, t): # Embed timestep t_emb = self.time_embedding(t) # Downsample d1 = self.down1(x, t_emb) d2 = self.down2(d1, t_emb) d3 = self.down3(d2, t_emb) # Bottleneck bottleneck = self.bottleneck(d3, t_emb) # Upsample u3 = self.up3(torch.cat([bottleneck, d3], dim=1), t_emb) u2 = self.up2(torch.cat([u3, d2], dim=1), t_emb) u1 = self.up1(torch.cat([u2, d1], dim=1), t_emb) # Output return self.out(u1)