The “Improved DDPM” (2021) paper showed that unconditional DDPMs' performance still lagged behind GANs. It proposed the Conditional Diffusion model. The authors found that by adding a class label as an additional input to the diffusion model, the image quality (especially FID) improved significantly, even surpassing contemporary GAN models. This is why researchers began adding conditions to diffusion models. Only later did conditional mechanisms evolve into a general tool for semantic and user-driven control, as seen in Classifier Guidance, CFG, and eventually LDM.
The evolution of conditional diffusion models follows the path: Conditional Diffusion → Classifier Guidance → Classifier-Free Guidance (CFG) → LDM.
In this note, we focus on the models before LDM. Before the emergence of LDM, almost all diffusion models (including conditional ones) were built upon the DDPM framework. Therefore, all pre-LDM conditional diffusion models can essentially be viewed as Conditional DDPMs. These models used conditions (such as class labels or semantic maps) to assist image generation and improve sample quality, but their inputs and outputs remained within the image domain.
In the LDM note, we will discuss LDM in detail. LDM differs from conditional DDPM in two ways: (1) it introduces a VAE to encode images into a low-dimensional latent space and performs the diffusion process there; (2) it becomes the first framework to make conditional diffusion truly multimodal (text ↔ image).
| Stage | Representative Model | Core Idea | Conditional Generation |
|---|---|---|---|
| 1️⃣ DDPM (Basic Diffusion Model) | Ho et al., “Denoising Diffusion Probabilistic Models” (NeurIPS 2020) | Simulates the process of gradually adding and then removing noise, learning the noise distribution. | ❌ Unconditional (only learns to generate images) |
| 2️⃣ Conditional Diffusion | Dhariwal & Nichol, “Improved DDPM” (2021), etc. | Adds conditioning (e.g., labels, embeddings, or images) to the denoising prediction. | ✅ Conditional |
| 3️⃣ Classifier Guidance | Dhariwal & Nichol, “Improved DDPM” (2021), etc. | Uses the gradient from an external classifier to guide the diffusion direction, enabling conditional control. | ✅ Conditional (via external classifier) |
| 4️⃣ Classifier-Free Guidance (CFG) | Ho & Salimans (2022) | Learn both conditional and unconditional modes in one model and mix them at inference. | ✅ Conditional (internally implemented) |
| 5️⃣ LDM (Latent Diffusion Model) | Rombach et al., CVPR 2022 | Trains the diffusion process in the VAE latent space, conditioned on text embeddings (CLIP/Text Encoder). | ✅ Conditional (multimodal embedding) |
Takeaway:
- In conditional diffusion, give some examples of what can be inputed into the noise prediction network as condition.
- In classifier guidance, what is inputed into the noise prediction network as condition?
- How to inject condition into the noise prediction network in conditional diffusion?
- How does the Classifier Guidance modify the model’s predicted noise? What's the intuition via Bayes' Rule?
- Does CFG use the same noise prediction net as ordinary conditional DDPM?
--- Conditional Diffusion¶
In a standard DDPM, a neural network learns to predict noise $\hat{\boldsymbol\epsilon} = \epsilon_\theta(\mathbf{x}_t, t)$. In a Conditional Diffusion, we introduce a condition $c$ as an extra input of the neural network $$\hat{\boldsymbol\epsilon} = \epsilon_\theta(\mathbf{x}_t, t, c)$$ This allows the denoising process to be guided by external information (e.g. $c$ can be class label, text embedding, image, etc.). For example:
- if $c$ is a class label: $c = 5$ (could be any int);
- if $c$ is text embedding: $c$ = text_encoder("a cute cat");
- if $c$ is image: $c$ = original img or $c$ = img_encoder(original img).
Condition injection into network: usually by addition, concatenation, or cross-attention.
Architecture¶
The forward and reverse process are exactly the same as the original DDPM. The only difference is that the noise prediction network now takes a condition $c$ as an extra input, and of course, its architecture will be different from the network in original DDPM.
Training Data: labeled images $(\mathbf{x}_0, c)$.
Forward Process: $$ q(\mathbf{x}_t | \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_t; \sqrt{\bar{\alpha}_t} \mathbf{x}_0, (1 - \bar{\alpha}_t)I) $$
Reverse Process ($\mu_\theta$ can be derived using $\epsilon_\theta(\mathbf{x}_t, t, c)$): $$ p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t, c) = \mathcal{N}(\mathbf{x}_{t-1}; \mu_\theta(\mathbf{x}_t, t, c), \Sigma_t) $$
Toy Example Implementation (Additive Conditioning)¶
This implements condition injection in the U-Net for DDPM. Compare with the original (unconditional) DDPM U-Net to easily see what changes are introduced by adding a condition. Basically, in this implementation, we concatenate the global time embedding with the global condition embedding, and in each layer of injection, we project the concatenated vector to desired dimension.
Here we use a single time_mlp for both time and class (condition) so both act as unified modulation signals, simplifying the architecture and training. The shared mechanism is not due to semantic similarity, but because linear modulation works well in diffusion models. If the condition is more complex (e.g., text, image features), a shared MLP is not sufficient.
import torch
import torch.nn as nn
import torch.nn.functional as F
class ToyUNet_ConditionalDDPM(nn.Module):
def __init__(self, in_ch=3, num_classes=10):
super().__init__()
self.in_ch = in_ch
out_ch = in_ch
emb_dim = 128 # embedding dim for time & class
# --- Time embedding ---
self.time_mlp = nn.Sequential(
nn.Linear(emb_dim, 64),
nn.ReLU(),
nn.Linear(64, 32)
)
# --- Class embedding ---
self.class_emb = nn.Embedding(num_classes, emb_dim)
# projection layers
self.time_proj_enc1 = nn.Linear(32, 16)
self.time_proj_enc2 = nn.Linear(32, 32)
self.time_proj_dec1 = nn.Linear(32, 16)
self.time_proj_dec2 = nn.Linear(32, 8)
# --- Encoder ---
self.enc1 = nn.Sequential(nn.Conv2d(in_ch, 16, 3, padding=1), nn.ReLU())
self.enc2 = nn.Sequential(nn.Conv2d(16, 32, 3, padding=1), nn.ReLU())
# --- Decoder ---
self.up1 = nn.ConvTranspose2d(32, 16, 2, stride=2)
self.dec1 = nn.Sequential(nn.Conv2d(16 + 32, 16, 3, padding=1), nn.ReLU())
self.up2 = nn.ConvTranspose2d(16, 8, 2, stride=2)
self.dec2 = nn.Sequential(nn.Conv2d(8 + 16, 8, 3, padding=1), nn.ReLU())
self.out_conv = nn.Conv2d(8, out_ch, 1)
def forward(self, x, t, y):
# --- time embedding ---
temb = timestep_embedding(t, dim=128) # [B, 128]
temb = self.time_mlp(temb) # [B, 32]
# --- class embedding ---
cemb = self.class_emb(y) # [B, 128]
cemb = self.time_mlp(cemb) # reuse same MLP to project
temb = temb + cemb # merge condition with time (Improved DDPM style)
# --- Encoder ---
x1 = self.enc1(x) + self.time_proj_enc1(temb)[:, :, None, None]
x2 = F.max_pool2d(x1, 2)
x3 = self.enc2(x2) + self.time_proj_enc2(temb)[:, :, None, None]
bottleneck = F.max_pool2d(x3, 2) + temb[:, :, None, None]
# --- Decoder ---
y = self.up1(bottleneck)
y = torch.cat([y, x3], dim=1)
y = self.dec1(y) + self.time_proj_dec1(temb)[:, :, None, None]
y = self.up2(y)
y = torch.cat([y, x1], dim=1)
y = self.dec2(y) + self.time_proj_dec2(temb)[:, :, None, None]
return self.out_conv(y)
--- Classifier Guidance¶
The conditioning in conditional diffusion is too weak and requires stronger guidance. Classifier Guidance addresses this issue. Classifier Guidance is riginally introduced in Dhariwal & Nichol, 2021 (“Improved DDPM”). Solution: Introduce classifier gradients of an external classifier $p_\phi(y | \mathbf{x}_t)$ to guide the sampling direction toward a desired class during the diffusion process.
Architecture¶
The forward and reverse process of Classifier Guidance model follows exactly the same flow as original DDPM. The only extra step is to feed $\mathbf{x}_t$ to the extra classifier network and incorporate the true label $y$ to get a gradient $\nabla_{x_t} \log p_\phi(y|\mathbf{x}_t)$. And propagate the gradient to the output of noise predictor netwrok to get a new noise prediction.
Training Data: labeled images $(\mathbf{x}_0, c)$.
Diffusion model: Unconditional DDPM.
Extra network: A classifier trained on noisy images, where input is $\mathbf{x}_t$ and output is a class label $\hat y$. This is always trained separately before diffusion model, or we use open sourced pre-trained classifier.
Noise Prediction Process:
Predict noise via noise prediction network (usually U-Net): $\varepsilon_\theta(\mathbf{x}_t,t) \rightarrow \hat{\boldsymbol{\varepsilon}}$.
Compute $\nabla_{x_t} \log p_\phi(y|\mathbf{x}_t)$ and add it to update the $\hat{\boldsymbol{\varepsilon}}$, where $s$ is the guidance strength, a hyperparameter: $$\hat{\boldsymbol{\varepsilon}}' = \hat{\boldsymbol{\varepsilon}} - s \cdot \nabla_{x_t} \log p_\phi(y|\mathbf{x}_t).$$
What exactly happened when compute $\nabla_{x_t} \log p_\phi(y|\mathbf{x}_t)$?
Let the logits on the noisy image $\mathbf{x}_t$ be $\mathbf{z} = (z_1, z_2, ..., z_y, ..., z_C)\in\mathbb{R}^C$, which is the layer as input of softmax, the dimention is the same as the cardinality of class. The log posteriour probablity of the class $y$ -- $\log p_\phi(y\mid \mathbf{x}_t)$ can be computed by softmax $$ p_\phi(y\mid \mathbf{x}_t)=\frac{e^{z_y}}{\sum_{c=1}^C e^{z_c}} \quad\Rightarrow\quad \log p_\phi(y\mid \mathbf{x}_t)= z_y - \log\sum_{c=1}^C e^{z_c}. $$ To obtain $\nabla_{\mathbf{x}_t} \log p_\phi(y\mid \mathbf{x}_t)$, the key is to compute $\nabla_{\mathbf{z}}\log p_\phi(y\mid \mathbf{x}_t)$. Let's look at each component $z_k$: $$ \frac{\partial}{\partial z_k}\log p_\phi(y\mid \mathbf{x}_t) = \frac{\partial}{\partial z_k}\Big(z_y - \log\sum_{c=1}^C e^{z_c}\Big) = \mathbf{1}[k=y] - p_\phi(k\mid \mathbf{x}_t). $$ In vector form, this is the classic result: $\mathbf{p}=\text{softmax}(\mathbf{z})$, and $\mathbf{e}_y$ is a one-hot vector with 1 at the $y$-th position. $$ \nabla_{\mathbf{z}}\log p_\phi(y\mid \mathbf{x}_t)= \mathbf{e}_y - \mathbf{p} \quad \in\mathbb{R}^C $$ The by chain rule we can get the following. Notice that $\frac{\partial \mathbf{z}}{\partial \mathbf{x}_t} \in\mathbb{R}^{C\times\text{dim }\mathbf{x}_t}$ is a standard gradeint of a linear layer of MLP. $$ \nabla_{\mathbf{x}_t}\log p_\phi(y\mid \mathbf{x}_t) = \frac{\partial \mathbf{z}}{\partial \mathbf{x}_t}^{\top} \nabla_{\mathbf{z}}\log p_\phi(y\mid \mathbf{x}_t) \in\mathbb{R}^{\text{dim }\mathbf{x}_t} $$
Compute the posterior mean using $\hat{\boldsymbol{\varepsilon}}'$ (the same formula as original DDPM).
Implementation Highlights¶
We'll show how to compute and add gradients. And then the rest is the same as original DDPM.
# -----------
# Toy Example (Not Runnable)
# -----------
import torch
import torch.nn.functional as F
x_t.requires_grad_(True) # shape [B, Channels * Height * Width]
logits = classifier(x_t) # shape [B, C]. Classifier returns logits, not log probabilities.
log_probs = F.log_softmax(logits, dim=1) # shape [B, C]. Apply softmax for every row: logits -> log probabilities.
# y is the label index which provided with the data x. For a batch, y is a tensor of shape [B]. Example: y = [0, 2, 5, ..., 4]
# NOTICE that we CANNOT use log_probs[:, y]; for example, if y = [0, 2, 5], log_probs[:, y] will select columns 0, 2, and 5 from every row, returning a tensor of shape [3, 3]
selected = log_probs[torch.arange(len(y)), y] # shape [B]. This is log p(y|x_t).
# Compute ∇_{x_t} log p(y|x_t)
# torch.autograd.grad() returns a TUPLE, the length of which equals the number of inputs you pass in. For example:
# grad_tuple = torch.autograd.grad(outputs, inputs=(x_t, w, b)) -> (grad_x_t, grad_w, grad_b)
#
# NOTICE that we ACTUALLY want to compute the gradient of selected[i] with respect to x_t[i].
# We use selected.sum() because the gradient of the sum is the same as the gradient of selected[i] wrt x_t[i].
grad_tuple = torch.autograd.grad(selected.sum(), x_t) # -> (grad_x_t,) grad_x_t has the same shape as x_t
grad = grad_tuple[0]
# Combine them (classifier guidance)
eps_guided = eps_pred - s * grad
Why Classifier Guidance Works¶
For Conditional Diffusion, adding the condition embedding to every layer does not count as true “strong guidance.” It only makes the conditional information propagate more thoroughly and less likely to be forgotten, but still belongs to the category of "weak conditioning". This is structural enhancement, not signal amplification—it does not change the model’s sampling mechanism.
In contrast, “strong guidance” directly modifies the model’s predicted noise during sampling, altering the generation trajectory. Specifically, $$\hat{\boldsymbol{\varepsilon}}' = \hat{\boldsymbol{\varepsilon}} - s \cdot \nabla_{x_t} \log p_\phi(y|\mathbf{x}_t)$$ shifts the diffusion trajectory toward the desired class.
Math¶
Let's have a little bit of intuition why we are doing $\hat{\boldsymbol{\varepsilon}}' = \hat{\boldsymbol{\varepsilon}} - s \cdot \nabla_{x_t} \log p_\phi(y|\mathbf{x}_t)$.
Recall that when maximizing likelihood, we aim to maximize $\log p_\theta(x_0)$. Diffusion sampling does not directly maximize $\log p_\theta(x_0)$, but instead gradually increases the final likelihood of $x_0$ by moving along $\nabla_{x_t} \log p(x_t)$ at each step. Here the label is known information, so we are maximizing $\log p_\theta(x_0| y)$; thus, at each step we move toward $\nabla_{x_t} \log p(x_t | y)$.
According to Bayes’ rule: $$ \nabla_{x_t} \log p(x_t | y) = \nabla_{x_t} \log p(x_t) + \nabla_{x_t} \log p(y | x_t) $$
The first term $\nabla_{x_t} \log p(x_t)$, corresponds to the denoising direction of the diffusion model itself (i.e., $\hat{\boldsymbol{\varepsilon}}$). We actually have $$\nabla_{x_t}\log p_t(x_t) \approx -\frac{1}{\sqrt{1-\bar\alpha_t}}\hat{\boldsymbol{\varepsilon}}.$$
This can be easily derived from the forward process: $$ q(x_t\mid x_0)=\mathcal N\big(x_t;\sqrt{\bar\alpha_t}x_0,(1-\bar\alpha_t)\mathbf I\big) $$ Taking the gradient of the log-density of the Gaussian: $$ \nabla_{x_t}\log q(x_t\mid x_0) = -\frac{x_t-\sqrt{\bar\alpha_t},x_0}{1-\bar\alpha_t} = -\frac{\sqrt{1-\bar\alpha_t},\varepsilon}{1-\bar\alpha_t} = -\frac{\varepsilon}{\sqrt{1-\bar\alpha_t}} $$ The training objective of DDPM is to make the model $\varepsilon_\theta(x_t,t)$ approximate the true noise $\boldsymbol\varepsilon$, so we can substitute $\hat{\boldsymbol{\varepsilon}}$ here to obtain the intuition.
The second term, $\nabla_{x_t} \log p(y|x_t)$, is the additional “classifier guidance” direction.
In one sentence¶
Conditional diffusion “tells the network who I am,” while classifier guidance “forces the network to become more like me.” It not only lets the model “know the condition” but also “enforces the response to the condition.”
- Drawback: In theory, for classifier guidance to work well, the classifier must be highly robust to noise, but such a classifier is very difficult to train. This is why later models like GLIDE, Imagen, and Stable Diffusion fully switched to CFG.
--- Classifier-Free Guidance (CFG)¶
As we mentioned before, for the classifier guidance it is very hard to find a robust enough classifier. So is there a way to add a strong guidance but can eliminate the difficulty of choosing classifier? -- The Classifier-Free Guidance (CFG) Proposed by Ho & Salimans, 2022 (“Classifier-Free Diffusion Guidance”) removes the need for an external classifier but still gives a strong condition. It's the most popular trick to impose condition.
Architecture¶
The network used by CFG is exactly the same as that used for Conditional Diffusion, except sometimes the condition embedding is "zeroed out" or "replaced with a special token".
- During training: With probability $p$, drop the condition $c$ (replace with null token). Thus the same model learns two modes $\epsilon_\theta(x_t, t, c) \text{ and } \epsilon_\theta(x_t, t, \varnothing)$.
- During inference: no random drop (where $s$ = CFG scale.): $$ \hat{\boldsymbol{\epsilon}}' = \epsilon_\theta(x_t, t, \varnothing) + s \cdot (\epsilon_\theta(x_t, t, c) - \epsilon_\theta(x_t, t, \varnothing)) $$
Network:
- Inputs: $x_t, t, c$ ($c$ may be empty).
- Outputs: noise prediction $\hat{\boldsymbol{\epsilon}}$.
Implementation Highlights¶
import torch
import random
## training
if random.random() < p_uncond:
cond = None # drop condition → unconditional
else:
cond = cond # keep condition → conditional
pred = model(x_t, t, cond)
## inference
def classifier_free_guidance(model, x_t, t, cond, scale):
eps_cond = model(x_t, t, cond)
eps_uncond = model(x_t, t, torch.zeros_like(cond)) # drop condition
eps_guided = eps_uncond + scale * (eps_cond - eps_uncond)
return eps_guided
Why CFG Works¶
In standard Conditional DDPM, during training the model learns "how much noise to remove after seeing condition $c$", and during sampling we fully trust the learned direction. It only "takes one step", and the strength is what the model has learned. The network may learn this well or not strong enough, but you can't manually increase this pull.
CFG does not change the network structure, it only modifies the sampling combination formula: $$ \hat{\boldsymbol{\epsilon}}' = \epsilon_\theta(x_t, t, \varnothing) + s \cdot (\epsilon_\theta(x_t, t, c) - \epsilon_\theta(x_t, t, \varnothing)) $$ Mathematically, what this does is: along the direction of the difference between the conditional and unconditional predictions, it takes an extra step of $s$ times farther. The unconditional prediction $\epsilon_\theta(x_t,t, \varnothing)$ is like "where the model would go without knowing c", and the difference between the two is the "pure conditional signal direction". Multiplying by $s > 0$ means strengthening the conditional signal by $s$ times.
Through $\epsilon_\theta(x_t, t, c) - \epsilon_\theta(x_t, t, \varnothing)$, CFG extracts the "pure conditional direction", then multiplies by $s$ to enhance it, so CFG actually "pulls harder", giving a stronger guidance.