DDPM (Denoising Diffusion Probabilistic Model)

DDPM is the foundational model of the diffusion family and forms the basis for understanding all subsequent variants. A diffusion model learns to reverse a noising process.

  • Forward process: start from clean data (e.g., an image) and gradually add Gaussian noise over many steps until it becomes pure noise.
  • Reverse process: a neural network is trained to invert the forward process step by step, reconstructing clean data from noise.
  • Sampling (generation): begin with random Gaussian noise and iteratively denoise it using the trained neural network to produce a realistic data sample.

Takeaway:

  1. Why the forward process converges to standard Gaussian?
  2. What is the input and output of the neural network in reverse process?
  3. How to compute the mean and variance of $p_\theta(\bm{x}_{t-1}\mid \bm{x}_t)$ in reverse process?
  4. What are the steps of the generationg process?
  5. How do we derive the L2 loss from MLE / ELBO?
  6. What is the principle of skip connection in U-Net?
  7. How to implement DDPM in pytorch (U-Net as noise predictor)?
  8. In U-Net, how is the residual block different from skip connection?
  9. How to add attention to U-Net?
  10. What are the loss functions in $x_0$-prediction and $v$-prediction?

--- DDPM Architecture¶

Forward Process: From Data to Noise¶

The forward process is a fixed, non-learned Markov chain that gradually adds noise to a clean input $\mathbf{x}_0$. The one‑step transition makes the forward process a Markov chain since each $\mathbf{x}_t$ depends only on $\mathbf{x}_{t-1}$, not the full history.

  1. One‑step transition

    At each time step $t$, we define $$ q(\mathbf{x}_t \mid \mathbf{x}_{t-1}) = \mathcal{N}\left(\mathbf{x}_t; \sqrt{1 - \beta_t} \mathbf{x}_{t-1}, \beta_t I \right) $$ That is, given $\mathbf{x}_{t-1}$, we sample $\mathbf{x}_t$ by:

    • Shrinking $\mathbf{x}_{t-1}$ a bit (multiply by $\sqrt{1 - \beta_t}$)
    • Adding Gaussian noise with variance $\beta_t$

    where $\beta_t \in (0, 1)$ is a small positive number controlling how much noise is added at step $t$. Typically chosen as a linear or cosine schedule, e.g. $$ \beta_t = \text{linspace}(10^{-4}, 0.02, T) $$

  2. Closed‑form for any time step

    We often define $\alpha_t = 1 - \beta_t$, and $\bar\alpha_t = \prod_{s=1}^{t} \alpha_s$. By recursively applying the above Gaussian transitions, the marginal at time $t$ has a simple form $$ q(\mathbf{x}_t \mid \mathbf{x}_0) = \mathcal{N}\bigl(\mathbf{x}_t; \sqrt{\bar\alpha_t}\mathbf{x}_0, (1-\bar\alpha_t)I\bigr) \quad\text{with}\quad \bar\alpha_t = \prod_{s=1}^t (1-\beta_s). $$ At training time we can sample any noisy state $\mathbf{x}_t$ directly from $\mathbf{x}_0$ and $\boldsymbol{\epsilon}$ without running the full chain $$ \mathbf{x}_t = \sqrt{\bar\alpha_t}\mathbf{x}_0 + \sqrt{1-\bar\alpha_t}\boldsymbol{\epsilon}, \qquad \boldsymbol{\epsilon} \sim \mathcal{N}(0,I). $$

    We can prove the closed‑form expression for $q(\mathbf{x}_t\mid \mathbf{x}_0)$ by simple induction. For $t=1$ we have $\mathbf{x}_1 = \sqrt{\alpha_1}\mathbf{x}_0 + \sqrt{1-\alpha_1}\boldsymbol{\epsilon}_1$ so the mean is $\sqrt{\alpha_1}\mathbf{x}_0$ and the variance is $(1-\alpha_1)I$, which matches $1-\bar\alpha_1$. Assume for some $t-1$ that $\mathbf{x}_{t-1} = \sqrt{\bar\alpha_{t-1}}\mathbf{x}_0 + \sqrt{1-\bar\alpha_{t-1}}\boldsymbol{\epsilon}'$ with $\boldsymbol{\epsilon}' \sim \mathcal{N}(0,I)$. Plugging this into $\mathbf{x}_t = \sqrt{\alpha_t}\mathbf{x}_{t-1} + \sqrt{1-\alpha_t}\boldsymbol{\epsilon}_t$ gives a mean $\sqrt{\alpha_t\bar\alpha_{t-1}}\mathbf{x}_0 = \sqrt{\bar\alpha_t}\mathbf{x}_0$ and a variance $\alpha_t(1-\bar\alpha_{t-1}) + (1-\alpha_t) = 1-\bar\alpha_t$. Thus by induction, $\mathbf{x}_t = \sqrt{\bar\alpha_t}\mathbf{x}_0 + \sqrt{1-\bar\alpha_t}\boldsymbol{\epsilon}$ with $\boldsymbol{\epsilon} \sim \mathcal{N}(0,I)$.

  3. Converge to standard Gaussian

    Over time $\bar\alpha_t \to 0$, meaning more and more noise. Mathematically $\mathbf{x}_t = \sqrt{\bar\alpha_t}\mathbf{x}_0 + \sqrt{1-\bar\alpha_t}\boldsymbol{\epsilon} \rightarrow \boldsymbol{\epsilon} \sim \mathcal{N}(0,I)$

Reverse Process: From Noise Back to Data¶

Gaussian Conditioning Formula Given two Gaussians:

  • $p(\mathbf{x}) = \mathcal{N}(\mathbf{x}; \mu_{\mathbf{x}}, \Sigma_{\mathbf{x}})$
  • $p(\mathbf{y} \mid \mathbf{x}) = \mathcal{N}(\mathbf{y}; A \mathbf{x} + b, \Sigma_{\mathbf{y}})$

Then: $$ p(\mathbf{x} \mid \mathbf{y}) = \mathcal{N}\left(\mathbf{x}; \mu', \Sigma'\right) $$ where: $$ \Sigma' = \left(\Sigma_{\mathbf{x}}^{-1} + A^T \Sigma_{\mathbf{y}}^{-1} A\right)^{-1} $$ $$ \mu' = \Sigma' \left(\Sigma_{\mathbf{x}}^{-1} \mu_{\mathbf{x}} + A^T \Sigma_{\mathbf{y}}^{-1} (\mathbf{y} - b)\right) $$

The reverse process inverts the forward process by learning the reverse Markov chain. The forward process defines $$ q(\mathbf{x}_t \mid \mathbf{x}_{t-1}) = \mathcal{N}\left(\mathbf{x}_t; \sqrt{1 - \beta_t} \mathbf{x}_{t-1}, \beta_t I \right) $$ $$ q(\mathbf{x}_t \mid \mathbf{x}_0) = \mathcal{N}\bigl(\mathbf{x}_t; \sqrt{\bar\alpha_t}\mathbf{x}_0, (1-\bar\alpha_t)I\bigr) \quad\text{with}\quad \bar\alpha_t = \prod_{s=1}^t (1-\beta_s). $$ By the Gaussian Conditioning Formula, let $\mathbf{x} = \mathbf{x}_{t-1}$, $\mathbf{y} = \mathbf{x}_t$, we know that $q(\mathbf{x}_{t-1}\mid \mathbf{x}_t)$ is still a Gaussian distribution and the mean and variance can be computed in a close form. Plug in the parameters above we can finally get a reduced form (detailed calculation see the appendix) $$ \Sigma' = \frac{(1 - \bar\alpha_{t-1}) \beta_t}{1 - \bar\alpha_t} I $$ $$ \mu' = \frac{\sqrt{\bar\alpha_{t-1}} \beta_t}{1 - \bar\alpha_t} \mathbf{x}_0 + \frac{\sqrt{\alpha_t}(1 - \bar\alpha_{t-1})}{1 - \bar\alpha_t} \mathbf{x}_t $$ Based on the expression above, we see that the variance $\Sigma'$ depends only on the time step. The mean $\mu'$, however, requires knowledge of $\mathbf{x}_0$. Since we only observe $\mathbf{x}_t$ but not $\mathbf{x}_0$, we need to figure out a way to approximate $\mathbf{x}_0$. To do this, we train a neural network $\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)$ (typically an MLP or U-Net) to output $\hat{\boldsymbol{\epsilon}}$: a prediction of the noise $\boldsymbol{\epsilon}$ added during the forward process. Then, using the forward process formula: $$ \mathbf{x}_t = \sqrt{\bar\alpha_t} \mathbf{x}_0 + \sqrt{1 - \bar\alpha_t} \boldsymbol{\epsilon}, $$ we can rearrange and estimate $$ \hat{\mathbf{x}}_0 = \frac{1}{\sqrt{\bar\alpha_t}} \left( \mathbf{x}_t - \sqrt{1 - \bar\alpha_t} \cdot \hat{\boldsymbol{\epsilon}} \right) $$ Plug $\hat{\mathbf{x}}_0$ into the formula of $\mu'$ to get an approximated mean denoted by $\mu_\theta(\mathbf{x}_t, t)$ (detailed calculation see the appendix) $$ \mu_\theta(\mathbf{x}_t, t) = \frac{1}{\sqrt{\alpha_t}} \left(\mathbf{x}_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar\alpha_t}} \cdot \hat{\boldsymbol{\epsilon}} \right) $$ The intuition of the above expression is: by removing the noise component from $\mathbf{x}_t$, we obtain a direction pointing toward $\mathbf{x}_{t-1}$.

That is, predicting the noise $\boldsymbol{\epsilon}$ is sufficient to approximate the mean of the reverse distribution. This is one of the most elegant and fundamental tricks in DDPM. However, since the predicted mean is computed via an estimate $\hat{\mathbf{x}}_0$, the resulting distribution is no longer the true reverse process $q(\mathbf{x}_{t-1} \mid \mathbf{x}_t)$, but rather an approximation. We denote this by $p_\theta(\mathbf{x}_{t-1} \mid \mathbf{x}_t)$. And it's a Gaussian distribution such that $$ p_\theta(\mathbf{x}_{t-1} \mid \mathbf{x}_t) = \mathcal{N}\left(\mathbf{x}_{t-1}; \mu_\theta(\mathbf{x}_t, t), \Sigma'\right) $$

The neural network $\boldsymbol{\epsilon}_\theta(\mathbf{x}_t,t)$ takes input: a noisy image $\mathbf{x}_t$, a time step $t$; then outputs predicted noise $\hat{\boldsymbol{\epsilon}}$ that was used to create $\mathbf{x}_t$ from $\mathbf{x}_0$ in the forward process. The training target is the actual noise $\boldsymbol{\epsilon}$ we sampled during forward diffusion. The loss is $$\mathcal{L} = \mathbb{E}_{\mathbf{x}_0, \boldsymbol{\epsilon}, t} \left[ \| \boldsymbol{\epsilon} - \hat{\boldsymbol{\epsilon}} \|^2 \right]$$ In the "Math Behinde the Training Objective of DDPM" section we will see that DDPM’s L2 loss is not an arbitrary choice. We’ll introduce the variational objective used in DDPM and show that the L2 loss emerges naturally from it under a Gaussian assumption - specifically, as the negative log-likelihood of a Gaussian with fixed variance.

Generation Process of DDPM: Iterative Denoising¶

Starting from $\mathbf{x}_T\sim\mathcal{N}(0,I)$, repeatedly apply $p_\theta(\mathbf{x}_{t-1}\mid \mathbf{x}_t)$ to obtain a sample $\mathbf{x}_{t-1}$, eventually obtaining a clean sample $\mathbf{x}_0$. Each step uses the predicted noise to compute a denoised mean, then samples from a Gaussian centered at that mean.

  • Initialize: $$\mathbf{x}_T \sim \mathcal{N}(0, I)$$

  • Iteratively sample: for $t = T, T-1, \dots, 1$, $$ \mu_\theta(\mathbf{x}_t, t) = \frac{1}{\sqrt{\alpha_t}} \left( \mathbf{x}_t - \frac{1- \alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \cdot \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) \right) $$ $$ \mathbf{x}_{t-1} \sim \mathcal{N}\left( \mu_\theta(\mathbf{x}_t, t), \sigma_t^2 I \right) $$ where $\sigma_t$ only depends on the timestep $t$, and $\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)$ is the predicted noise.

--- Math Behinde the Training Objective of DDPM¶

The theoretical objective of DDPM is to maximize the log-likelihood of the real data under the generative distribution. Denote $q(x_0)$ is the true data distribution, and $p_\theta(x_0)$ is the generative distribution defined by the model. $$ \max_\theta \ \mathbb{E}_{x_0 \sim q(x_0)} \big[ \log p_\theta(x_0) \big] = \max_\theta \; \int q(\mathbf{x}_0) \, \log p_\theta(\mathbf{x}_0) \, d\mathbf{x}_0 $$ where $\log p_\theta(\mathbf{x}_0)$ is the marginal likelihood. The empirical distribution $q(\mathbf{x}_0)$ does not depend on $\theta$, so optimizing $\mathbb{E}_{x_0 \sim q(\mathbf{x}_0)}[\log p_\theta(\mathbf{x}_0)]$ is equivalent to maximizing $\log p_\theta(\mathbf{x}_0)$ over all samples. Thus the objective is $$\max_\theta \log p_\theta(\mathbf{x}_0)$$

0. Notation and Setup¶

  • Time steps: $t = 1, \dots, T$, with $\alpha_t = 1 - \beta_t,\ \bar\alpha_t = \prod_{s=1}^t \alpha_s$.

  • Forward “noising” Markov chain (known and fixed): $$q(x_t \mid x_{t-1}) \sim \mathcal N\big(\sqrt{\alpha_t}x_{t-1},\ \beta_t I\big).$$

    Equivalently, $$q(x_t \mid x_0) \sim \mathcal N\big(\sqrt{\bar\alpha_t}x_0,\ (1-\bar\alpha_t)I\big), \quad x_t = \sqrt{\bar\alpha_t},x_0 + \sqrt{1-\bar\alpha_t},\epsilon,\ \ \epsilon \sim \mathcal N(0,I).$$

  • Reverse “denoising” generative model (to be learned): $$p_\theta(x_{0}x_1...x_T) = p(x_T)\prod_{t=1}^T p_\theta(x_{t-1}\mid x_t), \quad p(x_T) = \mathcal N(0,I),$$

    and in general we let $$p_\theta(x_{t-1}\mid x_t) \sim \mathcal N\big(\mu_\theta(x_t,t),\ \sigma_t^2 I\big),$$

    where the variance $\sigma_t^2$ only depends on $t$.

1. From Likelihood to ELBO¶

The objective is to maximize the log-likelihood: $$ \log p_\theta(x_0) % = \log \int p_\theta(x_{0}x_1...x_T) \mathrm{d}x_{1}x_2...x_T $$ We derive the ELBO of it (see Variational Inference and VAE notes for what is ELBO. This section is highly similar as the analysis of training objective in VAE) by introducing any distribution $q(x_{1}x_2...x_T\mid x_0)$ (we choose the forward chain $q$) to enable Bayes' Rule $$ \begin{align} \log p_\theta(x_0) & = \log p_\theta(x_0 x_1...x_T) - \log p_\theta(x_1 x_2...x_T \mid x_0) \nonumber\newline & = \int q(x_{1}x_2...x_T\mid x_0) \left[ \log p_\theta(x_0 x_1...x_T) - \log p_\theta(x_1 x_2...x_T \mid x_0) \right] \mathrm{d}x_{1}x_2...x_T \nonumber\newline \end{align} $$ Notice that after the integral the equation also holds because $\int q(x_{1}x_2...x_T\mid x_0) \log p_\theta(x_0) \mathrm{d}x_{1}x_2...x_T = \log p_\theta(x_0)$. And inside of the integral we substract and add $\log q(x_1 x_2...x_T \mid x_0)$ get $$ \begin{align} & = \int q(x_{1}x_2...x_T\mid x_0) \left[ \log p_\theta(x_0 x_1...x_T) - \log q(x_1 x_2...x_T \mid x_0) + \log q(x_1 x_2...x_T \mid x_0) - \log p_\theta(x_1 x_2...x_T \mid x_0) \right] \mathrm{d}x_{1}x_2...x_T \nonumber\newline & = \int q(x_{1}x_2...x_T\mid x_0) \left[ \log p_\theta(x_0 x_1...x_T) - \log q(x_1 x_2...x_T \mid x_0) \right] \mathrm{d}x_{1}x_2...x_T + D_{\mathrm{KL}}\left(q(x_1 x_2...x_T \mid x_0) \| p_\theta(x_1 x_2...x_T \mid x_0)\right) \nonumber\newline & \geq \int q(x_{1}x_2...x_T\mid x_0) \left[ \log p_\theta(x_0 x_1...x_T) - \log q(x_1 x_2...x_T \mid x_0) \right] \mathrm{d}x_{1}x_2...x_T \nonumber\newline \end{align} $$ since KL divergence is always non-negative. Thus we get the ELBO as the learning objective. Maximizing $\mathrm{ELBO}$ ≈ maximizing $\log p_\theta(x_0)$. $$ \begin{align} \mathrm{ELBO} & = \int q(x_{1}x_2...x_T\mid x_0) \left[ \log p_\theta(x_0 x_1...x_T) - \log q(x_1 x_2...x_T \mid x_0) \right] \mathrm{d}x_{1}x_2...x_T \nonumber\newline & = \mathbb{E}_{q}\left[\log p_\theta(x_{0}x_1...x_T) - \log q(x_{1}x_2...x_T\mid x_0)\right] \nonumber\newline \end{align} $$ Expanding both terms using the Markov factorization: $$ \begin{aligned} \log p_\theta(x_{0}x_1...x_T) &= \log p(x_T) + \sum_{t=1}^T \log p_\theta(x_{t-1}\mid x_t), \ \quad \log q(x_{1}x_2...x_T\mid x_0) &= \sum_{t=1}^T \log q(x_t \mid x_{t-1}). \end{aligned} $$ Finally, $$ \boxed{ \mathrm{ELBO} = \mathbb{E}_q\Big[\log p(x_T) + \sum_{t=1}^T \log p_\theta(x_{t-1}\mid x_t) - \sum_{t=1}^T \log q(x_t \mid x_{t-1})\Big] } $$

2. From ELBO to DDPM's L2 Loss¶

Add and subtract $\log q(x_T\mid x_0)$, and for each $t=2,\dots,T$, add and subtract $\log q(x_{t-1}\mid x_t,x_0)$. Then we obtain $$ \begin{align} \mathrm{ELBO} &=\mathbb E_q\Big[ \underbrace{\big(\log p(x_T)-\log q(x_T\mid x_0)\big)}_{\color{#888}{(A)}} \nonumber\newline &\qquad\qquad+ \sum_{t=2}^T \underbrace{\big(\log p_\theta(x_{t-1}\mid x_t)-\log q(x_{t-1}\mid x_t,x_0)\big)}_{\color{#888}{(B_t)}} \nonumber\newline &\qquad\qquad+ \underbrace{\log p_\theta(x_0\mid x_1)}_{\color{#888}{(C)}}\nonumber\newline &\qquad\qquad+ \underbrace{\Big(\log q(x_T\mid x_0)+\sum_{t=2}^T\log q(x_{t-1}\mid x_t,x_0)-\sum_{t=1}^T\log q(x_t\mid x_{t-1})\Big)}_{\color{#888}{(R)}} \Big]\nonumber\newline \end{align} $$ We can see that the entire term $(R)$ cancels out to 0. Using Bayes' rule, we have the following identity: $$ \log q(x_{t-1}\mid x_t,x_0) = \log q(x_t\mid x_{t-1}) + \log q(x_{t-1}\mid x_0) - \log q(x_t\mid x_0)\ \quad (t\ge2) $$ Substitute this into $(R)$ in equation and simplify, and we find that the whole $(R)$ term cancels out to 0.

  • First group $(A)$: $$ \mathbb E_q[\log p(x_T)-\log q(x_T\mid x_0)] = -\mathrm{KL}\big(q(x_T\mid x_0)\|p(x_T)\big) = -L_T $$
  • Second group $(B_t)$. Notice that in the ELBO, all terms are expectations over the full approximate posterior $q(x_{1}x_2...x_T\mid x_0)$, i.e. $\mathbb E_q[\cdot] = \mathbb E_{q(x_{1}x_2...x_T\mid x_0)}[\cdot]$. So for each $t=2,\dots,T$, by the law of total expectation: $$ \begin{align} \mathbb E_q[\log p_\theta(x_{t-1}\mid x_t)-\log q(x_{t-1}\mid x_t,x_0)] & = \mathbb E_{q(x_{1}x_2...x_T\mid x_0)} \Big[ \log p_\theta(x_{t-1}\mid x_t)-\log q(x_{t-1}\mid x_t,x_0)\Big] \nonumber\newline & = \mathbb E_{q(x_{t-1}x_t\mid x_0)} \Big[ \log p_\theta(x_{t-1}\mid x_t)-\log q(x_{t-1}\mid x_t,x_0)\Big] \nonumber\newline & = \mathbb E_{q(x_t\mid x_0)} \mathbb E_{q(x_{t-1}\mid x_t,x_0)} \Big[ \log p_\theta(x_{t-1}\mid x_t)-\log q(x_{t-1}\mid x_t,x_0)\Big] \nonumber\newline & = -\mathbb E_q\Big[ \mathrm{KL}\big(q(x_{t-1}\mid x_t,x_0)\|p_\theta(x_{t-1}\mid x_t)\big)\Big] = -L_{t-1} \nonumber\newline \end{align} $$
  • The third group $(C)$ is kept as is. Define $$ \mathbb E_q\log p_\theta(x_0\mid x_1) = - L_0 $$ Therefore, $$ \boxed{ \mathrm{ELBO} = -L_T-\sum_{t=2}^T L_{t-1} - L_0 } $$ Intuitively:
  • $L_T$ encourages the final noise to match the prior $\mathcal N(0,I)$. Since both distributions are Gaussian with known parameters, we can actually get a closed form of $L_T$. During training, $L_T$ is usually not included in the loss because it does not provide gradients for $\theta$, but when evaluating the upper bound, it can be computed exactly.
  • $L_0$ is the negative log-likelihood of the real $x_0$ on the learned reverse distribution. Since $p_\theta(x_0\mid x_1)=\mathcal N\big(\mu_\theta(x_1,1),\ \sigma_1^2 I\big)$, then we can get $$ L_0 \quad = \quad \mathbb E_q\left[\frac{1}{2\sigma_1^2}\|x_0-\mu_\theta(x_1,1)\|^2\right]+\text{const} \quad \propto \quad \mathbb E_q\left[ \|x_0-\mu_\theta(x_1,1)\|^2 \right] $$ Notice that $\mu_\theta(x_1,1)$ will be computed by $\epsilon_\theta(x_1,1)$. In the original DDPM (Ho et al. 2020), training often does not include $L_0$ separately. Improved or hybrid loss versions may add a small weight of $L_0$, which can slightly improve likelihood stability.
  • Each intermediate $L_{t-1}$ encourages the learned reverse to match the true reverse. The L2 loss in DDPM actually comes from this term. For a Gaussian Markov chain, each $L_{t-1}$ term becomes a quadratic form with respect to the means. By rewriting the means in terms of "predicted noise", we obtain a weighted L2 loss on $\epsilon$.

Since the forward process is a Gaussian Markov chain, the reverse conditional distribution $q(x_{t-1}\mid x_t,x_0)$ is also Gaussian, with its mean and variance given in closed form by the Gaussian conditioning formula. In reverse process, we approximate $q(x_{t-1}\mid x_t,x_0)$ by $$ p_\theta(x_{t-1}\mid x_t)=\mathcal N\big(\mu_\theta(x_t,t),\ \sigma_t^2 I\big), $$ where the variance is the same as $q(x_{t-1}\mid x_t, x_0)$, while the mean is learned through a neural network (by learning $\epsilon$ actually). With Gaussian distribution and equal variance, the KL term in $L_{t-1}$ can be reduced to a quadratic form of means. The $L_{t-1}$ can be reduced to $$ L_{t-1} =\mathbb E_q \Big[ \mathrm{KL}\big(\mathcal N(\mu_q,\sigma_t^2 I)\ |\ \mathcal N(\mu_\theta,\sigma_t^2 I)\big) \Big] =\frac{1}{2\sigma_t^2}\mathbb E_q\big[\|\mu_q-\mu_\theta\|^2\big]+\text{const}. $$ We can rewrite the means in terms of "predicted noise", $$ \mu = \frac{1}{\sqrt{\alpha_t}} \left(x_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar\alpha_t}} \cdot \epsilon \right) $$ then $$ \mu_q-\mu_\theta=\frac{1}{\sqrt{\alpha_t}}\frac{1-\alpha_t}{\sqrt{1-\bar\alpha_t}}\Big(\epsilon-\epsilon_\theta(x_t,t)\Big) $$ Plug into $L_{t-1}$ get $$ \boxed{ L_{t-1}=\underbrace{\frac{(1-\alpha_t)^2}{2\sigma_t^2\alpha_t(1-\bar\alpha_t)}}_{=:w_t} \mathbb E_q\left[\big\|\epsilon-\epsilon_\theta(x_t,t)\big\|^2\right]+\text{const} } $$ In other words, at each timestep, the KL divergence between $q$ and $p_{\theta}$ actually measures the discrepancy between the "true noise" $\epsilon$ and the "network-predicted noise" $\epsilon_\theta$.

Strictly speaking, the ELBO results in a time-weighted noise MSE, but as shown in Ho et al. (2020) and subsequent works, using a simpler and more numerically stable objective directly leads to better sample quality. This is equivalent to removing the above weight $w_t$ (or absorbing it into the learning rate/sampling distribution), and uniformly sampling the timestep $t$. $$ \boxed{ L_{\mathrm{simple}}=\mathbb E_{t\sim\mathcal U({1,\dots,T}),\ x_0\sim q,\ \epsilon\sim\mathcal N(0,I)} \Big[\ \big\|\epsilon-\epsilon_\theta(x_t,t)\big\|^2\ \Big] } $$ Many implementations train directly with $L_{\mathrm{simple}}$ (sometimes adding a small weight of $L_0$ as a reconstruction/perceptual constraint).

--- DDPM Implementation (Simple MLP as Noise Predictor)¶

In [9]:
import math, argparse, os
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset
torch.set_float32_matmul_precision("high") if hasattr(torch, "set_float32_matmul_precision") else None
In [ ]:
# ---------------------------
# Utilities
# ---------------------------
def cosine_beta_schedule(T, s=0.008):
    """
    Example:
        >>> betas = cosine_beta_schedule(5)
        >>> betas
        tensor([0.0064, 0.0243, 0.0630, 0.1310, 0.2250])
    """
    steps = T
    x = torch.linspace(0, T, steps+1, dtype=torch.float32)
    alphas_cumprod = torch.cos(((x/T)+s)/(1+s)*math.pi/2)**2
    alphas_cumprod = alphas_cumprod/alphas_cumprod[0]
    betas = 1 - (alphas_cumprod[1:]/alphas_cumprod[:-1])
    return betas.clamp(1e-5, 0.999)

def t_embed(t, dim=64):
    """
    Map time step t to a vector. Input a number, output a vector.
    Benefits: Removes scale effects; Capture relative relationships; Unifies representation range.
    # Example:
    # Input: t = torch.tensor([1, 10, 100]), dim=8
    # Output: t_embed(t, 8) =>
    # tensor([[ 0.8415,  0.9093,  0.1411,  0.7568,  0.5403, -0.4161, -0.9899,  0.6536],
    #         [ 0.9415,  0.1367,  0.0560,  0.6251,  0.3366,  0.9905, -0.9984,  0.7805],
    #         [ 0.5064, -0.7061, -0.0627,  0.4431, -0.8623,  0.7081, -0.9980,  0.8965]])
    """
    half = dim // 2
    freqs = torch.exp(torch.arange(half, device=t.device)*(-math.log(10000.0)/max(half-1,1)))
    args = t.float().unsqueeze(1)*freqs.unsqueeze(0)
    emb = torch.cat([torch.sin(args), torch.cos(args)], dim=1)
    if dim % 2 == 1: emb = torch.nn.functional.pad(emb, (0,1))
    return emb
In [ ]:
# ---------------------------
# Forward and Reverse Process
# ---------------------------
class Diffusion:
    def __init__(self, T=200, beta_schedule="cosine", device="cpu"):
        if beta_schedule == "cosine":
            betas = cosine_beta_schedule(T).to(device)
        else:
            betas = torch.linspace(1e-4, 0.02, T, device=device)
        self.T = T
        self.device = device
        self.betas = betas
        self.alphas = 1. - betas
        self.alphas_cumprod = torch.cumprod(self.alphas, dim=0)
        self.sqrt_alphas_cumprod = torch.sqrt(self.alphas_cumprod)
        self.sqrt_one_minus_alphas_cumprod = torch.sqrt(1. - self.alphas_cumprod)
        # series of \bar{a_{t-1}}; let \bar{a_0} = 1. Only used in self.posterior_variance.
        self.alphas_cumprod_prev = torch.cat([torch.tensor([1.], device=device), self.alphas_cumprod[:-1]])
        # The variance of p_theta(x_{t-1} | x_t). Calculated by Gaussian conditioning formula.
        self.posterior_variance = betas * (1. - self.alphas_cumprod_prev) / (1. - self.alphas_cumprod)

    @torch.no_grad()
    def sample(self, model, shape):
        x = torch.randn(shape, device=self.device)
        for t in reversed(range(self.T)): # t: T, T-1, ..., 0
            # duplicate t to (batchsize,), so that each sample corresponds to its own current timestep
            t_cur = torch.full((shape[0],), t, device=self.device, dtype=torch.long)
            eps = model(x, t_cur)
            alpha = self.alphas[t]
            alpha_bar = self.alphas_cumprod[t]
            beta = self.betas[t]
            mean = (1/torch.sqrt(alpha))*(x - (beta/torch.sqrt(1-alpha_bar))*eps)
            if t > 0:
                noise = torch.randn_like(x)
                var = self.posterior_variance[t]
                x = mean + torch.sqrt(var)*noise
            else: # t == 0, last step don't need to add noise. x_0 is exactly the mean of the initial distribution.
                x = mean
        return x

    def q_sample(self, x0, t, noise=None): # Exact one-step forward process. Input t here is a number not an embedding.
        if noise is None: noise = torch.randn_like(x0)
        # self.sqrt_alphas_cumprod[t] has shape (batch,), x0 has shape (batch, xdim)
        # So we add [:, None] → (batch, 1), to enable element-wise multiplication
        return self.sqrt_alphas_cumprod[t][:,None]*x0 + self.sqrt_one_minus_alphas_cumprod[t][:,None]*noise, noise
In [ ]:
# ---------------------------
# Simple MLP εθ(x_t, t)
# ---------------------------
class EpsModel(nn.Module):
    def __init__(self, xdim, tdim=64, hidden=256):
        super().__init__()
        self.tproj = nn.Sequential(nn.Linear(tdim, hidden), nn.SiLU()) 
        self.net = nn.Sequential(
            nn.Linear(xdim+hidden, hidden), nn.SiLU(),
            nn.Linear(hidden, hidden), nn.SiLU(),
            nn.Linear(hidden, xdim)
        )
    def forward(self, x, t):
        emb = t_embed(t, dim=self.tproj[0].in_features) # emb shape: (B, dim)
        temb = self.tproj(emb)  # why we need tproj given that t_embed already maps t to a vector
        h = torch.cat([x, temb], dim=1)
        return self.net(h)
In [13]:
# ---------------------------
# Training step (MSE on ε)
# ---------------------------
def train(model, diffusion, loader, epochs=5, lr=2e-4, device="cpu", log_every=200):
    opt = torch.optim.Adam(model.parameters(), lr=lr)
    mse = nn.MSELoss()
    step=0
    model.train()
    for ep in range(1, epochs+1):
        for (x,) in loader:
            x = x.to(device)
            # t is randomly sampled for each data point in the batch.
            # Why random? This follows the DDPM training strategy (Ho et al., 2020),
            # where the model is trained to predict noise at a random timestep t.
            t = torch.randint(0, diffusion.T, (x.size(0),), device=device).long()
            x_t, noise = diffusion.q_sample(x, t)
            pred = model(x_t, t)
            loss = mse(pred, noise)
            opt.zero_grad(); loss.backward(); opt.step()
            if step % log_every == 0:
                print(f"[epoch {ep}] step {step} loss {loss.item():.4f}")
            step += 1
In [14]:
# ---------------------------
# Datasets
# ---------------------------
def make_gaussian(n=20000, mean=0., std=1., device="cpu"):
    x = torch.randn(n, 1)*std + mean
    return TensorDataset(x.to(device))

def make_circle(n=20000, radius=2.0, noise=0.05, device="cpu"):
    angles = torch.rand(n)*2*math.pi
    x = torch.stack([torch.cos(angles), torch.sin(angles)], dim=1)*radius
    x += noise*torch.randn_like(x)
    return TensorDataset(x.to(device))

def make_mnist(device="cpu"):
    # MNIST flattened & normalized to [-1, 1]
    from torchvision import datasets, transforms
    tfm = transforms.Compose([transforms.ToTensor(), transforms.Lambda(lambda x: x*2-1), transforms.Lambda(lambda x: x.view(-1))])
    ds = datasets.MNIST(root="./data", train=True, download=True, transform=tfm)
    X = torch.stack([ds[i][0] for i in range(len(ds))]).to(device)
    return TensorDataset(X)
In [18]:
# ---------------------------
# Training
# ---------------------------
data = "circle"
batch = 256
epochs = 5
T = 200
lr = 2e-4
device = "cuda" if torch.cuda.is_available() else "cpu"

# dataset
if data == "gaussian":
    ds = make_gaussian(device=device); xdim = 1
elif data == "circle":
    ds = make_circle(device=device); xdim = 2
else:
    ds = make_mnist(device=device); xdim = 28*28

dl = DataLoader(ds, batch_size=batch, shuffle=True, drop_last=True)

# diffusion + model
diffusion = Diffusion(T=T, device=device)
model = EpsModel(xdim=xdim).to(device)

# train
train(model, diffusion, dl, epochs=epochs, lr=lr, device=device)
[epoch 1] step 0 loss 0.9971
[epoch 3] step 200 loss 0.5452
In [19]:
# ---------------------------
# Sampling
# ---------------------------
model.eval()
with torch.no_grad():
    samples = diffusion.sample(model, (64, xdim)).cpu()

# save few samples
os.makedirs("samples", exist_ok=True)
if xdim == 1:
    torch.save(samples, "samples/gaussian.pt")
    print("Saved samples to samples/gaussian.pt (tensor of shape [64, 1])")
elif xdim == 2:
    torch.save(samples, "samples/circle.pt")
    print("Saved samples to samples/circle.pt (tensor of shape [64, 2])")
else:
    import torchvision.utils as vutils
    imgs = (samples.view(-1,1,28,28).clamp(-1,1)+1)/2
    grid = vutils.make_grid(imgs, nrow=8)
    vutils.save_image(grid, "samples/mnist.png")
    print("Saved image grid to samples/mnist.png")
Saved samples to samples/circle.pt (tensor of shape [64, 2])

--- U-Net As Noise Predictor (Instead of MLP)¶

What is U-Net?¶

What is the background of U-Net? U-Net was not proposed in the original DDPM paper; it actually dates back to 2015. However, back then it was introduced for image segmentation rather than generation. Image segmentation means: given an input image $x \in \mathbb{R}^{C\times H\times W}$, the output is a mask $\in \mathbb{R}^{K\times H\times W}$, where each pixel has probabilities for $K$ different classes (for example: sky, tree, person), and each pixel’s value denotes which object or region it belongs to.

Ronneberger et al., "U-Net: Convolutional Networks for Biomedical Image Segmentation" first proposed U-Net for biomedical image segmentation—tasks like segmenting "cell nuclei", "tumor regions", or "organ boundaries" from CT or microscope images. U-Net is structured as encoder + decoder + skip connections. As shown in the figure below, it's called "U-Net" because its architecture looks like the letter "U".

Encoder                                               Decoder
─────────────────────────────────────                ─────────────────────────────────────
Input → [Enc1] → pool → [Enc2] → pool → [Bottleneck] → up → [Dec1] → up → [Dec2] → Output
            │              │                                  ↑               ↑
            └───skip1──────│──────────────────────────────────┘               │
                           └───── skip2 ──────────────────────────────────────┘

In this network:

  • Encoder: Repeatedly uses Conv2d + pooling to reduce the image resolution $(H, W)$ and increase the number of channels, making the representation coarser but more abstract, so it learns semantic features (e.g., "there is a cat in this image").
  • Decoder: Uses ConvTranspose2d or interpolate to increase the image resolution and decrease the number of channels, gradually restoring the original image size, e.g., from 32×32 → 64×64 → 128×128.
  • Bottleneck: This is the endpoint of the encoder and the starting point of the decoder. The encoder compresses the information to this point; the decoder restores it from here. This is the most compressed semantic representation of the entire image: the spatial resolution is the smallest, channel count is the largest, and features are the most abstract. At this point, the model has a "global view" and understands "what is in the image," but spatial details are lost.
  • Skip Connection: The core idea is to connect decoder and encoder layers of the same spatial resolution (same $H \times W$). Features from different encoder layers (with different resolutions) are concatenated with decoder layers of the same size, so each decoder layer receives both global semantics (from the decoder) and local details (from the corresponding encoder layer). Also, only when $H \times W$ matches can feature maps be concatenated along the channel dimension (dim=1) with torch.cat().

Here is the code of a toy 2-layer encoder + 2-layer decoder U-Net. Perfectly detailized the structure sketch before.

In [ ]:
class ToyUNet(nn.Module):
    def __init__(self, in_ch=3, out_ch=2):
        super().__init__()
        # --- Encoder ---
        self.enc1 = nn.Sequential(nn.Conv2d(in_ch, 16, 3, padding=1),nn.ReLU()) # -> x1: [B,16,H,W]
        self.enc2 = nn.Sequential(nn.Conv2d(16, 32, 3, padding=1),nn.ReLU()) # -> x3: [B,32,H/2,W/2]
        # --- Decoder ---
        self.up1 = nn.ConvTranspose2d(32, 16, kernel_size=2, stride=2)
        self.dec1 = nn.Sequential(nn.Conv2d(16 + 32, 16, 3, padding=1),nn.ReLU())
        self.up2 = nn.ConvTranspose2d(16, 8, kernel_size=2, stride=2) 
        self.dec2 = nn.Sequential(nn.Conv2d(8 + 16, 8, 3, padding=1),nn.ReLU())
        self.out_conv = nn.Conv2d(8, out_ch, 1)

    def forward(self, x):
        # --- Encoder ---
        x1 = self.enc1(x)                # [B,16,H,W]
        x2 = F.max_pool2d(x1, 2)         # [B,16,H/2,W/2]
        x3 = self.enc2(x2)               # [B,32,H/2,W/2]
        bottleneck = F.max_pool2d(x3, 2) # [B,32,H/4,W/4]
        # --- Decoder ---
        y = self.up1(bottleneck)        # [B,16,H/2,W/2]
        y = torch.cat([y, x3], dim=1)   # Enc2 → Dec1. shape: [B, 16 + 32, H/2, W/2]
        y = self.dec1(y)                # [B,16,H/2,W/2]

        y = self.up2(y)                 # [B,8,H,W]
        y = torch.cat([y, x1], dim=1)   # Enc1 → Dec2. shape: [B, 8 + 16, H, W]
        y = self.dec2(y)                # [B,8,H,W]

        return self.out_conv(y)      

Use U-Net in DDPM¶

The original DDPM paper (Ho et al., "Denoising Diffusion Probabilistic Models") used U-Net as the backbone for the noise prediction network εθ(xₜ, t). Since then, nearly all diffusion models (including conditional DDPM, LDM, and Stable Diffusion) have continued to adopt this U-Net architecture.

  • Why DDPM uses U-Net as noist predictor?

    The task of DDPM is: given a noisy image as input, output a noise map of the same size. In other words, this is a pixel-wise mapping task (image-to-image regression), which includes problems like image segmentation, super-resolution, and denoising. U-Net was specifically designed for such tasks; at the time, it was the most mature dense prediction model (pixel-level output), making it especially suitable for predicting noise images with the same shape as the input.

  • How to use U-Net as noise predictor in DDPM?

    Recall that in DDPM, the noise predicton network $\epsilon_\theta(\mathbf{x}_t, t)$ takes $\mathbf{x}_t$ (shape $(B, C, H, W)$) and $t$ (shape $(B, )$) as input, and outputs the noise $\hat{\boldsymbol\epsilon}$ (shape $(B, C, H, W)$). To incorporate $t$ into the network, we comupte the global embedding of $t$ via MLP, and inject the time embedding to each layer (addition or concatenation).

  • Why use an MLP to get global time embedding but not generate time embedding each time of injection?

    Why do we use a single global time embedding vector (temb) for all layers, instead of generating a separate embedding for each layer? Because this vector represents a global attribute: at which step (t) in the diffusion process the entire image is currently at, that is, how much noise there is. t is a global variable; it does not change across space or layers. For the whole image, t is the same everywhere. Repeatedly generating new embeddings is meaningless, since the time information itself doesn't change—the only difference is how each layer interprets that information.

  • Why inject time embedding to each block?

    Because: In the early layers (high resolution), the model needs to know "this is a low-noise image at step t"; in the deeper layers (low resolution), the model needs to know "this is a high-noise latent feature at step t". The level of noise and the details to be restored change with time, and each layer's features depend differently on the time step. Therefore, Improved DDPM, DDIM, and Stable Diffusion all choose to inject the time embedding into every block (or ResidualBlock).

In [ ]:
def timestep_embedding(t, dim): # t: [B] (int); return: [B, dim]
    """
    sinusoidal position embedding, freqs = exp(-log(10000) * arange(0, half) / half)    (from DDPM paper Appendix B)
    Remove the scale effect of timestamps like 1, 2, 3, ..., 100, ...
    Allows the network to sense the relationship between different time steps (for example, 1 and 2 are close, 1 and 100 are far apart).
    """
    half = dim // 2
    freqs = torch.exp(-math.log(10000) * torch.arange(0, half, dtype=torch.float32) / half) # freqs shape: (half,)
    args = t[:, None].float() * freqs[None] # args shape: (B, half)
    emb = torch.cat([torch.sin(args), torch.cos(args)], dim=-1) # emb shape: (B, dim)
    return emb

class ToyUNet_DDPM(nn.Module):
    def __init__(self, in_ch=3):
        super().__init__()
        self.in_ch = in_ch
        out_ch = in_ch  # noise prediction → same shape as input
        # --- Time embedding ---
        self.time_mlp = nn.Sequential( # global time embedding. also to match the bottleneck dimension
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Linear(64, 32)
        )
        self.time_proj_enc1 = nn.Linear(32, 16)
        self.time_proj_enc2 = nn.Linear(32, 32)
        self.time_proj_dec1 = nn.Linear(32, 16)
        self.time_proj_dec2 = nn.Linear(32, 8)
        # --- Encoder ---
        self.enc1 = nn.Sequential(nn.Conv2d(in_ch, 16, 3, padding=1), nn.ReLU())
        self.enc2 = nn.Sequential(nn.Conv2d(16, 32, 3, padding=1), nn.ReLU())
        # --- Decoder ---
        self.up1 = nn.ConvTranspose2d(32, 16, 2, stride=2)
        self.dec1 = nn.Sequential(nn.Conv2d(16 + 32, 16, 3, padding=1), nn.ReLU())
        self.up2 = nn.ConvTranspose2d(16, 8, 2, stride=2)
        self.dec2 = nn.Sequential(nn.Conv2d(8 + 16, 8, 3, padding=1), nn.ReLU())
        self.out_conv = nn.Conv2d(8, out_ch, 1)

    def forward(self, x, t): # x: [B, C, H, W], t: [B] (int)
        # --- time embedding ---
        temb = timestep_embedding(t, dim=128)   # [B,128]
        temb = self.time_mlp(temb)              # [B,32]
        # --- Encoder ---
        x1 = self.enc1(x) + self.time_proj_enc1(temb)[:, :, None, None]                      # [B, 16, H, W]
        x2 = F.max_pool2d(x1, 2)               # [B, 16, H/2, W/2]
        x3 = self.enc2(x2) + self.time_proj_enc2(temb)[:, :, None, None]                     # [B, 32, H/2, W/2]
        bottleneck = F.max_pool2d(x3, 2) + temb[:, :, None, None]      # [B, 32, H/4, W/4]
        # --- Decoder ---
        y = self.up1(bottleneck)                # [B, 16, H/2, W/2]
        y = torch.cat([y, x3], dim=1)           # [B, 16 + 32, H/2, W/2]
        y = self.dec1(y) + self.time_proj_dec1(temb)[:, :, None, None]                       # [B, 16, H/2, W/2]

        y = self.up2(y)                         # [B, 8, H, W]
        y = torch.cat([y, x1], dim=1)           # [B, 8 + 16, H, W]
        y = self.dec2(y) + self.time_proj_dec2(temb)[:, :, None, None]                       # [B, 8, H, W]

        return self.out_conv(y) # [B, out_ch, H, W]

Add residual block to U-Net¶

There are two types of "skip connections" in DDPM. Both allow information to bypass intermediate layers, but they differ in scope and mathematical operation.

  • U-Net long skip connections (encoder↔decoder): the key is concatenation, not addition.
  • ResBlock short residual connections (addition): usually happen between consecutive layers (short distance). For example, $y = F(x) + x$ (the key is element-wise addition). This occurs in blocks such as "Input x → Layer1 → Layer2 → Layer3 → Output." The residual block in U-Net is essentially the same concept as in ResNet.

Add attention to U-Net¶

Self-Attention in U-Net is an independent module that is "inserted" at a specific convolutional layer. It only models global relationships within that layer and does not connect across layers. Its Q, K, and V all come from the feature map of this layer, and it models the global dependencies among all pixels within this feature map. Typically, it looks like:

Conv → [optional Self-Attention] → Conv

Notice that for $X_{out} = \text{Attention}(X_{in})$, the shape of $X_{out}, X_{in}$ doesn't change.

We usually use self-attention in low resolution layers because

  • The complexity of self-attention is O(N^2), where N = H × W. For high-resolution layers (such as 64×64, 128×128), N is very large, and the computational and memory cost becomes prohibitive.
  • Low resolution layers can already capture local features. Self-attention at medium or low resolutions can model global dependencies (the semantic relationships of the whole image). The high resolution layers are expected to add more details instead of global dependencies.

Example of Computing Attention¶

We use Scaled Dot-Product Attention as an example. Suppose the input feature is $$ X \in \mathbb{R}^{N \times C} $$ where:

  • $N = H \times W$: the number of pixels (or latent tokens) after flattening;
  • $C$: the channel dimension at each position.

Linear transformations give Q, K, V: $$ Q = X W_Q, \quad K = X W_K, \quad V = X W_V $$

Symbol Shape Meaning
$W_Q, W_K, W_V$ $C \times d$ learned linear projection matrices
$Q, K, V$ $N \times d$ query, key, and value vectors at each position

The similarity matrix and attention weights: $$ A = \text{softmax}\left(\frac{QK^T}{\sqrt{d}}\right) $$ Element-wise form: $$ S = \frac{QK^T}{\sqrt{d}}, \quad S_{ij} = \frac{q_i \cdot k_j}{\sqrt{d}} \quad \Rightarrow \quad A_{ij} = \frac{\exp(S_{ij})}{\sum_{m=1}^{N} \exp(S_{im})} $$

Symbol Shape Meaning
$QK^T$ $N \times N$ similarity between every pair of positions
$A$ $N \times N$ attention weights matrix, row-normalized

Weighted sum to get the output: $$ Y = A V, \quad Y_i = \sum_{j=1}^{N} A_{ij} V_j $$

Symbol Shape Meaning
$V$ $N \times d$ value representation at each position
$A V$ $N \times d$ global features for each query after attention
$Y$ $N \times d$ self-attention output (reshape back to $C \times H \times W$)

Generally, in the attention framework, Q means "what am I looking for," K means "what am I," and V means "what am I providing." In an LDM U-Net:

Symbol What it means in LDM U-Net Intuitive meaning
Q (Query) A current pixel (latent location) "asks" the entire latent feature map: how should I adjust my noise prediction? Each position decides how to combine information from other positions
K (Key) The feature signature of each pixel (latent location), representing its structure/semantics Indicates to others "what kind of feature am I"
V (Value) The feature values each pixel offers for sharing (content others may use) What features others extract from me to update themselves

Why does attention achieve "global semantic fusion"?

  • $q_i \cdot k_j$: if position $i$ and $j$ have similar features (semantically related), the inner product is large;
  • after softmax, $A_{ij}$ converts this similarity into a weight;
  • the weighted sum $\sum_j A_{ij} v_j$ allows position $i$ to aggregate content from the entire map that is semantically related.

In other words: each position's new representation $y_i$ is "I (the query) retrieve, from all positions (keys), the information (values) weighted by semantic similarity."

--- $\varepsilon$ / $x_0$ / $v$ Prediction¶

Recall that in the reverse process, we need to compute the posterior mean $\mu_\theta(\mathbf{x}_t, t)$ where $$ p_\theta(\mathbf{x}_{t-1} \mid \mathbf{x}_t) = \mathcal{N}\left(\mathbf{x}_{t-1}; \mu_\theta(\mathbf{x}_t, t), \Sigma'\right) $$ In the original DDPM, we use the fact that $\mu$ can be derived from $x_t$ and $x_0$, and since $x_0$ can be written as a function of $\varepsilon$, we use a network $\varepsilon_\theta(x_t,t)$ to predict $\varepsilon$ to derive $\mu$. In fact, we have three options for the network’s output, which are mathematically equivalent but differ in numerical stability during training.

  1. $\varepsilon$-prediction:$\varepsilon_\theta(x_t,t) \rightarrow \hat\varepsilon$ (from original DDPM)
  2. $x_0$-prediction:$x_{0,\theta}(x_t,t) \rightarrow \hat x_{0,\theta}$ (In the original DDPM paper (Ho et al., 2020, Sec. 3.2), the authors mentioned that the model could alternatively predict $x_0$, but they found it produced lower-quality samples compared to $\varepsilon$-prediction, so they adopted the latter as the main approach).
  3. $v$-prediction: $v_\theta(x_t,t) \rightarrow \hat v$. Predicts $v$: a linear combination of $\varepsilon$ and $x_0$ (The v-prediction idea originates from Salimans & Ho (2022) (Progressive Distillation), and it was popularized by Imagen (Saharia et al., 2022)).

By the Gaussian Conditioning Formula, the posterious mean of $q(\mathbf{x}_{t-1} \mid \mathbf{x}_t)$ is $$ \mu' = \frac{\sqrt{\bar\alpha_{t-1}} \beta_t}{1 - \bar\alpha_t} \mathbf{x}_0 + \frac{\sqrt{\alpha_t}(1 - \bar\alpha_{t-1})}{1 - \bar\alpha_t} \mathbf{x}_t $$ No matter which one of $\varepsilon$ / $x_0$ / $v$ we predict by the network, ultimately we use the predicted quantity to derive $\hat{\mathbf{x}}_0$ and plug in back to this formula to get $\mu_\theta(\mathbf{x}_t, t)$.

Below we will discuss the three approaches separately:

  • How to recover $\hat{\mathbf{x}}_0$, then plug $\hat{\mathbf{x}}_0$ into the formula to get $\mu_\theta(\mathbf{x}_t, t)$.
  • What the loss function looks like, and a brief derivation of how it comes about.
  • What are the advantages of using this prediction.

1. $\varepsilon$-prediction¶

Network output: $\varepsilon_\theta(x_t,t) \rightarrow \hat\varepsilon$

Recover $\hat{\mathbf{x}}_0$ from $\hat\varepsilon$ $$ \hat x_0(x_t,t)=\frac{x_t-\sqrt{1-\bar\alpha_t}\hat\varepsilon}{\sqrt{\bar\alpha_t}}. $$

Directly substitute $\hat x_0$ into the posterior mean formula to get $\mu_\theta$ $$ \mu_\theta(x_t,t)=\frac{\sqrt{\bar\alpha_{t-1}}\beta_t}{1-\bar\alpha_t}\hat x_0 +\frac{\sqrt{\alpha_t}(1-\bar\alpha_{t-1})}{1-\bar\alpha_t}x_t. $$

Loss: $\mathcal L=\mathbb{E} \left[ \| \boldsymbol{\epsilon} - \hat{\boldsymbol{\epsilon}} \|^2 \right]$

Advantages: Historically the most used, formula is simple, training is stable.

2. $x_0$-prediction¶

Network output: $x_{0,\theta}(x_t,t) \rightarrow \hat x_{0}$

Directly substitute $\hat x_0$ into the posterior mean formula to get $\mu_\theta$ $$ \mu_\theta(x_t,t)=\frac{\sqrt{\bar\alpha_{t-1}}\beta_t}{1-\bar\alpha_t}\hat x_0 +\frac{\sqrt{\alpha_t}(1-\bar\alpha_{t-1})}{1-\bar\alpha_t}x_t. $$

Loss: $\mathcal L=\mathbb E\left[\|x_0-\hat x_{0}\|^2\right]$ (Easily obtained by plugging $\mu_\theta(x_t,t)$ into $L_{t-1}$.)

Advantages: Directly targets the pixel space, which is convenient when combining with perceptual/discriminator losses.

3. $v$-prediction¶

Apply an orthogonal linear transformation (coefficients related to $t$) to $(x_0,\varepsilon)$: $$ a_t=\sqrt{\bar\alpha_t},\quad b_t=\sqrt{1-\bar\alpha_t},\quad \Rightarrow \quad v_t = a_t\varepsilon - b_tx_0 $$

Network output: $v_\theta(x_t,t) \rightarrow \hat v$

Recover $\hat x_0$ and $\hat\varepsilon$ from $\hat v$: Solve the system (Note $a_t^2+b_t^2=1$) $$ \begin{cases} x_t = a_t x_0 + b_t \varepsilon, \\ v_t = a_t \varepsilon - b_t x_0 \end{cases} \quad \Rightarrow \quad \begin{cases} \hat x_0=a_tx_t - b_t\hat v_\theta, \\ \hat\varepsilon=b_tx_t + a_t\hat v_\theta \end{cases} $$

Directly substitute $\hat x_0$ into the posterior mean formula to get $\mu_\theta$ $$ \mu_\theta(x_t,t)=\frac{\sqrt{\bar\alpha_{t-1}}\beta_t}{1-\bar\alpha_t}[a_t x_t-b_t\hat v_\theta] +\frac{\sqrt{\alpha_t}(1-\bar\alpha_{t-1})}{1-\bar\alpha_t}x_t $$

Loss: $\mathcal L=\mathbb E\left[\|v_t-\hat v\|^2\right]$ (Easily obtained by plugging $\mu_\theta(x_t,t)$ into $L_{t-1}$.)

Advantages: $(x_0,\varepsilon)$ are "rotated" so that gradients are more balanced during high/low SNR phases, leading to a trade-off between quality and controllability. Widely adopted by Imagen, Stable Diffusion v2+, etc.

In practice, when using a neural network for estimation, the three methods all take the same inputs, namely $x_t$ and t. The only differences are in the network head (output) and the loss function: you regress either $\epsilon$, $x_0$, or $v$.

In fact, during real training, if you already have an implementation and want to switch to a different prediction method (for numerical reasons, for example), you only need to change the output head and the loss target to the corresponding quantity. The posterior mean ($\mu$) can be computed using the relevant formula for each prediction type. All other training and sampling procedures stay the same.

--- Appendix¶

ddpm