FlowBasedModels

Flow Based Models

Takeaway:

What is a coupling layer?
What is additive coupling?
What is affine coupling? How does it differ from additive coupling?
What are the training objectives look like in NICE (Non-linear Independent Components Estimation) and RealNVP (Real-valued Non-Volume Preserving)?
How to implement and train a NICE / RealNVP model?

--- Flow Based Models Overview¶

Flow based models follow the principle of "first sample latent $\mathbf{z}$, then generate data using $\mathbf{z}$", similar to GANs, VAEs, and Diffusion models.

In flow based models, to generate data from the latent variable $\mathbf{z}$, we use the following transformation $$ \mathbf{x} = f_K \circ f_{K-1} \circ \cdots \circ f_1(\mathbf{z}) $$ Let us define $G = f_K \circ f_{K-1} \circ \cdots \circ f_1$, so that $\mathbf{x} = G(\mathbf{z})$.

However, during training, we optimize the inverse function $G^{-1}$ (i.e., mapping data $\mathbf{x}$ to latent variable $\mathbf{z}$), and during generation, we use the forward function $G$ to map $\mathbf{z}$ back to $\mathbf{x}$.

Definition of Coupling Layer

In flow-based models, a coupling layer typically refers to a basic invertible transformation unit denoted by $f$. It is defined in both directions: the forward function $f$, and its inverse $f^{-1}$. In other words, a coupling layer is a bidirectional mapping $$ f: \mathbf{x} \mapsto \mathbf{y}, \quad f^{-1}: \mathbf{y} \mapsto \mathbf{x} $$ Specifically, during training (i.e., mapping $\mathbf{x} \to \mathbf{z}$), you use the inverse $f^{-1}$; during generation (i.e., mapping $\mathbf{z} \to \mathbf{x}$), you use the forward function $f$.

Training Objective Of Flow Based Models¶

To build a good model, we want the generated samples to resemble the data distribution. A reasonable training objective is to make the model distribution $p_G(\mathbf{x})$ be as close as possible to the true data distribution $p_\text{data}(\mathbf{x})$. A natural metric is the KL divergence $D_{\text{KL}}(p_\text{data}(\mathbf{x}) \| p_G(\mathbf{x}))$, and we want to minimize it. And remember that, mathematically, minimizing the KL divergence is equivalent to maximizing the likelihood $\mathbb{E}_{p_\text{data}(\mathbf{x})} [ \log p_G(\mathbf{x}) ]$ since $$D_{\text{KL}}(p_\text{data}(\mathbf{x}) \| p_G(\mathbf{x})) = \mathbb{E}_{p_\text{data}(\mathbf{x})}\big[ \log p_\text{data}(\mathbf{x}) - \log p_G(\mathbf{x}) \big]$$ and $\log p_\text{data}(\mathbf{x})$ does not depend on model parameters. And here is a standard trick: maximizing $\mathbb{E}_{p_\text{data}(\mathbf{x})}[\log p_G(\mathbf{x})]$ is equivalent to maximizing $\log p_G(\mathbf{x})$ over all samples, since $\max_G \; \mathbb{E}_{p_\text{data}(\mathbf{x})} [ \log p_G(\mathbf{x}) ] = \max_G \; \int p_\text{data}(\mathbf{x}) \, \log p_G(\mathbf{x}) \, d\mathbf{x}$ and the empirical distribution $p_\text{data}(\mathbf{x})$ does not depend on $G$. Thus we use the marginal likelihood $$\max_G \log p_G(\mathbf{x})$$ as the training objective of flow based models. We want to maximize it during training. The meaning of likelihood also fits the intuition: the probability that the model assigns to the observed data.

So how do we maximize the training objective $\log p_G(\mathbf{x})$? The design of $G$ ensures that if the prior $p_Z(\mathbf{z})$ has a closed form (e.g., Gaussian) and the determinant of the Jacobian matrix $J_G$ is tractable, then $\log p_G(\mathbf{x})$ can be computed exactly and optimized directly.

The change of variables formula gives: $$ p_G(\mathbf{x}) = p_Z(\mathbf{z}) \cdot \left| \det \left( \frac{\partial G^{-1}(\mathbf{x})}{\partial \mathbf{x}} \right) \right| = p_Z(\mathbf{z}) \cdot \left| \det \left( J_{G^{-1}}(\mathbf{x}) \right) \right|, $$ where the Jacobian matrix is $$ J_{G^{-1}}(\mathbf{x}) = \frac{\partial G^{-1}(\mathbf{x})}{\partial \mathbf{x}} = \begin{bmatrix} \frac{\partial z_1}{\partial x_1} & \cdots & \frac{\partial z_1}{\partial x_d} \\ \vdots & \ddots & \vdots \\ \frac{\partial z_d}{\partial x_1} & \cdots & \frac{\partial z_d}{\partial x_d} \end{bmatrix} $$

The Jacobian determinant characterizes how a small volume element in space is stretched or compressed after passing through a coupling layer (in either the forward or inverse direction). It quantifies the local volume scaling effect of the transformation and serves as the key link between the density of the original variables and the transformed ones.

Taking the log: $$ \log p_G(\mathbf{x}) = \log p_Z(\mathbf{z}) + \log \left| \det \left( J_{G^{-1}}(\mathbf{x}) \right) \right| $$ Let $\mathbf{h}$ be the intermediate output of the chain of multiple invertible transformations $$ \mathbf{h}_K = \mathbf{x}, \quad \mathbf{h}_{K-1} = f_K^{-1}(\mathbf{h}_K), \quad \mathbf{h}_{K-2} = f_{K-1}^{-1}(\mathbf{h}_{K-1}), \; \ldots, \; \mathbf{h}_0 = f_1^{-1}(\mathbf{h}_1) = \mathbf{z} $$ By the chain rule for Jacobians, $$ J_{G^{-1}}(\mathbf{x}) = J_{f_1^{-1}}(\mathbf{h}_1) \cdot J_{f_2^{-1}}(\mathbf{h}_2) \cdots J_{f_K^{-1}}(\mathbf{h}_K) $$ Therefore, its determinant factorizes: $$ \left| \det J_{G^{-1}}(\mathbf{x}) \right| = \prod_{k=1}^{K} \left| \det \big( J_{f_k^{-1}}(\mathbf{h}_k) \big) \right| $$ Taking logs turns this into a sum: $$ \log \left| \det J_{G^{-1}}(\mathbf{x}) \right| = \sum_{k=1}^{K} \log \left| \det \big( J_{f_k^{-1}}(\mathbf{h}_k) \big) \right| $$ The marginal log-likelihood for a flow composed of multiple coupling layers is: $$ \log p_G(\mathbf{x}) = \log p_Z(\mathbf{z}) + \sum_{k=1}^{K} \log \left| \det \big( J_{f_k^{-1}}(\mathbf{h}_k) \big) \right| $$

In the rest of the notes, we’ll explore the earliest and most basic flow-based models: NICE and RealNVP. These models mainly differ in their choice of coupling layers. We’ll examine how their coupling layers are designed and how the corresponding $\log p_G(\mathbf{x})$ is computed.

--- NICE Architecture: Additive Coupling (the most basic flow model)¶

In an additive coupling layer, to formulate an invertible transformation, we firstly need to split the input vector $\mathbf{x} \in \mathbb{R}^d$ into two parts by a mask vector $m\in\{0, 1\}^d$. One part remains unchanged (simply copied); The other part is updated by adding a shift computed by a neural network that takes the unchanged part as input. The essence of additive coupling lies in the design of the mask and the I/O of the neural network. We'll see this design enables an elegant invertible transformation because: 1. It guarantees invertibility by preventing the updated part from depending on itself; 2. It makes the Jacobian triangular, so the determinant is easy to compute; 3. With multiple layers and alternating masks, every dimension can eventually be updated.

⚠️ The mask itself is manually designed and non-trainable. It serves as a structural constraint rather than a model parameter.

Formally, the split is $$ \mathbf{x}_1 = \mathbf{x} \odot m, \quad \mathbf{x}_2 = \mathbf{x} \odot (\mathbf{1} - m) $$

And the two parts are transformed as the following, where $\text{NET}$ is a neural network $\mathbb{R}^{\text{sum}(m)} \rightarrow \mathbb{R}^{d - \text{sum}(m)}$. To make data fit into the neural network, we also need to define extraction/embedding operators:

$\text{extract}(\mathbf{x}) \in \mathbb{R}^{\text{sum}(m)}$ means extract the components at positions $m=1$;
$\text{embed}(s) \in \mathbb{R}^d$ means takes a vector in $\mathbb{R}^{d - \text{sum}(m)}$, places it back into the positions where $m=0$, and pads zeros elsewhere.

The output of the neural network is used to transform $\mathbf{x}_2$. Specifically, it gives $\mathbf{x}_2$ a shift. We'll see in RealNVP to give $\mathbf{x}_2$ a scaling term additionally. $$ \mathbf{y}_1 = \mathbf{x}_1, \quad \mathbf{y}_2 = \mathbf{x}_2 + \text{embed}\!\left(\text{NET}\big(\text{extract}(\mathbf{x}_1)\big)\right) $$

Notice that $\mathbf{y}_1$ only has non-zero bits at positions $m=1$, and $\mathbf{y}_2$ only has non-zero bits at positions $m=0$. Based on the split, we can define an invertible transformation $f(\cdot)$ as following.

Forward function $f$: $\mathbf{x} \rightarrow \mathbf{y}$ $$ \mathbf{y} = \mathbf{y}_1 + \mathbf{y}_2 % = \mathbf{x} \odot \mathbf{m} + \mathbf{x} \odot (\mathbf{1} - \mathbf{m}) + \text{embed}\!\left(\text{NET}\big(\text{extract}(\mathbf{x}_1)\big)\right) $$

Inverse function $f^{-1}$: $\mathbf{y} \rightarrow \mathbf{x}$ (notice that $\text{extract}(\mathbf{x}_1) = \text{extract}(\mathbf{x}) = \text{extract}(\mathbf{y}_1) = \text{extract}(\mathbf{y})$) $$ \begin{aligned} \mathbf{x}_1 &= \mathbf{y}_1 = \mathbf{y} \odot \mathbf{m} \\ \mathbf{x}_2 &= \mathbf{y}_2 - \text{embed}\!\left(\text{NET}\big(\text{extract}(\mathbf{x}_1)\big)\right) = \mathbf{y} \odot (\mathbf{1} - \mathbf{m}) - \text{embed}\!\left(\text{NET}\big(\text{extract}(\mathbf{y})\big)\right) \\ \mathbf{x} &= \mathbf{x}_1 + \mathbf{x}_2 \end{aligned} $$

Design Intuition Of Splitting By Mask¶

From the forward and inverse functions—$\mathbf{y} = \mathbf{x} + \text{embed}(\text{NET}(\text{extract}(\mathbf{x})))$ and $\mathbf{x} = \mathbf{y} - \text{embed}(\text{NET}(\text{extract}(\mathbf{y})))$—we already see the simplicity and elegance of the computation. This arises from splitting $\mathbf{x}$ with a mask, which also ensures a triangular Jacobian, and by stacking layers with alternating masks, every dimension can eventually be transformed.

① Guarantee invertibility (avoiding self-dependence of the updated part)

If the input and output of the neural network overlap in dimensions, we can no longer guarantee that the transformation is invertible—because the inverse mapping $f^{-1}(\mathbf{y})$ may not be available in closed form, making it difficult to recover $\mathbf{x}$ from $\mathbf{y}$. For example, suppose we incorrectly define the transformation as: $$ \mathbf{y}_1 = \mathbf{x}_1, \quad \mathbf{y}_2 = \mathbf{x}_2 + \text{net}(\mathbf{x}_1, \mathbf{x}_2). $$ Then, to compute the inverse, we would need to solve: $$ \begin{cases} \mathbf{x}_1 = \mathbf{y}_1, \\ \mathbf{x}_2 = \mathbf{y}_2 - \text{net}(\mathbf{x}_1, \mathbf{x}_2). \end{cases} $$ The problem lies in the second equation: the right-hand side, $\text{net}(\mathbf{x}_1, \mathbf{x}_2)$, includes $\mathbf{x}_2$ itself, so this results in an implicit equation: $$ \mathbf{x}_2 + \text{net}(\mathbf{x}_1, \mathbf{x}_2) = \mathbf{y}_2 $$ To solve for $\mathbf{x}_2$, we would typically need iterative methods or numerical solvers, which means the inverse is no longer in closed form.
② Triangular Jacobian, determinant easy to compute

Let $d_1 = \text{sum}(m), d_2 = d - \text{sum}(m)$, and let $t=\text{embed}\!\left(\text{NET}\big(\text{extract}(\mathbf{\cdot})\big)\right)$. Use the transformation $$ \mathbf{y}_1 = \mathbf{x}_1, \qquad \mathbf{y}_2 = \mathbf{x}_2 + t(\mathbf{x}_1), \quad t: \mathbb{R}^{d_1} \to \mathbb{R}^{d_2}, $$ we compute the Jacobian of $(\mathbf{x}_1, \mathbf{x}_2) \mapsto (\mathbf{y}_1, \mathbf{y}_2)$, written in block matrix form as: $$ J = \frac{\partial \mathbf{y}}{\partial \mathbf{x}} = \begin{bmatrix} \frac{\partial \mathbf{y}_1}{\partial \mathbf{x}_1} & \frac{\partial \mathbf{y}_1}{\partial \mathbf{x}_2} \\[2pt] \frac{\partial \mathbf{y}_2}{\partial \mathbf{x}_1} & \frac{\partial \mathbf{y}_2}{\partial \mathbf{x}_2} \end{bmatrix} = \begin{bmatrix} I_{d_1} & 0 \\[2pt] \displaystyle \frac{\partial t}{\partial \mathbf{x}_1}(\mathbf{x}_1) & I_{d_2} \end{bmatrix}. $$ This is a lower triangular block matrix with identity matrices along the diagonal. Therefore, $$ \det J = \det(I_{d_1}) \cdot \det(I_{d_2}) = 1, \qquad \log|\det J| = 0. $$ Recall that the Jacobian determinant $\det J$ characterizes the change in volume induced by the transformation $f$. Thus $\det J = 1$ shows the volume-preserving property of NICE, which makes sense because the additive coupling layer only imposes shift but not scaling. The term $\log|\det J| = 0$ in the log-likelihood can be computed with zero cost. If this value is nontrivial to compute, it introduces significant computational overhead at each step. That's why NICE is nice.
③ Multiple layers with alternating masks ensure coverage of all dimensions

In a single layer, only $\mathbf{x}_2$ is updated while $\mathbf{x}_1$ stays unchanged. With just one fixed mask, some dimensions would never be directly modified. So we can stack multiple layers with alternating (or permuted) masks so that the updated subset rotates across layers. For example, if the input dim is 4, we have two layers of additive coupling with alternating mask.
- In the first layer, mask $m^{(1)}=(1,0,1,0)$ so that $\mathbf{x}_1=(x_1,x_3), \; \mathbf{x}_2=(x_2,x_4)$: $$ \text{Layer 1:}\quad \mathbf{y}_1=\mathbf{x}_1,\quad \mathbf{y}_2=\mathbf{x}_2+t_1(\mathbf{x}_1). $$
- In the second layer, flip the mask: $m^{(2)}=(0,1,0,1)$, so now $\mathbf{y}_2$ is the unchanged block and $\mathbf{y}_1$ is updated: $$ \text{Layer 2:}\quad \mathbf{z}_2=\mathbf{y}_2,\quad \mathbf{z}_1=\mathbf{y}_1+t_2(\mathbf{y}_2). $$
We can see the second layer uses the already updated part $\mathbf{y}_2$ to update the other half $\mathbf{y}_1$, so information flows between the two halves. By alternating masks across layers, every dimension will eventually be directly updated, and through chaining, information propagates across all dimensions.

Training Objective Of NICE¶

Recall that the marginal log-likelihood is: $$ \log p_G(\mathbf{x}) = \log p_Z(\mathbf{z}) + \sum_{k=1}^{K} \log \left| \det \big( J_{f_k^{-1}}(\mathbf{h}_k) \big) \right| $$ And in NICE, we have $$ \left| \det \left( J_{f_k^{-1}}(\mathbf{h}_k) \right) \right| = 1 \quad \Rightarrow \quad \log \left| \det \left( J_{f_k^{-1}}(\mathbf{h}_k) \right) \right| = 0 $$ Thus $$ \log p_G(\mathbf{x}) = \log p_Z(\mathbf{z}), $$ All of the log-likelihood information comes from the latent density.

--- RealNVP Architecture: Affine Coupling¶

In RealNVP, the same mask-and-network design is used, but it uses a different coupling layer: affine coupling, where the invertible transformation adds not only a shift but also a scale factor. Below is the affine forward and inverse process.

Define two networks:

$\text{SNET}: \mathbb{R}^{\sum m} \to \mathbb{R}^{d - \sum m}$ for the scale,
$\text{TNET}: \mathbb{R}^{\sum m} \to \mathbb{R}^{d - \sum m}$ for the translation.

We use extract and embed operations as in NICE. We define $s$ and $t$ as: $$ s = \text{embed}\big(\text{SNET}(\text{extract}(\cdot))\big), \qquad t = \text{embed}\big(\text{TNET}(\text{extract}(\cdot))\big), $$

Forward function $f: \mathbf{x} \to \mathbf{y}$ $$ \begin{aligned} \mathbf{y}_1 &= \mathbf{x}_1 = \mathbf{x} \odot \mathbf{m}, \\ \mathbf{y}_2 &= \mathbf{x}_2\odot \exp(s(\mathbf{x}_1)) + t(\mathbf{x}_1) = \left(\mathbf{x} \odot (1 - \mathbf{m})\right) \odot \exp(s(\mathbf{x})) + t(\mathbf{x}), \\ \mathbf{y} &= \mathbf{y}_1 + \mathbf{y}_2. \end{aligned} $$

Inverse function $f^{-1}: \mathbf{y} \to \mathbf{x}$ (note that $\text{extract}(\mathbf{x}_1) = \text{extract}(\mathbf{x}) = \text{extract}(\mathbf{y}_1) = \text{extract}(\mathbf{y})$) $$ \begin{aligned} \mathbf{x}_1 &= \mathbf{y}_1 = \mathbf{y} \odot \mathbf{m}, \\ \mathbf{x}_2 &= \big( \mathbf{y}_2 - t(\mathbf{x}_1) \big) \odot \exp(-s(\mathbf{x}_1)) = \left(\mathbf{y} \odot (1 - \mathbf{m}) - t(\mathbf{y}) \right) \odot \exp(-s(\mathbf{y})), \\ \mathbf{x} &= \mathbf{x}_1 + \mathbf{x}_2. \end{aligned} $$

We compute the Jacobian of the transformation $(\mathbf{x}_1, \mathbf{x}_2) \mapsto (\mathbf{y}_1, \mathbf{y}_2)$, written in block matrix form as: $$ J = \frac{\partial \mathbf{y}}{\partial \mathbf{x}} = \begin{bmatrix} \frac{\partial \mathbf{y}_1}{\partial \mathbf{x}_1} & \frac{\partial \mathbf{y}_1}{\partial \mathbf{x}_2} \\[2pt] \frac{\partial \mathbf{y}_2}{\partial \mathbf{x}_1} & \frac{\partial \mathbf{y}_2}{\partial \mathbf{x}_2} \end{bmatrix} = \begin{bmatrix} I_{d_1} & 0 \\[2pt] \displaystyle \frac{\partial \mathbf{y}_2}{\partial \mathbf{x}_1} & \text{diag}(\exp(s(\mathbf{x}))) \end{bmatrix}. $$ This is again a triangular matrix, but now with a nontrivial diagonal in the lower-right block. Therefore, the determinant is the product of the diagonal entries: $$ \det J = \det(I_{d_1}) \cdot \det\big(\text{diag}(\exp(s(\mathbf{x})))\big) = \prod_{j=1}^{d_2} \exp(s_j(\mathbf{x})), $$ $$ \log|\det J| = \sum_{j=1}^{d_2} s_j(\mathbf{x}) = \mathbf{1}^\top s(\mathbf{x}). $$

Recall that the marginal log-likelihood is: $$ \log p_G(\mathbf{x}) = \log p_Z(\mathbf{z}) + \sum_{k=1}^{K} \log \left| \det \big( J_{f_k^{-1}}(\mathbf{h}_k) \big) \right| $$ And in RealNVP, we have (let $s^{(k)}$ be the scale network of layer $k$) $$ \left| \det \left( J_{f_k^{-1}}(\mathbf{h}_k) \right) \right| = \prod_i \exp(-s_i^{(k)}(\mathbf{h}_k)) = \exp\left(-\sum_i s_i^{(k)}(\mathbf{h}_k)\right) \quad \Rightarrow \quad \log \left| \det \left( J_{f_k^{-1}}(\mathbf{h}_k) \right) \right| = -\sum_i s_i^{(k)}(\mathbf{h}_k) $$ Thus $$ \log p_G(\mathbf{x}) = \log p_Z(\mathbf{z}) - \sum_{k=1}^K \sum_i s_i^{(k)}(\mathbf{h}_{k}) $$

The affine coupling keeps all the advantages of additive coupling: it remains invertible because $\mathbf{x}_2$ can be recovered in closed form, the Jacobian is still triangular so the log-determinant is easy to compute, and by stacking multiple layers with alternating masks, all dimensions can be updated. At the same time, introducing scaling greatly increases the model’s expressiveness, since the transformation can now change volume rather than being strictly volume-preserving as in NICE.

--- NICE Implementation¶

In [ ]:

import torch
import torch.nn as nn
import torch.optim as optim
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons

In [ ]:

# ===== NICE 架构 =====
class AdditiveCoupling(nn.Module):
    def __init__(self, dim, mask):
        super().__init__()
        self.dim = dim
        self.register_buffer("mask", mask)
        hidden = 128
        in_dim = int(mask.sum().item()) # how many dimensions are 1
        out_dim = dim - in_dim # how many dimensions are 0
        self.net = nn.Sequential( # dims with mask is 1 -> dims with mask is 0
            nn.Linear(in_dim, hidden), nn.ReLU(),
            nn.Linear(hidden, hidden), nn.ReLU(),
            nn.Linear(hidden, out_dim)
        )

    def forward(self, x, reverse=False):
        x1 = x * self.mask
        x2 = x * (1 - self.mask)
        h = x1[:, self.mask.bool()]
        shift = self.net(h)
        y = x.clone()
        if not reverse:
            y2 = x2[:, (1 - self.mask).bool()] + shift
            logdet = x.new_zeros(x.size(0))  # log|det J| = 0
        else:
            y2 = x2[:, (1 - self.mask).bool()] - shift
            logdet = x.new_zeros(x.size(0))
        y[:, (1 - self.mask).bool()] = y2
        return y, logdet


class NICE(nn.Module):
    def __init__(self, dim=2, num_coupling=6):
        super().__init__()
        masks = []
        base = torch.tensor([0, 1])  # 交替 mask
        for i in range(num_coupling):
            masks.append(base if i % 2 == 0 else 1 - base)
        self.layers = nn.ModuleList([AdditiveCoupling(dim, m.float()) for m in masks])
        self.prior = torch.distributions.MultivariateNormal(
            torch.zeros(dim), torch.eye(dim)
        )

    def fwd(self, x):
        logdet = x.new_zeros(x.size(0))
        z = x
        for layer in self.layers:
            z, ld = layer(z, reverse=False)
            logdet += ld
        return z, logdet

    def inv(self, z):
        x = z
        logdet = z.new_zeros(z.size(0))
        for layer in reversed(self.layers):
            x, ld = layer(x, reverse=True)
            logdet += ld
        return x, logdet

    def log_prob(self, x):
        z, logdet = self.fwd(x)
        log_pz = self.prior.log_prob(z)
        return log_pz + logdet  # logdet = 0 in NICE

    def sample(self, n):
        z = self.prior.sample((n,))
        x, _ = self.inv(z)
        return x

In [ ]:

# ===== 数据准备 =====
n_samples = 10000
X, _ = make_moons(n_samples=n_samples, noise=0.05)
X = torch.tensor(X, dtype=torch.float32)

train_loader = torch.utils.data.DataLoader(
    X, batch_size=256, shuffle=True
)

# ===== 训练 =====
device = "cuda" if torch.cuda.is_available() else "cpu"
model = NICE(dim=2, num_coupling=6).to(device)
optimizer = optim.Adam(model.parameters(), lr=1e-3)

for epoch in range(10):
    total_loss = 0
    for batch in train_loader:
        batch = batch.to(device)
        loss = -model.log_prob(batch).mean()
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        total_loss += loss.item() * batch.size(0)
    print(f"Epoch {epoch+1}, Loss: {total_loss/len(train_loader.dataset):.4f}")

# ===== 生成 =====
with torch.no_grad():
    samples = model.sample(1000).cpu().numpy()

plt.figure(figsize=(6,6))
plt.scatter(samples[:,0], samples[:,1], alpha=0.5, s=10, label="Generated")
plt.scatter(X[:1000,0], X[:1000,1], alpha=0.5, s=10, label="Real Data")
plt.legend()
plt.show()

Epoch 1, Loss: 2.2348
Epoch 2, Loss: 2.1082
Epoch 3, Loss: 2.0361
Epoch 4, Loss: 2.0241
Epoch 5, Loss: 1.9980
Epoch 6, Loss: 1.9760
Epoch 7, Loss: 1.9686
Epoch 8, Loss: 1.9661
Epoch 9, Loss: 1.9602
Epoch 10, Loss: 1.9542

No description has been provided for this image

--- History of Flow Based Models¶

NICE (2015)
- The first practical flow-based model, introducing coupling layers.
RealNVP (2017)
- Builds on NICE by adding affine coupling layers, greatly improving expressiveness.
- Extends flow-based modeling to images, making it a milestone in the field.
Glow (2018)
- Improves upon RealNVP by introducing invertible 1×1 convolutions, further enhancing modeling capacity.
- Became widely known for generating high-quality images (e.g., CelebA-HQ faces).
Flow++ (2019)
- Improves on RealNVP/Glow by using a mixture of logistics as the base distribution and more powerful coupling networks.
- Demonstrates the development direction of flow-based models: moving toward richer base distributions, more expressive transformations, and stronger architectures for scaling to complex, high-dimensional data.