GAN

GAN (Generative Adversarial Network)

GANs consist of two neural networks—a generator and a discriminator—that compete against each other in a zero-sum game, essentially minimizing the divergence between the generated distribution and the real distribution.

Takeaway:

What is a Generative Adversarial Network (GAN) and how to implement a GAN model?
GAN is a zero-sum game with infinite strategies.
How the generator, discriminator loss of GAN related to the minimax objective of a zero-sum game?
Why at the Nash Equilibrium, the generated data distribution $p_G(x)$ converges to the real distribution $p_{\text{data}}(x)$ (Jensen-Shannon divergence == 0 so that $p_G(x) = p_{\text{data}}(x)$)?

--- GAN Architecture¶

GANs consist of two neural networks—a generator and a discriminator.

Generator: Learns to generate data resembling the training dataset. It's a network takes whatever as input (for example, old faces) and generates fake data resembling real data (for example, young faces).

Discriminator: Learns to distinguish between real data and generated data. It's a classification network takes fake data or real data as input (they are in the same dimension). After training, it will predict the fake and real data to the same label. Initially, the discriminator is good at distinguish real and fake, but after the generator improved, the discriminator will be fooled.

Mathematical Description of GAN Architecture¶

Notation

Let $z \sim p_z(z)$: latent variable (e.g., Gaussian noise) sampled from a prior distribution.
Let $x \sim p_{\text{data}}(x)$: real data sampled from the true data distribution.
Let $G(z; \theta_g)$: generator network with parameters $\theta_g$, maps $z$ to generated data $\tilde{x}$.
Let $D(x; \theta_d)$: discriminator network with parameters $\theta_d$, outputs the probability that input $x$ is real.

Generator

Input: Random noise $z \sim p_z(z)$
Output: Fake/generated sample $\tilde{x} = G(z)$

Discriminator

Input: Data sample $x$ (either real from $p_{\text{data}}$ or fake from $G(z)$)
Output: Probability $D(x) \in [0, 1]$ that how likely $x$ is real

Loss Functions

Discriminator's Loss:

The loss function computes the cross-entropy for real and fake samples separately and sums them. The discriminator aims to maximize correct classification, which is equivalent to minimizing this loss.

For real samples, the true label is always $y = 1$, so the cross-entropy becomes: $$ \mathbb{E}_{x \sim p_{\text{data}}} \left[ -1 \cdot \log D(x) - (1 - 1) \cdot \log(1 - D(x)) \right] = \mathbb{E}_{x \sim p_{\text{data}}} \left[ -\log D(x) \right] $$ For fake samples, the true label is always $y = 0$, so the cross-entropy becomes: $$ \mathbb{E}_{z \sim p_z} \left[ -0 \cdot \log D(G(z)) - (1 - 0) \cdot \log(1 - D(G(z))) \right] = \mathbb{E}_{z \sim p_z} \left[ -\log(1 - D(G(z))) \right] $$ So the loss function is $$ \mathcal{L}_D = - \mathbb{E}_{x \sim p_{\text{data}}} [\log D(x)] - \mathbb{E}_{z \sim p_z} [\log (1 - D(G(z)))] $$
Generator's Loss:
- Non-Saturating Form
  
  The generator tries to fool the discriminator, i.e., to make the discriminator believe that fake samples are real. This means maximizing $D(G(z))$, making it close to 1. This can be formulated as minimizing a binary cross-entropy loss between $D(G(z))$ and target label (all set to $y = 1$): $$ \mathbb{E}_{z \sim p_z} \left[ -1 \cdot \log D(G(z)) - (1 - 1) \cdot \log(1 - D(G(z))) \right] = \mathbb{E}_{z \sim p_z} \left[ -\log D(G(z)) \right] $$ So the generator's cross-entropy loss function is the following. Minimizing loss is equivalent with maximizing $D(G(z))$. $$ \mathcal{L}_G = -\mathbb{E}_{z \sim p_z} [\log D(G(z))] $$
- Saturating Form
  
  Another way of maximizing $D(G(z))$ is to make it far from 0. This can be formulated as maximizing a binary cross-entropy between $D(G(z))$ and target label (all set to $y = 0$). Notice that this term is not a cross-entropy loss in the usual sense, because the generator does not minimize this cross-entropy. Instead, it attempts to maximize it. Here, we refer to the fundamental definition of cross-entropy—as a measure of divergence between two distributions. $$ \mathbb{E}_{z \sim p_z} \left[ -0 \cdot \log D(G(z)) - (1 - 0) \cdot \log(1 - D(G(z))) \right] = \mathbb{E}_{z \sim p_z} \left[ -\log(1 - D(G(z))) \right] $$ Maximizing the above cross-entropy is equivalent with minimizing the following expression. So the generator's saturating loss function is: $$ \mathcal{L}_G^{\text{saturating}} = \mathbb{E}_{z \sim p_z} [\log(1 - D(G(z)))] $$ ⚠️ Limitation: This loss can lead to vanishing gradients early in training, especially when $D(G(z))$ is close to 0 (i.e., the discriminator is very confident that the samples are fake). In that case, the gradient of $\log(1 - D(G(z)))$ vanishes. That's why we commonly use the non-saturating form.

Training Procedure

Fix G, update $D$ to maximize $\log D(x) + \log(1 - D(G(z)))$
Fix D, update $G$ to minimize $\log(1 - D(G(z)))$ or maximize $\log D(G(z))$

--- The Convergence of GAN: Game Theory Foundation¶

In the previous sections, we introduced the architecture of GANs and how they are trained using loss functions for the generator and discriminator. However, it's still unclear how close the trained generator actually gets to the true data distribution. Since GAN involves two separate objectives—the generator loss and the discriminator loss—it is not a well-defined single-objective convex optimization problem. In fact, the GAN training process can be understood as a saddle point problem, where the generator aims to minimize a value function while the discriminator aims to maximize it. This adversarial dynamic leads to an optimization landscape that is fundamentally different from traditional convex problems, often resulting in instability during training.

This section provides the theoretical foundation of GANs via game theory, showing that the generator, when trained properly, indeed reaches a form of optimality—specifically, it tries to minimize the Jensen-Shannon divergence between the real data distribution and the generator’s distribution.

This section includes:

An introduction to zero-sum game， finite strategy and infinite strategy. How the discriminator and generator loss functions lead to the minimax objective, showing that a GAN is fundamentally a zero-sum game with infinite strategy.
An introduction to Nash equilibrium, and Von Neumann’s minimax theorem. Show that at the global optimal, the system reaches $p_G = p_{\text{data}}$ and the optimal discriminator becomes $D^*(x) = 0.5$ for all $x$. And the global optimal is also the Nash equilibrium.

GAN: A Zero-Sum Game¶

A zero-sum game is a type of game in game theory where one player's gain is exactly equal to the other player's loss. In other words, the sum of payoffs for all players is always zero, so: $$ \text{Gain}_{\text{Player A}} + \text{Loss}_{\text{Player B}} = 0 $$ Formally, for a two-player game with payoff function $V(A, B)$, the objective is: $$ \max_A \min_B V(A, B) $$ This means Player A wants to maximize the value while Player B wants to minimize it—pure adversarial interaction.

A finite strategy space means each player only has finitely many pure strategies to choose from (e.g., Rock–Paper–Scissors). In contrast, an infinite strategy space means players can choose from a continuum of possible strategies (e.g., choosing a real number, or parameterizing a function).

It's not hard to sense that GAN is a Zero-Sum Game, since the discriminator and generator of GAN have opposing goals - the improvement of one will purely cause the regression of the other. And both players effectively play infinite strategies:

Generator: its strategy is defined by choosing a neural network $G(z; \theta_G)$ that maps noise $z \sim p_z(z)$ to data space. Since $\theta_G$ are real-valued parameters in a high-dimensional continuous space, the generator has infinitely many possible strategies.
Discriminator: its strategy is defined by choosing a function $D(x; \theta_D) \in [0,1]$. Similarly, $\theta_D$ are real-valued continuous parameters, so the discriminator also has infinitely many possible strategies.

Mathematically, GANs can be shown to form a zero-sum game. The opposing goals of the discriminator and generator are explicitly encoded in their loss functions (derived in the GAN architecture), which will combine into a minimax objective: $$ \max_D \min_G V(D, G). $$ This minimax formulation captures their adversarial interaction, provides the theoretical foundation for GAN training, and explicitly shows that GAN is a zero-sum game.

Deriving the Minimax Objective of GAN¶
- The discriminator aims to maximize the probability of assigning the correct label (real vs fake), leading to the following objective (as previously derived from the binary cross-entropy loss): $$ \max_D \; \mathbb{E}_{x \sim p_{\text{data}}}[\log D(x)] + \mathbb{E}_{z \sim p_z}[\log(1 - D(G(z)))] $$
- The generator, on the other hand, aims to minimize the probability that the discriminator correctly identifies generated samples (as previously derived the saturating form): $$ \min_G \; \mathbb{E}_{z \sim p_z}[\log(1 - D(G(z)))] $$
Note that the generator loss does not include the term $\mathbb{E}_{x \sim p_{\text{data}}} [\log D(x)]$ but the minimax objective does. But this term is independent of the generator $G$ and thus has no effect on updating $G$. For $G$, it can be freely included in the minimax objective as a constant.

Putting the two together, we arrive at the minimax objective of GAN: $$ \min_G \max_D \; V(D, G) = \mathbb{E}_{x \sim p_{\text{data}}}[\log D(x)] + \mathbb{E}_{z \sim p_z}[\log(1 - D(G(z)))] $$

Convergence of GAN: Nash Equilibrium¶

In game theory, a Nash equilibrium is a stable state of the system where neither player can improve their outcome by unilaterally changing their strategy.

In zero-sum games with finite strategies, there exists a saddle point (minimax solution) where each player is playing their optimal strategy. — John von Neumann, 1928 (Von Neumann's Minimax Theorem)

The minimax theorem only ensures the existence of a Nash equilibrium in finite strategy spaces for zero-sum games. However, in infinite or continuous strategy spaces, a Nash equilibrium is not always guaranteed (unlike the finite case where the minimax theorem applies). Nevertheless, in certain specific games, we can still construct a Nash equilibrium through mathematical derivation.

Since the strategy space of GANs is not finite, the minimax theorem does not guarantee the existence of a Nash equilibrium. Yet in our case, we explicitly construct a strategy pair that satisfies the Nash equilibrium conditions. This Nash equilibrium is not arbitrarily guessed, but naturally derived from the minimax objective function $\min_G \max_D \; V(D, G)$ of GANs. Let $p_G$ be the distribution of data generated by generator $G$ such that $p_G = G(p_z)$. We'll see that at equilibrium: the generator produces samples indistinguishable from real data - $p_G = p_{\text{data}}$; the discriminator cannot tell the difference and outputs $D(x) = 0.5$ everywhere.

Derivation of the GAN Nash Equilibrium: $p_G = p_{\text{data}},\; D^*(x) = 0.5$¶
Let’s now see mathematically how the GAN objective leads to this equilibrium.
- Step 1: Optimal Discriminator $D^*$ for Fixed Generator
  
  Given a fixed generator $G$ (which implicitly defines a distribution $p_G$), we find the optimal discriminator $D^*(x)$ that maximizes: $$ V(D, G) = \mathbb{E}_{x \sim p_{\text{data}}}[\log D(x)] + \mathbb{E}_{x \sim p_G}[\log(1 - D(x))] $$ Write each expectation as an integral (or a sum in the discrete case). Let $P(x)=p_{\text{data}}(x)$ and $Q(x)=p_G(x)$. Then $$ V(D,G)=\int \big[\,P(x)\log D(x)+Q(x)\log(1-D(x))\,\big]\,dx. $$ When we maximize $V(D,G)$ over $D$, we are really picking $D(x)$ that maximizes the point value for each $x$ - because there is no interaction between $D(x)$ and $D(x')$ for $x\neq x'$, and $P(x), Q(x)$ are regarded as constants in the optimization (both are fixed functions of $x$ once $G$ is fixed). Hence, for each fixed $x$ (except possibly where $P(x)=Q(x)=0$, which does not affect the integral), we solve the scalar problem $$ \max_{D(x)\in(0,1)}\; P(x)\log D(x)+Q(x)\log(1-D(x)). $$ Take derivative w.r.t. $D(x)$, set to 0: $$ \frac{P(x)}{D(x)} - \frac{Q(x)}{1 - D(x)} = 0 \Rightarrow D^*(x) = \frac{P(x)}{P(x) + Q(x)} = \frac{p_{\text{data}}(x)}{p_{\text{data}}(x) + p_G(x)} $$ And the second-order condition showes the strict concavity of the function $f(d)=P\log d+Q\log(1-d)$ on $d\in(0,1)$, so any stationary point is the global unique maximizer $$ f''(d)=-\frac{P}{d^{2}}-\frac{Q}{(1-d)^{2}}<0\qquad\text{for }d\in(0,1)\text{ if }P+Q>0. $$
- Step 2: Plug $D^*(x)$ Back to the Objective
  
  Now compute the value function at the optimal discriminator: $$ V(D^*, G) = \mathbb{E}_{x \sim p_{\text{data}}}\left[\log \frac{p_{\text{data}}(x)}{p_{\text{data}}(x) + p_G(x)}\right] + \mathbb{E}_{x \sim p_G}\left[\log \frac{p_G(x)}{p_{\text{data}}(x) + p_G(x)}\right] $$ Let’s define the mixture distribution: $$ m(x) = \frac{1}{2}(p_{\text{data}}(x) + p_G(x)) $$ Then, this value function becomes: $$ V(D^*,G)=\int p_{\text{data}}(x)\log\frac{p_{\text{data}}(x)}{2m(x)} \;+\; p_G(x)\log\frac{p_G(x)}{2m(x)}\,dx, $$ Using the fact that both $p_{\text{data}}$ and $p_G$ integrate to 1, $$ V(D^*,G) =\Big[\mathrm{KL}(p_{\text{data}}\|m)+\mathrm{KL}(p_G\|m)\Big]-2\log 2 =2\,\mathrm{JS}(p_{\text{data}}\|p_G) -\log 4 $$ Where: $$ \text{JS}(p_{\text{data}} \| p_G) = \frac{1}{2} \text{KL}(p_{\text{data}} \| m) + \frac{1}{2} \text{KL}(p_G \| m) $$ is the Jensen-Shannon divergence between the real and generated data distributions.
- Step 3: Analyze the Optimal Case
  
  At the optimal discriminator $D^*$, the GAN value function equals the Jensen–Shannon divergence between real and generated distributions (up to the constant $-\log 4$). Recall that the generator seeks minimizing the objective function $V(D^*, G)$. Since JS divergence is always non-negative and equals zero iff $p_G = p_{\text{data}}$, the global minimum of the GAN game is achieved when the JS divergence is zero. That is, $$ p_G = p_{\text{data}} \quad \text{and} \quad D^*(x) = \frac{1}{2},\; \forall x $$ At this point, the generator perfectly matches the data, and the discriminator is maximally confused — it assigns 0.5 to all inputs. This is exactly the Nash equilibrium, since neither the generator nor the discriminator can unilaterally change their strategy to improve their own objective.

--- GAN Toy Implementation¶

This is a classic toy example: generate Gaussian distribution data.

Generator: Takes random noise as input and generates data resembling the target Gaussian distribution.

Discriminator: Distinguishes between real data (from the Gaussian distribution) and fake data (from the Generator).

In [4]:

import torch
import torch.nn as nn
import torch.optim as optim

# toy real data: Normal(3, 1)
def real_data_sampler(batch_size=16):
    return torch.randn(batch_size, 1) + 3

# Generator: z -> x
G = nn.Sequential(
    nn.Linear(1, 16), nn.ReLU(),
    nn.Linear(16, 1)
)
# Discriminator: x -> prob(real)
D = nn.Sequential(
    nn.Linear(1, 16), nn.ReLU(),
    nn.Linear(16, 1), nn.Sigmoid()
)

loss_fn = nn.BCELoss()
opt_G = optim.Adam(G.parameters(), lr=0.01)
opt_D = optim.Adam(D.parameters(), lr=0.01)

for step in range(1000):
    # real data
    real = real_data_sampler(16)
    # fake data
    z = torch.randn(16, 1)
    fake = G(z)

    # ----- Train D -----
    pred_real = D(real)
    pred_fake = D(fake.detach())
    loss_D = loss_fn(pred_real, torch.ones_like(pred_real)) + \
             loss_fn(pred_fake, torch.zeros_like(pred_fake))
    opt_D.zero_grad(); loss_D.backward(); opt_D.step()

    # ----- Train G -----
    pred_fake = D(fake)
    loss_G = loss_fn(pred_fake, torch.ones_like(pred_fake))
    opt_G.zero_grad(); loss_G.backward(); opt_G.step()

    if step % 200 == 0:
        print(f"Step {step}: D_loss={loss_D.item():.4f}, G_loss={loss_G.item():.4f}")

Step 0: D_loss=1.6131, G_loss=0.8809
Step 200: D_loss=1.3623, G_loss=0.9018
Step 400: D_loss=1.3865, G_loss=0.6928
Step 600: D_loss=1.3863, G_loss=0.6930
Step 800: D_loss=1.3855, G_loss=0.6933

--- GAN History¶

2014: Ian Goodfellow proposed GAN, opening up a new direction for generative models.
2015–2017: Variants such as DCGAN, WGAN, and LSGAN were introduced, improving stability and performance.
2017 onward: GANs flourished in tasks like image generation (e.g., StyleGAN), image-to-image translation (CycleGAN), and super-resolution (SRGAN).
Now: Although diffusion models have surpassed GANs in many tasks, GANs remain an important milestone in generative modeling.

--- GAN Use Cases¶

Image, Speech, and Music Generation Example: use GANs to generate synthetic human faces (e.g., StyleGAN), artwork, or game characters. Input for the generator is a latent vector (a random noise vector) and optional conditioning signals such as style or attributes (e.g., age, gender, hairstyle). The output is a realistic image, voice, or music clip.
Image-to-Image Translation Example: use GANs to convert sketches to photos (Pix2Pix), or translate images between domains such as day ↔ night or summer ↔ winter (CycleGAN). Input for the generator is the source image plus a latent vector. The output is a synthetic target-domain image. The discriminator learns to distinguish between real and generated target-domain images.
Super-Resolution Reconstruction Example: use GANs (like SRGAN) to upscale blurry low-resolution images into sharper high-resolution ones. Input for the generator is a low-res image. The output is a high-res reconstruction. The discriminator enforces realism by distinguishing between real high-res images and generated high-res images.
Image Inpainting and Denoising Example: use GANs to restore missing parts of an image (e.g., removing objects, repairing damaged areas, or denoising). Input for the generator is a corrupted or masked image plus a latent vector. The output is a completed or cleaned-up image. The discriminator distinguishes between real intact images and reconstructed ones.
Data Augmentation Example: use GANs to augment training datasets in domains like autonomous driving. Input for the generator is the latent vector and conditioning information such as weather, time of day, road conditions (dry, wet, icy), and traffic conditions. The output is a synthetic driving scene. The discriminator distinguishes between real scenes and generated scenes.