Natural Gradient

The expression of the Natural Gradient implies that it uses the Fisher information matrix to "correct the ordinary gradient", so that the direction of optimization corresponds to the true steepest direction in the space of probability distributions $$\tilde\nabla_\theta U(\theta) = \mathcal I(\theta)^{-1} \nabla_\theta U(\theta)$$

Historically, Natural Gradient originates from information geometry and statistics. It was systematically introduced by Shun-ichi Amari in the late 1990s, where he studied the geometry of parametric probability distributions and showed that the Fisher information defines a natural Riemannian metric on the space of distributions. In this framework, the natural gradient emerges as a coordinate-invariant steepest descent direction. Only later was this concept adopted in machine learning and reinforcement learning (e.g., natural policy gradient around 2002, TRPO / PPO around 2015) as a principled way to stabilize and improve optimization in probabilistic models.

Takeaways:

  1. Why the score function is the direction of feedback from data w.r.t. parameter changes?
  2. Why the score function is defined on log-likelihood instead of likelihood?
  3. Why the score function has zero-mean property?
  4. What is the connection between the two forms of Fisher information expression?
  5. Why is the Euclidean gradient "unnatural" for probabilistic models?
  6. What problem is the Natural Gradient really trying to solve?
  7. Based on the math expression of natural gradient, how does it "correct" the Euclidean gradient?

--- Background Recap: Score Function and Fisher Information¶

Score Function¶

The original definition of the score function comes from classical statistics. Given a parametric distribution $p_\theta(x)$ and an observed data point $x$, the score function is defined as: $$ \boxed{ \text{score}(x) \triangleq \nabla_\theta \log p_\theta(x) } $$ A score function should definitely based on a parametric probability distribution. For a parametric probability distribution, if the data point $x$ is given, the likelihood / log-likelihood function are functions w.r.t. the parameter $\theta$, which measure how well a given parameter value explains the observed data, and is widely used for parameter estimation, such as maximum likelihood estimation.

  • Likelihood Function: $L(\theta;x)=p_\theta(x)$
  • Log-likelihood Function: $\ell(\theta;x)=\log p_\theta(x)$

The above expressions show that score function is actually the derivative of log-likelihood function of that distribution.

- 🍵 Why the score function is the direction of feedback from data w.r.t. parameter changes?¶

Treat $f(\theta) := \log p_\theta(x)$ as an ordinary function with respect to $\theta$. Here, $x$ is fixed, and $\theta$ is the knob you can turn. In calculus, what does $\nabla_\theta f(\theta)$ mean? It is the direction in which $f(\theta)$ increases most rapidly when you make a change in $\theta$. The value of $f(\theta)$ itself represents how plausible or likely this data sample $x$ is under the model. Therefore, $\nabla_\theta f(\theta) = \nabla_\theta \log p_\theta(x)$ naturally answers the question: "To make this observed sample $x$ more likely under the model, in which direction should we adjust the parameters?"

Why do we say this is feedback from the data to the parameters? Because in $f(\theta) := \log p_\theta(x)$, $x$ determines the shape of the function, and $\theta$ is what we adjust. The derivative describes how sensitive the function is to changes in $\theta$ for this given data $x$. In other words, the score is like a "gradient letter" written from the data to the parameters.

A key point: this is not feedback from "average data", but from this specific sample $x$. For intuition, consider a cooking analogy:

  • $\theta$: parameters of the cooking technique (amounts of salt, acidity, spiciness, cooking time)
  • $x$: a demonstration dish, presenting a specific, reproducible flavor profile. For an iid sample, $x_1$ might be a hummus sample made by mentor A, while $x_2$ could be a sample from mentor B, and so on.
  • $f(\theta) := \log p_\theta(x)$: under the current cooking technique $\theta$, the likelihood of producing this specific demonstration flavor

The name "score function" dates back to Fisher (1920s). "Score" here means the direction and magnitude of feedback that data provides in response to parameter changes. The score function literally grades the parameters: if you want to make "producing this demonstration flavor" more likely under your technique, which parameters should be increased, decreased, and by how much. The evaluation isn’t of the flavor itself, but of how much the technique should be changed. And each taste only gives you feedback for that specific bite: "How should you adjust according to this particular sample"?

For a fixed data point $x$, the likelihood $p_\theta(x)$ is a scalar-valued function of $\theta$, so its value acts only as a scalar multiplier on the gradient $\nabla_\theta p_\theta(x)$. Thus, the gradients of the likelihood and log-likelihood have exactly the same direction; the only difference is a positive scaling factor $\frac{1}{p_\theta(x)}$ in their magnitudes. $$\nabla_\theta \log p_\theta(x) = \frac{1}{p_\theta(x)} \nabla_\theta p_\theta(x)$$ Therefore, everything discussed above about the log-likelihood also applies to the likelihood function, and it too can serve as a direction of feedback from data with respect to parameter changes. Next, we'll see why the score function is specifically defined using the log-likelihood rather than the likelihood itself.

- 🍵 To get the direction, why do we take the derivative of the log-likelihood, instead of the likelihood or other forms?¶

Although the gradients of the likelihood and log-likelihood are in the same direction (their lengths differing only by a positive scaling factor), $\nabla_\theta \log p_\theta(x)$ has a more fundamental interpretation: it captures the relative rate of change of the probability as the parameter varies, because $$ \nabla_\theta \log p_\theta(x) = \frac{\nabla_\theta p_\theta(x)}{p_\theta(x)} $$ This is, $$ \text{relative change} = \frac{\text{absolute change}}{\text{current value}} $$ This expression measures the proportional change in probability caused by a unit change in the parameter, not just the absolute increment of the probability. This is crucial: when $p_\theta(x)$ itself is very small, $\nabla_\theta p_\theta(x)$ also tends to be tiny. Even if the model fits this sample very poorly, the derivative of the likelihood may provide almost no effective feedback; but the log-likelihood derivative, through normalization, cancels out the influence of the probability scale and preserves sensitivity to parameter directions.

Notice that the relative change is not normalized to $[0,1]$. The rate of increase can easily be greater than 1, or even unbounded. For example, if $p_\theta(x)=e^{\theta x}$, then $\frac{d}{d\theta}\log p_\theta(x)=x$. When $x$ is large, the relative change can also be very large, and that's perfectly normal.

This is also called reparameterization-friendly. If the likelihood is multiplied by any positive function $c(x)$ independent of $\theta$ (such as a measure or a constant): $$ \tilde p_\theta(x) = c(x)p_\theta(x), $$ then the score remains completely unchanged $$ \nabla_\theta \log \tilde p_\theta(x) = \nabla_\theta \log p_\theta(x) $$ which shows that this scalar eliminates the effect of the representation and keeps only the information relevant to parameter changes.

- 🍵 The Zero-mean Property of the Score Function¶

A fundamental property of score function is $$\mathbb{E}_{x\sim p_\theta}[\text{score}(x)] = \mathbb{E}_{x\sim p_\theta}[\nabla_\theta \log p_\theta(x)] = 0$$ The math is simple: $$ \begin{align} \mathbb{E}_{x\sim p_\theta} [\nabla_\theta \log p_\theta(x)] &= \int p_\theta(x)\frac{\nabla_\theta p_\theta(x)}{p_\theta(x)}dx \nonumber\newline &= \int \nabla_\theta p_\theta(x)dx \nonumber\newline &= \nabla_\theta \int p_\theta(x)dx \nonumber\newline &= \nabla_\theta 1 = 0\nonumber \end{align} $$

The intuition of the zero-mean property in one sentence: the expected value of the score function is zero because it equals the sum of the absolute changes of all probabilities; since total probability is equal to 1 forever, this sum can only be 0.

Let's use a "bucket of soil" analogy to explain the zero-mean property. Imagine there are many buckets, and each has an amount of soil $p_\theta(x)$. No matter how things change, the total amount of soil in all buckets remains 1.

  • Key Point 1: Sum of total mass is always 1 ⇒ The sum of all absolute changes is always 0. $$ \sum_x p_\theta(x)=1 \quad \Rightarrow \quad \sum_x \nabla_\theta p_\theta(x)=0 $$ Since the total soil in all buckets is always 1, regardless of how you adjust the parameters, some buckets will gain soil and some will lose soil, but the net change in total soil across all buckets must always be 0.

  • Key Point 2: The expectation of the score function $\mathbb{E}_{x\sim p_\theta}[\nabla_\theta \log p_\theta(x)]$ is actually just the sum of all absolute changes.

    $\frac{\nabla_\theta p_\theta(x)}{p_\theta(x)}$ is asking: for this bucket, relative to its original volume, how much has it increased or decreased? If you multiply this "rate of change" by the bucket's original amount of soil, $p_\theta(x)\frac{\nabla_\theta p_\theta(x)}{p_\theta(x)}$, you get the absolute change for that bucket, $\nabla_\theta p_\theta(x)$.

A real toy example: only two discrete variable values

This toy example makes the intuition explicit: when the probability of one outcome increases, the probability of the other must decrease by exactly the same amount in total mass.

Setup: There are only two possible values, $x_1, x_2$. A single parameter $\theta\in(0,1)$ determines the probabilities: $$ p_\theta(x_1) = \theta,\quad p_\theta(x_2) = 1-\theta $$

Directly calculate the score function. For $x_1$: $$ \nabla_\theta \log p_\theta(x_1) = \frac{1}{\theta} $$ For $x_2$: $$ \nabla_\theta \log p_\theta(x_2) = \frac{-1}{1-\theta} $$ Note that one is positive and one is negative. With only two outcomes, any increase in one probability must be paid for by a decrease in the other, making the zero-mean property of the score function completely transparent.

Calculate the expectation of the score function: $$ \mathbb{E}_{x\sim p_\theta}[\nabla_\theta \log p_\theta(x)] = \theta \cdot \frac{1}{\theta} + (1-\theta)\cdot \frac{-1}{1-\theta} = 1 - 1 = 0 $$

Fisher Information¶

The original definition of the Fisher Information also comes from classical statistics. Given a parametric distribution $p_\theta(x)$, an observed data point $x$, and the score function $\text{score}(x) \triangleq \nabla_\theta \log p_\theta(x)$, the Fisher Information is defined as the second moment of the score function. Also, since $\mathbb{E}[\text{score}(x)]=0$, the second moment of the score function equals its variance $$ \boxed{ \mathcal I(\theta) \triangleq \mathbb{E}_{x\sim p_\theta}\big[\text{score}(x)\text{score}(x)^\top\big] = \mathbb{E}_{x\sim p_\theta}\big[(\nabla_\theta\log p_\theta(x))(\nabla_\theta \log p_\theta(x))^\top\big] } $$ Another equivalent form of the Fisher Information is the negative expected Hessian, which is essentially taking the Jacobian of $\text{score}(x)$, taking the expectation, and then taking the negative $$ \boxed{ \mathcal I(\theta) \triangleq - \mathbb{E}_{x\sim p_\theta}\big[\nabla_\theta\text{score}(x)\big] = -\mathbb{E}_{x\sim p_\theta}\big[\nabla_\theta^2 \log p_\theta(x)\big] } $$

Fisher Information measures how much a small change in parameters causes observable changes in the probability distribution. This “change strength” can be seen as the curvature (Hessian) of the log-likelihood, or as the random fluctuation (second moment) of the score. These two views describe the same thing in expectation. Fisher ties together the “randomness of samples” and the “curvature in parameter space” into one quantity.

Simple Derivation of Those Two Forms:

The key to the equivalence of those two forms is that we have a conservation law: $$ \mathbb{E}[\text{score}(x)] = 0 \quad \text{for all } \theta $$ What this equation means is that, at the true parameter $\theta$, if you repeatedly sample data from the model, calculate the score each time, and then average them, the average is zero in every direction. In other words, there is no systematic bias; if there were any, you would observe $\mathbb{E}[\text{score}(x)] \neq 0$, meaning the score would consistently be more often positive or negative in some direction.

Now, let's take the derivative of this equation: $$ \begin{align} 0 = \nabla_\theta \mathbb{E}_{x\sim p_\theta}[\text{score}(x)] &= \nabla_\theta \int \nabla_\theta \log p_\theta(x)\, p_\theta(x)\, dx \nonumber\newline &= \int \nabla_\theta\!\left(\nabla_\theta \log p_\theta(x)\, p_\theta(x)\right) dx \nonumber\newline &= \int \Big(\nabla_\theta^2 \log p_\theta(x)\, p_\theta(x) + \nabla_\theta \log p_\theta(x)\, \nabla_\theta p_\theta(x) \Big) dx \nonumber\newline &= \mathbb{E}_{x\sim p_\theta}\!\left[\nabla_\theta^2 \log p_\theta(x)\right] + \mathbb{E}_{x\sim p_\theta}\!\left[\text{score}(x)\text{score}(x)^\top\right] \nonumber \end{align} $$ We can see that when taking the derivative of an expectation $\nabla_\theta \mathbb{E}_{x\sim p_\theta}[\text{score}(x)]$, because the integrand contains both the probability and the function being averaged, the chain rule generates two derivative terms. The first term is how the function itself changes with $\theta$, the second is how the probability distribution changes with $\theta$ and then reduces to the random fluctuation of score:

  1. [How the score itself changes] The first term, $\nabla_\theta^2 \log p_\theta(x) = \nabla_\theta\text{score}(x)$, describes how the score changes with $\theta$ for a fixed sample $x$. Its expectation means, on average under this distribution, how the score changes with $\theta$.
  2. [Random fluctuations of score over samples] The second term, $\text{score}(x)\text{score}(x)^\top$, describes the squared value for a single sample realization at a fixed $\theta$. Its expectation quantifies, on average, the intensity of random sample wise fluctuations of the score under this distribution.

Because the total change must sum to zero, the deterministic change of the score due to parameter variation must exactly offset the “random fluctuation effect” of sample-wise score variability.

Notice that if we replace the score function with a general $f_\theta(x)$, then $\mathbb{E}[f(x)\text{score}(x)^\top]$ no longer represents the sample fluctuation strength of $f$, but instead characterizes the correlation between $f$ and the direction of variation in the distribution parameters—that is, how changes in the distribution affect $\mathbb{E}[f]$. The score function is special because $\nabla_\theta p_\theta(x)=p_\theta(x),\text{score}(x)$, so the distribution-variation term reduces to $\mathbb{E}[\text{score}(x)\text{score}(x)^\top]$, which uniquely turns parameter sensitivity into a second-moment measure of sample-wise fluctuation.

Local Second-Order Approximation of KL = Fisher¶

Consider two close parameters, $\theta$ and $\theta+\delta$. Let’s look at the KL divergence:

$$ D_{\mathrm{KL}}(p_\theta \mid p_{\theta+\delta}) = \int p_\theta(x) \log \frac{p_\theta(x)}{p_{\theta+\delta}(x)} dx = \mathbb E_\theta\left[\log p_\theta(X)-\log p_{\theta+\delta}(X)\right] $$

Expand $\log p_{\theta+\delta}(x)$ in a Taylor series around $\theta$:

$$ \log p_{\theta+\delta}(x) = \log p_\theta(x) + \delta^\top \nabla_\theta \log p_\theta(x) + \frac12 \delta^\top \nabla^2_\theta \log p_\theta(x)\delta + o(\|\delta\|^2) $$

Substitute this back into the KL expression:

$$ \begin{align} D_{\mathrm{KL}}(p_\theta \mid p_{\theta+\delta}) & = -\mathbb E_\theta\left[ \delta^\top \nabla_\theta \log p_\theta(X) + \frac12 \delta^\top \nabla^2_\theta \log p_\theta(X)\delta\right] + o(\|\delta\|^2) \nonumber\newline & \approx \frac12\delta^\top \mathcal I(\theta)\delta \nonumber \end{align} $$

because $\mathbb E_\theta\big[\nabla_\theta \log p_\theta(X)\big] = 0$, and $o(|\delta|^2)$ means that as $\delta \to 0$, this term is higher order infinitesimal compared to $\|\delta\|^2$, i.e., $lim_{\delta\to 0}\frac{o(\|\delta\|^2)}{\|\delta\|^2} = 0$, so it can be neglected at second order accuracy.

Key Point: The Fisher information matrix is the local quadratic metric (Riemannian metric) in parameter space induced by the KL divergence. It defines the "true distance" in parameter space. When you take a step $\delta$ in parameter space, the true amount of change in the distribution is not $\|\delta\|$, but rather $\delta^\top \mathcal I(\theta)\delta$.

--- Motivation: Why We Need Natural Gradient?¶

Consider a family of parameterized probability distributions: $$ p_\theta(x), \quad \theta \in \mathbb{R}^d $$ and an objective function $$ U(\theta) $$ Here, $U$ depends on $\theta$ indirectly through $p_\theta$. For example, $U(\theta)$ is typically of the following form: $$ U(\theta) = \mathbb{E}_{x \sim p_\theta}[f(x)] \quad \text{or} \quad = \mathbb{E}_{(x,y)\sim p_{\text{data}}}[\log p_\theta(y|x)] $$ Specifically, with this dependency, if two parameters $\theta_1 \neq \theta_2$ but $p_{\theta_1} \approx p_{\theta_2}$, then $U(\theta_1) \approx U(\theta_2)$.

Suppose we use Euclidean gradient descent to move $\theta$, and hopefully this will cause the objective function $U(\theta)$ move accordingly towards its minimal / maximal: $$ \theta_{new} = \theta_{old} + \alpha \nabla_\theta U(\theta) $$ But the question is, will the movement of $\theta$ guarantee the according movement of $U(\theta)$?

The answer is absolutely not. As we mentioned before, if two parameters $\theta_1 \neq \theta_2$ but $p_{\theta_1} \approx p_{\theta_2}$, then $U(\theta_1) \approx U(\theta_2)$. The same $\|\Delta\theta\|$ might change $p_\theta$ barely at all—making the update ineffective; or it might change $p_\theta$ drastically—causing the linear approximation to completely break down. So, with Euclidean gradient descent, the movement of $U(\theta)$ is not predictable based on the movement of $\theta$.

Then if we want to get a predictable movement of $U(\theta)$, which quantity should we control? Since $U(\theta)$ explicitly depends on $p_{\theta}$, control the movement of $p_{\theta}$ will guarantee the according movement of $U(\theta)$. So what we really want to control is the "distance" between the probability distributions $p_\theta$ before and after updating $\theta$!

Since $p_\theta$ is the quantity we need to control in order to get a predictable movement of the objective function $U(\theta)$, how to measure the change of $p_{\theta}$? Instead of Euclidean distance, the metric in distribution space is the KL (Kullback–Leibler) divergence $$D_{\mathrm{KL}}(p_\theta \mid p_{\theta+\delta})$$ When the parameter change is small, the KL divergence can be approximated to second order at $\theta$: $$ D_{\mathrm{KL}}\left(p_\theta | p_{\theta+\delta}\right) \approx \frac12 \delta^\top \mathcal I(\theta)\delta, $$ where $\mathcal I(\theta) = \mathbb E_\theta\left[\nabla_\theta \log p_\theta(X)\nabla_\theta \log p_\theta(X)^\top\right]$ is the Fisher Information Matrix。

What natural gradients do is, under the premise of controlling the movement of the probability distribution $p_\theta$, move the objective function in its steepest ascent direction. The distance in the probability distribution space is the KL divergence. We use the second-order Taylor expansion of the KL divergence—previously discussed—which leads to the Fisher information, and thus use this to compute the KL distance.

--- Natural Gradient¶

Back to the optimization problem. pt ⌘⏎ Accept & Run

To maximize the objective function $U(\theta)$, natural gradient descent does not blindly follow $\nabla_\theta U(\theta)$ in parameter space. Instead, it constrains the change in the probability distribution $p_\theta$, ensuring that the distribution before and after the update does not differ too much. In this way, we achieve the greatest local improvement of the objective function $U(\theta)$ while controlling the change in the distribution. The original optimization problem is $$ \max_{\delta}\quad U(\theta+\delta) \quad \text{s.t.}\quad D_{\mathrm{KL}}\left(p_\theta | p_{\theta+\delta}\right) \le \varepsilon $$ We also call $D_{\mathrm{KL}}\left(p_\theta | p_{\theta+\delta}\right) \le \varepsilon$ the trust region.

Standard Quadratic-Constrained LP¶

Consider the optimization problem above. If the update is small, we can use a first-order Taylor expansion of the objective function: $$ U(\theta+\delta) \approx U(\theta) + \nabla_\theta U(\theta)^\top \delta $$ The constant term $U(\theta)$ does not affect the optimization direction, so this is equivalent to: $$ \max_{\delta}\quad \nabla_\theta U(\theta)^\top \delta $$ As explained earlier, when $\delta$ is small: $$ D_{\mathrm{KL}}\left(p_\theta \mid p_{\theta+\delta}\right) \approx \frac{1}{2}\delta^\top \mathcal I(\theta)\delta $$ So the constraint becomes: $$ \frac{1}{2}\delta^\top \mathcal I(\theta)\delta \le \varepsilon $$ Thus, we have: $$ \max_{\delta}\quad \nabla_\theta U(\theta)^\top \delta \quad \text{s.t.}\quad \delta^\top \mathcal I(\theta)\delta \le 2\varepsilon $$ This is a standard quadratic-constrained LP problem. Specifically, it is, finding the optimal direction under a linear objective within an ellipsoid defined by the Fisher information.

Solving the Quadratic-Constrained LP¶

We use the method of Lagrange multipliers to solve the quadratic-constrained linear problem. Construct the Lagrangian: $$ \mathcal L(\delta, \lambda) = \nabla_\theta U(\theta)^\top \delta - \lambda\left(\delta^\top \mathcal I(\theta)\delta - 2\varepsilon\right) $$ Take the derivative with respect to $\delta$ and set it to zero: $$ \nabla_\theta U(\theta) - 2\lambda \mathcal I(\theta)\delta = 0 $$ Solving gives: $$ \delta \propto \mathcal I(\theta)^{-1} \nabla_\theta U(\theta) $$

Therefore, in the space of distributions constrained by KL divergence, the direction of steepest ascent of the objective is: $$ \boxed{ \tilde\nabla_\theta U(\theta) = \mathcal I(\theta)^{-1} \nabla_\theta U(\theta) } $$ This is the Natural Gradient.

The final update rule is: $$ \theta_{new} = \theta_{old} + \alpha \mathcal I(\theta)^{-1} \nabla_\theta U(\theta) $$ From the above expression, we see that the Natural Gradient uses the Fisher information matrix to "adjust" the ordinary gradient. However, from its original definition and derivation, it is not merely a correction of the Euclidean gradient; instead, it determines the direction of steepest improvement within the "trust region" defined by KL geometry. The final result just happens to take the form of adjusting the gradient by the Fisher information matrix.