PolicyGradient

Policy Gradient

Policy Gradient is orthogonal to Dynamic Programming in methodology. Instead of deriving from the Bellman equation or optimality conditions, it directly formulates reinforcement learning as an optimization of the expected value of a parameterized probabilistic model. In Policy Gradient, the policy $\pi_\theta(a\mid s)$ is viewed as a differentiable probability distribution family, with the objective to maximize the expected return under the distribution - $\mathbb{E}_{s\sim d^{\pi_\theta},a\sim\pi_\theta}\big[Q^\pi(s,a)\big]$. It takes a path of statistical learning and stochastic optimization. The core idea is not to "solve for the optimal value function," but to "adjust the policy probabilities in directions that increase expected return under the current visitation distribution".

--- Learning Objectives¶

Policy Gradient = Directly maximize expected return by performing gradient ascent on the policy parameters.

Problem Setting¶

We consider a Markov Decision Process with policy $\pi_\theta(a\mid s)$.

At each time step $t$, the agent:

observes a state $s_t$,
samples an action $a_t \sim \pi_\theta(\cdot\mid s_t)$,
receives a reward $r(s_t,a_t)$,
transitions to $s_{t+1} \sim P(\cdot\mid s_t,a_t)$.

Recall the definitions of the state value and action value functions under policy $\pi_\theta$: $$ V^\pi(s) = \mathbb{E}_\pi \left[\sum_{k=0}^\infty \gamma^k r(s_{t+k},a_{t+k}) \middle| s_t=s\right], $$ $$ Q^\pi(s,a) = \mathbb{E}_\pi \left[\sum_{k=0}^\infty \gamma^k r(s_{t+k},a_{t+k}) \middle| s_t=s,a_t=a\right] $$

The state value function represents the expected return starting from state $s$ and following policy π; the action value function represents the expected return starting from state $s$, taking action $a$ first, then following policy π.

Policy Objective¶

The learning objective is the expected return when following policy $\pi_\theta$. Originally, it is the expected discounted return $$ U(\theta) = \mathbb{E}_{\tau\sim\pi_\theta}\Big[\sum_{t=0}^\infty \gamma^t r_t\Big] $$ This objective can be equivalently written in the following forms

State-value (initial state) form

$$ U(\theta) = \mathbb{E}_{s_0\sim\mu}\big[ V^{\pi_\theta}(s_0) \big], \quad \text{where }V^{\pi}(s)\text{ is as the definition above} $$
State–action (initial state-action) value form

$$ U(\theta) = \mathbb{E}_{s_0\sim\mu,a_0\sim\pi_\theta} \big[ Q^{\pi_\theta}(s_0,a_0) \big], \quad \text{where } Q^{\pi}(s,a) \text{ is as the definition above} $$

The original definition and value function forms are not friendly for computing the gradient $\nabla_\theta U(\theta)$. The critical issue is: $\theta$ affects both the action distribution at each step $\pi_\theta(a|s)$ and the distribution over future states (the entire trajectory distribution). This makes $\nabla_\theta U(\theta)$ the gradient of a trajectory distribution, which is very hard to compute directly.

Now we rewrite the learning objective, i.e., the expected return, to a less intuitive but differentiable and decomposable form: all the “future-related complexity” is absorbed into $Q^\pi$, and the dependence on the parameters $\theta$ is isolated in $\pi_\theta(a|s)$. This step is essential for the Policy Gradient Theorem to hold. The key intuition is that we only need to take the gradient with respect to the current action probability, without worrying about how the future evolves.

$$ \begin{align} U(\theta) &= \mathbb{E}_{\tau\sim\pi_\theta} \Big[\sum_{t=0}^\infty \gamma^t r(s_t,a_t)\Big] \nonumber\newline &= \sum_{t=0}^\infty \gamma^t \mathbb{E}_{\tau\sim\pi_\theta}\big[r(s_t,a_t)\big] \qquad\text{(linearity of expectation)} \nonumber\newline &= \sum_{t=0}^\infty \gamma^t \mathbb{E}_{s_t,a_t} \Big[ \mathbb{E}\big[r(s_t,a_t)\mid s_t,a_t\big] \Big] \qquad\text{(law of total expectation)} \nonumber\newline &= \sum_{t=0}^\infty \gamma^t \mathbb{E}_{s_t,a_t\sim\pi_\theta}\big[r(s_t,a_t)\big] \nonumber \end{align} $$

Introduce the discounted state–action occupancy measure $\rho_{\pi_\theta}(s,a)$ and discounted state distribution $d^{\pi_\theta}(s)$: $$ \boxed{ \rho_{\pi_\theta}(s,a) \triangleq (1-\gamma)\sum_{t=0}^\infty \gamma^t \Pr_{\pi_\theta}(s_t=s,a_t=a) } $$ $$ \boxed{ d^{\pi_\theta}(s) \triangleq (1-\gamma)\sum_{t=0}^{\infty}\gamma^t \Pr_{\pi_\theta}(s_t = s) } $$ and it's straightforward that $\rho_{\pi_\theta}(s,a)=d^{\pi_\theta}(s)\pi_\theta(a|s)$.

Using the definition of the action value functions $Q^{\pi_\theta}(s,a)$, the return can be rewritten as an expectation over $(s,a)$ sampled from $\rho_\pi$ $$ U(\theta) = \sum_{s,a}\rho_{\pi_\theta}(s,a)Q^{\pi_\theta}(s,a) $$ And using the definition of the state value functions $V^{\pi_\theta}(s)$, the return can be further rewritten in occupancy–Q form, which plays a key role in the gradient computing $$ U(\theta) = \sum_{s,a} d^{\pi_\theta}(s)\pi_\theta(a|s)Q^{\pi_\theta}(s,a) = \sum_{s} d^{\pi_\theta}(s) \sum_{a} \pi_\theta(a|s)Q^{\pi_\theta}(s,a) $$

Interpretation of $d^{\pi}(s)$

The discounted state distribution $d^{\pi}(s)$ will be used throughout the policy gradient derivation. It represents the frequency distribution of visiting state $s$ under policy $\pi$ in a discounted sense. You can think of $d^\pi(s)$ as: "If I randomly pick a time along an infinitely long trajectory (with discounting), what's the probability I'm in state $s$?"

Let's break down the formula for $d^{\pi}(s)$:

First term: $\displaystyle \sum_{t=0}^{\infty} \gamma^t \Pr_\pi(s_t = s)$

This is the discounted visitation frequency of state $s$ under policy $\pi$. Specifically:

$\Pr_\pi(s_t = s)$ is the probability of being in state $s$ at time $t$, determined by the environment and the policy.

$\gamma^t$ discounts later time steps, making earlier visits matter more—just like reward discounting.

Summing over all $t$ accumulates the (discounted) number of times state $s$ is visited.

Simple example: Suppose there are only two states, A → B → B → B → ..., starting in A, $\gamma = 0.9$. A appears only at $t=0$, B appears from $t=1$ onward:

$$ \sum_t \gamma^t \Pr(s_t = A) = 1, \quad \sum_t \gamma^t \Pr(s_t = B) = \gamma + \gamma^2 + \cdots $$

Second term: Normalization constant $(1-\gamma)$ to ensure $\sum_s d^\pi(s) = 1$

This converts the discounted visitation count into a proper probability distribution. The sum $\sum_{t=0}^\infty \gamma^t \Pr_\pi(s_t = s)$ can exceed 1, so it's not a probability—multiplying by $(1-\gamma)$ ensures normalization.

Summing over all states $s$: $$ \sum_s d^\pi(s) = (1-\gamma)\sum_{t=0}^{\infty}\gamma^t \sum_s \Pr_\pi(s_t = s) $$ For each time step $t$, $\sum_s \Pr_\pi(s_t = s) = 1$. Substitute back: $$ \sum_s d^\pi(s) = (1-\gamma)\sum_{t=0}^{\infty}\gamma^t \cdot 1 = (1-\gamma)\sum_{t=0}^{\infty}\gamma^t = (1-\gamma)\cdot\frac{1}{1-\gamma} = 1 $$

For example, if the environment has only one state $s$, $\Pr(s_t = s) = 1$ for all $t$, and without normalization $\sum_t \gamma^t = \frac{1}{1-\gamma}$. With $(1-\gamma)$ $d^\pi(s) = 1$.

Intuitively, policy gradient should optimize performance where the policy actually visits, not over the entire state space. Weighting states by $d^{\pi}(s)$ ensures that updates focus on frequently and early visited states under the current policy. A uniform or global state average would waste optimization effort on states that are rarely or never encountered.

--- Gradient Descent of the Learning Objective¶

Derivation of the Gradient (Policy Gradient Theorem)¶

We start from the objective written in occupancy–Q form of $U(\theta)$. Naively differentiating gives $$ \nabla_\theta U(\theta) = \sum_{s,a} \Big( \nabla_\theta d^{\pi_\theta}(s)\pi_\theta(a|s)Q^{\pi_\theta}(s,a) + d^{\pi_\theta}(s)\nabla_\theta \pi_\theta(a|s)Q^{\pi_\theta}(s,a) + d^{\pi_\theta}(s)\pi_\theta(a|s)\nabla_\theta Q^{\pi_\theta}(s,a) \Big) $$

And by the key cancellation lemma (this is the key of policy gradient. For proof and more intuition, see the last part of note) $$ \sum_{s,a} \Big( \nabla_\theta d^{\pi_\theta}(s)\pi_\theta(a|s)Q^{\pi_\theta}(s,a) + d^{\pi_\theta}(s)\pi_\theta(a|s)\nabla_\theta Q^{\pi_\theta}(s,a) \Big) = 0 $$ the gradient reduces to only the direct policy term: $$ \nabla_\theta U(\theta) = \sum_{s,a} d^{\pi_\theta}(s)\nabla_\theta \pi_\theta(a|s)Q^{\pi_\theta}(s,a) $$ Using log-derivative trick $\nabla_\theta \pi_\theta(a|s) = \pi_\theta(a|s)\nabla_\theta \log \pi_\theta(a|s)$, we obtain: $$ \nabla_\theta U(\theta) = \sum_{s,a} d^{\pi_\theta}(s) \pi_\theta(a|s) \nabla_\theta \log \pi_\theta(a|s) Q^{\pi_\theta}(s,a) $$ Writing this as an expectation: $$ \boxed{ \nabla_\theta U(\theta) = \mathbb{E}_{s\sim d^{\pi_\theta},a\sim\pi_\theta} \big[ \nabla_\theta \log \pi_\theta(a|s) Q^{\pi_\theta}(s,a) \big] } $$ This expression itself is the Policy Gradient Theorem. It states that for a differentiable stochastic policy $\pi_\theta$, the gradient of the expected discounted return satisfies the above expression. Notice that $\nabla_\theta U(\theta)$ is an expectation of a score function weighted by $Q^{\pi_\theta}(s,a)$, not the gradient of a $Q^{\pi_\theta}(s,a)$-weighted log-likelihood objective!!

Another interesting point is that this formula gives the first-order ascent direction at the current policy, but appears to lack a "true objective function".

Baseline Trick (Advantage Form)¶

For any baseline $b(s)$ that depends only on the state $s$ and not on the action $a$, $$ \mathbb{E}_{a\sim\pi_\theta(\cdot|s)}\left[\nabla_\theta \log \pi_\theta(a|s) b(s)\right] = b(s)\nabla_\theta \sum_a \pi_\theta(a|s) = b(s)\nabla_\theta 1 = 0. $$ Thus $\mathbb{E}_{s\sim d^{\pi_\theta},a\sim\pi_\theta}\big[ \nabla_\theta \log \pi_\theta(a|s) b(s) \big] = \mathbb{E}_{s\sim d^{\pi_\theta}}\big[ \mathbb{E}_{a\sim\pi_\theta(\cdot|s)}\left[\nabla_\theta \log \pi_\theta(a|s) b(s)\right] \big] = 0$. Therefore subtracting $b(s)$ does not change the expectation in $\nabla_\theta U(\theta)$, which yields the baseline trick form of the policy gradient $$ \boxed{ \nabla_\theta U(\theta) = \mathbb{E}_{s\sim d^{\pi_\theta},a\sim\pi_\theta}\left[ \nabla_\theta \log \pi_\theta(a|s) \big(Q^{\pi_\theta}(s,a)-b(s)\big) \right] } $$ Choosing $b(s)=V^{\pi_\theta}(s)$ yields the advantage function $$ A^{\pi_\theta}(s,a) \triangleq Q^{\pi_\theta}(s,a)-V^{\pi_\theta}(s), $$ and hence the advantage form of gradient is $$ \boxed{ \nabla_\theta U(\theta) = \mathbb{E}_{s\sim d^{\pi_\theta},a\sim\pi_\theta} \left[ \nabla_\theta \log \pi_\theta(a\mid s) A^{\pi_\theta}(s,a) \right] } $$

Under policy $\pi$, there is one and only advantage function defined as: $$ \boxed{ A^\pi(s,a) \triangleq Q^\pi(s,a) - V^\pi(s) } $$ Here $V^\pi(s)$, $Q^\pi(s,a)$ are as the definition of the very beginning: $V^\pi(s)$ is the state value function, which represents the expected return starting from state $s$ and following policy π; $Q^\pi(s,a)$ is the action value function, which represents the expected return starting from state $s$, taking action $a$ first, then following policy π.

Why is this called the Advantage? Semantically, $Q^\pi(s,a)$ represents what actually happened for this action, while $V^\pi(s_t)$ reflects the average outcome in this state. By subtracting the state value from the action value, we remove the "state bias," isolating the net contribution of action $a_t$ compared to the average in state $s_t$. In other words, it measures whether taking action $a_t$ in state $s_t$ is better or worse than the average in that state.

If $A^\pi(s_t,a_t)>0$: The action outperforms the average, so its probability should be increased.

If $A^\pi(s_t,a_t)<0$: The action is worse than average; its probability should be decreased.

If $A^\pi(s_t,a_t)=0$: The action matches the expectation; no update is needed.

--- Common Policy Parameterizations & Log-Gradients¶

The policy gradient theorem gives the update direction $$ \nabla_\theta U(\theta) = \mathbb{E}_{s\sim d^{\pi_\theta},a\sim\pi_\theta} \left[ \nabla_\theta \log \pi_\theta(a\mid s) A^{\pi_\theta}(s,a) \right] $$ To apply this in practice, we only need to specify:

the policy form $\pi_\theta(a\mid s)$, and its log-gradient $\nabla_\theta \log \pi_\theta(a\mid s)$.
the advantage $A^{\pi_\theta}(s,a)$.

Below are some common choices of the policy form. We'll talk about how to obtain the advantage $A^{\pi_\theta}(s,a)$ in the next part.

Gaussian Policy (Continuous Actions)¶

A standard choice for continuous control is a Gaussian policy with state-dependent mean: $$ a \sim \pi_\theta(\cdot\mid s) = \mathcal{N}\big(\mu_\theta(s),\Sigma\big) $$ where $\Sigma$ is fixed (or separately parameterized).

The log-likelihood is $$ \log \pi_\theta(a\mid s) = - \tfrac12 (a-\mu_\theta(s))^\top \Sigma^{-1}(a-\mu_\theta(s)) + \text{const} $$

Taking the gradient w.r.t. $\theta$, $$ \nabla_\theta \log \pi_\theta(a\mid s) = \Sigma^{-1}(a-\mu_\theta(s))\nabla_\theta \mu_\theta(s) $$ Plugging into the policy gradient $$ \nabla_\theta U(\theta) = \mathbb{E} \left[ \Sigma^{-1}(a-\mu_\theta(s)) \nabla_\theta \mu_\theta(s) A^{\pi_\theta}(s,a) \right] $$

Interpretation. Actions above the mean $(a>\mu)$ with positive advantage push the mean upward; actions below the mean push it downward.

Softmax Policy (Discrete Actions)¶

For discrete action spaces, a common parameterization is a softmax over logits: $$ \pi_\theta(a\mid s) = \frac{e^{h_\theta(s,a)}}{\sum_b e^{h_\theta(s,b)}} $$ The log-gradient has a clean closed form $$ \nabla_\theta \log \pi_\theta(a\mid s) = \nabla_\theta h_\theta(s,a) - \sum_b \pi_\theta(b\mid s)\nabla_\theta h_\theta(s,b). $$ Substituting into the policy gradient $$ \nabla_\theta U(\theta) = \mathbb{E} \left[ \Big( \nabla_\theta h_\theta(s,a) - \mathbb{E}_{b\sim\pi_\theta(\cdot\mid s)}[\nabla_\theta h_\theta(s,b)] \Big) A^{\pi_\theta}(s,a) \right] $$

Interpretation. The gradient increases the logit of the taken action relative to the average logit under the current policy, weighted by advantage.

--- How to Learn the Advantage Function?¶

From the policy gradient theorem, $$ \nabla_\theta U(\theta) = \mathbb{E}_{s\sim d^{\pi_\theta},a\sim\pi_\theta} \left[ \nabla_\theta \log \pi_\theta(a\mid s) A^{\pi_\theta}(s,a) \right] $$ once the policy form is fixed and $\nabla_\theta \log \pi_\theta(a\mid s)$ is known, the only remaining problem is: How do we estimate the advantage function $A^\pi(s,a)$?

Why Exact $Q^\pi$ and $V^\pi$ Are Infeasible¶

By definition, $$ A^\pi(s,a) = Q^\pi(s,a) - V^\pi(s) $$ However, computing $Q^\pi$ and $V^\pi$ exactly is generally impossible in modern RL settings:

High-dimensional or continuous state spaces / Continuous action spaces

States $s \in \mathbb{R}^d$ make tabular representations infeasible, or evaluating $Q^\pi(s,a)$ for all actions is impossible.
Unknown dynamics

The environment transition $P(s'|s,a)$ is not known, so expectations must be approximated via sampling.

As a result, advantage functions must be estimated from sampled trajectories.

Two Core Estimation Philosophies¶

Almost all policy gradient algorithms—including natural policy gradient methods—ultimately estimate the advantage using either Monte Carlo returns or temporal-difference (bootstrapped) estimates, or a mixture of the two (e.g. $n$-step returns, GAE). Natural policy gradients change how the gradient is preconditioned, not how the advantage is estimated.

Monte Carlo Estimation (Trajectory-Based)¶
Under Monte Carlo estimation, both $Q^\pi$ and $V^\pi$ are estimated directly from sampled trajectories, without bootstrapping.

Given a trajectory $\tau = (s_0,a_0,r_0,s_1,a_1,r_1,\dots)$, define the return $$ G_t \triangleq \sum_{k=0}^{\infty} \gamma^k r_{t+k} $$
- Action-value estimate $$ Q^\pi(s_t,a_t) \approx G_t $$ since $G_t$ is an unbiased sample of the return following $(s_t,a_t)$.
- State-value estimate $$ V^\pi(s_t) \approx \hat V^\pi(s_t) = \mathbb{E}_{a\sim\pi(\cdot\mid s_t)}[G_t] = \frac{1}{N(s)}\sum_{i: s_t^{(i)} = s} G_t^{(i)} $$ which in practice is approximated by an empirical average of returns observed from $s_t$.
Thus, the Monte Carlo advantage estimator is $$ A_t \approx G_t - \hat V^\pi(s_t) $$

Key properties: Unbiased; High variance; Requires waiting until episode termination.

This idea underlies REINFORCE-style methods.
Temporal-Difference Estimation (Bootstrapping)¶
To make it very clear - here we only use TD (stochastic approximation) to estimate the state-value function, and the action-value function is obtained by bootstrapping the estimated state-value function.

At time step $t$, we observe a one-step transition $(s_t, a_t, r_t, s_{t+1}), a_t \sim \pi(\cdot \mid s_t)$.
- State-value estimate
  
  The state-value function is learned directly by TD learning (stochastic approximation) $$ V^\pi(s_t) \approx \hat V^\pi(s_t), $$ where $\hat V^\pi$ is updated to satisfy the Bellman fixed-point equation $V^\pi(s) = \mathbb{E}\left[r_t + \gamma V^\pi(s_{t+1}) \mid s_t=s\right]$.
- Action-value estimate
  
  The estimation of action-value function uses a single transition and a state-value function approximation $\hat V^\pi$, to form the one-step bootstrapped estimate $$ Q^\pi(s_t,a_t) \approx r_t + \gamma \hat V^\pi(s_{t+1}) $$
Thus, the TD advantage estimator is $$ A_t \approx r_t + \gamma \hat V^\pi(s_{t+1}) - \hat V^\pi(s_t) $$

Key properties: Biased (due to bootstrapping); Lower variance; Can learn online and incrementally.

This idea underlies Actor–Critic-style methods.

Why use TD to estimate $V^\pi$ rather than $Q^\pi$?

The policy gradient depends on the advantage $A^\pi(s,a)=Q^\pi(s,a)-V^\pi(s)$; when TD is used to learn $V^\pi$, the estimation of advantage function $\hat A$ is actually the TD error $\delta_t = r_{t+1} + \gamma V^\pi(s_{t+1}) - V^\pi(s_t)$ in the TD update of $V^\pi$! That means we get a free-ride unbiased estimate of the advantage. In contrast, learning $Q^\pi$ directly would still require computing $V^\pi(s)=\mathbb{E}_{a\sim\pi}[Q^\pi(s,a)]$ to form a baseline, which introduces additional estimation noise and computational complexity, especially in continuous action spaces. Consequently, learning $V^\pi$ is the minimal and most efficient choice for actor–critic methods.
Generalized Advantage Estimation (MC–TD Interpolation)¶
GAE strictly follows the TD approach, only changing the way aggregation is performed. Starting from the TD error $\delta_t = r_t + \gamma \hat V^\pi(s_{t+1}) - \hat V^\pi(s_t)$, the definition of GAE is $$ \hat A_t^{\text{GAE}(\gamma,\lambda)} = \sum_{l=0}^\infty (\gamma\lambda)^l\delta_{t+l} $$

Notice that $\lambda=0$ → TD(0) and $\lambda=1$ → Monte Carlo advantage.

Key properties: Bias–variance tradeoff controlled by $\lambda$; Lower variance than Monte Carlo; Lower bias than TD; Uses bootstrapping via TD errors; Can be computed online with truncated horizons (no need to wait for episode termination).

This idea underlies A2C / A3C / PPO methods.