PolicyIteration

Policy Iteration

Solving Bellman equations directly via matrix inversion is often too expensive. Instead, we use iterative methods: policy iteration and value iteration. This note focuses on policy iteration; value iteration will be discussed in the next note.

Policy Iteration simply repeats two steps:

Evaluate the current policy (Policy Evaluation)
For each state, choose a better action (Policy Improvement / Greedification)

Continue until the policy no longer changes ⇒ Obtain the optimal policy $\pi^*$.

Takeaways:

What is the expression of Bellman expectation backup?
In policy iteration, if the policy is unchanged after the greedification step, what does this mean?
When GPI converges, does it converge to the optimum?
What is the difference between policy iteration and value iteration in terms of evaluation granularity?

--- Step 1: Policy Evaluation¶

Recall the Bellman Expectation Equation for the state value function is $$ v_\pi(s) = \sum_a \pi(a|s) \left[ r(s,a) + \gamma \sum_{s'} p(s'|s,a) v_\pi(s') \right] $$ To get $v_\pi$, we may solve this linear equation system by computing closed-form solutions. Alternatively, we can use the iterative policy evaluation algorithm illustrated below.

Iterative Policy Evaluation Algorithm¶

Initialization: $$ v_0(s) = \text{any value, usually 0} \quad \forall s $$ Synchronous backup: $$ v_{k+1}(s) = \sum_a \pi(a|s) \left[ r(s,a) + \gamma \sum_{s'} p(s'|s,a) v_k(s') \right] $$

Here, "synchronous" means that in the $k$-th iteration, the updates for all states $s$ are performed using only $v_k$, and after all updates are finished, we obtain $v_{k+1}$ all at once. One iteration is also called a sweep, which means completely traversing all the states once.

In contrast, in Asynchronous Backup, there is no globally consistent $v_k$ for all states. At the same time, different states may have been updated a different number of times—some states may contain more up-to-date information from more frequent updates, while others may still be using older values.

Stopping criterion: $$ \max_s |v_{k+1}(s) - v_k(s)| < \varepsilon $$

Why does the algorithm converge to the unique $v_\pi$?¶

Useful facts:

What "distance" do we use to measure how close two value functions are? $$|v'-v|_{\infty}=\max_{s\in\mathcal S}|v'(s)-v(s)|$$ Intuitively: on the worst-case state, how different are the two estimates.
Definition of the "Policy Evaluation update operator" (Bellman expectation operator): $$(F_\pi v)(s) =\sum_a \pi(a|s)\left[r(s,a)+\gamma\sum_{s'}p(s'|s,a)v(s')\right]$$ The iterative evaluation is: $$v_{k+1}=F_\pi(v_k)$$
$F_\pi$ is a $\gamma$-contraction: $|F_\pi(u)-F_\pi(v)|_{\infty} \le \gamma |u-v|_{\infty}$
Proof:
1. Write out the difference; the reward terms cancel: $$ (F_\pi u)(s)-(F_\pi v)(s) =\sum_a\pi(a|s)\gamma\sum_{s'}p(s'|s,a)\big(u(s')-v(s')\big) $$
2. Take the absolute value and upper bound: $$ |(F_\pi u)(s)-(F_\pi v)(s)| \le \gamma \sum_a\pi(a|s)\sum_{s'}p(s'|s,a)|u(s')-v(s')| $$
3. By the definition of the $\infty$-norm: for all $s'$, $|u(s')-v(s')|\le |u-v|_\infty$ $$ |(F_\pi u)(s)-(F_\pi v)(s)| \le \gamma \sum_a\pi(a|s)\sum_{s'}p(s'|s,a)|u-v|_\infty $$
4. The probabilities sum to 1 $(\sum_a\pi(a|s)=1$, $\sum_{s'}p(s'|s,a)=1)$: $$ |(F_\pi u)(s)-(F_\pi v)(s)| \le \gamma|u-v|_\infty $$
5. Take the maximum over all $s$: $$ |F_\pi(u)-F_\pi(v)|_\infty \le \gamma|u-v|_\infty $$

Policy Evaluation works because the Bellman expectation backup operator $F_\pi$ is a $\gamma$-contraction under the $|\cdot|_\infty$ norm. Therefore, according to the contraction mapping theorem, iterating from any initial value will always converge to the unique fixed point $v_\pi$.

Theorem (Contraction Mapping Theorem). For a $\gamma$-contraction $F$ on a complete normed vector space $\mathcal V$:

$F$ has a unique fixed point $x^* \in \mathcal V$, i.e. $$ F(x^*) = x^*. $$
For any initial point $x_0 \in \mathcal V$, the sequence defined by $$ x_{k+1} = F(x_k) $$ converges to the unique fixed point $x^*$.
The convergence is linear, with rate $\gamma$.

Complexity¶

Under the following three assumptions, we can derive the complexity of value iteration:

The transition is dense: for every $(s,a)$, any $s'$ is possible
The model is explicit: $P(s'|s,a)$ is known and can be accessed in $\mathcal{O}(1)$ time
Synchronous sweep: every iteration must update all states

Complexity analysis:

For a single state $s$:
1. Enumerate all actions to compute expectation: $\rightarrow$ $m$ times
2. For each action, compute the expectation $\sum_{s'} P(s'|s,a)v(s')$ by summing over all possible $s'$: $\rightarrow$ $n$ times
Therefore: the cost of a single update for one state = $O(mn)$
Each iteration performs a sweep over all $n$ states

Therefore: $$ \text{cost per iteration} = n \times O(mn) = O(mn^2) $$

The real bottleneck is the $n^2$ term. Under this quadratic complexity, synchronous DP in practice can usually only handle state spaces of size $n \approx 10^3 \sim 10^4$. When $n$ reaches $10^5$, the time and memory requirements become unacceptable. But in practice, classical DP may need to be applied to problems at the scale of millions of states, which is why we need other update methods such as asynchronous value / policy iteration.

Also, TD / Q-learning uses sample-based, local, asynchronous updates to actively avoid enumerating all successor states and thus escape the $n^2$ bottleneck. This is the fundamental reason why they can scale to very large state spaces.

--- Step 2: Policy Improvement¶

As a prerequisite, we need to recall the definition of the Bellman Expectation Equation for state-action value $q_\pi$, and given the relationship $v_\pi(s) = \sum_a \pi(a|s) q_\pi(s,a)$, we can rewrite it as $$ \begin{align} q_\pi(s,a) & = \sum_{s', r} p(s', r \mid s, a)\Bigl[r + \gamma \sum_{a'} \pi(a'|s') q_\pi(s', a')\Bigr] \nonumber\newline & = \sum_{s', r} p(s', r \mid s, a)\Bigl[r + \gamma v_\pi(s')\Bigr] \nonumber\newline & = r(s,a) + \gamma \sum_{s'} p(s'|s,a) v_\pi(s') \nonumber \end{align} $$ where $r(s,a) = \sum_{s', r} p(s', r \mid s, a) r$. And since $v_\pi(s')$ is independent of $r$, we can sum over $r$ to get $\sum_{s', r} p(s', r \mid s, a) v_\pi(s') = \sum_{s'} p(s'|s,a) v_\pi(s')$.

Policy Improvement Theorem¶

Do the greedification for all states to get a new policy $\pi'$ that is greedy w.r.t. $q_\pi(s, a)$: $$\pi'(s) = \arg\max_a q_\pi(s,a)$$ Then we have $$v_{\pi'}(s) \ge v_\pi(s), \quad \forall s$$ That means, the improved policy $\pi'$ is no worse than the original policy $\pi$.

Proof:

A trivial inequality showing max ≥ average: $$q_\pi(s,\pi'(s))=\max_a q_\pi(s,a)\ge \sum_a\pi(a|s)q_\pi(s,a)=v_\pi(s)$$ This also implies that, in the sense of being evaluated by the "old value function", new actions are no worse than the old policy.

At each step, we replace the original action with the optimal action according to the old policy $\pi$; this does not decrease the value in the short term, and in the long run the accumulation implies the overall value does not decrease:

By the definition of state-action value function $q$: $$ v_\pi(s) \le q_\pi(s,\pi'(s)) = \mathbb{E}[R_{t+1} + \gamma v_\pi(S_{t+1}) \mid S_t=s, A_t=\pi'(s)] $$

Again, use the trivial inequality to replace $v_\pi(S_{t+1}) \le q_\pi(S_{t+1}, \pi'(S_{t+1}))$ $$ \begin{align} v_\pi(s) & \le \mathbb E \left[R_{t+1}+\gamma q_\pi(S_{t+1},\pi'(S_{t+1})) \mid S_t=s,\ A_t=\pi'(s) \right] \nonumber\newline & = \mathbb E \Big[ R_{t+1} + \gamma \mathbb E \left[ R_{t+2}+\gamma v_\pi(S_{t+2}) \mid S_{t+1},\ A_{t+1}=\pi'(S_{t+1}) \right] \Big| S_t=s,\ A_t=\pi'(s) \Big] \nonumber\newline & = \mathbb E \left[R_{t+1}+\gamma R_{t+2}+\gamma^2 v_\pi(S_{t+2}) \mid S_t=s,\ A_t=\pi'(s) \right] \nonumber \end{align} $$ The last step used law of total expectation $\mathbb E \left[\mathbb E[X \mid Y,Z]\mid Z\right] = \mathbb E[X \mid Z]$.

Recursive unrolling (pattern repeats) $$ v_\pi(s) \le \mathbb E \left[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots \mid S_t=s,\ A_t=\pi'(s) \right] = v_{\pi'}(s) $$

Policy Iteration Algorithm¶

Initialize π arbitrarily
repeat:
    # Policy Evaluation
    compute v_π (approximately)

    # Policy Improvement
    policy_stable = True
    for s in S:
        old_action = π(s)
        π(s) = argmax_a [ r(s,a) + γ Σ_{s'} p(s'|s,a) v_π(s') ]
        if old_action ≠ π(s):
            policy_stable = False

until policy_stable
return π*, v*

Convergence and Optimality Criteria¶

When to stop?

If the policy does not change after greedification, we have $$ \begin{align} v_\pi(s) & = \max_a q_\pi(s,a) \nonumber\newline & = \max_a \left[ r(s,a) + \gamma \sum_{s'} p(s'|s,a) v_\pi(s') \right] \nonumber\newline \end{align} $$ This is exactly the Bellman Optimality Equation. Thus, $v_\pi = v^*$ and $\pi = \pi^*$.
Why is optimality guaranteed?

By the policy improvement theorem $$ v_{\pi_{k+1}}(s)\ge v_{\pi_k}(s)\quad \forall s $$ So the value function is monotonically non-decreasing.

In a finite MDP, the number of deterministic policies is finite $(|\mathcal A|^{|\mathcal S|})$, and each improvement step never decreases performance, and typically strictly improves it. Therefore, infinite strict improvement is impossible and the process will eventually reach a policy-stable point.

Basically, a bounded monotone increasing sequence should converge.

--- Generalized Policy Iteration¶

Generalized Policy Iteration (GPI): any interleaving of policy evaluation and policy improvement, independent of their granularity. Actually, the other iteration method - value iteration can be viewed as GPI where evaluation is truncated to one step and improvement is implicit. Also, all RL methods are a form of GPI.

Method	Evaluation	Improvement
Policy Iteration	Full / Multi-step	Greedy
Value Iteration	1-step	Greedy at each step
TD / Q-learning	Approximate	Implicit

When GPI converges, does it converge to the optimum?¶

Answer: Yes — under the standard assumptions.

Reasoning: Policy improvement is monotonic. Regardless the frequency of doing policy improvement, as long as the policy is stable, the limit must satisfy the Bellman optimality equation. Notice that Bellman optimality operator is a contraction (like Bellman expectation operator), so by the contraction mapping theorem, the solution to the Bellman optimality equation is unique. Thus the fixed point for GPI must be $(v^*, \pi^*)$.

Conclusion: GPI may or may not converge, but if it does converge, it converges to the optimal solution.