Solving Bellman equations directly via matrix inversion is often too expensive. Instead, we use iterative methods: policy iteration and value iteration. This note focuses on value iteration.
Unlike policy iteration which consists of two steps: policy evaluation and policy improvement, value iteration simply repeats updating its state value function using a Bellman optimality backup $$ v_{k+1}(s) \leftarrow \max_a \bigl[r(s,a) + \gamma \sum_{s'} P(s'|s,a) v_k(s')\bigr] \quad \text{for all } s\in\mathcal{S} $$ or updating its action value function using a Bellman optimality backup $$ q_{k+1}(s,a) \leftarrow r(s,a) + \gamma \sum_{s'} P(s'|s,a), \max_{a'} q_k(s',a') \quad \text{for each state-action pair } s\in\mathcal{S}, a\in\mathcal{A} $$ Continue until the value function converges ⇒ Extract the optimal policy $\pi^*$ greedily from $v^*$ or $q^*$.
We'll also see the bond between value iteration and policy iteration: a single step of value iteration implicitly performs policy evaluation and policy improvement at the same time.
Takeaways:
- Is Bellman Optimality Operator a contraction?
- The complexity of value iteration is the same as policy evaluation?
- What is the main difference of synchronous DP and asynchronous DP?
- How can the value iteration be viewed as a one-step GPI?
--- Value Iteration¶
Recall the Bellman Optimality Equations for state value function $$ v^*(s) = \max_{a \in \mathcal{A}} \Bigl[ r(s,a) + \gamma \sum_{s' \in \mathcal{S}} P(s' \mid s,a), v^*(s') \Bigr] $$
Unlike the Bellman Expectation Equations, this is a system of non-linear equations, so it's impossible to compute closed-form solution of $v^*$. Recall that even if the Bellman Expectation Equations has a closed-form solution, we still use the iterative policy evaluation algorithm to solve it. Analogously, we use value iteration to solve the Bellman Optimality Equations above. The update forms of the two are almost identical; the only essential difference is: one takes the expectation over the policy (∑ π), while the other takes the maximum over actions (max).
| Method | Equation Being Solved | Per-Iteration Update |
|---|---|---|
| Iterative Policy Evaluation | Bellman Expectation Equation | $v_{k+1}(s) \leftarrow \sum_a \pi(a\mid s)\Big[r(s,a) + \gamma \sum_{s'} P(s'\mid s,a) v_k(s')\Big]$ |
| Value Iteration | Bellman Optimality Equation | $v_{k+1}(s) \leftarrow \max_a \Big[r(s,a) + \gamma \sum_{s'} P(s'\mid s,a)v_k(s')\Big]$ |
Value Iteration Algorithm¶
Initialization: $$ v_0(s) = \text{any value, usually 0} \quad \forall s $$ Synchronous backup: $$ v_{k+1}(s) \leftarrow \max_{a\in\mathcal{A}} \bigl[r(s,a) + \gamma \sum_{s'} P(s'|s,a) v_k(s')\bigr] $$
Here, "synchronous" means that in the $k$-th iteration, the updates for all states $s$ are performed using only $v_k$, and after all updates are finished, we obtain $v_{k+1}$ all at once. One iteration is also called a sweep, which means completely traversing all the states once.
In contrast, in Asynchronous Backup (which we will introduce later), there is no globally consistent $v_k$ for all states. At the same time, different states may have been updated a different number of times—some states may contain more up-to-date information from more frequent updates, while others may still be using older values.
Stopping criterion: $$ \max_s |v_{k+1}(s) - v_k(s)| < \varepsilon $$
Why does the algorithm converge to the unique $v^*$?¶
The proof idea for Value Iteration is structurally identical to that of Iterative Policy Evaluation: the only difference is replacing the operator from the Bellman expectation operator $F_\pi$ $$(F_\pi v)(s) \triangleq \sum_a \pi(a|s)\left[r(s,a)+\gamma\sum_{s'}p(s'|s,a)v(s')\right]$$ to the Bellman optimality operator $T$ $$(Tv)(s) \triangleq \max_{a\in\mathcal{A}}\Bigl[r(s,a) + \gamma \sum_{s'} P(s' \mid s,a) v(s')\Bigr]$$ The key is also to show that $T$ is a $\gamma$-contraction under the $|\cdot|_\infty$ norm, and then directly apply the Contraction Mapping Theorem to obtain the unique fixed point $v^*$ and convergence from any initial value.
Complexity¶
Under the following three assumptions, we can derive the complexity of value iteration:
- The transition is dense: for every $(s,a)$, any $s'$ is possible
- The model is explicit: $P(s'|s,a)$ is known and can be accessed in $\mathcal{O}(1)$ time
- Synchronous sweep: every iteration must update all states
Complexity analysis:
For a single state $s$:
- Enumerate all actions to compute max: $\rightarrow$ $m$ times
- For each action, compute the expectation $\sum_{s'} P(s'|s,a)v(s')$ by summing over all possible $s'$: $\rightarrow$ $n$ times
Therefore: the cost of a single update for one state = $O(mn)$
Each iteration performs a sweep over all $n$ states
Therefore: $$ \text{cost per iteration} = n \times O(mn) = O(mn^2) $$
The real bottleneck is the $n^2$ term. Under this quadratic complexity, synchronous DP in practice can usually only handle state spaces of size $n \approx 10^3 \sim 10^4$. When $n$ reaches $10^5$, the time and memory requirements become unacceptable. But in practice, classical DP may need to be applied to problems at the scale of millions of states, which is why we need other update methods such as asynchronous value iteration.
Also, TD / Q-learning uses sample-based, local, asynchronous updates to actively avoid enumerating all successor states and thus escape the $n^2$ bottleneck. This is the fundamental reason why they can scale to very large state spaces.
Value Iteration for Action Value Function¶
As we have already seen, the state value function $v^*$ satisfies the Bellman Optimality Equation, and therefore can be solved using Value Iteration. In fact, the action value function $q^*$ also satisfies a Bellman Optimality Equation. This means we can use a completely analogous method to perform value iteration for $q^*$. The synchronous backup expression is: $$ q_{k+1}(s,a) \leftarrow r(s,a) + \gamma \sum_{s'} P(s' \mid s,a) \max_{a'} q_k(s',a') $$
The convergence proof is identical to that of state value iteration: the Bellman optimality operator for $q$ is a $\gamma$-contraction, which guarantees the existence and uniqueness of $q^*$ and convergence of value iteration.
Value iteration for the action value function is computationally more expensive than its state-value counterpart, as it maintains values for all state–action pairs. This additional cost can be interpreted as trading increased computation during learning for reduced complexity during policy extraction: while state value iteration requires a separate policy improvement step after computing $v^*$, action value iteration yields the optimal policy directly through a greedy operation on $q^*$.
For a single state–action pair $(s,a)$:
- Enumerate all next states (s') to compute the expectation $\sum_{s'} P(s' \mid s,a)\max_{a'} q(s',a')$: $\rightarrow$ $n$ times
- For each next state $s'$, enumerate all actions $a'$ to compute the max: $\rightarrow$ $m$ times
Therefore: the cost of a single update for one $(s,a)$ pair = $O(nm)$
Each iteration performs a sweep over all state–action pairs: $$ |\mathcal{S}|\times|\mathcal{A}| = n \times m $$ Therefore: $$ \text{cost per iteration} = (nm) \times O(nm) = O(n^2 m^2) $$
--- Asynchronous Value Iteration¶
The classical convergence proofs for policy/value iteration implicitly assume that every state is visited and updated in each iteration. However, in reality, even if the state space is enumerable, the number of states can increase exponentially (curse of dimensionality), making a complete sweep impossible. Therefore, asynchronous methods are adopted—in each iteration, instead of performing an exhaustive sweep, we sample a state/action at random and apply the Bellman backup only to the selected state/action.
Asynchronous DP methods are guaranteed to converge if all states continue to be selected.
Three asynchronous DP methods:
In-Place Value Iteration (default)
Synchronous value iteration stores two copies of value function tables - the old and new. While in-place value iteration store the value of each state as a placeholder, and update the value function in-place immediately if the state is selected.
Prioritized Sweeping
- At each step, give priority to backing up the state with the largest Bellman residual (error), and after each backup, update only the Bellman errors of states affected by this change.
- The affected states are all predecessor states that can reach the updated state in one transition, since their Bellman targets depend on the value of the updated state.
- Use a priority queue for efficient implementation.
Real-Time Dynamic Programming (RTDP)
Real-time Dynamic Programming (RTDP) no longer performs full Bellman updates over the entire state space. Instead, as the agent interacts with the environment, it only applies the Bellman backup to those states it actually visits.
The choice of which states to back up is naturally determined by the agent’s true trajectory and its current policy. For example, in a grid world, the agent starts at an initial state and moves step by step, performing a complete Bellman backup only on the current grid cell after each move, rather than updating the whole map. In contrast, standard Value Iteration sweeps through all states and performs Bellman backups at each one during every iteration—even for grid cells the agent might never visit.
--- Value Iteration From the GPI Point of View¶
Value Iteration = GPI where evaluation is truncated to one step. In the k-th iteration, we directly perform a one-step greedy backup on $v_k$ $$ \pi_k^{\text{greedy}}(s) = \arg\max_a \Big(r(s,a) + \gamma \sum_{s'} p(s'|s,a)v_k(s')\Big) $$ Then immediately use this policy to perform the Bellman expectation backup, which results in the Bellman optimality backup: $$ \begin{align} v_{k+1}(s) & = \sum_a \pi_k^{\text{greedy}}(a|s) q_k(s,a) \nonumber\newline & = \max_a \Big(r(s,a) + \gamma \sum_{s'} p(s'|s,a)v_k(s')\Big) \nonumber \end{align} $$