Takeaways:
- What makes Bellman Optimality Equations is a system of non-linear equations?
- Does Bellman Expectation Equations have a closed-form solution?
- Does Bellman Optimality Equations have a closed-form solution?
--- Background¶
In the MDP framework, we typically distinguish three major learning objectives:
- Prediction: given a fixed policy, evaluate how good it is by computing its value function
- Control: among all policies, find the optimal policy and the corresponding optimal value function
- Learning: when the environment dynamics and reward function are unknown, learn value functions or policies through interaction with the environment (basically, prediction + control)
Two Types of Bellman Equations¶
Bellman equations come in two closely related forms: Bellman Expectation Equations and Bellman Optimality Equations. One of them is the core tool of the "Prediction" objective and one is the core tool of the "Control" objective.
Bellman Expectation Equations
- Concern Prediction (evaluation a policy)
- Define and compute value functions under a fixed policy
- Express the value function as the expected immediate reward plus the discounted value of successor states, where the expectation is taken over the policy’s action probabilities
- Lead to a system of linear equations, which can be written in matrix form and solved exactly (in principle)
Bellman Optimality Equations
- Concern Control (finding optimal policy)
- Characterize optimal value functions
- Replace the expectation over actions with a maximization, reflecting optimal decision making
- Lead to a system of nonlinear equations, whose solution defines the optimal value function and the optimal policy
Bellman equations address Prediction and Control, but not Learning directly. This is because Bellman equations assume an MDP with known transition dynamics and rewards and characterize the recursive consistency conditions that value functions must satisfy.
The "Learning" objective focuses instead on how to approximate these Bellman equations from data when the transition dynamics and rewards are unknown, which motivates iterative and sample-based algorithms studied later in approximate DP / RL.
--- Bellman Expectation Equations (Prediction)¶
The Bellman Expectation Equation is a system of linear equations w.r.t. $v_{\pi}$ or $q_{\pi}$ (which appears in first-order), where each state (or state-action) corresponds to one equation.
Unique solution: the state or state-action value function under that policy.
Derivation of the State Value Function¶
Start with the recursive definition of the return (the source of all Bellman equations): $$ G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots $$ Through simple algebraic manipulation, we get the following form. It illustrates the key Bellman idea that "the current long-term return = one-step immediate reward + discounted future return": $$ G_t = R_{t+1} + \gamma G_{t+1} $$
By conditioning the above recursion on the state and taking expectation, we obtain the Bellman Expectation Equation (state value): $$ \begin{align} v_\pi(s) & = \mathbb{E}_\pi[G_t \mid S_t = s] = \mathbb{E}_\pi[R_{t+1} + \gamma v_\pi(S_{t+1}) \mid S_t = s] \nonumber\newline & = \sum_a \pi(a|s) \mathbb{E} \left[R_{t+1}+\gamma v_\pi(S_{t+1})\mid S_t=s,A_t=a\right] \nonumber\newline & = \sum_a \pi(a|s)\sum_{s', r} p(s', r \mid s, a)\bigl[r + \gamma v_\pi(s')\bigr] \nonumber \end{align} $$
Derivation of the State–Action Value Function¶
Similarly, if we condition on $(s, a)$, we obtain the Bellman Expectation Equation (state-action value): $$ \begin{align} q_\pi(s,a) & = \mathbb{E}_\pi[R_{t+1} + \gamma q_\pi(S_{t+1}, A_{t+1}) \mid S_t=s, A_t=a] \nonumber\newline & = \sum_{s',r} p(s',r\mid s,a)\Bigl[r+\gamma\mathbb{E}_\pi[q_\pi(s',A_{t+1})\mid S_{t+1}=s']\Bigr] \nonumber\newline & = \sum_{s', r} p(s', r \mid s, a)\Bigl[r + \gamma \sum_{a'} \pi(a'|s') q_\pi(s', a')\Bigr] \nonumber \end{align} $$
--- Bellman Optimality Equation (Control)¶
The Bellman Optimality Equation is a system of non-linear equations w.r.t. $v_{\pi}$ or $q_{\pi}$ (which appears in first-order), where each state (or state-action) corresponds to one equation.
This is a nonlinear system of equations because $\max(x,y)$ does not satisfy linearity: $\max(x_1+x_2,y_1+y_2)\neq \max(x_1,y_1)+\max(x_2,y_2)$
Unique solution: the state or state-action value function corresponds to the optimal policy.
The only difference between Expectation and Optimality is in one place: Expectation takes an expectation over actions $(\sum_a \pi(a|s))$, while Optimality takes the maximum over actions $(\max_a)$.
Optimal State Value Function¶
$$ v^*(s) = \max_{a \in \mathcal{A}} \sum_{s', r} p(s', r \mid s, a) \bigl[r + \gamma v^*(s')\bigr] $$
Optimal Action Value Function¶
$$ q^*(s,a) = \sum_{s', r} p(s', r \mid s, a) \bigl[ r + \gamma \max_{a'} q^*(s', a') \bigr] $$
--- Solve Bellman Equation¶
Bellman Expectation Equation(prediction)¶
Starting from the Bellman expectation equation: $$ v_\pi(s)=\sum_{a}\pi(a|s)\sum_{s',r}p(s',r\mid s,a)\bigl[r+\gamma v_\pi(s')\bigr] $$ Decompose the environment into reward + transition:
- Recall the one-step expected reward in the MDP five-tuple definition:
$$
r(s,a)\doteq \mathbb{E}[R_{t+1}\mid S_t=s,A_t=a]=\sum_{s',r}p(s',r\mid s,a) r
$$
Here, $r$ and $r()$ are two completely different things. This $r$ refers to a particular random value taken by $R_{t-1}$, whereas $r()$ denotes the reward function.
- Recall the state transition in the MDP five-tuple definition: $$ T(s'|s,a)\doteq \Pr(S_{t+1}=s'\mid S_t=s,A_t=a)=\sum_{r}p(s',r\mid s,a) $$ Substituting back, we get: $$ v_\pi(s)=\sum_{a}\pi(a|s)\Bigl(r(s,a)+\gamma\sum_{s'}T(s'|s,a) v_\pi(s')\Bigr) $$ This can be simplified as (this is also the Bellman equation for the Markov Reward Process): $$ v_\pi(s)=r^{\pi}_s+\gamma\sum_{s'}T^{\pi}_{s'|s} v_\pi(s') $$ where
- expected reward at state $s$: $r^{\pi}_s\doteq \sum_{a}\pi(a|s) r(s,a)$
- state transition dynamics: $T^{\pi}_{s'|s}\doteq \sum_{a}\pi(a|s) T(s'|s,a)$
Stack the value of all states into a vector $\mathbf v_\pi\in\mathbb{R}^{|\mathcal S|}$, similarly the expected reward vector $\mathbf r^\pi\in\mathbb{R}^{|\mathcal S|}$, and the transition matrix $T^\pi\in\mathbb{R}^{|\mathcal S|\times|\mathcal S|}$:
- $(T^\pi)_{j,k} = \Pr(S_{t+1}=s_k \mid S_t=s_j, a\sim\pi(\cdot|s_j))$
- $(\mathbf r^\pi)_j = \mathbb{E}[R_{t+1}\mid S_t=s_j, a\sim\pi(\cdot|s_j)]$
- $(\mathbf v_\pi)_j = v_\pi(s_j)$
Then: $$ \mathbf v_\pi=\mathbf r^\pi+\gamma T^\pi\mathbf v_\pi $$
Rearrange into the standard linear system: $$ (I-\gamma T^\pi)\mathbf v_\pi=\mathbf r^\pi $$
As long as $I-\gamma T^\pi$ is invertible (which usually holds for a finite MRP with $\gamma<1$), solving for $\mathbf v_\pi$ is a standard linear algebra problem, and you can obtain it directly using the matrix inverse: $$ \mathbf v_\pi=(I-\gamma T^\pi)^{-1}\mathbf r^\pi $$
Why don’t we actually use matrix inverse to compute state value / state-action value?
The typical computational complexity of directly inverting a matrix or solving a linear system is $\mathcal{O}(|\mathcal S|^3)$. This becomes infeasible when the state space is large. That’s why “directly solving the linear system” is mainly a theoretical tool—good to know, but in practice we use iterative algorithms. Similarly, in linear algebra we may learn how to compute the matrix inverse by hand for closed form, but in real computer implementations of linear algebra, actual computations involving matrix inversion are often done through iterative algorithms instead of direct inversion.
Bellman Optimality Equation(control)¶
$$ v^*(s) = \max_a \sum_{s',r}p(s',r|s,a)\bigl[r+\gamma v^*(s')\bigr] $$
The max operator is nonlinear, so the equation cannot be written in the form $A\mathbf v = \mathbf b$, and thus we cannot get a closed-form solution by matrix inversion. Instead, we must rely on iterative methods to approximate the optimal solution.
Solving Bellman Equations for State–Action Values
In this section, we mainly discuss how to solve Bellman equations for state values $v$. A natural question is: can the state–action values $q$ also be solved through Bellman equations?
Bellman Expectation Equation
For a given policy $\pi$, the state–action value function $q_\pi$ satisfies a set of linear equations. In theory, it can also be written in matrix form and a unique solution can be obtained by solving the linear system. However, since its dimension is $|\mathcal{S}||\mathcal{A}|$, the corresponding matrix is much larger, making computation and storage much more expensive. Therefore, in practice and in classroom derivations, this approach is rarely used.
Bellman Optimality Equation
For the optimal state–action value function $q^*$, the Bellman equation contains the $\max$ operator, making the equation nonlinear. As a result, it cannot be solved in closed-form by matrix inversion. In this case, we also have to rely on iterative methods to approximate the optimal solution.
Therefore, when introducing "how to solve Bellman equations," we mainly focus on state value functions $v$; for control problems, whether it is $v^*$ or $q^*$, we ultimately need to use iteration-based algorithms, such as value iteration, policy iteration, and their corresponding methods in $q$-space.
Preview: Policy Iteration And Value Iteration¶
Because:
| Equation | Linear? | Can matrix inverse? |
|---|---|---|
| Bellman Expectation | ✅ Yes | Theoretically possible (but very costly) |
| Bellman Optimality | ❌ (max) | Impossible |
Therefore, we must use iterative algorithms to solve control and prediction problems.
- Policy Iteration
- Policy Evaluation (solve Bellman Expectation)
- Policy Improvement (greedy with respect to value)
- Value Iteration
- Directly iterates on the Bellman Optimality equation
Essentially, both are different forms of iterative backup for the Bellman Equation.