MarkovDecisionProcess

Markov Decision Process

Problem formulation, not solution. Can be compared with bandits:

Bandits (problem)        ⟶  UCB / TS / EXP3 (solution)
MDP (problem)            ⟶  Stochastic DP / Approx DP / RL (solution)

The reason we dedicate a separate set of notes to Markov Decision Processes is because understanding them thoroughly is extremely important. MDPs provide the mathematical foundation and a unified language for all "sequential decision-making algorithms"—DP, RL, and Planning all share this modeling approach.

"Given an MDP, different assumptions about what is known lead to different solution paradigms."

Takeaways:

What are the five components of the MDP five-tuple?
As a turn-based game, what is the main difference between MDP and bandits?
Is a bandit equivalent to a one-state MDP? Write out the state transition probabilities for a bandit.
Why must we use discounted returns in a continuing task?
Can we know $\pi^*$ if we know $q^*$? How about if we know $v^*$?
What are the three main learning objectives of MDP?

--- The Five-Tuple Definition of an MDP¶

💡 Note: In theoretical studies, we usually assume the MDP is finite, which ensures that the Bellman equation has a unique solution, an optimal policy exists, and dynamic programming algorithms converge. In real-world applications (such as deep reinforcement learning), however, the state and action spaces are often infinite or continuous. The finite MDP serves as the theoretical foundation for analysis and teaching, even though real-world RL problems are usually infinite.

Regardless of whether the MDP is finite** or infinite, both share the same five-tuple structure $(\mathcal{S}, \mathcal{A}, T, r, \gamma)$; the only difference lies in whether the state space and action space are finite or infinite.

The five-tuple is the modeling language of an MDP, not an algorithmic assumption. Any solution must first be described using these five components.

A Markov Decision Process is defined by a five-tuple: $$ \mathcal{M} = (\mathcal{S}, \mathcal{A}, T, r, \gamma) $$ where:

State Space $\mathcal{S}$: The set of all possible environment states. $s_t \in \mathcal{S}$ denotes the concrete state at time $t$.

A state should be a sufficient summary of the past. We should be able to throw away the history once state is known, i.e., it should have the Markov Property: $$ \mathbb{P}\!\left[R_{t+1}=r,\,S_{t+1}=s' \mid S_0,A_0,R_1,\ldots,A_{t-1},R_t,S_t,A_t\right] = \mathbb{P}\!\left[R_{t+1}=r,\,S_{t+1}=s' \mid S_t,A_t\right] $$
Action Space $\mathcal{A}$: The set of actions the agent can take, which depends on the state. $a_t \in \mathcal{A}(s_t)$ denotes the action taken at state $s_t$.
State Transition $T(s' \mid s, a)$: The probability of transitioning to state $s'$ and get reward $r$ after taking action $a$ in state $s$.

$$T(s', r \mid s, a) = \mathbb{P}(S_{t+1} = s', R_{t+1} = r \mid S_t = s, A_t = a)$$ This equation expresses the Markov property we mentioned before.

Notice that sometimes $r$ is deterministic, where the state trasition can also be written as $T(s' \mid s, a)$.
Reward Function $r(s, a, s')$: The immediate reward received after the state transition or after taking an action. Here $s'$ is the resulting state after the transition. The reward may be a random variable, and its expectation is given by $r$. For example, a simple one

$$r(s,a,s') = \mathbb{E}\!\left[ R_{t+1} \mid S_t = s,\; A_t = a ,\; S_{t+1} = s'\right]$$

Notice that $r$ does not always depends on $s'$, where the reward function can also be written as $r(s, a)$.
Discount Factor $\gamma \in [0,1]$: Controls the importance of future rewards. $\gamma = 0$ considers only immediate rewards, while $\gamma \to 1$ emphasizes long-term returns.

In reinforcement learning and dynamic programming, lowercase letters usually denote concrete values or true objects, e.g. states $s$, actions $a$, value functions $v_\pi, q_\pi$; uppercase letters usually denote random variables or estimators in algorithms, such as random state $S_t$, random action $A_t$, and learned values $V, Q$. A standard and clear convention in the field is: uppercase = random variable / learned estimate, lowercase = realized value or ground truth.

--- Markov Decision Process As A Turn-Based Game¶

In the bandits section, we abstracted bandits as a turn-based game. Similarly, an MDP is also a turn-based game much like bandits. If the Bandit Problem describes a "turn-based game where each round is independent," then the Markov Decision Process (MDP) describes a "turn-based game where the situation in each round can change based on your previous actions". In this setting, each round is no longer independent. The basic interaction pattern (aligned with the form of bandits) is:

Given the set of available actions $\mathcal{A}(S_t)$ at the current state $S_t \in \mathcal{S}$
Choose an action $A_t$
The environment transitions to a new state according to the state transition probability $T(S_{t+1}\mid S_t, A_t)$
Simultaneously, receive a reward $R_t = r(S_t, A_t, S_{t+1})$
The agent updates its policy
Move to the next round

Objective: Maximize expected cumulative reward (or minimize regret).

Similarities between MDPs and bandits:

From the perspective of interaction, both follow: observe → choose → receive feedback → update → next round

The learning objective is to maximize expected cumulative return (or minimize regret)

In every step, an action is selected to maximize cumulative return

Key differences between MDPs and bandits:

There is an additional step (step 3): state transition. The set of available actions each round depends on the state $\mathcal{A}(S_t)$, so the interaction pattern in the next round may change depending on the action chosen in the previous round.

Looking at MDP from the Two Core Assumptions of Bandit:

Partial Feedback ✅

In an MDP, at each step you can only observe the action you took, the reward you received, and the next state you transitioned to. You can't see "what would have happened if you had selected a different action"—just like in bandits.
Stateless ❌

In an MDP, your action directly determines the state transition. The current action affects: the state you will be in in the future, the set of actions you can choose from in the future, and the distribution of future rewards.

In contextual bandits, the environment changes, but not due to your actions—it changes independently of the agent.

Bandit As A One-State MDP¶

Bandit is equivalent to an MDP where:

There is only one state $s^*$ (or the state contains no information)
The transition is always identical: $T(S_{t+1}=s^* \mid S_t=s^*, A_t)=1$
The reward depends only on the action (and possibly exogenous context): $R_t = r(A_t)$ or $R_t = r(A_t, C_t)$

This is a one-state MDP: the state is never changed by actions. $A_t$ does not enter the environment's evolution equation, so the way the next $C_{t+1}$ / $\mathcal{A}_{t+1}$ / reward mechanism is generated is independent of $A_t$.

--- Learning Objective of MDP¶

Based on the MDP five-tuple, we can define the following key quantities:

Return $G_t$: The cumulative (discounted) reward starting from time $t$. The MDP objective function is based on the expected return.
- Episodic task: The interaction will naturally end (has a terminal state), such as playing a game, a chess match, or completing a maze. Use simple summation: $$G_t = R_{t+1} + \cdots + R_T$$
- Continuing task: The interaction does not naturally terminate (keeps going), such as stock trading, server scheduling, or temperature control systems. Use discounted return to ensure $G_t$ converges: $$ G_t = R_{t+1} + \gamma R_{t+2} + ... = \sum_{k=0}^{\infty} \gamma^k r_{t+k+1} $$ Also it's easy to see $G_t = R_{t+1} + \gamma G_{t+1}$.
Policy $\pi$: A distribution over actions given states. It fully defines the behavior of the learner. It is updated during the learning process, but should be stationary (time-independent, i.e. $\mathbb{P}(A_t = a \mid S_t = s) = \mathbb{P}(A_{t'} = a \mid S_{t'} = s)$).
- Stochastic policy: $\pi(a \mid s) = \mathbb{P}(A_t = a \mid S_t = s)$
- Deterministic policy: $\pi(s) = a$ with probability 1.
Value Functions: Expected returns.
- State value function: Under policy $\pi$, the cumulative expected return starting from state $s$. It measures the goodness of a state. $$v_\pi(s) = \mathbb{E}_\pi [G_t \mid S_t = s]$$
- Action value function: Under policy $\pi$, the cumulative expected return starting from state $s$ and taking action $a$. It measures the goodness of a state-action pair. $$q_\pi(s, a) = \mathbb{E}_\pi [G_t \mid S_t = s, A_t = a]$$
Optimal Value Functions: Best achievable expected returns.
- Optimal state-value function: The maximum cumulative expected return starting from state $s$ over all policies. It measures the best possible goodness of a state. $$v^*(s) = \max_\pi v_\pi(s)$$
- Optimal action-value function: The maximum cumulative expected return starting from state $s$ and action $a$ over all policies. It measures the best possible goodness of a state-action pair. $$q^*(s,a) = \max_\pi q_\pi(s,a)$$
Optimal policy: $$ \pi^* = \arg\max_\pi \mathbb{E}_\pi [G_0] $$ The optimal policy is usually not unique; any policy that achieves the highest expected return is considered optimal.

Why the value functions are useful? The relationship between the optimal policy $\pi^*$ and $q^*$, $v^*$:

$\pi^*$ is greedy with respect to $q^*$. In state $s$, if you want to maximize total return, you should of course choose the action with the highest $q^*(s,a)$. Therefore, the optimal policy is $$\pi^*(s) = \arg\max_a q^*(s,a)$$

If you only know $v^*$, you also need the environment dynamics to derive $q^*$, and then obtain $\pi^*$. By definition, $q^*(s,a)$ is the expected return obtained by taking action $a$ in state $s$ and then following optimal behavior. After one transition, the agent receives reward $r$ and moves to next state $s'$. Therefore (we'll see in the Bellman Equation notes that this is actually Bellman Optimality Equation), $$q^*(s,a) = \mathbb{E}[r + \gamma v^*(s') \mid s,a] = \sum_{s',r} p(s',r|s,a)\bigl(r + \gamma v^*(s')\bigr)$$ accordingly $$\pi^*(s) = \arg\max_a \sum_{s',r} p(s',r|s,a)\bigl(r + \gamma v^*(s')\bigr)$$

Learning Objectives of MDP¶

In the MDP framework, there are three main objectives we might pursue:

Evaluating a given policy (prediction)

Given an MDP $(\mathcal S, \mathcal A, T, r, \gamma)$ and a policy $\pi(a|s)$, the goal is to find the state-value function $v_{\pi}$ or action-value function $q_{\pi}$. That is, “What is the long-term expected return if we always follow this policy?”

Example: Evaluating the performance of a "beginner player policy" in a maze.
- Policy (cannot be changed): at each intersection, 50% chance to go left, 50% to go right.
- All you want to know is: "If you always follow this strategy, what is the average score you will achieve?"
- Output: The state-value function $v_{\pi}$ for each position.
In industry systems, the ultimate goal is decision-making; Prediction is not for making decisions but is rather a tool for evaluation, so it rarely exists as a final product. Its main purpose is to support control and learning: it serves as a sub-problem of planning/control and as a source of training signal for RL.
Finding the optimal policy (optimal control / planning)

Given an MDP $(\mathcal S, \mathcal A, T, r, \gamma)$, the goal is to find $\pi^*$.

Example: Solving a board game with known rules.
- You know the winning/losing rules and all possible policies.
- You just want to know: "What is the best move in each situation?" In other words, which policy achieves the highest state value in each state?
- Output: The optimal policy $\pi^*$ and the corresponding $v^*$, $q^*$.
Learning via interaction when the environment is unknown (learning / RL)

Given an MDP $(\mathcal S, \mathcal A, T, r, \gamma)$ with unknown $T$ and $r$, the objective is to find $\pi^*$ (usually aiming for an approximately optimal policy).

Example: Teaching a robot to walk on unknown terrain.
- The environmental rules are unknown: you do not know which actions will move forward, fall, or cause damage, and must rely on trial-and-error interaction with the environment.
- During learning, keep updating the state-value function / action-value function / policy.
- Output: a policy $\pi$ learned through interaction (ideally close to $\pi^*$), and its corresponding $v$ or $q$.
Learning combines prediction and control: when the model is unknown, you estimate value functions and improve policy at the same time.

An intuitive analogy

Planning

You have: a complete map and the travel time for each road

Approach: Calculate the shortest path. You can "simulate it in your mind" and compute it exactly.

Learning

You only see: one step at a time as you go

Approach: Keep trying and continuously update your experience