TemporalDifferenceLearning

Temporal Difference Learning

TD = use stochastic approximation to solve Bellman expectation / optimality equation.

Prerequisite: read stochastic approximation note.

Bootstrapping means using an estimate to estimate another estimate. In TD learning, the target contains the current value estimate of the next state $V(s')$, e.g. $r + \gamma V(s')$, so the value function is updated using its own prediction.

TD learning is not a standalone algorithm in industrial RL. It functions as a fundamental building block—most commonly the value/critic update in actor–critic architectures. Pure tabular TD is impractical; real systems rely on TD combined with function approximation and stabilization techniques.

--- Prerequisites of TD Learning¶

Notations¶

Return $$ G_t = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{T-t-1} R_T $$ State Value $$ v_\pi(s) = \mathbb{E}_\pi[G_t \mid S_t = s] $$ Action Value $$ q_\pi(s,a) = \mathbb{E}_\pi[G_t \mid S_t=s, A_t=a] $$

--- TD Policy Evaluation (Prediction)¶

Goal: Given a fixed policy $\pi$, estimate the value function of each state / state-action: $$ v_\pi(s) = \mathbb{E}_\pi[G_t \mid S_t=s] \quad \text{ or } \quad q_\pi(s,a) = \mathbb{E}_\pi[G_t \mid S_t=s, A_t=a] $$

TD: Solving An Expectation-Based Equation By Stochastic Approximation¶

Basically, the TD policy evaluation has the same goal as Monte Carlo policy evaluation: to have the estimation of the value functions $v_\pi(s)$ / $q_\pi(s,a)$. The fundamental difference is: Monte Carlo estimates an expectation, while TD solves an equation involving an expectation.

Monte Carlo methods treat the value function according to the definition: $$ v_\pi(s) = \mathbb{E}_\pi[G_t \mid S_t = s] $$ $$ q_\pi(s,a) = \mathbb{E}_\pi[G_t \mid S_t = s, A_t = a] $$

viewing the value function as the expectation of a well-defined random variable $G_t$, and uses the empirical mean of $G_t$ to estimate the expectation of $G_t$. To compute the empirical mean, It collects every sample point $G_t^{(1)}, G_t^{(2)}, ...$ by adding up the real observed returns $R_{t+1}, R_{t+2}, R_{t+3}, ...$.

Temporal difference methods, in contrast, take a step further from the definition of value function to a Bellman expectation equation $$ v_\pi(s) = \mathbb{E}_\pi[G_t \mid S_t = s] = \mathbb{E}_\pi \bigl[ R_{t+1} + \gamma v_\pi(S_{t+1}) \mid S_t = s \bigr] $$ $$ q_\pi(s, a) = \mathbb{E}_\pi[G_t \mid S_t = s, A_t = a] = \mathbb{E}_\pi \bigl[ R_{t+1} + \gamma \, q_\pi(S_{t+1}, A_{t+1}) \mid S_t = s, A_t = a \bigr] $$

In this formulation, the value function appears as an unknown variable inside an equation involving an expectation. Estimating the value function therefore amounts to solving this expectation-based fixed-point equation, typically via stochastic approximation, rather than estimating the expectation of a random variable like MC.

TD(0) Algo: Solving the State-Value Bellman Equation¶

The Bellman expectation equation of state-value function can be written as an operator form $$ V^\pi = T^\pi V^\pi $$ where $T^\pi$ is the Bellman operator. The solution is the equilibrium of this ODE system $$ \dot V = T^\pi V - V $$ because based on the definition, we have $T^\pi V^\pi - V^\pi = 0$ at the equilibrium.

Recall that to solve the equilibrium of the ODE system, we can use Euler discretization, a.k.a. fixed-point iteration. Since the expectation in $T^\pi$ is impossible to compute, we use stochastic approximation (also called noisy fixed-point iteration): at time $t$, use a single unbiased sample from single-sample transitions $(S_t,A_t,R_{t+1},S_{t+1})$ in the update / iteration $$ \boxed{ V(S_t) \leftarrow V(S_t) + \alpha_t \Bigl(\widehat{T^\pi V}(S_t) - V(S_t) \Bigr) } $$ where $$ \widehat{T^\pi V}(S_t) := R_{t+1} + \gamma V(S_{t+1}), \qquad \mathbb{E}[\widehat{T^\pi V}(S_t)\mid Q, S_t]= (T^\pi V)(S_t). $$

Only the visited state $S_t$ is updated; all other states remain unchanged. The temporal-difference (TD) error is defined as $$ \delta_t := R_{t+1} + \gamma V(S_{t+1}) - V(S_t) $$ and the update can be written as $$ V(S_t) \leftarrow V(S_t) + \alpha_t \delta_t $$

The common step-size requirements for TD(0) in classic convergence theorems (together with conditions such as ergodicity and bounded second moments) are the same as the step-size conditions in stochastic approximation. $$ \sum_k \alpha_k=\infty,\qquad \sum_k \alpha_k^2<\infty $$

[Where do $R_{t+1}$ and $V(S_{t+1})$ come from?]

At time $t$, the agent is in state $S_t$. It selects an action $A_t \sim \pi(\cdot \mid S_t)$ and interacts with the environment once.

As a result of this single transition, the environment returns: $$ (S_t, A_t) \longrightarrow (R_{t+1}, S_{t+1}). $$

$R_{t+1}$ is the immediate reward emitted by the environment after taking action $A_t$ in state $S_t$.

$S_{t+1}$ is the next state produced by the environment.

$V(S_{t+1})$ is simply the agent’s current estimate of the value of that next state, read directly from the value table (or function approximator).

$R_{t+1} + \gamma V(S_{t+1})$ is also called the TD target. It is the one-step sample of the Bellman expectation $(T^\pi V)(S_t)$, formed using one real observed reward and one bootstrapped value prediction.

n-step TD: The General Form of TD(0)¶

From the discussion above, we see that TD learning is directly solving Bellman expectation equation. TD(0) corresponds to solving the one-step Bellman expectation equation, while the Bellman equation can be expanded over $n$ steps: $$ \begin{align} v_\pi(s) & = \mathbb{E}_\pi \left[\sum_{k=0}^{n-1} \gamma^k R_{t+k+1} + \gamma^n v_\pi(S_{t+n}) \mid S_t = s \right] \nonumber\newline & = \mathbb{E}_\pi \left[ R_{t+1} + \cdots + \gamma^{n-1} R_{t+n} + \gamma^n v_\pi(S_{t+n}) \mid S_t = s \right] \nonumber \end{align} $$ Solving the $n$-step Bellman expectation equation gives out the $n$-step TD algorithm, so TD(0) is a special case of $n$-step TD.

As before, the solution of the $n$-step Bellman expectation equation can be viewed as the equilibrium of an ODE: $$ \dot V = T_n^\pi V - V $$ where $T_n^\pi V(s) = \mathbb{E}_\pi \left[\sum_{k=0}^{n-1} \gamma^k R_{t+k+1} + \gamma^n V(S_{t+n}) \mid S_t = s \right]$. We solve this new fixed-point equation using the same stochastic approximation principle as TD(0), but with a different operator $T_n^\pi$. At time $t$, we form a single-sample, unbiased estimate of $(T_n^\pi V)(S_t)$ using one trajectory fragment of length $n$: $$ \widehat{T_n^\pi V}(S_t) := \sum_{k=0}^{n-1} \gamma^k R_{t+k+1} + \gamma^n V(S_{t+n}) $$ The update is then $$ \boxed{ V(S_t) \leftarrow V(S_t) + \alpha_t\bigl(\widehat{T_n^\pi V}(S_t) - V(S_t)\bigr) } $$

The step-size choice is the same as the step-size conditions in stochastic approximation (the same as TD(0) as well). $$ \sum_k \alpha_k=\infty,\qquad \sum_k \alpha_k^2<\infty $$

Notice that TD(0) can update immediately after one transition, while n-step TD must wait $n$ steps to collect enough data before forming the target and update.

[Where do the rewards and bootstrapped term in n-step TD come from?]

At time $t$, the agent is in state $S_t$. It selects actions according to the policy $A_k \sim \pi(\cdot \mid S_k)$ and interacts with the environment for $n$ consecutive steps. As a result of this short trajectory segment, the environment produces: $$ (S_t, A_t) \longrightarrow (R_{t+1}, S_{t+1}) \longrightarrow (R_{t+2}, S_{t+2}) \longrightarrow \cdots \longrightarrow (R_{t+n}, S_{t+n}) $$

From this single length-$n$ rollout, we obtain:

$R_{t+1}, \dots, R_{t+n}$: the sequence of real rewards emitted by the environment.

$S_{t+n}$: the state reached after (n) transitions.

$V(S_{t+n})$: the agent’s current estimate of the value of that state, read from the value function.

These quantities are combined to form the n-step TD target $\sum_{k=0}^{n-1} \gamma^k R_{t+k+1} + \gamma^n V(S_{t+n})$. This target is a single-sample estimate of the $n$-step Bellman expectation $(T_n^\pi V)(S_t)$. It mixes: $n$ steps of real experience (the rewards), followed by one bootstrapped value prediction at step $n$.

SARSA: Solving the Action-Value Bellman Equation¶

This part has the same sturcture & template as TD(0), because the only difference between them is to solve the Bellman expectation equation for the state-value function v.s. Bellman expectation equation for the action-value function.

The Bellman expectation equation for the action-value function under policy $\pi$ can be written in operator form $$ Q^\pi = T^\pi Q^\pi, $$ where the Bellman operator $T^\pi$ is defined by $$ (T^\pi Q)(s,a) := \mathbb{E}_\pi \left[ R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) \mid S_t=s, A_t=a \right] $$

As before, the solution is the equilibrium of the ODE system $$ \dot Q = T^\pi Q - Q, $$ since at equilibrium $T^\pi Q^\pi - Q^\pi = 0$. Applying Euler discretization (fixed-point iteration) to the ODE gives $$ Q \leftarrow Q + \alpha \bigl(T^\pi Q - Q\bigr) $$ Because the expectation in $T^\pi$ is not available, we use stochastic approximation. At time $t$, use a single unbiased sample from the on-policy transition $$ (S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1}), $$ where $A_{t+1} \sim \pi(\cdot \mid S_{t+1})$. This is exactly why this algorithm is called SARSA: a single update uses the random variables $(S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1})$. The original definition of SARSA is nothing else but using the TD method (stochastic approximation) to solve the action-value Bellman expectation equation.

The update becomes $$ \boxed{ Q(S_t, A_t) \leftarrow Q(S_t, A_t) +\alpha_t \Bigl(\widehat{T^\pi Q}(S_t, A_t) - Q(S_t, A_t)\Bigr) } $$ with $$ \widehat{T^\pi Q}(S_t, A_t) := R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}), \quad \mathbb{E}\left[\widehat{T^\pi Q}(S_t, A_t)\mid Q, S_t, A_t \right] = (T^\pi Q)(S_t, A_t) $$ Only the visited state–action pair $(S_t, A_t)$ is updated; all other pairs remain unchanged.

The temporal-difference error is $$ \delta_t := R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t). $$ The update can be written compactly as $$ Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha_t \delta_t. $$

Under standard assumptions (ergodicity of the induced Markov chain, bounded second moments, etc.), the step-size requirements are the same as in stochastic approximation: $$ \sum_t \alpha_t = \infty, \qquad \sum_t \alpha_t^2 < \infty. $$

[Where Do $R_{t+1}$ and $Q(S_{t+1}, A_{t+1})$ Come From?]

At time $t$, the agent is in state $S_t$ and takes action $A_t \sim \pi(\cdot \mid S_t)$. After one interaction with the environment, $$ (S_t, A_t) \longrightarrow (R_{t+1}, S_{t+1}). $$ Then the agent samples the next action $A_{t+1} \sim \pi(\cdot \mid S_{t+1})$

$R_{t+1}$ is the immediate reward from the environment.

$Q(S_{t+1}, A_{t+1})$ is the agent’s current estimate of the value of the next state–action pair.

The quantity $$ R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) $$ is the SARSA TD target: a one-step Monte Carlo sample of the Bellman expectation operator $T^\pi Q$, combining one observed reward and one bootstrapped action-value estimate.

Expected SARSA to Reduce Variance¶

The only difference between SARSA and Expected SARSA lies in the source of the TD target.

SARSA uses a single sampled next action $A_{t+1} \sim \pi(\cdot\mid S_{t+1})$ and the target $$ R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) $$
Expected SARSA replaces the sampled term $Q(S_{t+1}, A_{t+1})$ with its expectation under the policy: $$ \sum_a \pi(a\mid S_{t+1}) Q(S_{t+1}, a). $$

No other part of the algorithm is changed.

Define the expected Bellman sample as $$ \widehat{T^\pi_{\text{exp}} Q}(S_t, A_t) := R_{t+1} + \gamma \sum_a \pi(a\mid S_{t+1}) Q(S_{t+1}, a) = R_{t+1} + \gamma \mathbb{E}_\pi[Q(S_{t+1}, A_{t+1})] $$ It's easy to see the sample is unbiased $$ \begin{align} \mathbb{E}\left[\widehat{T^\pi_{\mathrm{exp}} Q}(S_t, A_t)\middle| S_t, A_t, Q \right] &= \mathbb{E}\left[R_{t+1} + \gamma \mathbb{E}_\pi[Q(S_{t+1}, A_{t+1})] \middle| S_t, A_t \right] \nonumber\newline &= \mathbb{E} \left[R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) \middle| S_t, A_t \right] \nonumber\newline &= (T^\pi Q)(S_t, A_t) \nonumber \end{align} $$

The Expected SARSA update is $$ \boxed{ Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha_t \Bigl(\widehat{T^\pi_{\text{exp}} Q}(S_t, A_t) - Q(S_t, A_t)\Bigr) } $$ and the TD error changes accordingly.

Why Does Expected SARSA Reduce Variance? In SARSA, the TD target contains randomness from: 1. the environment transition, and; 2. the sampled next action $A_{t+1}$. Expected SARSA eliminates the second source of randomness by replacing $Q(S_{t+1}, A_{t+1})$ with $\mathbb{E}_\pi[Q(S_{t+1}, A_{t+1})]$.

[Where Do $R_{t+1}$ and $\sum_a \pi(a\mid S_{t+1}) Q(S_{t+1}, a)$ Come From?]

At time $t$, the agent is in state $S_t$ and takes action $A_t \sim \pi(\cdot\mid S_t)$. After one interaction with the environment, $$ (S_t, A_t) \longrightarrow (R_{t+1}, S_{t+1}) $$

$R_{t+1}$ is the immediate reward observed from the environment.

$Q(S_{t+1}, a)$ is the agent’s current estimate of the value of each possible next action.

$\pi(a\mid S_{t+1})$ is the known policy used to compute the expectation.

The Expected SARSA TD target is $R_{t+1} + \gamma \sum_a \pi(a\mid S_{t+1}) Q(S_{t+1}, a)$. It is a combination of the observed reward and a one-step Monte Carlo sample of the Bellman expectation operator in which the action randomness has been analytically averaged out.

On-Policy & Off-Policy Policy Evaluation¶

Typically, policy evaluation is performed on-policy, since the goal is to evaluate a specific policy. It might seem counterintuitive to use samples generated by a different policy: if you're collecting data under another policy, aren't you then evaluating the new one instead? This intuition is mostly correct, which is why most policy evaluation in practice is on-policy. However, it's important to note that off-policy policy evaluation is also possible—it arises primarily in offline reinforcement learning settings.

A typical off-policy policy evaluation scenario is: you already have a batch of historical data $\{(S_t, A_t, R_{t+1}, S_{t+1})\}_t$, generated by some old policy $\pi'$. Now you have designed a new policy $\pi$, but you do not want to / cannot / are not allowed to interact with the environment again. The question is: can you use this data generated by $\pi'$ to estimate $V^\pi$? By definition, this is exactly off-policy policy evaluation.

Actually, TD(0) and SARSA policy evaluation are naturally on-policy. Their TD targets are unbiased only under on-policy sampling:

The TD(0) target is $R_{t+1} + \gamma V(S_{t+1})$, and the unbiasedness requires $\mathbb{E}\left[ R_{t+1} + \gamma V(S_{t+1}) \mid S_t=s \right] = (T^\pi V)(s)$, which holds only if the trajectory is generated under policy $\pi$, i.e., the transition distribution is $P^\pi$.
The SARSA TD target is $R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}), A_{t+1} \sim \pi(\cdot \mid S_{t+1})$, and the unbiasedness requires $\mathbb{E}\left[ R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) \mid S_t, A_t \right] = (T^\pi Q)(S_t, A_t)$, which holds only when $A_{t+1}$ is sampled from the same policy $\pi$

--- TD Control¶

Basic On-Policy TD Control: SARSA-style Policy Evaluation + Policy Improvement¶

On-policy TD control alternates between:

Policy evaluation

Evaluate the current policy $\pi$ using TD learning on the action-value function $Q^\pi$ (this step is SARSA in its original meaning. SARSA by itself does not imply control; SARSA + policy improvement = on-policy TD control).
Policy improvement

Improve the policy to be greedy (or $\epsilon$-greedy) with respect to the updated $Q$.

Unlike Monte Carlo control, policy evaluation is performed incrementally and online, using one-step bootstrapping instead of full returns.

[Pseudo Code]

Initialize Q(s,a) arbitrarily for all s,a
Initialize π to be ε-greedy w.r.t. Q

repeat for each episode:
    Initialize S
    Choose A ~ π(. | S)

    repeat for each step of episode:
        Take action A
        Observe R, S'
        Choose A' ~ π(. | S')        # on-policy action

        # Policy Evaluation (TD / SARSA update)
        Q(S,A) = Q(S,A) + α * [ R + γ Q(S',A') - Q(S,A) ]

        S = S'
        A = A'

    until S is terminal

    # Policy Improvement (implicit)
    For all s:
        π(s) = ε-greedy w.r.t. Q(s,·)

until convergence
return π*, Q

Notice that we can also switch SARSA $\rightarrow$ Expected SARSA in the policy evaluation step to reduce the variance.

ε-greedy Version¶

Purely greedy control can prematurely commit to suboptimal actions: some state–action pairs may never be visited, so their values are never corrected. ε-greedy injects minimal stochastic exploration, ensuring continued visitation while still exploiting current value estimates.

The core structure of ε-greedy Policy TD Control is the same as standard on-policy TD (SARSA) Control. The only difference is that, during the policy improvement step, TD Control replaces the deterministic greedy policy with an ε-greedy (stochastic) policy, ensuring persistent exploration while learning online via bootstrapping.

Formally, the ε-greedy policy is defined as $$ \pi(a|s)= \begin{cases} 1-\varepsilon + \varepsilon/|A(s)| & a=a^* \\ \varepsilon/|A(s)| & a\neq a^* \end{cases} \quad\text{where}\quad a^* \in \arg\max_a Q(s,a) $$

To convert the standard on-policy TD control to the ε-greedy version, just replace the greedy policy improvement step

π(s) = argmax_a Q(s,a)

with the ε-greedy policy

a* = argmax_a Q(s,a)
for a in A(s):
    π(a|s) = ε/|A(s)| if a != a* else 1 - ε + ε/|A(s)|

All other lines remain unchanged.

Why is ε-greedy still on-policy? Because the behavior policy equals the target policy; both are ε-greedy. We use this ε-greedy policy to sample data, and we evaluate/update the value of $Q^\pi$ for the same policy, so it remains on-policy. Since it's on-policy learning, the algorithm converges to an ε-greedy policy, not to the oracle greedy policy unless ε is driven to zero appropriately. Notice that we can make the ε-greedy control to be off-policy by importance sampling, just like the MC off-policy control. But we usually don't do that because when combined with bootstrapping, it becomes even more unstable. Instead, we use Q-learning as the to-go off-policy TD control, which can avoid importance sampling.

Basic Off-Policy TD Control: Q-Learning¶

Q-Learning = use stochastic approximation to solve the Bellman optimality equation.

This part has the same sturcture & template as SARSA & TD(0), because the only difference between them is to solve the Bellman optimality equation v.s. Bellman expectation equation.

The Bellman optimality equation for the action-value function can be written in operator form $$ Q^* = T^* Q^*, $$ where the Bellman optimality operator $T^*$ is defined by $$ (T^* Q)(s,a) := \mathbb{E} \left[ R_{t+1} + \gamma \max_{a'} Q(S_{t+1}, a') \mid S_t=s, A_t=a \right] $$

As before, the solution is the equilibrium of the ODE system $$ \dot Q = T^* Q - Q, $$ since at equilibrium $T^* Q^* - Q^* = 0$. Applying Euler discretization (fixed-point iteration) to the ODE gives $$ Q \leftarrow Q + \alpha \bigl(T^* Q - Q\bigr). $$

Because the expectation in $T^*$ is not available, we use stochastic approximation. At time $t$, use a single sample from the transition $$ (S_t, A_t, R_{t+1}, S_{t+1}), $$ where $A_t$ is generated by an arbitrary behavior policy $\mu$. Notice! This is the key difference between solving the expectation equation (SARSA) and solving the optimality equation (Q-learning)! In SARSA, we have $(S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1})$. That means we sampled $A_{t+1}$ in SARSA. But here we don't sample $A_{t+1}$. Instead, later we will evaluate the current action-value estimates for all actions available at $S_{t+1}$ and selecte the largest one to get $\max_{a'} Q(S_{t+1}, a')$.

The update becomes $$ \boxed{ Q(S_t, A_t) \leftarrow Q(S_t, A_t) +\alpha_t \Bigl(\widehat{T^* Q}(S_t, A_t) - Q(S_t, A_t)\Bigr) } $$ with $$ \widehat{T^* Q}(S_t, A_t) := R_{t+1} + \gamma \max_{a'} Q(S_{t+1}, a'), \quad \mathbb{E}\left[\widehat{T^* Q}(S_t, A_t)\mid Q\right] = (T^* Q)(S_t, A_t). $$

Only the visited state–action pair $(S_t, A_t)$ is updated; all other pairs remain unchanged.

The temporal-difference error is $$ \delta_t := R_{t+1} + \gamma \max_{a'} Q(S_{t+1}, a') - Q(S_t, A_t). $$ The update can be written compactly as $$ Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha_t \delta_t. $$

Under standard assumptions (sufficient exploration of all state–action pairs, bounded second moments, etc.), the step-size requirements are the same as in stochastic approximation: $$ \sum_t \alpha_t = \infty, \qquad \sum_t \alpha_t^2 < \infty $$

[Where Do $R_{t+1}$ and $\max_{a'} Q(S_{t+1}, a')$ Come From?]

At time $t$, the agent observes a transition $$ (S_t, A_t) \longrightarrow (R_{t+1}, S_{t+1}), $$ generated by an arbitrary behavior policy $\mu$.
$R_{t+1}$ is the observed immediate reward.
Given the next state $S_{t+1}$, the term $\max_{a'} Q(S_{t+1}, a')$ is computed deterministically by evaluating the current action-value estimates for all actions available at $S_{t+1}$ and selecting the largest one. No next action $A_{t+1}$ is sampled. In implementation, this corresponds to
target = r + gamma * max(Q[s_next, a] for a in actions)
Q[s, a] += alpha * (target - Q[s, a])
The resulting TD target $$ R_{t+1} + \gamma \max_{a'} Q(S_{t+1}, a') $$ combines one observed reward with a greedy bootstrap at the next state.

Why Q-learning Is Off-Policy¶

Q-learning is an off-policy algorithm because the policy used to generate data is different from the policy implicitly assumed in the update target. The behavior policy controls which transitions are observed; The update target corresponds to a greedy policy. These two policies are generally not the same.

Behavior Policy (Data Collection)

In practice, Q-learning typically uses an ε-greedy behavior policy with respect to the current Q-function
- with probability $\varepsilon$: choose a random action,
- with probability $1-\varepsilon$: choose $\arg\max_a Q(s,a)$.
In code:
```
# behavior policy
if random() < epsilon:
    a = random_action()
else:
    a = argmax(Q[s, :])
```
This policy determines how the state–action pair $(S_t, A_t)$ is sampled.
Target Policy (Learning Target)

The Q-learning update uses the TD target $$ R_{t+1} + \gamma \max_{a'} Q(S_{t+1}, a'), $$ which always assumes a greedy action choice at the next state, regardless of how the current action was selected.

In code:
```
target = r + gamma * max(Q[s_next, :])
Q[s, a] += alpha * (target - Q[s, a])
```
No action $A_{t+1}$ is sampled, and the behavior policy does not appear in the update.

Double Q-Learning: Bias-Reduced Off-Policy TD Control¶

Double Q-Learning = decouple action selection and action evaluation in Q-learning by maintainint two $Q()$.

Action Selection vs. Action Evaluation in the Q-learning Target¶

Consider the Q-learning TD target $$ R_{t+1} + \gamma \max_{a'} Q(S_{t+1}, a') $$ This expression actually contains two conceptually distinct steps:

Step 1: Action Selection (argmax) $a^* := \arg\max_{a'} Q(S_{t+1}, a')$. The algorithm selects a greedy action at the next state according to the current Q-function.
Step 2: Action Evaluation (value lookup) $Q(S_{t+1}, a^*)$. The algorithm evaluates the selected action by looking up its estimated value.

Equivalently, the TD target can be rewritten as $$ R_{t+1} + \gamma Q\left(S_{t+1}, \arg\max_{a'} Q(S_{t+1}, a')\right) $$ which makes explicit that: the same $Q$ is used once inside the argmax (selection), and again to evaluate the selected action.

Double Q-Learning Algorithm¶

In double Q-learning, we maintain two action value functions $$ Q^{(1)}(s,a), \quad Q^{(2)}(s,a). $$

At each time step, observe a transition $$ (S_t, A_t, R_{t+1}, S_{t+1}), $$ where $A_t$ is generated by an arbitrary behavior policy $\mu$ (for example, ε-greedy w.r.t. $Q^{(1)}+Q^{(2)}$).

With probability $1/2$, update $Q^{(1)}$; otherwise update $Q^{(2)}$. Notice that in both case only the visited state–action pair $(S_t, A_t)$ is updated.

Case 1: Update $Q^{(1)}$

Use $Q^{(1)}$ for action selection and $Q^{(2)}$ for evaluation $$ \begin{align} a^* &:= \arg\max_{a'} Q^{(1)}(S_{t+1}, a'), \nonumber\newline \delta_t^{(1)} &:= R_{t+1} + \gamma Q^{(2)}(S_{t+1}, a^*) - Q^{(1)}(S_t, A_t), \nonumber\newline Q^{(1)}(S_t, A_t) &\leftarrow Q^{(1)}(S_t, A_t) + \alpha_t , \delta_t^{(1)} \nonumber \end{align} $$
Case 2: Update $Q^{(2)}$

Symmetrically, $$ \begin{align} a^* &:= \arg\max_{a'} Q^{(2)}(S_{t+1}, a'), \nonumber\newline \delta_t^{(2)} &:= R_{t+1} + \gamma Q^{(1)}(S_{t+1}, a^*) - Q^{(2)}(S_t, A_t), \nonumber\newline Q^{(2)}(S_t, A_t) &\leftarrow Q^{(2)}(S_t, A_t) + \alpha_t , \delta_t^{(2)} \nonumber \end{align} $$

Just the same as Q-learning, the double Q-learning is off-policy and has the same step-size condition. Double Q-learning converges to the optimal action-value function $Q^*$. In practice, the final estimate is often taken as $$ Q(s,a) = Q^{(1)}(s,a) + Q^{(2)}(s,a) \quad \text{or} \quad \frac{1}{2}\big(Q^{(1)} + Q^{(2)}\big) $$

Why Need Double Q-Learning? -- Reduce Maximization Bias¶

Here we use a minimal example to explain the overestimation of Q-learning and how it's mitigated by double Q-learning. Basically, the max operator "picks out" the entry with the largest noise, so even if each entry's noise has zero mean, the mean of the maximum will be increased.

Consider a single state with two actions $a_1,a_2$. True values $Q^*(a_1)=Q^*(a_2)=0$; Estimates are pure noise $Q(a_i)=\varepsilon_i, \varepsilon_1,\varepsilon_2 \stackrel{i.i.d.}{\sim}\mathcal N(0,1)$.

Why Q-learning Overestimates

Q-learning uses the bootstrap target $$ \max_i Q(a_i)=\max(\varepsilon_1,\varepsilon_2) $$ Although each estimate is unbiased ($\mathbb{E}[\varepsilon_i]=0$), the max operator introduces selection bias. Using the identity $$ \max(x,y)=\tfrac{x+y}{2}+\tfrac{|x-y|}{2}, $$ we obtain $$ \mathbb{E}[\max(\varepsilon_1,\varepsilon_2)] =\tfrac{1}{2}\mathbb{E}[|\varepsilon_1-\varepsilon_2|]. $$ Since $\varepsilon_1-\varepsilon_2\sim\mathcal N(0,2)$, we have $$ \mathbb{E}[|\varepsilon_1-\varepsilon_2|] =\sqrt{2}\sqrt{\tfrac{2}{\pi}} =\tfrac{2}{\sqrt{\pi}}, $$ hence $$ \mathbb{E}[\max(\varepsilon_1,\varepsilon_2)] =\tfrac{1}{\sqrt{\pi}}>0 $$

Conclusion: Even in a pure-noise system with zero true value, Q-learning produces a strictly positive expected target. The bias is caused by using the same noisy estimates for both action selection and evaluation.
Why Double Q-Learning Removes This Bias

Double Q-learning maintains two independent estimators: $$ Q^A(a_i)=\varepsilon_i^A,\qquad Q^B(a_i)=\varepsilon_i^B, $$ with $\varepsilon^A \perp \varepsilon^B$ and zero mean.
- Selection: $a^*=\arg\max_i Q^A(a_i)$
- Evaluation: $Q^B(a^*)=\varepsilon^B_{a^*}$
Since the selection event depends only on $Q^A$, and $Q^B$ is independent, $$ \mathbb{E}[\varepsilon^B_{a^*}\mid Q^A]=0 \quad\Rightarrow\quad \mathbb{E}[\varepsilon^B_{a^*}]=0. $$

Many lecture notes explain overestimation in Q-learning by appealing to the convexity of the max operator and Jensen’s inequality, writing $$ \mathbb{E}\left[\max_{a'} Q(S_{t+1},a')\right] \ge \max_{a'} Q^*(S_{t+1},a') $$ While this inequality is not fully rigorous as stated. In particular, stochastic approximation theory establishes almost sure convergence $Q_t \to Q^*$, but does not in general guarantee convergence of expectations $\mathbb{E}[Q_t]\to Q^*$ without additional uniform integrability or boundedness assumptions. As a result, directly replacing $\mathbb{E}[Q_t]$ with $Q^*$ inside Jensen-type arguments can be mathematically unjustified. A fully rigorous analysis of overestimation should therefore be formulated in a local, finite-time manner—comparing $\mathbb{E}[\max_a Q_t]$ with $\max_a \mathbb{E}[Q_t]$ - rather than relying on asymptotic expectation convergence.

TD vs. MC: Bias-Variance Tradeoff¶

Monte Carlo (MC) and Temporal-Difference (TD) learning estimate the same quantity, the state value function $$ V^\pi(s) = \mathbb{E}^\pi[G_t \mid s_t = s], \quad G_t = \sum_{k=0}^\infty \gamma^k r_{t+k}. $$ They differ only in how the learning target approximates this expectation.

Bias

In Monte Carlo methods, the learning target is the full return $G_t$. By construction, $$ \mathbb{E}[G_t \mid s_t = s] = V^\pi(s), $$ so the MC target is unbiased.

In TD learning, the target is bootstrapped: $$ Y_t^{\mathrm{TD}} = r_t + \gamma V(s_{t+1}). $$ Its expectation satisfies $$ \mathbb{E}[Y_t^{\mathrm{TD}} \mid s_t = s] = \mathbb{E}[r_t + \gamma V(s_{t+1}) \mid s_t = s], $$ which equals $V^\pi(s)$ only if $V = V^\pi$. Thus, TD introduces bias whenever the value estimate is imperfect. This bias is not intrinsic and vanishes as the value function converges.
Variance

The variance difference arises from how much future uncertainty is retained in the target. The Monte Carlo return can be decomposed as $$ G_t = r_t + \gamma V^\pi(s_{t+1}) + \gamma \varepsilon_{t+1}, $$ where $$ \varepsilon_{t+1} = \sum_{k=1}^\infty \gamma^{k-1} \big(r_{t+k} - \mathbb{E}[r_{t+k} \mid s_{t+1}]\big) $$ is a zero-mean random variable capturing unresolved future randomness.

Accordingly, $$ \mathrm{Var}(G_t \mid s_t) = \mathrm{Var}(r_t + \gamma V^\pi(s_{t+1})) + \gamma^2 \mathrm{Var}(\varepsilon_{t+1}) + 2\,\mathrm{Cov}(\cdot), $$ showing that MC variance includes an explicit contribution from future stochasticity.

In contrast, assuming $V \approx V^\pi$, $$ \mathrm{Var}(Y_t^{\mathrm{TD}} \mid s_t) = \mathrm{Var}(r_t + \gamma V(s_{t+1})) \approx \mathrm{Var}(r_t + \gamma V^\pi(s_{t+1})), $$ since TD replaces the random future return with a value estimate and removes the $\gamma \varepsilon_{t+1}$ term.

This comparison is structural rather than pointwise: there is no universal inequality guaranteeing that TD variance is always smaller than MC variance. It depends a lot on the comparison of $\text{Var}(V(s_{t+1}))$ and $\mathrm{Var}(\varepsilon_{t+1})$ actually. TD only reduces variance if the bootstrapped value estimate is more stable than the Monte Carlo return. For example, consider a two-step MDP: $$ s_0 \rightarrow s_1 \rightarrow \text{terminal}, \quad \gamma = 1, $$ with rewards $$ r_0 = 0, \qquad r_1 \sim \mathcal{N}(0,\sigma^2). $$ The true value function is $$ V^\pi(s_1) = \mathbb{E}[r_1] = 0. $$

At state $s_0$, the Monte Carlo target is $$ Y^{\mathrm{MC}} = r_0 + r_1 = r_1, $$ with $$ \mathbb{E}[Y^{\mathrm{MC}}] = 0, \qquad \mathrm{Var}(Y^{\mathrm{MC}}) = \sigma^2. $$

The TD(0) target is $$ Y^{\mathrm{TD}} = r_0 + V(s_1). $$ If the value estimate is highly unstable, e.g. $$ V(s_1) \sim \mathcal{N}(0,\tau^2), \qquad \tau^2 \gg \sigma^2, $$ then $$ \mathbb{E}[Y^{\mathrm{TD}}] = 0, \qquad \mathrm{Var}(Y^{\mathrm{TD}}) = \tau^2, $$ which is strictly larger than the Monte Carlo variance.

This is why practical TD methods rely on stabilization techniques such as target networks, replay buffers, and slow updates—to ensure that bootstrapping reduces variance rather than amplifying it.