Monte Carlo Learning

In the settings where rewards and transition dynamics are known, value functions are been estimated via DP methods like policy iteration or value iteration.

In many settings, rewards and transition dynamics are unknown, and value functions must be estimated directly from interaction experience. Monte Carlo methods address this by sampling complete trajectories and replacing probability-based expectations with empirical expectations, obtained by averaging sampled returns.

Basically, Monte Carlo refers to a general idea: using random sampling to approximate expectations. That is, Monte Carlo = Monte Carlo Expectation Estimation. In the context of reinforcement learning, MC learning = Monte Carlo estimation of value-function (which is an expectation of returns) using sampled returns.

In real-world production, pure Monte Carlo (MC) Learning is rarely used on its own. The main reasons are

  • MC methods have low sample efficiency.
  • MC methods require waiting until the end of an entire episode before updating, making learning slow and unsuitable for long-horizon or continuing tasks.
  • The high variance of returns leads to instability during training (TD learning will reduce the variance by introducing extra bias).

As a result, MC Learning is mainly used for teaching, theoretical analysis, or as a component for approximate return estimation within Actor-Critic algorithms, rather than as a standalone method in engineering systems.

Takeaways:

  1. What are Monte Carlo Methods? And what is Monte Carlo Learning?
  2. Why the tasks are required to be episodic with a clear termination to be able to use MC learning?
  3. When should we use a constant step size $\alpha$?
  4. Practically, which is used more often: first-visit MC or every-visit MC?
  5. Can we add exploration to the policy evaluation step?
  6. What is the difference between on-policy and off-policy learning?
  7. Is ε-soft policy MC Control an on-policy or off-policy algorithm?
  8. Which one usually has lower variance: the ordinary IS estimator or weighted IS estimator? Why?

--- Prerequisites of MC Learning¶

Notations¶

Return $$ G_t = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{T-t-1} R_T $$ State Value $$ v_\pi(s) = \mathbb{E}_\pi[G_t \mid S_t = s] $$ Action Value $$ q_\pi(s,a) = \mathbb{E}_\pi[G_t \mid S_t=s, A_t=a] $$

Assumptions¶

We can see from the notations that the value functions are expected returns, and we compute the expected returns based on the reward function and trasition dynamic. While MC learning uses empirical mean return instead of the expected returns. Basically,

MC Policy Evaluation ≈ Model-free Policy Evaluation
MC Control ≈ Model-free Policy Iteration
  • A critical assumption of MC learning is episodic task with a clear termination. Why?

    The reason why classical Monte Carlo (MC) methods require the task to be episodic with guaranteed termination is that MC uses the complete return $G_t$ as the learning signal. This return must be explicitly computed after the episode ends, and MC does not use bootstrapping; it does not estimate the future based on current value estimates.

    For a non-terminating game, the return is defined as: $$ G_t^{(\infty)} = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} $$ Mathematically, if $\gamma < 1$, this sum may converge; but for Monte Carlo methods, this is an infinite random sum, and it is impossible to sample a complete $G_t^{(\infty)}$ within finite time.

    If the episode terminates at time $T$, then the return is: $$ G_t = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{T-t-1} R_T $$ This is a finite sum, a random variable that can be explicitly computed after the episode ends.

--- Monte Carlo Policy Evaluation (Prediction)¶

Goal: Learn $v_\pi(s)$ (or $q_\pi(s)$) from episodes of experience under policy $\pi$ $$S_0, A_0, R_1, S_1, A_1, R_2,..., S_k \sim \pi$$

Design Principle¶

The core idea of Monte Carlo Policy Evaluation is to estimate the state value function or action value function for a given policy using empirical expectation. Specifically, for each state, we maintain a count of visits $N(s)$ and the accumulated return $S(s)$: $$ N(s) \leftarrow N(s) + 1,\qquad S(s) \leftarrow S(s) + G_t $$ Then, the state value is estimated by the sample mean: $$ V(s) = \frac{S(s)}{N(s)} $$

Engineering Trick: Incremental Mean

This step is the key for implementing MC directly in code. Generally, The mean $\mu_k$ of a sequence $x_1, \ldots, x_k$ can be computed incrementally $$\begin{align} \mu_k &= \frac{1}{k}\sum_{j=1}^{k} x_j \nonumber\\ &= \frac{1}{k}\left(x_k + \sum_{j=1}^{k-1} x_j\right) \nonumber\\ &= \frac{1}{k}\left(x_k + (k-1)\mu_{k-1}\right) \nonumber\\ &= \mu_{k-1} + \frac{1}{k}\left(x_k - \mu_{k-1}\right) \nonumber \end{align}$$ Thus we can update $V$ as $$V(s)\leftarrow V(s)+\frac{1}{N(s)}(G_t - V(s))$$ Notice that MC Policy Evaluation is essentially Asynchronous + Sample-based: each update only occurs for states that are actually visited; updates are triggered by timestep $t$, without any global sweep, so the fundamental index is $t$.

MC Policy Evaluation For Non-Stationary Problems

The original MC update is essentially calculating the average of all historical returns, where by default each past sample has equal weight: $$ V(s)\leftarrow V(s)+\frac{1}{N(s)}(G_t - V(s)) $$ However, in non-stationary environments, the return distribution $\mathbb{E}[G_t \mid S=s]$ changes over time, so samples from long ago become less reliable and we do not want old data to have the same weight forever. By using a constant step size $\alpha$, we deliberately forget old samples: $$ V(S_t)\leftarrow V(S_t)+\alpha(G_t - V(S_t)) $$ Using a constant step size $\alpha$ for the update is actually performing exponential weighting: each update multiplies the old estimate by $(1-\alpha)$, causing the weight of older samples to decay exponentially over time $$ \begin{align} V_{t+1} & = V_t + \alpha (G_t - V_t) \nonumber \\ & = (1-\alpha) V_t + \alpha G_t \nonumber \\ & = (1-\alpha)\big[(1-\alpha)V_{t-1} + \alpha G_{t-1}\big] + \alpha G_t \nonumber \\ & = (1-\alpha)^2 V_{t-1} + \alpha(1-\alpha) G_{t-1} + \alpha G_t \nonumber \\ & = (1-\alpha)^{t+1} V_0 + \sum_{i=0}^{t} \alpha (1-\alpha)^{t-i} G_i \nonumber \\ \end{align} $$

Algorithm¶

In terms of real implementation, there are two options:

  • First-Visit MC: Update only the first time a state is visited in an episode
  • Every-Visit MC: Update every time the state is visited in an episode

The only difference between the two is whether $N(s)$ and $S(s)$ are updated only at the first visit to $s$ in an episode, or at every visit. For both methods, as the number of visits increases, by the law of large numbers, $V(s)$ converges to the true $v_\pi(s)$. In real use cases, we usually use every-visit MC, since using more samples usually leads to lower estimation variance and higher data efficiency.

First-Visit MC¶

Initialize V(s) arbitrarily for all s
Initialize N(s) = 0 for all s

Loop forever (for each episode):
    Generate an episode: S0, A0, R1, S1, A1, R2, ... , ST (terminal)

    G = 0
    Visited = empty set

    for t = T-1 down to 0:
        G = γ * G + R[t+1]          # now G == G_t

        if S[t] not in Visited:     # FIRST visit in this episode
            Visited.add(S[t])

            N(S[t]) = N(S[t]) + 1
            V(S[t]) = V(S[t]) + (1 / N(S[t])) * (G - V(S[t]))
            # same as: V(S[t]) ← (1 - 1/N) V(S[t]) + (1/N) G

Every-Visit MC¶

Initialize V(s) arbitrarily for all s
Initialize N(s) = 0 for all s

Loop forever (for each episode):
    Generate an episode: S0, A0, R1, ... , ST

    G = 0

    for t = T-1 down to 0:
        G = γ * G + R[t+1]          # G == G_t

        N(S[t]) = N(S[t]) + 1
        V(S[t]) = V(S[t]) + (1 / N(S[t])) * (G - V(S[t]))

In non-stationary environments, everything stays the same, except that $N(s)$ is replaced with alpha. $$ V(S_t)\leftarrow V(S_t)+\alpha(G_t - V(S_t)) $$

MC Policy Evaluation For Action Values¶

The algorithm structure, return calculation, and incremental update formula are exactly the same as in MC Policy Evaluation for state value.

First-Visit MC¶

Initialize Q(s,a) arbitrarily for all s, a
Initialize N(s,a) = 0 for all s, a

Loop forever (for each episode):
    Generate an episode using policy π:
        S0, A0, R1, S1, A1, R2, ... , ST

    G = 0
    Visited = empty set    # stores (s,a) pairs

    for t = T-1 down to 0:
        G = γ * G + R[t+1]         # now G == G_t

        if (S[t], A[t]) not in Visited:   # FIRST visit
            Visited.add((S[t], A[t]))

            N(S[t], A[t]) = N(S[t], A[t]) + 1
            Q(S[t], A[t]) =
                Q(S[t], A[t]) + (1 / N(S[t], A[t])) * (G - Q(S[t], A[t]))

Every-Visit MC¶

Initialize Q(s,a) arbitrarily for all s, a
Initialize N(s,a) = 0 for all s, a

Loop forever (for each episode):
    Generate an episode using policy π

    G = 0

    for t = T-1 down to 0:
        G = γ * G + R[t+1]

        N(S[t], A[t]) = N(S[t], A[t]) + 1
        Q(S[t], A[t]) =
            Q(S[t], A[t]) + (1 / N(S[t], A[t])) * (G - Q(S[t], A[t]))

--- Monte Carlo Control (Standard Version, No Exploration)¶

Monte Carlo Control = MC Policy Iteration.

Its policy improvement step is structurally identical to the one in traditional Policy Iteration; both make the current value function greedy (greedify). The only difference is in policy evaluation: PI uses the model, while MC uses sampled averages.

Initialize π arbitrarily
Initialize Q(s,a) arbitrarily for all s,a
Initialize N(s,a) = 0 for all s,a

repeat:
    # Policy Evaluation (Monte Carlo)
    # Evaluate q_π by generating episodes following π and averaging returns.
    Generate some number of episodes using policy π:
        for each episode: S0,A0,R1,S1,A1,R2,...,ST
            G = 0
            for t = T-1 down to 0
                G = γ * G + R[t+1]                 # G == G_t
                N(S[t],A[t]) = N(S[t],A[t]) + 1
                Q(S[t],A[t]) = Q(S[t],A[t]) + (1 / N(S[t],A[t])) * (G - Q(S[t],A[t]))

    # Policy Improvement
    policy_stable = True
    for s in S:
        old_action = π(s)
        π(s) = argmax_a Q(s,a)
        if old_action ≠ π(s):
            policy_stable = False

until policy_stable
return π*, Q

Add Exploration¶

Why need to add exploration

  • Early noise: The initial estimates of MC / TD are inherently high-variance. Without exploration, a single accidental high return can determine all future sampled paths.

  • Coverage: Under the current sampling policy, when generating episodes, it is possible that certain low-probability states or actions are never visited. Specifically, in this step

    Generate some number of episodes using policy π: 
        for each episode: S0,A0,R1,S1,A1,R2,...,ST
    

    To learn the optimal policy, it is necessary to have samples for every relevant state/action.

Where to add exploration

Exploration does not happen in the evaluation step, but rather in the choice of which policy to use for sampling.

  • In the ε-soft case, exploration is introduced directly through policy improvement.
  • In the off-policy case, exploration comes from a separately designed behavior policy b.

In any case, exploration does not occur during policy evaluation. Policy evaluation means: given a sampling policy, how do we estimate the value of the target policy? It only does the following: evaluates the value of the ε-soft policy itself (on-policy), or uses importance sampling (IS) to correct for the difference caused by the behavior policy (off-policy).

--- On-Policy MC Control (with Exploration)¶

In practice, MC Control must include exploration (such as ε-soft or a behavior policy); otherwise, there is no guarantee of sufficient action coverage and it is easy to get stuck in suboptimal behavior. The fundamental version without exploration is only used for theoretical illustration of the structure of MC Policy Iteration and is not actually learnable.

On-policy & Off-policy

Almost all RL algorithms can be classified as on-policy or off-policy. For example,

  • MC Control(ε-soft) ↔ Off-policy MC Control
  • SARSA ↔ Q-learning
  • Policy Gradient / A2C / PPO ↔ DQN

In reinforcement learning, the distinction between on-policy and off-policy always depends solely on whether the sampling policy is the same as the policy being learned or improved. This definition is consistent across all RL algorithms.

  • On-policy: The policy used to collect data is the same as the policy being evaluated or improved.
  • Off-policy: The policy used to collect data is different from the policy being evaluated or improved.

Some algorithms (such as PPO, A2C) update a new policy using data from an old policy in practice. However, since the data comes from a slightly lagged version of the current policy and not from long-term experience replay, these are still classified as belonging to the on-policy family.

Exploring Starts (Theoretical Construction)¶

Exploring Starts has an algorithm structure identical to standard MC Control, with the only difference being that each episode's initial $(s_0, a_0)$ is subject to the enforced exploration condition: "Every state–action pair has a non-zero probability of being the starting pair." In the MC Control pseudocode above, you just need to add (or replace) one line:

# Exploring Starts (only difference)
Choose S0, A0 such that P(S0=s, A0=a) > 0 for all (s,a)
Generate episode starting from S0, A0, then follow policy π

Why does it guarantee exploration? Because the initial $(s_0,a_0)$ of every episode is chosen with nonzero probability for all $(s,a)$ pairs, and as the number of episodes approaches infinity, according to the law of large numbers, every state–action pair will be visited infinitely often as a starting point. Since the starting point already enforces visiting $(s,a)$, and the rest of the episode follows the policy, we no longer need to rely on the policy itself to "accidentally explore" it.

Why is this called a "theoretical solution"? Because in real systems it is almost impossible to randomly assign the initial $(s_0,a_0)$ in a fully controllable manner.

Exploring Starts is on-policy because, in each episode, the policy used to generate trajectories is the same as the one being evaluated and improved, denoted by $\pi$. By default, standard MC Control is considered on-policy.

ε-soft Policy MC (Practical Solution)¶

The core structure of ε-soft policy MC Control is the same as standard MC Control. The only difference is that, during the policy improvement step, ε-soft MC Control replaces the deterministic greedy policy with an "ε-soft greedy" (stochastic) policy, ensuring that the algorithm always continues to explore. $$ \pi(a|s)= \begin{cases} 1-\varepsilon + \varepsilon/|A(s)| & a=a^* \\ \varepsilon/|A(s)| & a\neq a^* \end{cases} $$ Basically, replace

π(s) = argmax_a Q(s,a)

with

a* = argmax_a Q(s,a)
for a in A(s):
    π(a|s) = ε/|A(s)| if a != a* else 1 - ε + ε/|A(s)|

ε-soft Policy MC converges to the optimal ε-soft policy, but cannot converge to the oracle policy $\pi^*$ (a fully greedy policy). This is because an ε-soft policy is required to assign nonzero probability to every action, whereas a completely greedy policy violates this constraint and thus is not even within the permissible set of policies for the algorithm. Therefore, ε-soft MC Control is not "approaching greedy," but rather achieves the best within the policy space where exploration is mandatory; a completely greedy policy is excluded from the beginning.

ε-soft policy MC is on-policy because, in each episode, the policy used to generate trajectories is the same as the one being evaluated and improved, denoted by $\pi$. By default, standard MC Control is considered on-policy.

--- Off-policy MC Control (with Exploration)¶

Off-policy MC(-control): Use a behavior policy $\mu$ to sample data, and use this data to evaluate or improve another target policy $\pi$.

Regarding how to choose the behavior policy $\mu$, our requirement is: as long as any action the target policy $\pi$ might take can also be taken (with nonzero probability) by the behavior policy $\mu$, that's sufficient. That is, $$\mu(a \mid s) > 0 \quad \text{for every } (s,a) \text{ such that } \pi(a \mid s) > 0$$

Core difficulty of Off-policy MC: You cannot simply average returns.

  • In on-policy settings, sampled trajectories come from π, and the returns G are sampled from the correct distribution, so $$ V_\pi(s) = \mathbb{E}_\pi[G \mid S_t=s] $$ Therefore, directly averaging the returns to estimate the value function (for example, $V(s) = \frac{S(s)}{N(s)}$) is unbiased.

  • In off-policy settings, the data comes from μ, but what you want is the expectation under π. However, $$ \mathbb{E}_\mu[G] \neq \mathbb{E}_\pi[G] $$ So directly averaging to estimate the value function (for example, $V(s) = \frac{S(s)}{N(s)}$) will be biased. If your estimator is inherently biased, then by LLM, it may or may not converge to the real value function. Therefore, in theory, we generally require the estimator to be unbiased first, and then try to reduce variance.

To construct an unbiased estimator, we need to use importance sampling, giving higher weights to trajectories that are more like π.

Importance Sampling Estimators¶

  • Ordinary IS Estimator (unbiased, high variance)

    Corresponds to ordinary importance sampling: it directly averages importance-weighted returns and is unbiased, but highly sensitive to large fluctuations in the importance ratios. $$ V(s) = \frac{1}{|T(s)|} \sum_{t\in T(s)} \rho_t^T G_t $$

  • Weighted / Self-Normalized IS Estimator (biased but asymptotically unbiased, low variance)

    Corresponds to self-normalized importance sampling (SNIS): it normalizes by the sum of importance weights, introducing bias but typically achieving much lower variance. $$ V(s) = \frac{\sum_t \rho_t^T G_t}{\sum_t \rho_t^T} $$

The variance reduction of the weighted estimator comes from its ability to handle accidental weight inflation: when all importance weights are simultaneously scaled up due to sampling accidents, the normalization absorbs this global scale fluctuation, preventing a few trajectories from dominating the estimate.

For formal definitions of ordinary importance sampling, self-normalized importance sampling, and their variance comparison, see the Importance Sampling notes.

Algorithm¶

The structure of Off-Policy MC Control is basically the same as standard On-Policy MC Control (still: Policy Evaluation → Policy Improvement → policy_stable check), but the main modifications for Off-Policy are concentrated in the Policy Evaluation part. Below is the pseudocode for Off-Policy Monte Carlo Control (Weighted Importance Sampling version). The structure is kept as close as possible to On-Policy MC-Control, and any lines that are additional or modified compared to On-Policy are marked with # [DIFF].

#### Off-Policy Monte Carlo Control (Weighted Importance Sampling)

Initialize π arbitrarily                                  # target policy
Initialize μ as an exploratory behavior policy             # [DIFF] behavior policy for sampling
Initialize Q(s,a) arbitrarily for all s,a
Initialize C(s,a) = 0 for all s,a                          # [DIFF] replaces N(s,a)

repeat:
    # Policy Evaluation (Off-Policy Monte Carlo)
    # Evaluate q_π using episodes generated by μ, weighted by importance sampling.

    Generate some number of episodes using behavior policy μ:     # [DIFF]
        for each episode: S0,A0,R1,S1,A1,R2,...,ST

            G = 0
            ρ = 1                                                # [DIFF] importance sampling ratio accumulator

            for t = T-1 down to 0:
                G = γ * G + R[t+1]                               # G == G_t

                # Update importance sampling ratio for trajectory from t to end
                ρ = ρ * π(A[t] | S[t]) / μ(A[t] | S[t])          # [DIFF]

                # Weighted importance sampling update
                C(S[t],A[t]) = C(S[t],A[t]) + ρ                  # [DIFF]
                Q(S[t],A[t]) = Q(S[t],A[t]) + (ρ / C(S[t],A[t])) * (G - Q(S[t],A[t]))   # [DIFF]

    # Policy Improvement (same as on-policy)
    policy_stable = True
    for s in S:
        old_action = π(s)
        π(s) = argmax_a Q(s,a)
        if old_action ≠ π(s):
            policy_stable = False

until policy_stable
return π*, Q

Differences from On-Policy Standard MC-Control:

  1. An additional behavior policy μ is introduced, and μ is used to sample episodes.
  • Initialize μ ... and Generate episodes using μ are the two # [DIFF] lines.
  1. Replace the counter N(s,a) with the cumulative weight C(s,a).
  • Initialize C(s,a)=0 (# [DIFF])
  • When updating, use C += ρ instead of N += 1
  1. During episode replay, the importance sampling ratio ρ is introduced, and Q is updated in a weighted manner.
  • ρ = 1
  • ρ = ρ * π/μ
  • Q ← Q + (ρ/C)*(G-Q)