Importance sampling is a concept that originated in statistical physics and numerical integration, initially used to efficiently estimate integrals or expectations when it is difficult to sample directly from a distribution. Later, this method was widely adopted in machine learning and reinforcement learning to "approximate the expectation under one distribution using data from another distribution." Let's break down the logic and intuition behind importance sampling.
Takeaways:
- Is the self-normalized importance sampling only used for unnormalized distributions?
- Is the self-normalized importance sampling estimator asymptotically biased?
- Why we only compute importance weight with normalized distributions in RL?
--- Ordinary Importance Sampling¶
Suppose you want to compute an expectation: $$ \mathbb{E}_{p(z)}[f(z)] = \int f(z)p(z)dz $$ If you can sample directly from $p(z)$, then it's very straightforward. You can easily use the Monte Carlo estimator to approximate this expectation (also called the empirical expectation): $$ \hat{I} = \frac{1}{N}\sum_{i=1}^N f(z_i), \quad z_i \sim p $$
Core statistical properties of $\hat{I}$:
- Unbiasedness $$ \mathbb{E}[\hat{I}] = \mathbb{E}\left[\frac{1}{N}\sum_{i=1}^N f(z_i)\right] = \frac{1}{N}\sum_{i=1}^N \mathbb{E}_{p(z)}[f(z_i)] = \mathbb{E}_{p(z)}[f(z)] $$
- The variance decreases with $1/N$ $$ \mathrm{Var}[\hat{I}] = \frac{1}{N^2}\sum_{i=1}^N \mathrm{Var}[f(z_i)] = \frac{1}{N}\mathrm{Var}_{p(z)}[f(z)] $$
If sampling from $p(z)$ is difficult, we can introduce an alternative distribution $q(z)$ that is easier to sample from: $$ \mathbb{E}_{p}[f(z)] = \int f(z)\frac{p(z)}{q(z)}q(z) dz $$ Thus, you can sample from $q(z)$ and use weights to correct for the bias to obtain the empirical expectation $$ \hat{I}_{IS} = \frac{1}{N}\sum_{i=1}^N f(z_i)w(z_i), \quad z_i \sim q \text{ and } w(z) = \frac{p(z)}{q(z)}\text{ (importance weight)} $$
Core statistical properties of $\hat{I}_{IS}$:
Unbiasedness
As long as the support condition $q(z)>0$ if and only if $p(z)>0$, and both distributions are integrable, $$ \mathbb{E}[\hat{I}_{\text{IS}}] = \frac{1}{N}\sum_{i=1}^N \mathbb{E}_q[f(Z)w(Z)] = \mathbb{E}_q\left[f(Z)\frac{p(Z)}{q(Z)}\right] = \int f(z)p(z)dz = I $$
If $\mathbb{E}_q[f^2 w^2]<\infty$, the variance still scales as $1/N$ $$ \mathrm{Var}[\hat{I}_{\text{IS}}] = \frac{1}{N}\mathrm{Var}_q \big(f(Z)w(Z)\big) = \frac{1}{N}\big(\mathbb{E}_q \big[f(Z)^2 w(Z)^2\big]-I^2\big) $$ The variance of $\hat{I}_{IS}$ strongly depends on the fluctuations of the ratio $w(z)=p(z)/q(z)$. If $q$ assigns too little probability mass in regions where $p$ is large (poor tail matching), $w$ will "explode," causing $\mathrm{Var}_q(fw)$ to increase drastically (this is exactly why off-policy MC can have very high variance). On the other hand, if $q$ matches $p$ well, the variance is significantly reduced.
--- Self-Normalized Importance Sampling¶
In practical situations, we often only know the unnormalized forms of distributions $p$ and $q$. In other words, we know the relative magnitudes of the probabilities, but not their exact values. For example, in Bayesian inference, the posterior distribution $p(\theta|x) \propto p(x|\theta)p(\theta)$ is a typical example where only the unnormalized form is known.
In such cases, we can still use only the unnormalized $\tilde p(z)$ and $\tilde q(z)$ to compute a Monte Carlo estimator of the expectation. The technique used is called self-normalized importance sampling.
If both true distribution $p(z)$ and $q(z)$ are only known in unnormalized form $\tilde p(z)$ and $\tilde q(z)$, such that $$ p(z)=\frac{\tilde p(z)}{Z_p}, \qquad q(z)=\frac{\tilde q(z)}{Z_q}, $$ where $Z_p = \int \tilde p(z)dz$ and $Z_q = \int \tilde q(z)dz$ are unknown or intractable normalization constants, the ratio $\frac{p(z)}{q(z)}$ can be expressed as $$ \frac{p(z)}{q(z)} = \frac{\tilde p(z)/Z_p}{\tilde q(z)/Z_q} = \frac{Z_q}{Z_p}\cdot\frac{\tilde p(z)}{\tilde q(z)} $$ Thus, the expectation under $p$ becomes $$ \mathbb{E}_p[f] = \frac{Z_q}{Z_p}\mathbb{E}_q\left[f(z)\frac{\tilde p(z)}{\tilde q(z)}\right], \quad \text{where } w(z)=\frac{\tilde p(z)}{\tilde q(z)} $$ Using Monte Carlo samples $z_i \sim q$, we can get the Monte Carlo estimator of the expectation (or emperical expectation) as $$ \hat{I} = \frac{Z_q}{Z_p}\cdot\frac{1}{N}\sum_{i=1}^N f(z_i)w(z_i) ,\quad z_i \sim q $$ The ratio $\frac{Z_q}{Z_p}$ is unknown, but we can estimate using the same samples $z_i \sim q$ $$ \frac{Z_p}{Z_q} = \int \frac{\tilde p(z)}{\tilde q(z)} q(z)dz \approx \frac{1}{N}\sum_{i=1}^N \frac{\tilde p(z_i)}{\tilde q(z_i)} = \frac{1}{N}\sum_{i=1}^N w(z_i) ,\quad z_i \sim q $$ Plugging this back in $\hat{I}$ gives the self-normalized estimator: $$ \hat{I}_{\text{SNIS}} = \frac{\sum_{i=1}^N f(z_i)w(z_i)}{\sum_{i=1}^N w(z_i)} = \sum_{i=1}^N \tilde{w}_i f(z_i), \quad \text{ where } \tilde{w}_i = \frac{w(z_i)}{\sum_{j=1}^N w(z_j)}. $$
Notice! Self-normalized importance sampling is not an algorithm specifically designed for unnormalized distributions; rather, it is a method that uses sample weights to estimate the normalization constant, which reduces variance at the cost of introducing nonlinearity and bias. Even in the case where both $\tilde p$ and $\tilde q$ are fully normalized, it is still different from ordinary IS: if $\tilde p = p$ and $\tilde q = q$, then the importance weight $w$ here is exactly the same as the importance weight $w$ in the ordinary IS setting, since $\tilde p / \tilde q = p / q$. However, the estimators themselves are different. $$ \hat{I}_{IS}= \frac{1}{N}\sum_{i=1}^N f(z_i)w(z_i) , \quad \hat{I}_{\text{SNIS}} = \frac{\sum_{i=1}^N f(z_i)w(z_i)}{\sum_{i=1}^N w(z_i)} $$
Core statistical properties of $\hat{I}_{\text{SNIS}}$:
Biased but consistent
In general, there is no closed-form exact expectation; the first-order bias of the ratio estimator can be given by the second-order Taylor expansion (delta method): $$ \mathbb{E}[\hat I_{\text{SNIS}}] \approx \mathbb{E}_p[f] + \frac{1}{N} \frac{\operatorname{Cov}_q\big(f(Z)w(Z),w(Z)\big)- \mathbb{E}_p[f]\operatorname{Var}_q\big(w(Z)\big)}{\mu_0^{2}} $$ Therefore, SNIS is typically biased for finite samples (with bias of order $\mathcal O(1/N)$), but under standard regularity conditions it is consistent: $\hat I_{\text{SNIS}}\xrightarrow{a.s.} \mathbb{E}_p[f]$.
If $\mathbb{E}_q[(f-I)^2 w^2]<\infty$, the variance still scales as $1/N$ (here $I=\mathbb{E}_p[f]$).
Using the delta method for the ratio estimator, $$ \operatorname{Var}[\hat I_{\text{SNIS}}] \approx \frac{1}{N}\frac{\operatorname{Var}_q\big((f(Z)-I)w(Z)\big)}{\mu_0^{2}} = \frac{1}{N}\frac{\mathbb{E}_q\big[(f(Z)-I)^2 w(Z)^2\big]}{\mu_0^{2}} $$
--- The Importance Sampling Ratio in RL¶
Why do we need importance sampling in RL? We want to estimate the expectation under a policy that is hard to sample from, using data collected from a policy that is easy to sample from. For example, the target policy $\pi$ might be deterministic, making it unsuitable for exploration and data collection, while an exploratory behavior policy $\mu$ is used for sampling. But of course, this is not the only use case!
The importance sampling ratio in reinforcement learning is exactly the importance weight $w = p/q$ with normalized distributions, with the random variable being a trajectory (or a trajectory suffix). We only care about importance weight with normalized distributions because an "unnormalized policy" does not exist in RL. For any fixed $s$, every legitimate policy satisfies: $$ \sum_a \pi(a \mid s) = 1 $$
In RL, a sample is actually a trajectory $\tau = (S_0, A_0, \dots, S_T)$: $$ p_\pi(\tau) = P(S_0)\prod_{t=0}^{T-1}\pi(A_t|S_t)P(S_{t+1}|S_t,A_t) $$ Because the initial state distribution $P(S_0)$ and the state transition probabilities $P(S_{t+1}|S_t,A_t)$ are determined by the environment, we have: $$ \frac{p_\pi(\tau)}{p_\mu(\tau)} = \prod_{t=0}^{T-1}\frac{\pi(A_t|S_t)}{\mu(A_t|S_t)} $$
In RL, the importance sampling ratio from time $t$ to the end $T$ is defined as: $$ \rho_t^T = \prod_{k=t}^{T-1}\frac{\pi(A_k|S_k)}{\mu(A_k|S_k)}. $$ It quantifies the relative probability that the target policy $\pi$ and the behavior policy $\mu$ would generate the same sequence of actions from time $t$ onward. This is the fundamental weighting factor for all off-policy corrections (MC, TD, PG).
--- Self-Normalized Importance Sampling Estimator ≈ Control Variate¶
Self-Normalized Importance Sampling can be understood as rdinary importance sampling plus an automatically constructed control variate. Its variance reduction effect comes from canceling fluctuations in the overall scale of importance weights.
Absorbing the global scale (accidental weight inflation)¶
In importance sampling, it is common that all weights in a given Monte Carlo run are simultaneously inflated or deflated by a random factor. This is not a modeling choice but a sampling accident, typically caused by poor tail matching between $q$ and $p$:
- the proposal $q$ happens to place too little mass in regions where $p$ is large;
- the sampled points therefore all receive unusually large importance weights.
Formally, this corresponds to $$ w_i = c,\tilde w_i, $$ where the random constant $c$ reflects an accidental global scale error in that particular sample set.
Ordinary IS is highly sensitive to this accident: $$ \hat I_{\text{OIS}} = \frac1N\sum_i w_i f_i = c\cdot \frac1N\sum_i \tilde w_i f_i, $$ so the estimator (and its variance) is amplified linearly by $c$.
Self-Normalized IS automatically cancels this accidental scale: $$ \hat I_{\text{SNIS}} = \frac{\sum_i w_i f_i}{\sum_i w_i} = \frac{c\sum_i \tilde w_i f_i}{c\sum_i \tilde w_i} = \frac{\sum_i \tilde w_i f_i}{\sum_i \tilde w_i}. $$
Thus, SNIS treats the overall weight scale as nuisance randomness and removes it through normalization, leaving only the relative weights to determine the estimate.
Control variate view (delta-method approximation)¶
Write SNIS as a ratio $\bar X/\bar Y$ with $X=f(Z)w(Z)$, $Y=w(Z)$. A first-order expansion yields $$ \hat I_{\text{SNIS}} \approx \hat I_{\text{OIS}} - I(\bar w-\mu_0), \quad \bar w=\frac1N\sum_{i=1}^N w(Z_i). $$
- The term $(\bar w-\mu_0)$ has zero mean.
- Subtracting it does not change the expectation to first order.
- It does reduce variance if it is correlated with $\hat I_{\text{OIS}}$.
Hence, SNIS ≈ OIS plus a control variate with coefficient (I).
Variance decomposition (asymptotic)¶
$$ \operatorname{Var}(\hat I_{\text{SNIS}}) \approx \frac{1}{N\mu_0^2} \Big( \operatorname{Var}(fw) + I^2\operatorname{Var}(w) - 2I\operatorname{Cov}(fw,w) \Big) $$
Compared to OIS $\big(\frac{1}{N}\operatorname{Var}(fw)\big)$, the negative covariance term $$ -2I \operatorname{Cov}(fw,w) $$ is the variance reduction provided by the control variate. When $fw$ and $w$ are strongly positively correlated (common in practice), SNIS yields substantially smaller variance.
Takeaway: SNIS reduces variance by normalizing away random global weight scales; statistically, this is equivalent to adding a strong, automatically constructed control variate to ordinary importance sampling.