Stochastic Approximation

Prerequisite: read the ODE-Discretization.

Stochastic approximation = noisy Euler discretization.

--- Background Recap: Fixed-Point Iteration¶

Problem Setting¶

We consider the problem of finding a solution to the equation $$ x = g(x) $$ where $g: \mathbb{R}^d \to \mathbb{R}^d$ is a given deterministic function. Such a solution $x^*$ is called a fixed point of $g$.

Fixed-Point Iteration Algorithm¶

The most basic numerical method for solving the equation $x = g(x)$ is fixed-point iteration. Starting from an initial guess $x_0$, the algorithm repeatedly applies the mapping $g(\cdot)$ to generate a sequence ${x_t}$, with the hope that this sequence converges to a fixed point $x^*$.

Input:

  • Mapping $g: \mathbb{R}^d \to \mathbb{R}^d$
  • Initial point $x_0 \in \mathbb{R}^d$
  • Maximum number of iterations $T$ (or a stopping tolerance)

Procedure:

  1. Set $t = 0$.

  2. Repeat until a stopping criterion is satisfied:

    • Compute $$ x_{t+1} = g(x_t). $$
    • Set t++.
  3. Output: the final iterate $x_t$ as an approximation of a fixed point $x^*$.

⭐Remark

Fixed-point iteration is a purely numerical method: it does not require an explicit analytical form of $g(\cdot)$; it only requires that $g(x)$ can be evaluated exactly (or numerically) at any given point $x$. Consequently, it outputs the value rather than a closed-form expression of the fixed point $x^*$.

Convergence¶

A standard sufficient condition for the convergence of fixed point iteration is that $g$ is a contraction mapping: $$ | g(x) - g(y) | \le \gamma | x - y |, \quad \text{for some } \gamma \in (0,1) $$

Under this condition:

  • the fixed point $x^*$ exists and is unique,
  • the iteration converges to $x^*$ from any initial point.

This is a direct consequence of the Banach fixed-point theorem: in a complete space (Banach space), any contraction mapping has a unique fixed point, and repeated application of the mapping converges to it.

Proof sketch:

Assume $g:\mathbb{R}^d\to\mathbb{R}^d$ is a contraction with constant $\gamma\in(0,1)$. The iteration $x_{t+1}=g(x_t)$ satisfies $$ |x_{t+1}-x_t|\le \gamma^t|x_1-x_0| $$ which implies ${x_t}$ is a Cauchy sequence. Since $\mathbb{R}^d$ is complete, $x_t\to x^*$ for some $x^*$. By continuity of $g$, $$ x^*=\lim_{t\to\infty}x_{t+1}=\lim_{t\to\infty}g(x_t)=g(x^*) $$ so $x^*$ is a fixed point. If $x^*,y^*$ are two fixed points, then $|x^*-y^*|\le \gamma|x^*-y^*|$, implying $x^*=y^*$. Moreover, $$ |x_t-x^*|\le \gamma^t|x_0-x^*|, $$ so the convergence is linear with rate $\gamma$, from any initialization.

Compare to Newton's Method¶

Newton’s method is another classical numerical method for solving equations of the form $$ h(x)=0, \quad h:\mathbb{R}^d\to\mathbb{R}^d. $$ Starting from an initial point $x_0$, the iteration is defined as $$ x_{k+1} = x_k - \bigl[\nabla h(x_k)\bigr]^{-1} h(x_k), $$ where $\nabla h(x_k)$ denotes the Jacobian of $h$ at $x_k$.

Fixed-point iteration and Newton’s method solve the same mathematical problem, but expressed in different forms. Given any equation $$ h(x)=0 $$ it can be rewritten as a fixed-point problem $$ x=g(x), \quad \text{with } g(x)=x-h(x). $$ Conversely, any fixed-point equation $$ x=g(x) $$ can be written as a root-finding problem $$ h(x)=g(x)-x=0 $$ Therefore, both methods aim to find the same solution set; the difference lies entirely in how the iteration is constructed.

Fixed-point iteration only requires function evaluation and is therefore cheap and robust, but it typically converges slowly (linearly). In contrast, Newton’s method requires computing and inverting the Jacobian, which is computationally more expensive, but enjoys much faster (quadratic) local convergence. Also, Newton’s method is more sensitive to initialization and may fail when the Jacobian is singular or ill-conditioned, whereas fixed-point iteration is more robust.

--- Background Recap: Relaxed Fixed-Point Iteration¶

Although fixed-point iteration is more robust than Newton’s method, stability can still be an issue. Relaxed fixed-point iteration was introduced primarily to improve stability, rather than to accelerate convergence etc. Historically, this idea emerged gradually in numerical linear algebra (e.g., Jacobi, Gauss–Seidel, and Successive Over-Relaxation methods in the 1950s–1960s), where it was observed that direct substitution is often too aggressive.

Relaxed fixed-point iteration origins from an ODE / dynamical system viewpoint of fixed-point iteration (for ODE discretization context see the ODE-Discretization note). Consider the continuous-time dynamical system $$ \dot{x} = g(x) - x $$ A point $x^*$ is an equilibrium of this system if and only if $$ g(x^*) = x^* $$ i.e., $x^*$ is a fixed point of $g$. If this equilibrium is stable, then in continuous time we have $x(t)\to x^*$.

Applying forward Euler discretization with step size $\alpha>0$ yields $$ \boxed{ x_{t+1} = x_t + \alpha\bigl(g(x_t)-x_t\bigr) } $$ which is exactly the relaxed fixed-point iteration.

To choose a proper stepsize according to the stability condition, compute the Jacobian of the vector field $g(x)-x$ as $$ J = \nabla g(x^*) - I $$ The Euler discretization is stable if and only if, for all eigenvalues $\lambda_i(J)$, $$ \bigl|1 + \alpha \lambda_i(J)\bigr| < 1 $$ This condition directly determines an admissible range of $\alpha$. Choosing $\alpha$ sufficiently small ensures that the discrete-time iteration inherits the stability of the continuous system.

Why standard fixed-point iteration can be unstable?

The original fixed-point iteration corresponds to the special case $$ x_{t+1} = x_t + 1\cdot (g(x_t)-x_t) $$ i.e., Euler discretization with step size $\alpha=1$.

From a dynamical systems perspective, this is an extremely large time step. As a result, even when the continuous-time system is stable, the discrete iteration may become unstable purely due to over-aggressive stepping.

In this sense, if we choose $0 < \alpha < 1$, the relaxed fixed-point iteration stabilizes the method by reducing the effective step size, trading speed for robustness.

--- Stochastic Approximation¶

Stochastic approximation studies iterative schemes of the form $$ \boxed{ x_{t+1} = x_t + \alpha_t \bigl(f(x_t) + \varepsilon_t\bigr), } $$ where:

  • $f:\mathbb{R}^d\to\mathbb{R}^d$ is a deterministic drift function,
  • ${\varepsilon_t}$ is a stochastic noise sequence satisfying $$ \mathbb{E}[\varepsilon_t\mid x_t]=0, $$
  • ${\alpha_t}$ is a diminishing step-size sequence. Notice that we cannot use a constant step size here; otherwise, the noise will persist forever.

This iteration can be viewed as a noisy forward Euler discretization of the ODE $$ \dot{x} = f(x) $$

The central question of stochastic approximation is whether the discrete-time noisy iterates track the trajectories of this limiting ODE and converge to its stable equilibria.

Fixed-Point and Root-Finding Formulations¶

Stochastic approximation framework admits different problem-specific representations. Basically, the difference is I'm using different functions for the drift function $f()$.

  • Fixed-point formulation¶

    Suppose the goal is to find a solution to $$ x = g(x) $$ where $g:\mathbb{R}^d\to\mathbb{R}^d$ is not directly observable, but can be sampled via $$ \widehat g(x_t)=F(x_t,\xi_t), \quad \mathbb{E}[\widehat g(x_t) \mid x_t]=g(x_t) $$ Since for an equilibrium point, we have $f(x) = 0$, the drift function $f()$ in the ODE is defined as $$ f(x)=g(x)-x. $$ Thus to find a solution to $x = g(x)$ is equivalent with to find an equilibrium of the original system. Substituting into the Euler discretization iteration yields $$ x_{t+1} = x_t + \alpha_t\bigl(\widehat g(x_t)-x_t\bigr) $$ which is also called the stochastic fixed-point iteration.

  • Root-finding formulation¶

    Alternatively, consider solving an equation as the equation in Newton's method $$ h(x)=0 $$ when only noisy observations are available: $$ \widehat h(x_t) = H(x_t,\xi_t), \quad \mathbb{E}[\widehat h(x_t) \mid x_t]=h(x_t) $$ Since for an equilibrium point, we have $f(x) = 0$, the drift function $f()$ in the ODE is defined as $$ f(x)= - h(x). $$ Thus to find a solution to $h(x) = 0$ is equivalent with to find an equilibrium of the original system. Applying the Euler discretization iteration directly gives $$ x_{t+1} = x_t - \alpha_t \widehat h(x_t) $$ This is exactly the problem set up of the classical Robbins–Monro algorithm. This algorithm also comes with a step size choice, which we will discuss later in the convergence analysis.

    When formulating the ODE, both choices $\dot x = f(x)=h(x)$ and $\dot x = f(x)=-h(x)$ are in principle valid. The equilibrium $x^*$ satisfies $h(x^*)=0$, and its stability is determined by the linearization $$ \dot y = f'(x^*),y, \qquad y=x-x^*, $$ whose solution is $y(t)=y(0)e^{f'(x^*)t}$. Thus, stability requires $f'(x^*)<0$. If $f(x)=-h(x)$, then $f'(x^*)=-h'(x^*)$, and the equilibrium is stable whenever $h'(x^*)>0$; if $f(x)=h(x)$, stability instead requires $h'(x^*)<0$. Therefore, the sign choice is not intrinsic of stochastic approximation: what matters is to ensure the Jacobian of $f()$ has negative real part. The conventional choice $f(x)=-h(x)$ is motivated by common applications in statistics, optimization, and estimation, where $h(x)$ typically represents a gradient, score, or residual, and the solution $x^*$ satisfies a monotonicity or positive-definiteness condition (e.g., $h'(x^*)\succ 0$ or $J_h(x^*)$ has eigenvalues with positive real part). Under this assumption, choosing $f(x)=-h(x)$ automatically yields a stable ODE. If such conditions do not hold, the sign must be re-evaluated to ensure stability.

The two formulations both lead to the same limiting ODE $$ \dot{x}=f(x) $$ thus share identical stability properties, step-size requirements, and convergence behavior. The distinction is purely conceptual and reflects different ways of packaging the same stochastic dynamical system.

Convergence Analysis via the ODE Method¶

The convergence of stochastic approximation is governed by the stability of the limiting ODE rather than the specific problem formulation (root finding or fixed point finding).

Deterministic discretizations admit a closed-form representation $y_k=A^k y_0$ after linearization. In contrast, stochastic discretizations involve time-varying coefficients and additive noise, leading to random matrix products and noise accumulation; as a result, closed-form expressions are no longer useful, and convergence must instead be established using a quadratic Lyapunov function $V=y^\top Py$ to show "expected decrease plus summable $\alpha_k^2$ terms" in order to control the effect of noise.

Consider the noisy Euler / stochastic approximation iteration $$ x_{k+1}=x_k+\alpha_k\bigl(f(x_k)+\xi_{k+1}\bigr), \qquad \mathbb{E}[\xi_{k+1}\mid \mathcal{F}_k]=0, $$ where $\mathcal{F}_k$ is the history up to time $k$. Assume $x^\star$ is an equilibrium: $f(x^\star)=0$.

Define the error $y_k=x_k-x^\star$. Then $$ y_{k+1}=y_k+\alpha_k\bigl(f(x^\star+y_k)+\xi_{k+1}\bigr). $$ Taylor expand around $x^\star$: $$ f(x^\star+y_k)=Jy_k+r(y_k), \qquad \|r(y_k)\|\le C\|y_k\|^2, $$ where $J:=Df(x^\star)$. Hence $$ y_{k+1}=(I+\alpha_k J)y_k+\alpha_k r(y_k)+\alpha_k\xi_{k+1}. $$

Assumptions (local):

  1. ODE stability (Hurwitz): all eigenvalues of $J$ satisfy $\Re(\lambda_i(J))<0$. Equivalently, there exists $P\succ0$ and $Q\succ0$ such that the Lyapunov equation holds: $$ J^\top P+PJ=-Q $$
  2. Noise second moment: $\mathbb{E}[\|\xi_{k+1}\|^2\mid\mathcal{F}_k]\le \sigma^2$
  3. Step sizes: $$ \boxed{ \sum_k \alpha_k=\infty,\qquad \sum_k \alpha_k^2<\infty } $$

Let $V_k:=y_k^\top P y_k$. Using $J^\top P+PJ=-Q$ and $\mathbb{E}[\xi_{k+1}\mid\mathcal{F}_k]=0$, one can expand $V_{k+1}$ and take conditional expectation to obtain, for $y_k$ sufficiently small, $$ \mathbb{E}[V_{k+1}\mid\mathcal{F}_k] \le V_k-\alpha_ky_k^\top Q y_k + C_1\alpha_k\|y_k\|^3 + C_2\alpha_k^2\bigl(\|y_k\|^2+1\bigr). $$ Since $y_k^\top Q y_k\ge c\|y_k\|^2$ and $\|y_k\|^3$ is higher order, in a small neighborhood we absorb the cubic term and get the key inequality (the core step) $$ \mathbb{E}[V_{k+1}\mid\mathcal{F}_k] \le V_k - c\alpha_k\|y_k\|^2 + C\alpha_k^2 $$

Summing over $k$ and using $\sum_k\alpha_k^2<\infty$ implies $$ \sum_k \alpha_k\|y_k\|^2<\infty \quad (\text{a.s.}), $$ and since $\sum_k\alpha_k=\infty$, we must have $\|y_k\|\to 0$ (a.s.), hence $$ x_k\to x^\star \quad (\text{a.s.}). $$

What changed compared to the deterministic proof?

  • Deterministic Euler: stability is enforced by the original continuous ODE stability and a fixed step-size spectral condition $\rho(I+\alpha J)<1$.
  • Noisy Euler (SA): we still need the same continuous ODE stability $(\Re(\lambda(J))<0)$, but fixed $\alpha$ is not enough to kill noise; we additionally require diminishing steps $\sum\alpha_k=\infty, \sum\alpha_k^2<\infty$ so the stabilizing drift is $O(\alpha_k)$ while the accumulated noise energy is $O(\sum\alpha_k^2)<\infty$.

Example: SGD¶

This is the follow-up of the gradient flow example in the ODE Discretization note. Gradient descent and stochastic gradient descent arise from the same gradient flow.

Assume now that the objective function is given as an expectation $$ V(x)=\mathbb{E}_{\xi}[,\ell(x,\xi),] $$ where the full gradient $\nabla V(x)$ cannot be computed exactly. Instead, at each iteration we observe an unbiased stochastic gradient $$ \widehat{\nabla V}(x_k)=\nabla \ell(x_k,\xi_k), \qquad \mathbb{E}[\widehat{\nabla V}(x_k)\mid x_k]=\nabla V(x_k) $$

The underlying continuous-time system is still the gradient flow $$ \dot{x}(t)=-\nabla V(x(t)) $$ whose equilibria satisfy $\nabla V(x^\star)=0$. If $x^\star$ is a (strict) local minimizer, then $\nabla^2 V(x^\star)\succ0$, and the equilibrium is asymptotically stable.

Replacing the exact gradient by its noisy observation and allowing a time-dependent step size yields $$ x_{k+1} = x_k - \alpha_k,\widehat{\nabla V}(x_k) $$ which is stochastic gradient descent (SGD).

This update is a noisy Euler discretization of the gradient flow ODE.

How to choose a proper $\alpha$? In contrast to deterministic gradient descent, a constant step size is no longer sufficient (recall that in deterministic gradient descent we solve a spectral stability condition $|1-\alpha\lambda_i(\nabla^2V(x^\star))|<1$ and get $0<\alpha<2/L$ for $L$-smooth functions), since the noise term does not vanish. Based on the convergence analysis above, we require diminishing step sizes $$ \sum_{k=0}^\infty \alpha_k=\infty, \qquad \sum_{k=0}^\infty \alpha_k^2<\infty $$ Under these conditions, together with standard regularity assumptions (local smoothness and bounded gradient noise variance), the noisy Euler discretization tracks the stable gradient flow, and $x_k \xrightarrow{\text{a.s.}} x^\star$.