StochasticBandits

Stochastic Bandits

Takeaways:

Is change in action set $\mathcal{A}$ the criterion distinguishing Stochastic / Adversarial / Contextual bandits?
What is the core assumption about reward in stochastic bandits?
Why is pure greedy not suitable for practical use?
What is the simplest stochastic bandit algorithm that actually “works”?
Does gradient bandit algorithm have the hyperparameter "temperature"? If no, how does it balance the exploration and exploitation?
In the gradient updates of gradient bandit algorithm, what is the point of substracting a baseline?
UCB is a randomized or deterministic policy? How about Thompson Sampling?
How does UCB balance exploration / exploitation?
How does Thompson Sampling balance exploration / exploitation?
If we want continuous and smooth exploration, which algorithm should we use?

--- What Is a Stochastic Bandit?¶

In a classic Stochastic Bandit (also called multi-armed bandits...but unfortunately multi-armed bandits is not a good name since all the other types of bandits like adversarial, contextual bandits are also multi-armed) problem, the learner and environment follow the standard bandit interaction protocol:

At each round $t$, the learner selects an action $A_t \in \mathcal{A}$.
The environment reveals the reward $X_t$ for the chosen action only (partial feedback).
The learner updates its policy and proceeds to the next round.

What distinguishes Stochastic Bandits is the assumption that each arm has a fixed, unknown reward distribution: $$ \nu_a: a\in\mathcal{A} \quad \text{IID over time} $$

The distributions $\nu_a$ do not change over time and are independent across arms.

Key words：IID, stationary, fixed distribution

The main criterion distinguishing Stochastic, Adversarial, and Contextual settings is how the reward is generated.

Notice that whether the action set $\mathcal{A}$ changes is not the criterion for distinguishing Stochastic, Adversarial, and Contextual bandits. The main difference lies in how rewards are generated. For classic Stochastic Bandits, $\mathcal{A}$ is fixed over time, but in many extensions and variants, the action set $\mathcal{A}$ can change. For example, in Sleeping Bandits, some arms may not be available for selection in certain rounds (they are "asleep").

Example¶

To illustrate the core distinctions among stochastic, adversarial, and contextual bandits, we carefully design a single application scenario and instantiate it under all three settings, highlighting how the same setup behaves differently depending on how rewards are generated. For the corresponding examples in the Adversarial Bandits and Contextual Bandits notes, please refer to the parallel sections in those notebooks.

Setup: Online News Recommendation

At each round (for each user visit), the system must select one article to recommend from a fixed set, and then observe whether the user clicks.

Action set: $$ \mathcal{A} = {\text{Article 1}, \text{Article 2}, \dots, \text{Article K}} $$

Reward： $$ X_t = 1 \text{ (clicked)},\quad 0 \text{ (not clicked)} $$

Stochastic Bandit Variation (Fixed, Unknown Click-Through Rates)

Setting: Each article has a fixed but unknown click-through rate (CTR) $$ \nu_a = \text{Bernoulli}(p_a), \quad p_a \text{ is an unknown constant} $$ Throughout all rounds of the process, the reward for an Article always comes from the same fixed distribution, and is IID over time. That is, the user population is "stable" and the attractiveness of the articles does not change over time.
Key points that match the definition:
- IID: Each time you choose article a, the click/no click outcome is an independent sample.
- stationary: $p_a$ does not change over time.
- fixed distribution: Rewards are generated entirely by a fixed Bernoulli distribution.
In one sentence: The true attractiveness of each article is fixed and unchanging; the system simply needs to gradually learn these fixed CTRs. The learner's goal is to identify the article with the highest CTR.

Note: In this example, "news" can easily be replaced with "ads" or product recommendations—demonstrating the broad applicability of bandit algorithms across various information retrieval and recommendation scenarios.

To gain a clearer understanding, we suggest comparing the definitions and example here with those provided for adversarial and contextual bandits in their respective notes. The table below succinctly summarizes the core differences among the three bandit settings within the context of this example.

Category	How rewards are generated (the key distinction)	The corresponding variant in the news recommendation example
Stochastic	Each arm's reward is IID from a fixed distribution.	Each article has a fixed CTR that does not change over time.
Adversarial	No distributional assumption for rewards; can change anytime.	The adversary decides clicks at each round; CTR can vary dramatically.
Contextual	Reward distribution depends on context.	Different user groups have different CTRs for each article.

--- Algorithms: Stochastic Bandits Policy¶

The five classic algorithms for stochastic bandits:

Algorithm	Exploration Strategy	Theoretical Guarantee	Practical Performance	Engineering Feasibility	Parameter Sensitivity
Greedy / Optimistic Initialization	Implicit optimism: set initial Q values high	No strict regret guarantee	Good short-term performance, insufficient later exploration	⭐⭐⭐⭐ Very easy to implement	⭐⭐⭐ Sensitive to initial value
ε-greedy	Random exploration: with probability ε choose randomly, otherwise choose the current best	Sub-optimal regret bound (if ε is fixed) ε decay can achieve $O(\log T)$	Simple but inefficient exploration, easy to waste exploration	⭐⭐⭐⭐⭐ Very easy to implement	⭐⭐⭐⭐ Very sensitive to ε
UCB1	Optimism: mean + confidence bound	Strong theoretical guarantee $O(\log T)$ regret	Performs well in low-noise, stationary environments	⭐⭐⭐⭐ Implementation moderately complex	⭐⭐ Fewer and more stable parameters
Thompson Sampling	Bayesian sampling: sample from posterior	Theoretical guarantees developed later $O(\log T)$ regret	Usually best in practice, exploration is natural	⭐⭐⭐⭐ Requires prior modeling	⭐⭐ Relatively robust to the prior
Softmax (Boltzmann)	Weighted random: sample according to $e^{Q/T}$	Weak theoretical support, difficult to guarantee optimal regret	Smoother than ε-greedy, but easily over-explores	⭐⭐⭐ Easy to implement	⭐⭐⭐⭐ Highly sensitive to temperature

When to use what?

For small datasets / quick start: ε-Greedy or UCB1.
For stable performance and scalability: Thompson Sampling is usually the first choice.
Greedy: use only as a baseline or teaching example.
Boltzmann: use as a teaching example, and as a foundation for policy gradient methods.

Format¶

In the following sections, we present each stochastic bandit algorithm using a simple and consistent four-part template. We use the same structure across Stochastic, Adversarial, and Contextual bandits so that algorithms are easy to compare and understand.

Notation:

Arms: $\mathcal{A} = {1, \dots, K}$
Action: pick $A_t \in \mathcal{A}$
Reward model: if picked arm $a$, then $X_{t} \sim \nu_a$ IID, mean $\mu_a$
Optimal arm: $a^* = \arg\max_a \mu_a$

Each algorithm is described through:

Learning Objective: The objective function. The common goal of all Stochastic Bandit algorithms: minimizing regret$$R_T = T \mu^* - \mathbb{E}\left[\sum_{t=1}^T X_t\right] \qquad \mu^* = \max_a \mu_a \text{ is a constant}$$（equivalent with maximizing culmulated rewaerds $\mathbb{E}\left[\sum_{t=1}^T X_t\right]$）.
Algorithm Design Principle: The key idea or intuition behind how the algorithm handles exploration vs. exploitation and helps achieve the learning objective.
How the Algorithm Works: The concrete decision rule, update formulas, and pseudocode.
Regret Bound: A brief summary of the algorithm's theoretical performance. In regret analysis, we are interested in lower bounds for regret (what we really care about is how well an algorithm can possibly perform), which is equivalent to upper bounds for total reward. Note that for bandits, the regret upper bound is trivially $\mathcal{O}(T)$ for any bandit algorithm, since at most 1 can be lost per round. Therefore, we are not interested in regret upper bounds for bandits.

--- Pure Greedy¶

Pure greedy is the most naive bandit algo, but not recommended for practical use; it mainly serves as a baseline.

1. Learning Objective¶

The common objective of all Stochastic Bandit algorithms is: $$ R_T = T \mu^* - \mathbb{E}\left[\sum_{t=1}^T X_t\right], \qquad \mu^* = \max_{a\in\mathcal{A}} \mu_a. $$ Greedy is no exception; it also aims to minimize regret, but the strategy it adopts typically leads to larger regret.

2. Algorithm Design Principle¶

Core Idea: Always choose the arm with the highest current empirical mean reward. Act as if the estimated means are correct. Pure exploitation.

In the news recommendation example:

Once an article's estimated CTR is slightly higher, the system keeps recommending it.
After a few rounds, it gets "locked in" to that article.

3. How the Algorithm Works¶

Exploration warm-start:
- Pull each arm at least once (or a few times) to initialize the estimate $\hat{\mu}_a$. (First $K$ rounds are exploration)
Core rule:
- For $t > K$, each round choose $$ A_t = \arg\max_a \hat{\mu}_a $$
- After receiving the reward, update the empirical mean and count for that arm $$ \hat{\mu}_a \leftarrow \frac{N_a \hat{\mu}_a + X_t}{N_a + 1} $$

4. Regret Bound¶

If the random initialization is lucky (picked the optimal arm after $K$ rounds), performance may look good. But in the general case: $$ R_T = \mathcal{O}(T) $$

Intuition: Once the algorithm makes a wrong decision early on, it never explores and thus misses the better arm. Let the optimal arm have mean $\mu^*$, and a suboptimal arm have mean $\mu_i = \mu^* - \Delta_i$. There exists a constant probability $p_0 > 0$ such that after the initial $K$ rounds, the algorithm incorrectly identifies arm $i$ as the best and pulls only that arm forever. Then, $$\mathbb{E}[R_T] \ge p_0 \sum_{t=K+1}^T (\mu^*-\mu_i) = p_0 \Delta_i (T-K)=\mathcal{O}(T)$$

Comment: Lacks exploration. If early noise causes it to lock in the wrong arm, it will never correct the mistake.

--- Optimistic Initialization¶

The only difference from Pure Greedy: the initial estimate is optimistic (larger than $\max_a \mu_a$), rather than 0 or a random value.

Comment: this makes the agent more inclined to explore. It is the simplest embodiment of the exploration-by-optimism idea.

--- ε-Greedy¶

ε-greedy is the first bandit algorithm that is truly usable and effective as a baseline. In contrast, pure greedy is generally not considered a valid algorithm, because it almost always gets locked into the wrong arm due to early noise. As a result, it is rarely used as a real baseline in practice or research—only occasionally for teaching or comparison. ε-greedy is the first simple method that explicitly balances exploration and exploitation. Recall that the heart of bandits is exploration vs. exploitation.

1. Learning Objective¶

The common objective of all Stochastic Bandit algorithms is: $$ R_T = T \mu^* - \mathbb{E}\left[\sum_{t=1}^T X_t\right], \qquad \mu^* = \max_{a\in\mathcal{A}} \mu_a. $$

2. Algorithm Design Principle¶

Core Idea: Inject a controlled amount of random exploration.

Most of the time, exploit (choose the currently estimated best arm).
A small portion of the time, explore (randomly pick any arm).
Control the exploration–exploitation trade-off by $\varepsilon$. Large $\varepsilon$: explore more; small $\varepsilon$: behave more greedily.

From the perspective of news recommendation:

Most of the time, recommend the article with the highest estimated ctr.
Occasionally, randomly recommend other articles to see if there is a “hidden gem”.

3. How the Algorithm Works¶

Fixed ε version：

In each round:

With probability $1-\varepsilon$: select $\arg\max_a \hat{\mu}_a$;
With probability $\varepsilon$: randomly select an arm uniformly.
Empirical mean is updated in the same way as in Greedy.

Decaying ε version (theoretically superior):

Common schedule (more exploration at the beginning, less exploration later): $$ \varepsilon_t = \min\left(1, \frac{cK}{t}\right) \quad \text{or} \quad \varepsilon_t = c / t $$ In each round:

With probability $1-\varepsilon_t$: select $\arg\max_a \hat{\mu}_a$;
With probability $\varepsilon_t$: randomly select an arm uniformly.
Empirical mean is updated in the same way as in Greedy.

4. Regret Bound¶

Fixed ε: $$ R_T = \mathcal{O}(T) \quad (\text{because the exploration probability is fixed}) $$

Intuition: There is always a constant probability of wasting pulls on suboptimal arms. In ε-greedy, at each round, with probability ε you explore and with probability 1-ε you exploit. Let the optimal arm have mean reward $\mu^*$, and a suboptimal arm have mean reward $\mu_i = \mu^* - \Delta_i$. When ε > 0 is a constant, in each round there is still probability ε/K of pulling a suboptimal arm. Thus, $$\mathbb{E}[R_T] \geq \sum_i \left(\frac{\epsilon}{K}\right)\Delta_i T = \mathcal{O}(T)$$

Decaying ε (for example, $\varepsilon_t \propto 1/t$): $$ R_T = \mathcal{O}(\log T) $$

Intuition: Each suboptimal arm is explored about $\mathcal{O}(\log T)$ times. For ε-greedy, at each round, with probability ε you explore and with probability 1-ε you exploit. Let the optimal arm have mean reward $\mu^*$ and suboptimal arm mean $\mu_i = \mu^* - \Delta_i$. If ε decays over time (e.g., $\varepsilon_t = 1/t$), the cumulative exploration is approximately $\sum_t \varepsilon_t = \mathcal{O}(\log T)$, so $$\mathbb{E}[R_T] = \mathcal{O}(\log T)$$ (This matches the order of UCB1 and Thompson Sampling, but the constant is worse.)

Comment: Simple to implement and easy to explain, but requires tuning ε and is sensitive to the parameter.

--- Softmax / Boltzmann Exploration¶

In Stochastic Bandits, Softmax and Boltzmann Exploration fundamentally refer to the exact same algorithm. The different names stem from different historical backgrounds and mathematical viewpoints.

Recall Softmax and Boltzmann Distribution

Given any set of real numbers $$s_1, s_2, \dots, s_n$$

the softmax is defined as $$ \text{softmax}(s_i) = \frac{\exp(s_i)}{\sum_j \exp(s_j)} $$ If a "temperature" parameter is introduced, it becomes the Boltzmann distribution (in short: the Boltzmann distribution is mathematically just a temperature-scaled softmax) $$ \text{softmax}_\tau(s_i) = \frac{\exp(s_i / \tau)}{\sum_j \exp(s_j / \tau)} $$

This algorithm aims to maintain an empirical mean $\hat{\mu}_a$ for each arm, then uses the temperature-scaled softmax (the Boltzmann distribution expression above) to convert these scores into selection probabilities: $$ \Pr(a) = \frac{\exp(\hat{\mu}_a / \tau)}{\sum_b \exp(\hat{\mu}_b / \tau)} $$

Meaning of Each Name

Softmax: From the machine learning softmax function, emphasizing the mapping of each arm's "score" into a probability distribution.
Boltzmann Exploration: From statistical physics, where softmax distribution is precisely the Boltzmann distribution. The temperature parameter $\tau$ controls the level of exploration—high temperature (large $\tau$): distribution approaches uniform random (more exploration); low temperature (small $\tau$): approaches greedy selection (more exploitation).

1. Learning Objective¶

The common objective of all Stochastic Bandit algorithms is: $$ R_T = T \mu^* - \mathbb{E}\left[\sum_{t=1}^T X_t\right], \qquad \mu^* = \max_{a\in\mathcal{A}} \mu_a. $$

2. Algorithm Design Principle¶

Core Idea: Instead of directly estimating $\mu_a$, we learn a probability distribution over arm selection action using the Boltzmann distribution. The temperature parameter $\tau$ controls the exploration–exploitation trade-off.

In the news recommendation example:

Articles with a higher CTR are more likely to be recommended, but articles with slightly lower CTR can still be shown with some probability
No arm is ever completely abandoned or excluded from being tried again

3. How the Algorithm Works¶

At each round, maintain the empirical mean $\hat{\mu}_a$ for each arm, and sample using the softmax:

$$ \Pr(A_t = a) = \frac{\exp(\hat{\mu}_a / \tau)}{\sum_{b=1}^K \exp(\hat{\mu}_b / \tau)} $$

$\tau > 0$: temperature
- $\tau \to \infty$: approaches uniform random selection
- $\tau \to 0$: approaches greedy selection
The update rule for empirical means is the same as in the Greedy / ε-Greedy algorithm

Decaying temperature version (more common): $\tau_t \downarrow 0 \text{ as } t \to \infty$. For example: $$ \tau_t = \frac{c}{\log t} \quad \text{or} \quad \frac{c}{\sqrt{t}} $$

4. Regret Bound¶

This is where Softmax / Gradient Bandit is relatively weaker than UCB.

Fixed temperature (τ is a constant): $$ R_T = \mathcal{O}(T) $$

Intuition: The reason is similar to the fixed-ε greedy strategy – there is always a fixed probability of choosing a suboptimal arm.
Decaying temperature (carefully tuned): $$ R_T = \mathcal{O}(\log T) \quad \text{(under certain conditions and careful tuning)} $$

The theoretical analysis is much more complicated than for UCB, and many results are asymptotic or conditional; for now, just remember that the regret is highly sensitive to the decay rate of τ. Note: In theory, Softmax / Gradient Bandit is weaker than UCB. It can achieve logarithmic regret, but unlike UCB, there is no "guarantee by design." Only with careful design can the regret be $\mathcal{O}(\log T)$, and the constant factor is usually larger than UCB.

Comment: The advantage is that it offers a continuous and smooth exploration method, which other stochastic bandit algorithms cannot achieve (their decisions are all "jumpy"); the drawback is that it is not parameter-free and is very sensitive to temperature / learning rate. In pure Stochastic Bandits, the Gradient Bandit is rarely used as the final algorithm alone; but it serves as the direct prototype for Contextual Bandit and Policy Gradient methods, and thus has great educational value.

--- Gradient Bandit¶

Gradient Bandit does not directly estimate the mean $\mu_a$ of each arm; instead, it learns a stochastic policy $\pi(a)$ which increases the probability of choosing better arms and decreases it for inferior ones. It is the simplest prototype of Policy Gradient / REINFORCE in the bandit setting.

The form you saw in Softmax / Boltzmann Exploration, $$ \Pr(a)\propto \exp(\text{score}_a/\tau) $$ also appears in the Gradient Bandit, except the "score" here is not the empirical mean $\hat\mu_a$, but rather a learnable preference parameter $\theta_a$, and $\theta$ is updated by gradient ascent (not mean updates). Notice that in the bandit literature, $H$ is usually used as the preference parameter instead of $\theta$. but now we treat the Gradient Bandit as the prototype of RL, so we directly rewrite it using $\theta$ to fit in the RL literature.

1. Learning Objective¶

The common objective of all Stochastic Bandit algorithms is: $$ R_T = T \mu^* - \mathbb{E}\left[\sum_{t=1}^T X_t\right], \qquad \mu^* = \max_{a\in\mathcal{A}} \mu_a. $$ (Equivalently, you can understand it as: at each step, try to maximize the expected reward.)

2. Algorithm Design Principle¶

Core Idea: Learn the parameters $\theta = (\theta_1, ..., \theta_K)$ of a softmax policy.

Maintain a "preference" $\theta_a \in \mathbb{R}$ for each arm.
Use softmax to convert these preferences into probabilities of selecting an arm: $$ \pi_{\theta}(a)=\Pr(A_t=a)=\frac{\exp(\theta_a)}{\sum_{b=1}^K \exp(\theta_b)} $$
After observing the reward, perform gradient ascent to update $\theta$ so that the probability of actions with high reward increases and the probability of other actions decreases.

In the news recommendation example:

Instead of “always recommending the article with the highest CTR”, maintain a probability distribution.
If the feedback (click/time spent) from recommending an article is higher than expected, increase the probability of recommending it next time; decrease it if the feedback is lower than expected.

3. How the Algorithm Works¶

Maintain $\theta = (\theta_1, ..., \theta_K)$ and define the policy: $$ \pi_\theta(a)=\frac{e^{\theta_a}}{\sum_{b=1}^K e^{\theta_b}}. $$ (Optional: you can also introduce a "temperature/scaling" parameter $\tau$: $\pi(a)\propto e^{\theta_a/\tau}$. In many implementations this is absorbed into the learning rate.)

Initialization: set $\theta_1, ..., \theta_K=0$, $\bar X=0$

For each round $t=1,2,\dots$:

Compute $\pi_{\theta}(i)$ for every arm $i$.
Sample $A_t \sim \pi$, observe reward $X_t$.
Update the baseline (here we use the running average $\bar X_t \triangleq \frac{1}{t}\sum_{s=1}^t X_s$): $$ \bar X \leftarrow \bar X + \frac{1}{t}(X_t-\bar X) $$
For each arm $i$, update: $$ \theta_i \leftarrow \theta_i + \alpha (X_t-\bar X)\big(\mathbf{1}\{i=A_t\}-\pi_{\theta}(i)\big) $$

Full derivation of the gradient update:

Objective at each step: Adjust the policy parameter $\theta$ to increase expected reward. The most common approach is to maximize $$J(\theta) = \mathbb{E}_{A\sim \pi_{\theta}} [R(A)] = \mathbb{E}_{A_t\sim\pi_{\theta}}\big[\mathbb{E}[X_t\mid A_t]\big] = \sum_a \pi_{\theta}(a) \mu_a$$ Here $R(A)$ is the stochastic reward received by pulling arm $A$, and $\mu_a=\mathbb{E}[X\mid A=a]$ is the true mean of arm $a$.

Step 1: REINFORCE

REINFORCE is a policy gradient algorithm, first proposed by Ronald J. Williams (1992). Its core idea is: directly estimate the policy gradient using sampled rewards. If an action yields a high reward, increase (“reinforce”) its probability in the policy; if it yields a low reward, decrease its probability.

Using the identity that holds for any function $\nabla \pi(a)=\pi(a)\nabla \log \pi(a)$: $$\nabla_{\theta} J({\theta})= \sum_a \mu_a \nabla_{\theta} \pi_{\theta}(a) =\sum_a \pi_{\theta}(a) \mu_a \nabla_{\theta} \log \pi_{\theta}(a) = \mathbb{E}_{A\sim \pi_{\theta}}\left[\mu_A \nabla_{\theta} \log \pi_{\theta}(A)\right]$$ By the law of total expectation, since $\mu_a=\mathbb{E}[X\mid A=a]$ $$\nabla_{\theta} J({\theta}) = \mathbb{E}_{A\sim \pi_{\theta}}\left[X_t \nabla_{\theta} \log \pi_{\theta}(A)\right]$$ Since the expectation cannot be calculated directly, we use sampling to obtain an unbiased estimator for the gradient update: $$\nabla_{\theta} J(\theta)\approx X_t \nabla_{\theta} \log \pi_{\theta}(A_t)$$ Note that since this update is an unbiased estimator, this is a valid stochastic gradient descent (SGD), which converges under the Robbins–Monro conditions.

Step 2: Add the baseline trick (does not change the expected gradient, but can significantly reduce variance)

For any $b$ that does not depend on action $A$, $$\mathbb{E}_{A\sim\pi_{\theta}}\left[b\nabla_\theta\log\pi_\theta(A)\right] = \sum_a \pi_\theta(a)b\nabla_\theta\log\pi_\theta(a) = \sum_a b\pi_\theta(a)\frac{\nabla_\theta\pi_\theta(a)}{\pi_\theta(a)} = b\nabla_\theta\sum_a \pi_\theta(a)$$ $$b\nabla_\theta\sum_a \pi_\theta(a)=b\nabla_\theta 1=0$$ Therefore, $$\nabla_{\theta} J(\theta) = \mathbb{E}_{A\sim \pi_{\theta}}\left[(X_t-b) \nabla_{\theta} \log \pi_{\theta}(A)\right] \approx (X_t-b) \nabla_{\theta} \log \pi_{\theta}(A_t)$$ In practice, $b=\bar X$ (the running mean reward up to now) is commonly used as the baseline.

Step 3: Obtain the final preference update (gradient ascent)

Given $$\pi_\theta(a)=\frac{e^{\theta_a}}{\sum_j e^{\theta_j}}, \qquad \log\pi_\theta(a)=\theta_a-\log\Big(\sum_j e^{\theta_j}\Big)$$ Then $$\frac{\partial}{\partial \theta_i}\log\pi_\theta(a) = \frac{\partial}{\partial \theta_i}\theta_a - \frac{1}{\sum_j e^{\theta_j}}\frac{\partial}{\partial \theta_i}\sum_j e^{\theta_j} = \mathbf{1}\{i=a\}-\frac{e^{\theta_i}}{\sum_j e^{\theta_j}} = \mathbf{1}\{i=a\}-\pi_\theta(i)$$ Substitute back into the Gradient Bandit update: $$\theta_i \leftarrow \theta_i + \alpha (X_t-b_t)\big(\mathbf{1}\{i=A_t\}-\pi_\theta(i)\big)$$ This is exactly the core structure of the Gradient Bandit / Policy Gradient: the log-softmax gradient = +1 for the chosen action, subtract the probability for all actions. Also, if an arm's reward is higher than the baseline → increase its preference; if it is lower than the baseline → decrease its preference.

Notice: Since the policy gradient's update magnitude is proportional to the reward, if the reward scale is too large, the gradient variance and parameter step size will be proportionally amplified, leading to unstable or even divergent training.

How Gradient Bandit Controls Exploration–Exploitation:

Scale of $\theta$ (Implicit Temperature) The softmax policy $\pi(a)\propto e^{\theta_a}$ becomes more peaked as the magnitude and relative differences of $\theta$ grow. Large $|\theta|$ corresponds to a low effective temperature (more exploitation), while small $|\theta|$ yields a flatter distribution (more exploration).

Learning Rate $\alpha$ (Speed of Polarization) The learning rate controls how fast preferences $\theta$ separate. A larger $\alpha$ accelerates concentration on high-reward actions (faster exploitation), while a smaller $\alpha$ keeps the policy smoother for longer (more exploration).

4. Regret Bound¶

Because Gradient Bandit uses a softmax policy, its regret order mirrors Boltzmann exploration: linear with fixed parameters, logarithmic with careful decay, but with weaker and harder-to-prove guarantees.

Softmax / Boltzmann Exploration and Gradient Bandit are equivalent in their policy form, but not strictly equivalent in their parameter update dynamics in optimization; the Gradient Bandit can be seen as using policy gradients to implicitly learn parameters that correspond to the "softmax value function".

The action distributions of both methods are essentially: $$\pi(a) = \frac{\exp(\text{score}_a)}{\sum_b \exp(\text{score}_b)}$$ In Softmax / Boltzmann Exploration: $$\text{score}_a = \frac{\hat{\mu}_a}{\tau}$$ In Gradient Bandit: $$\text{score}_a = \theta_a$$ As long as $$\theta_a \approx \frac{\hat{\mu}_a}{\tau}$$ the policies they induce are exactly the same.

Comment: In pure stochastic bandit problems, the Gradient Bandit is usually not the "final winner" (UCB/TS are more stable, require less tuning, and offer better guarantees), but it is the direct prototype of Policy Gradient / Reinforcement Learning methods: it has very high educational and generalization value.

--- UCB（Upper Confidence Bound）¶

The essence of UCB is optimism: for each arm, construct an Upper Confidence Bound and select the arm with the highest Upper Confidence Bound in each round. All variants of UCB simply differ in how they model this confidence bound. This note focuses primarily on UCB1, while other variants are only introduced conceptually here:

UCB2: Reduces the frequency of switching arms by employing an "epoch-based" strategy, which significantly lowers the exploration-induced switching costs while maintaining sublinear regret.
KL-UCB: constructs a tighter confidence interval using KL-divergence (especially effective for Bernoulli rewards).
MOSS (Minimax Optimal Strategy in the Stochastic case): achieves the optimal regret upper bound in the minimax sense.
Linear / Generalized UCB: extends UCB to contextual settings (covered in the Contextual notes).

1. Learning Objective¶

The common objective of all Stochastic Bandit algorithms is: $$ R_T = T \mu^* - \mathbb{E}\left[\sum_{t=1}^T X_t\right], \qquad \mu^* = \max_{a\in\mathcal{A}} \mu_a. $$

2. Algorithm Design Principle¶

Core Idea: Explore when uncertain, exploit when confident. For each arm, construct an Upper Confidence Bound, which consists of the empirical mean plus an uncertainty bonus. The uncertainty bonus is larger when there are fewer samples or at early times, and shrinks as more samples are collected or as time progresses. This allows the algorithm to automatically try little-tested arms more in the beginning, and gradually shift focus to the optimal arm. Unlike "explore-then-exploit" methods, UCB simultaneously balances exploration and exploitation in every step, automatically adjusting the ratio.

In the news recommendation example:

At first, all articles are uncertain, so each is recommended in turn.
Later, the algorithm focuses more on articles with higher UCB scores.

3. How the Algorithm Works¶

The classic UCB1 algorithm is designed for bounded rewards in $[0,1]$: $$ \text{UCB}_a(t) = \hat{\mu}_a(t) + \sqrt{\frac{2 \log t}{N_a(t)}} $$

$\hat{\mu}_a(t)$: the average reward of arm $a$ up to the current time
$N_a(t)$: the number of times arm $a$ has been selected

Algorithm steps

Initialization: pull each arm once.
For $t > K$:

For each $a$, compute its UCB score;
$A_t = \arg\max_a \text{UCB}_a(t)$;
Pull arm $A_t$, observe the reward, update $\hat{\mu}_a$ and $N_a$.

4. Regret Bound¶

Classic UCB1 regret bound: $$ R_T = \mathcal{O}(\log T) $$

Intuition: decompose regret as $$R_T=\sum_{a\neq a^*}\Delta_a\mathbb{E}[N_a(T)]$$ The key is to estimate $N_a(T)$. Why will each suboptimal arm be pulled only $O(\log T / \Delta_a^2)$ times?

If a suboptimal arm $a$ is selected at time $t$, it must be that $$\hat\mu_a(t)+\sqrt{\tfrac{2\log t}{N_a(t)}} \ge \hat\mu_{a^*}(t)$$ This inequality can only hold unless the following low-probability estimation error events occur: $$\hat\mu_a(t)\ge \mu_a+\sqrt{\tfrac{2\log t}{N_a(t)}} \quad\text{or}\quad \hat\mu_{a^*}(t)\le \mu^*-\sqrt{\tfrac{2\log t}{N_{a^*}(t)}}$$ (By Hoeffding’s inequality $\Pr\big(|\hat\mu_a - \mu_a| \ge \delta\big)\le 2 \exp(-2 N \delta^2)$, the empirical mean concentrates to the true mean at a rate $O(\exp(-N\delta^2))$, so $\Pr\left(\left|\hat{\mu}_a - \mu_a\right| \ge \sqrt{\frac{2\log t}{N_a}}\right) \le t^{-4}$. Therefore, the total number of such events across all $t$ is very small, only constant order.)

Therefore, with high probability, the only reason for selection is $$\sqrt{\tfrac{2\log t}{N_a(t)}} \gtrsim \Delta_a$$ Solving gives $$N_a(T)=O\left(\tfrac{\log T}{\Delta_a^2}\right) \quad\Rightarrow\quad R_T=O\left(\sum_a\tfrac{\log T}{\Delta_a}\right)$$

Comment: Theoretically elegant, parameter-free, widely used in practice, and one of the most classic algorithms for stochastic bandits.

--- Thompson Sampling¶

Thompson Sampling always means sampling model parameters from their posterior distribution. The difference between various Thompson Sampling methods lies only in what the parameter is, and what the posterior distribution looks like. For example, in Beta–Bernoulli Thompson Sampling, we sample from the Beta posterior to obtain the Bernoulli distribution parameter; in Gaussian Thompson Sampling, we sample each arm’s mean from the Gaussian posterior (when the rewards of each arm are Gaussian as well, so many different Gaussians lol). But the core logic is always sample → choose → update.

Here we present the Beta–Bernoulli Thompson Sampling as the canonical example. Other reward models are briefly mentioned below:

Gaussian rewards + normal–normal conjugate TS.
Poisson rewards + Gamma–Poisson TS, etc.
Extension of TS to contextual bandits: posterior sampling for linear models, etc. (see Contextual Bandits note)

1. Learning Objective¶

The common objective of all Stochastic Bandit algorithms is: $$ R_T = T \mu^* - \mathbb{E}\left[\sum_{t=1}^T X_t\right], \qquad \mu^* = \max_{a\in\mathcal{A}} \mu_a. $$

2. Algorithm Design Principle¶

Core Idea: First, randomly sample a “possible world” (parameter), then make the optimal decision in that world. This provides an automatic balance of exploration and exploitation: if an arm’s posterior is uncertain (wide), it is more likely to sample a high value and thus be explored; if an arm’s posterior is well-known and concentrated, it leads to more stable exploitation.

News recommendation perspective:

Maintain a Beta distribution for the CTR of each article,
Each time a user visits, sample the CTR for each article and recommend the one with the highest sampled CTR.

3. How the Algorithm Works¶

Model assumption: Bernoulli reward + Beta prior

Prior: $\theta_a \sim \text{Beta}(\alpha_a, \beta_a)$
Reward: $X_t \mid A_t=a, \theta_a \sim \text{Bernoulli}(\theta_a)$

Algorithm steps

Initialize $(\alpha, \beta)$ for all arms (e.g., $(1,1)$ uniform prior).
For each round:
- For each arm, sample $\tilde{\theta}_a \sim \text{Beta}(\alpha_a, \beta_a)$;
- Select $A_t = \arg\max_a \tilde{\theta}_a$;
- Observe reward $X_t \in \{0,1\}$, update the chosen arm's $(\alpha_a, \beta_a)$: $\alpha_a \leftarrow \alpha_a + X_t$, $\beta_a \leftarrow \beta_a + (1-X_t)$

4. Regret Bound¶

Classic result (for Bernoulli bandits): $$ R_T = O(\log T) $$ Moreover, Thompson Sampling is asymptotically optimal in many settings, matching the order of the Lai–Robbins lower bound.

Intuition: regret decomposition
$$R_T = \sum_{a \neq a^*} \Delta_a\, \mathbb{E}[N_a(T)]$$
The key: estimate $N_a(T)$. Why will each suboptimal arm be pulled only $O(\log T/\Delta_a^2)$ times?

If a suboptimal arm $a$ is selected at time $t$ (i.e., $\theta_a(t) \ge \theta_{a^*}(t)$), then necessarily $$\theta_a(t) \ge \mu_a + \tfrac{\Delta_a}{2} \quad \text{or} \quad \theta_{a^*}(t) \le \mu^* - \tfrac{\Delta_a}{2}$$ (Similar to UCB analysis: by Hoeffding’s inequality, $\Pr(\theta_a - \mu_a \ge \delta) \le \exp(-cN\delta^2)$, so when $N_a \gtrsim \log t/\Delta_a^2$, this probability is $\le t^{-4}$, very small, and summing over all $t$ yields a constant.)

Therefore, with high probability, the only way $a$ gets selected is when $$N_a(t) \lesssim \tfrac{\log t}{\Delta_a^2}$$
Solving gives $$N_a(T) = O\left(\tfrac{\log T}{\Delta_a^2}\right) \quad \Rightarrow \quad R_T = O\left(\sum_a \tfrac{\log T}{\Delta_a}\right)$$

This is similar to UCB, which explores until uncertainty shrinks to the scale of the gap $\Delta_a$. Specifically, UCB explores while the confidence interval still covers the optimal arm; TS explores while the posterior still has enough probability to sample “appearing optimal”.

Comment: Asymptotically optimal, matching the Lai–Robbins lower bound up to constants. In practice, Thompson Sampling often performs extremely well (especially in more complex or noisy environments). It also requires no manual tuning of confidence intervals or exploration rate, and can be considered the optimal solution for stochastic bandits.

--- Algorithms Implementation¶

Environment settings:

3 arms
Success probabilities: $$ p_1=0.2,\quad p_2 = 0.5,\quad p_3=0.8 $$
Reward follows Bernoulli($p_a$)

⭐ Intuitive Results (You’ll see after running)

Method	Behavior	Outcome
Greedy	Always selects the arm with current highest mean	Easily gets stuck in local optima
Optimistic Initialization	High initial values drive early exploration, then switches to greedy	Stable early exploration, rapid convergence later
Softmax (Boltzmann)	Chooses arms probabilistically, softer exploration	Good exploration-exploitation balance, performs well
ε-greedy	Random exploration at a fixed rate	Simple but generally less efficient
UCB	Exploration decreases quickly, driven by “confidence”	Fast convergence, rational exploration
Thompson Sampling	Sample-based exploration, smoother and more natural	Usually strongest and most stable in practice

In [ ]:

import numpy as np
import matplotlib.pyplot as plt

# --- Environment ---
true_probs = np.array([0.2, 0.5, 0.8])  # True probabilities (unknown to agent)
n_arms = len(true_probs)
T = 1000

def pull_arm(arm):
    return 1 if np.random.rand() < true_probs[arm] else 0

In [ ]:

# --- General Bandit execution function ---
def run_bandit(algo, T=1000, epsilon=0.1, Q_init=0.0, c=2, tau=0.1):
    Q = np.ones(n_arms) * Q_init   # Value estimates
    N = np.zeros(n_arms)           # Number of times each arm is chosen
    rewards = []

    # Only UCB requires initial exploration
    if algo == 'ucb':
        for a in range(n_arms):
            r = pull_arm(a)
            Q[a] = r
            N[a] = 1
            rewards.append(r)

    # Thompson Sampling parameters
    if algo == 'thompson':
        alpha = np.ones(n_arms)
        beta = np.ones(n_arms)

    start = n_arms if algo == 'ucb' else 0

    for t in range(start, T):
        # --- Select action ---
        if algo == 'greedy':
            action = np.argmax(Q)

        elif algo == 'optimistic':
            action = np.argmax(Q)  # Same as greedy, but Q_init is high

        elif algo == 'epsilon':
            if np.random.rand() < epsilon:
                action = np.random.randint(n_arms)
            else:
                action = np.argmax(Q)

        elif algo == 'ucb':
            ucb_values = Q + c * np.sqrt(np.log(t + 1) / (N + 1e-6))
            action = np.argmax(ucb_values)

        elif algo == 'softmax':
            # Boltzmann distribution
            exp_values = np.exp(Q / tau)
            probs = exp_values / np.sum(exp_values)
            action = np.random.choice(np.arange(n_arms), p=probs)

        elif algo == 'thompson':
            theta = np.random.beta(alpha, beta)
            action = np.argmax(theta)

        else:
            raise ValueError("Unknown algorithm")

        # --- Execute action and obtain reward ---
        r = pull_arm(action)
        rewards.append(r)

        # --- Update estimates ---
        N[action] += 1

        if algo == 'thompson':
            # Beta distribution update
            alpha[action] += r
            beta[action] += 1 - r
        else:
            # Standard incremental average update
            Q[action] += (r - Q[action]) / N[action]

    return np.array(rewards), Q, N

In [5]:

# --- Run all algorithms ---
algorithms = {
    "Greedy": {"algo": "greedy", "Q_init": 0.0},
    "Optimistic": {"algo": "optimistic", "Q_init": 1.0},
    "ε-Greedy": {"algo": "epsilon", "epsilon": 0.1},
    "UCB": {"algo": "ucb", "c": 2},
    "Softmax": {"algo": "softmax", "tau": 0.1},
    "Thompson": {"algo": "thompson"},
}

results = {}

for name, params in algorithms.items():
    rewards, Q, N = run_bandit(**params, T=T)
    cumulative = np.cumsum(rewards)
    avg_reward = cumulative / np.arange(1, T + 1)
    results[name] = avg_reward
    print(f"{name:12s} → Final Q: {Q.round(3)}, Best Arm: {np.argmax(Q)}, Total Reward: {np.sum(rewards)}")

Greedy       → Final Q: [0.198 0.    0.   ], Best Arm: 0, Total Reward: 198
Optimistic   → Final Q: [0.219 0.    0.   ], Best Arm: 0, Total Reward: 219
ε-Greedy     → Final Q: [0.222 0.344 0.759], Best Arm: 2, Total Reward: 625
UCB          → Final Q: [0.111 0.482 0.81 ], Best Arm: 2, Total Reward: 749
Softmax      → Final Q: [0.    0.    0.798], Best Arm: 2, Total Reward: 796
Thompson     → Final Q: [0. 0. 0.], Best Arm: 0, Total Reward: 811

In [6]:

# --- Plot comparison ---
plt.figure(figsize=(10,6))
for name, avg_reward in results.items():
    plt.plot(avg_reward, label=name)

plt.title("Average Reward Comparison of Bandit Algorithms")
plt.xlabel("Steps")
plt.ylabel("Average Reward")
plt.legend()
plt.grid(True)
plt.show()

No description has been provided for this image

--- Applications¶

Summary: As long as the problem is repeated independent trials + fixed actions + iid random rewards, it can be regard as a Stochastic Bandit.

Online advertising/recommendation systems (in a static environment)
- Action: Display ads/recommend content
- Reward: Clicked/not clicked (CTR)
- Independent sample: Each user visit is a new independent trial
- Feature: Perfectly fits the bandit assumptions
Clinical trials
- Action: Assign drug A/B/C
- Reward: Treatment success/failure
- Independent sample: Each patient is an independent trial
- Feature: Bandit is applicable; classic scenario for TS algorithm
A/B testing optimization (experimental design)
- Action: Choose version A/B/C
- Reward: Conversion rate
- Independent sample: Each user is independent
- Feature: Typical bandit application scenario