AdversarialBandits

Adversarial Bandits

Takeaways:

Is change in action set $\mathcal{A}$ the criterion distinguishing Stochastic / Adversarial / Contextual bandits?
What is the core assumption about reward in adversarial bandits?
Can actions in adversarial bandits influence future rewards? If actions do affect future rewards, does that mean I can consider the "history" as a state? Wouldn't that make Adversarial Bandit essentially the same as RL/MDP?
what is the difference between the regret function in adversarial bandits and stochastic bandits?
What are the two tweaks the EXP3 did to make the Hedge algo to fit in the adversarial bandit setting?
Why do we need to require $x_{t,a}\in[0, 1]$ in Hedge?
Why do we need to require $X_t\in[0, 1]$ in EXP3?
What is the key difference between EXP3-IX and EXP3?

--- What Is a Adversarial Bandit?¶

In a classic Adversarial Bandit problem, the learner and environment follow the standard bandit interaction protocol:

At each round $t$, the learner selects an action $A_t \in \mathcal{A}$.
The environment reveals the reward $X_t$ for the chosen action only (partial feedback).
The learner updates its policy and proceeds to the next round.

What distinguishes Adversarial Bandits is that reward distribution is unlearnable — there are no assumptions about it, and it can change at any time. The reason it is called "adversarial" is that its design philosophy is to assume the worst-case scenario: the adversary can change the rules at any time, always aiming to make things as difficult as possible for the learner. Thus, the learner cannot rely on any transition law. Even if the reward currently behaves as a stable function, the learner must always assume the adversary can change it at any moment. That is,

$$ X_t \quad \text{is chosen by an adversary} $$

The goal of Adversarial Bandits is not prediction, but worst-case robust performance.

Key words：no distribution assumptions, adversary, worst-case, non-stationary

The main criterion distinguishing Stochastic, Adversarial, and Contextual settings is how the reward is generated.

Notice that whether the action set $\mathcal{A}$ changes is not the criterion for distinguishing Stochastic, Adversarial, and Contextual bandits. The main difference lies in how rewards are generated. For classic Adversarial Bandits, $\mathcal{A}$ is fixed over time. The adversary can change the reward, but cannot change the action set available to the learner. However, in many extensions and variants, the action set $\mathcal{A}$ can change. For example, in Sleeping Adversarial Bandits, some arms might not be available for selection in certain rounds (they are "asleep").

Example¶

To illustrate the core distinctions among stochastic, adversarial, and contextual bandits, we carefully design a single application scenario and instantiate it under all three settings, highlighting how the same setup behaves differently depending on how rewards are generated. For the corresponding examples in the Stochastic Bandits and Contextual Bandits notes, please refer to the parallel sections in those notebooks.

Setup：Online News Recommendation

At each round (for each user visit), the system must select one article to recommend from a fixed set, and then observe whether the user clicks.

Action set: $$ \mathcal{A} = {\text{Article 1}, \text{Article 2}, \dots, \text{Article K}} $$

Reward： $$ X_t = 1 \text{ (clicked)},\quad 0 \text{ (not clicked)} $$

Adversarial Bandit Variation (the adversary can change click rates at any time)

Setting: In each round, the click outcome for each article is decided by an adversary. $$ X_t(a) \in \{0,1\} \text{ is chosen adversarially} $$ The reward mechanism is unlearnable; the adversary can change the rules at any time (always ready to act in the way most unfavorable to the learner). Therefore, the learner cannot rely on any transition law. For example: a competitor deliberately tries to sabotage your recommendation system; or the platform experiences sudden social events causing large fluctuations in user interests.
Core points strictly adhering to the definition:
- no distribution assumption: the reward does not come from any fixed distribution.
- non-stationary: the reward distribution can change at any time.
- worst-case: the algorithm aims for robust regret against any reward sequence.
In one sentence: we no longer trust that the CTR has a stable structure; we must maintain robustness against any adversarial reward sequence.

Note: In this example, "news" can easily be replaced with "ads" or product recommendations—demonstrating the broad applicability of bandit algorithms across various information retrieval and recommendation scenarios.

To gain a clearer understanding, we suggest comparing the definitions and example here with those provided for contextual and stochastic bandits in their respective notes. The table below succinctly summarizes the core differences among the three bandit settings within the context of this example.

Category	How rewards are generated (the key distinction)	The corresponding variant in the news recommendation example
Stochastic	Each arm's reward is IID from a fixed distribution.	Each article has a fixed CTR that does not change over time.
Adversarial	No distributional assumption for rewards; can change anytime.	The adversary decides clicks at each round; CTR can vary dramatically.
Contextual	Reward distribution depends on context.	Different user groups have different CTRs for each article.

--- Algorithms: Adversarial Bandits Policy¶

We will cover the details of EXP3 and compare EXP3-IX with EXP3. Other variants will be briefly mentioned, as understanding EXP3 makes it easy to learn the rest independently. All algorithms in the Adversarial Bandit family can be viewed as descendants of EXP3, which forms the foundation of this setting. They share the same exponential-weight core but differ in how they handle exploration and variance control. In essence, each variant modifies a specific component of EXP3 — either the probability distribution, the reward estimator, or the update rule — to balance exploration, bias, and variance differently.

Format¶

In the following sections, we present adversarial bandit algorithms using a simple and consistent four-part template. We use the same structure across Stochastic, Adversarial, and Contextual bandits so that algorithms are easy to compare and understand.

Notation:

Arms: $\mathcal{A} = {1, \dots, K}$
Action: pick $A_t \in \mathcal{A}$
Reward: pick an arm and observe $X_{t}$
True reward: $x_{t,a}$ for arm $a$ at round $t$. It's the reward you could get at round $t$ if you pull arm $a$. In the adversarial setting, this is a deterministic value. Notice this is different from the observed reward $X_t$ (a random variable)
Optimal arm in hindsight: $a^* = \arg\max_{a\in\{1,\dots,K\}} \sum_{t=1}^T x_{t,a}$

Each algorithm is described through:

Learning Objective: The objective function. The common goal of all Adversarial Bandit algorithms is to minimize regret. However, in Adversarial Bandits, regret is no longer defined as the difference between the expected rewards of the optimal arm and the cumulative reward actually received, $R_T = T \mu^* - \mathbb{E}\left[\sum_{t=1}^T X_t\right]$, because there is no fixed distribution for each arm. Instead, the regret in Adversarial Bandits is the difference between "the maximum cumulative reward you could have obtained by always selecting the best single arm in hindsight", and the cumulative reward actually received. $$ R_T = \sum_{t=1}^T x_{t,a^*} - \mathbb{E}\left[\sum_{t=1}^T X_{t}\right] \quad \text{where } a^* \text{ is the best arm in hindsight.} $$
Hindsight Optimal ≠ Expectation Optimal:

In both stochastic bandit and adversarial bandit settings, regret can be understood as "the best possible reward we could have gotten" minus "the reward we actually received." However, the precise definition of "best possible reward" differs subtly between the two setups. The distinction can be clarified with a simple example. Suppose there are two arms, independently and identically distributed, with fixed distributions:
- Arm A (expectation-optimal): reward per round $\sim \text{Bernoulli}(0.6)$
- Arm B (expectation-sub-optimal): reward per round $\sim \text{Bernoulli}(0.4)$
In a particular realization (say $T=5$), Arm B happens to get the reward sequence $(1,1,1,1,1)$, while Arm A gets $(0,1,0,1,0)$. In hindsight, always picking Arm B yields a total reward of 5, which is greater than Arm A's 2—even though by expectation, Arm A is the optimal arm.
Algorithm Design Principle: The key idea or intuition behind how the algorithm handles exploration vs. exploitation and helps achieve the learning objective.
How the Algorithm Works: The concrete decision rule, update formulas, and pseudocode.
Regret Bound: A brief summary of the algorithm's theoretical performance. In regret analysis, we are interested in lower bounds for regret (what we really care about is how well an algorithm can possibly perform), which is equivalent to upper bounds for total reward. Note that for bandits, the regret upper bound is trivially $\mathcal{O}(T)$ for any bandit algorithm, since at most 1 can be lost per round. Therefore, we are not interested in regret upper bounds for bandits.

--- EXP3（Exponential-weight algorithm for Exploration and Exploitation）¶

EXP3 is the most classic, fundamental, and also the most important algorithm in the Adversarial Bandit setting. Other algorithms such as EXP3-IX and EXP3.P are derived from it.

We need to carefully study the proof of regret bounf of Hedge algorithm and EXP3. We study the regret proofs of Hedge and EXP3 because the key design choices in EXP3 - including the importance-weighted estimator $\hat{x}_{t,a}$, the exponential weight update $w_{t,a}$, and the probability distribution $p_{t,a}$ — are directly derived from the structure of the proof itself.

1. Learning Objective¶

The common objective of all Adversarial Bandit algorithms: $$ R_T = \sum_{t=1}^T x_{t,a^*} - \mathbb{E}\left[\sum_{t=1}^T X_{t}\right] \quad \text{where } a^* \text{ is the best arm in hindsight.} $$

2. Algorithm Design Principle¶

Core Idea:

Maintain a weight $w_a$ for each arm. The larger the weight, the higher the probability the arm is selected. The weights are updated in an exponential form, amplifying arms that perform well.
The probability of selecting each arm depends on its weight $w_a$ and an exploration term. The weight makes us increasingly favor arms that have performed well in the past, while the exploration term ensures sufficient randomness so the algorithm cannot be easily exploited by an adversary.
Need to make $X_t\in[0, 1]$ or bounded. If not, we need to rescale. Otherwise the regret bound won't hold.

In the news recommendation example:

The system does not assume that any article has a well-defined or learnable CTR; past clicks may provide no reliable information about future clicks.
Even if an article performs extremely well early on, the algorithm never commits to it, since an adversary could immediately change the click outcomes to penalize that choice.
The goal is not to identify a single best article, but to guarantee that the total number of clicks is close to that of the best fixed article in hindsight, no matter how user behavior evolves.

3. How the Algorithm Works¶

Core variables:

Weights: for arm $a$ is $w_a > 0$
Probability distribution: for arm $a$ is $p_a$
Exploration parameter: $\gamma \in (0,1]$
Learning Rate: $\eta$

Initialization:

For all arms, $w_a = 1$

At each round $t = 1,2,\dots,T$:

Construct the probability distribution $$p_a =(1-\gamma)\frac{w_a}{\sum_{b=1}^K w_b} + \frac{\gamma}{K}$$ Explanation: $(1-\gamma)$: exploitation of current weights; $\gamma/K$: uniform random exploration. This ensures that every arm has probability at least $\frac{\gamma}{K}$ of being selected.

Without the exploration term $\frac{\gamma}{K}$, the probability $p_{t,a}$ for some arms may approach zero, causing the variance of $\hat x_{t,a} = \frac{X_t\mathbf{1}\{A_t=a\}}{p_{t,a}}$ to go to infinity. This makes regret analysis fail to converge. Adding exploration ensures $p_{t,a} \ge \frac{\gamma}{K}$, keeping the variance under control and allowing the regret proof to remain valid.
Randomly select arm $$A_t \sim p$$
Observe the reward $X_t$ of the selected arm.
Construct an unbiased estimator for $x_{t,a}$ (also called importance-weighted estimator), and exponential update of $w_a$ (learning rate $\eta$ instead of $\frac{\gamma}{K}$??) $$ \hat{x}_{t,a} = \frac{X_t \mathbf{1}\{A_t=a\}}{p_a} $$ $$ w_a \leftarrow w_a \exp\left(\eta\hat{x}_{t,a}\right) $$

Unbiased Estimator:

An unbiased estimator for $x_{t,a}$ is not a strict requirement for the algorithm to run, but the existing regret bound analysis for EXP3 relies on unbiasedness. Unbiasedness allows the analysis of the bandit setting to copy the regret analysis of Hedge.

$\hat{x}_{t,a}$ is the estimated reward constructed for arm $a$ alone, and we can see it is an unbiased estimator for $x_{t,a}$: $$\mathbb{E}[\hat{x}_{t,a}] = \mathbb{E}\left[\frac{X_t \mathbf{1}\{A_t=a\}}{p_a}\right] = \mathbb{E}\left[\frac{x_{t,a} \mathbf{1}\{A_t=a\}}{p_a}\right] = \frac{x_{t,a}}{p_a}\mathbb{E}[\mathbf{1}\{A_t=a\}] = x_{t,a}$$

$X_t$ is "what you see is what you get for the arm chosen in this round," which is completely different. The expected value of $X_t$ is the average reward weighted by the arm-selection strategy, not the reward of any particular arm. $$\mathbb E[X_t] = \mathbb E[x_{t,A_t}] = \sum_{a} p_{a} x_{t,a}$$

Importance-Weighted Estimator:

Why is this estimator called "importance weighted"? It's because the weight $\frac{1}{p_{a}}$ comes from the sampling probability: the smaller the probability that an arm is sampled, the more "rare" the information provided by a single observation is, and thus the more importance we assign to it in our estimate by using a larger weight.

Why do we need "importance weighting" in adversarial bandits? Here's a key difference between the bound analysis of stochastic and adversarial bandits: in stochastic bandits, analysis is arm-wise, and we never try to estimate the reward of all arms in every round. However, the existing regret bound analysis for EXP3 relies on updating the weights of all arms in every round (to copy the regret bound analysis of Hedge, see the regret bound section). Therefore, the key in adversarial bandit algorithms is to estimate the rewards for arms that were not played. If certain arms are not updated in some rounds, it's equivalent to assuming their rewards are zero—this introduces systematic bias in the adversarial setting, with the bias arising from partial feedback itself. So, whenever we select an arm that is rarely chosen, we give the observed reward a larger weight, making the update for this arm larger, so that in expectation, it's as if we always see the full feedback for every arm in every round.

The exponential weight update is directly copied from the Hedge algorithm's form, in order for the regret bound analysis to work. Exponential updates ensure that arms which perform well increase their weights rapidly, while poorly performing arms have their weights quickly suppressed.

4. Regret Bound¶

The classical regret bound for EXP3 holds in expectation, where the expectation is with respect to the algorithm's own randomness (i.e., arm selection), not the adversary's rewards. If one wants a high-probability (pathwise) regret bound, you need to use improved variants such as EXP3.P or EXP3-IX, which further control variance and tails in the analysis. $$ \mathbb{E}[R_T] = \mathcal{O}\left(\sqrt{T K \log K}\right) $$

🍵Proof: EXP3 is essentially a Hedge algorithm under partial information. The proof for EXP3 almost copies the Hedge proof, substituting $x$ by $\hat x$ and adding a variance term. Understanding the proof for Hedge (see the Hedge section) is a prerequisite for understanding the EXP3 proof.

Three Key Modifications from Hedge:

Definition-level change: replace true rewards by unbiased estimates

The algorithm's "expected cumulative reward" is now expressed using estimated values: $$ \hat G \triangleq \sum_{t=1}^T \sum_a p_{t,a}\hat x_{t,a}, \qquad \mathbb E[\hat G] = \sum_{t=1}^T \sum_a p_{t,a} x_{t,a}. $$ The exponential weight update is accordingly changed: $$ w_{t+1,a} = w_{t,a}\exp(\eta \hat x_{t,a}), \qquad W_t \triangleq \sum_a w_{t,a}. $$
Upper bound: replacing Hoeffding's lemma with explicit variance control

Since the importance-weighted estimator in EXP3, $\hat x_{t,a}=\frac{X_t\mathbf 1\{A_t=a\}}{p_{t,a}}$, is no longer bounded nor sub-Gaussian*, Hoeffding’s lemma does not apply. Instead, an exponential bound based on the second moment is used. This is the key difference between EXP3 and Hedge.

Lemma (Second-moment exponential bound). For any real $x\le 1$, $$ e^x \le 1+x+x^2. $$ Consequently, for any random variable $X$ with $X\le 1$ almost surely, $$ \log\mathbb E[e^X] \le \mathbb E[X]+\mathbb E[X^2]. $$ Applied to EXP3 (let $X=\eta\hat x_{t,a}$, and make sure $\eta\hat x_{t,a}\le 1$), the remainder of the proof follows Hedge: $$ \log W_{T+1} - \log W_{1} = \log\Big(\sum_a p_{t,a}e^{\eta\hat x_{t,a}}\Big) \le \eta\sum_a p_{t,a}\hat x_{t,a} +\eta^2\sum_a p_{t,a}\hat x_{t,a}^2. $$

And since EXP3's exploration mechanism guarantees $p_{t,a}\ge \frac{\gamma}{K}$ and $X_t\in[0,1]$, we have $$ \hat x_{t,a}^2=\frac{X_t^2\mathbf 1\{A_t=a\}}{p_{t,a}^2}\le \frac{\mathbf 1\{A_t=a\}}{p_{t,a}^2} \quad \Rightarrow \quad \mathbb E_t[\hat x_{t,a}^2] =\frac{x_{t,a}}{p_{t,a}} \le \frac{1}{p_{t,a}} \le \frac{K}{\gamma} \quad \Rightarrow \quad \mathbb E\left[\sum_{t=1}^T\sum_a p_{t,a}\hat x_{t,a}^2\right] \le \frac{TK}{\gamma} $$

Taking expectation on both sides of the upper bound, we have $$ \mathbb E[\log W_{T+1}] \le \log K + \eta\mathbb E[\hat G] + \eta^2\frac{TK}{\gamma} $$
Lower bound: structurally unchanged, just replace by estimates

There is no new trick for the lower bound in EXP3; simply replace $x_{t,a}$ everywhere by $\hat x_{t,a}$.

Since $W_{T+1}$ is the sum of all weights, $$ W_{T+1} \ge w_{T+1,a^*} = \exp\Big(\eta \sum_{t=1}^T \hat x_{t,a^*}\Big) \quad \Rightarrow \quad \log W_{T+1} \ge \eta \sum_{t=1}^T \hat x_{t,a^*} $$ Taking expectation, $$ \mathbb E[\log W_{T+1}] \ge \eta G^* $$

Combine the upper and lower bounds to obtain the EXP3 regret bound: $$ \mathbb E[R_T] \le \frac{\log K}{\eta} + \eta \frac{TK}{\gamma}. $$ For fixed $\gamma$, the optimal $\eta$ is to balance the two terms: $$ \eta^*=\sqrt{\frac{\gamma\log K}{TK}} $$ Plugging this back in gives $$ \mathbb E[R_T]\le 2\sqrt{\frac{TK\log K}{\gamma}} $$

Note: the upper bound is monotonically decreasing in $\gamma\in (0, 1]$, with the optimum being $\gamma = 1$. However, in practice, this is almost never used, as it corresponds to a purely random policy that does not exploit the learned weights at all. In engineering practice, we usually use a theoretically optimal-order value such as $\gamma = \min\left(1,\sqrt{\frac{K\log K}{(e-1)T}}\right)$, or directly use improved algorithms like EXP3.P or EXP3-IX that are more robust to the choice of $\gamma$, rather than tuning $\gamma$ via grid search.

Comment: EXP3 is the most essential and classical algorithm for adversarial bandits, but in real-world applications, the original EXP3 is rarely used directly because its variance is quite high. In practice, we mostly use improved versions of it or more stable algorithms based on the same ideas.

5. Hedge (Exponentially Weighted Average (EWA) / Exponential Weights Algorithm)¶

Let's talk about Hedge, because EXP3 is not an entirely innovative algorithm. Instead, EXP3 essentially brings Hedge directly into the bandit setting. The algorithm and regret bound analysis are almost identical between the two. The only real difference lies in the setting: in each round, Hedge sees the rewards of all experts, whereas EXP3 can only observe the reward of the selected one. We'll see in the "Relation to EXP3" part, there are two tweaks EXP3 makes to mitigate this difference.

Background Info

"Hedge" is equivalent to the Exponentially Weighted Average (EWA) / Exponential Weights Algorithm. They refer to online learning algorithms for the expert advice problem: suppose you face a decision problem and have several experts providing advice; in each round you can observe the true performance of all experts (full information), and the algorithm uses exponential weighting to gradually increase the weight of experts with better historical performance, guaranteeing that its cumulative performance is close to that of the optimal fixed expert in hindsight.

Hedge is not an acronym, nor does it refer to technical details of the algorithm. It describes a decision philosophy aimed at adversarial or worst-case scenarios: instead of directly selecting a single expert, each expert is assigned a weight, and decisions are randomized according to these weights. The weights are adjusted based on historical performance. This approach "hedges" against the uncertainty of which expert will ultimately perform best in the future.

Hedge originates from the online learning and expert advice framework. Its core idea can be traced back to the early Weighted Majority algorithm of Littlestone and Warmuth (1989), which assigned weights to multiple experts and updated these weights based on their historical performance in order to avoid the worst-case outcomes brought by always following a single expert in adversarial environments. In 1997, Freund and Schapire systematically proposed and analyzed the Exponentially Weighted Average (EWA) / Exponential Weights Algorithm, naming it "Hedge" to emphasize the philosophy of diversifying bets and hedging risks. This theoretical framework also directly inspired the development of boosting algorithms (such as AdaBoost, where the sample weight update is itself an exponential weighting), and laid the foundation for algorithms in the partial feedback setting (such as EXP3).

Algorithm

Problem setup: Full-information online learning / expert advice

There are $K$ experts (or arms). In each round, maintain a weight $w_{t,1},...,w_{t,K}$ for each expert, and use the current weights to form a probability distribution $p_{t,1},...,p_{t,K}$.
In each round, we can observe the true reward $x_{t,a}$ for all experts. The rewards are bounded: $x_{t,a}\in[0,1]$ for all $t,a$. If they are not, we rescale it to be in $[0, 1]$ (this is important, otherwise the regret bound won't hold).
The algorithm's core output is the distribution $p_{t,1},...,p_{t,K}$, not a single sampled expert.
The cumulative reward in the analysis is typically $\sum_t \sum_a p_{t,a} x_{t,a}$, compared to the best fixed expert in hindsight: $a^* = \arg\max_a \sum_{t=1}^T x_{t,a}$.

Pseudocode:

Initialization: $w_{1, a}=1$ for all $a\in\{1,...,K\}$
Probability distribution: $$ p_{t,a}=\frac{w_{t,a}}{\sum_b w_{t,b}} $$
Exponential update: $$ w_{t+1,a}=w_{t,a}\exp(\eta x_{t,a}) $$ ($\eta$ is the learning rate / step size, controlling how strongly the algorithm responds to new information; a large $\eta$ will make the algorithm quickly favor the currently better-performing experts, while a smaller $\eta$ makes the updates more conservative)

Regret Bound

EXP3 suffers an additional $\sqrt{K}$ factor due to partial feedback and importance-weighted reward estimation, resulting in regret $\mathcal{O}(\sqrt{TK\log K})$. The variance of the importance-weighted estimator is $\sim 1/p_{t,a}$, which is at worst $\mathcal{O}(K)$. In contrast, Hedge enjoys full information, so it has a regret bound of the order $$R_T=\mathcal O\left(\sqrt{T\log K}\right)$$

🍵Proof:

Idea: Use the growth of the total weight $W_t$ in logarithmic scale (the potential function) to sandwich the algorithm's expected cumulative reward and the best expert’s cumulative reward, thereby obtaining the regret bound.

Definitions of comparison objects and the potential function:
- Expected cumulative reward of the algorithm $$ G \triangleq \sum_{t=1}^T \sum_{a=1}^K p_{t,a} x_{t,a} $$
- Cumulative reward of the optimal fixed expert in hindsight $$ G^* \triangleq \max_a \sum_{t=1}^T x_{t,a} $$
- The exponential weights potential function $$ W_t \triangleq \sum_{a=1}^K w_{t,a} $$ where the initialization is $W_1=K$.
Upper bound: Controlling $\log W_{T+1}-\log W_1$

By the exponential update and definition: $$ W_{t+1}=\sum_a w_{t,a}e^{\eta x_{t,a}} = W_t \sum_a p_{t,a}e^{\eta x_{t,a}} \quad \Rightarrow \quad \log\frac{W_{t+1}}{W_t}=\log\Big(\sum_a p_{t,a}e^{\eta x_{t,a}}\Big) $$

Hoeffding’s lemma (⭐The key of this proof. In EXP3, we make a tweak from here because the r.v. is no longer bounded)

If random variable $X\in[a,b]$, then for any $\lambda\in\mathbb{R}$, $$ \log \mathbb{E}[e^{\lambda X}] \le \lambda \mathbb{E}[X] + \frac{\lambda^2 (b-a)^2}{8}. $$

Construct the random variable $X$: At a fixed round $t$, let the random variable $X$ take values in the set $\{x_{t,1},\dots,x_{t,K}\}$, and let $\mathbb P(X = x_{t,a}) = p_{t,a}$. By the definition of the expectation of a discrete random variable: $$ \mathbb E[e^{\eta X}] = \sum_{a} \mathbb P(X=x_{t,a}) e^{\eta x_{t,a}} = \sum_{a} p_{t,a} e^{\eta x_{t,a}} $$ Directly applying Hoeffding's lemma. If $x_{t,a}\in[0,1]$, then $$\log\Big(\sum_a p_{t,a}e^{\eta x_{t,a}}\Big) \le \eta\sum_a p_{t,a}x_{t,a}+\frac{\eta^2}{8}$$

So we can bound $\log W_{T+1}-\log W_1$ by $$ \log W_{T+1}-\log W_1 = \sum_{t=1}^T (\log W_{t+1}-\log W_t) = \sum_{t=1}^T \log\frac{W_{t+1}}{W_t} \le \eta \sum_{t=1}^T\sum_a p_{t,a}x_{t,a}+\frac{\eta^2T}{8} = \eta G + \frac{\eta^2T}{8}. $$ Also, $W_1=\sum_a 1=K$, so $\log W_1=\log K$. Therefore, the upper bound is: $$ \log W_{T+1} \le \log K+\eta G + \frac{\eta^2T}{8}. $$
Lower bound: Lower bounding $\log W_{T+1}$ using the "best expert"

Let $a^* = \arg\max_a\sum_t x_{t,a}$. Since $W_{T+1}$ is the sum of all weights, it must be at least as large as any single one of them: $$ W_{T+1} \ge w_{T+1,a^*}. $$ Expanding $w_{T+1,a^*}$ using the update rule: $$ w_{T+1,a^*} = w_{1,a^*}\exp\Big(\eta\sum_{t=1}^T x_{t,a^*}\Big) = \exp\Big(\eta\sum_{t=1}^T x_{t,a^*}\Big), $$ since $w_{1,a^*}=1$. Taking the logarithm gives $$ \log W_{T+1} \ge \eta\sum_{t=1}^T x_{t,a^*} = \eta G^*. $$
Combine upper and lower bounds to get the classic Hedge bound

Combine the upper bound from step 3 and the lower bound from step 4: $$ \eta G^* \le \log W_{T+1} \le \log K+\eta G + \frac{\eta^2T}{8}. $$ Rearranging and dividing by $\eta$: $$ G^* - G \le \frac{\log K}{\eta}+\frac{\eta T}{8}. $$ This is the classic Hedge bound (reward version): $$ \sum_{t=1}^T\sum_a p_{t,a}x_{t,a} \ge \max_a\sum_{t=1}^T x_{t,a} - \frac{\log K}{\eta} - \frac{\eta T}{8} $$
Choose $\eta$ to get a more "elegant" order

Optimize $\frac{\log K}{\eta}+\frac{\eta T}{8}$ (set the two terms equal): $$ \eta = \sqrt{\frac{8\log K}{T}} \quad\Rightarrow\quad R_T \le \sqrt{\frac{T\log K}{2}}. $$ Therefore, the regret of Hedge is $\mathcal{O}(\sqrt{T\log K})$.

Relation to EXP3

Hedge and EXP3 share the same core exponential weighting mechanism. The only essential difference lies in the feedback each round: full information vs. bandit feedback. Hedge observes the true rewards for all experts in every round, whereas EXP3 only observes the reward of the chosen arm each round.

EXP3 uses unbiased estimation and enforced exploration so that Hedge's analysis still applies in the partial feedback setting. In more detail,

Are the exponential updates the same? -- The idea is the same, but the mathematical objects differ.
- Hedge: $$ w_{t+1,a}=w_{t,a}\exp(\eta x_{t,a}) $$
- EXP3: $$ w_{t+1,a}=w_{t,a}\exp(\eta \hat x_{t,a}) $$ It's exactly the same exponential weighting mechanism; only the true reward is replaced by an unbiased estimator: $x_{t,a} \longrightarrow \hat x_{t,a}$
Difference in probability distribution:
- Hedge: No need for exploration, since the feedback is full. $$ p_{t,a}=\frac{w_{t,a}}{\sum_b w_{t,b}} $$
- EXP3: The exploration term is not a stylistic difference, but a necessary correction under partial feedback. $$ p_{t,a} =(1-\gamma)\frac{w_{t,a}}{\sum_b w_{t,b}} +\frac{\gamma}{K} $$

--- EXP3-IX¶

Algorithm¶

The essential difference of EXP3-IX is that it moves the exploration from the probability distribution into the estimator itself. The algorithm is almost identical to EXP3, except for where the exploration term appears.

Component	EXP3	EXP3-IX
Sampling distribution	$$p_a =(1-\gamma)\frac{w_a}{\sum_b w_b} + \frac{\gamma}{K}$$	$$p_a = \frac{w_a}{\sum_b w_b}$$
Reward estimator	$$\hat{x}_{t,a} = \frac{X_t \mathbf{1}\{A_t=a\}}{p_a}$$	$$\hat{x}_{t,a}^{\text{IX}} = \frac{X_t \mathbf{1}\{A_t=a\}}{p_a + \gamma / K}$$
Weight update	$$w_a \leftarrow w_a \exp\left(\frac{\gamma}{K}\hat{x}_{t,a}\right)$$	$$w_a \leftarrow w_a \exp\left(\frac{\gamma}{K}\hat{x}_{t,a}^{\text{IX}}\right)$$

In EXP3-IX, the algorithm removes this explicit exploration term and instead introduces an implicit exploration bias inside the estimator denominator. The $\gamma/K$ offset prevents the estimator from exploding when $p_a$ is small, effectively regularizing the variance without modifying the sampling distribution. However, because of this denominator shift, $$ \mathbb{E}[\hat{x}_{t,a}^{\text{IX}}] \neq x_{t,a}, $$ i.e. the estimator is no longer unbiased. We deliberately introduce a small bias–variance tradeoff: a controlled bias that yields much lower estimator variance and a tighter, more stable regret bound in practice.

Regret Bound¶

The expected regret remains of the same asymptotic order as EXP3: $$ R_T = \mathcal{O}\left(\sqrt{TK\log K}\right) $$

Comment: The variance of the importance-weighted estimator is smaller, so the algorithm is empirically more stable. Exploration arises implicitly through the biased estimator.

--- Algorithms Implementation¶

In [2]:

import numpy as np
import matplotlib.pyplot as plt

# --- Environment Set Up: Adversarial Bandit ---
np.random.seed(42)

K = 3           # number of arms
T = 1000        # time horizon

# Adversarially generated rewards (time-varying pattern)
def generate_adversarial_rewards(T, K):
    rewards = np.zeros((T, K))
    for t in range(T):
        # alternating best arm every 200 rounds
        rewards[t, 0] = 0.8 if (t // 200) % 2 == 0 else 0.2
        rewards[t, 1] = 0.5
        rewards[t, 2] = 0.2 if (t // 200) % 2 == 0 else 0.8
    return rewards

In [3]:

# --- EXP3 ---
def run_exp3(X, eta=0.07, gamma=0.1):
    T, K = X.shape
    w = np.ones(K)
    rewards = []

    for t in range(T):
        p = (1 - gamma) * (w / np.sum(w)) + gamma / K
        a = np.random.choice(K, p=p)
        r = X[t, a]
        rewards.append(r)

        x_hat = np.zeros(K)
        x_hat[a] = r / p[a]
        w = w * np.exp((eta / K) * x_hat)

    return np.array(rewards)


# --- EXP3-IX ---
def run_exp3ix(X, eta=0.07, gamma=0.1):
    T, K = X.shape
    w = np.ones(K)
    rewards = []

    for t in range(T):
        p = w / np.sum(w)
        a = np.random.choice(K, p=p)
        r = X[t, a]
        rewards.append(r)

        x_hat = np.zeros(K)
        x_hat[a] = r / (p[a] + gamma / K)
        w = w * np.exp((eta / K) * x_hat)

    return np.array(rewards)

In [1]:

# --- Run multiple experiments ---
eta = 0.07      # learning rate
gamma = 0.1     # exploration parameter
n_runs = 20     # number of independent runs for averaging

rewards_exp3_all = []
rewards_exp3ix_all = []

for _ in range(n_runs):
    X = generate_adversarial_rewards(T, K)
    rewards_exp3_all.append(run_exp3(X, eta, gamma))
    rewards_exp3ix_all.append(run_exp3ix(X, eta, gamma))

rewards_exp3_all = np.array(rewards_exp3_all)
rewards_exp3ix_all = np.array(rewards_exp3ix_all)

# --- Compute metrics ---
def compute_avg_std(rewards_all):
    avg_reward = np.mean(np.cumsum(rewards_all, axis=1) / np.arange(1, rewards_all.shape[1] + 1), axis=0)
    std_reward = np.std(np.cumsum(rewards_all, axis=1) / np.arange(1, rewards_all.shape[1] + 1), axis=0)
    return avg_reward, std_reward

avg_exp3, std_exp3 = compute_avg_std(rewards_exp3_all)
avg_exp3ix, std_exp3ix = compute_avg_std(rewards_exp3ix_all)

# Best fixed arm reward (for regret)
best_arm_reward = np.max(np.sum(generate_adversarial_rewards(T, K), axis=0))

cum_rewards_exp3 = np.mean(np.cumsum(rewards_exp3_all, axis=1), axis=0)
cum_rewards_exp3ix = np.mean(np.cumsum(rewards_exp3ix_all, axis=1), axis=0)
regret_exp3 = best_arm_reward - cum_rewards_exp3
regret_exp3ix = best_arm_reward - cum_rewards_exp3ix

# --- Print summary ---
print(f"Over {n_runs} runs:")
print(f"EXP3    → Mean total reward: {np.mean(np.sum(rewards_exp3_all, axis=1)):.1f}, Std: {np.std(np.sum(rewards_exp3_all, axis=1)):.2f}")
print(f"EXP3-IX → Mean total reward: {np.mean(np.sum(rewards_exp3ix_all, axis=1)):.1f}, Std: {np.std(np.sum(rewards_exp3ix_all, axis=1)):.2f}")

# --- Plot comparison ---
plt.figure(figsize=(12,5))

# Average reward curves with variance bands
plt.subplot(1,2,1)
plt.plot(avg_exp3, label='EXP3')
plt.fill_between(np.arange(T), avg_exp3 - std_exp3, avg_exp3 + std_exp3, alpha=0.2)
plt.plot(avg_exp3ix, label='EXP3-IX')
plt.fill_between(np.arange(T), avg_exp3ix - std_exp3ix, avg_exp3ix + std_exp3ix, alpha=0.2)
plt.title("Average Reward (mean ± std)")
plt.xlabel("Time step")
plt.ylabel("Average Reward")
plt.legend()
plt.grid(True)

# Regret comparison
plt.subplot(1,2,2)
plt.plot(regret_exp3, label='EXP3')
plt.plot(regret_exp3ix, label='EXP3-IX')
plt.title("Cumulative Regret (averaged over runs)")
plt.xlabel("Time step")
plt.ylabel("Regret")
plt.legend()
plt.grid(True)

plt.tight_layout()
plt.show()

Over 20 runs:
EXP3    → Mean total reward: 518.9, Std: 11.92
EXP3-IX → Mean total reward: 527.7, Std: 9.71

No description has been provided for this image

--- Applications¶

Summary: As long as the problem involves repeated independent trials + fixed actions + rewards that may be non-i.i.d or adversarially chosen, it should be modeled as an Adversarial Bandit.

Online advertising / real-time bidding (RTB)
- Action: Choose bid level / select ad to display
- Reward: Click / conversion / revenue
- Independent sample: ❌ No — competitors adapt their bids over time
- Feature: Market is strategic and adversarial; reward distribution changes in response to your policy (EXP3 is commonly used)
Network security / adversarial defense
- Action: Choose defense strategy (firewall rules, detection thresholds)
- Reward: Attack blocked or loss incurred
- Independent sample: ❌ No — attackers adapt to observed defenses
- Feature: Explicit adversary exists; reward sequence can be worst-case rather than stochastic
Decentralized systems / network routing
- Action: Select routing path or resource allocation
- Reward: Latency / throughput
- Independent sample: ❌ No — congestion and delays vary over time
- Feature: Environment is non-stationary and not i.i.d; adversarial bandits handle time-varying costs
Stock trading / portfolio allocation
- Action: Allocate capital among assets
- Reward: Profit or loss
- Independent sample: ❌ No — market rewards are non-stationary and reactive
- Feature: Financial markets cannot be modeled with fixed reward distributions; Hedge / EXP3 are used for adversarial settings
Game AI / repeated competitive games
- Action: Choose strategy or move
- Reward: Win / loss / payoff
- Independent sample: ❌ No — opponent learns and adapts
- Feature: Rewards depend on an intelligent adversary rather than randomness
Real-time fraud detection
- Action: Select detection rule or threshold
- Reward: Fraud caught minus false positives
- Independent sample: ❌ No — fraud patterns change in response to detection
- Feature: Adversarial behavior breaks i.i.d assumptions; worst-case guarantees are needed