Stochastic Bandit: a collection of distributions \(\nu = (P_a: a\in\mathcal{A})\) where \(\mathcal{A}\) is a set of available actions. The learner and the environment interact sequentially over n rounds. In each round \(t\in\{1,...,n\}\) the learner chooses an action \(A_t\in\mathcal{A}\), which is fed to the environment. The environment then samples a reward \(X_t\in\mathbb{R}\) from the distribution \(P_{A_t}\), and reveals \(X_t\) to the learner.
Environment class: a class \(\xi\) gives the observer information about the bandits belonging to it. For example, finite variance etc. Environmental class can be classified into two types in terms of the dependence among arms. For the unstructured class, the observer cannot know other arm’s information by playing one arm; for the structured class, the observer is able to get information from other arms by playing one arm.
The notations we need to prove the properties of regret of stochastic bandits: \[\mu^* \text{is the mean of optimal arm }\] \[\mu_a \text{ is the mean of the a-th arm.}\] \[\Delta_a = \mu^*(\nu) - \mu_a(\nu)\] \[T_a(n) = \sum\limits_{t=1}^n I\{A_t == a\}\]
The learning objective of bandit algorithm is to maximize the reward, or equivalently, minimize the regret.
Reward: \[S_n = \sum\limits_{t=1}^n X_t\]
Empirical Regret: \[\hat{R_n} = n\mu^* - \sum\limits_{t=1}^n X_t\]
Expected Regret: \[R_n(\pi, \nu) = n\mu^*(\nu) - \sum\limits_{t=1}^n \mathbb{E}[X_t] = \sum\limits_{a\in\mathcal{A}}\Delta_a \mathbb{E}[T_a(n)]\]
Of course, given a policy, different bandits will have different regrets. A decently good policy is a policy such that \(\lim\limits_{n\rightarrow\infty}R_n(\pi, \nu)/n = 0\). More specifically, \[R_n(\pi, \nu)\leq C(\nu)f(n)\] where \(C\) is a constant and \(f(n)\rightarrow 0\).
A toy code and experiment about the python implementation of Bernoulli Bandit. (Exercise 4.6, 4.7, 4.10, 4.11)