Multi-armed Bandit Problem

The Multi-arm Bandit problem is a simplified reinforcement learning setting where one faces the exploration versus exploitation dilemma
The problem is defined as a tuple $< A, R, γ >$ , where
- $A$ is a set of $k$ actions $A = {a_{1}, a_{2}, \dots, a_{k}}$
- and $R$ is an unknown probability distribution $R^{a} = P [r ∣ a]$ of rewards given the chosen action
we choose $γ = 0$ for totally discounted rewards
at each time step $t$ , the agent applies a policy $π_{a} = P [a]$ to select an action $a_{t} \in A$ , based on previous actions taken and the respectively obtained rewards
subsequently, the environment returns a reward $r_{t} \sim R^{a_{t}}$
given that we set $γ = 0$ , we define the value of ac action $Q_{π} (a)$ as its instantaneous mean reward

Q_{π} (a) = E_{π} [r ∣ a] Eq .3

the goal is to find the optimal policy $π^{*}$ that maximized the cumulative rewards $\sum_{t = 1}^{T} r_{t}$
policies must take into account the exploration vs. exploitation dilemma and combine both explorative actions, to sample their associated unknown reward function to update their value estimates, and greedy actions, to increase the total cumulative reward by choosing the action with the highest value estimate
the main advantage of describing adaptive sampling in terms of a multi-armed bandit is that we can benefit from the extensive literature on bandits to find solutions and replace heuristic policies with more mathematically sound ones

Afffaa's Athenaeum