AdaptiveBandit, Methods, Multi-armed Bandit Problem
Nov 18, 20242 min read
Multi-armed Bandit Problem
The Multi-arm Bandit problem is a simplified reinforcement learning setting where one faces the exploration versus exploitation dilemma
The problem is defined as a tuple <A,R,γ>, where
A is a set of k actions A={a1,a2,…,ak}
and R is an unknown probability distribution Ra=P[r∣a] of rewards given the chosen action
we choose γ=0 for totally discounted rewards
at each time step t, the agent applies a policy πa=P[a] to select an action at∈A, based on previous actions taken and the respectively obtained rewards
subsequently, the environment returns a reward rt∼Rat
given that we set γ=0, we define the value of ac action Qπ(a) as its instantaneous mean reward
Qπ(a)=Eπ[r∣a]Eq.3
the goal is to find the optimal policy π∗ that maximized the cumulative rewards ∑t=1Trt
policies must take into account the exploration vs. exploitation dilemma and combine both explorative actions, to sample their associated unknown reward function to update their value estimates, and greedy actions, to increase the total cumulative reward by choosing the action with the highest value estimate
the main advantage of describing adaptive sampling in terms of a multi-armed bandit is that we can benefit from the extensive literature on bandits to find solutions and replace heuristic policies with more mathematically sound ones