Multi-armed Bandit Problem

  • The Multi-arm Bandit problem is a simplified reinforcement learning setting where one faces the exploration versus exploitation dilemma
  • The problem is defined as a tuple , where
    • is a set of actions
    • and is an unknown probability distribution of rewards given the chosen action
  • we choose for totally discounted rewards
  • at each time step , the agent applies a policy to select an action , based on previous actions taken and the respectively obtained rewards
  • subsequently, the environment returns a reward
  • given that we set , we define the value of ac action as its instantaneous mean reward
  • the goal is to find the optimal policy that maximized the cumulative rewards
  • policies must take into account the exploration vs. exploitation dilemma and combine both explorative actions, to sample their associated unknown reward function to update their value estimates, and greedy actions, to increase the total cumulative reward by choosing the action with the highest value estimate
  • the main advantage of describing adaptive sampling in terms of a multi-armed bandit is that we can benefit from the extensive literature on bandits to find solutions and replace heuristic policies with more mathematically sound ones