The Trade-Off

  • There are two concerns with learning on the go so that the amount of reward is maximized.
    • The agent can exploit what it knows so that it does not get penalized, however this limits the agent to a local maximum only.
    • The agent can explore the environment to discover the global maximum at the cost of being potentially penalized.
  • The heart of the tradeoff is that neither exploration nor exploitation can be pursued without failing
  • We can model this using the -armed bandit problem:

Choose among actions, that yield certain rewards. The rewards yielded by an action are dependent on some probability distribution that is unknown to the agent. The goal is maximizing reward

Balancing Exploration and Exploitation

  • We may encourage exploration by choosing optimistic initial values so that the model is constantly disappointed.
    • This is only effective for stationary problems.

Random Choice

  • Always choose to explore by performing random search.
  • Comes at the cost of not exploiting local information.

Greedy

  • Always choose to exploit the option with the known highest reward.
  • Comes at the cost of not exploring at all.

-Greedy

  • Choose the optimal choice, but sometimes with an probability, choose randomly to be able to explore the actions more.
  • Eventually, by the law of large numbers, the whole search space will be explored.
  • The choice of can be made based on domain knowledge.
  • This reveals that exploration can handle nonstationary distributions.
    • For nonstationary distributions, the step size (see here) for updating the policy should satisfy the following constraints:
  1. The steps are large enough to overcome noise.

  2. The steps become small enough to assure convergence.

Upper Confidence Bound Action Selection

  • An extension of -greedy which selects non-optimal choices based on their potential to be optimal. It is given as

    Where denotes the number of times the action has been taken prior to time step .

  • Essentially, rather than use the rewarded signals as they are, the model takes into account the potential future reward signal that could be received. This is modeled as an uncertainty and is captured via the following observations:

    • Each time an action is selected, the uncertainty about this action’s reward is reduced.

    • Each time an action is not selected, the uncertainty about the action increases.

      The uncertainty eventually grows at a slower pace since having more samples should also reduce uncertainty, but it still increases so all actions will be selected.

  • This method favors actions that have not been explored that much.

  • However, this method also does not generalize well to nonstationary distributions. Also it is computationally expensive.

Gradient Bandits

  • Gradient bandits make use of stochastic gradient ascent to improve the preference towards a certain action.

    • It can be shown that the expected update of the gradient bandit algorithm is equal to the gradient of expected reward.
    • This is under the assumption that the reward does not depend on the action.
  • The idea in general is that:

    • For the chosen action, update the preference for this action based on how better the reward is relative to a baseline.
    • If the probability of choosing the current action was low, then there should be a more drastic change in the preference.
    • For actions that were not chosen, update the preference based on how worse the reward is relative to the baseline.
    • If the probability of choosing these actions more drastically if they were more likely to be chosen.

Contextual Bandits

  • The goal here is to learn a policy where each action is associated with different situations.
  • The associative search task involves both trial and error to search for the best actions, and association of these actions with the situations in which they are best.

Bayesian Method

  • An alternative is to balance the probability of performing exploration and exploitation via Bayesian Statistics.
  • Compute for any possible action the probability of each possible immediate reward and the resultant posterior distributions over action values.
  • Note that this explodes in complexity very quickly so it is not feasible to implement exactly (but maybe approximately).

Links