The Trade-Off

There are two concerns with learning on the go so that the amount of reward is maximized.
- The agent can exploit what it knows so that it does not get penalized, however this limits the agent to a local maximum only.
- The agent can explore the environment to discover the global maximum at the cost of being potentially penalized.
The heart of the tradeoff is that neither exploration nor exploitation can be pursued without failing
We can model this using the -armed bandit problem:

Choose among actions, that yield certain rewards. The rewards yielded by an action are dependent on some probability distribution that is unknown to the agent. The goal is maximizing reward

Balancing Exploration and Exploitation

We may encourage exploration by choosing optimistic initial values so that the model is constantly disappointed.
- This is only effective for stationary problems.

Random Choice

Always choose to explore by performing random search.
Comes at the cost of not exploiting local information.

Greedy

Always choose to exploit the option with the known highest reward.
Comes at the cost of not exploring at all.

-Greedy

Choose the optimal choice, but sometimes with an probability, choose randomly to be able to explore the actions more.
Eventually, by the law of large numbers, the whole search space will be explored.
The choice of can be made based on domain knowledge.
This reveals that exploration can handle nonstationary distributions.
- For nonstationary distributions, the step size (see here) for updating the policy should satisfy the following constraints:

The steps are large enough to overcome noise.
The steps become small enough to assure convergence.

Upper Confidence Bound Action Selection

An extension of -greedy which selects non-optimal choices based on their potential to be optimal. It is given as

Where denotes the number of times the action has been taken prior to time step .
Essentially, rather than use the rewarded signals as they are, the model takes into account the potential future reward signal that could be received. This is modeled as an uncertainty and is captured via the following observations:
- Each time an action is selected, the uncertainty about this action’s reward is reduced.
- Each time an action is not selected, the uncertainty about the action increases.
  
  The uncertainty eventually grows at a slower pace since having more samples should also reduce uncertainty, but it still increases so all actions will be selected.
This method favors actions that have not been explored that much.
However, this method also does not generalize well to nonstationary distributions. Also it is computationally expensive.

Gradient Bandits

Gradient bandits make use of stochastic gradient ascent to improve the preference towards a certain action.
- It can be shown that the expected update of the gradient bandit algorithm is equal to the gradient of expected reward.
- This is under the assumption that the reward does not depend on the action.
The idea in general is that:
- For the chosen action, update the preference for this action based on how better the reward is relative to a baseline.
- If the probability of choosing the current action was low, then there should be a more drastic change in the preference.
- For actions that were not chosen, update the preference based on how worse the reward is relative to the baseline.
- If the probability of choosing these actions more drastically if they were more likely to be chosen.

Contextual Bandits

The goal here is to learn a policy where each action is associated with different situations.
The associative search task involves both trial and error to search for the best actions, and association of these actions with the situations in which they are best.

Bayesian Method

An alternative is to balance the probability of performing exploration and exploitation via Bayesian Statistics.
Compute for any possible action the probability of each possible immediate reward and the resultant posterior distributions over action values.
Note that this explodes in complexity very quickly so it is not feasible to implement exactly (but maybe approximately).

Table of Contents

Graph View

Backlinks

The Library

The Exploitation-Exploration Trade-Off

The Trade-Off

Balancing Exploration and Exploitation

Random Choice

Greedy

-Greedy

Upper Confidence Bound Action Selection

Gradient Bandits

Contextual Bandits

Bayesian Method

Links

Table of Contents

Graph View

Backlinks

The Exploitation-Exploration Trade-Off

The Trade-Off §

Balancing Exploration and Exploitation §

Random Choice §

Greedy §

-Greedy §

Upper Confidence Bound Action Selection §

Gradient Bandits §

Contextual Bandits §

Bayesian Method §

Links §

The Trade-Off

Balancing Exploration and Exploitation

Random Choice

Greedy

-Greedy

Upper Confidence Bound Action Selection

Gradient Bandits

Contextual Bandits

Bayesian Method

Links