• Temporal Difference (TD) learning is a combination of Dynamic Programming (via bootstrapping) and Monte Carlo Methods (via learning from experience).
  • The central idea is to learn while in the middle of the episode using an appropriate estimate based on the agent’s current predictions.

Advantages

  • Unlike DP, it does not require a model of the environment.
  • Unlike Monte Carlo, it does not require simulating the whole episode.
    • This means that small changes in the configuration of the environment (i.e., start states) do not require us to simulate multiple episodes. Instead, we can simply transfer knowledge from one model to another.
    • It also makes it applicable to continual tasks, which Monte Carlo cannot do.
  • Compared to Monte Carlo, it learns much faster because it does not discount or ignore episodes.
  • Compared to batch Monte Carlo, batch TD tends to converge towards a model that would likely be correct for future data rather than on a model that is correct only for the present data.
    • Specifically, TD converges towards the Maximum Likelihood via the certainty equivalence estimate — which is equivalent to assuming that the estimate of the underlying process was known with certainty rather than being approximated.

TD Prediction

  • TD methods wait until the next time step until they reward a target and make a useful update using . The update takes the form

TD-errors

  • The target for updating is (i.e., the estimated discounted value when the agent takes steps from the current state).

    • is the target estimate. In TD-0 this is given as

    • It makes use of two estimates — the sampled reward , and the estimated value . The sampled estimate makes use of only the successor state’s reward rather than the full distribution of successors.

    • The error in this estimate is proportional to the temporal differences between predictions.

    • The TD-error is defined as

  • The Monte Carlo error can be written as a sum of discounted TD errors

    • This identity is not exact if is updated during the episode, but if the step size is small then it may still hold approximately.
  • In an average reward context we have the TD-errors as

  • Note we can replace everything in the TD-error formulae with estimates.

TD Control

SARSA

  • Stands for .

  • It is an on-policy control method. Hence, it uses an -greedy policy for exploration (see here).

  • We estimate the action-value function for a given policy by considering transitions from state-action to state-action pair using the following update rule:

  • We then choose actions based on optimally. Note: this is important. If a suboptimal policy were enacted, the quality will degrade.

  • SARSA learns the bad policies during the episode and can quickly adjust.

  • It can only perform well in the long run with a small step size , at which short term performance is poor.

Q-Learning

  • An off-policy TD control algorithm. It is off-policy because we use a policy derived from (which lets us move to in order to update the optimal policy with respect to the current state, action pair .

  • It is characterized by updating

  • Actually learns the values of the optimal policy, but it is not good for online learning.

Expected SARSA

  • An off policy generalization of Q-learning and SARSA.

  • The target is

  • Moves towards the expected value, taking into account how likely each action is under the current policy. That is, the rule becomes

  • More computationally complex than SARSA, but eliminates variance due to random selection of .

Double Learning

  • Maximization bias - a bias that results from implicitly assuming a maximum over estimated values, as in the other TD methods. The bias arises because the maximum is always above the expected (or even actual) value of the estimate, and we always use the same samples both to determine maximizing action and to estimate its value.

  • Double Learning means to divide the plays in two sets and use them to learn two independent estimates and as an estimate of the true value .

    • will be used to determine the maximizing action

    • will provide the value estimate via

    • A second unbiased estimate is also derivable as follows

    • The actual process of updating involves dividing the time steps in two such that in each, we either update or using the formulate above.

  • Double Q Learning makes use of the following update rule

    Where we alternate between using this and swapping the roles of and .

  • We may also base the policy on the average or sum of and .

Retrace

  • Retrace 1 is an off-policy Q-value estimation algorithm with guaranteed convergence.

  • It addresses how importance weights on their own can introduce high variance

  • The estimation involves using truncated importance weights, which means that the importance weights should be no more than .

    This means that to improve the estimate we increment by

Links

Footnotes

  1. Munos et al. (2016) Safe and efficient off-policy reinforcement learning