- Temporal Difference (TD) learning is a combination of Dynamic Programming (via bootstrapping) and Monte Carlo Methods (via learning from experience).
- The central idea is to learn while in the middle of the episode using an appropriate estimate based on the agent’s current predictions.
Advantages
- Unlike DP, it does not require a model of the environment.
- Unlike Monte Carlo, it does not require simulating the whole episode.
- This means that small changes in the configuration of the environment (i.e., start states) do not require us to simulate multiple episodes. Instead, we can simply transfer knowledge from one model to another.
- It also makes it applicable to continual tasks, which Monte Carlo cannot do.
- Compared to Monte Carlo, it learns much faster because it does not discount or ignore episodes.
- Compared to batch Monte Carlo, batch TD tends to converge towards a model that would likely be correct for future data rather than on a model that is correct only for the present data.
- Specifically, TD converges towards the Maximum Likelihood via the certainty equivalence estimate — which is equivalent to assuming that the estimate of the underlying process was known with certainty rather than being approximated.
TD Prediction
- TD methods wait until the next time step until they reward a target and make a useful update using
. The update takes the form
TD-errors
-
The target for updating is
(i.e., the estimated discounted value when the agent takes steps from the current state). -
is the target estimate. In TD-0 this is given as -
It makes use of two estimates — the sampled reward
, and the estimated value . The sampled estimate makes use of only the successor state’s reward rather than the full distribution of successors. -
The error in this estimate is proportional to the temporal differences between predictions.
-
The TD-error is defined as
-
-
The Monte Carlo error can be written as a sum of discounted TD errors
- This identity is not exact if
is updated during the episode, but if the step size is small then it may still hold approximately.
- This identity is not exact if
-
In an average reward context we have the TD-errors as
-
Note we can replace everything in the TD-error formulae with estimates.
TD Control
SARSA
-
Stands for
. -
It is an on-policy control method. Hence, it uses an
-greedy policy for exploration (see here). -
We estimate the action-value function for a given policy by considering transitions from state-action to state-action pair using the following update rule:
-
We then choose actions based on
optimally. Note: this is important. If a suboptimal policy were enacted, the quality will degrade. -
SARSA learns the bad policies during the episode and can quickly adjust.
-
It can only perform well in the long run with a small step size
, at which short term performance is poor.
Q-Learning
-
An off-policy TD control algorithm. It is off-policy because we use a policy derived from
(which lets us move to in order to update the optimal policy with respect to the current state, action pair . -
It is characterized by updating
-
Actually learns the values of the optimal policy, but it is not good for online learning.
Expected SARSA
-
An off policy generalization of Q-learning and SARSA.
-
The target is
-
Moves towards the expected value, taking into account how likely each action is under the current policy. That is, the rule becomes
-
More computationally complex than SARSA, but eliminates variance due to random selection of
.
Double Learning
-
Maximization bias - a bias that results from implicitly assuming a maximum over estimated values, as in the other TD methods. The bias arises because the maximum is always above the expected (or even actual) value of the estimate, and we always use the same samples both to determine maximizing action and to estimate its value.
-
Double Learning means to divide the plays in two sets and use them to learn two independent estimates
and as an estimate of the true value . -
will be used to determine the maximizing action -
will provide the value estimate via -
A second unbiased estimate is also derivable as follows
-
The actual process of updating involves dividing the time steps in two such that in each, we either update
or using the formulate above.
-
-
Double Q Learning makes use of the following update rule
Where we alternate between using this and swapping the roles of
and . -
We may also base the policy on the average or sum of
and .
Retrace
-
Retrace 1 is an off-policy Q-value estimation algorithm with guaranteed convergence.
-
It addresses how importance weights on their own can introduce high variance
-
The estimation involves using truncated importance weights, which means that the importance weights should be no more than
. This means that to improve the estimate
we increment by
Links
-
Dynamic Programming for Reinforcement Learning - for more on Dynamic Programming in RL
-
Monte Carlo Methods in Reinforcement Learning - for more on Monte Carlo in RL
-
N-step Bootstrapping - a generalization of TD-learning
-
- 6.4 - SARSA
- 6.5 - Q learning
- 6.6 - Expected SARSA
- 6.7 - Double Learning
- 7.2 - n-step SARSA
- 7.4 - Control Variates
- 7.5 - Tree Backup
- 7.6 -
. - 10.3 - Average reward TD error.
-
Q-Learning: Model Free Reinforcement Learning and Temporal Difference Learning - supplementary material for Q-learning
Footnotes
-
Munos et al. (2016) Safe and efficient off-policy reinforcement learning ↩