The need for eligibility traces arises in general whenever one tries to learn long-term predictions in an efficient manner.
Every result below generalizes from state value functions to action-value functions with the appropriate replacements.
In practice, implementing eligibility traces are quite efficient since we only need to keep track of and update only the few traces that are significantly greater than .
The eligibility trace acts as a short-term memory vector that parallels the weights in function approximation.
When a component in participates in estimation, the corresponding component to is bumped up and then begins to fade away based on parameter .
Learning occurs in a component of if a non-zero TD-error occurs before the trace falls back to zero.
Compared to N-step Bootstrapping, Eligibility traces offer computational advantages.
We no longer need to store feature vectors, just the -dimensional trace.
Learning occurs continually and uniformly rather than being delayed steps.
One way to view this is as an extension of interpolating between Temporal Differencing and Monte Carlo methods
Error reduction property With updates, the -step return converges a better estimate of the value function as .
Another way is to view this as keeping a temporary record of which components ( states / actions / components in the weight vector) are eligible for learning changes.
We are concerned with the TD errors in particular.
Eligibility traces make use of compound updates — averaging over simpler updates.
In practice, because of the potentially infinite number of terms ,we cannot precisely calculate the above. To compensate, we may include the truncated return.
It is the weighted average of the n-step return, weighted with
We perform this using the following update rule
Note that no updates were made in the first steps and additional updates were made upon termination.
The following identity is used for efficient implementation
There is a tradeoff: should be large to properly approximate -return, but also small so that updates can be made sooner.
We can make better approximations to the offline update, while making them happen sooner, at the cost of computational complexity.
Updates are performed during the episode.
On each time step, go back and redo all updates since the beginning, accounting this time step’s new data.
The online algorithm Let denote the weights used to generate the value at time in the sequence up to horizon . The update takes the form
More computational complexity but at the same time more accurate than off-line updates because it makes use of bootstrapping, that makes the weight vector used more informative.
The true online algorithm1 is derived by using the following update rule. Note that it works for the linear case where .
The above algorithm gives the same results as the regular online -return algorithm but is less computationally expensive (in terms of performance. In terms of memory, they are still the same.)
An extension of SARSA but making use of Eligibility traces. We substitute the value function with the state-action function and the eligibility traces are now over the space of state-action pairs.
Like SARSA, it is an on-policy algorithm.
We make use of the -step return as defined in here.
We can form the action-value form of the -return (which just makes use of instead of but is otherwise the same as the off-line version)
The update rule is mostly the same as the off-line , except using the action value form of the TD-error and the action-value form of the eligibility trace,
Rationale: The result here gives us an eligibility trace which, when combined with semi-gradient descent, forms a general algorithm that can be applied to on/off-policy data.
Caveat: It is not necessarily stable as an off-policy algorithm. It also has high variance even with importance sampling.
Rationale: We cannot directly apply Importance Sampling to -returns so instead we directly use control variates.
Where is the single-step importance sampling ratio.
The above can be approximated using TD-errors
Where the approximation becomes exact if the value function does not change. 2
A single step of a forward view is then given as
The sum of the forward view updates is given as
And this sum is of the form a sum of a backward-view TD update because we can perform the update as an eligibility trace using the general accumulating trace update for state values. (see above). [^3] We get that for the general accumulating trace
At , it behaves like Monte-Carlo methods except that eligibility traces still perform bootstrapping. However, the dependence on estimates cancels out in the expected values.
Methods that achieve exact equivalence require provisional weights to keep track of updates made but may need to be retracted or emphasized depending on actions later. These are called PTD algorithms.
Extends Q-learning to make use of eligibility traces.
One caveat of this is that, instead of looking ahead all the way to the end, it only looks ahead as far as the action after the first exploratory action (or to the episode’s end if none exist)
Whenever a non-greedy action is taken, the eligibility trace at that point is set to .
Note: This relies on the action sequence of the behavior policy having few exploratory actions, as otherwise this performs no better than one-step Q-learning.
Watkins is a crude way to perform off-policy tracing. However, we must consider the following:
Use Importance sampling to assign credit for actions, since actions are taken based on a probability distribution so following the greedy action is dependent on this probability.
It does not actually bootstrap since it cuts off estimation after the first non-greedy action is taken.
Derivation found in Sutton and Barto but also in a paper by van Seijen et. al, 2016. The general ide ais that if we arrange all the ’s in a matrix, we see that only the diagonal terms of this matrix are important, and we represent each term with the previous diagonal term. ↩
This follows just by expanding the recursive formula given from the exact value of . ↩