• The need for eligibility traces arises in general whenever one tries to learn long-term predictions in an efficient manner.
  • Every result below generalizes from state value functions to action-value functions with the appropriate replacements.
  • In practice, implementing eligibility traces are quite efficient since we only need to keep track of and update only the few traces that are significantly greater than .

Overview

  • The eligibility trace acts as a short-term memory vector that parallels the weights in function approximation.

  • When a component in participates in estimation, the corresponding component to is bumped up and then begins to fade away based on parameter .

  • Learning occurs in a component of if a non-zero TD-error occurs before the trace falls back to zero.

  • Compared to N-step Bootstrapping, Eligibility traces offer computational advantages.

    • We no longer need to store feature vectors, just the -dimensional trace.
    • Learning occurs continually and uniformly rather than being delayed steps.
  • One way to view this is as an extension of interpolating between Temporal Differencing and Monte Carlo methods

    • Error reduction property With updates, the -step return converges a better estimate of the value function as .
  • Another way is to view this as keeping a temporary record of which components ( states / actions / components in the weight vector) are eligible for learning changes.

    • We are concerned with the TD errors in particular.
  • Eligibility traces make use of compound updates — averaging over simpler updates.

-return

The -return is defined as

  • Each -step return is weighted with . The whole some is renormalized using the term.
  • An alternative formulation if we terminate at time is as follows:
  • is called the trace decay.
    • controls how much credit we give to earlier states when we perform updates

Kinds of Traces

  • The accumulating trace is defined as
  • The action-value form of the eligibility trace is defined as
  • The general accumulating trace update for state values is of the form

    Where is the importance sampling ratio and we assume varying trace decay and discounting rate

  • The general accumulating trace update for action values is of the form

  • The Dutch Trace is defined as
  • The Replacing Trace is defined on a component-by-component basis on whether the component of the feature vector was or .
  • We use replacing traces as approximations of Dutch Traces, which are generally preferred.
  • Accumulating Traces are used when there is no theoretical basis for Dutch Traces

Relation to Monte Carlo

  • Eligibility traces can arise even when we use Monte Carlo.
  • A backward view of Monte Carlo can be used to improve performance (for example by using Dutch traces).

and its variants

  • In the off-line update algorithm, performing the updates follows the usual semi gradient approach with as the target.
    • Updates are performed after the episode, possibly in batches.
  • In the eligibility vector is computed using the accumulating trace (see above).
  • The weight vector is then updated using the TD-error and the eligibility trace vector
  • is more preferable than the off-line update algorithm because

    • It performs a backward update — on every step of an episode rather than only at the end. Estimates are better sooner than later
    • Its computations are properly amortized — distributed in time.
    • It can be applied to continuing problems.
      • is a variation of Monte Carlo but applicable for continuous tasks.
  • Linear converges, but not to the minimum-error vector. For the continuing discounted case:

Truncated return

  • In practice, because of the potentially infinite number of terms ,we cannot precisely calculate the above. To compensate, we may include the truncated return.

  • It is the weighted average of the n-step return, weighted with

  • We perform this using the following update rule

  • Note that no updates were made in the first steps and additional updates were made upon termination.
  • The following identity is used for efficient implementation
  • There is a tradeoff: should be large to properly approximate -return, but also small so that updates can be made sooner.

Online -return

  • We can make better approximations to the offline update, while making them happen sooner, at the cost of computational complexity.

    • Updates are performed during the episode.
  • On each time step, go back and redo all updates since the beginning, accounting this time step’s new data.

  • The online algorithm Let denote the weights used to generate the value at time in the sequence up to horizon . The update takes the form

  • More computational complexity but at the same time more accurate than off-line updates because it makes use of bootstrapping, that makes the weight vector used more informative.

  • The true online algorithm 1 is derived by using the following update rule. Note that it works for the linear case where .

  • The above algorithm gives the same results as the regular online -return algorithm but is less computationally expensive (in terms of performance. In terms of memory, they are still the same.)
    • Note that it makes use of Dutch traces.

  • An extension of SARSA but making use of Eligibility traces. We substitute the value function with the state-action function and the eligibility traces are now over the space of state-action pairs.
  • Like SARSA, it is an on-policy algorithm.
  • We make use of the -step return as defined in here.
  • We can form the action-value form of the -return (which just makes use of instead of but is otherwise the same as the off-line version)
  • The update rule is mostly the same as the off-line , except using the action value form of the TD-error and the action-value form of the eligibility trace,

Generalization

Varying .

  • Another (theoretical) technique for generalizing the -return is by allowing and to vary from step to step. (see here for more on this)
  • In this case, we set and as the values used at timestep . The new -return is defined using the state and state-action formulations below
  • One rationale for this is to give states whose value we are highly certain of, more credit (higher eligibility) for the return.

Incorporating Control Variates

  • See here for more about control variates.
  • Rationale: The result here gives us an eligibility trace which, when combined with semi-gradient descent, forms a general algorithm that can be applied to on/off-policy data.
    • Caveat: It is not necessarily stable as an off-policy algorithm. It also has high variance even with importance sampling.
  • Rationale: We cannot directly apply Importance Sampling to -returns so instead we directly use control variates.

Using State-Values

  • Doing so generalizes the return as

Where is the single-step importance sampling ratio.

  • The above can be approximated using TD-errors

Where the approximation becomes exact if the value function does not change. 2

  • A single step of a forward view is then given as
  • The sum of the forward view updates is given as

And this sum is of the form a sum of a backward-view TD update because we can perform the update as an eligibility trace using the general accumulating trace update for state values. (see above). [^3] We get that for the general accumulating trace

Using Action-Values

  • For the action-value case the derivation is similar. We get that

Where is given here.

  • This gives an approximation using the expectation form of the action-based TD-error
  • This approximation becomes exact if the approximate value function does not change.
  • Similar to the case for state-values, we can derive a form that uses the generalized eligibility trace for action values (see here)

Relation to Monte Carlo

  • At , it behaves like Monte-Carlo methods except that eligibility traces still perform bootstrapping. However, the dependence on estimates cancels out in the expected values.
  • Methods that achieve exact equivalence require provisional weights to keep track of updates made but may need to be retracted or emphasized depending on actions later. These are called PTD algorithms.
  • At , the deadly triad applies.

Watkins’

  • Extends Q-learning to make use of eligibility traces.
  • One caveat of this is that, instead of looking ahead all the way to the end, it only looks ahead as far as the action after the first exploratory action (or to the episode’s end if none exist)
  • Whenever a non-greedy action is taken, the eligibility trace at that point is set to .
  • Note: This relies on the action sequence of the behavior policy having few exploratory actions, as otherwise this performs no better than one-step Q-learning.
  • Watkins is a crude way to perform off-policy tracing. However, we must consider the following:
    • Use Importance sampling to assign credit for actions, since actions are taken based on a probability distribution so following the greedy action is dependent on this probability.
    • It does not actually bootstrap since it cuts off estimation after the first non-greedy action is taken.

Tree Backup with Eligibility Traces

  • See here for more on the tree backup algorithm.
  • The return is calculated as

Where is the TD error defined here.

  • The eligibility trace update is defined using
  • And then apply the usual update rule defined in .

Stable Off-Policy Methods

  • is the counterpart to TDC . The update rule is given as

Where is a vector of the same dimension as , initialized to , and is a second step-size parameter.

And we use as per-decision importance sampling is the general accumulating trace for state values is the TD-error is defined in here.

  • is often faster than when both algorithms converge.

  • The Gradient TD algorithm for action values with eligibility traces.
  • It learns such that from off-policy data.
  • If the target policy is -greedy or biased towards the greedy policy, then can be used for control.
  • The update is given as

Where is defined using the general accumulating trace for state values, and the rest follows from GTD including updating .

  • A hybrid state-value algorithm combining and .
  • It is a strict generalization of to off policy learning — if the behavior happens to be the target policy, then is the same as .
  • We have as an additional step-size parameter and is a second set of weights.
  • acts as a second set of accumulating eligibility traces for the behavior policy.
    • if all .

Emphatic

  • An extension of Emphatic TD to eligibility traces.
  • It retains strong off-policy convergence guarantees while allowing bootstrapping.
  • Downsides: High variance and slow convergence
  • Emphasis is generalized with
  • is termed the followon trace
  • is the interest as defined here.

Links

Footnotes

  1. Derivation found in Sutton and Barto but also in a paper by van Seijen et. al, 2016. The general ide ais that if we arrange all the ’s in a matrix, we see that only the diagonal terms of this matrix are important, and we represent each term with the previous diagonal term.

  2. This follows just by expanding the recursive formula given from the exact value of .