In an on-policy context, we refer to as the on-policy distribution.
- In continuing tasks, is the stationary distribution under policy .
- In an episodic task, we calculate the following.
  
  Let be the probability that an episode begins with state . the number of time steps spent, on average, in within an episode with discounting rate
  
  We then have

Policy Prediction

Assume that states appear in distribution .

Gradient Descent

We can use gradient descent methods (specifically Stochastic Gradient Descent) to adjust the weight vector at each step.
- In supervised learning terms: Treat the estimate as the predicted value and as the ground truth value.
- Where denotes some to the true value (i.e., one obtained from bootstrapping using or it may be )
- If is unbiased then we can guarantee convergence via SGD.
  - For example, if as in Monte Carlo approximation.
- If is derived from bootstrapping, then we cannot guarantee convergence with Gradient Descent because the estimates also depend on .

Semi-Gradient Descent and TD-0

Modifies the Gradient Descent based approach by including only part of the gradient.
Suitable for bootstrapping since they take into account the effect of changing on the estimate, but ignore its effect on the target.
Convergence to global optima not guaranteed, but enables much faster learning, and allows continual, online learning.
An example is semi-gradient TD-0 where

State Aggregation

Group states together with one estimated value (one component of ).
We typically see a staircase effect where the approximate value within a group abruptly changes.
A special case of SGD where the gradient is for ’s group’s component, and for all other components
We may think of state aggregation as a special case of Coarse Coding where the features correspond to the group the state belongs in.

Using Supervised Learning

We may consider using Linear Models where each state is encoded with a series of feature vectors.
- Linear semi-gradient converges to the TD fixed point where
  
  Where denotes the estimate of state . ¹
- At the TD fixed point,
  
  Note that, close to means the error of TD approximation can be quite high.
Some tips for choosing the basis function
- High degree polynomial bases give more accurate approximations but the number of features blows up exponentially, necessitating dimensionality reduction.
- When using a Fourier cosine basis, it may be helpful to use a different step size for each feature. In particular, for feature , we have the -th learning rate as
- It is easy to select features with Fourier bases by setting the seeds to account for suspected interactions among state variables. However, Fourier bases are better suited for global properties not local properties of states.
- RBF does not seem to be a good basis function. Too complex for little benefit.
We may use neural nets to create our features for us.
- ANNs can use TD errors to learn value functions.
- ANNs can be used to maximize expected reward (see Gradient bandits)

State Representations

Coarse Coding

Represent a state as a subset of present features. Features can overlap with each other.
Essentially, states are represented with non-mutually exclusive features.
Generalization from state to depends on the number of their features whose receptive fields overlap.
Features with large receptive fields give broad generalization. The finest discrimination possible between states depends on the total number of features.

Tile Coding

A form of coarse coding for multidimensional continuous spaces that is flexible and computationally efficient
Think of tile coding as bucketing but where we have many flavors of buckets. When two data points belong to the same bucket in a certain flavor, we can perform generalization..
The receptive fields of the features are grouped into partitions of the state space. Each such partition is called a tiling, and each element of the partition is called a tile.
A coarse coding is obtained using multiple tilings, shifted around.
Generalization occurs to states other than the one trained if those states fall within any of the same tiles,
Prefer asymmetrical offsets for better generalization. Intuition, irregular offsets = irregular overlap.
Denser tilings = more importance given during generalization.
We may use hashing to reduce memory requirements — each bin in the hash set corresponds to a tile.

Step Size

Suppose we want to learn experiences with feature vector . The step size should be set as
This method works best if the feature vectors do not vary greatly in length

Least Squares TD

Optimizes Semi-Gradient TD(0) by computing it iteratively..

We add to make the matrix invertible. is taken to be really small.

For computation’s sake, we make use of the Sherman-Morrison formula.
The functions similarly to the step size parameter .
Since LSTD has no step-size parameter, it never forgets (this is both an advantage and a disadvantage).

Memory-Based Function Approximation

Non-parametric method that saves training examples in memory as they arrive without updating parameters.
When we need an approximation for a query state, we simply fetch from the stored training examples (aka, we do lazy learning).
- These methods improve with more data.
- We are also not reliant on global approximations. We instead rely on local approximations.
Local learning involves approximating the value function using the local neighborhood of the current query state.
- One approach is doing (weighted) kNN with the incoming query state as input, and the stored states as memory.
- Another approach is locally weighted regression where we fit a surface to the values of a set of nearest states. The weights we use are dependent on distance.

Kernel-Based Function Approximation

We make use of a kernel function that assigns a number to indicate the similarity between two states.
We may view the kernel function as the strength of generalization from to .
- If we have feature vectors, we may have the following kernel
- And, of course, we can use the kernel trick as needed.
Kernel Regression involves using a kernel weighted average of the targets of all examples stored in memory. In particular if we have samples and targets we have

Interest and Emphasis

Rationale: Relax our assumption of updating all states equally.
Let be the interest - indicating the degree to which we are interested in state .
Let be the emphasis which multiplies the learning update, and thus emphasizes or de-emphasizes the learning done at time .
The distribution for then depends on .
The update rule with emphasis becomes
The emphasis is related to the interest with the formula

And for .

Policy Control

Update rules are based on .

Episodic Semi-Gradient Control

Episodic, Semi-Gradient One-Step SARSA The update rule makes use of
Episodic, Semi-Gradient n-step SARSA is obtained by using the following as

Handling Continuing Tasks

One concern is how do we handle action selection when actions are selected on a continuum.
We can use the average reward formulation. It is in fact, questionable if we should even be using the discounted setting.
- To account for variance in return, in the discounted setting we would have to take the average.
- For continuing tasks, discounting has no effect because the average of the discounted returns is proportional to the average reward. The ordering of all policies in the average discounted return setting and the average reward setting are the same.
- A formal proof is given in Sutton and Barto 10.4.
Differential Semi-Gradient n-step SARSA is defined using the differential form of the return as

Where estimates . As usual if .

Table of Contents

Graph View

Backlinks

The Library

On Policy Prediction and Control with Approximation

Policy Prediction

Gradient Descent

Semi-Gradient Descent and TD-0

State Aggregation

Using Supervised Learning

State Representations

Coarse Coding

Tile Coding

Step Size

Least Squares TD

Memory-Based Function Approximation

Kernel-Based Function Approximation

Interest and Emphasis

Policy Control

Episodic Semi-Gradient Control

Handling Continuing Tasks

Links

Table of Contents

Graph View

Backlinks

On Policy Prediction and Control with Approximation

Policy Prediction §

Gradient Descent §

Semi-Gradient Descent and TD-0 §

State Aggregation §

Using Supervised Learning §

State Representations §

Coarse Coding §

Tile Coding §

Step Size §

Least Squares TD §

Memory-Based Function Approximation §

Kernel-Based Function Approximation §

Interest and Emphasis §

Policy Control §

Episodic Semi-Gradient Control §

Handling Continuing Tasks §

Links §

Footnotes §

Policy Prediction

Gradient Descent

Semi-Gradient Descent and TD-0

State Aggregation

Using Supervised Learning

State Representations

Coarse Coding

Tile Coding

Step Size

Least Squares TD

Memory-Based Function Approximation

Kernel-Based Function Approximation

Interest and Emphasis

Policy Control

Episodic Semi-Gradient Control

Handling Continuing Tasks

Links

Footnotes