- In an on-policy context, we refer to
as the on-policy distribution. -
In continuing tasks,
is the stationary distribution under policy . -
In an episodic task, we calculate the following.
Let
be the probability that an episode begins with state . the number of time steps spent, on average, in within an episode with discounting rate We then have
-
Policy Prediction
- Assume that states appear in distribution
.
Gradient Descent
- We can use gradient descent methods (specifically Stochastic Gradient Descent) to adjust the weight vector at each step.
- In supervised learning terms: Treat the estimate
as the predicted value and as the ground truth value. -
Where
denotes some to the true value (i.e., one obtained from bootstrapping using or it may be ) - If
is unbiased then we can guarantee convergence via SGD. - For example, if
as in Monte Carlo approximation.
- For example, if
- If
is derived from bootstrapping, then we cannot guarantee convergence with Gradient Descent because the estimates also depend on .
- In supervised learning terms: Treat the estimate
Semi-Gradient Descent and TD-0
- Modifies the Gradient Descent based approach by including only part of the gradient.
- Suitable for bootstrapping since they take into account the effect of changing
on the estimate, but ignore its effect on the target. - Convergence to global optima not guaranteed, but enables much faster learning, and allows continual, online learning.
- An example is semi-gradient TD-0 where
State Aggregation
- Group states together with one estimated value (one component of
). - We typically see a staircase effect where the approximate value within a group abruptly changes.
- A special case of SGD where the gradient is
for ’s group’s component, and for all other components - We may think of state aggregation as a special case of Coarse Coding where the features correspond to the group the state belongs in.
Using Supervised Learning
- We may consider using Linear Models where each state is encoded with a series of feature vectors.
-
Linear semi-gradient
converges to the TD fixed point where Where
denotes the estimate of state . 1 -
At the TD fixed point,
Note that,
close to means the error of TD approximation can be quite high.
-
- Some tips for choosing the basis function
-
High degree polynomial bases give more accurate approximations but the number of features blows up exponentially, necessitating dimensionality reduction.
-
When using a Fourier cosine basis, it may be helpful to use a different step size for each feature. In particular, for feature
, we have the -th learning rate as -
It is easy to select features with Fourier bases by setting the seeds
to account for suspected interactions among state variables. However, Fourier bases are better suited for global properties not local properties of states. -
RBF does not seem to be a good basis function. Too complex for little benefit.
-
- We may use neural nets to create our features for us.
- ANNs can use TD errors to learn value functions.
- ANNs can be used to maximize expected reward (see Gradient bandits)
State Representations
Coarse Coding
- Represent a state as a subset of present features. Features can overlap with each other.
- Essentially, states are represented with non-mutually exclusive features.
- Generalization from state
to depends on the number of their features whose receptive fields overlap. - Features with large receptive fields give broad generalization. The finest discrimination possible between states depends on the total number of features.
Tile Coding
- A form of coarse coding for multidimensional continuous spaces that is flexible and computationally efficient
- Think of tile coding as bucketing but where we have many flavors of buckets. When two data points belong to the same bucket in a certain flavor, we can perform generalization..
- The receptive fields of the features are grouped into partitions of the state space. Each such partition is called a tiling, and each element of the partition is called a tile.
- A coarse coding is obtained using multiple tilings, shifted around.
- Generalization occurs to states other than the one trained if those states fall within any of the same tiles,
- Prefer asymmetrical offsets for better generalization. Intuition, irregular offsets = irregular overlap.
- Denser tilings = more importance given during generalization.
- We may use hashing to reduce memory requirements — each bin in the hash set corresponds to a tile.
Step Size
-
Suppose we want to learn
experiences with feature vector . The step size should be set as -
This method works best if the feature vectors do not vary greatly in length
Least Squares TD
-
Optimizes Semi-Gradient TD(0) by computing it iteratively..
We add
to make the matrix invertible. is taken to be really small.For computation’s sake, we make use of the Sherman-Morrison formula.
-
The
functions similarly to the step size parameter . -
Since LSTD has no step-size parameter, it never forgets (this is both an advantage and a disadvantage).
Memory-Based Function Approximation
- Non-parametric method that saves training examples in memory as they arrive without updating parameters.
- When we need an approximation for a query state, we simply fetch from the stored training examples (aka, we do lazy learning).
- These methods improve with more data.
- We are also not reliant on global approximations. We instead rely on local approximations.
- Local learning involves approximating the value function using the local neighborhood of the current query state.
- One approach is doing (weighted) kNN with the incoming query state as input, and the stored states as memory.
- Another approach is locally weighted regression where we fit a surface to the values of a set of nearest states. The weights we use are dependent on distance.
Kernel-Based Function Approximation
-
We make use of a kernel function that assigns a number to indicate the similarity between two states.
-
We may view the kernel function
as the strength of generalization from to .-
If we have feature vectors, we may have the following kernel
-
And, of course, we can use the kernel trick as needed.
-
-
Kernel Regression involves using a kernel weighted average of the targets of all examples stored in memory. In particular if we have samples
and targets we have
Interest and Emphasis
-
Rationale: Relax our assumption of updating all states equally.
-
Let
be the interest - indicating the degree to which we are interested in state . -
Let
be the emphasis which multiplies the learning update, and thus emphasizes or de-emphasizes the learning done at time . -
The distribution
for then depends on . -
The update rule with emphasis becomes
-
The emphasis is related to the interest with the formula
And
for .
Policy Control
-
Update rules are based on
.
Episodic Semi-Gradient Control
-
Episodic, Semi-Gradient One-Step SARSA The update rule makes use of
-
Episodic, Semi-Gradient n-step SARSA is obtained by using the following as
Handling Continuing Tasks
-
One concern is how do we handle action selection when actions are selected on a continuum.
-
We can use the average reward formulation. It is in fact, questionable if we should even be using the discounted setting.
- To account for variance in return, in the discounted setting we would have to take the average.
- For continuing tasks, discounting has no effect because the average of the discounted returns is proportional to the average reward. The ordering of all policies in the average discounted return setting and the average reward setting are the same.
- A formal proof is given in Sutton and Barto 10.4.
-
Differential Semi-Gradient n-step SARSA is defined using the differential form of the return as
Where
estimates . As usual if .
Links
-
- 9.3 - Gradient Monte Carlo Algorithm, and Semi-Gradient
. - 9.5 - Different Basis Functions
- 10.1 - Episodic Semi-Gradient SARSA
- 9.3 - Gradient Monte Carlo Algorithm, and Semi-Gradient
-
Tile-Coding: An Efficient Sparse-Coding Method for Real-Valued Data
-
A Unified View on Reinforcement Learning Approaches - more on On-Policy learning
-
Linear Models - more on basic supervised learning models.
Footnotes
-
Convergence is guaranteed because
is positive definite and so the inverse exists. See Sutton and Barto 9.12. ↩