-
Rationale: We often have memory constraints that do not allow us to explore large state spaces. Hence we sample from a small state space.
- However, we need a way to make decisions on unsampled states using those that we have sampled. This is done via function approximation
- This is, in fact, Supervised Learning. It makes changes generalizable but controlling them more complex.
- It should be noted, that we must make use of models that can handle non-stationarity.
- Nonstationarity comes in either because the environment is non-stationary or because of bootstrapping which makes our estimates change.
- It also make RL extensible to partially observable problems where states are not fully visible to the agent.
- It cannot augment states with memories of past observations.
-
A tradeoff with functional approximation is we can no longer use the policy improvement theorem.
Prediction
-
We represent the value function as a parameterized functional form with weight vector
. We denote for the approximate function of state
with weight vector . - Note, assume we have more states than weights. This assumption is founded on the fact we have more states than actions.
-
We may do something similar for the action value function. That is
-
We specify a state distribution which specifies how much we care about errors in value estimates for state
. We denote this with . - We require this state distribution because state updates can affect other states.
- Making one state accurate makes the estimations for other states inaccurate.
- We often choose
to be the fraction of time spent on .
Objective
-
The objective function is called the mean square value error (MSVE) defined as
- Note minimizing MSVE does not necessarily give optimal policies . Our goal is always to find the best policy, not the best value function.
-
We can generalize the notion of comparing policies with the following norm
-
A geometric way to view things is where value functions are functions parameterized with weight vector
. Different approaches implicitly make use of different characterizations for the solution.-
In a geometric view, Monte Carlo’s solution is found by using projections towards the closest policy (as defined below using projection matrix
. -
An alternative is the Bellman Error obtained by substituting
for in the Bellman equations and computing the differenceAka, it is the expected TD-error.
-
The vector of all Bellman errors at all states is the Bellman error vector. It can be seen as a result of applying the Bellman operator to the approximate value function so that
-
The norm of the Bellman Error vector can be used as a measure of error called the Mean square Bellman error
-
In an approximation context, we only deal with representable value functions. Those that cannot be represented are instead projected onto the subspace. The mean square projected Bellman error measures this
-
The mean square return error is the expectation, under
of the square of the error between value estimate and return.
-
Learnability
- A value function is learnable if given any amount of experience, we converge to the optimal / true value function.
- The Bellman Error is not learnable unless we have access to the underlying model itself.
- The VE objective is not learnable - given two MDPs that give the same streams of experience, we cannot distinguish between them from the experience stream alone.
- Still, the parameter that optimizes VE is learnable. This follows from using the mean square return error. Observe how RE is just VE but with a variance term independent of
.
- Still, the parameter that optimizes VE is learnable. This follows from using the mean square return error. Observe how RE is just VE but with a variance term independent of
- PBE and TDE are determined from data and are learnable. However, note they have different minima.
Topics
- On Policy Prediction and Control with Approximation
- Off Policy Prediction and Control with Approximation
- Eligibility Traces - a useful (most cases mandatory) construct when it comes to function approximation
- Policy Gradient Methods
Links
- Sutton and Barto
- 9.1 - 9.2 - the objectives of function approximation
- 11.4 - more on the geometry of the value function.
- 11.6 - why the Bellman Error is not learnable.
- Reinforcement Learning