Rationale: We often have memory constraints that do not allow us to explore large state spaces. Hence we sample from a small state space.
- However, we need a way to make decisions on unsampled states using those that we have sampled. This is done via function approximation
- This is, in fact, Supervised Learning. It makes changes generalizable but controlling them more complex.
  - It should be noted, that we must make use of models that can handle non-stationarity.
  - Nonstationarity comes in either because the environment is non-stationary or because of bootstrapping which makes our estimates change.
- It also make RL extensible to partially observable problems where states are not fully visible to the agent.
- It cannot augment states with memories of past observations.
A tradeoff with functional approximation is we can no longer use the policy improvement theorem.

Prediction

We represent the value function as a parameterized functional form with weight vector . We denote

for the approximate function of state with weight vector .
- Note, assume we have more states than weights. This assumption is founded on the fact we have more states than actions.
We may do something similar for the action value function. That is
We specify a state distribution which specifies how much we care about errors in value estimates for state . We denote this with .
- We require this state distribution because state updates can affect other states.
- Making one state accurate makes the estimations for other states inaccurate.
- We often choose to be the fraction of time spent on .

Objective

The objective function is called the mean square value error (MSVE) defined as
- Note minimizing MSVE does not necessarily give optimal policies . Our goal is always to find the best policy, not the best value function.
We can generalize the notion of comparing policies with the following norm
A geometric way to view things is where value functions are functions parameterized with weight vector . Different approaches implicitly make use of different characterizations for the solution.
- In a geometric view, Monte Carlo’s solution is found by using projections towards the closest policy (as defined below using projection matrix .
- An alternative is the Bellman Error obtained by substituting for in the Bellman equations and computing the difference
  
  Aka, it is the expected TD-error.
- The vector of all Bellman errors at all states is the Bellman error vector. It can be seen as a result of applying the Bellman operator to the approximate value function so that
- The norm of the Bellman Error vector can be used as a measure of error called the Mean square Bellman error
- In an approximation context, we only deal with representable value functions. Those that cannot be represented are instead projected onto the subspace. The mean square projected Bellman error measures this
- The mean square return error is the expectation, under of the square of the error between value estimate and return.

A value function is learnable if given any amount of experience, we converge to the optimal / true value function.
The Bellman Error is not learnable unless we have access to the underlying model itself.
The VE objective is not learnable - given two MDPs that give the same streams of experience, we cannot distinguish between them from the experience stream alone.
- Still, the parameter that optimizes VE is learnable. This follows from using the mean square return error. Observe how RE is just VE but with a variance term independent of .
PBE and TDE are determined from data and are learnable. However, note they have different minima.