Definition
-
A backup is an operation that performs an update on the value function
-
We may divide backups based on the following:
- Whether or not they update state values or action values (i.e., whether they use
or ). - Whether or not they update the value of the optimal policy or an arbitrary policy (i.e., whether they use
or ). - Whether or not they operate on all possible successors (full / expected) or a sample of successors (sample backup).
- Whether or not they update state values or action values (i.e., whether they use
-
The equation for performing a backup is as follows
Where
is the dynamics function We can write the above in a more compact manner
Where
and are marginal distributions given the provided parameters. -
The optimal Bellman equations describe the optimal policy
. -
Full / Expected Backups are more precise but at the cost of requiring more computations. For a given state, they backup based on all successors of this state.
- For stochastic models, we perform the backup using the expected values of the state / action.
- These are what we use for Dynamic Programming-based approaches
-
Sample backups are cheaper computationally, but at the cost of having sampling errors (i.e., they are inaccurate). For a given state, they backup based on a sample of the successor states.
- These are what we use for Temporal Difference Learning approaches.
-
The Bellman operator applies to a value function
and is defined as
Typology of Updates
Approach | Description |
---|---|
[[Dynamic Programming for Reinforcement Learning#policy-evaluation | Policy Evaluation]] |
[[Dynamic Programming for Reinforcement Learning#value-iteration | Value Iteration]] |
Q-Policy Evaluation | Expected update on |
Q-Value Iteration | Expected update on |
TD-0 | Sample Update on |
??? | Sample update on |
[[Temporal Difference Learning#sarsa | SARSA]] |
[[Temporal Difference Learning#q-learning | Q-learning]] |
- When performing an expected update on a stochastic problem, it is recommended to use Prioritized Sweeping.
Average Reward Bellman Equations
-
For the following, assume we are dealing with average reward context. Notice; we just replaced all
with . For the optimality equations, we replaced all discounting.
Hamilton-Jacobi-Bellman Equations
- The Hamilton-Jacobi-Bellman (HJB) equations can be viewed as a generalization of the Bellman equations to the continuous case
Where
And
Links
- Sutton and Barto
- Ch. 3 - introduction of Bellman backups
- Ch. 8.5 - Types of Backups
Footnotes
-
TODO: Move this to the appropriate location ↩