Definition

A backup is an operation that performs an update on the value function
We may divide backups based on the following:
- Whether or not they update state values or action values (i.e., whether they use or ).
- Whether or not they update the value of the optimal policy or an arbitrary policy (i.e., whether they use or ).
- Whether or not they operate on all possible successors (full / expected) or a sample of successors (sample backup).
The equation for performing a backup is as follows

Where is the dynamics function

We can write the above in a more compact manner

Where and are marginal distributions given the provided parameters.
The optimal Bellman equations describe the optimal policy .
Full / Expected Backups are more precise but at the cost of requiring more computations. For a given state, they backup based on all successors of this state.
- For stochastic models, we perform the backup using the expected values of the state / action.
- These are what we use for Dynamic Programming-based approaches
Sample backups are cheaper computationally, but at the cost of having sampling errors (i.e., they are inaccurate). For a given state, they backup based on a sample of the successor states.
- These are what we use for Temporal Difference Learning approaches.
The Bellman operator applies to a value function and is defined as

Typology of Updates

Approach	Description
[[Dynamic Programming for Reinforcement Learning#policy-evaluation	Policy Evaluation]]
[[Dynamic Programming for Reinforcement Learning#value-iteration	Value Iteration]]
Q-Policy Evaluation	Expected update on
Q-Value Iteration	Expected update on
TD-0	Sample Update on
???	Sample update on
[[Temporal Difference Learning#sarsa	SARSA]]
[[Temporal Difference Learning#q-learning	Q-learning]]

When performing an expected update on a stochastic problem, it is recommended to use Prioritized Sweeping.

For the following, assume we are dealing with average reward context. Notice; we just replaced all with . For the optimality equations, we replaced all discounting.

The Hamilton-Jacobi-Bellman (HJB) equations can be viewed as a generalization of the Bellman equations to the continuous case

Where is the instantaneous reward function given state and control vector .

And is the law of motion for the state (analogous to the environment’s dynamics) ¹