Definition

  • A backup is an operation that performs an update on the value function

  • We may divide backups based on the following:

    • Whether or not they update state values or action values (i.e., whether they use or ).
    • Whether or not they update the value of the optimal policy or an arbitrary policy (i.e., whether they use or ).
    • Whether or not they operate on all possible successors (full / expected) or a sample of successors (sample backup).
  • The equation for performing a backup is as follows

    Where is the dynamics function

    We can write the above in a more compact manner

    Where and are marginal distributions given the provided parameters.

  • The optimal Bellman equations describe the optimal policy .

  • Full / Expected Backups are more precise but at the cost of requiring more computations. For a given state, they backup based on all successors of this state.

    • For stochastic models, we perform the backup using the expected values of the state / action.
    • These are what we use for Dynamic Programming-based approaches
  • Sample backups are cheaper computationally, but at the cost of having sampling errors (i.e., they are inaccurate). For a given state, they backup based on a sample of the successor states.

  • The Bellman operator applies to a value function and is defined as

Typology of Updates

ApproachDescription
[[Dynamic Programming for Reinforcement Learning#policy-evaluationPolicy Evaluation]]
[[Dynamic Programming for Reinforcement Learning#value-iterationValue Iteration]]
Q-Policy EvaluationExpected update on
Q-Value IterationExpected update on
TD-0Sample Update on
???Sample update on
[[Temporal Difference Learning#sarsaSARSA]]
[[Temporal Difference Learning#q-learningQ-learning]]
  • When performing an expected update on a stochastic problem, it is recommended to use Prioritized Sweeping.

Average Reward Bellman Equations

  • For the following, assume we are dealing with average reward context. Notice; we just replaced all with . For the optimality equations, we replaced all discounting.

Hamilton-Jacobi-Bellman Equations

  • The Hamilton-Jacobi-Bellman (HJB) equations can be viewed as a generalization of the Bellman equations to the continuous case

Where is the instantaneous reward function given state and control vector .

And is the law of motion for the state (analogous to the environment’s dynamics) 1

Links

  • Sutton and Barto
    • Ch. 3 - introduction of Bellman backups
    • Ch. 8.5 - Types of Backups

Footnotes

  1. TODO: Move this to the appropriate location