All Actions Update

  • Update using the following rule

REINFORCE: Monte Carlo Policy Gradient

  • Update, at time only the action .
  • We redefine as follows. We replace by the sample and note that is the expected return.
  • The above approach makes it so that actions that are updated frequently are updated in small increments (to balance out the frequency).
  • The vector is called the eligibility vector
  • REINFORCE has high variance and slow learning being Monte Carlo. However, it has convergence guarantees.
    • We can add a baseline to the regular REINFORCE algorithm in order to reduce variance.

Actor-Critic Methods

  • We effectively learn two things

    • An actor corresponding to the policy (see here)
    • A critic corresponding to the value function (see here)
  • Act as analogues to Temporal Difference Learning and N-step Bootstrapping but for Policy Gradient Methods.

  • Rather than have an estimate that solely relies on the full return (as in Monte Carlo approaches such as REINFORCE), we instead have an -step estimate coupled with a learned baseline.

    Where is a generalization for our return.

    • It could be the one-step return from regular TD, or the n-step return or even an eligibility trace .

Topics

Links