All Actions Update

REINFORCE: Monte Carlo Policy Gradient

Update, at time only the action .
We redefine as follows. We replace by the sample and note that is the expected return.

The above approach makes it so that actions that are updated frequently are updated in small increments (to balance out the frequency).
The vector is called the eligibility vector
REINFORCE has high variance and slow learning being Monte Carlo. However, it has convergence guarantees.
- We can add a baseline to the regular REINFORCE algorithm in order to reduce variance.

We effectively learn two things
- An actor corresponding to the policy (see here)
- A critic corresponding to the value function (see here)
Act as analogues to Temporal Difference Learning and N-step Bootstrapping but for Policy Gradient Methods.
Rather than have an estimate that solely relies on the full return (as in Monte Carlo approaches such as REINFORCE), we instead have an -step estimate coupled with a learned baseline.

Where is a generalization for our return.
- It could be the one-step return from regular TD, or the n-step return or even an eligibility trace .