All Actions Update
- Update using the following rule
REINFORCE: Monte Carlo Policy Gradient
- Update, at time
only the action . - We redefine
as follows. We replace by the sample and note that is the expected return.
- The above approach makes it so that actions that are updated frequently are updated in small increments (to balance out the frequency).
- The vector
is called the eligibility vector - REINFORCE has high variance and slow learning being Monte Carlo. However, it has convergence guarantees.
- We can add a baseline to the regular REINFORCE algorithm in order to reduce variance.
Actor-Critic Methods
-
We effectively learn two things
-
Act as analogues to Temporal Difference Learning and N-step Bootstrapping but for Policy Gradient Methods.
-
Rather than have an estimate that solely relies on the full return (as in Monte Carlo approaches such as REINFORCE), we instead have an
-step estimate coupled with a learned baseline. Where
is a generalization for our return. - It could be the one-step return from regular TD, or the n-step return
or even an eligibility trace .
- It could be the one-step return from regular TD, or the n-step return
Topics
- Basic Policy Gradient Methods
- Deterministic Policy Gradient
- Trust Region Policies
- Energy-Based Reinforcement Learning Models
- Distributional Reinforcement Learning
- D4PG