TRPO

  • Trust Region Policy Optimization 1. It aims to optimize large non-linear policies (i.e., those that use neural networks) by using a surrogate loss function. In particular, this is based on the KL divergence

  • For convenience, we let

  • The notes here are helpful as a background for why we have TRPO. The theoretical bound is then given for old policy and new policy as

  • This gives the following algorithm for policy improvement.

Theoretical TRPO from Schulman et al. (2015). TRPO itself is an approximation of this
  • TRPO reframes the objective as maximizing (see above). However, we use the approximation . More specifically, we can use the Importance Sampling ratio and replace the advantage with . That is, the objective becomes

  • In practice, we do not use because this gives small step sizes. Instead we maximize subject to the constraint that for threshold .

    This is the trust region constraint

  • The trust region constraint guarantees that old and new policies do not diverge too much while guaranteeing monotonic improvement. 2

  • The value can be replaced with an empirical estimate.

    • Single Path - Estimates for comes from a single trajectory
    • Vine -Given a single trajectory, sample states, and from each sampled state, sample actions. From there, perform a rollout and estimate from the rollout trajectories.
    • Vine gives a better estimate with lower variance at the cost of requiring more calls as well as not being feasible for systems where “undoing” is not possible .

PPO

  • Proximal Policy Optimization 3. These alternate between sampling data through interaction with the environment, and optimizing a “surrogate” objective function using SGD.
  • The goal is scalability since TRPO is complicated and not good for noisy architectures.

PPO-CLIP

  • It introduces the clipped surrogate objective. Let denote the probability ratio

    The surrogate objective becomes

    The first term in the min function above is the same objective as TRPO. The second modifies the surrogate objective by clipping the probability ratio, removing the incentive for moving outside the interval .

    The final bound given by the loss function above is a lower bound (i.e., a pessimistic bound) on the unclipped objective.

    Thus, we ignore when it makes the objective better and include it when it makes the objective worse.

PPO-KL

  • An alternative is KL Penalizationwe use a penalty on the KL divergence and adapt the penalty based on a targeted KL divergence. This can be done by alternating between the following
    • Use Minibatch SGD to optimize

    • Compute

      • If , then
      • If , then .
    • We use the updated for the next policy update. The constants are magic but the algorithm is not sensitive to them.

    • Note that we can alternatively use the backward version which involves . This yields no difference.

    • Techniques presented here may be combined with the loss function above. For example , introducing entropy loss . as a penalty term, or using Experience replay.

    • Using entropy loss gives us this loss function

PPO Image from Schulman et al. (2017)

Failure Modes

  • 4 revisited PPO to examine two common design choices.

    • The clipped probability ratio
    • Parameterizing the policy action space by a continuous Gaussian or a discrete softmax.
  • The following happens in standard PPO

    1. When we have reward signals with bounded support This is problematic when the reward causes the policy to end up in regions with low rewards which would allow it to recover.
      • This is common when we have unknown action boundaries or sparse rewards.
    2. High dimensional discrete action spaces. This causes PPO to get stuck at suboptimal actions.
    3. Locally optimal actions close to initialization. PPO is sensitive to initialization and will converge to locally optimal actions close to initialization.
  • The following are the proposed solutions

    • Use KL regularization instead of the usual clipped loss. That is, we perform. This solves failure modes 1 and 2.
      • KL forces large steps to have a large opposite gradient from the KL penalty so that we do not immediately leave the trust region.
    • Instead of a Gaussian, use a Beta distribution as a parameterization for continuous action spaces. This solves failure modes 1 and 3
      • Compared to Gaussian policies where actions in the tails are given more importance, Beta policies give tails lower importance. All actions are weighted evenly
      • Additionally it is easier to integrate a uniform prior on a Beta distribution by setting .
      • It eliminates the bias towards the boundaries in truncated Gaussian on bounded action spaces.

PPG

  • Phasic Policy Gradient 5 extends PPO to have two separate training phases for policy and value functions. This leads to significant improvements on sample efficiency .

  • It provides an alternative to sharing network parameters which have the disadvantages of:

    • Being hard to balance so that competing objectives do not interfere with each other. This interference can impact performance.
    • Enforcing a hard restriction that policy and value function objectives are trained on the same data subject to the same sample reuse. Value functions tend to tolerate a higher level of sample reuse
  • In PPG, we have disjoint policy and value networks.

    • The policy network has policy and auxiliary value head .
    • The value network has a value head of
  • PPG introduces two training phases. Learning proceeds by alternating between them.

    • Policy Phase - train the agent with PPO using for the policy, and for the value function of the policy network

    • Auxiliary Phase - distill features from the value function into the policy network for training future policy phases. We use the following loss function

      Where is the policy before the auxiliary begins. is a hyperparameter controlling the tradeoff between the auxiliary objective and the original policy is the auxiliary objective, which can be anything. The paper gives the following

PPG. Image taken from Cobbe et al. (2020)
  • In the figure controls the number of policy updates performed in each policy phase controls sample reuse for the policy controls sample reuse for the true value function controls the sample reuse during the auxiliary phase. We increase this to increase sample reuse for the value function.

ACKTR

  • Actor Critic using Kronecker-factored trust region 6

  • It builds upon TRPO. It Proposes to use the Kronecker-factored approximation curvature (K-FAC) to perform the gradient updates for both the actor and the critic.

  • We make use of natural gradients rather than our usual gradients. This means that our step sizes are based on the KL divergence from the current network.

  • KFAC approximates the gradient using Kronecker products between smaller matrices

Footnotes

  1. Schulman et al. (2015) Trust Region Policy Optimization

  2. The KL divergence makes this intuitive since it measures the similarity between two probability distributions, and recall that policies are probability distributions.

  3. Schulman et al. (2017) Proximal Policy Optimization Algorithms

  4. Hsu, Mendler-Dummer, and Hardt (2020) Revisiting Design Choices in Proximal Policy Optimization

  5. Cobbe et al. (2020) Phasic Policy Gradient

  6. Wu et al. (2017) Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation