Asynchronous Learning

  • The aim is to speed up the process of DQN using asynchronous methods — that is, by running learning in parallel. 1
  • Instead of using replay memory, we rely on different threads running different policies and multiple actors perform exploration (via -greedy policies where is sampled from some distribution)
    • This reduces training time by exploiting parallelism.
    • This allows us to instead use On policy methods with stability guarantees

Asynchronous Deep Learning. Taken from Mnih et. al (2016)

A3C

  • Asynchronous Advantage Actor-Critic

A3C. Taken from Algorithm S3 from Mnih et. al (2016)
  • Here, critics learn the value function while multiple actors are trained in parallel, and accumulated updates are performed for more stable and robust training.

  • In practice ,we share some parameters. between the value function and the policy function.

  • Updates are performed as follows. Here we let and be the parameterized value and policy respectively and is the estimated advantage function

  • We may also add the entropy as a regularization term.

A2C - Advantage Actor-Critic

  • A synchronous version of A3C that resolves data inconsistency.
  • It introduces a coordinator that waits for all parallel actors to finish their work before updating global parameters. Actors then start from the same policy.
  • It can utilize the GPU more efficiently while achieving comparable or better performance to A3C.

Image from Lilian Weng https://lilianweng.github.io/posts/2018-04-08-policy-gradient/

ACER

  • Actor Critic With Experience Replay 2. It presents an actor-critic method with stable, sample-efficient experience replay. It is the off-policy counterpart to A3C.

  • It makes use of the value computed with Retrace, denoted as a target to train the critic by minimizing

  • To reduce the high variance of the policy gradient, we truncate the importance weights by a constant plus a correction term. We get the gradient as

    As a shorthand denotes the usual definition of the Importance Sampling ratio using and denotes it on action .

    The first term in the above clips the gradient and adds a baseline to reduce variance. The second term makes a correction to achieve an unbiased estimation.

  • Finally, it makes use of a modified version of TRPO . Instead of using the KL divergence, we maintain a running average of past policies and force the updated policy not to deviate from this average to reduce the variance of policy updates.

  • For continuous actor-critic, we make modifications to the estimate and off policy. We compute a stochastic estimate and a deterministic estimate given by

    The target then becomes

    When estimating in continuous domains, we use the following truncated importance weights, where is the dimensionality of the action space.

  • For trust region updating, we make modify the Retrace estimate by replacing the truncated importance ratio with .

ACER on discrete actor-critic. Taken from Wang et al. (2017)

ACER for continuous actor-ctritic. Taken from Wang et al. (2017)

Footnotes

  1. Mnih, et al. (2016) Asynchronous methods for deep reinforcement learning. 

  2. Wang et al. (2017) Sample Efficient Actor-Critic with Experience Replay