Asynchronous Learning

The aim is to speed up the process of DQN using asynchronous methods — that is, by running learning in parallel. ¹
Instead of using replay memory, we rely on different threads running different policies and multiple actors perform exploration (via -greedy policies where is sampled from some distribution)
- This reduces training time by exploiting parallelism.
- This allows us to instead use On policy methods with stability guarantees

Asynchronous Deep Learning. Taken from Mnih et. al (2016)

A3C

A3C. Taken from Algorithm S3 from Mnih et. al (2016)

Here, critics learn the value function while multiple actors are trained in parallel, and accumulated updates are performed for more stable and robust training.
In practice ,we share some parameters. between the value function and the policy function.
Updates are performed as follows. Here we let and be the parameterized value and policy respectively and is the estimated advantage function
We may also add the entropy as a regularization term.

A synchronous version of A3C that resolves data inconsistency.
It introduces a coordinator that waits for all parallel actors to finish their work before updating global parameters. Actors then start from the same policy.
It can utilize the GPU more efficiently while achieving comparable or better performance to A3C.

Actor Critic With Experience Replay ². It presents an actor-critic method with stable, sample-efficient experience replay. It is the off-policy counterpart to A3C.
It makes use of the value computed with Retrace, denoted as a target to train the critic by minimizing
To reduce the high variance of the policy gradient, we truncate the importance weights by a constant plus a correction term. We get the gradient as

As a shorthand denotes the usual definition of the Importance Sampling ratio using and denotes it on action .

The first term in the above clips the gradient and adds a baseline to reduce variance. The second term makes a correction to achieve an unbiased estimation.
Finally, it makes use of a modified version of TRPO . Instead of using the KL Divergence, we maintain a running average of past policies and force the updated policy not to deviate from this average to reduce the variance of policy updates.
For continuous actor-critic, we make modifications to the estimate and off policy. We compute a stochastic estimate and a deterministic estimate given by

The target then becomes

When estimating in continuous domains, we use the following truncated importance weights, where is the dimensionality of the action space.
For trust region updating, we make modify the Retrace estimate by replacing the truncated importance ratio with .

ACER on discrete actor-critic. Taken from Wang et al. (2017)

ACER for continuous actor-ctritic. Taken from Wang et al. (2017)