Soft Q Learning

  • Aims to learn maximum entropy policies 1. It expresses the optimal policy via a Boltzmann distribution (using the property that Boltzmann distributions maximize entropy).

  • Instead of learning the best way to perform the task, the resulting policies try to learn all of the ways of performing the task

  • We reframe the task as finding the optimal policy such that

    Where is the temperature parameter for balancing reward and entropy.

  • Policies are represented as

    Where is an energy function.

  • (Haarnoja Thm. 1) Let the soft function and soft function be defined as

    The optimal policy for maximum entropy RL is given as

  • Soft Bellman equation (Haarnoja Thm. 2)

  • Soft Q-iteration (Haarnoja Thm. 3) The following is a soft version of Bellman Backup

    Assume that and are bounded , and exists

    Then we can perform the following fixed point iteration to converge to and

  • A stochastic version involves optimizing via Importance Sampling using as an arbitrary distribution over action space. We compute and a general loss function as follows

    The stochastic version is computed using Approximate Sampling and Stein Variational Gradient Descent

Soft Q Learning. Taken from Haarnoja et al. (2017)


  • Soft Actor Critic [^Haarnoja _2018] Extends Soft Q Learning by having an actor also maximize both expected reward and entropy. To succeed in performing a task while performing as randomly as possible. This approach is very stable and can achieve SOTA performance on continuous tasks.

    • Previous methods such as Trust Region Policies require new samples to be collected for each gradient step, which becomes expensive as tasks become more complex.
    • Using DDPG, while sample efficient, is challenging to use due to extreme brittleness and hyperparameter sensitivity.
    • Soft Q Learning requires complex approximate inference procedures in continuous action spaces.
  • SAC relies on three components

    • An actor-critic architecture with separate policy and value function networks. The actor itself is stochastic.
    • An off policy formulation that enables reuse of previously collected data for efficiency
    • Entropy maximization to enable stability and exploration.
  • Soft Policy Evaluation (Haarnoja Lem. 1) - Let where . Let . The sequence converges to the soft value of as . In other words, we can compute the soft value using repeated applications of the soft Bellman operator.

  • We restrict the choice of policy to be in some set of policies . TO that end, we use the KL Divergence. Thus, for each improvement step, we update according to

    Where is a partition function that normalizes the distribution, but can largely be ignored.

  • Soft Policy Improvement (Haarnoja Lem. 2) Let and be as defined above. Then and where we have that

  • Soft Policy Iteration (Haarnaja Thm. 1) repeated application of soft policy evaluation and soft policy improvement from any converges to a policy , assuming . is the optimal policy among all .

    • In other words, we can generalize GPI for soft RL.
  • Soft Actor-Critic approximates Soft Policy Iteration by using Function Approximation using

    • A parameterized state value function to approximate the soft value. This helps stabilize training, even if strictly speaking there is no need for this.

      Its goal is to minimize the squared residual error using the distribution of previously sampled states and actions .

    • A soft -function which is trained to minimize the soft Bellman residual

      Where is a target value network. We optimize the above with stochastic gradients

    • A Tractable policy This is learned by minimizing the expected KL divergence given by

      Optimizing requires reparameterizing the policy using a neural network transformation since the target is a differentiable neural network

      is an input noise vector. With this, the objective can be written as follows.

      is evaluated at

  • We use two Q-functions in a similar manner to double Q. This speeds up training.

Soft Actor Critic. Image taken from Haarnoja et al. (2018)

[^Haarnoja _2018] Haarnoja, Zhou, Abbeel, and Levine (2018). Soft Actor Critic—Off Policy Maximum Entropy Deep Learning with a Stochastic Actor



  1. Haarnoja, Tang, Abbeel, and Levine (2017). Reinforcement Learning with Deep Energy-based Policies