Characterization

  • Let be a policy’s parameter vector, and be the probability an action is taken given state and parameter at time .
    • Assume the policy is differentiable with respect to its parameters.
    • To ensure exploration, we generally require policies never become deterministic.
  • The performance of a policy is calculated with respect to . The goal is to maximize performance using gradient ascent
  • Methods that follow this are policy gradient methods

Why Policy Gradient Methods?

  • Sometimes the policy is a simpler function to approximate.
  • We can also generalize well to continuous actions or stochastic policies.
  • It may also be a good way of injecting prior knowledge about the policy.
  • With a continuous parameterization, the action probabilities change smoothly so that there is less variance in the updating. We have stronger convergence guarantees.

Parameterization

  • For discrete and not-too-large action spaces, we form a parameterized numerical preference for every state-action pair.
    • The action preferences may be computed using any form of supervised learning technique.
    • The softmax in action preferences applies softmax to obtain a probability distribution as follows
  • One advantage to using the softmax on action preferences is that policies can approach a deterministic policy.
    • Using softmax on action values does not guarantee convergence to an optimal policy. It only converges to the true values of the action values.
    • Softmax in action preferences drives the parameters towards the optimal stochastic policy
    • It also enables selecting actions according to arbitrary probabilities.
  • For continuous cases we can parameterize by using a probability distribution instead.
    • In such a case, we typically want the probability distribution itself to be parameterized on the state and/or action to be taken.
    • We usually have parameterizations be dependent on even more parameterizations.

Policy Gradient Theorem

  • The Policy Gradient Theorem guarantees that we can perform gradient ascent on the parameterized policy without any knowledge of the model’s dynamics or the initial distribution of states.

    • This arises partly because when we calculate the gradient, any parameterization of the environment would not have a dependence on
  • The Policy Gradient Theorem. Let be the expected reward. If we take the expected reward using the undiscounted value function 1we get the following (the gradient is with respect to as usual)

    is the on-policy distribution

    The proportionality constant is the average length of an episode in the episodic case and in the continuous case. 2 1 3

    The last line is just a way to abbreviate the notation.

EGLP Lemma

  • The Expected Grad-Log-Prob lemma is an intermediate result that is applied extensively in policy gradients
  • Let be a parameterized probability distribution over a random variable , then

Baselines

  • We can generalize the policy gradient theorem to include comparison with a baseline

Where the baseline can be any function as long as it is not dependent on .

  • The rationale with baselines is that it reduces any variance of sample estimates of the policy gradient.

  • *The usual choices are either the on-policy distribution or an estimated value .

  • The baseline does not introduce bias precisely because of the EGLP Lemma

  • The baseline reduces variance based on the bias variance tradeoff.

    • In fact, we can view the variance of the above as a sum of squares that can be decomposed with the bias-variance decomposition. 4.
    • In line with that, we have the common choice of as a sort of MLE.
    • The tradeoff applies, we reduce variance at the cost of adding bias.

Baselined Discounting

  • If we want to apply discounting the baseline has to satisfy

Continuing Cases

  • For the continuing case, we simply generalize by using the average reward formulation. The derivation is otherwise the same.
  • In this case, the proportionality becomes an equality.

Advantage function

  • The advantage function is defined as

    Intuitively, it tells us how better action is compared to the average action taken on state .

  • We can reformulate using the advantage function (see The General Policy Gradient Theorem).

  • The advantage function formulation is equivalent to the baselined formulation because

    • which follows by definition of .
    • which must be asserted as true.
  • In general 5, we can frame the expected return of a policy in terms of the advantage over accumulated over time steps.

    • An important corollary: Any policy that has a non-negative advantage at every state is guaranteed to increase the policy performance or leave it constant

    • In an approximate setting, we may encounter errors for which the expected advantage of some state is negative. A local approximation is then given as follows

      If is parameterized and differentiable on , then we have that

      Which means that a sufficiently small step that improves will also improve .

The General Policy Gradient Theorem

  • We can generalize the Policy Gradient Theorem as follows. Let be the performance measure, specifically defined as the expected total reward
  • All policy gradient theorems follow the following form
    depends on our algorithm.
    • - total reward of the trajectory.
    • - reward following action
    • - baselined reward after action
    • - state-action value function
    • - advantage function
    • - TD residual.

Topics

Links

Footnotes

  1. Sutton and Barto use but if we are measuring expected reward, we should average over all the states. The reason why we can just sum over is because the environment is Markovian so we can simply get “all trajectories” just from using any possible starting states. 2

  2. In practice, we don’t really care about this because it will be weighted by a step size anyway.

  3. The part of the formula follows by multiplying the RHS of the equation with .

  4. see more here for showing the baseline does not introduce bias.

  5. Kakade and Langdord (2002). Approximately optimal approximate reinforcement learning