Characterization

Let be a policy’s parameter vector, and be the probability an action is taken given state and parameter at time .
- Assume the policy is differentiable with respect to its parameters.
- To ensure exploration, we generally require policies never become deterministic.
The performance of a policy is calculated with respect to . The goal is to maximize performance using gradient ascent

Methods that follow this are policy gradient methods

Why Policy Gradient Methods?

Sometimes the policy is a simpler function to approximate.
We can also generalize well to continuous actions or stochastic policies.
It may also be a good way of injecting prior knowledge about the policy.
With a continuous parameterization, the action probabilities change smoothly so that there is less variance in the updating. We have stronger convergence guarantees.

Parameterization

For discrete and not-too-large action spaces, we form a parameterized numerical preference for every state-action pair.
- The action preferences may be computed using any form of supervised learning technique.
- The softmax in action preferences applies softmax to obtain a probability distribution as follows

One advantage to using the softmax on action preferences is that policies can approach a deterministic policy.
- Using softmax on action values does not guarantee convergence to an optimal policy. It only converges to the true values of the action values.
- Softmax in action preferences drives the parameters towards the optimal stochastic policy
- It also enables selecting actions according to arbitrary probabilities.
For continuous cases we can parameterize by using a probability distribution instead.
- In such a case, we typically want the probability distribution itself to be parameterized on the state and/or action to be taken.
- We usually have parameterizations be dependent on even more parameterizations.

Policy Gradient Theorem

The Policy Gradient Theorem guarantees that we can perform gradient ascent on the parameterized policy without any knowledge of the model’s dynamics or the initial distribution of states.
- This arises partly because when we calculate the gradient, any parameterization of the environment would not have a dependence on
The Policy Gradient Theorem. Let be the expected reward. If we take the expected reward using the undiscounted value function ¹we get the following (the gradient is with respect to as usual)

is the on-policy distribution

The proportionality constant is the average length of an episode in the episodic case and in the continuous case. ² ¹ ³

The last line is just a way to abbreviate the notation.

EGLP Lemma

The Expected Grad-Log-Prob lemma is an intermediate result that is applied extensively in policy gradients
Let be a parameterized probability distribution over a random variable , then

Baselines

We can generalize the policy gradient theorem to include comparison with a baseline

Where the baseline can be any function as long as it is not dependent on .

The rationale with baselines is that it reduces any variance of sample estimates of the policy gradient.
*The usual choices are either the on-policy distribution or an estimated value .
The baseline does not introduce bias precisely because of the EGLP Lemma
The baseline reduces variance based on the bias variance tradeoff.
- In fact, we can view the variance of the above as a sum of squares that can be decomposed with the bias-variance decomposition. ⁴.
- In line with that, we have the common choice of as a sort of MLE.
- The tradeoff applies, we reduce variance at the cost of adding bias.

Baselined Discounting

If we want to apply discounting the baseline has to satisfy

Continuing Cases

For the continuing case, we simply generalize by using the average reward formulation. The derivation is otherwise the same.
In this case, the proportionality becomes an equality.

Advantage function

The advantage function is defined as

Intuitively, it tells us how better action is compared to the average action taken on state .
We can reformulate using the advantage function (see The General Policy Gradient Theorem).
The advantage function formulation is equivalent to the baselined formulation because
- which follows by definition of .
- which must be asserted as true.
In general ⁵, we can frame the expected return of a policy in terms of the advantage over accumulated over time steps.
- An important corollary: Any policy that has a non-negative advantage at every state is guaranteed to increase the policy performance or leave it constant
- In an approximate setting, we may encounter errors for which the expected advantage of some state is negative. A local approximation is then given as follows
  
  If is parameterized and differentiable on , then we have that
  
  Which means that a sufficiently small step that improves will also improve .

The General Policy Gradient Theorem

We can generalize the Policy Gradient Theorem as follows. Let be the performance measure, specifically defined as the expected total reward

All policy gradient theorems follow the following form
depends on our algorithm.
- - total reward of the trajectory.
- - reward following action
- - baselined reward after action
- - state-action value function
- - advantage function
- - TD residual.

Topics

Policy Gradient Method Algorithms

Table of Contents

Graph View

Backlinks

The Library

Policy Gradient Methods

Characterization

Why Policy Gradient Methods?

Parameterization

Policy Gradient Theorem

EGLP Lemma

Baselines

Baselined Discounting

Continuing Cases

Advantage function

The General Policy Gradient Theorem

Topics

Links

Table of Contents

Graph View

Backlinks

Policy Gradient Methods

Characterization §

Why Policy Gradient Methods? §

Parameterization §

Policy Gradient Theorem §

EGLP Lemma §

Baselines §

Baselined Discounting §

Continuing Cases §

Advantage function §

The General Policy Gradient Theorem §

Topics §

Links §

Footnotes §

Characterization

Why Policy Gradient Methods?

Parameterization

Policy Gradient Theorem

EGLP Lemma

Baselines

Baselined Discounting

Continuing Cases

Advantage function

The General Policy Gradient Theorem

Topics

Links

Footnotes