• - discrete time step
  • - final time step of an episode, or of the episode including time step .

  • - a trajectory.
  • - state at time
  • ,- action at time
  • - set of all non-terminal states
  • - set of all states (including terminal states).
  • - set of all actions
  • - set of all actions available from state .

  • - number of times an action is taken prior to time .
  • - expected number of visits of to state .
  • - probability of visiting under policy

  • - reward at time
  • - reward obtained by going from to via action .
  • - estimate of a reward at time .
  • - set of all possible rewards.
  • - total reward obtained in a trajectory
  • - average reward from following policy
  • - expected immediate reward obtained from state taking action a.
  • - expected immediate reward obtained on transitioning from to via

  • - (hidden) environment dynamics. The probability of going to and receiving reward by taking action from state .
  • - probability of transitioning from to at time step .
  • - policy
  • - in the context of off-policy learning, the behavior policy.
  • - \action taken under (deterministic) policy at state
  • - probability of taking action in state .
  • - probability of selecting action at time .
  • - optimal policy
  • - policy parameterized by .

  • - in an -greedy policy, denotes the degree of exploration.
  • - step size parameters
  • - discount rate parameter
  • - discount at time .
  • - trace decay rate for eligibility traces.
  • - trace decay at time .

  • - true state value function for policy .
  • - optimal true value function.
  • - true state-action value function for policy .
  • - true optimal state-action value function.
  • - Bellman operator for value function
  • - the advantage function for
  • - estimated advantage. 1

  • - expected approximate action value at time for policy
  • - estimated approximate action value at time for policy

  • - return at time .
  • - horizon. the time step we look up to during a forward view.
  • - n-step return from to .
  • - flat return (undiscounted, uncorrected).
  • - - return.
  • - truncated , corrected return.
  • - -return, corrected by estimated state or action value.

  • - on-policy distribution over state .
  • - vector of all ‘s.
  • - the steady state distribution following the average reward formulation of the return.
  • - importance sampling ratio, taken to be the per-step importance sampling ratio.
  • - importance sampling ratio from time to .

  • - target for estimate at time .
  • - temporal difference error at .
  • - state and action specific forms of the TD error
  • - Bellman error.
  • - Bellman error vector

  • - dimensional vector underlying an approximate value function.
  • - approximate value of state given parameters .
  • - approximate value of action pair given .
  • - estimate parameterized with
  • - feature vector visible when in state .
  • - feature vector visible when in state-action pair
  • - eligibility trace

  • - performance measure using the parameter vector at time . Typically, this denotes the expected return
  • - expected return of policy
  • - approximation to using for generating samples.
  • - baseline function. Assumed to only be dependent on the state estimate.
  • - mean square value error.
  • - weighted square norm of the value function
  • - mean squared Bellman error
  • - mean squared Projected Bellman error.
  • - mean squared Temporal Difference error.
  • - mean squared Return Error.
  • - a critic and an actor in an actor-critic model. Generally we will use for the critic’s parameters and for the actor’s

  • - experience replay.
  • - batch.
  • - the retrace estimate for the function.
  • - an episode step
  • - behavior distribution

  • - a deterministic policy parameterized with .
  • - initial distribution over states in the context of DPG
  • - discounted state distribution
  • - probability ratio (in the context of TRPO and PPO)
  • - clip surrogate objective at time (see PPO)
  • - KL divergence penalty objective (see PPO) at time

  • - denotes soft value functions.
  • - (in the context of energy-based RL), denotes temperature hyperparameter.
  • - soft Bellman operator

  • , - relating to the action-value function
  • , - relating to the state-value function

Footnotes

  1. Capital letters typically indicate that the estimate operates on an entire vector. But, the logic should remain the same.