- discrete time step - final time step of an episode, or of the episode including time step .
- a trajectory. - state at time ,- action at time - set of all non-terminal states - set of all states (including terminal states). - set of all actions - set of all actions available from state .
- number of times an action is taken prior to time . - expected number of visits of to state . - probability of visiting under policy
- reward at time - reward obtained by going from to via action . - estimate of a reward at time . - set of all possible rewards. - total reward obtained in a trajectory - average reward from following policy - expected immediate reward obtained from state taking action a . - expected immediate reward obtained on transitioning from to via
- (hidden) environment dynamics. The probability of going to and receiving reward by taking action from state . - probability of transitioning from to at time step . - policy - in the context of off-policy learning, the behavior policy. - \action taken under (deterministic) policy at state - probability of taking action in state . - probability of selecting action at time . - optimal policy - policy parameterized by .
- in an -greedy policy, denotes the degree of exploration. - step size parameters - discount rate parameter - discount at time . - trace decay rate for eligibility traces. - trace decay at time .
- true state value function for policy . - optimal true value function. - true state-action value function for policy . - true optimal state-action value function. - Bellman operator for value function - the advantage function for - estimated advantage. 1
- expected approximate action value at time for policy - estimated approximate action value at time for policy
- return at time . - horizon. the time step we look up to during a forward view. - n-step return from to . - flat return (undiscounted, uncorrected). - - return. - truncated , corrected return. - -return, corrected by estimated state or action value.
- on-policy distribution over state . - vector of all ‘s. - the steady state distribution following the average reward formulation of the return. - importance sampling ratio, taken to be the per-step importance sampling ratio. - importance sampling ratio from time to .
- target for estimate at time . - temporal difference error at . - state and action specific forms of the TD error - Bellman error. - Bellman error vector
- dimensional vector underlying an approximate value function. - approximate value of state given parameters . - approximate value of action pair given . - estimate parameterized with - feature vector visible when in state . - feature vector visible when in state-action pair - eligibility trace
- performance measure using the parameter vector at time . Typically, this denotes the expected return - expected return of policy - approximation to using for generating samples. - baseline function. Assumed to only be dependent on the state estimate. - mean square value error. - weighted square norm of the value function - mean squared Bellman error - mean squared Projected Bellman error. - mean squared Temporal Difference error. - mean squared Return Error. - a critic and an actor in an actor-critic model. Generally we will use for the critic’s parameters and for the actor’s
- experience replay. - batch. - the retrace estimate for the function. - an episode step - behavior distribution
- a deterministic policy parameterized with . - initial distribution over states in the context of DPG - discounted state distribution - probability ratio (in the context of TRPO and PPO) - clip surrogate objective at time (see PPO) - KL divergence penalty objective (see PPO) at time
- denotes soft value functions. - (in the context of energy-based RL), denotes temperature hyperparameter. - soft Bellman operator
, - relating to the action-value function , - relating to the state-value function
Footnotes
-
Capital letters typically indicate that the estimate operates on an entire vector. But, the logic should remain the same. ↩