Graph View

Backlinks

Reinforcement Learning
MARL -- Notation Guide

The Library

Search

Reinforcement Learning - Notation Guide

Mar 10, 2025, 5 min read

- discrete time step
- final time step of an episode, or of the episode including time step .

- a trajectory.
- state at time
,- action at time
- set of all non-terminal states
- set of all states (including terminal states).
- set of all actions
- set of all actions available from state .

- number of times an action is taken prior to time .
- expected number of visits of to state .
- probability of visiting under policy

- reward at time
- reward obtained by going from to via action .
- estimate of a reward at time .
- set of all possible rewards.
- total reward obtained in a trajectory
- average reward from following policy
- expected immediate reward obtained from state taking action a.
- expected immediate reward obtained on transitioning from to via

- (hidden) environment dynamics. The probability of going to and receiving reward by taking action from state .
- probability of transitioning from to at time step .
- policy
- in the context of off-policy learning, the behavior policy.
- \action taken under (deterministic) policy at state
- probability of taking action in state .
- probability of selecting action at time .
- optimal policy
- policy parameterized by .

- in an -greedy policy, denotes the degree of exploration.
- step size parameters
- discount rate parameter
- discount at time .
- trace decay rate for eligibility traces.
- trace decay at time .

- true state value function for policy .
- optimal true value function.
- true state-action value function for policy .
- true optimal state-action value function.
- Bellman operator for value function
- the advantage function for
- estimated advantage. ¹

- expected approximate action value at time for policy
- estimated approximate action value at time for policy

- return at time .
- horizon. the time step we look up to during a forward view.
- n-step return from to .
- flat return (undiscounted, uncorrected).
- - return.
- truncated , corrected return.
- -return, corrected by estimated state or action value.

- on-policy distribution over state .
- vector of all ‘s.
- the steady state distribution following the average reward formulation of the return.
- importance sampling ratio, taken to be the per-step importance sampling ratio.
- importance sampling ratio from time to .

- target for estimate at time .
- temporal difference error at .
- state and action specific forms of the TD error
- Bellman error.
- Bellman error vector

- dimensional vector underlying an approximate value function.
- approximate value of state given parameters .
- approximate value of action pair given .
- estimate parameterized with
- feature vector visible when in state .
- feature vector visible when in state-action pair
- eligibility trace

- performance measure using the parameter vector at time . Typically, this denotes the expected return
- expected return of policy
- approximation to using for generating samples.
- baseline function. Assumed to only be dependent on the state estimate.
- mean square value error.
- weighted square norm of the value function
- mean squared Bellman error
- mean squared Projected Bellman error.
- mean squared Temporal Difference error.
- mean squared Return Error.
- a critic and an actor in an actor-critic model. Generally we will use for the critic’s parameters and for the actor’s

- experience replay.
- batch.
- the retrace estimate for the function.
- an episode step
- behavior distribution

- a deterministic policy parameterized with .
- initial distribution over states in the context of DPG
- discounted state distribution
- probability ratio (in the context of TRPO and PPO)
- clip surrogate objective at time (see PPO)
- KL divergence penalty objective (see PPO) at time

- denotes soft value functions.
- (in the context of energy-based RL), denotes temperature hyperparameter.
- soft Bellman operator

, - relating to the action-value function
, - relating to the state-value function

Footnotes

Capital letters typically indicate that the estimate operates on an entire vector. But, the logic should remain the same. ↩

Created with Quartz v4.1.0, © 2025

Have an issue Send an issue here

GitHub