Characterization	Approach
Sample Updates, Uses Bootstrapping	Temporal Difference Learning
Sample Updates, No Bootstrapping	Monte Carlo
Expected Updates, Uses Bootstrapping	Dynamic Programming
Expected Updates, No Bootstrapping	Exhaustive Search

Note the above are extremes RL approaches are a spectrum (as seen in n-step bootstrapping) approaches.
Regardless, all approaches make use of Generalized Policy Iteration

Model-based and Model-Free

A model-based approach captures the ability to plan (i.e., to decide the course of action by considering the future states before they are experienced).
A model-free approach captures the ability to learn by trial and error

Off vs On Policy Learning

On-policy learning evaluates and improves the policy that is being used to make the decisions by simultaneously making an optimal value function.
- These policy creates only one policy that compromises between exploration and exploitation by creating a near optimal policy that still explores.
- Generally simpler than off-policy and they converge faster.
- On policy methods generally start stochastic but shift to being deterministic. These include -greedy policies which sometimes select the non-optimal policy.
- A variant is soft policies for which for all states and actions , but where the policy gradually shifts to a deterministic optimal policy.
- These policies iterate towards the optimal policy directly from experience.
Off-policy learning creates an optimal policy by learning from another policy. This is despite the current policy or value function being suboptimal.
- Off-policy learning provide a solution to the Exploration-Exploitation trade-off by having the behavior policy explore, and the target policy exploit.
- Generally more complex, and subject to high variance since we learn based on data that is closely related to the target..
- We call the first policy as the behavior policy denoted . This is the policy that is used to generate behaviors
- We call the second policy the target policy denoted . This is the policy that will become the optimal policy
- We require two assumptions
  - The Assumption of Coverage: Every action taken under is also taken occasionally under
  - Soft Policy Assumption: is a soft policy where all actions from all states may be taken.
- Off-policy is analogous to learning by imitation.
- It allows the agent to learn from multiple sources in parallel. By having many target policies that learn correspond to different things.
Off-policy learning is the general case of on-policy learning where the target and behavior policies are exactly the same.

Tasks and Goals

The goal of Reinforcement Learning is to maximize the expected return by changing the agent’s policy.
The return is quantified as a function of the reward that the agent receives. It is a measure of the cumulative performance of the agent on a particular task
- In episodic tasks the agent receives rewards in series before eventually the environment is reset to its starting state. One run is called an episode. The length of an episode is called its horizon.
- In continuous tasks the agent continually receives rewards and there is no reverting back to an original state.
The return is quantified differently based on different algorithms. The two most common formulations are given below. Most RL algorithms extend these definitions.

Discounted Formulation

The expected return makes use of a discount such that future rewards are worth less than current rewards.
The value function that is being optimized, is the expected return (under a policy )
- A state-value function is one that quantifies the expected return given the agent started at a particular state. This is also called the V-function (denoted with or ).
- An state-action-value function is one that quantifies the expected return given the agent started at a particular action and state. This is also called the Q-function (denoted or )
  - With respect to a policy, the value at an action assumes that all future actions taken will be optimal.
- We have the following for policy
*For episodic tasks, the calculation will terminate with an absorbing state which will continuously just give a reward of .
Thus, the discounted return with discount rate is calculated as
- For episodic tasks, we simply set all rewards after the episode end as .

Generalized Discounted Return Formulation

In this case, we vary at each time step, denoting the particular value as .
The return is defined more generally as

And to assure all sums are finite, we have that with probability one.

Rationale: It enables episodic settings and algorithms to be presented in terms of a single stream of experience.
- What would be terminating states are now states at which and which transitions to the start distribution.
- We can also include pseudo-termination cases where we predict quantities without altering the flow of the Markov process.

Average Reward Formulation

Applies to continuing problems. However, there is no discounting.
The quality of a policy is defined as the average rate of reward (i.e average reward) which is given as

Where

denotes the initial state

denotes the sequence . is the steady-state distribution given by

independent of (i.e., the Markov Process is ergodic).
- The steady-state distribution is the special distribution under which selecting actions according to means we remain in the same distribution
Returns are defined in terms of differences between rewards and average reward. This is called the differential return
The corresponding value functions are known as differential value functions, which we still denote and .
- Notation: We will be explicit when we are dealing with differential value functions, but by default assume we aren’t in a average reward value function context

Forward vs Backward

In the forward view we perform updates based on the next rewards and states steps ahead in the future. This is the approach in Temporal Difference Learning.
In the backward view, we look backward to recently visited states. This is achieved via Eligibility Traces.
Both views are equivalent and one can transform from one to the other

Action-Based vs Policy Based

Action-value methods learn the values of actions and then select actions based on the estimated values.
Policy-Based methods learn a parameterized policy that can select actions without consulting a value function (this is thee approach for Policy Gradient Methods)

Other Dimensions

Value Estimated: Which do we estimate?
- Action values
- State values
  - Use a model for action selection.
  - Use a policy for action selection.
- Afterstate values.
Action Selection / Exploration
- How do we address The Exploitation-Exploration Trade-Off
Synchronous / Asynchronous:
- Synchronous - all updates are performed one by one in order.
- Asynchronous - updates are performed simultaneously, using whatever approximations are appropriate.
Real / Simulated - do we backup on Real or Simulated Experience and to what extent
Location of Updates - what should be updated?
- For Model-free - we choose only among those states / state-action pairs encountered.
- For Model-based - we can choose arbitrarily.
Timing of Updates - update done
- as part of selecting actions (i.e., Value Iteration) or
- after selecting actions (i.e., trajectory sampling and rollout algorithms)
Memory for updates - how long should we retain update values.
- Permanent retention
- Retain only when computing an action selection (I.e., heuristic search)
Use of function approximation
Use of Policy Gradient Methods and extensions of the Policy Gradient Theorem .

Prediction and Update

Estimates and Update rules in Reinforcement learning can be formulated with the following formula (analogous to Gradient Descent): We notate this with

Table of Contents

Graph View

Backlinks

The Library

A Unified View on Reinforcement Learning Approaches

Model-based and Model-Free

Off vs On Policy Learning

Tasks and Goals

Discounted Formulation

Generalized Discounted Return Formulation

Average Reward Formulation

Forward vs Backward

Action-Based vs Policy Based

Other Dimensions

Prediction and Update

Links

Table of Contents

Graph View

Backlinks

A Unified View on Reinforcement Learning Approaches

Model-based and Model-Free §

Off vs On Policy Learning §

Tasks and Goals §

Discounted Formulation §

Generalized Discounted Return Formulation §

Average Reward Formulation §

Forward vs Backward §

Action-Based vs Policy Based §

Other Dimensions §

Prediction and Update §

Links §

Model-based and Model-Free

Off vs On Policy Learning

Tasks and Goals

Discounted Formulation

Generalized Discounted Return Formulation

Average Reward Formulation

Forward vs Backward

Action-Based vs Policy Based

Other Dimensions

Prediction and Update

Links