importance sampling is a technique where the expected value under one distribution is estimated given samples of another.
In the context of Reinforcement Learning, we use it in off-policy learning to match the target with the behavior.

Standard Definition

Given a start state , we have the probability of a trajectory as

And the importance sampling ratio is given as the relative probability of the trajectory under the target and behavior
The importance sampling ratio is then applied to , the returns in the behavior policy. We have that
We can perform ordinary importance sampling by taking a normal average. Let denote the first time of termination after
Weighted importance sampling is done using a weighted average.

Or if the denominator of .

Variants

In the context of discounted returns, we may make use of discounting-aware importance.
- We interpret the discount rate as a degree of partial termination. The return can be written as Where is the flat return (where discount rate is and the sum is up to horizon ).
- The estimator is then obtained by using the standard formula for importance sampling (see above), but where is scaled by and by . denotes the first termination time step after .

The per-step importance sampling ratio is defined as follows for target and behavior

We may also consider per decision sampling Based on the observation that In place of , we use the following in ordinary importance sampling And ¹

We extend importance sampling to apply to n-step returns by using the following importance sampling ratio
Where is the behavior function.
- When an action would never be taken by the policy , we no longer explore it.
- When an action has , it is characteristic of but is rarely explored. To compensate for the rare exploration, we give it higher weight (which is consistent with the importance sampling ratio’s properties).