- importance sampling is a technique where the expected value under one distribution is estimated given samples of another.
- In the context of Reinforcement Learning, we use it in off-policy learning to match the target with the behavior.
Standard Definition
-
Given a start state
, we have the probability of a trajectory as And the importance sampling ratio is given as the relative probability of the trajectory under the target and behavior
-
The importance sampling ratio is then applied to
, the returns in the behavior policy. We have that -
We can perform ordinary importance sampling by taking a normal average. Let
denote the first time of termination after -
Weighted importance sampling is done using a weighted average.
Or
if the denominator of .
Variants
Discounting-Aware
- In the context of discounted returns, we may make use of discounting-aware importance.
- We interpret the discount rate as a degree of partial termination. The return can be written as
Where
is the flat return (where discount rate is and the sum is up to horizon ). - The estimator is then obtained by using the standard formula for importance sampling (see above), but where
is scaled by and by . denotes the first termination time step after .
- We interpret the discount rate as a degree of partial termination. The return can be written as
Per-Step
-
The per-step importance sampling ratio is defined as follows for target
and behavior
Per-Decision
- We may also consider per decision sampling Based on the observation that
In place of
, we use the following in ordinary importance sampling And1
N-step Returns
- We extend importance sampling to apply to n-step returns by using the following importance sampling ratio
Where
is the behavior function. - When an action would never be taken by the policy
, we no longer explore it. - When an action has
, it is characteristic of but is rarely explored. To compensate for the rare exploration, we give it higher weight (which is consistent with the importance sampling ratio’s properties).
- When an action would never be taken by the policy
Links
- Sutton and Barto Ch. 5, 7
- 5.5 - more information about Information Sampling.
- 5.8 - discounting-aware importance sampling.
- 5.9 - per-decision importance sampling.
- 7.3 - Importance Sampling for n-step Off-policy Learning.
Footnotes
-
Note: It is less clear if there is a weighted per-decision importance sampling. These estimators are inconsistent ↩