• importance sampling is a technique where the expected value under one distribution is estimated given samples of another.
  • In the context of Reinforcement Learning, we use it in off-policy learning to match the target with the behavior.

Standard Definition

  • Given a start state , we have the probability of a trajectory as

    And the importance sampling ratio is given as the relative probability of the trajectory under the target and behavior

  • The importance sampling ratio is then applied to , the returns in the behavior policy. We have that

  • We can perform ordinary importance sampling by taking a normal average. Let denote the first time of termination after

  • Weighted importance sampling is done using a weighted average.

    Or if the denominator of .



  • In the context of discounted returns, we may make use of discounting-aware importance.
    • We interpret the discount rate as a degree of partial termination. The return can be written as
      Where is the flat return (where discount rate is and the sum is up to horizon ).
    • The estimator is then obtained by using the standard formula for importance sampling (see above), but where is scaled by and by . denotes the first termination time step after .


  • The per-step importance sampling ratio is defined as follows for target and behavior


  • We may also consider per decision sampling Based on the observation that
    In place of , we use the following in ordinary importance sampling

N-step Returns

  • We extend importance sampling to apply to n-step returns by using the following importance sampling ratio
    Where is the behavior function.
    • When an action would never be taken by the policy , we no longer explore it.
    • When an action has , it is characteristic of but is rarely explored. To compensate for the rare exploration, we give it higher weight (which is consistent with the importance sampling ratio’s properties).


  • Sutton and Barto Ch. 5, 7
    • 5.5 - more information about Information Sampling.
    • 5.8 - discounting-aware importance sampling.
    • 5.9 - per-decision importance sampling.
    • 7.3 - Importance Sampling for n-step Off-policy Learning.


  1. Note: It is less clear if there is a weighted per-decision importance sampling. These estimators are inconsistent