• The Mutual Information determines how similar the joint distribution is to the factored distribution . It is defined, therefore as

  • This determines how much information can be extracted about one random variable given observations on another.

  • It can also be expressed as

  • It can also be expressed as the expected value of the pointwise mutual information.

Approximation

  • A simple approximation for continuous random variables is to simply bin them (bin sizes need not be the same) and treat them as if they were discrete.

    • Limitation: This does not necessarily work if the underlying distributions aren’t smooth.
  • For continuous random variables, the mutual random variable can be approximated using the Maximal Information Coefficient. We define

    Where denotes the set of 2D grids of size and are discretizations of the variables on this grid. Then, the maximal information coefficient is defined as

    Where is a sample size dependent bound on the number of bins we can use.

    • A MIC of represents no relationship between the variables
    • A MIC of represents a noise-free relationship of any form, not just linear.
  • 1 proposes an approximation for mutual information based on KNN distances. The general idea is to estimate the entropy using the average distance to the -Nearest Neighbors for all data points.

    • We use the metric space with the norm defined as

      We also denote as the distance from to its -th nearest neighbor, and similarly and for the distances between the same points projected in and .

      Further denote as the number of points whose distance from is strictly less than

      We take inspiration from the Kozachenko-Leonenko Estimate for Entropy and use the same notation defined there.

    • The first estimator is given by

      • The approximation is derived using the entropy estimator
      • To approximate each marginal distribution, we use
        Where the approximation is derived by supposing one of the -th neighbor has distance .
      • Limitation: The derivation uses the Kozachenko-Leonenko estimator correctly in one direction.
    • The second approximation is given by using hyperrectangles instead of hypercubes. The approximation is given by

      • The general idea is to replace in the Kozachenko-Leonenko approximation with the following. Denote as the mass of the rectangle of size centered at and is the mass of the mass of the square of size
      • Caveat: The above is technically valid only for approximating the entropy of the joint distribution. The entropy of the joint marginals require corrections on the order for

Links

Footnotes

  1. Kraskov, Stoegbauer, and Grassberger (2003) Estimating Mutual Information