-
The Mutual Information determines how similar the joint distribution
is to the factored distribution . It is defined, therefore as -
This determines how much information can be extracted about one random variable given observations on another.
-
It can also be expressed as
-
It can also be expressed as the expected value of the pointwise mutual information.
Approximation
-
A simple approximation for continuous random variables is to simply bin them (bin sizes need not be the same) and treat them as if they were discrete.
- Limitation: This does not necessarily work if the underlying distributions aren’t smooth.
-
For continuous random variables, the mutual random variable can be approximated using the Maximal Information Coefficient. We define
Where
denotes the set of 2D grids of size and are discretizations of the variables on this grid. Then, the maximal information coefficient is defined as Where
is a sample size dependent bound on the number of bins we can use. - A MIC of
represents no relationship between the variables - A MIC of
represents a noise-free relationship of any form, not just linear.
- A MIC of
-
1 proposes an approximation for mutual information based on KNN distances. The general idea is to estimate the entropy using the average distance to the
-Nearest Neighbors for all data points. -
We use the metric space
with the norm defined as We also denote
as the distance from to its -th nearest neighbor, and similarly and for the distances between the same points projected in and . Further denote
as the number of points whose distance from is strictly less than We take inspiration from the Kozachenko-Leonenko Estimate for Entropy and use the same notation defined there.
-
The first estimator
is given by - The approximation is derived using the entropy estimator
- To approximate each marginal distribution, we use
Where the approximation is derived by supposing one of the
-th neighbor has distance . - Limitation: The derivation uses the Kozachenko-Leonenko estimator correctly in one direction.
- The approximation is derived using the entropy estimator
-
The second approximation
is given by using hyperrectangles instead of hypercubes. The approximation is given by - The general idea is to replace
in the Kozachenko-Leonenko approximation with the following. Denote as the mass of the rectangle of size centered at and is the mass of the mass of the square of size - Caveat: The above is technically valid only for approximating the entropy of the joint distribution. The entropy of the joint marginals require corrections on the order
for
- The general idea is to replace
-
Links
- Murphy Ch. 2.8
- Probability Theory - more on probability which is the basis of Information Theory
Footnotes
-
Kraskov, Stoegbauer, and Grassberger (2003) Estimating Mutual Information ↩