- The study of efficient data compression, and robust and reliable data transmission.
Important Quantities
-
Let
be a probability distribution. The entropy is defined as - It can be interpreted as the expected amount of surprise that we may have about a given event. That is, it is the expected amount of information we can gain from the distribution.
- This also corresponds to the degree in which the distribution is Uniform.
- This also corresponds with the amount of uncertainty that we may have about the distribution.
-
Let
and be probability distributions. The cross entropy is defined as - It is a measure of the average number of bits needed to identify an event drawn from the set if a coding scheme used for the set is optimized for an estimated probability distribution
rather than the true distribution .
- It is a measure of the average number of bits needed to identify an event drawn from the set if a coding scheme used for the set is optimized for an estimated probability distribution
-
The Kullback-Leibler Divergence is defined as
- It measures the coding inefficiency from using a model
to compress the data, when the true distribution is . - It also measures the dissimilarity between
and - It can also be formulated as:
Formulated this way, the KL divergence is the average number of extra bits needed to encode the data due to the fact we used distribution
rather than the true distribution . - This formulation informally motivates the following inequality
- This formulation informally motivates the following inequality
- It measures the coding inefficiency from using a model
-
The conditional entropy is defined as
- It measures how much entropy
has remaining given we have learnt the value of .
- It measures how much entropy
-
The Pointwise Mutual Information between two events
and is defined as - It measures the discrepancy between these occurring together compared to what would be expected by chance.
- It is also the amount we learn from updating a prior into a posterior.
-
The Mutual Information determines how similar the joint distribution
is to the factored distribution . It is defined, therefore as - This determines how much information can be extracted about one random variable given observations on another.
- It can also be expressed as
- It can also be expressed as the expected value of the pointwise mutual information.
- For continuous random variables, the mutual random variable can be approximated using the Maximal Information Coefficient. We define
Where
denotes the set of 2D grids of size and are discretizations of the variables on this grid. Then, the maximal information coefficient is defined as Whereis a sample size dependent bound on the number of bins we can use. - A MIC of
represents no relationship between the variables - A MIC of
represents a noise-free relationship of any form, not just linear.
- A MIC of
Links
- Probability Theory - more on probability which is the basis of Information Theory.
- Murphy Ch. 2.8