• The study of efficient data compression, and robust and reliable data transmission.

Important Quantities

  • Let be a probability distribution. The entropy is defined as

    • It can be interpreted as the expected amount of surprise that we may have about a given event. That is, it is the expected amount of information we can gain from the distribution.
    • This also corresponds to the degree in which the distribution is Uniform.
    • This also corresponds with the amount of uncertainty that we may have about the distribution.
  • Let and be probability distributions. The cross entropy is defined as

    • It is a measure of the average number of bits needed to identify an event drawn from the set if a coding scheme used for the set is optimized for an estimated probability distribution rather than the true distribution .
  • The Kullback-Leibler Divergence is defined as

    • It measures the coding inefficiency from using a model to compress the data, when the true distribution is .
    • It also measures the dissimilarity between and
    • It can also be formulated as:
      Formulated this way, the KL divergence is the average number of extra bits needed to encode the data due to the fact we used distribution rather than the true distribution .
      • This formulation informally motivates the following inequality
  • The conditional entropy is defined as

    • It measures how much entropy has remaining given we have learnt the value of .
  • The Pointwise Mutual Information between two events and is defined as

    • It measures the discrepancy between these occurring together compared to what would be expected by chance.
    • It is also the amount we learn from updating a prior into a posterior.
  • The Mutual Information determines how similar the joint distribution is to the factored distribution . It is defined, therefore as

    • This determines how much information can be extracted about one random variable given observations on another.
    • It can also be expressed as
    • It can also be expressed as the expected value of the pointwise mutual information.
    • For continuous random variables, the mutual random variable can be approximated using the Maximal Information Coefficient. We define
      Where denotes the set of 2D grids of size and are discretizations of the variables on this grid. Then, the maximal information coefficient is defined as
      Where is a sample size dependent bound on the number of bins we can use.
      • A MIC of represents no relationship between the variables
      • A MIC of represents a noise-free relationship of any form, not just linear.

Links