• Suppose we have a posterior predictive distribution that determines if an observed sample is part of some dataset . We represent this with and describe it with some model .
  • We become more certain about this probability distribution with more data.

Bayesian Components

  • We follow Occam’s Razor, that is we choose to describe with the simplest model possible consistent with the dataset. This is quantified using the likelihood that the chosen hypothesis is consistent, that is, it is
  • We can also incorporate beliefs about the space of models using the prior. That is,
  • Our posterior, then, is obtained as the normalized product of the likelihood and the posterior. The posterior is the internal belief state about the world. It is validated via observations about the world.

Estimates

  • In general, when we have enough data, the posterior peaks on a single , namely the Maximum a Posteriori estimate given by

  • Another estimate is the Maximum Likelihood Estimate where we choose maximize the likelihood of the data being observed given model or:

  • As the size of the dataset increases, converges to . With enough data, the data overwhelms any prior assumptions.

    • To see this, note that
      And scales with more data, whereas does not.

Posterior Predictive Distributions

  • The posterior predictive distribution can be calculated using Bayes Model Averaging
  • It can also be approximated using the Plugin Approximation given as
    That is, we can approximate the posterior distribution using a model derived from some estimate (i.e., MLE or MAP)
    • Note: This underestimates the uncertainty.
    • The estimate will certainly be different from using a pure Bayesian approach, but they converge to the same value with more data.

Extension

  • The Bayesian model can be extended for the case of continuous distributions, essentially by replacing sums with integrals and finite sets with spaces.
  • To characterize the dataset , we can use a set of sufficient statistics which describe the dataset fully. That is, if we have as a parameter we get

Beta Binomial Model

  • One extension of the above is the Beta Binomial, which is applicable for a binary classification task. That is
    Where is the probability of success.
  • Any additional evidence constitutes updating and above based on the new successes and failures observed.
  • The posterior of the beta-binomial is convex. The posterior mean is a combination of the prior mean and the MLE
  • The posterior variance decreases at a rate of , where is the number of samples. The variance is maximized when and minimized when or .
    • This implies, entropy is at a maximum when the probabilities of success and failure are equally likely (see here for more)
  • On its own, the model highlights the sparse data problem wherein our MLE estimate will be close to when the dataset is small. This approximation is often poor.
    • This can be mitigated with Laplace’s Rule of Succession wherein we assume a uniform prior and plug in the posterior mean so that Where and are counts for failure and successes respectively. This is called Add One Smoothing.

Dirichlet Multinomial Model

  • The Dirichlet Multinomial Model is a generalization of the Beta Binomial model but for outcomes instead of two outcomes.

  • The update rule for the posterior given new evidence is simply an extension of the rule for the beta binomial model. That is

    Where denotes the number of times the -th event occurs, and

    denotes the hyperparameters in the prior. That is, we assume a prior of

  • The MLE (assuming a uniform prior) and MAP estimates are just extensions of the corresponding estimates for the Beta Binomial.

  • The Posterior Distribution is characterized as an extension of the posterior distribution of the beta binomial.

Links