Suppose we have a posterior predictive distribution that determines if an observed sample is part of some dataset . We represent this with and describe it with some model .
We become more certain about this probability distribution with more data.

Bayesian Components

We follow Occam’s Razor, that is we choose to describe with the simplest model possible consistent with the dataset. This is quantified using the likelihood that the chosen hypothesis is consistent, that is, it is
We can also incorporate beliefs about the space of models using the prior. That is,
Our posterior, then, is obtained as the normalized product of the likelihood and the posterior. The posterior is the internal belief state about the world. It is validated via observations about the world.

Estimates

In general, when we have enough data, the posterior peaks on a single , namely the Maximum a Posteriori estimate given by
Another estimate is the Maximum Likelihood Estimate where we choose maximize the likelihood of the data being observed given model or:
As the size of the dataset increases, converges to . With enough data, the data overwhelms any prior assumptions.
- To see this, note that And scales with more data, whereas does not.

The posterior predictive distribution can be calculated using Bayes Model Averaging
It can also be approximated using the Plugin Approximation given as
That is, we can approximate the posterior distribution using a model derived from some estimate (i.e., MLE or MAP)
- Note: This underestimates the uncertainty.
- The estimate will certainly be different from using a pure Bayesian approach, but they converge to the same value with more data.

The Bayesian model can be extended for the case of continuous distributions, essentially by replacing sums with integrals and finite sets with spaces.
To characterize the dataset , we can use a set of sufficient statistics which describe the dataset fully. That is, if we have as a parameter we get

One extension of the above is the Beta Binomial, which is applicable for a binary classification task. That is Where is the probability of success.
Any additional evidence constitutes updating and above based on the new successes and failures observed.
The posterior of the beta-binomial is convex. The posterior mean is a combination of the prior mean and the MLE
The posterior variance decreases at a rate of , where is the number of samples. The variance is maximized when and minimized when or .
- This implies, entropy is at a maximum when the probabilities of success and failure are equally likely (see here for more)
On its own, the model highlights the sparse data problem wherein our MLE estimate will be close to when the dataset is small. This approximation is often poor.
- This can be mitigated with Laplace’s Rule of Succession wherein we assume a uniform prior and plug in the posterior mean so that Where and are counts for failure and successes respectively. This is called Add One Smoothing.

The Dirichlet Multinomial Model is a generalization of the Beta Binomial model but for outcomes instead of two outcomes.
The update rule for the posterior given new evidence is simply an extension of the rule for the beta binomial model. That is

Where denotes the number of times the -th event occurs, and

denotes the hyperparameters in the prior. That is, we assume a prior of
The MLE (assuming a uniform prior) and MAP estimates are just extensions of the corresponding estimates for the Beta Binomial.
The Posterior Distribution is characterized as an extension of the posterior distribution of the beta binomial.