It makes use of the posterior distribution to summarize all the unknowns about a random variable (compare with Frequentist Statistics)
It assumes the dataset as fixed, and the parameters unknown.

Posteriors

Maximum A Posteriori

Make use of a point estimate for the posterior (via mean, median or mode).
For MAP estimates, we make use of the mode because of algorithmic efficiency and ease of interpretability.
Some drawbacks.
- It is a point estimate so there is no measure of uncertainty.
  - Solution: Use confidence intervals
- It overfits as an estimator because we don’t capture uncertainties.
- The mode is a poor choice as a summary statistic since it is prone to extrema or skewed distributions.
  - Solution: Use Decision Theory-based methods.
- It is not invariant to reparameterization. Changing from one representation to another equivalent representation changes the MAP estimate.
  - One solution to this is to optimize the following Where is the Determinant of the Fisher Information Matrix associated with which is parameterization independent.
  - Unfortunately, optimizing the above is difficult.

Credible Intervals

We can specify a width measured using a contiguous region that contains of the posterior distribution’s mass in each tail (either side)
- Hence why in Normal Distributions, we report .
- These are called credible intervals since their mass is . They are central if they have mass on either side.
- Credible intervals are not sampling intervals. Credible intervals act on the posterior distribution.
We can also use Monte Carlo approximation to draw samples form the from the posterior
- Sort the samples
- Then, find the entries that rank from the estimator.
We may also use the highest posterior density region / interval which contains the set of most probable points that in total constitute mass. That is, if we find threshold such that for The HPD is then the set we integrated over

Model Selection

Problem: Given a set of models of varying complexities, how should we choose the best model?

Cross Validation

Estimate the generalization error of all the candidate models, and pick the model that seems to be the best.
Requires fitting times.

Bayesian Model Selection

Compute the posterior over models and use the MAP model. That is, calculate And choose
We can can use the marginal likelihood or the evidence for model . This is the equivalent of an MLE. Assuming a uniform prior over models, we have
The intuition behind the marginal likelihood is that we want to prevent overfitting by suggesting that more parameters = better performance. This is Bayesian Occam’s razor.
- Complex models which can predict many things must spread its probability mass thinly so that it will not obtain as large a probability as simpler models. This is the conservation of probability mass.

Marginal Likelihood

Computing the Marginal likelihood.

Let be unnormalized distributions be normalization constants be the prior be the likelihood be the posterior.

We have that the marginal likelihood (the conditioning on omitted for convenience) is

Some useful marginal likelihoods
- Beta Prior
- Dirichlet Prior
- Gaussian-Gaussian-Wishart
The Bayesian Information Criterion is an approximation for the integral in the marginal likelihood. It is simply of a penalized log likelihood form, where the penalty depends on model complexity. That is

Where is an estimate (MLE / MAP) and is the degrees of freedom.

We aim to maximize the BIC score.
We can also define a BIC-cost that we minimize
- The minimum description length principle is tied to this. It says the score of a model is how well it fits the data minus how complex the model is to define.
For complex models, we may use the Akaike information criterion. It has smaller penalty than BIC
The prior has an effect on the marginal likelihood since it is an average of likelihoods weighted by priors over the parameter.
- When the prior is uncertain, have a prior for the prior. Usually uninformative priors suffice.
- We compute

Hierarchical Bayes

In Hierarchical Bayes, we assume uncertainty about the priors. That is, we put priors on the priors. That is, for we put the prior as for hyperparameters .
In graphical form it is represented as
This can be coupled with parameter tying wherein we assume that parameters are shared across multiple distributions.
- In such a case, we assume is drawn from some distribution parameterized by .
- This leads to multi-task learning.

Empirical Bayes

We may also use empirical Bayes or type II maximum likelihood or the evidence procedure. Where via numerical optimization we optimize
It is a computationally cheap approximation to inference in a hierarchical Bayes model.
It makes use of a point estimate for the hyperparameter . Assuming the hyperparameters are smaller in dimensionality, this is less prone to overfitting.
For deeper hierarchical models, the shallower the depth we perform empirical Bayes, the more Bayesian the model becomes.
This gives a crude framework for why hyperparameter optimization works. We can simply use an estimate instead of integrating over hyperparameter space.

Bayes Factors

The Bayes Factor is the ratio of marginal likelihoods between and . That is When , prefer else, prefer .
The Bayes factor will always favor the simpler model since complex models will have a very small prior.

Priors

The Jeffreys-Lindley paradox arises from a poor choice of priors.
Priors capture initial assumptions about the world.
Uninformative Priors assume that we have no strong beliefs about the parameter .
Improper Priors are those that do not integrate to . They are still useful considering we normalize the posterior anyway. Note that this requires the posterior to be prior.
The Haldane prior is defined as the most non-informative prior of a Bernoulli process. It is defined by

It is a mixture of two equal point masses at and .
The Jeffreys prior is a non-informative prior. It is defined as a probability distribution that is

Where is the Fisher information matrix.
- The idea is that any reparameterization of an uninformative prior is also uninformative. Thus, we choose a convenient reparameterization via the Fisher information matrix.
- The Jeffreys prior is scale and translation invariant.
Robust Priors are those which have heavy tails to avoid forcing things being close to the prior mean. This is done so that the prior has no effect on the result.
Sensitivity Analysis can be used to check if the choice of prior was appropriate by checking how the conclusion changes in response to changes in modeling assumptions. Insensitivity to priors is desirable.
An important theorem: The mixture of conjugate priors is also conjugate.
- Suppose is an indicator variable that comes from mixture component . Then we have a prior Where each is conjugate. are the prior mixing weights. The posterior is given as Where the posterior mixing weights are given by Where is the marginal likelihood for the mixture component .

Bayesian Decision Theory

The game: Based on some unknown label , we are given observation . We then make decision and incur an appropriate loss which measures how compatible the action was with the label.
The goal: Make a decision procedure which specifies the optimal action for each input ¹ such that loss is minimized.

Or if we specify a utility function so that utility is maximized

we get the equivalent maximum expected utility principle
The expected value is interpreted as the expected value of the label given the observations.
For Bayesian Decision Theory, we minimize the posterior expected loss
The Bayes decision rule is given as minimizing the posterior expected loss
Some common rules:
- MAP estimate minimizes the -loss.
- MAP estimate can minimize the case where we include a “don’t know” class under specific cases.
- The posterior mean minimizes loss. This is the minimum mean squared error.
- The posterior median minimizes loss.
For supervised learning tasks, this rule can be translated to minimizing generalization error. Given unknown parameters and action with cost function that determines how far is from ground truth , we have

Which is the generalization error.
- The goal becomes minimizing

Topics

Model Performance - specifically Generalization Error and Confusion Matrices .

Table of Contents

Graph View

Backlinks

The Library

Bayesian Statistics

Posteriors

Maximum A Posteriori

Credible Intervals

Model Selection

Cross Validation

Bayesian Model Selection

Marginal Likelihood

Hierarchical Bayes

Empirical Bayes

Bayes Factors

Priors

Bayesian Decision Theory

Topics

Links

Table of Contents

Graph View

Backlinks

Bayesian Statistics

Posteriors §

Maximum A Posteriori §

Credible Intervals §

Model Selection §

Cross Validation §

Bayesian Model Selection §

Marginal Likelihood §

Hierarchical Bayes §

Empirical Bayes §

Bayes Factors §

Priors §

Bayesian Decision Theory §

Topics §

Links §

Footnotes §

Posteriors

Maximum A Posteriori

Credible Intervals

Model Selection

Cross Validation

Bayesian Model Selection

Marginal Likelihood

Hierarchical Bayes

Empirical Bayes

Bayes Factors

Priors

Bayesian Decision Theory

Topics

Links

Footnotes