• It makes use of the posterior distribution to summarize all the unknowns about a random variable (compare with Frequentist Statistics)
  • It assumes the dataset as fixed, and the parameters unknown.

Posteriors

Maximum A Posteriori

  • Make use of a point estimate for the posterior (via mean, median or mode).
  • For MAP estimates, we make use of the mode because of algorithmic efficiency and ease of interpretability.
  • Some drawbacks.
    • It is a point estimate so there is no measure of uncertainty.
      • Solution: Use confidence intervals
    • It overfits as an estimator because we don’t capture uncertainties.
    • The mode is a poor choice as a summary statistic since it is prone to extrema or skewed distributions.
      • Solution: Use Decision Theory-based methods.
    • It is not invariant to reparameterization. Changing from one representation to another equivalent representation changes the MAP estimate.
      • One solution to this is to optimize the following
        Where is the Determinant of the Fisher Information Matrix associated with which is parameterization independent.
      • Unfortunately, optimizing the above is difficult.

Credible Intervals

  • We can specify a width measured using a contiguous region that contains of the posterior distribution’s mass in each tail (either side)
    • Hence why in Normal Distributions, we report .
    • These are called credible intervals since their mass is . They are central if they have mass on either side.
    • Credible intervals are not sampling intervals. Credible intervals act on the posterior distribution.
  • We can also use Monte Carlo approximation to draw samples form the from the posterior
    • Sort the samples
    • Then, find the entries that rank from the estimator.
  • We may also use the highest posterior density region / interval which contains the set of most probable points that in total constitute mass. That is, if we find threshold such that for
    The HPD is then the set we integrated over

Model Selection

  • Problem: Given a set of models of varying complexities, how should we choose the best model?

Cross Validation

  • Estimate the generalization error of all the candidate models, and pick the model that seems to be the best.
  • Requires fitting times.

Bayesian Model Selection

  • Compute the posterior over models and use the MAP model. That is, calculate
    And choose
  • We can can use the marginal likelihood or the evidence for model . This is the equivalent of an MLE. Assuming a uniform prior over models, we have
  • The intuition behind the marginal likelihood is that we want to prevent overfitting by suggesting that more parameters = better performance. This is Bayesian Occam’s razor.
    • Complex models which can predict many things must spread its probability mass thinly so that it will not obtain as large a probability as simpler models. This is the conservation of probability mass.

Marginal Likelihood

  • Computing the Marginal likelihood.

    Let be unnormalized distributions be normalization constants be the prior be the likelihood be the posterior.

    We have that the marginal likelihood (the conditioning on omitted for convenience) is

  • Some useful marginal likelihoods

    • Beta Prior

    • Dirichlet Prior

    • Gaussian-Gaussian-Wishart

  • The Bayesian Information Criterion is an approximation for the integral in the marginal likelihood. It is simply of a penalized log likelihood form, where the penalty depends on model complexity. That is

    Where is an estimate (MLE / MAP) and is the degrees of freedom.

    We aim to maximize the BIC score.

  • We can also define a BIC-cost that we minimize

    • The minimum description length principle is tied to this. It says the score of a model is how well it fits the data minus how complex the model is to define.
  • For complex models, we may use the Akaike information criterion. It has smaller penalty than BIC

  • The prior has an effect on the marginal likelihood since it is an average of likelihoods weighted by priors over the parameter.

    • When the prior is uncertain, have a prior for the prior. Usually uninformative priors suffice.
    • We compute

Hierarchical Bayes

  • In Hierarchical Bayes, we assume uncertainty about the priors. That is, we put priors on the priors. That is, for we put the prior as for hyperparameters .

  • In graphical form it is represented as

  • This can be coupled with parameter tying wherein we assume that parameters are shared across multiple distributions.

    • In such a case, we assume is drawn from some distribution parameterized by .
    • This leads to multi-task learning.

Empirical Bayes

  • We may also use empirical Bayes or type II maximum likelihood or the evidence procedure. Where via numerical optimization we optimize

  • It is a computationally cheap approximation to inference in a hierarchical Bayes model.

  • It makes use of a point estimate for the hyperparameter . Assuming the hyperparameters are smaller in dimensionality, this is less prone to overfitting.

  • For deeper hierarchical models, the shallower the depth we perform empirical Bayes, the more Bayesian the model becomes.

  • This gives a crude framework for why hyperparameter optimization works. We can simply use an estimate instead of integrating over hyperparameter space.

Bayes Factors

  • The Bayes Factor is the ratio of marginal likelihoods between and . That is
    When , prefer else, prefer .
  • The Bayes factor will always favor the simpler model since complex models will have a very small prior.

Priors

  • The Jeffreys-Lindley paradox arises from a poor choice of priors.

  • Priors capture initial assumptions about the world.

  • Uninformative Priors assume that we have no strong beliefs about the parameter .

  • Improper Priors are those that do not integrate to . They are still useful considering we normalize the posterior anyway. Note that this requires the posterior to be prior.

  • The Haldane prior is defined as the most non-informative prior of a Bernoulli process. It is defined by

    It is a mixture of two equal point masses at and .

  • The Jeffreys prior is a non-informative prior. It is defined as a probability distribution that is

    Where is the Fisher information matrix.

    • The idea is that any reparameterization of an uninformative prior is also uninformative. Thus, we choose a convenient reparameterization via the Fisher information matrix.
    • The Jeffreys prior is scale and translation invariant.
  • Robust Priors are those which have heavy tails to avoid forcing things being close to the prior mean. This is done so that the prior has no effect on the result.

  • Sensitivity Analysis can be used to check if the choice of prior was appropriate by checking how the conclusion changes in response to changes in modeling assumptions. Insensitivity to priors is desirable.

  • An important theorem: The mixture of conjugate priors is also conjugate.

    • Suppose is an indicator variable that comes from mixture component . Then we have a prior
      Where each is conjugate. are the prior mixing weights. The posterior is given as
      Where the posterior mixing weights are given by
      Where is the marginal likelihood for the mixture component .

Bayesian Decision Theory

  • The game: Based on some unknown label , we are given observation . We then make decision and incur an appropriate loss which measures how compatible the action was with the label.

  • The goal: Make a decision procedure which specifies the optimal action for each input 1 such that loss is minimized.

    Or if we specify a utility function so that utility is maximized

    we get the equivalent maximum expected utility principle

  • The expected value is interpreted as the expected value of the label given the observations.

  • For Bayesian Decision Theory, we minimize the posterior expected loss

  • The Bayes decision rule is given as minimizing the posterior expected loss

  • Some common rules:

    • MAP estimate minimizes the -loss.
    • MAP estimate can minimize the case where we include a “don’t know” class under specific cases.
    • The posterior mean minimizes loss. This is the minimum mean squared error.
    • The posterior median minimizes loss.
  • For supervised learning tasks, this rule can be translated to minimizing generalization error. Given unknown parameters and action with cost function that determines how far is from ground truth , we have

    Which is the generalization error.

    • The goal becomes minimizing

Topics

Links

  • Murphy Ch. 5

    • 5.6 - More on Hierarchical and Empirical Bayes
    • 5.7.1 - Bayes estimators for common loss functions
  • The Fisher Information - more on the Fisher information. Essentially it measures the amount of information a variable carries about a parameter.

  • Probability Theory - more on the basics of probability. Bayes’ theorem is stated here.

  • Bayesian Models

Footnotes

  1. Not unlike Reinforcement Learning