• We may apply Bayesian statistics to compute for a posterior distribution of weights and obtain a MAP estimate as follows:

    We may then sequentially update the parameters in a Bayesian manner.

  • Under the assumption of Conditional Independence, we have

    Taking the negative of the log above gives us the Mean Square Error

Regularization using Priors

  • The effect is the same as that of Regularization depending on the choice of prior.

Gaussian

  • With a Gaussian prior and Gaussian likelihood, we get a Gaussian posterior with variance as

    Where is the variance of the observation noise, and is the variance in the parameters (i.e., how close is to the training data). This captures uncertainty as we move away from the training data, allowing the model to capture what is known and what is unknown.

    • We are unable to capture this kind of uncertainty using a plugin approximation (i.e., MAP) means
  • With unknown variance, and likelihood

    the posterior and the conjugate prior of the joint parameter distribution follow a Normal Inverse Gaussian distribution. (see more at Murphy Ch. 7.6.3)

    • The posterior variance is given by the Wald Distribution
    • The posterior weights follow a T distribution.
    • The posterior predictive distribution also follows a T distribution
    • We may use a g-prior of the form
      Where controls the strength of the prior (and has a regularization effect), and we scale on so that the posterior is invariant to input scaling.
      • Using a g-prior with , where is the number of samples gives the unit information prior which contains as much information as one sample.
  • If weights are sampled from

    Which is the regularization term in L2-regularization with .

    • favors values closer to , and it takes strong evidence to update it to become larger. Hence, all components in tend to be roughly the same and small, effectively regularizing.

Laplacian Prior

  • If weights are sampled from then

    Which is the regularization term in L1-regularization.

  • While it also favors values of closer to 0, it also does not take as much evidence to update to be something large.

  • Since we are minimizing, however, this means that few components of will actually have a large value, while the rest will be smaller.

Empirical Bayes

  • Assuming unknown priors, we can use empirical Bayes as an alternative to cross-validation. The goal is to maximize the marginal likelihood by choosing an appropriate .

Links