We may apply Bayesian statistics to compute for a posterior distribution of weights and obtain a MAP estimate as follows:

We may then sequentially update the parameters in a Bayesian manner.
Under the assumption of Conditional Independence, we have

Taking the negative of the log above gives us the Mean Square Error

Regularization using Priors

The effect is the same as that of Regularization depending on the choice of prior.

Gaussian

With a Gaussian prior and Gaussian likelihood, we get a Gaussian posterior with variance as

Where is the variance of the observation noise, and is the variance in the parameters (i.e., how close is to the training data). This captures uncertainty as we move away from the training data, allowing the model to capture what is known and what is unknown.
- We are unable to capture this kind of uncertainty using a plugin approximation (i.e., MAP) means
With unknown variance, and likelihood

the posterior and the conjugate prior of the joint parameter distribution follow a Normal Inverse Gaussian distribution. (see more at Murphy Ch. 7.6.3)
- The posterior variance is given by the Wald Distribution
- The posterior weights follow a T distribution.
- The posterior predictive distribution also follows a T distribution
- We may use a g-prior of the form
  Where controls the strength of the prior (and has a regularization effect), and we scale on so that the posterior is invariant to input scaling.
  - Using a g-prior with , where is the number of samples gives the unit information prior which contains as much information as one sample.
If weights are sampled from

Which is the regularization term in L2-regularization with .
- favors values closer to , and it takes strong evidence to update it to become larger. Hence, all components in tend to be roughly the same and small, effectively regularizing.

If weights are sampled from then

Which is the regularization term in L1-regularization.
While it also favors values of closer to 0, it also does not take as much evidence to update to be something large.
Since we are minimizing, however, this means that few components of will actually have a large value, while the rest will be smaller.

Assuming unknown priors, we can use empirical Bayes as an alternative to cross-validation. The goal is to maximize the marginal likelihood by choosing an appropriate .