-
We may apply Bayesian statistics to compute for a posterior distribution of weights and obtain a MAP estimate as follows:
We may then sequentially update the parameters in a Bayesian manner.
-
Under the assumption of Conditional Independence, we have
Taking the negative of the log above gives us the Mean Square Error
Regularization using Priors
- The effect is the same as that of Regularization depending on the choice of prior.
Gaussian
-
With a Gaussian prior and Gaussian likelihood, we get a Gaussian posterior with variance as
Where
is the variance of the observation noise, and is the variance in the parameters (i.e., how close is to the training data). This captures uncertainty as we move away from the training data, allowing the model to capture what is known and what is unknown. - We are unable to capture this kind of uncertainty using a plugin approximation (i.e., MAP) means
-
With unknown variance, and likelihood
the posterior and the conjugate prior of the joint parameter distribution
follow a Normal Inverse Gaussian distribution. (see more at Murphy Ch. 7.6.3) - The posterior variance is given by the Wald Distribution
- The posterior weights follow a T distribution.
- The posterior predictive distribution
also follows a T distribution - We may use a g-prior of the form
Where
controls the strength of the prior (and has a regularization effect), and we scale on so that the posterior is invariant to input scaling. - Using a g-prior with
, where is the number of samples gives the unit information prior which contains as much information as one sample.
- Using a g-prior with
-
If weights are sampled from
Which is the regularization term in L2-regularization with
. favors values closer to , and it takes strong evidence to update it to become larger. Hence, all components in tend to be roughly the same and small, effectively regularizing.
Laplacian Prior
-
If weights are sampled from
then Which is the regularization term in L1-regularization.
-
While it also favors values of
closer to 0, it also does not take as much evidence to update to be something large. -
Since we are minimizing, however, this means that few components of
will actually have a large value, while the rest will be smaller.
Empirical Bayes
- Assuming unknown priors, we can use empirical Bayes as an alternative to cross-validation. The goal is to maximize the marginal likelihood by choosing an appropriate
.
Links
- Linear Models
- Murphy Ch. 7
- Zhang et. al Ch. 3 - more on Linear Regression
- Bayesian Linear Regression: Data Science Concepts