Basics
-
Murphy 4.1.1.: If we have
iid samples , then the MLE for the parameters is given by the sample mean and covariances. That is - By the Central Limit Theorem, this implies that with more samples, the sampling mean and covariance become better estimates for the true mean and covariance.
-
The Gaussian is widely used in modelling since it is the distribution with the maximum entropy subject having to having a specified mean and covariance.
- Murphy 4.1.2. Let
be any density satisfying Let. Then
- Murphy 4.1.2. Let
-
When the covariance matrix is unknown, but the underlying distribution is known to be Gaussian, we may use the Inverse Wishart distribution to approximate the covariance matrix..
- The corresponding conjugate prior would be the inverse Wishart as well.
- For the precision matrix, we can simply use the Wishart distribution.
Gaussian Discriminant Analysis
-
Gaussian Discriminant Analysis is a generalization of the Naive Bayes classifier. It involves the following
- It becomes the Naive Bayes classifier if
is diagonal since this implies feature independence. - Its MLE for the mean and covariance matrix is equal to the sample mean and covariances for each class (see Murphy 4.2.4)
- It becomes the Naive Bayes classifier if
-
The Nearest Centroids Classifier is defined as follows:
-
Quadratic Discriminant Analysis (QDA) involves using the definition of Gaussian density for the posterior of class labels
. That is - Notably, we assume that the class conditional densities
follow Gaussian distributions. - The decision boundaries using logs of probabilities follow quadratic curves.
- The discriminant comes from taking the log of the probability
Observe the quadratic form.
- Notably, we assume that the class conditional densities
-
In Linear Discriminant Analysis (LDA) simplifies QDA as follows:
- Considering logs of probabilities gives us linear decision boundaries.
- LDA is a dimensionality reduction process which aims to maximize the distance between predicted means and minimize variance within each class distribution. That is, it maximizes separability of data.
-
One way to fit discriminant analysis to data is to use the MLE estimate, However, this is prone to overfitting. This can be mitigated in the following ways:
- Use a diagonal
, equivalent to Naive Bayes. - Use parameter sharing, have
for all classes , equivalent to LDA - Use diagonal covariance LDA
- Impose a prior and integrate it out.
- Use MAP estimates
- Dimensionality Reduction
- Use a diagonal
-
Regularized LDA involves the addition of a regularization term making use of the Wishart Prior so that the new regularized covariance matrix
is given byWhere
denotes taking all the diagonal entries and is the regularization parameter. is directly proportional to the strength of the Wishart prior used.- The MLE for highly dimensional data can be computed by using Singular Value Decomposition and making use of the empirical covariance matrices instead.
-
Diagonal LDA involves using a diagonal covariance matrix for each
.-
This is equivalent to Regularized LDA with
. -
We classify based on the following rule
Where we set
as the sample mean of the -th feature for data in the class. is set using the pooled empirical variance calculated by
-
Gaussian Inferencing
- Murphy 4.3.1 Suppose
is jointly Gaussian with parametersThen the marginals are given byAnd the conditional is given by
Gaussian Interpolation
- The Gaussian part of this comes from the assumption outlined in its prior. Each entry is the average of its neighbors plus Gaussian noise.
Noise-Free Observations
-
Problem: We estimate a 1-D function on the interval
such that for observed points -
Assume: For the points between observations, the function is smooth.
-
Start by discretizing. Define
That is, take evenly spaced samples in the desired interval.
-
Prior: Assume that each
is the average of its neighbors plus Gaussian noise controls how much it is believed that the function is smooth (higher = smoother)This can be encoded in a matrix
so thatAssume scaling by
. Let -
Let
be the noise free observations and be the unknowns.Partition
accordingly. Partition accordingly as well into a block matrix.The conditional now becomes
-
For noiseless data,
has no effect on the smoothness of the posterior mean estimate. -
We can treat each data point as being some random variable. Imputation can be done by computing for a marginal distribution conditioned on the known variables.
Noisy Data
-
In addition to the above, assume further that
, where , and , with signal noise , and is an appropriately sized projection matrix that selects out observed elements. -
The estimation is similar to that for noiseless data.
-
A strong prior determined by large
causes smooth estimates and small uncertainty. -
A weak prior determined by small
causes rougher estimates and high uncertainty. -
The posterior mean is given by solving the following optimization problem in a process called **Tikhonov Regularization (aka Ridge / L2 Regression)
-
Observe, the above simply involves adding the loss (MSE) with a penalty term for the roughness of the estimate. The second term is the regularization term
Information Form
-
Let
. We have that and . These are the moment parameters of the distribution. -
The canonical parameters are
-
The MVN in information form is defined as
-
Marginalization is easier in moment form, and conditioning is easier in information form
-
In Canonical form, multiplication of two Gaussians is easy
Linear Gaussian Systems
-
Let
, where is a noisy observation of .Where
is a matrix. This is a linear Gaussian system, represented by . -
Murphy 4.4.1 (Bayes rule for linear Gaussian systems). Given a linear Gaussian system, the posterior
is given byLikelihood-Prior Tradeoff
-
Suppose we have
noisy measurements of some underlying quantity . Suppose, the measurement noise has fixed precision, . The likelihood isThe prior is
Now, we compute the posterior, assuming each feature is independent (
) We have that for , it is normally distributed with the covariance and meanMisplaced & \mu&= \frac{N\lambda_y \bar{y} +\lambda_0\mu_0}{\lambda_N} \end{split} That is, the posterior mean is a compromise between the MLE and the prior. The signal strength is determined by
, and it gives more weight to the prior.
-
-
We can also analyze updating the posterior sequentially. Let
be the variances of the likelihood, prior, and posterior respectively. Then the updated posterior will haveThe posterior mean will become
The third quantity above is called shrinkage. It is the reduction in the effects of sampling variation.
-
The shrinkage can be quantified using the signal-to-noise ratio, wherein
Where
is the true signal, and is the observed signal, where
Links
-
- Theorem 4.1.2 quadratic form
- 4.2 - more on QDA and LDA
-
Probability Distributions Zoo - More on the Multivariate Gaussian Distribution
-
Information Theory - more on entropy.