Basics

Murphy 4.1.1.: If we have iid samples , then the MLE for the parameters is given by the sample mean and covariances. That is
- By the Central Limit Theorem, this implies that with more samples, the sampling mean and covariance become better estimates for the true mean and covariance.
The Gaussian is widely used in modelling since it is the distribution with the maximum entropy subject having to having a specified mean and covariance.
- Murphy 4.1.2. Let be any density satisfying Let . Then
When the covariance matrix is unknown, but the underlying distribution is known to be Gaussian, we may use the Inverse Wishart distribution to approximate the covariance matrix..
- The corresponding conjugate prior would be the inverse Wishart as well.
- For the precision matrix, we can simply use the Wishart distribution.

Gaussian Discriminant Analysis

Gaussian Discriminant Analysis is a generalization of the Naive Bayes classifier. It involves the following
- It becomes the Naive Bayes classifier if is diagonal since this implies feature independence.
- Its MLE for the mean and covariance matrix is equal to the sample mean and covariances for each class (see Murphy 4.2.4)
The Nearest Centroids Classifier is defined as follows:
Quadratic Discriminant Analysis (QDA) involves using the definition of Gaussian density for the posterior of class labels . That is
- Notably, we assume that the class conditional densities follow Gaussian distributions.
- The decision boundaries using logs of probabilities follow quadratic curves.
- The discriminant comes from taking the log of the probability Observe the quadratic form.
In Linear Discriminant Analysis (LDA) simplifies QDA as follows:
- Considering logs of probabilities gives us linear decision boundaries.
- LDA is a dimensionality reduction process which aims to maximize the distance between predicted means and minimize variance within each class distribution. That is, it maximizes separability of data.
One way to fit discriminant analysis to data is to use the MLE estimate, However, this is prone to overfitting. This can be mitigated in the following ways:
- Use a diagonal , equivalent to Naive Bayes.
- Use parameter sharing, have for all classes , equivalent to LDA
- Use diagonal covariance LDA
- Impose a prior and integrate it out.
- Use MAP estimates
- Dimensionality Reduction
Regularized LDA involves the addition of a regularization term making use of the Wishart Prior so that the new regularized covariance matrix is given by

Where denotes taking all the diagonal entries and is the regularization parameter.
- is directly proportional to the strength of the Wishart prior used.
- The MLE for highly dimensional data can be computed by using Singular Value Decomposition and making use of the empirical covariance matrices instead.
Diagonal LDA involves using a diagonal covariance matrix for each .
- This is equivalent to Regularized LDA with .
- We classify based on the following rule
  
  Where we set as the sample mean of the -th feature for data in the class.
  
  is set using the pooled empirical variance calculated by

Gaussian Inferencing

Murphy 4.3.1 Suppose is jointly Gaussian with parameters Then the marginals are given by And the conditional is given by

Gaussian Interpolation

The Gaussian part of this comes from the assumption outlined in its prior. Each entry is the average of its neighbors plus Gaussian noise.

Noise-Free Observations

Problem: We estimate a 1-D function on the interval such that for observed points
Assume: For the points between observations, the function is smooth.
Start by discretizing. Define

That is, take evenly spaced samples in the desired interval.
Prior: Assume that each is the average of its neighbors plus Gaussian noise

controls how much it is believed that the function is smooth (higher = smoother)

This can be encoded in a matrix so that

Assume scaling by . Let
Let be the noise free observations and

be the unknowns.

Partition accordingly. Partition accordingly as well into a block matrix.

The conditional now becomes
For noiseless data, has no effect on the smoothness of the posterior mean estimate.
We can treat each data point as being some random variable. Imputation can be done by computing for a marginal distribution conditioned on the known variables.

Noisy Data

In addition to the above, assume further that , where , and , with signal noise , and is an appropriately sized projection matrix that selects out observed elements.
The estimation is similar to that for noiseless data.
A strong prior determined by large causes smooth estimates and small uncertainty.
A weak prior determined by small causes rougher estimates and high uncertainty.
The posterior mean is given by solving the following optimization problem in a process called **Tikhonov Regularization (aka Ridge / L2 Regression)
Observe, the above simply involves adding the loss (MSE) with a penalty term for the roughness of the estimate. The second term is the regularization term

Information Form

Let . We have that and . These are the moment parameters of the distribution.
The canonical parameters are
The MVN in information form is defined as
Marginalization is easier in moment form, and conditioning is easier in information form
In Canonical form, multiplication of two Gaussians is easy

Linear Gaussian Systems

Let , where is a noisy observation of .

Where is a matrix. This is a linear Gaussian system, represented by .
Murphy 4.4.1 (Bayes rule for linear Gaussian systems). Given a linear Gaussian system, the posterior is given by

Likelihood-Prior Tradeoff
- Suppose we have noisy measurements of some underlying quantity . Suppose, the measurement noise has fixed precision, . The likelihood is
  
  The prior is
  
  Now, we compute the posterior, assuming each feature is independent () We have that for , it is normally distributed with the covariance and mean
  $Misplaced &\mu&= \frac{N\lambda_y \bar{y} +\lambda_0\mu_0}{\lambda_N} \end{split}$
  That is, the posterior mean is a compromise between the MLE and the prior. The signal strength is determined by , and it gives more weight to the prior.
We can also analyze updating the posterior sequentially. Let be the variances of the likelihood, prior, and posterior respectively. Then the updated posterior will have

The posterior mean will become

The third quantity above is called shrinkage. It is the reduction in the effects of sampling variation.
The shrinkage can be quantified using the signal-to-noise ratio, wherein

Where
- is the true signal, and
- is the observed signal, where

Table of Contents

Graph View

Backlinks

The Library

Gaussian Models

Basics

Gaussian Discriminant Analysis

Gaussian Inferencing

Gaussian Interpolation

Noise-Free Observations

Noisy Data

Information Form

Linear Gaussian Systems

Likelihood-Prior Tradeoff

Links

Table of Contents

Graph View

Backlinks

Gaussian Models

Basics §

Gaussian Discriminant Analysis §

Gaussian Inferencing §

Gaussian Interpolation §

Noise-Free Observations §

Noisy Data §

Information Form §

Linear Gaussian Systems §

Likelihood-Prior Tradeoff §

Links §

Basics

Gaussian Discriminant Analysis

Gaussian Inferencing

Gaussian Interpolation

Noise-Free Observations

Noisy Data

Information Form

Linear Gaussian Systems

Likelihood-Prior Tradeoff

Links