Basics

  • Murphy 4.1.1.: If we have iid samples , then the MLE for the parameters is given by the sample mean and covariances. That is

    • By the Central Limit Theorem, this implies that with more samples, the sampling mean and covariance become better estimates for the true mean and covariance.
  • The Gaussian is widely used in modelling since it is the distribution with the maximum entropy subject having to having a specified mean and covariance.

    • Murphy 4.1.2. Let be any density satisfying
      Let . Then
  • When the covariance matrix is unknown, but the underlying distribution is known to be Gaussian, we may use the Inverse Wishart distribution to approximate the covariance matrix..

    • The corresponding conjugate prior would be the inverse Wishart as well.
    • For the precision matrix, we can simply use the Wishart distribution.

Gaussian Discriminant Analysis

  • Gaussian Discriminant Analysis is a generalization of the Naive Bayes classifier. It involves the following

    • It becomes the Naive Bayes classifier if is diagonal since this implies feature independence.
    • Its MLE for the mean and covariance matrix is equal to the sample mean and covariances for each class (see Murphy 4.2.4)
  • The Nearest Centroids Classifier is defined as follows:

  • Quadratic Discriminant Analysis (QDA) involves using the definition of Gaussian density for the posterior of class labels . That is

    • Notably, we assume that the class conditional densities follow Gaussian distributions.
    • The decision boundaries using logs of probabilities follow quadratic curves.
    • The discriminant comes from taking the log of the probability
      Observe the quadratic form.
  • In Linear Discriminant Analysis (LDA) simplifies QDA as follows:

    • Considering logs of probabilities gives us linear decision boundaries.
    • LDA is a dimensionality reduction process which aims to maximize the distance between predicted means and minimize variance within each class distribution. That is, it maximizes separability of data.
  • One way to fit discriminant analysis to data is to use the MLE estimate, However, this is prone to overfitting. This can be mitigated in the following ways:

    • Use a diagonal , equivalent to Naive Bayes.
    • Use parameter sharing, have for all classes , equivalent to LDA
    • Use diagonal covariance LDA
    • Impose a prior and integrate it out.
    • Use MAP estimates
    • Dimensionality Reduction
  • Regularized LDA involves the addition of a regularization term making use of the Wishart Prior so that the new regularized covariance matrix is given by

    Where denotes taking all the diagonal entries and is the regularization parameter.

    • is directly proportional to the strength of the Wishart prior used.
    • The MLE for highly dimensional data can be computed by using Singular Value Decomposition and making use of the empirical covariance matrices instead.
  • Diagonal LDA involves using a diagonal covariance matrix for each .

    • This is equivalent to Regularized LDA with .

    • We classify based on the following rule

      Where we set as the sample mean of the -th feature for data in the class.

      is set using the pooled empirical variance calculated by

Gaussian Inferencing

  • Murphy 4.3.1 Suppose is jointly Gaussian with parameters
    Then the marginals are given by
    And the conditional is given by

Gaussian Interpolation

  • The Gaussian part of this comes from the assumption outlined in its prior. Each entry is the average of its neighbors plus Gaussian noise.

Noise-Free Observations

  • Problem: We estimate a 1-D function on the interval such that for observed points

  • Assume: For the points between observations, the function is smooth.

  • Start by discretizing. Define

    That is, take evenly spaced samples in the desired interval.

  • Prior: Assume that each is the average of its neighbors plus Gaussian noise

    controls how much it is believed that the function is smooth (higher = smoother)

    This can be encoded in a matrix so that

    Assume scaling by . Let

  • Let be the noise free observations and

    be the unknowns.

    Partition accordingly. Partition accordingly as well into a block matrix.

    The conditional now becomes

  • For noiseless data, has no effect on the smoothness of the posterior mean estimate.

  • We can treat each data point as being some random variable. Imputation can be done by computing for a marginal distribution conditioned on the known variables.

Noisy Data

  • In addition to the above, assume further that , where , and , with signal noise , and is an appropriately sized projection matrix that selects out observed elements.

  • The estimation is similar to that for noiseless data.

  • A strong prior determined by large causes smooth estimates and small uncertainty.

  • A weak prior determined by small causes rougher estimates and high uncertainty.

  • The posterior mean is given by solving the following optimization problem in a process called **Tikhonov Regularization (aka Ridge / L2 Regression)

  • Observe, the above simply involves adding the loss (MSE) with a penalty term for the roughness of the estimate. The second term is the regularization term

Information Form

  • Let . We have that and . These are the moment parameters of the distribution.

  • The canonical parameters are

  • The MVN in information form is defined as

  • Marginalization is easier in moment form, and conditioning is easier in information form

  • In Canonical form, multiplication of two Gaussians is easy

Linear Gaussian Systems

  • Let , where is a noisy observation of .

    Where is a matrix. This is a linear Gaussian system, represented by .

  • Murphy 4.4.1 (Bayes rule for linear Gaussian systems). Given a linear Gaussian system, the posterior is given by

    Likelihood-Prior Tradeoff

    • Suppose we have noisy measurements of some underlying quantity . Suppose, the measurement noise has fixed precision, . The likelihood is

      The prior is

      Now, we compute the posterior, assuming each feature is independent () We have that for , it is normally distributed with the covariance and mean

      Misplaced &\mu&= \frac{N\lambda_y \bar{y} +\lambda_0\mu_0}{\lambda_N} \end{split}

      That is, the posterior mean is a compromise between the MLE and the prior. The signal strength is determined by , and it gives more weight to the prior.

  • We can also analyze updating the posterior sequentially. Let be the variances of the likelihood, prior, and posterior respectively. Then the updated posterior will have

    The posterior mean will become

    The third quantity above is called shrinkage. It is the reduction in the effects of sampling variation.

  • The shrinkage can be quantified using the signal-to-noise ratio, wherein

    Where

    • is the true signal, and
    • is the observed signal, where

Links