• We can think of linear models as operating on features, which are effectively basis functions that form a linear basis for a set appropriate functions.

Linear Regression

Specifications

  • It is a model of the form

    Or in another form, we have Gaussian noise such that

  • Under the assumption of iid, by minimizing the negative log likelihood and using the MVN formula, we have a loss function specified by the residual sum of squares.

    • This loss function is convex.
    • We may obtain a solution using the following normal equation
      Under the assumption of invertibility of , where is the design matrix.
  • Geometrically: The goal is to find the best fit line in the feature space such that the sum of the residual is minimized.

Extensions

  • We may extend this by changing with the basis function expansion to allow for non-linear features.

  • We may have robust regression if the data has many outliers. Here, we replace the distribution of to use a heavy-tailed distribution such as the Laplace or the Student T-distribution.

    • For a Laplace likelihood, we may choose to minimize the Negative Log Likelihood via Linear Programming. Alternatively, we may use the Huber Loss

      • The Huber Loss is equivalent to the loss but accounts for outliers in the data.

Regression

  • Regression essentially involves controlling the distribution of weights such that the weights end up being “simple”
  • We may use Ridge Regression or weight decay by using a Maximum A Posteriori estimate with a Gaussian Prior. That is, we have that
    Where controls the strength of the prior.
    • The corresponding solution is given as
      Where . is given by the variance of Gaussian Noise (see above).
    • This also tends to be easier to fit numerically using a QR decomposition on a design matrix augmented with the Cholesky decomposition of the precision matrix (see Murphy 7.5.2)
      • When the dimensions is greater than the number of samples (), we can perform SVD first for dimensionality reduction.
    • Ridge Regression aims to shrink the principal components with the smallest singular values (and thus have the highest posterior variance. See Murphy 7.5.3).
    • Ill-determined parameters are reduced towards through shrinkage.
  • We can also use Principal Components Regression by applying regression on the principal components. However, this is not as effective as Ridge Regression since it ignores dimensions entirely (compared to Ridge Regression’s softening).

Regularization

  • L1 Regularization involves using the regularization term
  • L2 Regularization involves using the regularization term
    As a consequence features do not overpower each other
    • This is useful in datasets that have multicollinearity.
  • The exact mechanisms behind regularization are described in here.

Logistic Regression

Basis Functions

Polynomial Basis

Fourier Basis

Radial Basis Function

Topics

Links