We can think of linear models as operating on features, which are effectively basis functions that form a linear basis for a set appropriate functions.

Linear Regression

Specifications

It is a model of the form

Or in another form, we have Gaussian noise such that
Under the assumption of iid, by minimizing the negative log likelihood and using the MVN formula, we have a loss function specified by the residual sum of squares.
- This loss function is convex.
- We may obtain a solution using the following normal equation Under the assumption of invertibility of , where is the design matrix.
Geometrically: The goal is to find the best fit line in the feature space such that the sum of the residual is minimized.

Extensions

We may extend this by changing with the basis function expansion to allow for non-linear features.
We may have robust regression if the data has many outliers. Here, we replace the distribution of to use a heavy-tailed distribution such as the Laplace or the Student T-distribution.
- For a Laplace likelihood, we may choose to minimize the Negative Log Likelihood via Linear Programming. Alternatively, we may use the Huber Loss
  - The Huber Loss is equivalent to the loss but accounts for outliers in the data.

Regression

Regression essentially involves controlling the distribution of weights such that the weights end up being “simple”
We may use Ridge Regression or weight decay by using a Maximum A Posteriori estimate with a Gaussian Prior. That is, we have that
Where controls the strength of the prior.
- The corresponding solution is given as Where . is given by the variance of Gaussian Noise (see above).
- This also tends to be easier to fit numerically using a QR Decomposition on a design matrix augmented with the Cholesky decomposition of the precision matrix (see Murphy 7.5.2)
  - When the dimensions is greater than the number of samples (), we can perform SVD first for dimensionality reduction.
- Ridge Regression aims to shrink the principal components with the smallest singular values (and thus have the highest posterior variance. See Murphy 7.5.3).
- Ill-determined parameters are reduced towards through shrinkage.
We can also use Principal Components Regression by applying regression on the principal components. However, this is not as effective as Ridge Regression since it ignores dimensions entirely (compared to Ridge Regression’s softening).

Regularization

L1 Regularization involves using the regularization term
L2 Regularization involves using the regularization term
As a consequence features do not overpower each other
- This is useful in datasets that have multicollinearity.
The exact mechanisms behind regularization are described in here.

Table of Contents

Graph View

Backlinks

The Library

Linear Models

Linear Regression

Specifications

Extensions

Regression

Regularization

Logistic Regression

Basis Functions

Polynomial Basis

Fourier Basis

Radial Basis Function

Topics

Links

Table of Contents

Graph View

Backlinks

Linear Models

Linear Regression §

Specifications §

Extensions §

Regression §

Regularization §

Logistic Regression §

Basis Functions §

Polynomial Basis §

Fourier Basis §

Radial Basis Function §

Topics §

Links §

Linear Regression

Specifications

Extensions

Regression

Regularization

Logistic Regression

Basis Functions

Polynomial Basis

Fourier Basis

Radial Basis Function

Topics

Links