- We can think of linear models as operating on features, which are effectively basis functions that form a linear basis for a set appropriate functions.
Linear Regression
Specifications
-
It is a model of the form
Or in another form, we have Gaussian noise
such that -
Under the assumption of iid, by minimizing the negative log likelihood and using the MVN formula, we have a loss function specified by the residual sum of squares.
- This loss function is convex.
- We may obtain a solution using the following normal equation
Under the assumption of invertibility of
, where is the design matrix.
-
Geometrically: The goal is to find the best fit line in the feature space such that the sum of the residual is minimized.
Extensions
-
We may extend this by changing
with the basis function expansion to allow for non-linear features. -
We may have robust regression if the data has many outliers. Here, we replace the distribution of
to use a heavy-tailed distribution such as the Laplace or the Student T-distribution. -
For a Laplace likelihood, we may choose to minimize the Negative Log Likelihood via Linear Programming. Alternatively, we may use the Huber Loss
- The Huber Loss is equivalent to the
loss but accounts for outliers in the data.
- The Huber Loss is equivalent to the
-
Regression
- Regression essentially involves controlling the distribution of weights such that the weights end up being “simple”
- We may use Ridge Regression or weight decay by using a Maximum A Posteriori estimate with a Gaussian Prior. That is, we have that
Where
controls the strength of the prior. - The corresponding solution is given as
Where
. is given by the variance of Gaussian Noise (see above). - This also tends to be easier to fit numerically using a QR Decomposition on a design matrix augmented with the Cholesky decomposition of the precision matrix
(see Murphy 7.5.2) - When the dimensions is greater than the number of samples (
), we can perform SVD first for dimensionality reduction.
- When the dimensions is greater than the number of samples (
- Ridge Regression aims to shrink the principal components with the smallest singular values (and thus have the highest posterior variance. See Murphy 7.5.3).
- Ill-determined parameters are reduced towards
through shrinkage.
- The corresponding solution is given as
- We can also use Principal Components Regression by applying regression on the principal components. However, this is not as effective as Ridge Regression since it ignores dimensions entirely (compared to Ridge Regression’s softening).
Regularization
- L1 Regularization involves using the regularization term
- L2 Regularization involves using the regularization term
As a consequence features do not overpower each other
- This is useful in datasets that have multicollinearity.
- The exact mechanisms behind regularization are described in here.
Logistic Regression
Basis Functions
Polynomial Basis
Fourier Basis
Radial Basis Function
Topics
Links
-
- Ch. 7 - Linear Regression, Robust Regression, Ridge Regression, Bayesian Linear Regression.
-
Zhang et. al Ch. 3 - more on Linear Regression