-
It makes use of the sampling distribution
from which we derive the observations (compare with Bayesian Statistics) -
It assumes the dataset as random, and the parameters unknown but fixed. Thus we make use of parameter estimates.
-
An estimator
is applied to data so that the parameter estimate is obtained by -
The sampling distribution is obtained by sampling many datasets
and computing parameter estimates. The sampling distribution is the distribution of - This can be obtained with bootstrapping where we sample many datasets and use the empirical distribution to estimate the sampling distribution. The sampling distribution and the posterior distribution
are similar assuming a weak prior. - Under certain conditions, as the sample size tends to infinity, the sampling distribution of the MLE becomes Gaussian (see more in Murphy 6.2.2)
- This can be obtained with bootstrapping where we sample many datasets and use the empirical distribution to estimate the sampling distribution. The sampling distribution and the posterior distribution
-
-
We use heuristics since we do not have an automatic way of choosing between estimators.
Properties of Good Estimators
-
An estimator is consistent if it eventually recovers the true parameters that generated the data as the sample size goes to infinity. That is
-
The bias of an estimator is defined as
- An estimator is unbiased if its bias is
. This means the sampling distribution is centered on the true parameter. - Another way to say this is if
- It is the inherent error that you obtain from the model even with infinite training data. This is due to the model being inherently “biased” towards a particular kind of solution.
- An estimator is unbiased if its bias is
-
Another consideration is minimum variance.
- Cramer-Rao lower bound. Let
and be an unbiased estimator of . Then under various smoothness assumptions on , we have Whereis the Fisher information matrix. - The variance is the degree to which the classifier varies if it is trained on a different dataset.
- Cramer-Rao lower bound. Let
-
The Maximum Likelihood Estimator is asymptotically optimal. It is consistent, biased at the limit, and achieves the Cramer-Rao lower bound.
Bias-Variance Tradeoff
-
If we are using Mean Squared Error, we can show that for a given estimator or model
-
In terms of modeling tasks, a high variance means that the model is more flexible and can adapt to noise due to the estimator. However, it is also prone to overfitting as it adapts to all patterns
-
A high bias means that the model is less prone to noise, and thus it can generalize well. However, it is also not flexible, and if the bias of the model is too far from the true estimate of the parameters, it will underfit.
-
Note for classification with
loss, this does not hold. - For a correct estimate, the bias will be negative, and it pays to decrease the variance.
- For an incorrect estimate, the bias will be positive, and it pays to increase the variance.
-
Here is a more formal statement of this tradeoff. Suppose we have training dataset
. Let be the true function such that Where
is Gaussian noise. Let
be the approximate function to trained on . We can then quantify how good is based on the MSE. The MSE can be framed as
It can then be decomposed as follows. Let us drop subscript
for convenience. is independent of our choice of so The variance
is computed using the variance of the noise term. This follows because is independent of noise and is deterministic. The MSE then has the following computation