• It makes use of the sampling distribution from which we derive the observations (compare with Bayesian Statistics)

  • It assumes the dataset as random, and the parameters unknown but fixed. Thus we make use of parameter estimates.

    • An estimator is applied to data so that the parameter estimate is obtained by

    • The sampling distribution is obtained by sampling many datasets and computing parameter estimates. The sampling distribution is the distribution of

      • This can be obtained with bootstrapping where we sample many datasets and use the empirical distribution to estimate the sampling distribution. The sampling distribution and the posterior distribution are similar assuming a weak prior.
      • Under certain conditions, as the sample size tends to infinity, the sampling distribution of the MLE becomes Gaussian (see more in Murphy 6.2.2)
  • We use heuristics since we do not have an automatic way of choosing between estimators.

Properties of Good Estimators

  • An estimator is consistent if it eventually recovers the true parameters that generated the data as the sample size goes to infinity. That is

  • The bias of an estimator is defined as

    • An estimator is unbiased if its bias is . This means the sampling distribution is centered on the true parameter.
    • Another way to say this is if
    • It is the inherent error that you obtain from the model even with infinite training data. This is due to the model being inherently “biased” towards a particular kind of solution.
  • Another consideration is minimum variance.

    • Cramer-Rao lower bound. Let and be an unbiased estimator of . Then under various smoothness assumptions on , we have
      Where is the Fisher information matrix.
    • The variance is the degree to which the classifier varies if it is trained on a different dataset.
  • The Maximum Likelihood Estimator is asymptotically optimal. It is consistent, biased at the limit, and achieves the Cramer-Rao lower bound.

Bias-Variance Tradeoff

  • If we are using Mean Squared Error, we can show that for a given estimator or model

  • In terms of modeling tasks, a high variance means that the model is more flexible and can adapt to noise due to the estimator. However, it is also prone to overfitting as it adapts to all patterns

  • A high bias means that the model is less prone to noise, and thus it can generalize well. However, it is also not flexible, and if the bias of the model is too far from the true estimate of the parameters, it will underfit.

  • Note for classification with loss, this does not hold.

    • For a correct estimate, the bias will be negative, and it pays to decrease the variance.
    • For an incorrect estimate, the bias will be positive, and it pays to increase the variance.
  • Here is a more formal statement of this tradeoff. Suppose we have training dataset . Let be the true function such that

    Where is Gaussian noise.

    Let be the approximate function to trained on . We can then quantify how good is based on the MSE.

    The MSE can be framed as

    It can then be decomposed as follows. Let us drop subscript for convenience.

    is independent of our choice of so

    The variance is computed using the variance of the noise term. This follows because is independent of noise and is deterministic.

    The MSE then has the following computation

Links