Generalization Error

  • The generalization error of with the loss function is denoted as

  • This serves as a metric to determine how a model performs on unseen data. A high generalization error may indicate that the model is overfitting (see below for more)

  • The generalization error is typically not calculated since the probability distribution is unknown. We, instead, choose to evaluate over a set of unseen samples in the same vein as training error but using unseen data

    We say that the model generalizes if converges. Namely by asserting that

  • The Empirical Error, denoted is defined as the proportion of incorrectly classified values to the size of the dataset

  • The training error of the function using the loss function is denoted as : and is defined as

  • The generalization gap is defined as the difference between the training error and the generalization error or any estimate thereof.

Empirical Risk Minimization

  • We reframe the problem as follows. Let be a loss function with being the true response, and the prediction given the input .

    Thus, we are predicting observable quantities

  • The risk is now defined as

    Where represents the true distribution approximated by real data through the empirical distribution.

    We can define something similar by replacing the summations with integrals.

    • The Risk Identity is defined as (for continuous case)

  • The empirical risk is then defined as

    That is, it is the risk when we use the empirical distribution instead of the true distribution (in other words, average loss over the dataset).

  • The goal, then becomes to minimize the empirical risk.

Regularization

  • Minimizing the empirical risk will usually result in overfitting. This is because of our assumption that the true distribution is equal to the empirical distribution (which is not necessarily true due to sampling errors)
  • Hence we regularize by adding a penalty weighted with

Choosing

  • We cannot use the training set to approximate the true risk.
  • We use the structural risk minimization principle. Where
    Where is an estimate of the risk.
  • We model the uncertainty of our estimates via standard error. The one-standard error rule is a heuristic where we choose the simplest model (Occam’s razor) that is no more than one standard error from the mean.

Cross Validation

  • Let be the data number of data cases. be the kth data fold, and all the other data by . be a learning algorithm which outputs a parameter vector given a dataset and some model indices (i.e., hyperparameters).

    be a prediction function that takes an input and a parameter vector and returns a prediction

    The fit predict cycle is denoted as

  • The -fold CV estimate of the risk is defined as

  • We then call the fitting algorithm once per fold. Let

    be the function that was trained on all data except for the test data in fold . The estimate then becomes

    Where is the fold in which is used as test data.

  • In summary: divide the dataset into folds, each fold will be used as a validation set for one of models not trained on this data.

    • The empirical risk is then the average loss across all folds.
  • If , this is leave one out cross validation.

Theoretical Upper Bounds

  • Murphy 6.5.1 For any data distribution , and any dataset drawn from , the probability that our estimate of the error rate will be more than wrong in the worst case, is upper bounded as follows:

    Where denotes the (finite) hypothesis space so that

    • Training error increases with a larger hypothesis space, and decreases with a larger dataset. This is the key insight of statistical learning theory.
    • For infinite dimensional hypothesis spaces, we use the Vapnik Chervonenkis dimension in place of

Relation with Population Error

  • The population error, denoted . It is the expected fraction of example in an underlying population characterized by probability density for which the model disagrees with the ground truth

  • By the Central Limit Theorem, we can show that the empirical error approaches the population error at a rate of .

  • The asymptotic standard deviation of the estimate cannot be greater than by the central limit theorem.

    • This follows by reparameterize it by the probability that it gives (i.e., ). Thus, we get the parameterization
      Which has the highest variance when .

Confusion Matrices

  • See Here for all the entries in the confusion matrix (i.e., TPR, FPR).
  • The Receiver Operating Characteristic curve shows the and as a function of the classification boundary .
    • Good classification systems will hug the left axis and then the top axis.
    • The quality is quantified as Area Under the Curve (AUC). Higher = better.
    • The Equal Error Rate is the value of such that . Lower = better.

Accuracy

  • Accuracy measures the proportion of correct predictions over all classifications performed

  • Although accuracy can give an idea for a model’s performance, it may not necessarily reflect whether or not the model has truly learned patterns in the data.

  • It is also prone to the fact we weight each correct classification correctly, which assumes that the population proportion for each class is roughly equal ,which does always not hold

Precision-Recall

  • The precision measures what fraction of positive labels are actually positive (i.e. )

  • The recall measures what fraction of the ground truth positive samples are detected (i.e., )

  • The precision-recall curve can be plotted as a function of . Better classifiers hug the top right.

  • The average precision at is the precision at a fixed recall level.

  • The F1 score combines precision and recall. It is the harmonic mean of the two

    • For multi-class classification, we can generalize the score as a macro-averaged score defined as the average score where we identify a class from all other classes. That is
    • Alternatively, we can use a micro-averaged score where we pool all counts from each class’s contingency table (that is by pooling all and across all classes)

Multiple Hypothesis Testing

  • Problem: We need to classify using multiple hypotheses simultaneously.

  • We can minimize the false discovery rate defined as

    Where is the number of positively classified items and

    Where is the prior belief regarding the hypothesis.

  • The Direct Posterior Probability approach involves modifying the decision threshold .

Underfitting

  • Underfitting pertains to a phenomenon where a Model cannot capture the underlying patterns within the data.

  • The model is incapable of capturing the complexity of the problem. Whether we are training or testing, there is no difference in model performance.

  • More formally, we can use the generalization and training error.

    If is small and if and themselves are substantially high, the model has underfitted

Overfitting

  • Overfitting pertains to a phenomenon where the model corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably.

  • In other words, the model cannot generalize well and the model has only learnt on recalling the data rather than recognizing patterns.

  • If is large by a significant amount, then we may have an overfitted model.

  • Overfitting typically necessitates either more training models or regularization techniques.

Links