Generalization Error

The generalization error of with the loss function is denoted as
This serves as a metric to determine how a model performs on unseen data. A high generalization error may indicate that the model is overfitting (see below for more)
The generalization error is typically not calculated since the probability distribution is unknown. We, instead, choose to evaluate over a set of unseen samples in the same vein as training error but using unseen data

We say that the model generalizes if converges. Namely by asserting that
The Empirical Error, denoted is defined as the proportion of incorrectly classified values to the size of the dataset
The training error of the function using the loss function is denoted as : and is defined as
The generalization gap is defined as the difference between the training error and the generalization error or any estimate thereof.

Empirical Risk Minimization

We reframe the problem as follows. Let be a loss function with being the true response, and the prediction given the input .

Thus, we are predicting observable quantities
The risk is now defined as

Where represents the true distribution approximated by real data through the empirical distribution.

We can define something similar by replacing the summations with integrals.
- The Risk Identity is defined as (for continuous case)
The empirical risk is then defined as

That is, it is the risk when we use the empirical distribution instead of the true distribution (in other words, average loss over the dataset).
The goal, then becomes to minimize the empirical risk.

Regularization

Minimizing the empirical risk will usually result in overfitting. This is because of our assumption that the true distribution is equal to the empirical distribution (which is not necessarily true due to sampling errors)
Hence we regularize by adding a penalty weighted with

Choosing

We cannot use the training set to approximate the true risk.
We use the structural risk minimization principle. Where Where is an estimate of the risk.
We model the uncertainty of our estimates via standard error. The one-standard error rule is a heuristic where we choose the simplest model (Occam’s razor) that is no more than one standard error from the mean.

Cross Validation

Let be the data number of data cases. be the kth data fold, and all the other data by . be a learning algorithm which outputs a parameter vector given a dataset and some model indices (i.e., hyperparameters).

be a prediction function that takes an input and a parameter vector and returns a prediction

The fit predict cycle is denoted as
The -fold CV estimate of the risk is defined as
We then call the fitting algorithm once per fold. Let

be the function that was trained on all data except for the test data in fold . The estimate then becomes

Where is the fold in which is used as test data.
In summary: divide the dataset into folds, each fold will be used as a validation set for one of models not trained on this data.
- The empirical risk is then the average loss across all folds.
If , this is leave one out cross validation.

Theoretical Upper Bounds

Murphy 6.5.1 For any data distribution , and any dataset drawn from , the probability that our estimate of the error rate will be more than wrong in the worst case, is upper bounded as follows:

Where denotes the (finite) hypothesis space so that
- Training error increases with a larger hypothesis space, and decreases with a larger dataset. This is the key insight of statistical learning theory.
- For infinite dimensional hypothesis spaces, we use the Vapnik Chervonenkis dimension in place of

Relation with Population Error

The population error, denoted . It is the expected fraction of example in an underlying population characterized by probability density for which the model disagrees with the ground truth
By the Central Limit Theorem, we can show that the empirical error approaches the population error at a rate of .
The asymptotic standard deviation of the estimate cannot be greater than by the central limit theorem.
- This follows by reparameterize it by the probability that it gives (i.e., ). Thus, we get the parameterization Which has the highest variance when .

Confusion Matrices

See Here for all the entries in the confusion matrix (i.e., TPR, FPR).
The Receiver Operating Characteristic curve shows the and as a function of the classification boundary .
- Good classification systems will hug the left axis and then the top axis.
- The quality is quantified as Area Under the Curve (AUC). Higher = better.
- The Equal Error Rate is the value of such that . Lower = better.

Accuracy

Accuracy measures the proportion of correct predictions over all classifications performed
Although accuracy can give an idea for a model’s performance, it may not necessarily reflect whether or not the model has truly learned patterns in the data.
It is also prone to the fact we weight each correct classification correctly, which assumes that the population proportion for each class is roughly equal ,which does always not hold

Precision-Recall

The precision measures what fraction of positive labels are actually positive (i.e. )
The recall measures what fraction of the ground truth positive samples are detected (i.e., )
The precision-recall curve can be plotted as a function of . Better classifiers hug the top right.
The average precision at is the precision at a fixed recall level.
The F1 score combines precision and recall. It is the harmonic mean of the two
- For multi-class classification, we can generalize the score as a macro-averaged score defined as the average score where we identify a class from all other classes. That is
- Alternatively, we can use a micro-averaged score where we pool all counts from each class’s contingency table (that is by pooling all and across all classes)

Multiple Hypothesis Testing

Problem: We need to classify using multiple hypotheses simultaneously.
We can minimize the false discovery rate defined as

Where is the number of positively classified items and

Where is the prior belief regarding the hypothesis.
The Direct Posterior Probability approach involves modifying the decision threshold .

Underfitting

Underfitting pertains to a phenomenon where a Model cannot capture the underlying patterns within the data.
The model is incapable of capturing the complexity of the problem. Whether we are training or testing, there is no difference in model performance.
More formally, we can use the generalization and training error.

If is small and if and themselves are substantially high, the model has underfitted

Overfitting

Overfitting pertains to a phenomenon where the model corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably.
In other words, the model cannot generalize well and the model has only learnt on recalling the data rather than recognizing patterns.
If is large by a significant amount, then we may have an overfitted model.
Overfitting typically necessitates either more training models or regularization techniques.

Table of Contents

Graph View

Backlinks

The Library

Model Performance

Generalization Error

Empirical Risk Minimization

Regularization

Choosing

Cross Validation

Theoretical Upper Bounds

Relation with Population Error

Confusion Matrices

Accuracy

Precision-Recall

Multiple Hypothesis Testing

Underfitting

Overfitting

Links

Table of Contents

Graph View

Backlinks

Model Performance

Generalization Error §

Empirical Risk Minimization §

Regularization §

Choosing §

Cross Validation §

Theoretical Upper Bounds §

Relation with Population Error §

Confusion Matrices §

Accuracy §

Precision-Recall §

Multiple Hypothesis Testing §

Underfitting §

Overfitting §

Links §

Generalization Error

Empirical Risk Minimization

Regularization

Choosing

Cross Validation

Theoretical Upper Bounds

Relation with Population Error

Confusion Matrices

Accuracy

Precision-Recall

Multiple Hypothesis Testing

Underfitting

Overfitting

Links