Generalization Error
-
The generalization error of
with the loss function is denoted as -
This serves as a metric to determine how a model performs on unseen data. A high generalization error may indicate that the model is overfitting (see below for more)
-
The generalization error is typically not calculated since the probability distribution is unknown. We, instead, choose to evaluate over a set of unseen samples in the same vein as training error but using unseen data
We say that the model generalizes if
converges. Namely by asserting that -
The Empirical Error, denoted
is defined as the proportion of incorrectly classified values to the size of the dataset -
The training error of the function
using the loss function is denoted as : and is defined as -
The generalization gap is defined as the difference between the training error and the generalization error or any estimate thereof.
Empirical Risk Minimization
-
We reframe the problem as follows. Let
be a loss function with being the true response, and the prediction given the input . Thus, we are predicting observable quantities
-
The risk is now defined as
Where
represents the true distribution approximated by real data through the empirical distribution. We can define something similar by replacing the summations with integrals.
-
The Risk Identity is defined as (for continuous case)
-
-
The empirical risk is then defined as
That is, it is the risk when we use the empirical distribution instead of the true distribution (in other words, average loss over the dataset).
-
The goal, then becomes to minimize the empirical risk.
Regularization
- Minimizing the empirical risk will usually result in overfitting. This is because of our assumption that the true distribution is equal to the empirical distribution (which is not necessarily true due to sampling errors)
- Hence we regularize by adding a penalty
weighted with
Choosing
- We cannot use the training set to approximate the true risk.
- We use the structural risk minimization principle. Where
Where
is an estimate of the risk. - We model the uncertainty of our estimates via standard error. The one-standard error rule is a heuristic where we choose the simplest model (Occam’s razor) that is no more than one standard error from the mean.
Cross Validation
-
Let
be the data number of data cases. be the kth data fold, and all the other data by . be a learning algorithm which outputs a parameter vector given a dataset and some model indices (i.e., hyperparameters). be a prediction function that takes an input and a parameter vector and returns a prediction The fit predict cycle is denoted as
-
The
-fold CV estimate of the risk is defined as -
We then call the fitting algorithm once per fold. Let
be the function that was trained on all data except for the test data in fold
. The estimate then becomes Where
is the fold in which is used as test data. -
In summary: divide the dataset into
folds, each fold will be used as a validation set for one of models not trained on this data. - The empirical risk is then the average loss across all
folds.
- The empirical risk is then the average loss across all
-
If
, this is leave one out cross validation.
Theoretical Upper Bounds
-
Murphy 6.5.1 For any data distribution
, and any dataset drawn from , the probability that our estimate of the error rate will be more than wrong in the worst case, is upper bounded as follows: Where
denotes the (finite) hypothesis space so that - Training error increases with a larger hypothesis space, and decreases with a larger dataset. This is the key insight of statistical learning theory.
- For infinite dimensional hypothesis spaces, we use the Vapnik Chervonenkis dimension in place of
Relation with Population Error
-
The population error, denoted
. It is the expected fraction of example in an underlying population characterized by probability density for which the model disagrees with the ground truth -
By the Central Limit Theorem, we can show that the empirical error approaches the population error at a rate of
. -
The asymptotic standard deviation of the estimate cannot be greater than
by the central limit theorem. - This follows by reparameterize it by the probability that it gives
(i.e., ). Thus, we get the parameterization Which has the highest variance when.
- This follows by reparameterize it by the probability that it gives
Confusion Matrices
- See Here for all the entries in the confusion matrix (i.e., TPR, FPR).
- The Receiver Operating Characteristic curve shows the
and as a function of the classification boundary . - Good classification systems will hug the left axis and then the top axis.
- The quality is quantified as Area Under the Curve (AUC). Higher = better.
- The Equal Error Rate is the value of
such that . Lower = better.
Accuracy
-
Accuracy measures the proportion of correct predictions over all classifications performed
-
Although accuracy can give an idea for a model’s performance, it may not necessarily reflect whether or not the model has truly learned patterns in the data.
-
It is also prone to the fact we weight each correct classification correctly, which assumes that the population proportion for each class is roughly equal ,which does always not hold
Precision-Recall
-
The precision measures what fraction of positive labels are actually positive (i.e.
) -
The recall measures what fraction of the ground truth positive samples are detected (i.e.,
) -
The precision-recall curve can be plotted as a function of
. Better classifiers hug the top right. -
The average precision at
is the precision at a fixed recall level. -
The F1 score combines precision and recall. It is the harmonic mean of the two
- For multi-class classification, we can generalize the
score as a macro-averaged score defined as the average score where we identify a class from all other classes. That is - Alternatively, we can use a micro-averaged
score where we pool all counts from each class’s contingency table (that is by pooling all and across all classes)
- For multi-class classification, we can generalize the
Multiple Hypothesis Testing
-
Problem: We need to classify using multiple hypotheses simultaneously.
-
We can minimize the false discovery rate defined as
Where
is the number of positively classified items and Where
is the prior belief regarding the hypothesis. -
The Direct Posterior Probability approach involves modifying the decision threshold
.
Underfitting
-
Underfitting pertains to a phenomenon where a Model cannot capture the underlying patterns within the data.
-
The model is incapable of capturing the complexity of the problem. Whether we are training or testing, there is no difference in model performance.
-
More formally, we can use the generalization and training error.
If
is small and if and themselves are substantially high, the model has underfitted
Overfitting
-
Overfitting pertains to a phenomenon where the model corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably.
-
In other words, the model cannot generalize well and the model has only learnt on recalling the data rather than recognizing patterns.
-
If
is large by a significant amount, then we may have an overfitted model. -
Overfitting typically necessitates either more training models or regularization techniques.