Metrics for Language Modeling

One common metric is the perplexity. It is proportional to the cross entropy loss averaged over all tokens of a sequence. More formally

The language model provides us with the conditional probability terms in the summation.
- It is based on assessing quality by determining how surprising a piece of text is. More surprising = higher perplexity = worse.
- From an Information Theory perspective, this corresponds to determining how many bits we need to compress a sequence to predict what is next.
- It can be understood as the geometric mean of the number of real choices we have when deciding which token to pick next.
The Brevity Penalty is a penalty that punishes candidate strings that are too short.

Let be a set of candidate strings be a set of reference strings for the -th candidate string. .

The brevity penalty is defined as We define and as follows.

is the length of the candidate corpus. That is

is the effective reference corpus length, defined as

That is, for each entry in the candidate corpus, we find the sentence from (the set of corresponding reference strings), whose length is as close as possible to as possible.
The -gram Precision tells us how many instances of an -gram from the candidate string appear in the reference string ¹.

However note that it does not take into account the fact that we may have too many -grams in the candidate string compared to the reference string.

It is computed as follows. Let be a candidate string be a reference string be the number of -grams in a string . be a modification of the Kronecker Delta function where it is if is a substring of and otherwise.

The n-gram precision is defined as

The Modified -gram precision is defined similarly to the regular -gram precision.

Let be a candidate string be a reference string be the number of -grams in a string . be the number of times occurs as a substring of .

The modified -gram precision is defined as
- If , then none of the substrings in the candidate are in the reference.
- If , then every -gram in the substring appears in the reference for at least as many times as that in the candidate.
- Intuition: Begin by measuring how many instances of the -grams in the candidate appear in the reference string. This is done in the expression
  
  If we encounter duplicates (i.e., ), we count each duplicate.
  
  However, we need to keep in mind the case where the candidate string is too short. This is rectified by adding a minimum function Essentially, what this does is determine the number of common substrings (allowing duplicates) that appear in both the candidate and the reference.
  
  Finally, we obtain the denominator in simply by normalizing the expression above.
- If we have an entire corpus of candidates and where is the set of reference strings for the -th candidate string and , then we can extend the definition to apply to a corpus.
The Bilingual Evaluation Understudy is a performance metric that allows us to evaluate the probability that an -gram appears in a target sequence.

Let be a set of candidate strings be a set of reference strings for the -th candidate string. . be the modified -gram precision denote the brevity penalty be a weighting vector whose entries are all sampled from a probability distribution (i.e., the sum of all weights is ). This is chosen by us.

The Bilingual Evaluation Understudy is defined as

Analogous to precision. ↩

Graph View

Backlinks

The Library

Metrics for Language Modeling

Graph View

Backlinks

Metrics for Language Modeling

Footnotes §

Footnotes