A language model is a type of model which aims to estimate the joint probability that a sequence exists in some distribution.
The modern approach for language models is autoregressive (also called causal language model generation). It models prediction by decomposing the joint probability of a sequence as follows (using the Chain rule of Probability)

We may also simplify the above if we make use of the Markov property.

The goal is to then predict given the previous tokens in the sequence. We refer to the previous tokens as the context.

Pipeline

Basic Models

One technique we can employ to reduce parameters is to use weight tying. Assume we have an embedding layer for our language model, we make the embedding layer share its parameters with the final layer of the language model.
- This not only reduces the parameters but also improves perplexity.
- This is based on the insight that for next token prediction, often the embedding matrix and the final layer (sometimes denoted ) have the same dimensions (up to transposition).
Language Model Sampling

On Natural Language Processing and Plan Recognition by Geib and Steedman (2007)
SQuAD — 100 000 + Questions for Machine Comprehension for Text by Rajpurkar, Zhang, Lopyrev, and Liang (Oct 11, 2016)
An Efficient Framework for Learning Sentence Representations by Logeswaran and Lee (2018)
GLUE — A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding by Wang et. al (Feb 22, 2019)
UnifiedSKG — Unifying and Multi-Tasking Structured Knowledge Grounding with Text-to-Text Language Models by Xie et. al (Oct 18, 2022)