• A language model is a type of model which aims to estimate the joint probability that a sequence exists in some distribution.

  • Language models are autoregressive. It models prediction by decomposing the joint probability of a sequence as follows (using the Chain rule of Probability)

    We may also simplify the above if we make use of the Markov property.

Pipeline

  • In general, the pipeline for language data is:
    • Clean the data (Optional depending on problem)
      • Remove stop words or words that are commonly used.

      • Normalize the data — everything should be in a consistent format

      • Sequence Partitioning - we partition a corpus into smaller subsequences.

        Let be the size of the corpus and the size of each -grams.

        At the beginning of an epoch, discard the first tokens, where is uniformly sampled at random.

        The rest of the sequence is then partitioned into

    • Load inputs into memory
    • Tokenization — this pertains to the process of splitting a string into a sequence of small indivisible units called tokens.
    • Build a vocabulary to associate each vocabulary element with a numerical index.
    • Convert the text into a sequence of numerical indices.

Basic Models

Frequency-Based Estimation

  • Language models can be used to determine the conditional probability of any given word using its relative frequency in the training set.

    Given words, , and , we have

    Where denotes the number of occurrences of and in that order.

    • Due to the number of possible arrangements an -gram may have, we may find the probability above to be zero.

    • One technique to solve this is to use Laplace Smoothing. We add a small constant to all counts so that

      This has a few drawbacks

      • Manny -grams occur very rarely which makes them still unusable
      • We need to store all the counts which can get unwieldy for large vocabularies.
      • This ignores the meaning of words.
      • Long word sequences are likely to be less common

Papers

  • On Natural Language Processing and Plan Recognition by Geib and Steedman (2007)

  • SQuAD — 100 000 + Questions for Machine Comprehension for Text by Rajpurkar, Zhang, Lopyrev, and Liang (Oct 11, 2016)

  • An Efficient Framework for Learning Sentence Representations by Logeswaran and Lee (2018)

  • GLUE — A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding by Wang et. al (Feb 22, 2019)

  • UnifiedSKG — Unifying and Multi-Tasking Structured Knowledge Grounding with Text-to-Text Language Models by Xie et. al (Oct 18, 2022)

Links