A Large Language Model is a machine learning model for language embedding and text generation and is characterized by the large number of parameters (on the order of hundreds of billions) and being trained on very large datasets (trillions of tokens)
It is debatable whether or not LLMs actually learn to reason or whether they simply perform sophisticated pattern recognition.
When pre-training an LLM, we often fill the full context window with text. To make sure to separate different text snippets (i.e., separate documents), we add EOS tokens in between.
For the training corpus, it is desirable to filter
- Sources with personal identifiable information
- Boilerplate text
- Duplicate text or documents.

Key Facts

LLMs are capable of being fine-tuned using only a small amount of data. This makes them more economical to use (ignoring the hardware costs)
LLMs can be preconditioned to take or speak in certain roles or tuned for specific tasks.
LLMS scale not only due to the amount of parameters but also on the amount of training data used.
At scale, LLMs have a bunch of useful, emergent behavior such as:
- In-context learning (Language Models are Few-Shot Learners by Brown et. al, (Jul. 22, 2020)|Brown et. al (2020)) and zero shot generalization to unseen tasks.
- Instruction Tuning Amenability (Finetuned Language Models are Zero Shot Learners by Wei et al. (Feb 8, 2022)|Wei et al (2022))
- Chinchilla scaling (i.e., increased performance with more tokens used on training) (Training Compute-Optimal Large Language Models by Hoffmann et. al (Mar 29, 2022)|Hoffman et. al (2022))
- Chain of Thought Prompting (Chain-Of-Thought Prompting Elicits Reasoning in Large Language Models by Wei et. al (Jan 10, 2023)|Wei et al (2023))
LLMs acquire the biases that are present in the training sets.

Workflows

According to ²
- Left-to-right LMs are the most commonly used. They scan in the manner of an Encoder
- Masked LMs are used for bidirectional contexts, similar to that of a decoder.
- Prefix Language Models are left-to-right LMs that decodes an output conditioned on an input, which is encoded by the same model parameters but with a fully connected mask and possibly some corruption on the input.
- Encoder-Decoder architectures mimic the full transformer.

Language Model Sampling
LLM Fine Tuning
Prompt Engineering - an increasingly important technique in using LLMs which involves tuning the input prompts.
Instruction Tuning - all about instruction tuning, a technique to get an NLP model to understand instructions.