The transformer model makes use of the Attention Mechanism to transform an input sequence to another output sequence.
The model takes in an embedding (from some tokenizer or byte pair encoding) that acts as a more compact representation of the input data.
The model first performs positional encoding to represent the position of a token in the sequence. This is done to account for the fact the order tokens appear in sequence is important
- The original positional encoding proposed encodes both the absolute position of a token, and its position relative to other tokens. For any fixed offset , the positional encoding at can be obtained through linear projection of at .
The model then consists of an encoder and a decoder
- The encoder takes in the input sequence and outputs a latent space representation called a context variable for each position in the input sequence.
  - Encoders can attend to the whole sequence as needed.
- The decoder takes in the context variable as well as another sequence and produces an output sequence.
  - The architecture for a decoder is mostly similar to the encoder except for the presence of a specialized layer.
  - It makes use of encoder-decoder-attention which performs attention such that the queries are from the previous decoder layer, and they keys and values are from the encoder outputs.
  - Decoders can only attend to the tokens that have already been generated.
  - In deployment, the decoder is simply fed the outputs it has generated so far.
- The encoder produces the retrieval system for one language and the decoder produces the queries from another language. Queries, Keys and Values are all calculated by the model.
The encoder and decoder need not be used together. They can be taken and trained separately.
- Encoder-Only Transformers are best suited for tasks where there is a need to understand the full sequence within one language.
  - They are pre-trained by corrupting the given sequence and tasking the model with reconstructing the initial sequence.
- Decoder-Only Transformers are best suited for tasks that involve text-generation.
  - They are pre-trained by predicting the next word of the sequence.
Transformers scale very well with increased model size and training computation.
- This scaling follows a power law.
- This scaling is due to the fact the transformer is also easily parallelizable, enabling deeper architectures without performance loss.

Papers

Attention is All You Need by Vaswani et. al (Dec 6, 2017) - the seminal Transformer paper.
Universal Transformers by Deghani et. al (Mar 5, 2019)
Generating Long Sequences with Sparse Transformers by Child, Gray, Radford, Sutskever (Apr 23, 2019)
The Evolved Transformer by So, Liang, Le (May 17, 2019)

Table of Contents

Graph View

Backlinks

The Library

Transformer Model

Papers

Links

Table of Contents

Graph View

Backlinks

Transformer Model

Papers §

Links §

Papers

Links