-
The transformer model makes use of the Attention Mechanism to transform an input sequence to another output sequence.
-
The model takes in an embedding (from some tokenizer or byte pair encoding) that acts as a more compact representation of the input data.
-
The model first performs positional encoding to represent the position of a token in the sequence. This is done to account for the fact the order tokens appear in sequence is important
- The original positional encoding proposed encodes both the absolute position of a token, and its position relative to other tokens. For any fixed offset
, the positional encoding at can be obtained through linear projection of at .
- The original positional encoding proposed encodes both the absolute position of a token, and its position relative to other tokens. For any fixed offset
-
The model then consists of an encoder and a decoder
- The encoder takes in the input sequence and outputs a latent space representation called a context variable for each position in the input sequence.
- Encoders can attend to the whole sequence as needed.
- The decoder takes in the context variable as well as another sequence and produces an output sequence.
- The architecture for a decoder is mostly similar to the encoder except for the presence of a specialized layer.
- It makes use of encoder-decoder-attention which performs attention such that the queries are from the previous decoder layer, and they keys and values are from the encoder outputs.
- Decoders can only attend to the tokens that have already been generated.
- In deployment, the decoder is simply fed the outputs it has generated so far.
- The encoder produces the retrieval system for one language and the decoder produces the queries from another language. Queries, Keys and Values are all calculated by the model.
- The encoder takes in the input sequence and outputs a latent space representation called a context variable for each position in the input sequence.
-
The encoder and decoder need not be used together. They can be taken and trained separately.
- Encoder-Only Transformers are best suited for tasks where there is a need to understand the full sequence within one language.
- They are pre-trained by corrupting the given sequence and tasking the model with reconstructing the initial sequence.
- Decoder-Only Transformers are best suited for tasks that involve text-generation.
- They are pre-trained by predicting the next word of the sequence.
- Encoder-Only Transformers are best suited for tasks where there is a need to understand the full sequence within one language.
-
Transformers scale very well with increased model size and training computation.
- This scaling follows a power law.
- This scaling is due to the fact the transformer is also easily parallelizable, enabling deeper architectures without performance loss.
Papers
-
Attention is All You Need by Vaswani et. al (Dec 6, 2017) - the seminal Transformer paper.
-
Universal Transformers by Deghani et. al (Mar 5, 2019)
-
Generating Long Sequences with Sparse Transformers by Child, Gray, Radford, Sutskever (Apr 23, 2019)
-
The Evolved Transformer by So, Liang, Le (May 17, 2019)
Links
- Zhang et. al Ch. 11 - for everything about the basics of the transformer model.
- All about Attention - more about attention