• The transformer model makes use of the Attention Mechanism to transform an input sequence to another output sequence.

  • The model takes in an embedding (from some tokenizer or byte pair encoding) that acts as a more compact representation of the input data.

  • The model first performs positional encoding to represent the position of a token in the sequence. This is done to account for the fact the order tokens appear in sequence is important

    • The original positional encoding proposed encodes both the absolute position of a token, and its position relative to other tokens. For any fixed offset , the positional encoding at can be obtained through linear projection of at .
  • The model then consists of an encoder and a decoder

    • The encoder takes in the input sequence and outputs a latent space representation called a context variable for each position in the input sequence.
      • Encoders can attend to the whole sequence as needed.
    • The decoder takes in the context variable as well as another sequence and produces an output sequence.
      • The architecture for a decoder is mostly similar to the encoder except for the presence of a specialized layer.
      • It makes use of encoder-decoder-attention which performs attention such that the queries are from the previous decoder layer, and they keys and values are from the encoder outputs.
      • Decoders can only attend to the tokens that have already been generated.
      • In deployment, the decoder is simply fed the outputs it has generated so far.
    • The encoder produces the retrieval system for one language and the decoder produces the queries from another language. Queries, Keys and Values are all calculated by the model.
  • The encoder and decoder need not be used together. They can be taken and trained separately.

    • Encoder-Only Transformers are best suited for tasks where there is a need to understand the full sequence within one language.
      • They are pre-trained by corrupting the given sequence and tasking the model with reconstructing the initial sequence.
    • Decoder-Only Transformers are best suited for tasks that involve text-generation.
      • They are pre-trained by predicting the next word of the sequence.
  • Transformers scale very well with increased model size and training computation.

    • This scaling follows a power law.
    • This scaling is due to the fact the transformer is also easily parallelizable, enabling deeper architectures without performance loss.

Papers

  • Attention is All You Need by Vaswani et. al (Dec 6, 2017) - the seminal Transformer paper.

  • Universal Transformers by Deghani et. al (Mar 5, 2019)

  • Generating Long Sequences with Sparse Transformers by Child, Gray, Radford, Sutskever (Apr 23, 2019)

  • The Evolved Transformer by So, Liang, Le (May 17, 2019)

Links