RNN Encoder-Decoder

  • An RNN Encoder-Decoder is an Recurrent Neural Network architecture which comprises of two units, both are RNNs.

    • An encoder takes a variable length sequence as input and encodes it to some hidden state of fixed length
    • A decoder which takes the hidden state from the encoder and the leftwards context of the target sequence, and predicts the subsequent token in the target sequence.
  • At training time, the decoder is conditioned on the preceding tokens on the original sequence. However, at test time, we condition the decoder on the tokens already predicted. This may be achieved using teacher forcing.

Encoder-Decoder RNN. Image taken from Zhang et al.
  • The encoder transforms the input data as follows.

    Suppose our input sequence is . Then, at time step , we obtain a hidden state as

    Where we concatenate the input feature vector with the previous hidden state.

    After this, we transform the hidden states into a context variable such that

  • Let be an output sequence.

    For each time step , we assign a conditional probability based on the previous inputs and the context variable.

    That is,

    To perform the actual prediction, we simply take:

    the previous token’s target. the hidden state from the previous time step , the context variable.

    We obtain the new hidden state as

    Where is some function that describes the decoder’s transformation.

    An output layer is then determined using to figure out the conditional probabilities. Finally, we use softmax to generate our token .

    In practice, we may continue generating entries using the decoder simply by feeding the shifted sequence to the model again.

  • Since we use softmax for the decoder, we use the cross entropy Loss Function.

  • We perform an additional step during training and mask irrelevant entries with so that they do not affect the loss (at the current time step). The masking is necessary since we also need to pad the sequence.

Transformers

  • See Transformer Model. Transformers have the advantage that their latent space representation can be of variable dimension

Links