-
Positional Encoding is done to account for the fact the order tokens appear in sequence is important
- The original positional encoding proposed encodes both the absolute position of a token, and its position relative to other tokens. For any fixed offset
, the positional encoding at can be obtained through linear projection of at . - An ideal positional encoding has the following properties 1
- Unique encoding for each position across sequences regardless of sequence length.
- Linear relation between two encoded positions (for simplicity).
- Generalizable to longer sequences than those encountered in training.
- Deterministically generated
- Extensible to multiple dimensions.
- The original positional encoding proposed encodes both the absolute position of a token, and its position relative to other tokens. For any fixed offset
-
Position encoding can either be applied to only the first input, or in every layer. The latter tends to be more performance.
-
The positional encoding
is applied to as follows -
Sinusoidal Positional Embedding constructs the embedding as follows
-
Learned Positional Encoding involves treating
as a trainable matrix. We can treat layers as having shared or independent positional encoding matrices. -
Rotary Position Embedding (RoPE) 2 encodes the absolute position with a rotation matrix while incorporating explicit relative position dependency.
-
We formulate the inner product of
and such that it is a function only dependent on the word embeddings and the relative distance . Thus -
Consider the simpler case of
dimensional embeddings. By our stipulationFurther assume we have the initial conditions with no position information encoded.
We use Complex numbers to represent the
vectors. Let and denote the radial and angular components of . Then we note that we can rewrite . Applying this to and we get the relationsWhere
and .We can find a solution by setting
. This gives usOne solution could therefore be to set
and . Thus the radial functions are independent from position.Furthermore, since
the angular functions do not depend on the query and the key so we simply set . We haveFor some function
.Also setting
Thus
must be linear. Using , and .Let us set
, and . The final solution isTherefore
Where
denotes complex conjugation. . -
We can represent the solution using a rotation matrix of the form
-
In the general form where
, where is even. We divide the space into subspaces and by linearity, (for both and ) is of the formWhere
And
The inner product is therefore
-
For completeness, set
. Intuitively tokens do not influence other tokens that are relatively far away. -
Because
is sparse, we can perform multiplication with it uses the Hadamard product as follows -
Notice we do not add position information to the values.
-
One main challenge with RoPE is extrapolation.
-
-
3 proposes a modified relative positional encoding for use in models that have extended context lengths.
We use the following decomposition. Let
be the -th row of .We perform the following reparameterization
- Replace
with where𝕕 corresponds to the sinusoidal position embedding. Only the relative distance matters for where to attend - Replace
with trainable parameters for and corresponding to content and location respectively.- Since the query vector is the same for all query positions, the bias should remain the same regardless of query positions.
- Split
to and for content and local information. - The final reparameterization now looks like
- The first term corresponds to content-based addressing
- The second term corresponds to content-dependent positional bias.
- The third term corresponds to global content bias
- The fourth term correspond to global positional bias
- Replace
- 4 introduces relative positional encoding.
- Rationale: This is one approach to consider arbitrary pairwise relations between any two input tokens. We treat the input as a labeled fully connected digraph.
- The edge between input elements
is represented by the vectors . We apply the positional embedding by modifying the key-value matrices - We clip the maximum distance to a maximum absolute value
-
Rationale: This lets us generalize to sequence lengths not seen in training. Also it is hypothesized that previse positional information is not useful beyond a certain distance.
-
We consider
edge labels (within the interval ). We then obtainWhere
. and are then treated as learnable training parameters. -
For efficiency, we share the relative position encoding either across heads or across sequences
-
- When computing the scaled dot product, we separate the computation and perform tensor reshaping on the second term in the following
The computation for the output using
is done similarly, separating and and reshaping the term corresponding to .
Incorporating Long Contexts
- Attention with Linear Biases (ALiBi) 5 proposes a position encoding method where we add a bias for query-key attention scores with a penalty proportional to their distance.
-
The authors speculate that the failure to extrapolate is due to the choice of positional encoding used.
-
Using ALiBi entails training the model on short sequences. Thus, training incurs lower cost.
-
ALiBi simply entails modifying the attention mechanism by introducing a static, non-learned bias. For the
-th query, we performWhere
is a head specific sloping parameter.- The paper uses a geometric sequence for
. So for heads we have
- The paper uses a geometric sequence for
-
ALiBi has an inductive bias towards recency. The penalty decreases as the distance between a key and query diminishes.
-
ALiBi’s decrease in perplexity when given longer sequences is largely explained by its improved avoidance of the early token curse
-
Limitation: When using a larger context during validation, ALiBi might not actually be using contexts longer than the one it was trained on.
-
- 6 propose the DA-Transformer (distance aware transformer) which incorporates the real distance between tokens in re-scaling raw self-attention weights.
- Rationale: Global and local context modelling usually have different distance preferences for attention.
- In each attention head, we use a learnable parameter
to weight the relative distance. Let be the relative distance matrix where thenIt is stipulated that being more positive favors long-distance information, while it being more negative favors short-term context. - We then design a function
to obtain the rescaled coefficients . The function satisfies. since zero distance does not influence attention weights. . If attention prefers local information, the long-distance information should be surpassed. . We introduce this bound so that the model can process long sequences without over-emphasizing distance contexts.- The scale of
is tunable to adjust the intensity of distance information. is monotone.
- Our choice for
is a learnable sigmoid functionWhere is a head-specific learnable parameter. - The re-scaled coefficients are used to adjust the attention weights. We obtain the attention matrix as follows
- The scaling is multiplicative since adding might over-amplify the attention weights.
- The ReLU is incorporated since the sign of
can be both positive and negative. Thus, multiplication does not reflect distance information. - ReLU also adds sparsity since only positive attention is amplified.
- The extra time and space complexity for computing the above is
and respectively.
Links
-
Zhang et. al Ch. 11 - for everything about the basics of the transformer model.
-
All about Attention - more about attention
Footnotes
-
RoPE has all these properties. See https://huggingface.co/blog/designing-positional-encoding ↩
-
Su et al. (2021) RoFormer: Enhanced Transformer with Rotary Position Embedding ↩
-
: Dai et al. (2019) Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context ↩
-
Shaw, Uszkoreit, Vaswani (2018) Self-Attention with Relative Position Representations ↩
-
Press, Smith, Lewis (2021) Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation ↩
-
Wu, Wu, Huang (2021) DA-transformer: Distance Aware Transformer ↩