-
Sampling in the context of NLP refers to the procedure we use to generate the next token.
-
Sampling inherently has a Exploration-Exploitation tradeoff.
- High exploration would mean more diverse outputs at the cost of quality, cohesion, and truthfulness
- High exploitation would mean higher quality outputs at the cost of diversity and creativity.
-
Let
be the current word and be the context. Also let be the vocabulary of tokens. -
Greedy Decoding involves generating the most likely word given the context. That is,
- Limitation: Deterministic. The resulting text is thus generic.
-
Random Sampling involves choosing randomly based on the resulting probability distribution. Thus
- Limitation: This will still pick rare words that are very unlikely to make the generated text sensible. While one rare word on its own has low probability, this is offset by having many rare words.
-
Top-k sampling generalizes greedy decoding 1.
-
We do the following:
- Choose hyperparameter
. - Choose the top
words based on the probability distribution . - Truncate the distribution to only include the top
words. - Renormalize the distribution
- Sample from the renormalized distribution.
- Choose hyperparameter
-
Limitation: This implicitly assumes that the conditional probability distributions are of the same shape regardless of context.
However, this is not necessarily the case as the top-k choices might constitute only a small probability mass.
-
-
Nucleus / Top-p sampling improves top-k sampling by instead choosing based on the top
of the probability mass distribution. - The top
vocabulary is the smallest non-empty set of tokens such that We sample from the topvocabulary. - By using probabilities instead of just the top
words, we have a sampling method that is more robust against changes in context.
- The top
-
Temperature Sampling involves reshaping the distribution rather than truncating it.
-
We introduce the temperature parameter
which controls this reshaping process. -
We apply the temperature parameter prior to softmax. The distribution
is then obtained as -
When
is close to , the distribution doesn’t change. -
When
, we give more weight to more likely words and less weight to less likely words. As
, the most likely word is given probability . -
When
, we give less weight to more likely words and more weight to less likely words. As
the distribution becomes more uniform.
-
-
C5W3LO4 Beam Search - beam search is an algorithm similar to BFS and DFS (but is not guaranteed to find maxima), wherein given beam length
, we select the top likely outputs at each step of the search. The goal is to find the likely -length sentence using this search. -
C5W3LO4 Refining Beam Search - use length normalization techniques to optimize beam search (maximize log likelihood, average based on sentence length).
Links
Footnotes
-
When
, top-k sampling degenerates to greedy sampling. ↩