In general, perplexity is a measurement of how well a probability model predicts a sample. In the context of Natural Language Processing, perplexity is one way to evaluate language models. A language model is a probability distribution over sentences: it’s both able to generate plausible human-written sentences (if it’s a good language model) and to evaluate the goodness of already written sentences. Presented with a well-written document, a good language model should be able to give it a higher probability than a badly written document, i.e. it should not be “perplexed” when presented with a well-written document.
Thus, the perplexity metric in NLP is a way to capture the degree of ‘uncertainty’ a model has in predicting (i.e. assigning probabilities to) text. Now, let’s try to compute the probabilities assigned by language models to some example sentences and derive an intuitive explanation of what perplexity is.
Suppose we have trained a small language model over an English corpus. The model is only able to predict the probability of the next word in the sentence from a small subset of six words: “a”, “the”, “red”, “fox”, “dog”, and “.”.
Let’s compute the probability of the sentence W, which is “a red fox.”.
The probability of a generic sentence W, made of the words w1, w2, up to wn, can be expressed as the following:
P(W) = P(w1, w2, …, wn)
Using our specific sentence W, the probability can be extended as the following:
P(“a red fox.”) =
P(“a”) * P(“red” | “a”) * P(“fox” | “a red”) * P(“.” | “a red fox”)
Suppose these are the probabilities assigned by our language model to a generic first word in a sentence:
As can be seen from the chart, the probability of “a” as the first word of a sentence is:
P(“a”) = 0.4
Next, suppose these are the probabilities given by our language model to a generic second word that follows “a”:
The probability of “red” as the second word in the sentence after “a” is:
P(“red” | “a”) = 0.27
Similarly, these are the probabilities of the next words:
Finally, the probability assigned by our language model to the whole sentence “a red fox.” is:
P(“a red fox.”) =
P(“a”) * P(“red” | “a”) * P(“fox” | “a red”) * P(“.” | “a red fox”)
= 0.4 * 0.27 * 0.55 * 0.79
= 0.0469
It would be nice to compare the probabilities assigned to different sentences to see which sentences are better predicted by the language model. However, since the probability of a sentence is obtained from a product of probabilities, the longer the sentence the lower will be its probability (since it’s a product of factors with values smaller than one). We should find a way of measuring these sentence probabilities, without the influence of the sentence length.
This can be done by normalizing the sentence probability by the number of words in the sentence. Since the probability of a sentence is obtained by multiplying many factors, we can average them using the geometric mean.
Let’s call Pnorm(W) the normalized probability of the sentence W. Let n be the number of words in W. Then, applying the geometric mean:
Pnorm(W) = P(W) ^ (1 / n)
Using our specific sentence “a red fox.”:
Pnorm(“a red fox.”) = P(“a red fox”) ^ (1 / 4) = 0.465
Great! This number can now be used to compare the probabilities of sentences with different lengths. The higher this number is over a well-written sentence, the better is the language model.
So, what does this have to do with perplexity? Well, perplexity is just the reciprocal of this number.
Let’s call PP(W) the perplexity computed over the sentence W. Then:
PP(W) = 1 / Pnorm(W)
= 1 / (P(W) ^ (1 / n))
= (1 / P(W)) ^ (1 / n)
Which is the formula of perplexity. Since perplexity is just the reciprocal of the normalized probability, the lower the perplexity over a well-written sentence the better is the language model. Let’s try computing the perplexity with a second language model that assigns equal probability to each word at each prediction. Since the language models can predict six words only, the probability of each word will be 1/6.
P(“a red fox.”) = (1/6) ^ 4 = 0.00077
Pnorm(“a red fox.”) = P(“a red fox.”) ^ (1/4) = 1/6
PP(“a red fox”) = 1 / Pnorm(“a red fox.”) = 6
…which, as expected, is a higher perplexity than the one produced by the well-trained language model.
Perplexity can be computed also starting from the concept of Shannon entropy. Let’s call H(W) the entropy of the language model when predicting a sentence W. Then, it turns out that:
PP(W) = 2 ^ (H(W))
This means that, when we optimize our language model, the following sentences are all more or less equivalent:
A language model is a statistical model that assigns probabilities to words and sentences. Typically, we might be trying to guess the next word w in a sentence given all previous words, often referred to as the “history”.
For example, given the history “For dinner I’m making __”, what’s the probability that the next word is “cement”? What’s the probability that the next word is “fajitas”?
Hopefully, P(fajitas|For dinner I’m making) > P(cement|For dinner I’m making).
We are also often interested in the probability that our model assigns to a full sentence W made of the sequence of words (w_1,w_2,…,w_N).
For example, we’d like a model to assign higher probabilities to sentences that are real and syntactically correct.
A unigram model only works at the level of individual words. Given a sequence of words W, a unigram model would output the probability:
where the individual probabilities P(w_i) could for example be estimated based on the frequency of the words in the training corpus.
An n-gram model, instead, looks at the previous (n-1) words to estimate the next one. For example, a trigram model would look at the previous 2 words, so that:
Language models can be embedded in more complex systems to aid in performing language tasks such as translation, classification, speech recognition, etc.
Perplexity is an evaluation metric for language models. But why would we want to use it? Why can’t we just look at the loss/accuracy of our final system on the task we care about?
We can in fact use two different approaches to evaluate and compare language models:
This is probably the most frequently seen definition of perplexity. In this section, we’ll see why it makes sense.
First of all, what makes a good language model? As mentioned earlier, we want our model to assign high probabilities to sentences that are real and syntactically correct, and low probabilities to fake, incorrect, or highly infrequent sentences. Assuming our dataset is made of sentences that are in fact real and correct, this means that the best model will be the one that assigns the highest probability to the test set. Intuitively, if a model assigns a high probability to the test set, it means that it is not surprised to see it (it’s not perplexed by it), which means that it has a good understanding of how the language works.
However, it’s worth noting that datasets can have varying numbers of sentences, and sentences can have varying numbers of words. Clearly, adding more sentences introduces more uncertainty, so other things being equal a larger test set is likely to have a lower probability than a smaller one. Ideally, we’d like to have a metric that is independent of the size of the dataset. We could obtain this by normalizing the probability of the test set by the total number of words, which would give us a per-word measure.
How do we do this? If what we wanted to normalize was the sum of some terms, we could just divide it by the number of words to get a per-word measure. But the probability of a sequence of words is given by a product.
For example, let’s take a unigram model:
How do we normalize this probability? It’s easier to do it by looking at the log probability, which turns the product into a sum:
We can now normalize this by dividing by N to obtain the per-word log probability:
… and then remove the log by exponentiating:
We can see that we’ve obtained normalization by taking the N-th root.
Now going back to our original equation for perplexity, we can see that we can interpret it as the inverse probability of the test set, normalized by the number of words in the test set:
Note:
Note: if you need a refresher on entropy I heartily recommend this document by Sriram Vajapeyam.
Perplexity can also be defined as the exponential of the cross-entropy:
First of all, we can easily check that this is in fact equivalent to the previous definition:
But how can we explain this definition based on cross-entropy?
We know that entropy can be interpreted as the average number of bits required to store the information in a variable, and it’s given by:
We also know that the cross-entropy is given by:
which can be interpreted as the average number of bits required to store the information in a variable, if instead of the real probability distribution p we’re using an estimated distribution q.
In our case, p is the real distribution of our language, while q is the distribution estimated by our model on the training set. Clearly, we can’t know the real p, but given a long enough sequence of words W (so a large N), we can approximate the per-word cross-entropy using the Shannon-McMillan-Breiman theorem:
Let’s rewrite this to be consistent with the notation used in the previous section. Given a sequence of words W of length N and a trained language model P, we approximate the cross-entropy as:
Let’s look again at our definition of perplexity:
From what we know of cross-entropy we can say that H(W) is the average number of bits needed to encode each word. This means that the perplexity 2^H(W) is the average number of words that can be encoded using H(W) bits.
How can we interpret this? We can look at perplexity as to the weighted branching factor.
So we’ve said:
For example, if we find that H(W) = 2, it means that on average each word needs 2 bits to be encoded, and using 2 bits we can encode 2² = 4 words.
But what does this mean? For simplicity, let’s forget about language and words for a moment and imagine that our model is actually trying to predict the outcome of rolling a die. A regular die has 6 sides, so the branching factor of the die is 6. The branching factor simply indicates how many possible outcomes there are whenever we roll.
Let’s say we train our model on this fair die, and the model learns that each time we roll there is a 1/6 probability of getting any side. Then let’s say we create a test set by rolling the die 10 more times and we obtain the (highly unimaginative) sequence of outcomes T = {1, 2, 3, 4, 5, 6, 1, 2, 3, 4}. What’s the perplexity of our model on this test set?
So the perplexity matches the branching factor.
Let’s now imagine that we have an unfair die, which rolls a 6 with a probability of 7/12, and all the other sides with a probability of 1/12 each. We again train a model on a training set created with this unfair die so that it will learn these probabilities. We then create a new test set T by rolling the die 12 times: we get a 6 on 7 of the rolls, and other numbers on the remaining 5 rolls. What’s the perplexity now?
The perplexity is lower. This is because our model now knows that rolling a 6 is more probable than any other number, so it’s less “surprised” to see one, and since there are more 6s in the test set than other numbers, the overall “surprise” associated with the test set is lower. The branching factor is still 6, because all 6 numbers are still possible options at any roll. However, the weighted branching factor is now lower, due to one option being a lot more likely than the others. This is like saying that under these new conditions, at each roll our model is as uncertain of the outcome as if it had to pick between 4 different options, as opposed to 6 when all sides had equal probability.
To clarify this further, let’s push it to the extreme. Let’s say we now have an unfair die that gives a 6 with 99% probability, and the other numbers with a probability of 1/500 each. We again train the model on this die and then create a test set with 100 rolls where we get a 6 99 times and another number once. The perplexity is now:
The branching factor is still 6 but the weighted branching factor is now 1, because at each roll the model is almost certain that it’s going to be a 6, and rightfully so. So while technically at each roll there are still 6 possible options, there is only 1 option that is a strong favorite.
Let’s tie this back to language models and cross-entropy.
First of all, if we have a language model that’s trying to guess the next word, the branching factor is simply the number of words that are possible at each point, which is just the size of the vocabulary.
We said earlier that perplexity in a language model is the average number of words that can be encoded using H(W) bits. We can now see that this simply represents the average branching factor of the model. As we said earlier, if we find a cross-entropy value of 2, this indicates a perplexity of 4, which is the “average number of words that can be encoded”, and that’s simply the average branching factor. All this means is that when trying to guess the next word, our model is as confused as if it had to pick between 4 different words.
References:
https://towardsdatascience.com/perplexity-in-language-models-87a196019a94