Maximum Likelihood Estimation (MLE) and Maximum A Posteriori (MAP) estimation are method of estimating parameters of statistical models.
Despite a bit of advanced mathematics behind the methods, the idea of MLE and MAP are quite simple and intuitively understandable. In this article, I’m going to explain what MLE and MAP are, with a focus on the intuition of the methods along with the mathematics behind.
In 2018-19 season, Liverpool FC won 30 matches out of 38 matches in Premier league. Having this data, we’d like to make a guess at the probability that Liverpool FC wins a match in the next season.
The simplest guess here would be 30/38 = 79%, which is the best possible guess based on the data. This actually is an estimation with MLE method.
Then, assume we know that Liverpool’s winning percentages for the past few seasons were around 50%. Do you think our best guess is still 79%? I think some value between 50% and 79% would be more realistic, considering the prior knowledge as well as the data from this season. This is an estimation with MAP method.
I believe ideas above are pretty simple. But for more precise understanding, I will elaborate mathematical details of MLE and MAP in the following sections.
Before going into each of methods, let me clarify the model and the parameter in this example, as MLE and MAP are method of estimating parameters of statistical models.
In this example, we’re simplifying that Liverpool has a single winning probability (let’s call this as θ) throughout all matches across seasons, regardless of uniqueness of each match and any complex factors of real football matches. On the other words, we’re assuming each of Liverpool’s match as a Bernoulli trial with the winning probability θ.
With this assumption, we can describe probability that Liverpool wins k times out of n matches for any given number k and n (k≤n). More precisely, we assume that the number of wins of Liverpool follows binomial distribution with parameter θ. The formula of the probability that Liverpool wins k times out of n matches, given the winning probability θ, is below.
This simplification (describing the probability using just a single parameter θ regardless of real world complexity) is the statistical modelling of this example, and θ is the parameter to be estimated.
From the next section, let’s estimate this θ with MLE and MAP.
In the previous section, we got the formula of probability that Liverpool wins k times out of n matches for given θ.
Since we have the observed data from this season, which is 30 wins out of 38 matches (let’s call this data as D), we can calculate P(D|θ) — the probability that this data D is observed for given θ. Let’s calculate P(D|θ) for θ=0.1 and θ=0.7 as examples.
When Liverpool’s winning probability θ = 0.1, the probability that this data D (30 wins in 38 matches) is observed is following.
P(D|θ) = 0.00000000000000000000211. So, if Liverpool’s winning probability θ is actually 0.1, this data D (30 wins in 38 matches) is extremely unlikely to be observed. Then what if θ = 0.7?
Much higher than previous one. So if Liverpool’s winning probability θ is 0.7, this data D is much more likely to be observed than when θ = 0.1.
Based on this comparison, we would be able to say that θ is more likely to be 0.7 than 0.1 considering the actual observed data D.
Here, we’ve been calculating the probability that D is observed for each θ, but at the same time, we can also say that we’ve been checking likelihood of each value of θ based on the observed data. Because of this, P(D|θ) is also considered as Likelihood of θ. The next question here is, what is the exact value of θ which maximise the likelihood P(D|θ)? Yes, this is the Maximum Likelihood Estimation!
The value of θ maximising the likelihood can be obtained by having derivative of the likelihood function with respect to θ, and setting it to zero.
By solving this, θ = 0,1 or k/n. Since likelihood goes to zero when θ= 0 or 1, the value of θ maximise the likelihood is k/n.
In this example, the estimated value of θ is 30/38 = 78.9% when estimated with MLE.
MLE is powerful when you have enough data. However, it doesn’t work well when observed data size is small. For example, if Liverpool only had 2 matches and they won the 2 matches, then the estimated value of θ by MLE is 2/2 = 1. It means that the estimation says Liverpool wins 100%, which is unrealistic estimation. MAP can help dealing with this issue.
Assume that we have a prior knowledge that Liverpool’s winning percentage for the past few seasons were around 50%.
Then, without the data from this season, we already have somewhat idea of potential value of θ. Based (only) on the prior knowledge, the value of θ is most likely to be 0.5, and less likely to be 0 or 1. On the other words, the probability of θ=0.5 is higher than θ=0 or 1. Calling this as the prior probability P(θ), and if we visualise this, it would be like below.
Then, having the observed data D (30 win out of 38 matches) from this season, we can update this P(θ) which is based only on the prior knowledge. The updated probability of θ given D is expressed as P(θ|D) and called the posterior probability.
Now, we want to know the best guess of θ considering both our prior knowledge and the observed data. It means maximising P(θ|D) and it’s the MAP estimation.
The question here is, how to calculate P(θ|D)? So far in this article, we checked the way to calculate P(D|θ) but haven’t seen the way to calculate P(θ|D). To do so, we need to use Bayes’ theorem below.
I don’t go deep into Bayes’ theorem in this article, but with this theorem, we can calculate the posterior probability P(θ|D) using the likelihood P(D|θ) and the prior probability P(θ).
There’s P(D) in the equation, but P(D) is independent to the value of θ. Since we’re only interested in finding θ maximising P(θ|D), we can ignore P(D) in our maximisation.
The equation above means that the maximisation of the posterior probability P(θ|D) with respect to θ is equal to the maximisation of the product of Likelihood P(D|θ) and Prior probability P(θ) with respect to θ.
We discussed what P(θ) means in earlier part of this section, but we haven’t go into the formula yet. Intrinsically, we can use any formulas describing probability distribution as P(θ) to express our prior knowledge well. However, for the computational simplicity, specific probability distributions are used corresponding to the probability distribution of likelihood. It’s called conjugate prior distribution.
In this example, the likelihood P(D|θ) follows binomial distribution. Since the conjugate prior of binomial distribution is Beta distribution, we use Beta distribution to express P(θ) here. Beta distribution is described as below.
Where, α and β are called hyperparameter, which cannot be determined by data. Rather we set them subjectively to express our prior knowledge well. For example, graphs below are some visualisation of Beta distribution with different values of α and β. You can see the top left graph is the one we used in the example above (expressing that θ=0.5 is the most likely value based on the prior knowledge), and the top right graph is also expressing the same prior knowledge but this one is for the believer that past seasons’ results are reflecting Liverpool’s true capability very well.
A note here about the bottom right graph: when α=1 and β=1, it means we don’t have any prior knowledge about θ. In this case the estimation will be completely same as the one by MLE.
So, by now we have all the components to calculate P(D|θ)P(θ) to maximise.
As same as MLE, we can get θ maximising this by having derivative of the this function with respect to θ, and setting it to zero.
By solving this, we obtain following.
In this example, assuming we use α=10 and β=10, then θ=(30+10–1)/(38+10+10–2) = 39/56 = 69.6%
MLE, MAP and Bayesian inference are methods to deduce properties of a probability distribution behind observed data. That being said, there’s a big difference between MLE/MAP and Bayesian inference.
Let’s start from the recap of MLE and MAP.
Given the observed data D, estimations of a probabilistic model’s parameter θ by MLE and MAP are the following.
MLE gives you the value which maximises the Likelihood P(D|θ). And MAP gives you the value which maximises the posterior probability P(θ|D). As both methods give you a single fixed value, they’re considered as point estimators.
On the other hand, Bayesian inference fully calculates the posterior probability distribution, as below formula. Hence the output is not a single value but a probability density function (when θ is a continuous variable) or a probability mass function (when θ is a discrete variable).
This is the difference between MLE/MAP and Bayesian inference. MLE and MAP returns a single fixed value, but Bayesian inference returns probability density (or mass) function.
But why we even need to fully calculate the distribution, when we have MLE and MAP to determine the value of θ ?To answer this question, let’s see the case when MAP (and other point estimators) doesn’t work well.
Assume you’re in a casino with full of slot machines with 50% winning probability. After playing for a while, you heard the rumour that there’s one special slot machine with 67% winning probability.
Now, you’re observing people playing 2 suspicious slot machines (you’re sure that one of those is the special slot machine!) and got the following data.
Machine A: 3 wins out of 4 plays
Machine B: 81 wins out of 121 plays
By intuition, you would think machine B is the special one. Because 3 wins out of 4 plays on machine A could just happen by chance. But machine B’s data doesn’t look like happening by chance.
But just in case, you decided to estimate those 2 machines’ winning probabilities by MAP with hyperparameters α=β=2. (Assuming that the results (k wins out of n plays) follow binomial distribution with the slot machine’s winning probability θ as its parameter.)
The formula and results are below.
Machine A: (3+2–1)/(4+2+2–2) = 4/6 = 66.7%
Machine B: (81+2–1)/(121+2+2–2) = 82/123 = 66.7%
Unlike your intuition, estimated winning probability θ by MAP for the 2 machines are exactly same. Hence, by MAP, you cannot determine which one is the special slot machine.
But really? Isn’t it looking obvious that Machine B is more likely to be the special one?
To see if there really be no difference between machine A and machine B, let’s fully calculate the posterior probability distribution, not only MAP estimates.
In the case above, the posterior probability distribution P(θ|D) is calculated as below. (Detailed computation will be covered in the next section.)
And P(θ|D) for machine A and machine B are drawn as below.
Although both distributions have their mode on θ=0.666… (that’s why their MAP estimates are the same value), the shapes of the distributions are quite different. Density around the mode is much higher in the distribution of machine B than the one of machine A. This is why you want to calculate full distribution, not only the MAP estimate.
As we skipped the computation of P(θ|D) in the previous section, let’s go through the detailed calculation process in this section.
Both MAP and Bayesian inference are based on Bayes’ theorem. The computational difference between Bayesian inference and MAP is that, in Bayesian inference, we need to calculate P(D) called marginal likelihood or evidence. It’s the denominator of Bayes’ theorem and it assures that the integrated value* of P(θ|D) over all possible θ becomes 1. (* Sum of P(θ|D), if θ is a discrete variable.)
P(D) is obtained by marginalisation of joint probability. When θ is a continuous variable, the formula is as below.
Considering the chain rule, we obtain the following formula.
Now, put this into the original formula of the posterior probability distribution. Calculating below is the goal of Bayesian Inference.
Let’s calculate P(θ|D) for the case above.
Beginning with P(D|θ) — Likelihood — which is the probability that data D is observed when parameter θ is given. In the case above, D is “3 wins out of 4 matches”, and parameter θ is the winning probability of machine A. As we assume that the number of wins follows binomial distribution, the formula is as below, where n is the number of matches and k is the number of wins.
Then P(θ) — the prior probability distribution of θ — which is the probability distribution expressing our prior knowledge about θ. Here, specific probability distributions are used corresponding to the probability distribution of Likelihood P(D|θ). It’s called conjugate prior distribution.
Since the conjugate prior of binomial distribution is Beta distribution, we use Beta distribution to express P(θ) here. Beta distribution is described as below, where α and β are hyperparameters.
Now we got P(D|θ)P(θ) — the numerator of the formula — as below.
Then, P(D) — the denominator of the formula — is calculated as follows. Note that the possible range of θ is 0 ≤ θ ≤ 1.
With Euler integral of the first kind, the above formula can be deformed to below.
Finally, we can obtain P(θ|D) as below.
As you may have noticed, the estimate by MAP is the mode of the posterior distribution. But we can also use other statistics for the point estimation, such as expected value of θ|D. The estimation using the expected value of θ|D is called Expected A Posteriori.
Let’s estimate the winning probability of the 2 machines using EAP. From the discussion above, P(θ|D) in this case is below.
Thus the estimate is described as below.
Hence, EAP estimate of 2 machines’ winning probabilities with hyperparameters α=β=2 are below.
Machine A: (3+2)/(4+2+2) = 5/8 = 62.5%
Machine B: (81+2)/(121+2+2) = 83/125 = 66.4%
As seen above, Bayesian inference provides much more information than point estimators like MLE and MAP. However, it also has a drawback — the complexity of its integral computation. The case in this article was quite simple and solved analytically, but it’s not always the case in real-world applications. We then need to use MCMC or other algorithms as a substitute for the direct integral computation.
Hope this article helped you to understand Bayesian inference.