In this note, we examine (from a problem in Duda and Hart) the EM algorithm to maximize the probability of ‘good’ data in a missing data problem. I found it enormously insightful to work it out. Also, for similar reasons, I have grown extremely fond of Duda and Hart. We should note that the treatment is slightly different in Bishop, in that they approach it from a latent variable standpoint, motivated by the mixtures of gaussians formulations. I think I have been able to reconcile the differences between the two treatments, which seem to be primarily notational in nature. In either case, one wishes to estimate parameters that maximize some likelihood function by marginalizing over the missing variables. There is a much pithier version in Wikipedia (Expectation Maximization), but well, it is ordained that we must all elicit satisfaction in rediscovering the wheel, as it were. The point is, does this likelihood function actually maximize the correct thing? That is what we wish to prove here. Our goal is to find a set of parameters that maximizes the probability of the ‘good’ data (yes, ‘good’, ‘good’, ‘good’, ‘good’). This is the crux of the EM algorithm, the meaning of which is contained in the famed function from Dempster 1977 which one should some day work out in full.

Consider the EM algorithm for a dataset consisting of good data and missing or bad data ; .

In the above, it is to be noted that want to maximize the probability of the ‘good’ data . This is not quite the same as maximizing the probability of ‘complete’ dataset . In terms of notation, we should emphasize that the maximization occurs for a given parameter state , which we wish to discover through EM. So the probability is properly written as , although for obvious reasons it is quite cumbersome to carry it around.

The likelihood in question manifests as an expectation over the complete dataset, and the expectation is carried over . Now, in Bishop (and in other places), they denote this by . Here, is the missing or bad data (or latent data, as referred to in Bishop), and is the good or known data. The complete dataset probability would be be .

The recipe that is given in the books is to carry out an expectation over the conditional: or , which appears to be totally arbitrary on first impressions. We show here that it increases the likelihood function with each iteration.

Here is a brief description of the EM procedure. Given a parametrized model we wish to obtain which maximizes the probability of good data , as embodied by the likelihood function .

In the E-step we set , and get the expectation of over . In the M step, we set our parameter to that which maximizes .

E-step:

M-step:

In Bishop’s notation (with and ), the E step becomes:

Or in other words, we compute the expectation over . This is mentioned in his E-M chapter, and then extended in the chapter on variational inference.

Bishop (as in, Bishop the book) states that this procedure might appear arbitrary. Why, one might ask, would one want to take the expectation over the conditional and how does it help? Or even, if one is naive, what might we be trying to compute from this procedure.

This becomes apparent from the problem in Duda and Hart. Incidentally, the book also has a very illustrative example to clarify the machinery. The point is that we only have ‘good’ data, so we must find a set of parameters that maximizes the likelihood of the observed, good data. Our problem statement is thus to maximize . Owing to practicalities which become clear when we work through the mixtures of gaussians parameter estimation problem, this (the complete dataset formulation) brings about a simplification in the algebra.

Our aim is to show that EM maximizes the (log) likelihood of good data. Putting it more concretely, we must show that the EM iterations increase the likelihood of good data with each iteration.

Set – this is the value of we started with in the first iteration.

We wish to show that after each iteration. Now, since we maximize , we have .

We therefore only need to prove that the second term (which looks like an entropy) increases with each iteration.

i.e. (notice that the original equation takes the negative of ).

Use the identity (simply check that and the functions are monotonic). Then the integral becomes

Using Bayes again, we find that (after omitting for notational convenience)

Therefore the expression in the integral containing the inequality above really evaluates to zero and we have

Rewriting the equations for convenience we have

We have hence proved that EM increases the likelihood of ‘good’ data after each iteration.