Vida de Niki

 Section V: What the thunder said.

Lines 411-414

“Dayadhvam: I have heard the key

Turn in the door once and turn once only

We think of the key, each in his prison

Thinking of the key, each confirms a prison.”


            Regardless of Eliot’s theory of the objective correlative -according to which it would be enough for me to infer the presence here of some oriental wisdom lecturing on solitude and selfishness- I will start this paper giving the sources to these verses. After all, no man is like another, or knows what knows another, and a much vaster tradition permeated Eliot than it does me, and probably many others.

            These sources are three-fold: mythical, literary, and philosophical.

In the first place, we find a reference to the Hindu fables called Upanishads. Gods, men, and demons, having concluded their study of sacred knowledge, ask their father Prajapati…

View original post 1,430 more words

The Objective Function in the Variational Autoencoder

In this note, I derive the VAE objective function. The basic idea is to obtain a lower bound for the probability of interest p(X). This is often said to be a cheaper alternative to Monte Carlo sampling. We would like to replace \log p(X) with a lower bound \mathcal{L}(X, \theta) and maximize \mathcal{L}. In other words, the difference between \log p(X) and \mathcal{L}(X, \theta) is to be minimized (this turns out to be a Kullback Leibler divergence). The procedure is very similar to EM, where also we discover parameters that maximize \log p(X|\theta). Thus, at every iteration, we update distribution parameters to that the log likelihood increases.

The VAE is a generative process.  Very very loosely – sorry, I need my tricycle training wheels on –  we have the following. There is input X, from which we obtain Z – we therefore need p(Z|X), and then, we sample from Z (we need a prior p(Z)) to obtain  X. The process is laid out as neural networks, with some parameters to propagate the inputs x to the intermediate quantities p(Z|X), manufacture Z, pass it on to the generating layers in p(X|Z) and then obtain X.  p(Z|X) is an encoding distribution since it encodes the values of X into Z. In the same vein, p(X|Z) is a generating distribution because we generate new X from Z.

The catch here is that the VAE uses an approximate distribution q(Z|X) instead of p(Z|X). We sample from this approximate posterior, and take expectations. This process involves some ingenuity. The details are available in the original paper by Kingma and Welling, the tutorial by Carl Doersch, and web posts (e.g.  Dustin Tran, kvfrans). Special mention is to be made of Shakir Mohamed’s blog ‘The Spectator’, which is not only instructive but also propagates joy, in a manner of speaking (Rezende and Mohamed, 2014 has a complimentary presentation of gradient descent treatment for variational inference for deep latent generative models). See the ‘Reparametrization’ trick and the ‘log-derivative’ trick in Shakir’s blog.

In the VAE, we state our problem in terms of the lower bound. Since this lower bound is to be maximized, the problem then becomes one of optimization, rather than say, as is mentioned in Blei’s review paper “Variational Inference: A Review for Statisticians”, one of sampling from a Markov chain. The way they go about this is to put it in a neural network whose parameters are learnt with gradient descent. Latent variables {Z} are used as generators from which we will re-create new data {X}. Our goal is to get a handle of {p(Z|X)} so that we remain faithful to the data. But instead of {p(Z|X)}, we work with an approximation {q(Z|X)}.

1. Deriving the VAE objective

Our exercise here is to elaborate on how the VAE objective is derived. But before we do that, we should note its macroscopic form involving the lower bound for the log-likelihood of the measure of interest {X}. Here, {X} could be some input such as an image. We posit that there are hidden variables {Z} from which we can produce {X}. Furthermore, we introduce parameters {\theta} and {\phi} predicating the generating/decoding process {p_\theta (X|Z)} and the encoding process {q_\phi (Z|X)}. We set up the problem as a neural network, to maximize the objective function and learn the parameters using gradient descent.

The lower bound {\mathcal{L}} is a surrogate for our distribution of interest {p(X)}. In the VAE paper’s terminology, {p(X)} comes from a generating process, parametrized by {\theta}. More particularly, we attempt to learn {\theta} from {p(X|Z)}. The objective is the following:

\displaystyle \begin{array}{ccc} \log P_\theta (X) \geq \mathcal{L}(\theta, \phi; X) = E_{z\sim q_{\phi}(Z|X)} (\log P_\theta(X|Z)) - \mathcal{D}_{KL}(q_{\phi}(Z|X)||p_\theta(Z)) \end{array} \ \ \ \ \ (1)

Let us derive the aforementioned. Just as in the earlier post where we proved that EM is guaranteed to increase the log likelihood of {x} at every iteration, we start by invoking Bayes on {p(X)}.

\displaystyle p(X) = \frac{p(X, Z)}{p(Z|X)} \ \ \ \ \ (2)

Taking logs and adding and subtracting the proxy sampling distribution {q(Z|X)}, we get

\displaystyle \log p(X) = \log p(X|Z) - \log \left\{ \frac{q(Z|X)}{p(Z)} \right\} +\log \left\{ \frac{q(Z|X)}{p(Z|X)} \right\} \ \ \ \ \ (3)

In the textbooks (say, Bishop), they use {q(Z)} as the sampling distribution. Kingma and Welling fabricate the problem in terms of an encoder-decoder process, so they use the conditional form of {q(Z|X)} instead of {q(Z)}. In this, we would like to manufacture hidden variables {Z} that come from {X}. Therefore, for a given {X} we would like to get a handle of what {Z} is, given some assumptions about the form that this distribution will take – the conditionals p(X|Z) and q(Z|X) are assumed to be Gaussians with trainable means and diagonal covariances.

Next, we recast the above equation in terms of expectations taken from the sampling distribution {q(Z|X)} (or more precisely, {q_\phi (Z|X)}, with {\phi} being the parameters of the neural network used to train {\mu(x)} and {\Sigma(x)} for the gaussian implied above).

While taking expectations, we will make use of the following identify to marginalize over {Z}.

\displaystyle \int p(Z|X) dZ = \int \frac{p(Z,X)}{p(X)} dZ = \frac{p(X)}{p(X)} = 1 \ \ \ \ \ (4)

Take expectations:

\displaystyle \begin{array}{lll} \int \log p(X) q(Z|X) dZ = & \int \log p(X|Z) q(Z|X) dZ \\ & -\int \log \left\{ \frac{q(Z|X)}{p(Z)} \right\} q(Z|X) dZ +\int \log \left\{ \frac{q(Z|X)}{p(Z|X)} \right\} q(Z|X) dZ \end{array} \ \ \ \ \ (5)


\displaystyle \log p(X) = E_{z\sim q(Z|X)} p(X|Z) - \mathcal{D}_{KL}(q(Z|X)||p(Z)) +\mathcal{D}_{KL} (q(Z|X)||p(Z|X)) \ \ \ \ \ (6)

Rearrange to get

\displaystyle \log p(X) - \mathcal{D}_{KL} (q(Z|X)||p(Z|X)) = E_{z\sim q(Z|X)} p(X|Z) - \mathcal{D}_{KL}(q(Z|X)||p(Z)) \ \ \ \ \ (7)

Recognizing that the Kullback-Leibler divergence is a non-negative quantity, we can state that the LHS is a lower bound for the log-likelihood of {p(X)}.

\displaystyle \log p(X) \geq \log p(X) - \mathcal{D}_{KL} (q(Z|X)||p(Z|X)) = \mathcal{L} \ \ \ \ \ (8)

When we view these in the context of the encoder-decoder process of the VAE, we must include the parameters of the distributions. {p_\theta (X)} comes from a generating process, parametrized by {\theta}. {q_\phi(Z|X)} is the variational distribution, parametrized by {\phi}. we must also include a prior {p(Z)} since the LHS is not a function of {Z} which we marginalize over.

Putting it all together, we get the objective function for the lower bound. This lower bound must be maximized so that our approximation {\mathcal{L}} comes as close to our original probability {\log p(X)}. Or in other words, we need to find parameters {\theta, \phi} that maximize {\mathcal{L}}.

\displaystyle \mathcal{L}(\theta, \phi; X) = E_{z\sim q_{\phi}(Z|X)} (\log P_\theta(X|Z)) - \mathcal{D}_{KL}(q_{\phi}(Z|X)||p_\theta(Z)) \ \ \ \ \ (9)

The key and the prison

The Wasteland is a nebulous and murky soup of words pieced together from various literary sources, with any number of interpretations. In keeping with my current somewhat reflective mood – a good thing, because I seldom allow myself the time to reflect these days (Eliot might chuckle wryly from behind his grave) – I am musing over it by concocting this dour interpretation of his corruption of “Dayadhvam” from the Brihadaranyaka upanishad. Thanks to this, I was actually going over Eknath Easwaran’s translation of the upanishads, and while I didn’t get to this particular upanishad, not as though I gave a more than cursory treatment of the ones preceding this, I did see that it had section on

“What the thunder said”.

The words ring through with a reasonable degree of similarity [Oh hello, generative model, can you capture upanishad and dump out Eliot reasonably unsatisfactorily?].

Dayadhvam’s says the following:

“We think of the key, each in his prison

Thinking of the key, each confirms a prison


Casting it in terms of questions, answers and the number forty two, Dayadhvam represents the trap inherent in discovering the answer to life, the universe and everything, which, without a question underlying it is meaningless.

Likewise, I thought I might want to cavil gently about the reasoning that behind any action is an underlying motivation, a plot whose end goal is to produce some kind of dog-treat, to be coveted and worked for, and serving as a an end to be savored. We are all looking for keys that give us something, water, houses, loan approvals, legacies, and so on and so forth. In fact, we are even encouraged to think in these terms, when we look at current wisdom on goal orientation and getting the maximum out of a deal (works well for one side), and to win at all costs. But at some level, I feel that it all extols grasping behavior, which may or may not translate to happiness. And thus, does it even do to think of keys that open up some imaginary vault containing unfathomable riches?

As a working model though, it does explain many of our actions. Also, people seem to judge other people’s motivations based on this simplistic notion that there must be something (usually material) that drives whatever they are doing at any given moment.

Going south in the winter

The sad voice of Marie in “The Wasteland”, who recalls in a wistful, sad tone her frightening, exhilarating trips sliding down the mountain slopes on a sled, and the dull meaninglessness of her life nowadays, when all she can do is to read, much of the night and go south in the winter is quite evocative of some of our slightly banal lives.

There are some similarities to Mrs. Dalloway – a depressing book that I read some ten years ago – where one could say that her life is as good as done, but it might appear that Eliot’s poem is rebelling against his fate, whereas Mrs. Dalloway is simply going through the motions (her alter ego, Septimus Severus obliges, by committing suicide). I should read that book again. It was beautiful, somehow.

I am going to play some more blitz before setting my lands in order for the day. Oh, I learnt with some delight that the so called probit function is nothing but our beloved error function (or the integral of the gaussian) that we see so often in combustion equations.




The ELBO – another view of EM

I found this discussion pertaining to EM – mostly motivated from EM/mixtures of gaussians ideas in Murphy yesterday, while I was killing time at a Castro Street cafe in Mountain View – quite illuminating. So I gave Bishop (naturally, the mother of all books relating to this subject) a dusting in the hopes of gleaning from it a little insight.

The EM procedure can be seen in light of a quantity  called the ‘lower bound’, which then leads us to other things, notably, variational inference. This lower bound is referred to in places as the “ELBO” (evidence lower bound). When cast this way, we choose a certain function to base our operations around – referred to as ‘q’. Then the E and M steps take on some meaning. The E step maximizes the lower bound keeping the old parameters \theta^{old} – which occurs when the Kullback-Leibler divergence or the second term vanishes. And the M step maximizes the lower bound with respect to \theta, thus increasing the value of \log p(x). Our aim is, after all, to maximize the  likelihood with respect to \theta. Let us not forget, while we are grappling with all this machinery that we would like to get MLE estimates for \theta, given that we have a dataset with observed data X and hidden data Z, and the complete dataset is (X, Z).

\begin{array}{ccc} \log p(X|\theta) = \sum \limits_Z q(Z) \log\left \{ \frac{p(X,Z|\theta)}{q(Z)} \right \} - \sum \limits_Z q(Z) \log\left \{ \frac{p(Z|X, \theta)}{q(Z)} \right \}  \end{array}

The terms containing q cancel out, and the remainder allows itself amenable to the invocation of Bayes, thus giving the LHS. I find it remarkable and beautiful that we can concoct an artificial expression such as the above and base whole theories upon it. But then, maybe that is the way the world works, and that this world view constitutes that of the neophyte, naively inspecting the world with his myopic eyes.

In any case,  the first term in the RHS is the “ELBO”: \mathcal{L}(q, \theta), and the second term is the Kullback-Leibler divergence, a peculiar quantity that is always greater than or equal to zero: KL(q||p) (prove it!).

\begin{array}{ccc}  \log p(X|\theta)  = \mathcal{L}(q, \theta) + KL(q||p)\\  \mathcal{L}(q,  \theta)  = \sum \limits_Z q(Z) \log \left \{ \frac{p(X, Z|\theta)}{q(Z)} \right \} \\  KL(q||p) = - \sum \limits_Z q(Z) \log \left \{ \frac{p(Z|X, \theta)}{q(Z)} \right \}  \end{array}

Note also that we could normalize q(Z) so that it becomes a distribution, whereupon the quantities are simply going to become expectations with respect to this q.

In the E step, they choose \theta = \theta^{old} and maximize the lower bound, which happens when we set the Kullback-Leibler divergence to zero, or when we set (with the LHS being \log p(X|\theta^{old}))

q = p(Z|X, \theta^{old})

The term that remains is the lower bound, which is maximized in the M step:

\begin{array}{ccc}  \mathcal{L}(q, \theta) &=& \sum \limits_Z p(Z|X, \theta^{old}) \log \left \{ \frac{ p(Z, X|\theta)}{p(Z|X, \theta^{old})} \right \} \\  &=& \sum \limits_Z p(Z|X, \theta^{old}) \log p(Z, X|\theta) - \sum \limits_Z p(Z|X, \theta^{old}) \log p(Z|X, \theta^{old}) \\  &=& Q(\theta, \theta^{old}) - \sum \limits_Z p(Z|X, \theta^{old}) \log p(Z|X, \theta^{old})  \end{array}

The second term in the above is an entropy – a constant with respect to \theta, so when we maximize the ELBO, we do it on the familiar Q(\theta, \theta^{old}).

We pick \theta such that (as per the recipe)

\theta = \underset{\theta}\arg \max Q(\theta, \theta^{old})

Note also, that after the M step, our Kullback-Leibler divergence will be non-zero.


\begin{array}{ccc}  KL(q||p) = \sum \limits_Z p(Z|X, \theta^{old}) \log \left \{ \frac{p(Z|X, \theta)}{p(Z|X, \theta^{old})} \right \} >0  \end{array}

All that being said and done, this exercise seems to be a way of introducing us to this approximate distribution q(Z) so that we may get started with variational inference, which basically picks a distribution with the smallest Kullback-Leibler divergence as the distance metric.

The hyacinth girl

This is a glorious beginning to the cruelest month of the year, April, and one must mark it.

I had taken to tweeting fragmentary bits of thought relating to “The Wasteland” – as Eliot himself might have said, the little fire has turned into a conflagration, burning, with quadrupled intensity. Nevertheless, it is apparent (not that one really cares) that tweeting does justice neither to the obsession, nor to myself, for it’s expressive power is limited, and we are prone to verbal diarrhea. So here we are, manically swinging between the aridity of twitter’s word limit and to drowning ourselves in the torrent of words, making all sorts of apologies to the effect, mixed metaphors and all.

Several snippets can be found in the poem where a person in question is not a person, but could instead be a zombie, missing his or her soul, an aftermath of some sort of gargantuan nervous breakdown.

Here is one.

(From the hyacinth girl)

“… I could not speak, and my eyes failed
I was neither living nor dead, and I knew … NOTHING.
Looking into the heart of light, the silence
Öd’ und leer das meer.

I can only imagine the emptiness of this poor creature, suffering from some unimaginable pain when she says this. We should see (especially), Fiona Shaw’s chilling rendition to get a feel. Her eyes look somewhere far into the distance in a state of raving madness. Naturally, one can put many different spins on it based on how we identify with the scene. One of them that crossed my mind is PTSD (or Post Traumatic Stress Disorder), a nervous breakdown (not that I have seen a case in real life) that we had seen alluded to in TV serials growing up – I use growing up, because nowadays, our lives have altered and we can’t sit and watch TV for more than ten minutes at a time (sigh).

On estimating parameters that maximize good data in the EM procedure

In this note, we examine (from a problem in Duda and Hart) the EM algorithm to maximize the probability of ‘good’ data in a missing data problem. I found it enormously insightful to work it out. Also, for similar reasons, I have grown extremely fond of Duda and Hart. We should note that the treatment is slightly different in Bishop, in that they approach it from a latent variable standpoint, motivated by the mixtures of gaussians formulations. I think I have been able to reconcile the differences between the two treatments, which seem to be primarily notational in nature. In either case, one wishes to estimate parameters that maximize some likelihood function by marginalizing over the missing variables. There is a much pithier version in Wikipedia (Expectation Maximization), but well, it is ordained that we must all elicit satisfaction in rediscovering the wheel, as it were. The point is, does this likelihood function actually maximize the correct thing? That is what we wish to prove here. Our goal is to find a set of parameters that maximizes the probability of the ‘good’ data (yes, ‘good’, ‘good’, ‘good’, ‘good’). This is the crux of the EM algorithm, the meaning of which is contained in the famed Q function from Dempster 1977 which one should some day work out in full.

Consider the EM algorithm for a dataset \mathcal{D} consisting of good data \mathcal D_g and missing or bad data \mathcal D_b; \mathcal D = [\mathcal D_g, \mathcal D_b].

\begin{array}{ccc}  p(\mathcal{D}_g ) = \frac{p(\mathcal{D}_b, \mathcal{D}_g)}{p(\mathcal{D}_b|\mathcal{D}_g)}  \end{array}

In the above, it is to be noted that want to maximize the probability of the ‘good’ data p(\mathcal{D}_g). This is not quite the same as maximizing the probability of ‘complete’ dataset p(\mathcal{D}_g, \mathcal{D}_b)= p(\mathcal{D}). In terms of notation, we should emphasize that the maximization occurs for a given parameter state \theta, which we wish to discover through EM. So the probability is properly written as p(\mathcal{D}_g|\theta), although for obvious reasons it is quite cumbersome to carry it around.

The likelihood in question manifests as an expectation over the complete dataset, and the expectation is carried over p(\mathcal{D}_b|\mathcal{D}_g). Now, in Bishop (and in other places), they denote this by p(Z|X). Here, Z is the missing or bad data (or latent data, as referred to in Bishop), and X is the good or known data. The complete dataset probability would be be p(Z, X).

The recipe that is given in the books is to carry out an expectation over the conditional: p(\mathcal{D}_b|\mathcal{D}_g) or p(Z|X), which appears to be totally arbitrary on first impressions. We show here that it increases the likelihood function with each iteration.

Here is a brief description of the EM procedure. Given a parametrized model p(x_g, x_b|\theta) we wish to obtain \theta which maximizes the probability of good data p(x_g), as embodied by the likelihood function \log p(x_g).

In the E-step we set \theta = \theta_{old}, and get the expectation of \log p(x_b, x_g) over p(x_b|x_g).  In the M step, we set our parameter \theta to that which maximizes Q(\theta, \theta_{old}).


\begin{array}{ccc}  Q(\theta, \theta_{old}) = \int \log p(x_b, x_g| \theta) p(x_b|x_g, \theta_{old}) dx_b  \end{array}


\begin{array}{ccc}  \theta =\underset{\theta}{\arg \max} \ Q(\theta, \theta_{old})  \end{array}

In Bishop’s notation (with Z and X), the E step becomes:

\begin{array}{ccc}  Q(\theta, \theta_{old}) = \int \log p(Z,X|\theta) p(Z|X, \theta_{old}) dZ  \end{array}

Or in other words, we compute the expectation over Z|X. This is mentioned in his E-M chapter, and then extended in the chapter on variational inference.

\begin{array}{ccc}  Q(\theta, \theta_{old}) = E_{Z|X, \theta_{old}} \log p(Z, X|\theta)  \end{array}

Bishop (as in, Bishop the book) states that this procedure might appear arbitrary. Why, one might ask, would one want to take the expectation over the conditional and how does it help? Or even, if one is naive, what might we be trying to compute from this procedure.

This becomes apparent from the problem in Duda and Hart. Incidentally, the book also has a very illustrative example to clarify the machinery. The point is that we only have ‘good’ data, so we must find a set of parameters that maximizes the likelihood of the observed, good data. Our problem statement is thus to maximize p(x_g). Owing to practicalities which become clear when we work through the mixtures of gaussians parameter estimation problem, this (the complete dataset formulation) brings about a simplification in the algebra.

Our aim is to show that EM maximizes the (log) likelihood of good data. Putting it more concretely, we must show that the EM iterations increase the likelihood of good data with each iteration.

\begin{array}{ccc}  \log p(x_g|\theta) = \log p(x_b, x_g|\theta) -\log p(x_b|x_g, \theta)  \end{array}

\begin{array}{ccc}  E = \int \log p(x_g|\theta) p(x_b|x_g, \theta_{old}) dx_b & = & \int \log p(x_b, x_g|\theta) p(x_b|x_g, \theta_{old}) dx_b - \int \log p(x_b|x_g, \theta) p(x_b| x_g, \theta_{old}) dx_b\\  &=& Q(\theta, \theta_{old}) - \int \log p(x_b|x_g, \theta) p(x_b|x_g, \theta_{old}) dx_b \\  &=& Q(\theta, \theta_{old}) - H(\theta, \theta_{old})  \end{array}

Set \theta = \theta_{old} – this is the value of \theta we started with in the first iteration.

\begin{array} {ccc}  E_{old} &=& Q(\theta_{old}, \theta_{old}) - \int \log p(x_b|x_g, \theta_{old}) p(x_b|x_g, \theta_{old}) dx_b \\  &=& Q(\theta_{old}, \theta_{old}) -H(\theta_{old}. \theta_{old})  \end{array}

We wish to show that E\geq E_{old} after each iteration. Now, since we maximize Q(\theta, \theta_{old}), we have Q(\theta, \theta_{old}) \geq Q(\theta_{old}, \theta_{old}).

We therefore only need to prove that the second term (which looks like an entropy) increases with each iteration.

i.e. H(\theta, \theta_{old}) \leq H(\theta_{old}, \theta_{old}) (notice that the original equation takes the negative of H).

\begin{array}{ccc}  H(\theta, \theta_{old}) - H(\theta_{old}, \theta_{old}) = \int \log \frac{ p(x_b|x_g, \theta)}{p(x_b|x_g, \theta_{old}} p(x_b|x_g, \theta_{old}) dx_b  \end{array}

Use the identity \log{x} \leq (x-1) (simply check that \log(1) = 0 and the functions are monotonic). Then the integral becomes

\begin{array}{ccc}  H(\theta, \theta_{old}) - H(\theta_{old}, \theta_{old}) &\leq& \int \left\{\frac{p(x_b|x_g, \theta)}{p(x_b|x_g, \theta_{old})}-1\right\}  p(x_b|x_g, \theta_{old}) dx_b \\  &=& \int p(x_b|x_g, \theta) - p(x_b|x_g, \theta_{old}) dx_b  \end{array}

Using Bayes again, we find that (after omitting \theta for notational convenience)

\begin{array}{ccc}  \int p(x_b|x_g) dx_b = \frac{\int p(x_b, x_g) dx_b}{p(x_g)}  \end{array}

Therefore the expression in the integral containing the H inequality above really evaluates to zero and we have

\begin{array}{ccc}  H(\theta, \theta_{old}) - H(\theta_{old}, \theta_{old}) \leq 0  \end{array}

Rewriting the equations for convenience we have

\begin{array} {ccc}  E-E_{old} &=& Q(\theta, \theta_{old}) - Q(\theta_{old}, \theta_{old}) - (H(\theta, \theta_{old})-H(\theta_{old}, \theta_{old}))\\  &\geq 0  \end{array}

We have hence proved that EM increases the likelihood of ‘good’ data after each iteration.