The Laplace Formula

I saw this formula in Bender and Orszag’s book in the chapter on asymptotic analysis of integrals. The intention (in yoga class, they ask us to remember one’s intentions) has always been to complete that chapter, but as with all other things, ‘progress’ has been essentially non-existent when one did want to make progress, but now that one has given up on that term, things seem to be moving – if ever so glacially – in the opposite direction. Given a choice, I would happily work out Bender and Orszag all my life. Real happiness comes from immersion, when one is actually working out these integrals, trivial as they are. Conversely, I conjecture that remembering those dear old days when was indeed in the flow brings lasting unhappiness.

Nevertheless, be all that as it may, I turn to this simple idea concerning the asymptotic behavior of an integral from Bender & Orszag. It leads nicely to other very interesting ideas such as Watson’s Lemma, steepest descents and stationary phase. The machinery is straight forward and beautiful.

Consider the integral

I(x) = \int_a^b f(t) e^{x \phi(t)} dt

Laplace’s method posits that if \phi(t) has its maximum in the interval a\leq t\leq b, at t=c (and f(c)\neq 0) then it is only the neighborhood t=c that contributes to the asymptotic expansion of I(x) for large x, i.e. x\to \infty.

Consider the problem of obtaining the leading order behavior of the function

I(x) = \int_0^{10} \frac{e^{-xt}}{1+t}dt

Here, we can replace the upper limit 10 with an arbitrarily small quantity \epsilon, which contributes mostly to the leading order behavior for large x.

I(x;\epsilon) = \int_0^{\epsilon} \frac{e^{-xt}}{1+t} dt

Next, we replace (1+t)^{-1} with its leading order term in the Taylor expansion (=1) to get the approximation as x\to \infty

\int_0^\epsilon \frac{e^{-xt}}{1+t} dt \sim \frac{1}{x}


There are two examples that I can relate to immediately:

1) Flame chemistry with large activation temperature.

\dot{\omega}(x) = Y(x) e^{-k/T(x)}

Here, the activation temperature k is some very large number. If we use the Laplace formula, we can immediately see that the flame is a singular quantity, a boundary layer, because the burning rate \dot{m} = \int_{-\infty}^{\infty} \dot{\omega}(x) dx should come from the neighborhood of the region where the flame is at its hottest. This temperature is known as the adiabatic flame temperature or simply, the flame temperature.

2) Gaussian with small variance
In its extreme form, we have the Dirac Delta function. But otherwise, we can see that all the probability mass will come from around the mean of the gaussian.

Note: There is another Laplace formula in statistics, the basic idea of which is to fit a gaussian around a point of interest in order to calculate distributions.


seq2seq with RNN encoder-decoder

There are quite a few tutorials on attention based seq2seq models in the internet ether. Two excellent samples:

As anyone who has been touched by these things will know now, seq2seq models have transformed the state of the art in neural machine translation, and more recently (and perhaps less visibly), in speech synthesis. Google’s “Tacotron” is at its core, a very sophisticated seq2seq model in which one inputs a character sequence as text and sees emitted a speech signal in the form of a spectrogram. That we see seq2seq models as wrappers and front page tutorials is proof enough that this is very serious business.

My own intuition (such as that is) comes from the original papers from the Bahdanau-Cho hall of fame. In particular, the appendices in their papers are worth their weight in Osmium maybe. In this note, I sketch out the equations from the first of these Montreal papers (Cho et al. 2014) in a loose way. In subsequent posts, I hope to cover Bahdanau and its variant by Vinyals with some code that I borrowed from the aforementioned pytorch tutorial modified lightly to suit my ends.

We start with Kyunghyun Cho’s paper, which broaches the seq2seq model without attention. The authors call this iteration the RNN encoder-decoder. The idea is that one translates a variable length input sequence to a variable length output sequence. The zeroth order application is in Neural Machine Translation:

English: “I am not a small black cat” [7 words]
Francais:”Je ne suis pas un petit chat noir” [8 words]

Not only is the length of the sequence different between the two languages, but also, some of the tokens are permuted – (black cat chat noir).

We train by teaching our machine how the input sequence corresponds to its output equivalent. However, we don’t directly translate tokens between inputs and outputs, because of the properties alluded to above. Formally:

a) The input and output sequences are of varying lengths
b) The ordering is not one-to-one or in order [black cat == chat noir].

They get around these issues by encoding the input sequence to a context vector, which somehow stores the meaning of the sequence. The context vector c is now our basis for comparison with the ground truth. In other words, the variable length input sequence is encoded to a fixed length context vector. This is then used to make output predictions, dreamed up (Alex Graves likens the whole process to a dream sequence in Nando deFreitas’ lecture series one symbol at a time. Since both the inputs and outputs are sequences, they are parametrized by recurrent neural networks. The Montreal group of papers use GRUs (they created these things), but we can pretty much use any kind of RNN we want to – LSTM, vanilla RNNs or GRUs. More formally, this is an encoder-decoder architecture in which the encoder maps the inputs to a fixed length context vector, while the decoder uses the context vector to make predictions. They are trained end-to-end. This contributes to the attraction of these models, since they free us (supposedly) from the need to have painstakingly cultivated domain specific knowledge in the area under consideration, such as machine translation or speech synthesis.

Assume that the inputs are x^1, x^2, \ldots, x^T, and we would like to produce an output sequence y^1, y^2,\ldots, y^{T'}. We map the input sequence x^i to a context vector c with an RNN, whose hidden units we denote by h^{i}. The output sequence is also predicted with an RNN, with hidden units h'^{j}.


h^i = RNN^{enc} (h^{i-1}, x^i)\\ c = g(h^T)

The function g in the original paper (Cho et al 2014) is tanh:

c = \tanh(Vh^T)

The decoder’s RNN takes c as the input and then progressively makes predictions for y^j (or rather, p(y^j_k=1|y^1, y^2, \ldots, c)). There is a connection from the encoder c at all times, since it represents the input sequence x, indicated by the conditioning.

h'_t = RNN^{dec} (h'^{t-1}, y^{t-1}, c)\\

Here’s where the appendices provide invaluable insight into the model. It also reminds us that RNNs are nothing more than feed forward networks whose weights are trained with BPTT. The function RNN^{dec} is a set of linear transformations implemented with the Gated Recurrent Unit (GRU) in the paper, but at the end of these operations, we are left with a hidden unit at time t (h'^{t}), which is all we care about at a high level. Note also, that the subscripts j are words in the output dictionary.


The softmax operation allows one to create a valid probability distribution for the output, which is then trained with the output labels using a multinomial loss function. Keeping in mind the one-hot encoding labeling scheme, with \hat{y} as the labels, we can write the loss function at a given timestep t as
\mathcal{L} = \sum_{j} \log p(y^t_j=1|y^1, y^2, ..., y^{t-1},x)^{\hat{y}^t_j}

The overall probability that one is trying to maximize is:
p(y|x) = \Pi_{t=1}^{T'} p(y^t|y^{t-1}, y^{t-2}, \ldots, x)

In the paper, the softmax function uses a quantity called s, which comes from something called the max-out unit (due to Goodfellow), a refinement over dropout.

s_i^t = max(s'^t_{2i-1}, s'^t_{2i})

with s'^t being a linear combination of h'^t, y^{t-1} and c:

s'^{t} = O_{h} h'^{t} + O_y y^{t-1} + O_c c

In the companion paper by Bahdanau et. al. (all the same authors in this paper), they create the attention mechanism wherein the context vector now weights relevant mappings from the input to output, converting the encoding c to a context c^i that attends to parts of the input vector and outputs a context that is better aligned with the output word. Apparently, this considerably improves the performance of RNN enc-dec over longer phrases.


1. Cho et al. 2014 (
2. Bahdanau et al. 2014 (
3. Vinyals et al. 2014 (
4. Sutskever et al. 2014 (
5. Goodfellow et al. 2013 (
6. Tacotron:

Dominant balance in an optically thick formulation for radiating media

In this post, I demonstrate how we can get an optically thick approximation for the radiation transport equation in non-scattering media.

Consider the ray tracing equation (it’s the radiation transport equation but imagine that it will end up in a ray tracer eventually) without scattering:

\frac{dI}{ds} = \kappa(s)(I_b - I)

Here, the term \kappa(s) is the absorption coefficient, which is a complicated function of the medium’s temperature and varies spatially across the medium. s is an arbitrary ray path in some angular direction, the term containing I_b is the black body intensity (emission) and the one with I denotes absorption of energy by the medium. The above is a restatement of the radiation energy balance. The energy gained by a ‘ray’ is the opposite of the energy gained by the medium. The equation is linear, so we can multiply by integration factors to get a solution. This forms the basis of ray tracing calculations in a multitude of scenarios such as in flame radiation heat transfer, rendering and in planetary media.

Before we do anything, let us first non-dimensionalize the equations by bringing in the notion of ‘optical thickness’. Broadly put, the optical thickness is an indicator of how far we can see into the medium. If the medium is ‘thin’, very little energy is lost to the medium through absorption and we can see quite far into it. However, if the medium is ‘thick’, the light (or heat) energy is all absorbed by the medium which strongly attenuates the amount of energy passing through. This would be the case, for example, on a very foggy day. In fire phenomena, strongly absorbing media exist in thick clouds of smoke surrounding pool or oil fires where everything is black and no heat comes out. My co-advisor told me that we can stand pretty close to these thick oil fires without getting burnt. The flame loses heat through radiative emission, but the heat feeding back through the large, cold (relatively) smoke layer replenishes it.

Define a macroscopic length scale l, an absorption coefficient \bar{\kappa}. The absorption coefficient has units 1/length and so the product \bar{\kappa}l is a dimensionless number which we will call \delta, a dimensionless optical thickness scale.

Path integral

Upon non-dimensionalization, our ray tracing equation becomes

\frac{dI}{\kappa' ds'} = \bar{\kappa} l (I_b - I)

where s'=s/l and \kappa' = \kappa/\bar{\kappa}. The dimensions of the intensities on both sides cancel out, so we just leave that term as it is.

The optical thickness is sometimes written as the integral \int \kappa ds, which motivates writing the denominator in the LHS as d\xi = \kappa' ds. We replace the constant \bar{\kappa} l in the RHS by the non-dimensional optical thickness \delta. After these manipulations, the ray tracing equation looks as follows:

\frac{dI}{d\xi} = \delta(I_b - I)

We can integrate this equation by using the integrating factor exp(\delta \xi):

\frac{d}{d\xi} (e^{\delta \xi} I) = \delta I_b e^{\delta \xi}
Ie^{\delta \xi} -I_0= \delta \int e^{\delta \xi} I_b(\xi) d\xi


I = I_0 e^{-\delta \xi} + \delta e^{-\delta\xi} \int_0^{\xi} I_b e^{\delta\xi'} d\xi'

If I_b=const then we get

I = I_0 e^{-\delta \xi} +I_b (1-e^{-\delta \xi})

The above equation basically says that the intensity I_0 gets attenuated through a medium based on how optically thick it is. If \delta \ll 1 then we won’t have much attenuation of the original ray. The second term denotes the amount of energy the ray picks up from the emission of radiation as it travels through the medium.

For thin media, the first term is a small linear correction over the original intensity I_0, which is what we would expect in thin, clear, cloudless conditions.

I_1 = I_0(1-\delta \xi)\approx I_0

Thick media

Things get much more interesting when we go to thick media. By inspection, we can suspect that the intensity must be of the same order as the black body intensity I_b since the first term will disappear (\delta \gg 1). There’s an interesting way to show this using dominant balance, which is really the main point of this post. Consider again the intensity equation:

\frac{dI}{d\xi} = \delta(I_b - I)

The left hand side is O(1), but the right hand side is multiplied by the large quantity \delta. Since the RHS must also be O(1), this implies, perforce, that

I_b-I \sim O(\delta^{-1})
I\sim I_b = I^{(0)}

We can do a formal perturbation expansion on I with \delta^{-1} as the small parameter:

I = I^{(0)} + \delta^{-1} I^{(1)} + \delta^{-2} I^{(2)} + O(\delta^{-3})

Collecting O(\delta^{-n}) terms on both sides, we get

I^{(1)} = \frac{dI_b}{d\xi}, I^{(2)} = \frac{d^2 I_b}{d\xi^2} ...

with the recurrence relation

I^{(n)} = \frac {dI^{(n-1)}}{d\xi}

This gives us the full approximation for the series

I = I_b + \delta^{-1}\frac{dI_b}{d\xi} + \delta^{-2} \frac{d^2 I_b}{d\xi^2} + O(\delta^{-3})

Sidenote on computation

While the simple equations above make for a nice pen and paper exercise (mostly, I suppose, not all that useful) it takes on very special significance when we code it up. The ray equation as written is for a single angular ray shot from the light source or radiant source (more like the camera in graphics ray tracing but lets not go there). We are traversing along the path of the ray. In the case of the Navier Stokes equations, this kind of calculation will entail interactions with an Eulerian Grid. We will compute intersections of the ray with the grid or objects in the medium for each of the angular rays. This is in addition to carrying out the path integral along the medium with the variable black body intensity. I_b \sim T^4, with T\sim 2000 K in the flame zone, and 300 K in ambient air conditions. It stands to reason that computing the path integral would be expensive, and that we should try our best to find shortcuts (approximations) to lower computational cost.

A note on batching operations in PyTorch

While implementing ‘DRAW’ in PyTorch ( , I ran into some surprising problems in carrying out matrix vector products in a batched way, which I relate through the concrete case of computing filterbank and read/write operations from the paper.

Matrix computations for attention patches or glimpse windows

The attention mechanism in DRAW involves two operations referred to as reading and writing. In the read operation, one creates a smaller glimpse vector of size 12×12 from the larger MNIST image of size 28×28. The write operation is the inverse, wherein the 12×12 image is transformed to a 28×28 image to write into the global ‘canvas’. These operations are carried out by means of matrix transformations. So, if x is the 28×28 image, then we apply a matrix transformation F_Y x F_X^T to create the attention patch.

Read: Transform large image to small image
x \ (28\times28)=> F_Y x F_X^T \ (12 \times 12).

Write: Transform small image to large image
w \ (12 \times 12)=> F_Y^T w F_X \ (28 \times 28)

There is also a scalar intensity parameter that they call \gamma that multiplies (or divides) the read and write operations.

Read: \gamma F_Y x F_X^T
Write: \frac{1}{\gamma} F_Y^T w F_X

If this \gamma vanishes, then one might suppose that the calculation will blow up, something we can try preventing by adding some small noise. But then I found that if the thing blew up, there’s not really anything we can do about it anyway. Changing the GRU to an LSTM seems to help in this case.

The matrices F_Y, F_X are referred to in the paper as ‘filter banks’ which are essentially gaussian kernels around a point of interest. In 1D, if we say that F(x)= \mathcal{N}(\mu, \sigma) then it will apply a gaussian filter at the point x=\mu. More on that in a dedicated post.

Now, since variables in pytorch all have a batch_size dimension to them, we would like to have them computed in vectorized fashion (or write them in CUDA, which would be more intuitive conceptually). The loop for the read operation could look like this:

#x: BxA
#F_Y: NxB

for i in range(batch_size):
F_x_t = torch.t(F_x[i]) #transpose

# is for matrix multiplication
tmp1 =[i],[i], F_x_t) * gamma[i]

It would have been nice if the framework automatically vectorized the above computation, sort of like OpenMP or OpenACC, in which case we can try to use PyTorch as a GPU computing wrapper. But since this does not happen, we have to either write the loop in CUDA or to use PyTorch’s batching methods which thankfully happen to exist.

In order to do batch matrix multiplies, we should have the outer index as the batch variable, and the rest as a matrix, a tensor of 2 dimensions.

We do this with heavy use of the permute and expand functions:

#Do the transpose first
#permute indices 1, 2; 0 is batch index
#F_x_t => batch_size x A x N
F_x_t = F_x.permute(0, 2, 1)

#pad gamma with extra dimensions to make it a matrix
#(batch_size) => (NxNxbatch_size)
gamma = gamma.expand(N, N, batch_size)

#now permute to make batch_size the outer index
gamma = gamma.permute(2, 0, 1)

Our variables are now of the following shapes:

gamma: batch_size x N x N
x: batch_size x B x A
F_x_t: batch_size x A x N

Do a batch multiplication of x and F_X^T by invoking the bmm method.

tmp = x.bmm(F_x_t) # batch_size x B x N

Now multiply by F_Y to give the write patch of shape batch_size x N x N (times \gamma).

tmp = F_y.bmm(tmp)

Multiply this by \gamma to get the glimpse vector. This is an elementwise multiplication, simply done by using the asterisk operator.

w = gamma * tmp

The code can be found here: in

RNNCell Modules in PyTorch to implement DRAW

I realized that in the last few months, I’ve spent a lot of time reading about generative modeling in general, with a fair bit of nonsense rhapsodizing about this and that, as one often does when one sees things the first time. I can see that I’ll be working a lot with RNNs in the near future, so I decided to get my hands dirty with pytorch’s RNN offerings. There is little else to do anyway in the heat.

I am working on the “DRAW” paper by Gregor et. al.

This is a natural extension of the Variational Autoencoder formulation by Kingma and Welling, Rezende and Mohamed. The paper appeals to the idea that we can improve upon the VAE’s handiwork by iteratively refining it’s output over the course of several time steps. There is thus a temporal, sequential aspect that comes in. In addition, they also add a spatial attention mechanism wherein one ‘attends’ to portions of the image as to improve them in small NxN patches of a larger image. I am posting on the first part now, which only uses the RNNs. It might be a few more days before I can finish the second part.

Initially, I thought that we just have to pick from pytorch’s RNN modules (LSTM, GRU, vanilla RNN, etc.) and build up the layers in a straightforward way, as one does on paper. But then, some complications emerged, necessitating disconnected explorations to figure out the API. Firstly, we should all be aware of PyTorch’s way of creating arrays (well, I’ve not used any other frameworks except for caffe, and that too, only for benchmarking runs, so it’s all quite new to me) which demands that we include the batch size during initialization.

So for example, if we want to create an input x of size 784 (as in MNIST), we must also pass the batch size variable as input:

x = Variable(torch.randn(batch_size, input_size))

We do this in most of our initializations. The first dimension is the batch size. However, RNN variables have an additional dimension, which is the sequence length.

x_rnn = Variable(torch.zeros(seq_len, batch_size, input_size))

This is apparent in retrospect in the documentation ( – see under RNN), but we have to play with it in order to make sure. When we unroll this entity, we get the unrolled form with seq_len of them in number.

seq_len * Variable(batch_size, input_size)

Let us look at the example given in the documentation page:

>>> rnn = nn.GRU(10, 20, 2)
>>> input = Variable(torch.randn(5, 3, 10))
>>> h0 = Variable(torch.randn(2, 3, 20))
>>> output, hn = rnn(input, h0)

This defines a GRU of the following form:

rnn (input_size, hidden_size, num_hidden_layers)

We should note that the function curiously returns two outputs: output, hidden. The first output (output) contains the last hidden layer, while ‘hidden’ contains all the hidden layers from the last time step , which we can verify from the ‘size()’ method.

‘output’ is of shape (seq_len, batch_size, hidden_size) . It contains the sequence of hidden layer outputs from the last hidden layer.

>>>torch.Size([5, 3, 20])
(or torch.size([seq_len, batch_size, hidden_size]))

I find the purpose of ‘hidden’ a little enigmatic. It supposedly contains the hidden layer outputs from the last timestep in the sequence t = seq_len.

>>>torch.Size([2, 3, 20])
(or torch.Size([num_hidden_layers, batch_size, hidden_size)]))

The hidden layer can be bi-directional. Apparently, the default (as we might expect) is a standard uni-directional RNN. The documentation clarifies this:

h_n (num_layers * num_directions, batch, hidden_size)

It helps to remember that the quantity they call ‘output’ is really the hidden layer. The output of an RNN is the hidden variable which we then do something with:

h_t = f(x_t, h_{t-1}) = W_{hh} h_{t-1} + W_{xh} x_t +b_{xh} + b_{hh}

y_t = g(h_t)

In my experiments I used GRUCell because it seemed intuitive to set up at that time.


Note: I think in the above, we can replace the 2 RNNs used in the encoder (one each for \mu, \sigma with a single RNN – as can be made out from the clipping below from “Generating Sentences from a Continuous Space”:


RNNCell module

In DRAW, we need a connection from the decoder from the previous timestep. Specifically:

h_t^{enc} = RNN^{enc} (h_{t-1}^{enc}, [r_t, h_{t-1}^{dec}])

They define a cat operation [,] to concatenate two tensors.

[u, v] = cat(u, v)

As we can make out from the hand written figure (sorry, but that’s just the most efficient way factoring in things such as laziness), at each time step, the VAE encoder takes in the output of the read operation, which then gets encoded into the latent embedding Q(z_t|x, z_{1:t-1}) = Q(z_t|h_t^{enc}). Furthermore, at each timestep, we take in the same input image x, and then give it the previous timestep’s output image to create the sequence, together with the decoder output as well. In that sense the sequence is actually defined by the quantity r_t or the oputput of read:

r_t = read(x, \hat{x}, h_{t-1}^{dec})

The read operation is to be implemented. This is done differently depending on whether or not we put in spatial attention. Nevertheless, we can in a rough way make sense of it from a line in the paper:
“Moreover the encoder is privy to the decoder’s previous outputs, allow-
ing it to tailor the codes it sends according to the decoder’s
behaviour so far”

In our experiments, we used the RNNCell (or more precisely, the GRUCell) to handle the sequence, with a manual for loop to do the time stepping – the most intuitive way, if I may say so. In the forward method of the class, we create the set of operations comprising DRAW:

for seq in range(T):
x_hat = x - F.sigmoid(c) # error image
r =, x_hat) #cat operation
#encoder output == Q_mu, Q_sigma
mu, h_mu, logvar, h_logvar = self.encoder_RNN(r, h_mu, h_logvar, h_dec, seq)
z = self.reparametrize_and_sample(mu, logvar)
c, h_dec = self.decoder_network(z, h_dec, c) #c is the canvas

Naturally, the RNN layers handle each individual timestep rather than batching the whole sequence together:

The API is as follows:

>>> rnn = nn.RNNCell(10, 20) #(input_size, hidden_size)
>>> input = Variable(torch.randn(6, 3, 10)) #(seq_len, batch_size, input_size)
>>> hx = Variable(torch.randn(3, 20)) #(batch_size, hidden_size)
>>> output = []
>>> for i in range(6): #time advance
... hx = rnn(input[i], hx) #
... output.append(hx) #add to sequence

In addition to the vanilla RNNCell, also included in PyTorch are the GRU and LSTM variants.

I hope to put up a more descriptive post (with feeling!) of DRAW. But for now, I have what seems to be a quasi working implementation without the attention mechanism. In the figures below, we can see that there is a qualitative improvement in the figures as we add refinement timesteps to it.

The code may be found here:

Monte Carlo EM on the posterior predictive

In the last few posts, I have been waxing eloquent about the marvels of the Variational Autoencoder:

Much as I traveled in the realms of gold,
And many goodly states and kingdoms seen
Then I felt like some watcher of the skies
When a new planet swims into his ken …

This time, I attempt to add a few expository footnotes (shall we say, handnotes – since when did we start taking notes with the feet?) to the EM algorithm, in an attempt to reinterpret some lines from Wei and Tanner 1990. The expedition started with Bishop, so it has all been done before. Nevertheless, I will reinvent the wheel again by reproducing their notes.

Let us first start with a little recap of EM to maximize parameters of the observed data. We will keep our notation consistent with Wei and Tanner. Let y be the observed data, z,the latent variables (e.g. the encoded representations in the Frey faces) and \theta, the parameters characterizing the distribution. We seek to maximize the probability of the observed data (better sounding than ‘good’ data as used in an earlier post) \log y, which our EM procedure (we have already convinced ourselves, quite tediously, that the procedure works, chronicled in an earlier post) guarantees. We wish to compute the so called Q function.

First, the E step:
Q(\theta, \theta_0,y) = \int_z \log p(z,y|\theta) p(z|y,\theta_0) dz

In the M step, we maximize Q with respect to \theta. We then have Q(\theta, \theta_0) \geq Q(\theta_0,\theta_0). We can prove that this leads to an increase in the likelihood \log p(y).

In Wei and Tanner, they use the same idea to maximize the posterior p(\theta|y). They take it a step further by using Monte Carlo sampling in the E step. This is useful when computing the E step analytically is not very feasible.

The posterior distribution p(\theta|y)

Wei and Tanner 1990 present an expression for the Q function for the posterior, which I derive here accordingly. The Q function for the posterior is
Q(\theta, \theta_0) = \int_z \log (p(\theta|z,y)) p(z|\theta_0,y) dz

We start by invoking Bayes ideas arbitrarily to derive quantities of interest. The posterior predictive is an integral over the hidden states z.

p(\theta|y) = \int_z p(\theta,z|y) dz
p(\theta|y) = \int_z p(\theta|y,z) p(z|y) dz
We use Jensen’s inequality to prove that the EM objective increases the likelihood function \log p(\theta|y).

We know that since \log x is a concave function we have

\log [E_q (f(x))]\geq E_q [\log f(x)]

where q is a distribution that weights f(x) as follows:

E_q(x) [f(x)] = \int q(x) f(x)dx = \frac{1}{L} \sum_{l=1}^L f(x^l)

We wish to prove that the likelihood increases with every step of the EM procedure:

\log p(\theta|y) - \log p(\theta_0|y) \geq 0

Start by expanding p(\theta|y)

\begin{array}{ccc} \log p(\theta|y) &=& \log \int p(z,\theta|y) dz \\  &=& \log \int \frac{p(z,\theta|y)}{p(z|\theta_0,y)} p(z|\theta_0,y) dz \end{array}

Apply Jensen’s inequality to give

\log p(\theta|y) \geq E_{p(z|\theta_0,y)} \log\left \{\frac{p(z,\theta|y)}{p(z|\theta_0,y)}\right \}

\begin{array}{ccc} \log p(\theta|y) -\log p(\theta_0|y)   & \geq &  E_{p(z|\theta_0,y)} \log\left \{\frac{p(z,\theta|y)}{p(z|\theta_0,y)}\right \} -\log(\theta_0|y) \\ &=& E_{p(z|\theta_0,y)} \log\left \{\frac{p(z,\theta|y)}{p(z|\theta_0,y)p(\theta_0|y)}\right \} \\ &=& E_{p(z|\theta_0,y)} \log\left \{\frac{p(z,\theta|y)}{p(z, \theta_0|y)}\right \} \\ &=& E_{p(z|\theta_0,y)} \log\left \{\frac{p(\theta|z,y)}{p(\theta_0|z,y)}\right \} \end{array}

We therefore have

\log p(\theta|y) -\log p(\theta_0|y) \geq E_{p(z|\theta_0,y)} \log\left \{\frac{p(\theta|z,y)}{p(\theta_0|z,y)}\right \} = Q(\theta, \theta_0)-Q(\theta_0, \theta_0)

The upshot is that if we can generate \theta such that Q(\theta,\theta_0)\geq Q(\theta_0, \theta_0) then we guarantee that the EM procedure for the posterior increases the likelihood of the iterates produced.

Monte Carlo EM

The E step above can be written as a Monte Carlo approximation by sampling over z. Apparently, we do this when an analytical form is difficult to obtain.

\begin{array}{ccc} Q(\theta, \theta_0) & = & \int_z \log (p(\theta|z,y)) p(z|\theta_0,y) dz \\ & = & E_{z \sim p(z|\theta_0,y)} \log p(\theta|z,y) \\ & = & \frac{1}{L} \sum_{l=1}^{L} \log p(\theta|z^l,y) \end{array}

The M step remains essentially unchanged in principle: we maximize Q with respect to \theta, and use that for the next iteration.

“Minibatches” and computing Monte Carlo expectations in the VAE

In Kingma and Welling, they note that the objective function is computed in minibatches. Random samples totalling some minibatch size are picked and then we just extrapolate our objective loss for the entire dataset.

For a given pattern or sample \mathbf{x^i} we have:

\begin{array}{l} \mathcal{L}(\theta, \phi, \mathbf{x^i}) = -D_{KL} (q_\phi(\mathbf{z}|\mathbf{x^i})||p_\theta(\mathbf{z})))+ E_{\mathbf{z}\sim q_\phi(\mathbf{z|x^i})} (\log p_\theta(\mathbf{x^i|z^i})) \end{array}

The second term containing the expectation is to be obtained by using several samples of \mathbf{z}.

E_{\mathbf{z}\sim q_\phi(\mathbf{z|x^i})}(\log p_\theta(\mathbf{x^i|z^i})) = \frac{1}{L} \sum_{l=1}^L \log p_\theta(\mathbf{x^i|z^{i,l}})

This means that we must pick L samples of \mathbf{z} from q_\phi(\mathbf{z|x^i}) in order to take expectations. It is then argued that we can get away with the slightly sleazy approximation of choosing a single sample as long as we have a large enough batch size (say, 100). The calculations assume that L=1.

I found the idea profoundly fascinating – they start with a Monte Carlo idea, and then they just hand wave it away with this very clever argument. In the end, the objective function seems to appear much much simpler than we would expect, with all these expectations and things.

The minibatch idea is invoked to formulate the objective function (this is before the ingenious reparametrization trick). In essence, we just pick M samples, sum up the losses from these samples, and then linearly extrapolate that for the entire dataset N.

\mathcal{L}^N (\theta, \phi, \mathbf{X}) = \frac{N}{M} \sum{\mathcal{L}(\theta, \phi, \mathbf{x^i})}

The notation makes this point explicit. The LHS has the entire dataset X, whereas the RHS contains the minibatch samples M. This is, in my view, the most chilling explanation of minibatch gradient descent I’ve seen in my rather limited experience with these things.

Edit: The Monte Carlo expectation with L=1 is not really a totally sleazy approximation because during training, we take two expectations, one over the Monte Carlo samples and the other over the minibatch size, and in essence, we obtain randomness by sampling many many values. The point is, we need, in the end, functions of \theta, \phi to be backpropagated over.