Shrinkage – Bayes meets MLE

I am Lazarus, come from the dead,

Come back to tell you all – I shall tell you all

These are words from “The Love Song of J. Alfred Prufrock”, a work coming from the greatest modernist poet of them all, the one and only Eliot – Thomas Stearns. Pretentious quotes aside, and with no snide contexts hiding beneath the white fog at the golden gate bridge that is less than an hour away, let us now get to the point. All is well. Every so often, we disappear, and then reappear. But we take on a different form, hopefully also with shape.

This time, I would like to make a few notes on some quite basic Bayesian modeling ideas, with a view to seeing if I can put these thoughts in a coherent way. As they say, it is only when you explain things succinctly do you understand them. Through a mathematical example, I would like to bring out the connection between a Bayesian model’s prior belief, and how it gets adjusted when data arrives. Specifically, the example demonstrates that when we have a large number of data points, the estimate for the model becomes more and more precise, becoming explainable by the sample mean. When the amount of data is small, we do not have enough information for it to explain the observations precisely. In this case, it becomes important to have a prior belief. This belief gets adjusted by data as it arrives.

Single parameter model

This is a very elementary example, and completely lifted from BDA. Consider data y, parameterized by the model \theta, with a gaussian likelihood.

p(y|\theta) = \frac{1}{\sqrt{2\pi}\sigma} e^{-\frac{1}{2\sigma^2} (y-\theta)^2}

We also use a gaussian prior

p(\theta) \propto \exp(-\frac{1}{\tau_0^2}(\theta-\mu_0)^2)

Now, to get the posterior, we use Bayes, but as we are ‘given’ the data, we treat y as constant, so it basically becomes the product of the prior times likelihood, with an unknown normalizing constant p(y).

p(\theta|y) \propto p(\theta) p(y|\theta)

Now when we plug our individual formulas into this, we get a product of two gaussians, which is also a gaussian.

p(\theta|y) \propto \exp \left( -\frac{1}{2} \left(\frac{(y-\theta)^2}{\sigma^2}+\frac{(\theta-\mu_0)^2}{\tau_0^2}\right)\right)

After going through the exercise of completing the square (and treating terms not containing \theta as constant, etc), we get

p(\theta|y) \propto \exp\left(-\frac{1}{2\tau_1^2} (\theta - \mu_1)^2\right)

where \mu_1 = \frac{\frac{1}{\tau_0^2}\mu_0 + \frac{1}{\sigma^2} y} {\frac{1}{\tau_0^2} + \frac{1}{\sigma^2}} and \frac{1}{\tau_1^2} = \frac{1}{\tau_0^2} + \frac{1}{\sigma^2}

We can interpret this result as a compromise between the prior \mu_0 and the ‘data’ term as connoted by y – in the first version below, we ‘adjust’ the estimate given by the prior to account for the data.

\mu_1 = \mu_0 + (y-\mu_0) \frac{\tau_0^2}{\sigma^2+\tau_0^2}

Equivalently, we can say that the data has ‘shrunk’ towards the prior mean:

\mu_1 = y-(y-\mu_0)\frac{\sigma^2}{\sigma^2+\tau_0^2}

Model with multiple observations

It’s a bit more interesting when we have more than one data point. What happens when we have lots of lots of data, as we would in say, a general deep learning setting?

We use the same ideas, with iid assumptions and come up with a similar formulation, summarized by a sample mean:

p(\theta|y) \propto p(\theta) p(y|\theta) = p(\theta) \Pi_1^n p(y_i|\theta)

p(\theta|y) \propto \exp\left(-\frac{1}{2} \left(\frac{1}{\tau_0^2} (\theta-\mu_0^2) + \frac{1}{\sigma^2} \Sigma_{i=1}^{n} (y_i-\theta)^2 \right)\right)

This can be summarized with the sample mean – sufficient statistics – \bar{y} = \frac{1}{n} \Sigma_i y_i.

p(\theta|y_1\cdots y_n) = p(\theta|\bar{y}) = N(\theta|\mu_n, \tau_n^2)


\mu_n = \frac{\frac{1}{\tau_0^2} + \frac{n}{\sigma^2} \bar{y}}{\frac{1}{\tau_0^2} + \frac{n}{\sigma^2}} and \frac{1}{\tau_n^2} = \frac{1}{\tau_0^2} + \frac{n}{\sigma^2}

Now, after all this, we get to the crux or nub. Initially, when there is no data, we rely entirely on the prior belief. As data starts to arrive, the prior (embodied by \tau_0) gets slowly weighted out by the data term. When we have a large number of data points n\to \infty, our uncertainty term vanishes:

p(\theta|y) \approx N(\theta | \bar{y}, \sigma^2/n)

Here, we converge to an estimate parameterized by the sample mean, and the variance parameter goes to zero. In other words, as we get more and more points, we are able to make a more accurate prediction of the mean. In a perverse way, this struck me as one reason why we resort to maximum likelihood in deep learning problems. As the amount of data increases, the uncertainty in estimating the model decreases, with the parameter eventually converging to a point estimate.


BDA3, Chapter 2:

Reparameterization trick

In this note, we take a look at the reparameterization trick, an idea that forms the basis of the Variational Autoencoder. My material here comes from the fantastic paper by Ruiz et al [1]. The main idea is that the reparameterization trick [4,5] gives us a lower variance estimator than that obtained from the score function gradient, but suffers because the class of distributions to which it can be applied is somewhat limited – Kingma and Welling discusses Bernaulli and Gaussian variants. Nevertheless, it is possible to remedy this defect, as is done in papers [1], [2], [3]. The paper by Ruiz is especially instructive. It walks us through the machinery involved in reparameterization and discusses variance reduction [Casella and Berger, Robert and Casella] for variational inference – also [BBVI]. I was originally intending on writing a much more complete summary of the papers [1], [2], [3] but ran out of juice, leaving it for the future as might transpire (or not). If I may insert inappropriately (also, perhaps to acknowledge a decade that just ended):

What might have been is an abstraction
Remaining a perpetual possibility
Only in a world of speculation.
What might have been and what has been
Point to one end, which is always present.

Notwithstanding such tangential musings, it behooves us to study [Casella and Berger]. The authors note that it is a 22 month long course to learn statistics the hard way. Incidentally, the style of books bearing George Casella’s imprint – either actual or inspirational – ([Robert and Casella], [Robert]) very much reminds me of [Bender and Orszag], with engaging quotes from Holmes and their straight forward slant on equations.

The Variational Objective

Recall that in variational problems we are interested in obtaining the ELBO. We briefly derive this using Jensen’s inequality (\log E_q(.) \geq E_q\log(.))

Take the joint form presented in Ruiz:

\begin{aligned} p(x,z) = \frac{p(x,z)}{q(z;v)} q(z;v) \end{aligned}

\begin{aligned} \log \int p(x,z) dz &=& \log \int \frac{p(x,z)}{q(z;v)} q(z;v) dz \\ &\geq & \int \log \frac{p(x,z)}{q(z;v)} q(z;v) dz \end{aligned}

\begin{aligned} \mathcal{L}_v = E_{q(z;v)}[\log p(x,z) - \log q(z;v)] = E_{q(z;v)} [f(z)] + H(q(z;v)) \end{aligned}

The equation written in this form is quite instructive. We want to optimize the joint p(x,z) by varying the variational parameters v through the surrogate distribution q.

Gradient estimation with score function

Our aim is to obtain Monte Carlo estimates of the gradient (by taking expectations), but this is not possible as is. However, it is possible to arrange this as follows:
\begin{aligned} \int \nabla_v q(z;v) f(z) dz &=& \int q(z;v) \nabla_v \log q(z;v) f(z) dz \ &=& E_q(z;v) [f(z) \log q(z;v)] \end{aligned}

This is known as the score function estimator, or the REINFORCE gradient. We can view it as a discrete gradient that allows us to take gradients of non-differentiable functions by taking samples.

E_q(z;v)[f(z) \log q(z;v)] = \frac{1}{L} \sum_{l=1}^L [f(z^l) \log q(z^l;v)]

However, the estimate obtained tends to be noisy, and needs provisos for variance reduction – e.g. Rao-Black Wellization, control variates.

Gradient of expectation, expectation of gradient

A principal contribution of the VAE approach is that we have an alternative way to derive the estimator, one that is generally of lower variance than the score function method described above. That being said, it has the drawback that the method is not as widely applicable as the score function approach.

Recall that we would like to take gradients of the term containing the log joint in the ELBO.

\nabla_v \mathcal{L} = \nabla_v E_{q(z;v)} [f(z)] = \nabla_v \int q(z;v) f(z) dz + \cdots

In this equation, we can take samples f(z^l), but as the estimator contains variational parameters v within it, we cannot carry out any sort of diffentiation operations to it with respect to v– necessary to take gradients. The reparameterization trick gets around this problem.

We derive an alternative estimator by transforming to a distribution that does not depend on v so that we can now take the gradient operator inside the expectation. For this to work, we rely on what they call a ‘standardization’ operation to transform q(z;v) into another distribution q(\epsilon) independent of v (and other terms containing v). In the end, we want to have something like this:

\nabla_v E_q(z;v) f(z) = \nabla_v E_{q(\epsilon)} f(\tau(\epsilon;v)) g(v;\epsilon) = E_{q(\epsilon)} \nabla_v (\cdots)

We have pushed the gradient operator inside the expectation which allows us to take samples and allow taking gradients from it.

That’s a lot of words. Let us derive this estimator to put it more concretely.

We assume that there exists an invertable transformation z= \tau(\epsilon;v), \epsilon=\tau^{-1} (z;v) with pdfs q(z;v), q_\epsilon(\epsilon;v). In some cases, it is possible to find a transformation such that the reparameterized distribution is independent of the variational parameters v. For example, consider the standard normal distribution:

q(\epsilon) = \frac{1}{\sqrt{2\pi}} e^{-\epsilon^2/2}

We can consider this as the standardized version of a normal distribution N(\mu, \sigma).

\begin{aligned} z = \tau(\epsilon;\mu, \sigma) &=& \mu + \epsilon \sigma \\ \epsilon &=& \frac{z-\mu}{\sigma} \end{aligned}

Transform (pretend 1D for now):
\begin{aligned} q(z) = q_\epsilon(\epsilon) |\frac{d\epsilon}{dz}| &=& \frac{1}{\sigma} q_\epsilon(\epsilon) \\ &=& \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2}(\frac{z-\mu}{\sigma})^2} &=& N(\mu,\sigma) \end{aligned}

For more general cases, the differential is replaced by a Jacobian:

\begin{aligned} J(\epsilon;v) = |\det \nabla_\epsilon \tau(\epsilon; v)| \\ q(\epsilon;v) = q(\tau(\epsilon;v)) J(\epsilon;v) \end{aligned}

After standardization, we lose dependence on v: q(\epsilon;v) = q_\epsilon(\epsilon) so that

\begin{aligned} \int q(z;v) f(z) dz &=& \int q_\epsilon(\epsilon)) J f(\tau(\epsilon;v)) d\epsilon J^{-1} \\ &=& \int q_\epsilon(\epsilon) f(\tau(\epsilon;v)) d\epsilon \end{aligned}

Now we can take gradient and move it inside the expectation:

\begin{aligned} \nabla_v \int q(z;v) f(z) dz &=& \nabla_v \int q_\epsilon(\epsilon) f(\tau(\epsilon;v)) d\epsilon \\ &=& \int q_\epsilon(\epsilon) \nabla_v f(\tau(\epsilon;v)) d\epsilon \\ &=& E_{q(\epsilon)} \nabla_v f(\tau(\epsilon;v)) \end{aligned}

Reparameterization in more general cases

The standardization procedure is now extended so that \epsilon is weakly dependent on v – it has zero mean, but it’s first moment does not depend on v. Nevertheless, q_\epsilon(\epsilon;v) has dependence on the variational parameters. In this case, the expectation will have to be evaluated term by term with chain rule

\nabla_v E_{q(z;v)} [f(z)] = \nabla_v E_{q_\epsilon(\epsilon;v)} [f (\tau(\epsilon;v))] = \nabla_v \int q_\epsilon(\epsilon;v) f(\tau(\epsilon;v)) d\epsilon

\nabla_v E_{q(z;v)}[f(z)] = \int q_\epsilon(\epsilon;v) \nabla_v f(\tau(\epsilon;v)) d\epsilon + \int q_\epsilon(\epsilon;v) f(\tau(\epsilon;v))\nabla_v \log q_\epsilon(\epsilon;v) d\epsilon

As we can see, the first term is the regular reparameterization gradient. The second term is the score function estimator, a correction term for this version of this standardization setup.

\begin{aligned} g^{rep} = E_{q_\epsilon(\epsilon;v)} \nabla_v f(\tau(\epsilon;v)) \\ g^{corr} = E_{q_\epsilon(\epsilon;v)} f(\tau(\epsilon;v)) \nabla_v \log q_\epsilon(\epsilon;v) \\ \mathcal{L}_v = g^{rep} + g^{corr} + \nabla_v H[q(z;v)] \end{aligned}

In the case of the normal distribution, the second term g^{corr} vanishes.


The terms are massaged so as to look like control variates, an idea used in Monte Carlo variance reduction. The authors note that while Rao-Blackwellization is not used in the paper, it is perfectly reasonable to use the setup in conjuction with it, as is done in Black Box Variational Inference [BBVI], where both Rao-Blackwellization and control variates are used to reduce the variance of the estimator.

The basic idea of control variates is as follows ([Casella and Berger] – Chapter 7 on “Point Estimation”). Given an estimator W satisfying E_\theta W = \tau(\theta), we seek to find another estimator \phi_a of lower variance, using an estimator U with E_\theta U = 0:

\phi_a = W + aU

The variance for this estimator is (Casella and Berger):

Var_\theta \phi_a = Var_\theta(W+aU) = Var_\theta W + 2a Cov_\theta(W,U) + a^2 Var_\theta U

We would get lower variance for \phi_a than W if we could find a such that 2a Cov_\theta(W,U) + a^2 Var_\theta U <0.

To get back to our variational estimator, rewrite as follows for it to be interpretable as control variates (see [1]):

\begin{aligned} \nabla_v E_{q(z;v)} [f(z)] &= E_{q(z;v)} [f(z) \nabla_v \log q(z;v)] \\ & + E_{q(z;v)}[\nabla_z f(z) h(\tau^{-1} (z;v);v)] \\ & + E_{q(z;v)} [f(z) (\nabla_z \log q(z;v) h(\tau^{-1}(z;v);v)+u(\tau^{-1}(z;v);v))] \end{aligned}

In the first line, we have the score function expression, which is modified in subsequent lines. It is not entirely clear to me how the expectation of the terms that correct the noisy gradient is zero, but I suppose we will take it in the spirit with which it was intended.


[1] Ruiz et al: The generalized reparameterization gradient:

[2] Naesseth et al: Reparameterization Gradients through Acceptance-Rejection Sampling Algorithms:

[3] Figurnov et al: Implicit Reparameterization Gradients:

[4] Kingma and Welling: Autoencoding Variational Bayes:

[5] Rezende, Mohamed and Wierstra: Stochastic Backpropagation and Approximate Inference in Deep Generative Models:

[BBVI] Black Box Variational Inference:

[Casella and Berger]: Statistical Inference:

[Robert and Casella]: Monte Carlo Statistical Methods:

[Robert]: The Bayesian Choice:

[Bender and Orszag]: Advanced Mathematical Methods for Scientists and Engineers:

Transforming between probability distributions

I wanted to write about normalizing flows, but instead I am putting up something a bit more fundamental, as it forms a building block in the normalizing flows scheme (and even in deriving the reparameterization trick [8,9]), and one which had caused some mild discomfort when I had first looked that body of work (Variational Normalizing Flows [1], Inverse Autoregressive Flow [2], NICE [3], RealNVP [4], Glow [5], Masked Autoregressive Flow [6] (a must read) and this comprehensive review by Deepmind authors [7]).

The question to be asked is this: given a pdf f_X(x), and a transformation \mathcal{X} \to \mathcal{Y}, how do we calculate the pdf of the variable y?

Intuitively, we can call it conservation of probability mass:

\int f_Y(y) dy = \int f_X(x) dx

or f_Y(y) = f_X(x) |dx/dy| (use modulus to keep sign positive)

Below, we go over the machinery in Casella and Berger [0].

CDF of transformed function

Consider the pdf of a continuous random variable X: f_X(X), with the transformation from X \to Y: y=g(x). Then the CDF of Y is given by:

\begin{aligned} F_Y(y) &=& P(Y\leq y) \\ &=& \int_{g^{-1}(y)\leq y} f_X(x) dx \end{aligned}

Here, we should make note of whether the transformation g(x) is an increasing or decreasing function. In the case of it being an increasing function, the expression becomes:

\begin{aligned} F_Y(y) &=& \int_{-\infty}^{g^{-1}(y)} f_X(x) dx \\ &=& F_X(g^{-1}(y)) \end{aligned}

assuming that X is supported in (-\infty, \infty). Likewise, if g(x) is a decreasing function, we take the limits from x to \infty. This is because for any x>x_0, g(x_0)>g(x) in this case. This is made use of in the formula for P(Y\leq y).

\begin{aligned} F_Y(y) &=& \int_{g^{-1}(y)}^{\infty} f_X(x) dx \\ &=& 1-F_X(g^{-1}(y)) \end{aligned}

Typically (again, from Casella and Berger), we sum up the contributions in piecewise fashion where the functions are increasing/decreasing.

Deriving the PDF

By definition, the pdf is obtained by differentiating the cdf. As before, we treat the case for the function g(x) being increasing or decreasing separately.

When g(x) is an increasing function
\begin{aligned} f_Y(y) &=& \frac{dF_Y}{dy} \\ &=& f_X(g^{-1}(y)) \frac{d g^{-1}(y)}{dy} \\ &=& f_X(x) \frac{dx}{dy} \end{aligned}

The case where g(x) is a decreasing function gets a negative sign.

\begin{aligned} f_Y(y) = -f_X(x) \frac{dx}{dy} \end{aligned}

We can combine both of these into a single formula

\begin{aligned} f_Y(y) = f_X(x) \lvert \frac{dx}{dy} \rvert \end{aligned}

Multidimensional form

In Casella and Berger, the multivariate form is discussed in the chapter “Multiple Random Variables”, with some additional constraints that the function must be one-one and onto. We can see why these constraints arise: In the univariate case above g^{-1}(x) needs to exist, so the function must be one-one (i.e. have only one equivalent mapping in the inverse), and onto (no point must be left out).

Assuming that for a continuous random vector (X, Y), we have a transformation (with sloppy notation) u=g_1(x,y), v = g_2(x,y), functions that are one-one and onto, we should have an inverse transformation x = h_1(u,v), y = h_2(u,v).

The transformed distributions are related analogously to the univariate case, with the derivative now being replaced by a Jacobian, whose absolute value we use.

f_{U, V} = f_{X, Y} (h_1(u, v), h_2(u, v)) |J|


J = \begin{vmatrix} \frac{\partial x}{\partial u} & \frac{\partial x}{\partial v} \\ \frac{\partial y}{\partial u} & \frac{\partial y}{\partial v} \end{vmatrix}

Use in flow based models

The above formalism forms the general idea behind flow based generative models. We start with some distribution q(z) (which could be, say, a gaussian), and then transform it into more and more expressive distributions. We do the bookkeeping by means of the transformation machinery described above.


Variational Normalizing Flows [1]


0. Casella and Berger:
1. Variational Normalizing Flows:
2. Inverse Autoregressive Flow:
3. NICE:
4. RealNVP:
5. Glow:
6. Masked Autoregressive Flow:
7. Review by Deepmind authors:
8. Kingma and Welling:
9. Rezende, Mohamed and Wierstra:

A simple argument for gradient clipping in WGAN

In this note, we explain through an argument and a bit of hand waving, why gradient clipping makes sense in the Wasserstein GAN [1].

Primal formulation

We briefly examine the main ideas behind the WGAN work for context.

We would like to produce samples from a generator x \sim P_\theta, which we would like to come from the same distribution as the ground truth x \sim P_r. This is achieved by minimizing the Wasserstein distance between P_\theta and P_r. We can propose this problem in primal space through Optimal Transport.

W = \inf_\theta \int c(x,y) d \gamma(x,y)
where x \sim P_\theta, y \sim P_r; i.e. we sample z and send it through the generator, or in other words, we push forward z to x: x = G_\# z; and d\gamma(x,y) = \gamma(x,y) dx dy. The Wasserstein distance is characterized by a particular type of joint distribution \gamma(x,y) of P_r and P_\theta, in that we must look for candidates whose marginals reduce to P_r, P_\theta\int \gamma(x,y) dx = P_r(y); \int \gamma(x,y) dy = P_\theta(x).

The form for the distance measure c can be specified as a Euclidean distance, in which case we term this quantity a Wasserstein distance:

c(x,y) = ||x-y||^p

The Wasserstein distance will have a p^{th} root:

W= \inf_\theta(\int ||x-y||^p d \gamma) ^{1/p}

Generally in the literature, we find use of p=1,2. The original Monge problem was with p=1, c(x,y) = ||x-y||. When we use the squared distance with p=2, we come across something called “Brenier’s theorem” in the Monge problem.

In the current setting, p=1 is used, as it becomes expedient to set up the discriminator form this way. In principle, we can directly optimize the Wasserstein distance itself through Sinkhorn divergences (see [3]). However, it is as such not so simple (expensive?) to optimize the Wasserstein distance, so the WGAN paper used an ingenious trick, of transforming the problem to its dual equivalent through the Kantorovich-Rubinstein inequality. This is possible only when we use p=1. I hope to put up a proof of the Kantorovich-Rubinstein inequality in the not so distant future.

The WGAN discriminator objective

After transforming the problem to its dual form with the Kantorovich-Rubinstein inequality, we get the following expression for the discriminator objective:


Expanding that a bit, we would like to get the discriminator f (parameterized by w) that maximizes the above. The constraint, of course is that these functions should be 1-Lipshitz continuous.


As we can see, the form is amenable to gradient descent:


Next, we talk about how the Lipshitz constraint is imposed, which, if we are to remind ourselves, is the raison d’etre for this note.

1-Lipshitz continuity and gradient clipping

Let us first sketch out the Lipshitz formula in black and white.

Given a real valued function f:R\to R, a function is Lipshitz continuous [4] if there exists a constant K such that

|f(x_1) -f(x_2)| \leq K |x_1-x_2|

The Kantorovich-Rubinstein inequality demands that we have K=1. The question then arises as to how we can enforce this constraint. The paper’s recipe is to ‘clip’ the weights so that they lie in a ball [-0.01, 0.01]^l (l being the dimensionality). Here, we try to intuit why this might work through some simple analysis.

Consider a discriminator with a single layer, carrying out the operation:

f(x) = Wx + b

We can do away with the non-linearity for now (well, ReLU will zero things out, so we can be sure that it won’t increase the norm). Then when we take norms,

|f(x_1) -f(x_2)| = |W (x_1-x_2)| = |W| |x_1-x_2|

For it to satisfy the 1-Lipshitz constraint we need |W| \leq 1. One way of enforcing this is to arbitrarily clip the weights to some small value. They use [-0.01, 0.01]^l. The authors note that this is a ‘clearly terrible’ way of enforcing the condition because it reduces capacity, and could also lead to vanishing gradients.

In the follow up work [3], the authors implement the constraint in a more principled way by means of an additional penalty term that asks that the gradient norm be less than unity. This appeals to the idea that the lipshitz equation is like a slope, and when we take two points x_1, x_2, that are very close to each other, we get the gradient.



1. Wasserstein GAN:
2. Improved training of Wasserstein GANs:
3. Learning generative models with Sinkhorn divergences:
4. Lipshitz conitnuity:

Deriving the Kantorovich duality

It has been more than 2 1/2 years since the WGAN paper came out. This was a landmark effort that brought to light connections between optimal transport and GANs. It is also a formidable paper to come to terms with (I would not have felt up to reviewing that paper, not without understanding the duality proofs). And of course, the work is a shining example of a method that can be used in practice in applications. I have never felt up to putting up a review for this paper – so far – because there are some components there that I haven’t been able to prove. And when you put up results that you don’t fully have a feel for, it feels a bit disingenuous. The story is akin to the analogous situation in Chess where lets say, you are asked to evaluate a position with a bishop, knight and king against king, which you know is a win, but cannot do it over the board. I have also had some realizations along the way. Chief among them is that we cannot be self-respecting practitioners without studying convex optimization. This is probably as important as probability theory is, perhaps even more so. The second realization was more of a prosaic affirmation, that we ought to do some – useless – pen and paper work to keep our sanity. When there are so many things we have no control over (e.g. the capriciousness of the attention curve), it is reassuring that there are things that go beyond getting things to work, build and engineer. You can take all the time you want to, but in the end, there is truth, and that allows us to let go of the stress that our chaotic world produces.

Thanks to the WGAN paper, and the somewhat unrelated follow up on Wasserstein Autoencoders – which is eminently much more readable – I have been digging up Optimal Transport literature for instruction and entertainment. Maybe application also some day.

Having dispensed with the pourparlers, we can now get on with the matter of the moment (I hope the old sweats had found something else to do when they looked away). This time, I would like to provide a proof of the Kantorovich duality from an unexpectedly accessible tutorial on OT. It is to be noted that this is NOT the Kantorovich-Rubinstein duality used in the WGAN work. That emerges as a consequence of the Kantorovich duality, and is treated separately. My material comes from the excellent tutorial [1].

For the record, what bothered me in the WGAN paper most was the Kantorovich-Rubinstein duality which is a critical step in the paper’s presentation. In order to do justice to the paper, it is necessary for us to understand duality proofs. The Kantorovich-Rubinstein inequality arises as a consequence of the main duality proof, which the current article makes an attempt at explaining for the discrete problem.

Optimal problem definition

Before we set up the discrete problem from [1], it helps to introduce ourselves with the main ideas. The optimal transport problem can be seen as a transporting mass between a source and target domain, characterized by a joint function \gamma, the transport plan. The joint has the property that when we take its marginals, we get the source and target masses.


We would like to take the minimum cost computed from all the \gammas available, giving us an optimal metric. The quantity c(x,y) is the distance. When we take the Euclidean distance ||x-y||^p, we call this a Wasserstein distance. The original problem by Monge concerned itself with moving mounds of earth between points, with c=||x-y||. The problem was defined such that one wanted to transport earth from a point x to another point T(x) using a mass conservation condition. Apparently, this formulation posed several problems, such as non-existence (as mass cannot be split).

The above equation is written for the continuous problem. In the discrete setting, we replace the integral by a sum, so that what results is a dot product \sum C_{ij} \gamma_{ij}. We work in the discrete setting to stay compatible with the notes in [1].

Duality proof for the Kantorovich problem in the discrete setting

Let us consider the problem where we have m point masses U = (u_i)_{i=1,2,\cdots, m}, and n point masses V=(v_j)_{j=1,2,\cdots, n}, and we would like to transport U to V with minimum cost. The transportation plan in this case is the ‘joint’ \gamma, and the cost is C where \gamma = \gamma_{ij}, C = C_{ij} with i=1,\cdots, m, j=1, \cdots, n.

The problem becomes:


Here, it is worth our while look at the dimensions of the quantities. U is of dimension m \times 1; \gamma is of dimension mn \times 1 (m \times n flattened into a vector), so the quantity P_X should be of dimension m \times mn. Likewise, P_Y should be of dimension n \times mn. We are now ready to set up the objective.

Arriving at the dual form

With our primal discrete OT problem having been formulated, we first add a few terms to it, carefully making sure that it does not alter the expression by suitably adding infs and sups. We recast the original problem (the primal) into an optimization problem over \varphi, \psi.


We take the supremum of this expression over \varphi, \psi. When the constraints P_X \gamma = U; P_Y \gamma = V are satisfied, it equals the desired expression \langle C, \gamma \rangle . Otherwise, its maximum would be \infty. In fact, we can adjust the values of \varphi, \psi so that the expression becomes arbitrarily large.


We have made no changes to the original problem, except for adding a few extra layers of logic that turn the problem into finding an optimizer over \varphi, \psi.

I should note that this logic had eluded comprehension until I had chanced upon the tutorial [1], when something moved from under the brown fog of the brain. In some ways, one either gets it or one doesn’t. If one doesn’t, then it grates the soul, like a taxi throbbing waiting, while it fixes us with the aforesaid logic, leaving us sprawling on a pin, pinned and wriggling against the wall. And then, it emerges, announcing its arrival to say:

“I am Lazarus, come from the dead
Come back to tell you all
I shall tell you all …”

After setting up the indicator variables, we now add another piece. Take the infimum on both sides. When we do this, the branch with \infty vanishes, as we can find a lower value for the expression, which in this case has to be the one that satisfies the constraint.


It can be seen that the resulting expression is the equation for the primal. Observe also that the way the constraint gets manipulated is rather reminescent of the Lagrange multiplier machinery for constrained optimization (Boyd and Vandenberghe).

In the next few manipulations, we arrive at the expression for the dual. This time, we appeal to another trick of swapping the inf and the sup.


The four lines constituting the manipulation are explained below. In the first (equation (3)), we restate the expression as before. In the second, we swap inf and sup. This step is called the minimax principle, wisdom that I will take for granted (I contradict what I said earlier, about not putting up things that I don’t understand, but we will look the other way for now). The third equation is a rearrangement, fairly uncomplicated. The last step (equation (6)) needs a bit of explanation. In this, we again reinterpret the problem as one of constrained optimization, as we do with Lagrange multipliers. The constraint P_X^t \varphi + P_Y^t \psi \leq C is to be viewed in a component wise sense.

And there we have it. We have just derived the duality equations subject to the cost constraint.


In fact, it seems that we have equality between the primal and dual problems (is that correct?).

I = inf_\gamma \langle C, \gamma \rangle = sup_{\varphi, \psi} L(\varphi, \psi)


1. Notions of optimal transport theory and how to implement them on a computer

2. Computational Optimal Transport (Peyre and Cuturi):

3. WGAN:

4. WAE:

5. Convex Optimization (Boyd and Vandenberghe):

A reshaping bug

In Tacotron, it is recommended that we generate several output tokens at each decoder timestep, and then use one (or all, or some combination thereof) of them as input for the next timestep. While coding this up, I inadvertently created a bug for myself which went undetected for a long long time. To set things up in black and white, we create an example that shows how we should (and should not) reshape pytorch tensors in these scenarios.

Assume that we have a vector of size 8 in batches of 5 elements (5,8). In the end, we would like to produce 2 batches of 5, with vectors of size 4 – we break off the size 8 vector into two (2,5,4). The reason we do this is that our desired decoder output per timestep is of size 4, but since we produce a vector of size 8 by agglomerating two frames/timesteps into one, we would need to break it to read off the individual vectors.



Now, we would like to reshape this vector to produce 2 batches of frames of size 4. So the first four columns go into the first bucket, while the rest goes into the other bucket.




The point behind shaping it as (2,5,4) is that we can read them off as x[0] and x[1] which comprise our desired output for timesteps t+0, t+1, with the total number of decoder timesteps being (we would actually stop when the EOS token is produced, but disregard that for now) 2T.

The wrong way

The above is all obvious when we look at it retrospectively (and I, like Tiresius, perceived the bug and foretold the rest), but in my code, I had it as follows:



So, now we are traversing each row to grab 4 elements as we go along, and then dropping them in the same bucket until we do this 5 times, after which we drop the remainder in the other bucket. The correct way to do it would be to drop them off alternately in different buckets.


Needless to say, the choir will sneer for being preached upon. Nevertheless, it is fair to say that we mustn’t just make the shapes fit while reshaping. The ordering also matters, and its not a bad thing to be mindful of that to save us some heartaches and unnatural shocks that code is heir to.

Tacotron papers

A good paper comes with a good name, giving it the mnemonic that makes it indexable by Natural Intelligence (NI), with exactly zero recall overhead, and none of that tedious mucking about with obfuscated lookup tables pasted in the references section. I wonder if the poor (we assume mostly competent) reviewer will even bother to check the accuracy of these references (as long as we don’t get any latex lacunae ??) in their already overburdened roles of parsing difficult to parse papers, understanding their essence and then their minutiae, and on top of that providing meaningful feedback to the authors.

Tacotron is an engine for Text To Speech (TTS) designed as a supercharged seq2seq model with several fittings put in place to make it work. The curious sounding name originates – as mentioned in the paper – from obtaining a majority vote in the contest between Tacos and Sushis, with the greater number of its esteemed authors evincing their preference for the former. In this post, I want to describe what I understand about these architectures designed to produce speech from text, and the broader goal of disentangling stylistic factors such as prosody and speaker identity with the aim of combining them in suitable ways.

Tacotron is a seq2seq model taking in text as input and dumping speech as output. As we all know, seq2seq models are used ubiquitously in machine translation and speech recognition tasks. But over the last two or three years, these models have become usable in speech synthesis as well, largely owing to the Tacotron (Google) and DeepVoice (Baidu) works. In TTS, we take in characters or phonemes as input, and dump out as a speech representation. At the time of publication (early 2017), this model was unique in that it demonstrated a nearly end to end architecture that could produce plausible sounding speech. Related in philosophy was the work by Baidu (DeepVoice 1 – I think) which also had a seq2seq TTS setup, but which was not really end to end because it needed to convert text (grapheme) to phoneme and then pass that in to subsequent stages. The DeepVoice architecture has also evolved with time, and has produced complex TTS setups adapted for tasks such as speaker adaptation and creating speaker embeddings in transfer learning contexts with few data. Nevertheless, we should note that speech synthesis is probably never going to be end to end (as the title of the paper says) owing to the complexity of the pipeline. In Tacotron, we generate speech representations which need to be converted to audio in subsequent steps. This makes speech a somewhat difficult problem to approach – speaking personally – for the newbie unlike images where we can readily see the output. Even so, we can make do by looking at spectrograms. The other problem in speech is the lack of availability of good data. It costs a lot of money to hire a professional speaker!


At a high level, we could envisage Tacotron as an encoder-decoder setup where text embeddings get summarized into a context by the encoder, which is then used to generate output speech frames by the decoder. Attention modeling is absolutely vital in this setup for it to generalize to unseen inputs. Unlike NMT or ASR, the output is inflated, so that we have many output frames for an input frame. Text is a highly compressed representation whereas speech (depending on the representation) is highly uncompressed.

Several improvements are made to bolster the attention encoder-decoder model, which we describe below.

  1. Prenet: This is a bottleneck layer of sorts consisting of full connections. Importantly, it uses dropout which serves as a regularization mechanism for it to generalize to test samples. The paper mentions that scheduled sampling does not work well for them. Prenet is also used in the decoder.
  2. CBHG: “Convolutions, FilterBanks and Highway layers, Gated Recurrent Units”. We could view the CBH parts as preprocessing layers taking 1-3-5-… convolutions (Conv1d) and stacking all of them up after maxpooling them. It makes note of relationships between words of varying lengths (n-grams) and collates them together. This way, we agglomerate the input characters to a more meaningful feature representation taking into account the context at the word level. These are then sent to a stack of highway layers – a play on the residual networks idea – before being handed off to the recurrent encoder. I’ve described the architecture in more detail in another post. The “G” in CBHG is the recurrent encoder, specified as a stack of bidirectional Gated Recurrent Units.
  3. Reduction in output timesteps: Since we produce several similar looking speech frames, the attention mechanism won’t really move from frame to frame. To alleviate this problem, the decoder is made to swallow inputs only every ‘r’ frames, while we dump r frames as output. For example, if r=2, then we dump 2 frames as output, but we only feed in the last frame as input to the decoder. Since we reduce the number of timesteps, the recurrent model should have an easier time with this approach. The authors note that this also helps the model in learning attention.

Apart from the above tricks, the components are setup the usual way. The workflow for data is as follows.


From “Fully Character-Level Neural Machine Translation Without Explicit Segmentation” – the basis for CBHG


The original Tacotron architecture



Tacotron 2 – a simpler architecture

We pass the processed character inputs to the prenet and CBHG, producing encoder hidden units. These hidden units contain linguistic information, replacing the linguistic context used in older approaches (Statistical Parametric Speech Synthesis or SPSS). Since this is the ‘text’ summary, we can also think of it as seed for other tasks such as in the Tacotron GST works (Global Style Tokens) where voices of different prosodic styles are summarized into tokens which can then be ‘mixed’ with the text summaries generated above.

On the decoder side, we first compute attention in the so-called attention RNN. This term caused me a lot of confusion initially. The ‘attention RNN’ is essentially the block where the we compute the attention context, which is then concatenated with the input (also ejected by a prenet) and then sent to a recurrent unit. The input here is a mel spectrogram frame. The output of the attention RNN is then sent to a stack of recurrent units and projected back to the mel dimensions. Also, this stack uses residual connections.

Finally, instead of generating a single mel frame for each decoder step, we produce ‘r’ output frames, the last of which gets used as input for the next decoder step. This is termed the ‘reduction factor’ in the paper, with the number of decoder timesteps getting reduced by a factor or ‘r’. As we are producing a highly uncompressed speech output, it makes sense to assist the RNNs by reducing their workload. It is also said to be critical in helping the model learn attention.

Generalization comes from dropout in this case, with ground truth being fed – teacher forced – during training as input decoder frames. Contrast this with running in inference mode where one has to use decoder generated outputs as input for the next timestep. In scheduled sampling, the amount of teacher forcing is tapered down as the model trains. Initially, a large amount of ground truth is used, but as the models learns with time, we slowly taper off to inference mode with some sort of schedule. The other way is to use GANs to make the inference mode (fake) behave like teacher forced output (real) with a discriminator being used to tell them apart; the idea being that the adversarial game results in the generator producing inference mode samples that are indistinguishable from the desired, teacher forced mode output. This approach is named “Professor Forcing”.

Postprocessing network

Mel spectrogram frames generated by the network are converted back to audio with the postprocessing scheme. This postprocessing scheme is described in the literature as a ‘vocoder’ or backend. In the original tacotron work, the process involves first converting the mel frames into linear spectrogram frames with a neural network to learn the mapping. We actually lose information going from linear to mel spectrogram frames. The mapping is learnt using a CBHG network (not necessarily encoder-decoder as the input and output sequences have the same lengths) as used in the front end part of the setup computing linguistic context. In the end, audio is produced by carrying out an iterative procedure called Griffin-Lim on the linear spectrogram frames. In more recent works, the backend part is replaced by a Wavenet for better quality samples.

Refinements in Tacotron 2

Tacotron 2’s setup is much like its predecessor, but is somewhat simplified, in in that it uses convolutions instead of CBHG, and does away with the (attention RNN + decoder layer stack) and instead uses two 1024 unit decoder layers. In addition to this, it also has provisions to predict the probability of emission of the ‘STOP’ token, since the original Tacotron had problems predicting the end of sequence token and tended to get stuck. Also, Tacotron 2 discards the reduction factor, but adds location sensitive attention as in Chorowski et al’s ASR work to help the attention move forward. Supposedly, these changes obviate the need for the r-trick. In addition to location sensitive attention, the GST Tacotron works also use Alex Graves’ GMM attention from the handwriting synthesis works.

There are many other minutiae as well such as using MSE loss instead of L1,  which I suppose would qualify as tricks to be noted if one is actually creating these architectures.

In addition to architectural differences, the important bit is that Tacotron2 uses Wavenet instead of Griffin-Lim to get back the audio signal which makes for very realistic sounding speech.

Controlling speech with style tokens

The original Tacotron work was designed for a single speaker. While this is an important task in and of itself, one would also like to control various factors associated with speech. For example, can we produce speech from text for different speakers, and can we modify prosody? Unlike images where these sorts of unsupervised learning problems have been shown to be amenable to solutions (UNIT, CycleGAN, etc.), in speech we are very much in the wild west, and the gold rush is on. Built on top of Tacotron, Google researchers have made attempts at exercising finer control over factors by creating embeddings for prosodic style and speakers (also in Baidu’s works), which they call Global Style Tokens (GST).


Style Tokens are vectors exemplifying prosodic style which are shared across examples and learnt during training. They are randomly initialized, and compared against a ‘reference’ encoding (which is just a training audio example ingested by the reference encoder module) by means of attention, so that our audio example is now a weighted sum of all the style tokens. During inference, we can manually adjust the weights of each style token, or simply supply a reference encoding (again, spat out by the style encoder after putting in reference audio).



1. UNIT: Unsupervised Image to Image Translation Networks:

2. CycleGAN:

3. Tacotron: Towards End to End Speech Synthesis:

4. Tacotron-2: Natural TTS Synthesis by Concatenating Wavenet on Mel Spectrogram Predictions:

5. Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron:

6. (GST) Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis:

7. Alex Barron’s notes (excellent) describing Tacotron, implementation and woes:

8. Baidu Deep Voice 3:

9. (Speaker adaptation with Deep Voice) Neural Voice Cloning with a few samples:

10. Professor Forcing:

11. Alex Graves: “Generating Sequences With Recurrent Neural Networks”:

12. CBHG: Fully Character-Level Neural Machine Translation Without Explicit Segmentation:

Location based attention – from “Attention based Models for Speech Recognition”

“We have lingered in the chambers of the sea

By seagirls wreathed with seaweed red and brown

Till human voices wake us and we drown”.

These lines are from Eliot’s “The love song of J. Alfred Prufrock”. To take that to our more banal reality – not the ones with the cicada or dry grass singing – the mermaids that we have seen singing (each to each!) are the beautiful results that we see in the papers, and the lingering in the chambers of the sea could perhaps be this utopia that we envisage in our never-ever land submerged under. Nevertheless, as Eliot  notes dolefully, the mermaids don’t sing to us, and when we wake up from that sleep (O city city … I can sometimes besides a public bar in lower Thames Street …), we realize that not only do the mermaids spurn us with their ardent refusal to sing, but also that our watery singing chamber turns into a watery grave, submerged, inundated, buried. Somehow, in Eliot, the water seems to form such an important theme, from the wet beginning of “The Fire Sermon”, to the drowning of the Phoenician sailor in “Death by Water”, to the grand ending with visions of the Fisher King sitting upon the shore of his waste land fishing, with a desolate, arid (and beautiful!) landscape behind him wondering about his legacy.

In our case, the works and days of hands consist of restless – endless – hours watching our little attention curve appearing (mostly, not, as we would surmise) with its wispy sharpness before us, thereby announcing that the model is indeed learning something. Again, as he has observed only too wisely, “I have wept and fasted, wept and prayed. I have seen the moment of my greatness flicker, and in short I was afraid”.

With that entirely unnecessary digression into the drowning excesses of Eliotism – think of the opening chant in Ashtanga practice, or Vogon poetry, as suits one’s taste –  we bring our attention towards matters of the moment.

In keeping with our foray into attention based seq2seq models, which we know and love, I would like to put my thoughts down on what seems like a necessary enhancement to the content based attention mechanism by putting in a location component to it. This idea comes from the paper by Chorowski et al detailing their setup on speech recognition.

Now (the old sweats can take a breather), in the Bahdanau work for NMT, they present the so-called content based attention, in which a hidden unit of the decoder is compared against all encoder outputs for similarity (which, as we know could be some function – a dot product, or a learned function expressed as a perceptron). This works well for NMT, where all the tokens in the input and output are different. However, when  we replace the input sequence with a speech representation – say, a spectrogram –  there are many many frames of input that must be compared against, and these frames don’t change much from frame to frame, meaning that the attention will be spread out over all these frames, and they will be ranked equally – not quite entirely the result we want. The paper says that we must encourage the attention mechanism to move forward by noting what the attention was in the previous timestep. The point is that it has many nearly identical frames to score and could therefore use a hint on what the attention was in the previous step, so that it focuses on that particular frame’s vicinity. The other approach – as described in the previous post – is to cluster frames by reducing the number of timesteps in a hierarchical way (as in Listen Attend and Spell).

The Chorowski paper is a seminal work. It was first introduced in the context of speech recognition as an extension of sorts to the  machine translation work by Bahdanau. More recently, it has been used in speech synthesis in the Tacotron series of papers. The idea is to design an attention mechanism that looks at the previous time step’s attention and picks from it by using a learned filter.

“Informally, we would like an attention model that uses the previous alignment αi−1 to select a short list of elements from h, from which the content-based attention, in Eqs. (5)–(6), will select the relevant ones without confusion.”

This is carried out by convolving the previous timestep’s attention with a matrix F (kxr), to produce k vectors for each of the encoder timesteps considered. chorowski_attention

The way it is written, the scheme seems a little cryptic (or is it?), until we note that in the above, we take one dimensional convolutions on the attention (after all, it is a 1D vector) in equation (8), with k filters with filter size r. What is not mentioned is the size of the convolved vector, which is probably user defined.  I found code from Keith Ito’s Tacotron implementation in tensorflow which seems to corroborate this interpretation, with the convolution designed such that the input and output sizes are made the same by appropriately sizing the padding cells.

Be all that as it may, one might perhaps summarize the game as using a filtered version of the previous timestep’s attention so that it’s influence is felt in the new attention output. Again, as is usual in such scenarios, convolutions are carried out on the time axis. Another way of looking at it is to consider the spectrogram as an image, so that we look at a given region of the image as noted by the attention vector.

The paper notes that this form of attention drew inspiration from Alex Graves’ GMM based attention – called ‘window weights’ in that work – where the weighting parameters decide the location, width and importance of the windows. This also seems to motivate location based attention in the DRAW paper. But it is important to note that the location of the window is forced to move forward at every time step.




Other pertinent tricks in the paper:

A. Sharpening attention in long utterances with use being made of a softmax temperature beta. The rationale for this trick is that in longer utterances when we have a long (unnormalized) attention vector ((0.1, 0.4, 0.8, -0.4, -0.5, 0.9, etc), we would like to sharpen the attention so that it focuses on a few pertinent frames. So, if we keep the temperature parameter as 10, for example, then we will get (exp(1), exp(4), exp(8), …) and naturally, this will completely obliterate most but the largest components as compared with (exp(0.1), exp(0.4), exp(0.8),…). In other words, it amplifies – or sharpens – the attention.


B. Smoothening attention for shorter utterances. In this scenario, it is claimed that one wants to smoothen instead of sharpening focus. The reasoning given is that the sharp model (top scored frames) performs poorly in the case of short utterances, leading them to try out smoothening it out a little. Another way to reason it out is that in smaller utterances, the contribution from each individual unit becomes all the more important.


While the ideas might seem somewhat arbitrary, I take these as tricks that one should incorporate into one’s own practice as needed.

These tricks find use in the Tacotron series of works, notably in Tacotron2, where a location sensitive attention mechanism is used to inform the attention mechanism that it must move forward.

Below is a pytorch version of the Tacotron implementation referred to above.


The line calling self.location_conv corresponds to equation (8), producing ‘k’ vectors of the same size, with ‘k’ being the number of filters in this case. We then make a linear operation on this to resize it to the hidden dimension in self.location_layer, which gets added on to the attention query.

When we call the attention mechanism, we call both the content (this is the standard concat attention) and the location sensitive versions (in the code above), as the equation (9) in the paper shows, after which we take a softmax of these energies. All this is available in r9y9’s implementation of Tacotron (minus maybe the location component).





1. Attention-Based Models for Speech Recognition (Chorowski et. al.):

2. Tacotron 2:

3. Keith Ito’s code:

4. r9y9’s  Tacotron implementation in PyTorch:

5. Alex Graves’ “Generating Sentences with Recurrent Neural Networks”:



Hierarchical Encoders – from Listen, Attend and Spell

In machine translation literature, we come across the notion of content based attention, made famous by the paper from Bahdanau. It builds upon the encoder-decoder translation model in which the input sequence is encoded into a single, fixed context vector, from which we decode tokens of the output. The attention mechanism improves upon this idea by deciding which frames of the input sequence are important while generating a given output frame. Not only does this add more ‘color’ to the context vector in longer sequences, but it is also extremely important for the model to learn the attention in order to decode things during inference.

To expand on this point, we could think of minimizing reconstruction cost, which a network will do when it sees teacher forced output (i.e. when we feed with ground truth labels), possibly by ignoring context. But during inference, there is no ground truth and it must decode from its own generations. If it learns attention, it knows exactly which symbols of the input it should look up – fixing it in a formulated phrase, sprawling on a pin, wriggling on the wall, and in short the translator has been nailed.

The point of the the current post is to introduce – no, speculate upon is more like it – the idea of clustering input frames in the context of speech recognition. The input in this case is a speech representation – quite uncompressed – and the output desired is text – or phoneme, which is more appropriate.

Clustering with hierarchical encoders

It is natural to pose the ASR problem as a language model, a seq2seq problem with attention, since we have time varying input and output sequences, and we could simply try to replace the NMT input/output language translation task with speech/text transcription. However, there is a little problem in that speech is a highly uncompressed sequence, containing among other things, speaker information, the duration of voiced phonemes. When we apply the NMT seq2seq attention model as is, we see that a given output phoneme will correspond to several frames of the input spectrogram features. Our model will have problems learning attention in this case.


One could think of ‘clustering’ the input spectrogram frames to circumvent this issue, so that we now focus on zones of the input rather than individual spectrogram frames. This is what the Listen, Attend and Spell paper does indirectly. They set up a hierarchical encoder by reducing the number of timesteps in the input by using stacked hidden recurrent layers, with each layer in the stack containing fewer timesteps than the ones below. At the top of the stack, we have a much reduced number of timesteps, and we use this layer as the hidden representation for the encoder side calculations for attention.




Coding it up

I looked up a pytorch implementation of the LAS paper (with additional hooks for self-attention) here, which I abstracted into a hacky notebook.


The key operation is to reshape the frames of a given layer in a stack, agglomerating two frames into a timestep, thus doubling the size of the vector (easier seen in code than explained). This is passed on into the RNN for the next layer of the stack.


  1. Listen, Attend and Spell
  2. Pytorch implementation of LAS:
  3. Example notebook:
  4. Neural Machine Translation by Jointly Learning to Align and Translate:

Wasserstein Autoencoders

In the last several weeks, I have been looking over VAEGAN papers in order to use one for practical work problems. The Wasserstein GAN is easily extended to a VAEGAN formulation, as is the LS-GAN (loss sensitive GAN – a brilliancy). But the survey brought up the very intriguing Wasserstein Autoencoder, which is really not an extension of the VAE/GAN at all, in the sense that it does not seek to replace terms of a VAE with adversarial GAN components. Instead, it constructs an autoencoder with a set of arguments using optimal transport or Wasserstein distances, which can also function as a generative model. It seems to be a modified version of the work “From optimal transport to generative modeling – the VEGAN cookbook”.

Why is the paper interesting?

  1. It provides an ostensibly simple recipe to implement a non-blurry VAE (there is no V, but lets just think of it as one because it can also generate).
  2. It has setups using a GAN or an MMD. The latter is potentially useful because we can do away with the pain involved in tuning the GAN
  3. It provides what looks like an elegant and logical way to cast the Wasserstein distance metric to setup the VAE/GAN problem.
  4. It is another instructive example of the VAEGAN toolbox setup involving a reconstruction term and a regularization term – only, that in this case they do not – arbitrarily – hack off a KLD with a GAN, but arrive at it through a constraint which appears as a penalty term.
  5. Generalizing VAE like formulations. The paper gives three instructive VAEGAN model comparisons, unifying them thematically – Adversarial Autoencoders (AAE), Adversarial Variational Bayes (AVB), and the original Variational Autoencoders (VAE). These generalizations arise for the case with random decoders – the paper introduces the idea with deterministic decodes, and then extends it to random decoders – with play on the regularizer of the VAE which these papers replace with a GAN.
  6. An explanation for why VAEs tend to be blurry. In the VAE, we make Q(Z|X) match the prior p_z(z) for each point. But since there will be overlap between Q(Z|X) between points, the way I see it, we will be averaging across these overlapping Q(Z|X) leading to blurriness. However, in the WAE, the recognition model is forced to match the prior in an aggregated, global sense \int q(z|x) p(x) dx = q_z(z) = p_z(z), so that we are drawing from this global quantity and not averaging out over overlapping Q(Z|X). This explanation obviously needs some parsing.
  7. An example of setting up optimization objectives by putting terms together. Machine learning works often put together various terms in order to satisfy a certain constraint. This paper illustrates how we combine the VAE like reconstruction with the regularization component with a penalized objective formulation. Qz

Marginality and transport cost

The Earth mover or Wasserstein distance is characterized by a joint distribution \Gamma(X,Y) between two measures X \sim P_X and Y \sim P_G such that their marginals equal P_G and P_X respectively. Furthermore, the EMD itself is a cost – a euclidian distance – associated with moving material between X and Y.

Concretely, we should be aware of the following three equations

p_X(x) = \int \gamma(x,y) dy

p_G(y) = \int \gamma(x,y) dx

cost = \int c(x,y) d\gamma  = E_{x,y\sim \gamma} c(x,y)

The Wasserstein problem seeks to find the joint that gives the minimum cost. Posed as such, we get the following equation. It is to be noted that as such, getting the Wasserstein distance is in itself an optimization problem.


In the famous WGAN work by Arjrovsky et al, they work with Kantorovich’s dual formulation which casts it as an upper bound.


In the WAE paper, the goal is to parameterize the generating distribution in terms of an intermediate distribution Q(Z|X) , so that we go from X to latent variables Z, and then get the generated samples Y from Z with P(Y|Z). When we do this, it starts to look like a VAE.

Latent variable model

Now, in a GAN we would like to minimize the distance between the distributions P_X and P_G, which are the real and generating distributions respectively. G comes from a latent variable model by sampling z from the prior p_z(z).

p_G(x) = \int p(x|z) p_z(z) dz

In order to make the connection with an autoencoder, we should map z to the input x. In the VAE, this is done through the recognition model Q(z|x) using a variational approximation to set it up. In the WAE, we do not have a variational lower bound, but we appeal to the same ideas of using an intermediate recognition like model, with qualifications. How they arrive at these qualifications constitutes the brilliancy of this paper.

Wasserstein distance between P_X and P_G

Starting with the definition of the EMD, we wish to find a particular set of couplings that allow us to go from X to Z to Y. The joint \Gamma from which we draw x,y must satisfy the marginal property as indicated above. First, we note that \Gamma(X,Y) = \Gamma(Y|X) P_X(X). Our goal is to factorize \Gamma(Y|X) in terms of the intermediate latent representation Z.

To this end, consider the candidate joint distribution

\gamma(x,y) = \int p(y|z) q(z|x) p_X(x) dz

Integrate wrt x and use the marginality constraint

\int \gamma(x,y) dx = p_G(y) = \int \int p(y|z) q(z|x)p_X(x) dx dz

By inspection we see that this will hold if (because we can pull out the terms only dependent on x)

\int q(z|x) p_X(x) dx = p_Z(z)

This is an important point, which gives us a a constraint to work with. We are now ready to create our objective function. In the initial, illustrative problem the paper makes the assumption that the decoder is deterministic. i.e. Y = G(Z), or P(Y|Z) = \delta_G(z) = \delta(y-G(z)). Later on, they move on to generalizing the result for random decoders, and make the connection with other VAE/GAN examples (AAE, AVB, VAE).

Set up the Wasserstein objective as follows:
Minimize the earth mover distance between x \sim p_X(x), y \sim p_G(y) with y being factorized through a latent variable model P(Y|Z) with y \sim p(y|z), and a prior p_z(z).

W(P_X,P_Y) = \inf_\gamma \int c(x,y) d\gamma(x,y)

Using the factorization for \gamma through Q, and assuming a deterministic decoder, with the constraint noted above, we can massage W as

W(P_X,P_Y) = \inf_\gamma \int c(x,y) p(y|z) q(z|x) p_X(x) dz dx dy

But since p(y|z) = \delta(y-G(z)) the term in y gets integrated out, noting that

\int \delta(x-a) f(x) dx = f(a)

The objective then becomes

W(P_X,P_Z) = \int c(x,G(z)) q(z|x) p_X(x) dx dz

Written in shorthand, we get

W(P_X,P_Z) = \inf_{q(z|x)} E_X E_{q(z|x)} c(x,G(z))

with the constraint \int q(z|x) p_x(x) dx = p_z(z)

Penalized objective

In order to enforce the constraint above, the paper adds a penalty term to make \int q(z|x) p_X(x) match p_z(z). Together with the reconstruction term coming from the primal of the optimal transport formulation, we get what looks like the components of a generative model like the VAE – a reconstruction term, plus, a regularizer term. The regularaizer gives the model its generative characteristics, in that without it we would get a regular autoencoder which will know how to reconstruct input, but will have ‘holes’ in Z in those places that don’t have training data. In other words, we won’t be able to draw from a latent representation.

W(P,Q) = \inf_{q(z|x)} E_{q(z|x)} [c(x,G(z))] + \lambda D_{GAN} (q_z, p_z)

where q_z = E_X q(z|x).

The GAN ensures that the aggregated posterior (that is the term used in the AAE paper) matches the prior. The first term serves as the reconstruction loss of the autoencoder.


While these notes are in no way a complete review of the paper, I felt that it was important to work out the derivations to my satisfaction. I think the rest of the paper is fairly straightforward in how it describes the training and inference process, and the generalizations with other VAE/GAN models. Nevertheless, for the practitioner, such recipes are of value inasmuch as they allow for experimentation with generative models for real world tasks.


  1. From optimal transport to generative modeling: The VEGAN cookbook –
  2. Wasserstein Auto-Encoders –
  3. Wasserstein GAN –
  4. Loss-Sensitive Generative Adversarial Networks on Lipshitz Densities –
  5. Adversarial Autoencoders –
  6. Adversarial Variational Bayes: Unifying Variational Autoencoders and Generative Adversarial Networks –