Tacotron papers

A good paper comes with a good name, giving it the mnemonic that makes it indexable by Natural Intelligence (NI), with exactly zero recall overhead, and none of that tedious mucking about with obfuscated lookup tables pasted in the references section. I wonder if the poor (we assume mostly competent) reviewer will even bother to check the accuracy of these references (as long as we don’t get any latex lacunae ??) in their already overburdened roles of parsing difficult to parse papers, understanding their essence and then their minutiae, and on top of that providing meaningful feedback to the authors.

Tacotron is an engine for Text To Speech (TTS) designed as a supercharged seq2seq model with several fittings put in place to make it work. The curious sounding name originates – as mentioned in the paper – from obtaining a majority vote in the contest between Tacos and Sushis, with the greater number of its esteemed authors evincing their preference for the former. In this post, I want to describe what I understand about these architectures designed to produce speech from text, and the broader goal of disentangling stylistic factors such as prosody and speaker identity with the aim of combining them in suitable ways.

Tacotron is a seq2seq model taking in text as input and dumping speech as output. As we all know, seq2seq models are used ubiquitously in machine translation and speech recognition tasks. But over the last two or three years, these models have become usable in speech synthesis as well, largely owing to the Tacotron (Google) and DeepVoice (Baidu) works. In TTS, we take in characters or phonemes as input, and dump out as a speech representation. At the time of publication (early 2017), this model was unique in that it demonstrated a nearly end to end architecture that could produce plausible sounding speech. Related in philosophy was the work by Baidu (DeepVoice 1 – I think) which also had a seq2seq TTS setup, but which was not really end to end because it needed to convert text (grapheme) to phoneme and then pass that in to subsequent stages. The DeepVoice architecture has also evolved with time, and has produced complex TTS setups adapted for tasks such as speaker adaptation and creating speaker embeddings in transfer learning contexts with few data. Nevertheless, we should note that speech synthesis is probably never going to be end to end (as the title of the paper says) owing to the complexity of the pipeline. In Tacotron, we generate speech representations which need to be converted to audio in subsequent steps. This makes speech a somewhat difficult problem to approach – speaking personally – for the newbie unlike images where we can readily see the output. Even so, we can make do by looking at spectrograms. The other problem in speech is the lack of availability of good data. It costs a lot of money to hire a professional speaker!


At a high level, we could envisage Tacotron as an encoder-decoder setup where text embeddings get summarized into a context by the encoder, which is then used to generate output speech frames by the decoder. Attention modeling is absolutely vital in this setup for it to generalize to unseen inputs. Unlike NMT or ASR, the output is inflated, so that we have many output frames for an input frame. Text is a highly compressed representation whereas speech (depending on the representation) is highly uncompressed.

Several improvements are made to bolster the attention encoder-decoder model, which we describe below.

  1. Prenet: This is a bottleneck layer of sorts consisting of full connections. Importantly, it uses dropout which serves as a regularization mechanism for it to generalize to test samples. The paper mentions that scheduled sampling does not work well for them. Prenet is also used in the decoder.
  2. CBHG: “Convolutions, FilterBanks and Highway layers, Gated Recurrent Units”. We could view the CBH parts as preprocessing layers taking 1-3-5-… convolutions (Conv1d) and stacking all of them up after maxpooling them. It makes note of relationships between words of varying lengths (n-grams) and collates them together. This way, we agglomerate the input characters to a more meaningful feature representation taking into account the context at the word level. These are then sent to a stack of highway layers – a play on the residual networks idea – before being handed off to the recurrent encoder. I’ve described the architecture in more detail in another post. The “G” in CBHG is the recurrent encoder, specified as a stack of bidirectional Gated Recurrent Units.
  3. Reduction in output timesteps: Since we produce several similar looking speech frames, the attention mechanism won’t really move from frame to frame. To alleviate this problem, the decoder is made to swallow inputs only every ‘r’ frames, while we dump r frames as output. For example, if r=2, then we dump 2 frames as output, but we only feed in the last frame as input to the decoder. Since we reduce the number of timesteps, the recurrent model should have an easier time with this approach. The authors note that this also helps the model in learning attention.

Apart from the above tricks, the components are setup the usual way. The workflow for data is as follows.


From “Fully Character-Level Neural Machine Translation Without Explicit Segmentation” – the basis for CBHG
The original Tacotron architecture



Tacotron 2 – a simpler architecture

We pass the processed character inputs to the prenet and CBHG, producing encoder hidden units. These hidden units contain linguistic information, replacing the linguistic context used in older approaches (Statistical Parametric Speech Synthesis or SPSS). Since this is the ‘text’ summary, we can also think of it as seed for other tasks such as in the Tacotron GST works (Global Style Tokens) where voices of different prosodic styles are summarized into tokens which can then be ‘mixed’ with the text summaries generated above.

On the decoder side, we first compute attention in the so-called attention RNN. This term caused me a lot of confusion initially. The ‘attention RNN’ is essentially the block where the we compute the attention context, which is then concatenated with the input (also ejected by a prenet) and then sent to a recurrent unit. The input here is a mel spectrogram frame. The output of the attention RNN is then sent to a stack of recurrent units and projected back to the mel dimensions. Also, this stack uses residual connections.

Finally, instead of generating a single mel frame for each decoder step, we produce ‘r’ output frames, the last of which gets used as input for the next decoder step. This is termed the ‘reduction factor’ in the paper, with the number of decoder timesteps getting reduced by a factor or ‘r’. As we are producing a highly uncompressed speech output, it makes sense to assist the RNNs by reducing their workload. It is also said to be critical in helping the model learn attention.

Generalization comes from dropout in this case, with ground truth being fed – teacher forced – during training as input decoder frames. Contrast this with running in inference mode where one has to use decoder generated outputs as input for the next timestep. In scheduled sampling, the amount of teacher forcing is tapered down as the model trains. Initially, a large amount of ground truth is used, but as the models learns with time, we slowly taper off to inference mode with some sort of schedule. The other way is to use GANs to make the inference mode (fake) behave like teacher forced output (real) with a discriminator being used to tell them apart; the idea being that the adversarial game results in the generator producing inference mode samples that are indistinguishable from the desired, teacher forced mode output. This approach is named “Professor Forcing”.

Postprocessing network

Mel spectrogram frames generated by the network are converted back to audio with the postprocessing scheme. This postprocessing scheme is described in the literature as a ‘vocoder’ or backend. In the original tacotron work, the process involves first converting the mel frames into linear spectrogram frames with a neural network to learn the mapping. We actually lose information going from linear to mel spectrogram frames. The mapping is learnt using a CBHG network (not necessarily encoder-decoder as the input and output sequences have the same lengths) as used in the front end part of the setup computing linguistic context. In the end, audio is produced by carrying out an iterative procedure called Griffin-Lim on the linear spectrogram frames. In more recent works, the backend part is replaced by a Wavenet for better quality samples.

Refinements in Tacotron 2

Tacotron 2’s setup is much like its predecessor, but is somewhat simplified, in in that it uses convolutions instead of CBHG, and does away with the (attention RNN + decoder layer stack) and instead uses two 1024 unit decoder layers. In addition to this, it also has provisions to predict the probability of emission of the ‘STOP’ token, since the original Tacotron had problems predicting the end of sequence token and tended to get stuck. Also, Tacotron 2 discards the reduction factor, but adds location sensitive attention as in Chorowski et al’s ASR work to help the attention move forward. Supposedly, these changes obviate the need for the r-trick. In addition to location sensitive attention, the GST Tacotron works also use Alex Graves’ GMM attention from the handwriting synthesis works.

There are many other minutiae as well such as using MSE loss instead of L1,  which I suppose would qualify as tricks to be noted if one is actually creating these architectures.

In addition to architectural differences, the important bit is that Tacotron2 uses Wavenet instead of Griffin-Lim to get back the audio signal which makes for very realistic sounding speech.

Controlling speech with style tokens

The original Tacotron work was designed for a single speaker. While this is an important task in and of itself, one would also like to control various factors associated with speech. For example, can we produce speech from text for different speakers, and can we modify prosody? Unlike images where these sorts of unsupervised learning problems have been shown to be amenable to solutions (UNIT, CycleGAN, etc.), in speech we are very much in the wild west, and the gold rush is on. Built on top of Tacotron, Google researchers have made attempts at exercising finer control over factors by creating embeddings for prosodic style and speakers (also in Baidu’s works), which they call Global Style Tokens (GST).


Style Tokens are vectors exemplifying prosodic style which are shared across examples and learnt during training. They are randomly initialized, and compared against a ‘reference’ encoding (which is just a training audio example ingested by the reference encoder module) by means of attention, so that our audio example is now a weighted sum of all the style tokens. During inference, we can manually adjust the weights of each style token, or simply supply a reference encoding (again, spat out by the style encoder after putting in reference audio).



1. UNIT: Unsupervised Image to Image Translation Networks: https://arxiv.org/abs/1703.00848

2. CycleGAN: https://arxiv.org/abs/1703.10593

3. Tacotron: Towards End to End Speech Synthesis: https://arxiv.org/abs/1703.10135

4. Tacotron-2: Natural TTS Synthesis by Concatenating Wavenet on Mel Spectrogram Predictions: https://arxiv.org/abs/1712.05884

5. Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron: https://arxiv.org/abs/1803.09047

6. (GST) Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis: https://arxiv.org/abs/1803.09017

7. Alex Barron’s notes (excellent) describing Tacotron, implementation and woes: http://web.stanford.edu/class/cs224s/reports/Alex_Barron.pdf

8. Baidu Deep Voice 3: http://research.baidu.com/Blog/index-view?id=91

9. (Speaker adaptation with Deep Voice) Neural Voice Cloning with a few samples: https://arxiv.org/abs/1802.06006

10. Professor Forcing: https://arxiv.org/abs/1610.090381

11. Alex Graves: “Generating Sequences With Recurrent Neural Networks”: https://arxiv.org/abs/1308.0850

12. CBHG: Fully Character-Level Neural Machine Translation Without Explicit Segmentation: https://arxiv.org/abs/1610.03017


Location based attention – from “Attention based Models for Speech Recognition”

“We have lingered in the chambers of the sea

By seagirls wreathed with seaweed red and brown

Till human voices wake us and we drown”.

These lines are from Eliot’s “The love song of J. Alfred Prufrock”. To take that to our more banal reality – not the ones with the cicada or dry grass singing – the mermaids that we have seen singing (each to each!) are the beautiful results that we see in the papers, and the lingering in the chambers of the sea could perhaps be this utopia that we envisage in our never-ever land submerged under. Nevertheless, as Eliot  notes dolefully, the mermaids don’t sing to us, and when we wake up from that sleep (O city city … I can sometimes besides a public bar in lower Thames Street …), we realize that not only do the mermaids spurn us with their ardent refusal to sing, but also that our watery singing chamber turns into a watery grave, submerged, inundated, buried. Somehow, in Eliot, the water seems to form such an important theme, from the wet beginning of “The Fire Sermon”, to the drowning of the Phoenician sailor in “Death by Water”, to the grand ending with visions of the Fisher King sitting upon the shore of his waste land fishing, with a desolate, arid (and beautiful!) landscape behind him wondering about his legacy.

In our case, the works and days of hands consist of restless – endless – hours watching our little attention curve appearing (mostly, not, as we would surmise) with its wispy sharpness before us, thereby announcing that the model is indeed learning something. Again, as he has observed only too wisely, “I have wept and fasted, wept and prayed. I have seen the moment of my greatness flicker, and in short I was afraid”.

With that entirely unnecessary digression into the drowning excesses of Eliotism – think of the opening chant in Ashtanga practice, or Vogon poetry, as suits one’s taste –  we bring our attention towards matters of the moment.

In keeping with our foray into attention based seq2seq models, which we know and love, I would like to put my thoughts down on what seems like a necessary enhancement to the content based attention mechanism by putting in a location component to it. This idea comes from the paper by Chorowski et al detailing their setup on speech recognition.

Now (the old sweats can take a breather), in the Bahdanau work for NMT, they present the so-called content based attention, in which a hidden unit of the decoder is compared against all encoder outputs for similarity (which, as we know could be some function – a dot product, or a learned function expressed as a perceptron). This works well for NMT, where all the tokens in the input and output are different. However, when  we replace the input sequence with a speech representation – say, a spectrogram –  there are many many frames of input that must be compared against, and these frames don’t change much from frame to frame, meaning that the attention will be spread out over all these frames, and they will be ranked equally – not quite entirely the result we want. The paper says that we must encourage the attention mechanism to move forward by noting what the attention was in the previous timestep. The point is that it has many nearly identical frames to score and could therefore use a hint on what the attention was in the previous step, so that it focuses on that particular frame’s vicinity. The other approach – as described in the previous post – is to cluster frames by reducing the number of timesteps in a hierarchical way (as in Listen Attend and Spell).

The Chorowski paper is a seminal work. It was first introduced in the context of speech recognition as an extension of sorts to the  machine translation work by Bahdanau. More recently, it has been used in speech synthesis in the Tacotron series of papers. The idea is to design an attention mechanism that looks at the previous time step’s attention and picks from it by using a learned filter.

“Informally, we would like an attention model that uses the previous alignment αi−1 to select a short list of elements from h, from which the content-based attention, in Eqs. (5)–(6), will select the relevant ones without confusion.”

This is carried out by convolving the previous timestep’s attention with a matrix F (kxr), to produce k vectors for each of the encoder timesteps considered. chorowski_attention

The way it is written, the scheme seems a little cryptic (or is it?), until we note that in the above, we take one dimensional convolutions on the attention (after all, it is a 1D vector) in equation (8), with k filters with filter size r. What is not mentioned is the size of the convolved vector, which is probably user defined.  I found code from Keith Ito’s Tacotron implementation in tensorflow which seems to corroborate this interpretation, with the convolution designed such that the input and output sizes are made the same by appropriately sizing the padding cells.

Be all that as it may, one might perhaps summarize the game as using a filtered version of the previous timestep’s attention so that it’s influence is felt in the new attention output. Again, as is usual in such scenarios, convolutions are carried out on the time axis. Another way of looking at it is to consider the spectrogram as an image, so that we look at a given region of the image as noted by the attention vector.

The paper notes that this form of attention drew inspiration from Alex Graves’ GMM based attention – called ‘window weights’ in that work – where the weighting parameters decide the location, width and importance of the windows. This also seems to motivate location based attention in the DRAW paper. But it is important to note that the location of the window is forced to move forward at every time step.




Other pertinent tricks in the paper:

A. Sharpening attention in long utterances with use being made of a softmax temperature beta. The rationale for this trick is that in longer utterances when we have a long (unnormalized) attention vector ((0.1, 0.4, 0.8, -0.4, -0.5, 0.9, etc), we would like to sharpen the attention so that it focuses on a few pertinent frames. So, if we keep the temperature parameter as 10, for example, then we will get (exp(1), exp(4), exp(8), …) and naturally, this will completely obliterate most but the largest components as compared with (exp(0.1), exp(0.4), exp(0.8),…). In other words, it amplifies – or sharpens – the attention.


B. Smoothening attention for shorter utterances. In this scenario, it is claimed that one wants to smoothen instead of sharpening focus. The reasoning given is that the sharp model (top scored frames) performs poorly in the case of short utterances, leading them to try out smoothening it out a little. Another way to reason it out is that in smaller utterances, the contribution from each individual unit becomes all the more important.


While the ideas might seem somewhat arbitrary, I take these as tricks that one should incorporate into one’s own practice as needed.

These tricks find use in the Tacotron series of works, notably in Tacotron2, where a location sensitive attention mechanism is used to inform the attention mechanism that it must move forward.

Below is a pytorch version of the Tacotron implementation referred to above.


The line calling self.location_conv corresponds to equation (8), producing ‘k’ vectors of the same size, with ‘k’ being the number of filters in this case. We then make a linear operation on this to resize it to the hidden dimension in self.location_layer, which gets added on to the attention query.

When we call the attention mechanism, we call both the content (this is the standard concat attention) and the location sensitive versions (in the code above), as the equation (9) in the paper shows, after which we take a softmax of these energies. All this is available in r9y9’s implementation of Tacotron (minus maybe the location component).





1. Attention-Based Models for Speech Recognition (Chorowski et. al.): https://arxiv.org/abs/1506.07503

2. Tacotron 2: https://arxiv.org/abs/1712.05884

3. Keith Ito’s code: https://github.com/keithito/tacotron/issues/117

4. r9y9’s  Tacotron implementation in PyTorch: https://github.com/r9y9/tacotron_pytorch

5. Alex Graves’ “Generating Sentences with Recurrent Neural Networks”: https://arxiv.org/pdf/1308.0850.pdf



Hierarchical Encoders – from Listen, Attend and Spell

In machine translation literature, we come across the notion of content based attention, made famous by the paper from Bahdanau. It builds upon the encoder-decoder translation model in which the input sequence is encoded into a single, fixed context vector, from which we decode tokens of the output. The attention mechanism improves upon this idea by deciding which frames of the input sequence are important while generating a given output frame. Not only does this add more ‘color’ to the context vector in longer sequences, but it is also extremely important for the model to learn the attention in order to decode things during inference.

To expand on this point, we could think of minimizing reconstruction cost, which a network will do when it sees teacher forced output (i.e. when we feed with ground truth labels), possibly by ignoring context. But during inference, there is no ground truth and it must decode from its own generations. If it learns attention, it knows exactly which symbols of the input it should look up – fixing it in a formulated phrase, sprawling on a pin, wriggling on the wall, and in short the translator has been nailed.

The point of the the current post is to introduce – no, speculate upon is more like it – the idea of clustering input frames in the context of speech recognition. The input in this case is a speech representation – quite uncompressed – and the output desired is text – or phoneme, which is more appropriate.

Clustering with hierarchical encoders

It is natural to pose the ASR problem as a language model, a seq2seq problem with attention, since we have time varying input and output sequences, and we could simply try to replace the NMT input/output language translation task with speech/text transcription. However, there is a little problem in that speech is a highly uncompressed sequence, containing among other things, speaker information, the duration of voiced phonemes. When we apply the NMT seq2seq attention model as is, we see that a given output phoneme will correspond to several frames of the input spectrogram features. Our model will have problems learning attention in this case.


One could think of ‘clustering’ the input spectrogram frames to circumvent this issue, so that we now focus on zones of the input rather than individual spectrogram frames. This is what the Listen, Attend and Spell paper does indirectly. They set up a hierarchical encoder by reducing the number of timesteps in the input by using stacked hidden recurrent layers, with each layer in the stack containing fewer timesteps than the ones below. At the top of the stack, we have a much reduced number of timesteps, and we use this layer as the hidden representation for the encoder side calculations for attention.




Coding it up

I looked up a pytorch implementation of the LAS paper (with additional hooks for self-attention) here, which I abstracted into a hacky notebook.


The key operation is to reshape the frames of a given layer in a stack, agglomerating two frames into a timestep, thus doubling the size of the vector (easier seen in code than explained). This is passed on into the RNN for the next layer of the stack.


  1. Listen, Attend and Spell https://arxiv.org/abs/1508.01211
  2. Pytorch implementation of LAS: https://github.com/Alexander-H-Liu/Listen-Attend-and-Spell-Pytorch
  3. Example notebook: https://github.com/pravn/pyramidal_rnns/blob/master/pyramid_setup.ipynb
  4. Neural Machine Translation by Jointly Learning to Align and Translate: https://arxiv.org/abs/1409.0473

Wasserstein Autoencoders

In the last several weeks, I have been looking over VAEGAN papers in order to use one for practical work problems. The Wasserstein GAN is easily extended to a VAEGAN formulation, as is the LS-GAN (loss sensitive GAN – a brilliancy). But the survey brought up the very intriguing Wasserstein Autoencoder, which is really not an extension of the VAE/GAN at all, in the sense that it does not seek to replace terms of a VAE with adversarial GAN components. Instead, it constructs an autoencoder with a set of arguments using optimal transport or Wasserstein distances, which can also function as a generative model. It seems to be a modified version of the work “From optimal transport to generative modeling – the VEGAN cookbook”.

Why is the paper interesting?

  1. It provides an ostensibly simple recipe to implement a non-blurry VAE (there is no V, but lets just think of it as one because it can also generate).
  2. It has setups using a GAN or an MMD. The latter is potentially useful because we can do away with the pain involved in tuning the GAN
  3. It provides what looks like an elegant and logical way to cast the Wasserstein distance metric to setup the VAE/GAN problem.
  4. It is another instructive example of the VAEGAN toolbox setup involving a reconstruction term and a regularization term – only, that in this case they do not – arbitrarily – hack off a KLD with a GAN, but arrive at it through a constraint which appears as a penalty term.
  5. Generalizing VAE like formulations. The paper gives three instructive VAEGAN model comparisons, unifying them thematically – Adversarial Autoencoders (AAE), Adversarial Variational Bayes (AVB), and the original Variational Autoencoders (VAE). These generalizations arise for the case with random decoders – the paper introduces the idea with deterministic decodes, and then extends it to random decoders – with play on the regularizer of the VAE which these papers replace with a GAN.
  6. An explanation for why VAEs tend to be blurry. In the VAE, we make Q(Z|X) match the prior p_z(z) for each point. But since there will be overlap between Q(Z|X) between points, the way I see it, we will be averaging across these overlapping Q(Z|X) leading to blurriness. However, in the WAE, the recognition model is forced to match the prior in an aggregated, global sense \int q(z|x) p(x) dx = q_z(z) = p_z(z), so that we are drawing from this global quantity and not averaging out over overlapping Q(Z|X). This explanation obviously needs some parsing.
  7. An example of setting up optimization objectives by putting terms together. Machine learning works often put together various terms in order to satisfy a certain constraint. This paper illustrates how we combine the VAE like reconstruction with the regularization component with a penalized objective formulation. Qz

Marginality and transport cost

The Earth mover or Wasserstein distance is characterized by a joint distribution \Gamma(X,Y) between two measures X \sim P_X and Y \sim P_G such that their marginals equal P_G and P_X respectively. Furthermore, the EMD itself is a cost – a euclidian distance – associated with moving material between X and Y.

Concretely, we should be aware of the following three equations

p_X(x) = \int \gamma(x,y) dy

p_G(y) = \int \gamma(x,y) dx

cost = \int \gamma(x,y) c(x,y) dx dy = E_{x,y\sim \gamma} c(x,y)

The Wasserstein problem seeks to find the joint that gives the minimum cost. Posed as such, we get the following equation. It is to be noted that as such, getting the Wasserstein distance is in itself an optimization problem.


In the famous WGAN work by Arjrovsky et al, they work with Kantorovich’s dual formulation which casts it as an upper bound.


In the WAE paper, the goal is to parameterize the generating distribution in terms of an intermediate distribution Q(Z|X) , so that we go from X to latent variables Z, and then get the generated samples Y from Z with P(Y|Z). When we do this, it starts to look like a VAE.

Latent variable model

Now, in a GAN we would like to minimize the distance between the distributions P_X and P_G, which are the real and generating distributions respectively. G comes from a latent variable model by sampling z from the prior p_z(z).

p_G(x) = \int p(x|z) p_z(z) dz

In order to make the connection with an autoencoder, we should map z to the input x. In the VAE, this is done through the recognition model Q(z|x) using a variational approximation to set it up. In the WAE, we do not have a variational lower bound, but we appeal to the same ideas of using an intermediate recognition like model, with qualifications. How they arrive at these qualifications constitutes the brilliancy of this paper.

Wasserstein distance between P_X and P_G

Starting with the definition of the EMD, we wish to find a particular set of couplings that allow us to go from X to Z to Y. The joint \Gamma from which we draw x,y must satisfy the marginal property as indicated above. First, we note that \Gamma(X,Y) = \Gamma(Y|X) P_X(X). Our goal is to factorize \Gamma(Y|X) in terms of the intermediate latent representation Z.

To this end, consider the candidate joint distribution

\gamma(x,y) = \int p(y|z) q(z|x) p_X(x) dz

Integrate wrt x and use the marginality constraint

\int \gamma(x,y) dx = p_G(y) = \int \int p(y|z) q(z|x)p_X(x) dx dz

By inspection we see that this will hold if (because we can pull out the terms only dependent on x)

\int q(z|x) p_X(x) dx = p_Z(z)

This is an important point, which gives us a a constraint to work with. We are now ready to create our objective function. In the initial, illustrative problem the paper makes the assumption that the decoder is deterministic. i.e. Y = G(Z), or P(Y|Z) = \delta_G(z) = \delta(y-G(z)). Later on, they move on to generalizing the result for random decoders, and make the connection with other VAE/GAN examples (AAE, AVB, VAE).

Set up the Wasserstein objective as follows:
Minimize the earth mover distance between x \sim p_X(x), y \sim p_G(y) with y being factorized through a latent variable model P(Y|Z) with y \sim p(y|z), and a prior p_z(z).

W(P_X,P_Y) = \inf_\gamma \int c(x,y) \gamma(x,y) dx dy

Using the factorization for \gamma through Q, and assuming a deterministic decoder, with the constraint noted above, we can massage W as

W(P_X,P_Y) = \inf_\gamma \int c(x,y) p(y|z) q(z|x) p_X(x) dz dx dy

But since p(y|z) = \delta(y-G(z)) the term in y gets integrated out, noting that

\int \delta(x-a) f(x) dx = f(a)

The objective then becomes

W(P_X,P_Z) = \int c(x,G(z)) q(z|x) p_X(x) dx dz

Written in shorthand, we get

W(P_X,P_Z) = \inf_{q(z|x)} E_X E_{q(z|x)} c(x,G(z))

with the constraint \int q(z|x) p_x(x) dx = p_z(z)

Penalized objective

In order to enforce the constraint above, the paper adds a penalty term to make \int q(z|x) p_X(x) match p_z(z). Together with the reconstruction term coming from the primal of the optimal transport formulation, we get what looks like the components of a generative model like the VAE – a reconstruction term, plus, a regularizer term. The regularaizer gives the model its generative characteristics, in that without it we would get a regular autoencoder which will know how to reconstruct input, but will have ‘holes’ in Z in those places that don’t have training data. In other words, we won’t be able to draw from a latent representation.

W(P,Q) = \inf_{q(z|x)} E_{q(z|x)} [c(x,G(z))] + \lambda D_{GAN} (q_z, p_z)

where q_z = E_X q(z|x).

The GAN ensures that the aggregated posterior (that is the term used in the AAE paper) matches the prior. The first term serves as the reconstruction loss of the autoencoder.


While these notes are in no way a complete review of the paper, I felt that it was important to work out the derivations to my satisfaction. I think the rest of the paper is fairly straightforward in how it describes the training and inference process, and the generalizations with other VAE/GAN models. Nevertheless, for the practitioner, such recipes are of value inasmuch as they allow for experimentation with generative models for real world tasks.


  1. From optimal transport to generative modeling: The VEGAN cookbook – https://arxiv.org/abs/1705.07642
  2. Wasserstein Auto-Encoders – https://arxiv.org/abs/1711.01558
  3. Wasserstein GAN – https://arxiv.org/abs/1701.07875
  4. Loss-Sensitive Generative Adversarial Networks on Lipshitz Densities – https://arxiv.org/abs/1701.06264
  5. Adversarial Autoencoders – https://arxiv.org/abs/1511.05644
  6. Adversarial Variational Bayes: Unifying Variational Autoencoders and Generative Adversarial Networks – https://arxiv.org/abs/1701.04722

The data_parallel clause in pytorch

Some very quick and dirty notes on running on multiple GPUs using the nn.DataParallel module.

I found some code from the dcgan sample. Assume that the layer is written as follows:

layer = nn.Sequential(nn.Conv2d(...),etc.)

Call as follows:

output = layer(input)

To run in parallel, first issue the .cuda() call.


Now wrap in the data_parallel clause and do feed forward calculation:

output = nn.parallel.data_parallel(layer,input)

1) https://github.com/pytorch/examples/tree/master/dcgan
2) http://pytorch.org/tutorials/beginner/former_torchies/parallelism_tutorial.html

A note on highway layers

My introduction to highway layers was from the Tacotron paper where it forms a part of a much more complex network that they call “CBHG”. Needless to say, it had the immediate effect of complete stupefaction (or is it stupefication?), not quite entirely unlike having one’s brains smashed in by a slice of lemon wrapped around a large gold brick. Nevertheless, over time I have come to accept that Tacotron, as intimidating as it is, is mostly a superstructure erected upon simpler conceptions that we all know and love. I should definitely sit down to put my thoughts on paper to see if I explain things in a cogent way.

Highway layers are an extension of the widely used residual networks idea. With residual connections, we add the input to the output of some network in order to enhance learning, since it becomes difficult for gradients to pass through very deep network stacks.

y = f(x) + x

The highway layer works directly on this formulation by modulating how much of the input signal is added to the output. Naturally, in the above, f(x) and x are of the same size.

y = c. f(x) + (1-c). x

Now we add another little wrinkle by making c a trainable quantity, gated by sending to a sigmoid unit to make it lie between 0 and 1.

y = c(x) . f(x) + (1-c(x)) . x

Next, we make c(x) a vector, which would turn the products into element wise products.

y = c(x) * f(x) + (1-c) * x

We are now almost done. What remains is to specify the forms for the functions c and f. Tacotron uses full, linear connections.

Rewriting in the more formal notation of the paper:

y(x) = H(x, W_H) * T(x, W_T) + (1-H(x,W_H)) * x

H(x), T(x) are neural network outputs obtained by passing through fully connected layers passed through a non-linearity – sigmoid for H and relu (or some other) for T.

In pytorch, we create them as follows (again, mutatis mutandis formatting maladies, etc.)

self.H = nn.Linear(size_in, size_in) # input size == output size = size(input)
self.T = nn.Linear(size_in, size_in) # ditto

#define relu and sigmoid operations.
self.relu = nn.ReLU()
self.sigmoid = nn.Sigmoid()

def forward(self,x):
H = self.sigmoid(self.H(x))
T = self.relu(self.T(x))

return H*T + (1-H)*x

As we can see, the code is quite simple. In practice, we use a stack of highway layers. It forms part of the so-called CBHG layer in Tacotron, and is therefore included in that layer’s setup. They use a stack of n(=4) highway layers.

class CBHG:
self.highways = nn.ModuleList(
[Highway(in_dim, in_dim) for _ in range(4)])

Now they call the highway layer.

for highway in self.highways:
x = highway(x)

The code snippets above were lifted from one of the only pytorch implementations of Tacotron available online.



1) “CBHG” (named thus in Tacotron): https://arxiv.org/abs/1610.03017
2) Tacotron: https://arxiv.org/abs/1703.10135; https://google.github.io/tacotron
3) Highway layers: https://arxiv.org/abs/1505.00387
4) Medium article: https://medium.com/jim-fleming/highway-networks-with-tensorflow-1e6dfa667daa
5) Tacotron pytorch code by r9y9: https://github.com/r9y9/tacotron_pytorch/blob/master/tacotron_pytorch/tacotron.py
6) Scratch notebook with highway layer: https://github.com/pravn/HighwayLayerTest

Learned Losses in the VAE-GAN

In this post, I make a few notes on the paper
“Autoencoding beyond Pixels using a learned similarity metric”.


When we train a neural network to generate some data (or classify), we would like to produce samples that we compare with some ground truth. However, how we make this comparison is arbitrary. Quite often, we see use of the L2 loss.

\mathcal{L} = (x-x_0)^2

But we also often see the L1 loss which is |x-x_0|. For that matter, we can pretty much define any kind of loss function f(x)-f(x_0) or maybe f(x-x_0).

When we use the L2 loss, the interpretation is that the output distribution is a gaussian. We would like to maximize the log likelihood implied by this proposition, boiling down to minimizing the mean squared error (MSE). This is maximum likelihood learning.

A GAN, on the other hand does not make any assumptions about the form of the loss function. Instead, it uses a critic, or discriminator to tell us whether or not the samples are from the desired probability distribution. The signal from the discriminator (which minimizes the cross entropy between real and fake) is then fed back to the generator which then learns to produce better samples. Likewise, the discriminator is also trained in tandem. The wisdom here is that a GAN does not suffer from posing an artificially conceived loss function such as an L1 or L2 loss which can be inadequate in complex distributions arising in data. The authors claim that this is one reason why GANs are able to produce stellar looking samples.


The paper is an improvement over the vanilla VAE to improve reconstructions produced by the setup, effected by using a GAN discriminator to learn the loss function, instead of using the MSE error as is done in the original VAE formulation.


Normally, we assume that the VAE decoder – the reconstruction model – produces output differing from the actual value in a gaussian way.

p(x|z) = \mathcal{N}(x|x_{decoder}, I)

when we take the log-likelihood, this becomes the mean-squared error (for a single sample)

\mathcal{L}_{decoder} = -\log p(x|z)= NLL = \frac{1}{2} (x-x_{decoder})^2

It can be minimized using a gradient-descent algorithm. However, as is well known, the reconstructions of the VAE are known to be blurry. To remedy this defect, the authors replace the VAE’s decoder loss function with a learned loss metric through a GAN.

Thus, the setup does not make assumptions about the loss function. Instead, it uses the GAN’s discriminator to tell whether the samples are real or fake. This information is used by the generator (which is the VAE’s decoder) to produce more realistic samples. Eventually, the generator produces samples from the proper generating process when it succeeds in fooling the discriminator. All this is of course standard dialogue in GANs.

\mathcal{L} = learned loss

The loss function for the VAE is (and the goal is to minimize L)

-\mathcal{L} = E_{z\sim q_\phi(z|x)} [\log p_\theta(x|z)] - KL(q_\phi(z|x)||p_\theta(z))

where \phi, \theta are the encoder and decoder neural network parameters, and the KL term is the so called prior of the VAE.

Our problem here is to propose forms for \log p_\theta(x|z). In the original VAE, we assume that the samples produced differ from the ground truth in a gaussian way, as noted above. In the present work, the reconstructions and ground truth are sent to a neural network (the GAN discriminator). They assume that the outputs from one of the hidden layers of the discriminator are said to differ in a gaussian way. It is easier written than said.

p(x|z) = \mathcal{N} (D^l(x_{real})|D^l(x_{fake}), I)

Here, the l^{th} layer of the discriminator is chosen to measure the difference between real and fake, to be used as VAE loss. This is from the last but one hidden layer. To rationalize the approach, we can take D as an identify mapping, when it becomes the same as the original formulation in Kingma and Welling.

We thus have two sets of equations:

-\mathcal{L}_{vae} = MSE(D^{l} (x_{decoder}),D^{l} (x_{real})) +prior

-\mathcal{L}_{GAN} = E_{x\sim p_{data}} \log(D(x)) + E_{x\sim p_{model}} \log(1-D(x))

It is to be emphasized that we use the l^{th} layer output of the discriminator for use in the VAE’s loss term. The output of the GAN D(x) is a scalar quantity. Originally, I had been using the GAN output D(x) in the VAE, but this did not reconstruct the input properly. A hand waving explanation for this is that the scalar output produced when we use the discriminator’s final output is not sufficiently rich in capturing the differences between real and fake.

Auxillary generator

In the paper, they claim that they get better results augmenting the signal with an auxillary generator setup. This is a refinement over the single generator/decoder+discriminator setup described above to bolster the signals.

We have noted above that the decoder of the VAE also functions as the generator of the GAN, which generates a ‘fake’ \tilde{x}.

Z_{encoder} \to \tilde{x}


Z_{encoder} = \mu(x) + \epsilon \sigma(x)

with \epsilon\sim \mathcal{N}(0,I) as is usual in the VAE.

In addition to this, we now sample from a unit normal and use the same network as in the decoder (whose weights we now share) to generate an auxillary sample X_p.

Z_p \to X_p

Here, Z_p ~\sim \mathcal{N}(0,I). I was a little unsure how to interpret this term, in that we can also use

Z_p = \mu(x) + \delta \sigma(x)

with \delta \sim \mathcal{N}(0,I). But doing so will mean that we are essentially using the VAE’s encoder output. So it has to be this interpretation.

We will then have two generated outputs – one from the VAE’s decoder \tilde{X}, and an auxillary output X_p. We treat both of these as fakes to augment the GAN loss.

There’s also a fudge factor/tuning term that weights the VAE and the GAN in the auxillary decoder.


The following objectives are to be minimized.

A. Encoder

\begin{array}{ll} -\mathcal{L}_{Encoder} &= E_{z\sim p_\phi(z|x)} [\log (p_\theta(x|z))] -KL(q_\phi(z|x)||p_\theta(z))\\ &= -(\mathcal{L}_{D^l} + \mathcal{L}_{prior}) \end{array}

B. Decoder/Generator

-\mathcal{L}_{Decoder} = -(\gamma \mathcal{L}_{D^l} -\mathcal{L}_{GAN})

C. GAN discriminator

-\mathcal{L}_{GAN} = E_{x\sim p_{fake}} [\log(1-D(x))] + E_{z_p\sim \mathcal{N}(0,I)} [\log(1-D(G(z)))] + E_{x\sim p_{real}}[\log(D(x))]


After much fumbling around, I seem to have an implementation that improves upon the vanilla VAE’s reconstructions for MNIST. I plucked code from the dcgan sample in pytorch with arbitrary convolution/deconvolution layers. For one thing, it was not easy for me (and with not too much experience training GANs) getting the generator/discriminator dynamics right. It looks like D is way better than G for the architecture used. However, it does produce plausible reconstructions, though I am sure we can improve the code a bit.

Here’s the pytorch implementation: https://github.com/pravn/GAN_VAE_hybrids

The samples below don’t use the auxillary generator.

The vanilla VAE’s output is noticeably blurrier (ground truth not shown).


Fine print

There are a few things I struggled with:

1) Using a hidden layer output to compute loss. For a long time, I was trying to reconstruct with the output of the discriminator, which has only one scalar output.


This reconstructs only one or two digits correctly. It is therefore necessary to use the hidden layer in order to get correct reconstructions.

2) Retain graph: We have three networks; Encoder (VAE), Generator, and GAN discriminator. After backpropagating through one of these things, pytorch deletes the variables in order to save memory (If I understand the error message correctly). However, it does suggest that we should add a qualifier “retain_graph=True) to keep them.


3) Sharing weights (although this eventually turned out to be unnecessary):
I was not able to find a principled way to share weights across networks. However, we can manually ship the weights across by storing them.

E.g. The Decoder/Generator class below:

class Aux(nn.Module):
def __init__(self):
self.fc3 = nn.Linear(20,400)
self.fc4 = nn.Linear(400,784)
self.relu = nn.ReLU()
self.sigmoid = nn.Sigmoid()

def return_weights(self):
return self.fc3.weight, self.fc4.weight

def forward(self,z, fc3_weight, fc4_weight):
self.fc3.weight = fc3_weight
self.fc4.weight = fc4_weight
return self.decode(z)


We get the weights with x.weight, and then we copy things over for use in the other network.

4) Alternative formulation/loss for GAN
Instead of the correct form that comes about from the cross entropy formulation for GAN, I used the so-called alternative form (well, that’s the name used in the alpha-GAN paper) which helps in alleviating the vanishing gradient problem that the original form is prone to. This was used in the decoder or generator.

Original form:
-\mathcal{L}_{GAN,gen} = E_{x\sim p_{fake}} \log(1-D(x))

Alternative form:
-\mathcal{L}_{GAN,gen} = -E_{x_\sim p_{fake}} \log(D(x))

Update – 6 th Oct 2018

It is easiest to decouple the encoder, decoder and discriminator and backpropagate losses accordingly. I have an implementation in the EBGAN version here: https://github.com/pravn/vae_ebgan_mnist.

There is also a bug in most of my other (badly written) VAE code carrying over from code I had pulled from the pytorch examples repo – the KLD is normalized incorrectly by the number of pixels (==784 for MNIST). If we remove that normalization factor, we see horribly blurred, indecipherable images.

For the generator side, we do two generations, one for the reconstruction, and the other, an adversarial GAN like generation.

For the reconstruction stream, first compute the latent code (after reparametrization):

mu, logvar, z_enc = netE(images)

Pass the latent code to the decoder/generator

recon, _ = netG(z_enc)

Now, compute the auxillary/adversarial component, by transforming noise to an image (this is the regular GAN part). When I wrote this article initially, I did not understand why this term was needed. However, now I feel that we need this to bolster the signal (again, as the paper says). Imagine the scenario where we have a complicated VAE architecture. Our task essentially boils down to using a trained discriminator (since we compute the feature loss MSE(D(x), D(G(E(x))). An auxillary decoder setup where it uses noise would therefore make sense because in the above, none of these quantities are initially available, so it would take a while to learn a good D. Nonetheless, questions remain on the nature of latent codes learnt by this setup. In the plain GAN setup, we are using codes from all over the manifold N(0,1), whereas in the VAEGAN case, we are using codes that are part of the training manifold. I had seen this point in a paper which eludes memory currently. Nevertheless, I suspect that one way to train this setup would be to learn a discriminator (train the regular GAN first), and then use the trained discriminator for the VAE/GAN to refine it subsequently.

noise = noise.data.normal_(0,1)
aux_fake, _ = netG(noise)

Now, for D we have two loss components – one from the reconstruction term, and the other from the adversarial noise to image term.

D_fake, features_fake = netD(recon) #reconstruction
D_aux_fake, _ = netD(aux_fake) #auxillaary, adversarial component
GAN_loss_G = LAMBDA * (D_aux_fake - aux_fake).pow(2).mean() #EBGAN loss

While backpropagating, we can call .backward() for both the encoder and decoder.


The EBGAN experiment is part of a more practical try with real images using the DCGAN architecture. However, this exercise with MNIST was extremely instructive in the sense that it demonstrated how fragile the training process is. We often see instability – G and D cannot find equilibrium, in that what is good for G might not be good for D and vice versa – in this minmax problem. After many painful tries, what finally worked (and even that, not very well) was to copy the experiments given in the EBGAN paper for MNIST. Also of interest is the use of the many tricks given in the DCGAN paper and WGAN experiments – train D and G alternatively, vary the schedule, set weights, watch for G becoming too good, etc.


1) Autoencoding beyond Pixels using a learned similarity metric: https://arxiv.org/abs/1512.09300

2) Kingma and Welling: https://arxiv.org/abs/1312.6114

3) Generative adversarial networks: https://arxiv.org/abs/1406.2661

4) alpha-GAN: https://arxiv.org/abs/1706.04987

5) Energy based generative adversarial network: https://arxiv.org/abs/1609.03126

6) Unsupervised representation learning with deep convolutional generative adversarial networks: https://arxiv.org/abs/1511.06434