The data_parallel clause in pytorch

Some very quick and dirty notes on running on multiple GPUs using the nn.DataParallel module.

I found some code from the dcgan sample. Assume that the layer is written as follows:

layer = nn.Sequential(nn.Conv2d(...),etc.)

Call as follows:

output = layer(input)

To run in parallel, first issue the .cuda() call.


Now wrap in the data_parallel clause and do feed forward calculation:

output = nn.parallel.data_parallel(layer,input)



A note on highway layers

My introduction to highway layers was from the Tacotron paper where it forms a part of a much more complex network that they call “CBHG”. Needless to say, it had the immediate effect of complete stupefaction (or is it stupefication?), not quite entirely unlike having one’s brains smashed in by a slice of lemon wrapped around a large gold brick. Nevertheless, over time I have come to accept that Tacotron, as intimidating as it is, is mostly a superstructure erected upon simpler conceptions that we all know and love. I should definitely sit down to put my thoughts on paper to see if I explain things in a cogent way.

Highway layers are an extension of the widely used residual networks idea. With residual connections, we add the input to the output of some network in order to enhance learning, since it becomes difficult for gradients to pass through very deep network stacks.

y = f(x) + x

The highway layer works directly on this formulation by modulating how much of the input signal is added to the output. Naturally, in the above, f(x) and x are of the same size.

y = c. f(x) + (1-c). x

Now we add another little wrinkle by making c a trainable quantity, gated by sending to a sigmoid unit to make it lie between 0 and 1.

y = c(x) . f(x) + (1-c(x)) . x

Next, we make c(x) a vector, which would turn the products into element wise products.

y = c(x) * f(x) + (1-c) * x

We are now almost done. What remains is to specify the forms for the functions c and f. Tacotron uses full, linear connections.

Rewriting in the more formal notation of the paper:

y(x) = H(x, W_H) * T(x, W_T) + (1-H(x,W_H)) * x

H(x), T(x) are neural network outputs obtained by passing through fully connected layers passed through a non-linearity – sigmoid for H and relu (or some other) for T.

In pytorch, we create them as follows (again, mutatis mutandis formatting maladies, etc.)

self.H = nn.Linear(size_in, size_in) # input size == output size = size(input)
self.T = nn.Linear(size_in, size_in) # ditto

#define relu and sigmoid operations.
self.relu = nn.ReLU()
self.sigmoid = nn.Sigmoid()

def forward(self,x):
H = self.sigmoid(self.H(x))
T = self.relu(self.T(x))

return H*T + (1-H)*x

As we can see, the code is quite simple. In practice, we use a stack of highway layers. It forms part of the so-called CBHG layer in Tacotron, and is therefore included in that layer’s setup. They use a stack of n(=4) highway layers.

class CBHG:
self.highways = nn.ModuleList(
[Highway(in_dim, in_dim) for _ in range(4)])

Now they call the highway layer.

for highway in self.highways:
x = highway(x)

The code snippets above were lifted from one of the only pytorch implementations of Tacotron available online.


1) “CBHG” (named thus in Tacotron):
2) Tacotron:;
3) Highway layers:
4) Medium article:
5) Tacotron pytorch code by r9y9:
6) Scratch notebook with highway layer:

Learned Losses in the VAE-GAN

In this post, I make a few notes on the paper
“Autoencoding beyond Pixels using a learned similarity metric”.


When we train a neural network to generate some data (or classify), we would like to produce samples that we compare with some ground truth. However, how we make this comparison is arbitrary. Quite often, we see use of the L2 loss.

\mathcal{L} = (x-x_0)^2

But we also often see the L1 loss which is |x-x_0|. For that matter, we can pretty much define any kind of loss function f(x)-f(x_0) or maybe f(x-x_0).

When we use the L2 loss, the interpretation is that the output distribution is a gaussian p(y|x) = \mathcal{N}(y_0|y, I). We would like to maximize the log likelihood implied by this proposition, boiling down to minimizing the mean squared error (MSE). This is maximum likelihood learning.

A GAN, on the other hand does not make any assumptions about the form of the loss function. Instead, it uses a critic, or discriminator to tell us whether or not the samples are from the desired probability distribution. The signal from the discriminator (which minimizes the cross entropy between real and fake) is then fed back to the generator which then learns to produce better samples. Likewise, the discriminator is also trained in tandem. The wisdom here is that a GAN does not suffer from posing an artificially conceived loss function such as an L1 or L2 loss which can be inadequate in complex distributions arising in data. The authors claim that this is one reason why GANs are able to produce stellar looking samples.


The paper is an improvement over the vanilla VAE to improve reconstructions produced by the setup, effected by using a GAN discriminator to learn the loss function, instead of using the MSE error as is done in the original VAE formulation.


Normally, we assume that the VAE decoder – the reconstruction model – produces output differing from the actual value in a gaussian way.

p(x|z) = \mathcal{N}(x|x_{decoder}, I)

when we take the log-likelihood, this becomes the mean-squared error (for a single sample)

\mathcal{L}_{decoder} = -\log p(x|z)= NLL = \frac{1}{2} (x-x_{decoder})^2

It can be minimized using a gradient-descent algorithm. However, as is well known, the reconstructions of the VAE are known to be blurry. To remedy this defect, the authors replace the VAE’s decoder loss function with a learned loss metric through a GAN.

Thus, the setup does not make assumptions about the loss function. Instead, it uses the GAN’s discriminator to tell whether the samples are real or fake. This information is used by the generator (which is the VAE’s decoder) to produce more realistic samples. Eventually, the generator produces samples from the proper generating process when it succeeds in fooling the discriminator. All this is of course standard dialogue in GANs.

\mathcal{L} = learned loss

The loss function for the VAE is

-\mathcal{L} = E_{z\sim q_\phi(z|x)} [\log p_\theta(x|z)] - KL(q_\phi(z|x)||p_\theta(z))

where \phi, \theta are the encoder and decoder neural network parameters, and the KL term is the so called prior of the VAE.

Our problem here is to propose forms for \log p_\theta(x|z). In the original VAE, we assume that the samples produced differ from the ground truth in a gaussian way, as noted above. In the present work, the reconstructions and ground truth are sent to a neural network (the GAN discriminator). They assume that the outputs from one of the hidden layers of the discriminator are said to differ in a gaussian way. It is easier written than said.

p(x|z) = \mathcal{N} (D^l(x_{real})|D^l(x_{fake}), I)

Here, the l^{th} layer of the discriminator is chosen to measure the difference between real and fake, to be used as VAE loss. This is from the last but one hidden layer. To rationalize the approach, we can take D as an identify mapping, when it becomes the same as the original formulation in Kingma and Welling.

We thus have two sets of equations:

-\mathcal{L}_{vae} = MSE(D^{l} (x_{decoder}),D^{l} (x_{real})) +prior

-\mathcal{L}_{GAN} = E_{x\sim p_{data}} \log(D(x)) + E_{x\sim p_{model}} \log(1-D(x))

It is to be emphasized that we use the l^{th} layer output of the discriminator for use in the VAE’s loss term. The output of the GAN D(x) is a scalar quantity. Originally, I had been using the GAN output D(x) in the VAE, but this did not reconstruct the input properly. A hand waving explanation for this is that the scalar output produced when we use the discriminator’s final output is not sufficiently rich in capturing the differences between real and fake.

Auxillary generator

In the paper, they claim that they get better results augmenting the signal with an auxillary generator setup. This is a refinement over the single generator/decoder+discriminator setup described above to bolster the signals.

We have noted above that the decoder of the VAE also functions as the generator of the GAN, which generates a ‘fake’ \tilde{x}.

Z_{encoder} \to \tilde{x}


Z_{encoder} = \mu(x) + \epsilon \sigma(x)

with \epsilon\sim \mathcal{N}(0,I) as is usual in the VAE.

In addition to this, we now sample from a unit normal and use the same network as in the decoder (whose weights we now share) to generate an auxillary sample X_p.

Z_p \to X_p

Here, Z_p ~\sim \mathcal{N}(0,I). I was a little unsure how to interpret this term, in that we can also use

Z_p = \mu(x) + \delta \sigma(x)

with \delta \sim \mathcal{N}(0,I). But doing so will mean that we are essentially using the VAE’s encoder output. So it has to be this interpretation.

We will then have two generated outputs – one from the VAE’s decoder \tilde{X}, and an auxillary output X_p. We treat both of these as fakes to augment the GAN loss.

There’s also a fudge factor/tuning term that weights the VAE and the GAN in the auxillary decoder.


The following objectives are to be minimized.

A. Encoder

\begin{array}{ll} -\mathcal{L}_{Encoder} &= E_{z\sim p_\phi(z|x)} [\log (p_\theta(x|z))] -KL(q_\phi(z|x)||p_\theta(z))\\ &= -(\mathcal{L}_{D^l} + \mathcal{L}_{prior}) \end{array}

B. Decoder/Generator

-\mathcal{L}_{Decoder} = -(\gamma \mathcal{L}_{D^l} -\mathcal{L}_{GAN})

C. GAN discriminator

-\mathcal{L}_{GAN} = E_{x\sim p_{fake}} [\log(1-D(x))] + E_{z_p\sim \mathcal{N}(0,I)} [\log(1-D(G(z)))] + E_{x\sim p_{real}}[\log(D(x))]


After much fumbling around, I seem to have an implementation that improves upon the vanilla VAE’s reconstructions for MNIST. I plucked code from the dcgan sample in pytorch with arbitrary convolution/deconvolution layers. For one thing, it was not easy for me (and with not too much experience training GANs) getting the generator/discriminator dynamics right. It looks like D is way better than G for the architecture used. However, it does produce plausible reconstructions, though I am sure we can improve the code a bit.

Here’s the pytorch implementation:

The samples below don’t use the auxillary generator.

The vanilla VAE’s output is noticeably blurrier (ground truth not shown).


Fine print

There are a few things I struggled with:

1) Using a hidden layer output to compute loss. For a long time, I was trying to reconstruct with the output of the discriminator, which has only one scalar output.


This reconstructs only one or two digits correctly. It is therefore necessary to use the hidden layer in order to get correct reconstructions.

2) Retain graph: We have three networks; Encoder (VAE), Generator, and GAN discriminator. After backpropagating through one of these things, pytorch deletes the variables in order to save memory (If I understand the error message correctly). However, it does suggest that we should add a qualifier “retain_graph=True) to keep them.


3) Sharing weights (although this eventually turned out to be unnecessary):
I was not able to find a principled way to share weights across networks. However, we can manually ship the weights across by storing them.

E.g. The Decoder/Generator class below:

class Aux(nn.Module):
def __init__(self):
self.fc3 = nn.Linear(20,400)
self.fc4 = nn.Linear(400,784)
self.relu = nn.ReLU()
self.sigmoid = nn.Sigmoid()

def return_weights(self):
return self.fc3.weight, self.fc4.weight

def forward(self,z, fc3_weight, fc4_weight):
self.fc3.weight = fc3_weight
self.fc4.weight = fc4_weight
return self.decode(z)

We get the weights with x.weight, and then we copy things over for use in the other network.

4) Alternative formulation/loss for GAN
Instead of the correct form that comes about from the cross entropy formulation for GAN, I used the so-called alternative form (well, that’s the name used in the alpha-GAN paper) which helps in alleviating the vanishing gradient problem that the original form is prone to. This was used in the decoder or generator.

Original form:
-\mathcal{L}_{GAN,gen} = E_{x\sim p_{fake}} \log(1-D(x))

Alternative form:
-\mathcal{L}_{GAN,gen} = -E_{x_\sim p_{fake}} \log(D(x))


1) Autoencoding beyond Pixels using a learned similarity metric:

2) Kingma and Welling:

3) Generative adversarial networks:

4) alpha-GAN:

A note on batching operations in PyTorch

While implementing ‘DRAW’ in PyTorch ( , I ran into some surprising problems in carrying out matrix vector products in a batched way, which I relate through the concrete case of computing filterbank and read/write operations from the paper.

Matrix computations for attention patches or glimpse windows

The attention mechanism in DRAW involves two operations referred to as reading and writing. In the read operation, one creates a smaller glimpse vector of size 12×12 from the larger MNIST image of size 28×28. The write operation is the inverse, wherein the 12×12 image is transformed to a 28×28 image to write into the global ‘canvas’. These operations are carried out by means of matrix transformations. So, if x is the 28×28 image, then we apply a matrix transformation F_Y x F_X^T to create the attention patch.

Read: Transform large image to small image
x \ (28\times28)=> F_Y x F_X^T \ (12 \times 12).

Write: Transform small image to large image
w \ (12 \times 12)=> F_Y^T w F_X \ (28 \times 28)

There is also a scalar intensity parameter that they call \gamma that multiplies (or divides) the read and write operations.

Read: \gamma F_Y x F_X^T
Write: \frac{1}{\gamma} F_Y^T w F_X

If this \gamma vanishes, then one might suppose that the calculation will blow up, something we can try preventing by adding some small noise. But then I found that if the thing blew up, there’s not really anything we can do about it anyway. Changing the GRU to an LSTM seems to help in this case.

The matrices F_Y, F_X are referred to in the paper as ‘filter banks’ which are essentially gaussian kernels around a point of interest. In 1D, if we say that F(x)= \mathcal{N}(\mu, \sigma) then it will apply a gaussian filter at the point x=\mu. More on that in a dedicated post.

Now, since variables in pytorch all have a batch_size dimension to them, we would like to have them computed in vectorized fashion (or write them in CUDA, which would be more intuitive conceptually). The loop for the read operation could look like this:

#x: BxA
#F_Y: NxB

for i in range(batch_size):
F_x_t = torch.t(F_x[i]) #transpose

# is for matrix multiplication
tmp1 =[i],[i], F_x_t) * gamma[i]

It would have been nice if the framework automatically vectorized the above computation, sort of like OpenMP or OpenACC, in which case we can try to use PyTorch as a GPU computing wrapper. But since this does not happen, we have to either write the loop in CUDA or to use PyTorch’s batching methods which thankfully happen to exist.

In order to do batch matrix multiplies, we should have the outer index as the batch variable, and the rest as a matrix, a tensor of 2 dimensions.

We do this with heavy use of the permute and expand functions:

#Do the transpose first
#permute indices 1, 2; 0 is batch index
#F_x_t => batch_size x A x N
F_x_t = F_x.permute(0, 2, 1)

#pad gamma with extra dimensions to make it a matrix
#(batch_size) => (NxNxbatch_size)
gamma = gamma.expand(N, N, batch_size)

#now permute to make batch_size the outer index
gamma = gamma.permute(2, 0, 1)

Our variables are now of the following shapes:

gamma: batch_size x N x N
x: batch_size x B x A
F_x_t: batch_size x A x N

Do a batch multiplication of x and F_X^T by invoking the bmm method.

tmp = x.bmm(F_x_t) # batch_size x B x N

Now multiply by F_Y to give the write patch of shape batch_size x N x N (times \gamma).

tmp = F_y.bmm(tmp)

Multiply this by \gamma to get the glimpse vector. This is an elementwise multiplication, simply done by using the asterisk operator.

w = gamma * tmp

The code can be found here: in

RNNCell Modules in PyTorch to implement DRAW

I realized that in the last few months, I’ve spent a lot of time reading about generative modeling in general, with a fair bit of nonsense rhapsodizing about this and that, as one often does when one sees things the first time. I can see that I’ll be working a lot with RNNs in the near future, so I decided to get my hands dirty with pytorch’s RNN offerings. There is little else to do anyway in the heat.

I am working on the “DRAW” paper by Gregor et. al.

This is a natural extension of the Variational Autoencoder formulation by Kingma and Welling, Rezende and Mohamed. The paper appeals to the idea that we can improve upon the VAE’s handiwork by iteratively refining it’s output over the course of several time steps. There is thus a temporal, sequential aspect that comes in. In addition, they also add a spatial attention mechanism wherein one ‘attends’ to portions of the image as to improve them in small NxN patches of a larger image. I am posting on the first part now, which only uses the RNNs. It might be a few more days before I can finish the second part.

Initially, I thought that we just have to pick from pytorch’s RNN modules (LSTM, GRU, vanilla RNN, etc.) and build up the layers in a straightforward way, as one does on paper. But then, some complications emerged, necessitating disconnected explorations to figure out the API. Firstly, we should all be aware of PyTorch’s way of creating arrays (well, I’ve not used any other frameworks except for caffe, and that too, only for benchmarking runs, so it’s all quite new to me) which demands that we include the batch size during initialization.

So for example, if we want to create an input x of size 784 (as in MNIST), we must also pass the batch size variable as input:

x = Variable(torch.randn(batch_size, input_size))

We do this in most of our initializations. The first dimension is the batch size. However, RNN variables have an additional dimension, which is the sequence length.

x_rnn = Variable(torch.zeros(seq_len, batch_size, input_size))

This is apparent in retrospect in the documentation ( – see under RNN), but we have to play with it in order to make sure. When we unroll this entity, we get the unrolled form with seq_len of them in number.

seq_len * Variable(batch_size, input_size)

Let us look at the example given in the documentation page:

>>> rnn = nn.GRU(10, 20, 2)
>>> input = Variable(torch.randn(5, 3, 10))
>>> h0 = Variable(torch.randn(2, 3, 20))
>>> output, hn = rnn(input, h0)

This defines a GRU of the following form:

rnn (input_size, hidden_size, num_hidden_layers)

We should note that the function curiously returns two outputs: output, hidden. The first output (output) contains the last hidden layer, while ‘hidden’ contains all the hidden layers from the last time step , which we can verify from the ‘size()’ method.

‘output’ is of shape (seq_len, batch_size, hidden_size) . It contains the sequence of hidden layer outputs from the last hidden layer.

>>>torch.Size([5, 3, 20])
(or torch.size([seq_len, batch_size, hidden_size]))

I find the purpose of ‘hidden’ a little enigmatic. It supposedly contains the hidden layer outputs from the last timestep in the sequence t = seq_len.

>>>torch.Size([2, 3, 20])
(or torch.Size([num_hidden_layers, batch_size, hidden_size)]))

The hidden layer can be bi-directional. Apparently, the default (as we might expect) is a standard uni-directional RNN. The documentation clarifies this:

h_n (num_layers * num_directions, batch, hidden_size)

It helps to remember that the quantity they call ‘output’ is really the hidden layer. The output of an RNN is the hidden variable which we then do something with:

h_t = f(x_t, h_{t-1}) = W_{hh} h_{t-1} + W_{xh} x_t +b_{xh} + b_{hh}

y_t = g(h_t)

In my experiments I used GRUCell because it seemed intuitive to set up at that time.


Note: I think in the above, we can replace the 2 RNNs used in the encoder (one each for \mu, \sigma with a single RNN – as can be made out from the clipping below from “Generating Sentences from a Continuous Space”:


RNNCell module

In DRAW, we need a connection from the decoder from the previous timestep. Specifically:

h_t^{enc} = RNN^{enc} (h_{t-1}^{enc}, [r_t, h_{t-1}^{dec}])

They define a cat operation [,] to concatenate two tensors.

[u, v] = cat(u, v)

As we can make out from the hand written figure (sorry, but that’s just the most efficient way factoring in things such as laziness), at each time step, the VAE encoder takes in the output of the read operation, which then gets encoded into the latent embedding Q(z_t|x, z_{1:t-1}) = Q(z_t|h_t^{enc}). Furthermore, at each timestep, we take in the same input image x, and then give it the previous timestep’s output image to create the sequence, together with the decoder output as well. In that sense the sequence is actually defined by the quantity r_t or the oputput of read:

r_t = read(x, \hat{x}, h_{t-1}^{dec})

The read operation is to be implemented. This is done differently depending on whether or not we put in spatial attention. Nevertheless, we can in a rough way make sense of it from a line in the paper:
“Moreover the encoder is privy to the decoder’s previous outputs, allow-
ing it to tailor the codes it sends according to the decoder’s
behaviour so far”

In our experiments, we used the RNNCell (or more precisely, the GRUCell) to handle the sequence, with a manual for loop to do the time stepping – the most intuitive way, if I may say so. In the forward method of the class, we create the set of operations comprising DRAW:

for seq in range(T):
x_hat = x - F.sigmoid(c) # error image
r =, x_hat) #cat operation
#encoder output == Q_mu, Q_sigma
mu, h_mu, logvar, h_logvar = self.encoder_RNN(r, h_mu, h_logvar, h_dec, seq)
z = self.reparametrize_and_sample(mu, logvar)
c, h_dec = self.decoder_network(z, h_dec, c) #c is the canvas

Naturally, the RNN layers handle each individual timestep rather than batching the whole sequence together:

The API is as follows:

>>> rnn = nn.RNNCell(10, 20) #(input_size, hidden_size)
>>> input = Variable(torch.randn(6, 3, 10)) #(seq_len, batch_size, input_size)
>>> hx = Variable(torch.randn(3, 20)) #(batch_size, hidden_size)
>>> output = []
>>> for i in range(6): #time advance
... hx = rnn(input[i], hx) #
... output.append(hx) #add to sequence

In addition to the vanilla RNNCell, also included in PyTorch are the GRU and LSTM variants.

I hope to put up a more descriptive post (with feeling!) of DRAW. But for now, I have what seems to be a quasi working implementation without the attention mechanism. In the figures below, we can see that there is a qualitative improvement in the figures as we add refinement timesteps to it.

The code may be found here:

nn.Sequential in PyTorch

I haven’t been doing any writing at all in recent times. Part of the reason for that is that every time I sit down to creating something interesting, I get stuck tying the threads together and then having to rewind back to its predecessors, and so forth. In short, it has been a very disorganized experience as far as putting up a coherent structure. It also seems that I have stopped reading papers, with all this busyness of writing code – a tendency one must guard against.

For now though, I have a rather more prosaic bit on PyTorch API to set up a chain of operations. I found this in the convolutional GAN sample.

It comes in handy when we want to chain blocks of operations (e.g. conv+batch norm + relu) together. But generally, I suppose it is just economical use of code. It is immensely helpful in defining blocks for a VAE:

self.dec_ = nn.Sequential(
nn.ConvTranspose2d(20,20*8,4,1,0,bias=False), #(ic,oc,kernel,stride,padding)
nn.ConvTranspose2d(20*8,20*16,4,2,1,bias=False), #4x4->8x8
nn.ConvTranspose2d(20*16,20*32,4,2,1,bias=False), #8x8->16x16
nn.ConvTranspose2d(20*32,1,2,2,2,bias=False), #16x16->28x28

This code creates the architecture for the decoder in the VAE, where a latent vector of size 20 is grown to an MNIST digit of size 28×28 by modifying dcgan code to fit MNIST sizes.

Naturally, it would be quite tedious to define functions for each of the operations above. Like so.

self.conv_transpose_1 = nn.ConvTranspose2d(20,20*8, 4, 1, 0, bias=False),
self.bn1 = nn.BatchNorm2d(20*8)
self.relu = nn.ReLU()

And in the calling function, called decoder() we chain up these operations.

def decoder(self,z):
z = z.view(-1,z.size(1),1,1)
o1 = self.conv_transpose_1(z)
o2 = self.bn1(o1)
o3 = self.relu(o2)

An explanation is in order for ConvTranspose2d. The way it is done in pytorch is to pretend that we are going backwards, working our way down using conv2d which would reduce the size of the image. Except, that we use the same parameters we used to shrink the image to go the other way in convtranspose – the API takes care of how it is done underneath.

A simple case first: to reduce from 28×28 to 12×12, we take a kernel size of 5 with no padding and a stride of 2.

c = nn.Conv2d(input_channels, output_channels, kernel_size, stride, padding)

If we want to do 28×28 to 12×12, we define a kernel like this:

c = nn.Conv2d(input_channels, output_channels, 5, 2, 0)

To go the other way, from 12×12 to 28×28, we should do a transpose convolution.

c = nn.ConvTranspose2d(input_channels, output_channels, 5, 2, 0)

Lets do this on an example with strides and padding: 28×28->16×16

Use the same formula we would use to do the convolution (28×28->16×16), but now put the parameters in the definition of the transpose convolution kernel. We should work the formula out really, but it is elaborately explained in the old theano page

The formula for the normal conv2d (well, also conv1d, so it qualifies as abuse of dimension) is:

o = (i-k+2p)/s +1

where o is the output size, i is the input size, p is the padding, s is the stride.

Plug in i=28,k=2,p=2, setting batch size to 5 and channels to 1, to get o=16

q = nn.Conv2d(1,1,2,2,2)
x = Variable(torch.randn(5,1,28,28))
z = q(x)
torch.Size([5, 1, 16, 16])

The transpose convolution operation is as follows:

tc = nn.ConvTranspose2d(1,1,2,2,2)
x = Variable(torch.ones(5,1,16,16))
y = tc(x)
torch.Size([5, 1, 28, 28])

We can find this attempt in scratch.ipynb here

That repo is an attempt at implementing the GAN VAE paper
“Autoencoding beyond pixels using a learned similarity metric”.

Bahdanau attention

In an earlier post, I had written about seq2seq without attention by way of introducing the idea. This time, we extend upon that by adding attention to the setup. In the regular seq2seq model, we embed our input sequence x={x_1, x_2, ..., x_T} into a context vector c, which is then used to make predictions. In the attention variant, the context vector c is replaced by a customized context c_i for the hidden decoder vector s_{i-1}. The result is the summed over contribution over all of the input hidden vectors.

c_i = \sum_j \alpha_{ij} h_j

This operation computes how we weight the input hidden vectors h_j (this could be bidirectional, in which case we concatenate the forward and backward hidden states). Naturally, if the input and output hidden states are ‘aligned’ then \alpha_{ij} would be quite high for those states. In practice, more than one input word could be aligned with their output counterparts. For example, for some words in English, we will have a direct correspondences in French (de == of; le, la == the), but some words can have multiple correspondences (je ne suis pas == I am not), so the alignment should register these features. Here we know that (je/i) form a pair; (suis/am) form another pair, but we also know that (ne suis pas/am not) should occur together, and they will have non-zero alphas when we group them together, not to mention the difference in sequence lengths.

The attention/alignment parameters \alpha_{ij} are computed as a non-linear function of the hidden units, yielding an attention parameter a_{ij} which is then softmaxed to make it lie between 0 and 1.

a_{ij} = f(s_{i-1}, h_j) = v' \tanh(W_1 s_{i-1} + W_2 h_j + b)

\alpha_{ij} = softmax(a_{ij}) = \frac{\exp(a_{ij})}{\sum_j \exp(a_{ij})}

Once we compute the context c_i we can make it produce predictions for the output hidden states

s_i = F(s_{i-1}, c_i, y_{i-1})

Computing the hidden decoder state

As we can make out from the equation above, we would like to formulate the decoder state as an RNN:

s_i = F(s_{i-1}, u_{i-1})

where u_{i-1} is some function of c_i and y_{i-1}. A most obvious way to combine the two vectors is by concatenating c_i and y_{i-1}. This is the way they do it in Vinyals’ “Grammar as a foreign language”.

Define [u,v] = concat(u,v) to get
s_i = RNN(s_{i-1}, [c_i, y_{i-1}])

In the Bahdanau paper, they don’t explicitly use concatenation but leave it as a general non-linear combination of equations. This is interpreted as a concatenation (see Thang Luong). We paste the GRU equations from the paper s_i = GRU(s_{i-1}, c_i, y_{i-1}). The outputs y_i are one hot vectors which are transformed into a dense embedding, denoted by the matrix E.

s_i = (1-z_i) \circ s_{i-1} + z_i \circ s_i \\ \tilde{s}_i = \tanh (W Ey_{i-1}  + U [r_i \circ s_{i-1}] + Cc_i)\\ z_i = \sigma(W_z Ey_{i-1} + U_z s_{i-1} + C_z c_i)\\ r_i = \sigma(W_r Ey_{i-1} + U_r s_{i-1} + C_r c_i)

Annotations and bidirectionality

The Bahdanau paper uses a bidirectional RNN for the encoder. This computes hidden units for the sequence with the normal ordering (left to right) and reversed ordering (right to left) (see CS224D notes by Richard Socher).

\overrightarrow{h}_i = f(\overrightarrow{W} x_i + \overleftarrow{V} h_{i-1} + \overrightarrow{b}) \\ \overleftarrow{h}_i = f(\overleftarrow{W} x_i + \overleftarrow{V} h_{i+1} + \overleftarrow{b})

The so called ‘annotations’ h are a concatenation of the forward and backward hidden vectors which are then used to compute context vectors.

h_j = concat(\overrightarrow{h}_j, \overleftarrow{h}_j) = [\overrightarrow{h}_j, \overleftarrow{h}_j] \\ c_i= \sum_j \alpha_{ij} h_j


This part I don’t understand very well – I haven’t read the Goodfellow paper. However, the equations are clear enough. It gets ugly when we spell it out. As ugly as it looks, it’s probably the only way we can convince ourselves when we lack the proper intuition that allows us to handwave the equations away.

p(y_i|s_i, y_{i-1}, c_i) \propto \exp(y^T_i W_o t_i)

Naturally, t_i is a function of s_i, y_{i-1}, c_i, going into the softmax operation to get the probability of the next word.

t_i = [max\left\{\tilde{t}_{i, 2j-1}, \tilde{t}_{i,2j}\right\}]
with the tilde quantities being linear combinations of s_{i-1}, y_{i-1}, c_i

\tilde{t}_i = U_o s_{i-1} + V_o Ey_{i-1} + C_o c_i

i.e. \tilde{t}_i = f(s_{i-1}, [Ey_{i-1}, c_i])

It is to be ensured that the weights of the tilde t matrix are of the proper shape. The maxout process creates a vector \tilde{t} of size 2l, from which we pick the maximum between two adjacent elements of the vector. If we want to get element k of the vector t, we select the maximum of elements 2k-1, 2k from the vector \tilde{t}. Our matrices to compute the tilde vectors are to be sized accordingly. It is all pretty tedious business, but thankfully, the paper has it such that we can work everything out ourselves notwithstanding the notation. So if we have K_y words in the output dictionary, and t is a vector of length l, the quantities are of the following shapes:

W_o: K_y \times l \\ t: l \times 1\\ \tilde{t}: 2l\times 1\\ s: n \times 1 \\ U_o: 2l \times n \\ y_i: K_y \times 1, etc.

Alternative forms for alignment function

The alignment model, which scores the output words with their input equivalents in their hidden states, a_{ij} can be proposed differently (Luong et al.). They give some alternatives (below, we keep the notation consistent with Bahdanau – s_{i-1} for hidden decoder, h_j for hidden encoder. In the paper, they use \bar{h}_s,t for source and target.

\begin{array}{lll} a_{ij}  & =&  s_{i-1}^T h_j \ (dot) \\   & = & s_{i-1}^T W_a h_j \ (general) \\   & = & v^T \tanh(W_a[s_{i-1}, h_j]) \ concat (Bahdanau) \end{array}


1. Cho et al. 2014 (
2. Bahdanau et al. 2014 (
3. Vinyals et al. 2014 (
4. Sutskever et al. 2014 (
5. Goodfellow et al. 2013 (
6. Tacotron:
7. Luong et al.:
8. CS224D: