“We have lingered in the chambers of the sea
By seagirls wreathed with seaweed red and brown
Till human voices wake us and we drown”.
These lines are from Eliot’s “The love song of J. Alfred Prufrock”. To take that to our more banal reality – not the ones with the cicada or dry grass singing – the mermaids that we have seen singing (each to each!) are the beautiful results that we see in the papers, and the lingering in the chambers of the sea could perhaps be this utopia that we envisage in our never-ever land submerged under. Nevertheless, as Eliot notes dolefully, the mermaids don’t sing to us, and when we wake up from that sleep (O city city … I can sometimes besides a public bar in lower Thames Street …), we realize that not only do the mermaids spurn us with their ardent refusal to sing, but also that our watery singing chamber turns into a watery grave, submerged, inundated, buried. Somehow, in Eliot, the water seems to form such an important theme, from the wet beginning of “The Fire Sermon”, to the drowning of the Phoenician sailor in “Death by Water”, to the grand ending with visions of the Fisher King sitting upon the shore of his waste land fishing, with a desolate, arid (and beautiful!) landscape behind him wondering about his legacy.
In our case, the works and days of hands consist of restless – endless – hours watching our little attention curve appearing (mostly, not, as we would surmise) with its wispy sharpness before us, thereby announcing that the model is indeed learning something. Again, as he has observed only too wisely, “I have wept and fasted, wept and prayed. I have seen the moment of my greatness flicker, and in short I was afraid”.
With that entirely unnecessary digression into the drowning excesses of Eliotism – think of the opening chant in Ashtanga practice, or Vogon poetry, as suits one’s taste – we bring our attention towards matters of the moment.
In keeping with our foray into attention based seq2seq models, which we know and love, I would like to put my thoughts down on what seems like a necessary enhancement to the content based attention mechanism by putting in a location component to it. This idea comes from the paper by Chorowski et al detailing their setup on speech recognition.
Now (the old sweats can take a breather), in the Bahdanau work for NMT, they present the so-called content based attention, in which a hidden unit of the decoder is compared against all encoder outputs for similarity (which, as we know could be some function – a dot product, or a learned function expressed as a perceptron). This works well for NMT, where all the tokens in the input and output are different. However, when we replace the input sequence with a speech representation – say, a spectrogram – there are many many frames of input that must be compared against, and these frames don’t change much from frame to frame, meaning that the attention will be spread out over all these frames, and they will be ranked equally – not quite entirely the result we want. The paper says that we must encourage the attention mechanism to move forward by noting what the attention was in the previous timestep. The point is that it has many nearly identical frames to score and could therefore use a hint on what the attention was in the previous step, so that it focuses on that particular frame’s vicinity. The other approach – as described in the previous post – is to cluster frames by reducing the number of timesteps in a hierarchical way (as in Listen Attend and Spell).
The Chorowski paper is a seminal work. It was first introduced in the context of speech recognition as an extension of sorts to the machine translation work by Bahdanau. More recently, it has been used in speech synthesis in the Tacotron series of papers. The idea is to design an attention mechanism that looks at the previous time step’s attention and picks from it by using a learned filter.
“Informally, we would like an attention model that uses the previous alignment αi−1 to select a short list of elements from h, from which the content-based attention, in Eqs. (5)–(6), will select the relevant ones without confusion.”
This is carried out by convolving the previous timestep’s attention with a matrix F (kxr), to produce k vectors for each of the encoder timesteps considered.
The way it is written, the scheme seems a little cryptic (or is it?), until we note that in the above, we take one dimensional convolutions on the attention (after all, it is a 1D vector) in equation (8), with k filters with filter size r. What is not mentioned is the size of the convolved vector, which is probably user defined. I found code from Keith Ito’s Tacotron implementation in tensorflow which seems to corroborate this interpretation, with the convolution designed such that the input and output sizes are made the same by appropriately sizing the padding cells.
Be all that as it may, one might perhaps summarize the game as using a filtered version of the previous timestep’s attention so that it’s influence is felt in the new attention output. Again, as is usual in such scenarios, convolutions are carried out on the time axis. Another way of looking at it is to consider the spectrogram as an image, so that we look at a given region of the image as noted by the attention vector.
The paper notes that this form of attention drew inspiration from Alex Graves’ GMM based attention – called ‘window weights’ in that work – where the weighting parameters decide the location, width and importance of the windows. This also seems to motivate location based attention in the DRAW paper. But it is important to note that the location of the window is forced to move forward at every time step.
Other pertinent tricks in the paper:
A. Sharpening attention in long utterances with use being made of a softmax temperature . The rationale for this trick is that in longer utterances when we have a long (unnormalized) attention vector ((0.1, 0.4, 0.8, -0.4, -0.5, 0.9, etc), we would like to sharpen the attention so that it focuses on a few pertinent frames. So, if we keep the temperature parameter as 10, for example, then we will get (exp(1), exp(4), exp(8), …) and naturally, this will completely obliterate most but the largest components as compared with (exp(0.1), exp(0.4), exp(0.8),…). In other words, it amplifies – or sharpens – the attention.
B. Smoothening attention for shorter utterances. In this scenario, it is claimed that one wants to smoothen instead of sharpening focus. The reasoning given is that the sharp model (top scored frames) performs poorly in the case of short utterances, leading them to try out smoothening it out a little. Another way to reason it out is that in smaller utterances, the contribution from each individual unit becomes all the more important.
While the ideas might seem somewhat arbitrary, I take these as tricks that one should incorporate into one’s own practice as needed.
These tricks find use in the Tacotron series of works, notably in Tacotron2, where a location sensitive attention mechanism is used to inform the attention mechanism that it must move forward.
Below is a pytorch version of the Tacotron implementation referred to above.
The line calling self.location_conv corresponds to equation (8), producing ‘k’ vectors of the same size, with ‘k’ being the number of filters in this case. We then make a linear operation on this to resize it to the hidden dimension in self.location_layer, which gets added on to the attention query.
When we call the attention mechanism, we call both the content (this is the standard concat attention) and the location sensitive versions (in the code above), as the equation (9) in the paper shows, after which we take a softmax of these energies. All this is available in r9y9’s implementation of Tacotron (minus maybe the location component).
1. Attention-Based Models for Speech Recognition (Chorowski et. al.): https://arxiv.org/abs/1506.07503
2. Tacotron 2: https://arxiv.org/abs/1712.05884
3. Keith Ito’s code: https://github.com/keithito/tacotron/issues/117
4. r9y9’s Tacotron implementation in PyTorch: https://github.com/r9y9/tacotron_pytorch
5. Alex Graves’ “Generating Sentences with Recurrent Neural Networks”: https://arxiv.org/pdf/1308.0850.pdf