We all know that attention is to a language model what focus is to a human. It is the still point of the turning world, neither flesh nor fleshless, neither from nor toward. If your model learns attention, it releases one from action and suffering, from the inner and outer compulsion, yet surrounded by a grace and dignity with the white light still moving.
I had setup a little experiment to extract this elixir from a model that had already learnt attention. The task was to convert between source and target voices (or representations thereof) in a supervised way, but with only a small number of (paired) samples. The following considerations are germane:
- I was doing transfer learning to fine tune on the tiny (1000 utterances) dataset. The pretrained network modeling was fashioned as a self-supervised task – an autoencoder, which learnt to reconstruct the source utterance. This was with a much larger dataset and the network that resulted was therefore quite large. However, the smaller dataset merited a smaller network. This begged the question as to whether one could ‘distill’ a smaller network from the bigger one to avoid overfitting.
- As learning attention is critical in this type of sequence model, we would like to find ways of transferring already learnt attention if available. In a small model such as what I wanted, learning attention was not feasible with such a small number of examples.
- Compression – not considered, but quite a pertinent thought. How can we create a model that would sit in a small device such as a cell phone?
- Low resource languages: This paper  was recently brought to my attention, in the context of speech recognition. There isn’t much data here, so a transfer learning/distillation type approach is proposed. However, here, the creation of the smaller network comes from pruning and quantization – most interesting!
Generally, these ideas form part of a bigger theme, that of knowledge distillation . I was trying to find ways of adapting Hinton’s ‘soft targets’ ideas but for some reason they seemed much more amenable to usage in classification problems. A literature search might unearth ways of adapting them to synthesis problems.
The model could be loosely thought of as a seq2seq, encoder-decoder RNN model (eh, we have fallen behind the times, apparently) with attention, a la Tacotron. There is a large teacher (600 hidden units in the encoder), and we would like to create a smaller student (300 hidden units). Here’s the recipe to learn attention.
Attention as a probability distribution
Since we normalize the attention score, it essentially becomes a PDF, its values lying between 0 and 1. We can formulate the learning problem as one where we minimize the Kullback Leibler Divergence of the attention vector between teacher and student. This is then included as a loss. We assume that the number of timesteps is the same in the teacher and student (see ref ).
KLD between teacher and student:
Here, the number of states goes from 1 to T. When we expand, we get a cross entropy to be minimized, as the ‘teacher’ term is constant.
This can be incorporated into our losses, keeping all other things in the model the same.
Assume that we have a trained teacher whose weights we store. To train the student, we use the same workflow as one would if one were training from scratch, but with the difference that we also run a forward pass through the teacher. We then collect the attention weights from the teacher and from student, and compute the cross entropy between then as shown above. This is then added to the overall loss setup.
I found that the setup learns attention immediately – at the end of epoch 0 – when we insert the distillation loss.
In the top row, we see that the setup learns (or rather, transfers) attention in the student very quickly, in a single epoch. It is a little blurry because it isn’t fully trained yet. In the second and third rows, we see test time output. The target is slightly longer than the source, and the attention is a lot crisper. The generation at the bottom right shows that while it gets the general features right (number of segments and duration), it hasn’t overcome the domain gap between source and target quite yet. This could use further investigation (more data and additional losses (adversarial, contrastive, etc.)).
- Knowledge distillation: https://arxiv.org/abs/1503.02531; video – https://youtu.be/EK61htlw8hY
- Parallel Neural Text to Speech: https://openreview.net/pdf?id=BJeFQ0NtPS
- Parallel Wavenet – here also, a distillation type approach is used: https://arxiv.org/abs/1711.10433
- PARP: Prune, Adjust and Re-Prune for Self-Supervised Speech Recognition: https://openreview.net/forum?id=UoVpP8R2Vn