A good paper comes with a good name, giving it the mnemonic that makes it indexable by Natural Intelligence (NI), with exactly zero recall overhead, and none of that tedious mucking about with obfuscated lookup tables pasted in the references section. I wonder if the poor (we assume mostly competent) reviewer will even bother to check the accuracy of these references (as long as we don’t get any latex lacunae ??) in their already overburdened roles of parsing difficult to parse papers, understanding their essence and then their minutiae, and on top of that providing meaningful feedback to the authors.
Tacotron is an engine for Text To Speech (TTS) designed as a supercharged seq2seq model with several fittings put in place to make it work. The curious sounding name originates – as mentioned in the paper – from obtaining a majority vote in the contest between Tacos and Sushis, with the greater number of its esteemed authors evincing their preference for the former. In this post, I want to describe what I understand about these architectures designed to produce speech from text, and the broader goal of disentangling stylistic factors such as prosody and speaker identity with the aim of combining them in suitable ways.
Tacotron is a seq2seq model taking in text as input and dumping speech as output. As we all know, seq2seq models are used ubiquitously in machine translation and speech recognition tasks. But over the last two or three years, these models have become usable in speech synthesis as well, largely owing to the Tacotron (Google) and DeepVoice (Baidu) works. In TTS, we take in characters or phonemes as input, and dump out as a speech representation. At the time of publication (early 2017), this model was unique in that it demonstrated a nearly end to end architecture that could produce plausible sounding speech. Related in philosophy was the work by Baidu (DeepVoice 1 – I think) which also had a seq2seq TTS setup, but which was not really end to end because it needed to convert text (grapheme) to phoneme and then pass that in to subsequent stages. The DeepVoice architecture has also evolved with time, and has produced complex TTS setups adapted for tasks such as speaker adaptation and creating speaker embeddings in transfer learning contexts with few data. Nevertheless, we should note that speech synthesis is probably never going to be end to end (as the title of the paper says) owing to the complexity of the pipeline. In Tacotron, we generate speech representations which need to be converted to audio in subsequent steps. This makes speech a somewhat difficult problem to approach – speaking personally – for the newbie unlike images where we can readily see the output. Even so, we can make do by looking at spectrograms. The other problem in speech is the lack of availability of good data. It costs a lot of money to hire a professional speaker!
At a high level, we could envisage Tacotron as an encoder-decoder setup where text embeddings get summarized into a context by the encoder, which is then used to generate output speech frames by the decoder. Attention modeling is absolutely vital in this setup for it to generalize to unseen inputs. Unlike NMT or ASR, the output is inflated, so that we have many output frames for an input frame. Text is a highly compressed representation whereas speech (depending on the representation) is highly uncompressed.
Several improvements are made to bolster the attention encoder-decoder model, which we describe below.
- Prenet: This is a bottleneck layer of sorts consisting of full connections. Importantly, it uses dropout which serves as a regularization mechanism for it to generalize to test samples. The paper mentions that scheduled sampling does not work well for them. Prenet is also used in the decoder.
- CBHG: “Convolutions, FilterBanks and Highway layers, Gated Recurrent Units”. We could view the CBH parts as preprocessing layers taking 1-3-5-… convolutions (Conv1d) and stacking all of them up after maxpooling them. It makes note of relationships between words of varying lengths (n-grams) and collates them together. This way, we agglomerate the input characters to a more meaningful feature representation taking into account the context at the word level. These are then sent to a stack of highway layers – a play on the residual networks idea – before being handed off to the recurrent encoder. I’ve described the architecture in more detail in another post. The “G” in CBHG is the recurrent encoder, specified as a stack of bidirectional Gated Recurrent Units.
- Reduction in output timesteps: Since we produce several similar looking speech frames, the attention mechanism won’t really move from frame to frame. To alleviate this problem, the decoder is made to swallow inputs only every ‘r’ frames, while we dump r frames as output. For example, if r=2, then we dump 2 frames as output, but we only feed in the last frame as input to the decoder. Since we reduce the number of timesteps, the recurrent model should have an easier time with this approach. The authors note that this also helps the model in learning attention.
Apart from the above tricks, the components are setup the usual way. The workflow for data is as follows.
We pass the processed character inputs to the prenet and CBHG, producing encoder hidden units. These hidden units contain linguistic information, replacing the linguistic context used in older approaches (Statistical Parametric Speech Synthesis or SPSS). Since this is the ‘text’ summary, we can also think of it as seed for other tasks such as in the Tacotron GST works (Global Style Tokens) where voices of different prosodic styles are summarized into tokens which can then be ‘mixed’ with the text summaries generated above.
On the decoder side, we first compute attention in the so-called attention RNN. This term caused me a lot of confusion initially. The ‘attention RNN’ is essentially the block where the we compute the attention context, which is then concatenated with the input (also ejected by a prenet) and then sent to a recurrent unit. The input here is a mel spectrogram frame. The output of the attention RNN is then sent to a stack of recurrent units and projected back to the mel dimensions. Also, this stack uses residual connections.
Finally, instead of generating a single mel frame for each decoder step, we produce ‘r’ output frames, the last of which gets used as input for the next decoder step. This is termed the ‘reduction factor’ in the paper, with the number of decoder timesteps getting reduced by a factor or ‘r’. As we are producing a highly uncompressed speech output, it makes sense to assist the RNNs by reducing their workload. It is also said to be critical in helping the model learn attention.
Generalization comes from dropout in this case, with ground truth being fed – teacher forced – during training as input decoder frames. Contrast this with running in inference mode where one has to use decoder generated outputs as input for the next timestep. In scheduled sampling, the amount of teacher forcing is tapered down as the model trains. Initially, a large amount of ground truth is used, but as the models learns with time, we slowly taper off to inference mode with some sort of schedule. The other way is to use GANs to make the inference mode (fake) behave like teacher forced output (real) with a discriminator being used to tell them apart; the idea being that the adversarial game results in the generator producing inference mode samples that are indistinguishable from the desired, teacher forced mode output. This approach is named “Professor Forcing”.
Mel spectrogram frames generated by the network are converted back to audio with the postprocessing scheme. This postprocessing scheme is described in the literature as a ‘vocoder’ or backend. In the original tacotron work, the process involves first converting the mel frames into linear spectrogram frames with a neural network to learn the mapping. We actually lose information going from linear to mel spectrogram frames. The mapping is learnt using a CBHG network (not necessarily encoder-decoder as the input and output sequences have the same lengths) as used in the front end part of the setup computing linguistic context. In the end, audio is produced by carrying out an iterative procedure called Griffin-Lim on the linear spectrogram frames. In more recent works, the backend part is replaced by a Wavenet for better quality samples.
Refinements in Tacotron 2
Tacotron 2’s setup is much like its predecessor, but is somewhat simplified, in in that it uses convolutions instead of CBHG, and does away with the (attention RNN + decoder layer stack) and instead uses two 1024 unit decoder layers. In addition to this, it also has provisions to predict the probability of emission of the ‘STOP’ token, since the original Tacotron had problems predicting the end of sequence token and tended to get stuck. Also, Tacotron 2 discards the reduction factor, but adds location sensitive attention as in Chorowski et al’s ASR work to help the attention move forward. Supposedly, these changes obviate the need for the r-trick. In addition to location sensitive attention, the GST Tacotron works also use Alex Graves’ GMM attention from the handwriting synthesis works.
There are many other minutiae as well such as using MSE loss instead of L1, which I suppose would qualify as tricks to be noted if one is actually creating these architectures.
Controlling speech with style tokens
The original Tacotron work was designed for a single speaker. While this is an important task in and of itself, one would also like to control various factors associated with speech. For example, can we produce speech from text for different speakers, and can we modify prosody? Unlike images where these sorts of unsupervised learning problems have been shown to be amenable to solutions (UNIT, CycleGAN, etc.), in speech we are very much in the wild west, and the gold rush is on. Built on top of Tacotron, Google researchers have made attempts at exercising finer control over factors by creating embeddings for prosodic style and speakers (also in Baidu’s works), which they call Global Style Tokens (GST).
Style Tokens are vectors exemplifying prosodic style which are shared across examples and learnt during training. They are randomly initialized, and compared against a ‘reference’ encoding (which is just a training audio example ingested by the reference encoder module) by means of attention, so that our audio example is now a weighted sum of all the style tokens. During inference, we can manually adjust the weights of each style token, or simply supply a reference encoding (again, spat out by the style encoder after putting in reference audio).
1. UNIT: Unsupervised Image to Image Translation Networks: https://arxiv.org/abs/1703.00848
2. CycleGAN: https://arxiv.org/abs/1703.10593
3. Tacotron: Towards End to End Speech Synthesis: https://arxiv.org/abs/1703.10135
4. Tacotron-2: Natural TTS Synthesis by Concatenating Wavenet on Mel Spectrogram Predictions: https://arxiv.org/abs/1712.05884
5. Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron: https://arxiv.org/abs/1803.09047
6. (GST) Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis: https://arxiv.org/abs/1803.09017
7. Alex Barron’s notes (excellent) describing Tacotron, implementation and woes: http://web.stanford.edu/class/cs224s/reports/Alex_Barron.pdf
8. Baidu Deep Voice 3: http://research.baidu.com/Blog/index-view?id=91
9. (Speaker adaptation with Deep Voice) Neural Voice Cloning with a few samples: https://arxiv.org/abs/1802.06006
10. Professor Forcing: https://arxiv.org/abs/1610.090381
11. Alex Graves: “Generating Sequences With Recurrent Neural Networks”: https://arxiv.org/abs/1308.0850
12. CBHG: Fully Character-Level Neural Machine Translation Without Explicit Segmentation: https://arxiv.org/abs/1610.03017