Google recently developed a neural text-to-speech system ‘Tacotron 2’, which generates natural speech from text using neural networks.
Generally, the text-to-speech (TTS) systems use complex linguistic features as input, but Tacotron 2 has been developed using neural networks which are trained using speech examples and consistent text transcripts.
“Generating very natural sounding speech from text (text-to-speech, TTS) has been a research goal for decades. There has been great progress in TTS research over the last few years and many individual pieces of a complete TTS system have greatly improved,” wrote the Tacotron developers.
The developers have combined the ideas from Google’s past works- WaveNet and Tacotron, and advanced them to build the Tacotron 2.
As per the research paper published by Google, Tacotron 2 contains two deep neural networks, one translates the text into a spectrogram (a visual representation of a spectrum), and the other (WaveNet) read the spectrogram and produce corresponding audio elements.
The audio produced using Tacotron 2 can emphasize words, accurately pronounce names, stress the italicized and capitalized words, etc. It can differentiate between the noun and verb, and generate the sound accordingly. For example, it can detect the difference between noun ‘present’ and verb ‘present’ from the context.
Google briefed the working of Tacotron 2 like this- “We use a sequence-to-sequence model optimized for TTS to map a sequence of letters to a sequence of features that encode the audio. These features, an 80-dimensional audio spectrogram with frames computed every 12.5 milliseconds, capture not only pronunciation of words, but also various subtleties of human speech, including volume, speed and intonation. Finally these features are converted to a 24 kHz waveform using a WaveNet-like architecture.”
Also read: Google Chrome removes an extension secretly deploying cryptocurrency miner
The Tacotron developers also wrote that there are still some difficulties with the new project, such as it cannot correctly pronounce complex words like decorum and merlot. The generated speech can’t sound happy or sad, and can generate strange noises sometimes.