The machine learning powered text-to-speech synthesis, which Google currently uses for its own products like Maps and Google Assistant, is now coming to Google Cloud Platform with Cloud Text-to-Speech.
The move will enable developers to add Google’s text-to-speech functionality in their own applications to build more interactive interfaces.
The Cloud Text-to-Speech has 32 different voices in 12 languages, enabling developers to customize voice pitch, speaking rate and volume gain.
Google claimed that it can correctly pronounce complex text like names, dates, times and addresses. It also supports multiple audio formats including MP3 and WAV.
Cloud Text-to-Speech is integrated with DeepMind’s WaveNet technology, so that developers can select from high-quality voices. WaveNet is a neural network trained with multiple samples of speech which can create raw audio waveforms from scratch. When any input in the form of text is given, the WaveNet model generates one speech waveform at a time, resulting in higher accuracy sounds.
WaveNet in Cloud Text-to-Speech is an improved version of the original one which was launched in late 2016. The updated model can generate raw waveforms with 24,000 samples per second, and one second of speech in just 50 milliseconds. The resolution of samples has been increased from 8 bits to 16 bits for producing audio of high quality, more like the sound of human beings.
This service is intended for three main verticals— call centers, IoT devices, and text-based media. Call centers can use Cloud Text-to-Speech for interactive voice response systems, and real-time natural language conversations.
It can be implemented in IoT devices so that the devices can talk back to the users. Cloud Text-to-Speech can also be used to convert news articles and books into spoken format like podcast or audiobook. The service is now generally available to developers on Google Cloud Platform.