Google Cloud updates AI-powered speech tools for enterprises

About

Google Cloud announced last week it’s updating its Text-to-Speech products with more voice and more languages. Google has also improved the quality of its Speech-to-Text transcription tools and is bringing some of their features into general availability. The updates should help developers build intelligent voice applications that can reach millions of more people and function more effectively.

For Text-to-Speech, Google has roughly doubled the number of voices available since its last update in August. It’s added support for seven new languages or variants, including Danish, Portuguese/Portugal, Russian, Polish, Slovakian, Ukrainian and Norwegian Bokmål — all in beta. The product now supports a total of 21 languages.
Across those new languages, Google has added 31 new WaveNet voices and 24 new standard voices. Google says it now supports a total of 106 voices.
WaveNet is a deep neural network for generating raw audio, which creates voices that are more natural-sounding than standard text-to-speech voices. The technology was created by DeepMind, the AI company Google acquired in 2014.
“Thanks to unique access to WaveNet technology powered by Google Cloud TPUs, we can build new voices and languages faster and easier than is typical in the industry,” Google product manager Dan Aharon said in a blog post.
Google’s primary competition for Text-to-Speech services is Amazon Web Services‘ Polly, which according to its website currently enables 58 voices.
In addition to adding new voices, Google’s Text-to-Speech Device Profiles feature is now generally available. This lets customers optimize audio playback on different types of hardware, such as headphones for media applications like podcasts.

Meanwhile, for Speech-to-Text, Google is bringing into general availability premium models for video and enhanced phone, which were rolled out in beta last year. The video model, which is based on technology similar to what YouTube uses for automatic captioning, now has 64 percent fewer transcription errors, Google announced. The enhanced phone model now has 62 percent fewer errors.
Google was able to improved the models by requiring customers who used the premium services to share usage data via data logging. Starting now, customers can use the enhanced phone model without opting into data sharing, while those who opt in will pay a lower rate. Prices are also lower for all premium video model customers, and those who opt into data sharing will get an additional discount.
Google is also announcing the general availability of multi-channel recognition, which helps the Speech-to-Text API distinguish between multiple audio channels. This is useful for in scenarios involving multiple people, such as doing meeting analytics.