Bing Speech API – Automated Machine Learning For Speech Recognition With Text Streaming

Sudipto

Published: 23 Feb 2018

Category: Advanced Web Development, Artificial Intelligence (AI), New Technologies

Linkedin
Twitter
Facebook

Speech-to-Text API enables you to develop voice triggered apps. The API can be directed to turn on and recognize the audio in real-time coming from a microphone, from a different audio-source or file. The real-time streaming is available in all the cases. On the other hand, Text-to-Speech API enables for the development of the apps that can speak to the user. When the application needs to “talk” back to the user, the API can be used to convert text into audio that can be played back to the user.

TTS support for 34 languages

Voice is increasingly used as a mode to interact with the smart devices we use. The ability to provide not only voice input but voice output or Text-to-Speech (TTS) is critical for the systems that support artificial intelligence. TTS is necessary for the applications that enable accessibility. Microsoft is currently offering speech cognitive services in 34 languages of the world. The six new TTS languages that have been added by Microsoft to its Speech API Cognitive Services include the below ones:

Bulgarian
Slovenian
Tamil
Malaysia
Vietnamese, and
Croatian.

The six new TTS languages will become available through the Microsoft Translator Service API and Microsoft Translator apps by the end of February 2018. All 34 languages are available across 48 locales and 78 voice fonts. The Text-to-Speech API can be used for hand-free communication or its own for accessibility or any other machine to human interaction. The Bing Speech API can be combined with other Cognitive Service APIs such as language understanding to develop comprehensive voice-driven solutions.

Technological jumps enabled by neural networks have helped in the development of speech recognition technology that has transformed our daily lives from digital assistants and email dictations to the transcription of meetings.

Customized language models for recognition accuracy

Two of the important components of speech recognition system include acoustic and language models. If your application has vocabulary items that are hardly used in everyday conversation, customizing the language model will help in significantly improving the recognition accuracy. One can upload textual data in the form of colloquial sentences or phrases of the target domain to develop the language models that can be accessed by any device through the speech API.

Custom speech service

University lectures are typical examples as domain specific terminology is extensive used in them. There could be specific terms that need to be correctly transcribed. Microsoft has come up with Presentation Translator that offers highly accurate results for domain specific audio. Custom speech service allows for the adaptation of the language and acoustic model with zero coding.

Text results in real time

Speech API supports a range of devices including mobiles, laptops, IoT devices like televisions and cars. It can handle noisy audios from many environments without you having to use the additional noise cancellation, The API returns text results in real-time with the recognized text appearing immediately while speaking through text streaming. It returns recognition results while the user is still speaking.

With speech API, developers are empowered with the advances in AI and can build new and transformative experiences for the customers. For deeper explanation of how speech recognition works exactly, see the explainer videos on Presentation Translator on Microsoft Azure.

WANT TO START A PROJECT?

Get Estimation