Race Begins! NVIDIA is here to Take on Speech AI Technology
Armed with Artificial Intelligence Speech Technology, NVIDIA is here to rule.
Today, Nvidia spoke at the Voice AI Summit about its new speech artificial intelligence (AI) ecosystem, which it created in collaboration with Mozilla Common Voice. The ecosystem focuses on creating open-source pertained models and crowdsourcing multilingual voice corpora. The goal of Nvidia and Mozilla Common Voice is to hasten the development of universal automatic voice recognition systems that can be used by speakers of all languages. Nvidia discovered that less than 1% of the spoken languages in the world are supported by popular voice assistants like Amazon Alexa and Google Home. The company wants to improve linguistic inclusion in speech AI and increase the accessibility of speech data for languages with less access to resources to address this issue.
Nvidia asserts that linguistic inclusion for speech AI has numerous data health advantages, including aiding AI models in comprehending speaker variation and a range of noise characteristics. The new speech AI ecosystem aids in the development, upkeep, and improvement of speech AI models, datasets, and user interfaces for linguistic diversity, usability, and experience. On Mozilla Common Voice datasets, users can train their models, and those pertained models are subsequently made available as top-notch automatic speech recognition architectures. Then, to create their speech AI applications, other companies and people throughout the world can modify and use those architectures. “Language variety must be captured while maintaining demographic diversity’’, according to Caroline de Brito Gottlieb, product manager at Nvidia. ‘Underserved dialects, sociolects, pidgins, and accents are a few crucial elements influencing speech variety. We hope that this collaboration will enable communities to develop speech datasets and models for use in any language or situation.’’
Siddharth Sharma, head of product marketing, AI, and deep learning at Nvidia, told that the voice AI ecosystem “extensively focuses on not only the diversity of languages but also on accents and noise profiles that different language speakers throughout the globe have.” The company is developing speech AI for several use cases, including automatic speech recognition (ASR), artificial speech translation (AST), and text-to-speech, and it says that this has been its distinct focus at Nvidia. To design and deploy fully configurable, real-time AI pipelines for applications like contact center agent aids, virtual assistants, digital avatars, brand voices, and video conferencing transcription, Nvidia Riva, a component of the Nvidia AI platform, offers cutting-edge GPU-optimized processes. Applications created with Riva can be installed on any form of cloud, in any data center, at the edge, or on embedded hardware.
The Singapore government’s transportation technology partner, NCS, adapted the Riva FastPitch model from Nvidia and created its own text-to-speech engine for English-Singapore utilizing the voice data of local speakers. Breeze, an app for local drivers created by NCS, translates languages including Mandarin, Hokkien, Malay, and Tamil into Singaporean English with the same expressiveness and clarity as a native Singaporean would say them. T-Mobile, a multinational provider of mobile communications, also collaborated with Nvidia to create AI-based software for its customer experience centers that record real-time client discussions and makes recommendations to thousands of front-line employees. T-Mobile used Riva and Nvidia NeMo, an open-source framework for cutting-edge conversational AI models, to construct the software. With the help of these Nvidia tools, T-Mobile engineers were able to optimize ASR models on the company’s unique datasets and accurately decipher customer jargon in loud circumstances.
According to Sharma, Nvidia wants to incorporate recent advancements in AST and next-generation voice AI into use cases for the real-time metaverse. Currently, we can only provide slow translations from one language to another, and those translations must be done via text, he claimed. But in the future, he continued, “you can have people in the metaverse who are all able to have instant translation with each other despite speaking so many different languages.” The next phase, he continued, “is designing technologies that will enable fluid interactions with people throughout the globe through real-time text-to-speech and speech recognition