After being given a few seconds of audio input, a new AI system can produce speech and music that sound natural. With nearly little audible difference from the original recording, Google's new AI, AudioLM creates audio that matches the prompt's style, including complicated sounds like piano music or people chatting. The method has the potential to accelerate the process of teaching artificial intelligence to produce audio, and it may one day be used to automatically create music to go with videos.
Natural language processing is used in AI-generated voices on home assistants like Alexa, which are widely used. Although amazing achievements have previously been achieved with AI music systems like OpenAI's Jukebox, the majority of currently used methods require humans to create transcriptions and label text-based training sets, which takes a lot of time and effort. Jukebox, for instance, generates song lyrics using text-based information.
The non-transcriptional, label-free AudioLM system was recently reported in a non-peer-reviewed publication. Instead, sound databases are input into the computer, and ML is used to compress the audio files into short sound clips called "tokens" without substantially sacrificing any of the original audio's quality. A machine-learning model that makes use of NLP to learn the patterns of the sound is then given this tokenized training set.
A few seconds of sound are given into AudioLM to create the audio, and it then foretells what will happen next. The method is comparable to how language models like GPT-3 anticipate the normal order of sentences and words.
The team's audio samples have a rather natural sound to them. Piano music was created with AudioLM, in contrast to piano music created with previous AI approaches, which tends to sound chaotic, sounds more flowing.
According to Carnegie Mellon University's Roger Dannenberg, who studies computer-generated music, AudioLM already has far better sound quality than earlier music creation software. He claims that AudioLM is surprisingly effective at recreating some of the repetitive rhythms found in music created by humans.
The delicate vibrations that are included in each note when piano keys are struck must be captured by AudioLM in great detail in order to produce authentic piano music. The music must be able to maintain its harmonies and rhythms over time.
AudioLM is not just for music. The system can also produce speech that maintains the accent and tempo of the original speaker because it was built on a library of records of human speaking statements, however, at this point, the sentences may still appear like nonsequiturs. AudioLM is trained to recognise the kinds of sound clips that frequently occur together and then reverses this process to generate sentences. The ability to learn the gaps and exclamations that are natural in spoken languages but difficult to convey in the text is another benefit.
Researcher Rupal Patel of Northeastern University studies information and speech science. She claims that in earlier audio generation utilising AI, the nuances could only be captured if they were explicitly marked in training data. In contrast, AudioLM automatically picks up certain traits from the supplied data, which heightens the sense of realism.
In the future, AI-produced music might be utilized to create more believable soundtracks for slideshows and films. The group aspires to produce more complex sounds, such as a band with several instruments or noises that resemble a recording of a tropical rainforest.
Patel argues that it is necessary to take into account the ethical implications of the technology. It's crucial to ascertain whether the musicians who create the clips used as training data will receive credit or royalties from the finished product, as this is a problem that has arisen with text-to-image AIs. Indistinguishable artificial intelligence (AI) speech may eventually become so convincing that it makes the transmission of false information easier.
The researchers state in the report that they are already thinking about and attempting to address these problems, for instance, by creating methods to discriminate between sounds created naturally and those created using AudioLM.
Join our WhatsApp Channel to get the latest news, exclusives and videos on WhatsApp
_____________
Disclaimer: Analytics Insight does not provide financial advice or guidance. Also note that the cryptocurrencies mentioned/listed on the website could potentially be scams, i.e. designed to induce you to invest financial resources that may be lost forever and not be recoverable once investments are made. You are responsible for conducting your own research (DYOR) before making any investments. Read more here.