As far as advances in AI video generation go, it still requires a lot of source material, such as headshots or video footage from different angles, for someone to create a solid simulated version of their photo. It’s a different story when it comes to your fake voice, as Microsoft researchers Recently a new artificial intelligence tool has appeared one who can imitate one’s voice using only three second samples Talking about them.
The new tool, a “Neural Codec Language Model” called VALL-E, Meta’s encode based on audio compression technology, came out late last year, uses AI to compress audio to better-than-CD quality at data rates up to 10 times smaller than MP3 files, without appreciable loss of quality. Meta saw the encode as a way to improve phone call quality in areas with spotty cell coverage, or to reduce bandwidth demands for music streaming services, but Microsoft text-to-speech synthesis Taking advantage of technology as a way to create sound. Realistic based on a very limited source sample.
Today’s text-to-speech systems can produce very realistic-sounding sounds, which is why intelligent assistants sound so authentic, even though their verbal responses are generated on the fly. But they require very clean, high-quality training data, which is usually captured in a recording studio with professional equipment. Microsoft’s approach enables VALL-E to emulate almost anyone’s voice without spending weeks in the studio. Instead, the tool was trained with meta libri-lite datasetContains 60,000 hours of recorded English speech from over 7,000 unique speakers, extracted and processed from audiobooks Librivox”, all of which are in the public domain.
Microsoft shared Extensive collection of samples generated by VALL-E So you can hear for yourself just how capable its voice simulation capabilities are, though results are currently mixed. The tool sometimes has trouble recreating accents, including even subtle ones from the original samples where the speaker sounds Irish, and its ability to change the feel of a given sentence is laughable at times. But more often than not, the samples generated by the VALL-E sound natural, warm, and almost impossible to distinguish from the original speakers in a three-second source clip.
in its current form, trained in books-light, VALL-E is limited to simulating English speech, and while its performance is not perfect yet, it will certainly improve as its sample dataset is further expanded. However, it will be up to Microsoft researchers to improve VALL-E, as the team is not publishing the source code for the tool. in A recently published research article In detailing the development of VALL-E, its creators fully understand the risks it poses:
“Since VALL-E can synthesize speech that preserves the identity of the speaker, it may pose potential risks in misuse of the model, such as spoofing voice recognition or impersonating a specific speaker. Such risks To reduce this, it is possible to build a detection model to detect whether an audio clip was synthesized by VALL-E. We will also implement Microsoft AI Principles While we continue to develop the model.