Microsoft researchers have recently announced a new AI model called VALL-E that is able to closely simulate a person’s voice when given a three-second audio sample. This model, which Microsoft calls a “neural codec language model,” builds off of a technology called EnCodec which was developed by Meta and announced in October 2022. This technology uses discrete audio codec codes to analyze how a person sounds and match the output to the input.
One of the most exciting potential uses for VALL-E is in high-quality text-to-speech applications. With the ability to closely imitate a person’s voice, the model has the potential to significantly improve the quality of text-to-speech technology. Additionally, the model has the ability to preserve the speaker’s emotional tone, which could make the speech output feel more natural and lifelike.
Another potential application for VALL-E is speech editing. By providing a recording of a person and a text transcript, VALL-E can edit the recording to make the person sound as if they said something different. This could be useful in a number of scenarios, such as film production, voice-over work, and even news reporting.
VALL-E could also be used for audio content creation, especially when combined with other generative AI models like GPT-3. Imagine being able to generate realistic audio content for podcasts, audiobooks, or even music with a virtual artist. This could open up new possibilities for content creation and distribution, and could even lead to new forms of entertainment.
In order to train the model, Microsoft used an audio library called LibriLight, which contains 60,000 hours of English language speech from more than 7,000 speakers. The majority of this data is pulled from LibriVox public domain audiobooks. The model is able to generate a good result as long as the voice in the three-second sample closely matches a voice in the training data.
The VALL-E model is also able to imitate the “acoustic environment” of the sample audio. For example, if the sample came from a telephone call, the audio output will simulate the acoustic and frequency properties of a telephone call in its synthesized output. This means that the output will sound like it was recorded over the phone. Additionally, the model can generate variations in voice tone by changing the random seed used in the generation process, further expanding the possibilities of its use.
Overall, the release of VALL-E is an exciting development in the field of AI and speech synthesis. With the ability to closely imitate a person’s voice, preserve emotional tone and mimic acoustic environments, the model has the potential to revolutionize the way we think about text-to-speech, speech editing and audio content creation.