What is VALL-E?
Microsoft calls VALL-E a “neural codec language model” that generates audio from text input and short samples from a target speaker. It can mimic any voice by listening to a voice sample as small as 3 seconds. VALL-E is not generally available yet.
Researchers say they have trained VALL-E on 60,000 hours of English language speech — which is hundreds of times larger than existing systems — from 7,000-plus speakers on Meta‘s LibriLight audio library.
In order to mimic the voice, the target speaker’s voice must be a close match to the training data. This way, the AI can use its ‘training’ to try to mimic the target speaker’s voice to read aloud a desired text.
AI can mimic emotions
It is to be noted that the AI model can not only mimic the pitch or husk or texture but also the emotional tone of the speaker as well as the acoustics of the room. Which means that if the target voice has a disturbance, VALL-E will also mimic the voice as if there is a disturbance.
“Experiment results show that VALL-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity. In addition, we find VALL-E could preserve the speaker’s emotion and acoustic environment of the acoustic prompt in synthesis,” the team of researchers says.
Use case and threats
The AI model can be used for customised text-to-speech applications or media production industry or robotics. However, it is a potential threat in case of misuse.
“Since VALL-E could synthesise speech that maintains speaker identity, it may carry potential risks in misuse of the model, such as spoofing voice identification or impersonating,” the company said.
For example, people could use VALL-E to make spam calls sound real for conning people. Politicians or people with decent social presence can also be impersonated like we have seen with deep fakes. Applications that need voice commands or voice passwords can be a threat. Furthermore, VALL-E may also eat up jobs of voice artists.
There is also an ethical statement by the company which says that “the experiments in this work were carried out under the assumption that the user of the model is the target speaker and has been approved by the speaker.”
“However, when the model is generalised to unseen speakers, relevant components should be accompanied by speech editing models, including the protocol to ensure that the speaker agrees to execute the modification and the system to detect the edited speech,” it said.
Is ChatGPT the Google killer? | OpenAI ChatGPT