VALL-E: 5 things to know about Microsoft’s AI model that can mimic any voice in 3 seconds

Microsoft showed off VALL-E, its text-to-speech AI model that can simulate any voice from a short audio sample. Not only the voice but it can also match the emotion and acoustics of the room. While it can be used in a lot of good ways, there are moral concerns about it. While a lot of samples are available on github to listen to, here are five things to know about VALL-E.
What is VALL-E?
Microsoft calls VALL-E a “neural codec language model” that generates audio from text input and short samples from a target speaker. It can mimic any voice by listening to a voice sample as small as 3 seconds. VALL-E is not generally available yet.
Training models
Researchers say they have trained VALL-E on 60,000 hours of English language speech — which is hundreds of times larger than existing systems — from 7,000-plus speakers on Meta‘s LibriLight audio library.
In order to mimic the voice, the target speaker’s voice must be a close match to the training data. This way, the AI can use its ‘training’ to try to mimic the target speaker’s voice to read aloud a desired text.

AI can mimic emotions
It is to be noted that the AI model can not only mimic the pitch or husk or texture but also the emotional tone of the speaker as well as the acoustics of the room. Which means that if the target voice has a disturbance, VALL-E will also mimic the voice as if there is a disturbance.
“Experiment results show that VALL-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity. In addition, we find VALL-E could preserve the speaker’s emotion and acoustic environment of the acoustic prompt in synthesis,” the team of researchers says.
Use case and threats
The AI model can be used for customised text-to-speech applications or media production industry or robotics. However, it is a potential threat in case of misuse.
“Since VALL-E could synthesise speech that maintains speaker identity, it may carry potential risks in misuse of the model, such as spoofing voice identification or impersonating,” the company said.

For example, people could use VALL-E to make spam calls sound real for conning people. Politicians or people with decent social presence can also be impersonated like we have seen with deep fakes. Applications that need voice commands or voice passwords can be a threat. Furthermore, VALL-E may also eat up jobs of voice artists.
Ethical statement
There is also an ethical statement by the company which says that “the experiments in this work were carried out under the assumption that the user of the model is the target speaker and has been approved by the speaker.”
“However, when the model is generalised to unseen speakers, relevant components should be accompanied by speech editing models, including the protocol to ensure that the speaker agrees to execute the modification and the system to detect the edited speech,” it said.
Also Watch:

Is ChatGPT the Google killer? | OpenAI ChatGPT

Source link

About manashjyoti

Check Also

‘Google search is like cigarettes or drugs’

A senior Google executive once likened the company’s search advertising business to selling drugs, calling …

Leave a Reply

Your email address will not be published. Required fields are marked *