Jiarui Hai1*, Karan Thakkar1*, Helin Wang1, Zengyi Qin2,3, Mounya Elhilali1
1Johns Hopkins University, Baltimore, MD, USA
2Massachusetts Institute of Technology, Cambridge, MA, USA
3MyShell.ai, USA
*equal contribution
DreamVoice is an innovative approach to voice conversion (VC) that leverages text-guided generation to create personalized and versatile voice experiences. Unlike traditional VC methods, which require a target recording during inference, DreamVoice introduces a more intuitive solution by allowing users to specify desired voice timbres through text prompts.
Source | Prompt | DreamVC | DreamVG+ReDiffVC | DreamVG+FreeVC |
---|---|---|---|---|
A smooth young voice with a gender-neutral tone, that sounds cute. | ||||
Authoritative sounding person, who is gender-ambiguous and adult. | ||||
Senior's voice who can sound like a male or female with a smooth voice, perfect for storytelling. | ||||
A female adult voice with a warm and bright voice, perfect for client and public interaction. | ||||
A dark, smooth, and authoritative adult female voice, who sounds attractive and ideal for storytelling. | ||||
A teenage girl's voice, characterized by brightness, smoothness, and nasal quality. | ||||
Rough sounding attractive teenage girl with a voice suited for client and public interaction. | ||||
A teenage girl's voice that is smooth, warm, and attractive, perfect for captivating storytelling. | ||||
A senior female voice, dark, authoritative, and strong, ideal for diplomacy and judiciary roles. | ||||
A senior woman's voice carries with warmth, depth, and an authoritative tone. | ||||
Adult male voice, dark and smooth, authoritative and attractive. | ||||
A mature male voice, bright and engaging, good for client and public interaction. | ||||
Young boy with a bright, weak, and nasal voice. | ||||
Teenager's voice that is rough and weak. | ||||
A senior male voice, with a rough texture. |
Modules in Blue
are pre-trained models and remain Frozen
during training, while Modules in Green
are Trainable
.Yellow Blocks
represent the source speaker's Content Information
while Red Blocks
represent the target speaker's Timbre Information
.Purple blocks
correspond to the Converted Speech
.LM
represents the Language Model.KV
represents Cross-Attention (Vaswani et al., 2017) and FiLM
represents Feature-wise Linear Modulation layers (Perez et al., 2018) used for fusing Text Prompt and diffusion step t
respectively.SDE solver
is the stochastic differential equations for the diffusion sampling.Prompt
is the text description about the desired target voice.t
is the diffusion step.Content
is the content embedding of the source speaker.s
is the speaker embedding of the target voice.m
is the mel-spectrogram.m_t
and s_t
represent the noisy versions of the mel-spectrogram and the speaker embedding at the diffusion step t
.