
Expressive and engaging text-to-speech, generated from a short audio sample.
With only a prompt of 5 seconds or less, the Magpie TTS Flow model can analyze a speaker’s voice and replicate voice qualities such as pitch, timbre and speech rate to achieve a speaker similarity of over 70%, and an MOS score of 4.40. Maintaining the original characteristics that capture unique voice audio signature, it can create high-quality audio (speech) when used in combination with a vocoder model like BigVGAN [1].
Magpie TTS Flow [2] is an alignment-aware pre-training method that builds upon E2TTS’s [3] training framework to learn alignment between unit sequences and speech frames. By using de-duplicated units that retain only phonetic content, Magpie TTS Flow effectively learns alignment without relying on a phoneme duration predictor. This allows for direct application to zero-shot voice conversion, where phonetic content can be transferred to the target speaker’s voice without additional fine-tuning. This model is packaged with BigVGAN, a universal vocoder that generalizes well for various out-of-distribution scenarios without fine-tuning.
This model is ready for commercial use.
You are responsible for ensuring that your use of NVIDIA AI Foundation Models complies with all applicable laws.
NVIDIA AI Foundation Models Community License Agreement
[1] BigVGAN: A Universal Neural Vocoder with Large-Scale Training
[2] Magpie-TTS-Flow Paper
[3] E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS
[4] Flow Matching for Generative Modeling
Architecture Type: Flow Matching
Network Architecture: Optimal Transport Conditional Flow Matching (OT-CFM)-based Masked Speech Modeling
Flow Matching [4] (FM) is a simulation-free approach for training Continuous Normalizing Flows (CNFs) based on regressing vector fields of fixed conditional probability paths. It is compatible with a general family of Gaussian probability paths for transforming between noise and data samples — which subsumes existing diffusion paths as specific instances. Furthermore, Flow Matching opens the door to training CNFs with other, non-diffusion probability paths. An instance of particular interest is using Optimal Transport (OT) displacement interpolation to define the conditional probability paths. These paths are more efficient than diffusion paths, provide faster training and sampling, and result in better generalization.
Input Type: Text + Audio
Input Format:
For Text: Strings (Graphemes in US English)
For Audio: wav file
Input Parameters:
For text: One-Dimensional (1D)
For audio prompt: Two-Dimensional (batch x time)
Other Properties related to Input:
For Audio: Recommended format for prompt: Mono, PCM-encoded 16 bit audio; sampling rate of 22.05 kHz; between 3 and 5 second duration.
Output Type: Audio
Output Format: Audio of shape (batch x time) in wav format
Output Parameters: Two-Dimensional (batch x time)
Other Properties related to Output: Mono, PCM-encoded 16 bit audio; sampling rate of 22.05 kHz; 20 Second Maximum Length.
Supported Operating System(s):
Magpie-TTS-Flow_v1
Engine: Triton
Test Hardware:
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards.
Please report security vulnerabilities or NVIDIA AI Concerns here.