NVIDIA
Explore
Models
Blueprints
GPUs
Docs
⌘KCtrl+K
Terms of Use
Privacy Policy
Your Privacy Choices
Contact

Copyright © 2026 NVIDIA Corporation

nvidia

magpie-tts-flow

API Endpoint

Expressive and engaging text-to-speech, generated from a short audio sample.

NVIDIA NIMNVIDIA RivaTTSText-to-Speech
Apply for Access
API ReferenceAPI Reference
Accelerated by DGX Cloud

Speech Synthesis: Magpie TTS Flow Model Overview

Description:

With only a prompt of 5 seconds or less, the Magpie TTS Flow model can analyze a speaker’s voice and replicate voice qualities such as pitch, timbre and speech rate to achieve a speaker similarity of over 70%, and an MOS score of 4.40. Maintaining the original characteristics that capture unique voice audio signature, it can create high-quality audio (speech) when used in combination with a vocoder model like BigVGAN [1].

Magpie TTS Flow [2] is an alignment-aware pre-training method that builds upon E2TTS’s [3] training framework to learn alignment between unit sequences and speech frames. By using de-duplicated units that retain only phonetic content, Magpie TTS Flow effectively learns alignment without relying on a phoneme duration predictor. This allows for direct application to zero-shot voice conversion, where phonetic content can be transferred to the target speaker’s voice without additional fine-tuning. This model is packaged with BigVGAN, a universal vocoder that generalizes well for various out-of-distribution scenarios without fine-tuning.

This model is ready for commercial use.

You are responsible for ensuring that your use of NVIDIA AI Foundation Models complies with all applicable laws.

License/Terms of Use:

NVIDIA AI Foundation Models Community License Agreement

References:

[1] BigVGAN: A Universal Neural Vocoder with Large-Scale Training
[2] Magpie-TTS-Flow Paper
[3] E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS
[4] Flow Matching for Generative Modeling

Model Architecture:

Architecture Type: Flow Matching
Network Architecture: Optimal Transport Conditional Flow Matching (OT-CFM)-based Masked Speech Modeling

Flow Matching [4] (FM) is a simulation-free approach for training Continuous Normalizing Flows (CNFs) based on regressing vector fields of fixed conditional probability paths. It is compatible with a general family of Gaussian probability paths for transforming between noise and data samples — which subsumes existing diffusion paths as specific instances. Furthermore, Flow Matching opens the door to training CNFs with other, non-diffusion probability paths. An instance of particular interest is using Optimal Transport (OT) displacement interpolation to define the conditional probability paths. These paths are more efficient than diffusion paths, provide faster training and sampling, and result in better generalization.

Input:

Input Type: Text + Audio
Input Format:
For Text: Strings (Graphemes in US English)
For Audio: wav file
Input Parameters:
For text: One-Dimensional (1D)
For audio prompt: Two-Dimensional (batch x time)
Other Properties related to Input:
For Audio: Recommended format for prompt: Mono, PCM-encoded 16 bit audio; sampling rate of 22.05 kHz; between 3 and 5 second duration.

Output:

Output Type: Audio
Output Format: Audio of shape (batch x time) in wav format
Output Parameters: Two-Dimensional (batch x time)
Other Properties related to Output: Mono, PCM-encoded 16 bit audio; sampling rate of 22.05 kHz; 20 Second Maximum Length.

Supported Operating System(s):

  • Linux

Model Version(s):

Magpie-TTS-Flow_v1

Inference:

Engine: Triton
Test Hardware:

  • NVIDIA A100 GPU
  • NVIDIA A30 GPU
  • NVIDIA A10 GPU
  • NVIDIA H100 GPU
  • NVIDIA L4 GPU
  • NVIDIA L40 GPU

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards.

Please report security vulnerabilities or NVIDIA AI Concerns here.