
Natural and expressive voices in multiple languages. For voice agents and brand ambassadors.
Overview
The model is an end-to-end multilingual neural text-to-speech model that generates speech in nine different languages (English-US, European-Spanish, German-German, French-France, Italian, Vietnamese, Mandarin-Chinese, Hindi, and Japanese) by predicting discrete audio codec tokens autoregressively using a transformer encoder-decoder architecture. It supports at least one male and one female speakers for all the languages. It employs multi-codebook prediction (typically 8 codebooks) with optional local transformer refinement for high-quality audio generation, and leverages techniques like attention priors, classifier-free guidance (CFG), and Group Relative Policy Optimization (GRPO) for improved alignment. This model supports both batch inference for complete utterances and long-form inference for very long text inputs with sliding window mechanisms. The generated codecs are then converted to speech waveform using a frozen pretrained audio codec model.
This model is ready for commercial use.
GOVERNING TERMS: Your use of this API is governed by the NVIDIA API Trial Service Terms of Use; and the use of this model is governed by the NVIDIA Open Model License Agreement.
Global
For streaming voice agent use-cases, For Offline speech generation from text, In multiple languages.
Build.Nvidia.com [02/27/2026] via [https://build.nvidia.com/nvidia/magpie-tts-multilingual]
NGC [02/27/2026] via [https://registry.ngc.nvidia.com/orgs/nvstaging/teams/nim/containers/magpie-tts-multilingual]
TTS model papers:
Audio codec paper:
Architecture Type: Transformer Encoder, Transformer Decoder, Local Transformer, and feedforward layers
Network Architecture:
Causal Transformer Encoder with 6 layers, learnable positional encoder of length 2048, and 1 Layer Normalization output layer.
Causal Transformer Decoder with 12 layers, learnable positional encoder of length 2048, and 1 Layer Normalization output layer.
** Number of model parameters: 3.57 × 10^8 (241 M trainable params)
Computational Load: 1.62 × 10²¹ FLOP
Cumulative Compute: ~2,230 kWh
Estimated Energy and Emissions for Model Training: ~0.72 tCO2e
Input Type(s): Text, Audio (optional for zeroshot voice cloning)
Input Format(s):
Input Parameters:
Output Type(s): Audio
Output Format(s):
Output Parameters:
Other Properties Related to Output: Mono, PCM-encoded 16 bit audio; sampling rate of 22.05 kHz; Audio output with dimensions (B x T), where B is batch size and T is time dimension.
Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
Runtime Engine(s):
Supported Hardware Microarchitecture Compatibility:
Preferred/Supported Operating System(s):
The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.
RivaTTS_MagpieTTS_Multilingual v4.0
The following datasets were used to train the model, including additional datasets focused on speech and ASR.
Data Modality
Audio Training Data Size
Data Collection Method by dataset
Labeling Method by dataset
Properties:
Number of data items in training set: 38k hours
Modality: Audio (speech signal)
Nature of the content: Audio books
Language: Multilingual (En, Es, De, Fr, Vi, It, Zh, Hi, Ja)
Sensor Type: Microphones
Benchmark Score
Data Collection Method by dataset:
Labeling Method by dataset:
Properties:
Modality: Audio (speech signal)
Nature of the content: Audio books and Newspaper passages
Language: Multilingual (En, Es, De, Fr, Hi, Ja)
Sensor Type: Microphones
| CER (%) | SV-SSIM | |
|---|---|---|
| LibriTTS test-clean | 0.34 | 83.49 |
| Spanish CML | 1.14 | 71.53 |
| German CML | 0.66 | 62.59 |
| French CML | 2.70 | 70.34 |
| Italian | 4.00 | 66.71 |
| Vietnamese | 0.60 | 72.35 |
| Mandarin | 4.24 | - |
| Hindi | 0.86 | 75.59 |
| Japanese | 1.12 | 74.82 |
Acceleration Engine: Triton
Test Hardware:
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
For more detailed information on ethical considerations for this model, please see the Model Card++ Bias, Explainability, Safety & Security, and Privacy Subcards.
Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.