NVIDIA
Explore
Models
Blueprints
GPUs
Docs
⌘KCtrl+K
Terms of Use
Privacy Policy
Your Privacy Choices
Contact

Copyright © 2026 NVIDIA Corporation

nvidia

magpie-tts-multilingual

Downloadable

Natural and expressive voices in multiple languages. For voice agents and brand ambassadors.

NVIDIA NIMNVIDIA RivaTTSmultilingual
Get API Key
API ReferenceAPI Reference
Accelerated by DGX Cloud

Overview

Speech Synthesis: Magpie TTS Multilingual Model Overview

Description:

The model is an end-to-end multilingual neural text-to-speech model that generates speech in nine different languages (English-US, European-Spanish, German-German, French-France, Italian, Vietnamese, Mandarin-Chinese, Hindi, and Japanese) by predicting discrete audio codec tokens autoregressively using a transformer encoder-decoder architecture. It supports at least one male and one female speakers for all the languages. It employs multi-codebook prediction (typically 8 codebooks) with optional local transformer refinement for high-quality audio generation, and leverages techniques like attention priors, classifier-free guidance (CFG), and Group Relative Policy Optimization (GRPO) for improved alignment. This model supports both batch inference for complete utterances and long-form inference for very long text inputs with sliding window mechanisms. The generated codecs are then converted to speech waveform using a frozen pretrained audio codec model.

This model is ready for commercial use.

License/Terms of Use:

GOVERNING TERMS: Your use of this API is governed by the NVIDIA API Trial Service Terms of Use; and the use of this model is governed by the NVIDIA Open Model License Agreement.

Deployment Geography:

Global

Use Case:

For streaming voice agent use-cases, For Offline speech generation from text, In multiple languages.

Release Date:

Build.Nvidia.com [02/27/2026] via [https://build.nvidia.com/nvidia/magpie-tts-multilingual]
NGC [02/27/2026] via [https://registry.ngc.nvidia.com/orgs/nvstaging/teams/nim/containers/magpie-tts-multilingual]

References(s):

TTS model papers:

  • Koel-TTS: Enhancing LLM based Speech Generation with Preference Alignment and Classifier Free Guidance
  • Improving Robustness of LLM-based Speech Synthesis by Learning Monotonic Alignment

Audio codec paper:

  • Low Frame-rate Speech Codec: A Codec Designed for Fast High-quality Speech LLM Training and Inference

Model Architecture:

Architecture Type: Transformer Encoder, Transformer Decoder, Local Transformer, and feedforward layers

Network Architecture:

  1. Causal Transformer Encoder with 6 layers, learnable positional encoder of length 2048, and 1 Layer Normalization output layer.

  2. Causal Transformer Decoder with 12 layers, learnable positional encoder of length 2048, and 1 Layer Normalization output layer.

** Number of model parameters: 3.57 × 10^8 (241 M trainable params)

Computational Load (Internal Only: For NVIDIA Models Only)

Computational Load: 1.62 × 10²¹ FLOP
Cumulative Compute: ~2,230 kWh
Estimated Energy and Emissions for Model Training: ~0.72 tCO2e

Input(s):

Input Type(s): Text, Audio (optional for zeroshot voice cloning)

Input Format(s):

  • Text: Strings

Input Parameters:

  • Text: One-Dimensional (1D)

Output(s)

Output Type(s): Audio

Output Format(s):

  • Audio: WAV

Output Parameters:

  • Audio: One-Dimensional (1D)

Other Properties Related to Output: Mono, PCM-encoded 16 bit audio; sampling rate of 22.05 kHz; Audio output with dimensions (B x T), where B is batch size and T is time dimension.

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Software Integration:

Runtime Engine(s):

  • Riva 2.15.0

Supported Hardware Microarchitecture Compatibility:

  • NVIDIA Ampere
    • NVIDIA A100 GPU
    • NVIDIA A30 GPU
    • NVIDIA A10 GPU
  • NVIDIA Hopper
    • NVIDIA H100 GPU
  • NVIDIA Lovelace
    • NVIDIA L4 GPU
    • NVIDIA L40 GPU

Preferred/Supported Operating System(s):

  • Linux
  • Linux 4 Tegra

The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.

Model Version(s):

RivaTTS_MagpieTTS_Multilingual v4.0

Training and Evaluation Datasets:

Training Dataset:

The following datasets were used to train the model, including additional datasets focused on speech and ASR.

  • Hi-FiTTS En: link
  • HiFiTTS-2 A Large-Scale High Bandwidth Speech Dataset En: link
  • LibriTTS En: link
  • Jensen Huang 2020 Keynote En: Internal Dataset
  • RIVA Speakers En: Internal Dataset
  • CML-TTS Es: Link
  • Riva Speakers Es: Internal Dataset
  • CML-TTS Fr: Link
  • Riva Speakers Fr: Internal Dataset
  • CML-TTS It: Link
  • CML-TTS De: Link
  • Large-scale Vietnamese speech corpus (LSVSC) Vi: Link
  • InfoRe-2 Vi: Link
  • InfoRe-1 Vi: Link
  • LongFemale Vi: Internal Dataset
  • publicly available internet scale data Vi: Internal Dataset
  • RIVA Speakers Zh: Internal Dataset
  • publicly available internet scale data Zh: Internal Dataset
  • AI4Bharat Hi: link
  • Common voice Ja: Link
  • Japanese Anime Speech Ja: Link
  • Emilia YODAS Ja: Link
  • King ASR Ja: Internal Dataset

Data Modality

  • [Audio]

Audio Training Data Size

  • 60,000 Hours

Data Collection Method by dataset

  • [Human]

Labeling Method by dataset

  • [Hybrid: Human, Synthetic] - Human recorded data points were preprocessed algorithmically.

Properties: Number of data items in training set: 38k hours Modality: Audio (speech signal) Nature of the content: Audio books Language: Multilingual (En, Es, De, Fr, Vi, It, Zh, Hi, Ja) Sensor Type: Microphones

Evaluation Dataset:

  • LibriTTS test-clean: link
  • CML-TTS Es: Link
  • CML-TTS Fr: Link
  • CML-TTS De: Link

Benchmark Score

Data Collection Method by dataset:

  • [Human]

Labeling Method by dataset:

  • [Human]
  • [Hybrid: Human, Synthetic] - Human labeled data points are mixed and matched to create more variabilities.

Properties: Modality: Audio (speech signal) Nature of the content: Audio books and Newspaper passages Language: Multilingual (En, Es, De, Fr, Hi, Ja) Sensor Type: Microphones

CER (%)SV-SSIM
LibriTTS test-clean0.3483.49
Spanish CML1.1471.53
German CML0.6662.59
French CML2.7070.34
Italian4.0066.71
Vietnamese0.6072.35
Mandarin4.24-
Hindi0.8675.59
Japanese1.1274.82

Inference:

Acceleration Engine: Triton
Test Hardware:

  • NVIDIA A100 GPU
  • NVIDIA A30 GPU
  • NVIDIA A10 GPU
  • NVIDIA H100 GPU
  • NVIDIA L4 GPU
  • NVIDIA L40 GPU

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

For more detailed information on ethical considerations for this model, please see the Model Card++ Bias, Explainability, Safety & Security, and Privacy Subcards.

Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.