Overview # Speech Synthesis: Magpie TTS Multilingual Model Overview ## Description: The Magpie TTS Multilingual model converts text into audio (speech). Magpie TTS is a generative model, designed to be used as the first part of a neural text-to-speech system in conjunction with an audio codec model. This model uses the International Phonetic Alphabet (IPA) for inference and training, and it can output a female or a male voice for English-US and European-Spanish. In addition, it uses character-based encoding for French. Audio Codec is a neural codec model for speech applications. It is the second part of a two-stage speech synthesis pipeline. This model is ready for commercial use. ### License/Terms of Use:
GOVERNING TERMS: Use of this model is governed by the [NVIDIA Community Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-community-models-license/). ## References: TTS model papers: [Koel-TTS: Enhancing LLM based Speech Generation with Preference Alignment and Classifier Free Guidance](https://arxiv.org/pdf/2502.05236) [Improving Robustness of LLM-based Speech Synthesis by Learning Monotonic Alignment](https://arxiv.org/pdf/2406.17957) [Low Frame-rate Speech Codec: A Codec Designed for Fast High-quality Speech LLM Training and Inference](https://arxiv.org/abs/2409.12117) ## Model Architecture: Network Architecture: T5-TTS + Audio codec T5-TTS is an encoder-decoder transformer model for text-to-speech synthesis that improves robustness by learning monotonic alignment between text and speech tokens. The model takes text tokens and reference audio codes as input and autoregressively predicts acoustic tokens of the target speech. Low Frame Rate Speech Codec - 21Hz is a neural audio compression model that quantizes speech or audio signals into discrete tokens at a low temporal rate of 21 frames per second. The model typically employs a multi-stage encoding process to compress the input audio into a sequence of discrete codes while preserving essential acoustic characteristics despite the aggressive temporal compression. During encoding, it analyzes longer windows of audio to capture relevant acoustic features before downsampling to 21Hz, and during decoding, it uses neural upsampling techniques to reconstruct high-fidelity audio at the original sampling rate. This lower frame rate allows for efficient storage and transmission while still maintaining reasonable audio quality for applications like speech synthesis and audio compression. ## Input: **Input Type:** Text
**Input Format:** Strings (Graphemes in US English)
**Input Parameters:** One-Dimensional (1D)
## Output: **Output Type:** Audio
**Output Format:** Audio of shape (batch x time) in wav format
**Output Parameters:** Mono, PCM-encoded 16 bit audio; sampling rate of 22.05 kHz; 20 Second Maximum Length; Depending on input, this model can output a female or a male voice for English US with two (2) emotions for the female voice and six (6) emotions for male voices. The female voice is classified as “neutral” and “calm.” The male voice is classified as “neutral,” “calm,” “happy,” and “fearful”, “sad”, and “angry.”
## Software Integration: **Runtime Engine(s):** Riva 2.19.0 or greater
**Supported Hardware Platform(s):**
* NVIDIA Volta V100
* NVIDIA Turing T4
* NVIDIA A100 GPU
* NVIDIA A30 GPU
* NVIDIA A10 GPU
* NVIDIA H100 GPU
* NVIDIA L4 GPU
* NVIDIA L40 GPU
**Supported Operating System(s):**
* Linux
## Inference **Engine:** Triton
**Test Hardware:**
* NVIDIA Volta V100
* NVIDIA Turing T4
* NVIDIA A100 GPU
* NVIDIA A30 GPU
* NVIDIA A10 GPU
* NVIDIA H100 GPU
* NVIDIA L4 GPU
* NVIDIA L40 GPU
## Model Version(s): magpie-tts-multilingual v1
## Ethical Considerations (For NVIDIA Models Only): NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards. Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/). ## GOVERNING TERMS: This trial is governed by the [NVIDIA API Trial Terms of Service](https://assets.ngc.nvidia.com/products/api-catalog/legal/NVIDIA%20API%20Trial%20Terms%20of%20Service.pdf). The use of this model is governed by the [AI Foundation Models Community License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-community-models-license/)