Natural and expressive voices in multiple languages. For voice agents and brand ambassadors.
Overview
The Magpie TTS Multilingual model converts text into audio (speech).
Magpie TTS is a generative model, designed to be used as the first part of a neural text-to-speech system in conjunction with an audio codec model. This model uses the International Phonetic Alphabet (IPA) for inference and training, and it can output a female or a male voice for English-US and European-Spanish. In addition, it uses character-based encoding for French.
Audio Codec is a neural codec model for speech applications. It is the second part of a two-stage speech synthesis pipeline.
This model is ready for commercial use.
NVIDIA AI Foundation Models Community License Agreement
TTS model papers: Koel-TTS: Enhancing LLM based Speech Generation with Preference Alignment and Classifier Free Guidance Improving Robustness of LLM-based Speech Synthesis by Learning Monotonic Alignment Low Frame-rate Speech Codec: A Codec Designed for Fast High-quality Speech LLM Training and Inference
Network Architecture: T5-TTS + Audio codec T5-TTS is an encoder-decoder transformer model for text-to-speech synthesis that improves robustness by learning monotonic alignment between text and speech tokens. The model takes text tokens and reference audio codes as input and autoregressively predicts acoustic tokens of the target speech. Low Frame Rate Speech Codec - 21Hz is a neural audio compression model that quantizes speech or audio signals into discrete tokens at a low temporal rate of 21 frames per second. The model typically employs a multi-stage encoding process to compress the input audio into a sequence of discrete codes while preserving essential acoustic characteristics despite the aggressive temporal compression. During encoding, it analyzes longer windows of audio to capture relevant acoustic features before downsampling to 21Hz, and during decoding, it uses neural upsampling techniques to reconstruct high-fidelity audio at the original sampling rate. This lower frame rate allows for efficient storage and transmission while still maintaining reasonable audio quality for applications like speech synthesis and audio compression.
Input Type: Text
Input Format: Strings (Graphemes in US English)
Input Parameters: One-Dimensional (1D)
Output Type: Audio
Output Format: Audio of shape (batch x time) in wav format
Output Parameters: Mono, PCM-encoded 16 bit audio; sampling rate of 22.05 kHz; 20 Second Maximum Length; Depending on input, this model can output a female or a male voice for English US with two (2) emotions for the female voice and six (6) emotions for male voices. The female voice is classified as “neutral” and “calm.” The male voice is classified as “neutral,” “calm,” “happy,” and “fearful”, “sad”, and “angry.”
Runtime Engine(s): Riva 2.19.0 or greater
Supported Hardware Platform(s):
Supported Operating System(s):
Engine: Triton
Test Hardware:
magpie-tts-multilingual v1
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards. Please report security vulnerabilities or NVIDIA AI Concerns here.
This trial is governed by the NVIDIA API Trial Terms of Service. The use of this model is governed by the AI Foundation Models Community License Agreement