
Natural and expressive voices in 23 languages. For voice agents and brand ambassadors.
Chatterbox Multilingual is a 500M-parameter, end-to-end multilingual text-to-speech (TTS) model from Resemble AI. It generates expressive, natural speech across 23 languages using a T3 (Text-to-Speech Token Generator) architecture paired with an S3Gen diffusion-based decoder with flow matching, enabling high-fidelity audio generation and robust cross-lingual synthesis.
Chatterbox TTS Multilingual was developed by Resemble AI as a part of Chatterbox.
This model is ready for commercial/non-commercial use.
This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party's requirements for this application and use case; see link to Non-NVIDIA Resemble AI Model Card.
Governing Terms: The Chatterbox model governed by the NVIDIA Open Model Agreement.
ADDITIONAL INFORMATION: The Chatterbox base model is governed by MIT License.
Global
Developers building multilingual voice agents, content localization systems, interactive media, language learning tools, and accessibility applications. Suitable for applications requiring expressive, natural speech synthesis in 23 languages.
Build.Nvidia.com: https://build.nvidia.com/resemble-ai/chatterbox-multilingual
NGC: https://registry.ngc.nvidia.com/orgs/nim/containers/chatterbox-multilingual
HuggingFace: https://huggingface.co/ResembleAI/chatterbox-multilingual
GitHub: https://github.com/resemble-ai/chatterbox
Resemble AI Chatterbox GitHub
Chatterbox Multilingual announcement blog
Hugging Face model card
PerTh Watermarker GitHub
NVIDIA API Trial Service Terms of Use / NVIDIA AI Enterprise EULA
Architecture Type: Transformer
Network Architecture: T3 token generator + S3Gen diffusion decoder with flow matching
Number of model parameters: 500M (5.0*10^8)
Input Type(s): Text
Input Format(s): UTF-8 string
Input Parameters: One-Dimensional (1D)
Other Properties Related to Input: Input text to synthesize. Language code (BCP-47) required. Emotion exaggeration control: float parameter 0.0-1.0 with recommended range 0.4-0.7.
Output Type(s): Audio
Output Format: WAV
Output Parameters: One-Dimensional (1D)
Other Properties Related to Output: Output is 24 kHz, mono, 16-bit PCM WAV audio. Recommended maximum per generation: ~15 seconds of audio.
Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
Runtime Engine(s): Triton
Supported Hardware Microarchitecture Compatibility:
Supported Operating System(s): Linux
The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.
This AI model can be embedded as an Application Programming Interface (API) call into the software environment described above.
Chatterbox Multilingual v1.0
The model is integrated via NVIDIA NIM (Triton-based microservice) with support for NVIDIA Ampere (A100, A30, A10), Hopper (H100), Lovelace (L4, L40), and Ada (RTX 6000) GPUs. Preferred OS is Linux. The model can be used with the open-source weights or upgraded to the Pro version for enterprise features.
Data Modality: Text
Text Training Data Size: Less than a Billion Tokens
Data Collection Method by dataset: Undisclosed
Labeling Method by dataset: Undisclosed
Properties (Quantity, Dataset Descriptions, Sensor(s)): Undisclosed
Data Collection Method by dataset: Undisclosed
Labeling Method by dataset: Undisclosed
Properties (Quantity, Dataset Descriptions, Sensor(s)): Undisclosed
Benchmark Score: Undisclosed
Data Collection Method by dataset: Undisclosed
Labeling Method by dataset: Undisclosed
Properties (Quantity, Dataset Descriptions, Sensor(s)): Evaluated on diverse multilingual text-to-speech tasks across 23 languages, including stress testing for pronunciation, prosody, and cross-lingual synthesis.
Known limitations include variable quality in non-English languages, pronunciation issues in Castilian Spanish, and boundary artifacts in long text handling.
Acceleration Engine: Triton Test Hardware:
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. Developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.