Chatterbox TTS Multilingual Overview

Description:

Chatterbox Multilingual is a 500M-parameter, end-to-end multilingual text-to-speech (TTS) model from Resemble AI. It generates expressive, natural speech across 23 languages using a T3 (Text-to-Speech Token Generator) architecture paired with an S3Gen diffusion-based decoder with flow matching, enabling high-fidelity audio generation and robust cross-lingual synthesis.
Chatterbox TTS Multilingual was developed by Resemble AI as a part of Chatterbox.
This model is ready for commercial/non-commercial use.

Third-Party Community Consideration

This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party's requirements for this application and use case; see link to Non-NVIDIA Resemble AI Model Card.

License/Terms of Use:

Governing Terms: The Chatterbox model governed by the NVIDIA Open Model Agreement.

ADDITIONAL INFORMATION: The Chatterbox base model is governed by MIT License.

Deployment Geography:

Global

Use Case:

Developers building multilingual voice agents, content localization systems, interactive media, language learning tools, and accessibility applications. Suitable for applications requiring expressive, natural speech synthesis in 23 languages.

Release Date: 26-May, 2026

Build.Nvidia.com: https://build.nvidia.com/resemble-ai/chatterbox-multilingual
NGC: https://registry.ngc.nvidia.com/orgs/nim/containers/chatterbox-multilingual
HuggingFace: https://huggingface.co/ResembleAI/chatterbox-multilingual
GitHub: https://github.com/resemble-ai/chatterbox

Reference(s):

Resemble AI Chatterbox GitHub
Chatterbox Multilingual announcement blog
Hugging Face model card
PerTh Watermarker GitHub
NVIDIA API Trial Service Terms of Use / NVIDIA AI Enterprise EULA

Model Architecture:

Architecture Type: Transformer
Network Architecture: T3 token generator + S3Gen diffusion decoder with flow matching
Number of model parameters: 500M (5.0*10^8)

Input:

Input Type(s): Text
Input Format(s): UTF-8 string
Input Parameters: One-Dimensional (1D)
Other Properties Related to Input: Input text to synthesize. Language code (BCP-47) required. Emotion exaggeration control: float parameter 0.0-1.0 with recommended range 0.4-0.7.

Output:

Output Type(s): Audio
Output Format: WAV
Output Parameters: One-Dimensional (1D)
Other Properties Related to Output: Output is 24 kHz, mono, 16-bit PCM WAV audio. Recommended maximum per generation: ~15 seconds of audio.

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Software Integration:

Runtime Engine(s): Triton
Supported Hardware Microarchitecture Compatibility:

NVIDIA Ampere
NVIDIA Hopper
NVIDIA Lovelace

Supported Operating System(s): Linux

The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.

This AI model can be embedded as an Application Programming Interface (API) call into the software environment described above.

Model Version(s):

Chatterbox Multilingual v1.0

The model is integrated via NVIDIA NIM (Triton-based microservice) with support for NVIDIA Ampere (A100, A30, A10), Hopper (H100), Lovelace (L4, L40), and Ada (RTX 6000) GPUs. Preferred OS is Linux. The model can be used with the open-source weights or upgraded to the Pro version for enterprise features.

Training, Testing, and Evaluation Datasets:

Training Dataset:

Data Modality: Text

Text Training Data Size: Less than a Billion Tokens
Data Collection Method by dataset: Undisclosed
Labeling Method by dataset: Undisclosed
Properties (Quantity, Dataset Descriptions, Sensor(s)): Undisclosed

Testing Dataset:

Data Collection Method by dataset: Undisclosed
Labeling Method by dataset: Undisclosed
Properties (Quantity, Dataset Descriptions, Sensor(s)): Undisclosed

Evaluation Dataset:

Benchmark Score: Undisclosed

Data Collection Method by dataset: Undisclosed
Labeling Method by dataset: Undisclosed
Properties (Quantity, Dataset Descriptions, Sensor(s)): Evaluated on diverse multilingual text-to-speech tasks across 23 languages, including stress testing for pronunciation, prosody, and cross-lingual synthesis.

Key Considerations:

Known limitations include variable quality in non-English languages, pronunciation issues in Castilian Spanish, and boundary artifacts in long text handling.

Inference:

Acceleration Engine: Triton Test Hardware:

NVIDIA A100 GPU
NVIDIA A30 GPU
NVIDIA A10 GPU
NVIDIA H100 GPU
NVIDIA L4 GPU
NVIDIA L40 GPU

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. Developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.

Resemble.AI

chatterbox-multilingual-tts