
Multilingual, hybrid-reasoning model optimized for Indian language tasks, programming, mathematical reasoning capabilities.
Sarvam-m generates human-like text for a seamless chatting experience, providing a smooth and accessible multilingual conversation experience, and is intended for general-purpose conversation and text generation tasks.
This model is ready for commercial/non-commercial use.
This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party’s requirements for this application and use case; see link to Non-NVIDIA sarvamai/sarvamai-m.
GOVERNING TERMS: This trial service is governed by the NVIDIA API Trial Terms of Service. The model is governed by the NVIDIA Community Model License Agreement; ADDITIONAL INFORMATION: Apache License Version 2.0.
You are responsible for ensuring that your use of NVIDIA AI Foundation Models complies with all applicable laws.
Global
This model is intended for users seeking a seamless multilingual chatting experience, particularly those interested in Indian languages and cultures. It can be used by developers, researchers, and individuals looking to leverage its advanced reasoning capabilities for coding, math, and general conversation purposes.
Hugging Face: 05/23/2025 via
sarvamai/sarvam-m.
Build.NVIDIA.com: 07/25/2025 via link
Architecture Type: Hybrid-reasoning Transformer
Network Architecture: Mistral-Small
This model was developed based on Mistral-Small-3.1-24B-Base-2503 sarvamai/sarvamai-m.
Number of model parameters: 23.6B
Input Type(s): Text
Input Format(s): Strings
Input Parameters: One-Dimensional (1D)
Input Range: [0, 1] (float32) or [0, 255] (uint8, auto-converted)
Other Properties Related to Input: Supports up to 8,192-token context length (example shown in Quickstart using max_new_tokens=8192)
Output Type(s)
Output Format(s): Strings
Output parameters: One-Dimensional (1D)
Other Properties Related to Output:
Context Length: 32,768 tokens and sliding window attention of 4096 tokens.
Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
Runtime Engine: vLLM
Linux
sarvam-m-v1.0
The following code snippet demonstrates how to use sarvam-m using Transformers.
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "sarvamai/sarvam-m"
# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name, torch_dtype="auto", device_map="auto"
)
# prepare the model input
prompt = "Who are you and what is your purpose on this planet?"
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
enable_thinking=True, # Switches between thinking and non-thinking modes. Default is True.
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
# conduct text completion
generated_ids = model.generate(**model_inputs, max_new_tokens=8192)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]) :].tolist()
output_text = tokenizer.decode(output_ids)
if "</think>" in output_text:
reasoning_content = output_text.split("</think>")[0].rstrip("\n")
content = output_text.split("</think>")[-1].lstrip("\n").rstrip("</s>")
else:
reasoning_content = ""
content = output_text.rstrip("</s>")
print("reasoning content:", reasoning_content)
print("content:", content)
For thinking mode, we recommend temperature=0.5; for no-think mode, temperature=0.2.
from openai import OpenAI
base_url = "https://api.sarvam.ai/v1"
model_name = "sarvam-m"
api_key = "Your-API-Key" # get it from https://dashboard.sarvam.ai/
client = OpenAI(
base_url=base_url,
api_key=api_key,
).with_options(max_retries=1)
messages = [
{"role": "system", "content": "You're a helpful AI assistant"},
{"role": "user", "content": "Explain quantum computing in simple terms"},
]
response1 = client.chat.completions.create(
model=model_name,
messages=messages,
reasoning_effort="medium", # Enable thinking mode. `None` for disable.
max_completion_tokens=4096,
)
print("First response:", response1.choices[0].message.content)
# Building messages for the second turn (using previous response as context)
messages.extend(
[
{
"role": "assistant",
"content": response1.choices[0].message.content,
},
{"role": "user", "content": "Can you give an analogy for superposition?"},
]
)
response2 = client.chat.completions.create(
model=model_name,
messages=messages,
reasoning_effort="medium",
max_completion_tokens=8192,
)
print("Follow-up response:", response2.choices[0].message.content)
Refer to API docs here: sarvam Chat Completions API docs
reasoning_effort can take three possible values: low, medium, and high to be consistent with the OpenAI API spec. Setting any of the three values just enables the thinking mode of sarvam-m.
For easy deployment, we can use vllm>=0.8.5 and create an OpenAI-compatible API endpoint with vllm serve sarvamai/sarvam-m. If you want to use vLLM with python, you can do the following.
from openai import OpenAI
# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
models = client.models.list()
model = models.data[0].id
messages = [{"role": "user", "content": "Why is 42 the best number?"}]
# By default, thinking mode is enabled.
# If you want to disable thinking, add:
# extra_body={"chat_template_kwargs": {"enable_thinking": False}}
response = client.chat.completions.create(model=model, messages=messages)
output_text = response.choices[0].message.content
if "</think>" in output_text:
reasoning_content = output_text.split("</think>")[0].rstrip("\n")
content = output_text.split("</think>")[-1].lstrip("\n")
else:
reasoning_content = ""
content = output_text
print("reasoning content:", reasoning_content)
print("content:", content)
# For the next round, add the model's response directly as assistant turn.
messages.append(
{"role": "assistant", "content": output_text}
)
The testing datasets comprise a wide range of tasks and quantities:
Link: Undisclosed
Data Collection Method by dataset: Hybrid: Human, Synthetic, Automated
Labeling Method by dataset: Hybrid: Human, Automated
Properties:
Data Collection Method by dataset: Hybrid: Automated, Human, Synthetic
Labeling Method by dataset: Hybrid: Automated, Human, Synthetic
Properties:
IFEval: Contains over 500 prompts designed to test a model's ability to adhere to complex instructions
GSM8K: A dataset of 8,500 high-quality, linguistically diverse grade-school math word problems requiring multi-step reasoning.
MATH: A challenging dataset of 12,500 problems from mathematics competitions.
Big Bench Hard (BBH): A collection of 23 difficult tasks that are beyond the capabilities of most current language models.
MMLU: A massive multitask benchmark consisting of multiple-choice questions across 57 subjects, designed to test a model's general knowledge and problem-solving abilities.
TruthfulQA: Comprises 817 questions across 38 categories, designed to measure a model's tendency to produce truthful answers.
Indic Language Benchmarks: The model was also evaluated on its performance in 11 Indian languages, showing significant improvement over baseline models on various language tasks.
Results on multilingual benchmarks for 21 European languages with instruction-tuned models:
| Model | Avg. | EU21-ARC | EU21-HeSw | EU21-TQA | EU21-MMLU |
|---|---|---|---|---|---|
| Meta-Llama-3.1-8B-Instruct | 0.563 | 0.563 | 0.579 | 0.532 | 0.576 |
| Mistral-7B-Instruct-v0.3 | 0.527 | 0.530 | 0.538 | 0.548 | 0.491 |
| Salamandra-7B-Instruct | 0.543 | 0.595 | 0.637 | 0.482 | 0.459 |
| Aya-23-8B | 0.485 | 0.475 | 0.535 | 0.476 | 0.455 |
| Occiglot-7B-eu5-Instruct | 0.475 | 0.484 | 0.519 | 0.471 | 0.428 |
| Pharia-1-LLM-7B-C-A | 0.417 | 0.396 | 0.438 | 0.469 | 0.366 |
| Bloomz-7B1 | 0.358 | 0.316 | 0.354 | 0.461 | 0.302 |
| Teuken-7B-instruct-commercial-v0.4 | 0.531 | 0.569 | 0.620 | 0.503 | 0.430 |
Acceleration Engine: vLLM
Test Hardware:
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
Please report security vulnerabilities or NVIDIA AI Concerns here.