Sarvam-m Overview

Description

Sarvam-m generates human-like text for a seamless chatting experience, providing a smooth and accessible multilingual conversation experience, and is intended for general-purpose conversation and text generation tasks.

This model is ready for commercial/non-commercial use.

Third-Party Community Consideration

This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party’s requirements for this application and use case; see link to Non-NVIDIA sarvamai/sarvamai-m.

License/Terms of Use

GOVERNING TERMS: This trial service is governed by the NVIDIA API Trial Terms of Service. The model is governed by the NVIDIA Community Model License Agreement; ADDITIONAL INFORMATION: Apache License Version 2.0.

You are responsible for ensuring that your use of NVIDIA AI Foundation Models complies with all applicable laws.

Deployment Geography

Global

Use Case

This model is intended for users seeking a seamless multilingual chatting experience, particularly those interested in Indian languages and cultures. It can be used by developers, researchers, and individuals looking to leverage its advanced reasoning capabilities for coding, math, and general conversation purposes.

Release Date

Hugging Face: 05/23/2025 via sarvamai/sarvam-m.
Build.NVIDIA.com: 07/25/2025 via link

Model Architecture

Architecture Type: Hybrid-reasoning Transformer

Network Architecture: Mistral-Small

This model was developed based on Mistral-Small-3.1-24B-Base-2503 sarvamai/sarvamai-m.

Number of model parameters: 23.6B

Input

Input Type(s): Text
Input Format(s): Strings
Input Parameters: One-Dimensional (1D)
Input Range: [0, 1] (float32) or [0, 255] (uint8, auto-converted)
Other Properties Related to Input: Supports up to 8,192-token context length (example shown in Quickstart using max_new_tokens=8192)

Output

Output Type(s)
Output Format(s): Strings
Output parameters: One-Dimensional (1D)
Other Properties Related to Output: Context Length: 32,768 tokens and sliding window attention of 4096 tokens.

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Software Integration

Runtime Engine: vLLM

Supported Hardware Microarchitecture Compatibility

NVIDIA Ampere
NVIDIA Blackwell
NVIDIA Hopper
NVIDIA Lovelace

Preferred Operating System(s)

Linux

Model Version(s)

sarvam-m-v1.0

Quickstart

The following code snippet demonstrates how to use sarvam-m using Transformers.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "sarvamai/sarvam-m"

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name, torch_dtype="auto", device_map="auto"
)
# prepare the model input
prompt = "Who are you and what is your purpose on this planet?"

messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    enable_thinking=True,  # Switches between thinking and non-thinking modes. Default is True.
)

model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(**model_inputs, max_new_tokens=8192)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]) :].tolist()
output_text = tokenizer.decode(output_ids)

if "</think>" in output_text:
    reasoning_content = output_text.split("</think>")[0].rstrip("\n")
    content = output_text.split("</think>")[-1].lstrip("\n").rstrip("</s>")
else:
    reasoning_content = ""
    content = output_text.rstrip("</s>")

print("reasoning content:", reasoning_content)
print("content:", content)

For thinking mode, we recommend temperature=0.5; for no-think mode, temperature=0.2.

With Sarvam APIs

from openai import OpenAI

base_url = "https://api.sarvam.ai/v1"
model_name = "sarvam-m"
api_key = "Your-API-Key"  # get it from https://dashboard.sarvam.ai/

client = OpenAI(
    base_url=base_url,
    api_key=api_key,
).with_options(max_retries=1)

messages = [
    {"role": "system", "content": "You're a helpful AI assistant"},
    {"role": "user", "content": "Explain quantum computing in simple terms"},
]

response1 = client.chat.completions.create(
    model=model_name,
    messages=messages,
    reasoning_effort="medium",  # Enable thinking mode. `None` for disable.
    max_completion_tokens=4096,
)
print("First response:", response1.choices[0].message.content)

# Building messages for the second turn (using previous response as context)
messages.extend(
    [
        {
            "role": "assistant",
            "content": response1.choices[0].message.content,
        },
        {"role": "user", "content": "Can you give an analogy for superposition?"},
    ]
)

response2 = client.chat.completions.create(
    model=model_name,
    messages=messages,
    reasoning_effort="medium",
    max_completion_tokens=8192,
)
print("Follow-up response:", response2.choices[0].message.content)

Refer to API docs here: sarvam Chat Completions API docs

reasoning_effort can take three possible values: low, medium, and high to be consistent with the OpenAI API spec. Setting any of the three values just enables the thinking mode of sarvam-m.

vLLM Deployment

For easy deployment, we can use vllm>=0.8.5 and create an OpenAI-compatible API endpoint with vllm serve sarvamai/sarvam-m. If you want to use vLLM with python, you can do the following.

from openai import OpenAI

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
model = models.data[0].id

messages = [{"role": "user", "content": "Why is 42 the best number?"}]

# By default, thinking mode is enabled.
# If you want to disable thinking, add:
# extra_body={"chat_template_kwargs": {"enable_thinking": False}}
response = client.chat.completions.create(model=model, messages=messages)
output_text = response.choices[0].message.content

if "</think>" in output_text:
    reasoning_content = output_text.split("</think>")[0].rstrip("\n")
    content = output_text.split("</think>")[-1].lstrip("\n")
else:
    reasoning_content = ""
    content = output_text

print("reasoning content:", reasoning_content)
print("content:", content)

# For the next round, add the model's response directly as assistant turn.
messages.append(
    {"role": "assistant", "content": output_text}
)

Training, Testing, and Evaluation Datasets

Training Dataset:

Link: Undisclosed
Data Collection Method by dataset: Hybrid: Human, Synthetic, Automated
Labeling Method by dataset: Hybrid: Human, Automated
Properties: The model was trained on a significant volume of data, with a substantial portion dedicated to Indian languages. Approximately one-third of the training samples were in 11 Indic languages (Hindi, Bengali, Gujarati, Kannada, Malayalam, Marathi, Odia, Punjabi, Tamil, and Telugu). Specifically, 30% of coding, math, and reasoning prompts, and 50% of the remaining prompts were translated into these languages. The datasets were designed to improve the model's capabilities in logical reasoning, mathematical problem-solving, and multilingual conversational tasks. No sensors were used for data collection.

Testing Dataset

Link: Undisclosed
Data Collection Method by dataset: Hybrid: Human, Synthetic, Automated
Labeling Method by dataset: Human
Properties:

The testing datasets comprise a wide range of tasks and quantities:

IFEval: Over 500 prompts to test instruction-following capabilities.
GSM8K: 8,500 high-quality, linguistically diverse grade-school math word problems.
MATH: 12,500 challenging math competition problems.
Big Bench Hard (BBH): A set of 23 difficult tasks designed to test reasoning.
MMLU: A comprehensive benchmark with multiple-choice questions covering 57 different subjects.
HellaSwag: A dataset of 70,000 multiple-choice questions for commonsense natural language inference.
WinoGrande: A collection of 44,000 problems for commonsense reasoning.
ARC (AI2 Reasoning Challenge): A set of 7,787 science questions from grade-school examinations.
TruthfulQA: A benchmark consisting of 817 questions across 38 categories to measure a model's truthfulness.
Indic Language Benchmarks: Various datasets were used to evaluate performance in 11 Indian languages, including translations of standard benchmarks and Indic-specific evaluations.

Evaluation Dataset

Link: Undisclosed
Data Collection Method by dataset: Hybrid: Human, Synthetic, Automated
Labeling Method by dataset: Hybrid: Human, Automated
Properties:
Data Collection Method by dataset: Hybrid: Automated, Human, Synthetic
Labeling Method by dataset: Hybrid: Automated, Human, Synthetic
Properties:
- IFEval: Contains over 500 prompts designed to test a model's ability to adhere to complex instructions
- GSM8K: A dataset of 8,500 high-quality, linguistically diverse grade-school math word problems requiring multi-step reasoning.
- MATH: A challenging dataset of 12,500 problems from mathematics competitions.
- Big Bench Hard (BBH): A collection of 23 difficult tasks that are beyond the capabilities of most current language models.
- MMLU: A massive multitask benchmark consisting of multiple-choice questions across 57 subjects, designed to test a model's general knowledge and problem-solving abilities.
- TruthfulQA: Comprises 817 questions across 38 categories, designed to measure a model's tendency to produce truthful answers.
- Indic Language Benchmarks: The model was also evaluated on its performance in 11 Indian languages, showing significant improvement over baseline models on various language tasks.

Benchmark Results

Results on multilingual benchmarks for 21 European languages with instruction-tuned models:

Model	Avg.	EU21-ARC	EU21-HeSw	EU21-TQA	EU21-MMLU
Meta-Llama-3.1-8B-Instruct	0.563	0.563	0.579	0.532	0.576
Mistral-7B-Instruct-v0.3	0.527	0.530	0.538	0.548	0.491
Salamandra-7B-Instruct	0.543	0.595	0.637	0.482	0.459
Aya-23-8B	0.485	0.475	0.535	0.476	0.455
Occiglot-7B-eu5-Instruct	0.475	0.484	0.519	0.471	0.428
Pharia-1-LLM-7B-C-A	0.417	0.396	0.438	0.469	0.366
Bloomz-7B1	0.358	0.316	0.354	0.461	0.302
Teuken-7B-instruct-commercial-v0.4	0.531	0.569	0.620	0.503	0.430

Inference

Acceleration Engine: vLLM

Test Hardware:

L40s x2

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Please report security vulnerabilities or NVIDIA AI Concerns here.