Powerful, multimodal language model designed for enterprise applications, including software development, data analysis, and reasoning.
Mistral Medium 3 is a frontier-class dense language model optimized for enterprise use. It delivers state-of-the-art performance at significantly lower cost—up to 8× cheaper than leading alternatives—while maintaining high usability, adaptability, and deployability in enterprise environments. Designed to excel in professional workloads like coding, STEM reasoning, and multimodal understanding, it supports hybrid and self-hosted deployment, full model customization, and seamless integration into enterprise systems.
This model is ready for commercial/non-commercial use.
This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party's requirements for this application and use case; see link to Non-NVIDIA Mistral Medium 3 Model Card.
GOVERNING TERMS: The trial service is governed by the NVIDIA API Trial Terms of Service. To deploy and customize the model in your environment, please contact Mistral.
Global
Enterprise and research users leveraging high-performance LLMs for reasoning, multilingual understanding, and coding tasks.
** https://mistral.ai/news/mistral-medium-3
Architecture Type: Transformer-based dense decoder-only autoregressive LLM This model was developed based on: Proprietary Mistral architecture
Input Type(s): Text
Input Format(s): String
Input Parameters: 2D token sequences
Other Properties Related to Input:
mistral-tokenizer
Output Type(s): Text
Output Format: String
Output Parameters: 2D token sequences
Other Properties Related to Output:
[Preferred/Supported] Operating System(s):
Mistral Medium 3 (2505)
Data Collection Method by dataset: Undisclosed
Labeling Method by dataset: Undisclosed
Properties: Undisclosed
Data Collection Method by dataset: Undisclosed
Labeling Method by dataset: Undisclosed
Properties: Undisclosed
Mistral Medium 3 is designed to be frontier-class, particularly in categories of professional use. In the evaluations below, we use numbers reported previously by other providers wherever available, otherwise we use our own evaluation harness. Performance accuracy on all benchmarks were obtained through the same internal evaluation pipeline. Mistral Medium 3 particular stands out in coding and STEM tasks where it comes close to its very large and much slower competitors.
Benchmark | Mistral Medium 3 | Llama 4 Maverick | GPT-4o | Claude Sonnet 3.7 | Command-A | DeepSeek 3.1 |
---|---|---|---|---|---|---|
CODING | ||||||
HumanEval 0-shot | 92.1% | 85.4% | 91.5% | 92.1% | 82.9% | 93.3% |
LiveCodeBench (v6) 0-shot | 30.3% | 28.7% | 31.4% | 36.0% | 26.3% | 42.9% |
MultiPL-E average 0-shot | 81.4% | 76.4% | 79.8% | 83.4% | 73.1% | 84.9% |
INSTRUCTION FOLLOWING | ||||||
ArenaHard 0-shot | 97.1% | 91.8% | 95.4% | 93.2% | 95.1% | 97.3% |
IFEval 0-shot | 89.4% | 88.9% | 87.2% | 91.8% | 89.7% | 89.1% |
MATH | ||||||
Math500 Instruct 0-shot | 91.0% | 90.0% | 76.4% | 83.0% | 82.0% | 93.8% |
KNOWLEDGE | ||||||
GPQA Diamond 0-shot CoT | 57.1% | 61.1% | 52.5% | 69.7% | 46.5% | 61.1% |
MMLU Pro 0-shot CoT | 77.2% | 80.4% | 75.8% | 80.0% | 68.9% | 81.1% |
LONG CONTEXT | ||||||
RULER 32K | 96.0% | 94.8% | 96.0% | 95.7% | 95.6% | 95.8% |
RULER 128K | 90.2% | 86.7% | 88.9% | 93.8% | 91.2% | 91.9% |
MULTIMODAL | ||||||
MMMU 0-shot | 66.1% | 71.8% | 66.1% | 71.3% | ||
DocVQA 0-shot | 95.3% | 94.1% | 85.9% | 84.3% | No multimodal support | No multimodal support |
AI2D 0-shot | 93.7% | 84.4% | 93.3% | 78.8% | ||
ChartQA 0-shot | 82.6% | 90.4% | 86.0% | 76.3% |
*Performance accuracy on all benchmarks were obtained through the same internal evaluation pipeline.
In addition to academic benchmarks we report third-party human evaluations that are more representative of real-world use cases. Mistral Medium 3 continues to shine in the coding domain and delivers much better performance, across the board, than some of its much larger competitors.
Competitor | Mistral Wins (%) | Other Model Wins (%) |
---|---|---|
Claude Sonnet 3.7 | 40.00 | 60.00 |
DeepSeek 3.1 | 37.50 | 62.50 |
GPT-4o | 50.00 | 50.00 |
Command-A | 69.23 | 30.77 |
Llama 4 Maverick | 81.82 | 18.18 |
Domain | Mistral Win Rate (%) | Llama 4 Maverick Win Rate (%) |
---|---|---|
Coding | 81.82 | 18.18 |
Multimodal | 53.85 | 46.15 |
English | 66.67 | 33.33 |
French | 71.43 | 28.57 |
Spanish | 73.33 | 26.67 |
German | 62.50 | 37.50 |
Arabic | 64.71 | 35.29 |
Engine: Compatible with open-source inference engines like vLLM Test Hardware:
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. Please report security vulnerabilities or NVIDIA AI Concerns here.