Lightweight reasoning model for applications in latency bound, memory/compute constrained environments

Lightweight reasoning model for applications in latency bound, memory/compute constrained environments
Phi-4-mini-flash-reasoning is a lightweight, 3.8 B‑parameter open model built upon synthetic, reasoning‑dense data and fine‑tuned specifically for advanced mathematical reasoning. It belongs to the Phi‑4 family and supports 64 K‑token context length.
This model is ready for commercial/non-commercial use.
This model is not owned or developed by NVIDIA. It has been produced to a third-party’s requirements for this application and use-case. See the external card: Phi-4-mini-flash-reasoning Model Card.
GOVERNING TERMS: This trial service is governed by the NVIDIA API Trial Terms of Service. Use of this model is governed by the NVIDIA Community Model License. Additional Information: MIT.
Global
Primary Use Cases
Use‑Case Considerations
This model is designed and evaluated for math reasoning only. Developers should carefully evaluate accuracy, safety, fairness, multilingual performance, and legal compliance before use—especially in high‑risk domains.
Build.NVIDIA.com: 07/18/2025 (link)
Hugging Face: 07/09/2025 (link)
Input Type(s): Text
Input Formats: String
Input Parameters: One-Dimensional (1D)
Other Properties Related to Input: Up to 64K tokens context length. It is best suited for prompts using the chat format.
Output Type(s): Text
Output Formats: String
Output Parameters: One-Dimensional (1D)
Other Properties Related to Output: Generates step-by-step math reasoning up to 32K tokens
Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
Note that by default, the Phi-4-mini-flash-reasoning model uses flash attention, which requires certain types of GPU hardware to run. We have tested on the following GPU types:
Supported Operating System(s): Linux
phi-4-mini-flash-reasoning v1.0
Data Collection Method by dataset: Hybrid: Synthetic, Automated, Human
Labeling Method by dataset: Hybrid: Synthetic, Automated
Properties:
Additional Training Details:
Data Collection Method by dataset: Undisclosed
Labeling Method by dataset: Undisclosed
Properties: Undisclosed
Benchmark Score: Pass@1 – AIME 2024 52.29 % / AIME 2025 33.59 % / Math-500 92.45 % / GPQA-Diamond 45.08 %
Data Collection Method by dataset Public benchmark datasets
Labeling Method by dataset: Benchmark gold answers
Properties: Math-500, AIME 2024, AIME 2025, GPQA-Diamond
To understand the capabilities, the 3.8B parameters Phi-4-mini-flash-reasoning model was compared with a set of models over a variety of reasoning benchmarks. We use a more accurate evaluation where Pass@1 accuracy is averaged over 64 samples for AIME24/25 and 8 samples for Math500 and GPQA Diamond. A high-level overview of the model quality is as follows:
| Model | AIME 2024 | AIME 2025 | Math‑500 | GPQA Diamond |
|---|---|---|---|---|
| DeepSeek‑R1‑Distill‑Qwen‑1.5B | 29.58 | 20.78 | 84.50 | 37.69 |
| DeepSeek‑R1‑Distill‑Qwen‑7B | 53.70 | 35.94 | 93.03 | 47.85 |
| DeepSeek‑R1‑Distill‑Llama‑8B | 43.96 | 27.34 | 87.48 | 45.83 |
| Bespoke‑Stratos‑7B | 21.51 | 18.28 | 80.73 | 38.51 |
| OpenThinker‑7B | 29.69 | 24.32 | 87.25 | 41.60 |
| Phi‑4‑mini‑Reasoning (3.8 B) | 48.13 | 31.77 | 91.20 | 44.51 |
| Phi‑4‑mini‑Flash‑Reasoning (3.8 B) | 52.29 | 33.59 | 92.45 | 45.08 |
Overall, the model with only 3.8B-param achieves a similar level of math and science reasoning ability as much larger models. However, it is still fundamentally limited by its size for certain tasks. The model simply does not have the capacity to store too much factual knowledge, therefore, users may experience factual incorrectness. However, it may be possible to resolve such weakness by augmenting Phi-4-mini-flash-reasoning with a search engine, particularly when using the model under RAG settings.
We include a brief word on methodology here - and in particular, how we think about optimizing prompts. In an ideal world, we would never change any prompts in our benchmarks to ensure it is always an apples-to-apples comparison when comparing different models. Indeed, this is our default approach, and is the case in the vast majority of models we have run to date. For all benchmarks, we consider using the same generation configuration such as max sequence length (32768), the same temperature for the fair comparison. Benchmark datasets We evaluate the model with three of the most popular math benchmarks where the strongest reasoning models are competing together. Specifically:
Acceleration Engine: vLLM Test Hardware: NVIDIA Ada Lovelace
This compact 3.8 B model achieves near-linear latency growth with token count (2–3× lower latency and up to 10× higher throughput than Phi-4-mini-reasoning on a single A100-80 GB).
The two figures below compare the latency and throughput performance of the Phi-4-mini-reasoning and Phi-4-mini-flash-reasoning models under the vLLM inference framework. All evaluations were performed on a single NVIDIA A100-80GB GPU with tensor parallelism disabled (TP = 1). The Phi-4-mini-flash-reasoning model, which incorporates a decoder-hybrid-decoder architecture with attention and state space model (SSM), exhibits significantly greater computational efficiency—achieving a 2–3× reduction in average latency and up-to a 10× improvement in throughput when processing user requests with 2K prompt length and 32K generation length. Furthermore, Phi-4-mini-flash-reasoning demonstrates near-linear growth in latency with respect to the number of tokens generated (up to 32k), in contrast to the quadratic growth observed in Phi-4-mini-reasoning. These findings indicate that Phi-4-mini-flash-reasoning is more scalable and better suited for long-sequence generation tasks.
Phi-4-mini-flash-reasoning supports a vocabulary size of up to 200064 tokens. The tokenizer files already provide placeholder tokens that can be used for downstream fine-tuning, but they can also be extended up to the model's vocabulary size.
Given the nature of the training data, the Phi-4-mini-flash-reasoning model is best suited for prompts using specific formats. Below are the two primary formats:
This format is used for general conversation and instructions:
<|system|>Your name is Phi, an AI math expert developed by Microsoft.<|end|><|user|>How to solve 3*x^2+4*x+5=1?<|end|><|assistant|>
Phi-4-mini-flash-reasoning has been integrated in the 4.51.3 version of transformers. The current transformers version can be verified with: pip list | grep transformers.
Python 3.8 and 3.10 will work best.
List of required packages:
flash_attn==2.7.4.post1
torch==2.6.0
mamba-ssm==2.2.4
causal-conv1d==1.5.0.post8
transformers==4.46.1
accelerate==1.4.0
Phi-4-mini-flash-reasoning is also available in Azure AI Studio
After obtaining the Phi-4-mini-flash-reasoning model checkpoints, users can use this sample code for inference.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
torch.random.manual_seed(0)
model_id = "microsoft/Phi-4-mini-flash-reasoning"
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="cuda",
torch_dtype="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
messages = [{
"role": "user",
"content": "How to solve 3*x^2+4*x+5=1?"
}]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt",
)
outputs = model.generate(
**inputs.to(model.device),
max_new_tokens=32768,
temperature=0.6,
top_p=0.95,
do_sample=True,
)
outputs = tokenizer.batch_decode(outputs[:, inputs["input_ids"].shape[-1]:])
print(outputs[0])
The Phi-4 family of models has adopted a robust safety post-training approach. This approach leverages a variety of both open-source and in-house generated datasets. The overall technique employed to do the safety alignment is a combination of SFT, DPO (Direct Preference Optimization), and RLHF (Reinforcement Learning from Human Feedback) approaches by utilizing human-labeled and synthetic English-language datasets, including publicly available datasets focusing on helpfulness and harmlessness, as well as various questions and answers targeted to multiple safety categories.
Phi-4-Mini-Flash-Reasoning was developed in accordance with Microsoft's responsible AI principles. Potential safety risks in the model’s responses were assessed using the Azure AI Foundry’s Risk and Safety Evaluation framework, focusing on harmful content, direct jailbreak, and model groundedness. The Phi-4-Mini-Flash-Reasoning Model Card contains additional information about our approach to safety and responsible AI considerations that developers should be aware of when using this model.
Like other language models, the Phi family of models can potentially behave in ways that are unfair, unreliable, or offensive. Some of the limiting behaviors to be aware of include:
Developers should apply responsible AI best practices, including mapping, measuring, and mitigating risks associated with their specific use case and cultural, linguistic context. Phi 4 family of models are general purpose models. As developers plan to deploy these models for specific use cases, they are encouraged to fine-tune the models for their use case and leverage the models as part of broader AI systems with language-specific safeguards in place. Important areas for consideration include:
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft’s Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party’s policies.
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
Please report security vulnerabilities or NVIDIA AI Concerns here.