Cutting-edge open multimodal model exceling in high-quality reasoning from image and audio inputs.
Phi-4-multimodal-instruct is a lightweight open multimodal foundation model that leverages the language, vision, and speech research and datasets used for Phi-3.5 and 4.0 models. The model processes text, image, and audio inputs, generating text outputs, and comes with 128K token context length. The model underwent an enhancement process, incorporating both supervised fine-tuning, and direct preference optimization to support precise instruction adherence and safety measures.
This model is ready for commercial/non-commercial use.
This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party’s requirements for this application and use case; see link to Non-NVIDIA Phi-4-Multimodal-Instruct.
GOVERNING TERMS: This trial service is governed by the NVIDIA API Trial Terms of Service. Use of this model is governed by the NVIDIA Community Model License. Additional Information: MIT License.
Global
February 2025
Phi-4-Multimodal-Instruct Model Card
The model is intended for broad multilingual and multimodal commercial and research use. It provides uses for general purpose AI systems and applications which require memory/compute constrained environments, latency bound scenarios, strong reasoning, function and tool calling, general image understanding, optical character recognition, and chart and table understanding.
The model is designed to accelerate research on language and multimodal models, for use as a building block for generative AI powered features.
The model is not specifically designed or evaluated for all downstream purposes. Developers should consider common limitations of language models and multimodal models, as well as performance differences across languages, as they select use cases, and evaluate and mitigate for accuracy, safety, and fairness before using within a specific downstream use case, particularly for high-risk scenarios. Developers should be aware of and adhere to applicable laws or regulations (including but not limited to privacy, trade compliance laws, etc.) that are relevant to their use case.
Nothing contained in this Model Card should be interpreted as or deemed a restriction or modification to the license the model is released under.
This release of Phi-4-multimodal-instruct is based on valuable user feedback from the Phi-3 series. Previously, users could use a speech recognition model to talk to the Mini and Vision models. To achieve this, users needed to use a pipeline of two models: one model to transcribe the audio to text, and another model for the language or vision tasks. This pipeline means that the core model was not provided the full breadth of input information – e.g. cannot directly observe multiple speakers, background noises, jointly align speech, vision, language information at the same time on the same representation space.
With Phi-4-multimodal-instruct, a single new open model has been trained across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network. The model employed new architecture, larger vocabulary for efficiency, multilingual, and multimodal support, and better post-training techniques were used for instruction following and function calling, as well as additional data leading to substantial gains on key multimodal capabilities.
It is anticipated that Phi-4-multimodal-instruct will greatly benefit app developers and various use cases. The enthusiastic support for the Phi-4 series is greatly appreciated. Feedback on Phi-4-multimodal-instruct is welcomed and crucial to the model’s evolution and improvement. Thank you for being part of this journey!
Architecture Type: Phi-4-multimodal-instruct has 5.6B parameters and is a multimodal transformer model. The model has the pretrained Phi-4-mini as the backbone language model, and the advanced encoders and adapters of vision and speech
Input Type(s): Text, Image, Audio
Input Format(s): String, [.png, .jpg, .jpeg], [.mp3, .wav]
Input Parameters: [1D, 2D]
Other Properties Related to Input: Languages in training data | Arabic, Chinese, Czech, Danish, Dutch, English, Finnish, French, German, Hebrew, Hungarian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, Turkish, Ukrainian
Note that these are for TEXT only. There is limited language support for IMAGE and AUDIO modalities.
Output Type(s): Text
Output Format(s): String
Output Parameters: 1D
Supported Hardware Microarchitecture Compatibility:
[Preferred/Supported] Operating System(s):
Phi-4-multimodal-instruct v1.0
Data Collection Methods: [Hybrid: Automated, Human, Synthetic]
GPUS: 512 A100-80G
Training Time: 28 days
Training Data: 5T text tokens, 2.3M speech hours, and 1.1T image-text token
Training Dates: Trained between December 2024 and January 2025
Status: This is a static model trained on offline datasets with the cutoff date of June 2024 for publicly available data.
Languages in training data: Arabic, Chinese, Czech, Danish, Dutch, English, Finnish, French, German, Hebrew, Hungarian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, Turkish, Ukrainian
Note that these are for TEXT only. There is limited language support for IMAGE and AUDIO modalities.
Vision: English
Audio: English, Chinese, German, French, Italian, Japanese, Spanish, Portuguese
Phi-4-multimodal-instruct’s training data includes a wide variety of sources, totaling 5 trillion text tokens, and is a combination of:
Focus was placed on the quality of data that could potentially improve the reasoning ability for the model, and the publicly available documents were filtered to contain a preferred level of knowledge. As an example, the result of a game in premier league on a particular day might be good training data for large foundation models, but such information was removed for the Phi-4-multimodal-instruct to leave more model capacity for reasoning for the model’s small size. The data collection process involved sourcing information from publicly available documents, with a focus on filtering out undesirable documents and images. To safeguard privacy, image and text data sources were filtered to remove or scrub potentially personal data from the training data.
The decontamination process involved normalizing and tokenizing the dataset, then generating and comparing n-grams between the target dataset and benchmark datasets. Samples with matching n-grams above a threshold were flagged as contaminated and removed from the dataset. A detailed contamination report was generated, summarizing the matched text, matching ratio, and filtered results for further analysis.
The Phi-4 family of models has adopted a robust safety post-training approach. This approach leverages a variety of both open-source and in-house generated datasets. The overall technique employed for safety alignment is a combination of SFT (Supervised Fine-Tuning), DPO (Direct Preference Optimization), and RLHF (Reinforcement Learning from Human Feedback) approaches by utilizing human-labeled and synthetic English-language datasets, including publicly available datasets focusing on helpfulness and harmlessness, as well as various questions and answers targeted to multiple safety categories. For non-English languages, existing datasets were extended via machine translation. Speech Safety datasets were generated by running Text Safety datasets through Azure TTS (Text-To-Speech) Service, for both English and non-English languages. Vision (text & images) Safety datasets were created to cover harm categories identified both in public and internal multi-modal RAI datasets.
Various evaluation techniques including red teaming, adversarial conversation simulations, and multilingual safety evaluation benchmark datasets were leveraged to evaluate Phi-4 models’ propensity to produce undesirable outputs across multiple languages and risk categories. Several approaches were used to compensate for the limitations of one approach alone. Findings across the various evaluation methods indicate that safety post-training that was done as detailed in the Phi 3 Safety Post-Training paper had a positive impact across multiple languages and risk categories as observed by refusal rates (refusal to output undesirable outputs) and robustness to jailbreak techniques. Details on prior red team evaluations across Phi models can be found in the Phi 3 Safety Post-Training paper. For this release, the red teaming effort focused on the newest Audio input modality and on the following safety areas: harmful content, self-injury risks, and exploits. The model was found to be more susceptible to providing undesirable outputs when attacked with context manipulation or persuasive techniques. These findings applied to all languages, with the persuasive techniques mostly affecting French and Italian. This highlights the need for industry-wide investment in the development of high-quality safety evaluation datasets across multiple languages, including low resource languages, and risk areas that account for cultural nuances where those languages are spoken.
To assess model safety in scenarios involving both text and images, Microsoft’s Azure AI Evaluation SDK was utilized. This tool facilitates the simulation of single-turn conversations with the target model by providing prompt text and images designed to incite harmful responses. The target model's responses are subsequently evaluated by a capable model across multiple harm categories, including violence, sexual content, self-harm, hateful and unfair content, with each response scored based on the severity of the harm identified. The evaluation results were compared with those of Phi-3.5-Vision and open-source models of comparable size. In addition, we ran both an internal and the public RTVLM and VLGuard multi-modal (text & vision) RAI benchmarks, once again comparing scores with Phi-3.5-Vision and open-source models of comparable size. However, the model may be susceptible to language-specific attack prompts and cultural context.
In addition to extensive red teaming, the Safety of the model was assessed through three distinct evaluations. First, as performed with Text and Vision inputs, Microsoft’s Azure AI Evaluation SDK was leveraged to detect the presence of harmful content in the model’s responses to Speech prompts. Second, Microsoft’s Speech Fairness evaluation was run to verify that Speech-To-Text transcription worked well across a variety of demographics. Third, we proposed and evaluated a mitigation approach via a system message to help prevent the model from inferring sensitive attributes (such as gender, sexual orientation, profession, medical condition, etc...) from the voice of a user.
To understand the capabilities, Phi-4-multimodal-instruct was compared with a set of models over a variety of benchmarks using an internal benchmark platform. Users can refer to the Phi-4-Mini model card for details of language benchmarks. Below is a high-level overview of the model quality on representative speech and vision benchmarks:
Phi-4-multimodal-instruct demonstrated strong performance in speech tasks:
Phi-4-multimodal-instruct can process both image and audio together. The table below shows the model quality when the input query for vision content is synthetic speech on chart/table understanding and document reasoning tasks. Compared to other state-of-the-art omni models, Phi-4-multimodal-instruct achieves stronger performance on multiple benchmarks.
Benchmarks | Phi-4-multimodal-instruct | InternOmni-7B | Gemini-2.0-Flash-Lite-prv-02-05 | Gemini-2.0-Flash | Gemini-1.5-Pro |
---|---|---|---|---|---|
s_AI2D | 68.9 | 53.9 | 62.0 | 69.4 | 67.7 |
s_ChartQA | 69.0 | 56.1 | 35.5 | 51.3 | 46.9 |
s_DocVQA | 87.3 | 79.9 | 76.0 | 80.3 | 78.2 |
s_InfoVQA | 63.7 | 60.3 | 59.4 | 63.6 | 66.1 |
Average | 72.2 | 62.6 | 58.2 | 66.2 | 64.7 |
To understand the vision capabilities, Phi-4-multimodal-instruct was compared with a set of models over a variety of zero-shot benchmarks using an internal benchmark platform. Below is a high-level overview of the model quality on representative benchmarks:
Dataset | Phi-4-multimodal-ins | Phi-3.5-vision-ins | Qwen 2.5-VL-3B-ins | Intern VL 2.5-4B | Qwen 2.5-VL-7B-ins | Intern VL 2.5-8B | Gemini 2.0-Flash Lite-prv-0205 | Gemini2.0-Flash | Claude-3.5-Sonnet-2024-10-22 | Gpt-4o-2024-11-20 |
---|---|---|---|---|---|---|---|---|---|---|
Popular aggregated benchmark | 55.1 | 43.0 | 47.0 | 48.3 | 51.8 | 50.6 | 54.1 | 64.7 | 55.8 | 61.7 |
MMBench (dev-en) | 86.7 | 81.9 | 84.3 | 86.8 | 87.8 | 88.2 | 85.0 | 90.0 | 86.7 | 89.0 |
MMMU-Pro (std / vision) | 38.5 | 21.8 | 29.9 | 32.4 | 38.7 | 34.4 | 45.1 | 54.4 | 54.3 | 53.0 |
ScienceQA Visual (img-test) | 97.5 | 91.3 | 79.4 | 96.2 | 87.7 | 97.3 | 85.0 | 88.3 | 81.2 | 88.2 |
MathVista (testmini) | 62.4 | 43.9 | 60.8 | 51.2 | 67.8 | 56.7 | 57.6 | 47.2 | 56.9 | 56.1 |
InterGPS | 48.6 | 36.3 | 48.3 | 53.7 | 52.7 | 54.1 | 57.9 | 65.4 | 47.1 | 49.1 |
AI2D | 82.3 | 78.1 | 78.4 | 80.0 | 82.6 | 83.0 | 77.6 | 82.1 | 70.6 | 83.8 |
ChartQA | 81.4 | 81.8 | 80.0 | 79.1 | 85.0 | 81.0 | 73.0 | 79.0 | 78.4 | 75.1 |
DocVQA | 93.2 | 69.3 | 93.9 | 91.6 | 95.7 | 93.0 | 91.2 | 92.1 | 95.2 | 90.9 |
InfoVQA | 72.7 | 36.6 | 77.1 | 72.1 | 82.6 | 77.6 | 73.0 | 77.8 | 74.3 | 71.9 |
TextVQA (val) | 75.6 | 72.0 | 76.8 | 70.9 | 77.7 | 74.8 | 72.9 | 74.4 | 58.6 | 73.1 |
OCR Bench | 84.4 | 63.8 | 82.2 | 71.6 | 87.7 | 74.8 | 75.7 | 81.0 | 77.0 | 77.7 |
POPE | 85.6 | 86.1 | 87.9 | 89.4 | 87.5 | 89.1 | 87.5 | 88.0 | 82.6 | 86.5 |
BLINK | 61.3 | 57.0 | 48.1 | 51.2 | 55.3 | 52.5 | 59.3 | 64.0 | 56.9 | 62.4 |
Video MME (16 frames) | 55.0 | 50.8 | 56.5 | 57.3 | 58.2 | 58.7 | 58.8 | 65.5 | 60.2 | 68.2 |
Average | 72.0 | 60.9 | 68.7 | 68.8 | 73.3 | 71.1 | 70.2 | 74.3 | 69.1 | 72.4 |
Below are the comparison results on existing multi-image tasks. On average, Phi-4-multimodal-instruct outperforms competitor models of the same size and is competitive with much bigger models on multi-frame capabilities. BLINK is an aggregated benchmark with 14 visual tasks that humans can solve very quickly but are still hard for current multimodal LLMs.
Dataset | Phi-4-multimodal-instruct | Qwen2.5-VL-3B-Instruct | InternVL 2.5-4B | Qwen2.5-VL-7B-Instruct | InternVL 2.5-8B | Gemini-2.0-Flash-Lite-prv-02-05 | Gemini-2.0-Flash | Claude-3.5-Sonnet-2024-10-22 | Gpt-4o-2024-11-20 |
---|---|---|---|---|---|---|---|---|---|
Art Style | 86.3 | 58.1 | 59.8 | 65.0 | 65.0 | 76.9 | 76.9 | 68.4 | 73.5 |
Counting | 60.0 | 67.5 | 60.0 | 66.7 | 71.7 | 45.8 | 69.2 | 60.8 | 65.0 |
Forensic Detection | 90.2 | 34.8 | 22.0 | 43.9 | 37.9 | 31.8 | 74.2 | 63.6 | 71.2 |
Functional Correspondence | 30.0 | 20.0 | 26.9 | 22.3 | 27.7 | 48.5 | 53.1 | 34.6 | 42.3 |
IQ Test | 22.7 | 25.3 | 28.7 | 28.7 | 28.7 | 28.0 | 30.7 | 20.7 | 25.3 |
Jigsaw | 68.7 | 52.0 | 71.3 | 69.3 | 53.3 | 62.7 | 69.3 | 61.3 | 68.7 |
Multi-View Reasoning | 76.7 | 44.4 | 44.4 | 54.1 | 45.1 | 55.6 | 41.4 | 54.9 | 54.1 |
Object Localization | 52.5 | 55.7 | 53.3 | 55.7 | 58.2 | 63.9 | 67.2 | 58.2 | 65.6 |
Relative Depth | 69.4 | 68.5 | 68.5 | 80.6 | 76.6 | 81.5 | 72.6 | 66.1 | 73.4 |
Relative Reflectance | 26.9 | 38.8 | 38.8 | 32.8 | 38.8 | 33.6 | 34.3 | 38.1 | 38.1 |
Semantic Correspondence | 52.5 | 32.4 | 33.8 | 28.8 | 24.5 | 56.1 | 55.4 | 43.9 | 47.5 |
Spatial Relation | 72.7 | 80.4 | 86.0 | 88.8 | 86.7 | 74.1 | 79.0 | 74.8 | 83.2 |
Visual Correspondence | 67.4 | 28.5 | 39.5 | 50.0 | 44.2 | 84.9 | 91.3 | 72.7 | 82.6 |
Visual Similarity | 86.7 | 67.4 | 88.1 | 87.4 | 85.2 | 87.4 | 80.7 | 79.3 | 83.0 |
Overall | 61.3 | 48.1 | 51.2 | 55.3 | 52.5 | 59.3 | 64.0 | 56.9 | 62.4 |
Given the nature of the training data, Phi-4-Multimodal-Instruct model is best suited for prompts using the chat format as follows:
Text Chat Format
This format is used for general conversation and instructions:
<|system|>You are a helpful assistant.<|end|><|user|>How to explain Internet for a medieval knight?<|end|><|assistant|>
Tool-enabled Function Call Format for Text
This format is used when the user wants the model to provide function calls based on the given tools. The user should provide the available tools in the system prompt, wrapped by <|tool|>
and <|/tool|>
tokens. The tools should be specified in JSON format, using a JSON dump structure. For example:
<|system|>You are a helpful assistant with some tools.<|tool|> [{"name": "get_weather_updates", "description": "Fetches weather updates for a given city using the RapidAPI Weather API.", "parameters": {"city": {"description": "The name of the city for which to retrieve weather information.", "type": "str", "default": "London"}}}] <|/tool|><|end|><|user|>What is the weather like in Paris today?<|end|><|assistant|>
Vision-Language Format
This format is used for conversation with image:
<|user|><|image_1|>Describe the image in detail.<|end|><|assistant|>
For multiple images, the user needs to insert multiple image placeholders in the prompt as below:
<|user|><|image_1|><|image_2|><|image_3|>Summarize the content of the images.<|end|><|assistant|>
Speech-Language Format
This format is used for various speech and audio tasks:
<|user|><|audio_1|>{task prompt}<|end|><|assistant|>
The task prompt can vary for different task.
<|user|><|audio_1|>Transcribe the audio clip into text.<|end|><|assistant|>
<|user|><|audio_1|>Translate the audio to {lang}.<|end|><|assistant|>
<|user|><|audio_1|>Transcribe the audio to text, and then translate the audio to {lang}. Use <sep> as a separator between the original transcript and the translation.<|end|><|assistant|>
<|user|><|audio_1|><|end|><|assistant|>
Vision-Speech Format
This format is used for conversation with image and audio. The audio may contain query related to the image:
<|user|><|image_1|><|audio_1|><|end|><|assistant|>
For multiple images, the user needs to insert multiple image placeholders in the prompt as below:
<|user|><|image_1|><|image_2|><|image_3|><|audio_1|><|end|><|assistant|>
After obtaining the Phi-4-Multimodal-Instruct model checkpoints, users can use this sample code for inference.
import requests import torch import os from PIL import Image import soundfile from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig,pipeline,AutoTokenizer processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( "microsoft/Phi-4-Multimodal-Instruct", device_map="cuda", torch_dtype="auto", trust_remote_code=True, _attn_implementation='flash_attention_2', ).cuda() generation_config = GenerationConfig.from_pretrained(model_path, 'generation_config.json') user_prompt = '<|user|>' assistant_prompt = '<|assistant|>' prompt_suffix = '<|end|>' prompt = f'{user_prompt}<|image_1|>What is shown in this image?{prompt_suffix}{assistant_prompt}' url = 'https://www.ilankelman.org/stopsigns/australia.jpg' print(f'>>> Prompt\n{prompt}') image = Image.open(requests.get(url, stream=True).raw) inputs = processor(text=prompt, images=image, return_tensors='pt').to('cuda:0') generate_ids = model.generate( **inputs, max_new_tokens=1000, generation_config=generation_config, ) generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :] response = processor.batch_decode( generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False )[0] print(f'>>> Response\n{response}') speech_prompt = "Transcribe the audio to text, and then translate the audio to French. Use <sep> as a separator between the original transcript and the translation." prompt = f'{user_prompt}<|audio_1|>{speech_prompt}{prompt_suffix}{assistant_prompt}' print(f'>>> Prompt\n{prompt}') audio = soundfile.read('https://voiceage.com/wbsamples/in_mono/Trailer.wav') inputs = processor(text=prompt, audios=[audio], return_tensors='pt').to('cuda:0') generate_ids = model.generate( **inputs, max_new_tokens=1000, generation_config=generation_config, ) generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :] response = processor.batch_decode( generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False )[0] print(f'>>> Response\n{response}')
Engine: vLLM
Test Hardware: NVIDIA H100
Like other language models, the Phi family of models can potentially behave in ways that are unfair, unreliable, or offensive. Some of the limiting behaviors to be aware of include:
Developers should apply responsible AI best practices, including mapping, measuring, and mitigating risks associated with their specific use case and cultural, linguistic context. Phi 4 family of models are general purpose models. As developers plan to deploy these models for specific use cases, they are encouraged to fine-tune the models for their use case and leverage the models as part of broader AI systems with language-specific safeguards in place. Important areas for consideration include:
Ethical considerations and guidelines. NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. Please report security vulnerabilities or NVIDIA AI Concerns here.