Cutting-edge open multimodal model exceling in high-quality reasoning from images.
Description | Phi-3.5-vision is a lightweight, state-of-the-art open multimodal model built upon datasets which include - synthetic data and filtered publicly available websites - with a focus on very high-quality, reasoning dense data both on text and vision. The model belongs to the Phi-3 model family, and the multimodal version comes with 128K context length (in tokens) it can support. The model underwent a rigorous enhancement process, incorporating both supervised fine-tuning and direct preference optimization to ensure precise instruction adherence and robust safety measures. This model is ready for commercial and research use. |
Release date | August 20, 2024 |
License | MIT |
Architecture | Phi-3.5-vision has 4.2B parameters and contains image encoder, connector, projector, and Phi-3 Mini language model. |
Inputs | Text and Image. It’s best suited for prompts using the chat format. |
Context length | 128K tokens |
Outputs | Generated text (string) in response to the input |
Status | This is a static model trained on an offline text dataset with cutoff date March 15, 2024. Future versions of the tuned models may be released as we improve models. |
Nothing contained in this Model Card should be interpreted as or deemed a restriction or modification to the license the model is released under.
This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party’s requirements for this application and use case.
🏡 Phi-3 Portal
📰 Phi-3 Microsoft Blog
📖 Phi-3 Technical Report
🛠️ Phi-3 on Azure AI Studio
👩🍳 Phi-3 Cookbook
In this release, the model enables multi-frame image understanding and reasoning which is based on valuable customer feedback. The hero example multi-frame capabilities include detailed image comparison, multi-image summarization/storytelling and video summarization, which have broad applications in many scenarios. We also observed performance improvement on most single image benchmarks, e.g., boosting MMMU performance from 40.2 to 43.0, MMBench performance from 80.5 to 81.9, document understanding benchmark TextVQA from 70.9 to 72.0. We believe most use cases will benefit from this release, but we encourage users to test the new model in their AI applications. We appreciate the enthusiastic adoption of the Phi-3 model family and continue to welcome all the feedback from the community.
The model is intended for broad commercial and research use in English. The model provides uses for general purpose AI systems and applications with visual and text input capabilities which require:
Our model is designed to accelerate research on language and multimodal models, for use as a building block for generative AI powered features.
Our models are not specifically designed or evaluated for all downstream purposes. Developers should consider common limitations of language models as they select use cases, and evaluate and mitigate for accuracy, safety, and fariness before using within a specific downstream use case, particularly for high risk scenarios. Developers should be aware of and adhere to applicable laws or regulations (including privacy, trade compliance laws, etc.) that are relevant to their use case.
Given the nature of the training data, the Phi-3.5-vision model is best suited for prompts using the chat format as follows:
Single image:
<|user|>\n<|image_1|>\n{prompt}<|end|>\n<|assistant|>\n
Multi-turn conversations:
<|user|>\n<|image_1|>\n{prompt_1}<|end|>\n<|assistant|>\n{response_1}<|end|>\n<|user|>\n{prompt_2}<|end|>\n<|assistant|>\n
For multi-image usage, add multiple image placeholders in the front of the prompts. <|image_{}|> index should start from 1. One example of prompt is shown as follows:
<|user|>\n<|image_1|>\n<|image_2|>\n<|image_3|>\n<|image_4|>\n{prompt}<|end|>\n<|assistant|>\n
After obtaining the Phi-3.5-vision-instruct model checkpoints, users can use this sample code for inference.
from PIL import Image import requests from transformers import AutoModelForCausalLM from transformers import AutoProcessor model_id = "microsoft/Phi-3.5-vision-instruct" model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cuda", trust_remote_code=True, torch_dtype="auto", _attn_implementation='flash_attention_2') processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True, num_crops=4) images = [] placeholder = "" for i in range(1,20): url = f"https://image.slidesharecdn.com/azureintroduction-191206101932/75/Introduction-to-Microsoft-Azure-Cloud-{i}-2048.jpg" images.append(Image.open(requests.get(url, stream=True).raw)) placeholder += f"<|image_{i}|>\n" messages = [ {"role": "user", "content": placeholder+"Summarize the deck of slides."}, ] prompt = processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = processor(prompt, images, return_tensors="pt").to("cuda:0") generation_args = { "max_new_tokens": 1000, "temperature": 0.0, "do_sample": False, } generate_ids = model.generate(**inputs, eos_token_id=processor.tokenizer.eos_token_id, **generation_args) # remove input tokens generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:] response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] print(response)
[Use unique identifier for identifying the model in the future- this may be part of our internal naming, specifying variation like "pruned," "unpruned," "trained," or "deployable" or "quantized" where necessary and including a versioning number like vX.X along with short description differentiating if multiple versions are available]
Dates | Trained between July and August 2024 |
Training time | 6 days |
Training data | 500B tokens (vision tokens + text tokens) |
Our training data includes a wide variety of sources, and is a combination of
The data collection process involved sourcing information from publicly available documents, with a meticulous approach to filtering out undesirable documents and images. To safeguard privacy, we carefully filtered various image and text data sources to remove or scrub any potentially personal data from the training data.
Below are the comparison results on existing multi-image benchmarks. On average, our model outperforms competitor models on the same size and competitive with much bigger models on multi-frame capabilities and video summarization.
BLINK: a benchmark with 14 visual tasks that humans can solve very quickly but are still hard for current multimodal LLMs.
Benchmark | Phi-3.5-vision-instrust | LlaVA-Interleave-Qwen-7B | InternVL-2-4B | InternVL-2-8B | Gemini-1.5-Flash | GPT-4o-mini | Claude-3.5-Sonnet | Gemini-1.5-Pro | GPT-4o |
---|---|---|---|---|---|---|---|---|---|
Art Style | 87.2 | 62.4 | 55.6 | 52.1 | 64.1 | 70.1 | 59.8 | 70.9 | 73.3 |
Counting | 54.2 | 56.7 | 54.2 | 66.7 | 51.7 | 55.0 | 59.2 | 65.0 | 65.0 |
Forensic Detection | 92.4 | 31.1 | 40.9 | 34.1 | 54.5 | 38.6 | 67.4 | 60.6 | 75.8 |
Functional Correspondence | 29.2 | 34.6 | 24.6 | 24.6 | 33.1 | 26.9 | 33.8 | 31.5 | 43.8 |
IQ Test | 25.3 | 26.7 | 26.0 | 30.7 | 25.3 | 29.3 | 26.0 | 34.0 | 19.3 |
Jigsaw | 68.0 | 86.0 | 55.3 | 52.7 | 71.3 | 72.7 | 57.3 | 68.0 | 67.3 |
Multi-View Reasoning | 54.1 | 44.4 | 48.9 | 42.9 | 48.9 | 48.1 | 55.6 | 49.6 | 46.6 |
Object Localization | 49.2 | 54.9 | 53.3 | 54.1 | 57.3 | 57.4 | 62.3 | 65.6 | 68.0 |
Relative Depth | 69.4 | 77.4 | 63.7 | 67.7 | 32.8 | 58.1 | 71.8 | 76.6 | 71.0 |
Relative Reflectance | 37.3 | 34.3 | 32.8 | 38.8 | 32.8 | 27.6 | 36.6 | 38.8 | 40.3 |
Semantic Correspondence | 36.7 | 31.7 | 31.7 | 22.3 | 32.4 | 31.7 | 45.3 | 48.9 | 54.0 |
Spatial Relation | 65.7 | 75.5 | 78.3 | 78.3 | 55.9 | 81.1 | 60.1 | 79.0 | 84.6 |
Visual Correspondence | 53.5 | 40.7 | 34.9 | 33.1 | 29.7 | 52.9 | 72.1 | 81.4 | 86.0 |
Visual Similarity | 83.0 | 91.9 | 48.1 | 45.2 | 47.4 | 77.8 | 84.4 | 81.5 | 88.1 |
Overall | 57.0 | 53.1 | 45.9 | 45.4 | 45.1 | 51.9 | 56.5 | 61.0 | 63.2 |
Video-MME: comprehensively assess the capabilities of MLLMs in processing video data, covering a wide range of visual domains, temporal durations, and data modalities.
Benchmark | Phi-3.5-vision-instrust | LlaVA-Interleave-Qwen-7B | InternVL-2-4B | InternVL-2-8B | Gemini-1.5-Flash | GPT-4o-mini | Claude-3.5-Sonnet | Gemini-1.5-Pro | GPT-4o |
---|---|---|---|---|---|---|---|---|---|
short (<2min) | 60.8 | 62.3 | 60.7 | 61.7 | 72.2 | 70.1 | 66.3 | 73.3 | 77.7 |
medium (4-15min) | 47.7 | 47.1 | 46.4 | 49.6 | 62.7 | 59.6 | 54.7 | 61.2 | 68.0 |
long (30-60) | 43.8 | 41.2 | 42.6 | 46.6 | 52.1 | 53.9 | 46.6 | 53.2 | 59.6 |
Overall | 50.8 | 50.2 | 49.9 | 52.6 | 62.3 | 61.2 | 55.9 | 62.6 | 68.4 |
Engine: Tensor(RT)
Test Hardware [Name the specific test hardware model]:
Like other models, the Phi family of models can potentially behave in ways that are unfair, unreliable, or offensive. Some of the limiting behaviors to be aware of include:
Developers should apply responsible AI best practices and are responsible for ensuring that a specific use case complies with relevant laws and regulations (e.g. privacy, trade, etc.). Important areas for consideration include:
Please report security vulnerabilities or NVIDIA AI Concerns here.
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft’s Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party’s policies.