---
title: "florence-2"
publisher: "microsoft"
type: "endpoint"
updated: "2025-04-14T23:33:28.525Z"
description: "Vision foundation model capable of performing diverse computer vision and vision language tasks."
canonical: "https://build.nvidia.com/microsoft/microsoft-florence-2"
---

# Model Overview

## Description:
Florence-2 is an advanced vision foundation model using a prompt-based approach to handle a wide range of vision and vision-language tasks. It can interpret simple text prompts to perform tasks like captioning, object detection and segmentation.

This model is ready for non-commercial use.  <br>

## Third-Party Community Consideration
This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party’s requirements for this application and use case; see link to the [Florence-2 Model Card](https://huggingface.co/microsoft/Florence-2-large-ft).

### License/Terms of Use
[MIT license](https://huggingface.co/microsoft/Florence-2-large-ft/resolve/main/LICENSE).

## References:
+ [Florence-2 technical report](https://arxiv.org/abs/2311.06242)
+ [Jupyter Notebook for inference and visualization of Florence-2 model](https://huggingface.co/microsoft/Florence-2-large/blob/main/sample_inference.ipynb)

## Model Architecture:
**Architecture Type:** Transformer-Based  <br>
**Network Architecture:** DaViT; standard encoder-decoder  <br>

## Input:
**Input Type(s):** Image, Text. <br>
**Input Format(s):** Red, Green, Blue (RGB), String <br>
**Input Parameters:** Two Dimensional (2D) <br>
**Other Properties Related to Input:** Task prompt.

The model can perform 14 different vision language model and computer vision tasks. The input ```content``` field should be formatted as ```"<TASK_PROMPT><text_prompt (only when needed)><img>"```. Users need to specify the task type at the beginning. Image supports both base64 and NvCF asset id. Some tasks require a text prompt, and users need to provide that after image. Below are the list of tasks:
* Caption
* Detailed Caption
* More Detailed Caption
* Region to category
* Region to description
* Caption to Phrase Grounding
* Object Detection
* Dense Region Caption
* Region proposal
* Open vocabulary detection
* Referring expression segmentation
* Region to segmentation
* Optical character recognition
* Optical character recognition with region

For ```<CAPTION_TO_PHRASE_GROUNDING>```, ```<REFERRING_EXPRESSION_SEGMENTATION>```, ```<OPEN_VOCABULARY_DETECTION>```, the text prompt is a normal description. For example: ```'<OPEN_VOCABULARY_DETECTION>dog<img src="data:image/jpeg;asset_id,868f5924-8ef2-4d8d-866e-87bb423126cb" />'```.

For ```<REGION_TO_SEGMENTATION>```, ```<REGION_TO_CATEGORY>```, ```<REGION_TO_DESCRIPTION>```, the text prompt must be formatted as ```<loc_x1><loc_y1><loc_x2><loc_y2>```, which is the normalized coordinates from region of interest bbox as calculated below. For example: ```'<REGION_TO_SEGMENTATION><loc_2><loc_3><loc_998><loc_997><img src="data:image/jpeg;base64,iVBORw0KGgoAAAANSUhEUgAAAAgAAAAICAIAAABLbSncAAAAGUlEQVR4nGK5nHuGARtgwio6aCUAAQAA//+evgIfjH1FEwAAAABJRU5ErkJggg==" />'```.
```
x1=int(top_left_x_coor/width*999)
y1=int(top_left_y_coor/height*999)
x2=int(bottom_right_x_coor/width*999)
y2=int(bottom_right_y_coor/height*999)
```
Other tasks don't take text prompt input. For example: ```'<CAPTION><img src="data:image/png;asset_id,868f5924-8ef2-8g3c-866e-87bb423126cb" />'```.

<br>

## Output:
**Output Type(s):** Text, Bounding Box, Segmentation Mask <br>
**Output Format:** String or Dictionary (Text), Image (RBG, Black & White)
**Output Parameters:** One Dimensional (1D)- Text, 2D- Bounding Box, Segmentation Mask <br>
**Other Properties Related to Output:** <br>
The response data needs to be saved into a zip file and extracted. It contains an overlay image (when bounding box or segmentation is generated) and a ```<id>.response``` JSON file.

For caption related tasks, the output is saved in ```"content": "<TASK_PROMPT>caption"```. For example, ```"content": "<CAPTION>A black and brown dog in a grass field"``` <br>

For bounding box or segmentation masks, the output is saved in ```"entities": "{"bboxes":[], "quad_boxes":[], "labels":[], "polygons": []}"```. For example, ```"entiites": {"bboxes":[[192.47,68.882,611.081,346.83],[1.529,240.178, 611.081,403.394]],"quad_boxes":null,"labels":["A black and brown dog","a grass field"],"bboxes_labels":null,"polygons":null}``` <br>

## Software Integration:
**Runtime Engine(s):**
* PyTorch <br>

**Supported Hardware Microarchitecture Compatibility:** <br>
* NVIDIA Ampere <br>
* NVIDIA Blackwell <br>
* NVIDIA Jetson  <br>
* NVIDIA Hopper <br>
* NVIDIA Lovelace <br>
* NVIDIA Pascal <br>
* NVIDIA Turing <br>
* NVIDIA Volta <br>

**[Preferred/Supported] Operating System(s):** <br>
* Linux <br>
* Windows <br>

## Model Version(s):
* Florence-2-base  <br>
* Florence-2-large  <br>
* Florence-2-base-ft  <br>
* Florence-2-large-ft  <br>

# Training and Testing Datasets:

## Training Dataset:

**Link**

* FLD-5B dataset (Microsoft) <br>

**Data Collection Method by dataset** <br>

* Hybrid: Human, Automatic/Sensors  <br>

**Labeling Method by dataset** <br>

* Hybrid: Human, Automatic/Sensors  <br>

**Properties (Quantity, Dataset Descriptions, Sensor(s))**

* The dataset consists of images from a diverse collection of purposes, including caption, detection, segmentation and optical character recognition. There are 126 million images, 500 million text annotations, and 1.3 billion text-region annotations, and 3.6 billion text-phrase-region annotations across different tasks. <br>

## Testing Dataset:
**Link**

* [COCO](https://cocodataset.org/), [Flickr30k](https://shannon.cs.illinois.edu/DenotationGraph/) <br>

**Data Collection Method by dataset** <br>

* Hybrid: Human, Automatic/Sensors  <br>

**Labeling Method by dataset** <br>

* Hybrid: Human, Automatic/Sensors  <br>

**Properties (Quantity, Dataset Descriptions, Sensor(s))**

* COCO: COCO is a large-scale object detection, segmentation, and captioning dataset. COCO has several features: 330K images (>200K labeled), 1.5 million object instances <br>
* Flickr30k: The Flickr30k dataset contains 31,000 images collected from Flickr, together with five (5) reference sentences provided by human annotators <br>

## Inference:
**Engine:** PyTorch <br>
**Test Hardware:** <br>
* NVIDIA L40 <br>

## Bias

| Field | Response |
| -- | -- |
|Participation considerations from adversely impacted groups [(protected classes)](https://www.senate.ca.gov/content/protected-classes) in model design and testing: | None of the Above |
| Measures taken to mitigate against unwanted bias: | No measures taken to mitigate against unwanted bias.|

## Explainability

| Field | Response |
| -- | -- |
| Intended Application(s) & Domain(s): | The primary use of Florence-2 is research on large multimodal models. |
| Model Type: | Text generation, object detection, segmentation, and OCR from image |
| Intended Users: | The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence. |
| Output: | Text, detection, segmentation |
| Describe how the model works: | Text generation, object detection, segmentation, and OCR from image |
| Technical Limitations: | |
| Verified to have met prescribed NVIDIA standards: | |
| Performance Metrics: | Accuracy |
| Licensing: | [MIT License](https://huggingface.co/microsoft/Florence-2-large/resolve/main/LICENSE) |

## Privacy

| Field | Response |
| -- | -- |
| Generatable or reverse engineerable personally-identifiable information (PII)? | None |
| Protected classes used to create this model? | Not Applicable (No PII) |
| Was consent obtained for any PII used? | Not Applicable (No PII) |
| How often is dataset reviewed? | 	Before Release |
| Is a mechanism in place to honor data subject right of access or deletion of personal data? | No |
| If PII collected for the development of the model, was it collected directly by NVIDIA? |Not Applicable |
| If PII collected for the development of the model by NVIDIA, do you maintain or have access to disclosures made to data subjects?	| Not Applicable |
| If PII collected for the development of this AI model, was it minimized to only what was required? | Not Applicable |
| Is there provenance for all datasets used in training? | Yes |
| Does data labeling (annotation, metadata) comply with privacy laws? | Yes |
| Is data compliant with data subject requests for data correction or removal, if such a request was made? | Yes |
| Applicable NVIDIA Privacy Policy	| [https://www.nvidia.com/en-us/data-center/products/nvidia-ai-enterprise/eula/](https://www.nvidia.com/en-us/data-center/products/nvidia-ai-enterprise/eula/) |

## Prototype

```python
# The model can perform 14 different vision language model and computer vision tasks. The input ```content``` field should be formatted as ```"<TASK_PROMPT><text_prompt (only when needed)><img>"```.
# Users need to specify the task type at the beginning. Image supports both base64 and NvCF asset id. Some tasks require a text prompt, and users need to provide that after image. Below are the examples for each task.
# For <CAPTION_TO_PHRASE_GROUNDING>, <REFERRING_EXPRESSION_SEGMENTATION>, <OPEN_VOCABULARY_DETECTION>, users can change the text prompt as other descriptions.
# For <REGION_TO_SEGMENTATION>, <REGION_TO_CATEGORY>, <REGION_TO_DESCRIPTION>, the text prompt is formatted as <loc_x1><loc_y1><loc_x2><loc_y2>, which is the normalized coordinates from region of interest bbox. x1=int(top_left_x_coor/width*999), y1=int(top_left_y_coor/height*999), x2=int(bottom_right_x_coor/width*999), y2=int(bottom_right_y_coor/height*999).
import os
import sys
import zipfile
import requests

nvai_url = "https://ai.api.nvidia.com/v1/vlm/microsoft/florence-2"
header_auth = f'Bearer {os.getenv("API_KEY_REQUIRED_IF_EXECUTING_OUTSIDE_NGC", "")}'
prompts = ["<CAPTION>",
"<DETAILED_CAPTION>",
"<MORE_DETAILED_CAPTION>",
"<OD>",
"<DENSE_REGION_CAPTION>",
"<REGION_PROPOSAL>",
"<CAPTION_TO_PHRASE_GROUNDING>A black and brown dog is laying on a grass field.",
"<REFERRING_EXPRESSION_SEGMENTATION>a black and brown dog",
"<REGION_TO_SEGMENTATION><loc_312><loc_168><loc_998><loc_846>",
"<OPEN_VOCABULARY_DETECTION>a black and brown dog",
"<REGION_TO_CATEGORY><loc_312><loc_168><loc_998><loc_846>",
"<REGION_TO_DESCRIPTION><loc_312><loc_168><loc_998><loc_846>",
"<OCR>",
"<OCR_WITH_REGION>"]

def _upload_asset(input, description):
"""
Uploads an asset to the NVCF API.
:param input: The binary asset to upload
:param description: A description of the asset

"""

authorize = requests.post(
"https://api.nvcf.nvidia.com/v2/nvcf/assets",
headers={
"Authorization": header_auth,
"Content-Type": "application/json",
"accept": "application/json",
},
json={"contentType": "image/jpeg", "description": description},
timeout=30,
)
authorize.raise_for_status()

response = requests.put(
authorize.json()["uploadUrl"],
data=input,
headers={
"x-amz-meta-nvcf-asset-description": description,
"content-type": "image/jpeg",
},
timeout=300,
)

response.raise_for_status()
return str(authorize.json()["assetId"])

def _generate_content(task_id, asset_id):
if task_id < 0 or task_id >= len(prompts):
print(f"task_id should within [0, {len(prompts)-1}]")
exit(1)
prompt = prompts[task_id]
content = f'{prompt}<img src="data:image/jpeg;asset_id,{asset_id}" />'
return content

if __name__ == "__main__":
"""Uploads two images of your choosing to the NVCF API and sends a request
to the Visual ChangeNet model to compare them. The response is saved to
<output_dir>
"""

if len(sys.argv) != 4:
print("Usage: python test.py <test_image> <result_dir> <task_id>\n"
"For example: python test.py car.jpg result_dir 0")
sys.exit(1)

if len(os.getenv("API_KEY_REQUIRED_IF_EXECUTING_OUTSIDE_NGC", "")) == 0:
print("API_KEY not set. Please export API_KEY_REQUIRED_IF_EXECUTING_OUTSIDE_NGC=<Your API Key> as environment variable.")
sys.exit(1)

# Local images
asset_id = _upload_asset(open(sys.argv[1], "rb"), "Test Image")
content = _generate_content(int(sys.argv[3]), asset_id)
# Asset IDs returned by the _upload_asset function
inputs = {
"messages": [{
"role": "user",
"content": content
}]
}
# asset_list = f"{asset_id}"
headers = {
"Content-Type": "application/json",
"NVCF-INPUT-ASSET-REFERENCES": asset_id,
"NVCF-FUNCTION-ASSET-IDS": asset_id,
"Authorization": header_auth,
"Accept": "application/json"
}

print(asset_id, inputs)

# Send the request to the NIM API.
response = requests.post(nvai_url, headers=headers, json=inputs)

with open(f"{sys.argv[2]}.zip", "wb") as out:
out.write(response.content)

with zipfile.ZipFile(f"{sys.argv[2]}.zip", "r") as z:
z.extractall(sys.argv[2])

print(f"Response saved to path: {sys.argv[2]}. File list: {os.listdir(sys.argv[2])}")
```

```javascript
// The model can perform 14 different vision language model and computer vision tasks. The input ```content``` field should be formatted as ```"<TASK_PROMPT><text_prompt (only when needed)><img>"```.
// Users need to specify the task type at the beginning. Image supports both base64 and NvCF asset id. Some tasks require a text prompt, and users need to provide that after image. Below are the examples for each task.
// For <CAPTION_TO_PHRASE_GROUNDING>, <REFERRING_EXPRESSION_SEGMENTATION>, <OPEN_VOCABULARY_DETECTION>, users can change the text prompt as other descriptions.
// For <REGION_TO_SEGMENTATION>, <REGION_TO_CATEGORY>, <REGION_TO_DESCRIPTION>, the text prompt is formatted as <loc_x1><loc_y1><loc_x2><loc_y2>, which is the normalized coordinates from region of interest bbox. x1=int(top_left_x_coor/width*999), y1=int(top_left_y_coor/height*999), x2=int(bottom_right_x_coor/width*999), y2=int(bottom_right_y_coor/height*999).
// Prerequisites:
// npm install decompress@4.2.1 node-fetch@^2.7.0
const fs = require("fs");
const decompress = require("decompress");
const fetch = require("node-fetch")

const nvai_url = "https://ai.api.nvidia.com/v1/vlm/microsoft/florence-2";

const header_auth = `Bearer ${process.env.API_KEY_REQUIRED_IF_EXECUTING_OUTSIDE_NGC}`;

const prompts = [
"<CAPTION>",
"<DETAILED_CAPTION>",
"<MORE_DETAILED_CAPTION>",
"<OD>",
"<DENSE_REGION_CAPTION>",
"<REGION_PROPOSAL>",
"<CAPTION_TO_PHRASE_GROUNDING>A black and brown dog is laying on a grass field.",
"<REFERRING_EXPRESSION_SEGMENTATION>a black and brown dog",
"<REGION_TO_SEGMENTATION><loc_312><loc_168><loc_998><loc_846>",
"<OPEN_VOCABULARY_DETECTION>a black and brown dog",
"<REGION_TO_CATEGORY><loc_312><loc_168><loc_998><loc_846>",
"<REGION_TO_DESCRIPTION><loc_312><loc_168><loc_998><loc_846>",
"<OCR>",
"<OCR_WITH_REGION>"
];

async function uploadAsset(input, description) {
const assets_url = "https://api.nvcf.nvidia.com/v2/nvcf/assets";

const headers = {
Authorization: header_auth,
"Content-Type": "application/json",
accept: "application/json",
};

const put_headers = {
"x-amz-meta-nvcf-asset-description": description,
"content-type": "image/jpeg",
};

const payload = {
contentType: "image/jpeg",
description: description,
};

const response = await fetch(assets_url, {
method: "POST",
body: JSON.stringify(payload),
headers: headers,
});

const data = await response.json();

const asset_url = data["uploadUrl"];
const asset_id = data["assetId"];

const fileData = fs.readFileSync(input);

await fetch(asset_url, {
method: "PUT",
body: fileData,
headers: put_headers,
});

return asset_id.toString();
}

async function generateContent(taskId, assetId) {
return `${prompts[taskId]}<img src="data:image/jpeg;asset_id,${assetId}" />`
}

(async () => {
if (process.argv.length !== 5) {
console.log(
"Usage: node test.js <test_image> <result_dir> <task_id>\nFor example: node test.js car.jpg result_dir 0"
);
process.exit(1);
}

if (process.env.API_KEY_REQUIRED_IF_EXECUTING_OUTSIDE_NGC.length == 0) {
console.log(
"API_KEY not set. Please export API_KEY_REQUIRED_IF_EXECUTING_OUTSIDE_NGC=<Your API Key> as environment variable."
);
process.exit(1);
}

const assetId = await uploadAsset(process.argv[2], "Test Image");
const content = await generateContent(process.argv[4], assetId)

const inputs = {
messages: [{
role: "user",
content: content
}]
};

const response = await fetch(nvai_url, {
method: "POST",
headers: {
"Content-Type": "application/json",
"NVCF-INPUT-ASSET-REFERENCES": `${assetId}`,
"NVCF-FUNCTION-ASSET-IDS": `${assetId}`,
Authorization: header_auth,
},
body: JSON.stringify(inputs),
});

const buffer = await response.arrayBuffer();
fs.writeFileSync(`${process.argv[3]}.zip`, Buffer.from(buffer));

// Unzip the response synchronously
await decompress(`${process.argv[3]}.zip`, process.argv[3]);

// Log the output directory and its contents
console.log(`Response saved to ${process.argv[3]}`);
console.log(fs.readdirSync(process.argv[3]));
})();
```

```bash
#!/bin/bash
# The model can perform 14 different vision language model and computer vision tasks. The input ```content``` field should be formatted as ```"<TASK_PROMPT><text_prompt (only when needed)><img>"```.
# Users need to specify the task type at the beginning. Image supports both base64 and NvCF asset id. Some tasks require a text prompt, and users need to provide that after image. Below are the examples for each task.
# For <CAPTION_TO_PHRASE_GROUNDING>, <REFERRING_EXPRESSION_SEGMENTATION>, <OPEN_VOCABULARY_DETECTION>, users can change the text prompt as other descriptions.
# For <REGION_TO_SEGMENTATION>, <REGION_TO_CATEGORY>, <REGION_TO_DESCRIPTION>, the text prompt is formatted as <loc_x1><loc_y1><loc_x2><loc_y2>, which is the normalized coordinates from region of interest bbox. x1=int(top_left_x_coor/width*999), y1=int(top_left_y_coor/height*999), x2=int(bottom_right_x_coor/width*999), y2=int(bottom_right_y_coor/height*999).

set -e

# Check arguments
if [ "$#" -ne 3 ]; then
printf "Usage: ./test.sh <test_image> <result_dir> <task_id>\nFor example: ./test.sh car.jpg result_dir 0\n"
exit 1
fi

if [[ -z "${API_KEY_REQUIRED_IF_EXECUTING_OUTSIDE_NGC}" ]]; then
echo "API_KEY not set. Please export API_KEY_REQUIRED_IF_EXECUTING_OUTSIDE_NGC=<Your API Key> as environment variable."
exit 1

fi
# Set variables
nvai_url="https://ai.api.nvidia.com/v1/vlm/microsoft/florence-2"
api_key=$API_KEY_REQUIRED_IF_EXECUTING_OUTSIDE_NGC
assets_url="https://api.nvcf.nvidia.com/v2/nvcf/assets"

prompts=(
"<CAPTION>"
"<DETAILED_CAPTION>"
"<MORE_DETAILED_CAPTION>"
"<OD>"
"<DENSE_REGION_CAPTION>"
"<REGION_PROPOSAL>"
"<CAPTION_TO_PHRASE_GROUNDING>A black and brown dog is laying on a grass field."
"<REFERRING_EXPRESSION_SEGMENTATION>a black and brown dog"
"<REGION_TO_SEGMENTATION><loc_312><loc_168><loc_998><loc_846>"
"<OPEN_VOCABULARY_DETECTION>a black and brown dog"
"<REGION_TO_CATEGORY><loc_312><loc_168><loc_998><loc_846>"
"<REGION_TO_DESCRIPTION><loc_312><loc_168><loc_998><loc_846>"
"<OCR>"
"<OCR_WITH_REGION>"
)

content_type="image/jpeg"
description="Test Image"

# Function to upload an asset
upload_asset() {
local input=$1
local description=$2

# Authorize upload
authorize=$(curl -s -X POST $assets_url \
-H "Authorization: Bearer $api_key" \
-H "Content-Type: application/json" \
-H "accept: application/json" \
-d "{\"contentType\": \"$content_type\", \"description\": \"$description\"}")

# Get upload URL and asset ID
upload_url=$(echo $authorize | jq -r '.uploadUrl')
asset_id=$(echo $authorize | jq -r '.assetId')

# Upload asset
curl -s -X PUT $upload_url \
-H "x-amz-meta-nvcf-asset-description: $description" \
-H "content-type: $content_type" \
--upload-file $input

echo $asset_id
}

# Function to generate content
generate_content() {
local task_id=$1
local asset_id=$2
prompt=${prompts[$task_id]}
content="$prompt<img src=\\\"data:image/jpeg;asset_id,$asset_id\\\" />"

echo $content
}

# Upload images
asset_id=$(upload_asset $1 $description)
content=$(generate_content $3 $asset_id)
echo '{
"messages":[{
"role": "user",
"content": "'"$content"'"
}]
}' > payload.json

mkdir -p $2
# Compare images via microservice
location_command="curl -D - -s -X POST $nvai_url \
-H \"Content-Type: application/json\" \
-H \"NVCF-INPUT-ASSET-REFERENCES: $asset_id\" \
-H \"NVCF-FUNCTION-ASSET-IDS: $asset_id\" \
-H \"Authorization: Bearer $api_key\" \
-d @payload.json \
| grep location | awk '{print \$2}'"

location=$(eval ${location_command} | tr -d '\n' | tr -d '\r' | tr -d ' ' | tr -d '"' | tr -d ',')

# The download command will download the file from the location header
download_command="curl -s '${location}' > $2.zip"
echo $location_command

# Download the .zip file
response=$(eval ${download_command})

# Unzip the file
unzip -q $2.zip -d $2

echo "Response saved to $2.zip"
echo $(ls $2)
```