--- title: "florence-2" publisher: "microsoft" type: "endpoint" updated: "2025-04-14T23:33:28.525Z" description: "Vision foundation model capable of performing diverse computer vision and vision language tasks." canonical: "https://build.nvidia.com/microsoft/microsoft-florence-2" --- # Model Overview ## Description: Florence-2 is an advanced vision foundation model using a prompt-based approach to handle a wide range of vision and vision-language tasks. It can interpret simple text prompts to perform tasks like captioning, object detection and segmentation. This model is ready for non-commercial use.
## Third-Party Community Consideration This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party’s requirements for this application and use case; see link to the [Florence-2 Model Card](https://huggingface.co/microsoft/Florence-2-large-ft). ### License/Terms of Use [MIT license](https://huggingface.co/microsoft/Florence-2-large-ft/resolve/main/LICENSE). ## References: + [Florence-2 technical report](https://arxiv.org/abs/2311.06242) + [Jupyter Notebook for inference and visualization of Florence-2 model](https://huggingface.co/microsoft/Florence-2-large/blob/main/sample_inference.ipynb) ## Model Architecture: **Architecture Type:** Transformer-Based
**Network Architecture:** DaViT; standard encoder-decoder
## Input: **Input Type(s):** Image, Text.
**Input Format(s):** Red, Green, Blue (RGB), String
**Input Parameters:** Two Dimensional (2D)
**Other Properties Related to Input:** Task prompt. The model can perform 14 different vision language model and computer vision tasks. The input ```content``` field should be formatted as ```""```. Users need to specify the task type at the beginning. Image supports both base64 and NvCF asset id. Some tasks require a text prompt, and users need to provide that after image. Below are the list of tasks: * Caption * Detailed Caption * More Detailed Caption * Region to category * Region to description * Caption to Phrase Grounding * Object Detection * Dense Region Caption * Region proposal * Open vocabulary detection * Referring expression segmentation * Region to segmentation * Optical character recognition * Optical character recognition with region For ``````, ``````, ``````, the text prompt is a normal description. For example: ```'dog

'```. For ``````, ``````, ``````, the text prompt must be formatted as ``````, which is the normalized coordinates from region of interest bbox as calculated below. For example: ```'

'```. ``` x1=int(top_left_x_coor/width*999) y1=int(top_left_y_coor/height*999) x2=int(bottom_right_x_coor/width*999) y2=int(bottom_right_y_coor/height*999) ``` Other tasks don't take text prompt input. For example: ```'

'```.
## Output: **Output Type(s):** Text, Bounding Box, Segmentation Mask
**Output Format:** String or Dictionary (Text), Image (RBG, Black & White) **Output Parameters:** One Dimensional (1D)- Text, 2D- Bounding Box, Segmentation Mask
**Other Properties Related to Output:**
The response data needs to be saved into a zip file and extracted. It contains an overlay image (when bounding box or segmentation is generated) and a ```.response``` JSON file. For caption related tasks, the output is saved in ```"content": "caption"```. For example, ```"content": "A black and brown dog in a grass field"```
For bounding box or segmentation masks, the output is saved in ```"entities": "{"bboxes":[], "quad_boxes":[], "labels":[], "polygons": []}"```. For example, ```"entiites": {"bboxes":[[192.47,68.882,611.081,346.83],[1.529,240.178, 611.081,403.394]],"quad_boxes":null,"labels":["A black and brown dog","a grass field"],"bboxes_labels":null,"polygons":null}```
## Software Integration: **Runtime Engine(s):** * PyTorch
**Supported Hardware Microarchitecture Compatibility:**
* NVIDIA Ampere
* NVIDIA Blackwell
* NVIDIA Jetson
* NVIDIA Hopper
* NVIDIA Lovelace
* NVIDIA Pascal
* NVIDIA Turing
* NVIDIA Volta
**[Preferred/Supported] Operating System(s):**
* Linux
* Windows
## Model Version(s): * Florence-2-base
* Florence-2-large
* Florence-2-base-ft
* Florence-2-large-ft
# Training and Testing Datasets: ## Training Dataset: **Link** * FLD-5B dataset (Microsoft)
**Data Collection Method by dataset**
* Hybrid: Human, Automatic/Sensors
**Labeling Method by dataset**
* Hybrid: Human, Automatic/Sensors
**Properties (Quantity, Dataset Descriptions, Sensor(s))** * The dataset consists of images from a diverse collection of purposes, including caption, detection, segmentation and optical character recognition. There are 126 million images, 500 million text annotations, and 1.3 billion text-region annotations, and 3.6 billion text-phrase-region annotations across different tasks.
## Testing Dataset: **Link** * [COCO](https://cocodataset.org/), [Flickr30k](https://shannon.cs.illinois.edu/DenotationGraph/)
**Data Collection Method by dataset**
* Hybrid: Human, Automatic/Sensors
**Labeling Method by dataset**
* Hybrid: Human, Automatic/Sensors
**Properties (Quantity, Dataset Descriptions, Sensor(s))** * COCO: COCO is a large-scale object detection, segmentation, and captioning dataset. COCO has several features: 330K images (>200K labeled), 1.5 million object instances
* Flickr30k: The Flickr30k dataset contains 31,000 images collected from Flickr, together with five (5) reference sentences provided by human annotators
## Inference: **Engine:** PyTorch
**Test Hardware:**
* NVIDIA L40
## Bias | Field | Response | | -- | -- | |Participation considerations from adversely impacted groups [(protected classes)](https://www.senate.ca.gov/content/protected-classes) in model design and testing: | None of the Above | | Measures taken to mitigate against unwanted bias: | No measures taken to mitigate against unwanted bias.| ## Explainability | Field | Response | | -- | -- | | Intended Application(s) & Domain(s): | The primary use of Florence-2 is research on large multimodal models. | | Model Type: | Text generation, object detection, segmentation, and OCR from image | | Intended Users: | The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence. | | Output: | Text, detection, segmentation | | Describe how the model works: | Text generation, object detection, segmentation, and OCR from image | | Technical Limitations: | | | Verified to have met prescribed NVIDIA standards: | | | Performance Metrics: | Accuracy | | Licensing: | [MIT License](https://huggingface.co/microsoft/Florence-2-large/resolve/main/LICENSE) | ## Privacy | Field | Response | | -- | -- | | Generatable or reverse engineerable personally-identifiable information (PII)? | None | | Protected classes used to create this model? | Not Applicable (No PII) | | Was consent obtained for any PII used? | Not Applicable (No PII) | | How often is dataset reviewed? | Before Release | | Is a mechanism in place to honor data subject right of access or deletion of personal data? | No | | If PII collected for the development of the model, was it collected directly by NVIDIA? |Not Applicable | | If PII collected for the development of the model by NVIDIA, do you maintain or have access to disclosures made to data subjects? | Not Applicable | | If PII collected for the development of this AI model, was it minimized to only what was required? | Not Applicable | | Is there provenance for all datasets used in training? | Yes | | Does data labeling (annotation, metadata) comply with privacy laws? | Yes | | Is data compliant with data subject requests for data correction or removal, if such a request was made? | Yes | | Applicable NVIDIA Privacy Policy | [https://www.nvidia.com/en-us/data-center/products/nvidia-ai-enterprise/eula/](https://www.nvidia.com/en-us/data-center/products/nvidia-ai-enterprise/eula/) | ## Prototype ```python # The model can perform 14 different vision language model and computer vision tasks. The input ```content``` field should be formatted as ```""```. # Users need to specify the task type at the beginning. Image supports both base64 and NvCF asset id. Some tasks require a text prompt, and users need to provide that after image. Below are the examples for each task. # For , , , users can change the text prompt as other descriptions. # For , , , the text prompt is formatted as , which is the normalized coordinates from region of interest bbox. x1=int(top_left_x_coor/width*999), y1=int(top_left_y_coor/height*999), x2=int(bottom_right_x_coor/width*999), y2=int(bottom_right_y_coor/height*999). import os import sys import zipfile import requests nvai_url = "https://ai.api.nvidia.com/v1/vlm/microsoft/florence-2" header_auth = f'Bearer {os.getenv("API_KEY_REQUIRED_IF_EXECUTING_OUTSIDE_NGC", "")}' prompts = ["", "", "", "", "", "", "A black and brown dog is laying on a grass field.", "a black and brown dog", "", "a black and brown dog", "", "", "", ""] def _upload_asset(input, description): """ Uploads an asset to the NVCF API. :param input: The binary asset to upload :param description: A description of the asset """ authorize = requests.post( "https://api.nvcf.nvidia.com/v2/nvcf/assets", headers={ "Authorization": header_auth, "Content-Type": "application/json", "accept": "application/json", }, json={"contentType": "image/jpeg", "description": description}, timeout=30, ) authorize.raise_for_status() response = requests.put( authorize.json()["uploadUrl"], data=input, headers={ "x-amz-meta-nvcf-asset-description": description, "content-type": "image/jpeg", }, timeout=300, ) response.raise_for_status() return str(authorize.json()["assetId"]) def _generate_content(task_id, asset_id): if task_id < 0 or task_id >= len(prompts): print(f"task_id should within [0, {len(prompts)-1}]") exit(1) prompt = prompts[task_id] content = f'{prompt}

' return content if __name__ == "__main__": """Uploads two images of your choosing to the NVCF API and sends a request to the Visual ChangeNet model to compare them. The response is saved to """ if len(sys.argv) != 4: print("Usage: python test.py \n" "For example: python test.py car.jpg result_dir 0") sys.exit(1) if len(os.getenv("API_KEY_REQUIRED_IF_EXECUTING_OUTSIDE_NGC", "")) == 0: print("API_KEY not set. Please export API_KEY_REQUIRED_IF_EXECUTING_OUTSIDE_NGC= as environment variable.") sys.exit(1) # Local images asset_id = _upload_asset(open(sys.argv[1], "rb"), "Test Image") content = _generate_content(int(sys.argv[3]), asset_id) # Asset IDs returned by the _upload_asset function inputs = { "messages": [{ "role": "user", "content": content }] } # asset_list = f"{asset_id}" headers = { "Content-Type": "application/json", "NVCF-INPUT-ASSET-REFERENCES": asset_id, "NVCF-FUNCTION-ASSET-IDS": asset_id, "Authorization": header_auth, "Accept": "application/json" } print(asset_id, inputs) # Send the request to the NIM API. response = requests.post(nvai_url, headers=headers, json=inputs) with open(f"{sys.argv[2]}.zip", "wb") as out: out.write(response.content) with zipfile.ZipFile(f"{sys.argv[2]}.zip", "r") as z: z.extractall(sys.argv[2]) print(f"Response saved to path: {sys.argv[2]}. File list: {os.listdir(sys.argv[2])}") ``` ```javascript // The model can perform 14 different vision language model and computer vision tasks. The input ```content``` field should be formatted as ```""```. // Users need to specify the task type at the beginning. Image supports both base64 and NvCF asset id. Some tasks require a text prompt, and users need to provide that after image. Below are the examples for each task. // For , , , users can change the text prompt as other descriptions. // For , , , the text prompt is formatted as , which is the normalized coordinates from region of interest bbox. x1=int(top_left_x_coor/width*999), y1=int(top_left_y_coor/height*999), x2=int(bottom_right_x_coor/width*999), y2=int(bottom_right_y_coor/height*999). // Prerequisites: // npm install decompress@4.2.1 node-fetch@^2.7.0 const fs = require("fs"); const decompress = require("decompress"); const fetch = require("node-fetch") const nvai_url = "https://ai.api.nvidia.com/v1/vlm/microsoft/florence-2"; const header_auth = `Bearer ${process.env.API_KEY_REQUIRED_IF_EXECUTING_OUTSIDE_NGC}`; const prompts = [ "", "", "", "", "", "", "A black and brown dog is laying on a grass field.", "a black and brown dog", "", "a black and brown dog", "", "", "", "" ]; async function uploadAsset(input, description) { const assets_url = "https://api.nvcf.nvidia.com/v2/nvcf/assets"; const headers = { Authorization: header_auth, "Content-Type": "application/json", accept: "application/json", }; const put_headers = { "x-amz-meta-nvcf-asset-description": description, "content-type": "image/jpeg", }; const payload = { contentType: "image/jpeg", description: description, }; const response = await fetch(assets_url, { method: "POST", body: JSON.stringify(payload), headers: headers, }); const data = await response.json(); const asset_url = data["uploadUrl"]; const asset_id = data["assetId"]; const fileData = fs.readFileSync(input); await fetch(asset_url, { method: "PUT", body: fileData, headers: put_headers, }); return asset_id.toString(); } async function generateContent(taskId, assetId) { return `${prompts[taskId]}

` } (async () => { if (process.argv.length !== 5) { console.log( "Usage: node test.js \nFor example: node test.js car.jpg result_dir 0" ); process.exit(1); } if (process.env.API_KEY_REQUIRED_IF_EXECUTING_OUTSIDE_NGC.length == 0) { console.log( "API_KEY not set. Please export API_KEY_REQUIRED_IF_EXECUTING_OUTSIDE_NGC= as environment variable." ); process.exit(1); } const assetId = await uploadAsset(process.argv[2], "Test Image"); const content = await generateContent(process.argv[4], assetId) const inputs = { messages: [{ role: "user", content: content }] }; const response = await fetch(nvai_url, { method: "POST", headers: { "Content-Type": "application/json", "NVCF-INPUT-ASSET-REFERENCES": `${assetId}`, "NVCF-FUNCTION-ASSET-IDS": `${assetId}`, Authorization: header_auth, }, body: JSON.stringify(inputs), }); const buffer = await response.arrayBuffer(); fs.writeFileSync(`${process.argv[3]}.zip`, Buffer.from(buffer)); // Unzip the response synchronously await decompress(`${process.argv[3]}.zip`, process.argv[3]); // Log the output directory and its contents console.log(`Response saved to ${process.argv[3]}`); console.log(fs.readdirSync(process.argv[3])); })(); ``` ```bash #!/bin/bash # The model can perform 14 different vision language model and computer vision tasks. The input ```content``` field should be formatted as ```""```. # Users need to specify the task type at the beginning. Image supports both base64 and NvCF asset id. Some tasks require a text prompt, and users need to provide that after image. Below are the examples for each task. # For , , , users can change the text prompt as other descriptions. # For , , , the text prompt is formatted as , which is the normalized coordinates from region of interest bbox. x1=int(top_left_x_coor/width*999), y1=int(top_left_y_coor/height*999), x2=int(bottom_right_x_coor/width*999), y2=int(bottom_right_y_coor/height*999). set -e # Check arguments if [ "$#" -ne 3 ]; then printf "Usage: ./test.sh \nFor example: ./test.sh car.jpg result_dir 0\n" exit 1 fi if [[ -z "${API_KEY_REQUIRED_IF_EXECUTING_OUTSIDE_NGC}" ]]; then echo "API_KEY not set. Please export API_KEY_REQUIRED_IF_EXECUTING_OUTSIDE_NGC= as environment variable." exit 1 fi # Set variables nvai_url="https://ai.api.nvidia.com/v1/vlm/microsoft/florence-2" api_key=$API_KEY_REQUIRED_IF_EXECUTING_OUTSIDE_NGC assets_url="https://api.nvcf.nvidia.com/v2/nvcf/assets" prompts=( "" "" "" "" "" "" "A black and brown dog is laying on a grass field." "a black and brown dog" "" "a black and brown dog" "" "" "" "" ) content_type="image/jpeg" description="Test Image" # Function to upload an asset upload_asset() { local input=$1 local description=$2 # Authorize upload authorize=$(curl -s -X POST $assets_url \ -H "Authorization: Bearer $api_key" \ -H "Content-Type: application/json" \ -H "accept: application/json" \ -d "{\"contentType\": \"$content_type\", \"description\": \"$description\"}") # Get upload URL and asset ID upload_url=$(echo $authorize | jq -r '.uploadUrl') asset_id=$(echo $authorize | jq -r '.assetId') # Upload asset curl -s -X PUT $upload_url \ -H "x-amz-meta-nvcf-asset-description: $description" \ -H "content-type: $content_type" \ --upload-file $input echo $asset_id } # Function to generate content generate_content() { local task_id=$1 local asset_id=$2 prompt=${prompts[$task_id]} content="$prompt

" echo $content } # Upload images asset_id=$(upload_asset $1 $description) content=$(generate_content $3 $asset_id) echo '{ "messages":[{ "role": "user", "content": "'"$content"'" }] }' > payload.json mkdir -p $2 # Compare images via microservice location_command="curl -D - -s -X POST $nvai_url \ -H \"Content-Type: application/json\" \ -H \"NVCF-INPUT-ASSET-REFERENCES: $asset_id\" \ -H \"NVCF-FUNCTION-ASSET-IDS: $asset_id\" \ -H \"Authorization: Bearer $api_key\" \ -d @payload.json \ | grep location | awk '{print \$2}'" location=$(eval ${location_command} | tr -d '\n' | tr -d '\r' | tr -d ' ' | tr -d '"' | tr -d ',') # The download command will download the file from the location header download_command="curl -s '${location}' > $2.zip" echo $location_command # Download the .zip file response=$(eval ${download_command}) # Unzip the file unzip -q $2.zip -d $2 echo "Response saved to $2.zip" echo $(ls $2) ```