Vision foundation model capable of performing diverse computer vision and vision language tasks.
Florence-2 is an advanced vision foundation model using a prompt-based approach to handle a wide range of vision and vision-language tasks. It can interpret simple text prompts to perform tasks like captioning, object detection and segmentation.
This model is ready for non-commercial use.
This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party’s requirements for this application and use case; see link to the Florence-2 Model Card.
Architecture Type: Transformer-Based
Network Architecture: DaViT; standard encoder-decoder
Input Type(s): Image, Text.
Input Format(s): Red, Green, Blue (RGB), String
Input Parameters: Two Dimensional (2D)
Other Properties Related to Input: Task prompt.
The model can perform 14 different vision language model and computer vision tasks. The input content
field should be formatted as "<TASK_PROMPT><text_prompt (only when needed)><img>"
. Users need to specify the task type at the beginning. Image supports both base64 and NvCF asset id. Some tasks require a text prompt, and users need to provide that after image. Below are the list of tasks:
For <CAPTION_TO_PHRASE_GROUNDING>
, <REFERRING_EXPRESSION_SEGMENTATION>
, <OPEN_VOCABULARY_DETECTION>
, the text prompt is a normal description. For example: '<OPEN_VOCABULARY_DETECTION>dog<img src="data:image/jpeg;asset_id,868f5924-8ef2-4d8d-866e-87bb423126cb" />'
.
For <REGION_TO_SEGMENTATION>
, <REGION_TO_CATEGORY>
, <REGION_TO_DESCRIPTION>
, the text prompt must be formatted as <loc_x1><loc_y1><loc_x2><loc_y2>
, which is the normalized coordinates from region of interest bbox as calculated below. For example: '<REGION_TO_SEGMENTATION><loc_2><loc_3><loc_998><loc_997><img src="data:image/jpeg;base64,iVBORw0KGgoAAAANSUhEUgAAAAgAAAAICAIAAABLbSncAAAAGUlEQVR4nGK5nHuGARtgwio6aCUAAQAA//+evgIfjH1FEwAAAABJRU5ErkJggg==" />'
.
x1=int(top_left_x_coor/width*999) y1=int(top_left_y_coor/height*999) x2=int(bottom_right_x_coor/width*999) y2=int(bottom_right_y_coor/height*999)
Other tasks don't take text prompt input. For example: '<CAPTION><img src="data:image/png;asset_id,868f5924-8ef2-8g3c-866e-87bb423126cb" />'
.
Output Type(s): Text, Bounding Box, Segmentation Mask
Output Format: String or Dictionary (Text), Image (RBG, Black & White)
Output Parameters: One Dimensional (1D)- Text, 2D- Bounding Box, Segmentation Mask
Other Properties Related to Output:
The response data needs to be saved into a zip file and extracted. It contains an overlay image (when bounding box or segmentation is generated) and a <id>.response
JSON file.
For caption related tasks, the output is saved in "content": "<TASK_PROMPT>caption"
. For example, "content": "<CAPTION>A black and brown dog in a grass field"
For bounding box or segmentation masks, the output is saved in "entities": "{"bboxes":[], "quad_boxes":[], "labels":[], "polygons": []}"
. For example, "entiites": {"bboxes":[[192.47,68.882,611.081,346.83],[1.529,240.178, 611.081,403.394]],"quad_boxes":null,"labels":["A black and brown dog","a grass field"],"bboxes_labels":null,"polygons":null}
Runtime Engine(s):
Supported Hardware Microarchitecture Compatibility:
[Preferred/Supported] Operating System(s):
Link
Data Collection Method by dataset
Labeling Method by dataset
Properties (Quantity, Dataset Descriptions, Sensor(s))
Link
Data Collection Method by dataset
Labeling Method by dataset
Properties (Quantity, Dataset Descriptions, Sensor(s))
Engine: PyTorch
Test Hardware: