# Model Overview ## Description: **SAM 2**, Segment Anything Model, is a segmentation model that enables fast, precise selection of objects in videos and images. Released by Meta Facebook Artificial Intelligence Research (FAIR), SAM 2 produces segmentation masks of the object of interest in a single image or throughout all video frames, even if the object disappears from view. This model can be used for data annotation, object tracking, or segment anything. In addition, SAM 2 provides a flexible prompt-based interface that enables users to identify objects with a click, bounding box, or mask. Users can select one or more objects in any video frame and use extra prompts to adjust the model’s predictions. SAM2’s efficient video processing with streaming inference enables it for use in real-time, streaming video applications. The model's capabilities have been extended to support segmentation via text-based prompts with GroundingDINO, an open vocabulary object detection model that can detect one or multiple objects in a frame based on the text input. ## Third-Party Community Consideration SAM2 model is not owned or developed by NVIDIA. This model has been developed and built to a third-party's requirements for this application and use case. For **SAM 2 Model Card**, refer to **Model, data and annotation cards** section in [SAM 2 paper](https://arxiv.org/abs/2408.00714). GroundingDINO is a NVIDIA model, pre-trained on a wide range of commercial datasets where the annotations were either human generated or pseudo-labeled. For more information, refer to the [Grounding DINO model card](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/grounding_dino). ### License/Terms of Use **GOVERNING TERMS:** Your use of this API is governed by the [NVIDIA API Trial Service Terms of Use](https://assets.ngc.nvidia.com/products/api-catalog/legal/NVIDIA%20API%20Trial%20Terms%20of%20Service.pdf); and the use of this model is governed by the [NVIDIA Community Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-community-models-license/). ## References: + [SAM 2 paper](https://arxiv.org/abs/2408.00714) + [SAM 2 official code repo](https://github.com/facebookresearch/sam2) ## Model Architecture **Architecture Type:** Transformer
**Network Architecture:** SAM2. For more details, refer to **Model** section in [SAM 2 paper](https://arxiv.org/abs/2408.00714).
## Input: * **Input Types** : Image, Video, Integers (Visual Prompts), text
* **Input Formats** : Image - JPEG, PNG ; Video - MP4
* **Input Parameters:** 2D, 3D
* **Other Properties related to Input** : - The visual prompts include (X,Y) points with labels(include/exclude) in the image/video which are selected by the user by clicking on the image/video in the UI. The visual prompts allow the user to select the regions of interest for segmentation in an image or select the regions of interest to be tracked in a video.
- Users can also provide text description to detect and segement the object of interest. ## Output: * **Output Types**: Image, Video, Integers (Segmentation mask) * **Output Format**: Image - JPEG, PNG ; Video - MP4
* **Output Parameters:** 2D, 3D
* **Other Properties related to Output** : - For Image input, image with segmentation mask(s) overlaid on the object(s) of interest is given as output. - For Video input, video with segmentation mask(s) overlaid on the object(s) of interest. - The segmentation mask is a mask corresponding to the input image/video resolution and the background is represented with value 0 and the object(s) of interest are represented with the respective object id.
## Model Version(s): * [sam2_hiera_large](https://dl.fbaipublicfiles.com/segment_anything_2/072824/sam2_hiera_large.pt)
* [GroundingDINO Swin-Tiny](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/grounding_dino/files?version=grounding_dino_swin_tiny_commercial_deployable_v1.0) # Datasets: For details about the dataset refer to [SA-V DataSet](https://ai.meta.com/datasets/segment-anything-video/) and **Model, data and annotation cards** section in [SAM 2 paper](https://arxiv.org/abs/2408.00714).
## Training and Testing Datasets:
**Link**
[SA-V DataSet Download Link ](https://ai.meta.com/datasets/segment-anything-video-downloads/).
**Data Collection Method by dataset**
* Humans: Videos were collected by crowdworkers with unknown equipment via a contracted third-party vendor.
**Labeling Method by dataset**
* Hybrid: Human and Automatic. Masks generated by the Meta Segment Anything Model 2 (SAM 2) and human annotators.
**Properties (Quantity, Dataset Descriptions, Sensor(s))** * SA-V dataset consists of 51K diverse videos and 643K spatio-temporal segmentation masks (i.e., masklets). The videos vary in subject matter. Common themes of the videos include: locations, objects, scenes. Masks range from large scale objects such as buildings to fine grained details such as interior decorations.
## Ethical Considerations: NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).