![](/_next/image?url=https%3A%2F%2Fassets.ngc.nvidia.com%2Fproducts%2Fapi-catalog%2Fimages%2Fsam2.jpg&w=3840&q=75)
meta/sam2
SAM 2 is a segmentation model that enables fast, precise selection of any object in any video or image.
Model Overview
Description:
SAM 2, Segment Anything Model, is a segmentation model that enables fast, precise selection of objects in videos and images. Released by Meta Facebook Artificial Intelligence Research (FAIR), SAM 2 produces segmentation masks of the object of interest in a single image or throughout all video frames, even if the object disappears from view. This model can be used for data annotation, object tracking, or segment anything.
In addition, SAM 2 provides a flexible prompt-based interface that enables users to identify objects with a click, bounding box, or mask. Users can select one or more objects in any video frame and use extra prompts to adjust the model’s predictions. SAM2’s efficient video processing with streaming inference enables it for use in real-time, streaming video applications.
The model's capabilities have been extended to support segmentation via text-based prompts with GroundingDINO, an open vocabulary object detection model that can detect one or multiple objects in a frame based on the text input.
Third-Party Community Consideration
SAM2 model is not owned or developed by NVIDIA. This model has been developed and built to a third-party's requirements for this application and use case. For SAM 2 Model Card, refer to Model, data and annotation cards section in SAM 2 paper.
GroundingDINO is a NVIDIA model, pre-trained on a wide range of commercial datasets where the annotations were either human generated or pseudo-labeled. For more information, refer to the Grounding DINO model card.
License/Terms of Use
GOVERNING TERMS: Your use of this API is governed by the NVIDIA API Trial Service Terms of Use; and the use of this model is governed by the NVIDIA Community Model License.
References:
Model Architecture
Architecture Type: Transformer
Network Architecture: SAM2. For more details, refer to Model section in SAM 2 paper.
Input:
-
Input Types : Image, Video, Integers (Visual Prompts), text
-
Input Formats : Image - JPEG, PNG ; Video - MP4
-
Input Parameters: 2D, 3D
-
Other Properties related to Input :
- The visual prompts include (X,Y) points with labels(include/exclude) in the image/video which are selected by the user by clicking on the image/video in the UI. The visual prompts allow the user to select the regions of interest for segmentation in an image or select the regions of interest to be tracked in a video.
- Users can also provide text description to detect and segement the object of interest.
- The visual prompts include (X,Y) points with labels(include/exclude) in the image/video which are selected by the user by clicking on the image/video in the UI. The visual prompts allow the user to select the regions of interest for segmentation in an image or select the regions of interest to be tracked in a video.
Output:
-
Output Types: Image, Video, Integers (Segmentation mask)
-
Output Format: Image - JPEG, PNG ; Video - MP4
-
Output Parameters: 2D, 3D
-
Other Properties related to Output :
- For Image input, image with segmentation mask(s) overlaid on the object(s) of interest is given as output.
- For Video input, video with segmentation mask(s) overlaid on the object(s) of interest.
- The segmentation mask is a mask corresponding to the input image/video resolution and the background is represented with value 0 and the object(s) of interest are represented with the respective object id.
Model Version(s):
Datasets:
For details about the dataset refer to SA-V DataSet and Model, data and annotation cards section in SAM 2 paper.
Training and Testing Datasets:
Link
SA-V DataSet Download Link .
Data Collection Method by dataset
- Humans: Videos were collected by crowdworkers with unknown equipment via a contracted third-party vendor.
Labeling Method by dataset
- Hybrid: Human and Automatic. Masks generated by the Meta Segment Anything Model 2 (SAM 2) and human annotators.
Properties (Quantity, Dataset Descriptions, Sensor(s))
- SA-V dataset consists of 51K diverse videos and 643K spatio-temporal segmentation masks (i.e., masklets). The videos vary in subject matter. Common themes of the videos include: locations, objects, scenes. Masks range from large scale objects such as buildings to fine grained details such as interior decorations.
Ethical Considerations:
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
Please report security vulnerabilities or NVIDIA AI Concerns here.