SAM 2 is a segmentation model that enables fast, precise selection of any object in any video or image.
SAM 2, Segment Anything Model, is a segmentation model that enables fast, precise selection of objects in videos and images. Released by Meta Facebook Artificial Intelligence Research (FAIR), SAM 2 produces segmentation masks of the object of interest in a single image or throughout all video frames, even if the object disappears from view. This model can be used for data annotation, object tracking, or segment anything.
In addition, SAM 2 provides a flexible prompt-based interface that enables users to identify objects with a click, bounding box, or mask. Users can select one or more objects in any video frame and use extra prompts to adjust the model’s predictions. SAM2’s efficient video processing with streaming inference enables it for use in real-time, streaming video applications.
The model's capabilities have been extended to support segmentation via text-based prompts with GroundingDINO, an open vocabulary object detection model that can detect one or multiple objects in a frame based on the text input.
SAM2 model is not owned or developed by NVIDIA. This model has been developed and built to a third-party's requirements for this application and use case. For SAM 2 Model Card, refer to Model, data and annotation cards section in SAM 2 paper.
GroundingDINO is a NVIDIA model, pre-trained on a wide range of commercial datasets where the annotations were either human generated or pseudo-labeled. For more information, refer to the Grounding DINO model card.
GOVERNING TERMS: Your use of this API is governed by the NVIDIA API Trial Service Terms of Use; and the use of this model is governed by the NVIDIA Community Model License.
Architecture Type: Transformer
Network Architecture: SAM2. For more details, refer to Model section in SAM 2 paper.
Input Types : Image, Video, Integers (Visual Prompts), text
Input Formats : Image - JPEG, PNG ; Video - MP4
Input Parameters: 2D, 3D
Other Properties related to Input :
Output Types: Image, Video, Integers (Segmentation mask)
Output Format: Image - JPEG, PNG ; Video - MP4
Output Parameters: 2D, 3D
Other Properties related to Output :
For details about the dataset refer to SA-V DataSet and Model, data and annotation cards section in SAM 2 paper.
Link
SA-V DataSet Download Link .
Data Collection Method by dataset
Labeling Method by dataset
Properties (Quantity, Dataset Descriptions, Sensor(s))
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
Please report security vulnerabilities or NVIDIA AI Concerns here.