## Model Overview ### Description: VISTA-3D is a specialized interactive foundation model for 3D medical imaging. It excels in providing accurate and adaptable segmentation analysis across anatomies and modalities. Utilizing a multi-head architecture, VISTA-3D adapts to varying conditions and anatomical areas, helping guide users' annotation workflow. This model is for research purposes and not for clinical usage. * **Segment everything:** Enables whole body exploration, crucial for understanding complex diseases affecting multiple organs and for holistic treatment planning. * **Segment using class:** Provides detailed sectional views based on specific classes, essential for targeted disease analysis or organ mapping, such as tumor identification in critical organs. * **Segment point prompts:** Enhances segmentation precision through user-directed, click-based selection. This interactive approach accelerates the creation of accurate ground-truth data, essential in medical imaging analysis. ### Training Data Details The **VISTA3D model was trained on a large and diverse dataset of 11454 3D CT volumes**. This dataset was curated from in-house and publicly available sources. The training data encompassed a **wide range of acquisition protocols**. The **spatial resolutions** of the scans varied significantly, ranging **from 0.45 × 0.45 × 0.45 mm³ to 1.50 × 1.50 × 7.50 mm³**, with a **median resolution of 0.88 × 0.88 × 1.50 mm³**. This indicates that the training data included scans with varying slice thicknesses and in-plane resolutions [our conversation history]. **Information regarding the gender breakdown of the participants within these datasets is not explicitly provided in the paper or its supplementary material** [our conversation history, 47]. While Table 1 in the supplementary material lists the datasets used and the number of cases, it **does not include demographic information like gender** [our conversation history]. Similarly, Figure 1 in the supplementary material shows the distribution of annotated voxels per class but does not include gender information [our conversation history]. Other relevant details include: * The dataset includes **voxel-wise annotations of anatomical structures and lesions**. * Pseudo-labels of **117 classes** were generated using TotalSegmentator. * Supervoxels were generated for every scan using SAM pre-trained weights. * Each data source was randomly split into **64% training, 16% validation, and 20% test sets**. #### Intended Use The **VISTA3D model is intended to facilitate clinicians and researchers using 3D Computed Tomography (CT) images**. As a highly accurate and clinically applicable segmentation foundation model, it aims to streamline workflows in medical image analysis. Specifically, **CT image segmentation can aid in diagnosis, treatment planning, and disease monitoring** by providing detailed morphological information of body structures and abnormalities. VISTA3D aims to reduce the time-consuming and tedious nature of manual segmentation in clinical practice. #### Capabilities VISTA3D possesses several essential capabilities for 3D CT image segmentation: * **Accurate Automatic Segmentation:** VISTA3D provides **accurate out-of-the-box segmentation for 127 common types of human anatomical structures and various lesions**. For supported classes with sufficient labeled data, the model aims to achieve state-of-the-art or comparable performance to dataset-specific models. * **Interactive Refinement:** The model supports **3D interactive segmentation**, allowing users to conveniently edit and refine automatic segmentation results through point clicks. This enables effective correction of inaccurate automatic segmentations. * **Zero-Shot Segmentation:** VISTA3D exhibits **state-of-the-art zero-shot performance** for unseen classes using its interactive branch. This allows users to interactively annotate novel structures with minimal annotation effort. This capability is enhanced by the distillation of image understanding from SAM through the generation of 3D supervoxels. * **3D Operation:** The model operates directly on **3D volumetric images**, leveraging 3D visual contexts rather than relying on time-consuming 2D slice-by-slice methods. * **Few-Shot/Transfer Learning:** VISTA3D demonstrates **strong transfer learning ability**, allowing users to quickly adapt the model to perform segmentation on new classes with only a few annotated examples. The VISTA3D model architecture includes two branches, an **automatic branch for direct segmentation of supported classes** and an **interactive branch that accepts user clicks** for both supported and novel zero-shot classes. These branches share the same image encoder. ### Terms of use By using this model, you are agreeing to the [terms and conditions](https://docs.nvidia.com/ai-foundation-models-community-license.pdf) of the license. ### References(s): Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, Ross Girshick, 2023. High-resolution 3D abdominal segmentation with random patch network fusion. Segment Anything. arXiv:2304.02643 ### Model Architecture: **Architecture Type:** Transformer
**Network Architecture:** SegResNet + Prompt Encoding ### Input: **Input Type(s):** Computed Tomography (CT) Image
**Input Format(s):** (Neuroimaging Informatics Technology Initiative) NIfTI
**Input Parameters:** Three-Dimensional (3D)
**Other Properties Related to Input:** Array of Class/Point Information ### Output: **Output Type(s):** Image
**Output Format:** NIfTI
**Output Parameters:** 3D
## Software Integration: **Runtime Engine(s):** MONAI Core v.1.4
**Supported Hardware Microarchitecture Compatibility:**
* Ampere
* Hopper
**[Preferred/Supported] Operating System(s):**
* Linux
## Inference: **Engine:** [Triton](https://developer.nvidia.com/triton-inference-server)
**Test Hardware:** A100, H100, L40
## Ethical Considerations: NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards. Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/). ## Terms of use By using this model, you are agreeing to the [terms and conditions](https://docs.nvidia.com/ai-foundation-models-community-license.pdf) of the license. ## References(s) He, Y., Guo, P., Tang, Y., Myronenko, A., Nath, V., Xu, Z., ... & Li, W. (2024). Vista3d: Versatile imaging segmentation and annotation model for 3d computed tomography. CVPR2025.