Multi-modal vision-language model that understands text/images and generates informative responses
NeVA is NVIDIA's version of the LLaVA model where the open source LLaMA model is replaced with a GPT model trained by NVIDIA. At a high level the image is encoded using a frozen hugging face CLIP model and projected to the text embedding dimensions. This is then concatenated with the embeddings of the prompt and passed in through the language model. Training happens in two stages:
Architecture Type: Transformer
Network Architecture: GPT + CLIP
Model version: 22B
Input Format: Red, Green, Blue (RGB) Image + Text
Input Parameters: temperature, top-p, max output tokens, seed
Output Format: Text
Output Parameters: None
Runtime(s): N/A
Supported Hardware Platform(s): Hopper, Ampere/Turing
Supported Operating System(s): Linux
Link: CC-3M
Properties (Quantity, Dataset Descriptions, Sensor(s)):
The dataset consists of CC3M images and captions filtered to 595,000 samples.
Dataset License:
Link: Synthetic data generated by GPT4
Properties (Quantity, Dataset Descriptions, Sensor(s)):
The data has 158,000 samples was generated synthetically by GPT4. It consists of a mix of short question answers, detailed image description, and higher level reasoning questions.
Dataset License: CC-BY-NC 4.0 License CC BY-NC 4.0
Engine: Triton and TensorRT-LLM
Test Hardware: Other
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards here. Please report security vulnerabilities or NVIDIA AI Concerns here.