nvidia / neva-22b

Multi-modal vision-language model that understands text/images and generates informative responses

Model Overview


NeVA is NVIDIA's version of the LLaVA model where the open source LLaMA model is replaced with a GPT model trained by NVIDIA. At a high level the image is encoded using a frozen hugging face CLIP model and projected to the text embedding dimensions. This is then concatenated with the embeddings of the prompt and passed in through the language model. Training happens in two stages:

  • Pretraining: Here the language model is frozen and only the projection layer (that maps the image encoding to the embedding space) is trained. Here, image-caption pairs are used to pretrain the model.
  • Finetuning: Here the language model is also trained along with the projection layer. To finetune the model synthetic instruction data generated using GPT4 is used.


Model Architecture:

Architecture Type: Transformer
Network Architecture: GPT + CLIP
Model version: 22B


Input Format: Red, Green, Blue (RGB) Image + Text
Input Parameters: temperature, top-p, max output tokens, seed


Output Format: Text
Output Parameters: None

Software Integration:

Runtime(s): N/A
Supported Hardware Platform(s): Hopper, Ampere/Turing
Supported Operating System(s): Linux

Training & Finetuning:

Pretraining Dataset:

Link: CC-3M

Properties (Quantity, Dataset Descriptions, Sensor(s)):
The dataset consists of CC3M images and captions filtered to 595,000 samples.

Dataset License:

Finetuning Dataset:

Link: Synthetic data generated by GPT4

Properties (Quantity, Dataset Descriptions, Sensor(s)):
The data has 158,000 samples was generated synthetically by GPT4. It consists of a mix of short question answers, detailed image description, and higher level reasoning questions.

Dataset License: CC-BY-NC 4.0 License CC BY-NC 4.0


Engine: Triton and TensorRT-LLM
Test Hardware: Other

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards here. Please report security vulnerabilities or NVIDIA AI Concerns here.