
GPU-accelerated model optimized for providing a probability score that a given passage contains the information to answer a question.
The llama-nemotron-rerank-vl-1b-v2 was developed by NVIDIA for multimodal question-answering retrieval. It is optimized for providing a logit score that represents how relevant a document page is to a given query. The model can process documents in the form of image, text, or image and text combined. The expected images are screenshots of document pages or slides. Documents are ranked given a user query in text form. The model supports images containing text, tables, charts, and infographics. We report the model's performance by evaluating it on the popular ViDoRe V1, V2 and the new Vidore V3 (see Vidore LB for details) multimodal retrieval benchmarks, and on two internally curated visual retrieval datasets.
The reranking model serves as a key component of a multimodal retrieval system, such as a vision RAG pipeline, where it helps improve overall accuracy. A multimodal retrieval system often uses a multimodal embedding model (dense) to return relevant documents given the input. A reranking model can be used to rerank the potential candidates into a final order. The reranking model takes the query and document pairs as input, and its self-attention can perform deeper interaction between their tokens. It’s not scalable to apply a ranking model to all documents in the knowledge base for a given query; therefore, ranking models are often deployed to rerank top candidate documents retrieved by embedding models.
This model is ready for commercial use.
GOVERNING TERMS: The trial service is governed by the NVIDIA API Trial Terms of Service; and use of this model is governed by the NVIDIA Open Model License.
ADDITIONAL INFORMATION: Llama 3.2 Community License Agreement. Built with Llama.
Global
The llama-nemotron-rerank-vl-1b-v2 is most suitable for users who want to build a multimodal question-and-answer application over a large corpus, leveraging the latest dense retrieval technologies.
Hugging Face: 12/18/2025 via https://huggingface.co/nvidia/llama-nemotron-rerank-vl-1b-v2
Build.NVIDIA.com: 03/13/2026 via link
Technical report - "Llama Nemoretriever Colembed: Top-Performing Text-Image Retrieval Model"
Architecture Type: Transformer
Network Architecture: : Eagle VLM architecture with SigLIP 2 400M vision encoder and llama-nemotron-rerank-1b-v2 model as language model.
The llama-nemotron-rerank-vl-1b-v2 is a cross-encoder model with approximately 1.7B parameters. It is a fine-tuned version of an NVIDIA Eagle-family model, which consists of the SigLIP 2 400M vision encoder and the Llama 3.2 1B language model. The final embedding output by the decoder is aggregated using a mean pooling strategy, and a binary classification head is fine-tuned for the ranking task. The CrossEntropy loss is used to maximize the likelihood of (visual) documents containing information to answer the question and minimize the likelihood for (negative) documents that do not contain information to answer the question.
The vision-language model reranker incorporates key innovations from NVIDIA, including Eagle 2 work which uses a tiling-based VLM architecture, and nemoretriever-parse. The Eagle 2 architecture, available on Hugging Face, significantly enhances multimodal understanding through its dynamic tiling and mixture of vision encoders design. It particularly improves performance on tasks that involve high-resolution images and complex visual content.
Input Type(s): Image, Text
Input Format(s):
Input Parameters:
Other Properties Related to Input:
The model was fine-tuned exclusively on image data, using max_input_tiles = 4 and the maximum context length of 2048 tokens. For evaluation, it was tested on image-only, image+text, and text-only inputs, with max_input_tiles = 6 and the maximum context length of 10240 tokens. Inputs exceeding the maximum length are truncated.
Output Type(s): Floats
Output Format(s): List of Floats
Output Parameters: 1D
Other Properties Related to Output: Each value corresponds to a raw logit. Users can choose to apply a Sigmoid activation function to the logits to convert them into probabilities during model usage.
Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
The model requires transformers version >= 4.56.0, and optionally flash-attention.
pip install "transformers>=4.56.0"
pip install "flash-attn>=2.6.3,<2.8" --no-build-isolation
import torch
from transformers import AutoModelForSequenceClassification, AutoProcessor
from transformers.image_utils import load_image
modality = "image"
# Load model
model_path = "nvidia/llama-nemotron-rerank-vl-1b-v2"
device = "cuda" if torch.cuda.is_available() else "cpu"
model = AutoModelForSequenceClassification.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
attn_implementation="flash_attention_2",
device_map="auto"
).eval()
# Build processor kwargs (base settings)
processor_kwargs = {
"trust_remote_code": True,
"max_input_tiles": 6,
"use_thumbnail": True
}
# Set rerank_max_length based on modality
if modality == "image":
processor_kwargs["rerank_max_length"] = 2048
elif modality == "image_text":
processor_kwargs["rerank_max_length"] = 10240
elif modality == "text":
processor_kwargs["rerank_max_length"] = 8192
# Load processor with modality-specific kwargs
processor = AutoProcessor.from_pretrained(
model_path,
**processor_kwargs
)
query = "How is AI improving the intelligence and capabilities of robots?"
image_paths = [
"https://developer.download.nvidia.com/images/isaac/nvidia-isaac-lab-1920x1080.jpg",
"https://blogs.nvidia.com/wp-content/uploads/2018/01/automotive-key-visual-corp-blog-level4-av-og-1280x680-1.png",
"https://developer-blogs.nvidia.com/wp-content/uploads/2025/02/hc-press-evo2-nim-25-featured-b.jpg"
]
# Load all images
images = [load_image(img_path) for img_path in image_paths]
# Text descriptions corresponding to each image/document
document_texts = [
"AI enables robots to perceive, plan, and act autonomously.",
"AI is transforming autonomous vehicles by enabling safer, smarter, and more reliable decision-making on the road.",
"A biological foundation model designed to analyze and generate DNA, RNA, and protein sequences.",
]
if modality == "image":
# Prepare inputs: same query, different images
examples = [{
"question": query,
"doc_text": "",
"doc_image": image
} for image in images]
elif modality == "image_text":
examples = [{
"question": query,
"doc_text": doc_text,
"doc_image": image
} for image, doc_text in zip(images, document_texts)]
elif modality == "text":
# Prepare inputs: same query, different texts
examples = [{
"question": query,
"doc_text": doc_text,
"doc_image": ""
} for doc_text in document_texts]
else:
raise ValueError(f"Invalid modality: {modality}. Must be 'image', 'image_text', or 'text'")
# Process with processor
batch_dict = processor.process_queries_documents_crossencoder(examples)
# Move to device
batch_dict = {
k: v.to(device) if isinstance(v, torch.Tensor) else v
for k, v in batch_dict.items()
}
# Run inference
with torch.no_grad():
outputs = model(**batch_dict, return_dict=True)
# Get logits
logits = outputs.logits
logits_flat = logits.squeeze(-1)
# Get sorted indices (highest to lowest)
sorted_indices = torch.argsort(logits_flat, descending=True)
print(f"\nRanking (highest to lowest relevance for the modality {modality}):")
for rank, idx in enumerate(sorted_indices, 1):
doc_idx = idx.item()
logit_val = logits_flat[doc_idx].item()
if modality == "text":
print(f" Rank {rank}: logit={logit_val:.4f} | Text: {document_texts[doc_idx]}")
else: # image or image_text modality
print(f" Rank {rank}: logit={logit_val:.4f} | Image: {image_paths[doc_idx]}")
Runtime Engine(s):
Supported Hardware Microarchitecture Compatibility: NVIDIA Ampere, NVIDIA Hopper, NVIDIA Lovelace, NVIDIA Blackwell
Preferred/Supported Operating System(s): Linux
The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.
llama-nemotron-rerank-vl-1b-v2
The development of large-scale, public open-QA datasets has driven significant progress in powerful vision-language models, as well as vision embedding and reranking models. However, following issues limit the use of these models in commercial settings:
NVIDIA's training dataset is based on public QA datasets, and only includes datasets that have a license for commercial applications.
Properties: The model was fine-tuned with publicly available image datasets. We also generated synthetic queries for the image corpora, whose original queries were produced using proprietary models.
Data Modality
Image Training Data Size
Data Collection Method by dataset Hybrid: Automated, Human, Synthetic
Labeling Method by dataset: Hybrid: Automated, Human, Synthetic
We evaluate the embedding + reranking pipeline on a set of evaluation benchmarks. We applied the ranking model to the candidates retrieved from the llama-nemotron-embed-vl-1b-v2 model.
Vision document retrieval benchmarks
We evaluated llama-nemotron-rerank-vl-1b-v2 on five various visual document retrieval datasets: the popular ViDoRe V1, V2, the new Vidore V3, and two internal visual document retrieval datasets:
For those interested in reproducing our results, one of our internal datasets (DigitalCorpora-10k) can be created by following instructions in this notebook from the NeMo Retriever Extraction GitHub repository.
Text retrieval benchmarks
We evaluated llama-nemotron-rerank-vl-1b-v2 on 92 text retrieval datasets, from the benchmarks BEIR, MIRACL (multi-language), MLQA (cross-language) and MLDR (long-context).
In this section, we report the performance of llama-nemotron-rerank-vl-1b-v2 on different input modalities. In the table below, we can see that compared to the VLM embedding baseline, the VLM reranking model increases the Avg Recall@5 by approximately 7.2% for text modality, 6.9% for image modality, and 6% for image + text modality on 5 evaluation datasets.
Note: Image+Text modality means that both the page image and its text (extracted using ingestion libraries like NV-Ingest) are fed as input to the reranking model for more accurate representation and retrieval.
Visual Document Retrieval benchmarks — Avg Recall@5 on DC10k, Earnings V2, ViDoRe V1, V2, V3
| Model | Text | Image | Image + Text |
|---|---|---|---|
| llama-nemotron-embed-vl-1b-v2 | 71.04% | 71.20% | 73.24% |
| + llama-nemotron-rerank-vl-1b-v2 | 76.12% | 76.12% | 77.64% |
The table below demonstrates the llama-nemotron-rerank-vl-1b-v2's evaluation accuracy performance compared to two other publicly available multimodal reranker models: jina-reranker-m0 and MonoQwen2-VL-v0.1. The Jina model does not have commercial license and it does not support image+text modality out of the box, thus, we report image only and text only evaluation scores for this model.
| Model | Text | Image | Image+Text |
|---|---|---|---|
| llama-nemotron-rerank-vl-1b-v2 | 76.12% | 76.12% | 77.64% |
| jina-reranker-m0 | 69.31% | 78.33% | NA |
| MonoQwen2-VL-v0.1 | 74.70% | 75.80% | 75.98% |
The llama-nemotron-rerank-vl-1b-v2 demonstrates competitive retrieval accuracy on text retrieval benchmarks, comparable to NVIDIA's text-only reranking model llama-nemotron-rerank-1b-v2. This means you can deploy NVIDIA's VLM-based llama-nemotron-embed-vl-1b-v2 embedding model along with llama-nemotron-rerank-vl-1b-v2 reranking model, regardless of whether your retrieval corpus consists of images, text, or both.
Text Retrieval benchmarks (chunk retrieval) — Avg. Recall@5
| Model | BEIR retrieval + TechQA | MIRACL | MLQA | MLDR | Average |
|---|---|---|---|---|---|
| llama-nemotron-embed-1b-v2 + llama-nemotron-rerank-1b-v2 | 73.64% | 65.80% | 86.83% | 68.49% | 73.69% |
| llama-nemotron-embed-vl-1b-v2 + llama-nemotron-rerank-vl-1b-v2 | 73.18% | 65.71% | 87.05% | 69.98% | 73.98% |
Data Collection Method by dataset Hybrid: Automated, Human, Synthetic
Labeling Method by dataset: Hybrid: Automated, Human, Synthetic
Properties More details on ViDoRe benchmarks can be found on their Hugging Face page.
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
For more detailed information on ethical considerations for this model, please see the Explainability, Bias, Safety, and Privacy sections.
Please report security vulnerabilities or NVIDIA AI Concerns here.