---
title: "llama-3.2-nemoretriever-500m-rerank-v2"
publisher: "nvidia"
type: "endpoint"
updated: "2025-06-25T20:34:06.600Z"
description: "GPU-accelerated model optimized for providing a probability score that a given passage contains the information to answer a question."
canonical: "https://build.nvidia.com/nvidia/llama-3_2-nemoretriever-500m-rerank-v2"
---

## **Model Overview**

### **Description**

​​The Llama 3.2 NeMo Retriever Reranking 500M model is optimized for providing a logit score that represents how relevant a document(s) is to a given query. The model was fine-tuned for multilingual, cross-lingual text question-answering retrieval, with support for long documents (up to 8192 tokens). This model was evaluated on 26 languages: English, Arabic, Bengali, Chinese, Czech, Danish, Dutch, Finnish, French, German, Hebrew, Hindi, Hungarian, Indonesian, Italian, Japanese, Korean, Norwegian, Persian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, and Turkish.

The reranking model is a component in a text retrieval system to improve the overall accuracy. A text retrieval system often uses an embedding model (dense) or a lexical search (sparse) index to return relevant text passages given the input. A reranking model can be used to rerank the potential candidates into a final order. The reranking model has the question-passage pairs as an input and therefore, can process cross attention between the words. It’s not feasible to apply a Ranking model on all documents in the knowledge base, therefore, ranking models are often deployed in combination with embedding models.

This 500m version is pruned from the 1B version - it shares the same architecture overall, but is smaller and faster. Users should expect 90-95% of the accuracy of the 1B version, but with lower latency (as much as 2- 3x improvement) and reduced memory usage.

This model is ready for commercial use.

The Llama 3.2 NeMo Retriever Reranking 500M model is a part of the NeMo Retriever collection of NIM, which provides state-of-the-art, commercially-ready models and microservices, optimized for the lowest latency and highest throughput. It features a production-ready information retrieval pipeline with enterprise support. The models that form the core of this solution have been trained using responsibly selected, auditable data sources. With multiple pre-trained models available as starting points, developers can also readily customize them for their domain-specific use cases, such as information technology, human resource help assistants, and research & development assistants.

### **License/Terms of use**

GOVERNING TERMS: The NIM container is governed by the [NVIDIA Software License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-software-license-agreement/) and the [Product-Specific Terms for NVIDIA AI Products](https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/); except for the model which is governed by the [NVIDIA Community Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-community-models-license/).

ADDITIONAL INFORMATION: [Llama 3.2 Community License Agreement](https://www.llama.com/llama3_2/license/). Built with Llama.

### **Intended use**

The Llama 3.2 NeMo Retriever Reranking 500M model is most suitable for users who are focused on performance and latency, and want to improve their multilingual retrieval tasks by reranking a set of candidates for a given question.

### **Model Architecture: Llama-3.2 500M Ranker**

**Architecture Type:** Transformer
**Network Architecture:** Fine-tuned meta-llama/Llama-3.2-1B

The Llama 3.2 NeMo Retriever Reranking 500M model is a transformer encoder fine-tuned for contrastive learning. We employ bi-directional attention when fine-tuning for higher accuracy. The last embedding output by the decoder model is used with a mean pooling strategy, and a binary classification head is fine-tuned for the ranking task.

Ranking models for text ranking are typically trained as a cross-encoder for sentence classification. This involves predicting the relevancy of a sentence pair (for example, question and chunked passages). The CrossEntropy loss is used to maximize the likelihood of passages containing information to answer the question and minimize the likelihood for (negative) passages that do not contain information to answer the question.

We train the model on public datasets described in the Dataset and Training section.

### **Input**

**Input Type:** Pair of Texts <br>
**Input Format:** List of text pairs <br>
**Input Parameters:** 1D <br>
**Other Properties Related to Input:** The model was trained on question and answering over text documents from multiple languages. It was evaluated to work successfully with up to a sequence length of 8192 tokens. Longer texts are recommended to be either chunked or truncated.

**Output**
**Output Type:** Floats <br>
**Output Format:** List of floats <br>
**Output Parameters:** 1D <br>
**Other Properties Related to Output:** Each the probability score (or raw logits). Users can decide to implement a Sigmoid activation function applied to the logits in their usage of the model.

### **Software Integration**

**Runtime:** **Llama 3.2 NeMo Retriever Reranking 500M** NIM <br>
**Supported Hardware Microarchitecture Compatibility:** NVIDIA Ampere, NVIDIA Hopper, NVIDIA Lovelace <br>
**Supported Operating System(s):** Linux <br>

### **Model Version(s)**

Llama 3.2 NeMo Retriever Reranking 500M <br>
Short Name: llama-3-2-nemoretriever-rerankqa-500m <br>

## **Training Dataset & Evaluation**

### **Training Dataset**

The development of large-scale public open-QA datasets has enabled tremendous progress in powerful embedding models. However, one popular dataset named [MSMARCO](https://microsoft.github.io/msmarco/) restricts ‌commercial licensing, limiting the use of these models in commercial settings. To address this, NVIDIA created its own training dataset blend based on public QA datasets, which each have a license for commercial applications.

**Data Collection Method by dataset**: Automated, Unknown

**Labeling Method by dataset:** Automated, Unknown

**Properties:** This model was trained on 800k samples from public datasets.

### **Evaluation Results**

We evaluate the pipelines on a set of evaluation benchmarks. We applied the ranking model to the candidates retrieved from a retrieval embedding model.

Overall, the pipeline llama-3.2-nv-embedqa-1b-v2 + llama-3-2-nemoretriever-rerankqa-500m provides high BEIR+TechQA accuracy with multilingual and crosslingual support. The llama-3-2-nemoretriever-rerankqa-500m ranking model is 3.5x smaller than the nv-rerankqa-mistral-4b-v3 model.

**Data Collection Method by Dataset**

| Dataset   | Data Collection Method                                                                 |
|-----------|----------------------------------------------------------------------------------------|
| NQ        | Real Google search queries paired with corresponding Wikipedia articles               |
| HotpotQA  | Collected by a team of NLP researchers at Carnegie Mellon University, Stanford University, and Université de Montréal.|
| FiQA      | Collection from StackExchange posts in personal finance domain and user-generated content from WorldEconomicForum |
| TechQA    | Curated from real user questions on technical forums and IBM developer community      |
| MIRACL    | Collection from Wikipedia articles across 18 different languages                      |
| MLQA      | Parallel text extraction from Wikipedia articles in 7 languages                       |
| MLDR      | Collection from Wikipedia and mC4 multilingual corpus with cross-lingual alignment techniques |

**Labeling Method by Dataset**

| Dataset   | Labeling Method                                                                       |
|-----------|---------------------------------------------------------------------------------------|
| NQ        | Combination of automated processes and human annotators identifying answer spans in Wikipedia articles |
| HotpotQA  | Manual labeling                                                      |
| FiQA      | Combination of accepted answers from StackExchange and manually annotated sentiment scores for financial texts |
| TechQA    | Manual curation and labeling by domain experts                                       |
| MIRACL    | Combined automated labeling with human verification across 18 languages             |
| MLQA      | Manual alignment and verification of parallel texts across 7 languages              |
| MLDR      | Automated labeling from document sections and cross-references                      |

We evaluated the NVIDIA Retrieval QA Embedding Model in comparison to literature open & commercial retriever models on academic benchmarks for question-answering \- [NQ](https://huggingface.co/datasets/BeIR/nq), [HotpotQA](https://huggingface.co/datasets/hotpot_qa) and [FiQA (Finance Q\&A)](https://huggingface.co/datasets/BeIR/fiqa) from BeIR benchmark and TechQA dataset. In this benchmark, the metric used was Recall@5. As described, we need to apply the ranking model on the output of an embedding model.

| Open & Commercial Reranker Models | Average Recall@5 on NQ, HotpotQA, FiQA, TechQA dataset |
| ----- | ----- |
| llama-3.2-nv-embedqa-1b-v2 + **llama-3-2-nemoretriever-rerankqa-500m** | **72.03%** |
| llama-3.2-nv-embedqa-1b-v2 + llama-3.2-nemoretriever-rerankqa-1b-v2 | 73.64% |
| llama-3.2-nv-embedqa-1b-v2 | 68.60% |
| nv-embedqa-e5-v5 \+ nv-rerankQA-mistral-4b-v3 | 75.45% |
| nv-embedqa-e5-v5 | 62.07% |
| nv-embedqa-e5-v4 | 57.65% |
| e5-large\_unsupervised | 48.03% |
| BM25 | 44.67% |

We evaluated the model’s multilingual capabilities on the [MIRACL](https://github.com/project-miracl/miracl) academic benchmark \- a multilingual retrieval dataset, across 15 languages, and on an additional 11 languages that were translated from the English and Spanish versions of MIRACL. The reported scores are based on a custom subsampled version by selecting hard negatives for each query to reduce the corpus size.

| Open & Commercial Retrieval Models | Average Recall@5 on MIRACL multilingual datasets |
| :---- | :---- |
| llama-3.2-nv-embedqa-1b-v2 + **llama-3-2-nemoretriever-rerankqa-500m** | **64.24%** |
| llama-3.2-nv-embedqa-1b-v2 + llama-3.2-nemoretriever-rerankqa-1b-v2 | 65.80% |
| llama-3.2-nv-embedqa-1b-v2 | 60.75% |
| nv-embedqa-mistral-7b-v2 | 50.42% |
| BM25 | 26.51% |

We evaluated the cross-lingual capabilities on the academic benchmark [MLQA](https://github.com/facebookresearch/MLQA/) based on 7 languages (Arabic, Chinese, English, German, Hindi, Spanish, Vietnamese). We consider only evaluation datasets when the query and documents are in different languages. We calculate the average Recall@5 across the 42 different language pairs.

| Open & Commercial Retrieval Models | Average Recall@5 on MLQA dataset with different languages |
| :---- | :---- |
| llama-3.2-nv-embedqa-1b-v2 + **llama-3-2-nemoretriever-rerankqa-500m** | **82.27%** |
| llama-3.2-nv-embedqa-1b-v2 + llama-3.2-nemoretriever-rerankqa-1b-v2 | 86.83% |
| llama-3.2-nv-embedqa-1b-v2 | 79.86% |
| nv-embedqa-mistral-7b-v2 | 68.38% |
| BM25 | 13.01% |

We evaluated the support of long documents on the academic benchmark [Multilingual Long-Document Retrieval (MLDR)](https://huggingface.co/datasets/Shitao/MLDR) built on Wikipedia and mC4, covering 12 typologically diverse languages . The English version has a median length of 2399 tokens and 90th percentile of 7483 tokens using the llama 3.2 tokenizer.

| Open & Commercial Retrieval Models | Average Recall@5 on MLDR |
| :---- | :---- |
| llama-3.2-nv-embedqa-1b-v2 + **llama-3-2-nemoretriever-rerankqa-500m** | **65.39%** |
| llama-3.2-nv-embedqa-1b-v2 + llama-3.2-nemoretriever-rerankqa-1b-v2 | 70.69% |
| llama-3.2-nv-embedqa-1b-v2 | 59.55% |
| nv-embedqa-mistral-7b-v2 | 43.24% |
| BM25 | 71.39% |

**Properties**
The evaluation datasets are based on three [MTEB/BEIR](https://github.com/beir-cellar/beir) TextQA datasets, the TechQA dataset, and MIRACL multilingual retrieval datasets, which are all public datasets. The sizes range between 10,000s up to 5M depending on the dataset.

### Inference
**Engine:** TensorRT
**Test Hardware:**  A100 PCIe/SXM, and A10G

## **Ethical Considerations**

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

**For more detailed information on ethical considerations for this model**, please see the Model Card++ subcards.

Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).

## Bias

| Field | Response |
| ----- | ----- |
| Participation considerations from adversely impacted groups [protected classes](https://www.senate.ca.gov/content/protected-classes) in model design and testing | None |
| Measures taken to mitigate against unwanted bias | None |

## Explainability

| Field | Response |
| ----- | ----- |
| Intended Application & Domain: | Passage ranking for question and answer retrieval. |
| Model Type: | Transformer encoder |
| Intended User: | Generative AI creators working with conversational AI models - most suitable for users who want to improve their multilingual retrieval tasks by reranking a set of candidates for a given question. |
| Output: | List of Floats (Score/Logit indicating if a passage relevant to a question) |
| Describe how the model works: | Model provides a score about the likelihood the passage contains the information to answer the question. |
| Verified to have met prescribed quality standards: | Yes |
| Performance Metrics: | Accuracy, Throughput, and Latency |
| Potential Known Risks: | This model does not always guarantee to provide a meaningful ranking of passage(s) for a given question. |
| Licensing: | GOVERNING TERMS: The NIM container is governed by the [NVIDIA Software License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-software-license-agreement/) and the [Product-Specific Terms for NVIDIA AI Products](https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/); except for the model which is governed by the [NVIDIA Community Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-community-models-license/).<br><br>ADDITIONAL INFORMATION: [Llama 3.2 Community License Agreement](https://www.llama.com/llama3_2/license/). Built with Llama. |
| Technical Limitations | The model’s max sequence length is 8192. Therefore, the longer text inputs should be truncated. |

## Privacy

| Field | Response |
| ----- | ----- |
| Generatable or reverse engineerable personal data? | None |
| Personal data used to create this model? | None |
| Is there provenance for all datasets used in training? | Yes |
| Does data labeling (annotation, metadata) comply with privacy laws? | Yes |
| Is data compliant with data subject requests for data correction or removal, if such a request was made? | No, not possible with externally-sourced data. |

## Safety & Security

| Field | Response |
| ----- | ----- |
| Model Application(s): | Text Reranking for Retrieval |
| Describe the physical safety impact (if present). | Not Applicable |
| Use Case Restrictions: | Abide by [NVIDIA AI Foundation Models Community License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-community-models-license/).   |
| Model and dataset restrictions: | The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development. Restrictions enforce dataset access during training, and dataset license constraints adhered to. |

## Prototype

```python
import requests

invoke_url = "https://ai.api.nvidia.com/v1/retrieval/nvidia/llama-3_2-nemoretriever-500m-rerank-v2/reranking"

headers = {
"Authorization": "Bearer ",
"Accept": "application/json",
}

payload = {
"messages": [
{
"role": "user",
"content": ""
}
]
}

# re-use connections
session = requests.Session()

response = session.post(invoke_url, headers=headers, json=payload)

response.raise_for_status()
response_body = response.json()
print(response_body)
```

```python
import requests

invoke_url = "https://ai.api.nvidia.com/v1/retrieval/nvidia/llama-3_2-nemoretriever-500m-rerank-v2/reranking"

headers = {
"Authorization": "Bearer ",
"Accept": "application/json",
}

payload = {
"messages": [
{
"role": "user",
"content": ""
}
]
}

# re-use connections
session = requests.Session()

response = session.post(invoke_url, headers=headers, json=payload)

response.raise_for_status()
response_body = response.json()
print(response_body)
```

```javascript
import fetch from "node-fetch";

const invokeUrl = "https://ai.api.nvidia.com/v1/retrieval/nvidia/llama-3_2-nemoretriever-500m-rerank-v2/reranking"

const headers = {
"Authorization": "Bearer ",
"Accept": "application/json",
}

const payload = {
"messages": [
{
"role": "user",
"content": ""
}
]
}

let response = await fetch(invokeUrl, {
method: "post",
body: JSON.stringify(payload),
headers: { "Content-Type": "application/json", ...headers }
});

let response_body = await response.json()

console.log(JSON.stringify(response_body))
```

```bash
invoke_url='https://ai.api.nvidia.com/v1/retrieval/nvidia/llama-3_2-nemoretriever-500m-rerank-v2/reranking'

authorization_header='Authorization: Bearer '
accept_header='Accept: application/json'
content_type_header='Content-Type: application/json'

data=$'{
"messages": [
{
"role": "user",
"content": ""
}
]
}'

response=$(curl --silent -i -w "\n%{http_code}" --request POST \
--url "$invoke_url" \
--header "$authorization_header" \
--header "$accept_header" \
--header "$content_type_header" \
--data "$data"
)

echo "$response"
```