---
title: "rerank-qa-mistral-4b"
publisher: "nvidia"
type: "endpoint"
updated: "2025-01-17T20:06:15.644Z"
description: "GPU-accelerated model optimized for providing a probability score that a given passage contains the information to answer a question."
canonical: "https://build.nvidia.com/nvidia/rerank-qa-mistral-4b"
---

## Model Overview

### Description
The NVIDIA Retrieval QA Ranking Models is a model optimized for providing a probability score that a given passage contains the information to answer a question. The ranking model is a component in a text retrieval system to improve the overall accuracy. A text retrieval system often uses an embedding model (dense) or lexical search (sparse) index to return relevant text passages given the input. A ranking model can be used to rerank the potential candidate into a final order. Ranking model has the query-passage pairs as an input and therefore, can process cross attention between the words. It would not be feasible to apply a Ranking model on all documents in the knowledge base, therefore, ranking models are often deployed in combination with embedding models. 

NVIDIA Retrieval QA Ranking Model is a part of NVIDIA NeMo Retriever, which provides state-of-the-art, commercially-ready models and microservices, optimized for the lowest latency and highest throughput. It features a production-ready information retrieval pipeline with enterprise support. The models that form the core of this solution have been trained using responsibly selected, auditable data sources. With multiple pre-trained models available as starting points, developers can also readily customize them for their domain-specific use cases, such as Information Technology, Human Resource help assistants, and Research & Development research assistants.

### Terms of use
The use of this model is governed by
the [NVIDIA NeMo Foundational Models Evaluation License Agreement](https://registry.ngc.nvidia.com/orgs/ohlfw0olaadg/teams/ea-participants/resources/nemo_foundational_models_evaluation_license/files)

### Model Architecture: Mistral-4B Ranker

**Architecture Type:** Transformer <br>
**Network Architecture:**  Fine-tuned Mistral-7B-v0.1 LLM (only first 16 layers)  <br>

The NVIDIA Retrieval QA Ranking Model is a transformer encoder - a LoRA finetuned version of [Mistral-7B-v0.1 LLM](https://huggingface.co/mistralai/Mistral-7B-v0.1) that uses only the first 16  layers for higher throughput. The last embedding  output by the decoder model is used as a pooling strategy, and a binary classification head is fine-tuned for the ranking task.

Ranking models for text ranking are typically trained using a cross-encoder architecture for sentence classification. This involves predicting a pair of sentences (for example, query and chunked passages). The Binary CrossEntropy loss is used to maximize the likelihood for passages containing information to answer the query and minimize the likelihood for passages which do not contain information to answer the query.
We train the model on private and public datasets described in the Dataset and Training section.

### Intended use
The NVIDIA Retrieval QA Ranking model is most suitable for users who want to improve their retrieval systems by reranking a set of candidates for a given question.

### Input

**Input Type:** Pair of texts <br>
**Input Format:** list of text pairs <br>

### Output

**Output Type:** floats <br>
**Output Format:** list of floats, each the probability score (or raw logits). The user can decide if a Sigmoid activation function is applied to the logits. <br>

### Model Version(s)

NVIDIA Retrieval QA Text Reranking Mistral 4B-1

## Training Dataset & Evaluation

### Training Dataset
The development of large-scale public open-QA datasets has enabled tremendous progress in powerful embedding models. However, one popular dataset named [MSMARCO](https://microsoft.github.io/msmarco/) restricts ‌commercial licensing, limiting the use of these models in commercial settings. To address this, we created our own internal open-domain QA dataset to train a commercially-viable text qa models. For NVIDIA proprietary data collection, we searched the passages from web logs and selected a collection of passages relevant to customer use cases for annotation by the NVIDIA internal data annotation team.

The training dataset details are as follows:

**Use Case:** Information retrieval for question and answering over text documents. <br>

**Data Sources:** <br>
- Public datasets licensed for commercial use. <br>
- Text from public websites. <br>
- Annotations created by NVIDIA’s internal team. <br>

**Language:** English (US) <br>
**Domains:** Knowledge, Description, Numeric (unit, time), Entity, Location, Person <br>
**Volume:** 400k samples from public dataset <br>
**High Level Schema:** <br>
- query: question text <br>
- doc: full document that contains the answer <br>
- chunk: section of the document that contains the answer <br>
- relevancy label: rating of how relevant the passage is to the question <br>
- span: exact token range in the chunk that contains the answer <br>

### Evaluation Results
We evaluated the NVIDIA Retrieval QA Ranking Models in comparison to literature open & commercial retriever models on academic benchmarks - [NQ](https://huggingface.co/datasets/BeIR/nq), [HotpotQA](https://huggingface.co/datasets/hotpot_qa) and [FiQA(Finance Q&A)](https://huggingface.co/datasets/BeIR/fiqa) from BeIR benchmark. In this benchmark, the metric used was Recall@5. As described, we need to apply the ranking model on the output of a embedding model.

| Open & Commercial Retrieval Models   |Average Recall@5 on NQ, HotpotQA, FiQA dataset|
|--------------------------------------------------------------------------|----------------------------|
| NVIDIA Retrieval QA Embedding + NVIDIA Retrieval QA Ranking (Mistral-4B) | 70.60%                     |
| NVIDIA Retrieval QA Embedding                                            | 55.95%                     |
| E5-Large_unsupervised                                                    | 47.57%                     |

We also evaluated our embedding model with real internal customer datasets from telco, IT, consulting, and energy industries. The metric was Recall@5, to emulate a retrieval augmented generation (RAG) scenario where we would provide the top five most relevant passages as context in the prompt for the LLM model that is going to respond to the question. We compared our model’s information retrieval accuracy to a number of well-known embedding models made available by the AI community, including ones trained on non-commercial dataset (which are marked with "*").

| Retrieval Model              | Average Recall@5 on Internal Customer Datasets |
|----------------------------------------------------------------|-----------------------------|
| NVIDIA Retrieval QA Embedding + NVIDIA Retrieval QA Ranking    | 79.22%                      |
| NVIDIA Retrieval QA                                            | 74.3%                       |
| DRAGON*                                                        | 72.7%                       |
| E5-Large*                                                      | 71.7%                       |
| BGE*                                                           | 71.1%                       |
| GTR*                                                           | 71.0%                       |
| Contriever*                                                    | 69.0%                       |
| GTE*                                                           | 63.9%                       |
| E5-Large_unsupervised                                          | 61.6%                       |
| BM25                                                           | 55.6%                       |

## Ethical Considerations
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards [here](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-foundation/models/nvolve-29k/bias). Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).

## Bias

| Field              | Response |
|:-------------------|:---------|
| Participation considerations from adversely impacted groups [protected classes](https://calcivilrights.ca.gov/disputeresolution/protected-characteristics/) in model design and testing    | None   |
| Measures taken to mitigate against unwanted bias                                                      | None   |

## Explainability

| Field              | Response |
|:-------------------|:---------|
|Intended Application & Domain: | Passage and query ranking for question and answer retrieval    |
|Model Type:                    | Transformer encoder                                            |
|Intended User:                 | Generative AI creators working with conversational AI models.  |
|Output:                        | Score, is a passage answer a question (sentence classification) |
|Describe how the model works:  | The question,passage pair are concatenated and the transformer model provides a score about the likelihood the passage contains the information to answer the question  |
|Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of: | Not Applicable |
Verified to have met prescribed NVIDIA quality standards:  |  Yes
|Performance Metrics:           | Throughput and Latency                                         |
|Potential Known Risks:         | The model was trained on the data that may contain toxic language and societal biases originally crawled from the Internet. Therefore, the model may amplify those biases, for example, associating certain genders with certain social stereotypes. |
|Licensing:                     | [NVIDIA NeMo Foundational Models Evaluation License Agreement](https://registry.ngc.nvidia.com/orgs/ohlfw0olaadg/teams/ea-participants/resources/nemo_foundational_models_evaluation_license/files)|
|Technical Limitations:         | The model was trained on the data that may contain toxic language and societal biases originally crawled from the Internet. Therefore, the model may amplify those biases, for example, associating certain genders with certain social stereotypes.  The model's maximum context length is 8192 tokens. Texts longer than maximum length must either be chunked or truncated.|

## Privacy

| Field              | Response |
|:-------------------|:---------|
|Generatable or reverse engineerable personally-identifiable information (PII)?    | None             |
|Was consent obtained for any PII used?                                            | Not Applicable       |
|IPII used to create this model?                                                   | None                 |
|How often is dataset reviewed?                                                    | Before Release |
|Is a mechanism in place to honor data subject right of access or deletion of personal data? | No         |
|If PII collected for the development of the model, was it collected directly by NVIDIA? | Not Applicable |
|If PII collected for the development of the model by NVIDIA, do you maintain or have access to disclosures made to data subjects? | Not Applicable |
|If PII collected for the development of this AI model, was it minimized to only what was required? | Not Applicable |
|Is there provenance for all datasets used in training?                            | Yes                  |
|Are we able to identify and trace source of dataset?                              | Yes                  |
|Does data labeling (annotation, metadata) comply with privacy laws?               | Yes                  | 
|Is data compliant with data subject requests for data correction or removal, if such a request was made? | No, not possible with externally-sourced data|

## Safety & Security

| Field              | Response |
|:-------------------|:---------|
|Model Application(s):                                 | Text Embedding for Retrieval    |
|Describe the life-critical impact (if present).     | Not Applicable                  |
|Use Case Restrictions:| Evaluation license for Non-Commerical Use Only. |
|Model and dataset restrictions:| The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development. Restrictions enforce dataset access during training, and dataset license constraints adhered to. |

## Prototype

```python
import requests

invoke_url = "https://ai.api.nvidia.com/v1/retrieval/nvidia/reranking"

headers = {
"Authorization": "Bearer ",
"Accept": "application/json",
}

payload = {
"messages": [
{
"role": "user",
"content": ""
}
]
}

# re-use connections
session = requests.Session()

response = session.post(invoke_url, headers=headers, json=payload)

response.raise_for_status()
response_body = response.json()
print(response_body)
```

```python
import requests

invoke_url = "https://ai.api.nvidia.com/v1/retrieval/nvidia/reranking"

headers = {
"Authorization": "Bearer ",
"Accept": "application/json",
}

payload = {
"messages": [
{
"role": "user",
"content": ""
}
]
}

# re-use connections
session = requests.Session()

response = session.post(invoke_url, headers=headers, json=payload)

response.raise_for_status()
response_body = response.json()
print(response_body)
```

```javascript
import fetch from "node-fetch";

const invokeUrl = "https://ai.api.nvidia.com/v1/retrieval/nvidia/reranking"

const headers = {
"Authorization": "Bearer ",
"Accept": "application/json",
}

const payload = {
"messages": [
{
"role": "user",
"content": ""
}
]
}

let response = await fetch(invokeUrl, {
method: "post",
body: JSON.stringify(payload),
headers: { "Content-Type": "application/json", ...headers }
});

let response_body = await response.json()

console.log(JSON.stringify(response_body))
```

```bash
invoke_url='https://ai.api.nvidia.com/v1/retrieval/nvidia/reranking'

authorization_header='Authorization: Bearer '
accept_header='Accept: application/json'
content_type_header='Content-Type: application/json'

data=$'{
"messages": [
{
"role": "user",
"content": ""
}
]
}'

response=$(curl --silent -i -w "\n%{http_code}" --request POST \
--url "$invoke_url" \
--header "$authorization_header" \
--header "$accept_header" \
--header "$content_type_header" \
--data "$data"
)

echo "$response"
```