---
title: "llama-3_2-nemoretriever-300m-embed-v2"
publisher: "nvidia"
type: "endpoint"
updated: "2025-10-02T21:05:28.355Z"
description: "Multilingual, cross-lingual embedding model for long-document QA retrieval, supporting 26 languages."
canonical: "https://build.nvidia.com/nvidia/llama-3_2-nemoretriever-300m-embed-v2"
---

## **Model Overview**

### **Description**

The Llama 3.2 NeMo Retriever Embedding 300M model version 2 is optimized for **multilingual and cross-lingual** text question-answering retrieval with **support for long documents (up to 8192 tokens)**. This model was evaluated on 26 languages: English, Arabic, Bengali, Chinese, Czech, Danish, Dutch, Finnish, French, German, Hebrew, Hindi, Hungarian, Indonesian, Italian, Japanese, Korean, Norwegian, Persian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, and Turkish.

In addition to enabling multilingual and cross-lingual question-answering retrieval, this model reduces the data storage footprint through dynamic embedding sizing and support for longer token length, making it feasible to handle large-scale datasets efficiently.

An embedding model is a crucial component of a text retrieval system, as it transforms textual information into dense vector representations. They are typically transformer encoders that process tokens of input text (for example: question, passage) to output an embedding.

The Llama 3.2 NeMo Retriever Embedding 300M model version 2 is a part of the NVIDIA NeMo Retriever collection of NIMs, which provide state-of-the-art, commercially-ready models and microservices, optimized for the lowest latency and highest throughput. It features a production-ready information retrieval pipeline with enterprise support. The models that form the core of this solution have been trained using responsibly selected, auditable data sources. With multiple pre-trained models available as starting points, developers can also readily customize them for domain-specific use cases, such as information technology, human resource help assistants, and research & development research assistants.

This model is ready for commercial use.

### **License/Terms of use**

**GOVERNING TERMS:** The trial service is governed by the [NVIDIA API Trial Terms of Service](https://assets.ngc.nvidia.com/products/api-catalog/legal/NVIDIA%20API%20Trial%20Terms%20of%20Service.pdf). Use of this model is governed by the [NVIDIA Community Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-community-models-license/). ADDITIONAL INFORMATION: [Llama 3.2 Community License Agreement](https://www.llama.com/llama3_2/license/). Built with Llama.

ADDITIONAL INFORMATION: [Llama 3.2 Community License Agreement](https://www.llama.com/llama3_2/license/). Built with Llama.

**You are responsible for ensuring that your use of NVIDIA AI Foundation Models complies with all applicable laws.**

### Deployment Geography:
Global <br>

### Use Case: <br>
The Llama 3.2 NeMo Retriever Embedding 300M model version 2 is most suitable for users who want to build a multilingual question-and-answer application over a large text corpus, leveraging the latest dense retrieval technologies.

### Release Date:  <br>
**Build.NVIDIA.com:** 9/29/2025 via [link](https://build.nvidia.com/nvidia/llama-3_2-nemoretriever-300m-embed-v2) <br>

### **Model Architecture**

**Architecture Type:** Transformer<br>
**Network Architecture:** Fine-tuned Llama3.2 300M Retriever<br>

This NeMo Retriever embedding model is a transformer encoder with 9 layers and an embedding size of 2048 and has been pruned, distilled from Llama 3.2-nv-embedqa-1b-v1 model. After pruning and distillation, the model has been trained on public and synthetic datasets. The AdamW optimizer is employed incorporating 100 warm up steps and 5e-6 learning rate with WarmupDecayLR scheduler. Embedding models for text retrieval are typically trained using a bi-encoder architecture. This involves encoding a pair of sentences (for example, query and chunked passages) independently using the embedding model. Contrastive learning is used to maximize the similarity between the query and the passage that contains the answer, while minimizing the similarity between the query and sampled negative passages not useful to answer the question.

### Computational Load:
Cumulative Compute: 6.67E+22
Estimated Energy and Emissions for Model Training: 259,500kWh | 107 tons CO2eq

This model's cumulative compute is dominated by the llama3.2-1b model training; estimates on the base model's compute and energy/emissions usage is sourced from [epoch.ai](https://epoch.ai/data/ai-models?view=table#explore-the-data) and the [llama3.2-1b model card](https://huggingface.co/meta-llama/Llama-3.2-1B).

### **Input**

| Property | Query | Document |
|----------|-------|----------|
| Input Type | Text | Text |
| Input Format | List of strings | List of strings |
| Input Parameter | 1D | 1D |
| Other Properties | The model's maximum context length is 8192 tokens. Texts longer than maximum length must either be chunked or truncated. | The model's maximum context length is 8192 tokens. Texts longer than maximum length must either be chunked or truncated. |

### **Output**

**Output Type:** Floats<br>
**Output Format:** List of floats<br>
**Output Parameters:** 1D<br>
**Other Properties Related to Output:** Model outputs embedding vectors of maximum dimension 2048 for each text string (can be configured based on 384, 512, 768, 1024, or 2048).<br>

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (such as GPU cores) and software frameworks (such as CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

### **Software Integration**

**Runtime Engine:** NeMo Retriever embedding NIM<br>
**Supported Hardware Microarchitecture Compatibility**: NVIDIA Ampere, NVIDIA Blackwell, NVIDIA Hopper, NVIDIA Lovelace<br>
**Supported Operating System(s):** Linux<br>

### **Model Version(s)**

Llama 3.2 NeMo Retriever Embedding 300M v2<br>
Short Name: `llama-3.2-nemoretriever-300m-embed-v2`<br>

## **Training Dataset & Evaluation**

### **Training Dataset**

The development of large-scale public open-QA datasets has enabled tremendous progress in powerful embedding models. However, one popular dataset named MS MARCO restricts ‌commercial licensing, limiting the use of these models in commercial settings. To address this, NVIDIA created its own training dataset blend based on public QA datasets, which each have a license for commercial applications as well as synthetic QA datasets which were created using Llama 3.1 70b instruct. For long context retrieval, synthetic datasets were created using the same methodology as the MLDR train datasets (https://huggingface.co/datasets/Shitao/MLDR).

**Data Collection Method by dataset**: Hybrid: Automated, Human, Synthetic<br>
**Labeling Method by dataset**: Hybrid: Automated, Human, Synthetic<br>
**Properties:** Semi-supervised pre-training on 12M samples from public datasets and fine-tuning on 1M samples from public and synthetic datasets .<br>

### **Evaluation Datasets**

We evaluated the NeMo Retriever embedding model in comparison to literature open & commercial retriever models on academic benchmarks for question-answering - [NQ](https://huggingface.co/datasets/BeIR/nq), [HotpotQA](https://huggingface.co/datasets/hotpotqa/hotpot_qa) and [FiQA (Finance Q&A)](https://huggingface.co/datasets/BeIR/fiqa) from BeIR benchmark and TechQA dataset. Note that the model was evaluated offline on A100 GPUs using the model's PyTorch checkpoint. In this benchmark, the metric used was Recall@5.

We also evaluated the multilingual capabilities on the academic benchmark [MIRACL](https://github.com/project-miracl/miracl) across 15 languages and translated the English and Spanish version of MIRACL into additional 11 languages. The reported scores are based on an internal version of MIRACL by selecting hard negatives for each query to reduce the corpus size.

We evaluated the capabilities on the academic benchmark [MLQA](https://github.com/facebookresearch/MLQA/) based on 7 languages (Arabic, Chinese, English, German, Hindi, Spanish, Vietnamese). We consider only evaluation datasets when the query and documents are in same languages.

We evaluated the support of long documents on the academic benchmark [Multilingual Long-Document Retrieval (MLDR)](https://huggingface.co/datasets/Shitao/MLDR) built on Wikipedia and mC4, covering 12 typologically diverse languages. The English version has a median length of 2399 tokens and 90th percentile of 7483 tokens using the llama 3.2 tokenizer. The MLDR dataset is based on synthetic generated questions with a LLM, which has the tendency to create questions with similar keywords than the positive document, but might not be representative for real user queries. This characteristic of the dataset benefits sparse embeddings like BM25.

**Data Collection Method by dataset**: Hybrid: Automated, Human, Synthetic<br>
**Labeling Method by dataset**: Hybrid: Automated, Human, Synthetic<br>
**Properties:** The evaluation datasets are based on [MTEB/BEIR](https://github.com/beir-cellar/beir), TextQA, TechQA, [MIRACL](https://github.com/project-miracl/miracl), [MLQA](https://github.com/facebookresearch/MLQA), and [MLDR](https://huggingface.co/datasets/Shitao/MLDR). The size ranges between 10,000s up to 5M depending on the dataset.<br>

### **Evaluation Results**

| Open & Commercial Retrieval Models | Average Recall@5 on NQ, HotpotQA, FiQA, TechQA dataset |
| ----- | ----- |
| llama-3.2-nemoretriever-300m-embed-v2 (embedding dim 2048) | 62.92% |
| intfloat/multilingual-e5-large | 61.23% |
| Snowflake/snowflake-arctic-embed-l-v2.0 | 60.9% |
| Alibaba-NLP/gte-multilingual-base | 57.09% |
| BAAI/bge-m3 | 57.84% |
| nv-embedqa-e5-v5 | 62.07% |
| e5-large-unsupervised | 48.03% |
| BM25 | 44.67%  |

| Open & Commercial Retrieval Models | Average Recall@5 on MIRACL (multilingual) |
| ----- | ----- |
| llama-3.2-nemoretriever-300m-embed-v2 (embedding dim 2048) | 66.12% |
| intfloat/multilingual-e5-large | 64.27% |
| Snowflake/snowflake-arctic-embed-l-v2.0 | 60.28% |
| Alibaba-NLP/gte-multilingual-base | 63.27% |
| BAAI/bge-m3 | 67.67% |
| BM25 | 26.51% |

| Open & Commercial Retrieval Models | Average Recall@5 on MLQA dataset with different languages |
| ----- | ----- |
| llama-3.2-nemoretriever-300m-embed-v2 (embedding dim 2048) | 75.91% |
| intfloat/multilingual-e5-large | 77.21% |
| Snowflake/snowflake-arctic-embed-l-v2.0 | 53.34% |
| Alibaba-NLP/gte-multilingual-base | 71.08% |
| BAAI/bge-m3 | 74.21% |
| BM25 | 13.01% |

| Open & Commercial Retrieval Models | Average Recall@5 on MLDR |
| ----- | ----- |
| llama-3.2-nemoretriever-300m-embed-v2 (embedding dim 2048) | 53.27% |
| intfloat/multilingual-e5-large | 38.46% |
| Snowflake/snowflake-arctic-embed-l-v2.0 | 36.42% |
| Alibaba-NLP/gte-multilingual-base | 62.13% |
| BAAI/bge-m3 | 57.85% |
| BM25 | 71.39% |

**Inference**<br>
**Engine:** TensorRT<br>
**Test Hardware:** L40s<br>

## **Ethical Considerations**

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case, and address unforeseen product misuse.

For more detailed information on ethical considerations for this model, see the Model Card++ tab for the Explainability, Bias, Safety & Security, and Privacy subcards.

Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).

## Get Help

### Enterprise Support
Get access to knowledge base articles and support cases or  submit a ticket at the [NVIDIA AI Enterprise Support Services page.](https://www.nvidia.com/en-us/data-center/products/ai-enterprise-suite/support/).

### NVIDIA NIM Documentation
Visit the [NeMo Retriever docs page](https://docs.nvidia.com/nemo/retriever/index.html) for release documentation, deployment guides and more.

## Bias

| Field | Response |
| ----- | ------ |
| Participation considerations from adversely impacted groups [protected classes](https://www.senate.ca.gov/content/protected-classes) in model design and testing | None |
| Measures taken to mitigate against unwanted bias | None |

## Explainability

| Field | Response |
| ----- | ----- |
| Intended Application & Domain: | Document and query embedding for question and answer retrieval. |
| Model Type: | Transformer encoder. |
| Intended User: | Generative AI creators working with conversational AI models. Users who want to build a multilingual question and answer application over a large text corpus, leveraging the latest dense retrieval technologies. |
| Output: | Array of float numbers (Dense Vector Representation for the input text). |
| Describe how the model works: | Model transforms the input into a dense vector representation. |
| Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of: | N/A |
| Technical Limitations: | The model's max sequence length is 8192. Longer text inputs should be truncated. |
| Verified to have met prescribed NVIDIA quality standards: | Yes |
| Performance Metrics: | Accuracy, Throughput, and Latency. |
| Potential Known Risks: | This model does not guarantee to always retrieve the correct passage(s) for a given query. |
| Licensing & Terms of Use: | **GOVERNING TERMS:** The trial service is governed by the [NVIDIA API Trial Terms of Service](https://assets.ngc.nvidia.com/products/api-catalog/legal/NVIDIA%20API%20Trial%20Terms%20of%20Service.pdf). Use of this model is governed by the [NVIDIA Community Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-community-models-license/). ADDITIONAL INFORMATION: [Llama 3.2 Community License Agreement](https://www.llama.com/llama3_2/license/). Built with Llama. |

## Privacy

| Field | Response |
| ----- | ----- |
| Generatable or reverse engineerable personal data? | None |
| Personal data used to create this model? | None |
| How often is dataset reviewed? | Dataset is initially reviewed upon addition, and subsequent reviews are conducted as needed or upon request for changes. |
| Is there provenance for all datasets used in training? | Yes |
| Does data labeling (annotation, metadata) comply with privacy laws? | Yes |
| Is data compliant with data subject requests for data correction or removal, if such a request was made? | No, not possible with externally-sourced data. |
| Applicable Privacy Policy | https://www.nvidia.com/en-us/about-nvidia/privacy-policy/ |

## Safety & Security

| Field | Response |
| ----- | ----- |
| Model Application(s): | Document Embedding for Retrieval. User queries can be text and documents can be text. |
| Use Case Restrictions: | Abide by [Community Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-community-models-license/).   |
| Model and dataset restrictions: | The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development. Restrictions enforce dataset access during training, and dataset license constraints adhered to. |
| Describe the life critical impact (if present) | Not applicable |

## Prototype

```bash
invoke_url='https://integrate.api.nvidia.com/v1/embeddings'

authorization_header='Authorization: Bearer '
accept_header='Accept: application/json'
content_type_header='Content-Type: application/json'

data=$'{
"encoding_format": "float",
"truncate": "NONE",
"messages": [
{
"role": "user",
"content": ""
}
]
}'

response=$(curl --silent -i -w "\n%{http_code}" --request POST \
--url "$invoke_url" \
--header "$authorization_header" \
--header "$accept_header" \
--header "$content_type_header" \
--data "$data"
)

echo "$response"
```

```python
from openai import OpenAI

client = OpenAI(
api_key="$NVIDIA_API_KEY",
base_url="https://integrate.api.nvidia.com/v1"
)

response = client.embeddings.create(
input=[""],
model="nvidia/llama-3.2-nemoretriever-300m-embed-v2",
encoding_format="float",
extra_body={"input_type": "", "truncate": "NONE"}
)

print(response.data[0].embedding)
```

```python
from langchain_nvidia_ai_endpoints import NVIDIAEmbeddings

client = NVIDIAEmbeddings(
model="nvidia/llama-3.2-nemoretriever-300m-embed-v2", 
api_key="$NVIDIA_API_KEY", 
truncate="NONE", 
)

embedding = client.embed_query("")
print(embedding)
```