---
title: "llama-nemotron-embed-1b-v2"
publisher: "nvidia"
type: "endpoint"
updated: "2026-03-04T21:36:18.776Z"
description: "Multilingual, cross-lingual embedding model for long-document QA retrieval, supporting 26 languages."
canonical: "https://build.nvidia.com/nvidia/llama-nemotron-embed-1b-v2"
---

# Model Overview

### Description:

llama-nemotron-embed-1b-v2 is optimized for **multilingual and cross-lingual** text question-answering retrieval with **support for long documents (up to 8192 tokens)** and **dynamic embedding size (Matryoshka embeddings)**. The model was evaluated across 26 languages: English, Arabic, Bengali, Chinese, Czech, Danish, Dutch, Finnish, French, German, Hebrew, Hindi, Hungarian, Indonesian, Italian, Japanese, Korean, Norwegian, Persian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, and Turkish.

llama-nemotron-embed-1b-v2 was developed by NVIDIA as part of the Nemotron retriever family.

In addition to enabling multilingual and cross-lingual retrieval, this model reduces data storage footprint through dynamic embedding sizing and support for longer token length, making it feasible to handle large-scale datasets efficiently.

Embedding models are a key component of a retrieval system, transforming text into dense vector representations (embeddings) for indexing and similarity search.

*This model is ready for commercial/non-commercial use.*

### License/Terms of Use:

**GOVERNING TERMS:** The trial service is governed by the [NVIDIA API Trial Terms of Service](https://assets.ngc.nvidia.com/products/api-catalog/legal/NVIDIA%20API%20Trial%20Terms%20of%20Service.pdf). Use of this model is governed by the [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/). Additional Information: [Llama 3.2 Community Model License Agreement](https://www.llama.com/llama3_2/license/).

**You are responsible for ensuring that your use of NVIDIA provided models complies with all applicable laws.**

**Model Developer:** NVIDIA

### Deployment Geography:

**Global**

### Use Case:

This model is most suitable for users who want to build a multilingual question-and-answer application over a large text corpus, leveraging dense retrieval technologies.

### Release Date:

Build.NVIDIA.com 2/27/2026 via [llama-nemotron-embed-1b-v2](https://build.nvidia.com/nvidia/llama-nemotron-embed-1b-v2) <br>

## References(s):

- [NVIDIA NeMo Retriever Documentation](https://docs.nvidia.com/nemo/retriever/index.html)

## Model Architecture:

**Architecture Type:** Transformer encoder <br>
**Network Architecture:** Fine-tuned Llama 3.2 1B retriever <br>
**Embedding Dimension:** 2048 <br>
**Max Sequence Length:** 8192 <br>
**Number of Model Parameters:** 1B

This embedding model is a transformer encoder fine-tuned for contrastive learning using a bi-encoder setup. Query and passage texts are encoded independently, and contrastive learning maximizes similarity for relevant query–passage pairs while minimizing similarity for negative passages. The model supports Matryoshka embeddings to enable dynamic output dimensions.

## Input(s):

**Input Type(s):** Text <br>
**Input Format(s):** String / List of strings <br>
**Input Parameters:** One Dimensional (1D) <br>
**Other Properties Related to Input:** Text inputs longer than the maximum context length must be truncated or chunked.

## Output(s)

**Output Type(s):** Floats (dense vector embeddings) <br>
**Output Format(s):** List of floats <br>
**Output Parameters:** One Dimensional (1D) <br>
**Other Properties Related to Output:** Embedding vectors can be configured to output dimensions of 384, 512, 768, 1024, or 2048.

__Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.__

## Software Integration:

**Runtime Engines:** TensorRT <br>
**Supported Hardware Microarchitecture Compatibility:** <br>
NVIDIA Ampere <br>
NVIDIA Blackwell <br>
NVIDIA Hopper <br>
NVIDIA Lovelace <br>
**Preferred/Supported Operating System(s):**
Linux

__The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.__

## Model Version(s):

llama-nemotron-embed-1b-v2

Short Name: `llama-nemotron-embed-1b-v2`

## Training, Testing, and Evaluation Datasets:

### **Training Dataset**

**Data Modality:** Text <br>
**Training Data Collection:** Hybrid: Automated, Undisclosed <br>
**Training Labeling:** Hybrid: Automated, Undisclosed <br>
**Training Properties:** NVIDIA used a training blend of public QA datasets with commercial-use-eligible licenses (avoiding datasets such as MSMARCO that restrict commercial use). Semi-supervised pre-training on ~12M samples from public datasets and fine-tuning on ~1M samples from public datasets.

### **Evaluation Dataset**

**Evaluation Data Collection:** Hybrid: Automated, Undisclosed <br>
**Evaluation Labeling:** Hybrid: Automated, Undisclosed <br>
**Evaluation Properties:** Evaluated in comparison to open and commercial retriever models on academic question-answering benchmarks (NQ, HotpotQA, FiQA from BEIR, and TechQA). Metric used: Recall@5. Additional evaluations include MIRACL multilingual retrieval, MLQA cross-lingual retrieval, and MLDR long-document retrieval. <br><br>
**Selected results (Recall@5):**

| Open & Commercial Retrieval Models | Average Recall@5 on NQ, HotpotQA, FiQA, TechQA |
|:--|--:|
| llama-3.2-nv-embedqa-1b-v2 (embedding dim 2048) | 68.60% |
| llama-3.2-nv-embedqa-1b-v2 (embedding dim 384) | 64.48% |
| llama-3.2-nv-embedqa-1b-v1 (embedding dim 2048) | 68.97% |
| nv-embedqa-mistral-7b-v2 | 72.97% |
| nv-embedqa-mistral-7b-v1 | 64.93% |
| nv-embedqa-e5-v5 | 62.07% |
| nv-embedqa-e5-v4 | 57.65% |
| e5-large-unsupervised | 48.03% |
| BM25 | 44.67% |

| Open & Commercial Retrieval Models | Average Recall@5 on MIRACL multilingual |
|:--|--:|
| llama-3.2-nv-embedqa-1b-v2 (embedding dim 2048) | 60.75% |
| llama-3.2-nv-embedqa-1b-v2 (embedding dim 384) | 58.62% |
| llama-3.2-nv-embedqa-1b-v1 | 60.07% |
| nv-embedqa-mistral-7b-v2 | 50.42% |
| BM25 | 26.51% |

| Open & Commercial Retrieval Models | Average Recall@5 on MLQA (different languages) |
|:--|--:|
| llama-3.2-nv-embedqa-1b-v2 (embedding dim 2048) | 79.86% |
| llama-3.2-nv-embedqa-1b-v2 (embedding dim 384) | 71.61% |
| llama-3.2-nv-embedqa-1b-v1 (embedding dim 2048) | 78.77% |
| nv-embedqa-mistral-7b-v2 | 68.38% |
| BM25 | 13.01% |

| Open & Commercial Retrieval Models | Average Recall@5 on MLDR |
|:--|--:|
| llama-3.2-nv-embedqa-1b-v2 (embedding dim 2048) | 59.55% |
| llama-3.2-nv-embedqa-1b-v2 (embedding dim 384) | 54.77% |
| llama-3.2-nv-embedqa-1b-v1 (embedding dim 2048) | 60.49% |
| nv-embedqa-mistral-7b-v2 | 43.24% |
| BM25 | 71.39% |

# Inference:

**Acceleration Engine:** TensorRT <br>
**Test Hardware:** NVIDIA L40s

## Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case, and address unforeseen product misuse.

For more detailed information on ethical considerations for this model, please see the Model Card++ Bias, Explainability, Safety & Security, and Privacy Subcards.

Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).

## Bias

| Field | Response |
| ----- | ------ |
| Participation considerations from adversely impacted groups [protected classes](https://calcivilrights.ca.gov/disputeresolution/protected-characteristics/) in model design and testing | Not Applicable |
| Measures taken to mitigate against unwanted bias | Not Applicable |

## Explainability

| Field | Response |
| ----- | ----- |
| Intended Application & Domain: | Document and query embedding for question and answer retrieval. |
| Model Type: | Transformer encoder. |
| Intended User: | Generative AI creators working with conversational AI models. Users who want to build a multilingual question and answer application over a large text corpus, leveraging the latest dense retrieval technologies. |
| Output: | Array of float numbers (Dense Vector Representation for the input text). |
| Describe how the model works: | Model transforms the input into a dense vector representation. |
| Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of: | N/A |
| Technical Limitations: | The model's max sequence length is 8192. Longer text inputs should be truncated. |
| Verified to have met prescribed NVIDIA quality standards: | Yes |
| Performance Metrics: | Accuracy, Throughput, and Latency. |
| Potential Known Risks: | This model does not guarantee to always retrieve the correct passage(s) for a given query. |
| Licensing & Terms of Use: | **GOVERNING TERMS:** The trial service is governed by the [NVIDIA API Trial Terms of Service](https://assets.ngc.nvidia.com/products/api-catalog/legal/NVIDIA%20API%20Trial%20Terms%20of%20Service.pdf). Use of this model is governed by the [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/). Additional Information: [Llama 3.2 Community Model License Agreement](https://www.llama.com/llama3_2/license/). |

## Privacy

| Field | Response |
| ----- | ----- |
| Generatable or reverse engineerable personal data? | None |
| Personal data used to create this model? | None |
| Was data from user interactions with the AI model (e.g. user input and prompts) used to train the model? | No |
| How often is dataset reviewed? | Dataset is initially reviewed upon addition, and subsequent reviews are conducted as needed or upon request for changes. |
| Is there provenance for all datasets used in training? | Yes |
| Does data labeling (annotation, metadata) comply with privacy laws? | Yes |
| Is data compliant with data subject requests for data correction or removal, if such a request was made? | No, not possible with externally-sourced data. |
| Applicable Privacy Policy | https://www.nvidia.com/en-us/about-nvidia/privacy-policy/ |

## Safety & Security

| Field | Response |
| ----- | ----- |
| Model Application(s): | Document Embedding for Retrieval. User queries can be text and documents can be text. |
| Use Case Restrictions: | **GOVERNING TERMS:** The trial service is governed by the [NVIDIA API Trial Terms of Service](https://assets.ngc.nvidia.com/products/api-catalog/legal/NVIDIA%20API%20Trial%20Terms%20of%20Service.pdf). Use of this model is governed by the [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/). Additional Information: [Llama 3.2 Community Model License Agreement](https://www.llama.com/llama3_2/license/). |
| Model and dataset restrictions: | The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development. Restrictions enforce dataset access during training, and dataset license constraints adhered to. |
| Describe the life critical impact (if present) | Not applicable |

## Prototype

```bash
invoke_url='https://integrate.api.nvidia.com/v1/embeddings'

authorization_header='Authorization: Bearer '
accept_header='Accept: application/json'
content_type_header='Content-Type: application/json'

data=$'{
"encoding_format": "float",
"truncate": "NONE",
"messages": [
{
"role": "user",
"content": ""
}
]
}'

response=$(curl --silent -i -w "\n%{http_code}" --request POST \
--url "$invoke_url" \
--header "$authorization_header" \
--header "$accept_header" \
--header "$content_type_header" \
--data "$data"
)

echo "$response"
```

```python
from openai import OpenAI

client = OpenAI(
api_key="$NVIDIA_API_KEY",
base_url="https://integrate.api.nvidia.com/v1"
)

response = client.embeddings.create(
input=[""],
model="nvidia/llama-nemotron-embed-1b-v2",
encoding_format="float",
extra_body={"input_type": "", "truncate": "NONE"}
)

print(response.data[0].embedding)
```

```python
from langchain_nvidia_ai_endpoints import NVIDIAEmbeddings

client = NVIDIAEmbeddings(
model="nvidia/llama-nemotron-embed-1b-v2", 
api_key="$NVIDIA_API_KEY", 
truncate="NONE", 
)

embedding = client.embed_query("")
print(embedding)
```