The NV-EmbedCode model is a 7B Mistral-based embedding model optimized for code retrieval, supporting text, code, and hybrid queries.
The NV-EmbedCode model is a 7B Mistral-based embedding model optimized for code retrieval, supporting text, code, and hybrid queries.
Code retrieval is a critical task in many domains including coding assistance, code explanation, summarization, and documentation search. NV-EmbedCode transforms the input code or textual data into dense vector representations, known as embeddings, enabling effective retrieval and search.
This model is ready for commercial use.
NV-EmbdeCode is part of NVIDIA's effort to provide state-of-the-art, commercially-ready models and microservices, optimized for the lowest latency and highest throughput. The models that form the core of this solution have been trained using responsibly selected, auditable data sources.
The NV-EmbedCode model is most suitable for users who want to build a code retrieval system over a large text or code corpus, leveraging the latest dense retrieval technologies.
The use of this model is governed by the NVIDIA AI Foundation Models Community License Agreement and the Apache License 2.0.
Technology can have a profound impact on people and the world, and NVIDIA is committed to enabling trust and transparency in AI development. NVIDIA encourages users to adopt principles of AI ethics and trustworthiness to guide your business decisions by following the guidelines in the NVIDIA AI Foundation Models Community License Agreement.
Architecture Type: Transformer
Network Architecture: Fine-tuned NVIDIA Retrieval QA Mistral 7B Embedding model
Embedding Dimension: 4096
Parameter Count: 7.1 billion
The NV-EmbedCode model is a transformer encoder - a fine-tuned version of NVIDIA Retrieval QA Mistral 7B Embedding model, with 32 layers and 4096 as embedding size, which is trained on public datasets. Mistral Models are pre-trained with casual attention. As our research demonstrated that bi-directional attention improved the performance, NV-Embed series of models use bi-directional attention. Embedding models for retrieval are typically trained using a bi-encoder architecture. This involves encoding a pair of query and chunked passages independently using the embedding model. Contrastive learning is used to maximize the similarity between the query and its relevant (positive) passage, while minimizing the similarity to irrelevant (negative) passages.
NVIDIA Code Embedding v1
Short name: NV-EmbedCode-v1
Input Type: Code or text
Input Format: List of strings (any list length, any string length)
Other Properties Related to Input: The model was trained with documents of length up to 512 tokens however, similar to Mistral-7b, it has a theoretical attention span of approximately 131K tokens.
Output Type: Floats
Output Format: List of float arrays (same length as input list, 4096 dimensions per float array)
Other Properties Related to Output: Model outputs embedding vectors of dimension 4096 for each text string.
Our training dataset is a carefully curated blend of multiple sources. It includes publicly available code retrieval datasets with commercial licenses, issue description–code pairs sourced from public GitHub repositories with commercial licenses, and synthetic data generated in response to coding questions. We prefix the queries with task-specific instructions following our research in NV-Embed. For general tasks, we used ''Instruct: Retrieve code or text based on user query.\nQuery:''. The instruction can be changed based on the retrieval task.
The training dataset details are as follows:
Use Case: Code retrieval from text or code data.
Data Sources: Public datasets licensed for commercial use and synthetically-generated data.
Language: English (US), programming languages including Python, C/C++, Java, JavaScript, SQL, Go, Ruby, PHP.
Volume: 534k pairs of query-positive document.
Data Collection Method by dataset: Unknown
Labeling Method by dataset: The synthetic data is generated using DeepSeek-V2.5.
We evaluated NV-EmbedCode model using the CoIR benchmark and a curated set based on SWE-bench. CoIR consists of 10 code datasets across four retrieval tasks: (1) Text-to-Code Retrieval, (2) Code-to-Code Retrieval, (3) Code-to-Text Retrieval, and (4) Hybrid Code Retrieval. The default evaluation metric for CoIR is average NDCG@10 across all datasets. SWE-bench originally consists of real-world software engineering problems from GitHub issues and their corresponding pull requests. We adapted it into a retrieval task, where the goal is to identify the files that need to be edited to resolve an issue. These files are identified using the pull request that solved the issue. For SWE-bench Lite, we use Recall@1 to measure whether the top retrieved file is the correct one for resolving the issue, as each instance typically involves editing just one file.
Retrieval Method | CoIR Main Score (NDCG@10) | SWE-bench Lite (Recall@1) |
---|---|---|
NV-EmbedCode | 72.45% | 70.33% |
NV-EmbedQA-Mistral-7B-v2 | 60.08% | 61.33% |
SFR-Embedding-Code-2B_R | 67.41% | 47.00% |
SFR-Mistral-2_R | 61.85% | 60.33% |
BM25 | - | 42.33% |
Runtime: NeMo Retriever Text Embedding NIM
Supported Hardware Microarchitecture Compatibility: NVIDIA Ampere, NVIDIA Hopper, NVIDIA Lovelace
Supported Operating System(s): Linux
Engine: TensorRT
Test Hardware: See Support Matrix from NIM documentation.
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards.
Please report security vulnerabilities or NVIDIA AI Concerns here.