Build an Enterprise RAG Pipeline Blueprint Blueprint by NVIDIA

The NVIDIA AI Blueprint for Retrieval-Augmented Generation (RAG) is a production-ready reference workflow that provides a complete foundation for building scalable, customizable pipelines for both retrieval and generation. Powered by NVIDIA NeMo Retriever models and NVIDIA Llama Nemotron models, the blueprint is optimized for high accuracy, strong reasoning, and enterprise-scale throughput.

With built-in support for multimodal data ingestion, advanced retrieval, reranking, and reflection techniques, and seamless integration into LLM-powered workflows, it connects language models to enterprise data across text, tables, charts, audio, and infographics from millions of documents—enabling truly context-aware and generative responses.

Beyond retrieval and generation, the blueprint includes governance, observability, and safety features to meet enterprise requirements, along with developer-friendly APIs, telemetry, and evaluation frameworks for streamlined experimentation and deployment. GPU acceleration ensures unmatched performance at scale, while flexible plug-ins and customizability let teams adapt the solution to their unique use cases.

Whether you’re building enterprise search, knowledge assistants, generative copilots, or vertical AI workflows, the NVIDIA AI Blueprint for RAG delivers everything needed to move from prototype to production with confidence. It can be used standalone, combined with other NVIDIA Blueprints or integrated into an agentic workflow to support more advanced reasoning-driven applications.For example, this blueprint serves as a foundational building block in the AI Agent for Enterprise Research

Get started with this reference architecture to ground AI-driven decisions and generation in relevant enterprise data.

Architecture Diagram

Key Features

Data Ingestion and Processing
- Multimodal PDF data extraction support with text, tables, charts and infographics
- Support for audio file ingestion
- Custom metadata support
- Document summarization
Vector Database and Retrieval
- Multi-collection searchability
- Hybrid search with dense and sparse search
- Reranking to further improve accuracy
- GPU-accelerated Index creation and search
- Pluggable vector database
- ElasticSearch Support as a Vector Database
- Milvus Support as a Vector Database
- Query Decomposition
- Dynamic metadata filter generation
Multimodal and Advanced Generation
- Optional Vision Language Model (VLM) Support in answer generation
- Opt-in image captioning with vision language models (VLMs)
- Multi-turn conversations
- Multi-session support
- Improve accuracy with optional reflection
Governance
- Improve content safety with optional programmable guardrails
Observability and Telemetry
- Evaluation Scripts included (RAGAS framework)
- OpenTelemetry Support
Other
- User interface included
- NIM Operator support to allow GPU sharing using DRA
- Native Python library support
- OpenAI-compatible APIs
- Decomposable and customizable

Minimum System Requirements

Hardware Requirements

The blueprint offers two primary modes of deployment. By default, it deploys the referenced NIM microservices locally. Each method lists its minimum required hardware. This will change if the deployment turns on optional configuration settings.

Docker
- 2xRTXPro6000
- 2xH100
- 3xB200
- 3xA100
Kubernetes
- 8xRTX PRO 6000
- 8xH100-80GB
- 9xB200
- 9xA100-80GB SXM
- 4xH100 (with Multi-Instance GPU / DRA with NIM Operator)
The blueprint allows for use of NVIDIA NGC-hosted models, in which case one GPU will be required to host the NVIDIA cuVS-accelerated vector database.

OS Requirements

Ubuntu 22.04 OS

Deployment Options

Docker
Kubernetes

Software used in this blueprint

NVIDIA Technology

Llama Nemotron Super 49B
NeMo Retriever Llama 3.2 embedding NIM
NeMo Retriever Llama 3.2 reranking NIM
NeMo Retriever page elements NIM
NeMo Retriever table structure NIM
NeMo Retriever graphic elements NIM
PaddleOCR NIM
Nemotron parse NIM (optional)
Llama 3.1 NemoGuard 8B Content safety NIM (optional)
Llama 3.1 NemoGuard 8B Topic control NIM (optional)
NVIDIA Riva ASR NIM (optional)
Llama Nemotron nano vl 8b (optional)
Nemo Retriever OCR NIM (optional)
NeMo Retriever Llama 3.2 vlm embedding NIM (optional)

3rd Party Software

LangChain
Milvus database (accelerated with NVIDIA cuVS)
ElasticSearch Vector Database
Minio
Redis Cache

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility, and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure the models meet requirements for the relevant industry and use case and address unforeseen product misuse. Please report security vulnerabilities or NVIDIA AI concerns here.

License

Use of the models in this blueprint is governed by the NVIDIA AI Foundation Models Community License.

Terms of Use

This blueprint is governed by the NVIDIA Agreements | Enterprise Software | NVIDIA Software License Agreement and the NVIDIA Agreements | Enterprise Software | Product Specific Terms for AI Product. The models are governed by the NVIDIA Agreements | Enterprise Software | NVIDIA Community Model License and the NVIDIA RAG dataset which is governed by the NVIDIA Asset License Agreement. The following models that are built with Llama are governed by the Llama 3.2 Community License Agreement: nvidia/llama-3.2-nv-embedqa-1b-v2 and nvidia/llama-3.2-nv-rerankqa-1b-v2 and llama-3.2-nemoretriever-1b-vlm-embed-v1.

ADDITIONAL INFORMATION:

The Llama 3.1 Community License Agreement for the llama-3.1-nemotron-nano-vl-8b-v1, llama-3.1-nemoguard-8b-content-safety and llama-3.1-nemoguard-8b-topic-control models. The Llama 3.2 Community License Agreement for the nvidia/llama-3.2-nv-embedqa-1b-v2, nvidia/llama-3.2-nv-rerankqa-1b-v2 and llama-3.2-nemoretriever-1b-vlm-embed-v1 models. The Llama 3.3 Community License Agreement for the llama-3.3-nemotron-super-49b-v1.5 model. Built with Llama. Apache 2.0 for NVIDIA Ingest and for the nemoretriever-page-elements-v2, nemoretriever-table-structure-v1, nemoretriever-graphic-elements-v1, paddleocr and nemoretriever-ocr-v1 models.

Key Features

Data Ingestion and Processing

Multimodal PDF data extraction support with text, tables, charts and infographics
Support for audio file ingestion
Custom metadata support
Document summarization

Vector Database and Retrieval

Multi-collection searchability
Hybrid search with dense and sparse search
Reranking to further improve accuracy
GPU-accelerated Index creation and search
Pluggable vector database
ElasticSearch Support as a Vector Database
Milvus Support as a Vector Database
Query Decomposition
Dynamic metadata filter generation

Multimodal and Advanced Generation

Optional Vision Language Model (VLM) Support in answer generation
Opt-in image captioning with vision language models (VLMs)
Multi-turn conversations
Multi-session support
Improve accuracy with optional reflection

Governance

Improve content safety with optional programmable guardrails

Observability and Telemetry

Evaluation Scripts included (RAGAS framework)
OpenTelemetry Support

Other

User interface included
NIM Operator support to allow GPU sharing using DRA
Native Python library support
OpenAI-compatible APIs
Decomposable and customizable

Minimum System Requirements

Hardware Requirements

Docker

2xRTXPro6000
2xH100
3xB200
3xA100

Kubernetes

8xRTX PRO 6000
8xH100-80GB
9xB200
9xA100-80GB SXM
4xH100 (with Multi-Instance GPU / DRA with NIM Operator)

The blueprint allows for use of NVIDIA NGC-hosted models, in which case one GPU will be required to host the NVIDIA cuVS-accelerated vector database.