NVIDIA
Explore
Models
Blueprints
GPUs
Docs
⌘KCtrl+K
Terms of Use
Privacy Policy
Your Privacy Choices
Contact

Copyright © 2025 NVIDIA Corporation

nvidia

nemoretriever-ocr-v1

Run Anywhere

Powerful OCR model for fast, accurate real-world image text extraction, layout, and structure analysis.

Table Extractiondata ingestionextractionnemo retrieverOptical Character Recognition
Get API Key
API Reference
Accelerated by DGX Cloud

Model Overview

Description

The NeMo Retriever OCR v1 model is a state-of-the-art text recognition model designed for robust end-to-end optical character recognition (OCR) on complex real-world images. It integrates three core neural network modules: a detector for text region localization, a recognizer for transcription of detected regions, and a relational model for layout and structure analysis.

This model is optimized for a wide variety of OCR tasks, including multi-line, multi-block, and natural scene text, and supports advanced reading order analysis via its relational model component. NeMo Retriever OCR v1 has been developed to be production-ready and commercially usable, with a focus on speed and accuracy on both document and natural scene images.

The NeMo Retriever OCR v1 model is part of the NVIDIA NeMo Retriever collection of NIM, which provides state-of-the-art, commercially-ready models and microservices optimized for the lowest latency and highest throughput. It features a production-ready information retrieval pipeline with enterprise support. The models that form the core of this solution have been trained using responsibly selected, auditable data sources. With multiple pre-trained models available as starting points, developers can readily customize them for domain-specific use cases, such as information technology, human resource help assistants, and research & development research assistants.

This model is ready for commercial use.

License/Terms of use

GOVERNING TERMS: Use of this model is governed by the NVIDIA Open Model License Agreement.

You are responsible for ensuring that your use of NVIDIA AI Foundation Models complies with all applicable laws.

Deployment Geography:

Global

Use Case:

The NeMo Retriever OCR v1 model is designed for high-accuracy and high-speed extraction of textual information from images, making it ideal for powering multimodal retrieval systems, Retrieval-Augmented Generation (RAG) pipelines, and agentic applications that require seamless integration of visual and language understanding. Its robust performance and efficiency make it an excellent choice for next-generation AI systems that demand both precision and scalability across diverse real-world content.

Release Date:

NGC: July 15 2025

Model Architecture

Architecture Type: Hybrid detector–recognizer with document-level relational modeling

The NeMo Retriever OCR v1 model integrates three specialized neural components:

  • Text Detector: Utilizes a RegNetY-8GF convolutional backbone for high-accuracy localization of text regions within images.
  • Text Recognizer: Employs a Transformer-based sequence recognizer to transcribe text from detected regions, supporting variable word and line lengths.
  • Relational Model: Applies a multi-layer global relational module to predict logical groupings, reading order, and layout relationships across detected text elements.

All components are trained jointly in an end-to-end fashion, providing robust, scalable, and production-ready OCR for diverse document and scene images.

Parameter Counts:

ComponentParameters
Detector45,268,472
Recognizer4,944,346
Relational model2,254,422
Total52,467,240

Input

PropertyValue
Input Type & FormatImage (RGB, PNG/JPEG, float32/uint8), aggregation level (word, sentence, or paragraph)
Input Parameters3 x H x W (single image) or B x 3 x H x W (batch)
Input Range[0, 1] (float32) or [0, 255] (uint8, auto-converted)
Other PropertiesHandles both single images and batches. Automatic multi-scale resizing for best accuracy.

Output

PropertyValue
Output TypeStructured OCR results: a list of detected text regions (bounding boxes), recognized text, and confidence scores
Output FormatBounding boxes: tuple of floats, recognized text: string, confidence score: float
Output ParametersBounding boxes: 1D list of bounding box coordinates, recognized text: 1D list of strings, confidence score: 1D list of floats
Other PropertiesPlease see the sample output for an example of the model output

Sample output

ocr_boxes = [[[15.552736282348633, 43.141815185546875],
  [150.00149536132812, 43.141815185546875],
  [150.00149536132812, 56.845645904541016],
  [15.552736282348633, 56.845645904541016]],
 [[298.3145751953125, 44.43315124511719],
  [356.93585205078125, 44.43315124511719],
  [356.93585205078125, 57.34814453125],
  [298.3145751953125, 57.34814453125]],
 [[15.44686508178711, 13.67985725402832],
  [233.15859985351562, 13.67985725402832],
  [233.15859985351562, 27.376562118530273],
  [15.44686508178711, 27.376562118530273]],
 [[298.51727294921875, 14.268900871276855],
  [356.9850769042969, 14.268900871276855],
  [356.9850769042969, 27.790447235107422],
  [298.51727294921875, 27.790447235107422]]]

ocr_txts = ['The previous notice was dated',
 '22 April 2016',
 'The previous notice was given to the company on',
 '22 April 2016']

ocr_confs = [0.97730815, 0.98834222, 0.96804602, 0.98499225]

NVIDIA AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. Leveraging NVIDIA hardware (GPUs, CUDA), the model achieves fast inference for production-scale deployments.

Software Integration

Runtime Engine: NeMo Retriever OCR v1 NIM
Supported Hardware Microarchitecture Compatibility: NVIDIA Ampere, NVIDIA Blackwell, NVIDIA Hopper, NVIDIA Lovelace
Supported Operating System(s): Linux

Model Version(s)

NeMo Retriever OCR v1
Short Name: nemoretriever-ocr-v1

Training Dataset & Evaluation

Training Dataset

The model is trained on a large-scale, curated mix of public and proprietary OCR datasets, focusing on high diversity of document layouts and scene images. The training set includes synthetic and real images with varied noise and backgrounds, filtered for commercial use eligibility.

Data Collection Method: Hybrid (Automated, Human, Synthetic)
Labeling Method: Hybrid (Automated, Human, Synthetic)
Properties: Includes scanned documents, natural scene images, receipts, and business documents.

Evaluation Datasets

The NeMo Retriever OCR v1 model is evaluated on several NVIDIA internal datasets for various tasks, such as pure OCR, table content extraction, and document retrieval.

Data Collection Method: Hybrid (Automated, Human, Synthetic)
Labeling Method: Hybrid (Automated, Human, Synthetic)
Properties: Benchmarks include challenging scene images, documents with varied layouts, and multi-language data.

Evaluation Results

We benchmarked NeMo Retriever OCR v1 on internal evaluation datasets against PaddleOCR on various tasks, such as pure OCR (Character Error Rate), table content extraction (TEDS), and document retrieval (Recall@5).

MetricNeMo Retriever OCR v1PaddleOCRNet change
Character Error Rate0.16330.2029-19.5% ✔️
Bag-of-character Error Rate0.04530.0512-11.5% ✔️
Bag-of-word Error Rate0.12030.2748-56.2% ✔️
Table Extraction TEDS0.7810.7810.0% ⚖️
Public Earnings Multimodal Recall@50.7790.775+0.5% ✔️
Digital Corpora Multimodal Recall@50.9010.883+2.0% ✔️

Detailed Performance Analysis

The model demonstrates robust performance on complex layouts, noisy backgrounds, and challenging real-world scenes. Reading order and block detection are powered by the relational module, supporting downstream applications such as chart-to-text, table-to-text, and infographic-to-text extraction.

Inference
Engine: TensorRT, PyTorch
Test Hardware: H100 PCIe/SXM, A100 PCIe/SXM, L40s, L4, and A10G

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case, and address unforeseen product misuse.

For more detailed information on ethical considerations for this model, see the Model Card++ tab for the Explainability, Bias, Safety & Security, and Privacy subcards.

Please report security vulnerabilities or NVIDIA AI Concerns here.

Get Help

Enterprise Support

Get access to knowledge base articles and support cases or submit a ticket.