
Nemotron OCR v2 is a state-of-the-art multilingual text recognition model designed for robust end-to-end optical character recognition (OCR) on complex real-world images.
nemotron-ocr-v2 is a state-of-the-art multilingual text recognition model designed for robust end-to-end optical character recognition (OCR) on complex real-world images. It integrates three core neural network modules: a detector for text region localization, a recognizer for transcription of detected regions, and a relational model for layout and structure analysis.
This model is optimized for a wide variety of OCR tasks, including multi-line, multi-block, and natural scene text, and supports advanced reading order analysis via its relational model component. nemotron-ocr-v2 supports multiple languages and is production-ready with a focus on speed and accuracy on both document and natural scene images.
nemotron-ocr-v2 is part of the NVIDIA NeMo Retriever collection, which provides state-of-the-art, commercially-ready models and microservices optimized for the lowest latency and highest throughput.
This model is ready for commercial use.
GOVERNING TERMS: The trial service is governed by the NVIDIA API Trial Terms of Service. Use of this model is governed by the NVIDIA Open Model License Agreement.
You are responsible for ensuring that your use of NVIDIA provided models complies with all applicable laws.
Model Developer: NVIDIA
Global
This model is designed for high-accuracy and high-speed extraction of textual information from images across multiple languages, making it ideal for powering multimodal retrieval systems, RAG pipelines, and agentic applications that require seamless integration of visual and language understanding.
Build.NVIDIA.com 06/12/2026 via nemotron-ocr-v2
Hugging Face 04/15/2026 via nvidia/nemotron-ocr-v2
NGC 06/12/2026 via nemotron-ocr-v2
References:
Architecture Type: Hybrid detector-recognizer with document-level relational modeling
nemotron-ocr-v2 is available in two variants:
Both variants share the same three-component architecture:
All components are trained jointly in an end-to-end fashion, providing robust, scalable, and production-ready OCR for diverse document and scene images.
Network Architecture: RegNetX-8GF
| Spec | v2_english | v2_multilingual |
|---|---|---|
| Transformer layers | 3 | 6 |
| Hidden dimension | 256 | 512 |
| FFN width | 1024 | 2048 |
| Attention heads | 8 | 8 |
| Max sequence length | 32 | 128 |
| Character set size | 855 | 14,244 |
Parameter Counts — v2_english:
| Component | Parameters |
|---|---|
| Detector | 45,445,259 |
| Recognizer | 6,130,657 |
| Relational model | 2,255,419 |
| Total | 53,831,335 |
Parameter Counts — v2_multilingual:
| Component | Parameters |
|---|---|
| Detector | 45,445,259 |
| Recognizer | 36,119,598 |
| Relational model | 2,288,187 |
| Total | 83,853,044 |
| Property | Value |
|---|---|
| Input Type & Format | Image (RGB, PNG/JPEG, float32/uint8), aggregation level (word, sentence, or paragraph) |
| Input Parameters | 3 x H x W (single image) or B x 3 x H x W (batch) |
| Input Range | [0, 1] (float32) or [0, 255] (uint8, auto-converted) |
| Other Properties | Handles both single images and batches. Automatic multi-scale resizing for best accuracy. |
| Property | Value |
|---|---|
| Output Type | Structured OCR results: a list of detected text regions (bounding boxes), recognized text, and confidence scores |
| Output Format | Bounding boxes: tuple of floats, recognized text: string, confidence score: float |
| Output Parameters | Bounding boxes: 1D list of bounding box coordinates, recognized text: 1D list of strings, confidence score: 1D list of floats |
| Other Properties | Please see the sample output for an example of the model output. |
ocr_boxes = [[[15.552736282348633, 43.141815185546875],
[150.00149536132812, 43.141815185546875],
[150.00149536132812, 56.845645904541016],
[15.552736282348633, 56.845645904541016]],
[[298.3145751953125, 44.43315124511719],
[356.93585205078125, 44.43315124511719],
[356.93585205078125, 57.34814453125],
[298.3145751953125, 57.34814453125]]]
ocr_txts = ['The previous notice was dated', '22 April 2016']
ocr_confs = [0.97730815, 0.98834222]
Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
Runtime Engines: TensorRT, PyTorch
Supported Hardware Microarchitecture Compatibility:
NVIDIA Ampere
NVIDIA Blackwell
NVIDIA Hopper
NVIDIA Lovelace
Operating Systems: Linux
The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.
nemotron-ocr-v2 (variants: v2_english, v2_multilingual)
Short Name: nemotron-ocr-v2
Data Modality: Image
Training Data Collection: Hybrid (Automated, Human, Synthetic)
Training Labeling: Hybrid (Automated, Human, Synthetic)
Training Properties: Trained on a large-scale, curated mix of public and proprietary OCR datasets, focusing on high diversity of document layouts and natural scene images. The training set includes synthetic and real images with varied noise and backgrounds, filtered for commercial use eligibility. Includes scanned documents, natural scene images, receipts, and business documents.
Evaluation Data Collection: Hybrid (Automated, Human, Synthetic)
Evaluation Labeling: Hybrid (Automated, Human, Synthetic)
Evaluation Properties: Evaluated on OmniDocBench (crop-level) and SynthDoG (page-level, 7 languages) benchmarks.
| Dataset | Type | Samples |
|---|---|---|
| OmniDocBench | Crop-level OCR | 23,378 crops |
| SynthDoG (7 languages) | Page-level OCR | 100 pages/lang |
Acceleration Engine: TensorRT
Test Software: TensorRT
Test Hardware: NVIDIA L40S
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case, and address unforeseen product misuse.
For more detailed information on ethical considerations for this model, please see the Model Card++ Bias, Explainability, Safety & Security, and Privacy Subcards.
Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.